cam.sgnmt.misc package¶
Submodules¶
cam.sgnmt.misc.sparse module¶
This module adds support for sparse input or output features. In standard NMT we normally use a one-hot-representation, and input and output layers are lookup tables (embedding matrices). The Blocks NMT implementation in SGNMT supports explicit definition of word representations as sparse features, in which more than one neuron can be activated at a time.
- 
class cam.sgnmt.misc.sparse.FileBasedFeatMap(dim, path)[source]¶
- Bases: - cam.sgnmt.misc.sparse.SparseFeatMap- This class loads the mapping from word to sparse feature from a file (see - --src_sparse_feat_mapand- --trg_sparse_feat_map) The mapping from word to feature is a simple dictionary lookup.- The mapping from feature to word is implemented with a Trie based nearest neighbor implementation and does not require an exact match. However, in case of an exact match, is runs linearly in the number of non-zero entries in the vector. - Loads the feature map from the file system. - Parameters: - dim (int) –
- path (string) –
 - Raises: - IOError. If the file could not be loaded 
- 
class cam.sgnmt.misc.sparse.FlatSparseFeatMap(dim=0)[source]¶
- Bases: - cam.sgnmt.misc.sparse.SparseFeatMap- Can be used as replacement if a - SparseFeatMapis required but you wish to use flat word ids. It overrides the dense methods with the identities.- Parameters: - dim (int) – not used 
- 
class cam.sgnmt.misc.sparse.SparseFeatMap(dim)[source]¶
- Bases: - object- This is the super class for mapping strategies between sparse feature representations and symbolic word indices. The translation needs to be implemented in - sparse2wordand- word2sparse.- Initializes this map. - Parameters: - dim (int) – Dimensionality of the feature representation - 
dense2nwords(feat, n=1)[source]¶
- Returns the n closest words to - feat.- Parameters: - feat (list) – Dense feature vector
- n (int) – Number of words to retrieve
 - Returns: - List of (wordid, distance) tuples with words which
- are close to - feat.
 - Return type: - list - Note - The default implementation does not use the - nargument and always returns distance 0.
 - 
dense2sparse(dense, eps=0.5)[source]¶
- Converts a dense vector to a sparse vector. - Parameters: - dense (list) – Dense vector (list of length n)
- eps (float) – Values smaller than this are set to zero in the sparse representation
 - Returns: - list. List of (dimension, value) tuples (sparse vector) 
 - 
dense2word(feat)[source]¶
- Gets the word id for a dense feature. - Parameters: - feat (list) – Dense feature vector to look up - Returns: - int. Word ID of a match for featorNoneis no
- match could be found
 - Raises: - NotImplementedError.
- int. Word ID of a match for 
 - 
sparse2dense(sparse)[source]¶
- Converts a sparse vector to its dense representation. - Parameters: - sparse (list) – Sparse vector (list of tuples) - Raises: - IndexError. If the input vector exceeds the dimensionality – of this map 
 - 
sparse2nwords(feat, n=1)[source]¶
- Returns the n closest words to - feat. Subclasses can implement this method to implement search for words. The default implementation returns the single best word only.- Parameters: - feat (list) – Sparse feature
- n (int) – Number of words to retrieve
 - Returns: - List of (wordid, distance) tuples with words which
- are close to - feat.
 - Return type: - list - Note - The default implementation does not use the - nargument and always returns distance 0.
 - 
sparse2word(feat)[source]¶
- Gets the word id for a sparse feature. The sparse feature format is a list of tuples [(dim1,val1),..,(dimN,valN)] - Parameters: - feat (list) – Sparse feature to look up - Returns: - int. Word ID of a match for featorNoneis no
- match could be found
 - Raises: - NotImplementedError.
- int. Word ID of a match for 
 - 
word2dense(word)[source]¶
- Gets the feature representation in dense format, i.e. a - self.dim-dimensional vector as list.- Parameters: - word (int) – Word ID - Returns: - list. Dense vector corresponding to - wordor the null vector if no representation found
 
- 
- 
class cam.sgnmt.misc.sparse.TrivialSparseFeatMap(dim)[source]¶
- Bases: - cam.sgnmt.misc.sparse.SparseFeatMap- This is the null-object (GoF) implementation for - SparseFeatMap. It corresponds to the usual one-hot representation.- Pass through to - SparseFeatMap.- Parameters: - dim (int) – Dimensionality of the feature representation (should be the vocabulary size) 
- 
cam.sgnmt.misc.sparse.dense_euclidean(v1, v2)[source]¶
- Calculates the Euclidean distance between two sparse vectors. - Parameters: - v1 (dict) – First dense vector
- v2 (dict) – Second dense vector
 - Returns: - float. Distance between - v1and- v2.
- 
cam.sgnmt.misc.sparse.dense_euclidean2(v1, v2)[source]¶
- Calculates the squared Euclidean distance between two dense vectors. - Parameters: - v1 (dict) – First dense vector
- v2 (dict) – Second dense vector
 - Returns: - float. Squared distance between - v1and- v2.
cam.sgnmt.misc.trie module¶
This module contains SimpleTrie which is a generic trie
implementation based on strings of integers.
- 
class cam.sgnmt.misc.trie.SimpleNode[source]¶
- Helper class representing a node in a - SimpleTrie- Creates an empty node without children. 
- 
class cam.sgnmt.misc.trie.SimpleTrie[source]¶
- This is a very simple Trie implementation. It is simpler than the one in - cam.sgnmt.predictors.grammarbecause it does not support non-terminals or removal. The only supported operations are- addand- get, but those are implemented very efficiently. For many applications (e.g. the cache in the greedy heuristic) this is already enough.- The implementation also supports keys in sparse representation, in which most of the elements in the sequence are zero (see - add_sparse,- get_sparse, and- nearest_sparse. In this case, the key is a list of tuples [(dim1,val1),...(dimN,valN)]. Internally, we store them as sequence “dim1 val1 dim2 val2...” Note that we assume that the tuples are ordered by dimension!- Creates an empty Trie data structure. - 
add(seq, element)[source]¶
- Add an element to the Trie for the key - seq. If- seqalready exists, override.- Parameters: - seq (list) – Key
- element (object) – The object to store for key seq
 
 - 
add_sparse(key, element)[source]¶
- Adds an element with a key in sparse representation. - Parameters: - seq (list) – Sparse key (list of tuples)
- element (object) – The object to store for key seq
 
 - 
get(seq)[source]¶
- Retrieve the element for a key - seq.- Parameters: - seq (list) – Query key - Returns: - object. The element which has been added along with - seqor- Noneif the key does not exist.
 - 
get_prefix(seq)[source]¶
- Get the key in the Trie with the longest common prefix with - seq.- Parameters: - seq (list) – Query sequence - Returns: - list. The longest key in the Trie which is a prefix of - seq.
 - 
get_sparse(key, element)[source]¶
- Retrieves an element with a key in sparse representation. - Parameters: - seq (list) – - Returns: - object. The element which has been added along with - seqor- Noneif the key does not exist.
 - 
n_nearest_sparse(query, n=1)[source]¶
- This method returns the n element in the Trie with the closest key to - queryin terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient.- Parameters: - query (list) – Query key in sparse format
- n (int) – Number of elements to retrieve
 - Returns: - List. List of (object,dist) pairs with the nearest element to - queryin terms of L2 norm and the squared L2 distance.
 - 
nearest_sparse(query)[source]¶
- This method returns the element in the Trie with the closest key to - queryin terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient. If the Trie contains an exact match, this method runs linear in the length of the query (i.e. independent of number of elements in the Trie).- Parameters: - query (list) – Query key in sparse format - Returns: - Tuple. (object,dist) pair with the nearest element to - queryin terms of L2 norm and the squared L2 distance.
 
- 
cam.sgnmt.misc.unigram module¶
This module contains classes which are able to store unigram probabilities and potentially collect them by observing a decoder instance. This can be used for heuristics.
- 
class cam.sgnmt.misc.unigram.AllStatsUnigramTable[source]¶
- Bases: - cam.sgnmt.misc.unigram.UnigramTable- This unigram table collect statistics from all partial hypos. - Pass through to super class constructor. 
- 
class cam.sgnmt.misc.unigram.BestStatsUnigramTable[source]¶
- Bases: - cam.sgnmt.misc.unigram.UnigramTable- This unigram table collect statistics from the best full hypo. - Pass through to super class constructor. 
- 
class cam.sgnmt.misc.unigram.FileUnigramTable(path)[source]¶
- Bases: - cam.sgnmt.misc.unigram.UnigramTable- Loads a unigram table from an external file. - Loads the unigram table from - path.
- 
class cam.sgnmt.misc.unigram.FullStatsUnigramTable[source]¶
- Bases: - cam.sgnmt.misc.unigram.UnigramTable- This unigram table collect statistics from all full hypos. - Pass through to super class constructor. 
- 
class cam.sgnmt.misc.unigram.UnigramTable[source]¶
- Bases: - cam.sgnmt.utils.Observer- A unigram table stores unigram probabilities for a certain vocabulary. These statistics can be loaded from an external file ( - FileUnigramTable) or collected during decoding.- Creates a unigram table without entries. - 
estimate(word, default=0.0)[source]¶
- Estimate the unigram score for the given word. - Parameters: - word (int) – word ID
- default (float) – Default value to be returned if wordcannot be found in the table
 - Returns: - float. Unigram score or - defaultif- wordis not in table
 
- 
Module contents¶
This package contains miscellaneous routines and classes, e.g. for
preprocessing and feature extraction which are independent of the used
framework (blocks, tensorflow...). For example, this includes rolling
out sparse features into a dense representation or searching for the
best surface form for a given attribute vector. trie contains a
generic trie implementation, unigram can be used for keeping
track of unigram statistics during decoding.