cam.sgnmt.misc package¶
Submodules¶
cam.sgnmt.misc.sparse module¶
This module adds support for sparse input or output features. In standard NMT we normally use a one-hot-representation, and input and output layers are lookup tables (embedding matrices). The Blocks NMT implementation in SGNMT supports explicit definition of word representations as sparse features, in which more than one neuron can be activated at a time.
-
class
cam.sgnmt.misc.sparse.
FileBasedFeatMap
(dim, path)[source]¶ Bases:
cam.sgnmt.misc.sparse.SparseFeatMap
This class loads the mapping from word to sparse feature from a file (see
--src_sparse_feat_map
and--trg_sparse_feat_map
) The mapping from word to feature is a simple dictionary lookup.The mapping from feature to word is implemented with a Trie based nearest neighbor implementation and does not require an exact match. However, in case of an exact match, is runs linearly in the number of non-zero entries in the vector.
Loads the feature map from the file system.
Parameters: - dim (int) –
- path (string) –
Raises: IOError. If the file could not be loaded
-
class
cam.sgnmt.misc.sparse.
FlatSparseFeatMap
(dim=0)[source]¶ Bases:
cam.sgnmt.misc.sparse.SparseFeatMap
Can be used as replacement if a
SparseFeatMap
is required but you wish to use flat word ids. It overrides the dense methods with the identities.Parameters: dim (int) – not used
-
class
cam.sgnmt.misc.sparse.
SparseFeatMap
(dim)[source]¶ Bases:
object
This is the super class for mapping strategies between sparse feature representations and symbolic word indices. The translation needs to be implemented in
sparse2word
andword2sparse
.Initializes this map.
Parameters: dim (int) – Dimensionality of the feature representation -
dense2nwords
(feat, n=1)[source]¶ Returns the n closest words to
feat
.Parameters: - feat (list) – Dense feature vector
- n (int) – Number of words to retrieve
Returns: - List of (wordid, distance) tuples with words which
are close to
feat
.
Return type: list
Note
The default implementation does not use the
n
argument and always returns distance 0.
-
dense2sparse
(dense, eps=0.5)[source]¶ Converts a dense vector to a sparse vector.
Parameters: - dense (list) – Dense vector (list of length n)
- eps (float) – Values smaller than this are set to zero in the sparse representation
Returns: list. List of (dimension, value) tuples (sparse vector)
-
dense2word
(feat)[source]¶ Gets the word id for a dense feature.
Parameters: feat (list) – Dense feature vector to look up Returns: - int. Word ID of a match for
feat
orNone
is no - match could be found
Raises: NotImplementedError.
- int. Word ID of a match for
-
sparse2dense
(sparse)[source]¶ Converts a sparse vector to its dense representation.
Parameters: sparse (list) – Sparse vector (list of tuples) Raises: IndexError. If the input vector exceeds the dimensionality – of this map
-
sparse2nwords
(feat, n=1)[source]¶ Returns the n closest words to
feat
. Subclasses can implement this method to implement search for words. The default implementation returns the single best word only.Parameters: - feat (list) – Sparse feature
- n (int) – Number of words to retrieve
Returns: - List of (wordid, distance) tuples with words which
are close to
feat
.
Return type: list
Note
The default implementation does not use the
n
argument and always returns distance 0.
-
sparse2word
(feat)[source]¶ Gets the word id for a sparse feature. The sparse feature format is a list of tuples [(dim1,val1),..,(dimN,valN)]
Parameters: feat (list) – Sparse feature to look up Returns: - int. Word ID of a match for
feat
orNone
is no - match could be found
Raises: NotImplementedError.
- int. Word ID of a match for
-
word2dense
(word)[source]¶ Gets the feature representation in dense format, i.e. a
self.dim
-dimensional vector as list.Parameters: word (int) – Word ID Returns: list. Dense vector corresponding to word
or the null vector if no representation found
-
-
class
cam.sgnmt.misc.sparse.
TrivialSparseFeatMap
(dim)[source]¶ Bases:
cam.sgnmt.misc.sparse.SparseFeatMap
This is the null-object (GoF) implementation for
SparseFeatMap
. It corresponds to the usual one-hot representation.Pass through to
SparseFeatMap
.Parameters: dim (int) – Dimensionality of the feature representation (should be the vocabulary size)
-
cam.sgnmt.misc.sparse.
dense_euclidean
(v1, v2)[source]¶ Calculates the Euclidean distance between two sparse vectors.
Parameters: - v1 (dict) – First dense vector
- v2 (dict) – Second dense vector
Returns: float. Distance between
v1
andv2
.
-
cam.sgnmt.misc.sparse.
dense_euclidean2
(v1, v2)[source]¶ Calculates the squared Euclidean distance between two dense vectors.
Parameters: - v1 (dict) – First dense vector
- v2 (dict) – Second dense vector
Returns: float. Squared distance between
v1
andv2
.
cam.sgnmt.misc.trie module¶
This module contains SimpleTrie
which is a generic trie
implementation based on strings of integers.
-
class
cam.sgnmt.misc.trie.
SimpleNode
[source]¶ Helper class representing a node in a
SimpleTrie
Creates an empty node without children.
-
class
cam.sgnmt.misc.trie.
SimpleTrie
[source]¶ This is a very simple Trie implementation. It is simpler than the one in
cam.sgnmt.predictors.grammar
because it does not support non-terminals or removal. The only supported operations areadd
andget
, but those are implemented very efficiently. For many applications (e.g. the cache in the greedy heuristic) this is already enough.The implementation also supports keys in sparse representation, in which most of the elements in the sequence are zero (see
add_sparse
,get_sparse
, andnearest_sparse
. In this case, the key is a list of tuples [(dim1,val1),...(dimN,valN)]. Internally, we store them as sequence “dim1 val1 dim2 val2...” Note that we assume that the tuples are ordered by dimension!Creates an empty Trie data structure.
-
add
(seq, element)[source]¶ Add an element to the Trie for the key
seq
. Ifseq
already exists, override.Parameters: - seq (list) – Key
- element (object) – The object to store for key
seq
-
add_sparse
(key, element)[source]¶ Adds an element with a key in sparse representation.
Parameters: - seq (list) – Sparse key (list of tuples)
- element (object) – The object to store for key
seq
-
get
(seq)[source]¶ Retrieve the element for a key
seq
.Parameters: seq (list) – Query key Returns: object. The element which has been added along with seq
orNone
if the key does not exist.
-
get_prefix
(seq)[source]¶ Get the key in the Trie with the longest common prefix with
seq
.Parameters: seq (list) – Query sequence Returns: list. The longest key in the Trie which is a prefix of seq
.
-
get_sparse
(key, element)[source]¶ Retrieves an element with a key in sparse representation.
Parameters: seq (list) – Returns: object. The element which has been added along with seq
orNone
if the key does not exist.
-
n_nearest_sparse
(query, n=1)[source]¶ This method returns the n element in the Trie with the closest key to
query
in terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient.Parameters: - query (list) – Query key in sparse format
- n (int) – Number of elements to retrieve
Returns: List. List of (object,dist) pairs with the nearest element to
query
in terms of L2 norm and the squared L2 distance.
-
nearest_sparse
(query)[source]¶ This method returns the element in the Trie with the closest key to
query
in terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient. If the Trie contains an exact match, this method runs linear in the length of the query (i.e. independent of number of elements in the Trie).Parameters: query (list) – Query key in sparse format Returns: Tuple. (object,dist) pair with the nearest element to query
in terms of L2 norm and the squared L2 distance.
-
cam.sgnmt.misc.unigram module¶
This module contains classes which are able to store unigram probabilities and potentially collect them by observing a decoder instance. This can be used for heuristics.
-
class
cam.sgnmt.misc.unigram.
AllStatsUnigramTable
[source]¶ Bases:
cam.sgnmt.misc.unigram.UnigramTable
This unigram table collect statistics from all partial hypos.
Pass through to super class constructor.
-
class
cam.sgnmt.misc.unigram.
BestStatsUnigramTable
[source]¶ Bases:
cam.sgnmt.misc.unigram.UnigramTable
This unigram table collect statistics from the best full hypo.
Pass through to super class constructor.
-
class
cam.sgnmt.misc.unigram.
FileUnigramTable
(path)[source]¶ Bases:
cam.sgnmt.misc.unigram.UnigramTable
Loads a unigram table from an external file.
Loads the unigram table from
path
.
-
class
cam.sgnmt.misc.unigram.
FullStatsUnigramTable
[source]¶ Bases:
cam.sgnmt.misc.unigram.UnigramTable
This unigram table collect statistics from all full hypos.
Pass through to super class constructor.
-
class
cam.sgnmt.misc.unigram.
UnigramTable
[source]¶ Bases:
cam.sgnmt.utils.Observer
A unigram table stores unigram probabilities for a certain vocabulary. These statistics can be loaded from an external file (
FileUnigramTable
) or collected during decoding.Creates a unigram table without entries.
-
estimate
(word, default=0.0)[source]¶ Estimate the unigram score for the given word.
Parameters: - word (int) – word ID
- default (float) – Default value to be returned if
word
cannot be found in the table
Returns: float. Unigram score or
default
ifword
is not in table
-
Module contents¶
This package contains miscellaneous routines and classes, e.g. for
preprocessing and feature extraction which are independent of the used
framework (blocks, tensorflow...). For example, this includes rolling
out sparse features into a dense representation or searching for the
best surface form for a given attribute vector. trie
contains a
generic trie implementation, unigram
can be used for keeping
track of unigram statistics during decoding.