cam.sgnmt.misc package

Submodules

cam.sgnmt.misc.sparse module

This module adds support for sparse input or output features. In standard NMT we normally use a one-hot-representation, and input and output layers are lookup tables (embedding matrices). The Blocks NMT implementation in SGNMT supports explicit definition of word representations as sparse features, in which more than one neuron can be activated at a time.

class cam.sgnmt.misc.sparse.FileBasedFeatMap(dim, path)[source]

Bases: cam.sgnmt.misc.sparse.SparseFeatMap

This class loads the mapping from word to sparse feature from a file (see --src_sparse_feat_map and --trg_sparse_feat_map) The mapping from word to feature is a simple dictionary lookup.

The mapping from feature to word is implemented with a Trie based nearest neighbor implementation and does not require an exact match. However, in case of an exact match, is runs linearly in the number of non-zero entries in the vector.

Loads the feature map from the file system.

Parameters:
  • dim (int) –
  • path (string) –
Raises:

IOError. If the file could not be loaded

sparse2nwords(feat, n=1)[source]
sparse2word(feat)[source]
word2sparse(word)[source]
class cam.sgnmt.misc.sparse.FlatSparseFeatMap(dim=0)[source]

Bases: cam.sgnmt.misc.sparse.SparseFeatMap

Can be used as replacement if a SparseFeatMap is required but you wish to use flat word ids. It overrides the dense methods with the identities.

Parameters:dim (int) – not used
dense2sparse(dense, eps=0.3)[source]
Raise:
NotImplementedError.
dense2word(feat)[source]

Identity.

sparse2dense(sparse)[source]
Raise:
NotImplementedError.
sparse2word(feat)[source]
Raise:
NotImplementedError.
word2dense(word)[source]

Identity.

word2sparse(word)[source]
Raise:
NotImplementedError.
words2dense(seq)[source]

Identity.

class cam.sgnmt.misc.sparse.SparseFeatMap(dim)[source]

Bases: object

This is the super class for mapping strategies between sparse feature representations and symbolic word indices. The translation needs to be implemented in sparse2word and word2sparse.

Initializes this map.

Parameters:dim (int) – Dimensionality of the feature representation
dense2nwords(feat, n=1)[source]

Returns the n closest words to feat.

Parameters:
  • feat (list) – Dense feature vector
  • n (int) – Number of words to retrieve
Returns:

List of (wordid, distance) tuples with words which

are close to feat.

Return type:

list

Note

The default implementation does not use the n argument and always returns distance 0.

dense2sparse(dense, eps=0.5)[source]

Converts a dense vector to a sparse vector.

Parameters:
  • dense (list) – Dense vector (list of length n)
  • eps (float) – Values smaller than this are set to zero in the sparse representation
Returns:

list. List of (dimension, value) tuples (sparse vector)

dense2word(feat)[source]

Gets the word id for a dense feature.

Parameters:feat (list) – Dense feature vector to look up
Returns:
int. Word ID of a match for feat or None is no
match could be found
Raises:NotImplementedError.
dense2words(seq)[source]

Applies dense2word to a sequence.

sparse2dense(sparse)[source]

Converts a sparse vector to its dense representation.

Parameters:sparse (list) – Sparse vector (list of tuples)
Raises:IndexError. If the input vector exceeds the dimensionality – of this map
sparse2nwords(feat, n=1)[source]

Returns the n closest words to feat. Subclasses can implement this method to implement search for words. The default implementation returns the single best word only.

Parameters:
  • feat (list) – Sparse feature
  • n (int) – Number of words to retrieve
Returns:

List of (wordid, distance) tuples with words which

are close to feat.

Return type:

list

Note

The default implementation does not use the n argument and always returns distance 0.

sparse2word(feat)[source]

Gets the word id for a sparse feature. The sparse feature format is a list of tuples [(dim1,val1),..,(dimN,valN)]

Parameters:feat (list) – Sparse feature to look up
Returns:
int. Word ID of a match for feat or None is no
match could be found
Raises:NotImplementedError.
word2dense(word)[source]

Gets the feature representation in dense format, i.e. a self.dim-dimensional vector as list.

Parameters:word (int) – Word ID
Returns:list. Dense vector corresponding to word or the null vector if no representation found
word2sparse(word)[source]

Gets the sparse feature representation for a word.

Parameters:word (int) – Word ID
Returns:
list. Sparse feature representation for word or None
if the word could not be converted
Raises:NotImplementedError.
words2dense(seq)[source]

Applies word2dense to a word sequence.

class cam.sgnmt.misc.sparse.TrivialSparseFeatMap(dim)[source]

Bases: cam.sgnmt.misc.sparse.SparseFeatMap

This is the null-object (GoF) implementation for SparseFeatMap. It corresponds to the usual one-hot representation.

Pass through to SparseFeatMap.

Parameters:dim (int) – Dimensionality of the feature representation (should be the vocabulary size)
sparse2word(feat)[source]

Returns feat[0][0]

word2sparse(word)[source]

Returns [(word, 1)]

cam.sgnmt.misc.sparse.dense_euclidean(v1, v2)[source]

Calculates the Euclidean distance between two sparse vectors.

Parameters:
  • v1 (dict) – First dense vector
  • v2 (dict) – Second dense vector
Returns:

float. Distance between v1 and v2.

cam.sgnmt.misc.sparse.dense_euclidean2(v1, v2)[source]

Calculates the squared Euclidean distance between two dense vectors.

Parameters:
  • v1 (dict) – First dense vector
  • v2 (dict) – Second dense vector
Returns:

float. Squared distance between v1 and v2.

cam.sgnmt.misc.sparse.sparse_euclidean(v1, v2)[source]

Calculates the Euclidean distance between two sparse vectors.

Parameters:
  • v1 (dict) – First sparse vector
  • v2 (dict) – Second sparse vector
Returns:

float. Distance between v1 and v2.

cam.sgnmt.misc.sparse.sparse_euclidean2(v1, v2)[source]

Calculates the squared Euclidean distance between two sparse vectors.

Parameters:
  • v1 (dict) – First sparse vector
  • v2 (dict) – Second sparse vector
Returns:

float. Squared distance between v1 and v2.

cam.sgnmt.misc.trie module

This module contains SimpleTrie which is a generic trie implementation based on strings of integers.

class cam.sgnmt.misc.trie.SimpleNode[source]

Helper class representing a node in a SimpleTrie

Creates an empty node without children.

class cam.sgnmt.misc.trie.SimpleTrie[source]

This is a very simple Trie implementation. It is simpler than the one in cam.sgnmt.predictors.grammar because it does not support non-terminals or removal. The only supported operations are add and get, but those are implemented very efficiently. For many applications (e.g. the cache in the greedy heuristic) this is already enough.

The implementation also supports keys in sparse representation, in which most of the elements in the sequence are zero (see add_sparse, get_sparse, and nearest_sparse. In this case, the key is a list of tuples [(dim1,val1),...(dimN,valN)]. Internally, we store them as sequence “dim1 val1 dim2 val2...” Note that we assume that the tuples are ordered by dimension!

Creates an empty Trie data structure.

add(seq, element)[source]

Add an element to the Trie for the key seq. If seq already exists, override.

Parameters:
  • seq (list) – Key
  • element (object) – The object to store for key seq
add_sparse(key, element)[source]

Adds an element with a key in sparse representation.

Parameters:
  • seq (list) – Sparse key (list of tuples)
  • element (object) – The object to store for key seq
get(seq)[source]

Retrieve the element for a key seq.

Parameters:seq (list) – Query key
Returns:object. The element which has been added along with seq or None if the key does not exist.
get_prefix(seq)[source]

Get the key in the Trie with the longest common prefix with seq.

Parameters:seq (list) – Query sequence
Returns:list. The longest key in the Trie which is a prefix of seq.
get_sparse(key, element)[source]

Retrieves an element with a key in sparse representation.

Parameters:seq (list) –
Returns:object. The element which has been added along with seq or None if the key does not exist.
n_nearest_sparse(query, n=1)[source]

This method returns the n element in the Trie with the closest key to query in terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient.

Parameters:
  • query (list) – Query key in sparse format
  • n (int) – Number of elements to retrieve
Returns:

List. List of (object,dist) pairs with the nearest element to query in terms of L2 norm and the squared L2 distance.

nearest_sparse(query)[source]

This method returns the element in the Trie with the closest key to query in terms of Euclidean distance. The efficiency relies on sparseness: The more zeros in the vector, the more efficient. If the Trie contains an exact match, this method runs linear in the length of the query (i.e. independent of number of elements in the Trie).

Parameters:query (list) – Query key in sparse format
Returns:Tuple. (object,dist) pair with the nearest element to query in terms of L2 norm and the squared L2 distance.

cam.sgnmt.misc.unigram module

This module contains classes which are able to store unigram probabilities and potentially collect them by observing a decoder instance. This can be used for heuristics.

class cam.sgnmt.misc.unigram.AllStatsUnigramTable[source]

Bases: cam.sgnmt.misc.unigram.UnigramTable

This unigram table collect statistics from all partial hypos.

Pass through to super class constructor.

notify(message, message_type=1)[source]

Update unigram statistics. We assume to observe a Decoder instance. We update the unigram table if the message type is MESSAGE_TYPE_POSTERIOR.

Parameters:
  • message (object) – Message from an observable Decoder
  • message_type (int) – Message type
class cam.sgnmt.misc.unigram.BestStatsUnigramTable[source]

Bases: cam.sgnmt.misc.unigram.UnigramTable

This unigram table collect statistics from the best full hypo.

Pass through to super class constructor.

notify(message, message_type=1)[source]

Update unigram statistics. We assume to observe a Decoder instance. We update the unigram table if the message type is MESSAGE_TYPE_FULL_HYPO.

Parameters:
  • message (object) – Message from an observable Decoder
  • message_type (int) – Message type
reset()[source]

This is called to reset collected statistics between each sentence pair.

class cam.sgnmt.misc.unigram.FileUnigramTable(path)[source]

Bases: cam.sgnmt.misc.unigram.UnigramTable

Loads a unigram table from an external file.

Loads the unigram table from path.

reset()[source]
class cam.sgnmt.misc.unigram.FullStatsUnigramTable[source]

Bases: cam.sgnmt.misc.unigram.UnigramTable

This unigram table collect statistics from all full hypos.

Pass through to super class constructor.

notify(message, message_type=1)[source]

Update unigram statistics. We assume to observe a Decoder instance. We update the unigram table if the message type is MESSAGE_TYPE_FULL_HYPO.

Parameters:
  • message (object) – Message from an observable Decoder
  • message_type (int) – Message type
class cam.sgnmt.misc.unigram.UnigramTable[source]

Bases: cam.sgnmt.utils.Observer

A unigram table stores unigram probabilities for a certain vocabulary. These statistics can be loaded from an external file (FileUnigramTable) or collected during decoding.

Creates a unigram table without entries.

estimate(word, default=0.0)[source]

Estimate the unigram score for the given word.

Parameters:
  • word (int) – word ID
  • default (float) – Default value to be returned if word cannot be found in the table
Returns:

float. Unigram score or default if word is not in table

notify(message, message_type=1)[source]

Unigram tables usually observe the decoder, but some do not process messages from the decoder. This is an empty implementation of notify for those implementations.

reset()[source]

This is called to reset collected statistics between each sentence pair.

Module contents

This package contains miscellaneous routines and classes, e.g. for preprocessing and feature extraction which are independent of the used framework (blocks, tensorflow...). For example, this includes rolling out sparse features into a dense representation or searching for the best surface form for a given attribute vector. trie contains a generic trie implementation, unigram can be used for keeping track of unigram statistics during decoding.