Predictors

Predictors are scoring modules which define a distribution over target words given the history and some side information like the source sentence. If vocabulary sizes differ among predictors, we fill in gaps with predictor UNK scores.

Predictors are specified using the --predictors and --predictor_weights arguments, e.g.:

$ python decode.py --predictors nmt,fst,nplm --predictor_weights 0.7,0.1,0.2 ...

See the Tutorial: Basics page for examples how to use predictors for decoding.

Available predictors

The following predictors are available:

  • nmt: neural machine translation predictor. Requires Blocks/Theano or TensorFlow.

    Options: nmt_config, nmt_path, nmt_model_selector, cache_nmt_posteriors, nmt_engine

  • t2t: Predictor for tensor2tensor models. Requires Tensor2Tensor.

    Options: t2t_usr_dir, t2t_model, t2t_problem, t2t_hparams_set, t2t_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size

  • nizza: Nizza alignment models. Requires Nizza.

    Options: nizza_model, nizza_hparams_set, nizza_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size

  • lexnizza: Uses Nizza lexical scores to check coverage. Requires Nizza.

    Options: nizza_model, nizza_hparams_set, nizza_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size, lexnizza_alpha, lexnizza_beta, lexnizza_shortlist_strategies, lexnizza_max_shortlist_length

  • srilm: n-gram language model. Requires swig-srilm.

    Options: srilm_path, srilm_order

  • nplm: neural n-gram language model. Requires nplm.

    Options: nplm_path, normalize_nplm_probs

  • rnnlm: RNN language model following Zaremba et al. (2014). Requires TensorFlow.

    Options: rnnlm_config, rnnlm_path

  • forced: Forced decoding with one reference

    Options: trg_test

  • bracket: Enforces well-formed bracket expressions

    Options: syntax_pop_id , syntax_max_terminal_id, syntax_max_depth, extlength_path

  • osm: Constraints output to valid OSM sequences

    Options: None

  • forcedosm: Forced alignment with an OSM model

    Options: trg_test

  • forcedlst: Forced decoding with a Moses n-best list (n-best list rescoring)

    Options: trg_test, forcedlst_match_unk, forcedlst_sparse_feat, use_nbest_weights

  • bow: Forced decoding with one bag-of-words ref.

    Options: trg_test, heuristic_scores_file, bow_heuristic_strategies, bow_accept_subsets, bow_accept_duplicates, pred_trg_vocab_size

  • bowsearch: Forced decoding with one bag-of-words ref.

    Options: hypo_recombination, trg_test, heuristic_scores_file, bow_heuristic_strategies, bow_accept_subsets, bow_accept_duplicates, pred_trg_vocab_size

  • fst: Deterministic translation lattices

    Options: fst_path, use_fst_weights, normalize_fst_weights, fst_to_log, fst_skip_bos_weight

  • nfst: Non-deterministic translation lattices

    Options: fst_path, use_fst_weights, normalize_fst_weights, fst_to_log, fst_skip_bos_weight

  • rtn: Recurrent transition networks as created by HiFST with late expansion.

    Options: rtn_path, use_rtn_weights, minimize_rtns, remove_epsilon_in_rtns, normalize_rtn_weights

  • lrhiero: Direct Hiero (left-to-right Hiero). This is an EXPERIMENTAL implementation of LRHiero.

    Options: rules_path, grammar_feature_weights, use_grammar_weights

  • wc: Number of words feature.

    Options: wc_word, wc_nonterminal_penalty, syntax_nonterminal_ids, syntax_min_terminal_id, syntax_max_terminal_id, pred_trg_vocab_size

  • unkc: Poisson model for number of UNKs.

    Options: unk_count_lambdas, pred_trg_vocab_size

  • ngramc: For using MBR n-gram posteriors.

    Options: ngramc_path, ngramc_order

  • length: Target sentence length model.

    Options: src_test_raw, length_model_weights, use_length_point_probs

  • extlength: External target sentence lengths.

    Options: extlength_path

All predictors can be combined with one or more wrapper predictors by adding the wrapper name separated by a _ symbol. Following wrappers are available:

  • parse: Internal beam search over a representation which contains some pre-defined non-terminal ids, which should not appear in the output.

    Options: parse_tok_grammar, parse_bpe_path, syntax_path, syntax_bpe_path, syntax_word_out, normalize_fst_weights, syntax_norm_alpha, syntax_internal_beam, syntax_max_internal_len, syntax_allow_early_eos, syntax_consume_ooc, syntax_terminal_restrict, syntax_internal_only, syntax_eow_ids, syntax_terminal_ids

  • idxmap: Add this wrapper to predictors which use an alternative word map.

    Options: src_idxmap, trg_idxmap

  • altsrc: This wrapper loads source sentences from an alternative source.

    Options: altsrc_test

  • ngramize: Extracts n-gram posteriors from a predictor without feedback loop.

    Options: min_ngram_order, max_ngram_order, max_len_factor

  • skipvocab: Uses internal beam search to skip a subset of the predictor vocabulary.

    Options: beam, skipvocab_max_id, skipvocab_stop_size

  • unkvocab: This wrapper explicitly excludes matching word indices higher than pred_trg_vocab_size with UNK scores.

    Options: pred_trg_vocab_size

  • fsttok: Uses an FST to transduce SGNMT tokens to predictor tokens.

    Options: fsttok_path, fsttok_max_pending_score, fst_unk_id

  • word2char: Wraps word-level predictors when SGNMT is running on character level.

    Options: word2char_map

Note that you can use multiple instances of the same predictor. For example, ‘nmt,nmt,nmt’ can be used for ensembling three NMT systems. You can often override parts of the predictor configurations for subsequent predictors by adding the predictor number (e.g. see --nmt_config2 or --fst_path2)

Detailed descriptions are available below in the modules.

Predictor modules

cam.sgnmt.predictors.automata module

This module encapsulates the predictor interface to OpenFST. This module depends on OpenFST. To enable Python support in OpenFST, use a recent version (>=1.5.4) and compile with --enable_python. Further information can be found here:

http://www.openfst.org/twiki/bin/view/FST/PythonExtension

This file includes the fst, nfst, and rtn predictors.

Note: If we use arc weights in FSTs, we multiply them by -1 as everything in SGNMT is logprob, not -logprob as in FSTs log or tropical semirings. You can disable this behavior with –fst_to_log

Note2: The FSTs and RTNs are assumed to have both <S> and </S>. This has compatibility reasons, as lattices generated by HiFST have these symbols.

cam.sgnmt.predictors.automata.EPS_ID = 0

OpenFST’s reserved ID for epsilon arcs.

class cam.sgnmt.predictors.automata.FstPredictor(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can read determinized translation lattices. The predictor state consists of the current node. This is unique as the lattices are determinized.

Creates a new fst predictor.

Parameters:
  • fst_path (string) – Path to the FST file
  • use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
  • normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
  • skip_bos_weight (bool) – Add the score at the <S> arc to the </S> arc if this is false. This results in scores consistent with OpenFST’s replace operation, as <S> scores are normally ignored by SGNMT.
  • to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
consume(word)[source]

Updates the current node by following the arc labelled with word. If there is no such arc, we set cur_node to -1, indicating that the predictor is in an invalid state. In this case, all subsequent predict_next calls will return the empty set.

Parameters:word (int) – Word on an outgoing arc from the current node
Returns:float. Weight on the traversed arc
estimate_future_cost(hypo)[source]

The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.

get_state()[source]

Returns the current node.

get_unk_probability(posterior)[source]

Always returns negative infinity: Words outside the translation lattice are not possible according to this predictor.

Returns:float. Negative infinity
initialize(src_sentence)[source]

Loads the FST from the file system and consumes the start of sentence symbol.

Parameters:src_sentence (list) – Not used
initialize_heuristic(src_sentence)[source]

Creates a matrix of shortest distances between nodes.

is_equal(state1, state2)[source]

Returns true if the current node is the same

predict_next()[source]

Uses the outgoing arcs from the current node to build up the scores for the next word.

Returns:dict. Set of words on outgoing arcs from the current node together with their scores, or an empty set if we currently have no active node or fst.
set_state(state)[source]

Sets the current node.

class cam.sgnmt.predictors.automata.NondeterministicFstPredictor(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can handle non-deterministic translation lattices. In contrast to the fst predictor for deterministic lattices, we store a set of nodes which are all reachable from the start node through the current history.

Creates a new nfst predictor.

Parameters:
  • fst_path (string) – Path to the FST file
  • use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
  • normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
  • skip_bos_weight (bool) – If true, set weights on <S> arcs to 0 (= log1)
  • to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
consume(word)[source]

Updates the current nodes by searching for all nodes which are reachable from the current nodes by a path consisting of any number of epsilons and exactly one word label. If there is no such arc, we set the predictor in an invalid state. In this case, all subsequent predict_next calls will return the empty set.

Parameters:word (int) – Word on an outgoing arc from the current node
estimate_future_cost(hypo)[source]

The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.

get_state()[source]

Returns the set of current nodes

get_unk_probability(posterior)[source]

Always returns negative infinity: Words outside the translation lattice are not possible according to this predictor.

Returns:float. Negative infinity
initialize(src_sentence)[source]

Loads the FST from the file system and consumes the start of sentence symbol.

Parameters:src_sentence (list) – Not used
initialize_heuristic(src_sentence)[source]

Creates a matrix of shortest distances between all nodes

is_equal(state1, state2)[source]

Returns true if the current nodes are the same

predict_next()[source]

Uses the outgoing arcs from all current node to build up the scores for the next word. This method does not follow epsilon arcs: consume updates cur_nodes such that all reachable arcs with word ids are connected directly with a node in cur_nodes. If there are multiple arcs with the same word, we use the log sum of the arc weights as score.

Returns:dict. Set of words on outgoing arcs from the current node together with their scores, or an empty set if we currently have no active nodes or fst.
set_state(state)[source]

Sets the set of current nodes

class cam.sgnmt.predictors.automata.RtnPredictor(rtn_path, use_weights, normalize_scores, to_log=True, minimize_rtns=False, rmeps=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor for RTNs (recurrent transition networks). This predictor assumes a directory structure as produced by HiFST. You can use this predictor for non-deterministic lattices too. This implementation supports late expansion: RTNs are only expanded as far as necessary to retrieve all currently reachable states.

cur_nodes contains the accumulated weights from the last consumed word (if ambiguous, the largest)

This implementation does not maintain a list of active nodes like the other automata predictors. Instead, we store the current history and search for the active nodes at each expansion. This is more expensive, but fstreplace might change state IDs so a list of active nodes might get corrupted.

Note that this predictor does not support FSTs in gzip format.

Creates a new RTN predictor.

Parameters:
  • rtn_path (string) – Path to the RTN directory
  • use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
  • normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
  • to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
  • minimize_rtns (bool) – Minimize the FST after each replace operation
  • rmeps (bool) – Remove epsilons in the FST after each replace operation
add_to_label_fst_map_recursive(label_fst_map, visited_nodes, root_node, acc_weight, history, func)[source]

Adds arcs to label_fst_map if they are labeled with an NT symbol and reachable from root_node via history.

Note: visited_nodes is maintained for each history separately

consume(word)[source]

Adds word to the current history.

expand_rtn(func)[source]

This method expands the RTN as far as necessary. This means that the RTN is expanded s.t. we can build the posterior for cur_history. In practice, this means that we follow all epsilon edges and replaces all NT edges until all paths with the prefix cur_history in the RTN have at least one more terminal token. Then, we apply func to all reachable nodes.

get_state()[source]

Returns the current history.

get_sub_fst(fst_id)[source]

Load sub fst from the file system or the cache

get_unk_probability(posterior)[source]

Always returns negative infinity: Words outside the RTN are not possible according to this predictor.

Returns:float. Negative infinity
initialize(src_sentence)[source]

Loads the root RTN and consumes the start of sentence symbol.

Parameters:src_sentence (list) – Not used
is_nt_label(label)[source]

Returns true if label is a non-terminal.

predict_next()[source]

Expands RTN as far as possible and uses the outgoing edges from nodes reachable by the current history to build up the posterior for the next word. If there are no such nodes or arcs, or no root FST is loaded, return the empty set.

set_state(state)[source]

Sets the current history.

cam.sgnmt.predictors.blocks_nmt module

This is the only module outside the blocks package with dependency on the Blocks framework. It contains the neural machine translation predictor nmt. Code is partially taken from the neural machine translation example in blocks.

https://github.com/mila-udem/blocks-examples/tree/master/machine_translation

Note that using this predictor slows down decoding compared to the original NMT decoding because search cannot be parallelized. However, it is much more flexible as it can be combined with other predictors.

class cam.sgnmt.predictors.blocks_nmt.BlocksNMTPredictor(nmt_model_path, gnmt_beta, enable_cache, config)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This is the neural machine translation predictor. The predicted posteriors are equal to the distribution generated by the decoder network in NMT. This predictor heavily relies on the NMT example in blocks. Note that this predictor cannot be used in combination with a target side sparse feature map. See BlocksUnboundedNMTPredictor for that case.

Creates a new NMT predictor.

Parameters:
  • nmt_model_path (string) – Path to the NMT model file (.npz)
  • gnmt_beta (float) – If greater than 0.0, add a Google NMT style coverage penalization term (Wu et al., 2016) to the predictive scores
  • enable_cache (bool) – The NMT predictor usually has a very limited vocabulary size, and a large number of UNKs in hypotheses. This enables reusing already computed predictor states for hypotheses which differ only by NMT OOV words.
  • config (dict) – NMT configuration
Raises:

ValueError. If a target sparse feature map is defined

consume(word)[source]

Feeds back word to the decoder network. This includes embedding of word, running the attention network and update the recurrent decoder layer.

get_state()[source]

The NMT predictor state consists of the decoder network state, and (for caching) the current history of consumed words

get_unk_probability(posterior)[source]

Returns the UNK probability defined by NMT.

initialize(src_sentence)[source]

Runs the encoder network to create the source annotations for the source sentence. If the cache is enabled, empty the cache.

Parameters:src_sentence (list) – List of word ids without <S> and </S> which represent the source sentence.
is_equal(state1, state2)[source]

Returns true if the history is the same

is_history_cachable()[source]

Returns true if cache is enabled and history contains UNK

predict_next()[source]

Uses cache or runs the decoder network to get the distribution over the next target words.

Returns:np array. Full distribution over the entire NMT vocabulary for the next target token.
set_state(state)[source]

Set the NMT predictor state.

set_up_predictor(nmt_model_path)[source]

Initializes the predictor with the given NMT model. Code following blocks.machine_translation.main.

class cam.sgnmt.predictors.blocks_nmt.BlocksUnboundedNMTPredictor(nmt_model_path, gnmt_beta, config)[source]

Bases: cam.sgnmt.predictors.blocks_nmt.BlocksNMTPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This is a version of the NMT predictor which assumes an unbounded vocabulary. Therefore, this predictor can only be used when other predictors (like fst) define the words to score. Using this predictor is mandatory when a target sparse feature map is provided.

Creates a new NMT predictor with unbounded vocabulary.

Parameters:
  • nmt_model_path (string) – Path to the NMT model file (.npz)
  • config (dict) – NMT configuration,
consume(word)[source]

Feeds back word to the decoder network. This includes embedding of word, running the attention network and update the recurrent decoder layer.

get_unk_probability(posterior)[source]

Returns negative inf as this is a unbounded predictor.

is_equal(state1, state2)[source]

Returns true if the history is the same

predict_next(words)[source]

Uses cache or runs the decoder network to get the distribution over the next target words.

Returns:np array. Full distribution over the entire NMT vocabulary for the next target token.
set_up_predictor(nmt_model_path)[source]

Initializes the predictor with the given NMT model. Code following blocks.machine_translation.main.

class cam.sgnmt.predictors.blocks_nmt.MyopticSearch(samples)[source]

Bases: blocks.search.BeamSearch

This class hacks into blocks beam search to leverage off the initialization routines. Note that this has nothing to do with SGNMTs high level decoding in cam.sgnmt.decoding. We basically replace the search() with single_step_``decoding()`` which generates the posteriors for the next word. Thus, it fits in the predictor framework. We try to use BeamSearch functionality wherever possible.

Calls the BeamSearch constructor

class cam.sgnmt.predictors.blocks_nmt.MyopticSparseSearch(samples, trg_sparse_feat_map)[source]

Bases: cam.sgnmt.blocks.sparse_search.SparseBeamSearch

Variant of MyopticSearch for target side sparse features.

Calls the SparseBeamSearch constructor

cam.sgnmt.predictors.bow module

cam.sgnmt.predictors.core module

This module contains the two basic predictor interfaces for bounded and unbounded vocabulary predictors.

class cam.sgnmt.predictors.core.Predictor[source]

Bases: cam.sgnmt.utils.Observer

A predictor produces the predictive probability distribution of the next word given the state of the predictor. The state may change during predict_next() and consume(). The functions get_state() and set_state() can be used for non-greedy decoding. Note: The state describes the predictor with the current history. It does not encapsulate the current source sentence, i.e. you cannot recover a predictor state if initialize() was called in between. predict_next() and consume() must be called alternately. This holds even when using get_state() and set_state(): Loading/saving states is transparent to the predictor instance.

Initializes current_sen_id with 0.

consume(word)[source]

Expand the current history by word and update the internal predictor state accordingly. Two calls of consume() must be separated by a predict_next() call.

Parameters:word (int) – Word to add to the current history
estimate_future_cost(hypo)[source]

Predictors can implement their own look-ahead cost functions. They are used in A* if the –heuristics parameter is set to predictor. This function should return the future log cost (i.e. the lower the better) given the current predictor state, assuming that the last word in the partial hypothesis ‘hypo’ is consumed next. This function must not change the internal predictor state.

Parameters:hypo (PartialHypothesis) – Hypothesis for which to estimate the future cost given the current predictor state
Returns
float. Future cost
finalize_posterior(scores, use_weights, normalize_scores)[source]

This method can be used to enforce the parameters use_weights normalize_scores in predictors with dict posteriors.

Parameters:
  • scores (dict) – unnormalized log valued scores
  • use_weights (bool) – Set to false to replace all values in scores with 0 (= log 1)
  • normalize_scores – Set to true to make the exp of elements in scores sum up to 1
get_state()[source]

Get the current predictor state. The state can be any object or tuple of objects which makes it possible to return to the predictor state with the current history.

Returns:object. Predictor state
get_unk_probability(posterior)[source]

This function defines the probability of all words which are not in posterior. This is usually used to combine open and closed vocabulary predictors. The argument posterior should have been produced with predict_next()

Parameters:posterior (list,array,dict) – Return value of the last call of predict_next
Returns:Score to use for words outside posterior
Return type:float
initialize(src_sentence)[source]

Initialize the predictor with the given source sentence. This resets the internal predictor state and loads everything which is constant throughout the processing of a single source sentence. For example, the NMT decoder runs the encoder network and stores the source annotations.

Parameters:src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>
initialize_heuristic(src_sentence)[source]

This is called after initialize() if the predictor is registered as heuristic predictor (i.e. estimate_future_cost() will be called in the future). Predictors can implement this function for initialization of their own heuristic mechanisms.

Parameters:src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>
is_equal(state1, state2)[source]

Returns true if two predictor states are equal, i.e. both states will always result in the same scores. This is used for hypothesis recombination

Parameters:
  • state1 (object) – First predictor state
  • state2 (object) – Second predictor state
Returns:

bool. True if both states are equal, false if not

notify(message, message_type=1)[source]

We implement the notify method from the Observer super class with an empty method here s.t. predictors do not need to implement it.

Parameters:message (object) – The posterior sent by the decoder
predict_next()[source]

Returns the predictive distribution over the target vocabulary for the next word given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls of predict_next() must be separated by a consume() call.

Returns:dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability get_unk_probability()
set_current_sen_id(cur_sen_id)[source]

This function is called between initialize() calls to increment the sentence id counter. It can also be used to skip sentences for the –range argument.

Parameters:cur_sen_id (int) – Sentence id for the next call of initialize()
set_state(state)[source]

Loads a predictor state from an object created with get_state(). Note that this does not copy the argument but just references the given state. If state is going to be used in the future to return to that point again, you should copy the state with copy.deepcopy() before.

Parameters:state (object) – Predictor state as returned by get_state()
class cam.sgnmt.predictors.core.UnboundedVocabularyPredictor[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictors under this class implement models with very large target vocabularies, for which it is too inefficient to list the entire posterior. Instead, they are evaluated only for a given list of target words. This list is usually created by taking all non-zero probability words from the bounded vocabulary predictors. An example of a unbounded vocabulary predictor is the ngram predictor: Instead of listing the entire ngram vocabulary, we run srilm only on the words which are possible according other predictor (e.g. fst or nmt). This is realized by introducing the trgt_words argument to predict_next.

Initializes current_sen_id with 0.

predict_next(trgt_words)[source]

Like in Predictor, returns the predictive distribution over target words given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls of predict_next() must be separated by a consume() call.

Parameters:trgt_words (list) – List of target word ids.
Returns:dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability get_unk_probability(). The returned set should not contain any ids which are not in trgt_words, but it does not have to score all of them

cam.sgnmt.predictors.ffnnlm module

This module integrates neural language models, for example feed- forward language models like NPLM. It depends on the Python interface to NPLM.

http://nlg.isi.edu/software/nplm/

class cam.sgnmt.predictors.ffnnlm.NPLMPredictor(path, normalize_scores)[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

NPLM language model predictor. Even though NPLM normally has a limited vocabulary size, we implement it as a unbounded vocabulary predictor because it is more efficient to score only a subset of the vocabulary. This predictor uses the python interface to NPLM from

http://nlg.isi.edu/software/nplm/

Creates a new NPLM predictor instance.

Parameters:
  • path (string) – Path to the NPLM model file
  • normalize_scores (bool) – Whether to renormalize scores s.t. scores returned by predict_next sum up to 1
Raises:

NameError. If NPLM is not installed

consume(word)[source]

Extend current history by word

get_state()[source]

Returns the current history

get_unk_probability(posterior)[source]

Use NPLM UNK score if exists

initialize(src_sentence)[source]

Set the n-gram history to initial value.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Returns true if the ngram history is the same

predict_next(words)[source]

Scores the words in words using NPLM.

set_state(state)[source]

Sets the current history

cam.sgnmt.predictors.forced module

This module contains predictors for forced decoding. This can be done either with one reference (forced ForcedPredictor), or with multiple references in form of a n-best list (forcedlst ForcedLstPredictor).

class cam.sgnmt.predictors.forced.ForcedLstPredictor(trg_test_file, use_scores=True, match_unk=False, feat_name=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can be used for direct n-best list rescoring. In contrast to the ForcedPredictor, it reads an n-best list in Moses format and uses its scores as predictive probabilities of the </S> symbol. Everywhere else it gives the predictive probability 1 if the history corresponds to at least one n-best list entry, 0 otherwise. From the n-best list we use First column: Sentence id Second column: Hypothesis in integer format Last column: score

Note: Behavior is undefined if you have duplicates in the n-best list

TODO: Would be much more efficient to use Tries for cur_trgt_sentences instead of a flat list.

Creates a new n-best rescoring predictor instance.

Parameters:
  • trg_test_file (string) – Path to the n-best list
  • use_scores (bool) – Whether to use the scores from the n-best list. If false, use uniform scores of 0 (=log 1).
  • match_unk (bool) – If true, allow any word where the n-best list contains UNK.
  • feat_name (string) – Instead of the combined score in the last column of the Moses n-best list, we can use one of the sparse features. Set this to the name of the feature (denoted as <name>= in the n-best list) if you wish to do that.
consume(word)[source]

Extends the current history by word.

get_state()[source]

Returns the current history.

get_unk_probability(posterior)[source]

Return negative infinity unconditionally - words outside the n-best list are not possible according to this predictor.

initialize(src_sentence)[source]

Resets the history and loads the n-best list entries for the next source sentence

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Returns true if the history is the same

predict_next()[source]

Outputs 0.0 (i.e. prob=1) for all words for which there is an entry in cur_trg_sentences, and the score in cur_trg_sentences if the current history is by itself equal to an entry in cur_trg_sentences.

TODO: The implementation here is fairly inefficient as it scans through all target sentences linearly. Would be better to organize the target sentences in a Trie

set_state(state)[source]

Sets the current history.

class cam.sgnmt.predictors.forced.ForcedPredictor(trg_test_file)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor realizes forced decoding. It stores one target sentence for each source sentence and outputs predictive probability 1 along this path, and 0 otherwise.

Creates a new forced decoding predictor.

Parameters:trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode
consume(word)[source]

If word matches the target sentence, we increase the current history by one. Otherwise, we set this predictor in an invalid state, in which it always predicts </S>

Parameters:word (int) – Next word to consume
get_state()[source]

cur_trg_sentence can be changed so its part of the predictor state

get_unk_probability(posterior)[source]

Returns negative infinity unconditionally: Words which are not in the target sentence have assigned probability 0 by this predictor.

initialize(src_sentence)[source]

Fetches the corresponding target sentence and resets the current history.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Returns true if the state is the same

predict_next()[source]

Returns a dictionary with one entry and value 0 (=log 1). The key is either the next word in the target sentence or (if the target sentence has no more words) the end-of-sentence symbol.

set_state(state)[source]

Set the predictor state.

cam.sgnmt.predictors.grammar module

This module contains everything related to the hiero predictor. This predictor allows applying rules from a syntactical SMT system directly in SGNMT. The main interface is RuleXtractPredictor which can be used like other predictors during decoding. The Hiero predictor follows are the LRHiero implementation from

https://github.com/sfu-natlang/lrhiero

Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.

However, note that we modified the code to a) deal with an arbitrary number of non-terminals b) work with ruleXtract c) allow spurious ambiguity

ATTENTION: This implementation is experimental!!

class cam.sgnmt.predictors.grammar.Cell(init_hypo=None)[source]

Comparable to a CYK cell: A set of hypotheses. If duplicates are added, we do hypo combination by combining the costs and retraining only one of them. Internally, the hypotheses are stored in a list sorted by the sum of the translation prefix

Creates a new Cell with only one hypothesis.

Parameters:init_hypo (LRHieroHypothesis) – Initial hypothesis
add(hypo)[source]

Add a new hypothesis to the cell. If an equivalent hypothesis already exists, combine both hypotheses.

Parameters:hypo (LRHieroHypothesis) – Hypothesis to add under the key hypo.key
filter(pos, symb)[source]

Remove all hypotheses which do not have symb at pos in their trgt_prefix. Breaks if pos is out of range for some trgt_prefix

findIdx(key, a, b)[source]

Find index of first element with given key. If there is no such key, return last element with largest key smaller than key This is a recursive function which only searches in the interval [a,b]

pop()[source]

Removes a hypothesis from the cell.

Returns:LRHieroHypothesis. The removed hypothesis
class cam.sgnmt.predictors.grammar.LRHieroHypothesis(trgt_prefix, spans, cost)[source]

Represents a LRHiero hypothesis, which is defined by the accumulated cost, the target prefix, and open source spans.

Creates a new LRHiero hypothesis

Parameters:
  • trgt_prefix (list) – Target side translation prefix, i.e. the partial target sentence which is translated so far
  • spans (list) – List of spans which are not covered yet, in left-to-right order on target side
  • cost (float) – Cost of this partial hypothesis
is_final()[source]

Returns true if this hypothesis has no open spans

class cam.sgnmt.predictors.grammar.Node[source]

Represents a node in the Trie.

class cam.sgnmt.predictors.grammar.Rule(rhs_src, rhs_trgt, trgt_src_map, cost)[source]

A rule consists of rhs_src and rhs_trgt, both are sequences of integers. NTs are indicated with negative sign. The trgt_src_map defines which NT on the target side belongs to which NT on the source side.

Creates a new rule.

Parameters:
  • rhs_src (list) – Source on the right hand side of the rule
  • rhs_trgt (list) – Target on the right hand side of the rule
  • trgt_src_map (dict) – Defines which NT on the target side belongs to which NT on the source side
last_id = 0
class cam.sgnmt.predictors.grammar.RuleSet[source]

This class stores the set of rules and provides efficient retrieval and matching functionality

Initializes the set by setting up the trie data structure for storing the rules.

INF = 10000
create_rule(rhs_src, rhs_trgt, weight)[source]

Creates a rule object (factory method)

Parameters:
  • rhs_src (list) – String sequence describing the source of the right-hand-side of the rule
  • rhs_trgt (list) – String sequence describing the target of the right-hand-side of the rule
  • weight (float) – Rule weight
Returns:

Rule or None if something went wrong

expand_hypo(hypo, src_seq)[source]

Similar to getSpanRules() and GrowHypothesis() in Alg. 1 in (Siahbani, 2013) combined. Gets all rules which match the given span.

  • If the p parameter of the span is a single non-terminal, we return hypotheses resulting from productions of this non- terminal. Note that rules might be applicable in many different ways: X-> A the B can be applied to foo the bar the baz in two ways. In this case, we add the translation prefix, but leave the borders of the span untouched, and change the p value to thr rhs of the production (i.e. “A the B”). If p consists of multiple characters, the spans store the minimum and maximum length, not the begin and end since the exact begin and end positions are variable.
  • If the p parameter of the span has length > 1, we return a set of hypotheses in which the first subspan has a single NT as p parameter.

Through this contract we can e.g. handle spurious ambiguity, if two NT are on the source side. However, resolving this ambiguity is implemented in a lazy fashion: we delay fixing the span boundaries until we need to expand the hypothesis once more, and then we fix only the first boundaries for the first span.

Parameters:
  • hypo (LRHieroHypothesis) – Hypothesis to expand
  • src_seq (list) – Source sequence to match
parse(line, feature_weights=None)[source]

Parse a line in a rule file from ruleXtract and add the rule to the set.

Parameters:
  • line (string) –
  • feature_weights (list) – score or None to use uniform weights
update_span_len_range()[source]

This method updates the span_len_range variable by finding boundaries for the spans each non terminal can cover. This is done iteratively: First, guess the range for each NT to (0, inf). Then, iterate through all rules for a specific NT and adjust the boundaries given the ranges for all other NTs. Do this until ranges do not change anymore. This is an expensive operation should be done after adding all rules. Note also that the tries store a reference to self.span_len_range, i.e. the variable is propagated to all tries automatically.

class cam.sgnmt.predictors.grammar.RuleXtractPredictor(ruleXtract_path, use_weights, feature_weights=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor based on ruleXtract rules. Bins are organized according the number of target words. We assume that no rule produces the empty word on the source side (but possibly on the target side). Hypotheses are produced iteratively s.t. the following invariant holds: The bins contain a set of (partial) hypotheses from which we can derive all full hypotheses which are consistent with the current target prefix (i.e. the prefix of the target sentence which has already been translated). This set is updated when calling either consume_word or predict_next: consume_ word deletes all hypotheses which become inconsistent with the new word. predict_next requires all hypotheses to have a target_ prefix length of at least one plus the number of consumed words. Therefore, predict_next expands hypotheses as long as they are shorter. This fits nicely with grouping hypotheses in bins of same target prefix length: we expand until all low rank bins are empty. We predict the next target word by using the cost of the best hypothesis with the word at the right position.

Note that this predictor is similar to the decoding algorithm in

Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.

without cube pruning, but it is extended to an arbitrary number of non-terminals as produced with ruleXtract.

Creates a new hiero predictor.

Parameters:
  • ruleXtract_path (string) – Path to the rules file
  • use_weights (bool) – If false, set all hypothesis scores uniformly to 0 (= log 1). If true, use the rule weights to compute hypothesis scores
  • feature_weights (list) – Rule feature weights to compute the rule scores. If this is none we use uniform weights
build_posterior()[source]

We need to scan all hypotheses in self.stacks and add up scores grouped by the symbol at the n_consumed+1-th position. Then, we add end-of-sentence probability by checking self.finals[n_consumed]

consume(word)[source]

Remove all hypotheses with translation prefixes which do not match word

get_state()[source]

Predictor state consists of the stacks, the completed hypotheses, and the number of consumed words.

get_unk_probability(posterior)[source]

Returns negative infinity if the posterior is not empty as words outside the grammar are not possible according this predictor. If posterior is empty, return 0 (= log 1)

initialize(src_sentence)[source]

Delete all bins and add the initial cell to the first bin

predict_next()[source]

For predicting the distribution of the next target tokens, we need to empty the stack with the current history length by expanding all hypotheses on it. Then, all hypotheses are in larger bins, i.e. have a longer target prefix than the current history. Thus, we can look up the possible next words by iterating through all active hypotheses.

set_state(state)[source]

Set the predictor state.

class cam.sgnmt.predictors.grammar.Span(p, borders)[source]

Span is defined by the start and end position and the corresponding sequence of terminal and non-terminal symbols p. Normally, p is just a single NT symbol. However, if there is ambiguity with how to apply a rule to a span (e.g. rule X -> X the X to span foo the bar the baz) we allow to resolve them later on demand. In this case, p = X the X

Fully initializes a new Span instance.

Parameters:
  • p (list) – See class docstring for Span
  • borders (tuple) – (begin, end) with begin inclusive and end exclusive
class cam.sgnmt.predictors.grammar.Trie(span_len_range)[source]

This trie implementation allows matching NT symbols with arbitrary symbol sequences with certain lengths when searching. Note: This trie does not implement edge collapsing - each edge is labeled with exactly one word

Creates an empty trie data structure.

Parameters:span_len_range (tuple) – minimum and maximum span lengths for non-terminal symbols
add(seq, element)[source]

Add an element to the trie data structure. The key sequence seq can contain non-terminals with negative IDs. If a element with the same key already exists in the data structure, we do not delete it but store both items.

Parameters:
  • seq (list) – Sequence of terminals and non-terminals used as key in the trie
  • element (object) – Object to associate with seq
get_all_elements()[source]

Retrieve all elements stored in the trie

get_elements(src_seq)[source]

Get all elements (e.g. rules) which match the given sequence of source tokens.

Parameters:seq (list) – Sequence of terminals and non-terminals used as key in the trie
Returns:(rules, nt_span_lens). The first dictionary contains all applying rules. nt_span_lens lists the number of symbols each of the NTs on the source side covers. Make sure that self.span_len_range is updated
Return type:two dicts
replace(seq, element)[source]

Replaces all elements stored at a seq with a new single element element. This is equivalent to first removing all items with key seq, and then add the new element with add(seq, element)

Parameters:
  • seq (list) – Sequence of terminals and non-terminals used as key in the trie
  • element (object) – Object to associate with seq

cam.sgnmt.predictors.length module

cam.sgnmt.predictors.misc module

This module provides helper predictors and predictor wrappers which are not directly used for scoring. An example is the altsrc predictor wrapper which loads source sentences from a different file.

class cam.sgnmt.predictors.misc.AltsrcPredictor(src_test, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper loads the source sentences from an alternative source file. The src_sentence arguments of initialize and initialize_heuristic are overridden with sentences loaded from the file specified via the argument --altsrc_test. All other methods are pass through calls to the slave predictor.

Creates a new altsrc wrapper predictor.

Parameters:
  • src_test (string) – Path to the text file with source sentences
  • slave_predictor (Predictor) – Instance of the predictor which uses the source sentences in src_test
consume(word)[source]

Pass through to slave predictor

estimate_future_cost(hypo)[source]

Pass through to slave predictor

get_state()[source]

Pass through to slave predictor

get_unk_probability(posterior)[source]

Pass through to slave predictor

initialize(src_sentence)[source]

Pass through to slave predictor but replace src_sentence with a sentence from self.altsens

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor but replace src_sentence with a sentence from self.altsens

is_equal(state1, state2)[source]

Pass through to slave predictor

predict_next()[source]

Pass through to slave predictor

set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

Pass through to slave predictor

class cam.sgnmt.predictors.misc.UnboundedAltsrcPredictor(src_test, slave_predictor)[source]

Bases: cam.sgnmt.predictors.misc.AltsrcPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of AltsrcPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Pass through to AltsrcPredictor.__init__

predict_next(trgt_words)[source]

Pass through to slave predictor

cam.sgnmt.predictors.ngram module

This module contains predictors for n-gram (Kneser-Ney) language modeling. This is a UnboundedVocabularyPredictor as the vocabulary size ngram models normally do not permit complete enumeration of the posterior.

This module is based on the swig-srilm package.

https://github.com/desilinguist/swig-srilm

class cam.sgnmt.predictors.ngram.SRILMPredictor(path, ngram_order, convert_to_ln=False)[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

SRILM predictor based on swig https://github.com/desilinguist/swig-srilm

The predictor state is described by the n-gram history. The language model has to use word indices rather than the string word representations.

Creates a new n-gram language model predictor.

Parameters:
  • path (string) – Path to the ARPA language model file
  • ngram_order (int) – Order of the language model
Raises:

NameError. If srilm-swig is not installed

consume(word)[source]

Extends the current history by word

get_state()[source]

Returns the current n-gram history

get_unk_probability(posterior)[source]

Use the probability for ‘<unk>’ in the language model

initialize(src_sentence)[source]

Initializes the history with the start-of-sentence symbol.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Returns true if the ngram history is the same

predict_next(words)[source]

Score the set of target words with the n-gram language model given the current history

Parameters:words (list) – Set of words to score
Returns:dict. Language model scores for the words in words
set_state(state)[source]

Sets the current n-gram history

cam.sgnmt.predictors.parse module

class cam.sgnmt.predictors.parse.BpeParsePredictor(grammar_path, bpe_rule_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False, eow_ids=None, terminal_restrict=True, terminal_ids=None, internal_only_restrict=False)[source]

Bases: cam.sgnmt.predictors.parse.TokParsePredictor

Predict over a BPE-based grammar with two possible grammar constraints: one between non-terminals and bpe start-of-word tokens, one over bpe tokens in a word

Creates a new parse predictor wrapper which can be constrained to 2
grammars: one over non-terminals / terminals, one internally to constrain BPE units within a single word
Parameters:
  • grammar_path (string) – Path to the grammar file
  • bpe_rule_path (string) – Path to file defining rules between BPEs
  • slave_predictor – predictor to wrap
  • word_out (bool) – since this wrapper can be used for grammar constraint, this bool determines whether we also do internal beam search over non-terminals
  • normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
  • norm_alpha (float) – may be used for path weight normalization
  • beam_size (int) – beam size for internal beam search
  • max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
  • allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
  • consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point
  • eow_ids (string) – path to file containing ids of BPEs that mark the end of a word
  • terminal_restrict (bool) – true if applying grammar constraint over nonterminals and terminals
  • terminal_ids (string) – path to file containing all terminal ids
  • internal_only_restrict (bool) – true if applying grammar constraint over BPE units inside words
get_all_terminals(terminal_ids)[source]
get_bpe_can_follow(rule_path)[source]
get_eow_ids(eow_ids)[source]
is_nt(word)[source]
predict_next(predicting_next_word=False)[source]

predict next tokens as permitted by the current stack and the BPE grammar

update_stacks(word)[source]
class cam.sgnmt.predictors.parse.InternalHypo(score, token_score, predictor_state, word_to_consume)[source]

Bases: object

Helper class for internal parse predictor beam search over nonterminals

extend(score, predictor_state, word_to_consume)[source]
class cam.sgnmt.predictors.parse.ParsePredictor(slave_predictor, normalize_scores=True, beam_size=4, max_internal_len=35, nonterminal_ids=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor wrapper allowing internal beam search over a representation which contains some pre-defined ‘non-terminal’ ids, which should not appear in the output.

Create a new parse wrapper for a predictor :param slave_predictor: predictor to wrap with parse wrapper :param normalize_scores: whether to normalize posterior scores,

e.g. after some tokens have been removed
Parameters:
  • beam_size (int) – beam size for internal beam search over non-terminals
  • max_internal_len (int) – number of consecutive non-terminal tokens allowed in internal search before path is ignored
  • nonterminal_ids – file containing non-terminal ids, one per line
are_best_terminal(posterior)[source]

Return true if most probable tokens in posterior are all terminals (including EOS)

consume(word, internal=False)[source]
find_word_beam(posterior)[source]

Internal beam search over posterior until a beam of terminals is found

get_state()[source]

Returns the current state.

get_unk_probability(posterior)[source]

Return unk probability as determined by slave predictor :returns: float, unk prob

initialize(src_sentence)[source]

Initializes slave predictor with source sentence

Parameters:src_sentence (list) –
initialize_heuristic(src_sentence)[source]

Creates a matrix of shortest distances between nodes.

initialize_internal_hypos(posterior)[source]
is_equal(state1, state2)[source]

Returns true if the current node is the same

maybe_add_new_top_tokens(top_terminals, hypo, next_hypos)[source]
predict_next(predicting_internally=False)[source]

predict next tokens. Args: predicting_internally: will be true if called from internal beam

search, prevents infinite loop
set_state(state)[source]

Sets the current state.

class cam.sgnmt.predictors.parse.TokParsePredictor(grammar_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False)[source]

Bases: cam.sgnmt.predictors.parse.ParsePredictor

Unlike ParsePredictor, the grammar predicts tokens according to a grammar. Use BPEParsePredictor if including rules to connect BPE units inside words.

Creates a new parse predictor wrapper. :param grammar_path: Path to the grammar file :type grammar_path: string :param slave_predictor: predictor to wrap :param word_out: since this wrapper can be used for grammar

constraint, this bool determines whether we also do internal beam search over non-terminals
Parameters:
  • normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
  • norm_alpha (float) – may be used for path weight normalization
  • beam_size (int) – beam size for internal beam search
  • max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
  • allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
  • consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point
consume(word)[source]
Parameters:word (int) – word token being consumed
find_word(posterior)[source]

Check whether rhs of best option in posterior is a terminal if it is, return the posterior for decoding if not, take the best result and follow that path until a word is found this follows a greedy 1best or a beam path through non-terminals

find_word_beam(posterior)[source]

Do an internal beam search over non-terminal functions to find the next best n terminal tokens, as ranked by normalized path score

Returns: posterior containing up to n terminal tokens
and their normalized path score
find_word_greedy(posterior)[source]
get_current_allowed()[source]
get_state()[source]

Returns the current state, including slave predictor state

initialize(src_sentence)[source]
norm_hypo_score(hypo)[source]
norm_score(score, beam_len)[source]
predict_next(predicting_next_word=False)[source]

predict next tokens as permitted by the current stack and the grammar

prepare_grammar()[source]
replace_lhs()[source]
set_state(state)[source]

Sets the current state

update_stacks(word)[source]
cam.sgnmt.predictors.parse.load_external_ids(path)[source]

load file of ids to list

cam.sgnmt.predictors.structure module

This module implements constraints which assure that highly structured output is well-formatted. For example, the bracket predictor checks for balanced bracket expressions, and the OSM predictor prevents any sequence of operations which cannot be compiled to a string.

class cam.sgnmt.predictors.structure.BracketPredictor(max_terminal_id, closing_bracket_id, max_depth=-1, extlength_path='')[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This predictor constrains the output to well-formed bracket expressions. It also allows to specify the number of terminals with an external length distribution file.

Creates a new bracket predictor.

Parameters:
  • max_terminal_id (int) – All IDs greater than this are brackets
  • closing_bracket_id (string) – All brackets except these ones are opening. Comma-separated list of integers.
  • max_depth (int) – If positive, restrict the maximum depth
  • extlength_path (string) – If this is set, restrict the number of terminals to the distribution specified in the referenced file. Terminals can be implicit: We count a single terminal between each adjacent opening and closing bracket.
consume(word)[source]

Updates current depth and the number of consumed terminals.

get_state()[source]

Returns the current depth and number of consumed terminals

get_unk_probability(posterior)[source]

Always returns 0.0

initialize(src_sentence)[source]

Sets the current depth to 0.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Trivial implementation

predict_next(words)[source]

If the maximum depth is reached, exclude all opening brackets. If history is not balanced, exclude EOS. If the current depth is zero, exclude closing brackets.

Parameters:words (list) – Set of words to score
Returns:dict.
set_state(state)[source]

Sets the current depth and number of consumed terminals

class cam.sgnmt.predictors.structure.ForcedOSMPredictor(trg_test_file)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor allows forced decoding with an OSM output, which essentially means running the OSM in alignment mode. This predictor assumes well-formed operation sequences. Please combine this predictor with the osm constraint predictor to satisfy this requirement. The state of this predictor is the compiled version of the current history. It allows terminal symbols which are consistent with the reference. The end-of-sentence symbol is supressed until all words in the reference have been consumed.

Creates a new forcedosm predictor.

Parameters:trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode
consume(word)[source]

Updates the compiled string and the head position.

get_state()[source]
get_unk_probability(posterior)[source]

Always returns -inf.

initialize(src_sentence)[source]

Resets compiled and head.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Trivial implementation

predict_next()[source]

Apply word reference constraints.

Returns:dict.
set_state(state)[source]
class cam.sgnmt.predictors.structure.OSMPredictor[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor applies the following constraints to an OSM output:

  • The number of EOP (end-of-phrase) tokens must not exceed the number of source tokens.
  • JUMP_FWD and JUMP_BWD tokens are constraint to avoid jumping out of bounds.

Creates a new osm predictor.

consume(word)[source]

Updates the number of holes, EOPs, and the head position.

get_state()[source]
get_unk_probability(posterior)[source]

Always returns 0.0

initialize(src_sentence)[source]

Sets the number of source tokens.

Parameters:src_sentence (list) – Not used
is_equal(state1, state2)[source]

Trivial implementation

predict_next()[source]

Apply OSM constraints.

Returns:dict.
set_state(state)[source]
cam.sgnmt.predictors.structure.load_external_lengths(path)[source]

Loads a length distribution from a plain text file. The file must contain blank separated <length>:<score> pairs in each line.

Parameters:path (string) – Path to the length file.
Returns:list of dicts mapping a length to its scores, one dict for each sentence.

cam.sgnmt.predictors.tf_nizza module

This module integrates Nizza alignment models.

https://github.com/fstahlberg/nizza

class cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Common functionality for Nizza based predictors. This includes loading checkpoints, creating sessions, and creating computation graphs.

Initializes a nizza predictor.

Parameters:
  • src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
  • trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
  • model_name (string) – Name of the nizza model
  • hparams_set_name (string) – Name of the nizza hyper-parameter set
  • checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
  • single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
  • nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises:

IOError if checkpoint file not found.

create_session(checkpoint_dir)[source]

Creates a MonitoredSession for this predictor.

get_unk_probability(posterior)[source]

Fetch posterior[t2t_unk_id] or return NEG_INF if None.

class cam.sgnmt.predictors.tf_nizza.LexNizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, alpha, beta, shortlist_strategies, trg2src_model_name='', trg2src_hparams_set_name='', trg2src_checkpoint_dir='', max_shortlist_length=0, min_id=0, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor

This predictor is only compatible to Model1-like Nizza models which return lexical translation probabilities in precompute(). The predictor keeps a list of the same length as the source sentence and initializes it with zeros. At each timestep it updates this list by the lexical scores Model1 assigned to the last consumed token. The predictor score aims to bring up all entries in the list, and thus serves as a coverage mechanism over the source sentence.

Initializes a nizza predictor.

Parameters:
  • src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
  • trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
  • model_name (string) – Name of the nizza model
  • hparams_set_name (string) – Name of the nizza hyper-parameter set
  • checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
  • single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
  • alpha (float) – Score for each matching word
  • beta (float) – Penalty for each uncovered word at the end
  • shortlist_strategies (string) – Comma-separated list of shortlist strategies.
  • trg2src_model_name (string) – Name of the target2source nizza model
  • trg2src_hparams_set_name (string) – Name of the nizza hyper-parameter set for the target2source model
  • trg2src_checkpoint_dir (string) – Path to the Nizza checkpoint directory for the target2source model. The predictor will load the top most checkpoint in the checkpoints file.
  • max_shortlist_length (int) – If a shortlist exceeds this limit, initialize the initial coverage with 1 at this position. If zero, do not apply any limit
  • min_id (int) – Do not use IDs below this threshold (filters out most frequent words).
  • nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises:

IOError if checkpoint file not found.

consume(word)[source]

Update coverage.

estimate_future_cost(hypo)[source]

We use the number of uncovered words times beta as heuristic estimate.

get_state()[source]

The predictor state is the coverage vector.

get_unk_probability(posterior)[source]
initialize(src_sentence)[source]

Set src_sentence, reset consumed.

predict_next()[source]

Predict record scores.

set_state(state)[source]

The predictor state is the coverage vector.

class cam.sgnmt.predictors.tf_nizza.NizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor

This predictor uses Nizza alignment models to derive a posterior over the target vocabulary for the next position. It mainly relies on the predict_next_word() implementation of Nizza models.

Initializes a nizza predictor.

Parameters:
  • src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
  • trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
  • model_name (string) – Name of the nizza model
  • hparams_set_name (string) – Name of the nizza hyper-parameter set
  • checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
  • single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
  • nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises:

IOError if checkpoint file not found.

consume(word)[source]

Append word to the current history.

get_state()[source]

The predictor state is the complete history.

initialize(src_sentence)[source]

Set src_sentence, reset consumed.

is_equal(state1, state2)[source]

Returns true if the history is the same

predict_next()[source]

Call the T2T model in self.mon_sess.

set_state(state)[source]

The predictor state is the complete history.

cam.sgnmt.predictors.tf_nmt module

cam.sgnmt.predictors.tf_rnnlm module

cam.sgnmt.predictors.tf_t2t module

This is the interface to the tensor2tensor library.

https://github.com/tensorflow/tensor2tensor

Alternatively, you may use the following fork which has been tested in combination with SGNMT:

https://github.com/fstahlberg/tensor2tensor

The t2t predictor can read any model trained with tensor2tensor which includes the transformer model, convolutional models, and RNN-based sequence models.

class cam.sgnmt.predictors.tf_t2t.FertilityT2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, single_cpu_thread=False, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t.T2TPredictor

Use this predictor to integrate fertility models trained with T2T. Fertility models output the fertility for each source word instead of target words. We define the fertility of the i-th source word in a hypothesis as the number of tokens between the (i-1)-th and the i-th POP token.

TODO: This is not SOLID (violates substitution principle)

Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles

restoring checkpoints.
Parameters:
  • src_vocab_size (int) – Source vocabulary size.
  • trg_vocab_size (int) – Target vocabulary size.
  • model_name (string) – T2T model name.
  • problem_name (string) – T2T problem name.
  • hparams_set_name (string) – T2T hparams set name.
  • t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
  • checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
  • t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
  • single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
  • max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
  • pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
consume(word)[source]
get_state()[source]
get_unk_probability(posterior)[source]

Returns self.other_scores[n_aligned_words].

initialize(src_sentence)[source]

Set src_sentence, compute fertilities for first src word.

is_equal(state1, state2)[source]

Returns true if the history is the same

predict_next()[source]

Returns self.pop_scores[n_aligned_words] for POP and EOS.

set_state(state)[source]
cam.sgnmt.predictors.tf_t2t.POP = '##POP##'

Textual representation of the POP symbol.

class cam.sgnmt.predictors.tf_t2t.T2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, single_cpu_thread=False, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor

This predictor implements scoring with Tensor2Tensor models. We follow the decoder implementation in T2T and do not reuse network states in decoding. We rather compute the full forward pass along the current history. Therefore, the decoder state is simply the the full history of consumed words.

Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles

restoring checkpoints.
Parameters:
  • src_vocab_size (int) – Source vocabulary size.
  • trg_vocab_size (int) – Target vocabulary size.
  • model_name (string) – T2T model name.
  • problem_name (string) – T2T problem name.
  • hparams_set_name (string) – T2T hparams set name.
  • t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
  • checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
  • t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
  • single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
  • max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
  • pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
consume(word)[source]

Append word to the current history.

get_state()[source]

The predictor state is the complete history.

initialize(src_sentence)[source]

Set src_sentence, reset consumed.

is_equal(state1, state2)[source]

Returns true if the history is the same

predict_next()[source]

Call the T2T model in self.mon_sess.

set_state(state)[source]

The predictor state is the complete history.

cam.sgnmt.predictors.tf_t2t.T2T_INITIALIZED = False

Set to true by _initialize_t2t() after first constructor call.

cam.sgnmt.predictors.tf_t2t.expand_input_dims_for_t2t(t)[source]

Expands a plain input tensor for using it in a T2T graph.

Parameters:t – Tensor
Returns:Tensor t expanded by 1 dimension on the left and two dimensions on the right.
cam.sgnmt.predictors.tf_t2t.log_prob_from_logits(logits)[source]

Softmax function.

cam.sgnmt.predictors.tokenization module

This module contains wrapper predictors which support decoding with diverse tokenization. The Word2charPredictor can be used if the decoder operates on fine-grained tokens such as characters, but the tokenization of a predictor is coarse-grained (e.g. words or subwords).

The word2char predictor maintains an explicit list of word boundary characters and applies consume and predict_next whenever a word boundary character is consumed.

The fsttok predictor also masks coarse grained predictors when SGNMT uses fine-grained tokens such as characters. This wrapper loads an FST which transduces character to predictor-unit sequences.

class cam.sgnmt.predictors.tokenization.CombinedState(fst_node, pred_state, posterior, unconsumed=[], pending_score=0.0)[source]

Bases: object

Combines an FST state with predictor state. Use by the fsttok predictor.

consume_all(predictor)[source]

Consume all unconsumed tokens and update pred_state, pending_score, and posterior accordingly.

Parameters:predictor (Predictor) – Predictor instance
consume_single(predictor)[source]

Consume a single token in self.unconsumed.

Parameters:predictor (Predictor) – Predictor instance
score(token, predictor)[source]

Returns a score which can be added if token is consumed next. This is not necessarily the full score but an upper bound on it: Continuations will have a score lower or equal than this. We only use the current posterior vector and do not consume tokens with the wrapped predictor.

traverse_fst(trans_fst, char)[source]

Returns a list of CombinedState``s with the same predictor state and posterior, but an ``fst_node which is reachable via the input label char. If the output tabe contains symbols, add them to unconsumed.

Parameters:
  • trans_fst (Fst) – FST to traverse
  • char (int) – Index of character
Returns:

list. List of combined states reachable via char

update_posterior(predictor)[source]

If self.posterior is None, call predict_next to be able to score the next tokens.

cam.sgnmt.predictors.tokenization.EPS_ID = 0

OpenFST’s reserved ID for epsilon arcs.

class cam.sgnmt.predictors.tokenization.FSTTokPredictor(path, fst_unk_id, max_pending_score, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper can be used if the SGNMT decoder operates on the character level, but a predictor uses a more coarse grained tokenization. The mapping is defined by an FST which transduces character to predictor unit sequences. This wrapper maintains a list of CombinedState objects which are tuples of an FST node and a predictor state for which holds:

  • The input labels on the path to the node are consistent with the consumed characters
  • The output labels on the path to the node are consistent with the predictor states

Constructor for the fsttok wrapper

Parameters:
  • path (string) – Path to an FST which transduces characters to predictor tokens
  • fst_unk_id (int) – ID used to represent UNK in the FSTs (usually 999999998)
  • max_pending_score (float) – Maximum pending score in a CombinedState instance.
  • slave_predictor (Predictor) – Wrapped predictor
consume(word)[source]

Update self.states to be consistent with word and consumes all the predictor tokens.

estimate_future_cost(hypo)[source]

Not implemented yet

get_state()[source]
get_unk_probability(posterior)[source]

Always returns negative infinity. Handling UNKs needs to be realized by the FST.

initialize(src_sentence)[source]

Pass through to slave predictor. The source sentence is not modified. states is updated to the initial FST node and predictor posterior and state.

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor. The source sentence is not modified

is_equal(state1, state2)[source]

Not implemented yet

predict_next()[source]
set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]
class cam.sgnmt.predictors.tokenization.Word2charPredictor(map_path, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This predictor wraps word level predictors when SGNMT is running on the character level. The mapping between word ID and character ID sequence is loaded from the file system. All characters which do not appear in that mapping are treated as word boundary makers. The wrapper blocks consume and predict_next calls until a word boundary marker is consumed, and updates the slave predictor according the word between the last two word boundaries. The mapping is done only on the target side, and the source sentences are passed through as they are. To use alternative tokenization on the source side, see the altsrc predictor wrapper. The word2char wrapper is always an UnboundedVocabularyPredictor.

Creates a new word2char wrapper predictor. The map_path file has to be plain text files, each line containing the mapping from a word index to the character index sequence (format: word char1 char2... charn).

Parameters:
  • map_path (string) – Path to the mapping file
  • slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
consume(word)[source]

If word is a word boundary marker, truncate word_stub and let the slave predictor consume word_stub. Otherwise, extend word_stub by the character.

estimate_future_cost(hypo)[source]

Not supported

get_state()[source]

Pass through to slave predictor

get_unk_probability(posterior)[source]

This is about the unkown character, not word. Since the word level slave predictor has no notion of the unknown character, we return NEG_INF unconditionally.

initialize(src_sentence)[source]

Pass through to slave predictor. The source sentence is not modified

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor. The source sentence is not modified

is_equal(state1, state2)[source]

Pass through to slave predictor

predict_next(trgt_words)[source]
set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

Pass through to slave predictor

cam.sgnmt.predictors.vocabulary module

Predictor wrappers in this module work with the vocabulary of the wrapped predictor. An example is the idxmap wrapper which makes it possible to use an alternative word map.

class cam.sgnmt.predictors.vocabulary.IdxmapPredictor(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper predictor can be applied to slave predictors which use different wmaps than SGNMT. It translates between SGNMT word indices and predictors indices each time the predictor is called. This mapping is transparent to both the decoder and the wrapped slave predictor.

Creates a new idxmap wrapper predictor. The index maps have to be plain text files, each line containing the mapping from a SGNMT word index to the slave predictor word index.

Parameters:
  • src_idxmap_path (string) – Path to the source index map
  • trgt_idxmap_path (string) – Path to the target index map
  • slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
  • slave_weight (float) – Slave predictor weight
consume(word)[source]

Pass through to slave predictor

estimate_future_cost(hypo)[source]

Pass through to slave predictor

get_state()[source]

Pass through to slave predictor

get_unk_probability(posterior)[source]

ATTENTION: We should translate the posterior array back to slave predictor indices. However, the unk_id is translated to the identical index, and others normally do not matter when computing the UNK probability. Therefore, we refrain from a complete conversion and pass through posterior without changing its word indices.

initialize(src_sentence)[source]

Pass through to slave predictor

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor

is_equal(state1, state2)[source]

Pass through to slave predictor

load_map(path)[source]

Load a index map file. Mappings should be bijections, but there is no sanity check in place to verify this.

Parameters:path (string) – Path to the mapping file
Returns:dict. Mapping from SGNMT index to slave predictor index
predict_next()[source]

Pass through to slave predictor

set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.SkipvocabInternalHypothesis(score, predictor_state, word_to_consume)[source]

Bases: object

Helper class for internal beam search in skipvocab.

class cam.sgnmt.predictors.vocabulary.SkipvocabPredictor(max_id, stop_size, beam, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor wrapper masks predictors with a larger vocabulary than the SGNMT vocabulary. The SGNMT OOV words are not scored with UNK scores from the other predictors like usual, but are hidden by this wrapper. Therefore, this wrapper does not produce any word from the larger vocabulary, but searches internally until enough in-vocabulary word scores are collected from the wrapped predictor.

Creates a new skipvocab wrapper predictor.

Parameters:
  • max_id (int) – All words greater than this are skipped
  • stop_size (int) – Stop internal beam search when the best stop_size words are in-vocabulary
  • beam (int) – Beam size of internal beam search
  • slave_predictor (Predictor) – Wrapped predictor.
consume(word)[source]

Pass through to slave predictor

estimate_future_cost(hypo)[source]

Pass through to slave predictor

get_state()[source]

Pass through to slave predictor

get_unk_probability(posterior)[source]

Pass through to slave predictor

initialize(src_sentence)[source]

Pass through to slave predictor

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor

is_equal(state1, state2)[source]

Pass through to slave predictor

predict_next()[source]

This method first performs beam search internally to update the slave predictor state to a point where the best stop_size entries in the predict_next() return value are in-vocabulary (bounded by max_id). Then, it returns the slave posterior in that state.

set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.UnboundedIdxmapPredictor(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]

Bases: cam.sgnmt.predictors.vocabulary.IdxmapPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of IdxmapPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Pass through to IdxmapPredictor.__init__

predict_next(trgt_words)[source]

Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.UnkvocabPredictor(trg_vocab_size, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

If the predictor wrapped by the unkvocab wrapper produces an UNK with predict next, this wrapper adds explicit NEG_INF scores to all in-vocabulary words not in its posterior. This can control which words are matched by the UNK scores of other predictors.

Creates a new unkvocab wrapper predictor.

Parameters:trg_vocab_size (int) – Size of the target vocabulary
consume(word)[source]

Pass through to slave predictor

estimate_future_cost(hypo)[source]

Pass through to slave predictor

get_state()[source]

Pass through to slave predictor

get_unk_probability(posterior)[source]

Pass through to slave predictor

initialize(src_sentence)[source]

Pass through to slave predictor

initialize_heuristic(src_sentence)[source]

Pass through to slave predictor

is_equal(state1, state2)[source]

Pass through to slave predictor

predict_next()[source]

Pass through to slave predictor. If the posterior from the slave predictor contains util.UNK_ID, add NEG_INF for all word ids lower than trg_vocab_size that are not already defined

set_current_sen_id(cur_sen_id)[source]

We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

Pass through to slave predictor

Module contents

Predictors are the scoring modules used in SGNMT. They can be used together to form a combined search space and scores. Note that the configuration of predictors is not decoupled with the central configuration (yet). Therefore, new predictors need to be referenced to in blocks.decode, and their configuration parameters need to be added to blocks.ui.