Predictors¶

Predictors are scoring modules which define a distribution over target words given the history and some side information like the source sentence. If vocabulary sizes differ among predictors, we fill in gaps with predictor UNK scores.

Note that you can use multiple instances of the same predictor. For example, ‘t2t,t2t,t2t’ can be used for ensembling three T2T models. You can often override parts of the predictor configurations for subsequent predictors by adding the predictor number (e.g. see --t2t_checkpoint_dir2 or --fst_path2)

Main predictors¶

The following main predictors are available. The installation of additional (optional) software required by some of the predictors is described on the Installation page.

For a more detailed description of the predictors, search for the name of the predictor in the Predictor modules section.

t2t: Predictor for tensor2tensor models. Requires Tensor2Tensor.

Options: t2t_usr_dir, t2t_model, t2t_problem, t2t_hparams_set, t2t_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size, n_cpu_threads
fairseq: Predictor for fairseq models. Requires fairseq.

Options: fairseq_path, fairseq_user_dir, fairseq_lang_pair, n_cpu_threads
kenlm: Count-based n-gram language model in ARPA format. Requires KenLM.

Options: lm_path
forced: Forced decoding with one reference

Options: trg_test
forcedlst: Forced decoding with a Moses n-best list (n-best list rescoring)

Options: trg_test, forcedlst_match_unk, forcedlst_sparse_feat, use_nbest_weights
osm: Constraints output to valid OSM sequences

Options: osm_type
bow: Forced decoding with one bag-of-words ref.

Options: trg_test, heuristic_scores_file, bow_heuristic_strategies, bow_accept_subsets, bow_accept_duplicates, pred_trg_vocab_size
fst: Deterministic translation lattices

Options: fst_path, use_fst_weights, normalize_fst_weights, fst_to_log, fst_skip_bos_weight
nfst: Non-deterministic translation lattices

Options: fst_path, use_fst_weights, normalize_fst_weights, fst_to_log, fst_skip_bos_weight
wc: Number of words feature.

Options: wc_word, negative_wc, wc_nonterminal_penalty, syntax_nonterminal_ids, syntax_min_terminal_id, syntax_max_terminal_id, pred_trg_vocab_size
ngramc: For using MBR n-gram posteriors.

Options: ngramc_path, ngramc_order

All predictors can be combined with one or more wrapper predictors by adding the wrapper name separated by a _ symbol. Following main wrappers are available:

idxmap: Add this wrapper to predictors which use an alternative word map.

Options: src_idxmap, trg_idxmap
altsrc: This wrapper loads source sentences from an alternative source.

Options: altsrc_test
fsttok: Uses an FST to transduce SGNMT tokens to predictor tokens.

Options: fsttok_path, fsttok_max_pending_score, fst_unk_id

Experimental predictors¶

Experimental predictors are less frequently used but working predictor implementations for special cases.

editt2t: Predictor for searching over t2t models via edit operations. Requires Tensor2Tensor.

Options: t2t_usr_dir, t2t_model, t2t_problem, t2t_hparams_set, t2t_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size, trg_test, beam
fertt2t: Predictor for tensor2tensor models predicting source token fertilities. Requires Tensor2Tensor.

Options: t2t_usr_dir, t2t_model, t2t_problem, t2t_hparams_set, t2t_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size
segt2t: Predictor for tensor2tensor models with _seg and _pos features. Requires Tensor2Tensor.

Options: t2t_usr_dir, t2t_model, t2t_problem, t2t_hparams_set, t2t_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size
nizza: Nizza alignment models. Requires Nizza.

Options: nizza_model, nizza_hparams_set, nizza_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size
lexnizza: Uses Nizza lexical scores to check coverage. Requires Nizza.

Options: nizza_model, nizza_hparams_set, nizza_checkpoint_dir, pred_src_vocab_size, pred_trg_vocab_size, lexnizza_alpha, lexnizza_beta, lexnizza_shortlist_strategies, lexnizza_max_shortlist_length
bracket: Enforces well-formed bracket expressions

Options: syntax_pop_id , syntax_max_terminal_id, syntax_max_depth, extlength_path
forcedosm: Forced alignment with an OSM model

Options: trg_test
rtn: Recurrent transition networks as created by HiFST with late expansion.

Options: rtn_path, use_rtn_weights, minimize_rtns, remove_epsilon_in_rtns, normalize_rtn_weights
lrhiero: Direct Hiero (left-to-right Hiero). This is an EXPERIMENTAL implementation of LRHiero.

Options: rules_path, grammar_feature_weights, use_grammar_weights
unkc: Poisson model for number of UNKs.

Options: unk_count_lambdas, pred_trg_vocab_size
length: Target sentence length model.

Options: src_test_raw, length_model_weights, use_length_point_probs
extlength: External target sentence lengths.

Options: extlength_path

Following experimental predictor wrappers are available:

glue: Masks sentence-level predictors when SGNMT runs on the document level.

Options:
parse: Internal beam search over a representation which contains some pre-defined non-terminal ids, which should not appear in the output.

Options: parse_tok_grammar, parse_bpe_path, syntax_path, syntax_bpe_path, syntax_word_out, normalize_fst_weights, syntax_norm_alpha, syntax_internal_beam, syntax_max_internal_len, syntax_allow_early_eos, syntax_consume_ooc, syntax_terminal_restrict, syntax_internal_only, syntax_eow_ids, syntax_terminal_ids
rank: Does not use the scores of the wrapped predictor directly but the rank in the scores table.

Options:
ngramize: Extracts n-gram posteriors from a predictor without feedback loop.

Options: min_ngram_order, max_ngram_order, max_len_factor
skipvocab: Uses internal beam search to skip a subset of the predictor vocabulary.

Options: beam, skipvocab_max_id, skipvocab_stop_size
maskvocab: Hides a subset of the SGNMT vocabulary from the wrapped predictor.

Options: maskvocab_words
unkvocab: This wrapper explicitly excludes matching word indices higher than pred_trg_vocab_size with UNK scores.

Options: pred_trg_vocab_size
word2char: Wraps word-level predictors when SGNMT is running on character level.

Options: word2char_map

Predictor modules¶

Module contents¶

Predictors are the scoring modules used in SGNMT. They can be used together to form a combined search space and scores. Note that the configuration of predictors is not decoupled with the central configuration (yet). Therefore, new predictors need to be referenced to in blocks.decode, and their configuration parameters need to be added to blocks.ui.

cam.sgnmt.predictors.automata module¶

This module encapsulates the predictor interface to OpenFST. This module depends on OpenFST. To enable Python support in OpenFST, use a recent version (>=1.5.4) and compile with --enable_python. Further information can be found here:

http://www.openfst.org/twiki/bin/view/FST/PythonExtension

This file includes the fst, nfst, and rtn predictors.

Note: If we use arc weights in FSTs, we multiply them by -1 as everything in SGNMT is logprob, not -logprob as in FSTs log or tropical semirings. You can disable this behavior with –fst_to_log

Note2: The FSTs and RTNs are assumed to have both <S> and </S>. This has compatibility reasons, as lattices generated by HiFST have these symbols.

cam.sgnmt.predictors.automata.EPS_ID = 0: OpenFST’s reserved ID for epsilon arcs.

class cam.sgnmt.predictors.automata.FstPredictor(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can read determinized translation lattices. The predictor state consists of the current node. This is unique as the lattices are determinized.

Creates a new fst predictor.

Parameters:

fst_path (string) – Path to the FST file
use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
skip_bos_weight (bool) – Add the score at the <S> arc to the </S> arc if this is false. This results in scores consistent with OpenFST’s replace operation, as <S> scores are normally ignored by SGNMT.
to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.

consume(word)[source]

Updates the current node by following the arc labelled with word. If there is no such arc, we set cur_node to -1, indicating that the predictor is in an invalid state. In this case, all subsequent predict_next calls will return the empty set.

Parameters:	word (int) – Word on an outgoing arc from the current node
Returns:	float. Weight on the traversed arc

estimate_future_cost(hypo)[source]: The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.

get_state()[source]: Returns the current node.

get_unk_probability(posterior)[source]

Returns negative infinity if UNK is not in the lattice. Otherwise, return UNK score.

Returns:	float. Negative infinity

initialize(src_sentence)[source]

Loads the FST from the file system and consumes the start of sentence symbol.

Parameters:	src_sentence (list) – Not used

initialize_heuristic(src_sentence)[source]: Creates a matrix of shortest distances between nodes.

is_equal(state1, state2)[source]: Returns true if the current node is the same

predict_next()[source]

Uses the outgoing arcs from the current node to build up the scores for the next word.

Returns:	dict. Set of words on outgoing arcs from the current node together with their scores, or an empty set if we currently have no active node or fst.

set_state(state)[source]: Sets the current node.

class cam.sgnmt.predictors.automata.NondeterministicFstPredictor(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can handle non-deterministic translation lattices. In contrast to the fst predictor for deterministic lattices, we store a set of nodes which are all reachable from the start node through the current history.

Creates a new nfst predictor.

Parameters:

fst_path (string) – Path to the FST file
use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
skip_bos_weight (bool) – If true, set weights on <S> arcs to 0 (= log1)
to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.

consume(word)[source]

Updates the current nodes by searching for all nodes which are reachable from the current nodes by a path consisting of any number of epsilons and exactly one word label. If there is no such arc, we set the predictor in an invalid state. In this case, all subsequent predict_next calls will return the empty set.

Parameters:	word (int) – Word on an outgoing arc from the current node

estimate_future_cost(hypo)[source]: The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.

get_state()[source]: Returns the set of current nodes

get_unk_probability(posterior)[source]

Always returns negative infinity: Words outside the translation lattice are not possible according to this predictor.

Returns:	float. Negative infinity

initialize(src_sentence)[source]

Loads the FST from the file system and consumes the start of sentence symbol.

Parameters:	src_sentence (list) – Not used

initialize_heuristic(src_sentence)[source]: Creates a matrix of shortest distances between all nodes

is_equal(state1, state2)[source]: Returns true if the current nodes are the same

predict_next()[source]

Uses the outgoing arcs from all current node to build up the scores for the next word. This method does not follow epsilon arcs: consume updates cur_nodes such that all reachable arcs with word ids are connected directly with a node in cur_nodes. If there are multiple arcs with the same word, we use the log sum of the arc weights as score.

Returns:	dict. Set of words on outgoing arcs from the current node together with their scores, or an empty set if we currently have no active nodes or fst.

set_state(state)[source]: Sets the set of current nodes

class cam.sgnmt.predictors.automata.RtnPredictor(rtn_path, use_weights, normalize_scores, to_log=True, minimize_rtns=False, rmeps=True)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor for RTNs (recurrent transition networks). This predictor assumes a directory structure as produced by HiFST. You can use this predictor for non-deterministic lattices too. This implementation supports late expansion: RTNs are only expanded as far as necessary to retrieve all currently reachable states.

cur_nodes contains the accumulated weights from the last consumed word (if ambiguous, the largest)

This implementation does not maintain a list of active nodes like the other automata predictors. Instead, we store the current history and search for the active nodes at each expansion. This is more expensive, but fstreplace might change state IDs so a list of active nodes might get corrupted.

Note that this predictor does not support FSTs in gzip format.

Creates a new RTN predictor.

Parameters:

rtn_path (string) – Path to the RTN directory
use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
minimize_rtns (bool) – Minimize the FST after each replace operation
rmeps (bool) – Remove epsilons in the FST after each replace operation

add_to_label_fst_map_recursive(label_fst_map, visited_nodes, root_node, acc_weight, history, func)[source]

Adds arcs to label_fst_map if they are labeled with an NT symbol and reachable from root_node via history.

Note: visited_nodes is maintained for each history separately

consume(word)[source]: Adds word to the current history.

expand_rtn(func)[source]: This method expands the RTN as far as necessary. This means that the RTN is expanded s.t. we can build the posterior for cur_history. In practice, this means that we follow all epsilon edges and replaces all NT edges until all paths with the prefix cur_history in the RTN have at least one more terminal token. Then, we apply func to all reachable nodes.

get_state()[source]: Returns the current history.

get_sub_fst(fst_id)[source]: Load sub fst from the file system or the cache

get_unk_probability(posterior)[source]

Always returns negative infinity: Words outside the RTN are not possible according to this predictor.

Returns:	float. Negative infinity

initialize(src_sentence)[source]

Loads the root RTN and consumes the start of sentence symbol.

Parameters:	src_sentence (list) – Not used

is_nt_label(label)[source]: Returns true if label is a non-terminal.

predict_next()[source]: Expands RTN as far as possible and uses the outgoing edges from nodes reachable by the current history to build up the posterior for the next word. If there are no such nodes or arcs, or no root FST is loaded, return the empty set.

set_state(state)[source]: Sets the current history.

cam.sgnmt.predictors.bow module¶

This module contains predictors for bag of words experiments. This is the standard bow predictor and the bowsearch predictor which first does an unrestricted search to construct a skeleton and then restricts the order of words by that skeleton (in addition to the bag restriction).

class cam.sgnmt.predictors.bow.BagOfWordsPredictor(trg_test_file, accept_subsets=False, accept_duplicates=False, heuristic_scores_file='', collect_stats_strategy='best', heuristic_add_consumed=False, heuristic_add_remaining=True, diversity_heuristic_factor=-1.0, equivalence_vocab=-1)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor is similar to the forced predictor, but it does not enforce the word order in the reference. Therefore, it assigns 1 to all hypotheses which have the words in the reference in any order, and -inf to all other hypos.

Creates a new bag-of-words predictor.

Parameters:

trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode. The word order in the target sentences is not relevant for this predictor.
accept_subsets (bool) – If true, this predictor permits EOS even if the bag is not fully consumed yet
accept_duplicates (bool) – If true, counts are not updated when a word is consumed. This means that we allow a word in a bag to appear multiple times
heuristic_scores_file (string) – Path to the unigram scores which are used if this predictor estimates future costs
collect_stats_strategy (string) – best, full, or all. Defines how unigram estimates are collected for heuristic
heuristic_add_consumed (bool) – Set to true to add the difference between actual partial score and unigram estimates of consumed words to the predictor heuristic
heuristic_add_remaining (bool) – Set to true to add the sum of unigram scores of words remaining in the bag to the predictor heuristic
diversity_heuristic_factor (float) – Factor for diversity heuristic which penalizes hypotheses with the same bag as full hypos
equivalence_vocab (int) – If positive, predictor states are considered equal if the the remaining words within that vocab and OOVs regarding this vocab are the same. Only relevant when using hypothesis recombination

consume(word)[source]

Updates the bag by deleting the consumed word.

Parameters:	word (int) – Next word to consume

estimate_future_cost(hypo)[source]: The bow predictor comes with its own heuristic function. We use the sum of scores of the remaining words as future cost estimator.

get_state()[source]: State of this predictor is the current bag

get_unk_probability(posterior)[source]: Returns negative infinity unconditionally: Words which are not in the target sentence have assigned probability 0 by this predictor.

initialize(src_sentence)[source]

Creates a new bag for the current target sentence..

Parameters:	src_sentence (list) – Not used

initialize_heuristic(src_sentence)[source]

Calls reset of the used unigram table with estimates self.estimates to clear all statistics from the previous sentence

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Returns true if the bag is the same

notify(message, message_type=1)[source]: This gets called if this predictor observes the decoder. It updates unigram heuristic estimates via passing through this message to the unigram table self.estimates.

predict_next()[source]: If the bag is empty, the only allowed symbol is EOS. Otherwise, return the list of keys in the bag.

set_state(state)[source]: State of this predictor is the current bag

class cam.sgnmt.predictors.bow.BagOfWordsSearchPredictor(main_decoder, hypo_recombination, trg_test_file, accept_subsets=False, accept_duplicates=False, heuristic_scores_file='', collect_stats_strategy='best', heuristic_add_consumed=False, heuristic_add_remaining=True, diversity_heuristic_factor=-1.0, equivalence_vocab=-1)[source]

Bases: cam.sgnmt.predictors.bow.BagOfWordsPredictor

Combines the bag-of-words predictor with a proxy decoding pass which creates a skeleton translation.

Creates a new bag-of-words predictor with pre search

Parameters:

main_decoder (Decoder) – Reference to the main decoder instance, used to fetch the predictors
hypo_recombination (bool) – Activates hypo recombination for the pre decoder
trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode. The word order in the target sentences is not relevant for this predictor.
accept_subsets (bool) – If true, this predictor permits EOS even if the bag is not fully consumed yet
accept_duplicates (bool) – If true, counts are not updated when a word is consumed. This means that we allow a word in a bag to appear multiple times
heuristic_scores_file (string) – Path to the unigram scores which are used if this predictor estimates future costs
collect_stats_strategy (string) – best, full, or all. Defines how unigram estimates are collected for heuristic
heuristic_add_consumed (bool) – Set to true to add the difference between actual partial score and unigram estimates of consumed words to the predictor heuristic
heuristic_add_remaining (bool) – Set to true to add the sum of unigram scores of words remaining in the bag to the predictor heuristic
equivalence_vocab (int) – If positive, predictor states are considered equal if the the remaining words within that vocab and OOVs regarding this vocab are the same. Only relevant when using hypothesis recombination

consume(word)[source]

Calls super class consume. If not in pre_mode, update skeleton info.

Parameters:	word (int) – Next word to consume

get_state()[source]: If in pre_mode, state of this predictor is the current bag Otherwise, its the bag plus skeleton state

initialize(src_sentence)[source]: If in pre_mode, pass through to super class. Otherwise, initialize skeleton.

is_equal(state1, state2)[source]: Returns true if the bag and the skeleton states are the same

predict_next()[source]: If in pre_mode, pass through to super class. Otherwise, scan skeleton

set_state(state)[source]: If in pre_mode, state of this predictor is the current bag Otherwise, its the bag plus skeleton state

cam.sgnmt.predictors.core module¶

This module contains the two basic predictor interfaces for bounded and unbounded vocabulary predictors.

class cam.sgnmt.predictors.core.Predictor[source]

Bases: cam.sgnmt.utils.Observer

A predictor produces the predictive probability distribution of the next word given the state of the predictor. The state may change during predict_next() and consume(). The functions get_state() and set_state() can be used for non-greedy decoding. Note: The state describes the predictor with the current history. It does not encapsulate the current source sentence, i.e. you cannot recover a predictor state if initialize() was called in between. predict_next() and consume() must be called alternately. This holds even when using get_state() and set_state(): Loading/saving states is transparent to the predictor instance.

Initializes current_sen_id with 0.

consume(word)[source]

Expand the current history by word and update the internal predictor state accordingly. Two calls of consume() must be separated by a predict_next() call.

Parameters:	word (int) – Word to add to the current history

estimate_future_cost(hypo)[source]

Predictors can implement their own look-ahead cost functions. They are used in A* if the –heuristics parameter is set to predictor. This function should return the future log cost (i.e. the lower the better) given the current predictor state, assuming that the last word in the partial hypothesis ‘hypo’ is consumed next. This function must not change the internal predictor state.

Parameters:	hypo (PartialHypothesis) – Hypothesis for which to estimate the future cost given the current predictor state

Returns: float. Future cost

finalize_posterior(scores, use_weights, normalize_scores)[source]

This method can be used to enforce the parameters use_weights normalize_scores in predictors with dict posteriors.

Parameters:	scores (dict) – unnormalized log valued scores use_weights (bool) – Set to false to replace all values in `scores` with 0 (= log 1) normalize_scores – Set to true to make the exp of elements in `scores` sum up to 1

get_state()[source]

Get the current predictor state. The state can be any object or tuple of objects which makes it possible to return to the predictor state with the current history.

Returns:	object. Predictor state

get_unk_probability(posterior)[source]

This function defines the probability of all words which are not in posterior. This is usually used to combine open and closed vocabulary predictors. The argument posterior should have been produced with predict_next()

Parameters:	posterior (list,array,dict) – Return value of the last call of `predict_next`
Returns:	Score to use for words outside `posterior`
Return type:	float

initialize(src_sentence)[source]

Initialize the predictor with the given source sentence. This resets the internal predictor state and loads everything which is constant throughout the processing of a single source sentence. For example, the NMT decoder runs the encoder network and stores the source annotations.

Parameters:	src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>

initialize_heuristic(src_sentence)[source]

This is called after initialize() if the predictor is registered as heuristic predictor (i.e. estimate_future_cost() will be called in the future). Predictors can implement this function for initialization of their own heuristic mechanisms.

Parameters:	src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>

is_equal(state1, state2)[source]

Returns true if two predictor states are equal, i.e. both states will always result in the same scores. This is used for hypothesis recombination

Parameters:	state1 (object) – First predictor state state2 (object) – Second predictor state
Returns:	bool. True if both states are equal, false if not

notify(message, message_type=1)[source]

We implement the notify method from the Observer super class with an empty method here s.t. predictors do not need to implement it.

Parameters:	message (object) – The posterior sent by the decoder

predict_next()[source]

Returns the predictive distribution over the target vocabulary for the next word given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls of predict_next() must be separated by a consume() call.

Returns:	dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability `get_unk_probability()`

set_current_sen_id(cur_sen_id)[source]

This function is called between initialize() calls to increment the sentence id counter. It can also be used to skip sentences for the –range argument.

Parameters:	cur_sen_id (int) – Sentence id for the next call of `initialize()`

set_state(state)[source]

Loads a predictor state from an object created with get_state(). Note that this does not copy the argument but just references the given state. If state is going to be used in the future to return to that point again, you should copy the state with copy.deepcopy() before.

Parameters:	state (object) – Predictor state as returned by `get_state()`

class cam.sgnmt.predictors.core.UnboundedVocabularyPredictor[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictors under this class implement models with very large target vocabularies, for which it is too inefficient to list the entire posterior. Instead, they are evaluated only for a given list of target words. This list is usually created by taking all non-zero probability words from the bounded vocabulary predictors. An example of a unbounded vocabulary predictor is the ngram predictor: Instead of listing the entire ngram vocabulary, we run srilm only on the words which are possible according other predictor (e.g. fst or nmt). This is realized by introducing the trgt_words argument to predict_next.

Initializes current_sen_id with 0.

predict_next(trgt_words)[source]

Like in Predictor, returns the predictive distribution over target words given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls of predict_next() must be separated by a consume() call.

Parameters:	trgt_words (list) – List of target word ids.
Returns:	dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability `get_unk_probability()`. The returned set should not contain any ids which are not in `trgt_words`, but it does not have to score all of them

cam.sgnmt.predictors.ffnnlm module¶

cam.sgnmt.predictors.forced module¶

This module contains predictors for forced decoding. This can be done either with one reference (forced ForcedPredictor), or with multiple references in form of a n-best list (forcedlst ForcedLstPredictor).

class cam.sgnmt.predictors.forced.ForcedLstPredictor(trg_test_file, use_scores=True, match_unk=False, feat_name=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor can be used for direct n-best list rescoring. In contrast to the ForcedPredictor, it reads an n-best list in Moses format and uses its scores as predictive probabilities of the </S> symbol. Everywhere else it gives the predictive probability 1 if the history corresponds to at least one n-best list entry, 0 otherwise. From the n-best list we use First column: Sentence id Second column: Hypothesis in integer format Last column: score

Note: Behavior is undefined if you have duplicates in the n-best list

TODO: Would be much more efficient to use Tries for cur_trgt_sentences instead of a flat list.

Creates a new n-best rescoring predictor instance.

Parameters:

trg_test_file (string) – Path to the n-best list
use_scores (bool) – Whether to use the scores from the n-best list. If false, use uniform scores of 0 (=log 1).
match_unk (bool) – If true, allow any word where the n-best list contains UNK.
feat_name (string) – Instead of the combined score in the last column of the Moses n-best list, we can use one of the sparse features. Set this to the name of the feature (denoted as <name>= in the n-best list) if you wish to do that.

consume(word)[source]: Extends the current history by word.

get_state()[source]: Returns the current history.

get_unk_probability(posterior)[source]: Return negative infinity unconditionally - words outside the n-best list are not possible according to this predictor.

initialize(src_sentence)[source]

Resets the history and loads the n-best list entries for the next source sentence

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Returns true if the history is the same

predict_next()[source]

Outputs 0.0 (i.e. prob=1) for all words for which there is an entry in cur_trg_sentences, and the score in cur_trg_sentences if the current history is by itself equal to an entry in cur_trg_sentences.

TODO: The implementation here is fairly inefficient as it scans through all target sentences linearly. Would be better to organize the target sentences in a Trie

set_state(state)[source]: Sets the current history.

class cam.sgnmt.predictors.forced.ForcedPredictor(trg_test_file, spurious_words=[])[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor realizes forced decoding. It stores one target sentence for each source sentence and outputs predictive probability 1 along this path, and 0 otherwise.

Creates a new forced decoding predictor.

Parameters:	trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode spurious_words (list) – List of words that are permitted to occur anywhere in the sequence

consume(word)[source]

If word matches the target sentence, we increase the current history by one. Otherwise, we set this predictor in an invalid state, in which it always predicts </S>

Parameters:	word (int) – Next word to consume

get_state()[source]: cur_trg_sentence can be changed so its part of the predictor state

get_unk_probability(posterior)[source]: Returns negative infinity unconditionally: Words which are not in the target sentence have assigned probability 0 by this predictor.

initialize(src_sentence)[source]

Fetches the corresponding target sentence and resets the current history.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Returns true if the state is the same

predict_next()[source]: Returns a dictionary with one entry and value 0 (=log 1). The key is either the next word in the target sentence or (if the target sentence has no more words) the end-of-sentence symbol.

set_state(state)[source]: Set the predictor state.

cam.sgnmt.predictors.grammar module¶

This module contains everything related to the hiero predictor. This predictor allows applying rules from a syntactical SMT system directly in SGNMT. The main interface is RuleXtractPredictor which can be used like other predictors during decoding. The Hiero predictor follows are the LRHiero implementation from

https://github.com/sfu-natlang/lrhiero

Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.

However, note that we modified the code to a) deal with an arbitrary number of non-terminals b) work with ruleXtract c) allow spurious ambiguity

ATTENTION: This implementation is experimental!!

class cam.sgnmt.predictors.grammar.Cell(init_hypo=None)[source]

Comparable to a CYK cell: A set of hypotheses. If duplicates are added, we do hypo combination by combining the costs and retraining only one of them. Internally, the hypotheses are stored in a list sorted by the sum of the translation prefix

Creates a new Cell with only one hypothesis.

Parameters:	init_hypo (LRHieroHypothesis) – Initial hypothesis

add(hypo)[source]

Add a new hypothesis to the cell. If an equivalent hypothesis already exists, combine both hypotheses.

Parameters:	hypo (LRHieroHypothesis) – Hypothesis to add under the key `hypo.key`

filter(pos, symb)[source]: Remove all hypotheses which do not have symb at pos in their trgt_prefix. Breaks if pos is out of range for some trgt_prefix

findIdx(key, a, b)[source]: Find index of first element with given key. If there is no such key, return last element with largest key smaller than key This is a recursive function which only searches in the interval [a,b]

pop()[source]

Removes a hypothesis from the cell.

Returns:	LRHieroHypothesis. The removed hypothesis

class cam.sgnmt.predictors.grammar.LRHieroHypothesis(trgt_prefix, spans, cost)[source]

Represents a LRHiero hypothesis, which is defined by the accumulated cost, the target prefix, and open source spans.

Creates a new LRHiero hypothesis

Parameters:	trgt_prefix (list) – Target side translation prefix, i.e. the partial target sentence which is translated so far spans (list) – List of spans which are not covered yet, in left-to-right order on target side cost (float) – Cost of this partial hypothesis

is_final()[source]: Returns true if this hypothesis has no open spans

class cam.sgnmt.predictors.grammar.Node[source]: Represents a node in the Trie.

class cam.sgnmt.predictors.grammar.Rule(rhs_src, rhs_trgt, trgt_src_map, cost)[source]

A rule consists of rhs_src and rhs_trgt, both are sequences of integers. NTs are indicated with negative sign. The trgt_src_map defines which NT on the target side belongs to which NT on the source side.

Creates a new rule.

Parameters:	rhs_src (list) – Source on the right hand side of the rule rhs_trgt (list) – Target on the right hand side of the rule trgt_src_map (dict) – Defines which NT on the target side belongs to which NT on the source side

last_id = 0

class cam.sgnmt.predictors.grammar.RuleSet[source]

This class stores the set of rules and provides efficient retrieval and matching functionality

Initializes the set by setting up the trie data structure for storing the rules.

INF = 10000

create_rule(rhs_src, rhs_trgt, weight)[source]

Creates a rule object (factory method)

Parameters:	rhs_src (list) – String sequence describing the source of the right-hand-side of the rule rhs_trgt (list) – String sequence describing the target of the right-hand-side of the rule weight (float) – Rule weight
Returns:	`Rule` or `None` if something went wrong

expand_hypo(hypo, src_seq)[source]

Similar to getSpanRules() and GrowHypothesis() in Alg. 1 in (Siahbani, 2013) combined. Gets all rules which match the given span.

If the p parameter of the span is a single non-terminal, we return hypotheses resulting from productions of this non- terminal. Note that rules might be applicable in many different ways: X-> A the B can be applied to foo the bar the baz in two ways. In this case, we add the translation prefix, but leave the borders of the span untouched, and change the p value to thr rhs of the production (i.e. “A the B”). If p consists of multiple characters, the spans store the minimum and maximum length, not the begin and end since the exact begin and end positions are variable.
If the p parameter of the span has length > 1, we return a set of hypotheses in which the first subspan has a single NT as p parameter.

Through this contract we can e.g. handle spurious ambiguity, if two NT are on the source side. However, resolving this ambiguity is implemented in a lazy fashion: we delay fixing the span boundaries until we need to expand the hypothesis once more, and then we fix only the first boundaries for the first span.

Parameters:	hypo (LRHieroHypothesis) – Hypothesis to expand src_seq (list) – Source sequence to match

parse(line, feature_weights=None)[source]

Parse a line in a rule file from ruleXtract and add the rule to the set.

Parameters:	line (string) – feature_weights (list) – score or `None` to use uniform weights

update_span_len_range()[source]: This method updates the span_len_range variable by finding boundaries for the spans each non terminal can cover. This is done iteratively: First, guess the range for each NT to (0, inf). Then, iterate through all rules for a specific NT and adjust the boundaries given the ranges for all other NTs. Do this until ranges do not change anymore. This is an expensive operation should be done after adding all rules. Note also that the tries store a reference to self.span_len_range, i.e. the variable is propagated to all tries automatically.

class cam.sgnmt.predictors.grammar.RuleXtractPredictor(ruleXtract_path, use_weights, feature_weights=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor based on ruleXtract rules. Bins are organized according the number of target words. We assume that no rule produces the empty word on the source side (but possibly on the target side). Hypotheses are produced iteratively s.t. the following invariant holds: The bins contain a set of (partial) hypotheses from which we can derive all full hypotheses which are consistent with the current target prefix (i.e. the prefix of the target sentence which has already been translated). This set is updated when calling either consume_word or predict_next: consume_ word deletes all hypotheses which become inconsistent with the new word. predict_next requires all hypotheses to have a target_ prefix length of at least one plus the number of consumed words. Therefore, predict_next expands hypotheses as long as they are shorter. This fits nicely with grouping hypotheses in bins of same target prefix length: we expand until all low rank bins are empty. We predict the next target word by using the cost of the best hypothesis with the word at the right position.

Note that this predictor is similar to the decoding algorithm in

Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.

without cube pruning, but it is extended to an arbitrary number of non-terminals as produced with ruleXtract.

Creates a new hiero predictor.

Parameters:	ruleXtract_path (string) – Path to the rules file use_weights (bool) – If false, set all hypothesis scores uniformly to 0 (= log 1). If true, use the rule weights to compute hypothesis scores feature_weights (list) – Rule feature weights to compute the rule scores. If this is none we use uniform weights

build_posterior()[source]: We need to scan all hypotheses in self.stacks and add up scores grouped by the symbol at the n_consumed+1-th position. Then, we add end-of-sentence probability by checking self.finals[n_consumed]

consume(word)[source]: Remove all hypotheses with translation prefixes which do not match word

get_state()[source]: Predictor state consists of the stacks, the completed hypotheses, and the number of consumed words.

get_unk_probability(posterior)[source]: Returns negative infinity if the posterior is not empty as words outside the grammar are not possible according this predictor. If posterior is empty, return 0 (= log 1)

initialize(src_sentence)[source]: Delete all bins and add the initial cell to the first bin

predict_next()[source]: For predicting the distribution of the next target tokens, we need to empty the stack with the current history length by expanding all hypotheses on it. Then, all hypotheses are in larger bins, i.e. have a longer target prefix than the current history. Thus, we can look up the possible next words by iterating through all active hypotheses.

set_state(state)[source]: Set the predictor state.

class cam.sgnmt.predictors.grammar.Span(p, borders)[source]

Span is defined by the start and end position and the corresponding sequence of terminal and non-terminal symbols p. Normally, p is just a single NT symbol. However, if there is ambiguity with how to apply a rule to a span (e.g. rule X -> X the X to span foo the bar the baz) we allow to resolve them later on demand. In this case, p = X the X

Fully initializes a new Span instance.

Parameters:	p (list) – See class docstring for `Span` borders (tuple) – (begin, end) with begin inclusive and end exclusive

class cam.sgnmt.predictors.grammar.Trie(span_len_range)[source]

This trie implementation allows matching NT symbols with arbitrary symbol sequences with certain lengths when searching. Note: This trie does not implement edge collapsing - each edge is labeled with exactly one word

Creates an empty trie data structure.

Parameters:	span_len_range (tuple) – minimum and maximum span lengths for non-terminal symbols

add(seq, element)[source]

Add an element to the trie data structure. The key sequence seq can contain non-terminals with negative IDs. If a element with the same key already exists in the data structure, we do not delete it but store both items.

Parameters:	seq (list) – Sequence of terminals and non-terminals used as key in the trie element (object) – Object to associate with `seq`

get_all_elements()[source]: Retrieve all elements stored in the trie

get_elements(src_seq)[source]

Get all elements (e.g. rules) which match the given sequence of source tokens.

Parameters:	seq (list) – Sequence of terminals and non-terminals used as key in the trie
Returns:	`(rules, nt_span_lens)`. The first dictionary contains all applying rules. `nt_span_lens` lists the number of symbols each of the NTs on the source side covers. Make sure that `self.span_len_range` is updated
Return type:	two dicts

replace(seq, element)[source]

Replaces all elements stored at a seq with a new single element element. This is equivalent to first removing all items with key seq, and then add the new element with add(seq, element)

Parameters:	seq (list) – Sequence of terminals and non-terminals used as key in the trie element (object) – Object to associate with `seq`

cam.sgnmt.predictors.length module¶

This module contains predictors that deal wit the length of the target sentence. The NBLengthPredictor assumes a negative binomial distribution on the target sentence lengths, where the parameters r and p are linear combinations of features extracted from the source sentence. The WordCountPredictor adds the number of words as cost, which can be used to prevent hypotheses from getting to short when using a language model.

class cam.sgnmt.predictors.length.ExternalLengthPredictor(path)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor loads the distribution over target sentence lengths from an external file. The file contains blank separated length:score pairs in each line which define the length distribution. The predictor adds the specified scores directly to the EOS score.

Creates a external length distribution predictor.

Parameters:	path (string) – Path to the file with target sentence length distributions.

consume(word)[source]

Increases word counter by one.

Parameters:	word (int) – Not used

get_state()[source]: Returns the number of consumed words

get_unk_probability(posterior)[source]: Returns 0=log 1 if the partial hypothesis does not exceed max length. Otherwise, predict next returns an empty set, and we set everything else to -inf.

initialize(src_sentence)[source]

Fetches the corresponding target sentence length distribution and resets the word counter.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Returns true if the number of consumed words is the same

predict_next()[source]: Returns a dictionary with one entry and value 0 (=log 1). The key is either the next word in the target sentence or (if the target sentence has no more words) the end-of-sentence symbol.

set_state(state)[source]: Set the number of consumed words

class cam.sgnmt.predictors.length.NBLengthPredictor(text_file, model_weights, use_point_probs, offset=0)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor assumes that target sentence lengths are distributed according a negative binomial distribution with parameters r,p. r is linear in features, p is the logistic of a linear function over the features. Weights can be trained using the Matlab script estimate_length_model.m

Let w be the model_weights. All features are extracted from the src sentence:

r = w0 * #char
+ w1 * #words
+ w2 * #punctuation
+ w3 * #char/#words
+ w4 * #punct/#words
+ w10

p = logistic(w5 * #char
+ w6 * #words
+ w7 * #punctuation
+ w8 * #char/#words
+ w9 * #punct/#words
+ w11)

target_length ~ NB(r,p)

The biases w10 and w11 are optional.

The predictor predicts EOS with NB(#consumed_words,r,p)

Creates a new target sentence length model predictor.

Parameters:	text_file (string) – Path to the text file with the unindexed source sentences, i.e. not using word ids model_weights (list) – Weights w0 to w11 of the length model. See class docstring for more information use_point_probs (bool) – Use point estimates for EOS token, 0.0 otherwise offset (int) – Subtract this from hypothesis length before applying the NB model

consume(word)[source]

Increases the current history length

Parameters:	word (int) – Not used

get_state()[source]: State consists of the number of consumed words, and the accumulator for previous EOS probability estimates if we don’t use point estimates.

get_unk_probability(posterior)[source]: If we use point estimates, return 0 (=1). Otherwise, return the 1-p(EOS), with p(EOS) fetched from posterior

initialize(src_sentence)[source]

Extract features for the source sentence. Note that this method does not use src_sentence as we need the string representation of the source sentence to extract features.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Returns true if the number of consumed words is the same

predict_next()[source]: Returns a dictionary with single entry for EOS.

set_state(state)[source]: Set the predictor state

class cam.sgnmt.predictors.length.NgramCountPredictor(path, order=0, discount_factor=-1.0)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor counts the number of n-grams in hypotheses. n-gram posteriors are loaded from a file. The predictor score is the sum of all n-gram posteriors in a hypothesis.

Creates a new ngram count predictor instance.

Parameters:	path (string) – Path to the n-gram posteriors. File format: <ngram> : <score> (one ngram per line). Use placeholder %d for sentence id. order (int) – If positive, count n-grams of the specified order. Otherwise, count all n-grams discount_factor (float) – If non-negative, discount n-gram posteriors by this factor each time they are consumed

consume(word)[source]

Adds word to the current history. Shorten if the extended history exceeds max_history_len.

Parameters:	word (int) – Word to add to the history.

get_state()[source]: Current history is the predictor state

get_unk_probability(posterior)[source]: Always return 0.0

initialize(src_sentence)[source]

Loads n-gram posteriors and resets history.

Parameters:	src_sentence (list) – not used

is_equal(state1, state2)[source]: Hypothesis recombination is not supported if discounting is enabled.

predict_next()[source]: Composes the posterior vector by collecting all ngrams which are consistent with the current history.

set_state(state)[source]: Current history is the predictor state

class cam.sgnmt.predictors.length.NgramizePredictor(min_order, max_order, max_len_factor, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper extracts n-gram posteriors from a predictor which does not depend on the particular argument of consume(). In that case, we can build a lookup mechanism for all possible n-grams in a single forward pass through the predictor search space: We record all posteriors (predict_next() return values) of the slave predictor during a greedy pass in initialize(). The wrapper predictor state is the current n-gram history. We use the (semiring) sum over all possible positions of the current n-gram history in the recorded slave predictor posteriors to form the n-gram scores returned by this predictor.

Note that this wrapper does not work correctly if the slave predictor feeds back the selected token in the history, ie. depends on the particular token which is provided via consume().

TODO: Make this wrapper work with slaves which return dicts.

Creates a new ngramize wrapper predictor.

Parameters:	min_order (int) – Minimum n-gram order max_order (int) – Maximum n-gram order max_len_factor (int) – Stop the forward pass through the slave predictor after src_length times this factor slave_predictor (Predictor) – Instance of the predictor which uses the source sentences in `src_test`
Raises:	AttributeError if order is not positive.

consume(word)[source]: Pass through to slave predictor

get_state()[source]: State is the current n-gram history.

get_unk_probability(posterior)[source]

initialize(src_sentence)[source]: Runs greedy decoding on the slave predictor to populate self.scores and self.unk_scores, resets the history.

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: Looks up ngram scores via self.scores.

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: State is the current n-gram history.

class cam.sgnmt.predictors.length.UnkCountPredictor(src_vocab_size, lambdas)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor regulates the number of UNKs in the output. We assume that the number of UNKs in the target sentence is Poisson distributed. This predictor is configured with n lambdas for 0,1,...,>=n-1 UNKs in the source sentence.

Initializes the UNK count predictor.

Parameters:	src_vocab_size (int) – Size of source language vocabulary. Indices greater than this are considered as UNK. lambdas (list) – List of floats. The first entry is the lambda parameter given that the number of unks in the source sentence is 0 etc. The last float is lambda given that the source sentence has more than n-1 unks.

consume(word)[source]

Increases unk counter by one if word is unk.

Parameters:	word (int) – Increase counter if `word` is UNK

get_state()[source]: Returns the number of consumed words

get_unk_probability(posterior)[source]: Always returns 0 (= log 1) except for the first time

initialize(src_sentence)[source]

Count UNKs in src_sentence and reset counters.

Parameters:	src_sentence (list) – Count UNKs in this list

is_equal(state1, state2)[source]: Returns true if the state is the same

predict_next()[source]: Set score for EOS to the number of consumed words

set_state(state)[source]: Set the number of consumed words

class cam.sgnmt.predictors.length.WeightNonTerminalPredictor(slave_predictor, penalty_factor=1.0, nonterminal_ids=None, min_terminal_id=0, max_terminal_id=30003, vocab_size=30003)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper multiplies the weight of given tokens (those outside the min/max terminal range) by a factor.

Creates a new id-weighting wrapper for a predictor

Parameters:	slave_predictor – predictor to apply penalty to. penalty_factor (float) – factor by which to multiply tokens in range min_terminal_id – lower bound of tokens not to penalize,

if nonterminal_penalty selected: max_terminal_id: upper bound of tokens not to penalize,
if nonterminal_penalty selected: vocab_size: upper bound of tokens, used to find nonterminal range

consume(word)[source]

get_state()[source]

get_unk_probability(posterior)[source]

initialize(src_sentence)[source]

is_equal(state1, state2)[source]

predict_next()[source]

set_state(state)[source]

class cam.sgnmt.predictors.length.WordCountPredictor(word=-1, nonterminal_penalty=False, nonterminal_ids=None, min_terminal_id=0, max_terminal_id=30003, negative_wc=True, vocab_size=30003)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor adds the (negative) number of words as feature. This means that this predictor encourages shorter hypotheses when used with a positive weight.

Creates a new word count predictor instance.

Parameters:

word (int) – If this is non-negative we count only the number of the specified word. If its negative, count all words
nonterminal_penalty (bool) – If true, apply penalty only to tokens in a range (the range outside min/max terminal id)
nonterminal_ids – file containing ids of nonterminal tokens
min_terminal_id – lower bound of tokens not to penalize, if nonterminal_penalty selected
max_terminal_id – upper bound of tokens not to penalize, if nonterminal_penalty selected
negative_wc – If true, the score of this predictor is the negative word count.
vocab_size – upper bound of tokens, used to find nonterminal range

consume(word)[source]: Empty

get_state()[source]: Returns true

get_unk_probability(posterior)[source]

initialize(src_sentence)[source]: Empty

is_equal(state1, state2)[source]: Returns true

predict_next()[source]

set_state(state)[source]: Empty

cam.sgnmt.predictors.length.load_external_ids(path)[source]: load file of ids to list

cam.sgnmt.predictors.length.load_external_lengths(path)[source]

Loads a length distribution from a plain text file. The file must contain blank separated <length>:<score> pairs in each line.

Parameters:	path (string) – Path to the length file.
Returns:	list of dicts mapping a length to its scores, one dict for each sentence.

cam.sgnmt.predictors.misc module¶

This module provides helper predictors and predictor wrappers which are not directly used for scoring. An example is the altsrc predictor wrapper which loads source sentences from a different file.

class cam.sgnmt.predictors.misc.AltsrcPredictor(src_test, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper loads the source sentences from an alternative source file. The src_sentence arguments of initialize and initialize_heuristic are overridden with sentences loaded from the file specified via the argument --altsrc_test. All other methods are pass through calls to the slave predictor.

Creates a new altsrc wrapper predictor.

Parameters:	src_test (string) – Path to the text file with source sentences slave_predictor (Predictor) – Instance of the predictor which uses the source sentences in `src_test`

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Pass through to slave predictor but replace src_sentence with a sentence from self.altsens

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor but replace src_sentence with a sentence from self.altsens

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: Pass through to slave predictor

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.misc.GluePredictor(max_len_factor, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper masks sentence-level predictors when SGNMT runs on the document level. The SGNMT hypotheses consist of multiple sentences, glued together with <s>, but the wrapped predictor is trained on the sentence level. This predictor splits input sequences at <s> and feed them to the predictor one by one. The wrapped predictor is initialized with a new source sentence when the sentence boundary symbol <s> is emitted. Note that using the predictor heuristic of the wrapped predictor estimates the future cost for the current sentence, not for the entire document.

Creates a new glue wrapper predictor.

Parameters:	max_len_factor (int) – Target sentences cannot be longer than this times source sentence length slave_predictor (Predictor) – Instance of the sentence-level predictor.

consume(word)[source]: If word is <s>, initialize the slave predictor with the next source sentence. Otherwise, pass through word to the consume() method of the slave.

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: State is the slave state plus the source sentence index.

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Splits src_sentence at utils.GO_ID, stores all segments for later use, and calls initialize() of the slave predictor with the first segment.

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor.

is_equal(state1, state2)[source]: Pass through to slave predictor

is_last_sentence()[source]: Returns True if the current sentence is the last sentence in this document - i.e. we have already consumed n-1 <s> symbols since the last call of initialize().

predict_next()[source]: Calls predict_next() of the wrapped predictor. Replaces BOS scores with EOS score if we still have source sentences left.

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: State is the slave state plus the source sentence index.

class cam.sgnmt.predictors.misc.RankPredictor(slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper converts predictor scores to (negative) ranks, i.e. the best word gets a score of -1, the second best of -2 and so on.

Note: Using this predictor with UNK matching or predictor heuristics is not recommended.

Creates a new rank wrapper predictor.

Parameters:	slave_predictor (Predictor) – Use score of this predictor to compute ranks.

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Pass through to slave predictor.

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor.

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: Pass through to slave predictor

score2rank(scores)[source]

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.misc.UnboundedAltsrcPredictor(src_test, slave_predictor)[source]

Bases: cam.sgnmt.predictors.misc.AltsrcPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of AltsrcPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Pass through to AltsrcPredictor.__init__

predict_next(trgt_words)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.misc.UnboundedGluePredictor(max_len_factor, slave_predictor)[source]

Bases: cam.sgnmt.predictors.misc.GluePredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of GluePredictor for unbounded vocabulary predictors.

Creates a new glue wrapper predictor.

Parameters:	max_len_factor (int) – Target sentences cannot be longer than this times source sentence length slave_predictor (Predictor) – Instance of the sentence-level predictor.

predict_next(trgt_words)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.misc.UnboundedRankPredictor(slave_predictor)[source]

Bases: cam.sgnmt.predictors.misc.RankPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of RankPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Creates a new rank wrapper predictor.

Parameters:	slave_predictor (Predictor) – Use score of this predictor to compute ranks.

predict_next(trgt_words)[source]: Return ranks instead of slave scores

cam.sgnmt.predictors.ngram module¶

This module contains predictors for n-gram (Kneser-Ney) language modeling. This is a UnboundedVocabularyPredictor as the vocabulary size ngram models normally do not permit complete enumeration of the posterior.

class cam.sgnmt.predictors.ngram.KenLMPredictor(path)[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

KenLM predictor based on https://github.com/kpu/kenlm

The predictor state is described by the n-gram history.

Creates a new n-gram language model predictor.

Parameters:	path (string) – Path to the ARPA language model file
Raises:	NameError. If KenLM is not installed

consume(word)[source]

get_state()[source]: Returns the current n-gram history

get_unk_probability(posterior)[source]: Use the probability for ‘<unk>’ in the language model

initialize(src_sentence)[source]

Initializes the KenLM state.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]

predict_next(words)[source]

set_state(state)[source]: Sets the current n-gram history and LM state

cam.sgnmt.predictors.parse module¶

class cam.sgnmt.predictors.parse.BpeParsePredictor(grammar_path, bpe_rule_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False, eow_ids=None, terminal_restrict=True, terminal_ids=None, internal_only_restrict=False)[source]

Bases: cam.sgnmt.predictors.parse.TokParsePredictor

Predict over a BPE-based grammar with two possible grammar constraints: one between non-terminals and bpe start-of-word tokens, one over bpe tokens in a word

Creates a new parse predictor wrapper which can be constrained to 2: grammars: one over non-terminals / terminals, one internally to constrain BPE units within a single word

Parameters:

grammar_path (string) – Path to the grammar file
bpe_rule_path (string) – Path to file defining rules between BPEs
slave_predictor – predictor to wrap
word_out (bool) – since this wrapper can be used for grammar constraint, this bool determines whether we also do internal beam search over non-terminals
normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
norm_alpha (float) – may be used for path weight normalization
beam_size (int) – beam size for internal beam search
max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point
eow_ids (string) – path to file containing ids of BPEs that mark the end of a word
terminal_restrict (bool) – true if applying grammar constraint over nonterminals and terminals
terminal_ids (string) – path to file containing all terminal ids
internal_only_restrict (bool) – true if applying grammar constraint over BPE units inside words

get_all_terminals(terminal_ids)[source]

get_bpe_can_follow(rule_path)[source]

get_eow_ids(eow_ids)[source]

is_nt(word)[source]

predict_next(predicting_next_word=False)[source]: predict next tokens as permitted by the current stack and the BPE grammar

update_stacks(word)[source]

class cam.sgnmt.predictors.parse.InternalHypo(score, token_score, predictor_state, word_to_consume)[source]

Bases: object

Helper class for internal parse predictor beam search over nonterminals

extend(score, predictor_state, word_to_consume)[source]

class cam.sgnmt.predictors.parse.ParsePredictor(slave_predictor, normalize_scores=True, beam_size=4, max_internal_len=35, nonterminal_ids=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor wrapper allowing internal beam search over a representation which contains some pre-defined ‘non-terminal’ ids, which should not appear in the output.

Create a new parse wrapper for a predictor

Parameters:

slave_predictor – predictor to wrap with parse wrapper
normalize_scores (bool) – whether to normalize posterior scores, e.g. after some tokens have been removed
beam_size (int) – beam size for internal beam search over non-terminals
max_internal_len (int) – number of consecutive non-terminal tokens allowed in internal search before path is ignored
nonterminal_ids – file containing non-terminal ids, one per line

are_best_terminal(posterior)[source]: Return true if most probable tokens in posterior are all terminals (including EOS)

consume(word, internal=False)[source]

find_word_beam(posterior)[source]: Internal beam search over posterior until a beam of terminals is found

get_state()[source]: Returns the current state.

get_unk_probability(posterior)[source]: Return unk probability as determined by slave predictor :returns: float, unk prob

initialize(src_sentence)[source]

Initializes slave predictor with source sentence

Parameters:	src_sentence (list) –

initialize_heuristic(src_sentence)[source]: Creates a matrix of shortest distances between nodes.

initialize_internal_hypos(posterior)[source]

is_equal(state1, state2)[source]: Returns true if the current node is the same

maybe_add_new_top_tokens(top_terminals, hypo, next_hypos)[source]

predict_next(predicting_internally=False)[source]

Predict next tokens.

Parameters:	predicting_internally – will be true if called from internal beam search, prevents infinite loop

set_state(state)[source]: Sets the current state.

class cam.sgnmt.predictors.parse.TokParsePredictor(grammar_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False)[source]

Bases: cam.sgnmt.predictors.parse.ParsePredictor

Unlike ParsePredictor, the grammar predicts tokens according to a grammar. Use BPEParsePredictor if including rules to connect BPE units inside words.

Creates a new parse predictor wrapper.

Parameters:

grammar_path (string) – Path to the grammar file
slave_predictor – predictor to wrap
word_out (bool) – since this wrapper can be used for grammar constraint, this bool determines whether we also do internal beam search over non-terminals
normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
norm_alpha (float) – may be used for path weight normalization
beam_size (int) – beam size for internal beam search
max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point

consume(word)[source]

Parameters:	word (int) – word token being consumed

find_word(posterior)[source]: Check whether rhs of best option in posterior is a terminal if it is, return the posterior for decoding if not, take the best result and follow that path until a word is found this follows a greedy 1best or a beam path through non-terminals

find_word_beam(posterior)[source]

Do an internal beam search over non-terminal functions to find the next best n terminal tokens, as ranked by normalized path score

Returns: posterior containing up to n terminal tokens: and their normalized path score

find_word_greedy(posterior)[source]

get_current_allowed()[source]

get_state()[source]: Returns the current state, including slave predictor state

initialize(src_sentence)[source]

norm_hypo_score(hypo)[source]

norm_score(score, beam_len)[source]

predict_next(predicting_next_word=False)[source]: predict next tokens as permitted by the current stack and the grammar

prepare_grammar()[source]

replace_lhs()[source]

set_state(state)[source]: Sets the current state

update_stacks(word)[source]

cam.sgnmt.predictors.parse.load_external_ids(path)[source]: load file of ids to list

cam.sgnmt.predictors.pytorch_fairseq module¶

This is the interface to the fairseq library.

https://github.com/pytorch/fairseq

The fairseq predictor can read any model trained with fairseq.

cam.sgnmt.predictors.pytorch_fairseq.FAIRSEQ_INITIALIZED = False: Set to true by _initialize_fairseq() after first constructor call.

class cam.sgnmt.predictors.pytorch_fairseq.FairseqPredictor(model_path, user_dir, lang_pair, n_cpu_threads=-1)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Predictor for using fairseq models.

Initializes a fairseq predictor.

Parameters:	model_path (string) – Path to the fairseq model (.pt). Like –path in fairseq-interactive. lang_pair* (string) – Language pair string (e.g. ‘en-fr’). user_dir (string) – Path to fairseq user directory. n_cpu_threads (int) – Number of CPU threads. If negative, use GPU.

consume(word)[source]: Append word to the current history.

get_state()[source]: The predictor state is the complete history.

get_unk_probability(posterior)[source]: Fetch posterior[utils.UNK_ID]

initialize(src_sentence)[source]: Initialize source tensors, reset consumed.

is_equal(state1, state2)[source]: Returns true if the history is the same

predict_next()[source]: Call the fairseq model.

set_state(state)[source]: The predictor state is the complete history.

cam.sgnmt.predictors.structure module¶

This module implements constraints which assure that highly structured output is well-formatted. For example, the bracket predictor checks for balanced bracket expressions, and the OSM predictor prevents any sequence of operations which cannot be compiled to a string.

class cam.sgnmt.predictors.structure.BracketPredictor(max_terminal_id, closing_bracket_id, max_depth=-1, extlength_path='')[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This predictor constrains the output to well-formed bracket expressions. It also allows to specify the number of terminals with an external length distribution file.

Creates a new bracket predictor.

Parameters:

max_terminal_id (int) – All IDs greater than this are brackets
closing_bracket_id (string) – All brackets except these ones are opening. Comma-separated list of integers.
max_depth (int) – If positive, restrict the maximum depth
extlength_path (string) – If this is set, restrict the number of terminals to the distribution specified in the referenced file. Terminals can be implicit: We count a single terminal between each adjacent opening and closing bracket.

consume(word)[source]: Updates current depth and the number of consumed terminals.

get_state()[source]: Returns the current depth and number of consumed terminals

get_unk_probability(posterior)[source]: Always returns 0.0

initialize(src_sentence)[source]

Sets the current depth to 0.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Trivial implementation

predict_next(words)[source]

If the maximum depth is reached, exclude all opening brackets. If history is not balanced, exclude EOS. If the current depth is zero, exclude closing brackets.

Parameters:	words (list) – Set of words to score
Returns:	dict.

set_state(state)[source]: Sets the current depth and number of consumed terminals

class cam.sgnmt.predictors.structure.ForcedOSMPredictor(trg_wmap, trg_test_file)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor allows forced decoding with an OSM output, which essentially means running the OSM in alignment mode. This predictor assumes well-formed operation sequences. Please combine this predictor with the osm constraint predictor to satisfy this requirement. The state of this predictor is the compiled version of the current history. It allows terminal symbols which are consistent with the reference. The end-of-sentence symbol is supressed until all words in the reference have been consumed.

Creates a new forcedosm predictor.

Parameters:	trg_wmap (string) – Path to the target wmap file. Used to grap OSM operation IDs. trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode

consume(word)[source]: Updates the compiled string and the head position.

get_state()[source]

get_unk_probability(posterior)[source]: Always returns -inf.

initialize(src_sentence)[source]

Resets compiled and head.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Trivial implementation

predict_next()[source]

Apply word reference constraints.

Returns:	dict.

set_state(state)[source]

class cam.sgnmt.predictors.structure.OSMPredictor(src_wmap, trg_wmap, use_jumps=True, use_auto_pop=False, use_unpop=False, use_pop2=False, use_src_eop=False, use_copy=False)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor applies the following constraints to an OSM output:

The number of POP tokens must be equal to the number of source tokens

JUMP_FWD and JUMP_BWD tokens are constraint to avoid jumping out of bounds.

The predictor supports the original OSNMT operation set (default) plus a number of variations that are set by the use_* arguments in the constructor.

Creates a new osm predictor.

Parameters:

src_wmap (string) – Path to the source wmap. Used to grap EOP id.
trg_wmap (string) – Path to the target wmap. Used to update IDs of operations.
use_jumps (bool) – If true, use SET_MARKER, JUMP_FWD and JUMP_BWD operations
use_auto_pop (bool) – If true, each word insertion automatically moves read head
use_unpop (bool) – If true, use SRC_UNPOP to move read head to the left.
use_pop2 (bool) – If true, use two read heads to align phrases
use_src_eop (bool) – If true, expect EOP tokens in the src sentence
use_copy (bool) – If true, move read head at COPY operations

consume(word)[source]: Updates the number of holes, EOPs, and the head position.

get_state()[source]

get_unk_probability(posterior)[source]

initialize(src_sentence)[source]

Sets the number of source tokens.

Parameters:	src_sentence (list) – Not used

is_equal(state1, state2)[source]: Trivial implementation

predict_next()[source]

Apply OSM constraints.

Returns:	dict.

set_state(state)[source]

cam.sgnmt.predictors.structure.load_external_lengths(path)[source]

Loads a length distribution from a plain text file. The file must contain blank separated <length>:<score> pairs in each line.

Parameters:	path (string) – Path to the length file.
Returns:	list of dicts mapping a length to its scores, one dict for each sentence.

cam.sgnmt.predictors.structure.update_src_osm_ids(wmap_path)[source]

Update the OSM_*_ID variables using a source word map.

Parameters:	wmap_path (string) – Path to the wmap file.

cam.sgnmt.predictors.structure.update_trg_osm_ids(wmap_path)[source]

Update the OSM_*_ID variables using a target word map.

Parameters:	wmap_path (string) – Path to the wmap file.

cam.sgnmt.predictors.tf_nizza module¶

This module integrates Nizza alignment models.

https://github.com/fstahlberg/nizza

class cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.core.Predictor

Common functionality for Nizza based predictors. This includes loading checkpoints, creating sessions, and creating computation graphs.

Initializes a nizza predictor.

Parameters:

src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
model_name (string) – Name of the nizza model
hparams_set_name (string) – Name of the nizza hyper-parameter set
checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs

Raises:

IOError if checkpoint file not found.

create_session(checkpoint_dir)[source]: Creates a MonitoredSession for this predictor.

get_unk_probability(posterior)[source]: Fetch posterior[t2t_unk_id] or return NEG_INF if None.

class cam.sgnmt.predictors.tf_nizza.LexNizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, alpha, beta, shortlist_strategies, trg2src_model_name='', trg2src_hparams_set_name='', trg2src_checkpoint_dir='', max_shortlist_length=0, min_id=0, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor

This predictor is only compatible to Model1-like Nizza models which return lexical translation probabilities in precompute(). The predictor keeps a list of the same length as the source sentence and initializes it with zeros. At each timestep it updates this list by the lexical scores Model1 assigned to the last consumed token. The predictor score aims to bring up all entries in the list, and thus serves as a coverage mechanism over the source sentence.

Initializes a nizza predictor.

Parameters:

src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
model_name (string) – Name of the nizza model
hparams_set_name (string) – Name of the nizza hyper-parameter set
checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
alpha (float) – Score for each matching word
beta (float) – Penalty for each uncovered word at the end
shortlist_strategies (string) – Comma-separated list of shortlist strategies.
trg2src_model_name (string) – Name of the target2source nizza model
trg2src_hparams_set_name (string) – Name of the nizza hyper-parameter set for the target2source model
trg2src_checkpoint_dir (string) – Path to the Nizza checkpoint directory for the target2source model. The predictor will load the top most checkpoint in the checkpoints file.
max_shortlist_length (int) – If a shortlist exceeds this limit, initialize the initial coverage with 1 at this position. If zero, do not apply any limit
min_id (int) – Do not use IDs below this threshold (filters out most frequent words).
nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs

Raises:

IOError if checkpoint file not found.

consume(word)[source]: Update coverage.

estimate_future_cost(hypo)[source]: We use the number of uncovered words times beta as heuristic estimate.

get_state()[source]: The predictor state is the coverage vector.

get_unk_probability(posterior)[source]

initialize(src_sentence)[source]: Set src_sentence, reset consumed.

predict_next()[source]: Predict record scores.

set_state(state)[source]: The predictor state is the coverage vector.

class cam.sgnmt.predictors.tf_nizza.NizzaPredictor(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]

Bases: cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor

This predictor uses Nizza alignment models to derive a posterior over the target vocabulary for the next position. It mainly relies on the predict_next_word() implementation of Nizza models.

Initializes a nizza predictor.

Parameters:

src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
model_name (string) – Name of the nizza model
hparams_set_name (string) – Name of the nizza hyper-parameter set
checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs

Raises:

IOError if checkpoint file not found.

consume(word)[source]: Append word to the current history.

get_state()[source]: The predictor state is the complete history.

initialize(src_sentence)[source]: Set src_sentence, reset consumed.

is_equal(state1, state2)[source]: Returns true if the history is the same

predict_next()[source]: Call the T2T model in self.mon_sess.

set_state(state)[source]: The predictor state is the complete history.

cam.sgnmt.predictors.tf_t2t module¶

This is the interface to the tensor2tensor library.

https://github.com/tensorflow/tensor2tensor

The t2t predictor can read any model trained with tensor2tensor which includes the transformer model, convolutional models, and RNN-based sequence models.

class cam.sgnmt.predictors.tf_t2t.EditT2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, trg_test_file, beam_size, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor

This predictor can be used for T2T models conditioning on the full target sentence. The predictor state is a full target sentence. The state can be changed by insertions, substitutions, and deletions of single tokens, whereas each operation is encoded as SGNMT token in the following way:

1xxxyyyyy: Insert the token yyyyy at position xxx. 2xxxyyyyy: Replace the xxx-th word with the token yyyyy. 3xxx00000: Delete the xxx-th token.

Creates a new edit T2T predictor. This constructor is similar to the constructor of T2TPredictor but creates a different computation graph which retrieves scores at each target position, not only the last one.

Parameters:

src_vocab_size (int) – Source vocabulary size.
trg_vocab_size (int) – Target vocabulary size.
model_name (string) – T2T model name.
problem_name (string) – T2T problem name.
hparams_set_name (string) – T2T hparams set name.
trg_test_file (string) – Path to a plain text file with initial target sentences. Can be empty.
beam_size (int) – Determines how many substitutions and insertions are considered at each position.
t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
n_cpu_threads (int) – Number of TensorFlow CPU threads.
max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.

DEL_OFFSET = 300000000

INS_OFFSET = 100000000

MAX_SEQ_LEN = 999

POS_FACTOR = 100000

SUB_OFFSET = 200000000

consume(word)[source]: Append word to the current history.

get_state()[source]: The predictor state is the complete target sentence.

initialize(src_sentence)[source]: Set src_sentence, reset consumed.

is_equal(state1, state2)[source]: Returns true if the target sentence is the same

predict_next()[source]: Call the T2T model in self.mon_sess.

set_state(state)[source]: The predictor state is the complete target sentence.

class cam.sgnmt.predictors.tf_t2t.FertilityT2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t.T2TPredictor

Use this predictor to integrate fertility models trained with T2T. Fertility models output the fertility for each source word instead of target words. We define the fertility of the i-th source word in a hypothesis as the number of tokens between the (i-1)-th and the i-th POP token.

TODO: This is not SOLID (violates substitution principle)

Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles

restoring checkpoints.

Parameters:

src_vocab_size (int) – Source vocabulary size.
trg_vocab_size (int) – Target vocabulary size.
model_name (string) – T2T model name.
problem_name (string) – T2T problem name.
hparams_set_name (string) – T2T hparams set name.
t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
n_cpu_threads (int) – Number of TensorFlow CPU threads.
max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.

consume(word)[source]

get_state()[source]

get_unk_probability(posterior)[source]: Returns self.other_scores[n_aligned_words].

initialize(src_sentence)[source]: Set src_sentence, compute fertilities for first src word.

is_equal(state1, state2)[source]: Returns true if the history is the same

predict_next()[source]: Returns self.pop_scores[n_aligned_words] for POP and EOS.

set_state(state)[source]

cam.sgnmt.predictors.tf_t2t.POP = '##POP##': Textual representation of the POP symbol.

class cam.sgnmt.predictors.tf_t2t.SegT2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor

This predictor is designed for document-level T2T models. It differs from the normal t2t predictor in the following ways:

In addition to input and targets, it generates the features inputs_seg. targets_seg, inputs_pos, targets_pos which are used in glue models and the contextual Transformer.
The history is pruned when it exceeds a maximum number of <s> symbols. This can be used to reduce complexity for document-level models on very long documents. When the maximum number is reached, we start removing sentences from self.consumed, starting with the sentence which is begin_margin away from the document start and end_margin sentences away from the current sentence.

Creates a new document-level T2T predictor. See T2TPredictor.__init__().

Parameters:

src_vocab_size (int) – Source vocabulary size.
trg_vocab_size (int) – Target vocabulary size.
model_name (string) – T2T model name.
problem_name (string) – T2T problem name.
hparams_set_name (string) – T2T hparams set name.
t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
n_cpu_threads (int) – Number of TensorFlow CPU threads.
max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.

consume(word)[source]

get_state()[source]

get_unk_probability(posterior)[source]: Fetch posterior[t2t_unk_id]

initialize(src_sentence)[source]

is_equal(state1, state2)[source]: Returns true if the (pruned) history is the same

predict_next()[source]: Call the T2T model in self.mon_sess.

set_state(state)[source]

class cam.sgnmt.predictors.tf_t2t.T2TPredictor(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]

Bases: cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor

This predictor implements scoring with Tensor2Tensor models. We follow the decoder implementation in T2T and do not reuse network states in decoding. We rather compute the full forward pass along the current history. Therefore, the decoder state is simply the the full history of consumed words.

Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles

restoring checkpoints.

Parameters:

src_vocab_size (int) – Source vocabulary size.
trg_vocab_size (int) – Target vocabulary size.
model_name (string) – T2T model name.
problem_name (string) – T2T problem name.
hparams_set_name (string) – T2T hparams set name.
t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
n_cpu_threads (int) – Number of TensorFlow CPU threads.
max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.

consume(word)[source]: Append word to the current history.

get_state()[source]: The predictor state is the complete history.

initialize(src_sentence)[source]: Set src_sentence, reset consumed.

is_equal(state1, state2)[source]: Returns true if the history is the same

predict_next()[source]: Call the T2T model in self.mon_sess.

set_state(state)[source]: The predictor state is the complete history.

cam.sgnmt.predictors.tf_t2t.T2T_INITIALIZED = False: Set to true by _initialize_t2t() after first constructor call.

cam.sgnmt.predictors.tf_t2t.expand_input_dims_for_t2t(t, batched=False)[source]

Expands a plain input tensor for using it in a T2T graph.

Parameters:	t – Tensor batched – Whether to expand on the left side
Returns:	Tensor t expanded by 1 dimension on the left and two dimensions on the right.

cam.sgnmt.predictors.tf_t2t.gather_2d(params, indices)[source]

This is a batched version of tf.gather(), ie. it applies tf.gather() to each batch separately.

Example

params = [[10, 11, 12, 13, 14],: [20, 21, 22, 23, 24]]
indices = [[0, 0, 1, 1, 1, 2],: [1, 3, 0, 0, 2, 2]]
result = [[10, 10, 11, 11, 11, 12],: [21, 23, 20, 20, 22, 22]]

Parameters:	params – A [batch_size, n, ...] tensor with data indices – A [batch_size, num_indices] int32 tensor with indices into params. Entries must be smaller than n
Returns:	The result of tf.gather() on each entry of the batch.

cam.sgnmt.predictors.tf_t2t.log_prob_from_logits(logits)[source]: Softmax function.

cam.sgnmt.predictors.tokenization module¶

This module contains wrapper predictors which support decoding with diverse tokenization. The Word2charPredictor can be used if the decoder operates on fine-grained tokens such as characters, but the tokenization of a predictor is coarse-grained (e.g. words or subwords).

The word2char predictor maintains an explicit list of word boundary characters and applies consume and predict_next whenever a word boundary character is consumed.

The fsttok predictor also masks coarse grained predictors when SGNMT uses fine-grained tokens such as characters. This wrapper loads an FST which transduces character to predictor-unit sequences.

class cam.sgnmt.predictors.tokenization.CombinedState(fst_node, pred_state, posterior, unconsumed=[], pending_score=0.0)[source]

Bases: object

Combines an FST state with predictor state. Use by the fsttok predictor.

consume_all(predictor)[source]

Consume all unconsumed tokens and update pred_state, pending_score, and posterior accordingly.

Parameters:	predictor (Predictor) – Predictor instance

consume_single(predictor)[source]

Consume a single token in self.unconsumed.

Parameters:	predictor (Predictor) – Predictor instance

score(token, predictor)[source]: Returns a score which can be added if token is consumed next. This is not necessarily the full score but an upper bound on it: Continuations will have a score lower or equal than this. We only use the current posterior vector and do not consume tokens with the wrapped predictor.

traverse_fst(trans_fst, char)[source]

Returns a list of CombinedState``s with the same predictor state and posterior, but an ``fst_node which is reachable via the input label char. If the output tabe contains symbols, add them to unconsumed.

Parameters:	trans_fst (Fst) – FST to traverse char (int) – Index of character
Returns:	list. List of combined states reachable via `char`

update_posterior(predictor)[source]: If self.posterior is None, call predict_next to be able to score the next tokens.

cam.sgnmt.predictors.tokenization.EPS_ID = 0: OpenFST’s reserved ID for epsilon arcs.

class cam.sgnmt.predictors.tokenization.FSTTokPredictor(path, fst_unk_id, max_pending_score, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper can be used if the SGNMT decoder operates on the character level, but a predictor uses a more coarse grained tokenization. The mapping is defined by an FST which transduces character to predictor unit sequences. This wrapper maintains a list of CombinedState objects which are tuples of an FST node and a predictor state for which holds:

The input labels on the path to the node are consistent with the consumed characters
The output labels on the path to the node are consistent with the predictor states

Constructor for the fsttok wrapper

Parameters:	path (string) – Path to an FST which transduces characters to predictor tokens fst_unk_id (int) – ID used to represent UNK in the FSTs (usually 999999998) max_pending_score (float) – Maximum pending score in a `CombinedState` instance. slave_predictor (Predictor) – Wrapped predictor

consume(word)[source]: Update self.states to be consistent with word and consumes all the predictor tokens.

estimate_future_cost(hypo)[source]: Not implemented yet

get_state()[source]

get_unk_probability(posterior)[source]: Always returns negative infinity. Handling UNKs needs to be realized by the FST.

initialize(src_sentence)[source]: Pass through to slave predictor. The source sentence is not modified. states is updated to the initial FST node and predictor posterior and state.

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor. The source sentence is not modified

is_equal(state1, state2)[source]: Not implemented yet

predict_next()[source]

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]

class cam.sgnmt.predictors.tokenization.Word2charPredictor(map_path, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This predictor wraps word level predictors when SGNMT is running on the character level. The mapping between word ID and character ID sequence is loaded from the file system. All characters which do not appear in that mapping are treated as word boundary makers. The wrapper blocks consume and predict_next calls until a word boundary marker is consumed, and updates the slave predictor according the word between the last two word boundaries. The mapping is done only on the target side, and the source sentences are passed through as they are. To use alternative tokenization on the source side, see the altsrc predictor wrapper. The word2char wrapper is always an UnboundedVocabularyPredictor.

Creates a new word2char wrapper predictor. The map_path file has to be plain text files, each line containing the mapping from a word index to the character index sequence (format: word char1 char2... charn).

Parameters:	map_path (string) – Path to the mapping file slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT

consume(word)[source]: If word is a word boundary marker, truncate word_stub and let the slave predictor consume word_stub. Otherwise, extend word_stub by the character.

estimate_future_cost(hypo)[source]: Not supported

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: This is about the unkown character, not word. Since the word level slave predictor has no notion of the unknown character, we return NEG_INF unconditionally.

initialize(src_sentence)[source]: Pass through to slave predictor. The source sentence is not modified

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor. The source sentence is not modified

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next(trgt_words)[source]

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

cam.sgnmt.predictors.vocabulary module¶

Predictor wrappers in this module work with the vocabulary of the wrapped predictor. An example is the idxmap wrapper which makes it possible to use an alternative word map.

class cam.sgnmt.predictors.vocabulary.IdxmapPredictor(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper predictor can be applied to slave predictors which use different wmaps than SGNMT. It translates between SGNMT word indices and predictors indices each time the predictor is called. This mapping is transparent to both the decoder and the wrapped slave predictor.

Creates a new idxmap wrapper predictor. The index maps have to be plain text files, each line containing the mapping from a SGNMT word index to the slave predictor word index.

Parameters:	src_idxmap_path (string) – Path to the source index map trgt_idxmap_path (string) – Path to the target index map slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT slave_weight (float) – Slave predictor weight

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: ATTENTION: We should translate the posterior array back to slave predictor indices. However, the unk_id is translated to the identical index, and others normally do not matter when computing the UNK probability. Therefore, we refrain from a complete conversion and pass through posterior without changing its word indices.

initialize(src_sentence)[source]: Pass through to slave predictor

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor

is_equal(state1, state2)[source]: Pass through to slave predictor

load_map(path, name)[source]

Load a index map file. Mappings should be bijections, but there is no sanity check in place to verify this.

Parameters:	path (string) – Path to the mapping file name (string) – ‘source’ or ‘target’ for error messages
Returns:	dict. Mapping from SGNMT index to slave predictor index

predict_next()[source]: Pass through to slave predictor

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.MaskvocabPredictor(vocab_spec, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This wrapper predictor hides certain words in the SGNMT vocabulary from the predictor. Those words are scored by the masked predictor with zero. The wrapper passes through consume() only for other words.

Creates a new maskvocab wrapper predictor.

Parameters:	vocab_spec (string) – Vocabulary specification (see VocabSpec) slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Pass through to slave predictor

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: Pass through to slave predictor, set masked to 0.0

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.SkipvocabInternalHypothesis(score, predictor_state, word_to_consume)[source]

Bases: object

Helper class for internal beam search in skipvocab.

class cam.sgnmt.predictors.vocabulary.SkipvocabPredictor(vocab_spec, stop_size, beam, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

This predictor wrapper masks predictors with a larger vocabulary than the SGNMT vocabulary. The SGNMT OOV words are not scored with UNK scores from the other predictors like usual, but are hidden by this wrapper. Therefore, this wrapper does not produce any word from the larger vocabulary, but searches internally until enough in-vocabulary word scores are collected from the wrapped predictor.

Creates a new skipvocab wrapper predictor.

Parameters:	vocab_spec (string) – Vocabulary specification (see VocabSpec) stop_size (int) – Stop internal beam search when the best stop_size words are in-vocabulary beam (int) – Beam size of internal beam search slave_predictor (Predictor) – Wrapped predictor.

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Pass through to slave predictor

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: This method first performs beam search internally to update the slave predictor state to a point where the best stop_size entries in the predict_next() return value are in-vocabulary (bounded by max_id). Then, it returns the slave posterior in that state.

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.UnboundedIdxmapPredictor(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]

Bases: cam.sgnmt.predictors.vocabulary.IdxmapPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of IdxmapPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Pass through to IdxmapPredictor.__init__

predict_next(trgt_words)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.UnboundedMaskvocabPredictor(vocab_spec, slave_predictor)[source]

Bases: cam.sgnmt.predictors.vocabulary.MaskvocabPredictor, cam.sgnmt.predictors.core.UnboundedVocabularyPredictor

This class is a version of MaskvocabPredictor for unbounded vocabulary predictors. This needs an adjusted predict_next method to pass through the set of target words to score correctly.

Creates a new maskvocab wrapper predictor.

Parameters:	vocab_spec (string) – Vocabulary specification (see VocabSpec) slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT

predict_next(trgt_words)[source]: Pass through to slave predictor, set masked to 0.0

class cam.sgnmt.predictors.vocabulary.UnkvocabPredictor(trg_vocab_size, slave_predictor)[source]

Bases: cam.sgnmt.predictors.core.Predictor

If the predictor wrapped by the unkvocab wrapper produces an UNK with predict next, this wrapper adds explicit NEG_INF scores to all in-vocabulary words not in its posterior. This can control which words are matched by the UNK scores of other predictors.

Creates a new unkvocab wrapper predictor.

Parameters:	trg_vocab_size (int) – Size of the target vocabulary

consume(word)[source]: Pass through to slave predictor

estimate_future_cost(hypo)[source]: Pass through to slave predictor

get_state()[source]: Pass through to slave predictor

get_unk_probability(posterior)[source]: Pass through to slave predictor

initialize(src_sentence)[source]: Pass through to slave predictor

initialize_heuristic(src_sentence)[source]: Pass through to slave predictor

is_equal(state1, state2)[source]: Pass through to slave predictor

predict_next()[source]: Pass through to slave predictor. If the posterior from the slave predictor contains util.UNK_ID, add NEG_INF for all word ids lower than trg_vocab_size that are not already defined

set_current_sen_id(cur_sen_id)[source]: We need to override this method to propagate current_ sentence_id to the slave predictor

set_state(state)[source]: Pass through to slave predictor

class cam.sgnmt.predictors.vocabulary.VocabSpec(spec_str)[source]

Bases: object

Helper class for maskvocab and skipvocab predictors.

Takes a string that specifies a vocabulary. Examples:: ‘10,11,12’: The tokens 10, 11, and 12 ‘>55’: All token IDs larger than 55 ‘<33,99’: All token IDs less than 33 and the token 99.

Parameters:	spec_str (string) – String specification of the vocabulary

contains(token)[source]