cam.sgnmt.predictors package¶
Submodules¶
cam.sgnmt.predictors.automata module¶
This module encapsulates the predictor interface to OpenFST. This
module depends on OpenFST. To enable Python support in OpenFST, use a
recent version (>=1.5.4) and compile with --enable_python
.
Further information can be found here:
http://www.openfst.org/twiki/bin/view/FST/PythonExtension
This file includes the fst, nfst, and rtn predictors.
Note: If we use arc weights in FSTs, we multiply them by -1 as everything in SGNMT is logprob, not -logprob as in FSTs log or tropical semirings. You can disable this behavior with –fst_to_log
Note2: The FSTs and RTNs are assumed to have both <S> and </S>. This has compatibility reasons, as lattices generated by HiFST have these symbols.
-
cam.sgnmt.predictors.automata.
EPS_ID
= 0¶ OpenFST’s reserved ID for epsilon arcs.
-
class
cam.sgnmt.predictors.automata.
FstPredictor
(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor can read determinized translation lattices. The predictor state consists of the current node. This is unique as the lattices are determinized.
Creates a new fst predictor.
Parameters: - fst_path (string) – Path to the FST file
- use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
- normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
- skip_bos_weight (bool) – Add the score at the <S> arc to the </S> arc if this is false. This results in scores consistent with OpenFST’s replace operation, as <S> scores are normally ignored by SGNMT.
- to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
-
consume
(word)[source]¶ Updates the current node by following the arc labelled with
word
. If there is no such arc, we setcur_node
to -1, indicating that the predictor is in an invalid state. In this case, all subsequentpredict_next
calls will return the empty set.Parameters: word (int) – Word on an outgoing arc from the current node Returns: float. Weight on the traversed arc
-
estimate_future_cost
(hypo)[source]¶ The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.
-
get_unk_probability
(posterior)[source]¶ Returns negative infinity if UNK is not in the lattice. Otherwise, return UNK score.
Returns: float. Negative infinity
-
initialize
(src_sentence)[source]¶ Loads the FST from the file system and consumes the start of sentence symbol.
Parameters: src_sentence (list) – Not used
-
class
cam.sgnmt.predictors.automata.
NondeterministicFstPredictor
(fst_path, use_weights, normalize_scores, skip_bos_weight=True, to_log=True)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor can handle non-deterministic translation lattices. In contrast to the fst predictor for deterministic lattices, we store a set of nodes which are all reachable from the start node through the current history.
Creates a new nfst predictor.
Parameters: - fst_path (string) – Path to the FST file
- use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
- normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
- skip_bos_weight (bool) – If true, set weights on <S> arcs to 0 (= log1)
- to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
-
consume
(word)[source]¶ Updates the current nodes by searching for all nodes which are reachable from the current nodes by a path consisting of any number of epsilons and exactly one
word
label. If there is no such arc, we set the predictor in an invalid state. In this case, all subsequentpredict_next
calls will return the empty set.Parameters: word (int) – Word on an outgoing arc from the current node
-
estimate_future_cost
(hypo)[source]¶ The FST predictor comes with its own heuristic function. We use the shortest path in the fst as future cost estimator.
-
get_unk_probability
(posterior)[source]¶ Always returns negative infinity: Words outside the translation lattice are not possible according to this predictor.
Returns: float. Negative infinity
-
initialize
(src_sentence)[source]¶ Loads the FST from the file system and consumes the start of sentence symbol.
Parameters: src_sentence (list) – Not used
-
initialize_heuristic
(src_sentence)[source]¶ Creates a matrix of shortest distances between all nodes
-
predict_next
()[source]¶ Uses the outgoing arcs from all current node to build up the scores for the next word. This method does not follow epsilon arcs:
consume
updatescur_nodes
such that all reachable arcs with word ids are connected directly with a node incur_nodes
. If there are multiple arcs with the same word, we use the log sum of the arc weights as score.Returns: dict. Set of words on outgoing arcs from the current node together with their scores, or an empty set if we currently have no active nodes or fst.
-
class
cam.sgnmt.predictors.automata.
RtnPredictor
(rtn_path, use_weights, normalize_scores, to_log=True, minimize_rtns=False, rmeps=True)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Predictor for RTNs (recurrent transition networks). This predictor assumes a directory structure as produced by HiFST. You can use this predictor for non-deterministic lattices too. This implementation supports late expansion: RTNs are only expanded as far as necessary to retrieve all currently reachable states.
cur_nodes
contains the accumulated weights from the last consumed word (if ambiguous, the largest)This implementation does not maintain a list of active nodes like the other automata predictors. Instead, we store the current history and search for the active nodes at each expansion. This is more expensive, but fstreplace might change state IDs so a list of active nodes might get corrupted.
Note that this predictor does not support FSTs in gzip format.
Creates a new RTN predictor.
Parameters: - rtn_path (string) – Path to the RTN directory
- use_weights (bool) – If false, replace all arc weights with 0 (=log 1).
- normalize_scores (bool) – If true, we normalize the weights on all outgoing arcs such that they sum up to 1
- to_log (bool) – SGNMT uses normal log probs (scores) while arc weights in FSTs normally have cost (i.e. neg. log values) semantics. Therefore, if true, we multiply arc weights by -1.
- minimize_rtns (bool) – Minimize the FST after each replace operation
- rmeps (bool) – Remove epsilons in the FST after each replace operation
-
add_to_label_fst_map_recursive
(label_fst_map, visited_nodes, root_node, acc_weight, history, func)[source]¶ Adds arcs to
label_fst_map
if they are labeled with an NT symbol and reachable fromroot_node
viahistory
.Note: visited_nodes is maintained for each history separately
-
expand_rtn
(func)[source]¶ This method expands the RTN as far as necessary. This means that the RTN is expanded s.t. we can build the posterior for
cur_history
. In practice, this means that we follow all epsilon edges and replaces all NT edges until all paths with the prefixcur_history
in the RTN have at least one more terminal token. Then, we applyfunc
to all reachable nodes.
-
get_unk_probability
(posterior)[source]¶ Always returns negative infinity: Words outside the RTN are not possible according to this predictor.
Returns: float. Negative infinity
-
initialize
(src_sentence)[source]¶ Loads the root RTN and consumes the start of sentence symbol.
Parameters: src_sentence (list) – Not used
cam.sgnmt.predictors.bow module¶
This module contains predictors for bag of words experiments. This is the standard bow predictor and the bowsearch predictor which first does an unrestricted search to construct a skeleton and then restricts the order of words by that skeleton (in addition to the bag restriction).
-
class
cam.sgnmt.predictors.bow.
BagOfWordsPredictor
(trg_test_file, accept_subsets=False, accept_duplicates=False, heuristic_scores_file='', collect_stats_strategy='best', heuristic_add_consumed=False, heuristic_add_remaining=True, diversity_heuristic_factor=-1.0, equivalence_vocab=-1)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor is similar to the forced predictor, but it does not enforce the word order in the reference. Therefore, it assigns 1 to all hypotheses which have the words in the reference in any order, and -inf to all other hypos.
Creates a new bag-of-words predictor.
Parameters: - trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode. The word order in the target sentences is not relevant for this predictor.
- accept_subsets (bool) – If true, this predictor permits EOS even if the bag is not fully consumed yet
- accept_duplicates (bool) – If true, counts are not updated when a word is consumed. This means that we allow a word in a bag to appear multiple times
- heuristic_scores_file (string) – Path to the unigram scores which are used if this predictor estimates future costs
- collect_stats_strategy (string) – best, full, or all. Defines how unigram estimates are collected for heuristic
- heuristic_add_consumed (bool) – Set to true to add the difference between actual partial score and unigram estimates of consumed words to the predictor heuristic
- heuristic_add_remaining (bool) – Set to true to add the sum of unigram scores of words remaining in the bag to the predictor heuristic
- diversity_heuristic_factor (float) – Factor for diversity heuristic which penalizes hypotheses with the same bag as full hypos
- equivalence_vocab (int) – If positive, predictor states are considered equal if the the remaining words within that vocab and OOVs regarding this vocab are the same. Only relevant when using hypothesis recombination
-
consume
(word)[source]¶ Updates the bag by deleting the consumed word.
Parameters: word (int) – Next word to consume
-
estimate_future_cost
(hypo)[source]¶ The bow predictor comes with its own heuristic function. We use the sum of scores of the remaining words as future cost estimator.
-
get_unk_probability
(posterior)[source]¶ Returns negative infinity unconditionally: Words which are not in the target sentence have assigned probability 0 by this predictor.
-
initialize
(src_sentence)[source]¶ Creates a new bag for the current target sentence..
Parameters: src_sentence (list) – Not used
-
initialize_heuristic
(src_sentence)[source]¶ Calls
reset
of the used unigram table with estimatesself.estimates
to clear all statistics from the previous sentenceParameters: src_sentence (list) – Not used
-
notify
(message, message_type=1)[source]¶ This gets called if this predictor observes the decoder. It updates unigram heuristic estimates via passing through this message to the unigram table
self.estimates
.
-
class
cam.sgnmt.predictors.bow.
BagOfWordsSearchPredictor
(main_decoder, hypo_recombination, trg_test_file, accept_subsets=False, accept_duplicates=False, heuristic_scores_file='', collect_stats_strategy='best', heuristic_add_consumed=False, heuristic_add_remaining=True, diversity_heuristic_factor=-1.0, equivalence_vocab=-1)[source]¶ Bases:
cam.sgnmt.predictors.bow.BagOfWordsPredictor
Combines the bag-of-words predictor with a proxy decoding pass which creates a skeleton translation.
Creates a new bag-of-words predictor with pre search
Parameters: - main_decoder (Decoder) – Reference to the main decoder instance, used to fetch the predictors
- hypo_recombination (bool) – Activates hypo recombination for the pre decoder
- trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode. The word order in the target sentences is not relevant for this predictor.
- accept_subsets (bool) – If true, this predictor permits EOS even if the bag is not fully consumed yet
- accept_duplicates (bool) – If true, counts are not updated when a word is consumed. This means that we allow a word in a bag to appear multiple times
- heuristic_scores_file (string) – Path to the unigram scores which are used if this predictor estimates future costs
- collect_stats_strategy (string) – best, full, or all. Defines how unigram estimates are collected for heuristic
- heuristic_add_consumed (bool) – Set to true to add the difference between actual partial score and unigram estimates of consumed words to the predictor heuristic
- heuristic_add_remaining (bool) – Set to true to add the sum of unigram scores of words remaining in the bag to the predictor heuristic
- equivalence_vocab (int) – If positive, predictor states are considered equal if the the remaining words within that vocab and OOVs regarding this vocab are the same. Only relevant when using hypothesis recombination
-
consume
(word)[source]¶ Calls super class
consume
. If not inpre_mode
, update skeleton info.Parameters: word (int) – Next word to consume
-
get_state
()[source]¶ If in pre_mode, state of this predictor is the current bag Otherwise, its the bag plus skeleton state
cam.sgnmt.predictors.core module¶
This module contains the two basic predictor interfaces for bounded and unbounded vocabulary predictors.
-
class
cam.sgnmt.predictors.core.
Predictor
[source]¶ Bases:
cam.sgnmt.utils.Observer
A predictor produces the predictive probability distribution of the next word given the state of the predictor. The state may change during
predict_next()
andconsume()
. The functionsget_state()
andset_state()
can be used for non-greedy decoding. Note: The state describes the predictor with the current history. It does not encapsulate the current source sentence, i.e. you cannot recover a predictor state ifinitialize()
was called in between.predict_next()
andconsume()
must be called alternately. This holds even when usingget_state()
andset_state()
: Loading/saving states is transparent to the predictor instance.Initializes
current_sen_id
with 0.-
consume
(word)[source]¶ Expand the current history by
word
and update the internal predictor state accordingly. Two calls ofconsume()
must be separated by apredict_next()
call.Parameters: word (int) – Word to add to the current history
-
estimate_future_cost
(hypo)[source]¶ Predictors can implement their own look-ahead cost functions. They are used in A* if the –heuristics parameter is set to predictor. This function should return the future log cost (i.e. the lower the better) given the current predictor state, assuming that the last word in the partial hypothesis ‘hypo’ is consumed next. This function must not change the internal predictor state.
Parameters: hypo (PartialHypothesis) – Hypothesis for which to estimate the future cost given the current predictor state - Returns
- float. Future cost
-
finalize_posterior
(scores, use_weights, normalize_scores)[source]¶ This method can be used to enforce the parameters use_weights normalize_scores in predictors with dict posteriors.
Parameters: - scores (dict) – unnormalized log valued scores
- use_weights (bool) – Set to false to replace all values in
scores
with 0 (= log 1) - normalize_scores – Set to true to make the exp of elements
in
scores
sum up to 1
-
get_state
()[source]¶ Get the current predictor state. The state can be any object or tuple of objects which makes it possible to return to the predictor state with the current history.
Returns: object. Predictor state
-
get_unk_probability
(posterior)[source]¶ This function defines the probability of all words which are not in
posterior
. This is usually used to combine open and closed vocabulary predictors. The argumentposterior
should have been produced withpredict_next()
Parameters: posterior (list,array,dict) – Return value of the last call of predict_next
Returns: Score to use for words outside posterior
Return type: float
-
initialize
(src_sentence)[source]¶ Initialize the predictor with the given source sentence. This resets the internal predictor state and loads everything which is constant throughout the processing of a single source sentence. For example, the NMT decoder runs the encoder network and stores the source annotations.
Parameters: src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>
-
initialize_heuristic
(src_sentence)[source]¶ This is called after
initialize()
if the predictor is registered as heuristic predictor (i.e.estimate_future_cost()
will be called in the future). Predictors can implement this function for initialization of their own heuristic mechanisms.Parameters: src_sentence (list) – List of word IDs which form the source sentence without <S> or </S>
-
is_equal
(state1, state2)[source]¶ Returns true if two predictor states are equal, i.e. both states will always result in the same scores. This is used for hypothesis recombination
Parameters: - state1 (object) – First predictor state
- state2 (object) – Second predictor state
Returns: bool. True if both states are equal, false if not
-
notify
(message, message_type=1)[source]¶ We implement the
notify
method from theObserver
super class with an empty method here s.t. predictors do not need to implement it.Parameters: message (object) – The posterior sent by the decoder
-
predict_next
()[source]¶ Returns the predictive distribution over the target vocabulary for the next word given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls of
predict_next()
must be separated by aconsume()
call.Returns: dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability get_unk_probability()
-
set_current_sen_id
(cur_sen_id)[source]¶ This function is called between
initialize()
calls to increment the sentence id counter. It can also be used to skip sentences for the –range argument.Parameters: cur_sen_id (int) – Sentence id for the next call of initialize()
-
set_state
(state)[source]¶ Loads a predictor state from an object created with
get_state()
. Note that this does not copy the argument but just references the given state. Ifstate
is going to be used in the future to return to that point again, you should copy the state withcopy.deepcopy()
before.Parameters: state (object) – Predictor state as returned by get_state()
-
-
class
cam.sgnmt.predictors.core.
UnboundedVocabularyPredictor
[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Predictors under this class implement models with very large target vocabularies, for which it is too inefficient to list the entire posterior. Instead, they are evaluated only for a given list of target words. This list is usually created by taking all non-zero probability words from the bounded vocabulary predictors. An example of a unbounded vocabulary predictor is the ngram predictor: Instead of listing the entire ngram vocabulary, we run srilm only on the words which are possible according other predictor (e.g. fst or nmt). This is realized by introducing the
trgt_words
argument topredict_next
.Initializes
current_sen_id
with 0.-
predict_next
(trgt_words)[source]¶ Like in
Predictor
, returns the predictive distribution over target words given the predictor state. Note that the prediction itself can change the state of the predictor. For example, the neural predictor updates the decoder network state and its attention to predict the next word. Two calls ofpredict_next()
must be separated by aconsume()
call.Parameters: trgt_words (list) – List of target word ids. Returns: dictionary,array,list. Word log probabilities for the next target token. All ids which are not set are assumed to have probability get_unk_probability()
. The returned set should not contain any ids which are not intrgt_words
, but it does not have to score all of them
-
cam.sgnmt.predictors.forced module¶
This module contains predictors for forced decoding. This can be
done either with one reference (forced ForcedPredictor
), or with
multiple references in form of a n-best list (forcedlst
ForcedLstPredictor
).
-
class
cam.sgnmt.predictors.forced.
ForcedLstPredictor
(trg_test_file, use_scores=True, match_unk=False, feat_name=None)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor can be used for direct n-best list rescoring. In contrast to the
ForcedPredictor
, it reads an n-best list in Moses format and uses its scores as predictive probabilities of the </S> symbol. Everywhere else it gives the predictive probability 1 if the history corresponds to at least one n-best list entry, 0 otherwise. From the n-best list we use First column: Sentence id Second column: Hypothesis in integer format Last column: scoreNote: Behavior is undefined if you have duplicates in the n-best list
TODO: Would be much more efficient to use Tries for cur_trgt_sentences instead of a flat list.
Creates a new n-best rescoring predictor instance.
Parameters: - trg_test_file (string) – Path to the n-best list
- use_scores (bool) – Whether to use the scores from the n-best list. If false, use uniform scores of 0 (=log 1).
- match_unk (bool) – If true, allow any word where the n-best list contains UNK.
- feat_name (string) – Instead of the combined score in the last column of the Moses n-best list, we can use one of the sparse features. Set this to the name of the feature (denoted as <name>= in the n-best list) if you wish to do that.
-
get_unk_probability
(posterior)[source]¶ Return negative infinity unconditionally - words outside the n-best list are not possible according to this predictor.
-
initialize
(src_sentence)[source]¶ Resets the history and loads the n-best list entries for the next source sentence
Parameters: src_sentence (list) – Not used
-
predict_next
()[source]¶ Outputs 0.0 (i.e. prob=1) for all words for which there is an entry
in cur_trg_sentences
, and the score incur_trg_sentences
if the current history is by itself equal to an entry incur_trg_sentences
.TODO: The implementation here is fairly inefficient as it scans through all target sentences linearly. Would be better to organize the target sentences in a Trie
-
class
cam.sgnmt.predictors.forced.
ForcedPredictor
(trg_test_file, spurious_words=[])[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor realizes forced decoding. It stores one target sentence for each source sentence and outputs predictive probability 1 along this path, and 0 otherwise.
Creates a new forced decoding predictor.
Parameters: - trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode
- spurious_words (list) – List of words that are permitted to occur anywhere in the sequence
-
consume
(word)[source]¶ If
word
matches the target sentence, we increase the current history by one. Otherwise, we set this predictor in an invalid state, in which it always predicts </S>Parameters: word (int) – Next word to consume
-
get_unk_probability
(posterior)[source]¶ Returns negative infinity unconditionally: Words which are not in the target sentence have assigned probability 0 by this predictor.
-
initialize
(src_sentence)[source]¶ Fetches the corresponding target sentence and resets the current history.
Parameters: src_sentence (list) – Not used
cam.sgnmt.predictors.grammar module¶
This module contains everything related to the hiero predictor. This
predictor allows applying rules from a syntactical SMT system directly
in SGNMT. The main interface is RuleXtractPredictor
which can be
used like other predictors during decoding.
The Hiero predictor follows are the LRHiero implementation from
https://github.com/sfu-natlang/lrhiero
Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.
However, note that we modified the code to a) deal with an arbitrary number of non-terminals b) work with ruleXtract c) allow spurious ambiguity
ATTENTION: This implementation is experimental!!
-
class
cam.sgnmt.predictors.grammar.
Cell
(init_hypo=None)[source]¶ Comparable to a CYK cell: A set of hypotheses. If duplicates are added, we do hypo combination by combining the costs and retraining only one of them. Internally, the hypotheses are stored in a list sorted by the sum of the translation prefix
Creates a new
Cell
with only one hypothesis.Parameters: init_hypo (LRHieroHypothesis) – Initial hypothesis -
add
(hypo)[source]¶ Add a new hypothesis to the cell. If an equivalent hypothesis already exists, combine both hypotheses.
Parameters: hypo (LRHieroHypothesis) – Hypothesis to add under the key hypo.key
-
filter
(pos, symb)[source]¶ Remove all hypotheses which do not have
symb
atpos
in theirtrgt_prefix
. Breaks ifpos
is out of range for sometrgt_prefix
-
-
class
cam.sgnmt.predictors.grammar.
LRHieroHypothesis
(trgt_prefix, spans, cost)[source]¶ Represents a LRHiero hypothesis, which is defined by the accumulated cost, the target prefix, and open source spans.
Creates a new LRHiero hypothesis
Parameters: - trgt_prefix (list) – Target side translation prefix, i.e. the partial target sentence which is translated so far
- spans (list) – List of spans which are not covered yet, in left-to-right order on target side
- cost (float) – Cost of this partial hypothesis
-
class
cam.sgnmt.predictors.grammar.
Rule
(rhs_src, rhs_trgt, trgt_src_map, cost)[source]¶ A rule consists of
rhs_src
andrhs_trgt
, both are sequences of integers. NTs are indicated with negative sign. Thetrgt_src_map
defines which NT on the target side belongs to which NT on the source side.Creates a new rule.
Parameters: - rhs_src (list) – Source on the right hand side of the rule
- rhs_trgt (list) – Target on the right hand side of the rule
- trgt_src_map (dict) – Defines which NT on the target side belongs to which NT on the source side
-
last_id
= 0¶
-
class
cam.sgnmt.predictors.grammar.
RuleSet
[source]¶ This class stores the set of rules and provides efficient retrieval and matching functionality
Initializes the set by setting up the trie data structure for storing the rules.
-
INF
= 10000¶
-
create_rule
(rhs_src, rhs_trgt, weight)[source]¶ Creates a rule object (factory method)
Parameters: - rhs_src (list) – String sequence describing the source of the right-hand-side of the rule
- rhs_trgt (list) – String sequence describing the target of the right-hand-side of the rule
- weight (float) – Rule weight
Returns: Rule
orNone
if something went wrong
-
expand_hypo
(hypo, src_seq)[source]¶ Similar to
getSpanRules()
andGrowHypothesis()
in Alg. 1 in (Siahbani, 2013) combined. Gets all rules which match the given span.- If the p parameter of the span is a single non-terminal, we
return hypotheses resulting from productions of this non-
terminal. Note that rules might be applicable in many different
ways: X-> A the B can be applied to foo the bar the baz in two
ways. In this case, we add the translation prefix, but leave the
borders of the span untouched, and change the
p
value tothr rhs
of the production (i.e. “A the B”). If p consists of multiple characters, the spans store the minimum and maximum length, not the begin and end since the exact begin and end positions are variable. - If the p parameter of the span has length > 1, we return a set of hypotheses in which the first subspan has a single NT as p parameter.
Through this contract we can e.g. handle spurious ambiguity, if two NT are on the source side. However, resolving this ambiguity is implemented in a lazy fashion: we delay fixing the span boundaries until we need to expand the hypothesis once more, and then we fix only the first boundaries for the first span.
Parameters: - hypo (LRHieroHypothesis) – Hypothesis to expand
- src_seq (list) – Source sequence to match
- If the p parameter of the span is a single non-terminal, we
return hypotheses resulting from productions of this non-
terminal. Note that rules might be applicable in many different
ways: X-> A the B can be applied to foo the bar the baz in two
ways. In this case, we add the translation prefix, but leave the
borders of the span untouched, and change the
-
parse
(line, feature_weights=None)[source]¶ Parse a line in a rule file from ruleXtract and add the rule to the set.
Parameters: - line (string) –
- feature_weights (list) – score or
None
to use uniform weights
-
update_span_len_range
()[source]¶ This method updates the
span_len_range
variable by finding boundaries for the spans each non terminal can cover. This is done iteratively: First, guess the range for each NT to (0, inf). Then, iterate through all rules for a specific NT and adjust the boundaries given the ranges for all other NTs. Do this until ranges do not change anymore. This is an expensive operation should be done after adding all rules. Note also that the tries store a reference toself.span_len_range
, i.e. the variable is propagated to all tries automatically.
-
-
class
cam.sgnmt.predictors.grammar.
RuleXtractPredictor
(ruleXtract_path, use_weights, feature_weights=None)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Predictor based on ruleXtract rules. Bins are organized according the number of target words. We assume that no rule produces the empty word on the source side (but possibly on the target side). Hypotheses are produced iteratively s.t. the following invariant holds: The bins contain a set of (partial) hypotheses from which we can derive all full hypotheses which are consistent with the current target prefix (i.e. the prefix of the target sentence which has already been translated). This set is updated when calling either consume_word or predict_next: consume_ word deletes all hypotheses which become inconsistent with the new word.
predict_next
requires all hypotheses to have a target_ prefix length of at least one plus the number of consumed words. Therefore,predict_next
expands hypotheses as long as they are shorter. This fits nicely with grouping hypotheses in bins of same target prefix length: we expand until all low rank bins are empty. We predict the next target word by using the cost of the best hypothesis with the word at the right position.Note that this predictor is similar to the decoding algorithm in
Efficient Left-to-Right Hierarchical Phrase-based Translation with Improved Reordering. Maryam Siahbani, Baskaran Sankaran and Anoop Sarkar. EMNLP 2013. Oct 18-21, 2013. Seattle, USA.without cube pruning, but it is extended to an arbitrary number of non-terminals as produced with ruleXtract.
Creates a new hiero predictor.
Parameters: - ruleXtract_path (string) – Path to the rules file
- use_weights (bool) – If false, set all hypothesis scores uniformly to 0 (= log 1). If true, use the rule weights to compute hypothesis scores
- feature_weights (list) – Rule feature weights to compute the rule scores. If this is none we use uniform weights
-
build_posterior
()[source]¶ We need to scan all hypotheses in
self.stacks
and add up scores grouped by the symbol at the n_consumed+1-th position. Then, we add end-of-sentence probability by checkingself.finals[n_consumed]
-
get_state
()[source]¶ Predictor state consists of the stacks, the completed hypotheses, and the number of consumed words.
-
get_unk_probability
(posterior)[source]¶ Returns negative infinity if the posterior is not empty as words outside the grammar are not possible according this predictor. If
posterior
is empty, return 0 (= log 1)
-
predict_next
()[source]¶ For predicting the distribution of the next target tokens, we need to empty the stack with the current history length by expanding all hypotheses on it. Then, all hypotheses are in larger bins, i.e. have a longer target prefix than the current history. Thus, we can look up the possible next words by iterating through all active hypotheses.
-
class
cam.sgnmt.predictors.grammar.
Span
(p, borders)[source]¶ Span is defined by the start and end position and the corresponding sequence of terminal and non-terminal symbols p. Normally, p is just a single NT symbol. However, if there is ambiguity with how to apply a rule to a span (e.g. rule X -> X the X to span foo the bar the baz) we allow to resolve them later on demand. In this case, p = X the X
Fully initializes a new
Span
instance.Parameters: - p (list) – See class docstring for
Span
- borders (tuple) – (begin, end) with begin inclusive and end exclusive
- p (list) – See class docstring for
-
class
cam.sgnmt.predictors.grammar.
Trie
(span_len_range)[source]¶ This trie implementation allows matching NT symbols with arbitrary symbol sequences with certain lengths when searching. Note: This trie does not implement edge collapsing - each edge is labeled with exactly one word
Creates an empty trie data structure.
Parameters: span_len_range (tuple) – minimum and maximum span lengths for non-terminal symbols -
add
(seq, element)[source]¶ Add an element to the trie data structure. The key sequence
seq
can contain non-terminals with negative IDs. If a element with the same key already exists in the data structure, we do not delete it but store both items.Parameters: - seq (list) – Sequence of terminals and non-terminals used as key in the trie
- element (object) – Object to associate with
seq
-
get_elements
(src_seq)[source]¶ Get all elements (e.g. rules) which match the given sequence of source tokens.
Parameters: seq (list) – Sequence of terminals and non-terminals used as key in the trie Returns: (rules, nt_span_lens)
. The first dictionary contains all applying rules.nt_span_lens
lists the number of symbols each of the NTs on the source side covers. Make sure thatself.span_len_range
is updatedReturn type: two dicts
-
replace
(seq, element)[source]¶ Replaces all elements stored at a
seq
with a new single elementelement
. This is equivalent to first removing all items with keyseq
, and then add the new element withadd(seq, element)
Parameters: - seq (list) – Sequence of terminals and non-terminals used as key in the trie
- element (object) – Object to associate with
seq
-
cam.sgnmt.predictors.length module¶
This module contains predictors that deal wit the length of the
target sentence. The NBLengthPredictor
assumes a negative binomial
distribution on the target sentence lengths, where the parameters r and
p are linear combinations of features extracted from the source
sentence. The WordCountPredictor
adds the number of words as cost,
which can be used to prevent hypotheses from getting to short when
using a language model.
-
class
cam.sgnmt.predictors.length.
ExternalLengthPredictor
(path)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor loads the distribution over target sentence lengths from an external file. The file contains blank separated length:score pairs in each line which define the length distribution. The predictor adds the specified scores directly to the EOS score.
Creates a external length distribution predictor.
Parameters: path (string) – Path to the file with target sentence length distributions. -
get_unk_probability
(posterior)[source]¶ Returns 0=log 1 if the partial hypothesis does not exceed max length. Otherwise, predict next returns an empty set, and we set everything else to -inf.
-
initialize
(src_sentence)[source]¶ Fetches the corresponding target sentence length distribution and resets the word counter.
Parameters: src_sentence (list) – Not used
-
-
class
cam.sgnmt.predictors.length.
NBLengthPredictor
(text_file, model_weights, use_point_probs, offset=0)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor assumes that target sentence lengths are distributed according a negative binomial distribution with parameters r,p. r is linear in features, p is the logistic of a linear function over the features. Weights can be trained using the Matlab script
estimate_length_model.m
Let w be the model_weights. All features are extracted from the src sentence:
r = w0 * #char + w1 * #words + w2 * #punctuation + w3 * #char/#words + w4 * #punct/#words + w10 p = logistic(w5 * #char + w6 * #words + w7 * #punctuation + w8 * #char/#words + w9 * #punct/#words + w11) target_length ~ NB(r,p)
The biases w10 and w11 are optional.
The predictor predicts EOS with NB(#consumed_words,r,p)
Creates a new target sentence length model predictor.
Parameters: - text_file (string) – Path to the text file with the unindexed source sentences, i.e. not using word ids
- model_weights (list) – Weights w0 to w11 of the length model. See class docstring for more information
- use_point_probs (bool) – Use point estimates for EOS token, 0.0 otherwise
- offset (int) – Subtract this from hypothesis length before applying the NB model
-
get_state
()[source]¶ State consists of the number of consumed words, and the accumulator for previous EOS probability estimates if we don’t use point estimates.
-
get_unk_probability
(posterior)[source]¶ If we use point estimates, return 0 (=1). Otherwise, return the 1-p(EOS), with p(EOS) fetched from
posterior
-
class
cam.sgnmt.predictors.length.
NgramCountPredictor
(path, order=0, discount_factor=-1.0)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor counts the number of n-grams in hypotheses. n-gram posteriors are loaded from a file. The predictor score is the sum of all n-gram posteriors in a hypothesis.
Creates a new ngram count predictor instance.
Parameters: - path (string) – Path to the n-gram posteriors. File format: <ngram> : <score> (one ngram per line). Use placeholder %d for sentence id.
- order (int) – If positive, count n-grams of the specified order. Otherwise, count all n-grams
- discount_factor (float) – If non-negative, discount n-gram posteriors by this factor each time they are consumed
-
consume
(word)[source]¶ Adds
word
to the current history. Shorten if the extended history exceedsmax_history_len
.Parameters: word (int) – Word to add to the history.
-
initialize
(src_sentence)[source]¶ Loads n-gram posteriors and resets history.
Parameters: src_sentence (list) – not used
-
is_equal
(state1, state2)[source]¶ Hypothesis recombination is not supported if discounting is enabled.
-
class
cam.sgnmt.predictors.length.
NgramizePredictor
(min_order, max_order, max_len_factor, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper extracts n-gram posteriors from a predictor which does not depend on the particular argument of consume(). In that case, we can build a lookup mechanism for all possible n-grams in a single forward pass through the predictor search space: We record all posteriors (predict_next() return values) of the slave predictor during a greedy pass in initialize(). The wrapper predictor state is the current n-gram history. We use the (semiring) sum over all possible positions of the current n-gram history in the recorded slave predictor posteriors to form the n-gram scores returned by this predictor.
Note that this wrapper does not work correctly if the slave predictor feeds back the selected token in the history, ie. depends on the particular token which is provided via consume().
TODO: Make this wrapper work with slaves which return dicts.
Creates a new ngramize wrapper predictor.
Parameters: - min_order (int) – Minimum n-gram order
- max_order (int) – Maximum n-gram order
- max_len_factor (int) – Stop the forward pass through the slave predictor after src_length times this factor
- slave_predictor (Predictor) – Instance of the predictor which
uses the source sentences in
src_test
Raises: AttributeError if order is not positive.
-
initialize
(src_sentence)[source]¶ Runs greedy decoding on the slave predictor to populate self.scores and self.unk_scores, resets the history.
-
class
cam.sgnmt.predictors.length.
UnkCountPredictor
(src_vocab_size, lambdas)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor regulates the number of UNKs in the output. We assume that the number of UNKs in the target sentence is Poisson distributed. This predictor is configured with n lambdas for 0,1,...,>=n-1 UNKs in the source sentence.
Initializes the UNK count predictor.
Parameters: - src_vocab_size (int) – Size of source language vocabulary. Indices greater than this are considered as UNK.
- lambdas (list) – List of floats. The first entry is the lambda parameter given that the number of unks in the source sentence is 0 etc. The last float is lambda given that the source sentence has more than n-1 unks.
-
consume
(word)[source]¶ Increases unk counter by one if
word
is unk.Parameters: word (int) – Increase counter if word
is UNK
-
class
cam.sgnmt.predictors.length.
WeightNonTerminalPredictor
(slave_predictor, penalty_factor=1.0, nonterminal_ids=None, min_terminal_id=0, max_terminal_id=30003, vocab_size=30003)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper multiplies the weight of given tokens (those outside the min/max terminal range) by a factor.
Creates a new id-weighting wrapper for a predictor
Parameters: - slave_predictor – predictor to apply penalty to.
- penalty_factor (float) – factor by which to multiply tokens in range
- min_terminal_id – lower bound of tokens not to penalize,
- if nonterminal_penalty selected
- max_terminal_id: upper bound of tokens not to penalize,
- if nonterminal_penalty selected
- vocab_size: upper bound of tokens, used to find nonterminal range
-
class
cam.sgnmt.predictors.length.
WordCountPredictor
(word=-1, nonterminal_penalty=False, nonterminal_ids=None, min_terminal_id=0, max_terminal_id=30003, negative_wc=True, vocab_size=30003)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor adds the (negative) number of words as feature. This means that this predictor encourages shorter hypotheses when used with a positive weight.
Creates a new word count predictor instance.
Parameters: - word (int) – If this is non-negative we count only the number of the specified word. If its negative, count all words
- nonterminal_penalty (bool) – If true, apply penalty only to tokens in a range (the range outside min/max terminal id)
- nonterminal_ids – file containing ids of nonterminal tokens
- min_terminal_id – lower bound of tokens not to penalize, if nonterminal_penalty selected
- max_terminal_id – upper bound of tokens not to penalize, if nonterminal_penalty selected
- negative_wc – If true, the score of this predictor is the negative word count.
- vocab_size – upper bound of tokens, used to find nonterminal range
-
cam.sgnmt.predictors.length.
load_external_lengths
(path)[source]¶ Loads a length distribution from a plain text file. The file must contain blank separated <length>:<score> pairs in each line.
Parameters: path (string) – Path to the length file. Returns: list of dicts mapping a length to its scores, one dict for each sentence.
cam.sgnmt.predictors.misc module¶
This module provides helper predictors and predictor wrappers which are not directly used for scoring. An example is the altsrc predictor wrapper which loads source sentences from a different file.
-
class
cam.sgnmt.predictors.misc.
AltsrcPredictor
(src_test, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper loads the source sentences from an alternative source file. The
src_sentence
arguments ofinitialize
andinitialize_heuristic
are overridden with sentences loaded from the file specified via the argument--altsrc_test
. All other methods are pass through calls to the slave predictor.Creates a new altsrc wrapper predictor.
Parameters: - src_test (string) – Path to the text file with source sentences
- slave_predictor (Predictor) – Instance of the predictor which
uses the source sentences in
src_test
-
initialize
(src_sentence)[source]¶ Pass through to slave predictor but replace
src_sentence
with a sentence fromself.altsens
-
initialize_heuristic
(src_sentence)[source]¶ Pass through to slave predictor but replace
src_sentence
with a sentence fromself.altsens
-
class
cam.sgnmt.predictors.misc.
GluePredictor
(max_len_factor, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper masks sentence-level predictors when SGNMT runs on the document level. The SGNMT hypotheses consist of multiple sentences, glued together with <s>, but the wrapped predictor is trained on the sentence level. This predictor splits input sequences at <s> and feed them to the predictor one by one. The wrapped predictor is initialized with a new source sentence when the sentence boundary symbol <s> is emitted. Note that using the predictor heuristic of the wrapped predictor estimates the future cost for the current sentence, not for the entire document.
Creates a new glue wrapper predictor.
Parameters: - max_len_factor (int) – Target sentences cannot be longer than this times source sentence length
- slave_predictor (Predictor) – Instance of the sentence-level predictor.
-
consume
(word)[source]¶ If
word
is <s>, initialize the slave predictor with the next source sentence. Otherwise, pass throughword
to theconsume()
method of the slave.
-
initialize
(src_sentence)[source]¶ Splits
src_sentence
atutils.GO_ID
, stores all segments for later use, and callsinitialize()
of the slave predictor with the first segment.
-
is_last_sentence
()[source]¶ Returns True if the current sentence is the last sentence in this document - i.e. we have already consumed n-1 <s> symbols since the last call of
initialize()
.
-
predict_next
()[source]¶ Calls predict_next() of the wrapped predictor. Replaces BOS scores with EOS score if we still have source sentences left.
-
class
cam.sgnmt.predictors.misc.
RankPredictor
(slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper converts predictor scores to (negative) ranks, i.e. the best word gets a score of -1, the second best of -2 and so on.
Note: Using this predictor with UNK matching or predictor heuristics is not recommended.
Creates a new rank wrapper predictor.
Parameters: slave_predictor (Predictor) – Use score of this predictor to compute ranks.
-
class
cam.sgnmt.predictors.misc.
UnboundedAltsrcPredictor
(src_test, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.misc.AltsrcPredictor
,cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This class is a version of
AltsrcPredictor
for unbounded vocabulary predictors. This needs an adjustedpredict_next
method to pass through the set of target words to score correctly.Pass through to
AltsrcPredictor.__init__
-
class
cam.sgnmt.predictors.misc.
UnboundedGluePredictor
(max_len_factor, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.misc.GluePredictor
,cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This class is a version of
GluePredictor
for unbounded vocabulary predictors.Creates a new glue wrapper predictor.
Parameters: - max_len_factor (int) – Target sentences cannot be longer than this times source sentence length
- slave_predictor (Predictor) – Instance of the sentence-level predictor.
-
class
cam.sgnmt.predictors.misc.
UnboundedRankPredictor
(slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.misc.RankPredictor
,cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This class is a version of
RankPredictor
for unbounded vocabulary predictors. This needs an adjustedpredict_next
method to pass through the set of target words to score correctly.Creates a new rank wrapper predictor.
Parameters: slave_predictor (Predictor) – Use score of this predictor to compute ranks.
cam.sgnmt.predictors.ngram module¶
This module contains predictors for n-gram (Kneser-Ney) language
modeling. This is a UnboundedVocabularyPredictor
as the vocabulary
size ngram models normally do not permit complete enumeration of the
posterior.
-
class
cam.sgnmt.predictors.ngram.
KenLMPredictor
(path)[source]¶ Bases:
cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
KenLM predictor based on https://github.com/kpu/kenlm
The predictor state is described by the n-gram history.
Creates a new n-gram language model predictor.
Parameters: path (string) – Path to the ARPA language model file Raises: NameError. If KenLM is not installed
cam.sgnmt.predictors.parse module¶
-
class
cam.sgnmt.predictors.parse.
BpeParsePredictor
(grammar_path, bpe_rule_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False, eow_ids=None, terminal_restrict=True, terminal_ids=None, internal_only_restrict=False)[source]¶ Bases:
cam.sgnmt.predictors.parse.TokParsePredictor
Predict over a BPE-based grammar with two possible grammar constraints: one between non-terminals and bpe start-of-word tokens, one over bpe tokens in a word
- Creates a new parse predictor wrapper which can be constrained to 2
- grammars: one over non-terminals / terminals, one internally to constrain BPE units within a single word
Parameters: - grammar_path (string) – Path to the grammar file
- bpe_rule_path (string) – Path to file defining rules between BPEs
- slave_predictor – predictor to wrap
- word_out (bool) – since this wrapper can be used for grammar constraint, this bool determines whether we also do internal beam search over non-terminals
- normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
- norm_alpha (float) – may be used for path weight normalization
- beam_size (int) – beam size for internal beam search
- max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
- allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
- consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point
- eow_ids (string) – path to file containing ids of BPEs that mark the end of a word
- terminal_restrict (bool) – true if applying grammar constraint over nonterminals and terminals
- terminal_ids (string) – path to file containing all terminal ids
- internal_only_restrict (bool) – true if applying grammar constraint over BPE units inside words
-
class
cam.sgnmt.predictors.parse.
InternalHypo
(score, token_score, predictor_state, word_to_consume)[source]¶ Bases:
object
Helper class for internal parse predictor beam search over nonterminals
-
class
cam.sgnmt.predictors.parse.
ParsePredictor
(slave_predictor, normalize_scores=True, beam_size=4, max_internal_len=35, nonterminal_ids=None)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Predictor wrapper allowing internal beam search over a representation which contains some pre-defined ‘non-terminal’ ids, which should not appear in the output.
Create a new parse wrapper for a predictor
Parameters: - slave_predictor – predictor to wrap with parse wrapper
- normalize_scores (bool) – whether to normalize posterior scores, e.g. after some tokens have been removed
- beam_size (int) – beam size for internal beam search over non-terminals
- max_internal_len (int) – number of consecutive non-terminal tokens allowed in internal search before path is ignored
- nonterminal_ids – file containing non-terminal ids, one per line
-
are_best_terminal
(posterior)[source]¶ Return true if most probable tokens in posterior are all terminals (including EOS)
-
find_word_beam
(posterior)[source]¶ Internal beam search over posterior until a beam of terminals is found
-
get_unk_probability
(posterior)[source]¶ Return unk probability as determined by slave predictor :returns: float, unk prob
-
initialize
(src_sentence)[source]¶ Initializes slave predictor with source sentence
Parameters: src_sentence (list) –
-
class
cam.sgnmt.predictors.parse.
TokParsePredictor
(grammar_path, slave_predictor, word_out=True, normalize_scores=True, norm_alpha=1.0, beam_size=1, max_internal_len=35, allow_early_eos=False, consume_out_of_class=False)[source]¶ Bases:
cam.sgnmt.predictors.parse.ParsePredictor
Unlike ParsePredictor, the grammar predicts tokens according to a grammar. Use BPEParsePredictor if including rules to connect BPE units inside words.
Creates a new parse predictor wrapper.
Parameters: - grammar_path (string) – Path to the grammar file
- slave_predictor – predictor to wrap
- word_out (bool) – since this wrapper can be used for grammar constraint, this bool determines whether we also do internal beam search over non-terminals
- normalize_scores (bool) – true if normalizing scores, e.g. if some are removed from the posterior
- norm_alpha (float) – may be used for path weight normalization
- beam_size (int) – beam size for internal beam search
- max_internal_len (int) – max number of consecutive nonterminals before path is ignored by internal search
- allow_early_eos (bool) – true if permitting EOS consumed even if it is not permitted by the grammar at that point
- consume_out_of_class (bool) – true if permitting any tokens to be consumed even if not allowed by the grammar at that point
-
find_word
(posterior)[source]¶ Check whether rhs of best option in posterior is a terminal if it is, return the posterior for decoding if not, take the best result and follow that path until a word is found this follows a greedy 1best or a beam path through non-terminals
-
find_word_beam
(posterior)[source]¶ Do an internal beam search over non-terminal functions to find the next best n terminal tokens, as ranked by normalized path score
- Returns: posterior containing up to n terminal tokens
- and their normalized path score
cam.sgnmt.predictors.pytorch_fairseq module¶
This is the interface to the fairseq library.
https://github.com/pytorch/fairseq
The fairseq predictor can read any model trained with fairseq.
-
cam.sgnmt.predictors.pytorch_fairseq.
FAIRSEQ_INITIALIZED
= False¶ Set to true by _initialize_fairseq() after first constructor call.
-
class
cam.sgnmt.predictors.pytorch_fairseq.
FairseqPredictor
(model_path, user_dir, lang_pair, n_cpu_threads=-1)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Predictor for using fairseq models.
Initializes a fairseq predictor.
Parameters: - model_path (string) – Path to the fairseq model (*.pt). Like –path in fairseq-interactive.
- lang_pair (string) – Language pair string (e.g. ‘en-fr’).
- user_dir (string) – Path to fairseq user directory.
- n_cpu_threads (int) – Number of CPU threads. If negative, use GPU.
cam.sgnmt.predictors.structure module¶
This module implements constraints which assure that highly structured output is well-formatted. For example, the bracket predictor checks for balanced bracket expressions, and the OSM predictor prevents any sequence of operations which cannot be compiled to a string.
-
class
cam.sgnmt.predictors.structure.
BracketPredictor
(max_terminal_id, closing_bracket_id, max_depth=-1, extlength_path='')[source]¶ Bases:
cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This predictor constrains the output to well-formed bracket expressions. It also allows to specify the number of terminals with an external length distribution file.
Creates a new bracket predictor.
Parameters: - max_terminal_id (int) – All IDs greater than this are brackets
- closing_bracket_id (string) – All brackets except these ones are opening. Comma-separated list of integers.
- max_depth (int) – If positive, restrict the maximum depth
- extlength_path (string) – If this is set, restrict the number of terminals to the distribution specified in the referenced file. Terminals can be implicit: We count a single terminal between each adjacent opening and closing bracket.
-
initialize
(src_sentence)[source]¶ Sets the current depth to 0.
Parameters: src_sentence (list) – Not used
-
class
cam.sgnmt.predictors.structure.
ForcedOSMPredictor
(trg_wmap, trg_test_file)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor allows forced decoding with an OSM output, which essentially means running the OSM in alignment mode. This predictor assumes well-formed operation sequences. Please combine this predictor with the osm constraint predictor to satisfy this requirement. The state of this predictor is the compiled version of the current history. It allows terminal symbols which are consistent with the reference. The end-of-sentence symbol is supressed until all words in the reference have been consumed.
Creates a new forcedosm predictor.
Parameters: - trg_wmap (string) – Path to the target wmap file. Used to grap OSM operation IDs.
- trg_test_file (string) – Path to the plain text file with the target sentences. Must have the same number of lines as the number of source sentences to decode
-
class
cam.sgnmt.predictors.structure.
OSMPredictor
(src_wmap, trg_wmap, use_jumps=True, use_auto_pop=False, use_unpop=False, use_pop2=False, use_src_eop=False, use_copy=False)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor applies the following constraints to an OSM output:
- The number of POP tokens must be equal to the number of source tokens
- JUMP_FWD and JUMP_BWD tokens are constraint to avoid jumping out of bounds.
The predictor supports the original OSNMT operation set (default) plus a number of variations that are set by the use_* arguments in the constructor.
Creates a new osm predictor.
Parameters: - src_wmap (string) – Path to the source wmap. Used to grap EOP id.
- trg_wmap (string) – Path to the target wmap. Used to update IDs of operations.
- use_jumps (bool) – If true, use SET_MARKER, JUMP_FWD and JUMP_BWD operations
- use_auto_pop (bool) – If true, each word insertion automatically moves read head
- use_unpop (bool) – If true, use SRC_UNPOP to move read head to the left.
- use_pop2 (bool) – If true, use two read heads to align phrases
- use_src_eop (bool) – If true, expect EOP tokens in the src sentence
- use_copy (bool) – If true, move read head at COPY operations
-
cam.sgnmt.predictors.structure.
load_external_lengths
(path)[source]¶ Loads a length distribution from a plain text file. The file must contain blank separated <length>:<score> pairs in each line.
Parameters: path (string) – Path to the length file. Returns: list of dicts mapping a length to its scores, one dict for each sentence.
cam.sgnmt.predictors.tf_nizza module¶
This module integrates Nizza alignment models.
https://github.com/fstahlberg/nizza
-
class
cam.sgnmt.predictors.tf_nizza.
BaseNizzaPredictor
(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
Common functionality for Nizza based predictors. This includes loading checkpoints, creating sessions, and creating computation graphs.
Initializes a nizza predictor.
Parameters: - src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
- trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
- model_name (string) – Name of the nizza model
- hparams_set_name (string) – Name of the nizza hyper-parameter set
- checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
- nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises: IOError if checkpoint file not found.
-
class
cam.sgnmt.predictors.tf_nizza.
LexNizzaPredictor
(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, alpha, beta, shortlist_strategies, trg2src_model_name='', trg2src_hparams_set_name='', trg2src_checkpoint_dir='', max_shortlist_length=0, min_id=0, nizza_unk_id=None)[source]¶ Bases:
cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor
This predictor is only compatible to Model1-like Nizza models which return lexical translation probabilities in precompute(). The predictor keeps a list of the same length as the source sentence and initializes it with zeros. At each timestep it updates this list by the lexical scores Model1 assigned to the last consumed token. The predictor score aims to bring up all entries in the list, and thus serves as a coverage mechanism over the source sentence.
Initializes a nizza predictor.
Parameters: - src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
- trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
- model_name (string) – Name of the nizza model
- hparams_set_name (string) – Name of the nizza hyper-parameter set
- checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
- alpha (float) – Score for each matching word
- beta (float) – Penalty for each uncovered word at the end
- shortlist_strategies (string) – Comma-separated list of shortlist strategies.
- trg2src_model_name (string) – Name of the target2source nizza model
- trg2src_hparams_set_name (string) – Name of the nizza hyper-parameter set for the target2source model
- trg2src_checkpoint_dir (string) – Path to the Nizza checkpoint directory for the target2source model. The predictor will load the top most checkpoint in the checkpoints file.
- max_shortlist_length (int) – If a shortlist exceeds this limit, initialize the initial coverage with 1 at this position. If zero, do not apply any limit
- min_id (int) – Do not use IDs below this threshold (filters out most frequent words).
- nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises: IOError if checkpoint file not found.
-
class
cam.sgnmt.predictors.tf_nizza.
NizzaPredictor
(src_vocab_size, trg_vocab_size, model_name, hparams_set_name, checkpoint_dir, single_cpu_thread, nizza_unk_id=None)[source]¶ Bases:
cam.sgnmt.predictors.tf_nizza.BaseNizzaPredictor
This predictor uses Nizza alignment models to derive a posterior over the target vocabulary for the next position. It mainly relies on the predict_next_word() implementation of Nizza models.
Initializes a nizza predictor.
Parameters: - src_vocab_size (int) – Source vocabulary size (called inputs_vocab_size in nizza)
- trg_vocab_size (int) – Target vocabulary size (called targets_vocab_size in nizza)
- model_name (string) – Name of the nizza model
- hparams_set_name (string) – Name of the nizza hyper-parameter set
- checkpoint_dir (string) – Path to the Nizza checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- single_cpu_thread (bool) – If true, prevent tensorflow from doing multithreading.
- nizza_unk_id (int) – If set, use this as UNK id. Otherwise, the nizza is assumed to have no UNKs
Raises: IOError if checkpoint file not found.
cam.sgnmt.predictors.tf_t2t module¶
This is the interface to the tensor2tensor library.
https://github.com/tensorflow/tensor2tensor
The t2t predictor can read any model trained with tensor2tensor which includes the transformer model, convolutional models, and RNN-based sequence models.
-
class
cam.sgnmt.predictors.tf_t2t.
EditT2TPredictor
(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, trg_test_file, beam_size, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]¶ Bases:
cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor
This predictor can be used for T2T models conditioning on the full target sentence. The predictor state is a full target sentence. The state can be changed by insertions, substitutions, and deletions of single tokens, whereas each operation is encoded as SGNMT token in the following way:
1xxxyyyyy: Insert the token yyyyy at position xxx. 2xxxyyyyy: Replace the xxx-th word with the token yyyyy. 3xxx00000: Delete the xxx-th token.Creates a new edit T2T predictor. This constructor is similar to the constructor of T2TPredictor but creates a different computation graph which retrieves scores at each target position, not only the last one.
Parameters: - src_vocab_size (int) – Source vocabulary size.
- trg_vocab_size (int) – Target vocabulary size.
- model_name (string) – T2T model name.
- problem_name (string) – T2T problem name.
- hparams_set_name (string) – T2T hparams set name.
- trg_test_file (string) – Path to a plain text file with initial target sentences. Can be empty.
- beam_size (int) – Determines how many substitutions and insertions are considered at each position.
- t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
- checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
- n_cpu_threads (int) – Number of TensorFlow CPU threads.
- max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
- pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
-
DEL_OFFSET
= 300000000¶
-
INS_OFFSET
= 100000000¶
-
MAX_SEQ_LEN
= 999¶
-
POS_FACTOR
= 100000¶
-
SUB_OFFSET
= 200000000¶
-
class
cam.sgnmt.predictors.tf_t2t.
FertilityT2TPredictor
(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]¶ Bases:
cam.sgnmt.predictors.tf_t2t.T2TPredictor
Use this predictor to integrate fertility models trained with T2T. Fertility models output the fertility for each source word instead of target words. We define the fertility of the i-th source word in a hypothesis as the number of tokens between the (i-1)-th and the i-th POP token.
TODO: This is not SOLID (violates substitution principle)
Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles
restoring checkpoints.Parameters: - src_vocab_size (int) – Source vocabulary size.
- trg_vocab_size (int) – Target vocabulary size.
- model_name (string) – T2T model name.
- problem_name (string) – T2T problem name.
- hparams_set_name (string) – T2T hparams set name.
- t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
- checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
- n_cpu_threads (int) – Number of TensorFlow CPU threads.
- max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
- pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
-
cam.sgnmt.predictors.tf_t2t.
POP
= '##POP##'¶ Textual representation of the POP symbol.
-
class
cam.sgnmt.predictors.tf_t2t.
SegT2TPredictor
(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]¶ Bases:
cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor
This predictor is designed for document-level T2T models. It differs from the normal t2t predictor in the following ways:
- In addition to input and targets, it generates the features inputs_seg. targets_seg, inputs_pos, targets_pos which are used in glue models and the contextual Transformer.
- The history is pruned when it exceeds a maximum number of <s>
symbols. This can be used to reduce complexity for document-level
models on very long documents. When the maximum number is reached,
we start removing sentences from
self.consumed
, starting with the sentence which is begin_margin away from the document start and end_margin sentences away from the current sentence.
Creates a new document-level T2T predictor. See T2TPredictor.__init__().
Parameters: - src_vocab_size (int) – Source vocabulary size.
- trg_vocab_size (int) – Target vocabulary size.
- model_name (string) – T2T model name.
- problem_name (string) – T2T problem name.
- hparams_set_name (string) – T2T hparams set name.
- t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
- checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
- n_cpu_threads (int) – Number of TensorFlow CPU threads.
- max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
- pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
-
class
cam.sgnmt.predictors.tf_t2t.
T2TPredictor
(src_vocab_size, trg_vocab_size, model_name, problem_name, hparams_set_name, t2t_usr_dir, checkpoint_dir, t2t_unk_id=None, n_cpu_threads=-1, max_terminal_id=-1, pop_id=-1)[source]¶ Bases:
cam.sgnmt.predictors.tf_t2t._BaseTensor2TensorPredictor
This predictor implements scoring with Tensor2Tensor models. We follow the decoder implementation in T2T and do not reuse network states in decoding. We rather compute the full forward pass along the current history. Therefore, the decoder state is simply the the full history of consumed words.
Creates a new T2T predictor. The constructor prepares the TensorFlow session for predict_next() calls. This includes: - Load hyper parameters from the given set (hparams) - Update registry, load T2T model - Create TF placeholders for source sequence and target prefix - Create computation graph for computing log probs. - Create a MonitoredSession object, which also handles
restoring checkpoints.Parameters: - src_vocab_size (int) – Source vocabulary size.
- trg_vocab_size (int) – Target vocabulary size.
- model_name (string) – T2T model name.
- problem_name (string) – T2T problem name.
- hparams_set_name (string) – T2T hparams set name.
- t2t_usr_dir (string) – See –t2t_usr_dir in tensor2tensor.
- checkpoint_dir (string) – Path to the T2T checkpoint directory. The predictor will load the top most checkpoint in the checkpoints file.
- t2t_unk_id (int) – If set, use this ID to get UNK scores. If None, UNK is always scored with -inf.
- n_cpu_threads (int) – Number of TensorFlow CPU threads.
- max_terminal_id (int) – If positive, maximum terminal ID. Needs to be set for syntax-based T2T models.
- pop_id (int) – If positive, ID of the POP or closing bracket symbol. Needs to be set for syntax-based T2T models.
-
cam.sgnmt.predictors.tf_t2t.
T2T_INITIALIZED
= False¶ Set to true by _initialize_t2t() after first constructor call.
-
cam.sgnmt.predictors.tf_t2t.
expand_input_dims_for_t2t
(t, batched=False)[source]¶ Expands a plain input tensor for using it in a T2T graph.
Parameters: - t – Tensor
- batched – Whether to expand on the left side
Returns: Tensor t expanded by 1 dimension on the left and two dimensions on the right.
-
cam.sgnmt.predictors.tf_t2t.
gather_2d
(params, indices)[source]¶ This is a batched version of tf.gather(), ie. it applies tf.gather() to each batch separately.
Example
- params = [[10, 11, 12, 13, 14],
- [20, 21, 22, 23, 24]]
- indices = [[0, 0, 1, 1, 1, 2],
- [1, 3, 0, 0, 2, 2]]
- result = [[10, 10, 11, 11, 11, 12],
- [21, 23, 20, 20, 22, 22]]
Parameters: - params – A [batch_size, n, ...] tensor with data
- indices – A [batch_size, num_indices] int32 tensor with indices into params. Entries must be smaller than n
Returns: The result of tf.gather() on each entry of the batch.
cam.sgnmt.predictors.tokenization module¶
This module contains wrapper predictors which support decoding with
diverse tokenization. The Word2charPredictor
can be used if the
decoder operates on fine-grained tokens such as characters, but the
tokenization of a predictor is coarse-grained (e.g. words or subwords).
The word2char
predictor maintains an explicit list of word boundary
characters and applies consume and predict_next whenever a word boundary
character is consumed.
The fsttok
predictor also masks coarse grained predictors when SGNMT
uses fine-grained tokens such as characters. This wrapper loads an FST
which transduces character to predictor-unit sequences.
-
class
cam.sgnmt.predictors.tokenization.
CombinedState
(fst_node, pred_state, posterior, unconsumed=[], pending_score=0.0)[source]¶ Bases:
object
Combines an FST state with predictor state. Use by the fsttok predictor.
-
consume_all
(predictor)[source]¶ Consume all unconsumed tokens and update pred_state, pending_score, and posterior accordingly.
Parameters: predictor (Predictor) – Predictor instance
-
consume_single
(predictor)[source]¶ Consume a single token in
self.unconsumed
.Parameters: predictor (Predictor) – Predictor instance
-
score
(token, predictor)[source]¶ Returns a score which can be added if
token
is consumed next. This is not necessarily the full score but an upper bound on it: Continuations will have a score lower or equal than this. We only use the current posterior vector and do not consume tokens with the wrapped predictor.
-
traverse_fst
(trans_fst, char)[source]¶ Returns a list of
CombinedState``s with the same predictor state and posterior, but an ``fst_node
which is reachable via the input labelchar
. If the output tabe contains symbols, add them tounconsumed
.Parameters: - trans_fst (Fst) – FST to traverse
- char (int) – Index of character
Returns: list. List of combined states reachable via
char
-
-
cam.sgnmt.predictors.tokenization.
EPS_ID
= 0¶ OpenFST’s reserved ID for epsilon arcs.
-
class
cam.sgnmt.predictors.tokenization.
FSTTokPredictor
(path, fst_unk_id, max_pending_score, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper can be used if the SGNMT decoder operates on the character level, but a predictor uses a more coarse grained tokenization. The mapping is defined by an FST which transduces character to predictor unit sequences. This wrapper maintains a list of
CombinedState
objects which are tuples of an FST node and a predictor state for which holds:- The input labels on the path to the node are consistent with the consumed characters
- The output labels on the path to the node are consistent with the predictor states
Constructor for the fsttok wrapper
Parameters: - path (string) – Path to an FST which transduces characters to predictor tokens
- fst_unk_id (int) – ID used to represent UNK in the FSTs (usually 999999998)
- max_pending_score (float) – Maximum pending score in a
CombinedState
instance. - slave_predictor (Predictor) – Wrapped predictor
-
consume
(word)[source]¶ Update
self.states
to be consistent withword
and consumes all the predictor tokens.
-
get_unk_probability
(posterior)[source]¶ Always returns negative infinity. Handling UNKs needs to be realized by the FST.
-
initialize
(src_sentence)[source]¶ Pass through to slave predictor. The source sentence is not modified.
states
is updated to the initial FST node and predictor posterior and state.
-
initialize_heuristic
(src_sentence)[source]¶ Pass through to slave predictor. The source sentence is not modified
-
class
cam.sgnmt.predictors.tokenization.
Word2charPredictor
(map_path, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This predictor wraps word level predictors when SGNMT is running on the character level. The mapping between word ID and character ID sequence is loaded from the file system. All characters which do not appear in that mapping are treated as word boundary makers. The wrapper blocks consume and predict_next calls until a word boundary marker is consumed, and updates the slave predictor according the word between the last two word boundaries. The mapping is done only on the target side, and the source sentences are passed through as they are. To use alternative tokenization on the source side, see the altsrc predictor wrapper. The word2char wrapper is always an
UnboundedVocabularyPredictor
.Creates a new word2char wrapper predictor. The map_path file has to be plain text files, each line containing the mapping from a word index to the character index sequence (format: word char1 char2... charn).
Parameters: - map_path (string) – Path to the mapping file
- slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
-
consume
(word)[source]¶ If
word
is a word boundary marker, truncateword_stub
and let the slave predictor consume word_stub. Otherwise, extendword_stub
by the character.
-
get_unk_probability
(posterior)[source]¶ This is about the unkown character, not word. Since the word level slave predictor has no notion of the unknown character, we return NEG_INF unconditionally.
-
initialize
(src_sentence)[source]¶ Pass through to slave predictor. The source sentence is not modified
-
initialize_heuristic
(src_sentence)[source]¶ Pass through to slave predictor. The source sentence is not modified
cam.sgnmt.predictors.vocabulary module¶
Predictor wrappers in this module work with the vocabulary of the wrapped predictor. An example is the idxmap wrapper which makes it possible to use an alternative word map.
-
class
cam.sgnmt.predictors.vocabulary.
IdxmapPredictor
(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper predictor can be applied to slave predictors which use different wmaps than SGNMT. It translates between SGNMT word indices and predictors indices each time the predictor is called. This mapping is transparent to both the decoder and the wrapped slave predictor.
Creates a new idxmap wrapper predictor. The index maps have to be plain text files, each line containing the mapping from a SGNMT word index to the slave predictor word index.
Parameters: - src_idxmap_path (string) – Path to the source index map
- trgt_idxmap_path (string) – Path to the target index map
- slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
- slave_weight (float) – Slave predictor weight
-
get_unk_probability
(posterior)[source]¶ ATTENTION: We should translate the posterior array back to slave predictor indices. However, the unk_id is translated to the identical index, and others normally do not matter when computing the UNK probability. Therefore, we refrain from a complete conversion and pass through
posterior
without changing its word indices.
-
load_map
(path, name)[source]¶ Load a index map file. Mappings should be bijections, but there is no sanity check in place to verify this.
Parameters: - path (string) – Path to the mapping file
- name (string) – ‘source’ or ‘target’ for error messages
Returns: dict. Mapping from SGNMT index to slave predictor index
-
class
cam.sgnmt.predictors.vocabulary.
MaskvocabPredictor
(vocab_spec, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This wrapper predictor hides certain words in the SGNMT vocabulary from the predictor. Those words are scored by the masked predictor with zero. The wrapper passes through consume() only for other words.
Creates a new maskvocab wrapper predictor.
Parameters: - vocab_spec (string) – Vocabulary specification (see VocabSpec)
- slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
-
class
cam.sgnmt.predictors.vocabulary.
SkipvocabInternalHypothesis
(score, predictor_state, word_to_consume)[source]¶ Bases:
object
Helper class for internal beam search in skipvocab.
-
class
cam.sgnmt.predictors.vocabulary.
SkipvocabPredictor
(vocab_spec, stop_size, beam, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
This predictor wrapper masks predictors with a larger vocabulary than the SGNMT vocabulary. The SGNMT OOV words are not scored with UNK scores from the other predictors like usual, but are hidden by this wrapper. Therefore, this wrapper does not produce any word from the larger vocabulary, but searches internally until enough in-vocabulary word scores are collected from the wrapped predictor.
Creates a new skipvocab wrapper predictor.
Parameters: - vocab_spec (string) – Vocabulary specification (see VocabSpec)
- stop_size (int) – Stop internal beam search when the best stop_size words are in-vocabulary
- beam (int) – Beam size of internal beam search
- slave_predictor (Predictor) – Wrapped predictor.
-
predict_next
()[source]¶ This method first performs beam search internally to update the slave predictor state to a point where the best stop_size entries in the predict_next() return value are in-vocabulary (bounded by max_id). Then, it returns the slave posterior in that state.
-
class
cam.sgnmt.predictors.vocabulary.
UnboundedIdxmapPredictor
(src_idxmap_path, trgt_idxmap_path, slave_predictor, slave_weight)[source]¶ Bases:
cam.sgnmt.predictors.vocabulary.IdxmapPredictor
,cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This class is a version of
IdxmapPredictor
for unbounded vocabulary predictors. This needs an adjustedpredict_next
method to pass through the set of target words to score correctly.Pass through to
IdxmapPredictor.__init__
-
class
cam.sgnmt.predictors.vocabulary.
UnboundedMaskvocabPredictor
(vocab_spec, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.vocabulary.MaskvocabPredictor
,cam.sgnmt.predictors.core.UnboundedVocabularyPredictor
This class is a version of
MaskvocabPredictor
for unbounded vocabulary predictors. This needs an adjustedpredict_next
method to pass through the set of target words to score correctly.Creates a new maskvocab wrapper predictor.
Parameters: - vocab_spec (string) – Vocabulary specification (see VocabSpec)
- slave_predictor (Predictor) – Instance of the predictor with a different wmap than SGNMT
-
class
cam.sgnmt.predictors.vocabulary.
UnkvocabPredictor
(trg_vocab_size, slave_predictor)[source]¶ Bases:
cam.sgnmt.predictors.core.Predictor
If the predictor wrapped by the unkvocab wrapper produces an UNK with predict next, this wrapper adds explicit NEG_INF scores to all in-vocabulary words not in its posterior. This can control which words are matched by the UNK scores of other predictors.
Creates a new unkvocab wrapper predictor.
Parameters: trg_vocab_size (int) – Size of the target vocabulary -
predict_next
()[source]¶ Pass through to slave predictor. If the posterior from the slave predictor contains util.UNK_ID, add NEG_INF for all word ids lower than trg_vocab_size that are not already defined
-
-
class
cam.sgnmt.predictors.vocabulary.
VocabSpec
(spec_str)[source]¶ Bases:
object
Helper class for maskvocab and skipvocab predictors.
- Takes a string that specifies a vocabulary. Examples:
- ‘10,11,12’: The tokens 10, 11, and 12 ‘>55’: All token IDs larger than 55 ‘<33,99’: All token IDs less than 33 and the token 99.
Parameters: spec_str (string) – String specification of the vocabulary
Module contents¶
Predictors are the scoring modules used in SGNMT. They can be used
together to form a combined search space and scores. Note that the
configuration of predictors is not decoupled with the central
configuration (yet). Therefore, new predictors need to be referenced to
in blocks.decode
, and their configuration parameters need to be
added to blocks.ui
.