cam.sgnmt package¶
Subpackages¶
- cam.sgnmt.decoding package- Submodules
- cam.sgnmt.decoding.astar module
- cam.sgnmt.decoding.beam module
- cam.sgnmt.decoding.bigramgreedy module
- cam.sgnmt.decoding.bucket module
- cam.sgnmt.decoding.combibeam module
- cam.sgnmt.decoding.combination module
- cam.sgnmt.decoding.core module
- cam.sgnmt.decoding.dfs module
- cam.sgnmt.decoding.flip module
- cam.sgnmt.decoding.fstbeam module
- cam.sgnmt.decoding.greedy module
- cam.sgnmt.decoding.heuristics module
- cam.sgnmt.decoding.interpolation module
- cam.sgnmt.decoding.lenbeam module
- cam.sgnmt.decoding.mbrbeam module
- cam.sgnmt.decoding.multisegbeam module
- cam.sgnmt.decoding.predlimitbeam module
- cam.sgnmt.decoding.restarting module
- cam.sgnmt.decoding.sepbeam module
- cam.sgnmt.decoding.syncbeam module
- cam.sgnmt.decoding.syntaxbeam module
- Module contents
 
- cam.sgnmt.misc package
- cam.sgnmt.predictors package- Submodules
- cam.sgnmt.predictors.automata module
- cam.sgnmt.predictors.bow module
- cam.sgnmt.predictors.core module
- cam.sgnmt.predictors.forced module
- cam.sgnmt.predictors.grammar module
- cam.sgnmt.predictors.length module
- cam.sgnmt.predictors.misc module
- cam.sgnmt.predictors.ngram module
- cam.sgnmt.predictors.parse module
- cam.sgnmt.predictors.pytorch_fairseq module
- cam.sgnmt.predictors.structure module
- cam.sgnmt.predictors.tf_nizza module
- cam.sgnmt.predictors.tf_t2t module
- cam.sgnmt.predictors.tokenization module
- cam.sgnmt.predictors.vocabulary module
- Module contents
 
Submodules¶
cam.sgnmt.decode module¶
cam.sgnmt.decode_utils module¶
This module is the bridge between the command line configuration of the decode.py script and the SGNMT software architecture consisting of decoders, predictors, and output handlers. A common use case is to call create_decoder() first, which reads the SGNMT configuration and loads the right predictors and decoding strategy with the right arguments. The actual decoding is implemented in do_decode(). See decode.py to learn how to use this module.
- 
cam.sgnmt.decode_utils.add_heuristics(decoder)[source]¶
- Adds all enabled heuristics to the - decoder. This is relevant for heuristic based search strategies like A*. This method relies on the global- argsvariable and reads out- args.heuristics.- Parameters: - decoder (Decoder) – Decoding strategy, see - create_decoder(). This method will add heuristics to this instance with- add_heuristic()
- 
cam.sgnmt.decode_utils.add_predictors(decoder)[source]¶
- Adds all enabled predictors to the - decoder. This function makes heavy use of the global- argswhich contains the SGNMT configuration. Particularly, it reads out- args.predictorsand adds appropriate instances to- decoder. TODO: Refactor this method as it is waaaay tooooo looong- Parameters: - decoder (Decoder) – Decoding strategy, see - create_decoder(). This method will add predictors to this instance with- add_predictor()
- 
cam.sgnmt.decode_utils.args= None¶
- This variable is set to the global configuration when base_init(). 
- 
cam.sgnmt.decode_utils.base_init(new_args)[source]¶
- This function should be called before accessing any other function in this module. It initializes the args variable on which all the create_* factory functions rely on as configuration object, and it sets up global function pointers and variables for basic things like the indexing scheme, logging verbosity, etc. - Parameters: - new_args – Configuration object from the argument parser. 
- 
cam.sgnmt.decode_utils.create_decoder()[source]¶
- Creates the - Decoderinstance. This specifies the search strategy used to traverse the space spanned by the predictors. This method relies on the global- argsvariable.- TODO: Refactor to avoid long argument lists - Returns: - Decoder. Instance of the search strategy 
- 
cam.sgnmt.decode_utils.create_output_handlers()[source]¶
- Creates the output handlers defined in the - iomodule. These handlers create output files in different formats from the decoding results.- Parameters: - args – Global command line arguments. - Returns: - list. List of output handlers according –outputs 
- 
cam.sgnmt.decode_utils.do_decode(decoder, output_handlers, src_sentences)[source]¶
- This method contains the main decoding loop. It iterates through - src_sentencesand applies- decoder.decode()to each of them. At the end, it calls the output handlers to create output files.- Parameters: - decoder (Decoder) – Current decoder instance
- output_handlers (list) – List of output handlers, see
create_output_handlers()
- src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
 
- 
cam.sgnmt.decode_utils.get_sentence_indices(range_param, src_sentences)[source]¶
- Helper method for - do_decodewhich returns the indices of the sentence to decode- Parameters: - range_param (string) – --rangeparameter from config
- src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
 
- range_param (string) – 
cam.sgnmt.extract_scores_along_reference module¶
cam.sgnmt.io module¶
This module is responsible for converting input text to integer representations (encode()), and integer translation hypotheses back to readable text (decode()). In the default configuration, this conversion is an identity mapping: Source sentences are provided in integer representations, and output files also contain indexed sentences.
- 
class cam.sgnmt.io.BPE(codes_path, separator='@@', remove_eow=False)[source]¶
- Bases: - object- 
encode(orig)[source]¶
- Encode word based on list of BPE merge operations, which are applied consecutively 
 
- 
- 
class cam.sgnmt.io.BPEAtAtDecoder[source]¶
- Bases: - cam.sgnmt.io.Decoder- “Decoder for BPE mapping with @@ separator. 
- 
class cam.sgnmt.io.BPEDecoder[source]¶
- Bases: - cam.sgnmt.io.Decoder- “Decoder for BPE mapping SGNMT style. 
- 
class cam.sgnmt.io.BPEEncoder(codes_path, separator='', remove_eow=False)[source]¶
- Bases: - cam.sgnmt.io.Encoder- Encoder for BPE mapping. 
- 
class cam.sgnmt.io.CharDecoder[source]¶
- Bases: - cam.sgnmt.io.Decoder- “Decoder for char mapping. 
- 
class cam.sgnmt.io.CharEncoder[source]¶
- Bases: - cam.sgnmt.io.Encoder- Encoder for char mapping. 
- 
class cam.sgnmt.io.Encoder[source]¶
- Bases: - object- Super class for IO encoders. - 
encode(src_sentence)[source]¶
- Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input. - Parameters: - src_sentence (string) – A single input sentence - Returns: - List of integers. 
 
- 
- 
class cam.sgnmt.io.IDDecoder[source]¶
- Bases: - cam.sgnmt.io.Decoder- “Decoder for ID mapping. 
- 
class cam.sgnmt.io.IDEncoder[source]¶
- Bases: - cam.sgnmt.io.Encoder- Encoder for ID mapping. 
- 
class cam.sgnmt.io.WordDecoder[source]¶
- Bases: - cam.sgnmt.io.Decoder- “Decoder for word based mapping. 
- 
class cam.sgnmt.io.WordEncoder[source]¶
- Bases: - cam.sgnmt.io.Encoder- Encoder for word based mapping. 
- 
cam.sgnmt.io.decode(trg_sentence)[source]¶
- Converts the target sentence represented as sequence of token IDs to a string representation. This method calls - decoder.decode().- Parameters: - trg_sentence (list) – A sequence of integers (token IDs) - Returns: - string. 
- 
cam.sgnmt.io.decoder= None¶
- Decoder called in decode(). Initialized in initialize(). 
- 
cam.sgnmt.io.encode(src_sentence)[source]¶
- Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input. This method calls - encoder.encode().- Parameters: - src_sentence (string) – A single input sentence - Returns: - List of integers. 
- 
cam.sgnmt.io.encoder= None¶
- Encoder called in encode(). Initialized in initialize(). 
- 
cam.sgnmt.io.initialize(args)[source]¶
- Initializes the - iomodule, including loading word maps and other resources needed for encoding and decoding. Subsequent calls of- encode()and- decode()will process input as specified in- args.- Parameters: - args (object) – SGNMT config 
- 
cam.sgnmt.io.load_src_wmap(path)[source]¶
- Loads a source side word map from the file system. - Parameters: - path (string) – Path to the word map (Format: word id) - Returns: - word, value: id) - Return type: - dict. Source word map (key 
- 
cam.sgnmt.io.load_trg_wmap(path)[source]¶
- Loads a target side word map from the file system. - Parameters: - path (string) – Path to the word map (Format: word id) - Returns: - id, value: word) - Return type: - dict. Source word map (key 
- 
cam.sgnmt.io.src_wmap= {}¶
- Source language word map (word -> id) 
- 
cam.sgnmt.io.trg_wmap= {}¶
- Target language word map (id -> word) 
cam.sgnmt.output module¶
This module contains the output handlers. These handlers create
output files from the n-best lists generated by the Decoder. They
can be activated via –outputs.
This module depends on OpenFST to write FST files in binary format. To
enable Python support in OpenFST, use a recent version (>=1.5.4) and
compile with --enable_python. Further information can be found here:
http://www.openfst.org/twiki/bin/view/FST/PythonExtension
- 
class cam.sgnmt.output.FSTOutputHandler(path, unk_id)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- This output handler creates FSTs with with sparse tuple arcs from the n-best lists from the decoder. The predictor scores are kept separately in the sparse tuples. Note that this means that the parameter - --combination_schememight not be visible in the lattices because predictor scores are not combined. The order in the sparse tuples corresponds to the order of the predictors in the- --predictorsargument.- Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST. - Creates a sparse tuple FST output handler. - Parameters: - path (string) – Path to the VECLAT directory to create
- unk_id (int) – Id which should be used in the FST for UNK
 - 
write_hypos(all_hypos, sen_indices)[source]¶
- Writes FST files with sparse tuples for each sentence in - all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in- all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.- Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
 - Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
 
 
- 
class cam.sgnmt.output.NBestOutputHandler(path, predictor_names)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- Produces a n-best file in Moses format. The third part of each entry is used to store the separated unnormalized predictor scores. Note that the sentence IDs are shifted: Moses n-best files start with the index 0, but in SGNMT and HiFST we usually refer to the first sentence with 1 (e.g. in lattice directories or –range) - Creates a Moses n-best list output handler. - Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
 
- 
class cam.sgnmt.output.NgramOutputHandler(path, min_order, max_order)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- This output handler extracts MBR-style ngram posteriors from the hypotheses returned by the decoder. The hypothesis scores are assumed to be loglikelihoods, which we renormalize to make sure that we operate on a valid distribution. The scores produced by the output handler are probabilities of an ngram being in the translation. - Creates an ngram output handler. - Parameters: - path (string) – Path to the ngram directory to create
- min_order (int) – Minimum order of extracted ngrams
- max_order (int) – Maximum order of extracted ngrams
 - 
write_hypos(all_hypos, sen_indices)[source]¶
- Writes ngram files for each sentence in - all_hypos.- Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
 - Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
 
 
- 
class cam.sgnmt.output.OutputHandler[source]¶
- Bases: - object- Interface for output handlers. - Empty constructor - 
write_hypos(all_hypos, sen_indices=None)[source]¶
- This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments. - Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
 - Raises: - IOError. If something goes wrong while writing to the disk 
 
- 
- 
class cam.sgnmt.output.StandardFSTOutputHandler(path, unk_id)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- This output handler creates FSTs with standard arcs. In contrast to - FSTOutputHandler, predictor scores are combined using- --combination_scheme.- Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST. - Creates a standard arc FST output handler. - Parameters: - path (string) – Path to the fst directory to create
- unk_id (int) – Id which should be used in the FST for UNK
 - 
write_hypos(all_hypos, sen_indices)[source]¶
- Writes FST files with standard arcs for each sentence in - all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in- all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.- Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
 - Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
 
 
- 
class cam.sgnmt.output.TextOutputHandler(path)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- Writes the first best hypotheses to a plain text file - Creates a plain text output handler to write to - path
- 
class cam.sgnmt.output.TimeCSVOutputHandler(path, predictor_names)[source]¶
- Bases: - cam.sgnmt.output.OutputHandler- Produces one CSV file for each sentence. The CSV files contain the predictor score breakdown for each translation prefix length. - Creates a Moses n-best list output handler. - Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
 - 
write_hypos(all_hypos, sen_indices)[source]¶
- Writes ngram files for each sentence in - all_hypos.- Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
 - Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
 
 
cam.sgnmt.tf_utils module¶
This file contains utility functions for TensorFlow such as session handling and checkpoint loading.
- 
cam.sgnmt.tf_utils.create_session(checkpoint_path, n_cpu_threads=-1)[source]¶
- Creates a MonitoredSession. - Parameters: - checkpoint_path (string) – Path either to checkpoint directory or directly to a checkpoint file.
- n_cpu_threads (int) – Number of CPU threads. If negative, we assume either GPU decoding or that all CPU cores can be used.
 - Returns: - A TensorFlow MonitoredSession. 
cam.sgnmt.ui module¶
This module handles the user interface and contains subroutines for parsing and verifying config files and command line arguments.
- 
cam.sgnmt.ui.get_args()[source]¶
- Get the arguments for the current SGNMT run from both command line arguments and configuration files. This method contains all available SGNMT options, i.e. configuration is not encapsulated e.g. by predictors. - Returns: - object. Arguments object like for - ArgumentParser
- 
cam.sgnmt.ui.get_parser()[source]¶
- Get the parser object which is used to build the configuration argument - args. This is a helper method for- get_args()TODO: Decentralize configuration- Returns: - ArgumentParser. The pre-filled parser object 
- 
cam.sgnmt.ui.parse_args(parser)[source]¶
- http://codereview.stackexchange.com/questions/79008/parse-a-config-file- and-add-to-command-line-arguments-using-argparse-in-python 
cam.sgnmt.utils module¶
This file contains common basic functionality which can be used from anywhere. This includes the definition of reserved word indices, some mathematical functions, and helper functions to deal with the small quirks Python sometimes has.
- 
cam.sgnmt.utils.EOS_ID= 2¶
- Reserved word ID for the end-of-sentence symbol. 
- 
cam.sgnmt.utils.GO_ID= 1¶
- Reserved word ID for the start-of-sentence symbol. 
- 
cam.sgnmt.utils.MESSAGE_TYPE_DEFAULT= 1¶
- Default message type for observer messages 
- 
cam.sgnmt.utils.MESSAGE_TYPE_FULL_HYPO= 3¶
- This message type is used by the decoder when a new complete hypothesis was found. Note that this is not necessarily the best hypo so far, it is just the latest hypo found which ends with EOS. 
- 
cam.sgnmt.utils.MESSAGE_TYPE_POSTERIOR= 2¶
- This message is sent by the decoder after - apply_predictorswas called. The message includes the new posterior distribution and the score breakdown.
- 
cam.sgnmt.utils.NOTAPPLICABLE_ID= 3¶
- Reserved word ID which is currently not used. 
- 
class cam.sgnmt.utils.Observable[source]¶
- Bases: - object- For the GoF design pattern observer - Initializes the list of observers with an empty list 
- 
class cam.sgnmt.utils.Observer[source]¶
- Bases: - object- Super class for classes which observe (GoF design patten) other classes. 
- 
cam.sgnmt.utils.TMP_FILENAME= '/tmp/sgnmt.16734.fst'¶
- Temporary file name to use if an FST file is zipped. 
- 
cam.sgnmt.utils.UNK_ID= 0¶
- Reserved word ID for the unknown word (UNK). 
- 
cam.sgnmt.utils.argmax(arr)[source]¶
- Get the index of the maximum entry in - arr. The parameter can be a dictionary.- Parameters: - arr (list,array,dict) – Set of numerical values - Returns: - Index or key of the maximum entry in - arr
- 
cam.sgnmt.utils.argmax_n(arr, n)[source]¶
- Get indices of the - nmaximum entries in- arr. The parameter- arrcan be a dictionary. The returned index set is not guaranteed to be sorted.- Parameters: - arr (list,array,dict) – Set of numerical values
- n (int) – Number of values to retrieve
 - Returns: - List of indices or keys of the - nmaximum entries in- arr
- 
cam.sgnmt.utils.common_contains(obj, key)[source]¶
- Checks the existence of a key or index in a mapping. Works with numpy arrays, lists, and dicts. - Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
 - Returns: - Trueif- keyin- obj, otherwise- False
- 
cam.sgnmt.utils.common_get(obj, key, default)[source]¶
- Can be used to access an element via the index or key. Works with numpy arrays, lists, and dicts. - Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
- default (object) –
 - Returns: - obj[key]if- keyin- obj, otherwise- default
- 
cam.sgnmt.utils.common_iterable(obj)[source]¶
- Can be used to iterate over the key-value pairs of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python 
- 
cam.sgnmt.utils.common_viewkeys(obj)[source]¶
- Can be used to iterate over the keys or indices of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python 
- 
cam.sgnmt.utils.get_path(tmpl, sub=1)[source]¶
- Replaces the %d placeholder in - tmplwith- sub. If- tmpldoes not contain %d, return- tmplunmodified.- Parameters: - tmpl (string) – Path, potentially with %d placeholder
- sub (int) – Substitution for %d
 - Returns: - string. - tmplwith %d replaced with- subif present
- 
cam.sgnmt.utils.load_fst(path)[source]¶
- Loads a FST from the file system using PyFSTs - read()method. GZipped format is also supported. The arc type must be standard or log, otherwise PyFST cannot load them.- Parameters: - path (string) – Path to the FST file to load - Returns: - fst. PyFST FST object or - Noneif FST could not be read
- 
cam.sgnmt.utils.log_sum(vals)¶
- Defines which log summation function to use. 
- 
cam.sgnmt.utils.log_sum_log_semiring(vals)[source]¶
- Uses the - logsumexpfunction in scipy to calculate the log of the sum of a set of log values.- Parameters: - vals (set) – List or set of numerical values 
- 
cam.sgnmt.utils.log_sum_tropical_semiring(vals)[source]¶
- Approximates summation in log space with the max. - Parameters: - vals (set) – List or set of numerical values 
- 
cam.sgnmt.utils.switch_to_fairseq_indexing()[source]¶
- Calling this method overrides the global definitions of the reserved word ids - GO_ID,- EOS_ID, and- UNK_IDwith the fairseq indexing scheme.