cam.sgnmt package¶
Subpackages¶
- cam.sgnmt.decoding package
- Submodules
- cam.sgnmt.decoding.astar module
- cam.sgnmt.decoding.beam module
- cam.sgnmt.decoding.bigramgreedy module
- cam.sgnmt.decoding.bucket module
- cam.sgnmt.decoding.combibeam module
- cam.sgnmt.decoding.combination module
- cam.sgnmt.decoding.core module
- cam.sgnmt.decoding.dfs module
- cam.sgnmt.decoding.flip module
- cam.sgnmt.decoding.fstbeam module
- cam.sgnmt.decoding.greedy module
- cam.sgnmt.decoding.heuristics module
- cam.sgnmt.decoding.interpolation module
- cam.sgnmt.decoding.lenbeam module
- cam.sgnmt.decoding.mbrbeam module
- cam.sgnmt.decoding.multisegbeam module
- cam.sgnmt.decoding.predlimitbeam module
- cam.sgnmt.decoding.restarting module
- cam.sgnmt.decoding.sepbeam module
- cam.sgnmt.decoding.syncbeam module
- cam.sgnmt.decoding.syntaxbeam module
- Module contents
- cam.sgnmt.misc package
- cam.sgnmt.predictors package
- Submodules
- cam.sgnmt.predictors.automata module
- cam.sgnmt.predictors.bow module
- cam.sgnmt.predictors.core module
- cam.sgnmt.predictors.forced module
- cam.sgnmt.predictors.grammar module
- cam.sgnmt.predictors.length module
- cam.sgnmt.predictors.misc module
- cam.sgnmt.predictors.ngram module
- cam.sgnmt.predictors.parse module
- cam.sgnmt.predictors.pytorch_fairseq module
- cam.sgnmt.predictors.structure module
- cam.sgnmt.predictors.tf_nizza module
- cam.sgnmt.predictors.tf_t2t module
- cam.sgnmt.predictors.tokenization module
- cam.sgnmt.predictors.vocabulary module
- Module contents
Submodules¶
cam.sgnmt.decode module¶
cam.sgnmt.decode_utils module¶
This module is the bridge between the command line configuration of the decode.py script and the SGNMT software architecture consisting of decoders, predictors, and output handlers. A common use case is to call create_decoder() first, which reads the SGNMT configuration and loads the right predictors and decoding strategy with the right arguments. The actual decoding is implemented in do_decode(). See decode.py to learn how to use this module.
-
cam.sgnmt.decode_utils.
add_heuristics
(decoder)[source]¶ Adds all enabled heuristics to the
decoder
. This is relevant for heuristic based search strategies like A*. This method relies on the globalargs
variable and reads outargs.heuristics
.Parameters: decoder (Decoder) – Decoding strategy, see create_decoder()
. This method will add heuristics to this instance withadd_heuristic()
-
cam.sgnmt.decode_utils.
add_predictors
(decoder)[source]¶ Adds all enabled predictors to the
decoder
. This function makes heavy use of the globalargs
which contains the SGNMT configuration. Particularly, it reads outargs.predictors
and adds appropriate instances todecoder
. TODO: Refactor this method as it is waaaay tooooo looongParameters: decoder (Decoder) – Decoding strategy, see create_decoder()
. This method will add predictors to this instance withadd_predictor()
-
cam.sgnmt.decode_utils.
args
= None¶ This variable is set to the global configuration when base_init().
-
cam.sgnmt.decode_utils.
base_init
(new_args)[source]¶ This function should be called before accessing any other function in this module. It initializes the args variable on which all the create_* factory functions rely on as configuration object, and it sets up global function pointers and variables for basic things like the indexing scheme, logging verbosity, etc.
Parameters: new_args – Configuration object from the argument parser.
-
cam.sgnmt.decode_utils.
create_decoder
()[source]¶ Creates the
Decoder
instance. This specifies the search strategy used to traverse the space spanned by the predictors. This method relies on the globalargs
variable.TODO: Refactor to avoid long argument lists
Returns: Decoder. Instance of the search strategy
-
cam.sgnmt.decode_utils.
create_output_handlers
()[source]¶ Creates the output handlers defined in the
io
module. These handlers create output files in different formats from the decoding results.Parameters: args – Global command line arguments. Returns: list. List of output handlers according –outputs
-
cam.sgnmt.decode_utils.
do_decode
(decoder, output_handlers, src_sentences)[source]¶ This method contains the main decoding loop. It iterates through
src_sentences
and appliesdecoder.decode()
to each of them. At the end, it calls the output handlers to create output files.Parameters: - decoder (Decoder) – Current decoder instance
- output_handlers (list) – List of output handlers, see
create_output_handlers()
- src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
-
cam.sgnmt.decode_utils.
get_sentence_indices
(range_param, src_sentences)[source]¶ Helper method for
do_decode
which returns the indices of the sentence to decodeParameters: - range_param (string) –
--range
parameter from config - src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
- range_param (string) –
cam.sgnmt.extract_scores_along_reference module¶
cam.sgnmt.io module¶
This module is responsible for converting input text to integer representations (encode()), and integer translation hypotheses back to readable text (decode()). In the default configuration, this conversion is an identity mapping: Source sentences are provided in integer representations, and output files also contain indexed sentences.
-
class
cam.sgnmt.io.
BPE
(codes_path, separator='@@', remove_eow=False)[source]¶ Bases:
object
-
encode
(orig)[source]¶ Encode word based on list of BPE merge operations, which are applied consecutively
-
-
class
cam.sgnmt.io.
BPEAtAtDecoder
[source]¶ Bases:
cam.sgnmt.io.Decoder
“Decoder for BPE mapping with @@ separator.
-
class
cam.sgnmt.io.
BPEDecoder
[source]¶ Bases:
cam.sgnmt.io.Decoder
“Decoder for BPE mapping SGNMT style.
-
class
cam.sgnmt.io.
BPEEncoder
(codes_path, separator='', remove_eow=False)[source]¶ Bases:
cam.sgnmt.io.Encoder
Encoder for BPE mapping.
-
class
cam.sgnmt.io.
CharDecoder
[source]¶ Bases:
cam.sgnmt.io.Decoder
“Decoder for char mapping.
-
class
cam.sgnmt.io.
CharEncoder
[source]¶ Bases:
cam.sgnmt.io.Encoder
Encoder for char mapping.
-
class
cam.sgnmt.io.
Encoder
[source]¶ Bases:
object
Super class for IO encoders.
-
encode
(src_sentence)[source]¶ Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input.
Parameters: src_sentence (string) – A single input sentence Returns: List of integers.
-
-
class
cam.sgnmt.io.
IDDecoder
[source]¶ Bases:
cam.sgnmt.io.Decoder
“Decoder for ID mapping.
-
class
cam.sgnmt.io.
IDEncoder
[source]¶ Bases:
cam.sgnmt.io.Encoder
Encoder for ID mapping.
-
class
cam.sgnmt.io.
WordDecoder
[source]¶ Bases:
cam.sgnmt.io.Decoder
“Decoder for word based mapping.
-
class
cam.sgnmt.io.
WordEncoder
[source]¶ Bases:
cam.sgnmt.io.Encoder
Encoder for word based mapping.
-
cam.sgnmt.io.
decode
(trg_sentence)[source]¶ Converts the target sentence represented as sequence of token IDs to a string representation. This method calls
decoder.decode()
.Parameters: trg_sentence (list) – A sequence of integers (token IDs) Returns: string.
-
cam.sgnmt.io.
decoder
= None¶ Decoder called in decode(). Initialized in initialize().
-
cam.sgnmt.io.
encode
(src_sentence)[source]¶ Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input. This method calls
encoder.encode()
.Parameters: src_sentence (string) – A single input sentence Returns: List of integers.
-
cam.sgnmt.io.
encoder
= None¶ Encoder called in encode(). Initialized in initialize().
-
cam.sgnmt.io.
initialize
(args)[source]¶ Initializes the
io
module, including loading word maps and other resources needed for encoding and decoding. Subsequent calls ofencode()
anddecode()
will process input as specified inargs
.Parameters: args (object) – SGNMT config
-
cam.sgnmt.io.
load_src_wmap
(path)[source]¶ Loads a source side word map from the file system.
Parameters: path (string) – Path to the word map (Format: word id) Returns: word, value: id) Return type: dict. Source word map (key
-
cam.sgnmt.io.
load_trg_wmap
(path)[source]¶ Loads a target side word map from the file system.
Parameters: path (string) – Path to the word map (Format: word id) Returns: id, value: word) Return type: dict. Source word map (key
-
cam.sgnmt.io.
src_wmap
= {}¶ Source language word map (word -> id)
-
cam.sgnmt.io.
trg_wmap
= {}¶ Target language word map (id -> word)
cam.sgnmt.output module¶
This module contains the output handlers. These handlers create
output files from the n-best lists generated by the Decoder
. They
can be activated via –outputs.
This module depends on OpenFST to write FST files in binary format. To
enable Python support in OpenFST, use a recent version (>=1.5.4) and
compile with --enable_python
. Further information can be found here:
http://www.openfst.org/twiki/bin/view/FST/PythonExtension
-
class
cam.sgnmt.output.
FSTOutputHandler
(path, unk_id)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler creates FSTs with with sparse tuple arcs from the n-best lists from the decoder. The predictor scores are kept separately in the sparse tuples. Note that this means that the parameter
--combination_scheme
might not be visible in the lattices because predictor scores are not combined. The order in the sparse tuples corresponds to the order of the predictors in the--predictors
argument.Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.
Creates a sparse tuple FST output handler.
Parameters: - path (string) – Path to the VECLAT directory to create
- unk_id (int) – Id which should be used in the FST for UNK
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes FST files with sparse tuples for each sentence in
all_hypos
. The created lattices are not optimized in any way: We create a distinct path for each entry inall_hypos
. We advise you to determinize/minimize them if you are planning to use them for further processing.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
NBestOutputHandler
(path, predictor_names)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Produces a n-best file in Moses format. The third part of each entry is used to store the separated unnormalized predictor scores. Note that the sentence IDs are shifted: Moses n-best files start with the index 0, but in SGNMT and HiFST we usually refer to the first sentence with 1 (e.g. in lattice directories or –range)
Creates a Moses n-best list output handler.
Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
-
class
cam.sgnmt.output.
NgramOutputHandler
(path, min_order, max_order)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler extracts MBR-style ngram posteriors from the hypotheses returned by the decoder. The hypothesis scores are assumed to be loglikelihoods, which we renormalize to make sure that we operate on a valid distribution. The scores produced by the output handler are probabilities of an ngram being in the translation.
Creates an ngram output handler.
Parameters: - path (string) – Path to the ngram directory to create
- min_order (int) – Minimum order of extracted ngrams
- max_order (int) – Maximum order of extracted ngrams
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes ngram files for each sentence in
all_hypos
.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
OutputHandler
[source]¶ Bases:
object
Interface for output handlers.
Empty constructor
-
write_hypos
(all_hypos, sen_indices=None)[source]¶ This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.
Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: IOError. If something goes wrong while writing to the disk
-
-
class
cam.sgnmt.output.
StandardFSTOutputHandler
(path, unk_id)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler creates FSTs with standard arcs. In contrast to
FSTOutputHandler
, predictor scores are combined using--combination_scheme
.Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.
Creates a standard arc FST output handler.
Parameters: - path (string) – Path to the fst directory to create
- unk_id (int) – Id which should be used in the FST for UNK
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes FST files with standard arcs for each sentence in
all_hypos
. The created lattices are not optimized in any way: We create a distinct path for each entry inall_hypos
. We advise you to determinize/minimize them if you are planning to use them for further processing.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
TextOutputHandler
(path)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Writes the first best hypotheses to a plain text file
Creates a plain text output handler to write to
path
-
class
cam.sgnmt.output.
TimeCSVOutputHandler
(path, predictor_names)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Produces one CSV file for each sentence. The CSV files contain the predictor score breakdown for each translation prefix length.
Creates a Moses n-best list output handler.
Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes ngram files for each sentence in
all_hypos
.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
cam.sgnmt.tf_utils module¶
This file contains utility functions for TensorFlow such as session handling and checkpoint loading.
-
cam.sgnmt.tf_utils.
create_session
(checkpoint_path, n_cpu_threads=-1)[source]¶ Creates a MonitoredSession.
Parameters: - checkpoint_path (string) – Path either to checkpoint directory or directly to a checkpoint file.
- n_cpu_threads (int) – Number of CPU threads. If negative, we assume either GPU decoding or that all CPU cores can be used.
Returns: A TensorFlow MonitoredSession.
cam.sgnmt.ui module¶
This module handles the user interface and contains subroutines for parsing and verifying config files and command line arguments.
-
cam.sgnmt.ui.
get_args
()[source]¶ Get the arguments for the current SGNMT run from both command line arguments and configuration files. This method contains all available SGNMT options, i.e. configuration is not encapsulated e.g. by predictors.
Returns: object. Arguments object like for ArgumentParser
-
cam.sgnmt.ui.
get_parser
()[source]¶ Get the parser object which is used to build the configuration argument
args
. This is a helper method forget_args()
TODO: Decentralize configurationReturns: ArgumentParser. The pre-filled parser object
-
cam.sgnmt.ui.
parse_args
(parser)[source]¶ http://codereview.stackexchange.com/questions/79008/parse-a-config-file- and-add-to-command-line-arguments-using-argparse-in-python
cam.sgnmt.utils module¶
This file contains common basic functionality which can be used from anywhere. This includes the definition of reserved word indices, some mathematical functions, and helper functions to deal with the small quirks Python sometimes has.
-
cam.sgnmt.utils.
EOS_ID
= 2¶ Reserved word ID for the end-of-sentence symbol.
-
cam.sgnmt.utils.
GO_ID
= 1¶ Reserved word ID for the start-of-sentence symbol.
-
cam.sgnmt.utils.
MESSAGE_TYPE_DEFAULT
= 1¶ Default message type for observer messages
-
cam.sgnmt.utils.
MESSAGE_TYPE_FULL_HYPO
= 3¶ This message type is used by the decoder when a new complete hypothesis was found. Note that this is not necessarily the best hypo so far, it is just the latest hypo found which ends with EOS.
-
cam.sgnmt.utils.
MESSAGE_TYPE_POSTERIOR
= 2¶ This message is sent by the decoder after
apply_predictors
was called. The message includes the new posterior distribution and the score breakdown.
-
cam.sgnmt.utils.
NOTAPPLICABLE_ID
= 3¶ Reserved word ID which is currently not used.
-
class
cam.sgnmt.utils.
Observable
[source]¶ Bases:
object
For the GoF design pattern observer
Initializes the list of observers with an empty list
-
class
cam.sgnmt.utils.
Observer
[source]¶ Bases:
object
Super class for classes which observe (GoF design patten) other classes.
-
cam.sgnmt.utils.
TMP_FILENAME
= '/tmp/sgnmt.16734.fst'¶ Temporary file name to use if an FST file is zipped.
-
cam.sgnmt.utils.
UNK_ID
= 0¶ Reserved word ID for the unknown word (UNK).
-
cam.sgnmt.utils.
argmax
(arr)[source]¶ Get the index of the maximum entry in
arr
. The parameter can be a dictionary.Parameters: arr (list,array,dict) – Set of numerical values Returns: Index or key of the maximum entry in arr
-
cam.sgnmt.utils.
argmax_n
(arr, n)[source]¶ Get indices of the
n
maximum entries inarr
. The parameterarr
can be a dictionary. The returned index set is not guaranteed to be sorted.Parameters: - arr (list,array,dict) – Set of numerical values
- n (int) – Number of values to retrieve
Returns: List of indices or keys of the
n
maximum entries inarr
-
cam.sgnmt.utils.
common_contains
(obj, key)[source]¶ Checks the existence of a key or index in a mapping. Works with numpy arrays, lists, and dicts.
Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
Returns: True
ifkey
inobj
, otherwiseFalse
-
cam.sgnmt.utils.
common_get
(obj, key, default)[source]¶ Can be used to access an element via the index or key. Works with numpy arrays, lists, and dicts.
Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
- default (object) –
Returns: obj[key]
ifkey
inobj
, otherwisedefault
-
cam.sgnmt.utils.
common_iterable
(obj)[source]¶ Can be used to iterate over the key-value pairs of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python
-
cam.sgnmt.utils.
common_viewkeys
(obj)[source]¶ Can be used to iterate over the keys or indices of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python
-
cam.sgnmt.utils.
get_path
(tmpl, sub=1)[source]¶ Replaces the %d placeholder in
tmpl
withsub
. Iftmpl
does not contain %d, returntmpl
unmodified.Parameters: - tmpl (string) – Path, potentially with %d placeholder
- sub (int) – Substitution for %d
Returns: string.
tmpl
with %d replaced withsub
if present
-
cam.sgnmt.utils.
load_fst
(path)[source]¶ Loads a FST from the file system using PyFSTs
read()
method. GZipped format is also supported. The arc type must be standard or log, otherwise PyFST cannot load them.Parameters: path (string) – Path to the FST file to load Returns: fst. PyFST FST object or None
if FST could not be read
-
cam.sgnmt.utils.
log_sum
(vals)¶ Defines which log summation function to use.
-
cam.sgnmt.utils.
log_sum_log_semiring
(vals)[source]¶ Uses the
logsumexp
function in scipy to calculate the log of the sum of a set of log values.Parameters: vals (set) – List or set of numerical values
-
cam.sgnmt.utils.
log_sum_tropical_semiring
(vals)[source]¶ Approximates summation in log space with the max.
Parameters: vals (set) – List or set of numerical values
-
cam.sgnmt.utils.
switch_to_fairseq_indexing
()[source]¶ Calling this method overrides the global definitions of the reserved word ids
GO_ID
,EOS_ID
, andUNK_ID
with the fairseq indexing scheme.