cam.sgnmt package¶
Subpackages¶
- cam.sgnmt.blocks package
- Subpackages
- Submodules
- cam.sgnmt.blocks.align module
- cam.sgnmt.blocks.attention module
- cam.sgnmt.blocks.batch_decode module
- cam.sgnmt.blocks.checkpoint module
- cam.sgnmt.blocks.decoder module
- cam.sgnmt.blocks.encoder module
- cam.sgnmt.blocks.model module
- cam.sgnmt.blocks.nmt module
- cam.sgnmt.blocks.pruning module
- cam.sgnmt.blocks.sampling module
- cam.sgnmt.blocks.sparse_search module
- cam.sgnmt.blocks.stream module
- cam.sgnmt.blocks.train module
- cam.sgnmt.blocks.vanilla_decoder module
- Module contents
- cam.sgnmt.decoding package
- Submodules
- cam.sgnmt.decoding.astar module
- cam.sgnmt.decoding.beam module
- cam.sgnmt.decoding.bigramgreedy module
- cam.sgnmt.decoding.bow module
- cam.sgnmt.decoding.bucket module
- cam.sgnmt.decoding.combibeam module
- cam.sgnmt.decoding.combination module
- cam.sgnmt.decoding.core module
- cam.sgnmt.decoding.dfs module
- cam.sgnmt.decoding.flip module
- cam.sgnmt.decoding.greedy module
- cam.sgnmt.decoding.heuristics module
- cam.sgnmt.decoding.interpolation module
- cam.sgnmt.decoding.mbrbeam module
- cam.sgnmt.decoding.multisegbeam module
- cam.sgnmt.decoding.restarting module
- cam.sgnmt.decoding.sepbeam module
- cam.sgnmt.decoding.syncbeam module
- cam.sgnmt.decoding.syntaxbeam module
- Module contents
- cam.sgnmt.misc package
- cam.sgnmt.predictors package
- Submodules
- cam.sgnmt.predictors.automata module
- cam.sgnmt.predictors.blocks_nmt module
- cam.sgnmt.predictors.bow module
- cam.sgnmt.predictors.core module
- cam.sgnmt.predictors.ffnnlm module
- cam.sgnmt.predictors.forced module
- cam.sgnmt.predictors.grammar module
- cam.sgnmt.predictors.length module
- cam.sgnmt.predictors.misc module
- cam.sgnmt.predictors.ngram module
- cam.sgnmt.predictors.structure module
- cam.sgnmt.predictors.tf_nizza module
- cam.sgnmt.predictors.tf_nmt module
- cam.sgnmt.predictors.tf_rnnlm module
- cam.sgnmt.predictors.tf_t2t module
- cam.sgnmt.predictors.tokenization module
- cam.sgnmt.predictors.vocabulary module
- Module contents
- cam.sgnmt.tf package
Submodules¶
cam.sgnmt.decode module¶
cam.sgnmt.decode_utils module¶
This module is the bridge between the command line configuration of the decode.py script and the SGNMT software architecture consisting of decoders, predictors, and output handlers. A common use case is to call create_decoder() first, which reads the SGNMT configuration and loads the right predictors and decoding strategy with the right arguments. The actual decoding is implemented in do_decode(). See decode.py to learn how to use this module.
-
cam.sgnmt.decode_utils.
add_heuristics
(decoder)[source]¶ Adds all enabled heuristics to the
decoder
. This is relevant for heuristic based search strategies like A*. This method relies on the globalargs
variable and reads outargs.heuristics
.Parameters: decoder (Decoder) – Decoding strategy, see create_decoder()
. This method will add heuristics to this instance withadd_heuristic()
-
cam.sgnmt.decode_utils.
add_predictors
(decoder)[source]¶ Adds all enabled predictors to the
decoder
. This function makes heavy use of the globalargs
which contains the SGNMT configuration. Particularly, it reads outargs.predictors
and adds appropriate instances todecoder
. TODO: Refactor this method as it is waaaay tooooo looongParameters: decoder (Decoder) – Decoding strategy, see create_decoder()
. This method will add predictors to this instance withadd_predictor()
-
cam.sgnmt.decode_utils.
args
= None¶ This variable is set to the global configuration when base_init().
-
cam.sgnmt.decode_utils.
base_init
(new_args)[source]¶ This function should be called before accessing any other function in this module. It initializes the args variable on which all the create_* factory functions rely on as configuration object, and it sets up global function pointers and variables for basic things like the indexing scheme, logging verbosity, etc.
Parameters: new_args – Configuration object from the argument parser.
-
cam.sgnmt.decode_utils.
construct_nmt_vanilla_decoder
()[source]¶ Creates the vanilla NMT decoder which bypasses the predictor framework. It uses the template methods
get_nmt_vanilla_decoder
for uniform access to the blocks or tensorflow frameworks.Returns: NMT vanilla decoder using all specified NMT models, or None if an error occurred.
-
cam.sgnmt.decode_utils.
create_decoder
()[source]¶ Creates the
Decoder
instance. This specifies the search strategy used to traverse the space spanned by the predictors. This method relies on the globalargs
variable.TODO: Refactor to avoid long argument lists
Returns: Decoder. Instance of the search strategy
-
cam.sgnmt.decode_utils.
create_output_handlers
()[source]¶ Creates the output handlers defined in the
io
module. These handlers create output files in different formats from the decoding results.Parameters: args – Global command line arguments. Returns: list. List of output handlers according –outputs
-
cam.sgnmt.decode_utils.
do_decode
(decoder, output_handlers, src_sentences)[source]¶ This method contains the main decoding loop. It iterates through
src_sentences
and appliesdecoder.decode()
to each of them. At the end, it calls the output handlers to create output files.Parameters: - decoder (Decoder) – Current decoder instance
- output_handlers (list) – List of output handlers, see
create_output_handlers()
- src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
-
cam.sgnmt.decode_utils.
get_sentence_indices
(range_param, src_sentences)[source]¶ Helper method for
do_decode
which returns the indices of the sentence to decodeParameters: - range_param (string) –
--range
parameter from config - src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
- range_param (string) –
cam.sgnmt.extract_scores_along_reference module¶
cam.sgnmt.output module¶
This module contains the output handlers. These handlers create
output files from the n-best lists generated by the Decoder
. They
can be activated via –outputs.
This module depends on OpenFST to write FST files in binary format. To
enable Python support in OpenFST, use a recent version (>=1.5.4) and
compile with --enable_python
. Further information can be found here:
http://www.openfst.org/twiki/bin/view/FST/PythonExtension
-
class
cam.sgnmt.output.
AlignmentOutputHandler
[source]¶ Bases:
object
Interface for output handlers for alignments.
Empty constructor
-
write_alignments
(alignments)[source]¶ This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.
Parameters: alignments (list) – list of alignment matrices Raises: IOError. If something goes wrong while writing to the disk
-
-
class
cam.sgnmt.output.
CSVAlignmentOutputHandler
(path)[source]¶ Bases:
cam.sgnmt.output.AlignmentOutputHandler
Creates a directory with CSV files which store the alignment matrices.
-
class
cam.sgnmt.output.
FSTOutputHandler
(path, unk_id)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler creates FSTs with with sparse tuple arcs from the n-best lists from the decoder. The predictor scores are kept separately in the sparse tuples. Note that this means that the parameter
--combination_scheme
might not be visible in the lattices because predictor scores are not combined. The order in the sparse tuples corresponds to the order of the predictors in the--predictors
argument.Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.
Creates a sparse tuple FST output handler.
Parameters: - path (string) – Path to the VECLAT directory to create
- unk_id (int) – Id which should be used in the FST for UNK
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes FST files with sparse tuples for each sentence in
all_hypos
. The created lattices are not optimized in any way: We create a distinct path for each entry inall_hypos
. We advise you to determinize/minimize them if you are planning to use them for further processing.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
NBestOutputHandler
(path, predictor_names, trg_wmap)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Produces a n-best file in Moses format. The third part of each entry is used to store the separated unnormalized predictor scores. Note that the sentence IDs are shifted: Moses n-best files start with the index 0, but in SGNMT and HiFST we usually refer to the first sentence with 1 (e.g. in lattice directories or –range)
Creates a Moses n-best list output handler.
Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
- trg_wmap (dict) – (Inverse) word map for target language
-
class
cam.sgnmt.output.
NPYAlignmentOutputHandler
(path)[source]¶ Bases:
cam.sgnmt.output.AlignmentOutputHandler
Creates a directory with alignment matrices in numpy format npy
-
class
cam.sgnmt.output.
NgramOutputHandler
(path, min_order, max_order)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler extracts MBR-style ngram posteriors from the hypotheses returned by the decoder. The hypothesis scores are assumed to be loglikelihoods, which we renormalize to make sure that we operate on a valid distribution. The scores produced by the output handler are probabilities of an ngram being in the translation.
Creates an ngram output handler.
Parameters: - path (string) – Path to the ngram directory to create
- min_order (int) – Minimum order of extracted ngrams
- max_order (int) – Maximum order of extracted ngrams
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes ngram files for each sentence in
all_hypos
.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
OutputHandler
[source]¶ Bases:
object
Interface for output handlers.
Empty constructor
-
write_hypos
(all_hypos, sen_indices=None)[source]¶ This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.
Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: IOError. If something goes wrong while writing to the disk
-
-
class
cam.sgnmt.output.
StandardFSTOutputHandler
(path, unk_id)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
This output handler creates FSTs with standard arcs. In contrast to
FSTOutputHandler
, predictor scores are combined using--combination_scheme
.Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.
Creates a standard arc FST output handler.
Parameters: - path (string) – Path to the fst directory to create
- unk_id (int) – Id which should be used in the FST for UNK
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes FST files with standard arcs for each sentence in
all_hypos
. The created lattices are not optimized in any way: We create a distinct path for each entry inall_hypos
. We advise you to determinize/minimize them if you are planning to use them for further processing.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
-
class
cam.sgnmt.output.
TextAlignmentOutputHandler
(path)[source]¶ Bases:
cam.sgnmt.output.AlignmentOutputHandler
Creates a single text alignment file (Pharaoh format).
-
class
cam.sgnmt.output.
TextOutputHandler
(path, trg_wmap)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Writes the first best hypotheses to a plain text file
Creates a plain text output handler to write to
path
-
class
cam.sgnmt.output.
TimeCSVOutputHandler
(path, predictor_names)[source]¶ Bases:
cam.sgnmt.output.OutputHandler
Produces one CSV file for each sentence. The CSV files contain the predictor score breakdown for each translation prefix length.
Creates a Moses n-best list output handler.
Parameters: - path (string) – Path to the n-best file to write
- predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
-
write_hypos
(all_hypos, sen_indices)[source]¶ Writes ngram files for each sentence in
all_hypos
.Parameters: - all_hypos (list) – list of nbest lists of hypotheses
- sen_indices (list) – List of sentence indices (0-indexed)
Raises: - OSError. If the directory could not be created
- IOError. If something goes wrong while writing to the disk
cam.sgnmt.ui module¶
This module handles configuration and user interface when using
blocks. yaml
and ArgumentParser
are used for parsing config
files and command line arguments.
-
cam.sgnmt.ui.
get_args
()[source]¶ Get the arguments for the current SGNMT run from both command line arguments and configuration files. This method contains all available SGNMT options, i.e. configuration is not encapsulated e.g. by predictors.
Returns: object. Arguments object like for ArgumentParser
-
cam.sgnmt.ui.
get_parser
()[source]¶ Get the parser object which is used to build the configuration argument
args
. This is a helper method forget_args()
TODO: Decentralize configurationReturns: ArgumentParser. The pre-filled parser object
-
cam.sgnmt.ui.
parse_args
(parser)[source]¶ http://codereview.stackexchange.com/questions/79008/parse-a-config-file- and-add-to-command-line-arguments-using-argparse-in-python
cam.sgnmt.utils module¶
This file contains common basic functionality which can be used from anywhere. This includes the definition of reserved word indices, some mathematical functions, and helper functions to deal with the small quirks Python sometimes has.
-
cam.sgnmt.utils.
EOS_ID
= 2¶ Reserved word ID for the end-of-sentence symbol.
-
cam.sgnmt.utils.
GO_ID
= 1¶ Reserved word ID for the start-of-sentence symbol.
-
cam.sgnmt.utils.
MESSAGE_TYPE_DEFAULT
= 1¶ Default message type for observer messages
-
cam.sgnmt.utils.
MESSAGE_TYPE_FULL_HYPO
= 3¶ This message type is used by the decoder when a new complete hypothesis was found. Note that this is not necessarily the best hypo so far, it is just the latest hypo found which ends with EOS.
-
cam.sgnmt.utils.
MESSAGE_TYPE_POSTERIOR
= 2¶ This message is sent by the decoder after
apply_predictors
was called. The message includes the new posterior distribution and the score breakdown.
-
cam.sgnmt.utils.
NOTAPPLICABLE_ID
= 3¶ Reserved word ID which is currently not used.
-
class
cam.sgnmt.utils.
Observable
[source]¶ Bases:
object
For the GoF design pattern observer
Initializes the list of observers with an empty list
-
class
cam.sgnmt.utils.
Observer
[source]¶ Bases:
object
Super class for classes which observe (GoF design patten) other classes.
-
cam.sgnmt.utils.
TMP_FILENAME
= '/tmp/sgnmt.20117.fst'¶ Temporary file name to use if an FST file is zipped.
-
cam.sgnmt.utils.
UNK_ID
= 0¶ Reserved word ID for the unknown word (UNK).
-
cam.sgnmt.utils.
apply_src_wmap
(seq, wmap=None)[source]¶ Converts a string to a sequence of integers by applying the mapping
wmap
. Ifwmap
is empty, parseseq
as string of blank separated integers.Parameters: - seq (list) – List of strings to convert
- wmap (dict) – word map to apply (key: word, value: ID). If empty
use
utils.src_wmap
Returns: list. List of integers
-
cam.sgnmt.utils.
apply_trg_wmap
(seq, inv_wmap=None)[source]¶ Converts a sequence of integers to a string by applying the mapping
wmap
. Ifwmap
is empty, output the integers directly.Parameters: - seq (list) – List of integers to convert
- inv_wmap (dict) – word map to apply (key: id, value: word). If
empty use
utils.trg_wmap
Returns: string. Mapped
seq
as single (blank separated) string
-
cam.sgnmt.utils.
argmax
(arr)[source]¶ Get the index of the maximum entry in
arr
. The parameter can be a dictionary.Parameters: arr (list,array,dict) – Set of numerical values Returns: Index or key of the maximum entry in arr
-
cam.sgnmt.utils.
argmax_n
(arr, n)[source]¶ Get indices of the
n
maximum entries inarr
. The parameterarr
can be a dictionary. The returned index set is not guaranteed to be sorted.Parameters: - arr (list,array,dict) – Set of numerical values
- n (int) – Number of values to retrieve
Returns: List of indices or keys of the
n
maximum entries inarr
-
cam.sgnmt.utils.
common_contains
(obj, key)[source]¶ Checks the existence of a key or index in a mapping. Works with numpy arrays, lists, and dicts.
Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
Returns: True
ifkey
inobj
, otherwiseFalse
-
cam.sgnmt.utils.
common_get
(obj, key, default)[source]¶ Can be used to access an element via the index or key. Works with numpy arrays, lists, and dicts.
Parameters: - obj (list,array,dict) – Mapping
- key (int) – Index or key of the element to retrieve
- default (object) –
Returns: obj[key]
ifkey
inobj
, otherwisedefault
-
cam.sgnmt.utils.
common_iterable
(obj)[source]¶ Can be used to iterate over the key-value pairs of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python
-
cam.sgnmt.utils.
common_viewkeys
(obj)[source]¶ Can be used to iterate over the keys or indices of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python
-
cam.sgnmt.utils.
get_path
(tmpl, sub=1)[source]¶ Replaces the %d placeholder in
tmpl
withsub
. Iftmpl
does not contain %d, returntmpl
unmodified.Parameters: - tmpl (string) – Path, potentially with %d placeholder
- sub (int) – Substitution for %d
Returns: string.
tmpl
with %d replaced withsub
if present
-
cam.sgnmt.utils.
load_fst
(path)[source]¶ Loads a FST from the file system using PyFSTs
read()
method. GZipped format is also supported. The arc type must be standard or log, otherwise PyFST cannot load them.Parameters: path (string) – Path to the FST file to load Returns: fst. PyFST FST object or None
if FST could not be read
-
cam.sgnmt.utils.
load_src_wmap
(path)[source]¶ Loads a source side word map from the file system.
Parameters: path (string) – Path to the word map (Format: word id) Returns: word, value: id) Return type: dict. Source word map (key
-
cam.sgnmt.utils.
load_trg_cmap
(path)[source]¶ Loads a character map from
path
. Returns None ifpath
is empty or does not point to a file. In this case, output files are generated on the word level.Parameters: path (string) – Path to the character map Returns: dict. Map char -> id or None if character level output is not activated.
-
cam.sgnmt.utils.
load_trg_wmap
(path)[source]¶ Loads a target side word map from the file system.
Parameters: path (string) – Path to the word map (Format: word id) Returns: id, value: word) Return type: dict. Source word map (key
-
cam.sgnmt.utils.
log_sum
(vals)¶ Defines which log summation function to use.
-
cam.sgnmt.utils.
log_sum_log_semiring
(vals)[source]¶ Uses the
logsumexp
function in scipy to calculate the log of the sum of a set of log values.Parameters: vals (set) – List or set of numerical values
-
cam.sgnmt.utils.
log_sum_tropical_semiring
(vals)[source]¶ Approximates summation in log space with the max.
Parameters: vals (set) – List or set of numerical values
-
cam.sgnmt.utils.
src_wmap
= {}¶ Source language word map (word -> id)
-
cam.sgnmt.utils.
switch_to_blocks_indexing
()[source]¶ Calling this method overrides the global definitions of the reserved word ids
GO_ID
,EOS_ID
, andUNK_ID
with the Blocks indexing scheme. This scheme is used in the Blocks NMT implementation and it’s SGNMT extensions.
-
cam.sgnmt.utils.
switch_to_t2t_indexing
()[source]¶ Calling this method overrides the global definitions of the reserved word ids
GO_ID
,EOS_ID
, andUNK_ID
with the tensor2tensor indexing scheme. This scheme is used in all t2t models.
-
cam.sgnmt.utils.
switch_to_tf_indexing
()[source]¶ Calling this method overrides the global definitions of the reserved word ids
GO_ID
,EOS_ID
, andUNK_ID
with the TensorFlow indexing scheme. This scheme is used the TensorFlow NMT and RNNLM models.
-
cam.sgnmt.utils.
trg_cmap
= None¶ Target language character map (char -> id)
-
cam.sgnmt.utils.
trg_wmap
= {}¶ Target language word map (id -> word)