cam.sgnmt package

Subpackages

Submodules

cam.sgnmt.decode module

cam.sgnmt.decode_utils module

This module is the bridge between the command line configuration of the decode.py script and the SGNMT software architecture consisting of decoders, predictors, and output handlers. A common use case is to call create_decoder() first, which reads the SGNMT configuration and loads the right predictors and decoding strategy with the right arguments. The actual decoding is implemented in do_decode(). See decode.py to learn how to use this module.

cam.sgnmt.decode_utils.add_heuristics(decoder)[source]

Adds all enabled heuristics to the decoder. This is relevant for heuristic based search strategies like A*. This method relies on the global args variable and reads out args.heuristics.

Parameters:decoder (Decoder) – Decoding strategy, see create_decoder(). This method will add heuristics to this instance with add_heuristic()
cam.sgnmt.decode_utils.add_predictors(decoder)[source]

Adds all enabled predictors to the decoder. This function makes heavy use of the global args which contains the SGNMT configuration. Particularly, it reads out args.predictors and adds appropriate instances to decoder. TODO: Refactor this method as it is waaaay tooooo looong

Parameters:decoder (Decoder) – Decoding strategy, see create_decoder(). This method will add predictors to this instance with add_predictor()
cam.sgnmt.decode_utils.args = None

This variable is set to the global configuration when create_decoder() is called.

cam.sgnmt.decode_utils.construct_nmt_vanilla_decoder()[source]

Creates the vanilla NMT decoder which bypasses the predictor framework. It uses the template methods get_nmt_vanilla_decoder for uniform access to the blocks or tensorflow frameworks.

Returns:NMT vanilla decoder using all specified NMT models, or None if an error occurred.
cam.sgnmt.decode_utils.create_decoder(new_args)[source]

Creates the Decoder instance. This specifies the search strategy used to traverse the space spanned by the predictors. This method relies on the global args variable.

TODO: Refactor to avoid long argument lists

Parameters:new_args – Command line arguments
Returns:Decoder. Instance of the search strategy
cam.sgnmt.decode_utils.create_output_handlers()[source]

Creates the output handlers defined in the io module. These handlers create output files in different formats from the decoding results.

Parameters:args – Global command line arguments.
Returns:list. List of output handlers according –outputs
cam.sgnmt.decode_utils.do_decode(decoder, output_handlers, src_sentences)[source]

This method contains the main decoding loop. It iterates through src_sentences and applies decoder.decode() to each of them. At the end, it calls the output handlers to create output files.

Parameters:
  • decoder (Decoder) – Current decoder instance
  • output_handlers (list) – List of output handlers, see create_output_handlers()
  • src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)

cam.sgnmt.output module

This module contains the output handlers. These handlers create output files from the n-best lists generated by the Decoder. They can be activated via –outputs.

This module depends on OpenFST to write FST files in binary format. To enable Python support in OpenFST, use a recent version (>=1.5.4) and compile with --enable_python. Further information can be found here:

http://www.openfst.org/twiki/bin/view/FST/PythonExtension

class cam.sgnmt.output.AlignmentOutputHandler[source]

Bases: object

Interface for output handlers for alignments.

write_alignments(alignments)[source]

This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.

Parameters:alignments (list) – list of alignment matrices
Raises:IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.CSVAlignmentOutputHandler(path)[source]

Bases: cam.sgnmt.output.AlignmentOutputHandler

Creates a directory with CSV files which store the alignment matrices.

write_alignments(alignments)[source]

Writes CSV files for each alignment.

Parameters:

alignments (list) – list of alignments

Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.FSTOutputHandler(path, start_sen_id, unk_id)[source]

Bases: cam.sgnmt.output.OutputHandler

This output handler creates FSTs with with sparse tuple arcs from the n-best lists from the decoder. The predictor scores are kept separately in the sparse tuples. Note that this means that the parameter --combination_scheme might not be visible in the lattices because predictor scores are not combined. The order in the sparse tuples corresponds to the order of the predictors in the --predictors argument.

Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.

write_hypos(all_hypos)[source]

Writes FST files with sparse tuples for each sentence in all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.

Parameters:

all_hypos (list) – list of nbest lists of hypotheses

Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
write_weight(score_breakdown)[source]

Helper method to create the weight string

class cam.sgnmt.output.NBestOutputHandler(path, predictor_names, start_sen_id, trg_wmap)[source]

Bases: cam.sgnmt.output.OutputHandler

Produces a n-best file in Moses format. The third part of each entry is used to store the separated unnormalized predictor scores. Note that the sentence IDs are shifted: Moses n-best files start with the index 0, but in SGNMT and HiFST we usually refer to the first sentence with 1 (e.g. in lattice directories or –range)

write_hypos(all_hypos)[source]

Writes the hypotheses in all_hypos to path

class cam.sgnmt.output.NPYAlignmentOutputHandler(path)[source]

Bases: cam.sgnmt.output.AlignmentOutputHandler

Creates a directory with alignment matrices in numpy format npy

write_alignments(alignments)[source]

Writes NPY files for each alignment.

Parameters:

alignments (list) – list of alignments

Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.OutputHandler[source]

Bases: object

Interface for output handlers.

write_hypos(all_hypos)[source]

This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.

Parameters:all_hypos (list) – list of nbest lists of hypotheses
Raises:IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.StandardFSTOutputHandler(path, start_sen_id, unk_id)[source]

Bases: cam.sgnmt.output.OutputHandler

This output handler creates FSTs with standard arcs. In contrast to FSTOutputHandler, predictor scores are combined using --combination_scheme.

Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.

write_hypos(all_hypos)[source]

Writes FST files with standard arcs for each sentence in all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.

Parameters:

all_hypos (list) – list of nbest lists of hypotheses

Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.TextAlignmentOutputHandler(path)[source]

Bases: cam.sgnmt.output.AlignmentOutputHandler

Creates a single text alignment file (Pharaoh format).

write_alignments(alignments)[source]

Writes an alignment file in standard text format.

Parameters:alignments (list) – list of alignments
Raises:IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.TextOutputHandler(path, trg_wmap)[source]

Bases: cam.sgnmt.output.OutputHandler

Writes the first best hypotheses to a plain text file

close_file()[source]
open_file()[source]
write_empty_line()[source]
write_hypos(all_hypos)[source]

Writes the hypotheses in all_hypos to path

cam.sgnmt.ui module

This module handles configuration and user interface when using blocks. yaml and ArgumentParser are used for parsing config files and command line arguments.

TODO: Remove Blocks dependency

cam.sgnmt.ui.get_args()[source]

Get the arguments for the current SGNMT run from both command line arguments and configuration files. This method contains all available SGNMT options, i.e. configuration is not encapsulated e.g. by predictors. Additionally, we add blocks NMT model options as parameters to specify how the loaded NMT model was trained. These are defined in machine_translation.configurations.

Returns:object. Arguments object like for ArgumentParser
cam.sgnmt.ui.get_blocks_align_parser()[source]

Get the parser object for NMT alignment configuration.

cam.sgnmt.ui.get_blocks_batch_decode_parser()[source]

Get the parser object for NMT batch decoding.

cam.sgnmt.ui.get_blocks_train_parser()[source]

Get the parser object for NMT training configuration.

cam.sgnmt.ui.get_parser()[source]

Get the parser object which is used to build the configuration argument args. This is a helper method for get_args() TODO: Decentralize configuration

Returns:ArgumentParser. The pre-filled parser object
cam.sgnmt.ui.parse_args(parser)[source]

http://codereview.stackexchange.com/questions/79008/parse-a-config-file- and-add-to-command-line-arguments-using-argparse-in-python

cam.sgnmt.ui.parse_param_string(param)[source]

Parses a parameter string such as ‘param1=x,param2=y’. Loads config files if specified in the string. If param points to a file, load this file with YAML.

cam.sgnmt.ui.str2bool(v)[source]

For making the ArgumentParser understand boolean values

cam.sgnmt.ui.validate_args(args)[source]

Some rudimentary sanity checks for configuration options. This method directly prints help messages to the user. In case of fatal errors, it terminates using logging.fatal()

Parameters:args (object) – Configuration as returned by get_args

cam.sgnmt.utils module

This file contains common basic functionality which can be used from anywhere. This includes the definition of reserved word indices, some mathematical functions, and helper functions to deal with the small quirks Python sometimes has.

cam.sgnmt.utils.EOS_ID = 2

Reserved word ID for the end-of-sentence symbol.

cam.sgnmt.utils.GO_ID = 1

Reserved word ID for the start-of-sentence symbol.

cam.sgnmt.utils.MESSAGE_TYPE_DEFAULT = 1

Default message type for observer messages

cam.sgnmt.utils.MESSAGE_TYPE_FULL_HYPO = 3

This message type is used by the decoder when a new complete hypothesis was found. Note that this is not necessarily the best hypo so far, it is just the latest hypo found which ends with EOS.

cam.sgnmt.utils.MESSAGE_TYPE_POSTERIOR = 2

This message is sent by the decoder after apply_predictors was called. The message includes the new posterior distribution and the score breakdown.

cam.sgnmt.utils.NOTAPPLICABLE_ID = 3

Reserved word ID which is currently not used.

class cam.sgnmt.utils.Observable[source]

Bases: object

For the GoF design pattern observer

add_observer(observer)[source]

Add a new observer which is notified when this class fires a notification

Parameters:observer (Observer) – the observer class to add
notify_observers(message, message_type=1)[source]

Sends the given message to all registered observers.

Parameters:
  • message (object) – The message to send
  • message_type (int) – The type of the message. One of the MESSAGE_TYPE_* variables
class cam.sgnmt.utils.Observer[source]

Bases: object

Super class for classes which observe (GoF design patten) other classes.

notify(message, message_type=1)[source]

Get a notification from an observed object.

Parameters:
  • message (object) – the message sent by observed object
  • message_type (int) – The type of the message. One of the MESSAGE_TYPE_* variables
cam.sgnmt.utils.TMP_FILENAME = '/tmp/sgnmt.13099.fst'

Temporary file name to use if an FST file is zipped.

cam.sgnmt.utils.UNK_ID = 0

Reserved word ID for the unknown word (UNK).

cam.sgnmt.utils.apply_src_wmap(seq, wmap=None)[source]

Converts a string to a sequence of integers by applying the mapping wmap. If wmap is empty, parse seq as string of blank separated integers.

Parameters:
  • seq (list) – List of strings to convert
  • wmap (dict) – word map to apply (key: word, value: ID). If empty use utils.src_wmap
Returns:

list. List of integers

cam.sgnmt.utils.apply_trg_wmap(seq, inv_wmap=None)[source]

Converts a sequence of integers to a string by applying the mapping wmap. If wmap is empty, output the integers directly.

Parameters:
  • seq (list) – List of integers to convert
  • inv_wmap (dict) – word map to apply (key: id, value: word). If empty use utils.trg_wmap
Returns:

string. Mapped seq as single (blank separated) string

cam.sgnmt.utils.argmax(arr)[source]

Get the index of the maximum entry in arr. The parameter can be a dictionary.

Parameters:arr (list,array,dict) – Set of numerical values
Returns:Index or key of the maximum entry in arr
cam.sgnmt.utils.argmax_n(arr, n)[source]

Get indices of the n maximum entries in arr. The parameter arr can be a dictionary. The returned index set is not guaranteed to be sorted.

Parameters:
  • arr (list,array,dict) – Set of numerical values
  • n (int) – Number of values to retrieve
Returns:

List of indices or keys of the n maximum entries in arr

cam.sgnmt.utils.common_contains(obj, key)[source]

Checks the existence of a key or index in a mapping. Works with numpy arrays, lists, and dicts.

Parameters:
  • obj (list,array,dict) – Mapping
  • key (int) – Index or key of the element to retrieve
Returns:

True if key in obj, otherwise False

cam.sgnmt.utils.common_get(obj, key, default)[source]

Can be used to access an element via the index or key. Works with numpy arrays, lists, and dicts.

Parameters:
  • obj (list,array,dict) – Mapping
  • key (int) – Index or key of the element to retrieve
  • default (object) –
Returns:

obj[key] if key in obj, otherwise default

cam.sgnmt.utils.common_iterable(obj)[source]

Can be used to iterate over the key-value pairs of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python

cam.sgnmt.utils.common_viewkeys(obj)[source]

Can be used to iterate over the keys or indices of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python

cam.sgnmt.utils.get_path(tmpl, sub=1)[source]

Replaces the %d placeholder in tmpl with sub. If tmpl does not contain %d, return tmpl unmodified.

Parameters:
  • tmpl (string) – Path, potentially with %d placeholder
  • sub (int) – Substitution for %d
Returns:

string. tmpl with %d replaced with sub if present

cam.sgnmt.utils.load_fst(path)[source]

Loads a FST from the file system using PyFSTs read() method. GZipped format is also supported. The arc type must be standard or log, otherwise PyFST cannot load them.

Parameters:path (string) – Path to the FST file to load
Returns:fst. PyFST FST object or None if FST could not be read
cam.sgnmt.utils.load_src_wmap(path)[source]

Loads a source side word map from the file system.

Parameters:path (string) – Path to the word map (Format: word id)
Returns:word, value: id)
Return type:dict. Source word map (key
cam.sgnmt.utils.load_trg_cmap(path)[source]

Loads a character map from path. Returns None if path is empty or does not point to a file. In this case, output files are generated on the word level.

Parameters:path (string) – Path to the character map
Returns:dict. Map char -> id or None if character level output is not activated.
cam.sgnmt.utils.load_trg_wmap(path)[source]

Loads a target side word map from the file system.

Parameters:path (string) – Path to the word map (Format: word id)
Returns:id, value: word)
Return type:dict. Source word map (key
cam.sgnmt.utils.log_sum(vals)

Defines which log summation function to use.

cam.sgnmt.utils.log_sum_log_semiring(vals)[source]

Uses the logsumexp function in scipy to calculate the log of the sum of a set of log values.

Parameters:vals (set) – List or set of numerical values
cam.sgnmt.utils.log_sum_tropical_semiring(vals)[source]

Approximates summation in log space with the max.

Parameters:vals (set) – List or set of numerical values
cam.sgnmt.utils.oov_to_unk(seq, vocab_size, unk_idx=None)[source]
cam.sgnmt.utils.src_wmap = {}

Source language word map (word -> id)

cam.sgnmt.utils.switch_to_blocks_indexing()[source]

Calling this method overrides the global definitions of the reserved word ids GO_ID, EOS_ID, and UNK_ID with the Blocks indexing scheme. This scheme is used in the Blocks NMT implementation and it’s SGNMT extensions.

cam.sgnmt.utils.switch_to_t2t_indexing()[source]

Calling this method overrides the global definitions of the reserved word ids GO_ID, EOS_ID, and UNK_ID with the tensor2tensor indexing scheme. This scheme is used in all t2t models.

cam.sgnmt.utils.switch_to_tf_indexing()[source]

Calling this method overrides the global definitions of the reserved word ids GO_ID, EOS_ID, and UNK_ID with the TensorFlow indexing scheme. This scheme is used the TensorFlow NMT and RNNLM models.

cam.sgnmt.utils.trg_cmap = None

Target language character map (char -> id)

cam.sgnmt.utils.trg_wmap = {}

Target language word map (id -> word)

cam.sgnmt.utils.w2f(fstweight)[source]

Converts an arc weight to float

Module contents