cam.sgnmt package

Subpackages

Submodules

cam.sgnmt.decode module

cam.sgnmt.decode_utils module

This module is the bridge between the command line configuration of the decode.py script and the SGNMT software architecture consisting of decoders, predictors, and output handlers. A common use case is to call create_decoder() first, which reads the SGNMT configuration and loads the right predictors and decoding strategy with the right arguments. The actual decoding is implemented in do_decode(). See decode.py to learn how to use this module.

cam.sgnmt.decode_utils.add_heuristics(decoder)[source]

Adds all enabled heuristics to the decoder. This is relevant for heuristic based search strategies like A*. This method relies on the global args variable and reads out args.heuristics.

Parameters:decoder (Decoder) – Decoding strategy, see create_decoder(). This method will add heuristics to this instance with add_heuristic()
cam.sgnmt.decode_utils.add_predictors(decoder)[source]

Adds all enabled predictors to the decoder. This function makes heavy use of the global args which contains the SGNMT configuration. Particularly, it reads out args.predictors and adds appropriate instances to decoder. TODO: Refactor this method as it is waaaay tooooo looong

Parameters:decoder (Decoder) – Decoding strategy, see create_decoder(). This method will add predictors to this instance with add_predictor()
cam.sgnmt.decode_utils.args = None

This variable is set to the global configuration when base_init().

cam.sgnmt.decode_utils.base_init(new_args)[source]

This function should be called before accessing any other function in this module. It initializes the args variable on which all the create_* factory functions rely on as configuration object, and it sets up global function pointers and variables for basic things like the indexing scheme, logging verbosity, etc.

Parameters:new_args – Configuration object from the argument parser.
cam.sgnmt.decode_utils.create_decoder()[source]

Creates the Decoder instance. This specifies the search strategy used to traverse the space spanned by the predictors. This method relies on the global args variable.

TODO: Refactor to avoid long argument lists

Returns:Decoder. Instance of the search strategy
cam.sgnmt.decode_utils.create_output_handlers()[source]

Creates the output handlers defined in the io module. These handlers create output files in different formats from the decoding results.

Parameters:args – Global command line arguments.
Returns:list. List of output handlers according –outputs
cam.sgnmt.decode_utils.do_decode(decoder, output_handlers, src_sentences)[source]

This method contains the main decoding loop. It iterates through src_sentences and applies decoder.decode() to each of them. At the end, it calls the output handlers to create output files.

Parameters:
  • decoder (Decoder) – Current decoder instance
  • output_handlers (list) – List of output handlers, see create_output_handlers()
  • src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)
cam.sgnmt.decode_utils.get_sentence_indices(range_param, src_sentences)[source]

Helper method for do_decode which returns the indices of the sentence to decode

Parameters:
  • range_param (string) – --range parameter from config
  • src_sentences (list) – A list of strings. The strings are the source sentences with word indices to translate (e.g. ‘1 123 432 2’)

cam.sgnmt.extract_scores_along_reference module

cam.sgnmt.io module

This module is responsible for converting input text to integer representations (encode()), and integer translation hypotheses back to readable text (decode()). In the default configuration, this conversion is an identity mapping: Source sentences are provided in integer representations, and output files also contain indexed sentences.

class cam.sgnmt.io.BPE(codes_path, separator='@@', remove_eow=False)[source]

Bases: object

encode(orig)[source]

Encode word based on list of BPE merge operations, which are applied consecutively

get_pairs(word)[source]

Return set of symbol pairs in a word.

word is represented as tuple of symbols (symbols being variable-length strings)

process_line(line)[source]

segment line, dealing with leading and trailing whitespace

segment(sentence)[source]

segment single sentence (whitespace-tokenized string) with BPE encoding

segment_tokens(tokens)[source]

segment a sequence of tokens with BPE encoding

class cam.sgnmt.io.BPEAtAtDecoder[source]

Bases: cam.sgnmt.io.Decoder

“Decoder for BPE mapping with @@ separator.

decode(trg_sentence)[source]
class cam.sgnmt.io.BPEDecoder[source]

Bases: cam.sgnmt.io.Decoder

“Decoder for BPE mapping SGNMT style.

decode(trg_sentence)[source]
class cam.sgnmt.io.BPEEncoder(codes_path, separator='', remove_eow=False)[source]

Bases: cam.sgnmt.io.Encoder

Encoder for BPE mapping.

encode(src_sentence)[source]
class cam.sgnmt.io.CharDecoder[source]

Bases: cam.sgnmt.io.Decoder

“Decoder for char mapping.

decode(trg_sentence)[source]
class cam.sgnmt.io.CharEncoder[source]

Bases: cam.sgnmt.io.Encoder

Encoder for char mapping.

encode(src_sentence)[source]
class cam.sgnmt.io.Decoder[source]

Bases: object

“Super class for IO decoders.

decode(trg_sentence)[source]

Converts the target sentence represented as sequence of token IDs to a string representation.

Parameters:trg_sentence (list) – A sequence of integers (token IDs)
Returns:string.
class cam.sgnmt.io.Encoder[source]

Bases: object

Super class for IO encoders.

encode(src_sentence)[source]

Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input.

Parameters:src_sentence (string) – A single input sentence
Returns:List of integers.
class cam.sgnmt.io.IDDecoder[source]

Bases: cam.sgnmt.io.Decoder

“Decoder for ID mapping.

decode(trg_sentence)[source]
class cam.sgnmt.io.IDEncoder[source]

Bases: cam.sgnmt.io.Encoder

Encoder for ID mapping.

encode(src_sentence)[source]
class cam.sgnmt.io.WordDecoder[source]

Bases: cam.sgnmt.io.Decoder

“Decoder for word based mapping.

decode(trg_sentence)[source]
class cam.sgnmt.io.WordEncoder[source]

Bases: cam.sgnmt.io.Encoder

Encoder for word based mapping.

encode(src_sentence)[source]
cam.sgnmt.io.decode(trg_sentence)[source]

Converts the target sentence represented as sequence of token IDs to a string representation. This method calls decoder.decode().

Parameters:trg_sentence (list) – A sequence of integers (token IDs)
Returns:string.
cam.sgnmt.io.decoder = None

Decoder called in decode(). Initialized in initialize().

cam.sgnmt.io.encode(src_sentence)[source]

Converts the source sentence in string representation to a sequence of token IDs. Depending on the configuration of this module, it applies word maps and/or subword/character segmentation on the input. This method calls encoder.encode().

Parameters:src_sentence (string) – A single input sentence
Returns:List of integers.
cam.sgnmt.io.encoder = None

Encoder called in encode(). Initialized in initialize().

cam.sgnmt.io.initialize(args)[source]

Initializes the io module, including loading word maps and other resources needed for encoding and decoding. Subsequent calls of encode() and decode() will process input as specified in args.

Parameters:args (object) – SGNMT config
cam.sgnmt.io.load_src_wmap(path)[source]

Loads a source side word map from the file system.

Parameters:path (string) – Path to the word map (Format: word id)
Returns:word, value: id)
Return type:dict. Source word map (key
cam.sgnmt.io.load_trg_wmap(path)[source]

Loads a target side word map from the file system.

Parameters:path (string) – Path to the word map (Format: word id)
Returns:id, value: word)
Return type:dict. Source word map (key
cam.sgnmt.io.src_wmap = {}

Source language word map (word -> id)

cam.sgnmt.io.trg_wmap = {}

Target language word map (id -> word)

cam.sgnmt.output module

This module contains the output handlers. These handlers create output files from the n-best lists generated by the Decoder. They can be activated via –outputs.

This module depends on OpenFST to write FST files in binary format. To enable Python support in OpenFST, use a recent version (>=1.5.4) and compile with --enable_python. Further information can be found here:

http://www.openfst.org/twiki/bin/view/FST/PythonExtension

class cam.sgnmt.output.FSTOutputHandler(path, unk_id)[source]

Bases: cam.sgnmt.output.OutputHandler

This output handler creates FSTs with with sparse tuple arcs from the n-best lists from the decoder. The predictor scores are kept separately in the sparse tuples. Note that this means that the parameter --combination_scheme might not be visible in the lattices because predictor scores are not combined. The order in the sparse tuples corresponds to the order of the predictors in the --predictors argument.

Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.

Creates a sparse tuple FST output handler.

Parameters:
  • path (string) – Path to the VECLAT directory to create
  • unk_id (int) – Id which should be used in the FST for UNK
write_hypos(all_hypos, sen_indices)[source]

Writes FST files with sparse tuples for each sentence in all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.

Parameters:
  • all_hypos (list) – list of nbest lists of hypotheses
  • sen_indices (list) – List of sentence indices (0-indexed)
Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
write_weight(score_breakdown)[source]

Helper method to create the weight string

class cam.sgnmt.output.NBestOutputHandler(path, predictor_names)[source]

Bases: cam.sgnmt.output.OutputHandler

Produces a n-best file in Moses format. The third part of each entry is used to store the separated unnormalized predictor scores. Note that the sentence IDs are shifted: Moses n-best files start with the index 0, but in SGNMT and HiFST we usually refer to the first sentence with 1 (e.g. in lattice directories or –range)

Creates a Moses n-best list output handler.

Parameters:
  • path (string) – Path to the n-best file to write
  • predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
write_hypos(all_hypos, sen_indices)[source]

Writes the hypotheses in all_hypos to path

class cam.sgnmt.output.NgramOutputHandler(path, min_order, max_order)[source]

Bases: cam.sgnmt.output.OutputHandler

This output handler extracts MBR-style ngram posteriors from the hypotheses returned by the decoder. The hypothesis scores are assumed to be loglikelihoods, which we renormalize to make sure that we operate on a valid distribution. The scores produced by the output handler are probabilities of an ngram being in the translation.

Creates an ngram output handler.

Parameters:
  • path (string) – Path to the ngram directory to create
  • min_order (int) – Minimum order of extracted ngrams
  • max_order (int) – Maximum order of extracted ngrams
write_hypos(all_hypos, sen_indices)[source]

Writes ngram files for each sentence in all_hypos.

Parameters:
  • all_hypos (list) – list of nbest lists of hypotheses
  • sen_indices (list) – List of sentence indices (0-indexed)
Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.OutputHandler[source]

Bases: object

Interface for output handlers.

Empty constructor

write_hypos(all_hypos, sen_indices=None)[source]

This method writes output files to the file system. The configuration parameters such as output paths should already have been provided via constructor arguments.

Parameters:
  • all_hypos (list) – list of nbest lists of hypotheses
  • sen_indices (list) – List of sentence indices (0-indexed)
Raises:

IOError. If something goes wrong while writing to the disk

class cam.sgnmt.output.StandardFSTOutputHandler(path, unk_id)[source]

Bases: cam.sgnmt.output.OutputHandler

This output handler creates FSTs with standard arcs. In contrast to FSTOutputHandler, predictor scores are combined using --combination_scheme.

Note that the created FSTs use another ID for UNK to avoid confusion with the epsilon symbol used by OpenFST.

Creates a standard arc FST output handler.

Parameters:
  • path (string) – Path to the fst directory to create
  • unk_id (int) – Id which should be used in the FST for UNK
write_hypos(all_hypos, sen_indices)[source]

Writes FST files with standard arcs for each sentence in all_hypos. The created lattices are not optimized in any way: We create a distinct path for each entry in all_hypos. We advise you to determinize/minimize them if you are planning to use them for further processing.

Parameters:
  • all_hypos (list) – list of nbest lists of hypotheses
  • sen_indices (list) – List of sentence indices (0-indexed)
Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
class cam.sgnmt.output.TextOutputHandler(path)[source]

Bases: cam.sgnmt.output.OutputHandler

Writes the first best hypotheses to a plain text file

Creates a plain text output handler to write to path

close_file()[source]
open_file()[source]
write_hypos(all_hypos, sen_indices=None)[source]

Writes the hypotheses in all_hypos to path

class cam.sgnmt.output.TimeCSVOutputHandler(path, predictor_names)[source]

Bases: cam.sgnmt.output.OutputHandler

Produces one CSV file for each sentence. The CSV files contain the predictor score breakdown for each translation prefix length.

Creates a Moses n-best list output handler.

Parameters:
  • path (string) – Path to the n-best file to write
  • predictor_names – Names of the predictors whose scores should be included in the score breakdown in the n-best list
write_hypos(all_hypos, sen_indices)[source]

Writes ngram files for each sentence in all_hypos.

Parameters:
  • all_hypos (list) – list of nbest lists of hypotheses
  • sen_indices (list) – List of sentence indices (0-indexed)
Raises:
  • OSError. If the directory could not be created
  • IOError. If something goes wrong while writing to the disk
cam.sgnmt.output.write_fst(f, path)[source]

Writes FST f to the file system after epsilon removal, determinization, and minimization.

cam.sgnmt.tf_utils module

This file contains utility functions for TensorFlow such as session handling and checkpoint loading.

cam.sgnmt.tf_utils.create_session(checkpoint_path, n_cpu_threads=-1)[source]

Creates a MonitoredSession.

Parameters:
  • checkpoint_path (string) – Path either to checkpoint directory or directly to a checkpoint file.
  • n_cpu_threads (int) – Number of CPU threads. If negative, we assume either GPU decoding or that all CPU cores can be used.
Returns:

A TensorFlow MonitoredSession.

cam.sgnmt.tf_utils.session_config(n_cpu_threads=-1)[source]

Creates the session config with default parameters.

Parameters:n_cpu_threads (int) – Number of CPU threads. If negative, we assume either GPU decoding or that all CPU cores can be used.
Returns:A TF session config object.

cam.sgnmt.ui module

This module handles the user interface and contains subroutines for parsing and verifying config files and command line arguments.

cam.sgnmt.ui.get_args()[source]

Get the arguments for the current SGNMT run from both command line arguments and configuration files. This method contains all available SGNMT options, i.e. configuration is not encapsulated e.g. by predictors.

Returns:object. Arguments object like for ArgumentParser
cam.sgnmt.ui.get_parser()[source]

Get the parser object which is used to build the configuration argument args. This is a helper method for get_args() TODO: Decentralize configuration

Returns:ArgumentParser. The pre-filled parser object
cam.sgnmt.ui.parse_args(parser)[source]

http://codereview.stackexchange.com/questions/79008/parse-a-config-file- and-add-to-command-line-arguments-using-argparse-in-python

cam.sgnmt.ui.parse_param_string(param)[source]

Parses a parameter string such as ‘param1=x,param2=y’. Loads config files if specified in the string. If param points to a file, load this file with YAML.

cam.sgnmt.ui.run_diagnostics()[source]

Check availability of external libraries.

cam.sgnmt.ui.str2bool(v)[source]

For making the ArgumentParser understand boolean values

cam.sgnmt.ui.validate_args(args)[source]

Some rudimentary sanity checks for configuration options. This method directly prints help messages to the user. In case of fatal errors, it terminates using logging.fatal()

Parameters:args (object) – Configuration as returned by get_args

cam.sgnmt.utils module

This file contains common basic functionality which can be used from anywhere. This includes the definition of reserved word indices, some mathematical functions, and helper functions to deal with the small quirks Python sometimes has.

cam.sgnmt.utils.EOS_ID = 2

Reserved word ID for the end-of-sentence symbol.

cam.sgnmt.utils.GO_ID = 1

Reserved word ID for the start-of-sentence symbol.

cam.sgnmt.utils.MESSAGE_TYPE_DEFAULT = 1

Default message type for observer messages

cam.sgnmt.utils.MESSAGE_TYPE_FULL_HYPO = 3

This message type is used by the decoder when a new complete hypothesis was found. Note that this is not necessarily the best hypo so far, it is just the latest hypo found which ends with EOS.

cam.sgnmt.utils.MESSAGE_TYPE_POSTERIOR = 2

This message is sent by the decoder after apply_predictors was called. The message includes the new posterior distribution and the score breakdown.

cam.sgnmt.utils.NOTAPPLICABLE_ID = 3

Reserved word ID which is currently not used.

class cam.sgnmt.utils.Observable[source]

Bases: object

For the GoF design pattern observer

Initializes the list of observers with an empty list

add_observer(observer)[source]

Add a new observer which is notified when this class fires a notification

Parameters:observer (Observer) – the observer class to add
notify_observers(message, message_type=1)[source]

Sends the given message to all registered observers.

Parameters:
  • message (object) – The message to send
  • message_type (int) – The type of the message. One of the MESSAGE_TYPE_* variables
class cam.sgnmt.utils.Observer[source]

Bases: object

Super class for classes which observe (GoF design patten) other classes.

notify(message, message_type=1)[source]

Get a notification from an observed object.

Parameters:
  • message (object) – the message sent by observed object
  • message_type (int) – The type of the message. One of the MESSAGE_TYPE_* variables
cam.sgnmt.utils.TMP_FILENAME = '/tmp/sgnmt.16734.fst'

Temporary file name to use if an FST file is zipped.

cam.sgnmt.utils.UNK_ID = 0

Reserved word ID for the unknown word (UNK).

cam.sgnmt.utils.argmax(arr)[source]

Get the index of the maximum entry in arr. The parameter can be a dictionary.

Parameters:arr (list,array,dict) – Set of numerical values
Returns:Index or key of the maximum entry in arr
cam.sgnmt.utils.argmax_n(arr, n)[source]

Get indices of the n maximum entries in arr. The parameter arr can be a dictionary. The returned index set is not guaranteed to be sorted.

Parameters:
  • arr (list,array,dict) – Set of numerical values
  • n (int) – Number of values to retrieve
Returns:

List of indices or keys of the n maximum entries in arr

cam.sgnmt.utils.common_contains(obj, key)[source]

Checks the existence of a key or index in a mapping. Works with numpy arrays, lists, and dicts.

Parameters:
  • obj (list,array,dict) – Mapping
  • key (int) – Index or key of the element to retrieve
Returns:

True if key in obj, otherwise False

cam.sgnmt.utils.common_get(obj, key, default)[source]

Can be used to access an element via the index or key. Works with numpy arrays, lists, and dicts.

Parameters:
  • obj (list,array,dict) – Mapping
  • key (int) – Index or key of the element to retrieve
  • default (object) –
Returns:

obj[key] if key in obj, otherwise default

cam.sgnmt.utils.common_iterable(obj)[source]

Can be used to iterate over the key-value pairs of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python

cam.sgnmt.utils.common_viewkeys(obj)[source]

Can be used to iterate over the keys or indices of a mapping. Works with numpy arrays, lists, and dicts. Code taken from http://stackoverflow.com/questions/12325608/iterate-over-a-dict-or-list-in-python

cam.sgnmt.utils.get_path(tmpl, sub=1)[source]

Replaces the %d placeholder in tmpl with sub. If tmpl does not contain %d, return tmpl unmodified.

Parameters:
  • tmpl (string) – Path, potentially with %d placeholder
  • sub (int) – Substitution for %d
Returns:

string. tmpl with %d replaced with sub if present

cam.sgnmt.utils.load_fst(path)[source]

Loads a FST from the file system using PyFSTs read() method. GZipped format is also supported. The arc type must be standard or log, otherwise PyFST cannot load them.

Parameters:path (string) – Path to the FST file to load
Returns:fst. PyFST FST object or None if FST could not be read
cam.sgnmt.utils.log_sum(vals)

Defines which log summation function to use.

cam.sgnmt.utils.log_sum_log_semiring(vals)[source]

Uses the logsumexp function in scipy to calculate the log of the sum of a set of log values.

Parameters:vals (set) – List or set of numerical values
cam.sgnmt.utils.log_sum_tropical_semiring(vals)[source]

Approximates summation in log space with the max.

Parameters:vals (set) – List or set of numerical values
cam.sgnmt.utils.oov_to_unk(seq, vocab_size, unk_idx=None)[source]
cam.sgnmt.utils.split_comma(s, func=None)[source]

Splits a string at commas and removes blanks.

cam.sgnmt.utils.switch_to_fairseq_indexing()[source]

Calling this method overrides the global definitions of the reserved word ids GO_ID, EOS_ID, and UNK_ID with the fairseq indexing scheme.

cam.sgnmt.utils.switch_to_t2t_indexing()[source]

Calling this method overrides the global definitions of the reserved word ids GO_ID, EOS_ID, and UNK_ID with the tensor2tensor indexing scheme. This scheme is used in all t2t models.

cam.sgnmt.utils.w2f(fstweight)[source]

Converts an arc weight to float

Module contents