Tutorial: Basics (T2T)

This tutorial describes common use cases of SGNMT. We will use SGNMT with Tensor2Tensor backend in this tutorial. Please make sure that you have installed (at least) OpenFST and Tensor2Tensor as described on the Installation page. Verify your setup with:

$ python $SGNMT/decode.py --run_diagnostics
Checking Python3.... OK
Checking PyYAML.... OK
Checking TensorFlow.... OK (1.13.1)
Checking Tensor2Tensor.... OK
Checking OpenFST.... OK (openfst_python)
(...)

If you are not willing to install these dependencies, you are still encouraged to read through this tutorial since it illustrates a number of general concepts in SGNMT.

The tutorial data is available under the following DOI:

http://dx.doi.org/TODO (We are in the process of publishing the data under a DOI. In the meantime, please write fs439@cam.ac.uk to get access to this data)

Please download the archive and extract it:

$ tar xzf tutorial-ende-wmt19.tar.gz
$ cd tutorial-ende-wmt19

This tarball contains NMT model files and translation lattices for replicating some of our results in our WMT‘19 English-German submission. The directory structure is as follows:

  • ./data/: Preprocessed news-test2018 test set, word maps and BPE files.
  • ./ini/: Example SGNMT configuration files.
  • ./models/ende/: Tensor2Tensor English-German Transformer models.
  • ./supplementary/: SMT translation lattices and n-gram posteriors for MBR. The ucam-smt tutorial explains how to generate translation lattices for SGNMT.
  • ./scripts/: contains helper scripts for preprocessing.

Introduction

The two central concepts in SGNMT are predictors and decoders. Predictors are scoring modules which define scores over the target language vocabulary given the current internal predictor state, the history, the source sentence, and external side information. Predictors have a strict left-to-right semantic. They can represent translation models like NMT or language models. In a more general sense, translation lattices or n-best lists can also be represented in this framework. Predictors can be combined with other predictors to form complex decoding tasks.

Decoders are search strategies which traverse the space spanned by the predictors. SGNMT provides implementations of common search tree traversal algorithms like beam search. Since decoders differ in runtime complexity and the kind of search errors they make, different decoders are appropriate for different predictor constellations.

Preprocessing

The models in this tutorial are trained on preprocessed data (Moses tokenization + truecasing). Preprocessed files are provided in the tarball. Alternatively, the ./scripts/preprocess_??.sh scripts can be used for preprocessing:

echo 'A mental asylum, where today young people are said to meet.' | ./scripts/preprocess_en.sh
a mental asylum , where today young people are said to meet .

Note that in order to use the preprocessing scripts you need to change the paths to Moses in the scripts.

Pure NMT decoding (single)

All SGNMT options can be specified either as command line arguments or in configuration files. Options in configuration files are prioritized over command line arguments. Simple NMT decoding needs a single t2t predictor:

predictors: t2t

The t2t predictor itself requires additional options that set the vocabulary sizes, model location, and t2t problem and hparams set:

pred_src_vocab_size: 35627
pred_trg_vocab_size: 35627
t2t_model: transformer
t2t_checkpoint_dir: models/ende/base/
t2t_problem: translate_ende_wmt32k
t2t_hparams_set: transformer_base_v2

These options are summarized in ini/base.ini. Use the src_test and range options to translate the first two sentences of the test set:

$ python $SGNMT/decode.py --config_file ini/base.ini --src_test data/test18.ende.bpe.en --range 1:2
(...)
2019-06-24 11:32:47,344 INFO: Next sentence (ID: 1): 14828 5524 16148 3851 7721 22779 3739 3910 5314 4104 4905 3675 3646 5390
2019-06-24 11:32:51,181 INFO: Decoded (ID: 1): 15351 5524 16148 3851 7909 15056 3637 3674 6738 8516 3789 3674 6043 16507 3880
2019-06-24 11:32:51,181 INFO: Stats (ID: 1): score=-5.486438 num_expansions=51 time=3.84
2019-06-24 11:32:51,181 INFO: Next sentence (ID: 2): 3667 6203 13502 3637 4678 5149 7703 4499 3771 5598 3678 8087 3642
2019-06-24 11:32:54,017 INFO: Decoded (ID: 2): 3787 15786 4602 30449 3637 5478 4909 13833 5058 7373 3896 8681 3642
2019-06-24 11:32:54,018 INFO: Stats (ID: 2): score=-8.179755 num_expansions=50 time=2.84
2019-06-24 11:32:54,018 INFO: Decoding finished. Time: 6.67

Here, sentences are in indexed representation as sequence of numerical token IDs on the subword level. The apply_wmap.py script can be used to convert between indexed and plain text representations:

# Preserve subword segmentation
$ echo '15351 5524 16148 3851 7909 15056 3637 3674 6738 8516 3789 3674 6043 16507 3880' | python $SGNMT/scripts/apply_wmap.py -m data/wmap.bpe.ende -d i2s
München</w> 18 56</w> :</w> vier</w> Karten</w> ,</w> die</w> Ihren</w> Blick</w> auf</w> die</w> Stadt</w> verändern</w> werden</w>

# Generate original (word-level) segmentation
echo '15351 5524 16148 3851 7909 15056 3637 3674 6738 8516 3789 3674 6043 16507 3880' | python $SGNMT/scripts/apply_wmap.py -m data/wmap.bpe.ende -d i2s -t eow
München 1856 : vier Karten , die Ihren Blick auf die Stadt verändern werden

Processing indexed files often leads to a clearer setup, especially when many external supplementary files are involved to help decoding as in lattice or n-best rescoring or MBR-based NMT. However, for convenience, SGNMT can also operate on string representations by using ini/bpe.ini:

wmap: data/wmap.bpe.ende
bpe_codes: data/bpe.train.ende
preprocessing: bpe
postprocessing: bpe

Decode with both ini files:

$ python $SGNMT/decode.py --config_file ini/base.ini,ini/bpe.ini --src_test data/test18.ende.preprocess.en --range 1:2
(...)
2019-06-24 11:58:16,067 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 11:58:19,923 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihren Blick auf die Stadt verändern werden
2019-06-24 11:58:19,923 INFO: Stats (ID: 1): score=-5.486438 num_expansions=51 time=3.86
2019-06-24 11:58:19,923 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 11:58:22,734 INFO: Decoded (ID: 2): ein geistiges Asyl , wo heute junge Menschen sollen sich treffen .
2019-06-24 11:58:22,734 INFO: Stats (ID: 2): score=-8.179755 num_expansions=50 time=2.81
2019-06-24 11:58:22,734 INFO: Decoding finished. Time: 6.67

SGNMT also supports a shell mode for interactive use:

$ python $SGNMT/decode.py --config_file ini/base.ini,ini/bpe.ini --input_method shell
(...)
Starting interactive mode...
PID: 9395
Display help with 'help'
Quit with ctrl-d or 'quit'
sgnmt> a mental asylum , where today young people are said to meet .
2019-06-24 12:02:01,986 INFO: Start time: 1561374121.986785
2019-06-24 12:02:01,987 INFO: Next sentence (ID: 1): a mental asylum , where today young people are said to meet .
2019-06-24 12:02:05,769 INFO: Decoded (ID: 1): ein geistiges Asyl , wo heute junge Menschen sollen sich treffen .
2019-06-24 12:02:05,769 INFO: Stats (ID: 1): score=-8.179755 num_expansions=50 time=3.78
2019-06-24 12:02:05,769 INFO: Decoding finished. Time: 3.78
sgnmt> config beam 10
Setting beam=10...
(...)
sgnmt> a mental asylum , where today young people are said to meet .
2019-06-24 12:03:01,519 INFO: Start time: 1561374181.5197287
2019-06-24 12:03:01,520 INFO: Next sentence (ID: 1): a mental asylum , where today young people are said to meet .
2019-06-24 12:03:09,498 INFO: Decoded (ID: 1): ein geistiges Asyl , wo sich heute junge Menschen treffen sollen .
2019-06-24 12:03:09,498 INFO: Stats (ID: 1): score=-7.389152 num_expansions=123 time=7.98
2019-06-24 12:03:09,498 INFO: Decoding finished. Time: 7.98

Pure NMT decoding (ensemble)

Ensemble decoding can be realized by using multiple t2t predictors and overriding some of the predictor-specific options:

predictors: t2t, t2t

t2t_checkpoint_dir: models/ende/base/
t2t_hparams_set: transformer_base_v2

t2t_checkpoint_dir2: models/ende/big/
t2t_hparams_set2: transformer_big

This config file ensembles a base and a big Transformer model:

$ python $SGNMT/decode.py --config_file ini/ensemble.ini,ini/bpe.ini --src_test data/test18.ende.preprocess.en --range 1:2
(...)
2019-06-24 12:27:56,752 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 12:28:10,596 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihren Blick auf die Stadt verändern werden
2019-06-24 12:28:10,596 INFO: Stats (ID: 1): score=-10.303408 num_expansions=52 time=13.84
2019-06-24 12:28:10,596 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 12:28:21,765 INFO: Decoded (ID: 2): ein geistiges Asyl , wo sich heute junge Menschen treffen sollen .
2019-06-24 12:28:21,766 INFO: Stats (ID: 2): score=-14.571033 num_expansions=50 time=11.17
2019-06-24 12:28:21,766 INFO: Decoding finished. Time: 25.01

Generating output files

SGNMT supports a number of output formats to dump the explored search space to the file system:

  • text: First best translations in plain text
  • nbest: n-best list in Moses format
  • sfst: OpenFST translation lattices with standard arcs
  • fst: OpenFST translation lattices with sparse tuple arcs
  • ngram: MBR-style n-gram posteriors
  • timecsv: CSV with predictor scores over time

The outputs option activates the different output formats, and output_path sets the path to the output files:

$ python $SGNMT/decode.py --config_file ini/ensemble.ini,ini/bpe.ini \
     --outputs text,nbest,sfst,timecsv \
     --output_path ensemble-output.%s \
     --src_test data/test18.ende.preprocess.en --range 1:2

The above command generates the files ensemble-output.text that contains the translations in plain text, and ensemble-output.nbest that is an n-best list in Moses format that keeps the scores of both predictors (i.e. ensembled models) separate:

$ cat ensemble-output.nbest
0 ||| München 1856 : vier Karten , die Ihren Blick auf die Stadt verändern werden  ||| t2t= -5.486438 t2t2= -4.816970 ||| -10.303408
0 ||| München 1856 : vier Karten , die Ihre Sicht auf die Stadt verändern werden  ||| t2t= -6.245763 t2t2= -4.711694 ||| -10.957457
0 ||| München 1856 : vier Karten , die Ihren Blick auf die Stadt verändern .  ||| t2t= -6.356879 t2t2= -6.373269 ||| -12.730148
1 ||| ein geistiges Asyl , wo sich heute junge Menschen treffen sollen .  ||| t2t= -7.389152 t2t2= -7.181882 ||| -14.571033
1 ||| ein geistiges Asyl , wo sich heute junge Leute treffen sollen .  ||| t2t= -8.740399 t2t2= -7.823516 ||| -16.563915
1 ||| ein geistiges Asyl , wo heute junge Menschen sich treffen sollen .  ||| t2t= -8.870988 t2t2= -8.109677 ||| -16.980665

Additionally, two directories are created. ensemble-output.timecsv contains CSV files with predictor scores in each time step for each hypothesis. ensemble-output.sfst consists of output translation lattices in OpenFST format:

$ fstprint --isymbols=data/wmap.bpe.ende --acceptor ensemble-output.sfst/1.fst
0     1       <s>     10.3034077
1     2       München</w>
2     3       18
3     4       56</w>
4     5       :</w>
5     6       vier</w>
6     7       Karten</w>
7     8       ,</w>
8     9       die</w>
9     10      Ihre</w>        0.654048979
9     11      Ihren</w>
10    12      Sicht</w>
11    13      Blick</w>
12    14      auf</w>
13    15      auf</w>
14    16      die</w>
15    17      die</w>
16    18      Stadt</w>
17    19      Stadt</w>
18    20      verändern</w>
19    21      verändern</w>
20    22      werden</w>
21    22      werden</w>
21    22      .</w>   2.42674088
22    23      </s>
23

Lattice rescoring

For restricting NMT to an (SMT) translation lattice as proposed in our ACL‘16 paper, use the fst predictor:

predictors: t2t, fst
fst_path: supplementary/smt.lats_test18/%d.fst
(...)

Decode with:

$ python $SGNMT/decode.py --config_file ini/rescoring.ini --range 1:2
(...)
2019-06-24 13:48:18,317 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 13:48:22,418 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihre Sicht auf die Stadt verändern werden .
2019-06-24 13:48:22,418 INFO: Stats (ID: 1): score=-7.547943 num_expansions=53 time=4.10
2019-06-24 13:48:22,418 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 13:48:25,212 INFO: Decoded (ID: 2): ein geistiges Asyl , wo heute junge Menschen sollen sich treffen .
2019-06-24 13:48:25,213 INFO: Stats (ID: 2): score=-8.179755 num_expansions=48 time=2.79

This command loads the determinised lattice ./supplementary/smt.lats_test18/?.fst from the file system and runs the NMT beam search decoder on it. For non-deterministic lattices use the nfst predictor instead. By default, SGNMT ignores the scores in the translation lattices and uses the FST as unweighted constrained. To also take the FST weights into account, set use_fst_weights to true and balance NMT vs. SMT score with predictor_weights:

python $SGNMT/decode.py --config_file ini/rescoring.ini --predictor_weights 2.0,1.0 --use_fst_weights true --range 1:2
(...)
2019-06-24 13:50:01,530 INFO: Initialized predictor t2t (weight: 2.0)
2019-06-24 13:50:01,530 INFO: Initialized predictor fst (weight: 1.0)
2019-06-24 13:50:01,531 INFO: Start time: 1561380601.5318987
2019-06-24 13:50:01,531 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 13:50:05,488 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihre Sicht auf die Stadt verändern werden .
2019-06-24 13:50:05,488 INFO: Stats (ID: 1): score=-30.180038 num_expansions=51 time=3.96
2019-06-24 13:50:05,488 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 13:50:08,385 INFO: Decoded (ID: 2): ein geistiges Asyl , wo heute junge Menschen sollen zusammenkommen .
2019-06-24 13:50:08,385 INFO: Stats (ID: 2): score=-23.774222 num_expansions=50 time=2.90
2019-06-24 13:50:08,385 INFO: Decoding finished. Time: 6.85

Decoders

Different search strategies (decoders) can be used to explore the space spanned by the predictors. The default is beam decoding with beam size of 4 (change beam size with beam option). Alternative decoders can be selected with the decoder option:

$ python $SGNMT/decode.py --config_file ini/rescoring.ini --decoder greedy --range 1:2
(...)
2019-06-24 14:24:15,744 INFO: Start time: 1561382655.7441242
2019-06-24 14:24:15,744 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 14:24:17,693 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihre Sicht auf die Stadt verändern werden .
2019-06-24 14:24:17,693 INFO: Stats (ID: 1): score=-7.547943 num_expansions=17 time=1.95
2019-06-24 14:24:17,693 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 14:24:18,511 INFO: Decoded (ID: 2): ein geistiges Asyl , wo heute junge Menschen sollen sich treffen .
2019-06-24 14:24:18,511 INFO: Stats (ID: 2): score=-8.179755 num_expansions=14 time=0.82
2019-06-24 14:24:18,511 INFO: Decoding finished. Time: 2.77

Greedy decoding requires fewer node expansions (compare num_expansions with the previous section), but is more prone to search errors. Depth-first search is an exact inference scheme that is guaranteed to find the global best score:

$ python $SGNMT/decode.py --config_file ini/rescoring.ini --decoder dfs --range 1:2
(...)
2019-06-24 14:28:10,696 INFO: Start time: 1561382890.6964235
2019-06-24 14:28:10,696 INFO: Next sentence (ID: 1): Munich 1856 : four maps that will change your view of the city
2019-06-24 14:28:15,261 INFO: Decoded (ID: 1): München 1856 : vier Karten , die Ihre Sicht auf die Stadt verändern werden .
2019-06-24 14:28:15,261 INFO: Stats (ID: 1): score=-7.547943 num_expansions=61 time=4.56
2019-06-24 14:28:15,261 INFO: Next sentence (ID: 2): a mental asylum , where today young people are said to meet .
2019-06-24 14:28:21,248 INFO: Decoded (ID: 2): ein geistiges Asyl , wo heute junge Menschen sollen sich treffen .
2019-06-24 14:28:21,248 INFO: Stats (ID: 2): score=-8.179755 num_expansions=104 time=5.99
2019-06-24 14:28:21,248 INFO: Decoding finished. Time: 10.55

In this case, even greedy decoding finds the global best hypotheses.

MBR-based NMT (UCAM at WMT19)

In our EACL 2017 paper we described how to use n-gram posteriors extracted from a Hiero lattice to improve NMT. External n-gram probabilities can be introduced to SGNMT via the ngramc predictor. Since ngramc can yield positive scores, early_stopping needs to be set to false for MBR-based decoding. The ini/ucam-wmt19-base.ini config file additionally ensembles with a Transformer language model to reproduce our base single system in our WMT19 submission:

$ python $SGNMT/decode.py --config_file ini/ucam-wmt19-base.ini --range 1:2
(...)
2019-06-25 11:03:42,207 INFO: Next sentence (ID: 1): 14828 5524 16148 3851 7721 22779 3739 3910 5314 4104 4905 3675 3646 5390
2019-06-25 11:03:42,207 DEBUG: Loading n-gram scores from supplementary/smt.ngramc_test18/1.txt...
2019-06-25 11:03:59,037 INFO: Decoded (ID: 1): 15351 5524 16148
2019-06-25 11:03:59,037 INFO: Stats (ID: 1): score=-11.173564 num_expansions=108 time=16.83
2019-06-25 11:03:59,037 INFO: Next sentence (ID: 2): 3667 6203 13502 3637 4678 5149 7703 4499 3771 5598 3678 8087 3642
2019-06-25 11:03:59,038 DEBUG: Loading n-gram scores from supplementary/smt.ngramc_test18/2.txt...
2019-06-25 11:04:13,880 INFO: Decoded (ID: 2): 3787 15786 4602 30449 3637 5478 3896 4909 22992 8681 7373 3642
2019-06-25 11:04:13,881 INFO: Stats (ID: 2): score=-25.113689 num_expansions=104 time=14.84

Distributed decoding using the Grid Engine

Large decoding jobs can be distributed over multiple nodes with queueing systems like the Sun Grid Engine. SGNMT comes with an example script for distributed decoding:

$ bash $SGNMT/scripts/sge/decode_cpu.sh 100 ini/rescoring.ini output-sge

This will submit an array of 100 jobs to the grid, and each worker calls $SGNMT/decode.py with the range option pointing to a file that assigns sentence IDs to workers. Worker jobs write their output files to output-sge/<worker-id>, and their logs to output-sge/logs. When all workers are finished, the combination job combines the output files, writes the requested output formats to output-sge/out.%s, and touches output-sge/DONE.

This script works on the air-stack of the Cambridge Engineering Department speech group, and can serve as template for setting up distributed decoding elsewhere. Consider changing the -l options of the qsub commands in decode_cpu.sh and pointing the source command in decode_cpu_worker.sh to your <ACTIVATE.SH> script.

BLEU scores

The tutorial tarball contains the ./scripts/eval_test18_ende.sh script that can be used to compute BLEU scores from indexed subword-level translations that are comparable to official WMT results (adjust path to Moses before use):

# e.g. out.text from ini/ucam-wmt19-big.ini
$ cat output-sge/out.text | ./scripts/eval_test18_ende.sh
(...)
BLEU = 47.95, 74.42/53.76/41.42/32.67 (BP=0.994164, ratio=0.994181342958491, hyp_len=63902, ref_len=64276)

The script generates a number of tmp.* files for closer inspection. Here are the expected BLEU scores for the ini files in this tutorial:

Config File BLEU
base.ini 43.8
ucam-wmt19-base.ini 45.1
big.ini 47.8
ucam-wmt19-big.ini 48.0

These results correspond to the single system results in rows 2 and 4 of Table 6 in our WMT19 system description paper. Note that our single NMT result (48.0 BLEU) is very close to the winning WMT18 system (Microsoft-Marian: 48.3 BLEU) that was an ensemble of four NMT systems.