.. _tutorial-blocks-label:

Tutorial: Blocks/Thano (outdated)
==================================

This is the previous version of the SGNMT tutorial based on NPLM and Blocks/Theano. Both backends are not supported
in SGNMT anymore. This tutorial is intended as guide to reproducing previous research results on the WMT'15 English-German
test set reported in our `ACL 2016 paper <http://arxiv.org/abs/1605.04569>`_.

In order to run this tutorial, please make sure that you use SGNMT 0.x, Python 2.7, and that you have installed (at least) 
OpenFST and Blocks correctly.

The tutorial data is available under the following DOI:

http://dx.doi.org/10.17863/CAM.282

Please download the archive and extract it::

  $ tar xzf tutorial-ende-wmt15.tar.gz
  $ cd tutorial-ende-wmt15

The directory structure is as follows:

* *./data/* contains the source and target sentences for news-test2015 and word maps.
* *./train/* contains the NMT model file ``params.npz``.
* *./train2/* contains a second NMT model for ensembling.
* *./lm/* contains language model files
* *./hiero/lats/* contains the Hiero translation lattices.
* *./hiero/ngramc/* contains n-gram posteriors for MBR extracted from the Hiero translation lattices.
* *./hiero/100best.txt* is an n-best list generated with Hiero.
* *./scripts/* contains helper scripts for creating lattice directories or applying word maps.

This structure is intended to be used as starting point for your own experiments.

The `ucam-smt tutorial <http://ucam-smt.github.io/tutorial/nmt.html>`_ explains how to generate translation lattices
for SGNMT in general.

For this tutorial, we assume that you set the ``$SGNMT`` environment variable to the location of your SGNMT installation::

  $ export SGNMT=/path/to/sgnmt

Introduction
----------------------------------------

The two central concepts in SGNMT are *predictors* and *decoders*. *Predictors* are scoring modules which define scores over
the target language vocabulary given the current internal predictor state, the history, the source sentence, and external side information. 
Predictors have a strict left-to-right semantic. They can represent translation models like NMT or language models. In a more general sense, 
translation lattices or n-best lists can also be represented in this framework. Predictors can be combined with other predictors to form
complex decoding tasks. 

*Decoders* are search strategies which traverse the space spanned by the predictors. SGNMT provides implementations of common
search tree traversal algorithms like beam search. Since decoders differ in runtime complexity and the kind of search errors they make,
different decoders are appropriate for different predictor constellations.

Pure NMT decoding (single)
----------------------------------------

Start the NMT decoder with the following command::

  $ python $SGNMT/decode.py --predictors nmt --src_test data/test15.ids.en --range 1:1 --nmt_config src_vocab_size=50003,trg_vocab_size=50003
  2016-05-19 12:59:07,348 INFO: Creating theano variables
  2016-05-19 12:59:07,350 INFO: Building RNN encoder-decoder
  (...)
  2016-05-19 12:59:21,492 INFO: Loading the model from ./train/params.npz
  (...)
  2016-05-19 12:59:28,028 INFO: Start time: 1463659168.03
  2016-05-19 12:59:28,028 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 12:59:59,183 INFO: Decoded (ID: 1): 1511 7 1422 894 30 8 10453
  2016-05-19 12:59:59,183 INFO: Stats (ID: 1): score=-3.700894 num_expansions=85 time=31.15
  2016-05-19 12:59:59,183 INFO: Decoding finished. Time: 31.16

The ``--predictors nmt`` argument tells SGNMT to use the NMT scoring module. The ``--src_test`` option defines the location of the
source sentences to translate (words are represented by IDs), and ``--range 1:1`` limits the decoding to the first sentence. SGNMT will
search for NMT model files in the default location *./train/*. The arguments ``src_vocab_size`` and ``trg_vocab_size`` specify that the 
NMT model has been trained with vocabulary sizes of 50003. Since we will use the last four options throughout this tutorial, we load them
from a configuration file instead of adding them separately::

  $ cat tut.ini
  src_test: data/test15.ids.en
  range: '1:1'
  nmt_config: src_vocab_size=50003,trg_vocab_size: 50003
  $ python $SGNMT/decode.py --predictors nmt --config_file tut.ini
  (...)
  2016-05-19 12:59:59,183 INFO: Decoded (ID: 1): 1511 7 1422 894 30 8 10453


You can look at our first translation by using the *apply_wmap.py* script::

  $ echo '1511 7 1422 894 30 8 10453' | python scripts/apply_wmap.py -m data/wmap.test15.de -d i2s
  Indien und Japan treffen sich in Tokio

For NMT decoding, you have the option to use an optimised version of the beam decoder::

  $ python $SGNMT/decode.py --decoder vanilla --config_file tut.ini
  (...)
  2016-05-19 13:14:27,040 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 13:14:34,009 INFO: Decoded (ID: 1): 1511 7 1422 894 30 8 10453
  2016-05-19 13:14:34,009 INFO: Stats (ID: 1): score=-3.700894 num_expansions=120 time=6.97
  2016-05-19 13:14:34,009 INFO: Decoding finished. Time: 6.97

SGNMT also offers a batch decoding script for pure NMT::

  $ python $SGNMT/batch_decode.py --src_test data/test15.ids.en --src_vocab_size 50003 --trg_vocab_size 50003
  Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5110)
  (...)
  2017-04-13 18:32:59,959 INFO: Decoding finished. Time: 53.550469

Batch decoding translates the entire test set in 53.55 seconds on a Titan X GPU (831.6 words per second).

However, the vanilla decoder and the ``batch_decode.py`` script bypass the predictor framework. Therefore, they cannot be used in combination with other
predictors, e.g. for lattice rescoring. Furthermore, they are only available for Theano.


Ensemble NMT decoding
----------------------------------------

NMT ensembling can be done by simply adding a second NMT predictor. However, we need to override the NMT configuration
for the second NMT predictor to load a different NMT model. We can use ``--nmt_config2`` for changing the second NMT
configuration in general, or ``--nmt_path2`` to only change the model path::

  $ python $SGNMT/decode.py --predictors nmt,nmt --nmt_path2 train2 --config_file tut.ini
  (...)
  2016-05-19 13:24:36,060 INFO: Loading the model from ./train/params.npz
  (...)
  2016-05-19 13:24:56,942 INFO: Loading the model from train2/params.npz
  (...)
  2016-05-19 13:25:10,937 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 13:25:56,787 INFO: Decoded (ID: 1): 1511 7 1422 894 30 8 10453
  2016-05-19 13:25:56,787 INFO: Stats (ID: 1): score=-6.195214 num_expansions=83 time=45.85
  2016-05-19 13:25:56,787 INFO: Decoding finished. Time: 45.87

The first NMT predictor still uses the default NMT training directory location *./train/*, but the second NMT instance loads the
NMT model from *train2/params.npz*. The faster *vanilla* search strategy can also be used for ensembles.

Lattice rescoring (SGNMT)
----------------------------------------

For restricting NMT to a translation lattice, we need the *fst* predictor::

  $ python $SGNMT/decode.py --predictors nmt,fst --fst_path hiero/lats/%d.fst --config_file tut.ini
  (...)
  2016-05-19 15:37:29,601 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 15:37:36,437 INFO: Decoded (ID: 1): 1511 7 1422 84829 894 30 8 10453
  2016-05-19 15:37:36,437 INFO: Stats (ID: 1): score=-4.779791 num_expansions=64 time=6.84
  2016-05-19 15:37:36,437 INFO: Decoding finished. Time: 6.84

  $ echo '1511 7 1422 84829 894 30 8 10453' | python scripts/apply_wmap.py -m data/wmap.test15.de -d i2s
  Indien und Japan Premierministern treffen sich in Tokio


This command loads the determinised lattice *./hiero/lats/1.fst* from the file system and runs the NMT beam search decoder on it. For non-deterministic lattices use
the *nfst* predictor instead. Per default, SGNMT ignores the scores in the translation lattices. To change this, use ``--use_fst_weights``::

  $ python $SGNMT/decode.py --predictors nmt,fst --fst_path hiero/lats/%d.fst --use_fst_weights true --predictor_weights 2.7,24.4 --config_file tut.ini
  (...)
  2016-05-19 15:41:19,878 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 15:41:28,228 INFO: Decoded (ID: 1): 1511 7 1422 3278 7 2830 894 30 8 10453
  2016-05-19 15:41:28,229 INFO: Stats (ID: 1): score=-50.385922 num_expansions=72 time=8.35
  2016-05-19 15:41:28,229 INFO: Decoding finished. Time: 8.35

  $ echo '1511 7 1422 3278 7 2830 894 30 8 10453' | python scripts/apply_wmap.py -m data/wmap.test15.de -d i2s   
  Indien und Japan Staats- und Regierungschefs treffen sich in Tokio

This command uses ``--predictor_weights`` to weight the NMT scores against the lattice scores (corresponds to the lambdas in the `ACL 2016 paper <http://arxiv.org/abs/1605.04569>`_).

So far, we applied beam decoding as search strategy. However, the beam decoder introduces search errors. SGNMT supports a variety of decoding strategies such 
as greedy, beam, depth-first search, and A* search. To do an exhaustive search over the lattice, use the depth-first search decoder::

  $ python $SGNMT/decode.py --decoder dfs --predictors nmt,fst --fst_path hiero/lats/%d.fst --config_file tut.ini --nmt_config tut.ini
  (...)
  2016-05-19 15:43:33,713 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 15:43:35,926 INFO: Decoded (ID: 1): 1511 7 1422 84829 894 30 8 10453
  2016-05-19 15:43:35,926 INFO: Stats (ID: 1): score=-4.779791 num_expansions=17 time=2.21
  2016-05-19 15:43:35,926 INFO: Decoding finished. Time: 2.21

In this case, the exact decoding was very fast because DFS automatically enables admissible pruning of branches in the search tree with accumulated scores
worse than the current best hypothesis. If we disable this feature with ``--early_stopping false``, we see that SGNMT finds the same hypothesis but with 
much more node expansions::

  $ python $SGNMT/decode.py --decoder dfs --early_stopping false --predictors nmt,fst --fst_path hiero/lats/%d.fst --config_file tut.ini --nmt_config tut.ini
  (...)
  2016-05-19 15:44:28,765 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 15:45:12,650 INFO: Decoded (ID: 1): 1511 7 1422 84829 894 30 8 10453
  2016-05-19 15:45:12,650 INFO: Stats (ID: 1): score=-4.779791 num_expansions=334 time=43.88
  2016-05-19 15:45:12,650 INFO: Decoding finished. Time: 43.89

Informed search is implemented by the *astar* search strategy::

  $ python $SGNMT/decode.py --decoder astar --heuristics predictor --predictors nmt,fst --fst_path hiero/lats/%d.fst --use_fst_weights true --predictor_weights 2.7,24.4 --config_file tut.ini
  (...)
  2016-05-19 18:28:05,618 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 18:28:06,897 INFO: Decoded (ID: 1): 1511 7 1422 3278 7 2830 894 30 8 10453
  2016-05-19 18:28:06,898 INFO: Stats (ID: 1): score=-50.385922 num_expansions=11 time=1.28
  2016-05-19 18:28:06,898 INFO: Decoding finished. Time: 1.28

The option ``--heuristics predictor`` enables the predictor specific heuristics. In this example, this uses the shortest distances in the lattice as 
future cost estimates. This is a very weak heuristic as the *fst* predictor has a small weight, but it already speeds up decoding. Alternatively,
``--heuristics greedy`` performs greedy decoding with all predictors to estimate future cost (expensive but more accurate).


N-best list rescoring
----------------------------------------

The *forced* predictor implements single best rescoring (i.e. forced decoding). The ``--trg_test`` option needs to point to a plain text file with
the reference sentences::

  $ python $SGNMT/decode.py --predictors nmt,forced --trg_test data/test15.ids.de --config_file tut.ini
  (...)
  2016-05-20 10:21:07,916 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-20 10:21:09,200 INFO: Decoded (ID: 1): 5 3316 7930 7 7312 9864 30 8 10453 4
  2016-05-20 10:21:09,200 INFO: Stats (ID: 1): score=-23.736977 num_expansions=11 time=1.28
  2016-05-20 10:21:09,200 INFO: Decoding finished. Time: 1.28

For NMT n-best list rescoring, use the *forcedlst* predictor::

  $ python $SGNMT/decode.py --predictors nmt,forcedlst --decoder dfs --trg_test hiero/100best.txt --config_file tut.ini
  (...)
  2016-05-20 10:46:12,884 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-20 10:46:15,049 INFO: Decoded (ID: 1): 1511 7 1422 84829 894 30 8 10453
  2016-05-20 10:46:15,049 INFO: Stats (ID: 1): score=-4.779791 num_expansions=17 time=2.17
  2016-05-20 10:46:15,049 INFO: Decoding finished. Time: 2.17


The n-best list needs to be stored in `Moses format <http://www.statmt.org/moses/?n=Advanced.Search#ntoc1>`_. The *dfs* decoder efficiently
traverses the search space spanned by the n-best list by reusing predictor states for the same histories. If you are interested in rescoring 
the full n-best list and not only in the single best translation, use ``--early_stopping false``. In order to use the scores provided in the 
n-best list, add ``--use_nbest_weights true``.


Working with language models
----------------------------------------

Language model scores can be added to the lattices before passing them through to SGNMT. This is the approach we took in the
`ACL 2016 paper <http://arxiv.org/abs/1605.04569>`_. Alternatively, the nplm predictor can be used directly in SGNMT for incorporating a
feedforward neural language model trained with `NPLM <http://nlg.isi.edu/software/nplm/>`_. A German NPLM model file
can be found in *./lm/nplm*. However, this model has been trained with a different word map. Therefore, we add the *idxmap* wrapper predictor
to the *nplm* predictor. This wrapper translates between word indices used by SGNMT, and indices used by the NPLM predictor. The mapping between
indices is defined with the ``--src_idxmap`` and ``--trg_idxmap`` arguments::

  $ python $SGNMT/decode.py --predictors nmt,fst,idxmap_nplm --fst_path hiero/lats/%d.fst --nplm_path lm/nplm --src_idxmap data/idxmap.nplm.en --trg_idxmap data/idxmap.nplm.de --config_file tut.ini
  2016-05-19 16:24:47,811 INFO: Start time: 1463671487.81
  2016-05-19 16:24:47,811 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2016-05-19 16:24:54,768 INFO: Decoded (ID: 1): 1511 7 1422 84829 8 10453
  2016-05-19 16:24:54,768 INFO: Stats (ID: 1): score=-43.008210 num_expansions=56 time=6.96
  2016-05-19 16:24:54,768 INFO: Decoding finished. Time: 6.96

  echo '1511 7 1422 84829 8 10453' | python scripts/apply_wmap.py -m data/wmap.test15.de -d i2s 
  Indien und Japan Premierministern in Tokio

This results in a translation which is too short, because the language model prefers short hypotheses. To counteract that, adjust the weight between NMT and LM with ``--predictor_weights``,
or add a word count feature with the *wc* predictor. The *srilm* predictor supports loading Kneser-Ney language models in ARPA format.

MBR-based NMT
----------------------------------------

In our `EACL 2017 paper <http://arxiv.org/abs/1612.03791>`_ we described how to use n-gram posteriors extracted from a Hiero lattice to improve NMT. 
External n-gram probabilities can be introduced to SGNMT via the *ngramc* predictor::

  $ python $SGNMT/decode.py --predictors nmt,ngramc,wc --ngramc_path hiero/ngramc/%d.txt --predictor_weights 0.625,0.375,0.375 --config_file tut.ini
  (...)
  2017-04-13 19:01:47,753 INFO: Next sentence (ID: 1): 1543 7 1491 1359 1532 692 9 6173
  2017-04-13 19:02:06,987 INFO: Decoded (ID: 1): 1511 7 1422 894 30 8 10453
  2017-04-13 19:02:06,987 INFO: Stats (ID: 1): score=1.414550 num_expansions=64 time=19.23
  2017-04-13 19:02:06,987 INFO: Decoding finished. Time: 19.23

The *wc* predictor is a simple word penalty which is often denoted as Theta_0 in the MBR literature. Note that the n-gram posterior files can
be generated with the ``--logger.verbose`` options of `HiFST's lmbr tool <http://ucam-smt.github.io/tutorial/basictrans.html#lmbr>`_.

Creating output files
----------------------------------------

SGNMT supports four different output formats:

* *text*: Plain text file with the translations
* *nbest*: n-best list in 
* *sfst*: OpenFST translation lattices with standard arcs
* *fst*: OpenFST translation lattices with sparse tuple arcs
* *ngram*: MBR-style n-gram posteriors
* *timecsv*: CSV with predictor scores over time

They can be activated with ``--outputs``. For example, this adds NMT scores to the Hiero n-best list *hiero/100best.txt*::

  $ python $SGNMT/decode.py --outputs text,nbest --predictors nmt,forcedlst --use_nbest_weights true --trg_test hiero/100best.txt --decoder dfs --early_stopping false --config_file tut.ini
  (...)
  $ head sgnmt-out.*
  ==> sgnmt-out.nbest <==
  0 ||| 1511 7 1422 6284 894 30 8 10453 ||| nmt= -5.695894 forcedlst= 10.681800 ||| 4.985906
  0 ||| 1511 7 1422 3316 894 30 8 10453 ||| nmt= -5.125271 forcedlst= 7.244780 ||| 2.119509
  0 ||| 1511 7 1422 6284 894 8 10453 ||| nmt= -7.310516 forcedlst= 9.359490 ||| 2.048974
  0 ||| 1511 7 1422 84829 894 30 8 10453 ||| nmt= -4.779791 forcedlst= 6.022160 ||| 1.242369
  0 ||| 1511 7 1422 13997 2153 894 30 8 10453 ||| nmt= -9.046590 forcedlst= 9.643540 ||| 0.596950
  0 ||| 1511 7 1422 3278 7 2830 894 30 8 10453 ||| nmt= -11.577704 forcedlst= 11.586500 ||| 0.008796
  0 ||| 1511 7 1422 3316 894 8 10453 ||| nmt= -6.950797 forcedlst= 6.573870 ||| -0.376927
  0 ||| 1511 7 1422 84829 894 8 10453 ||| nmt= -6.364513 forcedlst= 5.350270 ||| -1.014243
  0 ||| 1511 7 1422 2830 894 30 8 10453 ||| nmt= -8.984533 forcedlst= 7.901030 ||| -1.083503
  0 ||| 1511 7 7312 6284 894 30 8 10453 ||| nmt= -7.913851 forcedlst= 5.823900 ||| -2.089951

  ==> sgnmt-out.text <==
  1511 7 1422 6284 894 30 8 10453

The default output path is *sgnmt-out.%s* (can be changed with ``--output_path``). The generated n-best file *sgnmt-out.nbest* does not only show
the combined score, but also the separated predictor scores. In this case, *nmt=* contains the NMT log-likelihood, and *forcedlst=* corresponds
to the hypothesis score in the Hiero n-best list *hiero/100best.txt*.

Simple NMT translation lattices can be generated with the *sfst* output format::

  $ python $SGNMT/decode.py --outputs sfst --predictors nmt --config_file tut.ini 
  (...)
  $ fstprint sgnmt-out.sfst/1.fst | head
  0 1 1 1 3.70089412
  0 10  1 1 4.99342918
  0 18  1 1 5.3186779
  0 26  1 1 5.67907906
  1 2 1511  1511
  2 3 7 7
  3 4 1422  1422
  4 5 894 894
  5 6 30  30
  6 7 8 8

If you wish to keep predictor scores separated in the generated lattices, use the *fst* output format to create lattices with 
`sparse tuple arcs <http://ucam-smt.github.io/tutorial/basictrans.html#lmert_veclats_tst>`_.
You'll need to `install HiFST <http://ucam-smt.github.io/tutorial/build.html>`_ to enable support for the tropicalsparsetuple arc type::

  $ python $SGNMT/decode.py --outputs fst --predictors nmt,fst --fst_path hiero/lats/%d.fst --use_fst_weights true --predictor_weights 2.7,24.4 --config_file tut.ini
  (...)
  $ TUPLEARC_WEIGHT_VECTOR=2.7,24.4 fstshortestpath sgnmt-out.fst/1.fst | fsttopsort | fstprint
  0 1 1 1
  1 2 1511  1511  0,1,0.420832008,2,0.00846654177
  2 3 7 7 0,1,0.129989997
  3 4 1422  1422  0,1,0.0673521981,2,0.00468987226
  4 5 3278  3278  0,1,9.95738029,2,0.680079401
  5 6 7 7 0,1,0.0128448997
  6 7 2830  2830  0,1,0.0604516007
  7 8 894 894 0,1,0.287970006,2,0.0195894614
  8 9 30  30  0,1,0.173721001,2,0.0528452434
  9 10  8 8 0,1,0.354806006,2,0.000487923622
  10  11  10453 10453 0,1,0.0348698013,2,0.017698925
  11  12  2 2 0,1,0.0774876028
  12

The weights in the generated FST correspond to the unweighted predictor scores in order of how they are defined in ``--predictors``.


Distributed decoding using the Grid Engine
-------------------------------------------

Large decoding jobs can be distributed over multiple nodes with the Grid Engine using the ``--range`` argument. First, we create a configuration file for SGNMT
which specifies the decoding parameters. Here is a example .ini file for distributing lattice rescoring on the WMT'15 English-German test set::

  $ cat scripts/grid/example.ini 
  src_test: data/test15.ids.en
  nmt_config: src_vocab_size=50003,trg_vocab_size=50003
  predictors: nmt,fst
  fst_path: hiero/lats/%d.fst
  use_fst_weights: true
  predictor_weights: 2.7,24.4


Make sure that the .ini file does not contain ``output_path`` or ``range``. Next, open *scripts/grid/decode_on_grid_cpu_worker.sh* and, if
necessary, change the environment variables PATH, LD_LIBRARY_PATH, and PYTHONPATH as described on the :ref:`setup-label` page. Start the
distributed decoding with the following command::

  $ bash scripts/grid/decode_on_grid_cpu.sh 40 1:2169 scripts/grid/example.ini grid-output

This will submit an array of 40 jobs to the grid, and each worker calls decode.py with a different ``--range``. Worker jobs write their
output files to *grid-output/<worker-id>*, and their logs to *grid-output/logs*. When all workers are finished, the combination job
combines the output files and writes the requested output formats to *grid-output/out.%s*.