Cambridge SMT System
Installation

This section contains instructions on downloading, compiling, and installing the UCAM-SMT tools.

There are three sets of tools that can be installed:

The HiFST Binary Installation is required to run the HiFST Data and Control Files. However the HiFST Data and Control Files does not require the translation grammar - translation grammars, language models, etc., are provided in the tutorial. However, the Hadoop-based translation grammar extraction may be useful if you wish to extract your own translation grammars from aligned parallel text.

HiFST Binary Installation

Note:The following instructions are for the Bash shell.

The code can be cloned from the following GitHub address:

> git clone https://github.com/ucam-smt/ucam-smt.git

In the following, HiFSTROOT designates the cloned directory, i.e. the following should be a complete path to the cloned directory

> export HiFSTROOT=complete_path_to_hifst_cloned_directory

As a check, the following command should find the README.md file downloaded from github:

> ls $HiFSTROOT/README.md

Once downloaded, the tools and libraries are compiled in the directory $HiFSTROOT by running the script build-test.sh.

Note that the environment variable TGTBINMK specific compilation option (e.g. optimisation level, static vs dynamic, etc.) can be set to a value of your choosing before running the script. See build-test.sh for the supported options.

> cd $HiFSTROOT
> export TGTBINMK=O2 # change this as appropriate for your environment. If you omit this line, the default is `export TGTBINMK=O2`.
> ./build-test.sh

This should download and install necessary dependencies, compile the code and run tests. The README.md in the cloned directory also contains useful information for the installation.

HiFST Paths and Environment Variables

After HiFST is successfully built and tested, the file $HiFSTROOT/Makefile.inc will contain environment variable settings needed to run the HiFST binaries and the OpenFST tools using the HiFST libraries. To set these, simply run

> source $HiFSTROOT/Makefile.inc
> export PATH=$HiFSTROOT/bin:$OPENFST_BIN:$PATH
> export LD_LIBRARY_PATH=$HiFSTROOT/bin:$OPENFST_LIB:$BOOST_LIB:$LD_LIBRARY_PATH

You should make sure that $HiFSTBINDIR is added first on the path and the library path and that it preceeds the OpenFst directories. If the LD_LIBRARY_PATH variable is not set correctly, you will see the message

ERROR: GenericRegister::GetEntry : tropical_LT_tropical-arc.so: cannot open shared object file: No such file or directory
ERROR: ReadFst : unknown arc type "tropical_LT_tropical" : standard input

Sourcing Makefile.inc sets the environment variable TGTBINMK to point to the HiFST binaries; for example, the following should find the main HiFST binary:

ls $HiFSTROOT/bin/hifst.${TGTBINMK}.bin

It is possible to use multiple builds of HiFST by changing the TGTBINMK variable.

Installation of the Hadoop-based Grammar Extraction Tools

Note: These are not needed to run the basic HiFST HiFST Data and Control Files.

Requirements for building the rule extraction software:

The rule extraction code will be in the $HiFSTROOT/java/ruleXtract directory. Rule extraction is built as part of the full build. To build separatly use the folloing commands.

> cd $HiFSTROOT/java/ruleXtract
> sbt assembly

You will obtain a jar file named ruleXtract.jar located at $HiFSTROOT/java/ruleXtract/target/ruleXtract.jar . We use the variable $RULEXTRACTJAR to designate this jar from now on:

> RULEXTRACTJAR=$HiFSTROOT/java/ruleXtract/target/ruleXtract.jar

Tutorial Installation

Files for this tutorial can be downloaded from the following GitHub address:

> git clone https://github.com/ucam-smt/demo-files.git
> cd demo-files; gunzip wmaps/*.gz  ## Uncompress big wordmap files.

The tutorial files and sources for this tutorial will be in the demo-files directory. The variable $DEMO should be set to point to this director:

> export DEMO=complete_path_to_demo-files_cloned_directory

As a check, the following command should find the README file downloaded from github:

> ls $DEMO/README

Language Models The language models needed for this tutorial can be downloaded from http://mi.eng.cam.ac.uk/~wjb31/data/hifst.release.May14/ . These are not on github, due to their size. There are two files:

To run this tutorial you must download the smaller of the two LMs:

> cd $DEMO/M/
> wget http://mi.eng.cam.ac.uk/~wjb31/data/hifst.release.May14/interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.withoptions.mmap.gz
> gunzip interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.withoptions.mmap.gz

The big LM is needed for some language model rescoring operations, but otherwise is not required for the tutorial.

> cd $DEMO/M/
> wget http://mi.eng.cam.ac.uk/~wjb31/data/hifst.release.May14/interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.union.mmap.gz
> gunzip interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.union.mmap.gz

At this point, the following command should find the smaller of the two LMs:

> ls $DEMO/M/interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.withoptions.mmap

Word Maps The Russian and English wordmap files should be uncompressed (see Integer-Mapped Text Files ):

> gunzip -k wmaps/wmt13.ru.wmap.gz wmaps/wmt13.en.wmap.gz

Documentation As an optional step, if you wish to regenerate the HTML code for this tutorial, make sure you have doxygen (1.8+ for markdown support) and latex (for formulas) installed and run the following commands:

> cd $DEMO/Docs.dox
> doxygen