Cambridge SMT System
|
Notes:
$DEMO/M/
directory.The following directories contain the data files, configuration files, and model files needed for this tutorial.
./ |-configs/ # Configuration files |-train/ # Training data files |-Docs.dox/ # Tutorial documentation files |-EN/ # English reference text |-G/ # Translation grammars |-M/ # Language models |-RU/ # Russian input text |-scripts/ # Scripts for these demonstration exercises |-wmaps/ # Word maps, to map English and Russian text to integers
The following directories will be created after running this tutorial.
./ |-log/ # Translation process log files |-output/ # Translation output, as 1-best hypotheses and lattices
HiFST uses the Boost libraries which provide support for command line options and configuration files. For example, the following options could be provided on the command line:
--hifst.prune=9 --hifst.replacefstbyarc.nonterminals=X,V
Alternatively, they could be specified in a configuration file either as
hifst.prune=9 hifst.replacefstbyarc.nonterminals=X,V
or as
[hifst] prune=9 replacefstbyarc.nonterminals=X,V
As you work through the tutorial, please read the comments in the config files which explain some of the processing options. For example, see configs/CF.baseline
:
# Basic HiFST configuration for Russian-English translation # Translation uses a 4gram language model with a shallow-1 grammar # as described in # J. Pino, et al. The University of Cambridge Russian-English System at # WMT13. http://aclweb.org/anthology//W/W13/W13-2225.pdf range=1:2 # range of line numbers in the source text file to be translated [source] load=RU/RU.tune.idx # path and filename of the source text file [target] store=output/exp.baseline/hyps # path and filename of the target text file; translations are written to this file [hifst] lattice.store=output/exp.baseline/LATS/?.fst.gz # specifies the path and filename into which translation lattices are # written. Note the use of the placeholder '?' in the argument # .../LATS/?.fst.gz . The placeholder is replaced by the line # number of sentence being translated, e.g. so that # .../LATS/2.fst.gz is a WFST containing translations of the second # line in the source text file. Note also the use of the '.gz' # extension: when this is provided, lattices are written as gzipped # files. prune=9 # translation lattices are pruned using OpenFST pruning operations # prior to saving to disk. The parameter provided is the pruning # threshold. The default is 3.40282347e+38 which effectively means # no pruning. replacefstbyarc.nonterminals=X,V # Specifies non-terminals in the grammar, apart from the top-level # non-terminal S. HiFST will treat X and V as non-terminals; # in addition, other symbols Xn and Vn (e.g. X1 and V2) will be treated as non-terminals. [lm] load=M/interp.4g.arpa.newstest2012.tune.corenlp.ru.idx.withoptions.mmap # path and filename of the English n-gram language model in kenlm format # this particular language model is a modified Kneser-Ney 4-gram language model # trained for the Russian-English task at WMT13 (http://statmt.org/wmt13/) # with the KenLM toolkit (http://kheafield.com/code/kenlm/) [grammar] load=G/rules.shallow.gz # path and filename of the Hiero translation grammar to load and use in translation # this particular grammar is a shallow grammar: when CYK parsing the source side, # hierarchical rules can only be concatenated with the glue rule and not combined # together with another hierarchical rules. # See # A. de Gispert et al. Hierarchical phrase-based translation with weighted finite # state transducers and shallow-N grammars. Computational Linguistics, 2010. # http://www.aclweb.org/anthology-new/J/J10/J10-3008.pdf)
HiFST uses symbol tables as provided by OpenFst to map between source and target language text and the integer representation used internally by the decoder. See the OpenFst Quick Tour for a discussion of the use of symbol tables.
Integer mappings for English and Russian are in the directory wmaps/ :
wmaps/wmt13.en.wmap wmaps/wmt13.ru.wmap
Note that HiFST reserves the integers 1
and 2
for the sentence-start and sentence-end symbols. 0
is the OpenFst epsilon symbol. 999999998
is the OOV symbol.
The format of the wordmap files is straightforward, e.g.
> head wmaps/wmt13.en.wmap <epsilon> 0 <s> 1 </s> 2 the 3 , 4 . 5 of 6 to 7 and 8 in 9
Source text files are provided in integer format :
> head -1 RU/RU.tune.idx 1 3526 10 1278 28847 3 64570 1857 7786 2
The OpenFst FAR tools can be used to generate Russian text from the integer mapped files (see Extracting the Best Translation from a Lattice).
> farcompilestrings --entry_type=line RU/RU.tune.idx | farprintstrings --symbols=wmaps/wmt13.ru.wmap | head -1 <s> парламент не поддерживает поправку , дающую свободу тимошенко </s>
English language models are provided in both KenLM and ARPA formats . See [Pino2013] for a description of how these LMs are built.
Language models are in integer mapped format, e.g. for the ARPA files:
> zcat M/lm.tc.gz | head -15 \data\ ngram 1=2794 ngram 2=181413 ngram 3=292841 \1-grams: -2.153572 10 -0.8350325 -4.555545 100 -0.5700829 -4.174318 1000 -0.9515274 -3.82578 1001 -1.104666 -5.726784 10013 -0.2537366 -6.726784 100231 -0.2309253 -4.027669 1004 -0.9985322 -4.362545 1009 -0.7841078
The integers correspond to words in the English wordmap file wmaps/wmt13.en.wmap
.
HiFST uses Synchronous Context-Free Grammars (SCFGs) for translation. A full Hiero and a Shallow-1 translation grammar are provided in the G/
directory:
G/rules.shallow.gz : Shallow-1 hiero grammar with scalar scores
We also provide a grammar with raw, unweighted feature vectors:
G/rules.shallow.vecfea.gz : Shallow-1 hiero grammar with feature vectors
There is also a grammar provided for the true-casing example (FST-based True Casing)
G/tc.unimap
In the grammar file, each line represents a rule. The rule format is:
LHS RHS_SOURCE RHS_TARGET FEA_1 [FEA_2 FEA_3 FEA_4 ...]
where
LHS = the left hand side of the rule RHS_SOURCE = the source-language part of the right hand side of the rule RHS_TARGET = the target-language part of the right hand side of the rule FEA_i = the i-th component of the feature vector associated with the rule
The left hand side of a rule is a non-terminal symbol (in uppercase). The right hand side is a pair of terminal and non-terminal symbol sequences in the source and target languages.
Scores are assigned to rules as the dot product of a rule-specific feature vector and a weight vector ([Chiang2007] and see the discussion in Lattice MERT). This computation can be done offline, in which case the feature for every rule in the grammar is a 1-dimensional scalar. Alternatively, the decoder can be provided with a weight vector which is applied to the feature vectors while loading the grammar. For example, the grammar G/rules.shallow.gz
provided in this tutorial the following set of weights was found via LMERT tuning (Lattice MERT ):
0.697263,0.396540,2.270819,-0.145200,0.038503,29.518480,-3.411896,-3.732196,0.217455,0.041551,0.060136
These can be applied to the Shallow-1 grammar, as follows:
> gzip -d -c G/rules.shallow.vecfea.gz | head -3 V 3 4 0.223527 0.116794 -1 -1 0 0 0 0 -1 1.268789 0.687159 V 3 4_3 3.333756 0.338107 -2 -1 0 0 0 0 -1 1.662178 3.363062 V 3 8 3.74095 3.279819 -1 -1 0 0 0 0 -1 3.741382 2.271445 > GW=0.697263,0.396540,2.270819,-0.145200,0.038503,29.518480,-3.411896,-3.732196,0.217455,0.041551,0.060136 > zcat G/rules.shallow.vecfea.gz | scripts/weightgrammar -w=$GW | head -3 V 3 4 -2.046860955276 V 3 4_3 -1.884009085882 V 3 8 1.857985226112
and this should agree with the scalar-valued version of the grammar:
> zcat G/rules.shallow.gz | head -3 V 3 4 -2.046860955276 V 3 4_3 -1.884009085882 V 3 8 1.857985226112
Note that the unweighted feature vector contains all features needed to compute the score of a translation, except for the contributions from the language model(s).
In translation, a non-terminal X
on the right hand side can be rewritten by any rule whose left hand side is X
. HiFST places no restrictions on the definition of terminal and non-terminal sequences in an SCFG rule; similarly, there are no constraints on how many non-terminal symbols can be used in a rule (i.e. there are no constraints on order of the SCFG). However using a full Hiero grammar can lead to slow translation, and this tutorial discusses several strategies for pruning in translation and for translation grammar pruning.
Here is a small sample translation grammar
M 434_M 1462_8_M -1.81842 M 7_M 9_3_M -0.735445 M V V -0 S S_X S_X 0.05768 S X X -0 V 10806 1411 1.16623 V 164_M_60 78_M_8 -0.226464 V 164_M2_60_M1 78_M1_8_M2 -0.226464 V 21_591 39_258_8 -0.510102 V 24 3_54 -2.50252 V 274_M_4 709_9_3_M -0.589246 V 5 6 -1.81729 V 7_1689 9_741_8 0.438945 V 8 23 -1.46604 X 1 1 -2.5598 X 2 2 -2.5598 X V V -0 D 1775 <dr> 10.4327 S S_D_X S_D_X 0.11536
This grammar has four non-terminal symbols: S
, M
, V
and D
, and a 1-dimensional weight. The terminal symbols are integers, corresponding to words in the source and target language word maps (Word Maps). The symbol <dr>
is a special symbol used by HiFST to represent an empty word, to indicate deletion.
As an example, for rule
V 164_M_60 78_M_8 -0.226464
we have
LHS = V RHS_SOURCE = 164_M_60 RHS_TARGET = 78_M_8 WEIGHT_1 = -0.226464
With this rule the decoder can rewrite the non-terminal V
by replacing it by 164_M_60
in the source language and by 78_M_8
in the target language; the rule is applied with a score of -0.226464. Similarly, rule V 5 6 -1.81729
replaces V
by the word "5" in the source-language and with the word "6" in the target-language, with a score of -1.81729.
Indices on non-terminals indicate alignment within rules with more than one non-terminal. For example, for rule
V 164_M2_6_M1 78_M1_8_M2 -0.226464
the M
non-terminals on both language sides are indexed by 1 or 2; that is, M1
in the source language is linked with M1
in the target language, and M2
in the source language is linked with M2
in the target. Strictly speaking, M2
is not a non-terminal in the above example: it is the second instance of the non-terminal M
in the rule. Obviously this is a reordering rule.
In general we use different types of rule to distinguish different translation cases and obtain a finer-grained model. In this example, the S
rules are doing something very similar to the glue rules used in Hiero-style systems; the M
rules can be regarded as monotonic translation rules; the V
rules can be regarded as reordering translation rules and phrasal translation rules; and the D
rule can be regarded as an explicit operation of word deletion.
Note that there is a straight-forward correspondence between the Hiero-style rule notation introduced by [Chiang2007] and the HiFST rule file format (for rules with one-dimensional weights). For example, the HiFST grammar file entries
V 164_M2_6_M1 78_M1_8_M2 -0.226464 S S_X S_X 0.05768
can be written as
V -> < 164 M2 6 M1 , 78 M1 8 M2 > / -0.226464 S -> < S X , S X > / 0.05768
Shallow grammars [deGispert2010] can be used to control the degree of nesting allowed within an hierarchical grammar. For example, for a Shallow-1 Grammar, variables in a rule can be substituted only by phrases. That is, hierarchical rules can be used only once to generate words; once a hierarchical rule is used, translation relies on glue rules to cover longer source spans.
Formally, we use W to denote the set of terminals, and S and X to denote two non-terminals. A simple Hiero-style grammar can be defined to be:
S -> X, X S -> S X, S X X -> a, b where a, b \in ({X} union W)^+
We can transform this into a Shallow-1 grammar as
S -> X, X S -> S X, S X Y -> a, b where a, b \in {W}^+ X -> u, v where u, v \in {{Y} union W}^+
Here Y
is introduced to handle phrasal translations. The variables in rule X -> u, v
can only be substituted with the Y
rules.
In some cases it is desirable to allow more complex movement in translation, such as complex structure movements in Chinese-English translation. For this we can use a generalisation of the simple Shallow-1 grammar, called Shallow-N grammars. These grammars allow hierarchical rules to be applied up to N times. For example, below is the form of a Shallow-2 grammar.
S -> X, X S -> S X, S X Y^0 -> a^0, b^0 where a^0, b^0 \in {W}^+ Y^1 -> a^1, b^1 where a^1, b^1 \in {{Y^0} union W}^+ X -> u, v where u, v \in {{Y^1} union W}^+
The Shallow-2 grammar introduces Y^0
and Y^1
to handle hierarchical rule application at two levels within a derivation. For more detailed description of Shallow-n grammars, please refer to the HiFST paper [deGispert2010].