Cambridge SMT System
|
The ucam-smt package contains a disambiguation algorithm that supports non-functional WFSTs, i.e. ambiguous WFSTs in which repeated translation hypotheses have different label sequences. For a detailed description, please check [Iglesias2015]. The gist of the disambiguation algorithm is as follows:
The implementation of our disambiguation algorithm relies on the Tropical Sparse Tuple Semiring, which is already implemented in the ucam-smt package. It allows to define zero-weighted sparse features on this semiring, which we call Topological Features. These identify arcs/output labels in the original topologically sorted WFST. The expansion algorithm in step 3 also ensures that the topological features are placed correctly, so that a 1-1 mapping back to WFST is possible.
In one step of the previous tutorial on Rescoring with Bilingual Neural Network Models, HiFST generates affiliation lattices in output/bilmexp/AFILATS.STD/. In these lattices, each translation hypothesis has one or more affiliation sequences, i.e. these lattices are non-functional.
We can count the number of hypotheses of the first lattice as so:
> zcat output/bilmexp/AFILATS.STD/1.fst.gz | countstrings.${TGTBINMK}.bin --input=- --output=- 2601
But the number of unique hypotheses is actually much smaller:
> zcat output/bilmexp/AFILATS.STD/1.fst.gz | fstproject --project_output | fstdeterminize | countstrings.${TGTBINMK}.bin --input=- --output=- 98
This happens because there are repeated input sequences for each translation hypothesis:
> printstrings.${TGTBINMK}.bin --input=output/bilmexp/AFILATS.STD/1.fst.gz --print-input-output-labels -n 10 | grep "1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2" 1 5 6 4 2 7 7 9 9 9 12 11 10 14 1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2 75.5229 1 5 6 4 2 7 7 9 9 9 13 11 10 14 1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2 76.2495
We can disambiguate these lattices as so:
> disambignffst.${TGTBINMK}.bin --range=1:10 --input=output/bilmexp/$it/AFILATS.STD/?.fst.gz --output=output/exp.lmert/$it/AFIDETLATS/?.fst.gz --nthreads=4 --determinize-output=yes
And for the first lattice, we now have the same number of unique hypotheses:
> zcat output/bilmexp/AFIDETLATS/1.fst.gz | countstrings.${TGTBINMK}.bin --input=- --output=- 98
And we now have only the best affiliation sequence per hypothesis:
> printstrings.${TGTBINMK}.bin --input=output/bilmexp/AFIDETLATS/1.fst.gz --print-input-output-labels -n 10 -w | grep "1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2" 1 5 6 4 2 7 7 9 9 9 12 11 10 14 1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2 75.5229