The ucam-smt package contains a disambiguation algorithm that supports non-functional WFSTs, i.e. ambiguous WFSTs in which repeated translation hypotheses have different label sequences. For a detailed description, please check [Iglesias2015]. The gist of the disambiguation algorithm is as follows:

Map the topologically sorted WFST into an equivalent WFSA using weights that contain both the WFST weight and the output symbols. This requires a special semiring.
Apply WFSA determinization under this semiring to ensure that only one unique path per input string survives.
Expand the result back to an WFST that preserves arc-level alignments.

The implementation of our disambiguation algorithm relies on the Tropical Sparse Tuple Semiring, which is already implemented in the ucam-smt package. It allows to define zero-weighted sparse features on this semiring, which we call Topological Features. These identify arcs/output labels in the original topologically sorted WFST. The expansion algorithm in step 3 also ensures that the topological features are placed correctly, so that a 1-1 mapping back to WFST is possible.

In one step of the previous tutorial on Rescoring with Bilingual Neural Network Models, HiFST generates affiliation lattices in output/bilmexp/AFILATS.STD/. In these lattices, each translation hypothesis has one or more affiliation sequences, i.e. these lattices are non-functional.

We can count the number of hypotheses of the first lattice as so:

 > zcat  output/bilmexp/AFILATS.STD/1.fst.gz | countstrings.${TGTBINMK}.bin --input=- --output=-
 2601

But the number of unique hypotheses is actually much smaller:

 > zcat  output/bilmexp/AFILATS.STD/1.fst.gz | fstproject --project_output | fstdeterminize | countstrings.${TGTBINMK}.bin --input=- --output=-
 98

This happens because there are repeated input sequences for each translation hypothesis:

 > printstrings.${TGTBINMK}.bin --input=output/bilmexp/AFILATS.STD/1.fst.gz  --print-input-output-labels -n 10 | grep "1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2"
 1 5 6 4 2 7 7 9 9 9 12 11 10 14         1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2       75.5229
 1 5 6 4 2 7 7 9 9 9 13 11 10 14         1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2       76.2495

We can disambiguate these lattices as so:

 > disambignffst.${TGTBINMK}.bin --range=1:10 --input=output/bilmexp/$it/AFILATS.STD/?.fst.gz --output=output/exp.lmert/$it/AFIDETLATS/?.fst.gz --nthreads=4 --determinize-output=yes

And for the first lattice, we now have the same number of unique hypotheses:

 > zcat  output/bilmexp/AFIDETLATS/1.fst.gz   | countstrings.${TGTBINMK}.bin --input=- --output=-
 98

And we now have only the best affiliation sequence per hypothesis:

 > printstrings.${TGTBINMK}.bin --input=output/bilmexp/AFIDETLATS/1.fst.gz  --print-input-output-labels -n 10 -w | grep "1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2"
 1 5 6 4 2 7 7 9 9 9 12 11 10 14         1 3 511 342 1480 866 11 3 3286 5 717 35351 9967 2       75.5229