XLE Translation Documentation

This document describes an experimental system for translating between two languages. The system has two parts: one for translating from one language into another, and a second for extracting transfer rules from pairs of phrases, described below.

NB: The system is very slow for sentences more than a few words in length.

Using a Translator

To translate from one language to another, use Tcl commands like the following:

set defaulttranslator [create-translator translator.txt]
load_translation_rules $defaulttranslator rules.pl
translate "der Terrorismus"
translate-testfile german-testfile.lfg 7
translate-testfile german-testfile.lfg english-translations.txt

translator.txt is a performance vars file that has statements like the following:

setx source_grammar german-mt.lfg
setx target_grammar english-mt.lfg
setx grammar_encoding utf-8

set timeout 60
set max_xle_scratch_storage 1500

rules.pl is a file of transfer rules. Transfer rules can be hand-written, or they can be extracted from phrase pairs. The transfer rules are applied in parallel -- there is no feeding and bleeding of transfer rules.

The translator works by enumerating the parses, transfering each parse, enumerating the transfers of each parse, and then generating from each transfer. This can be time-consuming if there are a lot of parses. You can limit the number of parse, transfers, and generations enumerated using the following Tcl variables:

set enumerate_parses 50
set enumerate_transfers 10
set enumerate_generations 10

If the translator has property weights associated with it, then you can speed things up even further by only generating from the best transfers:

set max_transfers 10
set score_diff_cutoff 5

Setting max_transfers to N tells the translator to only generate from the N best transfers (if there are more transfers with the same score as the Nth transfer, then these will be included too). Setting score_diff_cutoff to M tells the translator to ignore transfers whose score is M less than the score of the transfer with the highest score.

You can view intermediate translation results using the following:

show-solutions $defaultparser
show-solutions $defaulttranslator
show-solutions $defaultgenerator

The latter two commands only show the result of transfering the last parse and the results of generating the last transfer. If you want to look at earlier attempts, you can set enumerate_parses and enumerate_transfers to smaller values.

Extracting Transfer Rules

Transfer rules can be extracted automatically from phrase pairs or from aligned sentences. In the first case, a single transfer rule is extracted that represents the transfer from the source phrase to the target phrase, taking into account special words that represent arguments. In the second case, atomic transfer rules are first extracted based on word alignments. Then pairs and triples of adjacent atomic transfer rules are then combined to make composite transfer rules. This process is similar to how transfer rules are extracted from aligned sentences in the Pharaoh system.

The first step in extracting transfer rules is to choose the best analyses on the source and target sides. XLE chooses the pair of analyses that align the best so that the transfer rules will be as simple as possible. Thus, the source and target sentences disambiguate each other.

After an f-structure is chosen for each side, XLE simplifies the SUBJs. SUBJs that are in a predicative noun or adjective are removed so that a single transfer rule covers both predicative and attributive uses. Predicative SUBJs are also removed from the source sentence when translating so that the simplified transfer rules will match. Since the target language is expecting the predicative SUBJs to be there, you must make SUBJ addable in the generator. If SUBJ is addable, the generator will add it even if it isn't governed.

XLE also simplifies SUBJs in verbs that are controlled in one way or another. If the SUBJ of a verb is functionally controlled, or is a null pronoun (representing anaphoric control; e.g. PRON-TYPE = null), or creates a cycle (as in (^ ADJUNCT $ SUBJ) = ^), then the SUBJ is replaced with an empty SUBJ (an f-structure with no content). This makes it easier to translate between constructions that use one form of control to constructions that use another form of control. Controlled SUBJs are also simplified in the source sentence when translating so that the simplified transfer rules will match. If the input to the generator has an empty SUBJ, then the generator will add whatever form of control is required by the target grammar.

After the SUBJs have been simplified, XLE aligns the individual f-structures using user-specified words or using word alignments. Then XLE extracts transfer rules and prints them out.

Suppressing Features

Often there are features in the f-structures that you don't want in the transfer rules. For instance, you probably don't want the tense feature in your transfer rule unless the tense feature has different values on the two sides. Otherwise, you have to produce phrase pairs for every tense of every verb. There are two ways to suppress such features. One is to use set-gen-adds remove in the standard way in the translator:

set-gen-adds remove "CASE CHECK COORD-LEVEL FOCUS-INT \
 GEND LEFT_SISTER PASS-ASP PERF=- PROG PRON-INT RIGHT_SISTER TOPIC o::"

These features will be removed from the f-structures of the phrases on both sides no matter what their values are. They will also be removed from the input to the translator.

The second way to specify features that are only removed if they have different values on the two sides:

setx remove_equal_features "ADJUNCT AQUANT ATYPE CLAUSE-TYPE DEG-DIM DEGREE DET \
 GEND-SEM MOOD NUM NUMBER OBJ OBL-AG PASSIVE PERF PERS POSS PSEM QUANT SPEC \
 STMT-TYPE SUBJ TENSE TNS-ASP TOPIC-REL"

Features that take f-structures instead of constant values (such as SUBJ) are only removed if their values are paraphrase variables that are aligned. This allows you to extract transfer rules work for both active and passive forms.

Extracting Transfer Rules From Phrase Pairs

Transfer rules can be extracted from a file consisting of pairs of phrases with an equal sign between them separated by blank lines:

wie sehr = how much

der Terrorismus = terrorism

syrische = Syrian

ADVP: zu Hause = PP: at home

To extract transfer rules, use the following Tcl commands:

create-translator translator.txt
extract-paraphrase-rules phrase-pairs.txt rules.pl

The first command, create-translator, creates a translator. It takes a performance vars file as an argument, such as described in the section on using a translator.

The second command, extract-paraphrase-rules, extracts transfer rules from the phrase pairs (it can also be used to extract paraphrase rules if the source grammar and target grammar are the same). Its first argument is the file of phrase pairs described above. Its second argument is the name of the output file.

If either of the phrase pairs are ambiguous, then extract-paraphrase-rules will choose the analyses that are most parallel to each other. It skips phrase pairs that have a fragment parse on one side or the other.

Normally, extract-paraphrase-rules extracts a single transfer rule for each phrase pair that represents all that is in the phrase pair. If it cannot find a single rule that covers both phrases, then it will print all of the sub-transfer rules that it found. If you also want very simple back-off transfer rules for each pair of aligned words in the phrase pair, do the following in Tcl before calling extract-paraphrase-rules:

setx extractParaphraseBackoffs 1

You can also see the results of extracting one transfer rule with the following:

extract-paraphrase-rules phrase-pairs.txt 7

test-extract-paraphrase-rules tests extract-paraphrase-rules by extracting rules one paraphrase at a time and using the rule for each paraphrase to translate the left hand side of the paraphrase.

Paraphrase Variables

If a word to be translated takes arguments, you can specify those arguments using dummy lexical entries:

NP1 scheint zu Vinf . = NP1 apparently Vs.

These dummy lexical entries may need to be added to the grammars:

Vinf    Vx[v,inf] * (^PRED) = 'V<(^SUBJ)>'
                    (^CHECK _VMORPH _INF) = +
                    @(DEFAULT-INF-FORM)
                    @(PASSIVE -)
                    @(VTYPE main)
                    "prefer the MT categories in the same way
                    as MWEs, i.e. via a CSTRUCTURE preference mark"
                    @(OT-MARK MultiWord).

The dummy lexical entries must have the same f-structures in the source and target languages.

The semantic forms of the dummy lexical entries also need to be added to the translator's performance vars file:

setx paraphrase_variables "NP V"

Normally, features in a paraphrase variable that are equal on both sides are excluded from the resulting transfer rule. However, you can tell the system to preserve specific features by doing the following:

setx preserve_dummy_features "MEDICINE"

This is useful if there are selectional restrictions that you want to include in the paraphrase variables. If you want argument types to be included in the transfer rules, use the following:

setx include_argument_types 1

If all of the words in a phrase pair are paraphrase variables, then extract-paraphrase-rules will extract transfer rules for the paraphrase variables. This is useful when you want to translate the left hand side of a phrase pair file as a sanity check but you need transfer rules for the paraphrase variables.

Extracting Transfer Rules From Aligned Sentences

XLE has an experimental system for extracting transfer rules from sentence pairs whose words have been aligned statistically (e.g. Improved Alignment Models for Statistical Machine Translation by Och et. al.).

The extract_transfer_rules command expects files in the following format:

# Sentence pair (1)
adoption of the minutes of the previous sitting
genehmigung ({ 1 }) des ({ 2 }) protokolls ({ 3 4 5 }) der ({ 6 }) vorangegangenen ({ 7 }) sitzung ({ 8 }) 
# Sentence pair (2)
i refer to item 11 on the order of business .
ich ({ 1 }) beziehe ({ 2 }) mich ({ 2 3 }) auf ({ 2 6 }) punkt ({ 4 }) 11
({ 5 }) des ({ 7 9 }) arbeitsplans ({ 8 10 }) . ({ 11 })

You can then extract transfer rules using commands like the following:

set chart [create-translator /tilde/maxwell/mt/german/extractor.txt]
extract_transfer_rules \
	-chart $chart \
	-alignments alignedsentences.txt \
	-sourceDir german \
	-targetDir english \
	-from 1 -to 1000 \
	-outRules transfer-rules.pl

-sourceDir should be a directory that contains the f-structures for the source sentences, where S1.pl is the f-structure for the first sentence. The sentences can have mixed upper and lower case, even if the alignment file lower cases all of the words. -targetDir should contain the f-structures for the target sentences. extract_transfer_rules will print the transfer rules in -outRules (or stdout if -outRules is not specified).

extract_transfer_rules extracts transfer rules for each of the well-formed f-structures that are aligned by -alignments. It also extract transfer rules for pairs and triples of adjacent f-structures that are aligned. This is similar to the phrase-based translation used by the Pharaoh system, only applied to f-structures instead of strings.

The output of extract_transfer_rules is not suitable as input to load_translation_rules, since there is a fair amount of repetition in the transfer rules. In order to eliminate repetition and make it easier to find rules, you must collate the rules into a rule directory using the following command:

collate_transfer_rules transfer-rules.pl rule_directory

You can collate as many rule files as you want into the same rule directory. Then you can use the rule directory as input to load_translation_rules:

load_translation_rules rule_directory

Translation Property Weights

The translation system uses property weights to choose the best translation from the set of all translations produced for a sentence. Each component of the translation system has its own property weights: the parser, the transfer system, and the generator.

Most likely, the parser for the source language already has property weights. These weights are used to choose the N best parses when enumerate_parses is set to some value other than zero. They are also used as input to the later components.

Transfer Property Weights

The transfer system has its own set of weights to pick the best transfers. These weights are used by enumerate_transfers, max_transfers, and score_diff_cutoff. XLE recognizes the following property weights:

1.0 fs_attr_val INPUT_SCORE %X
1.0 fs_attr_val DOMINANCE_SCORE %X
-1.0 fs_attr_val DOMINANCE_COUNT %X
-1.0 fs_attr_val o:: DEFAULT
-10.0 fs_attr_val o:: DEFAULTPRED
-1.0 rule_trace
1.0 rule_trace * * %X           # absolute frequency
1.0 rule_trace * * * %X         # relative frequency
1.0 rule_trace * * * * %X       # head count
1.0 rule_trace * * * * * %X     # head count difference

INPUT_SCORE is the score of the input f-structure based on the parser property weights. DOMINANCE_SCORE is the score given by the language model of dominance relations on the f-structure. It is only useful if dominance_db_file has been set in the performance vars file of the translator. DOMINANCE_COUNT is the count of dominance relations. It is necessary in order to normalize over f-structures with different numbers of dominance relations. DEFAULT counts the number of default rules applied (where a feature is translated as itself). Fewer is better here. DEFAULTPRED measures the number of PRED values that were translated as themselves. Fewer is better here. rule_trace counts the number of rule applications. In general, larger rules produce fewer rule applications, so fewer rule applications is better. The statistics for absolute frequency and relative frequency are only useful if the rules were collated from a corpus of aligned sentences.

You will need the following in the performance variables file for the translator:

setx property_weights_file property-weights.txt
setx dominance_db_file dominance

Dominance relations can be extracted from a directory of unambiguous f-structures using the following command:

extract_dominance_statistics -sourceDir english \
    -from 1 -to 10 -db dominance

Generator Property Weights

The generator also has its own set of property weights. The translation system needs a specialized set of property weights since each system that provides input the generator has its own conventions about which features are left to be filled in by the generator. Any of the standard properties used to disambiguate parses can be used to disambiguate generations. There are also some special property weights that are useful in translation:

1.0 fs_attr_val INPUT_SCORE %X
1.0 fs_attr_val GEN_NGRAM_SCORE %X
1.0 fs_attr_val GEN_WORD_COUNT %X
-1.0 fs_attr_val CONSTITUENT_MOVES %X
-10.0 fs_attr_val GEN_STARRED %X

INPUT_SCORE is the score given to the input f-structure provided by the transfer system (this score includes the parser's score as well). GEN_NGRAM_SCORE is the score given to the output of the generator by the language model. GEN_WORD_COUNT is the number of space-delimited words in the output of the generator. It is needed to normalize GEN_NGRAM_SCORE, since longer sentences tend to have a lower language model score. GEN_CONSTITUENT_MOVES is the number of constituents that were moving in going from the source language sentence to the target language sentence. Finally, GEN_STARRED is the number of ungrammatical OT marks that were required in order to generate a particular string.

You will need to add statements like the following to the performance variables file for the generator:

setx gen_selector lcngramProc
setx gen_property_weights_file gen-property-weights.txt
setx language_model_file lm-europarl-eng-train-all-lc.srilm

For language modeling, you will need a license to the SRILM language modeling software. Then you will need libxle-lm.so so that XLE can access the SRILM package.

Generating Training Data

If you don't want to set property weights by hand, then you will need to generate training data to train the property weights using cometc. You can generate training data by specifying that the output file is a directory using the slash character:

translate-testfile german-testfile.lfg trainingdata/

translate-testfile will write out the training data in a directory structure rooted in the specified directory. The top level has directories for each sentence. The next level down has directories for each parse for the current sentence. Each parse directory has directories for each transfer structure for the current parse, as well as the parse f-structure (parse.pl). Each transfer directory has the transfer f-structure (transfer.pl), the features of the transfer f-structure (transfer-features.pl), and three files for each generation: the generated string (genN.txt), the generation f-structure (genN.pl), and the features of the generated f-structure (genN-features.txt):

s1    p1    parse.pl
            t1      transfer.pl
                    transfer-features.txt
                    gen1.txt
                    gen1.pl
                    gen1-features.txt
                    gen2.txt
                    gen2.pl
                    gen2-features.txt
                    ...
            t2      ...
      p2    ...
s2    ...
...

The transfer features in transfer-features.txt are the features in the translator's property weights file. The generation features in genN-features.txt are the features in the generator's property weights file.

To train with cometc, you need to create a set of files that has unlabeled features weights for each sentence, and a set of files that has labeled features weights for each sentence. The unlabeled features weights for a sentence are just the disjunctions of the all of the feature weights for a sentence. The labeled feature weights are the disjunction of the feature weights whose generated strings are correct.

You should first train the transfer feature weights (all of the transfer-features.txt), and then train the generation feature weights (all of the genN-features.txt) using the transfer feature weights that you just obtained.