Copyright © 1993-2001 by the Xerox Corporation and Copyright © 2002-2008 by the Palo Alto Research Center. All rights reserved.
This document describes an experimental system for translating between two languages. The system has two parts: one for translating from one language into another, and a second for extracting transfer rules from pairs of phrases, described below.
NB: The system is very slow for sentences more than a few words in length.
To translate from one language to another, use Tcl commands like the following:
set defaulttranslator [create-translator translator.txt] load_translation_rules $defaulttranslator rules.pl translate "der Terrorismus" translate-testfile german-testfile.lfg 7 translate-testfile german-testfile.lfg english-translations.txt
translator.txt
is a performance vars file that has
statements like the following:
setx source_grammar german-mt.lfg setx target_grammar english-mt.lfg setx grammar_encoding utf-8 set timeout 60 set max_xle_scratch_storage 1500
rules.pl
is a file of transfer rules. Transfer rules can
be hand-written, or they can be extracted from phrase
pairs. The transfer rules are applied in parallel
-- there is no feeding and bleeding of transfer rules.
The translator works by enumerating the parses, transfering each parse, enumerating the transfers of each parse, and then generating from each transfer. This can be time-consuming if there are a lot of parses. You can limit the number of parse, transfers, and generations enumerated using the following Tcl variables:
set enumerate_parses 50 set enumerate_transfers 10 set enumerate_generations 10
If the translator has property weights associated with it, then you can speed things up even further by only generating from the best transfers:
set max_transfers 10 set score_diff_cutoff 5
Setting max_transfers
to N tells the translator to only
generate from the N best transfers (if there are more transfers with the
same score as the Nth transfer, then these will be included too). Setting
score_diff_cutoff
to M tells the translator to ignore
transfers whose score is M less than the score of the transfer with the
highest score.
You can view intermediate translation results using the following:
show-solutions $defaultparser show-solutions $defaulttranslator show-solutions $defaultgenerator
The latter two commands only show the result of transfering the last
parse and the results of generating the last transfer. If you want to look
at earlier attempts, you can set enumerate_parses
and
enumerate_transfers
to smaller values.
Transfer rules can be extracted automatically from phrase pairs or from aligned sentences. In the first case, a single transfer rule is extracted that represents the transfer from the source phrase to the target phrase, taking into account special words that represent arguments. In the second case, atomic transfer rules are first extracted based on word alignments. Then pairs and triples of adjacent atomic transfer rules are then combined to make composite transfer rules. This process is similar to how transfer rules are extracted from aligned sentences in the Pharaoh system.
The first step in extracting transfer rules is to choose the best analyses on the source and target sides. XLE chooses the pair of analyses that align the best so that the transfer rules will be as simple as possible. Thus, the source and target sentences disambiguate each other.
After an f-structure is chosen for each side, XLE simplifies the SUBJs. SUBJs that are in a predicative noun or adjective are removed so that a single transfer rule covers both predicative and attributive uses. Predicative SUBJs are also removed from the source sentence when translating so that the simplified transfer rules will match. Since the target language is expecting the predicative SUBJs to be there, you must make SUBJ addable in the generator. If SUBJ is addable, the generator will add it even if it isn't governed.
XLE also simplifies SUBJs in verbs that are controlled in one way or another. If the SUBJ of a verb is functionally controlled, or is a null pronoun (representing anaphoric control; e.g. PRON-TYPE = null), or creates a cycle (as in (^ ADJUNCT $ SUBJ) = ^), then the SUBJ is replaced with an empty SUBJ (an f-structure with no content). This makes it easier to translate between constructions that use one form of control to constructions that use another form of control. Controlled SUBJs are also simplified in the source sentence when translating so that the simplified transfer rules will match. If the input to the generator has an empty SUBJ, then the generator will add whatever form of control is required by the target grammar.
After the SUBJs have been simplified, XLE aligns the individual f-structures using user-specified words or using word alignments. Then XLE extracts transfer rules and prints them out.
Often there are features in the f-structures that you don't want in the
transfer rules. For instance, you probably don't want the tense feature in your
transfer rule unless the tense feature has different values on the two sides.
Otherwise, you have to produce phrase pairs for every tense of
every verb. There are two ways to suppress such features. One is to use
set-gen-adds remove
in the standard way in the translator:
set-gen-adds remove "CASE CHECK COORD-LEVEL FOCUS-INT \ GEND LEFT_SISTER PASS-ASP PERF=- PROG PRON-INT RIGHT_SISTER TOPIC o::"
These features will be removed from the f-structures of the phrases on both sides no matter what their values are. They will also be removed from the input to the translator.
The second way to specify features that are only removed if they have different values on the two sides:
setx remove_equal_features "ADJUNCT AQUANT ATYPE CLAUSE-TYPE DEG-DIM DEGREE DET \ GEND-SEM MOOD NUM NUMBER OBJ OBL-AG PASSIVE PERF PERS POSS PSEM QUANT SPEC \ STMT-TYPE SUBJ TENSE TNS-ASP TOPIC-REL"
Features that take f-structures instead of constant values (such as SUBJ) are only removed if their values are paraphrase variables that are aligned. This allows you to extract transfer rules work for both active and passive forms.
Transfer rules can be extracted from a file consisting of pairs of phrases with an equal sign between them separated by blank lines:
wie sehr = how much der Terrorismus = terrorism syrische = Syrian ADVP: zu Hause = PP: at home
To extract transfer rules, use the following Tcl commands:
create-translator translator.txt extract-paraphrase-rules phrase-pairs.txt rules.pl
The first command, create-translator
, creates a translator. It
takes a performance vars file as an argument, such as described in the section on using a translator.
The second command, extract-paraphrase-rules
, extracts transfer
rules from the phrase pairs (it can also be used to extract paraphrase
rules if the source grammar and target grammar are the same). Its first
argument is the file of phrase pairs described above. Its second argument
is the name of the output file.
If either of the phrase pairs are ambiguous, then
extract-paraphrase-rules
will choose the analyses that are most
parallel to each other. It skips phrase pairs that have a fragment parse
on one side or the other.
Normally, extract-paraphrase-rules
extracts a single
transfer rule for each phrase pair that represents all that is in the
phrase pair. If it cannot find a single rule that covers both phrases,
then it will print all of the sub-transfer rules that it found. If you
also want very simple back-off transfer rules for each pair of aligned
words in the phrase pair, do the following in Tcl before calling
extract-paraphrase-rules
:
setx extractParaphraseBackoffs 1
You can also see the results of extracting one transfer rule with the following:
extract-paraphrase-rules phrase-pairs.txt 7
test-extract-paraphrase-rules
tests
extract-paraphrase-rules
by extracting rules one paraphrase at
a time and using the rule for each paraphrase to translate the left hand
side of the paraphrase.
If a word to be translated takes arguments, you can specify those arguments using dummy lexical entries:
NP1 scheint zu Vinf . = NP1 apparently Vs.
These dummy lexical entries may need to be added to the grammars:
Vinf Vx[v,inf] * (^PRED) = 'V<(^SUBJ)>' (^CHECK _VMORPH _INF) = + @(DEFAULT-INF-FORM) @(PASSIVE -) @(VTYPE main) "prefer the MT categories in the same way as MWEs, i.e. via a CSTRUCTURE preference mark" @(OT-MARK MultiWord).
The dummy lexical entries must have the same f-structures in the source and target languages.
The semantic forms of the dummy lexical entries also need to be added to the translator's performance vars file:
setx paraphrase_variables "NP V"
Normally, features in a paraphrase variable that are equal on both sides are excluded from the resulting transfer rule. However, you can tell the system to preserve specific features by doing the following:
setx preserve_dummy_features "MEDICINE"
This is useful if there are selectional restrictions that you want to include in the paraphrase variables. If you want argument types to be included in the transfer rules, use the following:
setx include_argument_types 1
If all of the words in a phrase pair are paraphrase variables,
then extract-paraphrase-rules
will extract transfer rules for the
paraphrase variables. This is useful when you want to translate the left
hand side of a phrase pair file as a sanity check but you need transfer
rules for the paraphrase variables.
XLE has an experimental system for extracting transfer rules from sentence pairs whose words have been aligned statistically (e.g. Improved Alignment Models for Statistical Machine Translation by Och et. al.).
Theextract_transfer_rules
command expects files in the following
format:
# Sentence pair (1) adoption of the minutes of the previous sitting genehmigung ({ 1 }) des ({ 2 }) protokolls ({ 3 4 5 }) der ({ 6 }) vorangegangenen ({ 7 }) sitzung ({ 8 }) # Sentence pair (2) i refer to item 11 on the order of business . ich ({ 1 }) beziehe ({ 2 }) mich ({ 2 3 }) auf ({ 2 6 }) punkt ({ 4 }) 11 ({ 5 }) des ({ 7 9 }) arbeitsplans ({ 8 10 }) . ({ 11 })
You can then extract transfer rules using commands like the following:
set chart [create-translator /tilde/maxwell/mt/german/extractor.txt] extract_transfer_rules \ -chart $chart \ -alignments alignedsentences.txt \ -sourceDir german \ -targetDir english \ -from 1 -to 1000 \ -outRules transfer-rules.pl
-sourceDir
should be a directory that contains the
f-structures for the source sentences, where S1.pl is the f-structure for
the first sentence. The sentences can have mixed upper and lower case,
even if the alignment file lower cases all of the words.
-targetDir
should contain the f-structures for the target
sentences. extract_transfer_rules
will print the transfer
rules in -outRules
(or stdout if -outRules
is not
specified).
extract_transfer_rules
extracts transfer rules for each of
the well-formed f-structures that are aligned by -alignments
.
It also extract transfer rules for pairs and triples of adjacent
f-structures that are aligned. This is similar to the phrase-based
translation used by the Pharaoh system, only applied to f-structures
instead of strings.
The output of extract_transfer_rules
is not suitable as
input to load_translation_rules
, since there is a fair amount of
repetition in the transfer rules. In order to eliminate repetition and
make it easier to find rules, you must collate the rules into a rule
directory using the following command:
collate_transfer_rules transfer-rules.pl rule_directory
You can collate as many rule files as you want into the same rule
directory. Then you can use the rule directory as input to
load_translation_rules
:
load_translation_rules rule_directory
The translation system uses property weights to choose the best translation from the set of all translations produced for a sentence. Each component of the translation system has its own property weights: the parser, the transfer system, and the generator.
Most likely, the parser for the source language already has property
weights. These weights are used to choose the N best parses when
enumerate_parses
is set to some value other than zero. They
are also used as input to the later components.
The transfer system has its own set of weights to pick the best
transfers. These weights are used by enumerate_transfers
,
max_transfers
, and score_diff_cutoff
. XLE
recognizes the following property weights:
1.0 fs_attr_val INPUT_SCORE %X 1.0 fs_attr_val DOMINANCE_SCORE %X -1.0 fs_attr_val DOMINANCE_COUNT %X -1.0 fs_attr_val o:: DEFAULT -10.0 fs_attr_val o:: DEFAULTPRED -1.0 rule_trace 1.0 rule_trace * * %X # absolute frequency 1.0 rule_trace * * * %X # relative frequency 1.0 rule_trace * * * * %X # head count 1.0 rule_trace * * * * * %X # head count difference
INPUT_SCORE
is the score of the input f-structure based on
the parser property weights. DOMINANCE_SCORE
is the score
given by the language model of dominance relations on the f-structure. It
is only useful if dominance_db_file
has been set in the
performance vars file of the translator. DOMINANCE_COUNT
is
the count of dominance relations. It is necessary in order to normalize
over f-structures with different numbers of dominance relations.
DEFAULT
counts the number of default rules applied (where a
feature is translated as itself). Fewer is better here.
DEFAULTPRED
measures the number of PRED
values
that were translated as themselves. Fewer is better here.
rule_trace
counts the number of rule applications. In
general, larger rules produce fewer rule applications, so fewer rule
applications is better. The statistics for absolute frequency and relative
frequency are only useful if the rules were collated from a corpus of
aligned sentences.
You will need the following in the performance variables file for the translator:
setx property_weights_file property-weights.txt setx dominance_db_file dominance
Dominance relations can be extracted from a directory of unambiguous f-structures using the following command:
extract_dominance_statistics -sourceDir english \ -from 1 -to 10 -db dominance
The generator also has its own set of property weights. The translation system needs a specialized set of property weights since each system that provides input the generator has its own conventions about which features are left to be filled in by the generator. Any of the standard properties used to disambiguate parses can be used to disambiguate generations. There are also some special property weights that are useful in translation:
1.0 fs_attr_val INPUT_SCORE %X 1.0 fs_attr_val GEN_NGRAM_SCORE %X 1.0 fs_attr_val GEN_WORD_COUNT %X -1.0 fs_attr_val CONSTITUENT_MOVES %X -10.0 fs_attr_val GEN_STARRED %X
INPUT_SCORE
is the score given to the input f-structure
provided by the transfer system (this score includes the parser's score as
well). GEN_NGRAM_SCORE
is the score given to the output of
the generator by the language model. GEN_WORD_COUNT
is the
number of space-delimited words in the output of the generator. It is
needed to normalize GEN_NGRAM_SCORE
, since longer sentences
tend to have a lower language model score.
GEN_CONSTITUENT_MOVES
is the number of constituents that were
moving in going from the source language sentence to the target language
sentence. Finally, GEN_STARRED
is the number of ungrammatical
OT marks that were required in order to generate a particular string.
You will need to add statements like the following to the performance variables file for the generator:
setx gen_selector lcngramProc setx gen_property_weights_file gen-property-weights.txt setx language_model_file lm-europarl-eng-train-all-lc.srilm
For language modeling, you will need a license to the SRILM language modeling software. Then you will need libxle-lm.so so that XLE can access the SRILM package.
If you don't want to set property weights by hand, then you will need to generate training data to train the property weights using cometc. You can generate training data by specifying that the output file is a directory using the slash character:
translate-testfile german-testfile.lfg trainingdata/
translate-testfile
will write out the training data in a
directory structure rooted in the specified directory. The top level has
directories for each sentence. The next level down has directories for
each parse for the current sentence. Each parse directory has directories
for each transfer structure for the current parse, as well as the parse
f-structure (parse.pl
). Each transfer directory has the
transfer f-structure (transfer.pl
), the features of the
transfer f-structure (transfer-features.pl
), and three files
for each generation: the generated string (genN.txt
), the
generation f-structure (genN.pl
), and the features of the
generated f-structure (genN-features.txt
):
s1 p1 parse.pl t1 transfer.pl transfer-features.txt gen1.txt gen1.pl gen1-features.txt gen2.txt gen2.pl gen2-features.txt ... t2 ... p2 ... s2 ... ...
The transfer features in transfer-features.txt
are the
features in the translator's property weights file. The generation
features in genN-features.txt
are the features in the
generator's property weights file.
To train with cometc, you need to create a set of files that has unlabeled features weights for each sentence, and a set of files that has labeled features weights for each sentence. The unlabeled features weights for a sentence are just the disjunctions of the all of the feature weights for a sentence. The labeled feature weights are the disjunction of the feature weights whose generated strings are correct.
You should first train the transfer feature weights (all of the
transfer-features.txt
), and then train the generation feature
weights (all of the genN-features.txt
) using the transfer
feature weights that you just obtained.