Copyright © 1993-2001 by the Xerox Corporation and Copyright © 2002-2007 by the Palo Alto Research Center. All rights reserved.
XLE is a combination of linguistic tools developed at PARC and Grenoble XRCE, plus a Tcl user interface to them. This document currently only describes the Tcl user interface. Documentation on the LFG formalism is in a separate file.
If you are viewing this document from an HTML browser, you can get to a table of contents by clicking on any of the underlined headers.
For Windows, you need to also download xle-windows-dependencies.zip, and unzip it in your XLE directory. Set the environment variables by running "XLEDIR\xle-setup.bat XLEDIR". You have to supply the full path of the XLE directory as the argument to the script.
Each user of XLE should enable XLE by adding the following to their .login file (replace $xledir with the name of the XLE directory):
This can be put in a script for convenience. (NB: Do not create a link to $xledir/bin/xle directly instead of using PATH, since this will cause problems for XLE.)
If you use bash, you should use something like:
The only thing that XLE depends on is an X server. All of the platforms come with their own X server except MacOS X. MacOS 10.3.5 (Panther) comes with an X server, but it is not installed by default. If it is not installed on your Mac computer, then run the install disks again after unchecking everything else and checking the X Server box. It can also be downloaded from the apple web site.
Some X servers (such as Exceed) do not update window bar titles when they are changed programatically. This means that you may need to make a window into an icon and open it again in order to have the window bar display the correct information.
If you are at PARC, the first thing that you will need to do is to enable XLE, either by typing "enable xle" in your shell or by putting "enable xle" in your .login file. If you are at another site, check with the person who installed XLE to find out how to enable it, or read about enabling XLE in the documentation on installing XLE.
Once you have enabled XLE, all that you need to do to start XLE is to type "xle" in your shell (note: do not fork via "xle &" because XLE uses the shell for command inputs). When XLE is loaded, the shell that XLE was started in will be converted into a interactive Tcl shell. The Tcl shell uses "%" as a prompt.
To load a grammar for parsing, type:
% create-parser "filename"
To parse a sentence, type
% parse "sentence"
or
% parse {sentence}
For example:
% create-parser english.lfg
% parse {This is a sentence.}
The results of the parse will be automatically displayed in a Tree Window and an F-Structure Window described below.
NB: XLE doesn't work well under Emacs on Windows. You might try Eclipse instead. (Update: For newer Windows Versions, Notepad++ or similar text editors are recommended instead. (Nov., 2020))
There is a special emacs mode, lfg-mode, that makes some aspects of grammar
writing and interacting with XLE easier. To use lfg-mode, load the library
package lfg-mode.el with the command M-x load-library lfg-mode,
or put the command
(load-library "lfg-mode")
in your .emacs file. The file lfg-mode.el or the compiled version lfg-mode.elc should be placed in a directory on your Emacs load-path so that Emacs can find it. Alternatively, you can specify the complete file name when executing the load-library command.
For files with the extension .lfg, lfg-mode is invoked automatically when the library package lfg-mode is loaded. Otherwise, use the command M-x lfg-mode to invoke lfg-mode.
The lfg-mode supports the notion of coding system that is introduced in emacs-20. (You would need this version of emacs to work with non-roman alphabets.) The default coding system used in xle is iso-latin-1 (aka 8859-1) which is suitable for most European languages, but this can be overriden by the user by placing a command like
(setq xle-buffer-process-coding-system alternative-coding-system)
in the .emacs file before the call to load lfg-mode. For example, to use the coding system named junet, place the command
(setq xle-buffer-process-coding-system 'junet)This will cause emacs to use the coding system junet instead of the default iso-latin-1. If you use emacs-20, a list of coding systems is available from the top-level Mule menu under the subheading "Describe coding systems."
If you have the package imenu.el, lfg-mode will give you a menu of lexical items, rules, and templates and options for starting an XLE process in another window (see documentation below). Also, buffers created with the command inferior-xle-mode or with the menu options in an lfg window will have a menu of options for starting and restarting xle and for creating a parser.
By default, lfg-mode uses font-lock, so that parts of expressions are displayed in different colors. To switch off font-lock mode, use the command M-x font-lock-mode. The variable lfg-color-level determines what parts of an expression appear in color. The default is for several parts of expressions to appear in color. If you want only comments to appear in color, set the variable lfg-color-level to 0 in your .emacs file. This must appear before you load lfg-mode. For example:
(setq lfg-color-level 0) (load-library "lfg-mode")
The default color set uses bright colors. For more muted colors, using the default Emacs color values, set the variable lfg-more-colors to nil in your .emacs file before loading lfg-mode:
(setq lfg-more-colors nil) (load-library "lfg-mode")
To customize colors for the mode, put something like the following in your Emacs .init file:
(add-hook 'lfg-mode-hook (function (lambda () (set-face-foreground font-lock-string-face "ForestGreen"))))
This will produce comments in ForestGreen. Use M-x list-colors-display to see what colors are available.
When the cursor is in a buffer that is in lfg-mode, an additional menu item "LFG" becomes available at the top of the screen. Clicking with the left mouse button on this menu item gives several choices:
You can search for rules, rule macros, templates, and lexical items in other buffers that have been loaded into Emacs by positioning your cursor on the word you are searching for and executing the command M-" (meta-quotation mark). To go back to the original buffer and position, execute the command C-" (control-quotation mark). Unless you have created a TAGS database for the files you wish to search (see below), this will only work for buffers for which menus have already been created. To create a menu for a buffer in lfg-mode, choose the command "Rules, templates, lexical items" in the LFG menu.
The extended searching option will also work for all files for which you have created a TAGS database. You can create a TAGS database for a file or files by executing the Unix command gtags with the filename(s) as the argument. For example, this command will create a TAGS database for all files with the extension .lfg in the current directory:
gtags *.lfg
The TAGS database will need to be recreated periodically as your files change. To update the database, execute the gtags command again. If you have created a TAGS database, then you can use the following commands:Consult Emacs Info for more information on using tags.
There is another special emacs mode, xle-mode, that makes interactions
with the XLE buffer easier. As with lfg-mode, this mode is available by
loading the library package lfg-mode.el with the command M-x load-library
lfg-mode, or putting the command:
load-library "lfg-mode"
in your .emacs file.
For XLE buffers created by using the menus in a buffer in lfg-mode, or with the commands M-x run-xle or M-x run-new-xle, xle-mode is invoked automatically.
When the cursor is in a buffer that is in xle-mode, an additional menu item "XLE" becomes available at the top of the screen. Clicking with the left mouse button on this menu item gives several choices:
(setq lfg-default-parser "mylanguage.lfg") (load-library "lfg-mode")
COMMENT{TYPE NAME TEXT}
If the same comment is associated with more than one type and name, the following abbreviation can be used:
COMMENT{{TYPE1 NAME1}{TYPE2 NAME2} TEXT}
COMMENT can be abbreviated COM. For comments between definitions, COMMENT-FILE or COM-FILE should be used rather than COMMENT; otherwise, an incorrect NAME for the comment may result.
Any value of TYPE may be used. The following values and their abbreviations are standard:
FEATURE | FEAT |
TEMPLATE | TEMP |
MACRO | |
CATEGORY | CAT |
LEX-ENTRY | LEX |
OTMARK | OT |
PARAMETER | PARAM |
EXAMPLE | EX |
MISC |
NAME is the name of the feature/template/rule being commented on for comments with COMMENT, or the name of the file for comments with COMMENT-FILE.
TEXT is the text of the comment, which may not contain a closing curly bracket (}).
These comments are collected by the commands M-x lfg-display comments (for the current file), M-x lfg-display-comments-all (for all .lfg files in the current directory), and M-x lfg-display-comments-config (for all .lfg files mentioned in the CONFIG). These commands are also accessible from the LFG menu. Executing these commands produces a new buffer containing comments organized by NAME and TYPE. If any EXAMPLE comments are found, these also appear in a separate buffer in test file format.
# -*- mode: lfg; coding: euc-jp -*-or the following near the end of the file:
# Local Variables: #The # character is used because this is the comment character for XLE test files.
# mode: lfg #
# coding: euc-jp #
# End: #
Similarly, if you wanted to make the file japanese.lfg always be displayed with Japanese characters, you could add the following to the first line of the file:
" -*- mode: lfg; coding: euc-jp -*- "or the following near the end of the file:
" Local Variables: "The quote character is used here because this is the comment character for XLE grammar and lexicon files. For more information on the possible variables that can be set using this method, please see the Emacs documentation on File Variables.
" coding: euc-jp "
" End: "
You can find information on how to enter accented characters in Emacs here. See the section on Emacs File Variables for information on how to specify the character set of a file.
If you use Emacs 22 or later, then you can use Mule for Unicode character sets. Here is an example of how you would use Mule to work on a Russian grammar encoded in UTF-8:
http://merkur61.inf.uni-konstanz.de/exlepse/downloads/eXLEpse-1.0.3-SDK-3.6.1-win32.zip http://merkur61.inf.uni-konstanz.de/exlepse/downloads/eXLEpse-1.0.3-SDK-3.6.1-macosx-cocoa.zip http://merkur61.inf.uni-konstanz.de/exlepse/downloads/eXLEpse-1.0.3-SDK-3.6.1-macosx-carbon.zip http://merkur61.inf.uni-konstanz.de/exlepse/downloads/eXLEpse-1.0.3-SDK-3.6.1-linux-gtk.tar.gz
The whole thing comes with an EPL license (free software), just like Eclipse itself. The packages include the SDK Eclipse and the eXLEpse plugin ('perspective'). Therefore, it's quite a large download (~180 MB). You can unzip the package anywhere you like. Then, start the Eclipse application in the main folder.
After you created/chose a workspace, you can create a new project for your grammar using 'File - New - Project...'. In the wizard, just select 'General - Project'. After choosing a name, you can add your .lfg files (and any other files belonging to your grammar) to the project.
Also integrated is the Subclipse plugin, which lets you inspect/check out/commit to Subversion repositories. If you have further questions on this, please contact me, as I have played with that a little bit already.
If you have comments and/or feature requests about eXLEpse, please send an email to Sebastian Sulger (sebastian.sulger@uni-konstanz.de).
The Tcl Shell Interface is the main interface to the XLE system. You can use it to load grammars, parse sentences, and view documentation. Also, process messages such as "loading grammar ... " get printed in the shell. Here are examples of commands that you can type:
% create-parser "demo-eng.lfg"
% parse Hello
% set parser [create-parser "demo-eng.lfg"]
% parse {This is a test.} $parser
% help
The Tcl Shell uses Tcl syntax for commands. This means that the double-quote ("), left brace ({), right brace (}), left bracket ([), right bracket (]), and dollar sign ($) characters are treated specially. All of the Tcl commands and syntactic conventions are available to you in the Tcl Shell.
The dollar sign is used to signal that the following token should be replaced by its value, so that
% parse {This is a test.} $parser
means take the value of the variable parser and pass it as the parser.
Braces are used to group things into a single argument. In the examples above, Hello and {This is a test} are both the first argument of parse. Putting braces around Hello is optional, but if the braces were missing from {This is a test}, then Tcl would have complained that there were too many arguments.
Double quotes are also used to group things into a single argument. The only difference between braces and double quotes is that dollar sign substitutions are not allowed inside of braces. So if you want to parse a sentence with dollar signs, braces, or double quotes in it, you need to use braces:
% parse {This cost $5.}
% parse {"Yikes!", he said.}
If you use braces around the input, the only special character that you will have to worry about is the backslash (\) character. In particular, a backslash followed by a brace (\}) is not treated as a close brace.
XLE automatically looks in the home directory and the current directory for .xlerc files (e.g. files named xlerc or .xlerc). The home directory .xlerc file allows you to customize your Tcl Shell however you want. For instance, you can customize the fonts that xle uses in the display. You can also write your own functions in Tcl for you to use and put their definitions in your home .xlerc file (see below). The current directory .xlerc file is useful for grammar-specific tasks. For instance, you can put the command create-parser in the .xlerc file of a grammar directory so that the grammar will be automatically loaded when you start xle from its directory.
When XLE is loading the .xlerc files, it first looks in the home directory for a .xlerc file and loads whatever it finds. Then, if the current directory is not the home directory, it looks in the current directory for a .xlerc file and loads whatever it finds. If there is both a xlerc file and a .xlerc file in the same directory, then XLE loads both and prints a warning. You can override the .xlerc file loaded from the currenct directory by using the following syntax: xle yourfile.tcl or xle -file yourfile.tcl. You can also use these in combination, e.g. xle firstfile.tcl -file secondfile.tcl or xle -file firstfile.tcl -file secondfile.tcl
Tcl makes it easy to write short procedures to do common tasks. For instance, suppose that you are always parsing sentences from a test file by typing something like parse-testfile my-testfile.lfg N, for different values of N. If you are tired of typing all this each time, you can add the following to your .xlerc file:
proc test {N} {
parse-testfile my-testfile.lfg $N
}
The $N means "substitute the value of N here". Once this is defined, typing test 7 will accomplish the same as typing parse-testfile my-testfile.lfg 7.
If you want to suppress loading of the default .xlerc files, you can use the command line argument -noxlerc. This option will not prevent the loading of explicitly specified files.
Another command line option is to specify a Tcl command that will run
immediately after all the files are loaded, and before the interactive session
begins. This is done using the -exec (or -e) option. You can use this facility
to start xle in a certain mode or perform any task that you wish. In particular,
you can include in your script files arbitrary procedures that can be invoked
this way. For example,
xle -e "create-parser grammarfile; parse {John sleeps.}"
will start xle, load a grammar and parse an initial sentence.
If you use the -noTk option, then XLE will suppress the initialization of Tk and X. This is useful for running XLE in a batch mode where there is no X server. If the script invokes Tk, then you will get an message about an application-specific error involving $DISPLAY or Tcl will complain that "winfo" is undefined.
If you use Perl or another scripting language to give Tcl commands to XLE, then you may run into problems where the output of Tcl commands implemented by XLE isn't synchronized with the output of standard Tcl commands. If this happens, add an xle_flush command after the XLE Tcl command.
We use Tcl to display trees, f-structures, and charts. The parts of these displays that are mouse-sensitive change their color when the mouse moves over them. You can find out what a mouse-sensitive part does by clicking on it with the right mouse button (<Button-3>). The documentation uses the Tcl convention for describing mouse actions. Typical mouse actions are <Button-3> (click the right mouse button), <Control-Button-1> (click the left mouse button while holding down the Control key), and <Control-Shift-Button-1> (click the left mouse button while holding down the Control and Shift keys). The documentation for pull-down menu items can be obtained by typing 'h' with the cursor over the menu item.
You can control the appearance of the graphical interface, by changing fonts and window sizes (useful when running XLE on a small screen). The window sizes of the four main windows can be controlled by setting the Tcl variables "window_width" and "window_height", which measure the window dimensions in pixels (the default values are 500 and 400 respectively).
There are now four different variables to control the use of fonts in the various XLE displays, one for the user interface (i.e. buttons), one for displaying feature structures, one for displaying texts and one for displaying trees. The terminals (surface forms) in the tree displayer use the text font, which makes it possible to separate them visually from the rest of the nodes. To change the fonts, simply set the variables "xleuifont", "xletextfont", "xletreefont" and "xlefsfont". Font names can be given using the naming convention of either X windows (i.e. -adobe-courier-medium-r-normal--12-120*) or Tk (i.e. {Times 14 bold}). Note that it is now possible to use fonts with proporional spacing, although the default font is still Courier.
Another aspect of the graphical interface that can be modified is the vertical and horizontal spacing in the tree layout algorithm. Following a facility that exists in the Medley system, this is done by changing the variables "CSTRUCTUREMOTHERD" and "CSTRUCTUREPERSONALD". Note, however, that the spacing is also proportional to the font used for displaying trees and therefore modifying these variables should not be necessary in general.
The tree window is used to display c-structure trees. There can only be one tree display window active at a time. There is a row of buttons and menus at the top that provide some control, plus the nodes of the trees are sensitive to mouse commands. You can get more documentation for the buttons and the mouse-sensitive nodes by clicking on them with the right mouse button (<Button-3>).
The items in the Views menu controls how the tree is displayed. Each item has a one-letter accelerator after it. Typing this letter in the tree window has the same effect as clicking on its menu item. The "node numbers" menu item determines whether node numbers are displayed on the nodes of the tree. The "partials" menu item determines whether or not the partial constituents used internally by XLE to deal with multiple daughters are displayed. The "tokens" menu item determines whether or not the terminals of the tree correspond to the tokens produced by the tokenizer or correspond to the surface forms of the input sentence.
Trees are displayed in a standard inverted format, with the root at the top and the leaves (lexical items) at the bottom. Only one tree is displayed at a time in the tree window. If you want another tree, then click on the "next" button. This will give the next good tree unless there are none, in which case it will give you the next bad tree. The "prev" button will let you back up. Trees are numbered according to the order that they are displayed. This means that if you parse the same sentence twice, there is no guarantee that the trees will get the same numbers in both parses, unless you only use "prev" and "next" to visit the trees (using "most probable" or clicking on a choice in one of the fschart windows can change the display order). If there is a file loaded for choosing the most probable parse, then that parse will be displayed first.
You can look for a particular tree using the tree window or the bracketing window. To look for a particular tree using the tree window you construct the tree by starting from the top node and recursively choosing the correct sub-tree at every level. Only the nodes that have dotted lines under them have more than one sub-tree. You can get the next sub-tree of a node by clicking on it with the left button (<Button-1>). You can get the previous sub-tree via <Shift-Button-1>. If there are no more trees in the direction indicated, the button will flash. If you can't find the desired sub-tree, it may be that you are looking at the wrong constituent. For processing reasons, xle has a different node for each place that a rule might stop (i.e. for each final state in the underlying finite-state machine that corresponds to the rule). So if you can't find the sub-tree that you want, see if the node above it has another sub-tree with the same category name but a different node number.
If a node in a tree is boxed, then that is a node where the f-structure
information went bad (i.e. there are no valid f-structures above this
node). If you click on a boxed node with the middle button (<Button-2>)
you should get to see the first bad f-structure. Sometimes, however, no f-structure
shows up. This may be because the node that is actually bad is a partial
node that is not shown on the display; this frequently occurs when the node
dominates a sublexical rule. You should
click on the "partials" menu item in order to see it. Or, you can click on
each daughter node with
If you click on a tree node with
The bracketing window allows you to select contiguous material in the parsed sentence and insist that it be a constituent, possibly of some particular category. The XLE graphical interface constrains the trees it displays to be consistent with the choices you've made in the bracketing window. You may also insist that some material not be a constituent.
The bracketing window is displayed by clicking on the "Show Bracket Window" menu item in the Commands menu of the tree window. In the bracketing window, the sentence is displayed with alternate tokenizations shown above one another. Buttons (labelled with "#") appear between the tokens.
Clicking <Button-1> on two of these "#" buttons brackets the material in between. The brackets are shown in the window. Only trees and solutions in which the bracketed material forms a constituent are displayed in XLE's tree and f-structure windows. Alternatively, clicking <Shift-Button-1> on two of these buttons "debrackets" the enclosed material; only trees and solutions in which the debracketed material does not form a constituent are displayed. Debrackets are displayed using "![" and "!]". A pair of brackets or debrackets can be removed by clicking <Control-Shift-Button-1> on one element of the pair. Clicking <Button-1> on a bracket pops up a menu of the categories that span the bracketed material. You can specify that a category be excluded, in which case no tree containing that category spanning the bracketed material will be displayed. You can specify that a category be included, in which case only trees containing that category spanning the bracketed material will be displayed. If you click on the "show" button for a category, then XLE will display the category's edge in the tree display window.The f-structure window is used to display feature structures. Feature structures are displayed as an attribute-value structure. Links between different parts of a feature structure (from, for instance, (^ SUBJ) = (^ XCOMP SUBJ)) are displayed by giving a path name. The f-structure is not mouse sensitive.
At the top of the f-structure window there is a row of buttons and menus that affect the f-structure display window. In general, you can get the documentation for each button by clicking on the button with the right-most mouse button (<Button-3>). Documentation for the menu items can be obtained by typing 'h' while holding the cursor over the menu item.
The "prev" and "next" buttons allow you to enumerate the feature structures (f-structures) in a manner similar to the "prev" and "next" buttons in the tree window. The valid feature structures are displayed first, and then any invalid feature structures. The reasons that a feature structure is invalid are highlighted in black. This may involve highlighting a relation that would not otherwise be visible, such as negated constraints or the arglist relation. When relations need to be displayed, they are put after the attributes. Their "value" is actually a set of values enclosed by parentheses. Sometimes the values are preceded by a "~" or a "c", which indicates that their is a negative or sub-c constraint on that value. Attributes which are equated to a constant value usually just display that value, but if there are additional constraints, then an attribute-value structure is displayed with a single equality relation with multiple values.
The items in the Views menu controls how the f-structure is displayed. Each item has a one-letter accelerator after it. Typing this letter in the f-structure window has the same effect as clicking on its menu item. The "abbreviate attributes" menu item suppresses all of the attributes except those that appear in the abbrevAttributes Tcl variable. The "constraints" menu item determines whether or not negated and sub-c constraints are included in the display. The "node numbers" menu item determines whether node numbers for each f-structure are displayed in a column along the left side of the f-structure. The "subc constraints" menu item determines whether or not sub-c constraints are included in the display.
Above each f-structure there is a label and buttons for each projection from that f-structure. Clicking with the left mouse button on a projection button causes that projection to be displayed. Projections that have a * in them (like o::*) are actually projections off of the c-structure that are displayed in the f-structure window for convenience.
Whenever you click on a node in a tree with the middle button while the Control key is held down, the constraints associated with that node will be printed in the Constraints Window (if you also hold down the Shift key, then the constraints associated with the partial above the node will be printed). The constraints will come from the lexicon if the node is a pre-terminal node, and otherwise they will come from a rule. The constraints are the base constraints that are obtained when all of the templates have been expanded. Constraints that are filtered from the grammar before instantiation are printed with a comment after them. For instance, if an =c constraint is globally incomplete, it will printed with a "GLOBALLY INCOMPLETE" comment following it. Similarly if an optimality constraint has a NOGOOD mark, then it will be printed with a "NOGOOD OPTIMALITY MARK" after it. Since these constraints aren't instantiated they won't appear in the f-structure window (even among the invalid f-structures) and so the only way to see them is to use the Constraints Window.
The f-structure chart windows are used to display two different views of a packed representation of all of the valid solutions. One window indexes the solutions by constraints. The result is an f-structure that is annotated with choices to show where alternatives are possible. The other window indexes the solutions by choices. The result is a tree of choices with their corresponding constraints. The choices in both windows are active. When you click on a choice, then a solution corresponding to that choice is displayed in the tree window and the f-structure window.
The f-structure chart window indexes the packed solutions by their constraints, so that each constraint appears once in an f-structure annotated by all of the choices where that constraint holds. By default, this window appears at the upper right of the display. There are three menu items under the Views menu that control how the f-structure is displayed. Each item has a one-letter accelerator after it. Typing this letter in the f-structure chart window has the same effect as clicking on its menu item. The "abbreviate attributes" menu item suppresses all of the attributes except those that appear in the abbrevAttributes Tcl variable. The "constraints" menu item determines whether or not negated and sub-c constraints are included in the display. Finally, the "linear" menu item changes the display into a line of tokens with corresponding f-structures.
The f-structure chart choices window indexes the packed solutions by the alternative choices. By default, this window appears at the lower right of the display. Choices are labeled a:1, a:2, a:3, ... b:1, b:2, b:3, etc. The choices that belong to the same disjunction have the same alphabetic string as a prefix. The disjunctions are laid out vertically in the window, with the "a" disjunction shown first, then the "b" disjunction shown second, and so on. At the left of each disjunction is its context. Top level disjunctions are given the True context. Embedded disjunctions are given the choice that they are embedded under. Sometimes disjunctions are embedded under more than one choice (because of the way that the chart works). When this happens, then the context is itself disjunctive. Alternatives within a disjunction are laid out vertically, with the constraints that they select displayed on their right. If a constraint has a conjunctive context, then the constraint will show up under both choices contextualized by the remaining choice. Thus a:1 & b:2 -> foo will appear under a:1 as b:2 -> foo and under b:2 as a:1 -> foo. When an f-structure only has one predicate name, then its name will be printed after the f-structure variable for easy identification (e.g. f12:WITH $ (f15:TELESCOPE ADJUNCT)). If there are no f-structure constraints, it is sometimes useful to display the subtree constraints (by using the "subtrees" option in the Views menu), for finding solutions that only differ in the c-structure.
The choices in the f-structure chart window and the f-structure chart choices window are active. When you click on a choice with the left button, then a solution corresponding to that choice is displayed in the tree window and the f-structure window. A selection is a fully specified choice of exactly one solution. One way to select more than one solution is to use the narrowing facility (which is somewhat related to underspecification). You can mark a choice as being nogood by clicking on a choice button in the f-structure chart choices window with the middle mouse button. This will effectively mask out all the solutions that include the choice. XLE will grey the button and report the number of remaining solutions in the title of the window. Clicking the choice button again will toggle the nogood property back to its previous setting. If you want to specify that only one choice in a disjunction is good, you can press the shift key and the middle button at the same time. This will turn every other choice in the disjunction to nogood.
To enumerate the solutions according to the choices they include, you can use the "next solution" and "prev solution" buttons in the chart choices window. This enumeration honors the nogoods so it only goes over the narrowed set of solutions. If you want to include the nogood solutions, press the shift key when clicking on "next solution" and "prev solution" buttons.
Also, if you enumerate the solutions using the next/prev buttons on the tree or fstructure window, the selections corresponding to the currently displayed solution will be highlighted in the fschart window. If you want to clear the selection, there is a "Clear Selection" command in the command menu of this window. This can be useful when you want to print the choices in the Prolog format and you don't want the selection to be recorded there as well.
The menu items in the Views menu control what is shown and how it is shown. Each item has a one-letter accelerator after it. Typing this letter in the f-structure chart choices window has the same effect as clicking on its menu item. The "abbreviate attributes" menu item suppresses all of the attributes except those that appear in the abbrevAttributes Tcl variable. The "constraints" item causes sub-c and negated constraints to be displayed when it is enabled. The "disjunctions only" item only shows the disjunction structure (no constraints). The "OT marks only" item only shows the optimality mark constraints. The "subtrees" item causes c-structure subtrees to be displayed as binary rules. The "unoptimal" button causes XLE to display unoptimal solutions by disabling the OPTIMALITYORDER defined in the grammar config. The next three items mentioned are mutually exclusive: only one can be enabled at a time. They have to do with how the disjunctions are laid out. The "flat choices" item is a flat list of disjunctions. The "nested choices" item causes each disjunction to be nested within the choice that define its context under the constraints that are particular to that context. If a choice has several disjunctions embedded underneath it, then the disjunctions will be separated by a black line for readability. The "re-entrant choices" item is like the "nested choices" item, except that disjunctions that are defined in a complex context (such as "{a:1 | b:1}") are displayed once at the level of the disjunctions in the TRUE context instead of being duplicated under each context (e.g. once under "a:1" and once under "b:1").
If the disjunctions are sufficiently complicated, then XLE will not be able to display the disjunctions nested within the window size limits allowed by Tcl. In this case, XLE will turn off nesting and print the following message:
A nested disjunction had to be truncated.
Nesting has been turned off so that
the 'more' button can do the right thing.
Just below the menu items is a list of optimality marks that are present in any of the solutions to the current input. The list is prefixed with "OTOrder:". The optimality marks are in the order given by the current OPTIMALITYORDER. If an optimality mark is prefixed with a "*", then it is an ungrammatical mark. If it is prefixed with a "+", then it is a preference mark. Clicking on an optimality mark in this list temporarily removes it from the ranking so that it has no effect on the relative ranking of analyses. Clicking again restores the optimality mark.
Both the f-structure chart window and the f-structure chart choices window can be very long. If either window exceeds a certain limit, then the displayer will add a "more" button at the end. Clicking on the "more" button will cause a new window created that displays another chunk of the data.
KNOWN PROBLEMS
The packed representation used for the display is the same one used by "print-chart-graph" and "print-prolog-chart-graph". The structure of the disjunctions that appears in these displays is an artifact of the computation, and won't always match one's intuitions about how the disjunctions should be factored. At some future time we may try to add code for re-factoring disjunctions. Also, the code for extracting a packed representation will produce an incorrect representation when you are extracting from the generator and there are is a solution with discontinuous heads that have the same category. There isn't even code to detect this situation, so you should use this on the generator with caution.
The code for producing a packed representation attempts to normalize the packed representation by rewriting equalities and eliminating redundant constraints. Unfortunately, normalizing the packed representation can cause XLE to timeout. In this case, XLE will print out the message:
extract_chart_graph aborted because of timeout.
You might try setting normalize_chart_graphs to 0 and try again.
Setting "normalize_chart_graphs" to 0 will turn off normalization, which may allow XLE to produce the packed representation within the time alloted.
The chart window displays all of the edges in the chart. The edges are stacked according to depth: the lexical items are at the bottom, the edges that build on lexical items are immediately above them, those that build on pre-terminals are above them, and so on.
The morphology window displays all of the morphological edges. The edges are displayed with the tokens on the left, descending vertically in order they appear in the sentence. To the right of each token comes any preterminals for that token (e.g. matching lexical entries that have a * morph code). Below the token preterminals are lexical forms for the token, and to the right of each lexical form are preterminals for the lexical forms. If there is a ?? at the end of the edge name for a lexical edge, then the lexical edge wasn't found in the lexicon. If there is a ? at the end of the edge name for a pre-terminal edge, then the pre-terminal came from the -unknown entry. If there is a * at the end of an edge, then the edge didn't have any valid solutions or the solutions weren't computed.
Tcl has a built-in facility for traversing menu items within a window without using a mouse. It is invoked by typing the F10 key. After the F10 key is typed, the first menu of the window that currently has the input focus will be displayed. You can choose an element within this menu by using the up and down arrows. After the desired menu item is selected, type a carriage return to invoke it. You can cycle through the different menus associated with the current window by using the left and right arrow keys. Typing ESC aborts the menu traversal mode initiated by the F10 key.
In addition to Tcl's built-in facility, XLE provides a means for cycling the input focus through the XLE windows. Typing the F9 key will cause XLE to move the input focus from an XLE window that has the input focus to the next XLE window. This means that repeatedly typing the F9 key will cycle the input focus through all of the XLE windows including the Tcl shell window. However, typing F9 will not move the input focus from the Tcl shell window to an XLE window unless the Tcl shell window is in an Emacs buffer and lfg-mode is loaded.
A generator is the inverse of a parser. A parser takes a string as input and produces f-structures as output. A generator takes an f-structure as input and produces all of the strings that, when parsed, could have that f-structure as output. The generator can be useful as a component of translation, summarization, or a natural language interface. It can also be used to test whether a grammar overgenerates.
Sometimes it is not desirable for the generator to be the exact inverse of the parser. For instance, although the parser eliminates extra spaces between tokens, you may not want the generator to insert arbitrary spaces between tokens. To handle this, you can make the generation grammar be a little different from the parsing grammar by changing the set of optimality marks used (through the GENOPTIMALITYORDER configuration field) and by changing the set of transducers used (through the "G!" prefix in the morphology configuration file). These mechanisms allow you to vary the generation grammar as needed while still being able to share as much as possible with the parsing grammar.
The generator in XLE is associated with the following commands:
The sections below describe the use of these commands.
create-generator "grammarfile"where "grammarfile" is the root file of a grammar. This will create a generator that uses the given grammar file, except that the GENOPTIMALITYORDER optimality marks will be used instead of the OPTIMALITYORDER optimality marks and the G! morphological transducers will be used instead of the P! morphological transducers.
Creating a generator may take a considerable amout of time, because it may require indexing the lexicons on their content, if previously-created indexes are no longer valid (because of changes to the lexicon, templates, or initial grammar file). This also has the side effect of checking the syntax of the content of any files that have to be re-indexed. This can be handy for debugging and catching typographical errors. If all of the files have to be re-indexed, then XLE will report the names of any features in the feature declaration section that are never used in the grammar. You can force all of the files to be re-indexed if you want by making a minor change to the root file of the grammar.
The generator takes f-structures as input. The only format that is currently supported is the Prolog format that the parser produces as output. To generate from a Prolog file, use the following command:
generate-from-file "filename" ("count")
The "filename" argument specifies the name of the prolog file that contains the f-structure to be used as input. The "count" argument specifies how many f-structures to generate from if the prolog file is a packed representation of multiple f-structures. The default value for "count" is 9999.
There is also a way to generate from a set of files in a single directory:
generate-from-directory "directoryname"
This command will enumerate the files in the given directory and call generate-from-file on each one. You can create a directory full of prolog files using
parse-testfile "testfile" -outputPrefix "dirname/"(the trailing slash is required and the directory named "dirname" must already exist).
You can also generate from the output of the parser using the "Generate from this FS" command in the f-structure window of the parser.
The generator produces a packed representation of all of its output strings using a regular expression notation. For instance:
You
{ make copies of books magazines or other bound {and|&} large documents or
can also detach the scanner if you want to scan.
|can also detach the scanner if you want to scan or make copies of books
magazines or other bound {and|&} large documents.}
The generator output is printed on $gen_out_strings, which defaults to stdout. Error messages are printed on $gen_out_msgs. These variables should be set using set-gen-outputs (e.g. "set-gen-outputs output.txt stderr").
If you want all of the generations listed separately instead of as part of a packed representation, set the Tcl variable gen_selector to allstrings, e.g.:
setx gen_selector allstrings
You can set this variable in the Tcl shell or in the performance variables file. This variable must be set before the generator starts generating unless you use the following notation in the Tcl shell:
setx gen_selector allstrings $myGenerator
You can revert to the standard regular expression output using:
setx gen_selector ""
You can force the generator to only produce one output by setting the Tcl variable gen_selector to one of shortest or longest.
In principle the generator should produce strings that, when parsed, have an f-structure that exactly matches the input f-structure. However, sometimes one wants to allow the generator to produce strings that have an f-structure that matches the input except for a few features. This is particularly true if the grammar has some features that are only used internally to the grammar. It may be difficult for the user of the generator to know how these features should be set. To allow underspecified input, XLE allows the user to specify a set of features that should be removed from the input and a set of features that can be added by the generator. These can be set in the following way:
set-gen-adds add "FEAT1 !FEAT2 FEAT3=foo @INTERNALATTRIBUTES"
set-gen-adds remove "FEAT4 FEAT5 @INTERNALATTRIBUTES"
The notation FEAT3=foo means that only this particular attribute-value pair is addable. To make more than one value addable, use set-gen-adds add "FEAT3=foo FEAT3=fum".
The notation !FEAT2 means that embedded features must be listed as addable in order to be added. The default is that if a feature is added, then any features embedded underneath can also be added whether or not they are listed as addable.
@INTERNALATTRIBUTES expands to all of the attributes that are not listed in EXTERNALATTRIBUTES in the grammar configuration.
The generator will freely add any attributes or attribute-value pairs that are declared to be addable if they are consistent with the input. However, all of the nodes in the c-structure must map to feature structures that exist in the input. This means that if you have a rule like NP --> (DET: (^ SPEC)=!) ..., then you cannot make SPEC be underspecified. However, if the rule is NP --> (DET: ^=!) ... and SPEC is defined in the lexicon, then you can allow SPEC to be underspecified.
If a governable function is declared addable using set-gen-adds, then the governable function can be added even if the resulting f-structure would be incoherent. This is useful for adding SUBJ when translating between languages that have different conventions about whether adjectives take subjects.
You can also associate an optimality mark with addable attributes. Whenever an addable attribute is added to the f-structure of a generation string, then its optimality mark will also be added. The effect of this depends on the optimality mark's position in GENOPTIMALITYORDER. Here is an example:
set-gen-adds add @ALL AddedFact
set-gen-adds add @INTERNALATTRIBUTES NEUTRAL
set-gen-adds add @GOVERNABLERELATIONS NOGOOD
set-gen-adds add @SEMANTICFUNCTIONS NOGOOD
In this example, all of the attributes are first assigned the user-specified AddedFact OT mark. Then the internal attributes are assigned the NEUTRAL OT mark, which makes them freely addable. Then the governable relations and semantic functions are assigned the NOGOOD OT mark, which means that they cannot be added. The net effect is that all of the attributes other than the internal attributes, the governable relations and the semantic functions are assigned the AddedFact OT mark. These attributes can be added to the f-structure of a generation string at a cost.
Note that it is possible to have more than one call to set-gen-adds, and that the calls are additive. For backward compatibility, calling set-gen-adds with no OT mark removes any existing addable attributes.
create-generator defaults to "set-gen-adds add @INTERNALATTRIBUTES".
Sometimes the input to the generator is underspecified in that the relative order of internal f-structures is not given. For instance, the adjuncts of an f-structure may not have been ordered using the scope relation. This can produce an exponential number of different outputs. You can control this by adding the BADSEMFORMIDORDER OT mark to GENOPTIMALITYORDER. This mark is added to the f-structure of a generation string whenever two semantic form ids are not in numeric order. This means that the generator will try to generate so that the generation string preserves the order given by the semantic form ids in the input f-structure. This is just a preference, though: if the grammar doesn't allow strings to preserve the semantic form id order, then the generator will pick the string that has the fewest semantic form ids out of order. Since the parser guarantees that the semantic form ids are ordered by string position, this means that the generator will tend to pick a generation string that preserves the order of the parse string. Since the transfer component tends to preserve semantic form ids, translations also tend to preserve the order of the source string.
In some circumstances you can use the generator to produce an exhaustive list of all possible variations of a grammatical phenomenon. For instance, suppose that you wanted to see all the ways that a verb could be inflected in English. You can accomplish this by removing the tense and aspect features from the input to the generator and also making them addable so that the generator can freely add them back in:
set-gen-adds remove "TNS-ASP @INTERNALATTRIBUTES"
set-gen-adds add "TNS-ASP MOOD PERF PROG TENSE @INTERNALATTRIBUTES"
Note that although you only need to remove TNS-ASP from the input since the tense and aspect features are embedded under it, you need to make all of the tense and aspect features addable. If you left one of the tense and aspect features out, then the generator would not generate anything.
Now, parse a simple sentence like John sleeps and use the "Generate from this FS" command to generate from its f-structure. The tense and aspect features will be stripped from the f-structure before it is given to the generator, and the generator will produce all possible forms:
John
{ { will be
|was
|is
|{has|had} been}
sleeping
|{{will have|has|had}|} slept
|sleeps
|will sleep}
This technique can only be used for grammatical phenomena that only vary in the values of a few features or feature complexes. This technique won't work if any of the underspecified features map to a c-structure node. For instance, if the grammar had (^ TNS-ASP)=!, where ! contained the values of the TNS-ASP feature cluster, then the generator would have refused to generate any analysis that used this constraint since it only introduces c-structure nodes that map to an f-structure in the input. On the other hand, any phenomena that obeys these restrictions can be enumerated using this technique. For instance, you might be able to generate all of the specifiers by ignoring the SPEC feature. So this technique can be used for things beyond inflectional paradigms.
You can also use set-gen-adds remove to remove particular values:
set-gen-adds remove "PERF=-"
Sometimes you want the generator to produce an output even if the input f-structure is ill-formed (e.g. it is not a possible output of the parser). For instance, you may want a translation system to produce an output even if some of the input f-structure didn't get translated. Also, for debugging purposes it is easier to see why an output was invalid than to see why the generator produced no output at all. XLE has two techniques to help with this. The first is to allow the generator to relax the relationship between the input f-structure and what is generated through some special optimality theory marks (OT marks). The second is to allow for a fragment grammar for generation, similar in spirit to the fragment grammar for parsing.
XLE defines some special OT marks that are useful for robust generation: MISSINGFACT, DUPLICATESEMFORMID, and BADSCOPE. They have the following interpretation:
The generator uses these OT marks to choose the generation string that minimizes the mismatch between the input and the generation string's f-structure. If one of these special OT marks is not listed in an OPTIMALITYORDER (or GENOPTIMALITYORDER), then it is implicitly NOGOOD (e.g. the OT mark and its behavior is disabled).
The second technique that XLE has for robust generation involves a fragment grammar for generation. This is similar to the fragment grammar for parsing, except that it is designed to match any input f-structure rather than being designed to match any input string. The idea is to stitch together well-formed generation strings by using a cover grammar that can match any input f-structure. Here is a sample fragment grammar for generation:
GENFRAGMENTS --> {"fragments"
NP | VP | S | ADV | ADJ
|"f-structures"
GENFRAGMENTS*: GenFragment $ o::*
{ (^ %ATTR1)=!
|! $ (^ %MOD1)}
PRED
GENFRAGMENTS*: GenFragment $ o::*
{ (^ %ATTR2)=!
|! $ (^ %MOD2)}
|"sets"
[GENFRAGMENTS: GenFragment $ o::*
! $ ^;
COMMA]+
(CONJ)
GENFRAGMENTS: GenFragment $ o::*
! $ ^
(^ COORD)=+_
}.
-token CONJ * GenFragment $ o::*
(^ COORD-FORM) = %stem;
PRED * GenFragment $ o::*
{ (^ PRED)='%stem'
|(^ PRED)='%stem<%ARG1>'
|(^ PRED)='%stem<%ARG1 %ARG2>'
|(^ PRED)='%stem<%ARG1 %ARG2 %ARG3>'
|(^ PRED)='%stem<%ARG1 %ARG2 %ARG3 %ARG4>'
}.
This GENFRAGMENTS rule produces a tree that has well-formed generation strings based on NP, VP, S, etc. at the bottom of the tree. These well-formed generation fragments are stitched together using the GENFRAGMENTS rule. The constraints (^ %ATTR1)=! and ! $ (^ %MOD1) are designed to match any attribute in the input that has the corresponding structure (NB: any variable name can be used for %ATTR1, %ATTR2, %MOD1, and %MOD2). This means that the generator can generate even if the input attribute is unknown to the grammar. The attribute name variables like %ATTR1 can be constrained just like any other variable. For instance, adding %ATTR2 ~= SUBJ to the rule near (^ %ATTR2)=! means that the SUBJ attribute cannot follow the head. Adding %ATTR1 ~$ {OBJ OBJ2 OBL} near (^ %ATTR1)=! means that the OBJ, OBJ2, and OBL attributes cannot precede the head. These sorts of constraints can be useful for reducing the number of different orders that the generation fragments can appear in.
The PRED and CONJ categories in this -token entry are designed to match a PRED or COORD-FORM attribute and generate its value as a string. They work because, by convention, the -token head word is replaced by the value of %stem, which matches the value of the attribute in the input f-structure.
The GenFragment OT mark is a user-specified OT mark (e.g. XLE doesn't assign a special interpretation to GenFragment). It is used in this fragment grammar to minimize the number of generation fragments produced. It should be put in the NOGOOD section of the parser's OPTIMALITYORDER, so that the parser doesn't make use of the GenFragment constructions. The GENFRAGMENTS rule is added to the grammar when the generator fails if it is included in the grammar config as the value of the REGENCAT field. To keep generation tractable, the generator eliminates any REGENCAT edges that are not headed.
The OT marks described above can be added to the GENOPTIMALITYORDER field of the grammar config along with other optimality marks. Here is an example of how they might be used:
GENOPTIMALITYORDER
ParseOnly NOGOOD
*GenFragment
MISSINGFACT AddedFact BADSCOPE DUPLICATESEMFORMID
STOPPOINT
GenMisc.
In this example, MISSINGFACT is dispreferred more than AddedFact, which is dispreferred more than BADSCOPE, which is dispreferred more than DUPLICATESEMFORMID (if you want them all equally dispreferred, then you can put them in paretheses). AddedFact is an example of a user-specified OT mark that is associated with addable attributes using set-gen-adds:
set-gen-adds add "NUM PERS NTYPE" AddedFact
Given this GENOPTIMALITYORDER, the generator first tries to generate without relaxing any constraints (e.g. only using the hypothetical GenMisc OT mark). If that fails, it then tries to relax the relationship between the input f-structure and the f-structure of a generation string using the MISSINGFACT, BADSCOPE and DUPLICATESEMFORMID OT marks, picking the generation string(s) that most closely match the input f-structure. At this point, generation strings are grammatical, although they might not correspond to the input f-structure. Finally, the generator allows the GenFragment OT mark, which enables the GENFRAGMENTS rule. The GenFragment OT mark is marked with a * to indicate that the output strings will be ungrammatical. The generator will choose the string(s) that minimize the number of generation fragments.
Most of the time, when the generator fails to generate a string it is because of one of the following:
One approach to debugging with the generator is to first parse the string that you want to generate. If the string doesn't parse, then it won't generate, either. If the string parses, then pick the f-structure that is closest to the f-structure that you were generating with. If this f-structure doesn't generate, then the problem is probably with grammar rather than the input. If the f-structure generates, then use debug-gen-input badfstructure.pl goodfstructure.pl to find out why the bad f-structure file doesn't generate but the good f-structure file does. Fix any problems that it finds until the bad f-structure generates.
Another approach to debugging with the generator is to look for the desired tree in the generation chart and see why it failed. You can look for a tree by typing show-solutions $defaultgenerator in the Tcl shell and then enumerating subtrees until you get to the desired tree. Unfortunately, this can be slow and error-prone, since there can be many edges in the generation chart with the same generation vertex and category, and it is hard to tell which is the desired one. (Edges in the chart are indexed by the f-structure in the input that maps to ^. There can be several edges with the same index and category because edges also encode positions in the grammatical rule, positions in the morphology, and some state about resources that have been consumed.) To make this process easier, you can filter edges from the chart using a facility similar to the parser's bracketing tool. This is described in the next paragraph.
If you click on "Show Bracket Window" in the tree window or type show-chart-nav in the Tcl shell, then XLE will display a window of vertices in the generation chart. Each vertex will have an index and a predicate name after it (or ? if there is no predicate). The index corresponds to the GOAL relation of the input f-structure, which can be found by typing show-input and then clicking on the "constraints" item under the Views menu. (The GOAL relation is an internal relation used by XLE to index the input to the generator.)
If you click on a "show" button for a vertex in the vertex window, then XLE will show a category menu for that vertex. This is a list of c-structure categories that the generator attempted to build for this vertex. You can exclude categories by clicking on "out" and then clicking on "Apply". This will filter these categories from the chart, reducing the number of trees that you have to look through to find the right tree. (It is not easy to make edges required as in the parser because a generation chart doesn't have some of the properties that a parse chart has.) If you click on "cancel", any changes that you have made on the menu will be discarded.
If you click on a "show" button for a category in a category menu for a particular vertex, then XLE will show a menu of edges that have that category and vertex. Edges can be excluded or not, just like in the category menu described above. If you click on a "show" button for an edge, then XLE will display the edge in the tree window using "show-subtree".
If you click on "restrictions" in the vertices window, then XLE will show a menu of all of the categories and edges that have been excluded so far. This is useful for turning the exclusions off if you find that the exclusions have eliminated the tree that you are interested in. If you click on "clear" in the vertices window, this will turn off all of the current exclusions.
The techniques described above work if the generator produced the desired tree but the feature structure was ill-formed for some reason. However, sometimes the generator prunes a lexical entry or subtree before it builds a tree. Usually this happens because the lexical entry has a feature that is not addable or conflicts with the input. The only way to debug this sort of problem is to try smaller and smaller generations until you identify the lexical item or grammatical rule that is causing the problem. An easy way to do this is to parse the sentence that you wanted the generator to produce, and then use "Generate from this FS" for complete f-structures that correspond to subtrees of the parse tree that you want. (You cannot generate from incomplete f-structures because they will be incomplete in the generator, too. This means that you cannot generate from a VP, for instance.)
The best way to test a generation grammar is to parse a string, pick one of the resulting f-structures, generate from it, and see whether any of the outputs match the input. We call this process regeneration. This section lists some useful commands for doing regeneration. Please see the online "help" command for more information on how to use these commands.
The regenerate command takes a string as input and parses it using $defaultparser (the parser created by create-parser). Then it picks the first f-structure and generates from it using a generator created with the same grammar file that $defaultparser was created with. The "regenerate" command automatically creates a generator if one hasn't already been created.
The regenerate-testfile command is just like the parse-testfile command except that it regenerates each test item in the testfile instead of just parsing it. It also produces a testfile.regen file, which is a file of just the regenerations. Finally, it checks the output of the generator against the original string, ignoring minor differences in white space. Errors are written in testfile.errors.
The regenerate-morphemes command is useful for checking the generation morphology. Sometimes the generator fails because the morphemes that it needs to produce aren't accepted by the generator's morphology. regenerate-morphemes applies the generator morphology to the morphemes that are in the valid trees of the parse chart. If this doesn't produce anything, then the generator's morphology is not the inverse of the parser's morphology in a non-trivial way. (Note: Multi-word preference tags that are added by BuildMultiwordsFromLexicon or BuildMultiwordsFromMorphology will usually prevent regenerate-morphemes from producing an output.)
If the grammar is context-free equivalent and the input to the generator is fully-specified, then the generator should generate in time that is quadratic in the number of f-structure variables in the input in the worst case (usually linear in the typical case). A grammar is context-free equivalent if XLE can parse with the grammar in time that is cubic with the length of the input sentence. The input is fully-specified if the output of the generator is a single string.
If the grammar is not context-free equivalent or the input to the generator is not fully-specified, then the generator can generate in time that is exponential in the number of f-structure variables in the input in the worst case. In particular, if the grammar doesn't distinguish between different orders of constituents within the tree, then the number of solutions produced and the amount of time taken to generate can be proportional to N!, where N is the number of free constituents and "!" represents the factorial function. One way to reduce non-determinism in the grammar is to record the order of adjuncts using $<h<s or $<h>s instead of just $. Unfortunately, this can slow down the parser, since it forces more information to be copied up within each adjunct. Another way to reduce non-determinism is to mark adjuncts with features such as FOCUS that indicate what function the position of the adjunct plays in the sentence.
The generator uses the PREDs in the input to build a generation chart. If a word is missing a PRED (such as "it" in "it is raining"), then the generator builds generation trees that have that word optionally in every possible position, leaving it to the unifier to eliminate duplicates. This is very inefficient, especially in a free-word order language that has a functional uncertainty associated with the word. It is better if you can avoid such analyses.
Whenever a grammar gets loaded by create-parser or create-generator, XLE prints out some statistics about the size of the grammar. These statistics look something like:
grammar has 286 rules with 699 states, 1528 arcs, and 2987 disjunctsThis says that the grammar has 286 finite-state rules, and that the finite-state rules have 699 states and 1528 arcs in them. The number of arcs is a good indication of the size of the grammar. If you were to covert the grammar into an equivalent grammar consisting of only unary and binary-branching rules, then it would have about this number of rules in it. The last number, the number of disjuncts, indicates about how many different rules you would have if you further required that the equivalent grammar couldn't have disjunctive constraints. For instance,
VP --> VP PP: { (^ OBL)=! | ! $ (^ MODS) }.would have to be converted to:
VP --> VP PP: (^ OBL)=!.
VP --> VP PP: ! $ (^ MODS).
These numbers can be useful for giving someone a rough idea of how big your grammar is.
This section describes a number of tricks for debugging grammars in XLE in both parsing and generation (see the generation section for further hints on debugging the generator).
SEARCHING FOR TREES
It's quite common that there are a large number of trees for a given sentence. Locating trees of interest can be nontrivial. The bracketing window can help you restrict the display to trees containing (or not containing) constituents of interest.
DEBUGGING THE NOTATION
Whenever you load a grammar using create-parser, XLE parses the formalism and reports warnings and errors. Warnings are given for things that are acceptable but suspicious, errors are for unacceptable notation. If XLE reports an error, then the grammar that it has is in a funny state and no guarantees can be made about what it will do if you try to use it. Usually XLE will give a line number in the file where the error occurred.
If you have separate lexicon files, then the lexicon entries that you need are parsed on demand when you use a word in the sentence for the first time. This means that you may get errors reported when you parse a sentence. If XLE reports an error, it won't give an absolute line number for the error, but rather a line number that is relative to the beginning of the entry. If you want to check all the lexical entries, you can load the generator ("create-generator filename") which will index all the lexicon files and in the process find any errors.
LOCKING SOLUTIONS
Whenever a solution is locally bad, it shows up in the f-structure display with a reason why it is bad and with the constraints highlighted that caused it to be bad, if possible. A solution on a subtree may be marked as "EVENTUALLY BAD" if it is eventually bad in all of the trees that incorporate the subtree. Similarly a solution on a subtree may be marked as "INCOMPLETE" if it is incomplete in all of the trees that incorporate the subtree. These solutions are marked as bad low down in the tree for performance reasons, so that the number of solutions that occur on intermediate subtrees remains manageable. However, it can make a grammar difficult to debug. To solve this problem, solutions that are marked as "EVENTUALLY BAD", "INCOMPLETE", or "UNOPTIMAL" can be locked using the "lock" button to the left of the f-structure label on the f-structure display window. When the "lock" button is clicked, then this solution becomes the only solution available from its subtree. This information gets propagated up the tree as if the solution had been made the only good solution of the subtree.
The "lock" button can also be used to lock a good solution. Locking a good solution will filter out all of the other good solutions in the subtree. This information will be propagated up the tree, reducing the number of solutions in the subtrees above the locked solution.
Locking only applies within a single tree; if you click "next" or "prev" in the tree displayer, then the locked solutions get unlocked automatically.
FINDING DUPLICATE SOLUTIONS
A common problem in grammar debugging is determining whether the reason that you are getting multiple solutions is that there are duplicates. One way to detect such a situation is to look in the choices window and see if there are any pairs of mutually exclusive choices that don't have any constraints in them. Another way is to use the "Print" commands in the tree window and the f-structure window to print out different structures and then use diff to see whether and how the structures differ.
One common way that you can get duplicate c-structures is if you have two entries under a word for the same category. XLE doesn't collapse these, and so you will get two identical trees with (possibly) different f-structures. A diagnostic for this case is that the lexical entry will have a dotted line between it and its duplicated category in the tree.
Another common way of getting duplicate f-structures is if you have disjunctions that are not mutually exclusive. For instance, y'all are ... will get two solutions if y'all is second person plural and are constrains its subject to be second person or plural. One way to eliminate the spurious ambiguity is to make the disjunction mutually exclusive. In this case, are could constrain its subject to be second person or (plural and not second person). If there is a possibility that a feature may be unspecified, then you will need to make the positive constraint (in this case, second person) a sub-c constraint.
This particular problem can show up even when there is not an obvious disjunction. For instance, a common way to deal with base form verbs is to say that when they are present tense, the subject is not third person singular (~[(^ SUBJ NUM)=3 (^ SUBJ PERS)=SG]). However, XLE uses DeMorgan's law to convert this into a disjunction ((^ SUBJ NUM)~=3 | (^ SUBJ PERS)~=SG). If the resulting disjunction is not mutually exclusive, you may get spurious ambiguity. The best solution in this case is to make the disjunction explicit and make it mutually exclusive.
A simple way to find disjunctions that are not mutually exclusive is to use the "Check Disjunctions" command on the tree window. This command will look through the current grammar for disjuncts that are not mutually exclusive and print them out, along with the file and line number where they occur. It also looks through the lexical items of the current chart, if there is one. You can increase the usefulness of the "Check Disjunctions" command by adding constraints to the non-exclusive disjuncts so that they become mutually exclusive. This will reduce the number of disjunctions printed by "Check Disjunctions", and so make it easier to see when new non-exclusive disjunctions appear.
Disjunctions that are not multually exclusive do not always lead to a spurious ambiguity. For instance, the rule fragment
...PP*:{ (^ OBL)=! | ! $ (^ MODS) }; ...
is not likely to lead to a spurious ambiguity. To reduce the number of disjunctions that need to be checked, "Check Disjunctions" uses a heuristic to filter out disjunctions like these. If you want to see all of the disjunctions that are not mutually exclusive, you can use the "Check All Disjunctions" command.
DEBUGGING THE MORPHOLOGY
You can print out the results of the tokenizer or the morphology when applied to a particular string by typing:
tokens {John laughs.}
might produce:
{"^ " john|John} "TB"
{ laughs. "TB" [ "_," "TB" ]*.
|laughs "TB" { "_," "TB" [ "_," "TB" ]*.|.}
|laughs.}
"TB"
While the command:
morphemes {John laughs.}
might produce:
{ john {"+Token"|"+Noun" "+Sg"}
|John
{ "+Token"
|"+Prop" {"+Misc"|"+Giv"
"+Masc" "+Sg"}}}
{ { { {laughs|laughs.} "+Token"
|laugh {"+Noun"
"+Pl"|"+Verb" "+Pres" "+3sg"}}
"_," "+Token"[
"_," "+Token"]*
|{laughs|laughs.} "+Token"
|laugh {"+Noun" "+Pl"|"+Verb"
"+Pres" "+3sg"}}
. {"+Token"|"+Punct" "+Sent"}
|laughs. "+Token"}
These commands use the morphology of the default parser ($defaultparser).
This section describes a set of tools that have been developed to build and run regression test suites, which are useful for checking progress and detecting bugs during the development of grammars, semantic lexicons, rules mapping semantic representations to knowledge representations, or transfer rules. The form of regression testing supported by these tools does not just record whether one obtains a f-structure, semantic, KR analysis or transfer for a sentence. It also matches the analyses against gold (benchmark) standards, and gives a detailed report of any points of difference.
You are strongly encouraged to store your grammar in a version control system (such as CVS or Subversion) to make it easier to find out when a particular change ended up breaking the grammar. For instance, suppose that after a week of working on your grammar, you discover that your test suite takes twice as long to process as it did the last time you ran your regression tests. Storing your grammar in a version control system allows you to back up to earlier versions to see when the slowdown began. If you are lucky, you will be able to use a divide-and-conquer strategy to narrow the problem down to a particular change that caused the problem.
A test suite can be created automatically using the command create-testsuite-dir, which takes a single optional argument:
create-testsuite-dir (<DIRECTORY>)
The optional argument specifies the path to the directory where the test suite directory ts will be created. If the command is called without this argument, then the test suite directory will be created in the current working directory (i.e., ./ts).
Note that the test suite directory and its subdirectories will also be created automatically if a test suite is run with one of the commands described in the next section, Running a Test Suite—e.g., run-syn-testsuite).
To construct a test suite by hand (e.g. for a gold standard test suite recording ground truths, rather than a regression suite recording current best analyses), you need to know something about the directory structure required by a test suite. The test suite directory must contain the following files and subdirectories.
sentences.lfg % file with example sentences fs/ % directory with benchmark f-structures sem/ % directory with benchmark semantic reps kr/ % directory with benchmark KR xfs/ % directory with benchmark transferred f-structures xfr/ % directory with benchmark transfer structures reports/ % directory with reports of previous test runs tmp/ % directory with structures produced by most recent test run fs/ sem/ kr/ xfs/ xfr/
Note that many of these directories will remain empty if you are not using the relevant level (e.g., no sem or kr). The file sentences.lfg takes the form of a regular input file for the parse-testfile command, but with two crucial additions. First, each sentence must be numbered. This is necessary to keep track of which f-structures, semantics, KRs, and xfr files belong to which sentences, by means of a file numbering convention. To number a sentence, it must be preceded by a comment line containing just the number of the sentence, e.g.
# 23 This is sentence number 23.
It is recommended that the number be surrounded by blank lines to make the test suites compatible with the format used by parse-testfile.
Second, the first line of the file should be another comment giving the highest sentence number in the file, e.g.
# 253 #1 Sentence 1. #2 Sentence 2. ... #253 Sentence 253.
It is recommended that sentences be numbered consecutively and without gaps, although this is no in fact enforced by XLE.
The fs, sem and kr directories contain files of gold standard structures fs<N>.pl, sem<N>.pl and kr<N>.pl, where N is the number of the sentence in sentences.lfg. It is possible for the semantics and kr benchmark directories to be empty if semantic and/or KR results are not being stored for the test suite.
The fs<N>.pl files contain normal prolog f-structures or f-structure charts. The sem<N>.pl files contain a prolog term of the form:
sem(N, 'Sentence N.' Choices, Equivs, ContextedFacts)
where N is the number of the sentence, 'Sentence N.' is the text string for the sentence, Choices is a choice structure (in the same format as for prolog fs-charts), Equivs is a list of variable definitions and/or selections (in the same format as for prolog fs-charts), and ContextedFacts is a list of contexted facts of the form cf(C, Fact).
The kr<N>.pl files are similar:
kr(N, 'Sentence N.' Choices, Equivs, ContextedFacts)
If transfer has been run before adding items to the testsuite, then the
xfs<N>.pl file will contain transferred f-structures, and
the xfr<N>.pl files will contain transfer structures of the
form
xfr(Choices,Equivs,Equalities,ContextedFacts,Documentation)
The reports directory contains reports from previous test runs.
The tmp directory contains the structures obtained when running a test suite, and which are compared against the benchmark structures. The fs, sem and kr directories parallel those of the benchmark directories. In addition, the tmp directory itself will contain files final_N.pl for the final structures produced by the last test run.
Permissions: It is important that the tmp directory and its subdirectories and files be readable and writable to anyone who might run the test suite. The other directories should be readable by anyone who might run the test suite, and writable to anyone who might add further examples. The add-to-testsuite commands default to make these directories group writable, and readable to everyone.
Test suites can be run across for any range of example numbers for any of the sub-sequences of the following levels:
The following XLE commands are available:
run-syn-testsuite % text => f-structure run-sem-testsuite % text => semantics run-kr-testsuite % text => KR run-synsem-testsuite % f-structure => semantics run-semkr-testsuite % semantics => KR run-xfs-testsuite % text => transferred f-structure run-xfr-testsuite % text => transfer-structure run-synxfs-testsuite % f-structure => transferred f-structure run-synxfr-testsuite % f-structure => transfer-structure run-multi-xfr-testsuite % text => transfer-structure run-syn-multi-xfr-testsuite % f-structure => transfer-structure
These commands can also be called specifying a range of example numbers, as follows:
run-syn-testsuite <FROM> <TO> run-sem-testsuite <FROM> <TO> run-kr-testsuite <FROM> <TO> run-synsem-testsuite <FROM> <TO> run-semkr-testsuite <FROM> <TO> run-xfs-testsuite <FROM> <TO> run-xfr-testsuite <FROM> <TO> run-synxfs-testsuite <FROM> <TO> run-synxfr-testsuite <FROM> <TO> run-multi-xfr-testsuite <FROM> <TO> run-syn-multi-xfr-testsuite <FROM> <TO>
These commands will pick up whatever the current test suite directory is, which is either the directory last specified by the command set-testsuite or the default directory ./ts. The input to the test suite run is taken from sentences.lfg in the case of runs starting from text, or from the appropriate benchmark files otherwise.
When parsing text, the currently loaded parse grammar will be used. (Note: Sometimes running the test suite may reload the grammar, due to a technical irritation.) If an example number range is not specified, all of the test suite examples will be run. When running transfer (xfr) versions of the test suite, the current active transfer grammar will be used. For multi-xfr runs, where a sequence of transfer rules is applied, the sequence is whichever one was last used, e.g. by the transfer-seq command.
For each level of analysis, the analysis results will be compared to the benchmark structures, and the differences between the best matching analysis structure and the benchmark structure will be printed out. When running across multiple levels (e.g. text => KR, which passes through syntax and semantics), the analysis result most closely matching the benchmark will be selected for subsequent processing.
The command
set-testsuite-most-probable 1
will set an environment flag that causes subsequent test runs to only compare the most probable f-structure (i.e. the one that would initially be shown in the unpacked f-structure window) to the benchmark. The default, where all f-structures are compared to find the one best matching the benchmark, can be restored by running the command
set-testsuite-most-probable 0
The command
set-testsuite-partial-benchmark 1
will set an environment flag indicating that benchmark structures are only partial specifications of the desired results. Matching will maximize recall rather than the fscore of precision and recall when this is set. To restore the default, run the command
set-testsuite-partial-benchmark 0
Comparison of analysis and benchmark results uses the triples matching mechanism. Thus transfer rules convert f-structures, semantics, KR, and xfr structures to sets of triples (more accurately, tuples, since not all semantic and KR relations are 2-place), and these are compared to find the best match. A default set of rules for converting f-structures, semantics and KR to triples is automatically loaded. It is possible to redefine these mapping rules by loading your own set of structure=>triples transfer rules. These rules must have the identifiers
grammar = fs_triples. % redefine fs=>triples grammar = sem_triples. % redefine sem=>triples grammar = kr_triples. % redefine kr=>triples
in order for them to be recognized by the test suite comparison. When running testsuites on transfer structures, you can also load a set of transfer rules with the grammar name xfr_triples to determine the mapping of transfer structures to triples. If no such rule-set has been loaded, then fs_triples will be used instead.
Note on triples comparison of f-structures:
The command add-benchmark-menu will add additional commands to your fs-structure and fschart menus allowing you to select and save transfer structures. This menu calls on the benchmark tcl procedure defined in the file xfr_benchmark.tcl. To run the benchmark menu command, you must first have a current transfer rule sequence set up. For example
set-current-xfr-sequence fs_tripleswill ensure that f-structures are run through the transfer sequence containing the single ruleset fs_triples, before being displayed for benchmarking. To set a sequence of more than one ruleset, enclose the rule names between braces as a space separated list, e.g.
set-current-xfr-sequence {rules1 rules2 rules3}
The benchmark command will open a new window, displaying the choice space, the transfer facts that are fixed as being part of the current choice selection, and the open transfer facts that are not yet unambiguously chosen. Beside each open fact are two buttons. You can click on the left hand button to say that you want to choose the fact, and select the part of the choice space it is in. The right hand button allows you to exclude the fact, and eliminate the part of the choice space it is in. After marking various facts, click on the Select button at the bottom of the screen. This will recompute the open and fixed facts, and redisplay them on the window. If you have made selections that reduces the number of available choices to one, then the set of open facts will be empty. If you are unhappy with the selection you made, you can click on the Go Back to return to the previous selection.
You can control the way that the facts are displayed by means of the Toggle Display command. This switches between: (1) a full display, which shows all of the transfer facts exactly as they are, and (2) an abbreviated display which filters out some of the facts, and displays the others in an abbreviated way. You can control the degree of abbreviation by means of the benchmark-abbreviation command line command, e.g.
benchmark-abbreviation "'SUBJ'(A, B)" "'SUBJ'(A, B)" # Print SUBJ facts in abbreviated mode, without change benchmark-abbreviation "pronoun_res_chained(A,B,_,_,_,_,_,_)" "pronoun_res(A,B)" # Reprint pronoun_res_chained/7 facts as pronoun_res/2, # preserving only first two arguments, A and BThe two arguments to this call, the full form of the fact and the abbreviated form, need to be valid prolog terms. Prolog variables are used to indicate arguments shared between the full and abbreviated forms.
When you are satisfied with your selection, the Save command buttons will save the full transfer structure, restricted to the selected choices. This will be saved to a file xfrN.pl in the current working directory, where the integer N is chosen to be the first number not used to identify an xfr file.
The Discriminants command button on the benchmark window will open a new window. This display the currently available fixed and open transfer facts. The checkbox beside each fact allows you to select facts that you want to store as constraints or discriminants for the representation. Clicking on the Save command will append a new transfer rule to a file. (By default, this file will be discriminant_rules.pl in the present working directory: you can change this default by setting the Tcl global variable discriminantsOutFile to another file name.) The rule will check that all the selected facts are present in a subsequent transfer structure for that sentence.
For example, for the sentence Ed slept., picking the triples facts SUBJ(sleep:1, Ed:2) and VTYPE(sleep:1, main) as discriminants will create the following transfer rule
"------------------------------- Rule for sentence: Ed slept. " @check_sentence(Ed` slept`.), {%F0 = []}, @discriminant(SUBJ(sleep:%%SK__1,Ed:%%SK__2), %F0, %F1), @discriminant(VTYPE(sleep:%%SK__1,main), %F1, %F2) ==> tested(Ed` slept`., %F2).The rule file can be included in a file like the following
"PRS (1.0)" grammar = discriminant_test. :- set_transfer_option(fixed_query_order, 1). "==================================== Macros ====================================" discriminant(%Test, %FailuresIn, %FailuresOut) := ( (+%Test, {%FailuresIn = %FailuresOut}) | (-%Test, {%FailuresOut = [%Test | %FailuresIn]})). check_sentence(%S) := (+meta_info(string, %S) | +fstr_property(string(%S))). "==================================== Include sentence rules ====================================" include(discriminant_rules). "==================================== Failure reporting rules ====================================" "We're not interested in pulling out the choices where the sentence succeeded or failed, so pull everything up to the top" tested(%S, %Failures) +==> <1> tested(%S, %Failures). "Did the tests succeed (empty list of failures)" +tested_sentence(%S, []) ==> succeeded(%S). "If unsuccessful, record the number of failures" -succeeded(%Sentence), +tested(%Sentence, %L), {length(%L, %N), %N > 0} +==> failed(%Sentence, %N). " Record the minimal number of failures: at the end of this, we will have just one failed(%Sentence, %M) record left" failed(%S, %N), +failed(%S, %M), {%N > %M} +==> 0. "Remove non-minimal test failures" +failed(%S, %Min), tested(%S, %L), {length(%L, %N), %N > %Min} +==> 0. "Aggregate failures" +failed(%S, %%Min) ==> failures(%S, []). -succeeded(%S), +tested(%S, %NewFails) ** [ failures(%S, %Failures0), {aggregate_failures(%NewFails, %Failures0, %Failures1)} +==> failures(%S, %Failures1) ]. +failures(%S, %Failures), {report_discriminant_failure(%S, %Failures)} ==> 0.This file defines the macros assumed by the discriminant_rules. The @discriminant macro tests for the presence of the fact. If it is present, it threads through the old value of the failure list, and otherwise it adds the test to the failure list. The @check-sentence macro checks the sentence fact in the input transfer structure, to make sure that rules are only applied to the appropriate sentence inputs.
The rules at the end of the file look at the results of any sentence specific rules to detect if there were any errors: if there are, a warning will be printed out. The procedure aggregate_failures and report_discriminant_failure are pre-defined.
The xfr_benchmark.tcl file defines a command, discriminant-check that makes use of the run-testsuite functionality to push a testsuite of sentences through a set of discriminant rules, and report on the results. Assuming that you want to run f-structure through the rule sequence specified in the Tcl variable currentRuleSequence and that you have loaded the discriminant rules (with grammar name discriminant_test), here is an example of how you might run this command
% lappend currentRuleSequence discriminant_test % discriminant-check $currentRuleSequence "testdir/ts"where testdir/ts is the path to a standard testsuite directory.
XLE also provides tools for grammar testing. The parse-testfile command can be used to test a suite of sentences against a grammar in batch mode. To use it, first load a grammar with the command create-parser and then call parse-testfile using the following syntax:
parse-testfile (<START> <STOP>) (-parser <GRAMMAR>)
The arguments <START> and <STOP> are sentence indices, which can be either numbers or strings. If no indices are given, then parse-testfile will parse the entire testfile. If indices are given and they are numbers, then the indices refer to the number of the sentence from the beginning of the file (1 for the first sentence, 2 for the second, etc.). If, however, the indices provided are strings, parse-testfile will parse the first sentence that matches the <START> string (i.e., has the string in it or follows a comment that does) and continue parsing sentences until it reaches the first sentence that matches the <STOP> string. For example, given a testfile sample-test.lfg such as the following
# 1 Philip K. Dick brought the anomic world of California to many of his works. # 2 Dick spent most of his career as a writer in near-poverty. # 3 Alternate universes and simulacra were common plot devices. # 4 "There are no heroics in Dick's books, but there are heroes."
the command
parse-testfile sample-test.lfg anomic simulacra
will parse from the first sentence in sample-test.lfg that has anomic in it (or follows a comment with anomic in it) up to and including the sentence that has simulacra in it (or follows a comment with simulacra in it). In other words, it will parse sentence 1 through 3.
A testfile must consist of test sentences separated by blank lines. If any of the lines begin with #, they are treated as comments. Be sure to put a blank line between a comment and the following sentence. Otherwise, the sentence will be considered part of the comment.
There are a number of command-line options for parse-testfile that follow the indices arguments and change the command's normal behavior. By default, parse-testfile uses the value of the variable defaultparser for the grammar used in parsing. The command create-parser normally sets this variable. But if multiple grammars are being tested, each can be assigned to a different variables. The option -parser can then be used to instruct create-transfer to use one of these alternative variables (rather than defaultparser). For example,
set $foo [create-parser /home/foo/grammar.lfg] parse-testfile sample-test.lfg -parser $foo
When called with the option -parseProc followed by a procedure name, parse-testfile will call the procedure named on every sentence processed by parse-testfile with four arguments:
This facility can be used to process testfiles in ways not anticipated by XLE. The default value for -parseProc is defaultParserProc.
The option -outputPrefix directs defaultParseProc to create a packed prolog file for every sentence parsed, which will be named according to the following convention
<PREFIX><INDEX>.pl
where PREFIX is the prefix given and INDEX is the sentence number for the sentence parsed. For instance, if the prefix is /tilde/smith/fs, the output of the first sentence will be stored in /tilde/smith/fs1.pl. If you want to store in a sub-directory, then you need an explicit slash at the end of the prefix (e.g. fs/). XLE will not create the directory for you. In this case, XLE will use <PREFIX>S<INDEX>.pl, so that you get fs/S1.pl.
The option -goldPrefix directs defaultParseProc to compare the results in -outputPrefix with the gold standard in -goldPrefix. In this case, defaultParseProc returns the f-score instead of the number of solutions that the parse got.
If parse-testfile is called with different start and stop indices, then it will produce a set of files that give the results. If the filename of testfile is testfile.lfg, then testfile.lfg.new will contain a copy of the testfile with performance information appended to the end of each sentence. This file should become your new testfile. After this has been done for a testfile, calling parse-testfile will also produce testfile.lfg.stats with performance information about the sentences. If there are any errors or mismatches in the number of solutions for a sentence, then the sentence will be printed on testfile.lfg.errors.
The format of the performance information is (solutions time subtrees), where "solutions" is the number of valid solutions for the sentence, "time" is the time it took to parse the sentence, and "subtrees" is the subtrees that XLE had to process in order to parse the sentence (in general, the time will be proportional to the number of subtrees).
If a grammar distinguishes between optimal and unoptimal solutions, then XLE will report the number of solutions as x+y (e.g. 7+3) where x is the number of optimal solutions and y is the number of unoptimal solutions. parse-testfile checks both numbers when deciding whether there is an error or a mismatch. This means that 7+2 won't match 7 or even 9. You can tell parse-testfile to ignore the unoptimal number by adding set ignoreUnoptimal 1 to your .xlerc file.
For instance, the following result for the sample testfile above
((1) (2+12 0.43 349) (14 words))
would indicate that the first sentence had 2 optimal solutions and 12 unoptimal solutions and took 0.43 CPU seconds to process 349 subtrees.
The value for the number of solutions will be a positive integer if a sentence successfully parses. Negative values indicate an unsuccessful parse:
If you want to specify how many solutions a sentence should have, then put a number at the beginning with an exclamation point:
(3! 5 1.03 200).
Whenever parse-testfile is run, it will compare the number of solutions a sentence should have with the number it actually got and report an error if they are different. It also reports a mismatch whenever the number of actual solutions changes.
Whenever the .new, .stats, or .errors files get remade, backup copies are made. The type of backup copy made depends on the Tcl variable "version-control". This variable is modelled after Emac's version-control variable. If the value of the variable is t, then numbered backups of the form foo.~1~ will be made. If the value of the variable is never, then a single backup of the form foo~ will be made. If the value of the variable is nil, then numbered backups will be made if a file already has numbered backups, and otherwise a single backup file will be made. You can set version-control in your .xlerc file using something like set version-control t. The default is nil.
If the grammar has a BREAKTEXT transducer in the morph config file of a parser, then you can create a testfile from a text file with:
make-testfile <TEXT-FILE> (<TESTFILE>) (<PARSER>)
The command make-testfile breaks the text in <TEXT-FILE> up into text segments using the BREAKTEXT transducer and writes the results to <TESTFILE>. It inserts a comment at the beginning of the testfile to indicate that the testfile should be parsed literally (e.g. preserving whitespace and giving no special treatment to comments, performance data, or prefixed categories). It also inserts a blank line after every text segment in order to ensure that the results is in testfile format. Blank lines in each text segment have a vertical bar (|) appended at the end so that parse-testfile and also the user can distinguish blank lines used as separators from blank lines used as text. This is designed so that
parse-testfile(make-testfile(x))
is equivalent to
parse-file(x)
The argument <TESTFILE> is optional. If it is not provided, then the results are written to <TEXT-FILE>.new. If the argument <PARSER> is not provided, the parser defaults to the variable defaultparser.
You can use diff-testfiles to find mismatches in the number of solutions for sentences in two different versions of a testfile. diff-testfiles will work even when some sentences or comments have been added or deleted from the testfile, although it will miss some sentences if the testfile gets rearranged. diff-testfiles reports mismatches between sentences in the two testfiles and also errors in the second testfile (e.g. differences between the expected number of solutions (notated as 7!, for instance) and the actual number of solutions). It also reports the sentences that it skipped because it couldn't find a corresponding sentence in the other testfile.
If you call parse-testfile with just a start index or the same start and stop index, then it will parse just that sentence and display the results on the screen (otherwise, parse-testfile won't display results). This facility can be used to parse a single sentence from a testfile. To make things even more convenient, you can create your own Tcl procedure for parsing a sentence from a particular testfile:
proc sent {n} { parse-testfile verbmobil.testfile.lfg $n $n }
If you add this to your .xlerc file, then typing the following into the Tcl shell will cause the seventh sentence to be parsed from the testfile verbmobil.testfile.lfg:
sent 7
You can use parse-testfile to construct an annotated tree bank. First, load a grammar using create-parser. Then parse the first sentence in your testfile using parse-testfile testfile.lfg 1 1. This will automatically add the buttons "next sentence" and "prev sentence" to the fschart window. Look at the packed representation and choose the correct analysis. Then, click on the print button while holding the CONTROL key down on either the fschart, fstructure, or tree window. The fschart print button will produce a packed representation of all of the solutions along with the choices that you made as a prolog term. The fstructure window will produce the current fstructure as a .lfg file. The tree window will produce the current tree as a .tree file. If the grammar assigns a value to the SENTENCE_ID attribute on the top-level fstructure or if there is a comment that begins with # SENTENCE_ID: before the sentence, then the print files will include the SENTENCE_ID value in their names. Otherwise, the next available name will be used. Finally, click on "next sentence" to get the parse of the next sentence. Continue until you are done.
You can test how well your test files cover a rule by parsing a test file and then calling
print-unused-grammar-choices $defaultparser (<CAT>)
For example, the command
print-unused-grammar-choices $defaultparser {NP[std]}
will print the unused choices in NP[std]). print-unused-grammar-choices lists the constraint disjuncts in the given rule that were not part of any valid analysis in the test files. It will also list the constrained daughter categories in the rules that were never used. It will try to report the unconstrained daughter categories as well (e.g. the VP in S --> NP: (^ SUBJ); VP), but it won't know the line number for the unconstrained daughter. If no category (CAT) is given, then print-unused-grammar-choices will print unused grammar choices for all of the rules.
If you are having trouble figuring out why print-unused-grammar-choices is reporting that a constraint disjunct or daughter category is unused, it may be that the grammar choice in question is part of a macro or template that is expanded in more than one place with different epsilon constraints in front of it (e.g. e: (^ FOO)=+; @MACRO). The process of shifting the epsilon constraints to the following category can sometimes obscure the original source of some choices.
The command generate-test-sentences can be used to generate new test sentences for a given category. Its syntax is the following:
generate-test-sentences $defaultgenerator -lexemes <LEXEMEFILE> -length <N> -rootcat <ROOTCAT>
Each test sentence will include at least one disjunct or daughter category that hasn't been covered yet. LEXEMEFILE is a list of lexemes that generate-test-sentences can use. The list is a plain text file where each lexeme is on a new line. You can specify the category of the lexeme by putting it in front of the lexeme followed by a colon (e.g. N_BASE: dog). You do not need to quote spaces or other special characters.
generate-test-sentences works by generating all possible sentences up to length N and then choosing sentences that have unused grammar choices in them. For this reason, you should choose N to be as small as possible. For instance, an N of 6 may take several minutes. In order to make generate-test-sentences practical at all, it places severe restrictions on the possible sentences:
To get around the first restriction, you should define some pseudo lexical entries like "PP" and "CPREL" that represent multi-word categories. If you put a dispreference mark in them, then generate-test-sentences will only use them when they are absolutely necessary.
There are several ways of increasing the robustness of your grammar without sacrificing performance. One way is to put a STOPPOINT mark in the OT field of the configuration, and mark rules used for robustness with marks that are stronger than the STOPPOINT mark (e.g. to the left of the STOPPOINT in OPTIMALITYORDER). These rules will only be used if the core grammar fails to find a valid analysis.
Another way to increase the robustness of the grammar is to create a special rule for collecting fragments and put the rule name in the REPARSECAT field of the configuration. This rule will only be invoked if XLE fails to find a valid analysis after all of the STOPPOINT rules have been tried. XLE will then retry with the new category and the first STOPPOINT. If written correctly, the fragments rule should always get a valid analysis (unless it runs out of resources). Here is an example of a rule for fragments:
FRAGMENTS --> { S
| CPint
| NP
| PARENP
| VP: (! SUBJ PRED)='DUMMY'
| PP
| TOKEN: Fragment $ o::*
} e: (^ FIRST)=! Fragment $ o::*;
(FRAGMENTS: (^ REST)=!).
This fragments rule produces a right-branching list of major categories. The TOKEN category represents any token. It uses the special -token lexical entry which matches any token (including those that already have a lexical entry). It appears in the lexicon something like this:
-token TOKEN * (^ TOKEN)=%stem.
In the worst case, XLE will only produce a list of tokens as the valid analysis.
Because of the epsilon constraints (e.g. e: (^ FIRST)=! Fragment $ o::*), each major category receives a dispreference mark named Fragment. This means that XLE will prefer analyses that have fewer major categories. The TOKEN category gets two Fragment marks, and so XLE will prefer a major category consisting of a single word over a TOKEN. Finally, categories that are missing information are completed. In this case, the VP is given a dummy subject.
The TOKEN category is necessary to guarantee that you can always get some analysis, but it doesn't have much useful information in it. In particular, it doesn't have any information about what part of speech the token might be. You can get this information by adding lexical categories to the FRAGMENTS rule, and filling them in with appropriate defaults. For instance:
FRAGMENTS --> { ...
| V: {(! SUBJ PRED)='DUMMY'}
{(! OBJ PRED)='DUMMY'}
{(! OBJ2 PRED)='DUMMY'}
{(! OBL PRED)='DUMMY'}
...
| ...
It is important for performance reasons that the reparse category build fragments using a list structure (e.g. like the FIRST and REST attributes like above). XLE cannot efficiently handle the large sets that would be generated if you used a set instead of a list structure.
So, a sentence like the the boy appeared might have a fragmented
f-structure like:
[ FIRST [ TOKEN the]
REST [ FIRST [ PRED 'appear<SUBJ>'
SUBJ
[ PRED 'boy'
SPEC the ]]]]
As the grammar progresses, your attention may shift from how many sentences can be parsed to how long it takes to parse a sentence. This section talks about how to find out where the time is going during parsing and tricks for making your grammar run faster.
XLE parses a sentence in a series of passes. First of all, the morphology analyzes the sentence, looks up each morpheme in the lexicon and initializes a chart with the morphemes and their constraints. Then the chart parser builds all possible constituents out of the morphemes using the c-structure rules given in the grammar. The constraints are processed after all of the constituents have been built. The unifier processes the constraints bottom up, only visiting those constituents that are part of a tree with the correct root category that covers the sentence. Each subtree processed during this pass will have a constraint graph associated with it that has a packed representation for the analyses that are relevant to this subtree plus a set of nogoods that represent the contexts of bad analyzes. After the unifier is through, these graphs are processed again to see which analyses are locally incomplete. This will produce some more nogoods. This produces the equivalent of a boolean satisfaction problem, which involves finding all possible assignments of true or false to the contexts such that the assigments are not covered by the nogoods. The final pass is to solve the boolean satisfaction problem for each edge bottom up, so that the solutions of the lower edges are used as part of solving an edge built upon them. (See the section on Lazy Contexted Unification in [Maxwell and Kaplan 96] for more details.)
The main way to make a sentece parse faster is to try to prune bad analyses early. For instance, if your grammar is constantly testing whether the f-structure associated with a VP has (^ FIN)=+, you might split the VP rule into VP[fin] and VP[nonfin] rules using parameterized rules, and use the VP[fin] category instead of VP where the f-structure is being tested. This will cause the non-finite feature structures associated with the VP category to be pruned in the chart building phase instead of the unification phase. However, it will also increase the number of categories which are built in the chart, so you shouldn't split rules into too many categories.
Another way to prune bad analyses early is to add constraints that allow analyses that will ultimately be incomplete to be pruned earlier. For instance, suppose that you have a noun constraint that requires that there be a determiner (something like (^ DET)=c +). Once the parser has gotten past the NP rule this constraint will never be satisfied if there is no long-distance extraction of determiners. If you change the NP rule to add the constraint ~(^ DET) when there isn't a determiner then the analyses that require a determiner will be eliminated early. You can also add a @(COMPLETE (^ DET)) template call to signal that (^ DET) can be completed early (see System-defined templates).
Another way to make a sentence parse faster is to eliminate ambiguous analyses. The number of solutions that an edge can have can grow exponentially in the number of ambiguities below it, so eliminating ambiguities can have a big effect on performance. Sometimes there are local ambiguities with analyses that are ultimately bad. Rewriting the grammar to detect the ill-formed nature of the analyses earlier can reduce the number of solutions that intermediate edges have. Sometimes ambiguities can be eliminated by choosing one analysis as a canonical representation for all of the analyses. For instance, there are many different analyses for "A and B and C and D" because there are many ways to group the conjunctions. Choosing one analysis as a representative and rewriting the grammar to eliminate the others can have a big effect on performance as the number of conjunctions goes up.
If you type print-ambiguity-sources, then XLE will print all of the local sources of ambiguity in the current chart. This will look something like:
V_BASE:389:1 adds a 2-way ambiguity, perhaps from line 2807 in eng-oald-lex.lfg
VTNS_SFX_BASE:404:1 adds a 3-way ambiguity, perhaps from line 1600 in eng-core-lex.lfg
N_BASE:429:1 adds a 2-way ambiguity, perhaps from line 1465 in eng-core-lex.lfg
NPzeronc:2460 is ambiguous because of subtrees 1 & 2
NPzero:2458 is ambiguous because of subtrees 1 & 2
NPadjnc:1939 is ambiguous because of subtrees 3 & 4 & 5 & 6
The output represents both f-structure ambiguities and c-structure ambiguities. The f-structure ambiguities will give a subtree identifier plus the line number of the source of the constraints. The local f-structure ambiguity that was added could be from these constraints, or they could be the result of a functional uncertainty or other implicit disjunction. You can look at the subtree by using show-subtree. For instance, show-subtree 389:1 would show the first ambiguity listed above. The c-structure ambiguities will give an edge id plus a list of subtree ids. You can use show-subtree to show the subtrees one at a time. For instance, show-subtree 1939:3 will show the first subtree of the last ambiguity. Note that when you use show-subtree to display an analysis, some of the solutions may be marked as UNOPTIMAL even when they are not. That is because they are locally unoptimal because you are looking at the subtree in isolation. If you are having trouble seeing the difference between two subtrees, you can use the "Print SExp" command under the "Commands" menu to print textual representations of the trees. You can then use diff -w treeX.txt treeY.txt to see the differences between the trees (diff -w ignores white space). This works best if the node numbers have been turned off (you can use the "Views" menu for this).
If you type something like print-ambiguity-sources 1793, then XLE will print ambiguities that may be contributing to the number of local solutions that XLE needs to compute for the given edge. If you type something like print-ambiguity-sources 1793:7, then XLE will print ambiguities that may be contributing to the number of local solutions that XLE needs to compute for the given subtree. This can be useful with the print_subtrees capability described below. Whenever a subtree is found that needs a large number of local solutions, you can give the subtree identifier to print-ambiguity-sources to find out what ambiguities are being multiplied together to produce the solutions.
You can also use the "Check Disjunctions" menu item in the menu of the tree window to check for possible sources of spurious ambiguities. This only looks for spurious ambiguities in the current sentence.
If the language that you are writing a grammar for is head-final or head-medial,
you might consider using the RIGHTBRANCHING template. For instance, suppose that
you had a rule that looked like:
S --> NP*: (^ GF*)=!; V.
XLE converts all grammar rules into left-associative binary branching
rules, so this gets turned into something like:
S--> (S1) V.
S1 --> (S1) NP: (^ GF*)=!.
So when XLE parses the non-terminal string NP1 NP2 NP3 V, it brackets
it as ((((NP1) NP2) NP3) V). If the verb is intransitive, then XLE
will do a lot of unnecessary work to build the largest S1 before it combines
it with the V and discovers that it is incoherent. If the grammar were
written to bracket the non-terminal string as (NP1 (NP2 (NP3 (V))))
instead, then XLE would build the f-structure bottom-up starting with the
V. As soon as it got to the NP2, it would know that no analysis was possible.
You can get this effect in the grammar with the RIGHTBRANCHING template:
S --> NP*: (^ GF*)=!; V: @RIGHTBRANCHING.
XLE transforms this rule into:
S --> {V | NP: (^ GF*)=!; S}.
This sort of transformation is especially useful for things like adverbs that don't normally take arguments but sometimes allow arguments to precede them.
If you type set_xle_switch time_parser 1 into your Tcl shell or your .xlerc file then XLE will print timing information using the following format:
(morph = 0.500 chart = 2.454 unifier = 8.113 completer = 1.153 solver = 2.000)
These numbers indicate how long XLE spent in each pass, where "morph" is the morphology plus chart initialization, "chart" is chart parser, "unifier" is the contexted unifier, "completer" is the completion pass, and "solver" is the nogood database solving pass. XLE usually spends most of its time in the unifier. If XLE spends an unusually large amount of time in the unifier, then the problem is probably related to functional uncertainties.
Typing set print_subtrees 1 will give you more detailed information. (time_parser shouldn't be used with print_subtrees, since the timing information for "constraints" and "remainder" will be inflated by the time taken to print the more detailed information.) When this switch is set, XLE will print timing information for the morphology pass and the chart parser pass, plus it will print timing information for each subtree processed by the unifier, the completer and the nogood solver. Here is a sample output:
(initializing chart 0.483sec)
(building chart 10000 edges, 15000 subtrees, 2.167sec)
(format: mother/daughter edge#:subtree# [left right]
address(*=nogood) (#variables #attributes #contexts) #sec)
(unifying N-S_BASE/Hans 15:1 [0 2] 0x1bdb260 (1 5 0) 0.02sec)
(unifying NAME/N-S_BASE 5058:1 [0 2] 0x1bdb2a8 (2 5 0) 0.00sec)
(unifying N-T_BASE/+Noun 38:1 [2 3] 0x1bdb2f0 (1 2 0) 0.00sec)
.
.
.
(format: mother/daughter edge#:subtree# [left right]
(#attributes #contexts) #sec)
(completing *TOP* 5408 [0 6] (1 2) 0.00sec)
(completing *TOP*/ROOT 5408:1 [0 6] (21 15) 0.00sec)
(completing ROOT 4227 [0 66] (21 15) 0.00sec)
(completing ROOT 4368 [0 58] (20 15) 0.00sec)
.
.
.
(format: mother/daughter edge#:subtree# [left right]
(#partial x #complete x #local -> #valid solutions) #sec)
(solving N-S_BASE/ 15:1 [0 2] (1[-1]x1[14]x1 -> 1) 0.00sec)
(solving NAME/N-S_BASE 5058:1 [0 2] (1x1x1 -> 1) 0.00sec)
(solving N-T_BASE/ 38:1 [2 3] (1[-1]x1[37]x1 -> 1) 0.00sec)
(solving NAME/N-T_BASE 5148:1 [0 3] (1x1x1 -> 1) 0.00sec)
XLE first prints the total time taken to initialize the chart (including the morphology) and to build the chart. It then prints the time taken to unify the constraints associated with each subtree. The first two lines of this section gives an explanation of the timing information:
(format: mother/daughter edge#:subtree# [left right]
address(*=nogood) (#variables #attributes #contexts) #sec)
After the unifier is done, the subtree graphs are processed again to determine which analyses are incomplete. Before the subtree timing information is printed for this, a header is printed:
(format: mother/daughter edge#:subtree# [left right]
(#attributes #contexts) #sec)
The information in this header is the same as for the unifier. The only difference is that the number of variables is not displayed, and sometimes there is no daughter or subtree information given because a disjunctive graph is being processed which represents the disjunction of a number of subtree graphs.
After the completion pass is done, the subtree graphs are processed again to determine which analyses are completely valid given the nogoods that have been found in the previous two passes. Before this information is printed, a header is given:
(format: mother/daughter edge#:subtree# [left right]
(#partial[partialID] x #complete[completeID] x #local -> #valid solutions) #sec)
The information in the first line of this header is the same as for the other two headers. The second line gives an indication of how many analyses are being considered. "#partial" is the number of solutions that the partial constituent had. "#complete" is the number of solutions that the complete constituent had. partialID and completeID are the edge ids for the partial and the complete. An id of -1 indicates the lack of an edge. "#local" is the number of local alternatives that were introduced by this subtree, either explicitly through disjunction or implicitly through things like functional uncertainty. "#valid solutions" is the number of solutions that resulted after the cross-product of the three sets just described were filtered by local nogoods.
During this pass, you may occasionally see entries of the form:
(reusing NP 4446 [0 7] 0.00sec)
These entries indicate that the solutions for a previously solved edge are being reused by a consumer that wants the solutions factored differently. For instance, if one consumer was interested in the CASE feature of an NP but another was not, then XLE would factor the solutions for the NP differently. The second consumer would "reuse" the solutions of the first to produce the new factorization.
If you want to look at a particular subtree mentioned in the print_subtrees output, you can use the command show-subtree edge#:subtree# using the edge#:subtree# combinations given in the output.
print_subtrees is probably the most useful performance debugging tool that XLE provides. When you use it, you can watch the output to see which subtrees XLE slows down on, or you can search for subtrees that take more than a second to process. Look for subtrees that take a long time to solve and that have a lot of valid solutions. Look for the sources of the ambiguities that produced the large number of solutions by tracing back through the partials and completes. See if you can prune any of the alternatives early or underspecify them.
If you want to sum over all of the categories, save the output to a file and then run sum-print-subtrees <filename>, where <filename> is the name of the file. (If you set print_subtrees_file to a file name before parsing, the output of print_subtrees will automatically be saved to that file.) This will produce an output that looks like:
category count total
total: 2054 46.76
VPargs/VPargs 82 22.19
VPargs 52 11.96
...
remaining categories have time = 0.0
sum-print-subtrees prints a list of triples of category name, count, and total time, where count is the number of lines where the category name occurs in the output of print_subtrees. If a category name is of the form X/Y, then it represents a subtree in an edge, where X is the name of the edge and Y is the right-most daughter. If a category name is of the form X, then this represents the work done on an edge in addition to the work on the edge's subtrees (e.g. the work required to disjoin the subtrees together). The count is the number of lines in the print_subtrees output about the category. The first line of the output is a header. The second line gives the total for all of the categories. After that, the categories appear sorted by their total time. If there is more than one edge with the same category name, they will all be summed together. If you want to know the total time for a particular category, then you need to sum the times given for all of the category names that have the category in the first part of the name. Categories that take up zero time in the print_subtrees output are skipped. sum-print-subtrees also prints the subtrees that took more than 1 second to process.
As a convenience, if you set sum-print-subtrees 1 and parse or generate something, then XLE will print subtrees to $print_subtrees_file (or "subtrees.txt" if $print_subtrees_file doesn't have a value) and call sum-print-subtrees for you.
You can also gather statistics on the chart as a whole by using chart-statistics. This won't give you any timing information, but it will tell you how many time a category appears in the chart and how many times a daughter category appears under a mother category. chart-statistics all gives statistics for all of the edges in the chart, chart-statistics graph (the default) gives statistics for only the edges that have graphs, and chart-statistics nogoods gives statistics for all of the edges that are locally inconsistent. The nogood statistics is especially useful for looking for places to possibly split categories (for example, the VPfin/VPnonfin case discussed above).
Sometimes the sentences that you want to speed up are too complicated to analyze. In this case, you may be better off looking for shorter sentences that are slow for their length or complexity. If you type sort-stats testfile.lfg.stats, then XLE will sort a statistics file produced by parse-testfile according to the number of subtrees per second that each sentence takes to process. This may reveal sentences of moderate length that are slow for their complexity. This is because if the grammar is context-free equivalent, then the number of subtrees per second should be a constant. When the number of subtrees is small, the time it takes to process a sentence tends to be dominated by the morphology and lexicon. For this reason, sentences that have less than 200 subtrees are sorted to the end. You can also use sort-stats testfile.lfg.stats subtrees-per-word and sort-stats testfile.lfg.stats words-per-second to see a statistics file sorted by other criteria.
Sometimes the reason that XLE is slow is that there are other processes on your machine that are hogging the CPU. For instance, old emacs processes sometimes wake up and start running at full throttle, even though they don't have a window and the user is logged out. To find out if this is the case on your machine, use the Unix command "top" to list the current processes in order of how much CPU is being used. Look at the CPU column at the far right. Look for processes that are consuming a large percentage of the CPU. You can kill processes that have your USERNAME by using the Unix command kill -9 pid, where "pid" is the number that appears under the PID column on the left. You will need to contact a superuser to kill processes that don't belong to you.
You can limit the weight of a constituent from the grammar with a constraint like (* WEIGHT) < 5. See the section on the WEIGHT attribute for more details.
You can limit the weight of constituents in the middle of a sentence by setting the Tcl variables max_medial_constituent_weight and max_medial2_constituent_weight to non-zero values. Setting max_medial_constituent_weight to a moderately large number (say, 30) will cause XLE to prune constituents in the middle of a sentence that have more than that number of terminals at the bottom (ignoring "weightless" terminals such as punctuation and morphological tags). Setting max_medial2_constituent_weight to a moderately small number (say, 10) will cause XLE to prune constituents in the middle of a sentence of any size that themselves have a constituent in the middle of them with more than max_medial2_constituent_weight terminals. For example, an analysis of the form "XXX [YYY [ZZZ] WWW] VVV" will be pruned if ZZZ has more than max_medial2_constituent_weight terminals at its bottom. It could also be pruned if [YYY [ZZZ] WWW] had more than max_medial_constituent_weight terminals at its bottom, even if ZZZ did not have more than max_medial2_constituent_weight terminals at its bottom.
The best way to determine good values for max_medial_constituent_weight and max_medial2_constituent_weight is to test your grammar against a gold standard and to see what the performance/coverage tradeoffs are as you try different values. In particular, you want values that reduce the parsing time significantly without much loss in precision and recall.
Internally, this feature is implemented by splitting edges into three classes: those that have an edge on their right side that has more than max_medial2_constituent_weight terminals (e.g. have a heavy right side), those that have an edge with a heavy right side on their left side (and thus have a heavy center-embedded edge), and those that have neither. This means that in the worst case this feature can multiply the number of edges constructed by 3. On the other hand, the number of subtrees should grow quadratically in the number of words instead of cubically once the number of words exceeds max_medial_constituent_weight. (The number of subtrees would only grow linearly in the number of words beyond max_medial_constituent_weight if XLE did early composition of partials, but this is a major architectual change. Also, doing early composition of partials adds a grammar constant that is the square of the size of the grammar, so it is not clear that doing early composition would necessarily be a win.) Although the number of subtrees grows quadratically with the number of words, the number of subtrees that are unified will only grow linearly with the number of words, which is probably a more important effect since chart construction typically only takes 10% of the parse time.
If any trees were pruned because they had heavy constituents, then XLE will print a tilde (~) in front of the number of solutions found. This indicates that the number of solutions is approximate because some possible solutions were discarded.
If you set heavy_only to 1 in the Tcl shell then XLE will only compute analyses that have heavy constituents. This is useful for seeing the analyses that would be pruned by max_medial_constituent_weight and max_medial2_constituent_weight. The offending tree node(s) will be boxed, although the whole tree has a valid f-structure. If you inspect the edge under a tree node by clicking on it with CONTROL and SHIFT held down and then clicking on the value of the "edge" field, you will find a "weight" field that gives the weight of the edge and a "heavytype" field that indicates what type of heaviness it has (e.g. HAS_HEAVY_RIGHT_EDGE, HAS_HEAVY_CENTER, and TOO_HEAVY).
The procedure for determining the weight of a node is experimental and is subject to change. The weight of a node is equal to the weight of the edge in the chart that corresponds to the node. The weight of an edge is the minimum weight of all of the trees that correspond to that edge. The weight of a tree is the number of terminal leaves that the tree has that are neither punctuation marks nor morphological tags. Currently, punctuation marks are determined using the C procedure "is_punct()". XLE assumes that all sub-lexical morphemes that have a left sister are morphological tags. Here are some examples:
(N (N_BASE "dog") (N_SUFF_BASE "+N") (N_SUFF_BASE "+PL"))
This tree has a weight of 1 because there is only one sub-lexical morpheme that doesn't have a left sister (e.g. "dog").
(S (V "stop")(PUNCT "!"))
This tree has a weight of 1 because there is only one leaf node that isn't punctuation (e.g. "stop").
(NP (DET "the") (N "United States of America"))
This tree has a weight of 2 because it has two leaf nodes and neither node is punctuation or a sub-lexical morpheme that has a left sister.
(NP (DET "the") (ADJ "United")(N "States") (PP (P "of") (N "America")))
This tree has a weight of 5 because it has 5 leaf nodes in it. However, if this tree were part of the same edge as the previous tree, then the edge would have a weight of 2 since the weight of an edge is the minimum of the weights of all of its trees.
As an alternative to aborting XLE when too much time has been used or too much storage has been used, one can specify that XLE "skim" the constituents that it has not finished processing. When XLE skims the remaining constituents, it does a bounded amount of work per subtree. This guarantees that XLE will finish processing the sentence in a polynomial amount of time. This only makes sense if your grammar has a fragment rule.
There are four variables involved in skimming:
start_skimming_when_scratch_storage_exceeds
start_skimming_when_total_events_exceed
max_new_events_per_graph_when_skimming
skimming_nogoods
Skimming is started when the scratch storage exceeds the given number.
For instance,
set start_skimming_when_scratch_storage_exceeds 700
will cause skimming to begin when the scratch storage exceeds 700 megabytes. This is similar to max_xle_scratch_storage. You should generally set start_skimming_when_scratch_storage_exceeds to 50-75% of the machines real memory for optimal performance.
Skimming is also started when the total events exceed a given number. The "events" being counted are specific actions within the parser that act as a platform-independent measure of clock ticks. XLE measures events instead of CPU time because (1) it is difficult to figure out how much CPU time is devoted to each subtree, (2) the CPU time used per subtree can vary slightly each time a sentence is parsed, producing non-determinism, and (3) the amount of CPU time each subtree uses varies from machine to machine.
Once skimming is started, XLE limits the number of new events per graph to the value of max_new_events_per_graph_when_skimming. XLE flags the fact that a sentence was skimmed by putting a "~" before the number of solutions (e.g. ~4+572 or ~*12+166676). This indicates that the number of solutions is approximate (e.g. there might have been more solutions if XLE had not skimmed). It also reports the number of skimmed sentences in the output of parse-testfile.
When skimming starts, any OT marks listed in the skimming_nogoods variable become NOGOOD marks. This is useful for eliminating expensive and little-used constructions from the grammar when skimming.
You can determine good values for start_skimming_when_total_events_exceed and max_new_events_per_graph_when_skimming by running parse-testfile with your usual timeout. In the statistics, parse-testfile will report the maximum # of events per sentence and the average # of events per graph. If you set start_skimming_when_total_events_exceed to a number that is a little bigger than the maximum # of events per sentence, then XLE will only skim the sentences that timed out in your test suite. You should get reasonable skimming results by setting max_new_events_per_graph_when_skimming to a number that is 2-3 times the average # of events per graph.
Even when skimming, it is a good idea to set the timeout and max_xle_scratch_storage to something reasonable. This guards against running out of storage or other performance problems. For instance, you could set the timeout to 5-10 times the original timeout. This gives the parser time to finish skimming.
To train the chart pruner, you need a corpus of bracketed sentences. It is not enough to have a gold standard of f-structures for training, since the chart pruner prunes the c-structure chart using information that is only available in the c-structure chart. However, it might be possible to create a corpus of bracketed sentences from a gold standard of f-structures.
Once you have a corpus of bracketed sentence, parse the sentences using parse-testfile with a -parseProc of count-subtree-features. This will cause XLE to gather statistics from the parses. XLE assumes that the brackets are named LSB (left square bracket) and RSB (right square bracket) and that there are rules of the form XP --> LSB XP RSB. When gathering statistics, XLE will ignore rules of this form.
When parse-testfile is done, you can print the statistics using "print_subtree_features $defaultparser <subtree-features-filename>". This will write the subtree features to the designated file.
You can parse your corpus in parallel and combine the results if you have multiple machines available. To combine the results, load the grammar, call "clear_subtree_features $defaultparser" to clear any existing subtree features, and then call "read_subtree_features $defaultparser <filename>" for each file of subtree features that you have produced. Finally, call "print_subtree_features $defaultparser <final-filename>" to print the combination.
WARNING: the subtree statistics become invalid whenever a rule is changed in the grammar. XLE will print a warning message if you try to use the chart pruner with invalid subtree statistics. In this case, you will need to retrain the statistics.
There are two performance variables that you need to set to use the chart pruner. They are prune_subtree_file and prune_subtree_cutoff. These should be put in your performance variables file.
prune_subtree_file should be set to the file produced by print_subtree_features. prune_subtree_cutoff should be set to the threshhold value that you want for determining when subtrees should be pruned. This will be a number between 4 and 10. Low numbers cause a lot of pruning; high numbers cause a little pruning. You should use "triples match" with a gold standard to see how much precision and recall you lose at different values. Warning: the recall percentage in "triples match" doesn't take into account the sentences that didn't get a parse. So a value that causes more sentences to parse may receive a lower recall percentage. You should look at the actual number of facts recalled to compare recall between different values.
Sometimes there are morphemes in the chart which should not be considered when determining the probability of an analysis. For instance, if BuildMultiwordsFromLexicon adds +Prefer tags after the multiword, you probably don't want to gather statistics about how often these occur. You can suppress the statistics for these morphemes by adding commands to the performance variables file like:
suppress_pruning_statistics MWE_SFX_BASE
where MWE_SFX_BASE is the preterminal that covers the +Prefer tags.
Some tokenizers make named entity markup in the input be optional. That is, a sentence like "He saw <mwe>on the waterfront</mwe>" might be tokenized as both "He saw on_the_waterfront +mwe" and "He saw on the waterfront". The idea is to let the parser decide which is correct. However, the chart pruner may interfere with this, since it may erroneously prune one of the analyses based on the training data. To avoid this, you can add a command to the performance variable file like:
set_phrase_confidence +mwe .7
This says that you are 70% confident that the phrase that has +mwe is correct. The chart pruner will give this phrase a probability of .7. It will also give any alternative analysis a probability of .3. You can also set the phrase confidence of the preterminal of +mwe, with the same effect.
The chart pruner uses a simple stochastic CFG model. The probability of an tree is the product of the probabilities of each of the rules used to form the tree, including the rules that lead to lexical items (such as N --> dog). The probability of a rule is basically the number of times that that particular form of the rule occurs in the training data divided by the number of times the rule's category occurs in the training data, plus a smoothing term.
The pruner prunes at the level of individual constituents in the chart. It calculates the probabilities of each of the subtrees of a constituent and compares them. The probability of each subtree is compared with the best subtree probability for that constituent. If a subtree's probability is lower than the best probability by a given factor, then the subtree is pruned. The value of prune_subtrees_cutoff is the natural logarithm of the factor used. So a value of 5 means that a subtree will be pruned if its probability is about a factor of 100 less than the best probability.
If two different subtrees have different numbers of morphemes under them, then the probability model is biased towards the subtree that has fewer morphemes (since there are fewer probabilities multiplied together). XLE counteracts this by normalizing the probabilities based on the difference in length.
REPARSECAT categories are not pruned so that XLE is guaranteed to get a parse if REPARSECAT is used.
Here is an example of how this works. "Fruit flies like bananas" has two different analyses. Here are their analyses along with hypothetical probabilities for each rule:
S --> NP VP 0.5000 NP --> N N 0.1500 N --> Fruit 0.0010 N --> flies 0.0015 VP --> V NP 0.2000 V --> like 0.0050 NP --> N 0.5000 N --> bananas 0.0015 ------------------------ 8.4375E-14 S --> NP VP 0.5000 NP --> N 0.5000 N --> Fruit 0.0010 VP --> V PP 0.1000 V --> flies 0.0025 PP --> P NP 0.9000 P --> like 0.0500 NP --> bananas 0.0015 ------------------------ 4.21875E-12
These two analyses come together at the S constituent that spans the whole sentence. The probability of the first analysis is 8.4375E-14. The probability of the second analysis is 4.21875E-12. This means that the probability of the second analysis is 50 times more likely than the probability of the first analysis. If prune_subtree_cutoff is less than the natural logarithm of 50 (about 3.9), then the subtree of the first analysis will be pruned from the S constituent.
Consider two different analyses of "Who wrote <mwe>on the waterfront</mwe>?":
S --> NP VP 0.5000 NP --> Who 0.0020 VP --> V NP 0.2000 V --> wrote 0.0050 NP --> N 0.5000 N --> On_The_Waterfront +MWE 0.0001 --------------------------------------- 5E-11 S --> NP VP 0.5000 NP --> Who 0.0020 VP --> V PP 0.1000 V --> wrote 0.0050 PP --> P NP 0.9000 P --> on 0.1000 NP --> DET N 0.3000 DET --> the 0.4000 N --> waterfront 0.0010 --------------------------------------- 5.4E-12
In this case, the two alternatives come together at the VP. On the surface, these probabilities for the two analyses seem close. However, the chart pruner adjusts the probabilities based on the number of rule applications, so that "wrote on the waterfront" gets a much higher probability than "wrote On_The_Waterfront". Thus, the "wrote On_The_Waterfront" analysis gets pruned.
If we set the phrase confidence of +MWE to .7, then the statistics become:
S --> NP VP 0.5000 NP --> Who 0.0020 VP --> V NP 0.2000 V --> wrote 0.0050 NP --> "On_The_Waterfront" 0.7000 ----------------------------------------- 7E-07 S --> NP VP 0.5000 NP --> Who 0.0020 VP --> V PP 0.1000 V --> wrote 0.0050 PP --> "on the waterfront" 0.3000 ----------------------------------------- 1.5E-07
Note that the alternative to On_The_Waterfront (e.g. "on the waterfront") gets a probability of one minus the confidence of On_The_Waterfront.
The pruning file has counts for each rule and for each rule category, where "S --> NP VP" and "N --> dog" are both considered rules. In XLE, the right hand side of a rule is a regular expression. XLE converts the regular expression into a finite state machine. Each arc in the machine can be thought of as representing a rule in Chomsky normal form. For instance, "S --> NP VP" is represented as the network (S1) NP (S2) VP (S3(final)). In Chomsky normal form, this is equivalent to S --> S2 VP; S2 --> NP. Lines in the pruning file are of the form:
COUNT CATEGORY STATE#1 STATE#2 CONSTRAINTID DAUGHTER
where COUNT is the counts for this rule, CATEGORY is the rule category, STATE#1 is the state number of the mother state, STATE#2 is the state number of the left constituent (which must have the same category as the mother), CONSTRAINTID is the ID of the daughter constraints (like (^ SUBJ)=!), and DAUGHTER is the category of the right constituent. If you had a Chomsky normal form rule like:
S2 --> S1 NP: (^ SUBJ)=!
then this would be encoded as:
17.0 S 2 1 24 NP
where 24 is the constraint id of (^ SUBJ)=!.
If STATE#1 is 0, then the state is final and the mother constituent is complete. Otherwise, the state is non-final and the mother constituent is partial. If STATE#2 is 0, CONSTRAINTID is 0, and DAUGHTER is empty, then the line represents the count for a rule category (if STATE#1 is 0, then the rule category is complete).
If you set max_medial_constituent_weight to a relatively small number (say, 10) and the grammar has a fragment rule then this will have the effect of turning your grammar into a shallow parser. If you also use skimming, then the parse time will tend to be linear in the length of the sentence for sentences that are longer than max_medial_constituent_weight. Thus, a grammar that was designed for deep parsing can also be used for shallow parsing when the need arises.
If you set max_raw_subtrees to a non-zero value (such as 50000), then XLE will limit the chart to that number of subtrees. When used in conjunction with skimming, this will bound the amount of time and storage that it takes to parse a sentence. This is better than using the timeout or max_xle_scratch_storage variables, since the parser will always produce something if there is a fragment rule in the grammar. Note that there is no easy way to predict the effect that this has on the accuracy of the grammar except to create a test suite of sentences with a gold standard of correct analyses and to look at the precision and recall that you get for various values of max_raw_subtrees.
When max_raw_subtrees has a non-zero value, XLE will give first priority to subtrees that could be used to span the input (e.g. that have the root category and begin at the beginning of the input or end at the end of the input). XLE then gives priority to all of the one word edges, then all of the two word edges, and so on until it runs out of subtrees. This is similar to setting max_medial_constituent_weight, except that the maximum constituent weight will vary from sentence to sentence. If a sentence is short, there will be no limitation. If a sentence is moderately long, then constituents may be limited to a moderate size. If a sentence is very long, then the constituents may be limited to a small size. In general, the maximum constituent weight will be proportional to 1/sqrt(N), where N is the number of words in the input.
If any constituents were not put in the chart because max_raw_subtrees was exceeded, then XLE will print a tilde (~) in front of the number of solutions found. This indicates that the number of solutions is approximate because some possible solutions were discarded.
If you set max_raw_subtrees and you also use skimming, then the parse time will tend to be cubic in the length of the input up to a certain point, and then it will be bounded by a constant beyond that point. You can also set max_medial_constituent_weight to a constant value. This will make the parse time cubic in the length of the input until it reaches max_medial_constituent_weight, and then it will be linear until it reaches max_raw_subtrees, and then it will be bounded.
If you are using max_raw_subtrees to bound the amount of time needed to parse a sentence, then you can get a sense of a good value for max_raw_subtrees by setting parse_testfile_resource2 to raw_subtrees and then parse a representative test file of sentences. This will cause XLE to print the number of raw subtrees that each sentence uses instead of the number of subtrees that have a graph associated with them. Then sort the .stats file using sort-stats testfile.lfg.stats seconds-per-sentence. This will sort the sentences according to how long they take to parse. Then pick a value for max_raw_subtrees that is smaller than the number of subtrees used by the sentences that exceed your time limit.
If you are using max_raw_subtrees to bound the amount of storage needed to parse a sentence then you can can get a sense of a good value for max_raw_subtrees by setting parse_testfile_resource1 to megabytes and parse_testfile_resource2 to raw_subtrees and then parse a representative test file of sentences. This will produce a .stats file with storage statistics instead of time statistics. Then sort the .stats file using sort-stats testfile.lfg.stats storage-per-sentence. This will sort the sentences according to how much storage they take to parse. Then pick a value for max_raw_subtrees that is smaller than the number of subtrees used by the sentences that exceed your storage limit.
If you set the Tcl variable timeout to some number of seconds, then XLE will automatically abort the parse or generation of a sentence after that number of CPU seconds have elapsed and return -1 for the number of analyses. For instance, if you type set timeout 60 in the Tcl shell or your .xlerc file, XLE will abort sentences after 60 seconds or so. This is useful for running a test file overnight and guaranteeing that XLE will get through the whole test file. The default value for timeout is 100 seconds.
If you set the Tcl variable max_xle_events to a non-zero value, then XLE will automatically abort the parse or generation of a sentence after that number of events have elapsed and return -5 for the number of analyses. The "events" being counted are specific actions within the parser that act as a platform-independent measure of clock ticks. They are the same events used by start_skimming_when_total_events_exceed above. The main advantages of using max_xle_events instead of timeout is that the results do not depend on the platform and do not vary from run to run. You can determine a good value for max_xle_events by using parse-testfile and looking at the maximum number of events per sentence listed in the statistics at the end. The default value for max_xle_events is 0, which means that there is no limit.
If you set max_xle_scratch_storage to something other than 0, then XLE will abort if its scratch storage for all of the parsers and generators in use exceeds that number of megabytes. For instance, set max_xle_scratch_storage 50 will cause XLE to abort and return -2 if it needs more than 50 megabytes of storage to parse a sentence.
If you want to parse in bounded time or storage and still get a partial result, you should see the section on Bounded-time Parsing.
If XLE is taking too long to parse a sentence you can abort the parse by typing C-/ (CONTROL slash). (If you are running in the XLE buffer produced by the LFG menu you will need to type C-c C-/, or to use the QUIT menu item under the Signals menu.) This will cause XLE to stop parsing the current sentence and return -3 for the number of analyses. If XLE is not parsing a sentence, the signal will be ignored. If XLE is in parse-testfile mode, it will ask you whether you want to abort parse-testfile, too. You can still abort XLE using C-c (or C-c C-c in the XLE buffer produced by the LFG menu).
The following Tcl variables affect the performance of a chart (e.g. a parser or a generator) and are specific to the chart:
timeout
max_solutions
max_solutions_reranking
max_xle_events
max_xle_scratch_storage
normalize_chart_graphs
dnf_chart_graphs
max_medial_constituent_weight
max_medial2_constituent_weight
max_raw_subtrees
start_skimming_when_scratch_storage_exceeds
start_skimming_when_total_events_exceed
max_new_events_per_graph_when_skimming
skimming_nogoods
property_weights_file
property_weights_file_reranking
gen_property_weights_file
max_selection_additional_storage
ignoreProjections
delayProjections
prune_subtree_file
prune_subtree_cutoff
rank_cutoff_best
rank_cutoff_prev
gen_selector
input_position_type
left_markup
right_markup
replace_markup
Each chart has its own copy of these variables, so that they can be set differently for different charts. The notation for setting the performance variable of a chart is:
setx timeout 50 $chart
where "chart" is a variable that has a Chart as its value.
The notation setx timeout 60 sets the default value of timeout to 60.
In addition to these variables, XLE has the following special commands that affect the performance: set-OT-rank, add-OT-condition, set-gen-adds, suppress_pruning_statistics, and set_phrase_confidence.
You can print the current values of the performance variables using print-performance-vars.
It is possible to collect the chart-specific performance variables into a separate file for convenience. The notation is the same as used for setting these variables in a Tcl Shell:
setx timeout 50
setx start_skimming_when_scratch_storage_exceeds 10000
setx max_selection_additional_storage 1000
A file of performance variables can also contain calls to a few Tcl functions:
prepend-tokenizer [patterns onomasticon.fst] pop-tokenizer prepend-analyzer morph-add.txt prepend-priority-analyzer morph-add.txt set-OT-rank SVAgr NOGOOD set-gen-adds remove HONORIFIC set-gen-adds add @INTERNALATTRIBUTES set-gen-adds add @GOVERNABLERELATIONS AddedFactThe file can be read using
setx timeout 50 $chart setx start_skimming_when_scratch_storage_exceeds 10000 $chart setx max_selection_additional_storage 1000 $chart prepend-tokenizer [patterns onomasticon.fst] $chart pop-tokenizer $chart prepend-analyzer morph-add.txt $chart prepend-priority-analyzer morph-add.txt $chart set-OT-rank SVAgr NOGOOD $chart set-gen-adds remove HONORIFIC "" $chart set-gen-adds add @INTERNALATTRIBUTES "" $chart set-gen-adds add @GOVERNABLERELATIONS AddedFact $chart
If filename doesn't exist, then set-performance-vars looks in the directory of the grammar for the file.
A grammar can specify a file of performance variables for any chart that uses the grammar using the PERFORMANCEVARSFILE config entry:
PERFORMANCEVARSFILE performance-vars.txt.
When a chart is first created, XLE calls set-performance-vars with this filename and the chart to initialize the performance variables of the chart. These values can be overriden later using set-performance-vars or setx:
set-performance-vars performance-vars2.txt $chart
setx timeout 30 $chart
If a chart's performance variable has not been set before the first use of the chart, then XLE initializes the performance variable to the default value as given in Tcl. For instance, if "timeout" has not been set for $chart, then XLE will do the equivalent of:
setx timeout $timeout $chart
where "timeout" has been set using something like:
setx timeout 60
There are several forms of online documentation. The help command gives you a short description of the commands that can be invoked in the Tcl shell. All of the buttons and mouse-sensitive nodes document themselves when clicked on with the right mouse button. You can get the documentation for menu items by typing 'h' while the cursor is over the menu item. Finally, the documentation command opens a browser on a Table of Contents overview of all of the written XLE documentation. The documentation command uses the environment variable WEB_BROWSER to determine which Web browser to use. If this variable is not set, then it uses netscape. If netscape is used, then XLE will attempt to use an existing netscape window first, otherwise it will start up a new browser. On a MacOS X computer, set WEB_BROWSER to "open" (e.g.,
setenv WEB_BROWSER open).
XLE implements a mechanism for ranking analyses that is an extension of the most common mechanism used in Optimality Theory. In XLE, optimality marks can be added to an analysis by adding a constraint of the following form in the appropriate place in the grammar:
... Mark1 $ o::* ...
This says that Mark1 is a member of the optimality projection. (Note: marks can also be added on the mother node (e.g. Mark1 $ o::M*).)
Marks are ranked in the config:
FOO ENGLISH CONFIG (1.0)
...
OPTIMALITYORDER Mark5 Mark4 Mark3 +Mark2 +Mark1.
...
The list given in OPTIMALITYORDER shows the relative importance of the marks. In this case Mark5 is the most important, and Mark1 is the least important. Marks that have a + in front of them are preference marks. The more preference marks that an analysis has, the better. All other marks are dispreference marks. The fewer dispreference marks that an analysis has, the better.
The most common metric used to compare analyses in Optimality Theory is to compare the marks on two analyses in order of importance until you get to a mark that has a different number of instances in the two analyses (e.g. one with 0 Mark5s vs. one with 2 Mark5s, or one with 3 Mark3s vs. one with 5 Mark3s). When you get to a mark where the two analyses differ, then the one that has the fewest of that mark is chosen. XLE extends Optimality Theory by allowing preference marks. If the mark that has the first difference is a preference mark (signalled by a +), then the analysis that has the greater number of that mark is chosen. Here are some examples based on the ranking given above:
Whenever a grammar produces unoptimal solutions, then they are reported
in the number of solutions using the form 7+3. This is interpreted as 7
optimal solutions plus 3 unoptimal solutions. By default, XLE only displays
the optimal solutions.
You can see the unoptimal solutions by clicking on the "unoptimal" Views menu item on the choices window located in the lower right corner of the screen. When this button is clicked, then the OT interpretation of the optimality marks is turned off and the unoptimal solutions become regular solutions. The unoptimal menu item is automatically turned off before each parse. You can also selectively turn an optimality mark off by clicking on it in the choices window after "OTOrder:". Clicking on it again restores it to its original rank. All of the marks are turned on again before each parse.
If you want to modify the rank of an OT mark for the duration of a session
without changing the grammar, you can use the "set-OT-rank" command. For
instance,
set-OT-rank Fragment NOGOOD
will make the Fragment OT mark have the same rank as the NOGOOD OT mark. This command does not change the results for the current input. It only applies to future inputs.
To facilitate the modification of the ranking without having to edit the grammar and lexicon files, it is also possible to collect some marks into an equivalence class by enclosing them in parentheses. A declaration of the form
OPTIMALITYORDER Mark5 Mark4 (Mark3 Mark3a) +Mark2 +Mark1.
would be interpreted in such a way that Mark3a and Mark3 count as dispreference marks of identical strength. If Mark3a was a preference mark (e.g. with a + sign in front of it) then the two marks could cancel each other out. For instance, if there were an equal number of Mark3 and Mark3a marks, then the net count would be zero, and the analysis would be identical to one with no marks.
If a NOGOOD mark appears in the list, then all of the marks to its left indicate that a solution is always bad, even if there is no other analysis. Its purpose is to allow fine-grained filtering of the grammar. For instance, a grammar might be annotated with TRACTOR and VERBMOBIL marks that indicate constructions that are only used in those domains. If these marks are listed to the left of NOGOOD, then these constructions are treated as being inconsistent. If such a mark occurs within a disjunction, then the disjunct that contains it is effectively deleted from the grammar.
If all of the alternatives on a category within a rule are inconsistent when the grammar is loaded, then XLE will act as if that category had been deleted from the rule. This means that an edge corresponding to the category won't appear in the chart. An alternative is inconsistent when the grammar is loaded if it is marked with a NOGOOD optimality mark or if it has inconsistent constants (e.g. a = b, a =c b, or a ~= a). Inconsistent constants are most likely to occur within parameterized rules.
If the ranking has one or more instances of the special optimality mark named STOPPOINT in it, then XLE will process the input in multiple passes, using larger and larger versions of the grammar. STOPPOINTs are useful for eliminating ungrammatical analyses when grammatical analyses are present and for speeding up the parser by only using expensive and rare constructions when no other analysis is available.
STOPPOINTS are processed from right to left, so that the first STOPPOINT considered is the rightmost. During the first pass, the rightmost STOPPOINT is treated as a NOGOOD mark. This means that any suboptimal constructions involving marks to the left of the STOPPOINT are not even tried. If the input has a solution, then XLE will stop. Otherwise, it will reprocess the input using the grammar up to the next STOPPOINT to the left (for this purpose, the NOGOOD mark is considered a STOPPOINT). For instance, if a grammar had:
OPTIMALITYORDER Mark4 NOGOOD Mark3 STOPPOINT Mark2 STOPPOINT Mark1.
then XLE would first try analyses with either no marks or only the Mark1 mark. It wouldn't try the suboptimal constructions involving Mark2, Mark3, or Mark4. If there were no valid analyses, then it would try including analyses with a Mark2 mark. If there were no valid analyses, then it would try including analyses with a Mark3 mark. It would never try including analyses with a Mark4 mark, since it is to the left of the NOGOOD.
If one of the marks to the left of a STOPPOINT or NOGOOD is a preference
mark, then the suboptimal constructions that are eliminated are the ones that
have fewer preference marks than a competing analysis. Since the lack
of preference marks cannot be easily detected when the grammar is loaded,
these analyses are pruned just before the constraints are solved. Putting
a preference mark to the left of a STOPPOINT makes sense for multi-word
expressions, for instance. If the preference mark for multi-word expressions
is to the left of a STOPPOINT, then XLE will only consider analyses that
involve the individual components of the multi-word expression if there is
no valid analysis involving the multi-word expression. For example:
OPTIMALITYORDER +Mwe STOPPOINT Mark1.
Reprocessing in multiple pases is expensive, so STOPPOINTs should be used sparingly. The ideal is to have a STOPPOINT at a place that allows 80-90% of the inputs to be processed successfully, and to put the optimality marks of computationally expensive and syntactically marginal grammar rules to the left of a STOPPOINT. Except for STOPPOINTS and the marks listed to the left of a NOGOOD, the optimality ranking doesn't affect how fast XLE parses or generates. Instead, it filters solutions after it is done and removes the ones that are not optimal.
Some marks indicate that an analysis is ungrammatical, even though it is complete, consistent and coherent. These marks are used to distinguish dispreferences from true ungrammaticalities. For instance, a mark that indicated mismatched subject-verb agreement should be an ungrammatical mark. Marks are designated as ungrammatical if they are listed to the left of the special UNGRAMMATICAL mark or if they are prefixed by a "*". The "*" is used to allow grammatical and ungrammatical marks to be mixed together. For instance, Lets go to the store might be marked ungrammatical because of the missing apostrophe (should be Let's). However, let can be a noun, as in The bottom-ranked tennis player hit let after let. Thus, you might want to disprefer let as a noun even more than the ungrammatical lets without an apostrophe.
If an ungrammatical mark is grouped with a grammatical mark using parentheses, then analyses with the ungrammatical mark will be slightly dispreferred over analyses with the "equivalent" grammatical mark. This is because XLE keeps a count of the number of ungrammatical marks for each analysis. If two analyses are otherwise equal, then XLE will prefer the analysis with the fewest ungrammatical marks.
All ungrammatical solutions have UNGRAMMATICAL printed in the f-structure display. Furthermore, if the optimal solution is ungrammatical, then the number of solutions will be prefixed by a * (e.g. dogs sleeps would have *1 solution). If you don't want ungrammatical solutions included when grammatical solutions are present, then you should put a STOPPOINT to the right of the last ungrammatical mark.
The INCONSISTENT, INCOMPLETE, and INCOHERENT optimality marks have special meanings for XLE. If the INCONSISTENT mark is active (e.g. weaker than the current STOPPOINT), then XLE will convert nogoods involving an inconsistency into an INCONSISTENT optimality mark on the * node of the current subtree. If the INCOMPLETE mark is active, then XLE will convert nogoods involving incompletes into an INCOMPLETE optimality mark on the * node of the current subtree. Finally, if the INCOHERENT mark is active, then XLE will convert nogoods involving incoherence into an INCOHERENT optimality mark on the * node of the current subtree.
Using these marks can produce an enormous number of unoptimal solutions. For instance, if all of these marks are in use then "Finding your way around the Homecentre" can have >200,000,000 unoptimal solutions when parsing with the English Homecentre grammar. For this reason, these marks are not practical as part of an industrial-strength grammar. However, they may be useful for experimental purposes such as finding the best f-structure given a particular c-structure tree or for OT-based generation.
The CSTRUCTURE mark is a special mark that indicates that all of the optimality marks that are stronger than it are applied before the f-structure constraints are processed. When the f-structure constraints are processed, these marks are treated as NEUTRAL.
The main use of the CSTRUCTURE mark is to improve performance by filtering dispreferred analyses early. For instance, suppose that the grammar had an optimality mark called NoCloseQuote that is used to analyze a quotation that is missing its close quote. When parsing a sentence that had two standard quotation marks, the grammar will produce two analyses: one that sees the quotation marks as surrounding a single quotation, and one that has two quotations both of which are missing their close quotes. The second analysis is suboptimal because it will have two NoCloseQuote marks in it. If NoCloseQuote is stronger than the CSTRUCTURE mark, then XLE will remove the latter analysis from the c-structure chart, and will not attempt to build f-structures for it.
Another use of the CSTRUCTURE mark is to control the output of the morphology.
For instance, if MWE is used to mark multi-word expressions, then making
+MWE be stronger than the CSTRUCTURE mark will cause XLE to remove the c-structures
that do not use a multi-word expression unless none of the analyses use
the multi-word expression. Also, if Guessed is used to mark morphologically
unknown analyses, then making Guessed be stronger than the CSTRUCTURE mark
will cause XLE remove morphologically unknown analyses unless there are
no morphologically known analyses.
OPTIMALITYORDER Guessed +MWE CSTRUCTURE Mark1.
A CSTRUCTURE mark often does the wrong thing when a fragment grammar is needed. This is because the fragment grammar can parse anything, and thus an analysis with a CSTRUCTURE mark will always be filtered. It is possible to get around this by specifying that certain marks should be ignored by the CSTRUCTURE mark:
OPTIMALITYORDER Guessed IGNORING Fragment CSTRUCTURE.
This notation tells XLE that it should ignore Fragment analyses when deciding whether or not to filter Guessed analyses. The IGNORING keyword only applies to the immediately preceding and following marks. If you want it to apply to more than one mark, use a list:
OPTIMALITYORDER Mark1 (Guessed Foo) IGNORING (Fragment Fum)
Mark2 CSTRUCTURE.
In this example, Guessed and Foo ignore Fragment and Fum. Mark1 and Mark2 are irrelevant.
Both the parser and the generator allow the input to override the root category used. Sometimes this leads to situations where the parser or generator would have been successful if the standard root category had been used instead of the input root category. If ROOTCAT appears in OPTIMALITYORDER or GENOPTIMALITYORDER then the standard root category will be used as well as the input root category. The grammar writer must explicitly add the ROOTCAT OT mark to the rule for the root category to make the root category be dispreferred.
The optimality marks described so far are global in scope. That is, analyses are ranked via optimality marks without regard to where the optimality marks occur in the analyses and what their alternatives are. This makes sense for gradations of grammaticality (e.g. how grammatical a sentence is), but makes less sense for preferences. For instance, you might want to disprefer adjuncts over arguments. If you did this by adding a dispreference mark to adjuncts, then you would get the counter-intuitive result that an analysis with an adjunct PP is dispreferred even if the alternative doesn't use the PP as an argument. For instance, "Fruit flies like a banana" has two relevant readings, one where "flies" is a noun and one where "flies" is a verb. If adjuncts are dispreferred, then the second will be unoptimal because "like a banana" is an adjunct. This happens even though the first does not use "like a banana" as an argument.
To solve this problem, XLE supports a special class of optimality marks called local optimality marks. A local optimality mark only affects the ranking of analyses when there is a conditioning optimality mark on a constituent that covers the same span of input as the local optimality mark. Thus, it is possible to disprefer using a PP as an adjunct only when the PP is used as an argument in another analysis.
You can specify local optimality marks in OPTIMALITYORDER by giving the conditioning marks in angle brackets. For instance,
FOO ENGLISH CONFIG (1.0)
...
OPTIMALITYORDER Adjunct < Argument > X < Y Z >.
...
The local optimality marks are only considered when there is an analysis that covers the same span of the input that has one of the conditioning optimality marks. For instance, if you had the following rules:
NP --> N
PP*: ! $ (^ ADJUNCT)
Adjunct $ o::*.
VP --> V
(PP: (^ OBL)=!
Argument $ o::*)
PP*: ! $ (^ ADJUNCT)
Adjunct $ o::*.
Then an Adjunct PP would only be dispreferred if there was an Argument PP that covered the same span of the input.
If you want to add a conditioning mark to an existing OT mark without
changing the grammar, you can use the "add-OT-condition" command.
For instance,
add-OT-condition Adjunct Argument
will behave the same as
OPTIMALITYORDER Adjunct < Argument >
NB: Local OT marks don't make sense in the context of a STOPPOINT, NOGOOD, or CSTRUCTURE mark.
The optimality ranking may be set differently for generation than for parsing. This is done using a separate
clause in the grammar config with the same format and interpretation as the OPTIMALITYORDER clause.
Here is an example of how Optimality Theory could be used to allow a grammar to accept ill-formed input in the VerbMobil domain:
ALL ENGLISH CONFIG (1.0)
...
OPTIMALITYORDER Tractor NOGOOD Partial SVAgr +MWE.
...
This ranking says that multi-word expressions with the MWE mark are preferred over analyses that don't have them (e.g. analyses using their component parts). Then analyses with not marks are preferred, then Partial analyses, then analyses with subject-verb disagreements (marked with SVAgr). Finally, all of the tractor-specific constructions are ignored.
Here are how some of the marks might appear in the grammar:
ROOT --> { ...
| ZP+: !$^
Partial $ o::*}.
ZP --> {VP: (^ SUBJ PRED)='PRO'
|AP
|NP
|PP}.
This rule stitches together major constituents and then prefers the analyses that has the fewest daughters (if there are no valid analyses).
+V3SG V_SUFF XLE {(^ SUBJ PERS)=3 (^ SUBJ NUM)= SG
| ~[(^ SUBJ PERS)=3 (^ SUBJ NUM)= SG]
NoSVAgr $ o::* }.
This lexical entry prefers to get the agreement right, but will allow disagreement at a cost. This lexical entry could have also been written as:
+V3SG V_SUFF XLE {(^ SUBJ PERS)=3 (^ SUBJ NUM)= SG
| NoSVAgr $ o::* }.
which would produce the same result, but perhaps at a performance penalty.
Next consider an example of how OT marks can be used to control generation.
We might have the GENOPTIMALITYORDER:
GENOPTIMALITYORDER ExtraPunct NoSVAgr.
The NoSVAgr will disprefer generation of strings with mismatched subject-verb
agreement, assuming the entry for +V3SG shown above. Thus, only Boys
appear will be generated and not Boys appears.
The ExtraPunct mark could be used to block optional commas:
VP --> V
((COMMA: ExtraPunct $ o::*)
PP: ! $ (^ ADJUNCT)).
With the given GENOPTIMALITYORDER, They laugh at night will be generated and not They laugh, at night.
This facility enables the user to efficiently select the most probable parse after parsing. These techniques are described in
setx property_weights_file my_property_weights_file
The code for determining the most probable parse stops after a certain amount of storage has been allocated and returns the most probable parse found so far. The default is 1000 KiloBytes. When it runs out of storage, it prints a message like "Used 1000 KiloBytes of structural memory. Returning current best selection". If you are seeing too many of these messages, you can increase the default cutoff by setting the following variable to the number of KiloBytes of storage you want to allow. For instance:
setx max_selection_additional_storage 2000
If you want to have different property weights files for different parsers, then use:
setx property_weights_file property_weights_file1 $parser1
setx property_weights_file property_weights_file2 $parser2
Once the property_weights_file has been set, you can click the "most probable" button in the fschart window (XLE will display the most probable parse as the first parse when there is a property weights file in use). This will cause XLE to choose the most probable structure given the properties and weights defined in your property weights file. XLE will select the appropriate choices and display the most probable structure. If you print a prolog file of the fschart, the selected choices will be recorded along with the packed structure. You can also print the most probable structure from a script using print-most-probable-structure $chart $filename. This will print just the most probable structure. You can print all of the structures with the most probable structure selected using print-most-probable-structure $chart $filename 1.
XLE displays the weight of each choice in the choices window using the format "weight = number". XLE prints the weight of each choice in the Prolog output using the format "weight(choice,number)". The weight of a solution is the sum of the weights of its choices.
You can use a property weights (property_weights_file/property_weights_file_reranking) file to determine the n-best analyses by setting max_solutions/max_solutions_reranking to the desired n:
set max_solutions 100
If the number of solutions is already less than max_solutions/max_solutions_reranking and dnf_chart_graph is not set to 1, then XLE doesn't do anything special. If the number of solutions is greater than max_solutions/max_solutions_reranking, then XLE computes the n-best solutions and produces a packed result with an n-way disjunction. If you wish the same behavior for fscharts with fewer than max_solutions/max_solutions_reranking solutions, set dnf_chart_graphs to 1.
Finally, if you include the flag -mostProbable when using -outputPrefix in parse-testfile,
then parse-testfile will only print the most probable analysis
for each sentence.
Statistical Parameter Estimation
If you want to train your own property-weights vector, the following steps are necessary:
Data preparation consists first of preparing training data, i.e. the construction of separate fscharts for all possible parses of the training sentences (i.e. the unlabelled data), and for the correct parse(s) of the training data (i.e. the labelled data). Ideally, for each unlabeled example, one parse is labeled uniquely as correct parse. To achieve this, costly manual disambiguation has to be invested in. If a set of shallower annotations such as the UPenn WSJ Treebank annotation is available, manual disambiguation can be avoided by resorting to estimation from partially labeled data: Here it is sufficient to identify a set of correct parses by selecting from the set of all possible parses those that are consistent with the given shallower annotation. Our code allows for discriminative estimation from such partially labeled training data, subsuming learning from fully labeled training as a special case.
Helpful restrictions on the training data that will improve both efficiency and performance are to choose a training set that consists of
Moreover, caution has to be taken to ensure that the set of correct parses is a proper subset of the set of all possible parses - otherwise discriminative estimation will go awry.
Manual labeling of correct parses is unavoidable for the preparation of
heldout and test data. Heldout data have to be prepared for adjusting parsing
performance parameters (such as skimming and maxmedial) and estimation parameters (such as regularization
constants). Test data should be used for final testing only.
Step 1: Feature Annotation
The following commands can be used to produce feature annotated and/or-forests for each input chart:
print-feature-forest < features_file > < input_prolog_chart_file > < output_counts_file >
print-feature-forest < features_file > -x < extension > < input_prolog_chart_file >+
In the second usage, the output counts file name is formed by adding the extension to each input prolog chart file name.
The experimenter is required to specify the features using the feature templates given below. The features should be given in < features_file >, one feature per line. The experimenter can extract features automatically from a corpus using count_statistical_features, print_statistical_features, and read_statistical_features. See the help documentation for more details on these functions.
c-structure feature templates | |||||
---|---|---|---|---|---|
name | parameters | example | description | comments | |
cs_label | label | cs_label CPembed[that] | count of constituent labels label in the parse | ||
cs_adjacent_label | parent_label child_label | cs_adjacent_label VPcop[perf] PP | count of times parent_label is the parent of constituent child_label | immediate dominance only | |
cs_sub_label | ancestor_label descendant_label | cs_sub_label VPcop[perf] PP | count of times ancestor_label is an ancestor of descendant_label | includes immediate dominance | |
cs_sub_rule | parent_label child_labels | cs_sub_rule S NP VP | count of sub-rule in the parse | ||
cs_num_children | label | cs_num_children NP | total number of children of constituents label in the parse | ||
cs_embedded | label size | cs_embedded VPv[base] 4 | count of top-most nodes of chains of size constituents labeled label embedded into one another | does not include "partial" nodes into counting; cs_embedded XP 1 is equivalent to cs_sub_label XP XP | |
cs_right_branch | count of right branch pairs | ||||
cs_heavy | (cat)(>weight (>embedding)) (score) | cs_heavy NP >10 >0 1 | count of heavy constituents. weight = number of preds in constituent. embedding = niminum number of preds to edge of sentence. score = count to give feature ("0" = 1, "1" = weight, "2" = weight*weight) | ||
cs_conj_nonpar | depth | cs_conj_nonpar 4 | count of non-parallel conjuncts within depth levels | ||
f-structure feature templates | |||||
name | parameters | example | description | comments | |
fs_attrs | attrs | fs_attrs OBL-COMPAR | count of f-structure attributes that are one of attrs | ||
fs_attr_val | attr val | fs_attr_val COMP-FORM if | count of f-structure attributes attr with value val | includes ruletrace features, e.g., fs_attr_val RULE-TRACE 1
counts the number of values! | |
fs_adj_attrs | parent child | fs_adj_attrs ADJUNCT SUBJ | count of attributes parent with a child child | counts the number of parents | |
fs_auntsubattrs | num_aunts aunts num_ancestors ancestors num_descendants descendants | fs_auntsubattrs 2 ADJUNCT NAME-MOD 6 OBJ OBJ-TH COMP-EX OBL COMP XCOMP 1 PRED | count of f-structure attributes ancestors having a sister of one of aunts and being ancestors of one of descendants | counts the number of ancestors | |
fs_subattr (DEPRECATED) | ancestor descendant | fs_subattr SUBJ SPEC | count of f-structure attributes ancestor having descendant attribute descendant | counts the number of ancestors | |
verb_arg | pred arg | verb_arg read OBJ | counts the number of occurences of verb pred with argument arg | ||
lex_subcat | pred arg_sets | lex_subcat break SUBJ,VTYPE | count of verbs pred with one of arg_sets as the set of arguments | ||
anaphora feature templates | |||||
name | parameters | example | description | comments | |
anaphora_attr_value_path(anaphora) | path value | anaphora_attr_value_path(anaphora) NTYPE NSYN proper | counts the number of anaphora with path with last attribute having value value | ||
anaphora_attr_value_path(antecedent) | path value | anaphora_attr_value_path(antecedent) NTYPE NSYN proper | counts the number of antecedents with path with last attribute having value value | ||
anaphora_match_value | attr | anaphora_match_value NUM | counts the number of linked anaphora antecedent pairs that have the same value for attribute attr | ||
anaphora_path_crosses | attr | anaphora_path_crosses ADJUNCT | counts the number of linked anaphora antecedent pairs with a path between them crossing attr | ||
anaphora_least_common_ancestor | attr | anaphora_least_common_ancestor ADJUNCT | least common ancestor of linked anaphora and antecedent id attr | ||
anaphora_spans | ancestor_anaphora_attr descendant_anaphora_attr | anaphora_spans REF COREF | counts the number of ancestor_anaphora_attr dominating descendant_anaphora_attr, where each attribute is one of REF or COREF |
The output of print-feature-forest is a packed representation of feature counts. The feature counts are represented as F:N, where F is the feature number and N is the number of instances of that feature. The output always starts with an and-node, which are always delimited by parentheses. And-nodes look like:
(ID F1:N1 F2:N2 ... )where ID is the unique identifier of the and-node. Or-nodes can be embedded inside of and-nodes after the feature counts. Or-nodes are always delimited by curly brackets. They look like:
{ID (...) (...) }where ID is the unique identifier. After the ID is a set of and-nodes. When an and-node refers to an or-node that has already been printed, it uses the notation {ID } with no internal and-nodes.
Here is an example. Suppose that we have a choice space that looks like:
choice([A1,A2,A3], 1), choice([B1,B2], or(A2,A3))and a set of contexted feature facts that looks like:
cf(1, #feature(7,1,_)), cf(1, #feature(8,1,_)), cf(1, #feature(8,1,_)), cf(or(A2,A3), #feature(1,1,_)), cf(A2, #feature(1,1,_)), cf(B1, #feature(2,1,_)), cf(B2, #feature(1,1,_))Then the packed representation of the feature counts would look like:
(1 7:1 8:2 {1 (2 ) (3 1:2 {2 (4 2:1 ) (5 1:1 ) } ) (6 1:1 {2 } ) } )
Here and-node 1 represents the True context. Or-node 1 represents the A disjunction. And-node 2 represents A1, and-node 3 represents A2, and and-node 6 represents A3. Or-node 2 represents the B disjunction. And-node 4 represents B1 and and-node 5 represents B2. Note that the same id can be used for both an and-node and an or-node. However, the same id cannot be used for two different and-nodes. If two or-nodes have the same id, then one refers to the other (cf the or-node in and-node 6). Note also that and-nodes can be empty of feature counts (cf. and-node 2).
The statistical estimation code, called cometc, for conditional maximum-entropy estimation from truncated data, has the following usage:
Usage: cometc
-n < number of sentences >
[-W < list of weights associated with and-or-forest file pairs; default = 1 for all >]
-u < list of and-or-forest files for unlabeled data >
-l < list of and-or-forest files for labeled data >
-o < output file for estimated parameters >
[-c < cutoff for frequency-based feature selection; discards features whose counts in the unlabeled data are below the given minimum; default = 1 > ]
[-d < cutoff for frequency-based feature selection; discards features that are not discriminative between good and
bad readings for the given minimum of sentences; can be combined with -r and -a or -s; default = 1 > ]
[-r < regularization constant(s) for double-exponential prior >
-a < added features in each grafting step >
[-D < factor by which gradient of a given feature must be different from gradient of previously added feature for it to be added; default = 0 >]
default = off]
[-s < standard deviation of Gaussian prior; default = off > ]
[-p < minimal lookahead for partial gradient calculation; default = off>]
[-i < limit on number of conjugate gradient iterations; default = 1000 >]
Required arguments for the simplest case, that is, for unregularized estimation without feature selection, are:
Furthermore, options for data weighting, feature selection and regularization are provided.
People may want to weight certain pairs of feature forests in the training data more than others, e.g. on the basis of sentence length. The following option provides an easy and flexible way of doing this:
The simplest methods for feature selection are based on the frequency of features:
Alternatively, regularization can be done by specifying a zero-mean Gaussian prior on the model parameters. This regularizer, however, does not discard features resulting in sparse feature vectors, except for very small values of the standard deviation parameter s:
Finally, for very large feature spaces it can be helpful to avoid the computation of gradients for the full set of candidate features in each selection step. The following flag allows you to set a lookahead value for partial gradient computation:
This lookahead value specifies the number of top-n features ranked according to their gradient value under the previous model for which the gradient will be recomputed.
Another performace parameter let's you set a threshold on the number of conjugate gradient iterations, to speed up estimation and to avoid overtraining in very large applications:
On large data sets, determining the features that ``survive'' the discriminative frequency cutoff -d can take several minutes. If you want to use the same setting for -d all over again while adjusting other parameters, such as -r, -a, or -D, you can use the standalone command identify_sparse_features. It is called as follows:
Usage: identify_sparse_features
-n < number of sentences >
-u < list of and-or-forest files for unlabeled data >
-l < list of and-or-forest files for labeled data >
-o < output file for estimated parameters >
[-d < cutoff for frequency-based feature selection; discards features that are not discriminative between good and
bad readings for the given minimum of sentences; if not set, the number of sentences for which corresponding feature is discriminative is output]
Usage: select-best-parse <weighted features file> <input prolog chart file> [-m <max memory in KiloBytes>] <output prolog chart file>
For an evaluation based on precision/recall/f-score for matching dependency triples against a gold standard such as the PARC-700 dependency bank please be referred to the triples match facility of the transfer system.
For an assessment of statistical significance of result differences between (variants of) systems there is a standalone program available that implements the "approximate randomization test". For more information please see the description of this test in
Usage: approx-rand-sigtestFor small test set matrices such as the PARC 700 high confidence levels can be achieved for statistical signifcance testing by setting the number of shuffles to 1M and more.
-n <number of shuffles>
-a <data matrix of (matches, system-facts, gold-facts) for system A>
-b <data matrix of (matches, system-facts, gold-facts) for system B>
-s <matrix size>
XLE has a user interface to the underlying data structures that was originally designed for the programmers but is now generally available. To inspect the underlying data structures of a tree node click on the node with <Control-Shift-Button-1>. This will give you an inspector on the DTree data structure that represents the display of a tree node. You can also get an inspector for the f-structure in the f-structure display window by clicking on the "Fstructure #N" button with <Control-Shift-Button-1>. This will give you an inspector on the Graph data structure that represents that particular f-structure, which was extracted from XLE's packed representation for display purposes.
The inspector represents the data structure as an attribute-value structure where the attributes are the fields of the data structure and the values are typed values. You can inspect a value just by clicking on it, and another inspector will show up for that value. Some data structures have specialized inspectors that display the information in ways that are particularly helpful for that data structure. For instance, the inspector for the Graph data structure displays the attribute-value tree to depth 3. Whenever you see a specialized inspector, you can revert to the standard inspector by clicking on the "raw" button.
As mentioned above, the Graph data structure has a specialized inspector that makes it easier to view the lazy contexted graphs that are used by XLE. Each graph has an attribute-value tree that is displayed to depth 3. You can toggle the display of an attribute by clicking on it. This means that you can hide the attributes (and their recursive structure) that you aren't interested in, or you can reveal parts of the attribute-value tree below depth 3 one attribute at a time. If you click <Control-Button-1> on an attribute, you will get a standard inspector for the AVPair data structure that represents the attribute. This is especially useful for following attributes that have lazy links down, which are otherwise not displayed.
At the bottom of a Graph inspector is a representation of the solution space. Solutions are grouped into restriction sets based on which combinations of contexted variables are of interest to a consuming constituent. Within each restriction set, there are a set of restricted solutions. Each restricted solution represents one way of setting the values of the variables in the restriction set. So if one restriction set has the contexted variables A1 and A2, there might be a restricted solution with just A1, just A2, or both A1 and A2. Finally, within each restricted solution is a set of local solutions each of which gives a different way of getting the contexted variables in the restricted solution. The local solutions relate to the restricted solution in the same way that subtrees relate to their constituents in a parse forest: they represent different ways of getting the same thing. Each local solution has 4 parts: p(artial), c(omplete), t(ask), and l(ocal). The partial and complete are the solutions used from the daughter constituents of this subtree to make up this solution. The task context is just an indicator of which subtree this is in the constituent. The local context is which combinations of local contexted variables (i.e. those introduced by instantiating disjunctions in the grammar) are used for this solution. Finally, there is a "show" button that will highlight those context variables that are enabled under this solution.
If you set morphOnly to 1 in the Tcl shell, then XLE will only process the morphology when parsing (it will also process the chart bottom-up instead of top-down to make sure that all of the morphology gets processed). You can then call get_lexical_edges $defaultchart to get a list of edges that are looked up in the lexicon. For all of these, get_field $edge lexical will be T. If get_field $edge is_surface is T, then the edge is a token (e.g. the output of the tokenizer). If get_field $edge lexentry_found is T, then a lexical entry was found for $edge. If get_field $edge unknown is T, then $edge didn't have an exact match in the lexicon. If both are T, then $edge matched -unknown. If an edge has a value for get_field $edge surface, then $edge is a morpheme and the value of get_field $edge surface is the token that it came from. Tokens are also considered lexical edges, since they can match a lexicon entry with * for a morph code. If there is a token that isn't the value of any surface field, then this token wasn't analyzable by the morphology.
As a side effect, calling get_lexical_edges inverts the chart. This means that get_field $edge mothers will return a list of SubTrees that have $edge as a daughter. get_field $subtree mother will return the edge that contains $subtree. If $edge is lexical, then the edge that contains $subtree will be a pre-terminal category that was found in the lexical entry for $edge. For all pre-terminals, get_field $edge preterm is T. If get_field $preterm unknown is T, then the preterminal came from -unknown. Calling get_field $preterm mothers will return the SubTrees that contain $preterm, and so on. This can be used to see if the grammar was able to apply sub-lexical rules to the pre-terminals.
At some point, you will want to use the results of XLE for some other purpose. There are several way to get results out of XLE:
You can read how to use these commands by using the help command provided by XLE (e.g. help print-tree). Whenever an explicit filename is not given to one of these printing functions, then XLE will construct a default filename based on the type of the object printed and the format. If the top-level f-structure associated with the object to be printed either (a) has a SENTENCE_ID attribute with a constant value, or (b) comes from parsing a sentence in a testfile preceded by # SENTENCE_ID:, then the SENTENCE_ID value will be used in the name. The outputs are mostly self-explanatory, unless charts are involved (as is the case with print-chart-graph and print-prolog-chart-graph).
The print-chart-graph command prints out a contextualized feature structure that represents all valid f-structures in the chart. For instance, the sentence John saw the girl with the telescope might produce the following output:
^=%2
(%1 PRED)='John'
(%2 PRED)='saw<%1, %3>'
(%2 MODS)=%8 <<a>>
%4$%8 <<a>>
(%3 PRED)='girl'
(%3 DET)=+
(%3 MODS)=%11 <<~a>>
%4$%11 <<~a>>
(%4 PRED)='with<%13, %5>'
(%4 SUBJ)=%13
%13=%2 <<a>>
%13=%3 <<~a>>
(%5 PRED)='telescope'
(%5 DET)=+.
This is like the output for print-fs-as-lex, except that certain constraints are annotated with <<a>> and <<~a>> tags. These tags indicate which contexts the constraints are defined in. If we look just at the <<a>> constraints, we find the reading where with attaches to saw. If we look at the <<~a>> constraints, we get the reading where with attaches to girl.
When we try the more complicated John saw the girl with the telescope on the hill, we get some new information at the bottom:
^=%2
(%1 PRED)='John'
(%2 PRED)='saw<%1, %3>'
(%2 MODS)=%10 <<a>>
%6$%10 <<~b>>
%4$%10 <<{b | c}>>
(%3 PRED)='girl'
(%3 DET)=+
(%3 MODS)=%13 <<{~a | ~c}>>
%6$%13 <<~d>>
%4$%13 <<{~a | ~c}>>
(%4 PRED)='with<%15, %5>'
(%4 SUBJ)=%15
%15=%2 <<{b | c}>>
%15=%3 <<{~a | ~c}>>
(%5 PRED)='telescope'
(%5 DET)=+
(%5 MODS)=%18 <<{b | d}>>
%6$%18 <<{b | d}>>
(%6 PRED)='on<%20, %7>'
(%6 SUBJ)=%20
%20=%2 <<~b>>
%20=%3 <<~d>>
%20=%5 <<{b | d}>>
(%7 PRED)='hill'
(%7 DET)=+
<<CONTEXT-FACTS>>
<<{b | ~b} iff a>>
<<{c | ~c} iff ~b>>
<<{d | ~d} iff ~a>>.
This is like the previous output only now we have more complicated contexts (e.g. <<{b | d}>>) and we have some information at the end that says how the contexts are related. For instance, the <<{b | ~b} iff a>> line indicates that b and ~b are only defined when a is true. The contexts have the following approximate interpretations:
a = saw has modifiers
~a = saw has no modifiers
b = with modifies saw (a must be true)
~b = on modifies saw (a must be true)
c = with modifies saw (~b must be true)
~c = with modifies girl (~b must be true)
d = on modifies telescope (~a must be true)
~d = on modifies girl (~a must be true)
Taken together, these disjunctions give a five-way attachment ambiguity, although the particular way that the choices are broken out may be different from what you expected.
Although the above examples both only show PP attachment ambiguity, this mechanism is in fact very general. As a final example, here is a possible output for fruit flies like a banana:
^=%7
(%1 PRED)='fruit'
(%1 NUM)=SG <<~a>>
(%2 PRED)='fly' <<a>>
(%2 MODS)=%11 <<a>>
(%2 NUM)=PL <<a>>
%1$%11 <<a>>
(%3 PRED)='fly<%1>' <<~a>>
(%3 MODS)=%14 <<~a>>
%5$%14 <<~a>>
(%4 PRED)='like<%2, %6>' <<a>>
(%5 PRED)='like<%3, %6>' <<~a>>
(%6 PRED)='banana'
(%6 DET)=a
%7=%3 <<~a>>
%7=%4 <<a>>.
Here, the fact that the head is ambiguous is represented by having a disjunction
at variable %7 between %7=%3 (fly as a verb) and %7=%4 (like
as a verb).
print-fs-as-prolog and print-prolog-chart-graph are variants of print-fs-as-lex and print-chart-graph that print out prolog terms instead of LFG constraints. They also print out a packed-forest representation of the chart, and the phi mapping between the chart and the feature constraints. Here is the result of print-prolog-chart-graph on Parking brake warning light:
fstructure('Parking brake warning light',
% Properties:
[
'xle_version'('XLE release of Feb 26, 2004 11:29.'),
'grammar'('/pargram/english/standard/english.lfg'),
'grammar_date'('Mar 19, 2004 08:57'),
'statistics'('1 solutions, 0.15 CPU seconds, 36 subtrees unified'),
'rootcategory'('ROOT'),
'outputStructures'('c f::')
],
% Choices:
[
choice([A1,A2,A3,A4], 1),
choice([B1,B2], A3)
],
% Equivalences:
[
define(CV_004, or(B1,or(A1,A2))),
define(CV_003, or(A2,or(A3,A4))),
define(CV_002, or(B1,A4,or(A1,A2))),
define(CV_001, or(A4,or(A1,A2))),
select(A1, 1),
select(or(A1,A2), 1)
],
% Constraints:
[
cf(A1,eq(var(0),var(4))),
cf(CV_003,eq(var(0),var(7)),CV_003),
cf(A3,eq(attr(var(19),'PRED'),semform('BRAKE',1,[],[]))),
cf(A3,eq(attr(var(19),'PERS'),3)),
cf(A3,eq(attr(var(19),'ANIM'),'-')),
cf(A3,eq(attr(var(19),'NTYPE'),'COUNT')),
cf(A3,eq(attr(var(19),'NUM'),'SG')),
cf(A3,eq(attr(var(19),'SPEC'),var(20))),
cf(B2,eq(attr(var(19),'COMPOUND'),var(21))),
cf(B1,eq(attr(var(19),'ADJUNCT'),var(23))),
cf(1,eq(attr(var(20),'TYPE'),'DEF')),
cf(B2,eq(attr(var(21),'PRED'),semform('PARKING',2,[],[]))),
cf(B2,eq(attr(var(21),'PERS'),3)),
cf(B2,eq(attr(var(21),'ANIM'),'-')),
cf(B2,eq(attr(var(21),'NTYPE'),'MASS')),
cf(B2,eq(attr(var(21),'NUM'),'SG')),
cf(B2,eq(attr(var(21),'SPEC'),var(22))),
cf(B2,eq(attr(var(22),'TYPE'),'DEF')),
cf(B1,in_set(var(24),var(23))),
.
.
.
],
% C-Structure:
[
cf(A1,subtree(735,'*TOP*',null,362)),
cf(A1,phi(735,var(4))),
cf(A1,subtree(362,'ROOT',711,573)),
cf(A1,phi(362,var(4))),
cf(A1,subtree(711,'ROOT',null,707)),
cf(A1,phi(711,var(4))),
cf(A1,subtree(707,'ADVPsent',null,653)),
cf(A1,phi(707,var(24))),
cf(A1,subtree(653,'VPverb',645,332)),
cf(A1,phi(653,var(24))),
.
.
.
cf(CV_003,subtree(203,'N',202,41)),
cf(CV_003,phi(203,var(17))),
cf(CV_003,subtree(202,'N',null,37)),
cf(CV_003,phi(202,var(17))),
cf(CV_003,subtree(37,'N_BASE',null,36)),
cf(CV_003,phi(37,var(17))),
cf(CV_003,terminal(36,'light',[34])),
cf(CV_003,phi(36,var(17))),
cf(CV_003,subtree(41,'N_SFX_BASE',null,42)),
cf(CV_003,phi(41,var(17))),
cf(CV_003,terminal(42,'-Nsg',[34])),
cf(CV_003,phi(42,var(17))),
cf(1,surfaceform(1,'parking',1,8)),
cf(1,surfaceform(13,'brake',9,14)),
cf(1,surfaceform(22,'warning',15,22)),
cf(1,surfaceform(34,'light',23,28))
]).
The f-structure term is a 6-tuple of sentence, grammar, choices, equivalences, contexted feature structure constraints, and contexted c-structure constraints. Alternatives are represented by the "choice" predicate. The definition of the choice predicate is:
choice([B1,B2],A1) =def ((B1 | B2) <-> A1) & ~(B1 & B2)
Context variables are only used to save space in the representation. They are represented by the "define" predicate. The right-hand side of a define predicate can contain an arbitrary boolean expression of and's and or's which is interpreted using standard predicate logic. The definition of the define predicate is:
define(X, Y) =def X <-> Y.
The "select" predicate is used to indicate which solution, if any, was selected when the structure was printed. There can also be "nogood" predicates, which indicate that certain choices are nogood.
Each contexted constraint has a context and a constraint. In the feature structure constraints, equality is represented by eq and set membership by in_set. Attributes are represented by attr, with the attribute name always quoted. Projections are represented by proj. Local variables are represented by var with an integer value.
Semantic forms are represented by the semform term, where the first argument is the name of the semantic form (always quoted), the second is an identifier for the semantic form, the third is the arguments as a list, and the fourth the non-arguments as a list. If two semantic forms have different identifiers, then they correspond to different instantiations of the semantic form. In a sentence like John gave a book to Mark and a record to Bill, there will be two different semantic forms for give, but they will have the same identifier because they came from the same instantiation. The identifiers are ordered based on the string position of the places where the semantic forms were instantiated.
Instantiated values are followed by an underscore in the Prolog representation. For instance, the constraint (^ PERF)= +_ is represented as eq(attr(var(0),'PERF'),'+_'). In order to distinguish between an instantiated symbol and a lexical entry that happens to have a trailing underscore, the trailing underscores of a lexical entry are duplicated. For instance, the constraint from the lexical entry a_ TOKEN * (^ TOKEN)=%stem is represented as eq(attr(var(0),'TOKEN'),'a__'). XLE removes the duplicate underscores when the value is read in. If a value had a odd number of trailing underscores, then it is treated as an instantiated value.
After the contexted f-structure constraints is a contexted packed-forest representation of the c-structure constraints. In the packed-forest representation, trees are always binary branching (otherwise, the packed-forest representation would be O(n^(k+1)) in size, where k is the largest number of daughters). Trees are represented by subtree(mother,label,partial,complete), where 'mother' is the mother constituent, 'label' is the label of the mother, 'complete' is the right daughter, and 'partial' is a new constituent that represents all of the daughters to the left of the right daughter. For consistency, we always have a partial edge even if there is only one left daughter. Terminal nodes are represented as having a node id, a label, and a list of token ids that they map to. The phi projection is represented using a separate predicate. All of the other c-structure projections are grouped under one "cproj" predicate, whose value is a variable that has the c-structure projections. For technical reasons, the phi projection is a projection of the mother constituent, whereas the c-structure projections are projections of the right daughter.
Surface forms (e.g. tokens) are represented as by the surfaceform predicate. The surfaceform predicate has a node id, a label, and a left and right input position in the input string (see Input Positions for more details).
Newer prolog files have semform_data predicates. A semform_data predicate has four arguments: the lexical id for a semantic form, a node id for the node that the semantic form was instantiated in, the left input position of the surfaceform corresponding to the node, and the right input position of the surfaceform corresponding to the node. If the node does not have a corresponding surfaceform (e.g. a null_pro), the left input position and the right input position is the position of the left edge of the node. Note that the node that a semantic form is instantiated in is often a morpheme rather than a surface form, so the input positions may be a subset of the input positions of the surface form that the lexical id came from. The input positions are useful for mapping stand-aside markup information into lexical ids and dependency relations.
Newer prolog files have fspan predicates. An fspan predicate represents the span of the input string that an f-structure covers. An f-structure can have more than one fspan if the f-structure is discontinuous. An fspan predicate has three arguments: a var, the left input position of the var, and the right input position of the var.
As an example of the c-structure system, the tree "NP:1 -> John:2" might be represented in the following (uncontexted) manner:
subtree(1,'NP',null,2),
phi(1,var(1)),
cproj(1,var(2)),
terminal(2,'John',[3]),
phi(2,var(1)),
surfaceform(3,'John',1,5)
You can control which structures are included in the prolog output by setting the Tcl variable outputStructures. For instance,
set outputStructures "c f::"will restrict the prolog output to the c-structures and the f-structures. The variable outputStructures can contain any projection defined in the grammar (such as f::, m::, or o::) or c for the c-structure or s for the surface forms (NB: the latter two do not have colons following them and are not to be confused with the similarly named c:: or s:: projections). If you include "EXTERNAL", then XLE will only output the attributes listed in the EXTERNALATTRIBUTES field in the grammar's config. If you include "PREDSONLY", then XLE will only output the attributes that related to PREDs (e.g. PRED and the semantic and governable attributes).
The surfaceform, semform_data, and fspan predicates give the left and right positions of these predicates in the input string. If the performance variable input_position_type is set to chars, then input positions are measured in characters. Otherwise, input positions are measured in bytes. The first character in the string is at input position 1. The right input position is always the input position of the character that immediately follows the surface form. This means that the length of the token is typically the difference between the right input position and the left input position. Thus, a token that has no realization in the input (such as in a haplology comma before a period) would have the same left and right input position. Note that the input positions of a surface form can occasionally be off by one or two from what you would expect because of subtle consequences of how the tokenization rules for the tokenizer are compiled.
If there is are grammar libraries in the tokenizer, then the input positions will be relative to the string produced by the last grammar library rather than the input string. This is because XLE can't keep track of the relation between the characters in the input string and the characters in the output of the grammar libraries. When this happens, input positions are relative to a sentence that is stored under the property markup_free_sentence.
Named entity taggers and part-of-speech taggers often add markup to the input string that is used by the parser to parse the string better. This markup can mess up the input positions, since the markup is not part of the original string and the input positions are relative to the marked-up string. If you want the input positions to be relative to the original string, you can use the performance variables left_markup, right_markup, and replace_markup to describe the markup so that XLE can skip over it when assigning input positions. In addition, XLE will remove and/or replace the markup from the markup_free_sentence. Thus, markup_free_sentence can be used as a presentation string for highlighting purposes.
Markup can be specified using the xfst regular expression notation or as a fsmfile that ends with ".fsmfile" or ".fst". For instance, here is something that could be added to a performance variables file to tell XLE to skip over XML markup:
setx left_markup {"<" \["/"|"<"|">"] \["<"|">"]* ">"} setx right_markup {"<" "/" \["<"|">"]* ">"} setx replace_markup {{<}:{<} | {>}:{>} | {&}:{&}}
The first regular expression covers markup like <abc>. The second regular expression covers markup like </abc>. The last regular expression replaces markup like < with its non-markup form (<).
Left markup appears to the left of the word that it marks up. It has the same input position as the character that follows it. Right markup appears to the right of the word that it marks up. It has the same input position as the character that precedes it. Thus, left markup and right markup don't have a position in the original string. Replace markup replaces one string with another. The character positions of the remainder of the string are adjusted accordingly.
A packed f-structure is normalized when the number of equality and subsumption links in it have been minimized. If the number of solutions in a packed f-structure is very large, normalizing the f-structure can cause XLE to time out. So normalization is only performed if normalize_chart_graphs is set to 1, or if max_solutions is set to something other than 0, or if normalize_graph is called explicitly.
Normalization minimizes the number of equality links by substitution. If there is an equality link from A to B in context C, then A is replaced with B in context C wherever A occurs and the equality link is deleted. For instance, the constraints:
cf(1,eq(attr(var(0),'SUBJ'),var(9))), cf(A1,eq(var(9),var(7)))
would become:
cf(A1,eq(attr(var(0),'SUBJ'),var(7))), cf(A2,eq(attr(var(0),'SUBJ'),var(9)))
When normalization is done, there should be no equality links left. Reentrant structures are indicated by shared variables:
cf(1,eq(attr(var(0),'SUBJ'),var(9))), cf(1,eq(attr(var(0),'TOPIC'),var(9)))
Normalization minimizes the number of subsumption links by replacing the subsumee with the subsumer when possible. Subsumption links are introduced by XLE when attributes are distributed across sets. For instance parsing "John tossed and turned" causes XLE to introduce subsumption links when the SUBJ "John" gets distributed across the set consisting of f-structures for "tossed" and "turned":
cf(1,eq(attr(var(0),'COORD-FORM'),'and')), cf(1,in_set(var(1),var(0))), cf(1,in_set(var(2),var(0))), cf(1,eq(attr(var(3),'PRED'),semform('John',0,[],[]))), cf(1,subsume(var(3),var(4))), cf(1,subsume(var(3),var(5))), cf(1,eq(attr(var(1),'PRED'),semform('toss',1,[var(4)],[]))), cf(1,eq(attr(var(1),'SUBJ')),var(4))), cf(1,eq(attr(var(4),'PRED'),semform('John',0,[],[]))), cf(1,eq(attr(var(2),'PRED'),semform('turn',2,[var(5)],[]))), cf(1,eq(attr(var(2),'SUBJ')),var(5))), cf(1,eq(attr(var(5),'PRED'),semform('John',0,[],[])))
This is done in case the verbs assert inconsistent constraints on their subjects (as occurs with quirky case in Icelandic). During normalization, XLE replaces the subsumees with their subsumers when they are the same:
cf(1,eq(attr(var(0),'COORD-FORM'),'and')), cf(1,in_set(var(1),var(0))), cf(1,in_set(var(2),var(0))), cf(1,eq(attr(var(3),'PRED'),semform('John',0,[],[]))), cf(1,eq(attr(var(1),'PRED'),semform('toss',1,[var(3)],[]))), cf(1,eq(attr(var(1),'SUBJ')),var(3))), cf(1,eq(attr(var(2),'PRED'),semform('turn',2,[var(3)],[]))), cf(1,eq(attr(var(2),'SUBJ')),var(3)))
parse-testfile will parse a text that has already been broken up into separate sentences. However, it will not work on arbitrary texts. If there is a BREAKTEXT transducer defined in the morph config file, then you can create a testfile using make-testfile. Using make-testfile allows you to take full advantage of the flexibility of parse-testfile. However, you can also parse a text directly using parse-file. parse-file takes two arguments: a file name and an output directory:
parse-file mytext.txt outputfiledir/
parse-file will parse the segments in the file that are provided by the BREAKTEXT transducer using the given parser and print the results as Prolog files in the output directory. Each Prolog file name will contain a number that indicates which segment in the text that the Prolog file represents. If the parser is omitted, then parse-file uses the value of defaultparser. Note that parse-file does not produce any of the auxiliary files that parse-testfile produces.
parse-file also supports the notation:
parse-file mytext.txt -parseProc myParseProc -parseData myParseData
where the -parseProc and -parseData parameters are the same as for parse-testfile. parse-file mytext.txt outputfiledir/ is equivalent to parse-file mytext.txt -parseProc defaultParseFileProc -parseData outputfiledir/.
If you type print-rule rulename in the Tcl Shell, then XLE will print an expanded version of the rule with that name. The rule expansion will expand the macros and templates, but it won't shift the epsilon constraints or distribute any constraints on meta categories over their internal categories. You can print the rule expansion for complex categories if you quote the category name with curly brackets so that Tcl won't interpret the square brackets as a command (e.g. print-rule {S[fin]}).
If you type print-lex-entry headword in the Tcl Shell, then XLE will print the effective entry for headword. The effective entry for headword is the result of combining different actual entries for headword together based on edit entries like +CAT, !CAT, ETC and ONLY. If the result has more than one entry for a category, then each category will be printed separately with a comment indicating its source. Also, morphological categories (e.g. those with something other than * as the morph code) will have _BASE appended to their name.
You can get a list of all of the templates defined in the currently loaded grammar by typing get_templates in the Tcl Shell. For each one of these, you can call template_calls_of_template to get the list of templates that the given template calls. These can be used to produce a call graph for the templates.
If you type print-prolog-grammar filename in the Tcl Shell, then XLE will print a Prolog version of the currently loaded grammar on filename. The grammar is represented using the following predicates:
template_def(NAME, PARAMS, DEF).
template_expansion(NAME, PARAMS, BODY).
macro_def(NAME, PARAMS, DEF).
rule(NAME, DEF).
Constraints are represented using a colon infix operator followed by a list (e.g. eq:[p:[^, 'SUBJ', 'CASE'],'NOM'] for (^ SUBJ CASE)=NOM). The atom before the colon is the name of the operator. Most of the operator names should be self-evident. Here is a list of those that might not be:
p path
tc template call
term rule terminal category with constraints
In addition, we use the predicate sf(FN,-,ARGS,NOTARGS) for semantic forms.
If you include -lexentries as an argument to print-prolog-grammar, then XLE will also print the lexical entries for the grammar using the following format:
lex(NAME,[cat(CAT,MORPHCODE,CONSTRAINTS,-),
cat(CAT,MORPHCODE,CONSTRAINTS,-),
...]).
If you include -cfg as an argument to print-prolog-grammar, then XLE will print the rules as if they were context-free grammar rules instead of regular expressions. That is, the rules are all either binary branching or unary branching and there can be more than one rule for each category. In order to do this, XLE introduces a rule category for each state in the finite-state machine that it uses internally to represent the regular expression that defines the rule. The new rule categories are named state(RULENAME,STATEID). For instance, S --> NP VP PP* might be converted to the following rules:
rule('S',seq:[state('S',3),term:['PP',eq:[^,!]]]).
rule('S',seq:[state('S',1),term:['VP',eq:[^,!]]]).
rule(state('S',3),seq:[state('S',3),term:['PP',eq:[^,!]]]).
rule(state('S',3),seq:[state('S',1),term:['VP',eq:[^,!]]]).
rule(state('S',1),term:['NP',eq:[^,!]]).
In order to reduce the size of the output of print-prolog-grammar, we replace template bodies with ? whenever the template body can be obtained by a simple substitution of the arguments in the template_expansion for that template. We do not replace template bodies with ? if there are designator rewrites like (^ SUBJ) --> (^ OBJ) that might apply to an argument, or if one of the arguments is a path designator such as (^ SUBJ) which might be used as a prefix of a path (e.g. (_ARG CASE)=NOM) or which might be coerced into an existential constraint.
Note that XLE deletes inconsistent constant constraints (such as a=b, a~=a, a${b c}, and a~${a b}) when the grammar is loaded. Thus, these constraints and the disjuncts that they belong to will not appear in the output of any of the functions described above.
XLE uses a version of Tcl that supports a wide range of character encodings including iso8859-1 (iso-latin-1), euc-jp (for Japanese), and Unicode (utf-8). By default, XLE assumes that the character encoding of the grammar and the sentences being parsed or generated is iso8859-1 (XLE interprets iso8859-1 to be an extended version of iso8859-1 known as Windows Latin 1 (cp1252) which includes some useful characters such as the euro symbol). This works for English and most Western European languages. Other languages have to declare the character encoding that is being used by each file or module. Currently, you can declare the character encoding of the grammar, the morphology, the stdio, the prolog files, and the test files. The character encodings of these files and modules can all be different as long as they are mutually intelligible.
To declare the character encoding of a grammar, add a line like the following to the grammar configuration:
CHARACTERENCODING euc-jp.
This will tell XLE that the grammar files are encoded using this character encoding. XLE also uses this character encoding internally for the data structures associated with this grammar (this is only relevant for clients of the C API for XLE).
The finite state transducers are assumed to have the same character encoding as the grammar files unless the character encoding is explicitly specified by setting the CHARENCODING property of each transducer. All of the transducers must use the same character encoding. If two transducers have different character encodings specified, then XLE will print a message.
To parse or generate with such a grammar, you will need to let XLE know what character encoding the input and output is in. This can be specified with a command like:
set-character-encoding stdio euc-jp
If you are using parse-testfile, then you can tell XLE the character encoding of the test file using a command like:
set-character-encoding japanese.lfg euc-jp
You can also specify the character encoding in the file itself using Emacs file variables. This is the preferred technique for specifying the character encoding of a file, since it is carried with the file and since Emacs will know how to display the characters correctly.
XLE uses the same system for naming character encodings that Tcl uses. You can get a list of the character encodings that Tcl supports by typing encoding names into the Tcl shell (indicated by %):
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345
cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201
gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland
iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 X11ControlChars cp737
iso8859-16 big5 euc-kr macRomania ucs-2be macTurkish gb1988 iso2022-kr
macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian
koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251
macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254
cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis
utf-8 cp855 cp936 symbol cp775 unicode cp857
You can also define new character encodings for Tcl. For more information, see the Tcl documentation.
The only way to use grammars with different character encodings in the same window is to make Tcl use UTF-8. For instance, suppose that you wanted to demonstrate the Japanese grammar and the German grammar in the same window. You can load and parse with these grammars in the same window, but the strings in the window may look funny, since the window can only have one character encoding at a time. For instance, if the window's character encoding is euc-jp, then some of the German characters may appear as Japanese. You can get around this problem by setting the character encoding of the window to UTF-8 and by telling XLE that the stdio is encoded in UTF-8:
% set-character-encoding stdio utf-8
% set germanParser [create-parser german.lfg]
% set japaneseParser [create-parser japanese.lfg]
You also need to specify the character encoding of the test files that you are using:
% set-character-encoding german.test.lfg iso8859-1
% set-character-encoding japanese.test.lfg euc-jp
Now you should be able to parse using parse-testfile, and XLE will automatically convert the sentences in the testfile from their encoding into UTF-8 for display in Tcl, and then convert from UTF-8 into the grammar's character encoding for parsing:
% parse-testfile german.test.lfg 1 -parser $germanParser
% parse-testfile japanese.test.lfg 1 -parser $japaneseParser
Unfortunately, Emacs does not fully support UTF-8 yet. However, Emacs 22 is supposed to fully support UTF-8 when it comes out. In them meantime, you can use the Mule-UCS package to handle Chinese, Japanese, and Korean characters. The Macintosh terminal window fully supports UTF-8.
You can set the character encoding of the Prolog files that are read and written by XLE using set-character-encoding prolog plus the name of a character encoding. If you want to set the Prolog character encoding of a specific chart to a specific encoding, (say, utf-8), use the following:
set-character-encoding prolog utf-8 $chart
XLE automatically records the character encoding used when writing out a prolog file.
When you release a grammar to others, you may want to package it up a little differently from the development version of the grammar. make-release-grammar copies the files of the currently loaded grammar into another directory, flattening any directory structure and changing the configs to point to the new files. This is similar to make-bug-grammar, except that it doesn't preserve the original config files. If the grammar config has a value for ENCRYPTFILES, then any files listed there will be encrypted using encrypt-lexicon-file unless -noencrypt is given as an argument. This protects your lexicons from unauthorized use. When make-release-grammar is done, you should have a self-contained grammar that can be easily given to others and that runs without further modification.
make-release-grammar can take the following arguments:
make-release-grammar (<chart>) (-targetdir <targetdir>) (-noencrypt)where <chart> defaults to $defaultchart and <targetdir> defaults to release-$currentdate at the same level as the grammar's directory.
If the grammar config contains the line "GRAMMARVERSION.", then make-release-grammar will store the release directory in its place (for example, "GRAMMARVERSION release-2006-09-07.").
Let's assume that the XLE tar file was unpacked into /usr/local/xle. Then the main C interface to XLE will be in /usr/local/xle/include/xle.h. Look at this first to see if it will meet your needs. If you want to examine the XLE data structures, you will also need some of the following files: chart-typedefs.h, DU-typedefs.h, chart.h, graph.h, attributes.h, relations.h, values.h, semantic-forms.h, and clause.h. The typedefs files should go first in the list of included files, with chart-typedefs.h coming before DU-typedefs.h. There are also some useful functions in the files that end in "func" (e.g. clause-func.h).
The XLE library is named libxle.so on Solaris and Linux and libxle.dylib on MacOS X. It will be stored at /usr/local/xle/lib. If you want to load the XLE library dynamically (e.g. at run time), then your Makefile will need to look something like:
myprogram: myprogram.o /usr/local/xle/lib/libxle.so
${CC} myprogram.o -o myprogram -L/usr/local/xle/lib -lxle
You will also need to put libxle.so in /usr/lib or add /usr/local/xle/lib to LD_LIBRARY_PATH and/or DYLD_LIBRARY_PATH so that the dynamic loader can find libxle.so at run time.
If you want to load the XLE library statically, then treat it as if it were an object file instead of a library:
myprogram: myprogram.o /usr/local/xle/lib/libxle.so
${CC} myprogram.o -o myprogram /usr/local/xle/lib/libxle.so
In either case, you need to make sure that the following environment variables are set correctly at run time so that XLE can access Tcl source files and character encoding maps as needed:
setenv XLEPATH /usr/local/xle
setenv TCLLIBPATH /usr/local/xle/lib/tcl8.4
setenv TCL_LIBRARY /usr/local/xle/lib/tcl8.4
setenv TK_LIBRARY /usr/local/xle/lib/tk8.4
to_xml input-path input-filename
output-path output-filename analysis-type
where all arguments are strings. input-filename should be
the
name of the prolog file which is to be converted. Normally, this will
be a file with a .pl extension outputted by the XLE via use of a command
such as fs {sentence}
filename
. output-filename should be the name of
the xml file (with a .xml extension) that is to be written as the
result. If this file already exists, its contents will be
overwritten. input-path and output-path strings should
indicate the absolute paths for the input prolog-file and the output
xml-file and should end with the character '/'. analysis-type
specifies the type of analysis for which the conversion is being
made. Valid string values for this parameter are: FS, TRIPLES,
SEM, KR
and XFR
. If any other string is passed in
as the type of analysis, the conversion module will attempt to treat
the input file as a prolog readable file and use a general conversion
procedure to write an xml file that corresponds to the prolog
file.
Here is an exemplary usage of the command:
to_xml /tilde/anon/input/ fs1.pl /tilde/anon/output/
fs1.xml FS
C programmers intending to build xle from source should take a look at the file "xml_conversion.h" that includes the interface for the underlying C function.
If one of the valid analysis types is given to the command as the last parameter, then the xml file produced will contain the root element analysis with the type attribute specifying the type of analysis. For example:
<analysis type="structure"> |
... |
</analysis> |
will indicate that the xml specifies the f- and c-structures for the given sentence.
The rest of the xml document can be divided into 4 parts. These are: sentence, properties, packing information and analysis constraints or facts.
The sentence that produced the output xml is given in the
sentence element. When the analysis type is sem
or
kr
, this element will also include the no (number)
attribute, whose value specifies the order of the sentence in the
suite it belongs to. If the sentence is standalone, this number will
typically be 1. For example:
<sentence no="1">All men are
mortal.</sentence>
In addition, triples
and transfer
analysis types will include a num_solutions element after the
sentence element, specifing the number of different solutions
(or choices) the analysis encapsulates for the given sentence.
The properties for the analysis are given under the
properties element for analysis types structure,
triples
and transfer
. Each element within the
properties element specifies one property. The property's name
is given as the element and its value with respect to the particular
instance of analysis is given as the element's value. For example:
<properties> |
<xle_version>XLE release of Jul 19, 2005
10:02.</xle_version> |
<grammar>/project/pargram/english/content_analysis/grammar/english.lfg</grammar> |
... |
</properties> |
The packing information enacapsulates how different choices for the interpretation of the given sentence are represented under a packed-forest representation. This information will be given under the packing element. If the user has asked for a singleton analysis (i.e., only one interpretation of the sentence is represented in the analysis), the packing element will be empty. Otherwise it will include 0 or more of each of the following elements, which all have the attribute context.
For example:
<packing> |
<alternatives_for context="1">[A1,
A2]</alternatives_for> |
... |
<weight_of context="A1">0.540843</weight_of> |
... |
<definiton_of context="CV_001">or(A1,
A2)</definition_of> |
... |
<selected_for context="1">A1</selected_for> |
... |
<equality_in context="A2"> |
<arg no="1">var:3</arg> |
<arg no="2">gerund</arg> |
</equality_in> |
... |
</packing> |
This is the main, informative part of the analysis. For analysis
type structure
this part will contain the elements
fstructure and cstructure. For analysis type
triples
it will contain the element triples_facts;
for sem
, sem_facts; for kr
,
kr_facts; for transfer
, xfr_facts. In an
effort to unify the different types of analysis in their
representation, each of these elements is a list of nested elements
named constraint (in structure
) or fact
(in
triples, transfer, sem,
and kr
).
If the analysis encapsulates more than 1 solution, then each constraint or fact element will also have a context attribute, whose value indicates the choice under which the constraint given is true.
The structure of the constraint or fact element is
identical across types of analysis, with the exception of
triples
. Each of these elements will have:
Here's an example that says "under choice 1, the subject of variable 0 is variable 1:"
<constraint context="1"> |
<label>eq</label> |
<arg no="1"> |
<label>attr</label> |
<arg no="1">var:0</arg> |
<arg no="2">SUBJ</arg> |
</arg> |
<arg no="2">var:1</arg> |
</constraint> |
For the triples
analysis type, each fact
element will have a similar label element, which will hold an
attribute name. However, instead of arguments it will have two
elements named of and is. These elements hold
surface-level values and can be interpreted as arguments 1 and 2 to
the predicate indicated by label. For example:
<fact context="1"> |
<label>NUM</label> |
<of>telescope:8</of> |
<is>sg</is> |
</fact> |
For more information on what different label and arg
values mean, please refer to the XLE documentation.