Starting a ParGram Grammar

Tracy Holloway King  

  1. Walkthrough
  2. Common features, grammatical functions, and templates
  3. Sample grammar
  4. List of useful xle tricks
  5. Background reading

This document is intended for people who are using XLE to write LFG grammars. Almost all of the information in here is in the xle documentation. However, it is arranged so that things that are of immediate use to beginning grammar writers or that are different than in theoretical LFG are given prominence

Walkthrough

Do the walkthrough provided with xle before starting on anything. It is also useful to skim over the xle documentation to get some idea of what all is there. However, the documentation is now very extensive and so it is hard to absorb until after you have worked a bit with the system.

Pargram features and grammatical functions

This section is intended for people who are working on pargram grammars that are supposed to conform to the existing pargram feature committee standards. Since these standards are not well documented, this section will provide a start place.

The other thing to do is to take the large English grammar, which is available to anyone with a pargram license, and parse constructions that are similar to the ones you are interested in. If the analyses seems feasible for your language, then go ahead and use them. Note that reading the grammar itself will be difficult at first because it is so large. However, just looking at the f-structure output may be useful.

There is a naming convention for features.  Features are in all uppercase letters while (atomic) values are in all lower case letters.  For example, the feature NUM can have the value pl.  Features whose values reflect surface forms in a language are named X-FORM where X can be any number of upper case letters.  For example, PFORM is used for the form of prepositions that do not have a PRED (otherwise the PRED encodes the form information redundantly).  If there is more than one letter before the FORM, then a hyphen is inserted for legibility: for example, PRON-FORM for the surface form of pronominals.

Grammatical Functions

There are some standard naming conventions for grammatical functions across the pargram grammars. Other grammatical functions may need to be added for new languages.
The following grammatical functions are non-subcategorized and are set valued. Note that sets can have scoped elements which can be very useful for noun-noun compounds and for coordination.

Common feature table

There is a common feature table common.features.lfg defined for the pargram grammars. Each grammar should use these features when possible. As detailed in the feature table, each language can:
The common feature table is included with this. Note that it is periodically updated and the new version is sent around to all the grammar writers (it should also be on the pargram common workspace at http://ling.uib.no/bscw/).

The features are discussed here in more detail.

CHECK: Note that the CHECK feature is a feature that each grammar can use for grammar internal features that are largely used as well-formedness checks. These CHECK features are generally assumed to be ignored by applications.

The notations are read as follows:
The list of values are ones that are permitted for that feature; they are not required.  Currently there is no way in the feature table to require a feature to have a particular set of values.  For example, there is not way to state that TNS-ASP must contain TENSE.

Nominal/Specifier features: Adjectival/Adverbial features:
Preposition features:
Clausal/Verbal features:

Common Templates

There is also a common templates file (common.templates.lfg) which can be used with the pargram grammars. It includes a set of templates to assign the features in the common feature declaration. For example, instead of having:
(^ TNS-ASP TENSE)=pres
you can have:
@(TENSE pres)
where the template TENSE is already defined in the common templates.  This can help with grammar maintenance and compliance to feature committee standards.

In addition, they include templates to assign:
There are comments in this file to document what these are and the general naming schema that was used.

For example, you cound define a template for intransitive verbs called V-SUBJ which in turn could call the common template to construct the predicate.  V-SUBJ might have additional features in it:
V-SUBJ(_P) = "basic intransitive verb template"
             @(SUBJ_core _P) "call to common template to construct the PRED"
             ~(^ PRT-FORM) "no particle allowed".

Starter English Grammar

There is a larger version of demo-eng.lfg called eng-pargram.lfg which demonstrates many of the mechanisms that you might want to use in your grammar.

File Management

The first thing to notice is that this grammar is split into several files:
It is advisable to divide your grammar into separate files and so this grammar is set up to demonstrate how to do that. Any .lfg file that you want to include must be listed in the CONFIG under FILES. Fsts (finite state machines: tokenizers and morphologies) are called from the morphconfig which can itself be a separate file or can be part of the main grammar file, as is the case here.

Integrating FSTs

xle makes it very simple to integrate fst tokenizers and morphologies into your grammar. Tokenizers insert token boundaries between words and do things like split punctuation off of words and lowercase initial capitals.  The tokenizer used here is based on that in Beesley and Karttunen. Morphologies associate surface forms with a stem form and a set of tags that encodes the relevant morphological information.

Here we will focus on the morphologies. If you load up eng-pargram.lfg and type:
morphemes baking
xle will return:
{ bake "+Verb" "+Prog" | baking "+Token" }
What this means is that the surface form "baking" is associated with a stem "bake" and the tags +Verb and +Prog. It is alternatively associated with the form "baking" and a built-in tag called +Token.

If you look in the lexicon in eng-pargram-lex.lfg, there is only an entry for "bake"; the various possible surface forms ("bake", "bakes", "baked", "baking") are not listed. This is not necessary because once the string (sentence) is passed through the tokenizer and the morphology, the output string only contains the stem form plus the tags. Each of these stems and tags is listed in the lexicon and can be looked up by:
bake V XLE { @(V-SUBJ %stem)
            |@(V-SUBJ-OBJ %stem)
            |@(V-SUBJ-OBJ-OBJTH %stem)}.
+Verb V_SFX XLE @(VTYPE main).
+Prog V_SFX XLE (^ TNS-ASP PROG)=c +.

where V_SFX is an arbitrarily chosen c-structure category.  The grammar in eng-pargram.lfg then assembles these into a V category via the sublexical rule:
 V --> V_BASE "stem form"
      V_SFX_BASE+ "as many tags as the morphology provides".

These rules are quite easy to write except that you need to remember to add "_BASE" to whatever the category is that is listed in the lexicon. For example V_SFX in the lexicon corresponds to V_SFX_BASE in the sublexical rule. The reasoning behind this has to do with how xle figures out the display of these since the sublexical structure is by default not shown (hence why you have to use "show morphemes" to view it).

If in xle you do:
set-OT-rank Fragment NOGOOD
and then parse
parse {V: baking}
you will see the resulting tree if you choose the menu item "show morphemes".

The morphology in eng-pargram-morph.fst also provides analyses for nouns and prepositions. This can be seen by typing in xle:
morphemes girls
morphemes with

which have the output:
{girls "+Token"|girl "+Noun"  "+Pl"}
with {"+Prep"|"+Token"}

However, there are no specifix lexical entries for "girl" or "with" or any other noun, pronoun, or preposition in eng-pargram-lex.lfg. Even so, xle is able to provide an analysis for these words by falling back on the entry for "-unknown" which is a special form that matches any stem in the morphology that does not have an overt lexical entry.  ("unknown" means unknown to the xle lexicon but known to the morphology.  xle first looks for an overt lexical entry; if it cannot find one, it tries to match against "-unknown".)
 -unknown N XLE @(NOUN %stem);
                 P XLE @(PREP %stem).

In this grammar, we have allowed "-unknown" to match nouns (N) and prepositions (P). So, whenever a noun or preposition is parsed and passed through the morphological analyzer, xle will build a lexical entry for it based on -unknown. The information provided by the morphological tags will constrain whether it is treated as a noun or adjective. In xle type:
parse {N: girls}
to see how this works; the morphological tags can be made visible by choosing the "show morphemes" option.

There is a corresponding -token lexical entry which can be used to match tokens, i.e., items which to not get a morphological analysis. In eng-pargram.lfg and eng-pargram-lex.lfg, this is used to match words in the FRAGMENTS grammar (see below) which are not part of a constituent. Note that the morphcode for -token is always * and not XLE:
-token TOKEN * (^ TOKEN)=%stem.
So, to connect a FST morphology to your grammar, you need to:
Once this is done, you can write your grammar as you would have normally. However, you will have much less lexicon work to do since the morphology and -unknown entry conspire to provide lexical entries for most words. Of course, if you do not have an fst morphology available, the lexical work will be done in building the fst morphology.

You might ask where the fst morphologies come from. For many languages, such morphologies already exist. However, you can write your own using the xfst tools provided with the Beesley and Karttunen book. The input fst script for eng-pargram-morph.fst is in eng-pargram-morph.infile. This is an extremely unsophisticated script and would not be very efficient for entering large numbers of lexical items. You can use the lexc tools described in the book to create a much more sophisticated morphology in a more succinct and more linguistically satisfying format. Note that the fst files are binary files and hence cannot be looked at in emacs; the infiles which produce these are readable though.

Robustness: FRAGMENTS

When doing initial grammar development, you want ungrammatical sentences to get no parses. However, for later applications, it is often useful to get some type of output for any input. One way to do this is to write a fragment grammar. A fragment grammar builds up well-formed chunks, such as NPs, and then puts all of the chunks together in a FIRST-REST structure. To see these, in xle type:
parse {girls sleep bananas.}
The result is a fragment parse (if you get 0 parses, restart the grammar; the set-OT-rank Fragment NOGOOD command used above removed the fragments so that you could see ungrammatical structures). Each piece of the f-structure is well-formed, but the top level f-structure has no PRED.

In the CONFIG section in eng-pargram.lfg, in addition to defining a ROOTCAT, we have also defined a REPARSECAT. XLE will first try to build a well-formed structure using the ROOTCAT (here S). If it fails, then it will build a structure using the REPARSECAT (here FRAGMENTS).

The rule for this category is in the RULES section. It consists of two main parts.

The first is a disjunction of all the categories we want to build chunks out of (NP, PP, VP, S, and TOKEN). TOKEN is a special category that is used when one of the other chunks cannot be built. The lexical entry for this is in eng-pargram-lex.lfg under "-token" which matches anything that gets a +Token tag in the morphology (it is similar to -unknown which matches anything that goes through the morphology and hence gets any tag other than +Token). All this lexical entry does is provide a feature that records the lexical item.

Each of these is associated with an OT mark "Fragment" which is a dispreference mark (any mark without a prefixed + is a dispreference mark). The reason for this is to make sure that the fragment rule uses the fewest chunks possible. That is, if there is one analysis with an NP chunk and a VP chunk, and another analysis with an S chunk, the analysis with the S chunk will be chosen because it has fewer instances of the Fragment OT mark.

The second part of the FRAGMENTS rule is a recursive call to the rule to build up any additional chunks. Most sentences that go through FRAGMENTS will consist of more than one chunk.

In sum, what you need to build a fragment grammar of your own is:
One final note on the fragments: it is extremely hard to debug a grammar with the fragments on. To turn them off, in xle type:
set-OT-mark Fragment NOGOOD
You can put this command in your xlerc file if you always want the fragments off.

Coordination and METARULEMACRO

There is a special macro that you can define in the RULES section of the grammar called METARULEMACRO. If this macro is present, it is applied to all of the rules in the grammar, including the sublexical rules. The reason to use this is so that when you add in new rules, you don't have to remember to add in calls to other rules or macros that should apply to them, such as coordination. Instead, XLE will do this automatically for you.

There are two main ways in which this macro is used in most pargram grammars. This first is for coordination: by using METARULEMACRO, each rule does not have to contain a disjunct that calls the coordination rules. This is discussed in detail below. The second is to allow certain types of punctuation or markup to apply to any constituent.

Look at the METARULEMACRO definition in eng-pargram.lfg. This macro has three variables. The first is the category name, such as NP. The second is the base category for complex categories; since there are no complex categories in this grammar, _CAT will be the same as _BASECAT. The last is the righthand side, i.e., the expansion, of the rule.

The first disjunct in METARULEMACRO should always be _RHS. Otherwise, the simple, unmarked up, expansion of your rules will not occur. This is almost never the desired affect.

The second and third disjuncts allow coordination to apply.

The final disjunct allows any category to appear surrounded by a left bracket and a right bracket. This can be very useful for determining if a particular parse is available and for cutting down on ambiguity. First parse:
parse {the boys devour the bananas in the cake.}
This sentence has two parses. Next parse:
parse {the boys devour [the bananas in the cake].}
This sentence has only one parse because "in the cake" is forced to be a constituent of the object NP.

Note the call to @PUSHUP in the bracketing disjunct. This template is defined in common.templates.lfg. It is used to make sure that the brackets occur around the highest constituent that they can, instead of occurring at all levels. This might happen when bracketing something like "cats" which is a constituent at both the N and the NP levels. (To see this, comment out the ": @PUSHUP;" and see what happens if you parse something like "[cats] sleep.")

Lets now look at the coordination rules in more detail.

There are two rules for coordination. SCCOORD is used for everything but nominal coordination. It is a simple rule that just takes the f-structures of the two constituent categories and puts them in a set with the conjunction between them. The conjunction will provide a COORD-FORM to the set, as specified in the lexical entry for "and". In order for this feature to appear as a value of the set and not in the f-structures of the conjuncts, you have to define it as non-distributive in the CONFIG:
 NONDISTRIBUTIVES NUM PERS COORD-FORM.
To restrict this rule to only apply to non-nominals, the annotation:
e: _CAT ~$ { NP N };
occurs which states that _CAT cannot be either NP or N.

NPCOORD is used for coordinating nominals because the person and number values of a coordinated nominal is not necessarily the same as those of its conjuncts ("the cat and dog jump."). So, NPCOORD provides the correct person and number features, with the NUM feature coming from the lexical entry for "and" and the PERS feature coming from the template NP-CONJUNCT. Since these are features of the set itself, NUM and PERS must also be listed as non-distributive features in the CONFIG. When a verb checks a coordinate subject for number and person, it will see the values of the feature in the set. To see this, parse:
parse {the boy and the girl bake the cake.}
Even though "boy" and "girl" are both singular, the coordinated NP is plural, and the verb "bake" can occur with them. To restrict NPCOORD to only apply to nominals, the annotation:
e: _CAT $c { NP N };
appears in the call to NPCOORD in METARULEMACRO.

In sum, METARULEMACRO applies to every rule in the grammar. The only difficult part in using it is to remember to include a disjunct that just says _RHS to make sure that the rules apply as you intended.

Lexical Rules

Theoretical LFG often uses lexical rules to manipulate predicates in things like passives. You can use lexical rules in xle. It is possible to delete arguments of the predicate and to rename them. However, under the current implementation it is not possible to add arguments.

The lexical rules are defined in the TEMPLATES. An example in this rule is PASS. (The COM comments are used with the emacs lfg-mode tools.)
PASS(_SCHEMATA) = "passive lexical rule"
                  "COM{EX TEMPLATES S: the girl devours a banana.}"
                  "COM{EX TEMPLATES S: a banana is devoured.}"

   { "active version" _SCHEMATA (^ PASSIVE)=-
     |"passive version" _SCHEMATA
      (^ PASSIVE)=c +
     { (^ SUBJ) --> NULL "wipe out the subject"
       |(^ SUBJ) --> (^ OBL) "make into an oblique 'by' phrase"
        @(OT-MARK OblAg)} "COM{EX TEMPLATES S: a banana is devoured by the girls.}"
      (^ OBJ) --> (^ SUBJ) "make the object the subject"}.

 The PASS template takes a predicate such as:
(^ PRED)='bake<(^ SUBJ)(^ OBJ)>'
and rewrites the SUBJ as NULL which effectively deletes it or rewrites it as an oblique. There is an OT mark in the disjunct that creates the OBL; given the OT ranking in the CONFIG, this will result in "by" phrases in passives being prefered over adjunct readings (OT marks are discussed more below). The lexical rule then rewrites the object as the subject.

Note that the first disjunct of the PASS template does nothing to the predicate and occurs in an active environment. The second disjunct is the one that performs the passive lexical rule and is constrained to occur in passive environments.

The PASS template is called by the two other templates: V-SUBJ-OBJ and V-SUBJ-OBJ-OBJTH. Thus, both transitive and ditransitive verbs can be passivized in this grammar.

Defaults

It is sometimes useful to provide a default value for a feature. This can be done with the template DEFAULT:
 DEFAULT(_FEAT _VAL) = "provides a default value for a feature"
    { _FEAT "feature exists but with a different value"
       _FEAT ~= _VAL
      |_FEAT = _VAL "assign the default value"
                                 "it will unify if it already exists"}.

This template either requires that the feature have a value other than the default one or assigns the default value. Note that is is important to have the equation in the first disjunct stating that the feature's value is not the default; otherwise you can end up with vacuous ambiguity, that is multiple parses with no difference in the resulting c-structure or f-structure.

The issue is then where to call the DEFAULT template. In eng-pargram.lfg, default present ("pres") tense is assigned in the S rule. Default third ("3") person is assigned to nouns in the NOUN template.

Epsilon

In the CONFIG section, you can define a category for epsilon. This will allow you to hang equations in rules where there is not a convenient constituent on which to do so.

An example of this is seen in the S rule. Epsilon is "e" in this grammar (and standardly in the pargram grammars):
S --> "COM{EX RULE S: the girl pushes the boys.}"

         e: @(DEFAULT (^ TNS-ASP TENSE) pres)
           "provide pres as a default value to TENSE"
           @(DEFAULT (^ STMT-TYPE) decl)
           "provide decl as default value to STMT-TYPE";

It would have been possible to put these equations on both the VP and the VPaux categories, but by putting them on the "e", they only have to be mentioned once.

Another use for "e" would be in a language in which the copula is sometimes present (e.g., in past tenses) and sometimes not (e.g., in the present tense). The VP copular rule might look like:
VP --> { Vcop "overt copula in past tense"
        |e: "non-overt copula in the present tense"
           (^ PRED)='null-be<(^SUBJ)(^PREDLINK)>'
           (^ TENSE)=present }

       { NP: (^ PREDLINK)=!
        |AP: (^ PREDLINK)=!}.

Note that it is important to avoid using down (!) in the annotations on the "e". You can do this, but the behaviour is not likely to be what you want.

OT Marks

As grammars get bigger, the ambiguity rate becomes very high. One way you can control this is by using OT (optimality theory) marks. These are marks that you put in the grammar rules, templates, and lexical entries. The marks are then ranked in the CONFIG. The orders in eng-pargram.lfg are:
OPTIMALITYORDER NOGOOD *Fragment "disprefer fragments and mark with *"
                        +OblAg. "prefer 'by' obliques in passives"
GENOPTIMALITYORDER GenBadPunct NOGOOD "do not generate these"
                  +GenGoodPunct. "prefer these"

There are two orders: one for parsing and one for generation. (In xle, the same grammar is used for parsing and generation, with the only differences being the tokenizer and the OT order; it is also possible to use slightly different morphologies for parsing and generation.)

If there are two f-structures for a sentence and one has a dispreference mark and one no mark, then the f-structure with the one with no mark is chosen. This was seen in the case of the fragment grammar where each chunk introduced a Fragment OT mark. So, if there is a choice between an f-structure with one mark (one chunk) and one with two marks (two chunks), the one with one mark is chosen. This results in a fewest chunks approach to fragmenting.

If there are two f-structures for a sentence and one has a preference mark, which is indicated by a preceding + in the OT order, and one has no mark, then the f-structure with the preference mark is chosen. This is seen in the passive lexical rule PASS in the templates. The OBL reading introduces an OT mark OblAg which is rankign +OblAg. If you parse:
parse {bananas are devoured by boys.}
there will be 1+1 solutions. The optimal solution is the one shown and it has the OBL-AG reading. The "+1" in the "1+1" is the suboptimal solution and corresponds to an ADJUNCT reading of the "by" phrase. You can see the unoptimal solutions by choosing the "unoptimal" command in the f-structure window.

The generation OT marks work the same way. In this grammar there are two generation OT marks. The preference mark GenGoodPunct (Generate Good Punctuation) requires a period to be generated at the end of sentences. This grammar can parse both the following.
parse {boys sleep}
parse {boys sleep.}

However, it will only generate:
boys sleep.
To see this, in the f-structure window choose the command "generate from this f-structure".

The dispreference mark GenBadPunct is a NOGOOD mark and hence occurs to the left of NOGOOD in the ranking (NOGOOD does not affect things that occur to its right). This means that any rule part in the grammar with which it is associated has been removed from the grammar. Here, the mark appears in METARULEMACRO in the bracketing markup. This means that bracketing can be parsed but not generated. So, if you parse:
parse {[the girls] sleep.}
the result when generating will be:
the girls sleep.
The same mark also appears on the comma in the coordination rule.

So, OT marks give you as a grammar writer control over some of the ambiguity in the grammar. There are many additional types of OT marks that are described in detail in the xle documentation, but what is described here will give you enough to start with.

Useful XLE tricks

There are a number of extra-grammatical facilities available in XLE that will make grammar writing much easier.

XLE documentation

To access the xle documentation, in xle type:
documentation
and a web browser will be launched with the documentation; this documentation is also found in xle/doc/xle-toc.html.  You can also type:
help
which will list all of the commands that you can use in xle.

xlerc file

Every time you make a change to your grammar, you have to restart xle and reload the grammar. To make this easier, create a file called:
xlerc (important: the file has no extension!)
in the directory that you are going to work in. In it put the line:
create-parser mygrammar.lfg
where "mygrammar.lfg" is the name of your top level grammar file.

Whenever you (re)start xle in that directory, it will automatically create the parser for you.

You can put any commands normally used in xle in the xlerc file and they will automatically be invoked. You can also define procedures and create aliases for commands; these are defined according to tcl and the easiest way to learn about them may be to look at previously defined ones and modify them. For example, you can redefine the "analyze-string" command as "as" via:
proc as {P} {
   analyze-string $P
         }

Emacs library

XLE comes with a special emacs library lfg-mode.el. You should load this library when using emacs to edit grammar files and run xle. It will format rules, lexical entries, and templates for you. It also has commands to launch and restart xle and to automatically parse sentences in testfiles.   To get emacs to automatically load this library whenever you are editing a file ending in .lfg, add the following lines to your .emacs file (if you do not have a .emacs file, you can create one in your home directory):
; to load the LFG-mode for XLE
(load-library "/usr/local/xle/emacs/lfg-mode")

Note that the path may be different depending on where xle is installed on your machine.  

If you have never used emacs before, you can access an emacs tutorial by typing:
C-h t
when you are in emacs; where C-h means hold down the control key while typing an "h" and then type a "t" without holding down the control key.

There are a number of keyboard short cuts that can be used when you have lfg-mode loaded.
For more details read the xle documentation on emacs support for xle.

Testfiles and comment examples

Grammar development, even at the early stages, involves having to reparse things many times to figure out if they are working yet and, after making changes, still working. To facilitate this, you can put your example sentences in a testfile. It is best to name your testfile with a ".lfg" suffix since then you can use the emacs library to automatically parse whichever sentence you are interested in. The test file should look like:
# Comment lines begin with hash marks

ROOT: This is a sentence.

NP: a noun phrase

PP: with a noun phrase

NP: an ungrammatical noun phrases (0! 0 0 0)

where each new sentence has a blank line on either side of it. It is useful to put in the parse category (e.g., ROOT, NP, PP) in case you change the default parse (root) category in your grammar. You can indicate sentences which are supposed to get no parses by putting (0! 0 0 0) after them. If these do get a parse, xle will complain.  You can also mark if a sentence is supposed to have a particular number of parses:
ROOT: I see the girl with the telescope (2! 0 0 0)
You can run the entire testsuite at once by doing:
parse-testfile my-testsuite.lfg
where "my-testsuite.lfg" is the name of the testsuite. Note that you can use path names if you don't want to store the testsuites with the grammar files:
parse-testfile testfiles/questions/my-testsuite.lfg
It is possible to automatically create testsuites from comments in the grammar if the comments are of the form:
"COM{EX section example}"
"section" indicates what section it comes from (RULES, TEMPLATES, LEXICON). "example" is the example itself ("NP: a monkey"). In lfg-mode, there is an option to extract the comments under the LFG window bar. Doing this will create an emacs buffer of all of the examples as a testsuite file; this buffer can then be saved as a testsuite file. Some examples of this:
NP --> "rule for common noun phrases"
             "COM{EX RULES NP: boxes}"

            (D: (^ SPEC)=!) "COM{EX RULES NP: the box}"
                            "COM{EX RULES NP: a box}"
                            "COM{EX RULES NP: a boxes (0! 0 0 0)}"
                N "head noun" "COM{EX RULES ROOT: Foxes push the boxes.}".

There are a number of comments of this type in eng-pargram.lfg. You can see what the resulting testsuite files look like by running the extract comments command on this file in emacs.

It is highly recommended to do this because it makes it easier for someone else to read the grammar and makes it easy to figure out which parts of the grammar are working.

 Intepreting Error Messages

It takes some time to get used to the xle error messages, just as with any new system.  By doing the walkthrough and playing with the starter grammar provided here, you should get some practice with the types of errors you are likely to run into when doing grammar writing.

Background Reading

This is a list of paper divided by topic that might be of direct use to you when writing grammars. Many of them are available electronically.

ParGram project as a whole:

Grammar engineering:

Features and templates:

OT marks:

FST morphology integration:

Grammar porting and adaptation:



2004 09 22
Tracy.King@microsoft.com