Starting a ParGram Grammar

Tracy Holloway King

Walkthrough
Common features, grammatical functions, and templates
Sample grammar
List of useful xle tricks
Background reading

This document is intended for people who are using XLE to write LFG grammars. Almost all of the information in here is in the xle documentation. However, it is arranged so that things that are of immediate use to beginning grammar writers or that are different than in theoretical LFG are given prominence

Walkthrough

Do the walkthrough provided with xle before starting on anything. It is also useful to skim over the xle documentation to get some idea of what all is there. However, the documentation is now very extensive and so it is hard to absorb until after you have worked a bit with the system.

Pargram features and grammatical functions

This section is intended for people who are working on pargram grammars that are supposed to conform to the existing pargram feature committee standards. Since these standards are not well documented, this section will provide a start place.

The other thing to do is to take the large English grammar, which is available to anyone with a pargram license, and parse constructions that are similar to the ones you are interested in. If the analyses seems feasible for your language, then go ahead and use them. Note that reading the grammar itself will be difficult at first because it is so large. However, just looking at the f-structure output may be useful.

There is a naming convention for features. Features are in all uppercase letters while (atomic) values are in all lower case letters. For example, the feature NUM can have the value pl. Features whose values reflect surface forms in a language are named X-FORM where X can be any number of upper case letters. For example, PFORM is used for the form of prepositions that do not have a PRED (otherwise the PRED encodes the form information redundantly). If there is more than one letter before the FORM, then a hyphen is inserted for legibility: for example, PRON-FORM for the surface form of pronominals.

Grammatical Functions

There are some standard naming conventions for grammatical functions across the pargram grammars. Other grammatical functions may need to be added for new languages.

SUBJ: subjects ("Mary left")
OBJ: direct objects ("push the box")
OBJ-TH: thematically restricted objects for languages which allow two objects to occur at once ("we gave him a book"); these are the same thing as secondary objects (OBJ2) but in pargram the more generic and more LFG compliant OBJ-TH is used instead.
OBL: oblique argument; these are usually prepositional phrases that are subcategorized for by a verb (it can be difficult to tell whether something is subcategorized or just an ADJUNCT); adjectives may also take OBL complements. ("we talked about them" "proud of him")
OBL-AG: oblique agent in passives ("it was eaten by them")
OBL-COMPAR: the comparison phrase in comparatives and equatives ("prettier than them" "as pretty as they are")
COMP: closed complement clause ("I know that they left" "I wondered whether they had left")
XCOMP: open complement; these may be verbal or small clauses; since they are open, their subject is provided from outside the predicate ("they want to leave" "they consider him an idiot")
XCOMP-PRED: open complement in predicative position; the -PRED is solely for implementational reasons since the rules that apply to predicatives/copular clauses are often very different from those that apply to other XCOMPs. ("he is a teacher" "he is happy")
PREDLINK: a closed complement in predicative position; this is used for languages or constructions within languages where a closed complement is more appropriate than XCOMP(-PRED). ("he is a teacher" "he is happy") See the Grammar Writers' Cookbook and Dalrymple, Dyvik, and King's paper in the LFG 2004 proceedings for more on PREDLINK and XCOMP for predicatives.

The following grammatical functions are non-subcategorized and are set valued. Note that sets can have scoped elements which can be very useful for noun-noun compounds and for coordination.

ADJUNCT: adjuncts of various types; this should be used as the default grammatical function for non-subcategorized arguments; the canonical example of adjuncts are various adverbials ("they ran quickly" "the very red box"); however, other modifiers can be adjuncts ("when I left, he left" "having left, he closed the door")
MOD: the modifying noun in a noun-noun compound ("tractor" in "the tractor trailer") when there are multiple modifying nouns ("oil filter signal"), it is best to have these scoped. This can be done by using an equation such as: ! $<h>s (^ MOD). The $ creates a set; this allows multiple elements to appear as the MOD, e.g., both "oil" and "filter" will be in the MOD set modifying "signal". The <h>s after the $ is used to mark the scope; the "h" guarantees that these have heads which are in a precedence relation in the c-structure; the "s" marks the scope. In our example, the f-structure for "filter" will be marked as scoping over that of "oil". This is extensively documented in the xle documentation in the section on scope relations in the functional description language. This can be invoked by the MOD template in common.templates.lfg
NAME-MOD: the modifying names in a proper name ("Mary" and "Jane" in "Mary Jane Smith")
APP: appositions ("Mr. Smith, the president,")

Common feature table

There is a common feature table common.features.lfg defined for the pargram grammars. Each grammar should use these features when possible. As detailed in the feature table, each language can:

add features not in the common feature list
add feature values present in the language but not in the common feature list
delete features not present in the language
delete feature values not present in the language

The common feature table is included with this. Note that it is periodically updated and the new version is sent around to all the grammar writers.

The features are discussed here in more detail.

CHECK: Note that the CHECK feature is a feature that each grammar can use for grammar internal features that are largely used as well-formedness checks. These CHECK features are generally assumed to be ignored by applications.

The notations are read as follows:

The name of the feature followed by a colon and an ->
One of the following:

<< [] : feature whose values are f-structures

<<{} : set valued feature (none shown here, but there is an example for PSEM in the big English grammar)

$ {} : atomic valued feature

The values that the feature can have are listed in the brackets

The list of values are ones that are permitted for that feature; they are not required. Currently there is no way in the feature table to require a feature to have a particular set of values. For example, there is not way to state that TNS-ASP must contain TENSE.

Nominal/Specifier features:

PERS: -> $ {1 2 3}. First, second, or third person for pronouns and nouns. In general verbs are not given a PERS feature; instead, verbs contain information about the PERS of their arguments. For example, the verb "eats" in English states that its SUBJ PERS is 3, but has not PERS feature of its own. Nominalizations of verbs, such as English gerunds ("eating cake all day made him sick"), may have a PERS feature.
NUM: -> $ {pl sg}. Singular or plural for nouns and pronouns
GEND: { -> $ {fem masc neut} | -> << [ FEM MASC NEUT ] }. The syntactic gender of nouns; used for languages like French and German where nouns inherently belong to a gender class. Most languages use the atomic values; the FEM MASC NEUT values are for languages which need each of these to have a + or - value (see the Norwegian grammar). This notation is unusual in that GEND can either have the atomic values fem masc and neut or can have the f-structure FEM MASC and NEUT as values. In general, it is not advisable to have a feature that can either have atomic or f-structure values; it is assumed that a given language will have one or the other for the GEND feature.
GEND-SEM: -> $ {female male nonhuman}. Semantic gender; used for languages like English where the relevant gender depends on the gender of the referent. languages may use both syntactic and semantic gender
ANIM: -> $ {+ -}. Animacy of the noun (+ for animate, - for inanimate); languages which make a distinction between humans and non-humans should use HUMAN instead
HUMAN: -> $ {+ -}. Whether a noun is human or not; languages which make a distinction between animatess and inanimates should use ANIM instead
CASE: -> $ {acc dat erg gen inst loc nom obl}. Case of the noun.

acc: accusative
dat: dative
erg: ergative
gen: genitive
inst: instrumental
loc: locative
nom: nominative
obl: oblique; the obl value is used for languages which have a single oblique case (as opposed to, for example, accusative and dative)

PRON-TYPE: -> $ {demon expl_ free inh-refl_ int locative null pers quant poss recip refl rel}. Type of pronouns, including:

demon: demonstratives ("I want those.")
expl_: expletive ("It is raining.") ; the underscore indicates that the value is instantiated which means that it cannot unify with another PRON-TYPE expl_; this helps to prevent two copies of an expletive pronoun occurring as one argument
free: free relatives ("whoever I see")
inh-refl_: inherent reflexive ("Il se suidice.")
int: interrogative ("Who left?")
locative: locative ("John is there.")
null: null, including pro-dropped ("_ to leave is imperative.")
pers: personal ("They left.") quant: quantificational ("Many left at noon.")
poss: possessive ("His mother left.")
recip: reciprocal ("We saw each other.")
refl: reflexive ("She saw herself.")
rel: relative ("the boy who left")

NTYPE: -> << [ NSEM NSYN ]. Type of noun; pronouns have both an NTYPE and a PRON-TYPE. NTYPE is divided, somewhat arbitrarily, into two parts: syntactic (NSYN) and semantic (NSEM).

NSYN: -> $ { common pronoun proper }. The basic syntactic type of the noun:

common ("books" "sugar" "bewilderment")
proper ("Mary" "Detroit")
pronoun ("it" "herself")

NSEM: -> << [ COMMON NUMBER-TYPE PROPER TIME ]. Semantic features of the nouns; these are usually features that are useful in constraining syntactic constructions, but they may just also pass information on to applications. There are no "unspecified" values for these features; for example, if there is a common noun and you do not know what type it is, just use "NSYN common" without any value for "NSEM COMMON"

COMMON: -> $ { count gerund mass measure partitive }. common (non-proper) nouns:

count ("a box" "the boxes")
gerund ("his pushing the box") includes deverbal nouns with arguments in general
mass ("sugar")
measure ("two meters")
partitive ("all of the boxes")

PROPER: -> << [ PROPER-TYPE LOCATION-TYPE NAME-TYPE ]. Proper nouns; these are subdivided because these details tend to be important for applications

PROPER-TYPE: -> $ { addr_form location name organization title }. The specific subtype of a proper noun:

location ("Paris")
name: person's name ("Mary" "Smith")
organizaton: name of company or organization ("Senate")
title: title for people ("Mr. Smith")
addr_form: form of address for people; these are for address forms that can be used in addition to the titles ("Herr Dr. Schmitt": Herr=addr_form Dr = title). What can be a title and an addr_form varies from language to language.

LOCATION-TYPE: -> $ { city country }. Subtype of location; more values can be added as needed
NAME-TYPE: -> $ {first_name last_name }. Subtype of name
TIME: -> $ { date day hour minute month season second week year}. Subtype of time expression; some of these are proper nouns and some common. This division still needs work since many time expressions are not covered here:; in addition, some phrases only get the TIME feature in time expressions (e.g. numbers in digital representations of time) while others get them whenever they occur (e.g. months of the year).

date ("24/2/2004")
day ("Tuesday")
hour ("3:00")
minute ("3:30")
month ("January")
season ("winter")
second: used for the actual word ("second")
week: used for the actual word ("week")
year: used for the actual word ("year")

SPEC: -> << [ADJUNCT AQUANT DET NUMBER POSS QUANT SPEC-TYPE]. Specifiers of noun phrases; includes determiners, possessives, quantifiers and numbers

DET: -> << [ DEIXIS DET-TYPE PRED ]. Determiners, including demonstratives ("the box" "this box" "a box")

DET-TYPE: -> $ {article def demon indef int rel}. Type of determiner:

article:

def: definite ("the box")
demon: demonstrative ("this box")
indef: indefinite ("a box")
int: interrogative ("which box")
rel: relative ("the girl whose box broke")

DEIXIS: -> $ { distal proximal post-distal }. For determiners and demonstratives that encode deixis:

distal ("that girl")
proximal ("this girl")
post-distal: this is used for deixis systems with a three way distinction where this value is for the furthest away set of deictics

NUMBER: -> << [NUMBER-TYPE PRED ADJUNCT CLASSIFIER-FORM MOD]. Numbers modifying nouns ("six boxes")

NUMBER-TYPE: -> $ {card fract ord percent}. Type of the number; card and ord are the most important; NUMBER-TYPE can be used for non-specifier numbers as well ("I bought six.")

card: cardinal ("six")
fract: fraction ("1/2")
ord: ordinal ("6th" "sixth")
percent ("6%")

CLASSIFIER-FORM: No forms listed in the common feature declaration; these are used for languages like Japanese to encode which classifier is used with the noun-number combination.

QUANT: -> << [ADJUNCT QUANT-TYPE POL PRED DEGREE DEG-DIM ]. Quantifiers ("all boys")

QUANT-TYPE: -> $ {comparative equative existential gen negative superlative universal}. type of quantifier; most grammars are not using this much

comparative ("more boxes than foxes")
equative ("as many boxes as foxes")
existential ("the man")
gen: generalized quantifier; used when not making distinctions such as universal and existential
negative ("no boys")
superlative ("the most sugar")
universal ("every man")

POL: Used for negative/positive polarity quantifiers ("no boxes"); many languages do not use this feature syntactically
DEGREE and DEG-DIM: similar to these features for adjectives and adverbs

AQUANT: -> << [ ADJUNCT PRED QUANT-TYPE DEGREE DEG-DIM ]. Adjectival quantifers ("many boxes"); used mainly because some things can have both a quantifier and an adjectival quantifier in languages like Norwegian and English ("all my many boxes")
POSS: Encodes the possessor NP ("Mary's box"). This has no declared features/values in the common feature table because it generally has all of the features that an NP does.
SPEC-TYPE: This is used in specific constructions to provide a null specifier for count nouns in languages like English in noun-noun compounds ("tractor" in "the tractor trailer")

Adjectival/Adverbial features:

ATYPE: -> $ {attributive predicative}. Basic type of adjective:

attributive: used when modifying nouns ("the flimsy box")
predicative: used as the argument of a copular verb and in small clauses ("the box is flimsy")

DEGREE: -> $ {comparative positive superlative}. Degree of an adjective or adverb:

comparative ("redder" "more beautiful")
superlative ("reddest" "(the) most beautiful")
positive ("red" "as red as the sun")

DEG-DIM: -> $ {equative neg pos}. Whether the DEGREE is in a positive or negative dimension; non-equative positive adjectives ("red") have no DEG-DIM and as such there is no "non-equative" value

pos: positive direction ("more beautiful" "most beautiful")
neg: negative direction ("less beautiful" "least beautiful")
equative: no direction ("as beautiful as her")

Preposition features:

PSEM: -> $ {ag ben comit compar dir inst loc manner num part poss purp temp }. Semantic value of the preposition. Most prepositions have more than one possible meaning and so there may be many values; to avoid ambiguity, these can be used as a set valued feature (PSEM: -> << { loc dir }).

ag: agentive ("it was eaten by them")
ben: benefactive ("I baked a cake for Mary")
comit: comitative ("I went with Mary")
compar: comparative ("this is redder than that")
dir: direction ("he went to the store")
inst: instrumental ("he wrote with a pen")
loc: locative ("it is on the table")
manner: manner ("how did they do it?")
part: partitive ("all of them"); these might be non-semantic under some analyses in which case there is no PSEM
poss: possessive ("of Mary")
purp: purpose ("in order to leave")
temp: temporal ("after dark")

PTYPE: -> $ {nosem sem}. Whether a preposition is semantic or non-semantic. This usually correlates with whether the preposition has a PRED of its own and also a PSEM. However, some prepositions may have PREDs but no longer be truly semantic prepositions

sem: semantic ("They are under the roof.")
nosem: non-semantic ("They rely on it/It is relied on.")

Clausal/Verbal features:

STMT-TYPE: -> $ { decl header imp int }. Used by main (root) clauses to indicate the type of the statement; not necessary for clauses for grammars where the CLAUSE-TYPE and STMT-TYPE are identical. However, header cannot be a CLAUSE-TYPE; so, even if STMT-TYPE is generally omitted for the other values, it may still be necessary for headers.

decl: declarative ("they appear")
header: header for newspapers, manuals, etc ("Tractor Repair")
imp: imperative ("push it")
int: interrogative ("did they appear?")

CLAUSE-TYPE: -> $ { adv cond decl imp int nom pol-int rel wh-int }. Used by all clauses, embedded and main, to indicate the type of clause:

adv: adverbial
cond: conditional ("Were they to leave, we would also leave.")
decl: declarative ("They left.")
imp: imperative ("Leave now.")
int: interrogative ("Did you leave?", "Who did you see?"); this is used when a difference between pol-int and wh-int is not needed
nom: nominalized
pol-int: polar (yes-no) interrogative ("Did you leave?")
rel: relative ("the vase that he broke")
wh-int: interrogative with wh-word ("Who did you see?")

VTYPE: -> $ {aux copular main modal noncopular predicative raising}. Type of the verb that heads the clause. For languages where the auxiliaries just provide tense and aspect features, like English, the VTYPE will be that of the main verb, not the auxiliary:

aux: auxiliary, only for languages where auxiliaries head an f-structure of their own, taking the main verb as an XCOMP; see the Norwegian grammar for examples
copular: copula ("it is red")
main: standard main verb ("they appear")
modal: modal verb ("they should apppear")
noncopular: languages which do not distinguish between main and modal can use this instead of main
predicative: non-copular verbs taking predicatives, as in German ("nennen" in "Sie nennt ihn einen Idioten." 'She calls him an idiot.')
raising: raising verbs in languages where marking a distinction between main and raising is needed, as in German ("beginnen" in "Es beginnt zu regnen." 'It starts raining.')

PASSIVE: -> $ {+ -}. Whether a verb is passive or not. When possible, the negative value should be present for non-passivized forms
TNS-ASP: -> << [MOOD PERF PROG TENSE]. Feature under which the tense, mood, and aspect information should go. This is likely to undergo some changes with the addition of new languages. This is generally just a recording of the syntactic information; the features can be passed to applications to figure out the actual tense and aspect values.

TENSE: -> $ {fut past pres}. Tense of the verb; this is generally the syntactic tense of the top level auxiliary

fut: future ("they will arrive")
past ("they arrived" "they had arrived" "they were arriving")
pres: present ("they arrive" "they have arrived" "they are arriving")

MOOD: -> $ {imperative indicative subjunctive successive}. Mood of the verb/clause

imperative ("push it")
indicative ("they push it")
subjunctive ("were I to go, I would do it")
successive

PERF: -> $ {+ - +_ -_}. Perfective; value may be instantiated or not; the _ indicates an instantiated feature; instantiated features cannot unify and this can be helpful in constraining cascades of auxiliary forms. ("they have left" "they will have left" "they had left")
PROG: -> $ {+ - +_ -_}. Progressive verb; value may be instantiate or not ("they are leaving" "they have been leaving")

Common Templates

There is also a common templates file (common.templates.lfg) which can be used with the pargram grammars. It includes a set of templates to assign the features in the common feature declaration. For example, instead of having:

(^ TNS-ASP TENSE)=pres

you can have:

@(TENSE pres)

where the template TENSE is already defined in the common templates. This can help with grammar maintenance and compliance to feature committee standards.

In addition, they include templates to assign:

grammatical functions subcategorization frames (for verbs and other lexical items)

some generally useful templates (including the old notationtemplates.lfg file which contains some more general templates for assigning defaults, creating constraining equations from defining equations, etc.)

There are comments in this file to document what these are and the general naming schema that was used.

For example, you cound define a template for intransitive verbs called V-SUBJ which in turn could call the common template to construct the predicate. V-SUBJ might have additional features in it:

V-SUBJ(_P) = "basic intransitive verb template" @(SUBJ_core _P) "call to common template to construct the PRED" ~(^ PRT-FORM) "no particle allowed".

Starter English Grammar

There is a larger version of demo-eng.lfg called eng-pargram.lfg which demonstrates many of the mechanisms that you might want to use in your grammar.

File Management

The first thing to notice is that this grammar is split into several files:

eng-pargram.lfg: contains the CONFIG, which is required for all xle grammars to tell xle which files and features to use with the grammar, a small feature declaration, a morphconfig, the rules, and some templates
common.templates.lfg: the pargram shared templates
common.features.lfg: the pargram shared feature declaration
eng-pargram-lex.lfg: a small lexicon
eng-pargram-morph.fst: a small English morphological analyzer (this is a binary file; don't try to look at it)
basic-parse-tok.fst: a small tokenizer (this is a binary file; don't try to look at it)

It is advisable to divide your grammar into separate files and so this grammar is set up to demonstrate how to do that. Any .lfg file that you want to include must be listed in the CONFIG under FILES. Fsts (finite state machines: tokenizers and morphologies) are called from the morphconfig which can itself be a separate file or can be part of the main grammar file, as is the case here.

Integrating FSTs

xle makes it very simple to integrate fst tokenizers and morphologies into your grammar. Tokenizers insert token boundaries between words and do things like split punctuation off of words and lowercase initial capitals. The tokenizer used here is based on that in Beesley and Karttunen. Morphologies associate surface forms with a stem form and a set of tags that encodes the relevant morphological information.

Here we will focus on the morphologies. If you load up eng-pargram.lfg and type:

morphemes baking

xle will return:

{ bake "+Verb" "+Prog" | baking "+Token" }

What this means is that the surface form "baking" is associated with a stem "bake" and the tags +Verb and +Prog. It is alternatively associated with the form "baking" and a built-in tag called +Token.

If you look in the lexicon in eng-pargram-lex.lfg, there is only an entry for "bake"; the various possible surface forms ("bake", "bakes", "baked", "baking") are not listed. This is not necessary because once the string (sentence) is passed through the tokenizer and the morphology, the output string only contains the stem form plus the tags. Each of these stems and tags is listed in the lexicon and can be looked up by:

bake V XLE { @(V-SUBJ %stem) |@(V-SUBJ-OBJ %stem) |@(V-SUBJ-OBJ-OBJTH %stem)}. +Verb V_SFX XLE @(VTYPE main). +Prog V_SFX XLE (^ TNS-ASP PROG)=c +.

where V_SFX is an arbitrarily chosen c-structure category. The grammar in eng-pargram.lfg then assembles these into a V category via the sublexical rule:

V --> V_BASE "stem form" V_SFX_BASE+ "as many tags as the morphology provides".

These rules are quite easy to write except that you need to remember to add "_BASE" to whatever the category is that is listed in the lexicon. For example V_SFX in the lexicon corresponds to V_SFX_BASE in the sublexical rule. The reasoning behind this has to do with how xle figures out the display of these since the sublexical structure is by default not shown (hence why you have to use "show morphemes" to view it).

If in xle you do:

set-OT-rank Fragment NOGOOD

and then parse

parse {V: baking}

you will see the resulting tree if you choose the menu item "show morphemes".

The morphology in eng-pargram-morph.fst also provides analyses for nouns and prepositions. This can be seen by typing in xle:

morphemes girls morphemes with

which have the output:

{girls "+Token"|girl "+Noun" "+Pl"} with {"+Prep"|"+Token"}

However, there are no specifix lexical entries for "girl" or "with" or any other noun, pronoun, or preposition in eng-pargram-lex.lfg. Even so, xle is able to provide an analysis for these words by falling back on the entry for "-unknown" which is a special form that matches any stem in the morphology that does not have an overt lexical entry. ("unknown" means unknown to the xle lexicon but known to the morphology. xle first looks for an overt lexical entry; if it cannot find one, it tries to match against "-unknown".)

-unknown N XLE @(NOUN %stem); P XLE @(PREP %stem).

In this grammar, we have allowed "-unknown" to match nouns (N) and prepositions (P). So, whenever a noun or preposition is parsed and passed through the morphological analyzer, xle will build a lexical entry for it based on -unknown. The information provided by the morphological tags will constrain whether it is treated as a noun or adjective. In xle type:

parse {N: girls}

to see how this works; the morphological tags can be made visible by choosing the "show morphemes" option.

There is a corresponding -token lexical entry which can be used to match tokens, i.e., items which to not get a morphological analysis. In eng-pargram.lfg and eng-pargram-lex.lfg, this is used to match words in the FRAGMENTS grammar (see below) which are not part of a constituent. Note that the morphcode for -token is always * and not XLE:

-token TOKEN * (^ TOKEN)=%stem.

So, to connect a FST morphology to your grammar, you need to:

list the morphology files in the morphconfig
write lexical entries for the morphological tags, -unknown, and any stems with unpredictable subcat frames (e.g. verbs)
write sublexical rules to connect the stems and tags

Once this is done, you can write your grammar as you would have normally. However, you will have much less lexicon work to do since the morphology and -unknown entry conspire to provide lexical entries for most words. Of course, if you do not have an fst morphology available, the lexical work will be done in building the fst morphology.

You might ask where the fst morphologies come from. For many languages, such morphologies already exist. However, you can write your own using the xfst tools provided with the Beesley and Karttunen book. The input fst script for eng-pargram-morph.fst is in eng-pargram-morph.infile. This is an extremely unsophisticated script and would not be very efficient for entering large numbers of lexical items. You can use the lexc tools described in the book to create a much more sophisticated morphology in a more succinct and more linguistically satisfying format. Note that the fst files are binary files and hence cannot be looked at in emacs; the infiles which produce these are readable though.

Robustness: FRAGMENTS

When doing initial grammar development, you want ungrammatical sentences to get no parses. However, for later applications, it is often useful to get some type of output for any input. One way to do this is to write a fragment grammar. A fragment grammar builds up well-formed chunks, such as NPs, and then puts all of the chunks together in a FIRST-REST structure. To see these, in xle type:

parse {girls sleep bananas.}

The result is a fragment parse (if you get 0 parses, restart the grammar; the set-OT-rank Fragment NOGOOD command used above removed the fragments so that you could see ungrammatical structures). Each piece of the f-structure is well-formed, but the top level f-structure has no PRED.

In the CONFIG section in eng-pargram.lfg, in addition to defining a ROOTCAT, we have also defined a REPARSECAT. XLE will first try to build a well-formed structure using the ROOTCAT (here S). If it fails, then it will build a structure using the REPARSECAT (here FRAGMENTS).

The rule for this category is in the RULES section. It consists of two main parts.

The first is a disjunction of all the categories we want to build chunks out of (NP, PP, VP, S, and TOKEN). TOKEN is a special category that is used when one of the other chunks cannot be built. The lexical entry for this is in eng-pargram-lex.lfg under "-token" which matches anything that gets a +Token tag in the morphology (it is similar to -unknown which matches anything that goes through the morphology and hence gets any tag other than +Token). All this lexical entry does is provide a feature that records the lexical item.

Each of these is associated with an OT mark "Fragment" which is a dispreference mark (any mark without a prefixed + is a dispreference mark). The reason for this is to make sure that the fragment rule uses the fewest chunks possible. That is, if there is one analysis with an NP chunk and a VP chunk, and another analysis with an S chunk, the analysis with the S chunk will be chosen because it has fewer instances of the Fragment OT mark.

The second part of the FRAGMENTS rule is a recursive call to the rule to build up any additional chunks. Most sentences that go through FRAGMENTS will consist of more than one chunk.

In sum, what you need to build a fragment grammar of your own is:

a specification of REPARSECAT in the CONFIG
an entry for -token in the lexicon
a fragment rule
an OT mark

One final note on the fragments: it is extremely hard to debug a grammar with the fragments on. To turn them off, in xle type:

set-OT-mark Fragment NOGOOD

You can put this command in your xlerc file if you always want the fragments off.

Coordination and METARULEMACRO

There is a special macro that you can define in the RULES section of the grammar called METARULEMACRO. If this macro is present, it is applied to all of the rules in the grammar, including the sublexical rules. The reason to use this is so that when you add in new rules, you don't have to remember to add in calls to other rules or macros that should apply to them, such as coordination. Instead, XLE will do this automatically for you.

There are two main ways in which this macro is used in most pargram grammars. This first is for coordination: by using METARULEMACRO, each rule does not have to contain a disjunct that calls the coordination rules. This is discussed in detail below. The second is to allow certain types of punctuation or markup to apply to any constituent.

Look at the METARULEMACRO definition in eng-pargram.lfg. This macro has three variables. The first is the category name, such as NP. The second is the base category for complex categories; since there are no complex categories in this grammar, _CAT will be the same as _BASECAT. The last is the righthand side, i.e., the expansion, of the rule.

The first disjunct in METARULEMACRO should always be _RHS. Otherwise, the simple, unmarked up, expansion of your rules will not occur. This is almost never the desired affect.

The second and third disjuncts allow coordination to apply.

The final disjunct allows any category to appear surrounded by a left bracket and a right bracket. This can be very useful for determining if a particular parse is available and for cutting down on ambiguity. First parse:

parse {the boys devour the bananas in the cake.}

This sentence has two parses. Next parse:

parse {the boys devour [the bananas in the cake].}

This sentence has only one parse because "in the cake" is forced to be a constituent of the object NP.

Note the call to @PUSHUP in the bracketing disjunct. This template is defined in common.templates.lfg. It is used to make sure that the brackets occur around the highest constituent that they can, instead of occurring at all levels. This might happen when bracketing something like "cats" which is a constituent at both the N and the NP levels. (To see this, comment out the ": @PUSHUP;" and see what happens if you parse something like "[cats] sleep.")

Lets now look at the coordination rules in more detail.

There are two rules for coordination. SCCOORD is used for everything but nominal coordination. It is a simple rule that just takes the f-structures of the two constituent categories and puts them in a set with the conjunction between them. The conjunction will provide a COORD-FORM to the set, as specified in the lexical entry for "and". In order for this feature to appear as a value of the set and not in the f-structures of the conjuncts, you have to define it as non-distributive in the CONFIG:

NONDISTRIBUTIVES NUM PERS COORD-FORM.

To restrict this rule to only apply to non-nominals, the annotation:

e: _CAT ~$ { NP N };

occurs which states that _CAT cannot be either NP or N.

NPCOORD is used for coordinating nominals because the person and number values of a coordinated nominal is not necessarily the same as those of its conjuncts ("the cat and dog jump."). So, NPCOORD provides the correct person and number features, with the NUM feature coming from the lexical entry for "and" and the PERS feature coming from the template NP-CONJUNCT. Since these are features of the set itself, NUM and PERS must also be listed as non-distributive features in the CONFIG. When a verb checks a coordinate subject for number and person, it will see the values of the feature in the set. To see this, parse:

parse {the boy and the girl bake the cake.}

Even though "boy" and "girl" are both singular, the coordinated NP is plural, and the verb "bake" can occur with them. To restrict NPCOORD to only apply to nominals, the annotation:

e: _CAT $c { NP N };

appears in the call to NPCOORD in METARULEMACRO.

In sum, METARULEMACRO applies to every rule in the grammar. The only difficult part in using it is to remember to include a disjunct that just says _RHS to make sure that the rules apply as you intended.

Lexical Rules

Theoretical LFG often uses lexical rules to manipulate predicates in things like passives. You can use lexical rules in xle. It is possible to delete arguments of the predicate and to rename them. However, under the current implementation it is not possible to add arguments.

The lexical rules are defined in the TEMPLATES. An example in this rule is PASS. (The COM comments are used with the emacs lfg-mode tools.)

PASS(_SCHEMATA) = "passive lexical rule" "COM{EX TEMPLATES S: the girl devours a banana.}" "COM{EX TEMPLATES S: a banana is devoured.}" { "active version" _SCHEMATA (^ PASSIVE)=- |"passive version" _SCHEMATA (^ PASSIVE)=c + { (^ SUBJ) --> NULL "wipe out the subject" |(^ SUBJ) --> (^ OBL) "make into an oblique 'by' phrase" @(OT-MARK OblAg)} "COM{EX TEMPLATES S: a banana is devoured by the girls.}" (^ OBJ) --> (^ SUBJ) "make the object the subject"}.

The PASS template takes a predicate such as:

(^ PRED)='bake<(^ SUBJ)(^ OBJ)>'

and rewrites the SUBJ as NULL which effectively deletes it or rewrites it as an oblique. There is an OT mark in the disjunct that creates the OBL; given the OT ranking in the CONFIG, this will result in "by" phrases in passives being prefered over adjunct readings (OT marks are discussed more below). The lexical rule then rewrites the object as the subject.

Note that the first disjunct of the PASS template does nothing to the predicate and occurs in an active environment. The second disjunct is the one that performs the passive lexical rule and is constrained to occur in passive environments.

The PASS template is called by the two other templates: V-SUBJ-OBJ and V-SUBJ-OBJ-OBJTH. Thus, both transitive and ditransitive verbs can be passivized in this grammar.

Defaults

It is sometimes useful to provide a default value for a feature. This can be done with the template DEFAULT:

DEFAULT(_FEAT _VAL) = "provides a default value for a feature" { _FEAT "feature exists but with a different value" _FEAT ~= _VAL |_FEAT = _VAL "assign the default value" "it will unify if it already exists"}.

This template either requires that the feature have a value other than the default one or assigns the default value. Note that is is important to have the equation in the first disjunct stating that the feature's value is not the default; otherwise you can end up with vacuous ambiguity, that is multiple parses with no difference in the resulting c-structure or f-structure.

The issue is then where to call the DEFAULT template. In eng-pargram.lfg, default present ("pres") tense is assigned in the S rule. Default third ("3") person is assigned to nouns in the NOUN template.

Epsilon

In the CONFIG section, you can define a category for epsilon. This will allow you to hang equations in rules where there is not a convenient constituent on which to do so.

An example of this is seen in the S rule. Epsilon is "e" in this grammar (and standardly in the pargram grammars):

S --> "COM{EX RULE S: the girl pushes the boys.}" e: @(DEFAULT (^ TNS-ASP TENSE) pres) "provide pres as a default value to TENSE" @(DEFAULT (^ STMT-TYPE) decl) "provide decl as default value to STMT-TYPE";

It would have been possible to put these equations on both the VP and the VPaux categories, but by putting them on the "e", they only have to be mentioned once.

Another use for "e" would be in a language in which the copula is sometimes present (e.g., in past tenses) and sometimes not (e.g., in the present tense). The VP copular rule might look like:

VP --> { Vcop "overt copula in past tense" |e: "non-overt copula in the present tense" (^ PRED)='null-be<(^SUBJ)(^PREDLINK)>' (^ TENSE)=present } { NP: (^ PREDLINK)=! |AP: (^ PREDLINK)=!}.

Note that it is important to avoid using down (!) in the annotations on the "e". You can do this, but the behaviour is not likely to be what you want.

OT Marks

As grammars get bigger, the ambiguity rate becomes very high. One way you can control this is by using OT (optimality theory) marks. These are marks that you put in the grammar rules, templates, and lexical entries. The marks are then ranked in the CONFIG. The orders in eng-pargram.lfg are:

OPTIMALITYORDER NOGOOD *Fragment "disprefer fragments and mark with *" +OblAg. "prefer 'by' obliques in passives" GENOPTIMALITYORDER GenBadPunct NOGOOD "do not generate these" +GenGoodPunct. "prefer these"

There are two orders: one for parsing and one for generation. (In xle, the same grammar is used for parsing and generation, with the only differences being the tokenizer and the OT order; it is also possible to use slightly different morphologies for parsing and generation.)

If there are two f-structures for a sentence and one has a dispreference mark and one no mark, then the f-structure with the one with no mark is chosen. This was seen in the case of the fragment grammar where each chunk introduced a Fragment OT mark. So, if there is a choice between an f-structure with one mark (one chunk) and one with two marks (two chunks), the one with one mark is chosen. This results in a fewest chunks approach to fragmenting.

If there are two f-structures for a sentence and one has a preference mark, which is indicated by a preceding + in the OT order, and one has no mark, then the f-structure with the preference mark is chosen. This is seen in the passive lexical rule PASS in the templates. The OBL reading introduces an OT mark OblAg which is rankign +OblAg. If you parse:

parse {bananas are devoured by boys.}

there will be 1+1 solutions. The optimal solution is the one shown and it has the OBL-AG reading. The "+1" in the "1+1" is the suboptimal solution and corresponds to an ADJUNCT reading of the "by" phrase. You can see the unoptimal solutions by choosing the "unoptimal" command in the f-structure window.

The generation OT marks work the same way. In this grammar there are two generation OT marks. The preference mark GenGoodPunct (Generate Good Punctuation) requires a period to be generated at the end of sentences. This grammar can parse both the following.

parse {boys sleep} parse {boys sleep.}

However, it will only generate:

boys sleep.

To see this, in the f-structure window choose the command "generate from this f-structure".

The dispreference mark GenBadPunct is a NOGOOD mark and hence occurs to the left of NOGOOD in the ranking (NOGOOD does not affect things that occur to its right). This means that any rule part in the grammar with which it is associated has been removed from the grammar. Here, the mark appears in METARULEMACRO in the bracketing markup. This means that bracketing can be parsed but not generated. So, if you parse:

parse {[the girls] sleep.}

the result when generating will be:

the girls sleep.

The same mark also appears on the comma in the coordination rule.

So, OT marks give you as a grammar writer control over some of the ambiguity in the grammar. There are many additional types of OT marks that are described in detail in the xle documentation, but what is described here will give you enough to start with.

Useful XLE tricks

There are a number of extra-grammatical facilities available in XLE that will make grammar writing much easier.

XLE documentation

To access the xle documentation, in xle type:

documentation

and a web browser will be launched with the documentation; this documentation is also found in xle/doc/xle-toc.html. You can also type:

help

which will list all of the commands that you can use in xle.

xlerc file

Every time you make a change to your grammar, you have to restart xle and reload the grammar. To make this easier, create a file called:

xlerc (important: the file has no extension!)

in the directory that you are going to work in. In it put the line:

create-parser mygrammar.lfg

where "mygrammar.lfg" is the name of your top level grammar file.

Whenever you (re)start xle in that directory, it will automatically create the parser for you.

You can put any commands normally used in xle in the xlerc file and they will automatically be invoked. You can also define procedures and create aliases for commands; these are defined according to tcl and the easiest way to learn about them may be to look at previously defined ones and modify them. For example, you can redefine the "analyze-string" command as "as" via:

proc as {P} { analyze-string $P }

Emacs library

XLE comes with a special emacs library lfg-mode.el. You should load this library when using emacs to edit grammar files and run xle. It will format rules, lexical entries, and templates for you. It also has commands to launch and restart xle and to automatically parse sentences in testfiles. To get emacs to automatically load this library whenever you are editing a file ending in .lfg, add the following lines to your .emacs file (if you do not have a .emacs file, you can create one in your home directory):

; to load the LFG-mode for XLE (load-library "/usr/local/xle/emacs/lfg-mode")

Note that the path may be different depending on where xle is installed on your machine.

If you have never used emacs before, you can access an emacs tutorial by typing:

C-h t

when you are in emacs; where C-h means hold down the control key while typing an "h" and then type a "t" without holding down the control key.

There are a number of keyboard short cuts that can be used when you have lfg-mode loaded.

ESC q will format a rule, lexical entry, or template if the cursor is in that rule, lexical entry, or template; this is a good way to see if you made a mistake entering the rule, although it will not catch all errors; in particular, if the alignment of disjuncts is not correct, there is probably an error
C-c C-f will launch an xle process if you are in a .lfg file; if you are in the xle shell, it will restart xle
ESC C-x will parse a sentence if you have the cursor on that sentence in the testfile

For more details read the xle documentation on emacs support for xle.

Testfiles and comment examples

Grammar development, even at the early stages, involves having to reparse things many times to figure out if they are working yet and, after making changes, still working. To facilitate this, you can put your example sentences in a testfile. It is best to name your testfile with a ".lfg" suffix since then you can use the emacs library to automatically parse whichever sentence you are interested in. The test file should look like:

# Comment lines begin with hash marks ROOT: This is a sentence. NP: a noun phrase PP: with a noun phrase NP: an ungrammatical noun phrases (0! 0 0 0)

where each new sentence has a blank line on either side of it. It is useful to put in the parse category (e.g., ROOT, NP, PP) in case you change the default parse (root) category in your grammar. You can indicate sentences which are supposed to get no parses by putting (0! 0 0 0) after them. If these do get a parse, xle will complain. You can also mark if a sentence is supposed to have a particular number of parses:

ROOT: I see the girl with the telescope (2! 0 0 0)

You can run the entire testsuite at once by doing:

parse-testfile my-testsuite.lfg

where "my-testsuite.lfg" is the name of the testsuite. Note that you can use path names if you don't want to store the testsuites with the grammar files:

parse-testfile testfiles/questions/my-testsuite.lfg

It is possible to automatically create testsuites from comments in the grammar if the comments are of the form:

"COM{EX section example}"

"section" indicates what section it comes from (RULES, TEMPLATES, LEXICON). "example" is the example itself ("NP: a monkey"). In lfg-mode, there is an option to extract the comments under the LFG window bar. Doing this will create an emacs buffer of all of the examples as a testsuite file; this buffer can then be saved as a testsuite file. Some examples of this:

NP --> "rule for common noun phrases" "COM{EX RULES NP: boxes}" (D: (^ SPEC)=!) "COM{EX RULES NP: the box}" "COM{EX RULES NP: a box}" "COM{EX RULES NP: a boxes (0! 0 0 0)}" N "head noun" "COM{EX RULES ROOT: Foxes push the boxes.}".

There are a number of comments of this type in eng-pargram.lfg. You can see what the resulting testsuite files look like by running the extract comments command on this file in emacs.

It is highly recommended to do this because it makes it easier for someone else to read the grammar and makes it easy to figure out which parts of the grammar are working.

Intepreting Error Messages

It takes some time to get used to the xle error messages, just as with any new system. By doing the walkthrough and playing with the starter grammar provided here, you should get some practice with the types of errors you are likely to run into when doing grammar writing.

Background Reading

This is a list of paper divided by topic that might be of direct use to you when writing grammars. Many of them are available electronically.

ParGram project as a whole:

Miriam Butt, Tracy Holloway King, Maria-Eugenia Nino, and Frederique Segond. 1999. A Grammar Writer's Cookbook. Stanford: CSLI Publications.
Miriam Butt, Helge Dyvik, Tracy Holloway King, Hiroshi Masuichi, and Christian Rohrer. 2002. The Parallel Grammar Project Proceedings of COLING2002, Workshop on Grammar Engineering and Evaluation pp. 1-7.

Grammar engineering:

Miriam Butt and Tracy Holloway King. 2003. Grammar Writing, Testing, and Evaluation. In Ali Farghaly (ed.) Handbook for Language Engineers. CSLI Publications. pp. 129-179.

Features and templates:

Miriam Butt, Martin Forst, Tracy Holloway King, and Jonas Kuhn. 2003. The Feature Space in Parallel Grammar Writing ESSLLI 2003 Workshop on Ideas and Strategies for Multilingual Grammar Development.

OT marks:

Tracy Holloway King, Stephanie Dipper, Anette Frank, Jonas Kuhn, and John Maxwell. 2000. Ambiguity Management in Grammar Writing Linguistic Theory and Grammar ImplementationWorkshop at European Summer School in Logic, Language, and Information (ESSLLI-2000).

FST morphology integration:

Ron Kaplan, John T. Maxwell III, Tracy Holloway King, and Richard Crouch. 2004. Integrating Finite-state Technology with Deep LFG Grammars Proceedings of the Workshop on Combining Shallow and Deep Processing for NLP (ESSLLI).

Grammar porting and adaptation:

Ron Kaplan, Tracy Holloway King, and John Maxwell. 2002. Adapting Existing Grammars: The XLE Experience. Proceedings of COLING2002, Workshop on Grammar Engineering and Evaluation pp. 29-35.
Roger Kim, Mary Dalrymple, Ron Kaplan, Tracy Holloway King, Hiroshi Masuichi, and Tomoko Ohkuma. 2003. Multilingual Grammar Development via Grammar Porting ESSLLI 2003 Workshop on Ideas and Strategies for Multilingual Grammar Development.

2004 09 22
Tracy.King@microsoft.com