Starting a ParGram
Grammar
Tracy Holloway King
- Walkthrough
- Common features,
grammatical functions, and templates
- Sample grammar
- List of useful xle tricks
- Background reading
This document is intended for people who are using XLE to write LFG
grammars. Almost all of the information in here is in the xle documentation.
However, it is arranged so that things that are of immediate use to beginning
grammar writers or that are different than in theoretical LFG are given
prominence
Walkthrough
Do the walkthrough provided with xle before starting on anything.
It is also useful to skim over the xle documentation to get some idea
of what all is there. However, the documentation is now very extensive
and so it is hard to absorb until after you have worked a bit with the system.
Pargram features and grammatical
functions
This section is intended for people who are working on pargram grammars
that are supposed to conform to the existing pargram feature committee
standards. Since these standards are not well documented, this section
will provide a start place.
The other thing to do is to take the large English grammar, which
is available to anyone with a pargram license, and parse constructions
that are similar to the ones you are interested in. If the analyses seems
feasible for your language, then go ahead and use them. Note that reading
the grammar itself will be difficult at first because it is so large.
However, just looking at the f-structure output may be useful.
There is a naming convention for features. Features are in all uppercase
letters while (atomic) values are in all lower case letters. For example,
the feature NUM can have the value pl. Features whose
values reflect surface forms in a language are named X-FORM where
X can be any number of upper case letters. For example, PFORM
is used for the form of prepositions that do not have a PRED (otherwise the
PRED encodes the form information redundantly). If there is more than
one letter before the FORM, then a hyphen is inserted for legibility:
for example, PRON-FORM for the surface form of pronominals.
Grammatical Functions
There are some standard naming conventions for grammatical functions
across the pargram grammars. Other grammatical functions may need to be
added for new languages.
- SUBJ: subjects ("Mary left")
- OBJ: direct objects ("push the box")
- OBJ-TH: thematically restricted objects for languages
which allow two objects to occur at once ("we gave him a book");
these are the same thing as secondary objects (OBJ2) but in pargram the more
generic and more LFG compliant OBJ-TH is used instead.
- OBL: oblique argument; these are usually prepositional
phrases that are subcategorized for by a verb (it can be difficult
to tell whether something is subcategorized or just an ADJUNCT); adjectives may also take OBL complements.
("we talked about them" "proud of him")
- OBL-AG: oblique agent in passives ("it was eaten by
them")
- OBL-COMPAR: the comparison phrase in comparatives and
equatives ("prettier than them" "as pretty as they
are")
- COMP: closed complement clause ("I know that they left"
"I wondered whether they had left")
- XCOMP: open complement; these may
be verbal or small clauses; since they are open, their subject is
provided from outside the predicate ("they want to leave"
"they consider him an idiot")
- XCOMP-PRED: open complement in predicative position;
the -PRED is solely for implementational reasons since the rules
that apply to predicatives/copular clauses are often very different
from those that apply to other XCOMPs. ("he is a teacher"
"he is happy")
- PREDLINK: a closed complement in predicative position;
this is used for languages or constructions within languages
where a closed complement is more appropriate than XCOMP(-PRED).
("he is a teacher" "he is happy") See the Grammar
Writers' Cookbook and Dalrymple, Dyvik, and King's paper
in the LFG 2004 proceedings for more on PREDLINK and XCOMP for
predicatives.
The following grammatical functions are non-subcategorized and are
set valued. Note that sets can have scoped elements which can be very useful
for noun-noun compounds and for coordination.
- ADJUNCT: adjuncts of various types;
this should be used as the default grammatical function for non-subcategorized
arguments; the canonical example of adjuncts are various adverbials
("they ran quickly" "the very red box"); however,
other modifiers can be adjuncts ("when I left, he left"
"having left, he closed the door")
- MOD: the modifying noun in a noun-noun compound
("tractor" in "the tractor trailer") when there
are multiple modifying nouns ("oil filter signal"), it is best
to have these scoped. This can be done by using an equation such
as: ! $<h>s (^ MOD). The $ creates a set; this allows multiple
elements to appear as the MOD, e.g., both "oil" and "filter"
will be in the MOD set modifying "signal". The <h>s
after the $ is used to mark the scope; the "h" guarantees that these
have heads which are in a precedence relation in the c-structure; the
"s" marks the scope. In our example, the f-structure for "filter"
will be marked as scoping over that of "oil". This is extensively
documented in the xle documentation in the section on scope relations in the
functional description language. This can be invoked by the MOD template
in common.templates.lfg
- NAME-MOD: the modifying names in a proper name ("Mary"
and "Jane" in "Mary Jane Smith")
- APP: appositions ("Mr. Smith, the president,")
Common feature table
There is a common feature table common.features.lfg
defined for the pargram grammars. Each grammar should use these features
when possible. As detailed in the feature
table, each language can:
- add features not in the common feature list
- add feature values present in the language but not in the common
feature list
- delete features not present in the language
- delete feature values not present in the language
The common feature table is included with this. Note that it is periodically
updated and the new version is sent around to all the grammar writers (it
should also be on the pargram common workspace at http://ling.uib.no/bscw/).
The features are discussed here in more detail.
CHECK: Note that the CHECK feature is a feature that each grammar
can use for grammar internal features that are largely used as well-formedness
checks. These CHECK features are generally assumed to be ignored by applications.
The notations are read as follows:
- The name of the feature followed by a colon and an ->
- One of the following:
- << [] : feature whose values are f-structures
- <<{} : set valued feature (none shown here, but there is
an example for PSEM in the big English grammar)
- $ {} : atomic valued feature
- The values that the feature can have are listed in the brackets
The list of values are ones that are permitted for that feature; they
are not required. Currently there is no way in the feature table to
require a feature to have a particular set of values. For example,
there is not way to state that TNS-ASP must contain TENSE.
Nominal/Specifier features:
- PERS: -> $ {1 2 3}. First, second, or third person
for pronouns and nouns. In general verbs are not given a PERS feature;
instead, verbs contain information about the PERS of their arguments. For
example, the verb "eats" in English states that its SUBJ PERS is
3, but has not PERS feature of its own. Nominalizations of verbs,
such as English gerunds ("eating cake all day made him sick"), may
have a PERS feature.
- NUM: -> $ {pl sg}. Singular or plural for nouns
and pronouns
- GEND: { -> $ {fem masc neut} | -> << [
FEM MASC NEUT ] }. The syntactic gender of nouns; used for languages like
French and German where nouns inherently belong to a gender class.
Most languages use the atomic values; the FEM MASC NEUT values are
for languages which need each of these to have a + or - value (see
the Norwegian grammar). This notation is unusual in that GEND
can either have the atomic values fem masc and neut or can
have the f-structure FEM MASC and NEUT as values. In
general, it is not advisable to have a feature that can either have atomic
or f-structure values; it is assumed that a given language will have one
or the other for the GEND feature.
- GEND-SEM: -> $ {female male nonhuman}. Semantic gender;
used for languages like English where the relevant gender depends
on the gender of the referent. languages may use both syntactic and
semantic gender
- ANIM: -> $ {+ -}. Animacy of the noun (+ for animate,
- for inanimate); languages which make a distinction between humans
and non-humans should use HUMAN instead
- HUMAN: -> $ {+ -}. Whether a noun is human or not;
languages which make a distinction between animatess and inanimates should
use ANIM instead
- CASE: -> $ {acc dat erg gen inst loc nom obl}. Case
of the noun.
- acc: accusative
- dat: dative
- erg: ergative
- gen: genitive
- inst: instrumental
- loc: locative
- nom: nominative
- obl: oblique; the obl value is used for languages which have
a single oblique case (as opposed to, for example, accusative and
dative)
- PRON-TYPE: -> $ {demon expl_ free inh-refl_ int locative
null pers quant poss recip refl rel}. Type of pronouns,
including:
- demon: demonstratives ("I want those.")
- expl_: expletive ("It is raining.") ; the underscore
indicates that the value is instantiated which means that it cannot unify
with another PRON-TYPE expl_; this helps to prevent two copies of an expletive
pronoun occurring as one argument
- free: free relatives ("whoever I see")
- inh-refl_: inherent reflexive ("Il se suidice.")
- int: interrogative ("Who left?")
- locative: locative ("John is there.")
- null: null, including pro-dropped ("_ to leave is imperative.")
- pers: personal ("They left.") quant: quantificational
("Many left at noon.")
- poss: possessive ("His mother left.")
- recip: reciprocal ("We saw each other.")
- refl: reflexive ("She saw herself.")
- rel: relative ("the boy who left")
- NTYPE: -> << [ NSEM NSYN ]. Type of noun; pronouns
have both an NTYPE and a PRON-TYPE. NTYPE is divided, somewhat arbitrarily,
into two parts: syntactic (NSYN) and semantic (NSEM).
- NSYN: -> $ { common pronoun proper }. The basic syntactic
type of the noun:
- common ("books" "sugar" "bewilderment")
- proper ("Mary" "Detroit")
- pronoun ("it" "herself")
- NSEM: -> << [ COMMON NUMBER-TYPE PROPER TIME
]. Semantic features of the nouns; these are usually features that are
useful in constraining syntactic constructions, but they may just
also pass information on to applications. There are no "unspecified" values
for these features; for example, if there is a common noun and you
do not know what type it is, just use "NSYN common" without any value
for "NSEM COMMON"
- COMMON: -> $ { count gerund mass measure partitive
}. common (non-proper) nouns:
- count ("a box" "the boxes")
- gerund ("his pushing the box") includes deverbal
nouns with arguments in general
- mass ("sugar")
- measure ("two meters")
- partitive ("all of the boxes")
- PROPER: -> << [ PROPER-TYPE LOCATION-TYPE
NAME-TYPE ]. Proper nouns; these are subdivided because these details
tend to be important for applications
- PROPER-TYPE: -> $ { addr_form location name organization
title }. The specific subtype of a proper noun:
- location ("Paris")
- name: person's name ("Mary" "Smith")
- organizaton: name of company or organization ("Senate")
- title: title for people ("Mr. Smith")
- addr_form: form of address for people; these
are for address forms that can be used in addition to
the titles ("Herr Dr. Schmitt": Herr=addr_form Dr = title). What
can be a title and an addr_form varies from language to language.
- LOCATION-TYPE: -> $ { city country }. Subtype
of location; more values can be added as needed
- NAME-TYPE: -> $ {first_name last_name }. Subtype
of name
- TIME: -> $ { date day hour minute month season
second week year}. Subtype of time expression; some of these are proper
nouns and some common. This division still needs work since many
time expressions are not covered here:; in addition, some phrases only get
the TIME feature in time expressions (e.g. numbers in digital representations
of time) while others get them whenever they occur (e.g. months of the year).
- date ("24/2/2004")
- day ("Tuesday")
- hour ("3:00")
- minute ("3:30")
- month ("January")
- season ("winter")
- second: used for the actual word ("second")
- week: used for the actual word ("week")
- year: used for the actual word ("year")
- SPEC: -> << [ADJUNCT AQUANT
DET NUMBER POSS QUANT SPEC-TYPE]. Specifiers of noun phrases; includes
determiners, possessives, quantifiers and numbers
- DET: -> << [ DEIXIS DET-TYPE PRED ]. Determiners,
including demonstratives ("the box" "this box"
"a box")
- DET-TYPE: -> $ {article def demon indef int rel}.
Type of determiner:
- def: definite ("the box")
- demon: demonstrative ("this box")
- indef: indefinite ("a box")
- int: interrogative ("which box")
- rel: relative ("the girl whose box broke")
- DEIXIS: -> $ { distal proximal post-distal }. For
determiners and demonstratives that encode deixis:
- distal ("that girl")
- proximal ("this girl")
- post-distal: this is used for deixis systems with a three
way distinction where this value is for the furthest away set of deictics
- NUMBER: -> << [NUMBER-TYPE PRED ADJUNCT CLASSIFIER-FORM
MOD]. Numbers modifying nouns ("six boxes")
- NUMBER-TYPE: -> $ {card fract ord percent}. Type
of the number; card and ord are the most important; NUMBER-TYPE can
be used for non-specifier numbers as well ("I bought six.")
- card: cardinal ("six")
- fract: fraction ("1/2")
- ord: ordinal ("6th" "sixth")
- percent ("6%")
- CLASSIFIER-FORM: No forms listed in the common feature
declaration; these are used for languages like Japanese to encode which
classifier is used with the noun-number combination.
- QUANT: -> << [ADJUNCT QUANT-TYPE POL PRED DEGREE
DEG-DIM ]. Quantifiers ("all boys")
- QUANT-TYPE: -> $ {comparative equative existential gen
negative superlative universal}. type of quantifier;
most grammars are not using this much
- comparative ("more boxes than foxes")
- equative ("as many boxes as foxes")
- existential ("the man")
- gen: generalized quantifier; used when not making
distinctions such as universal and existential
- negative ("no boys")
- superlative ("the most sugar")
- universal ("every man")
- POL: Used for negative/positive polarity quantifiers
("no boxes"); many languages do not use this feature syntactically
- DEGREE and DEG-DIM:
similar to these features for adjectives and adverbs
- AQUANT: -> << [ ADJUNCT PRED QUANT-TYPE DEGREE
DEG-DIM ]. Adjectival quantifers ("many boxes"); used mainly because
some things can have both a quantifier and an adjectival quantifier
in languages like Norwegian and English ("all my many boxes")
- POSS: Encodes the possessor NP ("Mary's box").
This has no declared features/values in the common feature table because
it generally has all of the features that an NP does.
- SPEC-TYPE: This is used in specific constructions to
provide a null specifier for count nouns in languages like English in
noun-noun compounds ("tractor" in "the tractor trailer")
Adjectival/Adverbial features:
- ATYPE: -> $ {attributive predicative}. Basic type
of adjective:
- attributive: used when modifying nouns ("the flimsy box")
- predicative: used as the argument of a copular verb and
in small clauses ("the box is flimsy")
- DEGREE: -> $ {comparative positive
superlative}. Degree of an adjective or adverb:
- comparative ("redder" "more beautiful")
- superlative ("reddest" "(the) most beautiful")
- positive ("red" "as red as the sun")
- DEG-DIM: -> $ {equative neg pos}.
Whether the DEGREE is in a positive or negative dimension; non-equative
positive adjectives ("red") have no DEG-DIM and as such there is
no "non-equative" value
- pos: positive direction ("more beautiful" "most beautiful")
- neg: negative direction ("less beautiful" "least beautiful")
- equative: no direction ("as beautiful as her")
Preposition features:
- PSEM: -> $ {ag ben comit compar dir inst loc manner
num part poss purp temp }. Semantic value of the preposition. Most
prepositions have more than one possible meaning and so there may
be many values; to avoid ambiguity, these can be used as a set valued
feature (PSEM: -> << { loc dir }).
- ag: agentive ("it was eaten by them")
- ben: benefactive ("I baked a cake for Mary")
- comit: comitative ("I went with Mary")
- compar: comparative ("this is redder than that")
- dir: direction ("he went to the store")
- inst: instrumental ("he wrote with a pen")
- loc: locative ("it is on the table")
- manner: manner ("how did they do it?")
- part: partitive ("all of them"); these might be non-semantic
under some analyses in which case there is no PSEM
- poss: possessive ("of Mary")
- purp: purpose ("in order to leave")
- temp: temporal ("after dark")
- PTYPE: -> $ {nosem sem}. Whether a preposition is
semantic or non-semantic. This usually correlates with whether
the preposition has a PRED of its own and also a PSEM. However,
some prepositions may have PREDs but no longer be truly semantic prepositions
- sem: semantic ("They are under the roof.")
- nosem: non-semantic ("They rely on it/It is relied on.")
Clausal/Verbal features:
- STMT-TYPE: -> $ { decl header imp int }. Used by main
(root) clauses to indicate the type of the statement; not necessary
for clauses for grammars where the CLAUSE-TYPE and STMT-TYPE are identical.
However, header cannot be a CLAUSE-TYPE; so, even if STMT-TYPE
is generally omitted for the other values, it may still be necessary for
headers.
- decl: declarative ("they appear")
- header: header for newspapers, manuals, etc ("Tractor Repair")
- imp: imperative ("push it")
- int: interrogative ("did they appear?")
- CLAUSE-TYPE: -> $ { adv cond decl imp int nom pol-int
rel wh-int }. Used by all clauses, embedded and main, to indicate the
type of clause:
- adv: adverbial
- cond: conditional ("Were they to leave, we would also leave.")
- decl: declarative ("They left.")
- imp: imperative ("Leave now.")
- int: interrogative ("Did you leave?", "Who did
you see?"); this is used when a difference between pol-int and wh-int
is not needed
- nom: nominalized
- pol-int: polar (yes-no) interrogative ("Did you leave?")
- rel: relative ("the vase that he broke")
- wh-int: interrogative with wh-word ("Who did you see?")
- VTYPE: -> $ {aux copular main modal noncopular predicative
raising}. Type of the verb that heads the clause. For languages where
the auxiliaries just provide tense and aspect features, like English,
the VTYPE will be that of the main verb, not the auxiliary:
- aux: auxiliary, only for languages where auxiliaries head
an f-structure of their own, taking the main verb as an XCOMP; see the Norwegian grammar for examples
- copular: copula ("it is red")
- main: standard main verb ("they appear")
- modal: modal verb ("they should apppear")
- noncopular: languages which do not distinguish between main
and modal can use this instead of main
- predicative: non-copular verbs taking predicatives, as in German
("nennen" in "Sie nennt ihn einen Idioten." 'She calls him
an idiot.')
- raising: raising verbs in languages where marking a distinction
between main and raising is needed, as in German ("beginnen" in "Es
beginnt zu regnen." 'It starts raining.')
- PASSIVE: -> $ {+ -}. Whether a verb is passive or not.
When possible, the negative value should be present for non-passivized
forms
- TNS-ASP: -> << [MOOD PERF PROG TENSE]. Feature
under which the tense, mood, and aspect information should go. This
is likely to undergo some changes with the addition of new languages.
This is generally just a recording of the syntactic information;
the features can be passed to applications to figure out the actual
tense and aspect values.
- TENSE: -> $ {fut past pres}. Tense of the verb; this
is generally the syntactic tense of the top level auxiliary
- fut: future ("they will arrive")
- past ("they arrived" "they had arrived" "they were arriving")
- pres: present ("they arrive" "they have arrived"
"they are arriving")
- MOOD: -> $ {imperative indicative subjunctive successive}.
Mood of the verb/clause
- imperative ("push it")
- indicative ("they push it")
- subjunctive ("were I to go, I would do it")
- successive
- PERF: -> $ {+ - +_ -_}. Perfective; value may be
instantiated or not; the _ indicates an instantiated feature; instantiated
features cannot unify and this can be helpful in constraining cascades
of auxiliary forms. ("they have left" "they will have left" "they
had left")
- PROG: -> $ {+ - +_ -_}. Progressive verb; value may
be instantiate or not ("they are leaving" "they have been leaving")
Common Templates
There is also a common templates file (common.templates.lfg) which can be used
with the pargram grammars. It includes a set of templates to assign the
features in the common feature declaration. For example, instead of having:
(^ TNS-ASP TENSE)=pres
you can have:
@(TENSE pres)
where the template TENSE is already defined in the common templates. This
can help with grammar maintenance and compliance to feature committee standards.
In addition, they include templates to assign:
- grammatical functions subcategorization frames (for verbs and
other lexical items)
- some generally useful templates (including the old notationtemplates.lfg
file which contains some more general templates for assigning defaults,
creating constraining equations from defining equations, etc.)
There are comments in this file to document what these are and the general
naming schema that was used.
For example, you cound define a template for intransitive verbs called
V-SUBJ which in turn could call the common template to construct the predicate.
V-SUBJ might have additional features in it:
V-SUBJ(_P) = "basic intransitive verb template"
@(SUBJ_core _P)
"call to common template to construct the PRED"
~(^ PRT-FORM) "no particle
allowed".
Starter English Grammar
There is a larger version of demo-eng.lfg called eng-pargram.lfg which demonstrates many of the
mechanisms that you might want to use in your grammar.
File Management
The first thing to notice is that this grammar is split into several
files:
It is advisable to divide your grammar into separate files and so this
grammar is set up to demonstrate how to do that. Any .lfg file that you
want to include must be listed in the CONFIG under FILES. Fsts (finite
state machines: tokenizers and morphologies) are called from the morphconfig
which can itself be a separate file or can be part of the main grammar
file, as is the case here.
Integrating FSTs
xle makes it very simple to integrate fst tokenizers and morphologies
into your grammar. Tokenizers insert token boundaries between words and
do things like split punctuation off of words and lowercase initial capitals.
The tokenizer used here is based on that in Beesley
and Karttunen. Morphologies associate surface forms with a stem form
and a set of tags that encodes the relevant morphological information.
Here we will focus on the morphologies. If you load up eng-pargram.lfg
and type:
morphemes baking
xle will return:
{ bake "+Verb" "+Prog" | baking "+Token" }
What this means is that the surface form "baking" is associated with
a stem "bake" and the tags +Verb and +Prog. It is
alternatively associated with the form "baking" and a built-in tag
called +Token.
If you look in the lexicon in eng-pargram-lex.lfg,
there is only an entry for "bake"; the various possible surface
forms ("bake", "bakes", "baked", "baking") are
not listed. This is not necessary because once the string (sentence) is
passed through the tokenizer and the morphology, the output string only
contains the stem form plus the tags. Each of these stems and tags is listed
in the lexicon and can be looked up by:
bake V XLE { @(V-SUBJ %stem)
|@(V-SUBJ-OBJ %stem)
|@(V-SUBJ-OBJ-OBJTH %stem)}.
+Verb V_SFX XLE @(VTYPE main).
+Prog V_SFX XLE (^ TNS-ASP PROG)=c +.
where V_SFX is an arbitrarily chosen c-structure category. The grammar
in eng-pargram.lfg then assembles these into a V category via the sublexical
rule:
V --> V_BASE "stem form"
V_SFX_BASE+ "as many tags as the morphology provides".
These rules are quite easy to write except that you need to remember to
add "_BASE" to whatever the category is that is listed in the lexicon. For
example V_SFX in the lexicon corresponds to V_SFX_BASE in the sublexical
rule. The reasoning behind this has to do with how xle figures out the display
of these since the sublexical structure is by default not shown (hence why
you have to use "show morphemes" to view it).
If in xle you do:
set-OT-rank Fragment NOGOOD
and then parse
parse {V: baking}
you will see the resulting tree if you choose the menu item "show morphemes".
The morphology in eng-pargram-morph.fst also provides analyses for nouns
and prepositions. This can be seen by typing in xle:
morphemes girls
morphemes with
which have the output:
{girls "+Token"|girl "+Noun" "+Pl"}
with {"+Prep"|"+Token"}
However, there are no specifix lexical entries for "girl" or
"with" or any other noun, pronoun, or preposition in eng-pargram-lex.lfg. Even so, xle is able
to provide an analysis for these words by falling back on the entry for
"-unknown" which is a special form that matches any stem in the
morphology that does not have an overt lexical entry. ("unknown"
means unknown to the xle lexicon but known to the morphology. xle
first looks for an overt lexical entry; if it cannot find one, it tries
to match against "-unknown".)
-unknown N XLE @(NOUN %stem);
P XLE
@(PREP %stem).
In this grammar, we have allowed "-unknown" to match nouns (N)
and prepositions (P). So, whenever a noun or preposition is parsed and passed
through the morphological analyzer, xle will build a lexical entry for it
based on -unknown. The information provided by the morphological
tags will constrain whether it is treated as a noun or adjective. In xle
type:
parse {N: girls}
to see how this works; the morphological tags can be made visible by
choosing the "show morphemes" option.
There is a corresponding -token lexical entry which can be used
to match tokens, i.e., items which to not get a morphological analysis. In
eng-pargram.lfg and eng-pargram-lex.lfg, this is used to match
words in the FRAGMENTS grammar (see below) which are not part of a constituent.
Note that the morphcode for -token is always * and not XLE:
-token TOKEN * (^ TOKEN)=%stem.
So, to connect a FST morphology to your grammar, you need to:
- list the morphology files in the morphconfig
- write lexical entries for the morphological tags, -unknown,
and any stems with unpredictable subcat frames (e.g. verbs)
- write sublexical rules to connect the stems and tags
Once this is done, you can write your grammar as you would have normally.
However, you will have much less lexicon work to do since the morphology
and -unknown entry conspire to provide lexical entries for most words.
Of course, if you do not have an fst morphology available, the lexical
work will be done in building the fst morphology.
You might ask where the fst morphologies come from. For many languages,
such morphologies already exist. However, you can write your own using
the xfst tools provided with the Beesley
and Karttunen book. The input fst script for eng-pargram-morph.fst
is in eng-pargram-morph.infile. This
is an extremely unsophisticated script and would not be very efficient for
entering large numbers of lexical items. You can use the lexc tools described
in the book to create a much more sophisticated morphology in a more succinct
and more linguistically satisfying format. Note that the fst files
are binary files and hence cannot be looked at in emacs; the infiles which
produce these are readable though.
Robustness: FRAGMENTS
When doing initial grammar development, you want ungrammatical sentences
to get no parses. However, for later applications, it is often useful to
get some type of output for any input. One way to do this is to write a
fragment grammar. A fragment grammar builds up well-formed chunks, such
as NPs, and then puts all of the chunks together in a FIRST-REST structure.
To see these, in xle type:
parse {girls sleep bananas.}
The result is a fragment parse (if you get 0 parses, restart the grammar;
the set-OT-rank Fragment NOGOOD command used above removed the fragments
so that you could see ungrammatical structures). Each piece of the f-structure
is well-formed, but the top level f-structure has no PRED.
In the CONFIG section in eng-pargram.lfg,
in addition to defining a ROOTCAT, we have also defined a REPARSECAT.
XLE will first try to build a well-formed structure using the ROOTCAT (here
S). If it fails, then it will build a structure using the REPARSECAT (here
FRAGMENTS).
The rule for this category is in the RULES section. It consists of
two main parts.
The first is a disjunction of all the categories we want to build chunks
out of (NP, PP, VP, S, and TOKEN). TOKEN is a special category that is
used when one of the other chunks cannot be built. The lexical entry for
this is in eng-pargram-lex.lfg under "-token"
which matches anything that gets a +Token tag in the morphology (it
is similar to -unknown which matches anything that goes through the morphology
and hence gets any tag other than +Token). All this lexical entry
does is provide a feature that records the lexical item.
Each of these is associated with an OT mark "Fragment" which is a dispreference
mark (any mark without a prefixed + is a dispreference mark). The reason
for this is to make sure that the fragment rule uses the fewest chunks
possible. That is, if there is one analysis with an NP chunk and a VP
chunk, and another analysis with an S chunk, the analysis with the S chunk
will be chosen because it has fewer instances of the Fragment OT mark.
The second part of the FRAGMENTS rule is a recursive call to the rule
to build up any additional chunks. Most sentences that go through FRAGMENTS
will consist of more than one chunk.
In sum, what you need to build a fragment grammar of your own is:
- a specification of REPARSECAT in the CONFIG
- an entry for -token in the lexicon
- a fragment rule
- an OT mark
One final note on the fragments: it is extremely hard to debug a grammar
with the fragments on. To turn them off, in xle type:
set-OT-mark Fragment NOGOOD
You can put this command in your xlerc file
if you always want the fragments off.
Coordination and METARULEMACRO
There is a special macro that you can define in the RULES section of
the grammar called METARULEMACRO. If this macro is present, it is applied
to all of the rules in the grammar, including the sublexical rules. The
reason to use this is so that when you add in new rules, you don't have
to remember to add in calls to other rules or macros that should apply to
them, such as coordination. Instead, XLE will do this automatically for
you.
There are two main ways in which this macro is used in most pargram
grammars. This first is for coordination: by using METARULEMACRO, each
rule does not have to contain a disjunct that calls the coordination rules.
This is discussed in detail below. The second is to allow certain types
of punctuation or markup to apply to any constituent.
Look at the METARULEMACRO definition in eng-pargram.lfg.
This macro has three variables. The first is the category name, such as
NP. The second is the base category for complex categories; since there
are no complex categories in this grammar, _CAT will be the same as _BASECAT.
The last is the righthand side, i.e., the expansion, of the rule.
The first disjunct in METARULEMACRO should always be _RHS. Otherwise,
the simple, unmarked up, expansion of your rules will not occur. This is
almost never the desired affect.
The second and third disjuncts allow coordination to apply.
The final disjunct allows any category to appear surrounded by a left
bracket and a right bracket. This can be very useful for determining
if a particular parse is available and for cutting down on ambiguity. First
parse:
parse {the boys devour the bananas in the cake.}
This sentence has two parses. Next parse:
parse {the boys devour [the bananas in the cake].}
This sentence has only one parse because "in the cake" is forced
to be a constituent of the object NP.
Note the call to @PUSHUP in the bracketing disjunct. This template
is defined in common.templates.lfg.
It is used to make sure that the brackets occur around the highest constituent
that they can, instead of occurring at all levels. This might happen when
bracketing something like "cats" which is a constituent at both the N and
the NP levels. (To see this, comment out the ": @PUSHUP;" and see what happens
if you parse something like "[cats] sleep.")
Lets now look at the coordination rules in more detail.
There are two rules for coordination. SCCOORD is used for everything
but nominal coordination. It is a simple rule that just takes the f-structures
of the two constituent categories and puts them in a set with the conjunction
between them. The conjunction will provide a COORD-FORM to the set, as specified
in the lexical entry for "and". In order for this feature to appear as
a value of the set and not in the f-structures of the conjuncts, you have
to define it as non-distributive in the CONFIG:
NONDISTRIBUTIVES
NUM PERS COORD-FORM.
To restrict this rule to only apply to non-nominals, the annotation:
e: _CAT ~$ { NP N };
occurs which states that _CAT cannot be either NP or N.
NPCOORD is used for coordinating nominals because the person and number
values of a coordinated nominal is not necessarily the same as those of
its conjuncts ("the cat and dog jump."). So, NPCOORD provides the
correct person and number features, with the NUM feature coming from the
lexical entry for "and" and the PERS feature coming from the template NP-CONJUNCT.
Since these are features of the set itself, NUM and PERS must also be
listed as non-distributive features in the CONFIG. When a verb checks
a coordinate subject for number and person, it will see the values of the
feature in the set. To see this, parse:
parse {the boy and the girl bake the cake.}
Even though "boy" and "girl" are both singular, the coordinated
NP is plural, and the verb "bake" can occur with them. To restrict
NPCOORD to only apply to nominals, the annotation:
e: _CAT $c { NP N };
appears in the call to NPCOORD in METARULEMACRO.
In sum, METARULEMACRO applies to every rule in the grammar. The only
difficult part in using it is to remember to include a disjunct that just
says _RHS to make sure that the rules apply as you intended.
Lexical Rules
Theoretical LFG often uses lexical rules to manipulate predicates in
things like passives. You can use lexical rules in xle. It is possible
to delete arguments of the predicate and to rename them. However, under
the current implementation it is not possible to add arguments.
The lexical rules are defined in the TEMPLATES. An example in this
rule is PASS. (The COM
comments are used with the emacs lfg-mode tools.)
PASS(_SCHEMATA) = "passive lexical rule"
"COM{EX
TEMPLATES S: the girl devours a banana.}"
"COM{EX
TEMPLATES S: a banana is devoured.}"
{ "active version" _SCHEMATA (^ PASSIVE)=-
|"passive version" _SCHEMATA
(^ PASSIVE)=c +
{ (^ SUBJ) --> NULL "wipe out the subject"
|(^ SUBJ) --> (^ OBL) "make into an oblique
'by' phrase"
@(OT-MARK OblAg)} "COM{EX TEMPLATES S: a
banana is devoured by the girls.}"
(^ OBJ) --> (^ SUBJ) "make the object the subject"}.
The PASS template takes a predicate such as:
(^ PRED)='bake<(^ SUBJ)(^ OBJ)>'
and rewrites the SUBJ as NULL which effectively deletes it or rewrites
it as an oblique. There is an OT mark in the disjunct that creates the
OBL; given the OT ranking in the CONFIG, this will result in "by"
phrases in passives being prefered over adjunct readings (OT marks are discussed
more below). The lexical rule then rewrites the object as the subject.
Note that the first disjunct of the PASS template does nothing to the
predicate and occurs in an active environment. The second disjunct is
the one that performs the passive lexical rule and is constrained to occur
in passive environments.
The PASS template is called by the two other templates: V-SUBJ-OBJ and
V-SUBJ-OBJ-OBJTH. Thus, both transitive and ditransitive verbs can be
passivized in this grammar.
Defaults
It is sometimes useful to provide a default value for a feature. This
can be done with the template DEFAULT:
DEFAULT(_FEAT
_VAL) = "provides a default value for a feature"
{ _FEAT "feature exists but with a different value"
_FEAT ~= _VAL
|_FEAT = _VAL "assign the default value"
"it will unify if it already
exists"}.
This template either requires that the feature have a value other than
the default one or assigns the default value. Note that is is important
to have the equation in the first disjunct stating that the feature's value
is not the default; otherwise you can end up with vacuous ambiguity, that
is multiple parses with no difference in the resulting c-structure or f-structure.
The issue is then where to call the DEFAULT template. In eng-pargram.lfg, default present ("pres") tense
is assigned in the S rule. Default third ("3") person is assigned to nouns
in the NOUN template.
Epsilon
In the CONFIG section, you can define a category for epsilon. This
will allow you to hang equations in rules where there is not a convenient
constituent on which to do so.
An example of this is seen in the S rule. Epsilon is "e" in this grammar
(and standardly in the pargram grammars):
S --> "COM{EX RULE S: the girl pushes the boys.}"
e: @(DEFAULT (^ TNS-ASP TENSE) pres)
"provide pres as a default
value to TENSE"
@(DEFAULT (^ STMT-TYPE) decl)
"provide decl as default value
to STMT-TYPE";
It would have been possible to put these equations on both the VP and
the VPaux categories, but by putting them on the "e", they only have to
be mentioned once.
Another use for "e" would be in a language in which the copula is sometimes
present (e.g., in past tenses) and sometimes not (e.g., in the present
tense). The VP copular rule might look like:
VP --> { Vcop "overt copula in past tense"
|e: "non-overt copula in the present tense"
(^ PRED)='null-be<(^SUBJ)(^PREDLINK)>'
(^ TENSE)=present }
{ NP: (^ PREDLINK)=!
|AP: (^ PREDLINK)=!}.
Note that it is important to avoid using down (!) in the annotations
on the "e". You can do this, but the behaviour is not likely to be what
you want.
OT Marks
As grammars get bigger, the ambiguity rate becomes very high. One way
you can control this is by using OT (optimality theory) marks. These are
marks that you put in the grammar rules, templates, and lexical entries.
The marks are then ranked in the CONFIG. The orders in eng-pargram.lfg are:
OPTIMALITYORDER NOGOOD *Fragment "disprefer fragments and
mark with *"
+OblAg. "prefer 'by' obliques in passives"
GENOPTIMALITYORDER GenBadPunct NOGOOD "do not generate these"
+GenGoodPunct.
"prefer these"
There are two orders: one for parsing and one for generation. (In xle,
the same grammar is used for parsing and generation, with the only differences
being the tokenizer and the OT order; it is also possible to use slightly
different morphologies for parsing and generation.)
If there are two f-structures for a sentence and one has a dispreference
mark and one no mark, then the f-structure with the one with no mark is
chosen. This was seen in the case of the fragment grammar where each chunk
introduced a Fragment OT mark. So, if there is a choice between an f-structure
with one mark (one chunk) and one with two marks (two chunks), the one
with one mark is chosen. This results in a fewest chunks approach to
fragmenting.
If there are two f-structures for a sentence and one has a preference
mark, which is indicated by a preceding + in the OT order, and one has
no mark, then the f-structure with the preference mark is chosen. This
is seen in the passive lexical rule PASS in the templates. The OBL reading
introduces an OT mark OblAg which is rankign +OblAg. If you parse:
parse {bananas are devoured by boys.}
there will be 1+1 solutions. The optimal solution is the one shown
and it has the OBL-AG reading. The "+1" in the "1+1" is the suboptimal
solution and corresponds to an ADJUNCT reading of the "by" phrase.
You can see the unoptimal solutions by choosing the "unoptimal" command
in the f-structure window.
The generation OT marks work the same way. In this grammar there are
two generation OT marks. The preference mark GenGoodPunct (Generate Good
Punctuation) requires a period to be generated at the end of sentences.
This grammar can parse both the following.
parse {boys sleep}
parse {boys sleep.}
However, it will only generate:
boys sleep.
To see this, in the f-structure window choose the command "generate
from this f-structure".
The dispreference mark GenBadPunct is a NOGOOD mark and hence occurs
to the left of NOGOOD in the ranking (NOGOOD does not affect things that
occur to its right). This means that any rule part in the grammar with which
it is associated has been removed from the grammar. Here, the mark appears
in METARULEMACRO in the bracketing markup. This means that bracketing can
be parsed but not generated. So, if you parse:
parse {[the girls] sleep.}
the result when generating will be:
the girls sleep.
The same mark also appears on the comma in the coordination rule.
So, OT marks give you as a grammar writer control over some of the ambiguity
in the grammar. There are many additional types of OT marks that are described
in detail in the xle documentation, but what is described here will give
you enough to start with.
Useful XLE tricks
There are a number of extra-grammatical facilities available in XLE
that will make grammar writing much easier.
XLE documentation
To access the xle documentation, in xle type:
documentation
and a web browser will be launched with the documentation; this documentation
is also found in xle/doc/xle-toc.html. You can also type:
help
which will list all of the commands that you can use in xle.
xlerc file
Every time you make a change to your grammar, you have to restart xle
and reload the grammar. To make this easier, create a file called:
xlerc (important: the file has no extension!)
in the directory that you are going to work in. In it put the line:
create-parser mygrammar.lfg
where "mygrammar.lfg" is the name of your top level grammar file.
Whenever you (re)start xle in that directory, it will automatically
create the parser for you.
You can put any commands normally used in xle in the xlerc file and
they will automatically be invoked. You can also define procedures and
create aliases for commands; these are defined according to tcl and the
easiest way to learn about them may be to look at previously defined ones
and modify them. For example, you can redefine the "analyze-string" command
as "as" via:
proc as {P} {
analyze-string $P
}
Emacs library
XLE comes with a special emacs library lfg-mode.el. You should load
this library when using emacs to edit grammar files and run xle. It will
format rules, lexical entries, and templates for you. It also has commands
to launch and restart xle and to automatically parse sentences in testfiles.
To get emacs to automatically load this library whenever you are editing
a file ending in .lfg, add the following lines to your .emacs file (if you
do not have a .emacs file, you can create one in your home directory):
; to load the LFG-mode for XLE
(load-library "/usr/local/xle/emacs/lfg-mode")
Note that the path may be different depending on where xle is installed
on your machine.
If you have never used emacs before, you can access an emacs tutorial
by typing:
C-h t
when you are in emacs; where C-h means hold down the control key while
typing an "h" and then type a "t" without holding down the control key.
There are a number of keyboard short cuts that can be used when you
have lfg-mode loaded.
- ESC q will format a rule, lexical entry, or template if
the cursor is in that rule, lexical entry, or template; this is a good
way to see if you made a mistake entering the rule, although it will
not catch all errors; in particular, if the alignment of disjuncts is not
correct, there is probably an error
- C-c C-f will launch an xle process if you are in a .lfg
file; if you are in the xle shell, it will restart xle
- ESC C-x will parse a sentence if you have the cursor on
that sentence in the testfile
For more details read the xle documentation on emacs support for xle.
Testfiles and comment examples
Grammar development, even at the early stages, involves having to reparse
things many times to figure out if they are working yet and, after making
changes, still working. To facilitate this, you can put your example sentences
in a testfile. It is best to name your testfile with a ".lfg" suffix since
then you can use the emacs library to automatically parse whichever sentence
you are interested in. The test file should look like:
# Comment lines begin with hash marks
ROOT: This is a sentence.
NP: a noun phrase
PP: with a noun phrase
NP: an ungrammatical noun phrases (0! 0 0 0)
where each new sentence has a blank line on either side of it. It is
useful to put in the parse category (e.g., ROOT, NP, PP) in case you change
the default parse (root) category in your grammar. You can indicate sentences
which are supposed to get no parses by putting (0! 0 0 0) after them.
If these do get a parse, xle will complain. You can also mark if
a sentence is supposed to have a particular number of parses:
ROOT: I see the girl with the telescope (2! 0 0 0)
You can run the entire testsuite at once by doing:
parse-testfile my-testsuite.lfg
where "my-testsuite.lfg" is the name of the testsuite. Note that you
can use path names if you don't want to store the testsuites with the grammar
files:
parse-testfile testfiles/questions/my-testsuite.lfg
It is possible to automatically create testsuites from comments in the
grammar if the comments are of the form:
"COM{EX section example}"
"section" indicates what section it comes from (RULES, TEMPLATES, LEXICON).
"example" is the example itself ("NP: a monkey"). In lfg-mode, there is
an option to extract the comments under the LFG window bar. Doing this
will create an emacs buffer of all of the examples as a testsuite file;
this buffer can then be saved as a testsuite file. Some examples of this:
NP --> "rule for common noun phrases"
"COM{EX RULES NP: boxes}"
(D: (^ SPEC)=!) "COM{EX RULES
NP: the box}"
"COM{EX RULES NP: a box}"
"COM{EX RULES NP: a boxes (0! 0 0 0)}"
N "head noun"
"COM{EX RULES ROOT: Foxes push the boxes.}".
There are a number of comments of this type in eng-pargram.lfg. You can see what the resulting
testsuite files look like by running the extract comments command on this
file in emacs.
It is highly recommended to do this because it makes it easier for someone
else to read the grammar and makes it easy to figure out which parts of
the grammar are working.
Intepreting Error Messages
It takes some time to get used to the xle error messages, just as with
any new system. By doing the walkthrough and playing with the starter
grammar provided here, you should get some practice with the types of errors
you are likely to run into when doing grammar writing.
Background Reading
This is a list of paper divided by topic that might be of direct use
to you when writing grammars. Many of them are available electronically.
ParGram project as a whole:
- Miriam Butt, Tracy Holloway King, Maria-Eugenia Nino, and Frederique
Segond. 1999. A Grammar
Writer's Cookbook. Stanford: CSLI Publications.
- Miriam Butt, Helge Dyvik, Tracy Holloway King, Hiroshi Masuichi,
and Christian Rohrer. 2002.
The Parallel Grammar Project Proceedings of COLING2002, Workshop
on Grammar Engineering and Evaluation pp. 1-7.
Grammar engineering:
- Miriam Butt and Tracy Holloway King. 2003. Grammar Writing, Testing,
and Evaluation. In Ali Farghaly (ed.) Handbook
for Language Engineers. CSLI Publications. pp. 129-179.
Features and templates:
OT marks:
- Tracy Holloway King, Stephanie Dipper, Anette Frank, Jonas Kuhn,
and John Maxwell. 2000. Ambiguity
Management in Grammar Writing Linguistic Theory and Grammar ImplementationWorkshop
at European Summer School in Logic, Language, and Information (ESSLLI-2000).
FST morphology integration:
Grammar porting and adaptation:
- Ron Kaplan, Tracy Holloway King, and John Maxwell. 2002. Adapting
Existing Grammars: The XLE Experience. Proceedings of COLING2002, Workshop
on Grammar Engineering and Evaluation pp. 29-35.
- Roger Kim, Mary Dalrymple, Ron Kaplan, Tracy Holloway King, Hiroshi
Masuichi, and Tomoko Ohkuma. 2003. Multilingual
Grammar Development via Grammar Porting ESSLLI 2003 Workshop on
Ideas and Strategies for Multilingual Grammar Development.
2004 09 22
Tracy.King@microsoft.com