Integrating Finite-state Morphologies with Deep LFG Grammars

Integrating Finite-state Morphologies Integrating Finite-state Morphologies with Deep LFG Grammarswith Deep LFG Grammars

Tracy Holloway KingTracy Holloway King

FST and deep grammarsFST and deep grammars Finite state tokenizers and morphologies can

be integrated into deep processing systems Integrated tokenizers

– eliminate the need for preprocessing – allow the grammar writer more control over the

input

Morphologies– eliminate the need to list (multiple) surface forms in

the lexicon– eliminate the need for lexical entries for words with

predictable subcategorization frames

Talk outlineTalk outline

Basic integrated system Integrating morphology FSTs Interaction of tokenization and morphology

Basic ArchitectureBasic Architecture

(Shallow markup)

Tokenizing FSTs

Morphology FSTs

LFG grammar and lexicons

Constituent-structure(tree)

Functional-structure(AVM)

Input string

Example steps through the systemExample steps through the system

Input string: Boys appeared. Tokenizing: boys TB appeared TB . TB Morphology:

boy + Noun +Pl

appear +Verb +PastBoth +123SP

. +Punct C-structure/F-structure: next slides

C-structure treeC-structure tree

F-structure AVMF-structure AVM

The wider system: XLEThe wider system: XLE Handwritten grammars for various languages

– Substantial for English, German, Japanese, Norwegian– Also: Arabic, Chinese, Urdu, Korean, Welsh, Malagasy, Turkish

Robustness mechanisms– Fragment grammar rules– Morphological guessers– Skimming when resource limits approached

Ambiguity management (packing)– Compute all analyses (no “aggressive pruning”)– Propagate packed ambiguities across processing modules

Stochastic disambiguation– MaxEnt models to select from packed (f-)structures

Other processing available: – generation, semantics, transfer/rewriting

Comparisons to other systems/tasks– Parsing WSJ (Riezler et al, ACL 2002)– Comparison to Collins model 3 (Riezler et al, NAACL 2004)

FST MorphologiesFST Morphologies Associate surface form with

– a lemma (stem/canonical form)– a set of tags

Process is non-deterministic– can have many analyses for one surface form– grammar has to be able to deal with multiple

analyses (morphological ambiguity)– Issue: can the grammar control rampant

morphological ambiguity? Arabic vowelless representations

Example Morphology OutputExample Morphology Output turnips <=> turnip +Noun +Pl Mary <=> Mary +Prop +Giv +Fem +Sg falls <=> fall +Noun +Pl fall +Verb +Pres +3sg broken <=> break +Verb +PastPerf +123SP broken +Verb +PastPart } +Adj New York <=> New York +Prop +Place +USAState +Prefer New York +Prop +Place +City +Prefer [ plus analyses of New and York ]

Morphologies and lexiconsMorphologies and lexicons

Without a morphology, need to list all surface forms in the lexicon– bad for English– horrible for languages like Finnish and Arabic

With a morphology, one entry for the stem form go V XLE @(V-INTRANS go).for: go, goes, going, gone, went

With additional integration, words with predictable subcategorization frames need no entry

Basic ideaBasic idea

Run surface forms of words through the morphology to produce stems and tags– MorphConfig file specifies which morphologies the

grammar uses

Look up stems and tags in the lexicon Sublexical phrase structure rules build

syntactic nodes covering the stems and tags Standard grammar rules build larger phrases

Lexical entries for tagsLexical entries for tags

boys ==> boy +Noun +Pl

boy N XLE @(NOUN boy).

+Noun N_SFX XLE @(PERS 3)

@(EXISTS NTYPE).

+Pl NNUM_SFX XLE @(NUM pl).

Sublexical rules for tagsSublexical rules for tags Build up lexical nodes from stem plus tags Rules are identical to standard phrase structure

rules– Except display can hide the sublexical information

N --> N_BASE

N_SFX_BASE

NNUM_SFX_BASE.

N

N_BASEboy

N_SFX_BASE+Noun

NNUM_SFX_BASE+Pl

Resulting structuresResulting structures

N

N_BASEboy

N_SFX_BASE+Noun

NNUM_SFX_BASE+Pl

PRED 'boy'PERS 3NUM plNTYPE common

Lexical entriesLexical entries Stems with unpredictable subcategorization

frames need entries– verbs– adjectives with obliques (proud of her)– nouns with that complements (the idea that he

laughed)

Most lexical items have predictable frames determined by part of speech– common and proper nouns– adjectives– adverbs– numbers

-unknown lexical entry-unknown lexical entry Match any stem to the entry Provide desired functional information

– %stem will pass in the appropriate surface form (i.e., the lemma/stem)

Constrain application via morphological tag possibilities

-unknown N XLE @(NOUN %stem);

A XLE @(ADJ %stem);

ADV XLE @(ADVERB %stem).

-unknown example-unknown example The box boxes. Lexicon entries:

box V XLE @(V-INTRANS %stem).

-unknown N XLE @(NOUN %stem); ADV…; A...

Morphology output:box ==> box +Noun +Sg | +Verb +Non3Sg

boxes ==> box +Noun +Pl | +Verb +3Sg

Build up four effective lexical entries– 1 noun, 1 verb, 1 adverb, 1 adjective– adverb and adjective fail sublexically– noun and verb relevant for the sentence

Inflectional morphology summaryInflectional morphology summary

Integrating FST morphologies significantly decreases lexicon development

Verbs and other unpredictable items are listed only under their stem form

Predictable items such as nouns are processed via –unknown and never listed in the lexicon

GuessersGuessers Even large industrial FST morphologies are not

complete Novel words usually have regular morphology Build and FST guesser based on this

– Words with capital letters are proper nouns (Saakashvili)

– Words ending in –ed are past tense verbs or deverbal adjectives

Guessed words will go through –unknown– no difference from standard morphological output– can add +Guessed tag for further control

Guessers: controlling applicationGuessers: controlling application

Apply guesser in the grammar only if there is no form in the regular morphology– don't guess unless you have to

Control this with the MorphConfig– use multiple fst morphologies– stop looking once analysis if found

Sample MorphConfigSample MorphConfig

STANDARD ENGLISH MORPHOLOGY (1.0)

TOKENIZE: english.tok.parse.fst

ANALYZE USEFIRST: english.infl.fst try regular morphology first english.guesser.fst if fail, guess

MULTIWORD: english.standard.mwe.fst

Multiple morphology FSTsMultiple morphology FSTs

In addition to the regular morphology and guesser, can have other morphologies– morphology for technical terms, part numbers, etc.

These can be applied in sequence or in parallel (cascaded or unioned)

ANALYZE USEALL:

english.infl.fst try regular morphology

english.eureka.parts.fst and also part names

Morphology vs. surface formMorphology vs. surface form

System always allows surface form through Lexicon can match this form for

– multiword expressions– override/supplement morphological analysis

Example: or as adverb (Or you could leave now.)

or ADV * @(ADVERB or);

CONJ XLE @(CONJ or).

Tokenizers Tokenizers

Tokenizers break strings (sentences) into tokens (words)

Need to (for English):– break off punctuation

Mary laughs. ==> Mary TB laughs TB . TB– lower case certain letters

The dog ==> the TB dog

Tokenization and morphologyTokenization and morphology

Linguistic analysis may govern tokenization Are English contracted auxiliaries:

– affixes: John'll ==> no tokenization

John +Noun +Proper +Fut– clitics: John'll ==> John TB 'll TB

John +Noun +Proper will +Fut

Arabic determiners and conjunctions– both written with adjacent words

determiner as an affix giving +Def (Albint the-girl)

conjunction tokenized separately (wakutub and-books)

Non-deterministic tokenizers: Non-deterministic tokenizers: PunctuationPunctuation

Cannot just break off punctuation and insert a TB Comma haplology

Find the dog, a poodle. ==>

find TB the TB dog TB , TB a TB poodle TB , TB . TB

Period haplologyGo to Palm Dr. ==>

go TB to TB Palm TB Dr. TB . TB

Resulting tokenizer is non-deterministic System must be able to handle multiple inputs

CapitalizationCapitalization Intial capitals are optionally lower cased

The boy left. ==> the boy left.

Mary left. ==> Mary left.

Example for both types of non-determinismBush saw them. ==>

{ Bush | bush } TB saw TB them TB [, TB]* . TB

Tokenization rules vary from language to language and by choice of linguistic analysis

ConclusionsConclusions System architecture integrates FST techniques

with deep LFG parsing– tokenizers– morphologies and guessers

Allows generalizations to be factored out– properties of words– properties of strings

Allows use of existing large-scale lexical resources– avoids redundant speficication

System is actively in use in ParGram grammars

Shallow MarkupShallow Markup

Preprocessing with shallow markup can reduce ambiguity and speed processing

Tokenizer must be able to process the markup

Part of speech tagging:– I/PRP_ saw/VBD_ her/PRP_ duck/VB_.

Named entities– <person>General Mills</person> bought it.

POS taggingPOS tagging

POS tags are not relevant for tokenizing, but the tokenizer must skip them– She walks/VBZ_. should be treated like She walks.

The morphology must only insert compatible tags– A mapping table states allowable combinations

/VBZ_ +Verb +3sg

/NN_ +Noun +Sg– These are encoded into a filtering FST– Only compatible tags are passed to the grammar

POS tagging examplePOS tagging example

I saw her duck duck +Noun +Sg

duck +Verb +Pres +Non3sg– both possibilities passed to the grammar

I saw her duck/VB_.– only +Verb +Pres +Non3sg possibility is

compatible with /VB_ POS tag– only this possibility is passed to the grammar

Named EntitiesNamed Entities Named entities appear in text as XML markup

<person>General Mills</person> bought it.

Tokenizer – creates special tag for these– puts literal spaces instead of TBs– allows version without markup for fallback

General Mills TB +NamedEntity TB

General TB +Title TB Mills +Proper TB

Lexical entry added for +NamedEntity Sublexical N and NAME rules allows the tag

Sample Named Entity outputSample Named Entity output

Date post:	31-Dec-2015
Category:	Documents
Upload:	beatrice-waters
View:	23 times
Download:	0 times

Integrating Finite-state Morphologies with Deep LFG Grammars

Documents