Morphology
11-711AlgorithmsforNLP21November2017– PartI
(SomeslidesfromLoriLevin,DavidMortenson)
TypesofLexicalandMorphologicalProcessing
• Tokenization• Input:rawtext• Output:sequenceoftokensnormalizedforfurtherprocessing
• Recognition• Input:astringofcharacters• Output:isitalegalword?(yesorno)
• MorphologicalParsing• Input:aword• Output:ananalysisofthestructureoftheword
• MorphologicalGeneration• Input:ananalysisofthestructureoftheword• Output:aword
Butfirst:Whatisaword?
• Thethingsthatareinthedictionary?• Buthowdidthelexicographersdecidewhattoputinthedictionary?
• Thethingsbetweenspacesandpunctuation?• Thesmallestunitthatcanbeutteredinisolation?
• Youcouldsaythiswordinisolation:Unimpressively• Thisonetoo: impress• Butyouprobablywouldn’tsaytheseinisolation,unlessyouweretalkingaboutmorphology:• un• ive• ly
Sowhatisaword?
• Cangetprettytricky:• didn’t• would’ve• gonna• shoulda woulda coulda• Ima• blackboard(vs.schoolboard)• baseball(vs.golfball)• thepersonwholeft’s hat;JimandGregg’s apartment• acct.• LTI
About1000pages.$139.99
Youdon’thavetoreadit.
Thepointisthatittakes1000pagesjusttosurveytheissuesrelatedtowhatwordsare.
Sowhatisaword?
• Itisuptoyouorthesoftwareyouuseforprocessingwords.• Takelinguisticsclasses.• Makegooddecisionsinsoftwaredesignandengineering.
Tokenization
Tokenization
Input:rawtextOutput:sequenceoftokens normalizedforeasierprocessing.
Tokenization
• SomeAsianlanguageshaveobviousissues:�)����2+���#0������22%�63,7*4 ��2+$���5�����2+$�'�!.�
• ButGermantoo:Noun-nouncompounds:Gesundheitsversicherungsgesellschaften
• Spanishclitics:Darmelo• EvenEnglishhasissues,toasmalldegree:GreggandBob’shouse
Tokenization
• SomeAsianlanguageshaveobviousissues:�)����2+���#0������22%�63,7*4 ��2+$���5�����2+$�'�!.�
• ButGermantoo:Noun-nouncompounds:Gesundheits-versicherungs-gesellschaften (health
insurancecompanies)• Spanishclitics:Darmelo• EvenEnglishhasissues,toasmalldegree:GreggandBob’shouse
Tokenization
• SomeAsianlanguageshaveobviousissues:�)����2+���#0������22%�63,7*4 ��2+$���5�����2+$�'�!.�
• ButGermantoo:Noun-nouncompounds:Gesundheitsversicherungsgesellschaften
• Spanishclitics:Dar-me-lo(Togivemeit)• EvenEnglishhasissues,toasmallerdegree:GreggandBob’shouse
TokenizationInput:rawtext
Dr. Smith said tokenization of English is “harder than you’ve thought.” When in New York, he paid $12.00 a day for lunch and wondered what it would be like to work for AT&T or Google, Inc.
OutputfromStanfordParser:http://nlp.stanford.edu:8080/parser/index.jspwithpart-of-speechtags:
Dr./NNP Smith/NNP said/VBD tokenization/NN of/IN English/NNP is/VBZ ``/`` harder/JJR than/IN you/PRP 've/VBP thought/VBN ./. ''/’’When/WRB in/IN New/NNP York/NNP ,/, he/PRP paid/VBD $/$ 12.00/CD a/DT day/NN for/IN lunch/NN and/CC wondered/VBD what/WP it/PRP would/MD be/VB like/JJ to/TO work/VB for/IN AT&T/NNP or/CC Google/NNP ,/, Inc./NNP ./.
MorphologicalPhenomena
WhatisLinguisticMorphology?
• Morphologyisthestudyoftheinternalstructureofwords.
• Derivationalmorphology. Hownewwordsarecreatedfromexistingwords.• [grace]• [[grace]ful]• [un[grace]ful]]
• Inflectionalmorphology. Howfeaturesrelevanttothesyntacticcontextofawordaremarkedonthatword.• Thisexampleillustratesnumber(singularandplural)andtense(presentandpast).• Greenindicatesirregular.Blueindicateszeromarkingofinflection.Redindicatesregularinflection.• This student walks.• These studentswalk.• These students walked.
• Compounding. Creatingnewwordsbycombiningexistingwords• Withorwithoutspaces:surfboard,golfball,blackboard
Morphemes
• Morphemes.Minimalpairingsofformandmeaning.
• Roots. The“core”ofawordthatcarriesitsbasicmeaning.• apple :‘apple’• walk :‘walk’
• Affixes (prefixes,suffixes,infixes,andcircumfixes).Morphemesthatareaddedtoabase(arootorstem)toperformeitherderivationalorinflectionalfunctions.• un- :‘NEG’• -s :‘PLURAL’
LanguageTypology
TypesofLanguages:
• Inorderofmorphologicalcomplexity:• Isolating(orAnalytic)• Fusional(orInflecting)• Agglutinative• Polysynthetic• Others
IsolatingLanguages:ChineseLittlemorphologyotherthancompounding
• Chinese inflection• fewaffixes(prefixesandsuffixes):
• � "��� ������ mén:wǒmén,nǐmén,tāmén, tóngzhìménplural:we,you(pl.),theycomrades,LGBTpeople
• “suffixes”thatmarkaspect:- -zhě ‘continuousaspect’• Chinesederivation• /&� yìshùjiā ‘artist’
• Chineseisachampionintherealmofcompounding—upto80%ofChinesewordsareactuallycompounds.
( + 1 → (1
dú fàn dúfàn
‘poison,drug’ ‘vendor’ ‘drug trafficker’
AgglutinativeLanguages:SwahiliVerbsinSwahilihaveanaverageof4-5morphemes,http://wals.info/valuesets/22A-swa
Swahili English
m-tu a-li-lala ‘Thepersonslept’
m-tu a-ta-lala ‘Thepersonwillsleep’
wa-tu wa-li-lala ‘Thepeopleslept’
wa-tu wa-ta-lala ‘Thepeople willsleep’
• Wordswrittenwithouthyphensorspacesbetweenmorphemes.• Orangeprefixesmarknounclass(likegender,exceptSwahili hasnineinsteadoftwoor
three).• Verbsagreewithnounsinnounclass.• Adjectivesalsoagreewithnouns.• Veryhelpfulinparsing.
• Blackprefixesindicatetense.
TurkishExampleofextremeagglutinationButmostTurkishwordshavearoundthreemorphemes
uygarlaştıramadıklarımızdanmışsınızcasına�(behaving)asifyouareamongthosewhomwewerenotabletocivilize�
uygar �civilized�+laş �become�+tır �causeto�+ama �notable�+dık pastparticiple+larplural+ımız firstpersonpluralpossessive(�our�)+dan ablativecase(�from/among�)+mış past+sınız secondpersonplural(�y�all�)+casına finiteverb→adverb(�asif�)
Operationalization
• operate(opus/opera+ate)• ion• al• ize• ate• ion
FusionalLanguages:Spanish
Singular Plural
1st 2nd 3rdformal 2nd
1st 2nd 3rd
Present am-o am-as am-a am-a-mos am-áis am-an
Imperfect am-ab-a am-ab-as am-ab-a am-áb-a-mos am-ab-ais am-ab-an
Preterit am-é am-aste am-ó am-a-mos am-asteis am-aron
Future am-aré am-arás am-ará am-are-mos am-aréis am-arán
Conditional am-aría am-arías am-aría am-aría-mos am-aríais am-arían
PolysyntheticLanguages:Yupik
• Polysyntheticmorphologiesallowthecreationoffull“sentences”bymorphologicalmeans.• Theyoftenallowtheincorporationofnounsintoverbs.• Theymayalsohaveaffixesthatattachtoverbsandtaketheplaceofnouns.• YupikEskimountu-ssur-qatar-ni-ksaite-ngqiggte-uqreindeer-hunt-FUT-say-NEG-again-3SG.INDIC‘Hehadnotyetsaidagainthathewasgoingtohuntreindeer.’
Root-and-PatternMorphology:Arabic
• Root-and-pattern.A specialkindoffusional morphologyfoundinArabic,Hebrew,andtheircousins.• Rootusuallyconsistsofasequenceofconsonants.• Wordsarederivedand,tosomeextent,inflectedbypatternsofvowelsintercalatedamongtherootconsonants.• kitaab ‘book’• kaatib ‘writer;writing’• maktab ‘office;desk’• maktaba ‘library’
OtherNon-Concatenative Morphological
Processes
Non-concatenativemorphology involvesoperationsotherthantheconcatenationofaffixeswithbases.• Infixation.Amorphemeisinsertedinsideanothermorphemeinsteadofbeforeorafterit.• Reduplication.Canbeprefixing,suffixing,andeveninfixing.
• Tagalog:• sulat (write,imperative)• susulat (reduplication)(write,future)• sumulat (infixing)(write,past)• sumusulat (infixingandreduplication)(write,present)
• Apophony,includingtheumlautinEnglishtooth→teeth;subtractivemorphology,includingthetruncation inEnglishnicknameformation(David→Dave);andsoon.• Tonechange;stressshift.Andmore...
Type-TokenCurvesFinnishisagglutinative
Iñupiaq ispolysynthetic
0
1000
2000
3000
4000
5000
6000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Type
s
Tokens
Type-TokenCurves
English
Arabic
Hocąk
Inupiaq
Finnish
TypesandTokens:“Iliketowalk.Iamwalkingnow.Itookalongwalkearliertoo.”
Thetypewalk occurstwice.Sotherearetwotokensofthetypewalk.
Walking isadifferenttypethatoccursonce.
MorphologicalProcessing
Recognizing thewordsofalanguage
• Input:astring(fromsomealphabet)• Output:isitalegalword? (yesorno)
FSAforEnglishNouns
Lexicon:
Note:“fox”becomespluralbyadding“es”not“s”.Wewillgettothatlater.
Finite-StateAutomaton
• Q:afinitesetofstates• q0� Q:aspecialstartstate• F� Q:asetoffinalstates• Σ:afinitealphabet• Transitions:
• Encodesaset ofstringsthatcanberecognizedbyfollowingpathsfromq0 tosomestateinF.
qiqjs� Σ*
......
FSAforEnglishAdjectives
Butnotethatthisacceptswordslike“unbig”.
Big,bigger,biggestHappy,happier,happiest,happilyUnhappy,unhappier,unhappiest,unhappilyClear,clearer,clearest,clearlyUnclear,unclearly
Cool,cooler,coolest,coollyRed,redder,reddestReal,unreal,really
FSAforEnglishDerivationalMorphology
Howbigdotheseautomataget?Reasonablecoverageofalanguagetakesanexpertabouttwotofourmonths.
Whatdoesittaketobeanexpert?Studylinguisticstogetusedtoallthecommonandnot-so-commonthingsthathappen,andthenpractice.
MorphologicalParsing
Input:awordOutput:theword’sstem(s)andfeaturesexpressedbyothermorphemes.
Example: geese→goose+N+Plgooses→goose+V+3P+Sgdog→{dog+N+Sg,dog+V}leaves→{leaf+N+Pl,leave+V+3P+Sg}
UpperSide/LowerSide
talk+Past
talked
FST
uppersideorunderlyingform
lowersideorsurfaceform
FiniteStateTransducers
• Q:afinitesetofstates• q0� Q:aspecialstartstate• F� Q:asetoffinalstates• ΣandΔ:twofinitealphabets• Transitions:
qiqj
s :ts� Σ*andt� Δ*
......
MorphologicalParsingwithFSTs
Note�samesymbol�shorthand.
^denotesamorphemeboundary.
#denotesawordboundary.
EnglishSpellingGettingbacktofox+s =foxes
TheEInsertionRuleasaFST
✏ ! e/
8<
:
s
x
z
9=
; ^ s#
Generateanormallyspelledwordfromanabstractrepresentationofthemorphemes:
Input:fox^s#(fox^εs#)Output:foxes#(foxεes#)
TheEInsertionRuleasaFST
✏ ! e/
8<
:
s
x
z
9=
; ^ s#
Parseanormallyspelledwordintoanabstractrepresentationofthemorphemes:
Input:foxes#(foxεes#)Output:fox^s#(fox^εs#)
CombiningFSTs
parse
generate
FSTOperations
Input:fox+N+plOutput:foxes#
LanguageTypeComparisonwrt FSTs
• Morphologiesofalltypescanbeanalyzedusingfinitestatemethods.• Somepresentmorechallengesthanothers:• Analyticlanguages.Trivial,sincethereislittleornomorphology(otherthancompounding).• Agglutinatinglanguages.Straightforward—finitestatemorphologywas“made”forlanguageslikethis.• Polysyntheticlanguages.Similartoagglutinatinglanguages,butwithblurredlinesbetweenmorphologyandsyntax.• Fusional languages. Easyenoughtoanalyzeusingfinitestatemethodaslongasoneallows“morphemes”tohavelotsofsimultaneousmeaningsandoneiswillingtoemploysomeadditionaltricks.• Root-and-patternlanguages. Requiresomeveryclevertricks.
Stemming(“PoorMan’sMorphology”)
Input:awordOutput:theword’sstem(approximately)
ExamplesfromthePorterstemmer:•-sses→-ss•-ies→i•-ss→s
nonoahnob
nobilitynobisnoble
noblemannoblemennobleness
noblernobles
noblessenoblestnobly
nobodynocesnod
noddednoddingnoddlenoddlesnoddynods
nonoahnobnobilnobinoblnoblemannoblemennoblnoblernoblnoblessnoblestnoblinobodinocenodnodnodnoddlnoddlnoddinod
TheGoodNews
• Morethanalmostanyotherproblemincomputationallinguistics,morphologyisasolvedproblem(aslongasyoucanaffordtowriterulesbyhand).• Finitestatemethodsprovideasimpleandpowerfulmeansofgeneratingandanalyzingwords(aswellasthephonologicalalternationsthataccompanywordformation/inflection).• Finitestatemorphologyisoneofthegreatsuccessesofnaturallanguageprocessing.• OnebrilliantaspectofusingFSTsformorphology:thesamecode canhandlebothanalysis andgeneration.