CS388:NaturalLanguageProcessingLecture4:SequenceModelsI
GregDurrettPartsofthislectureadaptedfromDanKlein,UCBerkeley
andVivekSrikumar,UniversityofUtah
Administrivia
‣ Project1outtoday,dueSeptember27
‣ Thisclasswillcoverwhatyouneedtogetstartedonit,thenextclasswillcovereverythingyouneedtocompleteit
‣ Viterbialgorithm,CRFNERsystem,extension
‣ ExtensionshouldbesubstanUal:don’tjusttryoneaddiUonalfeature(seesyllabus/specfordiscussion,samplesonwebsite)
‣ Mini1duetoday
Recall:MulUclassClassificaUon‣ LogisUcregression:
Gradient(unregularized):
‣ SVM:definedbyquadraUcprogram(minimizaUon,sogradientsareflipped)Loss-augmenteddecode
P (y|x) =exp
�w
>f(x, y)
�P
y02Y exp (w
>f(x, y
0))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )� Ey[fi(xj , y)]
⇠j = max
y2Yw
>f(xj , y) + `(y, y
⇤j )� w
>f(xj , y
⇤j )
Subgradient(unregularized)onjthexample = fi(xj , ymax
)� fi(xj , y⇤j )
Recall:OpUmizaUon
‣ StochasUcgradient*ascent* w w + ↵g, g =@
@wL
wi wi + ↵1q
✏+Pt
⌧=1 g2⌧,i
gt,i‣ Adagrad:
‣ SGD/AdaGradhaveabatchsizeparameter
‣ Largebatches(>50examples):canparallelizewithinbatch
‣ …butbiggerbatchesodenmeansmoreepochsrequiredbecauseyoumakefewerparameterupdates
‣ Shuffling:onlinemethodsaresensiUvetodatasetorder,shufflinghelps!
ThisLecture
‣ Sequencemodeling
‣ HMMsforPOStagging
‣ Viterbi,forward-backward
‣ HMMparameteresUmaUon
LinguisUcStructures
‣ Languageistree-structured
IatethespaghehwithchopsUcks Iatethespaghehwithmeatballs
‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis
IatethespaghehwithchopsUcks IatethespaghehwithmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS
LinguisUcStructures‣ LanguageissequenUallystructured:interpretedinanonlineway
Tanenhausetal.(1995)
POSTagging
Ghana’sambassadorshouldhavesetupthebigmee6nginDCyesterday.
‣ Whattagsareoutthere?
NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.
POSTagging
Slidecredit:DanKlein
POSTagging
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
I’m0.5%interestedintheFed’sraises!
Iherebyincreaseinterestrates0.5%
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣OtherpathsarealsoplausiblebutevenmoresemanUcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣ WordidenUty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.
Whatisthisgoodfor?
‣ Text-to-speech:record,lead
‣ PreprocessingstepforsyntacUcparsers
‣ Domain-independentdisambiguaUonforothertasks
‣ (Very)shallowinformaUonextracUon
SequenceModels‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
‣ POStagging:xisasequenceofwords,yisasequenceoftags
‣ Today:generaUvemodelsP(x,y);discriminaUvemodelsnextUme
HiddenMarkovModelsy = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)
‣ModelthesequenceofyasaMarkovprocess(dynamicsmodel)
y1 y2
‣ Markovproperty:futureiscondiUonallyindependentofthepastgiventhepresent
‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore
y3 P (y3|y1, y2) = P (y3|y2)
‣ LotsofmathemaUcaltheoryabouthowMarkovchainsbehave
HiddenMarkovModels
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
…
P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
IniUaldistribuUon
TransiUonprobabiliUes
EmissionprobabiliUes
} }} ‣ P(x|y)isadistribuUonoverallwordsinthevocabulary—notadistribuUonoverfeatures(butcouldbe!)
‣ MulUnomials:tagxtagtransiUons,tagxwordemissions
‣ ObservaUon(x)dependsonlyoncurrentstate(y)
TransiUonsinPOSTagging‣Dynamicsmodel
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ likelybecausestartofsentence
‣ likelybecauseverbodenfollowsnoun
‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)
P (y1 = NNP)
P (y2 = VBZ|y1 = NNP)
P (y3 = NN|y2 = VBZ)
P (y1)nY
i=2
P (yi|yi�1)
EsUmaUngTransiUons
‣ SimilartoNaiveBayesesUmaUon:maximumlikelihoodsoluUon=normalizedcounts(withsmoothing)readoffsuperviseddata
Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN
‣ Howtosmooth?
‣ Onemethod:smoothwithunigramdistribuUonovertags
‣ P(tag|NN)
P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)
=empiricaldistribuUon(readofffromdata)P̂
.
=(0.5.,0.5NNS)
‣ EmissionsP(x|y)capturethedistribuUonofwordsoccurringwithagiventag
EmissionsinPOSTagging
‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)
Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN
‣ Whenyoucomputetheposteriorforagivenword’stags,thedistribuUonfavorstagsthataremorelikelytogeneratethatword
‣ Howshouldwesmooththis?
InferenceinHMMs
‣ Inferenceproblem:
‣ ExponenUallymanypossibleyhere!
‣ SoluUon:dynamicprogramming(possiblebecauseofMarkovstructure!)
‣ ManyneuralsequencemodelsdependonenUreprevioustagsequence,needtouseapproximaUonslikebeamsearch
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
…P (y,x) = P (y1)
nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
‣ Best(parUal)scoreforasequenceendinginstates
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:DanKlein
‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
Forward-BackwardAlgorithm‣ InaddiUontofindingthebestpath,wemaywanttocomputemarginalprobabiliUesofpaths P (yi = s|x)
P (yi = s|x) =X
y1,...,yi�1,yi+1,...,yn
P (y|x)
‣ WhatdidViterbicompute? P (y
max
|x) = max
y1,...,yn
P (y|x)
‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward
Forward-BackwardAlgorithmP (y3 = 2|x) =
sum of all paths through state 2 at time 3
sum of all paths
Forward-BackwardAlgorithm
slidecredit:DanKlein
P (y3 = 2|x) =sum of all paths through state 2 at time 3
sum of all paths
=
‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute
Forward-BackwardAlgorithm
↵1(s) = P (s)P (x1|s)
↵t(st) =X
st�1
↵t�1(st�1)P (st|st�1)P (xt|st)
‣ IniUal:
‣ Recurrence:
‣ SameasViterbibutsumminginsteadofmaxing!
‣ ThesequanUUesgetverysmall!StoreeverythingaslogprobabiliUes
Forward-BackwardAlgorithm
‣ IniUal:�n(s) = 1
�t(st) =X
st+1
�t+1(st+1)P (st+1|st)P (xt+1|st+1)
‣ Recurrence:
‣ Bigdifferences:countemissionforthenextUmestep(notcurrentone)
Forward-BackwardAlgorithm↵1(s) = P (s)P (x1|s)
↵t(st) =X
st�1
↵t�1(st�1)P (st|st�1)P (xt|st)
�n(s) = 1
�t(st) =X
st+1
�t+1(st+1)P (st+1|st)P (xt+1|st+1)
P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)
‣ Whatisthedenominatorhere?‣ Doesthisexplainwhybetaiswhatitis?
P (x)
HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy
‣ TrigramHMM:~95%accuracy/55%onunknownwords
Slidecredit:DanKlein
TrigramTaggers
‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…
‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O
Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN
‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging
HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy
‣ TrigramHMM:~95%accuracy/55%onunknownwords
‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks
Slidecredit:DanKlein
‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+
Errors
officialknowledge madeupthestory recentlysoldshares
JJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS
Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)
RemainingErrors
‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%
‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)‣ DifficultlinguisUcs:20%
Theysetupabsurdsitua6ons,detachedfromrealityVBD/VBP?(pastorpresent?)
a$10millionfourth-quarterchargeagainstdiscon6nuedopera6onsadjecUveorverbalparUciple?JJ/VBN?
Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisUcs?”
OtherLanguages
‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources
Gillicketal.2016
NextTime‣ CRFs:feature-baseddiscriminaUvemodels
‣ StructuredSVMforsequences
‣ NamedenUtyrecogniUon