Download - CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

CS388:NaturalLanguageProcessingLecture4:SequenceModelsI

GregDurrettPartsofthislectureadaptedfromDanKlein,UCBerkeley

andVivekSrikumar,UniversityofUtah

Administrivia

‣ Project1outtoday,dueSeptember27

‣ Thisclasswillcoverwhatyouneedtogetstartedonit,thenextclasswillcovereverythingyouneedtocompleteit

‣ Viterbialgorithm,CRFNERsystem,extension

‣ ExtensionshouldbesubstanUal:don’tjusttryoneaddiUonalfeature(seesyllabus/specfordiscussion,samplesonwebsite)

‣ Mini1duetoday

Recall:MulUclassClassificaUon‣ LogisUcregression:

Gradient(unregularized):

‣ SVM:definedbyquadraUcprogram(minimizaUon,sogradientsareflipped)Loss-augmenteddecode

P (y|x) =exp

�w

>f(x, y)

�P

y02Y exp (w

>f(x, y

0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

⇠j = max

y2Yw

>f(xj , y) + `(y, y

⇤j )� w

>f(xj , y

⇤j )

Subgradient(unregularized)onjthexample = fi(xj , ymax

)� fi(xj , y⇤j )

Recall:OpUmizaUon

‣ StochasUcgradient*ascent* w w + ↵g, g =@

@wL

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gt,i‣ Adagrad:

‣ SGD/AdaGradhaveabatchsizeparameter

‣ Largebatches(>50examples):canparallelizewithinbatch

‣ …butbiggerbatchesodenmeansmoreepochsrequiredbecauseyoumakefewerparameterupdates

‣ Shuffling:onlinemethodsaresensiUvetodatasetorder,shufflinghelps!

ThisLecture

‣ Sequencemodeling

‣ HMMsforPOStagging

‣ Viterbi,forward-backward

‣ HMMparameteresUmaUon

LinguisUcStructures

‣ Languageistree-structured

IatethespaghehwithchopsUcks Iatethespaghehwithmeatballs

‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis

IatethespaghehwithchopsUcks IatethespaghehwithmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS

LinguisUcStructures‣ LanguageissequenUallystructured:interpretedinanonlineway

Tanenhausetal.(1995)

POSTagging

Ghana’sambassadorshouldhavesetupthebigmee6nginDCyesterday.

‣ Whattagsareoutthere?

NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.

POSTagging

Slidecredit:DanKlein

POSTagging

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

I’m0.5%interestedintheFed’sraises!

Iherebyincreaseinterestrates0.5%


VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣OtherpathsarealsoplausiblebutevenmoresemanUcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣ WordidenUty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.

Whatisthisgoodfor?

‣ Text-to-speech:record,lead

‣ PreprocessingstepforsyntacUcparsers

‣ Domain-independentdisambiguaUonforothertasks

‣ (Very)shallowinformaUonextracUon

SequenceModels‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

‣ POStagging:xisasequenceofwords,yisasequenceoftags

‣ Today:generaUvemodelsP(x,y);discriminaUvemodelsnextUme

HiddenMarkovModelsy = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

‣ModelthesequenceofyasaMarkovprocess(dynamicsmodel)

y1 y2

‣ Markovproperty:futureiscondiUonallyindependentofthepastgiventhepresent

‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore

y3 P (y3|y1, y2) = P (y3|y2)

‣ LotsofmathemaUcaltheoryabouthowMarkovchainsbehave

HiddenMarkovModels

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

…

P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

IniUaldistribuUon

TransiUonprobabiliUes

EmissionprobabiliUes

} }} ‣ P(x|y)isadistribuUonoverallwordsinthevocabulary—notadistribuUonoverfeatures(butcouldbe!)

‣ MulUnomials:tagxtagtransiUons,tagxwordemissions

‣ ObservaUon(x)dependsonlyoncurrentstate(y)

TransiUonsinPOSTagging‣Dynamicsmodel


VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ likelybecausestartofsentence

‣ likelybecauseverbodenfollowsnoun

‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)

P (y1 = NNP)

P (y2 = VBZ|y1 = NNP)

P (y3 = NN|y2 = VBZ)

P (y1)nY

i=2

P (yi|yi�1)

EsUmaUngTransiUons

‣ SimilartoNaiveBayesesUmaUon:maximumlikelihoodsoluUon=normalizedcounts(withsmoothing)readoffsuperviseddata

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN

‣ Howtosmooth?

‣ Onemethod:smoothwithunigramdistribuUonovertags

‣ P(tag|NN)

P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)

=empiricaldistribuUon(readofffromdata)P̂

.

=(0.5.,0.5NNS)

‣ EmissionsP(x|y)capturethedistribuUonofwordsoccurringwithagiventag

EmissionsinPOSTagging

‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Whenyoucomputetheposteriorforagivenword’stags,thedistribuUonfavorstagsthataremorelikelytogeneratethatword

‣ Howshouldwesmooththis?

InferenceinHMMs

‣ Inferenceproblem:

‣ ExponenUallymanypossibleyhere!

‣ SoluUon:dynamicprogramming(possiblebecauseofMarkovstructure!)

‣ ManyneuralsequencemodelsdependonenUreprevioustagsequence,needtouseapproximaUonslikebeamsearch

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

…P (y,x) = P (y1)

nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm


ViterbiAlgorithm


‣ Best(parUal)scoreforasequenceendinginstates

ViterbiAlgorithm


ViterbiAlgorithm


ViterbiAlgorithm

slidecredit:DanKlein

‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.

ViterbiAlgorithm


ViterbiAlgorithm


ViterbiAlgorithm


Forward-BackwardAlgorithm‣ InaddiUontofindingthebestpath,wemaywanttocomputemarginalprobabiliUesofpaths P (yi = s|x)

P (yi = s|x) =X

y1,...,yi�1,yi+1,...,yn

P (y|x)

‣ WhatdidViterbicompute? P (y

max

|x) = max

y1,...,yn

P (y|x)

‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward

Forward-BackwardAlgorithmP (y3 = 2|x) =

sum of all paths through state 2 at time 3

sum of all paths

Forward-BackwardAlgorithm

slidecredit:DanKlein

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

=

‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute


↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

‣ IniUal:

‣ Recurrence:

‣ SameasViterbibutsumminginsteadofmaxing!

‣ ThesequanUUesgetverysmall!StoreeverythingaslogprobabiliUes


‣ IniUal:�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Recurrence:

‣ Bigdifferences:countemissionforthenextUmestep(notcurrentone)

Forward-BackwardAlgorithm↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)

‣ Whatisthedenominatorhere?‣ Doesthisexplainwhybetaiswhatitis?

P (x)

HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords


TrigramTaggers

‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…

‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging

HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks


‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+

Errors

officialknowledge madeupthestory recentlysoldshares

JJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS

Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)

RemainingErrors

‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%

‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)‣ DifficultlinguisUcs:20%

Theysetupabsurdsitua6ons,detachedfromrealityVBD/VBP?(pastorpresent?)

a$10millionfourth-quarterchargeagainstdiscon6nuedopera6onsadjecUveorverbalparUciple?JJ/VBN?

Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisUcs?”

OtherLanguages

‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources

Gillicketal.2016

NextTime‣ CRFs:feature-baseddiscriminaUvemodels

‣ StructuredSVMforsequences

‣ NamedenUtyrecogniUon