Lecture3:StructuresandDecoding
Outline
1. StructuresinNLP2. HMMsasBNs– ViterbialgorithmasvariableeliminaBon
3. Linearmodels
4. Fiveviewsofdecoding
TwoMeaningsof“Structure”
• Yesterday:structureofagraphformodelingacollecBonofrandomvariablestogether.
• Today:linguisBcstructure.– Sequencelabelings(POS,IOBchunkings,…)– Parsetrees(phrase‐structure,dependency,…)– Alignments(word,phrase,tree,…)– Predicate‐argumentstructures– Text‐to‐text(translaBon,paraphrase,answers,…)
AUsefulAbstracBon?
• Wethinkso.• BringsoutcommonaliBes:– Modelingformalisms(e.g.,linearmodelswithfeatures)
– Learningalgorithms(lectures4‐6)– Genericinferencealgorithms
• Permitssharingacrossawiderspaceofproblems.
• Disadvantage:hidesengineeringdetails.
FamiliarExample:HiddenMarkovModels
HiddenMarkovModel
• XandYarebothsequencesofsymbols– XisasequencefromthevocabularyΣ– YisasequencefromthestatespaceΛ
• Parameters:– TransiBonsp(y’|y)• includingp(stop |y),p(y|start)
– Emissionsp(x|y)
HiddenMarkovModel
• Thejointmodel’sindependenceassumpBonsareeasytocapturewithaBayesiannetwork.
Y1
X1
Y0 Y2
X2
Y3
X3
Yn
Xn
stop …
HiddenMarkovModel
• ThejointmodelinstanBatesdynamicBayesiannetworks.
Yi
Xi
Yi‐1Y0 templatethatgetscopiedasmanyBmesasneeded
HiddenMarkovModel
• GivenX’svalueasevidence,thedynamicpartbecomesunnecessary,sinceweknown.
Y1
X1=x1
Y0 Y2
X2=x2
Y3
X3=x3
Yn
Xn=xn
stop …
HiddenMarkovModel
• TheusualinferenceproblemistofindthemostprobablevalueofYgivenX=x.
Y1
X1=x1
Y0 Y2
X2=x2
Y3
X3=x3
Yn
Xn=xn
stop …
HiddenMarkovModel
• TheusualinferenceproblemistofindthemostprobablevalueofYgivenX=x.
• Factorgraph:
Y1
X1=x1
Y0 Y2
X2=x2
Y3
X3=x3
Yn
Xn=xn
stop …
HiddenMarkovModel
• TheusualinferenceproblemistofindthemostprobablevalueofYgivenX=x.
• Factorgraphaferreducingfactorstorespectevidence:
Y1 Y2 Y3 Yn…
HiddenMarkovModel
• TheusualinferenceproblemistofindthemostprobablevalueofYgivenX=x.
• Cleverorderingshouldbeapparent!
Y1 Y2 Y3 Yn…
HiddenMarkovModel
• WhenweeliminateY1,wetakeaproductofthreerelevantfactors.• p(Y1|start)• η(Y1)=reducedp(x1|Y1)• p(Y2|Y1)
Y1 Y2 Y3 Yn…
HiddenMarkovModel
• WhenweeliminateY1,wefirsttakeaproductoftwofactorsthatonlyinvolveY1.
Y1 Y2 Y3 Yn…
y1y2…
y|Λ|y1y2…
y|Λ|η(Y1)=reducedp(x1|Y1)
p(Y1|start)
HiddenMarkovModel
• WhenweeliminateY1,wefirsttakeaproductoftwofactorsthatonlyinvolveY1.
• ThisistheViterbiprobabilityvectorforY1.
Y1 Y2 Y3 Yn…y1y2…
y|Λ|φ1(Y1)
HiddenMarkovModel
• WhenweeliminateY1,wefirsttakeaproductoftwofactorsthatonlyinvolveY1.
• ThisistheViterbiprobabilityvectorforY1.• EliminaBngY1equatestosolvingtheViterbiprobabiliBesforY2.
Y1 Y2 Y3 Yn…y1y2…
y|Λ|φ1(Y1)
y1
y2
…
y|Λ|p(Y2|Y1)
HiddenMarkovModel
• ProductofallfactorsinvolvingY1,thenreduce.• φ2(Y2)=maxy∈Val(Y1)(φ1(y)⨉p(Y2|y))
• ThisfactorholdsViterbiprobabiliiesyforY2.
Y2 Y3 Yn…
Y2
HiddenMarkovModel
• WhenweeliminateY2,wetakeaproductoftheanalogoustworelevantfactors.
• Thenreduce.• φ3(Y3)=maxy∈Val(Y2)(φ2(y)⨉p(Y3|y))
Y3 Yn…
Yn
HiddenMarkovModel
• Attheend,wehaveonefinalfactorwithonerow,φn+1.
• Thisisthescoreofthebestsequence.• Usebacktracetorecovervalues.
WhyThinkThisWay?
• EasytoseehowtogeneralizeHMMs.– Moreevidence– Morefactors– Morehiddenstructure– Moredependencies
• ProbabilisBcinterpretaBonoffactorsisnot centraltofindingthe“best”Y…– ManyfactorsarenotcondiBonalprobabilitytables.
GeneralizaBonExample1
• Eachwordalsodependsonpreviousstate.
Y1
X1 X2 X3 X4 X5
Y2 Y3 Y4 Y5
GeneralizaBonExample2
• “Trigram”HMM
Y1
X1 X2 X3 X4 X5
Y2 Y3 Y4 Y5
GeneralizaBonExample3
• Aggregatebigrammodel(SaulandPereira,1997)
Y1
X1 X2 X3 X4 X5
Y2 Y3 Y4 Y5
GeneralDecodingProblem
• Twostructuredrandomvariables,XandY.– SomeBmesdescribedascollec,onsofrandomvariables.
• “Decode”observedvalueX=xintosomevalueofY.
• Usually,weseektomaximizesomescore.– E.g.,MAPinferencefromyesterday.
LinearModels
• DefineafeaturevectorfuncBongthatmaps(x,y)pairsintod‐dimensionalrealspace.
• Scoreislinearing(x,y).
• Results:– decodingseeksytomaximizethescore.– learningseekswto…dosomethingwe’lltalkaboutlater.
• Extremelygeneral!
GenericNoisyChannelasLinearModel
• Ofcourse,thetwoprobabilitytermsaretypicallycomposedof“smaller”factors;eachcanbeunderstoodasanexponenBatedweight.
MaxEntModelsasLinearModels
HMMsasLinearModels
RunningExample
• IOBsequencelabeling,hereappliedtoNER• OfensolvedwithHMMs,CRFs,M3Ns…
(WhatisNotALinearModel?)
• Modelswithhiddenvariables
• Modelsbasedonnon‐linearkernels
Decoding
• ForHMMs,thedecodingalgorithmweusuallythinkoffirstistheViterbialgorithm.– Thisisjustoneexample.
• Wewillviewdecodinginfivedifferentways.– Sequencemodelsasarunningexample.– TheseviewsarenotjustforHMMs.– SomeBmestheywillleadusbacktoViterbi!
FiveViewsofDecoding
1.ProbabilisBcGraphicalModels
• ViewthelinguisBcstructureasacollecBonofrandomvariablesthatareinterdependent.
• Representinterdependenciesasadirectedorundirectedgraphicalmodel.
• CondiBonalprobabilitytables(BNs)orfactors(MNs)encodetheprobabilitydistribuBon.
InferenceinGraphicalModels
• GeneralalgorithmforexactMAPinference:variableeliminaBon.– IteraBvelysolveforthebestvaluesofeachvariablecondiBonedonvaluesof“preceding”neighbors.
– Thentraceback.
TheViterbialgorithmisaninstanceofmax‐productvariableeliminaBon!
MAPisLinearDecoding
• Bayesiannetwork:
• Markovnetwork:
• ThisonlyworksifeveryvariableisinXorY.
InferenceinGraphicalModels
• Remember:moreedgesmakeinferencemoreexpensive.– Feweredgesmeansstrongerindependence.
• Reallypleasant:
InferenceinGraphicalModels
• Remember:moreedgesmakeinferencemoreexpensive.– Feweredgesmeansstrongerindependence.
• Reallyunpleasant:
2.Polytopes
“Parts”
• AssumethatfeaturefuncBongbreaksdownintolocalparts.
• Eachparthasanalphabetofpossiblevalues.– Decodingischoosingvaluesforallparts,withconsistencyconstraints.
– (Inthegraphicalmodelsview,apartisaclique.)
Example
• Onepartperword,eachisin{B,I,O}• NofeatureslookatmulBpleparts– Fastinference– Notveryexpressive
Example
• Onepartperbigram,eachisin{BB,BI,BO,IB,II,IO,OB,OO}
• Featuresandconstraintscanlookatpairs– Slowerinference– Abitmoreexpressive
GeometricView
• Letzi,πbe1ifpartitakesvalueπand0otherwise.
• zisavectorin{0,1}N– N =totalnumberoflocalizedpartvalues– Eachzisavertexoftheunitcube
ScoreisLinearinz
notreallyequal;needtotransformbacktogety
Polyhedra
• NotallverBcesoftheN‐dimensionalunitcubesaBsfytheconstraints.– E.g.,can’thavez1,BI=1 andz2,BI=1
• SomeBmeswecanwritedownasmall(polynomialnumber)oflinearconstraintsonz.
• Result:linearobjecBve,linearconstraints,integerconstraints…
IntegerLinearProgramming
• Veryeasytoaddnewconstraintsandnon‐localfeatures.
• ManydecodingproblemshavebeenmappedtoILP(sequencelabeling,parsing,…),butit’snot alwaystrivial.
• NP‐hardingeneral.– ButtherearepackagesthatofenworkwellinpracBce(e.g.,CPLEX)
– Specializedalgorithmsinsomecases– LPrelaxaBonforapproximatesoluBons
Remark
• GraphicalmodelsassumedaprobabilisBcinterpretaBon– ThoughtheyarenotalwayslearnedusingaprobabilisBcinterpretaBon!
• ThepolytopeviewisagnosBcabouthowyouinterprettheweights.– ItonlysaysthatthedecodingproblemisanILP.
3.WeightedParsing
Grammars
• Grammarsareofenassociatedwithnaturallanguageparsing,buttheyareextremelypowerfulforimposingconstraints.
• Wecanaddweightstothem.– HMMsareakindofweightedregulargrammar(closelyconnectedtoWFSAs)
– PCFGsareakindofweightedCFG– Many,manymore.
• Weightedparsing:findthemaximum‐weightedderivaBonforastringx.
DecodingasWeightedParsing
• EveryvalidyisagrammaBcalderivaBon(parse)forx.– HMM:sequenceof“grammaBcal”statesisoneallowedbythetransiBontable.
• Augmentparsingalgorithmswithweightsandfindthebestparse.
TheViterbialgorithmisaninstanceofrecogniBonbyaweightedgrammar!
BIOTaggingasaCFG
• Weighted(orprobabilisBc)CKYisadynamicprogrammingalgorithmverysimilarinstructuretoclassicalCKY.
4.PathsandHyperpaths
BestPath
• Generalidea:takexandbuildagraph.• Scoreofapathfactorsintotheedges.
• Decodingisfindingthebest path.
TheViterbialgorithmisaninstanceoffindingabestpath!
“Lavce”ViewofViterbi
MinimumCostHyperpath
• Generalidea:takexandbuildahypergraph.• Scoreofahyperpathfactorsintothehyperedges.
• Decodingisfindingthebesthyperpath.
• ThisconnecBonwaselucidatedbyKleinandManning(2002).
ParsingasaHypergraph
ParsingasaHypergraph
cf. “Dean for democracy”
ParsingasaHypergraph
Forced to work on his thesis, sunshine streaming in the window, Mike experienced a …
ParsingasaHypergraph
Forced to work on his thesis, sunshine streaming in the window, Mike began to …
WhyHypergraphs?
• Useful,compactencodingofthehypothesisspace.– Buildhypothesisspaceusinglocalfeatures,maybedosomefiltering.
– Passitofftoanothermoduleformorefine‐grainedscoringwithricherormoreexpensivefeatures.
5.WeightedLogicProgramming
LogicProgramming
• Startwithasetofaxiomsandasetofinferencerules.
• Thegoalistoproveaspecifictheorem,goal.• Manyapproaches,butweassumeadeduc,veapproach.– Startwithaxioms,iteraBvelyproducemoretheorems.
WeightedLogicProgramming
• Twist:axiomshaveweights.• Wanttheproofofgoal withthebestscore:
• Notethataxiomscanbeusedmorethanonceinaproof(y).
WhenceWLP?
• Shieber,Schabes,andPereira(1995):manyparsingalgorithmscanbeunderstoodinthesamededucBvelogicframework.
• Goodman(1999):addweights,getmanyusefulNLPalgorithms.
• Eisner,Goldlust,andSmith(2004,2005):semiring‐genericalgorithms,Dyna.
DynamicProgramming
• Mostviews(excepBonispolytopes)canbeunderstoodasDPalgorithms.– Thelow‐levelprocedures weuseareofenDP.– EvenDPistoohigh‐leveltoknowthebestwaytoimplement.
• DPdoesnot implypolynomialBmeandspace!– MostcommonapproximaBonswhenthedesiredstatespaceistoobig:beamsearch,cubepruning,agendaswithearlystopping,...
– Otherviewssuggestothers.
Summary
• Decodingisthegeneralproblemofchoosingacomplexstructure.– LinguisBcanalysis,machinetranslaBon,speechrecogniBon,…
– StaBsBcalmodelsareusuallyinvolved(notnecessarilyprobabilisBc).
• Noperfectgeneralview,butmuchcanbegainedthroughacombinaBonofviews.