Lecture2:Inference
Inference:AnUbiquitousObstacle
• Decodingisinference.• Subrou<nesforlearningareinference.• Learningisinference.
• Exactinferenceis#P‐complete.– Evenapproxima<onswithinagivenabsoluteorrela<veerrorarehard.
Probabilis<cInferenceProblemsGivenvaluesforsomerandomvariables(X⊂ V)…• MostProbableExplana<on:whatarethemost probablevaluesoftherest
ofther.v.sV\X?
(Moregenerally…)• MaximumA Posteriori (MAP):whatarethemostprobablevaluesofsome
otherr.v.s,Y⊂ (V\X)?
• RandomsamplingfromtheposteriorovervaluesofY• FullposteriorovervaluesofY• Marginalprobabili<esfromtheposterioroverY
• MinimumBayesrisk:WhatistheYwiththelowestexpectedcost?• Cost‐augmenteddecoding:Whatisthemostdangerous Y?
ApproachestoInference
inference
exact
variableelimina<on
dynamicprogram’ng
ILP
approximate
randomized
MCMC
Gibbs
importancesampling
randomizedsearch
simulatedannealing
determinis<c
varia<onal
meanfield
loopybeliefpropaga<on
LPrelaxa<ons
dualdecomp.
localsearch
beamsearch
lecture6
today
ExactMarginalforY
• Thiswillbeageneraliza<onofalgorithmsyoualreadyknow:theforwardandbackwardalgorithms.
• Thegeneralnameisvariableelimina<on.
• A\erweseeitforthemarginal,we’llseehowtouseitfortheMAP.
SimpleInferenceExample
• Goal:P(D)
A
B
C
D
0
1
P(B|A) 0 1
0
1
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
SimpleInferenceExample
• Let’scalculateP(B)fromthingswehave.
A
B
C
D
0
1
P(B|A) 0 1
0
1
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
SimpleInferenceExample
• Let’scalculateP(B)fromthingswehave.
A
B
C
D
SimpleInferenceExample
• Let’scalculateP(B)fromthingswehave.
• NotethatCandDdonotmaaer.
A
B
C
D
SimpleInferenceExample
• Let’scalculateP(B)fromthingswehave.
A
B
C
D
0
1
P(B|A) 0 1
0
1
T
= 0
1
SimpleInferenceExample
• WenowhaveaBayesiannetworkforthemarginaldistribu<onP(B,C,D).
B
C
D
0
1
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
SimpleInferenceExample
• WecanrepeatthesameprocesstocalculateP(C).
• WealreadyhaveP(B)!
B
C
D
SimpleInferenceExample
• WecanrepeatthesameprocesstocalculateP(C).
B
C
D
0
1
P(C|B) 0 1
0
1
T
= 0
1
SimpleInferenceExample
• WenowhaveP(C,D).• MarginalizingoutAandBhappenedintwosteps,andweareexploi<ngtheBayesiannetworkstructure.
C
D
0
1
P(D|C) 0 1
0
1
SimpleInferenceExample
• LaststeptogetP(D):
D
0
1
P(D|C) 0 1
0
1
T
= 0
1
SimpleInferenceExample
• No<cethatthesamestephappenedforeachrandomvariable:– WecreatedanewCPDoverthevariableandits“successor”
– Wesummedout(marginalized)thevariable.
ThatWasVariableElimina<on
• Wereusedcomputa<onfrompreviousstepsandavoideddoingthesameworkmorethanonce.– Dynamicprogrammingàlaforwardalgorithm!
• WeexploitedtheBayesiannetworkstructure(eachsubexpressiononlydependsonasmallnumberofvariables).
• Exponen<alblowupavoided!
WhatRemains
• Somemachinery• Variableelimina<oningeneral
• Themaximiza<onversion(forMAPinference)
• Abitaboutapproximateinference
FactorGraphs
• Variablenodes(circles)• Factornodes(squares)
– CanbeMNfactorsorBNcondi<onalprobabilitydistribu<ons!
• Edgebetweenvariableandfactorifthefactordependsonthatvariable.
• Thegraphisbipar<te.
Z
X
Y
φ1
φ2
φ3
φ4
ProductsofFactors
• Giventwofactorswithdifferentscopes,wecancalculateanewfactorequaltotheirproducts.
ProductsofFactors
• Giventwofactorswithdifferentscopes,wecancalculateanewfactorequaltotheirproducts.
A B ϕ1(A,B)
0 0 30
0 1 5
1 0 1
1 1 10
B C ϕ2(B,C)
0 0 100
0 1 1
1 0 1
1 1 100
. =
A B C ϕ3(A,B,C)
0 0 0 3000
0 0 1 30
0 1 0 5
0 1 1 500
1 0 0 100
1 0 1 1
1 1 0 10
1 1 1 1000
FactorMarginaliza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamarginaliza<on:
FactorMarginaliza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamarginaliza<on:
P(C|A,B) 0,0 0,1 1,0 1,1
0 0.5 0.4 0.2 0.1
1 0.5 0.6 0.8 0.9
A C ψ(A,C)
0 0 0.9
0 1 0.3
1 0 1.1
1 1 1.7“summing out” B
FactorMarginaliza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamarginaliza<on:
P(C|A,B) 0,0 0,1 1,0 1,1
0 0.5 0.4 0.2 0.1
1 0.5 0.6 0.8 0.9
A B ψ(A,B)
0 0 1
0 1 1
1 0 1
1 1 1“summing out” C
FactorMarginaliza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamarginaliza<on:
• Wecanrefertothisnewfactorby∑Yϕ.
MarginalizingEverything?
• TakeaMarkovnetwork’s“productfactor”bymul<plyingall ofitsfactors.
• Sumoutallthevariables(onebyone).
• Whatdoyouget?
FactorsAreLikeNumbers
• Productsarecommuta<ve:ϕ1· ϕ2=ϕ2· ϕ1• Productsareassocia<ve:(ϕ1· ϕ2) · ϕ3=ϕ1· (ϕ2· ϕ3)
• Sumsarecommuta<ve:∑X∑Yϕ=∑Y∑Xϕ
• Distribu<vityofmultliplica<onoversumma<on:
Elimina<ngOneVariable
Input:SetoffactorsΦ,variableZtoeliminateOutput:newsetoffactorsΨ
1. LetΦ’={ϕ∈Φ|Z∈Scope(ϕ)}
2. LetΨ={ϕ∈Φ|Z∉Scope(ϕ)}
3. Letψbe∑Z∏ϕ∈Φ’ϕ
4. ReturnΨ∪{ψ}
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.
Flu All.
S.I.
R.N. H.
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑H∏ϕ∈Φ’ϕ4. ReturnΨ∪{ψ}
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑HϕSH
4. ReturnΨ∪{ψ}
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑HϕSH
4. ReturnΨ∪{ψ}
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
P(H|S) 0 1
0 0.8 0.1
1 0.2 0.9
S ψ(S)
0 1.0
1 1.0
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑HϕSH
4. ReturnΨ∪{ψ}
Flu All.
S.I.
R.N.
ϕSR ψ
ϕAϕF
ϕFAS
P(H|S) 0 1
0 0.8 0.1
1 0.2 0.9
S ψ(S)
0 1.0
1 1.0
Example
• Query:P(Flu|runnynose)
• Let’seliminateH.• Wecanactuallyignorethenewfactor,equivalentlyjustdele<ngH!– Why?– Insomecaseselimina<ngavariableisreallyeasy!
Flu All.
S.I.
R.N.
ϕSR
ϕAϕF
ϕFAS
S ψ(S)
0 1.0
1 1.0
VariableElimina<on
Input:SetoffactorsΦ,orderedlistofvariablesZtoeliminate
Output:newfactorψ
1. ForeachZi∈Z(inorder):– LetΦ=Eliminate‐One(Φ,Zi)
2. Return∏ϕ∈Φϕ
Example
• Query:P(Flu|runnynose)
• Hisalreadyeliminated.
• Let’snoweliminateS.
Flu All.
S.I.
R.N.
ϕSR
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑S∏ϕ∈Φ’ϕ4. ReturnΨ∪{ψFAR}
Flu All.
S.I.
R.N.
ϕSR
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑SϕSR∙ϕFAS4. ReturnΨ∪{ψFAR}
Flu All.
S.I.
R.N.
ϕSR
ϕAϕF
ϕFAS
Example
• Query:P(Flu|runnynose)
• Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑SϕSR∙ϕFAS4. ReturnΨ∪{ψFAR}
Flu All.
R.N.
ϕAϕF
ψFAR
Example
• Query:P(Flu|runnynose)
• Finally,eliminateA.
Flu All.
R.N.
ϕAϕF
ψFAR
Example
• Query:P(Flu|runnynose)
• Elimina<ngA.1. Φ’={ϕA,ϕFAR}2. Ψ={ϕF}3. ψFR=∑AϕA∙ψFAR4. ReturnΨ∪{ψFR}
Flu All.
R.N.
ϕAϕF
ψFAR
Example
• Query:P(Flu|runnynose)
• Elimina<ngA.1. Φ’={ϕA,ϕFAR}2. Ψ={ϕF}3. ψFR=∑AϕA∙ψFAR4. ReturnΨ∪{ψFR}
Flu
R.N.
ϕF
ψFR
MarkovChain,Again
• Earlier,weeliminatedA,thenB,thenC.
A
B
C
D
0
1
P(B|A) 0 1
0
1
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
MarkovChain,Again
• Nowlet’sstartbyelimina<ngC.
A
B
C
D
0
1
P(B|A) 0 1
0
1
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
MarkovChain,Again
• Nowlet’sstartbyelimina<ngC.
A
B
C
D
P(C|B) 0 1
0
1
P(D|C) 0 1
0
1
. =
B C D ϕ’(B,C,D)
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
MarkovChain,Again
• Nowlet’sstartbyelimina<ngC.
A
B
C
D
ΣC =
B C D ϕ’(B,C,D)
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
B D ψ(B,D)
0 0
0 1
1 0
1 1
MarkovChain,Again
• Elimina<ngBwillbesimilarlycomplex.
A
B
D
B D ψ(B,D)
0 0
0 1
1 0
1 1
VariableElimina<on:Comments
• Canpruneawayallnon‐ancestorsofthequeryvariables.
• Orderingmakesadifference!
• WorksforMarkovnetworksandBayesiannetworks.– FactorsneednotbeCPDsand,ingeneral,newfactorswon’tbe.
WhataboutEvidence?
• Sofar,we’vejustconsideredtheposterior/marginalP(Y).
• Next:condi<onaldistribu<onP(Y|X=x).
• It’salmostthesame:theaddi<onalstepistoreducefactorstorespecttheevidence.
Example
• Query:P(Flu|runnynose)
• Let’sreducetoR=true(runnynose).
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
P(R|S) 0 1
0
1
Example
• Query:P(Flu|runnynose)
• Let’sreducetoR=true(runnynose).
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
P(R|S) 0 1
0
1
S R ϕSR(S,R)0 0
0 1
1 0
1 1
Example
• Query:P(Flu|runnynose)
• Let’sreducetoR=true(runnynose).
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
P(R|S) 0 1
0
1
S R ϕSR(S,R)0 0
0 1
1 0
1 1
S R ϕ’S(S)0 0
0 1
1 0
1 1
Example
• Query:P(Flu|runnynose)
• Let’sreducetoR=true(runnynose).
Flu All.
S.I.
R.N. H.
ϕSR ϕSH
ϕAϕF
ϕFAS
P(R|S) 0 1
0
1
S R ϕSR(S,R)0 0
0 1
1 0
1 1
S R ϕ’S(S)0 1
1 1
Example
• Query:P(Flu|runnynose)
• Let’sreducetoR=true(runnynose).
Flu All.
S.I.
H.
ϕ’S ϕSH
ϕAϕF
ϕFAS
S R ϕ’S(S)0 1
1 1
Example
• Query:P(Flu|runnynose)
• Nowrunvariableelimina<onallthewaydowntoonefactor(forF).
Flu All.
S.I.
H.
ϕ’S ϕSH
ϕAϕF
ϕFAS
H can be pruned for the same reasons as before.
Example
• Query:P(Flu|runnynose)
• Nowrunvariableelimina<onallthewaydowntoonefactor(forF).
Flu All.
S.I.
ϕ’S
ϕAϕF
ϕFAS
Eliminate S.
Example
• Query:P(Flu|runnynose)
• Nowrunvariableelimina<onallthewaydowntoonefactor(forF).
Flu All.ψFA
ϕAϕF
Eliminate A.
Example
• Query:P(Flu|runnynose)
• Nowrunvariableelimina<onallthewaydowntoonefactor(forF).
Flu ψF
ϕF
Take final product.
Example
• Query:P(Flu|runnynose)
• Nowrunvariableelimina<onallthewaydowntoonefactor.
ϕF· ψF
VariableElimina<onforCondi<onalProbabili<es
Input:GraphicalmodelonV,setofqueryvariablesY,evidenceX=x
Output:factorϕandscalarα1. Φ=factorsinthemodel2. ReducefactorsinΦbyX=x3. ChoosevariableorderingonZ=V\Y\X4. ϕ=Variable‐Elimina<on(Φ,Z)5. α=∑z∈Val(Z)ϕ(z)6. Returnϕ,α
Note
• ForBayesiannetworks,thefinalfactorwillbeP(Y,X=x)andthesumα=P(X=x).
• ThisequatestoaGibbsdistribu<onwithpar<<onfunc<on=α.
VariableElimina<on
• Ingeneral,exponen<alrequirementsininducedwidthcorrespondingtotheorderingyouchoose.
• It’sNP‐hardtofindthebestelimina<onordering.
• Ifyoucanavoid“big”intermediatefactors,youcanmakeinferencelinearinthesizeoftheoriginalfactors.– Chordalgraphs– Polytrees
Addi<onalComments
• Run<medependsonthesizeoftheintermediatefactors.
• Hence,variableelimina<onorderingmaaersalot.– Butit’sNP‐hardtofindthebestone.– ForMNs,chordal graphspermitinferencein<melinearinthesizeoftheoriginalfactors.
– ForBNs,polytreestructuresdothesame.
Ge}ngBacktoNLP
• Tradi<onalstructuredNLPmodelsweresome<messubconsciouslychosenfortheseproper<es.– HMMs,PCFGs(withalialework)
– Butnot:IBMmodel3
• NeedMAPinferencefordecoding!
• Needapproximateinferenceforcomplexmodels!
FromMarginalstoMAP
• Replacefactormarginaliza<onstepswithmaximiza;on.– Addbookkeepingtokeeptrackofthemaximizingvalues.
• Addatracebackattheendtorecoverthesolu<on.
• Thisisanalogoustotheconnec<onbetweentheforwardalgorithmandtheViterbialgorithm.– Orderingchallengeisthesame.
FactorMaximiza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamaximiza<on:
• WecanrefertothisnewfactorbymaxYϕ.
FactorMaximiza<on
• GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamaximiza<on:
A C ψ(A,C)
0 0 1.1 B=1
0 1 1.7 B=1
1 0 1.1 B=1
1 1 0.7 B=0“maximizing out” B
A B C ϕ(A,B,C)
0 0 0 0.9
0 0 1 0.3
0 1 0 1.1
0 1 1 1.7
1 0 0 0.4
1 0 1 0.7
1 1 0 1.1
1 1 1 0.2
Distribu<veProperty
• Ausefulpropertyweexploitedinvariableelimina<on:
• Underthesamecondi<ons,factormul<plica<ondistributesovermax,too:
Traceback
Input:Sequenceoffactorswithassociatedvariables:(ψZ1,…,ψZk)
Output:z*
• EachψZisafactorwithscopeincludingZandvariableseliminateda<erZ.
• Workbackwardsfromi=kto1:– Letzi=argmaxzψZi(z,zi+1,zi+2,…,zk)
• Returnz
AbouttheTraceback
• Noextra(asympto<c)expense.– Lineartraversalovertheintermediatefactors.
• Thefactoropera<onsforbothsum‐productVEandmax‐productVEcanbegeneralized.– Example:gettheKmostlikelyassignments
Input:SetoffactorsΦ,variableZtoeliminateOutput:newsetoffactorsΨ
1. LetΦ’={ϕ∈Φ|Z∈Scope(ϕ)}2. LetΨ={ϕ∈Φ|Z∉Scope(ϕ)}3. LetτbemaxZ∏ϕ∈Φ’ϕ
– Letψ be ∏ϕ∈Φ’ϕ(bookkeeping) 4. ReturnΨ∪{τ},ψ
Elimina<ngOneVariable(Max‐ProductVersionwithBookkeeping)
VariableElimina<on(Max‐ProductVersionwithDecoding)Input:SetoffactorsΦ,orderedlistofvariablesZtoeliminate
Output:newfactor
1. ForeachZi∈Z(inorder):– Let(Φ,ψZi)=Eliminate‐One(Φ,Zi)
2. Return∏ϕ∈Φϕ,Traceback({ψZi})
VariableElimina<onTips
• Anyorderingwillbecorrect.• Mostorderingswillbetooexpensive.
• Thereareheuris<csforchoosinganordering(youarewelcometofindthemandtestthemout).
(RocketScience:TrueMAP)
• Evidence:X=x• Query:Y• Othervariables:Z=V\X\Y
• First,marginalizeoutZ,thendoMAPinferenceoverYgivenX=x
• ThisisnotusuallyaaemptedinNLP,withsomeexcep<ons.
SketchofGibbsSampling
• MCMC:design(onpaper)agraphwhereeachconfigura<onfromVal(V)isanode.– Transi<onsinthegraphdesignedtogiveaMarkovchainwhosesta<onarydistribu<onistheposterior.
• Simulatearandomwalkinthegraph.
• Ifyouwalklongenough,yourposi<onisdistributedaccordingtoP(V).
Transi<onsinGibbsSampling
• Atransi<onintheMarkovchainequatestochangingasubsetoftherandomvariables.
• Gibbs:resampleVi’svalueaccordingtoP(Vi|V\{Vi}).– OnlyneedthelocalfactorsthataffectVi:takeproduct,marginalize,andrandomlychoosenewvalue.
• SimplylockevidencevariablesX.• Maximizingversiongraduallyshi\ssamplerinfavorofmostprobablevalueforVi.
SketchofMeanFieldVaria<onalInference
• Inferencewithourdistribu<onPishard.• Choosean“easier”distribu<onfamily,Q.Thenfind:
• Usuallyitera<vemethodsarerequiredto“fit”QtoP.– Theseo\enresemblefamiliarlearningalgorithmslikeEM!
EnergyFunc<onal
• Expecta<onsundersimplerdistribu<onfamily,Q.– EveryelementofQ isanapproximatesolu<on.– Wetrytofindthebestone.
Varia<onalMethods
• Thisisasimpleexample.• Foranyλandanyx:
family of functions gλ(x)
Varia<onalMethods
• Thisisasimpleexample.• Foranyλandanyx:
• Further,foranyx,thereissomeλwheretheboundis<ght.– λiscalledavaria/onalparameter.
Tangent:Varia<onalMethods
• Thisisasimpleexample.• Foranyλandanyx:
• Further,foranyx,thereissomeλwheretheboundis<ght.– λiscalledavaria/onalparameter.
Tangent:Varia<onalMethods
• Thisisasimpleexample.• Foranyλandanyx:
• Further,foranyx,thereissomeλwheretheboundis<ght.– λiscalledavaria/onalparameter.
• Forus,logP(X=x)islike‐ln(x),andQislikeλ.
StructuredVaria<onalApproach
• Maximizetheenergyfunc<onaloverafamilyQ thatiswell‐defined.– Agraphicalmodel!– ProbablynotanI‐mapforP.(Boundisn’t<ght.)
• Simplerstructuresleadtoeasierinference.– Meanfieldisthesimplest:
Par<ngShots
• Youwillprobablyneverimplementthegeneralvariableelimina<onalgorithm.
• Youwillrarelyuseexactinference.
• Thereisvalueinunderstandingtheproblemthatapproxima<onmethodsaretryingtosolve,andwhatanexact(ifintractable)solu<onwouldlooklike!