Download - Lecture 2: Inference - Carnegie Mellon School of Computer ...nasmith/psnlp/lecture2.pdfInference inference exact variable eliminaon dynamic program’ng ILP approximate randomized

Lecture2:Inference

Inference:AnUbiquitousObstacle

•  Decodingisinference.•  Subrou<nesforlearningareinference.•  Learningisinference.

•  Exactinferenceis#P‐complete.– Evenapproxima<onswithinagivenabsoluteorrela<veerrorarehard.

Probabilis<cInferenceProblemsGivenvaluesforsomerandomvariables(X⊂ V)…•  MostProbableExplana<on:whatarethemost probablevaluesoftherest

ofther.v.sV\X?

(Moregenerally…)•  MaximumA Posteriori (MAP):whatarethemostprobablevaluesofsome

otherr.v.s,Y⊂ (V\X)?

•  RandomsamplingfromtheposteriorovervaluesofY•  FullposteriorovervaluesofY•  Marginalprobabili<esfromtheposterioroverY

•  MinimumBayesrisk:WhatistheYwiththelowestexpectedcost?•  Cost‐augmenteddecoding:Whatisthemostdangerous Y?

ApproachestoInference

inference

exact

variableelimina<on

dynamicprogram’ng

ILP

approximate

randomized

MCMC

Gibbs

importancesampling

randomizedsearch

simulatedannealing

determinis<c

varia<onal

meanfield

loopybeliefpropaga<on

LPrelaxa<ons

dualdecomp.

localsearch

beamsearch

lecture6

today

ExactMarginalforY

•  Thiswillbeageneraliza<onofalgorithmsyoualreadyknow:theforwardandbackwardalgorithms.

•  Thegeneralnameisvariableelimina<on.

•  A\erweseeitforthemarginal,we’llseehowtouseitfortheMAP.

SimpleInferenceExample

•  Goal:P(D)

A

B

C

D

0

1

P(B|A) 0 1

0

1

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1


•  Let’scalculateP(B)fromthingswehave.

A

B

C

D

0

1

P(B|A) 0 1

0

1

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1



A

B

C

D



•  NotethatCandDdonotmaaer.

A

B

C

D



A

B

C

D

0

1

P(B|A) 0 1

0

1

T

= 0

1


•  WenowhaveaBayesiannetworkforthemarginaldistribu<onP(B,C,D).

B

C

D

0

1

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1


•  WecanrepeatthesameprocesstocalculateP(C).

•  WealreadyhaveP(B)!

B

C

D


•  WecanrepeatthesameprocesstocalculateP(C).

B

C

D

0

1

P(C|B) 0 1

0

1

T

= 0

1


•  WenowhaveP(C,D).•  MarginalizingoutAandBhappenedintwosteps,andweareexploi<ngtheBayesiannetworkstructure.

C

D

0

1

P(D|C) 0 1

0

1


•  LaststeptogetP(D):

D

0

1

P(D|C) 0 1

0

1

T

= 0

1


•  No<cethatthesamestephappenedforeachrandomvariable:– WecreatedanewCPDoverthevariableandits“successor”

– Wesummedout(marginalized)thevariable.

ThatWasVariableElimina<on

•  Wereusedcomputa<onfrompreviousstepsandavoideddoingthesameworkmorethanonce.– Dynamicprogrammingàlaforwardalgorithm!

•  WeexploitedtheBayesiannetworkstructure(eachsubexpressiononlydependsonasmallnumberofvariables).

•  Exponen<alblowupavoided!

WhatRemains

•  Somemachinery•  Variableelimina<oningeneral

•  Themaximiza<onversion(forMAPinference)

•  Abitaboutapproximateinference

FactorGraphs

•  Variablenodes(circles)•  Factornodes(squares)

–  CanbeMNfactorsorBNcondi<onalprobabilitydistribu<ons!

•  Edgebetweenvariableandfactorifthefactordependsonthatvariable.

•  Thegraphisbipar<te.

Z

X

Y

φ1

φ2

φ3

φ4

ProductsofFactors

•  Giventwofactorswithdifferentscopes,wecancalculateanewfactorequaltotheirproducts.

ProductsofFactors

•  Giventwofactorswithdifferentscopes,wecancalculateanewfactorequaltotheirproducts.

A B ϕ1(A,B)

0 0 30

0 1 5

1 0 1

1 1 10

B C ϕ2(B,C)

0 0 100

0 1 1

1 0 1

1 1 100

. =

A B C ϕ3(A,B,C)

0 0 0 3000

0 0 1 30

0 1 0 5

0 1 1 500

1 0 0 100

1 0 1 1

1 1 0 10

1 1 1 1000

FactorMarginaliza<on

•  GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamarginaliza<on:



P(C|A,B) 0,0 0,1 1,0 1,1

0 0.5 0.4 0.2 0.1

1 0.5 0.6 0.8 0.9

A C ψ(A,C)

0 0 0.9

0 1 0.3

1 0 1.1

1 1 1.7“summing out” B



P(C|A,B) 0,0 0,1 1,0 1,1

0 0.5 0.4 0.2 0.1

1 0.5 0.6 0.8 0.9

A B ψ(A,B)

0 0 1

0 1 1

1 0 1

1 1 1“summing out” C



•  Wecanrefertothisnewfactorby∑Yϕ.

MarginalizingEverything?

•  TakeaMarkovnetwork’s“productfactor”bymul<plyingall ofitsfactors.

•  Sumoutallthevariables(onebyone).

•  Whatdoyouget?

FactorsAreLikeNumbers

•  Productsarecommuta<ve:ϕ1· ϕ2=ϕ2· ϕ1•  Productsareassocia<ve:(ϕ1· ϕ2) · ϕ3=ϕ1· (ϕ2· ϕ3)

•  Sumsarecommuta<ve:∑X∑Yϕ=∑Y∑Xϕ

•  Distribu<vityofmultliplica<onoversumma<on:

Elimina<ngOneVariable

Input:SetoffactorsΦ,variableZtoeliminateOutput:newsetoffactorsΨ

1. LetΦ’={ϕ∈Φ|Z∈Scope(ϕ)}

2. LetΨ={ϕ∈Φ|Z∉Scope(ϕ)}

3. Letψbe∑Z∏ϕ∈Φ’ϕ

4. ReturnΨ∪{ψ}

Example

•  Query:P(Flu|runnynose)

•  Let’seliminateH.

Flu All.

S.I.

R.N. H.

Example


•  Let’seliminateH.

Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

Example


•  Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑H∏ϕ∈Φ’ϕ4. ReturnΨ∪{ψ}

Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

Example


•  Let’seliminateH.1. Φ’={ϕSH}2. Ψ={ϕF,ϕA,ϕFAS,ϕSR}3. ψ=∑HϕSH


Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

Example




Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

P(H|S) 0 1

0 0.8 0.1

1 0.2 0.9

S ψ(S)

0 1.0

1 1.0

Example




Flu All.

S.I.

R.N.

ϕSR ψ

ϕAϕF

ϕFAS

P(H|S) 0 1

0 0.8 0.1

1 0.2 0.9

S ψ(S)

0 1.0

1 1.0

Example


•  Let’seliminateH.•  Wecanactuallyignorethenewfactor,equivalentlyjustdele<ngH!– Why?–  Insomecaseselimina<ngavariableisreallyeasy!

Flu All.

S.I.

R.N.

ϕSR

ϕAϕF

ϕFAS

S ψ(S)

0 1.0

1 1.0

VariableElimina<on

Input:SetoffactorsΦ,orderedlistofvariablesZtoeliminate

Output:newfactorψ

1. ForeachZi∈Z(inorder):–  LetΦ=Eliminate‐One(Φ,Zi)

2. Return∏ϕ∈Φϕ

Example


•  Hisalreadyeliminated.

•  Let’snoweliminateS.

Flu All.

S.I.

R.N.

ϕSR

ϕAϕF

ϕFAS

Example


•  Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑S∏ϕ∈Φ’ϕ4. ReturnΨ∪{ψFAR}

Flu All.

S.I.

R.N.

ϕSR

ϕAϕF

ϕFAS

Example


•  Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑SϕSR∙ϕFAS4. ReturnΨ∪{ψFAR}

Flu All.

S.I.

R.N.

ϕSR

ϕAϕF

ϕFAS

Example


•  Elimina<ngS.1. Φ’={ϕSR,ϕFAS}2. Ψ={ϕF,ϕA}3. ψFAR=∑SϕSR∙ϕFAS4. ReturnΨ∪{ψFAR}

Flu All.

R.N.

ϕAϕF

ψFAR

Example


•  Finally,eliminateA.

Flu All.

R.N.

ϕAϕF

ψFAR

Example


•  Elimina<ngA.1. Φ’={ϕA,ϕFAR}2. Ψ={ϕF}3. ψFR=∑AϕA∙ψFAR4. ReturnΨ∪{ψFR}

Flu All.

R.N.

ϕAϕF

ψFAR

Example


•  Elimina<ngA.1. Φ’={ϕA,ϕFAR}2. Ψ={ϕF}3. ψFR=∑AϕA∙ψFAR4. ReturnΨ∪{ψFR}

Flu

R.N.

ϕF

ψFR

MarkovChain,Again

•  Earlier,weeliminatedA,thenB,thenC.

A

B

C

D

0

1

P(B|A) 0 1

0

1

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1

MarkovChain,Again

•  Nowlet’sstartbyelimina<ngC.

A

B

C

D

0

1

P(B|A) 0 1

0

1

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1

MarkovChain,Again


A

B

C

D

P(C|B) 0 1

0

1

P(D|C) 0 1

0

1

. =

B C D ϕ’(B,C,D)

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

MarkovChain,Again


A

B

C

D

ΣC =

B C D ϕ’(B,C,D)

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

B D ψ(B,D)

0 0

0 1

1 0

1 1

MarkovChain,Again

•  Elimina<ngBwillbesimilarlycomplex.

A

B

D

B D ψ(B,D)

0 0

0 1

1 0

1 1

VariableElimina<on:Comments

•  Canpruneawayallnon‐ancestorsofthequeryvariables.

•  Orderingmakesadifference!

•  WorksforMarkovnetworksandBayesiannetworks.– FactorsneednotbeCPDsand,ingeneral,newfactorswon’tbe.

WhataboutEvidence?

•  Sofar,we’vejustconsideredtheposterior/marginalP(Y).

•  Next:condi<onaldistribu<onP(Y|X=x).

•  It’salmostthesame:theaddi<onalstepistoreducefactorstorespecttheevidence.

Example


•  Let’sreducetoR=true(runnynose).

Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

P(R|S) 0 1

0

1

Example



Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

P(R|S) 0 1

0

1

S R ϕSR(S,R)0 0

0 1

1 0

1 1

Example



Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

P(R|S) 0 1

0

1

S R ϕSR(S,R)0 0

0 1

1 0

1 1

S R ϕ’S(S)0 0

0 1

1 0

1 1

Example



Flu All.

S.I.

R.N. H.

ϕSR ϕSH

ϕAϕF

ϕFAS

P(R|S) 0 1

0

1

S R ϕSR(S,R)0 0

0 1

1 0

1 1

S R ϕ’S(S)0 1

1 1

Example



Flu All.

S.I.

H.

ϕ’S ϕSH

ϕAϕF

ϕFAS

S R ϕ’S(S)0 1

1 1

Example


•  Nowrunvariableelimina<onallthewaydowntoonefactor(forF).

Flu All.

S.I.

H.

ϕ’S ϕSH

ϕAϕF

ϕFAS

H can be pruned for the same reasons as before.

Example



Flu All.

S.I.

ϕ’S

ϕAϕF

ϕFAS

Eliminate S.

Example



Flu All.ψFA

ϕAϕF

Eliminate A.

Example



Flu ψF

ϕF

Take final product.

Example


•  Nowrunvariableelimina<onallthewaydowntoonefactor.

ϕF· ψF

VariableElimina<onforCondi<onalProbabili<es

Input:GraphicalmodelonV,setofqueryvariablesY,evidenceX=x

Output:factorϕandscalarα1.  Φ=factorsinthemodel2.  ReducefactorsinΦbyX=x3.  ChoosevariableorderingonZ=V\Y\X4.  ϕ=Variable‐Elimina<on(Φ,Z)5.  α=∑z∈Val(Z)ϕ(z)6.  Returnϕ,α

Note

•  ForBayesiannetworks,thefinalfactorwillbeP(Y,X=x)andthesumα=P(X=x).

•  ThisequatestoaGibbsdistribu<onwithpar<<onfunc<on=α.

VariableElimina<on

•  Ingeneral,exponen<alrequirementsininducedwidthcorrespondingtotheorderingyouchoose.

•  It’sNP‐hardtofindthebestelimina<onordering.

•  Ifyoucanavoid“big”intermediatefactors,youcanmakeinferencelinearinthesizeoftheoriginalfactors.–  Chordalgraphs–  Polytrees

Addi<onalComments

•  Run<medependsonthesizeoftheintermediatefactors.

•  Hence,variableelimina<onorderingmaaersalot.– Butit’sNP‐hardtofindthebestone.– ForMNs,chordal graphspermitinferencein<melinearinthesizeoftheoriginalfactors.

– ForBNs,polytreestructuresdothesame.

Ge}ngBacktoNLP

•  Tradi<onalstructuredNLPmodelsweresome<messubconsciouslychosenfortheseproper<es.– HMMs,PCFGs(withalialework)

– Butnot:IBMmodel3

•  NeedMAPinferencefordecoding!

•  Needapproximateinferenceforcomplexmodels!

FromMarginalstoMAP

•  Replacefactormarginaliza<onstepswithmaximiza;on.– Addbookkeepingtokeeptrackofthemaximizingvalues.

•  Addatracebackattheendtorecoverthesolu<on.

•  Thisisanalogoustotheconnec<onbetweentheforwardalgorithmandtheViterbialgorithm.– Orderingchallengeisthesame.

FactorMaximiza<on

•  GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamaximiza<on:

•  WecanrefertothisnewfactorbymaxYϕ.

FactorMaximiza<on

•  GivenXandY(Y∉X),wecanturnafactorϕ(X,Y)intoafactorψ(X)viamaximiza<on:

A C ψ(A,C)

0 0 1.1 B=1

0 1 1.7 B=1

1 0 1.1 B=1

1 1 0.7 B=0“maximizing out” B

A B C ϕ(A,B,C)

0 0 0 0.9

0 0 1 0.3

0 1 0 1.1

0 1 1 1.7

1 0 0 0.4

1 0 1 0.7

1 1 0 1.1

1 1 1 0.2

Distribu<veProperty

•  Ausefulpropertyweexploitedinvariableelimina<on:

•  Underthesamecondi<ons,factormul<plica<ondistributesovermax,too:

Traceback

Input:Sequenceoffactorswithassociatedvariables:(ψZ1,…,ψZk)

Output:z*

•  EachψZisafactorwithscopeincludingZandvariableseliminateda<erZ.

•  Workbackwardsfromi=kto1:– Letzi=argmaxzψZi(z,zi+1,zi+2,…,zk)

•  Returnz

AbouttheTraceback

•  Noextra(asympto<c)expense.– Lineartraversalovertheintermediatefactors.

•  Thefactoropera<onsforbothsum‐productVEandmax‐productVEcanbegeneralized.– Example:gettheKmostlikelyassignments

Input:SetoffactorsΦ,variableZtoeliminateOutput:newsetoffactorsΨ

1. LetΦ’={ϕ∈Φ|Z∈Scope(ϕ)}2. LetΨ={ϕ∈Φ|Z∉Scope(ϕ)}3. LetτbemaxZ∏ϕ∈Φ’ϕ

–  Letψ be ∏ϕ∈Φ’ϕ(bookkeeping) 4. ReturnΨ∪{τ},ψ

Elimina<ngOneVariable(Max‐ProductVersionwithBookkeeping)

VariableElimina<on(Max‐ProductVersionwithDecoding)Input:SetoffactorsΦ,orderedlistofvariablesZtoeliminate

Output:newfactor

1. ForeachZi∈Z(inorder):–  Let(Φ,ψZi)=Eliminate‐One(Φ,Zi)

2. Return∏ϕ∈Φϕ,Traceback({ψZi})

VariableElimina<onTips

•  Anyorderingwillbecorrect.•  Mostorderingswillbetooexpensive.

•  Thereareheuris<csforchoosinganordering(youarewelcometofindthemandtestthemout).

(RocketScience:TrueMAP)

•  Evidence:X=x•  Query:Y•  Othervariables:Z=V\X\Y

•  First,marginalizeoutZ,thendoMAPinferenceoverYgivenX=x

•  ThisisnotusuallyaaemptedinNLP,withsomeexcep<ons.

SketchofGibbsSampling

•  MCMC:design(onpaper)agraphwhereeachconfigura<onfromVal(V)isanode.– Transi<onsinthegraphdesignedtogiveaMarkovchainwhosesta<onarydistribu<onistheposterior.

•  Simulatearandomwalkinthegraph.

•  Ifyouwalklongenough,yourposi<onisdistributedaccordingtoP(V).

Transi<onsinGibbsSampling

•  Atransi<onintheMarkovchainequatestochangingasubsetoftherandomvariables.

•  Gibbs:resampleVi’svalueaccordingtoP(Vi|V\{Vi}).– OnlyneedthelocalfactorsthataffectVi:takeproduct,marginalize,andrandomlychoosenewvalue.

•  SimplylockevidencevariablesX.•  Maximizingversiongraduallyshi\ssamplerinfavorofmostprobablevalueforVi.

SketchofMeanFieldVaria<onalInference

•  Inferencewithourdistribu<onPishard.•  Choosean“easier”distribu<onfamily,Q.Thenfind:

•  Usuallyitera<vemethodsarerequiredto“fit”QtoP.– Theseo\enresemblefamiliarlearningalgorithmslikeEM!

EnergyFunc<onal

•  Expecta<onsundersimplerdistribu<onfamily,Q.–  EveryelementofQ isanapproximatesolu<on.– Wetrytofindthebestone.

Varia<onalMethods

•  Thisisasimpleexample.•  Foranyλandanyx:

family of functions gλ(x)

Varia<onalMethods


•  Further,foranyx,thereissomeλwheretheboundis<ght.– λiscalledavaria/onalparameter.

Tangent:Varia<onalMethods



Tangent:Varia<onalMethods



•  Forus,logP(X=x)islike‐ln(x),andQislikeλ.

StructuredVaria<onalApproach

•  Maximizetheenergyfunc<onaloverafamilyQ thatiswell‐defined.– Agraphicalmodel!– ProbablynotanI‐mapforP.(Boundisn’t<ght.)

•  Simplerstructuresleadtoeasierinference.– Meanfieldisthesimplest:

Par<ngShots

•  Youwillprobablyneverimplementthegeneralvariableelimina<onalgorithm.

•  Youwillrarelyuseexactinference.

•  Thereisvalueinunderstandingtheproblemthatapproxima<onmethodsaretryingtosolve,andwhatanexact(ifintractable)solu<onwouldlooklike!