Probabilistic Graphical Modelssiamak/COMP767/slides/... · 2019-11-19 · Probabilistic Graphical...

Probabilistic Graphical ModelsProbabilistic Graphical ModelsStructure learning in Bayesian networks

Siamak Ravanbakhsh Fall 2019

Learning objectivesLearning objectives

why structure learning is hard?two approaches to structure learning

constraint-based methodsscore based methods

MLE vs Bayesian score

Structure learningStructure learning in BayesNets in BayesNets

family of methods

constraint-based methodsestimate cond. independencies from the data

find compatible BayesNets


family of methods



search over the combinatorial space, maximizing a score 2O(n )2


family of methods



search over the combinatorial space, maximizing a score

Bayesian model averagingintegrate over all possible structures

2O(n )2


family of methods






Identifiable up to I-equivalence

family of methods



a DAG with the same set of conditional independencies (CI) I(G) = I(p )D



family of methods




Perfect MAP



family of methods




hypothesis testing

Perfect MAP



family of methods




hypothesis testing

Perfect MAP

X ⊥ Y ∣ Z?



family of methods




hypothesis testing

first attempt: a DAG that is I-map for

Perfect MAP

p D I(G) ⊆ I(p )D

X ⊥ Y ∣ Z?

minimal I-mapminimal I-map from CI test from CI test

input: IC test oracle; an orderingoutput: a minimal I-map G for i=1...n

find minimal s.t.set

X , … ,X 1 n

(X ⊥i X , … ,X −1 i−1 U ∣ U)U ⊆ {X , … ,X }1 i−1

X 1 X nX i

Pa ←X i U X ⊥ NonDesc ∣ Pa i X i X i

a DAG where removing an edge violates I-map property

minimal I-mapminimal I-map from CI test from CI testProblems:

CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP

minimal I-mapminimal I-map from CI test from CI testProblems:

CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP

different orderings give different graphs Example:

D,I,S,G,L(a topological ordering)

L,S,G,I,D L,D,S,I,G



family of methods



a DAG with the same set of conditional independencies (CI)

I(G) = I(p )D

first attempt: a DAG that is I-map for p D I(G) ⊆ I(p )D

can we find a perfect MAP with fewer IC testsinvolving fewer variables?

second attempt: a DAG that is P-map for

Perfect mapPerfect map from CI test from CI test

only up to I-equivalencethe same set of CIs

same skeletonsame immoralities

Perfect mapPerfect map from CI test from CI test

only up to I-equivalencethe same set of CIs

same skeletonsame immoralities

procedure:

1. find the undirected skeleton using CI tests2. identify immoralities in the undirected graph

Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton

observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y



assumption: max number of parents d




idea: search over all subsets of size d, and check CI above





input: CI oracle; bound on #parents d

output: undirected skeleton

initialize H as a complete undirected graph

for all pairs for all subsets U of size (within current neighbors of )

If then remove from Hreturn H

X ,X i j

≤ d

X ⊥i X ∣j U X −i X j

X ,X i j










X ,X i j

≤ d


X ,X i j = O((n ) ×2 O((n − 2) )d

O(n )d+2

Perfect mapPerfect map from CI test from CI test2. finding the immoralities

potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H

YX

Z



YX

Z



not immorality only if

X ⊥i X ∣j U⇒ Z ∈ UYX

Z







X ,X i j

≤ d


X ,X i j


not immorality only if

X ⊥i X ∣j U⇒ Z ∈ U

save the U when removing X-Ysee if Z in U?

if no, then we have immorality

X Y

Z

YX

Z

Perfect mapPerfect map from CI test from CI test3. propagate the constraints

at this point: a mix of directed and undirected edges


at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)

until convergence

for exact CI tests, this guarantees the exact I-equivalence family



until convergenceExample

Ground truth DAG





Ground truth DAG

undirected skeleton+immoralities





Ground truth DAG

undirected skeleton+immoralities using rules R1,R2,R3


conditional independence (CI) testconditional independence (CI) test

how to decide from the datasetX ⊥ Y ∣ Z D



measure the deviance of from

conditional mututal information

statistics

p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D

d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D

χ2





statistics



χ2

d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D

(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2

using frequencies in thedataset





statistics



χ2




large deviance rejects the null hypothesis (of conditional independence)





statistics



χ2





d(D) > tpick a threshold





statistics



χ2






p-value is the probability of false rejection pvalue(t) = P ({D : d(D) > t} ∣ X ⊥ Y ∣ Z)






over all possible datasets






over all possible datasets

it is possible to derive the distribution of deviance measures

e.g., distributionreject a hypothesis (CI) for small p-values (.05)

χ2

.05

.95


family of methods





Mutual informationMutual information

how much information does X encode about Y?

reduction in the uncertainty of X after observing Y



I(X,Y ) = H(X) − H(X∣Y )


conditional entropy p(x)H(p(y∣x))∑x



I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)




I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)

I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)




I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)

= D (p(x, y)∥p(x)p(y))KL

I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)

positive


MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form

log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i



= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i




= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i

using the empirical distribution





use MLE estimate ℓ(D, θ ) =∗ N p (x ,Pa ) log p (x ∣∑i∑x ,Pa i x iD i xi D i Pa )xi







= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i

(p (x )p (Pa )D i D x i

p (x ,Pa )D i x iD i )







= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i

(p (x )p (Pa )D i D x i

p (x ,Pa )D i x iD i )

using the definition of mutual information = N I (X ,Pa ) −∑i D i X iH (X )D i


Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i

H (X )D i


H (X )D i

does not depend on structure


H (X )D i


I (X ,X )D i j


H (X )D i

structure learning algorithms use mutual information in the structure search:

Chow-Liu algorithm: find the max-spanning tree: edge-weights = mutual information

add direction to edges later

make sure each node has at most one parent (i.e., no v-structure)


I (X ,X )D i j

I (X ,X ) =D j i I (X ,X )D i j

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNetsfor BayesNets

P (G∣D) ∝ P (D∣G)P (G)

G θ



P (G∣D) ∝ P (D∣G)P (G)

G θlog

score (G,D) =B log P (D∣G) + log P (G)



P (G∣D) ∝ P (D∣G)P (G)

G θ

P (D∣θ,G)P (θ ∣∫θ∈Θ G

G)dθ marginal likelihood for a structure

logscore (G,D) =B log P (D∣G) + log P (G)

G



P (G∣D) ∝ P (D∣G)P (G)

G θ



assuming local and global parameter independence

factorizes to the marginal likelihood of each node


G



P (G∣D) ∝ P (D∣G)P (G)

G θ






G

for Dirichlet-multinomial has closed form



P (G∣D) ∝ P (D∣G)P (G)

G θ






G


score (G,D) ≈B ℓ(D, θ ) −∗G log(∣D∣)K2

1Bayesian Information Criterion (BIC)

for large sample size

any exp-family member



P (G∣D) ∝ P (D∣G)P (G)

G θ






G




#parameters





P (G∣D) ∝ P (D∣G)P (G)

G θ






G




#parameters



Akaike Information Criterion (AIC) ℓ(D, θ ) −∗G K2

1


Example

G 1

G 2

= ∣D∣

The Bayesian score is biased towards simpler structures


Example The Bayesian score is biased towards simpler structures

= ∣D∣

data sampled from ICU alarm Bayesnet

Bayesian score of the true model (509 params.)simplified model (359 params)

simplified model (214 params)

Structure searchStructure search

is NP-hardarg max Score(D,G)G

use heuristic search algorithms (discussed for MAP inference)




local search using: edge additionedge deletionedge reversal





O(N )2 possible moves






collect sufficient statistics (frequencies)estimate the score





use the decomposition of the score







use the decomposition of the score



example ICU-alarm network

SummarySummary

Structure learning is NP-hardMake assumptions to simplify:

SummarySummary


constraint-based methods:limit the max number of parents

rely on CI tests

identifies the I-equivalence class

SummarySummary


constraint-based methods:limit the max number of parents

rely on CI tests

identifies the I-equivalence class

score based methods:tree structure

use a Bayesian score + heuristic search

finds a locally optimal structure

Date post:	04-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Probabilistic Graphical Modelssiamak/COMP767/slides/... · 2019-11-19 · Probabilistic Graphical...

Documents