Download - Probabilistic Graphical Modelssiamak/COMP767/slides/... · 2019-11-19 · Probabilistic Graphical Models Structure learning in Bayesian networks Siamak Ravanbakhsh Fall 2019. Learning

Probabilistic Graphical ModelsProbabilistic Graphical ModelsStructure learning in Bayesian networks

Siamak Ravanbakhsh Fall 2019

Learning objectivesLearning objectives

why structure learning is hard?two approaches to structure learning

constraint-based methodsscore based methods

MLE vs Bayesian score

Structure learningStructure learning in BayesNets in BayesNets

family of methods

constraint-based methodsestimate cond. independencies from the data

find compatible BayesNets


family of methods



search over the combinatorial space, maximizing a score 2O(n )2


family of methods



search over the combinatorial space, maximizing a score

Bayesian model averagingintegrate over all possible structures

2O(n )2


family of methods






Identifiable up to I-equivalence

family of methods



a DAG with the same set of conditional independencies (CI) I(G) = I(p )D



family of methods




Perfect MAP



family of methods




hypothesis testing

Perfect MAP



family of methods




hypothesis testing

Perfect MAP

X ⊥ Y ∣ Z?



family of methods




hypothesis testing

first attempt: a DAG that is I-map for

Perfect MAP

p D I(G) ⊆ I(p )D

X ⊥ Y ∣ Z?

minimal I-mapminimal I-map from CI test from CI test

input: IC test oracle; an orderingoutput: a minimal I-map G for i=1...n

find minimal s.t.set

X , … ,X 1 n

(X ⊥i X , … ,X −1 i−1 U ∣ U)U ⊆ {X , … ,X }1 i−1

X 1 X nX i

Pa ←X i U X ⊥ NonDesc ∣ Pa i X i X i

a DAG where removing an edge violates I-map property

minimal I-mapminimal I-map from CI test from CI testProblems:

CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP

minimal I-mapminimal I-map from CI test from CI testProblems:

CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP

different orderings give different graphs Example:

D,I,S,G,L(a topological ordering)

L,S,G,I,D L,D,S,I,G



family of methods



a DAG with the same set of conditional independencies (CI)

I(G) = I(p )D

first attempt: a DAG that is I-map for p D I(G) ⊆ I(p )D

can we find a perfect MAP with fewer IC testsinvolving fewer variables?

second attempt: a DAG that is P-map for

Perfect mapPerfect map from CI test from CI test

only up to I-equivalencethe same set of CIs

same skeletonsame immoralities

Perfect mapPerfect map from CI test from CI test

only up to I-equivalencethe same set of CIs

same skeletonsame immoralities

procedure:

1. find the undirected skeleton using CI tests2. identify immoralities in the undirected graph

Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton

observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y



assumption: max number of parents d




idea: search over all subsets of size d, and check CI above





input: CI oracle; bound on #parents d

output: undirected skeleton

initialize H as a complete undirected graph

for all pairs for all subsets U of size (within current neighbors of )

If then remove from Hreturn H

X ,X i j

≤ d

X ⊥i X ∣j U X −i X j

X ,X i j










X ,X i j

≤ d


X ,X i j = O((n ) ×2 O((n − 2) )d

O(n )d+2

Perfect mapPerfect map from CI test from CI test2. finding the immoralities

potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H

YX

Z



YX

Z



not immorality only if

X ⊥i X ∣j U⇒ Z ∈ UYX

Z







X ,X i j

≤ d


X ,X i j


not immorality only if

X ⊥i X ∣j U⇒ Z ∈ U

save the U when removing X-Ysee if Z in U?

if no, then we have immorality

X Y

Z

YX

Z

Perfect mapPerfect map from CI test from CI test3. propagate the constraints

at this point: a mix of directed and undirected edges


at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)

until convergence

for exact CI tests, this guarantees the exact I-equivalence family



until convergenceExample

Ground truth DAG





Ground truth DAG

undirected skeleton+immoralities





Ground truth DAG

undirected skeleton+immoralities using rules R1,R2,R3


conditional independence (CI) testconditional independence (CI) test

how to decide from the datasetX ⊥ Y ∣ Z D



measure the deviance of from

conditional mututal information

statistics

p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D

d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D

χ2





statistics



χ2

d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D

(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2

using frequencies in thedataset





statistics



χ2




large deviance rejects the null hypothesis (of conditional independence)





statistics



χ2





d(D) > tpick a threshold





statistics



χ2






p-value is the probability of false rejection pvalue(t) = P ({D : d(D) > t} ∣ X ⊥ Y ∣ Z)






over all possible datasets






over all possible datasets

it is possible to derive the distribution of deviance measures

e.g., distributionreject a hypothesis (CI) for small p-values (.05)

χ2

.05

.95


family of methods





Mutual informationMutual information

how much information does X encode about Y?

reduction in the uncertainty of X after observing Y



I(X,Y ) = H(X) − H(X∣Y )


conditional entropy p(x)H(p(y∣x))∑x



I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)




I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)

I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)




I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)


symmetric = I(Y ,X)

= D (p(x, y)∥p(x)p(y))KL

I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)

positive


MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form

log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i



= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i




= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i

using the empirical distribution





use MLE estimate ℓ(D, θ ) =∗ N p (x ,Pa ) log p (x ∣∑i∑x ,Pa i x iD i xi D i Pa )xi







= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i

(p (x )p (Pa )D i D x i

p (x ,Pa )D i x iD i )







= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i

(p (x )p (Pa )D i D x i

p (x ,Pa )D i x iD i )

using the definition of mutual information = N I (X ,Pa ) −∑i D i X iH (X )D i


Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i

H (X )D i


H (X )D i

does not depend on structure


H (X )D i


I (X ,X )D i j


H (X )D i

structure learning algorithms use mutual information in the structure search:

Chow-Liu algorithm: find the max-spanning tree: edge-weights = mutual information

add direction to edges later

make sure each node has at most one parent (i.e., no v-structure)


I (X ,X )D i j

I (X ,X ) =D j i I (X ,X )D i j

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNetsfor BayesNets

P (G∣D) ∝ P (D∣G)P (G)

G θ



P (G∣D) ∝ P (D∣G)P (G)

G θlog

score (G,D) =B log P (D∣G) + log P (G)



P (G∣D) ∝ P (D∣G)P (G)

G θ

P (D∣θ,G)P (θ ∣∫θ∈Θ G

G)dθ marginal likelihood for a structure

logscore (G,D) =B log P (D∣G) + log P (G)

G



P (G∣D) ∝ P (D∣G)P (G)

G θ



assuming local and global parameter independence

factorizes to the marginal likelihood of each node


G



P (G∣D) ∝ P (D∣G)P (G)

G θ






G

for Dirichlet-multinomial has closed form



P (G∣D) ∝ P (D∣G)P (G)

G θ






G


score (G,D) ≈B ℓ(D, θ ) −∗G log(∣D∣)K2

1Bayesian Information Criterion (BIC)

for large sample size

any exp-family member



P (G∣D) ∝ P (D∣G)P (G)

G θ






G




#parameters





P (G∣D) ∝ P (D∣G)P (G)

G θ






G




#parameters



Akaike Information Criterion (AIC) ℓ(D, θ ) −∗G K2

1


Example

G 1

G 2

= ∣D∣

The Bayesian score is biased towards simpler structures


Example The Bayesian score is biased towards simpler structures

= ∣D∣

data sampled from ICU alarm Bayesnet

Bayesian score of the true model (509 params.)simplified model (359 params)

simplified model (214 params)

Structure searchStructure search

is NP-hardarg max Score(D,G)G

use heuristic search algorithms (discussed for MAP inference)




local search using: edge additionedge deletionedge reversal





O(N )2 possible moves






collect sufficient statistics (frequencies)estimate the score





use the decomposition of the score







use the decomposition of the score



example ICU-alarm network

SummarySummary

Structure learning is NP-hardMake assumptions to simplify:

SummarySummary


constraint-based methods:limit the max number of parents

rely on CI tests

identifies the I-equivalence class

SummarySummary


constraint-based methods:limit the max number of parents

rely on CI tests

identifies the I-equivalence class

score based methods:tree structure

use a Bayesian score + heuristic search

finds a locally optimal structure