Download - Probabilistic Graphical Models: Bayesian Networksweb.cs.iastate.edu/~cs573x/Notes/cs573week13-spring07.pdf · Artificial Intelligence Research Laboratory Properties of undirected

1

Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Probabilistic Graphical Models: Bayesian Networks

Vasant HonavarArtificial Intelligence Research Laboratory

Department of Computer ScienceBioinformatics and Computational Biology Program

Center for Computational Intelligence, Learning, & DiscoveryIowa State University

[email protected]/~honavar/

www.cild.iastate.edu/www.bcb.iastate.edu/www.igert.iastate.edu



Inference by enumeration

• Start with the joint probability distribution:�

�

• For any proposition φ, sum the atomic events where it is true: P(φ) = Σω:ω╞φ P(ω)�

2





�

• For any proposition φ, sum the atomic events where it is true: P(φ) = Σω:ω╞φ P(ω)�

• P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2�





• Can also compute conditional probabilities:�P(¬cavity | toothache) = P(¬cavity ∧ toothache)

P(toothache)= 0.016+0.064

0.108 + 0.012 + 0.016 + 0.064= 0.4�

3



Normalization

Denominator can be viewed as a normalization constant α�P(Cavity | toothache) = α, P(Cavity,toothache)

= α, [P(Cavity,toothache,catch) + P(Cavity,toothache,¬ catch)]= α, [<0.108,0.016> + <0.012,0.064>] = α, <0.12,0.08> = <0.6,0.4>�

General idea: compute distribution on query variable by fixing evidence variables and summing over unobserved variables



Inference by enumeration, continued

Typically, we are interested in the posterior joint distribution of the query variables Y given specific values e for the evidence variables E�

Let the hidden or unobserved variables be H = X - Y - E�

Then the required summation of joint entries is done by summing out the hidden variables:�P(Y | E = e) = αP(Y,E = e) = αΣhP(Y,E= e, H = h)�

• The terms in the summation are joint entries because Y, E and Htogether exhaust the set of random variables�

4



Inference by enumeration, continued

• Obvious problems:�1. Worst-case time complexity O(dn) where d is the largest

arity�2. Space complexity O(dn) to store the joint distribution�3. How to find the numbers for O(dn) entries?



Independence• A and B are independent iff

P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)�

P(Toothache, Catch, Cavity, Weather)= P(Toothache, Catch, Cavity) P(Weather)

• 32 entries reduced to 12; for n independent biased coins, O(2n) →O(n)�

• Absolute independence powerful but rare�• How can we manage a large numbers of variables?�

5



Conditional independence• P(Toothache, Cavity, Catch) has 23 – 1 = 7 independent entries�• If I have a cavity, the probability that the probe catches in it doesn't

depend on whether I have a toothache:�

– P(catch | toothache, cavity) = P(catch | cavity)• The same independence holds if I haven't got a cavity:�

– P(catch | toothache,¬cavity) = P(catch | ¬cavity)�• Catch is conditionally independent of Toothache given Cavity:�

– P(Catch | Toothache,Cavity) = P(Catch | Cavity)�



Conditional independence

• Catch is conditionally independent of Toothache given Cavity:�

– P(Catch | Toothache,Cavity) = P(Catch | Cavity)�

• Equivalent statements:

– P(Toothache | Catch, Cavity) = P(Toothache | Cavity)�– P(Toothache, Catch | Cavity) = P(Toothache | Cavity)

P(Catch | Cavity)�

6



Conditional independence

• Write out full joint distribution using chain rule:�P(Toothache, Catch, Cavity)

= P(Toothache | Catch, Cavity) P(Catch, Cavity)�= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)�= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)�

I.e., 2 + 2 + 1 = 5 independent numbers�• Conditional independence

– often reduces the size of the representation of the joint distribution from exponential in n to linear in n�

– Is one of the most basic and robust form of knowledge about uncertain environments�



X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z:

P (X |Y, Z ) = P (X |Z ) that is, if

Conditional Independence

)|(),|(),,( kikjikji zZxXPzZyYxXPzyx ======∀

7



Independence and Conditional Independence

( ) ( )

( ) ( )

variables random to sassignment value possible all for equations, of sets represent these that Note

t.independen are and if

if given tindependenmutually are space. event given a on variables randomof sets disjoint pairwise be and Let

21121

11

1

1

....

,...,

,...

ZZWZWZZ

WZWZZ

WZZ

WZZ

PP

PP i

n

in

n

n

=

=∪∪ ∏=

U



Independence Properties of Random Variables

( )( ) ( ) ( ) ( ) ( )

( ) ( )( ) ( )( ) ( )( ) ( ) ( )

. of definition from Follows :Proof,,,,,, d.

,,,, c.,,,, b.

,,,, a.:Then . or , is, That

. given are and that denote ,, Let space. event given a on variables random

of sets disjoint pairwise be ,,, Let

ceindependenIII

IIII

II

PPPPPnt independeI

WYZXWYZXYZXYWZXWYZX

YZXWYZXXZYYZX

YXZYXYZYXYZXYZXZYX

ZYXW

UU

UU

U

UU

⇒∧⇒⇒

⇒

==

8



Bayes RuleDoes patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

=¬+=+

=

)|()|(

)(

cancerPcancerP

cancerP

=¬−=−=¬

)|()|()(

cancerPcancerPcancerP



Bayes Rule

Does patient have cancer or not?

030980

0080

.)|(.)|(

.)(

=¬+=+

=

cancerPcancerP

cancerP

970020

9920

.)|(.)|(

.)(

=¬−=−

=¬

cancerPcancerP

cancerP

( ) ( )( ) ( ) ( )

( )( ) ( ) ( ) ( )

cancer havenot does not,n likely tha more patient, The79.0)|( ;21.0)|(

0298.00078.0)(0298.0992.003.0 ;0078.0008.098.0

)(

;)(

=+¬=++=+

=×=++¬=×=++

+¬¬+

=+¬+

+=+

cancerPcancerPP

PcancerPPcancerPP

cancerPcancerPcancerP

PcancerPcancerP

cancerP

9



Bayes Rule

• Product rule

– P(a∧b) = P(a | b) P(b) = P(b | a) P(a)�– Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)�

• In distribution form �P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)



Bayes' Rule and conditional independence

P(Cavity | toothache ∧ catch)

= αP(toothache ∧ catch | Cavity) P(Cavity) = αP(toothache | Cavity) P(catch | Cavity) P(Cavity)

• This is an example of a naïve Bayes (idiot Bayes) model:�– P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)��

• Total number of parameters is linear in n�

10



Bayesian NetworksExploiting conditional independence and graphical representation for reasoning under uncertainty

Review of GraphsReview of Independence and conditional independenceDirected graphical models and probability distributions

Querying a probability distribution – inference



Review of basic concepts of graphs

Undirected graph G1=(V,E1) Directed Graph G2=(V,E2)Vertex Set V = { A, B, C, D, E }Edge Set E1 = { A–B, B–D, D–E, A–C) C–E }Edge Set E2 = { A B, B D, D E, A C) C E }

A

C

B

D

E

A

C

B

D

E

11



Review of basic concepts of graphs

Adjacency Set of a node – immediate neighbors reachable through undirected (directed) links

In G1, Adj(A) = {C, D}; Adj(B) = {D}, Adj(D)={A,B,E} Adj(E)=∅In G2, Adj(A) = {C, D}; Adj(B) = {D}, Adj(D)={E} Adj(E)={C,D}Paths between two nodes – ordered list of nodes starting with the first node

and ending with the second in which each successive node is in the adjacency list of preceding node

A

C

B

D

E

A

C

B

D

E

G1 G2



Properties of undirected graphs• Complete graph – there is a link between every

pair of nodes• Complete set – a subset of nodes in a graph is

said to be complete if there is a link between every pair of nodes in the subset

• Clique – A complete set of nodes is said to be a clique if it is maximal – i.e., it is not a proper subset of another complete set

The graph G1(A, B. C. D. E) is a complete set and also the only

clique in the graph G1

A

C D

E

12



Properties of undirected graphs

Identify the cliques in the graph shownTwo cliques (A, C, D), (B, C, D, E)

A

C D

E

B




Neighbors of a node In G1, Neighbors(A) = {C, D}Boundary of a set of nodes S – union of neighbors of nodes in S except

nodes in SBoundary({C, D}) = {A,E,D} U {A,B,C,E} – {C,D}

= {A,B,C,D,E} – {C,D} = {A, B, E}

A

C

B

D

E

G1

13




A graph is said to be connected if there exists at least one path between any pair of nodes

G1 is connected

G2 is not connected

A

C D

E

G1

A

C D

E

G2




A connected undirected graph is a tree if for every pair of nodes, there is a unique path

G1 is a treeAn undirected graph is said to be multiply

connected if at least one pair of nodes is connected by more than one path – i.e., there is at least one loop

G2 is not a tree (is multiply connected)A

C D

E

G1A

C D

E

G2

14




Chord of a loop – a chord is a link between two nodes in a loop that is not part of the loop

C–D is a chord of the loop A–C–E–D–AA–E is a chord of the loop A–C–E–D–A

A chord of a loop decomposes a loop into two smaller loops

The loop A–E–D–A does not have a chord

A

C D

E

G1




Triangulated Graph – An undirected graph is said to be triangulated if every loop of length 4 or greater has at least one chord

C–D is a chord of the loop A–C–E–D–AG1 is not triangulatedG2 is triangulated

Triangulation does not mean dividing the graph into triangles!

A

C D

E

G1

A

C D

E

G2

15




Triangulation – the process of adding chords to make the graph triangulated

There may be multiple ways to triangulate a graphA triangulation is said to be minimal if it contains the

minimal number of chordsG2 is a minimal triangulationG3 is not a minimal triangulationFinding minimal triangulation is NP-HardThere is a greedy algorithm for triangulating a

graph (Tarzan and Yannakakis, 1984)

G1 A

C D

E

A

C D

E

G2

A

C D

E

G3




Triangulated graphs have the running intersection property

There exists an ordering of cliques C1 … Cn such that Ci ∩{C1∪ C2.. Ci-1} is contained in at least one of the

cliques C1 C2.. Ci-1 for all i=1..nAn ordering of cliques satisfying the running

intersection property is called a chain of cliquesAn undirected graph has an associated chain of

cliques iff it is triangulatedOrdering {A,C,E,D} {C,D,E,F}

G1A

C D

EF

16




Cluster – a subset of nodes of a graphCluster graph • nodes are clusters • There is an edge between two nodes if and only clusters contain

common nodesClique graph of an undirected graph is a cluster graph in which the

clusters correspond to the cliques of the original graphA clique graph is called a join or a junction graph if it contains all the

possible links between two cliques with a common nodeJoin graph of an undirected graph is unique



Properties of undirected graphsA clique graph is a join tree or junction tree if it is a

tree and every node that belongs to two clusters also belongs to every node on the path between the two clusters

An undirected graph has a join tree if and only if it is triangulated

A

B c

ED F

G H I

D,G

D,H B,D,E E,I

A,B,C B,C,E

C,F

D,G

D,H B,D,E E,I

A,B,C B,C,E

C,F

Graph

Join Graph Join Tree

17




An undirected graph has a join tree if and only if it is triangulated

There is no join tree for this graph

CB

A

D

A,CB,D

A,B

C,D

Non-triangulated graph

Join graph



Properties of Directed Graphs

Parents(D)={A,B}Children(A)={C,D}Family(D)={A,B,D) (node and its parents)Ancestors(E)={A,B,C,D}Ancestors(A)=∅Ancestral numbering – numbering of

nodes such that the number of any node is less than that of its children

A

C

B

D

E

18



Properties of Directed Graphs

Undirected graph associated with a directed graph – drop directionality of links

Moral graph – obtained by linking every pair of nodes that share a common child and dropping the directions on links

A

C

B

D

E

A

C

B

D

E

A

C

B

D

E

Directed Graph Undirected Graph Moral Graph



Properties of Directed GraphsA cycle in a directed graph – closed directed pathA directed graph is acyclic (DAG) if it has no directed cyclesA directed graph is connected if the associated undirected graph is

connectedA connected directed graph is a tree if the associated undirected graph

is a tree. Otherwise it is multiply connectedSimple directed tree – every node has at most one parent; otherwise

polytree

A

C

B

D

E

Directed Acyclic Graph Simple Tree Polytree

A

C D

E

A

C

B

D

E

19



Representation of graphs

Graphical representation Numerical representation • Adjacency matrix entry (i,j) is 1 if there is an edge from

node i to j.• Successive powers A1, A2.. of adjacency matrix provide the

number of paths of length equal to 1, 2, ..• Attainability matrix – entry (i,j) is 1 if there is a path from

node i to node j• If there is a path between two nodes, there is a path of

length less than N where N is the number of nodes in the graph



• Random variable X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z:

• P (X |Y, Z ) = P (X |Z ) that is, if

Building Probabilistic Models –Conditional Independence

)|(),|(),,( kikjikii zZxXPzZyYxXPzyx ======∀

20




( ) ( )( )1,0|1

1|11,1|1====

======LighteningRainThunderP

LighteningThunderPLightningRainThunderP

( ) ( )( )0,0|1

0|10,1|1====



( ) ( )( )1,0|0

1|01,1|0====



( ) ( )( )0,0|0

0|00,1|0====






( ) ( )

( ) ( )

variablesrandom tosassignment valuepossible allfor equations, of setsrepresent e that thesNotet.independen are and if ,

,...,

if given t independenmutually are ,..., space.event given aon

variablesrandom be and ,...Let

21121

121

1

1

ZZWZPWZZP

WZPWZZZP

WZZ

WZZ

i

n

in

n

n

=

= ∏=

21



Independence and Conditional Independence

( ) ( )

( ) ( )

variablesrandom tosassignment valuepossible allfor equations, of setsrepresent e that thesNote

t.independen are and if

....

if given t independenmutually are ,..., space.event given aon variablesrandom

of setsdisjoint pairwise be and ,...Let

21121

11

1

1

ZZWZWZZ

WZWZZ

WZZ

WZZ

PP

PP i

n

in

n

n

=

=∪∪ ∏=

U



Independence Properties of Random Variables

( )( ) ( ) ( ) ( ) ( )

( ) ( )( ) ( )( ) ( )( ) ( ) ( )

ce.independen of definition from Follows :Proof d. c. b. a.

:Then or is, That given tindependen are and that denote Let

space. event given a on variables randomof sets disjoint pairwise be Let

WYZXWYZXYZXYWZXWYZX

YZXWYZXXZYYZX

YXZYXYZYXYZXYZXZYX

ZYXW

UU

UU

U

UU

,,,,,,,,,,

,,,,,,,,

.,.,,

,,,

IIIIIII

IIPPPPP

I

⇒∧⇒⇒

⇒

==

22



Implications of Independence

• Suppose we have 5 Binary features and a binary class label

• Without independence, in order to specify the joint distribution, we need to specify a probability for each possible assignment of values to each variable resulting in a table of size 26=64

• Suppose the features are independent given the class label – we only need 5(2x2)=20 entries



Bayesian Networks

CancerSmoking{ }heavylightnoS ,,∈

{ }malignantbenignnoneC ,,∈P( S=no) 0.80P( S=light) 0.15P( S=heavy) 0.05

Smoking= no light heavyP( C=none) 0.96 0.88 0.60P( C=benign) 0.03 0.08 0.25P( C=malig) 0.01 0.04 0.15

23



Product Rule

• P(C,S) = P(C|S) P(S)

S⇓ C⇒ none benign malignantno 0.768 0.024 0.008light 0.132 0.012 0.006heavy 0.035 0.010 0.005



Marginalization

S⇓ C⇒ none benign malig totalno 0.768 0.024 0.008 .80light 0.132 0.012 0.006 .15heavy 0.035 0.010 0.005 .05

total 0.935 0.046 0.019

P(Cancer)

P(Smoke)

24



Bayes Rule Revisited

)(),(

)()()|()|(

CPSCP

CPSPSCPCSP ==

S⇓ C⇒ none benign maligno 0.768/.935 0.024/.046 0.008/.019light 0.132/.935 0.012/.046 0.006/.019heavy 0.030/.935 0.015/.046 0.005/.019

Cancer= none benign malignantP( S=no) 0.821 0.522 0.421P( S=light) 0.141 0.261 0.316P( S=heavy) 0.037 0.217 0.263



A Bayesian Network

Smoking

GenderAge

Cancer

LungTumor

SerumCalcium

Exposureto Toxics

25



Independence

Age and Gender are independent.

P(A|G) = P(A) A ⊥ G P(G|A) = P(G) G ⊥ A

GenderAge

P(A,G) = P(G|A) P(A) = P(G)P(A)P(A,G) = P(A|G) P(G) = P(A)P(G)

P(A,G) = P(G)P(A)




Smoking

GenderAge

Cancer

Cancer is independent of Age and Gendergiven Smoking.

P(C|A,G,S) = P(C|S) C ⊥ A,G | S

26



More Conditional Independence:Naïve Bayes

Cancer

LungTumor

SerumCalcium

Serum Calcium is independent of Lung Tumor, given Cancer

P(L|SC,C) = P(L|C)

Serum Calcium and Lung Tumor are dependent



Naïve Bayes in general

H

E1 E2 E3 En…...

2n + 1 parameters:nihePheP

hP

ii ,,1),|(),|()(

K=

27



More Conditional Independence:Explaining Away

Exposure to Toxics is dependent on Smoking, given Cancer

Exposure to Toxics and Smoking are independentSmoking

Cancer

Exposureto Toxics

E ⊥ S

P(E = heavy | C = malignant) >P(E = heavy | C = malignant, S=heavy)



Put it all together

=),,,,,,( SCLCSEGAP

Smoking

GenderAge

Cancer

LungTumor

SerumCalcium

Exposureto Toxics

)|()|( CLPCSCP ⋅

⋅⋅ )()( GPAP

⋅⋅ ),|()|( GASPAEP⋅),|( SECP

28



General Product (Chain) Rule for Bayesian Networks

)|(),,,(1

21 iPa∏=

=n

iin XPXXXP K

Pai=parents(Xi)



• Naive assumption of variables are independent (e.g., Naïve Bayes assumption that the variables are independent given the class) can be too restrictive

• But representing joint distributions is intractable without some independence assumptions

• Bayesian networks explicitly model conditional independence among subsets of variables to yield a graphical representation of probability distributions that admit such independence

Bayesian Networks

29



Bayesian network

• Bayesian network is a directed acyclic graph (DAG) in which the nodes represent random variables

• Each node is annotated with a probability distribution P (Xi | Parents(Xi ) ) representing the dependency of that node on its parents in the DAG

• Each node is asserted to be conditionally independent of its non-descendants, given its immediate predecessors.

• Arcs represent direct dependencies



Bayesian Networks

Efficient factorized representation of probability distributions via conditional independence

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

30



Bayesian Networks

• Qualitative partstatistical independence statements represented in the form of a directed acyclic graph (DAG)• Nodes - random

variables • Edges – direct

influence

Quantitative part Conditional probability distributions – one for each random variable conditioned on its parents

0.9 0.1

e

be

0.2 0.8

0.01 0.99

0.9 0.1

bebb

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call



Qualitative part

• Nodes are independent of non-descendants given their parents

d-separation: • a graph theoretic criterion

for reading independence statements

• can be computed in linear time (in the number of edges)

Earthquake

Radio

Burglary

Alarm

Call

31



Directed graphs and joint probabilities

• Let be a set of random variables

• Let be the set of parents of

• Associate a vertex in the directed a-cyclic graph with a random variable and a function of the form

• Then

•

{ }nXXX ...., 21

iX

( )i

xxf ii π,

( ) ( )i

xxfxxp i

n

iin π

=∏= ,...

11

iX π



What independences does a Bayes Net model?• In order for a Bayesian network to model a probability

distribution, the following must be true by definition: • Each variable is conditionally independent of all its non-

descendants in the graph given the value of all its parents.

This implies

But what else does it imply?

∏=

=n

iiin XparentsXPXXP

11 ))(|()( K

Earthquake

Radio

Burglary

Alarm

Call

)|(),|()|()()(),,,,(

ACPBEAPERPBPEPCARBEP =

32



What Independences does a Bayes Network model?

Example:

Z

Y

X

Given Y, does learning the value of Z tell us nothing new about X?

i.e., is P(X|Y, Z) equal to P(X | Y)?

Yes. Since we know the value of all ofX’s parents (namely, Y), and Z is not adescendant of X, X is conditionally independent of Z.

Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).



Quick proof that independence is symmetric

• Assume: P(X|Y, Z) = P(X|Y) • X and Z are independent given Y

),()()|,(),|(

YXPZPZYXPYXZP =

)()|()(),|()|(

YPYXPZPZYXPZYP

=

(Bayes’s Rule)

(Chain Rule)

(By Assumption)

(Bayes’s Rule))()|(

)()|()|(YPYXP

ZPYXPZYP=

)|()(

)()|( YZPYP

ZPZYP==

33




• Let I(X,Y,Z) represent X and Z being conditionally independent given Y.

• I(X,Y,Z)? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.

Y

X Z




• I(X,{U},Z)? No.• I(X,{U,V},Z)? Yes.

Z

VU

X

34



Things get a little more confusing

• X has no parents, so we know all its parents’ values trivially• Z is not a descendant of X• So, I(X,{},Z), even though there’s a undirected path from X to Z

through an unknown variable Y.• What if we do know the value of Y ? Or one of its descendants?

ZX

Y



The Burglar Alarm example

• Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.

• Earth arguably doesn’t care whether your house is currently being burgled

• While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing.

Burglar Earthquake

Alarm

Phone Call

35



Burgler Alarm Example (Contd)

• But now suppose you learn that there was a medium-sized earthquake in your neighborhood. …Probably not a burglar after all.

• Earthquake “explains away” the hypothetical burglar.

• But then it must NOT be the case thatI(Burglar,{Phone Call}, Earthquake), even though I(Burglar,{}, Earthquake)!

Burglar Earthquake

Alarm

Phone Call



d-separation to the rescue

• Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation.

36



Blocked Unblocked

d-separation

Two variables are independent if all paths between them are blocked by evidence

Three cases:• Common cause• Intermediate cause• Common Effect



Blocked UnblockedE

R A

E

R A

d-separation

• Two variables are independent if all paths between them are blockedby evidence

• Three cases:• Common cause• Intermediate cause• Common Effect

Blocked Unblocked

If we do not know whether an earthquake occurred, then radio announcement can influence our belief about the alarm having gone off. If we know that earthquake occurred, then radio announcement gives no information about the alarm

Evidence may be transmitted through a diverging connection unless it is instantiated.

37



Blocked UnblockedE

C

A

E

C

A

d-separation

Common causeIntermediate causeCommon Effect

Blocked Unblocked

Evidence may be transmitted through a serial connection unless it is blocked



Blocked UnblockedE B

A

C

E B

A

CE B

A

C

d-separation

Common cause

Intermediate cause

Common Effect

Blocked Unblocked

Evidence may be transmitted through a converging connection only if either the variable or one of its descendents has received evidence

38



I(X,Y,Z) denotes X and Z are independent given Y– Surely I(R,B) – Possibly ¬I(R,A,B)– Surely I(R,{E,A}B)– Possibly ¬I(R,B,C)

Example

E B

A

C

R



d-separation

Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked” by evidence E

39



d-separation

• Theorem [Verma & Pearl, 1998]: If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I(X, E, Z).

• d-separation can be computed in linear time using a depth-first search like algorithm.

• We now have a fast algorithm for automatically inferring whether finding out about the value of one variable might give us any additional hints about some other variable, given what we already know.

• Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved



d-separation example

A B

C D

E F

G

I

H

J

I(C, {}, D)?I(C, {A}, D)?I(C, {A, B}, D)?I(C, {A, B, J}, D)?I(C, {A, B, E, J}, D)?

40



Markov Blanket

• A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents -

Alarm

MaryCallsJohnCalls

EarthquakeBurglary

Burglary is independent of John Calls and Mary Calls given Alarm and Earth Quake



Bayesian Networks: Summary

• Bayesian networks offer an efficient representation of probability distributions

• Efficient:• Local models• Independence (d-separation)

• Effective: Algorithms take advantage of structure to • Compute posterior probabilities • Compute most probable instantiation• Decision making

41



Inference in Bayesian networks

• BN models compactly the full joint distribution by taking advantage of existing independences between variables

• Inference tasks:• – Diagnostic inference (from effect to cause)

P ( Burglary | JohnCalls = T )• – Predictive inference (from cause to effect)

P ( JohnCalls | Burglary = T )• – Other probabilistic queries (queries on joint distributions).• Can we take advantage of independences to construct

special algorithms and speeding up the inference?



Bayesian network inference

P(E)=0.002

Alarm

MaryCallsJohnCalls

EarthquakeBurglaryP(B)=0.001

P(A|B,E)=0.95P(A|B, ¬E)=0.94P(A|¬B,E)=0.29P(A|¬B, ¬E)=0.001

P(J|A)=0.9P(J|¬A)=0.05

P(M|A)=0.7P(M|¬A)=0.01

42



Example

• Device operating normally or malfunctioning• A sensor indirectly monitors the operation of the device• Sensor reading is either high or low



Diagnostic inference. Example

Diagnostic inference: compute the probability of device operating normally given the sensor reading is high (S ).

•

43



Inference in Bayesian network

Bad news:• – Exact inference problem in BNs is NP-hard (Cooper)• – Approximate inference is NP-hard (Dagum, Luby)In practice, things are not so bad• Exact inference

– Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms

• Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory



Computing joint probability distributions using a Bayesian network

Any entry in the joint probability distribution can be calculated from the Bayesian network.

)()(),|()|()|( ),(),|()|()|(

),,(),,|()|( ),,,(),,,|(),,,,(

EPBPEBAPAMPAJPEBPEBAPAMPAJP

EBAPEBAMPAJPEBAMPEBAMJPEBAMJP

¬¬¬¬=¬¬¬¬=

¬¬¬¬=¬¬¬¬=¬¬

(We’re just using the chain rule and conditional independence.)

44



Computing joint probabilities

• Joint distribution can be used to answer any query about the domain.

• Bayesian network represents the joint distribution• Any query about the domain can be answered using a BN• Tradeoff: A BN can be much more concise, but you need

to calculate, rather than look up in a table, probabilities fromthe joint distribution

))(|()(),...,(2

111 ii

n

iinn XParentsxXPXPxXxXP ==== ∏

=

General formula:



Computing joint probability distributions using a Bayesian network

Any entry in the joint probability distribution can be calculated from the Bayesian network.

)()(),|()|()|( ),(),|()|()|(

),,(),,|()|( ),,,(),,,|(),,,,(

EPBPEBAPAMPAJPEBPEBAPAMPAJP

EBAPEBAMPAJPEBAMPEBAMJPEBAMJP

¬¬¬¬=¬¬¬¬=

¬¬¬¬=¬¬¬¬=¬¬

(We’re just using the chain rule and conditional independence.)

45




Blind approach• Sum out all un-instantiated variables from the full joint,• Express the joint distribution as a product of conditionals

Computational cost• Number of additions: 15• Number of products: 16x4 = 64



Inference in Bayesian networksInterleave sums and products• Combines sums and product in a smart way (multiplication

constants can be taken out of the sum)

Computational cost:• Number of additions: 1+2 [1+1+2]=9• Number of products: 2[2+2(1+2)]=16

46




• Smart interleaving of sums and products can help us to speed up the computation of joint probability queries

• What if we want to compute P(B = T , J = T )?

• Smart caching of results of computation that would otherwise be repeated can save time



Inference in Bayesian networks• When does caching of results becomes handy?• There are other queries when results can be shared• General technique: Variable elimination

•

47




• When does caching of results becomes handy?• What if we want to compute a diagnostic query:

• Exactly probabilities we have just computed !!• There are other queries when cashing and ordering of sums and

products can be shared and saves computation

• General technique: Variable elimination




General idea of variable elimination

Results cached in tree structure

48



Inference in Bayesian Networks

Find P(Q=q|E=e)- Q the query variable(s)- E set of evidence variables

P(q|e) = P(q,e) / P(e)X1,.. Xn are network variables except Q,E

( ) ( )∑=nxxx

nXXXeqeqP...,

...,,,,21

21



Basic Inference

P(b) = ?

A B

∑∑ ==aa

bP P(a) a) | P(b b) P(a,)(

49



Basic Inference

∑∑ ==aa

bP P(a) a) | P(b b) P(a,)(

A B C

∑=b

bPbcPcP )()|()(

∑

∑

∑∑

=

=

==

ba

ba

baba

bPbcP

aPabPbcP

aPabPabcPcbaPcP

,

,

,,

)()|(

)()|()|(

)()|(),|(),,()(



Inference in trees

Y1 Y2

X

( ) ( ) ( ) ( ) ( ) ( )2121212121212121

YPYPYYXPYYPYYXPYYXPXPyyyyyy

∑∑∑ ===,,,

,|,,|,,)(

50



Polytrees

A network is singly connected (a polytree) if it contains no undirected loops.

Not a polytree Polytree



Inference in polytrees

• Theorem: Inference in polytrees can be performed in time that is polynomial in the number of variables.

• Main idea: in variable elimination, need only maintain distributions over single nodes.

51



Inference with Bayesian Networks

• Inference in polytrees can be performed efficiently• Inference with DAG is NP-Hard – Proof by

reduction of SAT to Bayesian network inference



Approaches to inference

• Exact inference – Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms

• Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory

52



Inference – A more complicated example

RainSprinkler

Cloudy

WetGrass

∑=c,s,r

)c(P)c|s(P)c|r(P)s,r|w(P)w(P

∑ ∑=s,r c

)c(P)c|s(P)c|r(P)s,r|w(P

∑=s,r

1 )s,r(f)s,r|w(P )s,r(f1

Because of the structure of the BN, some sub-expressions in the joint depend only on a small number of variablesBy computing them once and caching the result, we can avoid generating them exponentially many times



Variable Elimination

• General idea:• Write query in the form

• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product

∑ ∑∑∏=kx x x i

iin paxPXP3 2

)|(),( Le

53




• A factor over X is a function from Domain(X) to numbers in the interval [0,1]

• A conditional probability table is a factor• A joint distribution is a factor• Bayesian network inference • Factors are multiplied to generate new ones• Variables in factors are summed out (marginalization)• A variable can be summed out as soon as all the factors in which the

variable appears have been multiplied



A More Complex Example

Visit to Asia Smoking

Lung CancerTuberculosis

Abnormalityin Chest Bronchitis

X-Ray Dyspnea

54



V S

LT

A B

X D

),|( )|( ),|( )|( )|( )|( )( )( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b



V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b• Initial factors

Eliminate: v

Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term

Compute: ∑=v

v vtPvPtf )|()()(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒

55



V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: s,x,t,l,a,b• Initial factors

Eliminate: s

Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables.

Compute: ∑=s

s slPsbPsPlbf )|()|()(),(


),|()|(),|(),()( badPaxPltaPlbftf sv⇒



V SLT

A BX D


• We want to compute P(d)• Need to eliminate: x,t,l,a,b• Initial factors

Eliminate: x

Note: fx(a) = 1 for all values of a

Compute: ∑=x

x axPaf )|()(



),|(),|()(),()( badPltaPaflbftf xsv⇒

56



V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: t,l,a,bInitial factors

Eliminate: tCompute: ∑=

tvt ltaPtflaf ),|()(),(




),|(),()(),( badPlafaflbf txs⇒



V SLT

A BX D


• We want to compute P(d)• Need to eliminate: l,a,b• Initial factors

Eliminate: lCompute: ∑=

ltsl laflbfbaf ),(),(),(





),|()(),( badPafbaf xl⇒

57




• We want to compute P(d)• Need to eliminate: b• Initial factors

Eliminate: a,bCompute:

∑∑ ==b

aba

xla dbfdfbadpafbafdbf ),()(),|()(),(),(


),|()(),( badPafbaf xl⇒


)(),( dfdbf ba ⇒⇒

V SLT

A BX D



Basic operations

• Multiplying two factors• Summing out a variable from a product of factors –

marginalization

58



Example: Multiplying factorsPointwise product

• Pointwise product is NOT– matrix multiplication– element by element multiplication



Dealing with evidence

• How do we deal with evidence?• Suppose get evidence V = 1, S = 0, D = 1• We want to compute P(L, V = 1, S = 0, D = 1)

V S

LT

A B

X D

59



Dealing with Evidence

• We start by writing the factors:

• Since we know that V = 1, we don’t need to eliminate V• Instead, we can replace the factors P(V) and P(T|V) with

• These “select” the appropriate parts of the original factors given the evidence

• Note that fp(V) is a constant, and thus does not appear in elimination of other variables


)|()()( )|()( 11 ==== VTPTfVPf VTpVP

V S

LT

A B

X D




• We now understand variable elimination as a sequence of rewriting operations

• Actual computation is done in elimination step• Computation depends on order of elimination

60




• Given evidence V = 1, S = 0, D = 1• Compute P(L, V = 1, S = 0, D = 1 )• Initial factors, after setting evidence:

),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP

V S

LT

A B

X D



• Given evidence V = 1, S = 0, D = 1• Compute P(L, V = 1, S = 0, D = 1 )• Initial factors, after setting evidence:

• Eliminating x, we get


),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP

Dealing with EvidenceV S

LT

A B

X D

61




• Given evidence V = 1, S = 0, D = 1• Compute P(L, V = 1, S = 0, D = 1)• Initial factors, after setting evidence:


• Eliminating t, we get



),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP

V S

LT

A B

X D




• Given evidence V = 1, S = 0, D = 1• Compute P(L, V = 1, S = 0, D = 1)• Initial factors, after setting evidence:


• Eliminating t, we get

• Eliminating a, we get



),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP

),()()( )|()|()()( lbfbflfff asbPslPsPvP

V S

LT

A B

X D

62



Variable Elimination Algorithm• Let X1,…, Xm be an ordering on the non-query variables

• For i= m, …, 1

• Leave in the summation for Xi only factors mentioning Xi

• Multiply the factors, getting a factor that contains a number for each value of the variables mentioned, including Xi

• Sum out Xi, getting a factor f that contains a number for each value of the variables mentioned, not including Xi

• Replace the multiplied factor in the summation

∏∑ ∑∑j

jjX XX

XParentsXPm

))(|(...1 2



∑=x

kxkx yyxfyyf ),,,('),,( 11 KK

∏=

=m

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,(' KK

Complexity of variable elimination• Suppose in one elimination step we compute

• This requires multiplications

• For each value for x, y1, …, yk, we do m multiplications• additions

• For each value of y1, …, yk , we do |Domain(X)| additions• Complexity is (not surprisingly) exponential in number of variables

in the intermediate factor!

∏⋅⋅i

iYDomainXDomainm )()(

∏⋅i

iYDomainXDomain )()(

63



Understanding Variable Elimination

• We want to select “good” elimination orderings that reduce complexity

• This can be done be examining a graph theoretic property of the “induced” graph; we will not cover this in class.

• This reduces the problem of finding good ordering to graph-theoretic operation that is well-understood—unfortunately computing it is NP-hard!



Exercise: Variable elimination

smart study

prepared fair

pass

p(smart)=.8

p(study)=.6

p(fair)=.9

Query: What is the probability that a student is smart, given that he/she passes the exam?

.9

.5

.7

.1

TFTF

TTFF

P(Pr|…)StSm

TTTTFFFF

Sm

TTFFTTFF

Pr

TFTFTFTF

F

.9

.1

.7

.1

.7

.1

.2

.1

P(Pa|…)

64



Bayesian Network Inference in polytrees – Message Passing algorithm



Decomposing the probabilities

• Suppose we want P(Xi | E) where E is some set of evidence variables.

• Let’s split E into two parts:

– Ei- is the part consisting of assignments to variables

in the subtree rooted at Xi

– Ei+ is the rest of the variables in E

Xi

65



Decomposing the probabilities

)(λ)(α)|(

)|()|()|(

)|(),|(

),|()|(

ii

ii

iiii

ii

iiiii

iiii

XXπEEP

EXPXEP

EEPEXPEXEP

EEXPEXP

=

=

=

=

+−

+−

+−

++−

+−

Xi

Where:• α is a constant independent of Xi• π(Xi) = P(Xi |Ei

+)• λ(Xi) = P(Ei

-| Xi)



Using the decomposition for inference

• We can use this decomposition to do inference as follows. First, compute λ(Xi) = P(Ei

-| Xi) for all Xi recursively, using the leaves of the tree as the base case.

66



Quick aside: “Virtual evidence”

• For theoretical simplicity, but without loss of generality, let us assume that all variables in E (the evidence set) are leaves in the tree.

Xi

Xi

Xi’Observe Xi Equivalent to Observe Xi’

Where P(Xi’| Xi) =1 if Xi’=Xi, 0 otherwise



Calculating λ(Xi) for non-leaves

• Suppose Xi has one child, Xj =Xc. • Then:

Xi

Xc

∑

∑

∑

∑

=

=

=

==

−

−

−−

j

j

j

j

Xjij

Xjiij

Xjiiij

Xijiiii

XXXP

XEPXXP

XXEPXXP

XXEPXEPX

)(λ)|(

)|()|(

),|()|(

)|,()|()(λ

67



Calculating λ(Xi) for non-leaves

• Now, suppose Xi has a set of children, C.• Since Xi d-separates each of its subtrees, the contribution of each

subtree to λ(Xi) is independent:

∏ ∑

∏

∈

∈

−

⎥⎥⎦

⎤

⎢⎢⎣

⎡=

==

CX Xjij

CXijiii

j j

j

XXXP

XXEPX

)λ()|(

)(λ)|()(λ

• where λj(Xi) is the contribution to P(Ei-| Xi) of the part of the evidence

lying in the subtree rooted at one of Xi’s children Xj.



We are now λ-happy

• We have a way to recursively compute all the λ(Xi)’s, starting from the root and using the leaves as the base case.

• We can think of each node in the network as an autonomous processor that passes a little “λ message” to its parent.

λ λ λ λ

λλ

68



Computing π(Xi)

Xp

Xi

• Where πi(Xp) is defined as

∑

∑

∑

∑

∑

=

=

=

=

==

+

++

++

p

p

p

p

p

Xpipi

X pi

ppi

Xippi

Xipipi

Xipiiii

XXXP

XEXP

XXP

EXPXXP

EXPEXXP

EXXPEXPX

)(π)|(

)(λ)|(

)|(

)|()|(

)|(),|(

)|,()|()(π

)(λ)|(

pi

p

XEXP



Bayesian network inference in trees

• Thus we can compute all the π(Xi)’s, and, in turn, all the P(Xi|E)’s.

• Can think of nodes as autonomous processors passing λ and π messages to their neighbors

λ λ λ λ

λλπ π

π π π π

69



Conjunctive queries

• What if we want, e.g., P(A, B | C) instead of just marginal distributions P(A | C) and P(B | C)?

• Just use chain rule:

– P(A, B | C) = P(A | C) P(B | A, C)– Each of the latter probabilities can be computed using

the technique just discussed.



Polytrees

• Previous technique can be generalized to polytrees: undirected versions of the graphs are still trees, but nodes canhave more than one parent

70



Dealing with cycles

• Can deal with undirected cycles in graph by• clustering variables together

• Conditioning

B

A

C

D

A

D

BC

Set to 0 Set to 1



Dealing with cycles

• Can deal with undirected cycles in graph by• clustering variables together

B

A

C

D

A

D

BC

71



Join trees or junction treesArbitrary Bayesian network can be transformed via a graph-

theoretic trick into a join tree (also used in databases) in which a similar method can be employed.

AB

E D

F

C

G

In the worst case the join tree nodes must take on values whose number grows exponentially with the number of nodes that are clustered together, but this often works well in practicewhen the number of nodes per cluster is small



Junction Tree• Why junction tree?

– Variable elimination is inefficient if the undirected graph underlying the Bayesian Nets contains cycles

– We can avoid cycles if we turn highly-interconnected subsets of the nodes into “supernodes” cluster

• Objective– Compute

• is a value of a variable and is evidence for a set of variables

)|( eE == vVPv V e

E

72



Potentials

• Potentials:

– Denoted by • Marginalization

– , the marginalization of into X

• Multiplication

– , the multiplication of and

Φ :X→ R+ ∪{0}

Xφ

∑=X\Y

YX φφ

YX ⊆φ

YXZ ∪=

YφXφYXZ φφφ =



Properties of Junction Tree• An undirected tree• Each node is a cluster (nonempty set) of variables• Running intersection property:

– Given two clusters and , all clusters on the path between and contain

• Separator sets (sepsets): – Intersection of the adjacent cluster

X YX Y YX ∩

ADEABD DEFAD DE

Cluster ABD

SepsetDE

73



Properties of Junction Tree

• Belief potentials: – Map each instantiation of clusters or sepsets into a real

number• Constraints:

– Consistency: for each cluster and neighboring sepset

– The joint distribution

X S

SS\X

X φφ =∑

∏∏=

j

i

j

iPS

XUφφ

)(



Properties of Junction Tree

• If a junction tree satisfies these properties, it follows that:– For each cluster (or sepset) ,

– The probability distribution of any variable , using any cluster (or sepset) that containsX

)(XX P=φV

X

V

∑=}\{

)(V

VPX

Xφ

74



Building Junction Trees

DAG

Moral Graph

Triangulated Graph

Junction Tree

Identifying Cliques



Constructing the Moral Graph

A

B

D

C

E

G

F

H

75



Constructing The Moral Graph

• Add undirected edges to all co-parents which are not currently joined –Marrying parents

A

B

D

C

E

G

F

H



Constructing The Moral Graph

• Add undirected edges to all co-parents which are not currently joined –Marrying parents

• Drop the directions of the arcs

A

B

D

C

E

G

F

H

76



Triangulating

• An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes

A

B

D

C

E

G

F

H



Identifying Cliques

• A clique is a subgraph of an undirected graph that is complete (has an edge between each pair of vertices) and maximal

A

B

D

C

E

G

F

H

EGH

ADEABD

ACEDEF

CEG

77



Junction Tree• A junction tree is a subgraph of the clique

graph that – is a tree – contains all the cliques– satisfies the junction tree property

• Junction tree property: For each pair U, Vof cliques with intersection S, all cliques on the path between U and V contain S.

EGH

ADEABD

ACEDEF

CEG

ADEABD ACEAD AE CEGCE

DEF

DE

EGH

EG



Inference

• Choose a root• For each distribution (CPT) in the original Bayes Net, put

this distribution into one of the clique nodes that contains allthe variables referenced by the CPT. (At least one such node must exist because of the moralization step).

• For each clique node, take the product of the distributions (as in variable elimination).

78



Example: Create Join Tree

X1 X2

Y1 Y2

Junction Tree:

X1,X2X1,Y1 X2,Y2X1 X2



Example: Initialization

X1,X2X2

X2,Y2Y2

X1,Y1Y1

X1,Y1X1

Potential functionAssociated ClusterVariable

φX1,Y1 = P(X1)

)1|1()1(1,1 XYPXPYX =φ

φX1,X 2 = P(X2 | X1)

φX 2,Y 2 = P(Y2 | X2)

X1,X2X1,Y1 X2,Y2X1 X2

79



Example: Collect Evidence

• Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected.

• Call recursively neighboring cliques for messages:• 1. Call X1,Y1.

– 1. Projection:

– 2. Absorption:

∑ ∑−

==φ=φ1

11111 1

111XYX Y

YXX XPYXP},{

, )(),(

),()()|(,, 211121

1

2121XXPXPXXPold

X

XXXXX ==

φ

φφ←φ



Example: Collect Evidence (cont.)

• 2. Call X2,Y2:– 1. Projection:

– 2. Absorption:

φX 2 = φX 2,Y 2 = P(Y2 | X2) =1Y 2∑

{X 2,Y 2}−X 2∑

X1,X2X1,Y1 X2,Y2X1 X2

φX1,X 2 ← φX1,X 2φX 2

φX 2old = P(X1, X2)

80



Example: Distribute Evidence

• Pass messages recursively to neighboring nodes• Pass message from X1,X2 to X1,Y1:

– 1. Projection:

– 2. Absorption:

φX1 = φX1,X 2 = P(X1, X2) = P(X1)X 2∑

{X1,X 2}−X1∑

φX1,Y1 ← φX1,Y1φX1

φX1old = P(X1,Y1) P(X1)

P(X1)



Example: Distribute Evidence (cont.)

• Pass message from X1,X2 to X2,Y2:– 1. Projection:

– 2. Absorption:

φX 2 = φX1,X 2 = P(X1, X2) = P(X2)X1∑

{X1,X 2}−X 2∑

φX 2,Y 2 ← φX 2,Y 2φX 2

φX 2old = P(Y2 | X2) P(X2)

1= P(Y2, X2)

X1,X2X1,Y1 X2,Y2X1 X2

81



Approximate Inference

• With large and highly connected graphical models, the associated cliques for the junction tree algorithm or the intermediate factors in the variable elimination algorithm will grow in size, generating an exponential blowup in the number of computations performed



Inference in Bayesian network

Exact inference algorithms:• Variable elimination• Symbolic inference (D’Ambrosio)• Message passing algorithm (Pearl)• Clustering and join tree approach (Lauritzen, Spiegelhalter)Approximate inference algorithms:• Monte Carlo methods:• Forward sampling, Likelihood sampling• Variational methods

82



Stochastic simulation• Suppose you are given values for some subset of the

variables, G, and want to infer values for unknown variables, U

• Randomly generate a very large number of instantiations from the BN• Generate instantiations for all variables – start at root

variables and work your way “forward”• Only keep those instantiations that are consistent with the

values for G• Use the frequency of values for U to get estimated

probabilities• Accuracy of the results depends on the size of the sample

(asymptotically approaches exact results)



Stochastic Simulation

RainSprinkler

Cloudy

WetGrass

1. Draw N samples from the BN by repeating 1.1 and 1.21.1. Guess Cloudy at random according to P(Cloudy) 1.2. For each guess of Cloudy, guess

Sprinkler and Rain, then WetGrass2. Compute the ratio of the # runs where

WetGrass and Cloudy are True over the # runs where Cloudy is True

P(WetGrass|Cloudy)?

P(WetGrass|Cloudy) = P(WetGrass, Cloudy) / P(Cloudy)

83



Stochastic simulation• The probability is approximated using sample frequencies

BN sampling: • Generate sample in a top down manner, following the links

in BN• A sample is an assignment of values to all

variables



BN Sampling Example

84



BN Sampling Example



BN Sampling Example

85



BN Sampling Example



BN Sampling Example

86



BN Sampling Example



Rejection Sampling

Rejection sampling:• Generate sample for the full joint by sampling BN• Use only samples that agree with the condition, the

remaining samples are rejected• Problem: many samples can be rejected

87



Likelihood weighting

• Avoids inefficiencies of rejection sampling• Idea: generate only samples consistent with an evidence

(or conditioning event)• If the value is set by evidence, there is no sampling• Problem: using simple counts is not enough since these

may occur with different probabilities• Likelihood weighting: with every sample keep a weight with

which it should count towards the estimate



Likelihood weighting Example

88







89







90







91







92







93







94






Likelihood Sampling

95



Likelihood Sampling



Likelihood Weighting