CIS664 KD&DM G.F. Cooper, E. Herskovits

CIS664 KD&DM

G.F. Cooper, E. Herskovits

A Bayesian Method for the Induction of Probabilistic Networks

from Data

Presented by Uroš Midić

Mar 21, 2007

Introduction

Bayesian Belief Network

Learning Network Parameters

Learning Network Structure

Probability of Network Structure Given a Dataset

Finding the Most Probable Network Structure

K2 algorithm

Experimental result

Pros and cons

Introduction

Events A and B are independent if

P(A∩B)=P(A)P(B)

or in the case that P(A)>0 and P(B)>0,

P(A|B) = P(A) and P(B|A) = P(B)

Introduction

Discrete-valued random variables X and Y are independent if

for any a, b, the events X=a and Y=b are independent, i.e.

P(X=a∩Y=b)=P(X=a)P(Y=b)

Or

P(X=a|Y=b) = P(X=a)

Introduction

Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y

given Z if for any a, b, c,

P(X=a | Y=b,Z=c)=P(X=a|Z=c)

This can be extended to sets of variables, e.g. we can say that X1…Xl is conditionally independent of

Y1…Ym given Z1…Zn, if

P(X1…Xl | Y1…Ym, Z1…Zn) = P(X1…Xl | Z1…Zn)

Introduction

Let X = {X1, …, Xn} be a set of discrete-valued variables, and each variable Xi has a defined set

of possible values V(Xi).

The joint space of the set of variables X is defined as V(X1)xV(X2)x…xV(Xn).

The probability distribution over the joint space specifies the probability for each of the possible variable bindings for the tuple (X1, …, Xn), and is

called the joint probability distribution.

Bayesian Belief Networks

A Bayesian Belief Network represents the joint probability distribution for a set of variables.

It specifies a set of conditional independence assumptions (in form of a Directed Acyclic Graph) and sets of local conditional probabilities (in form

of tables).

Each variable is represented by a node in the graph. A network arc represents the assertion

that the variable is conditionally independent of its non-descendants, given its immediate

predecessors.


Example:

Storm BusTourGroup

Lightning Campfire

Thunder ForestFireCampfire

S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8


Storm BusTourGroup

Lightning Campfire

Thunder ForestFireCampfire

S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8

Example

P(Campfire = True|Storm=True,BusTourGroup = True) = 0.4


For any assignment of values (x1,…,xn) to the variables (X1, …, Xn), we can compute the joint

probability

P(x1,…,xn) = ∏i=1..nP(xi|Parents(Xi))

The values P(xi|Parents(Xi)) come directly from the tables associated with respective Xi.

We can therefore recover any probability of the form P(X1|X2), where X1,X2 are subsets of

X = {X1, …, Xn}.


We are given a network structure.

We are given a dataset, such that for each training example all variables have assigned

values, i.e. we always get a full assignment (x1,…,xn) to the variables (X1, …, Xn).

Then learning the conditional probability tables is straightforward:

1. Initialize then counter in all the tables to 0,2. For each training example increase all the appropriate

counters in the tables3. Convert the counts into probabilities.


We are given a network structure but the dataset has missing values.

In this case, learning the conditional probability tables is not straightforward, and usually involves a gradient-ascent procedure to

maximize P(D|h) – probability of the dataset given the model h.

Note that D (dataset) is fixed and we search for the best h that maximizes P(D|h).

Learning Network Structure

The assumption for learning the network parameters is that we know (or assume) a

network structure.

In simple cases, the network structure is constructed by a domain expert.

In most cases a domain expert is not available, or the dataset is so complicated that even a

domain expert is powerless.

Example: Gene-expression microarray datasets have > 10K variables.

Probability of N.S. Given a Dataset

We are given a dataset and two networks. Which of the networks is better fit for this dataset:

Case x1 x2 x3

1 + - - S1:2 + + +3 - - +4 + + + S2:5 - - -6 - + +7 + + +8 - - -9 + + +10 - - -

x1 x2 x3

x1

x2

x3

P(BS1|D) is much greaterthan P(BS2|D). S1 was used

to generate the dataset.


We are given a dataset and two networks.

How to calculate P(BS,D)?

DBP

DBP

DP

DBPDP

DBP

DBP

DBP

S

S

S

S

S

S

,

,

,

,

|

|

2

1

2

1

2

1


Assumptions made in the paper:

1. The database variables are discrete

2. Cases occur independently, given a belief-network model.

3. There are no cases that have variables with missing values.

4. Prior probabilities (before observing the data) for conditional probability assignments are uniform.



Xi has a set of parents πi

Xi has ri possible assignments (vi1, …, viri)

There are qi unique instantiations (wi1, wi2, … , wiqi) for πi in D.

Nijk is the number of cases in D in which variable xi has value vik and πi is instantiated as wij.

Let Nij = ∑k=1..riNijk

With the four assumptions we get a formula:

n

i

q

j

r

kijk

iij

iSS

i i

NrN

rBPDBP

1 1 1

!!1

!1,



Surprisingly, after some indexing and constant bounding we get that the time complexity for this formula is O(mn) – linear in the

number of variables and number of cases.

n

i

q

j

r

kijk

iij

iSS

i i

NrN

rBPDBP

1 1 1

!!1

!1,


We have an efficient way to calculate P(BS,D)?

We could calculate all

P(BSi|D) = P(BSi,D)/(∑P(BS,D))

and find the optimal one, but the set of possible BSi grows exponentially with n.

However, if we had a situation where∑BS Y∊ P(BS,D) ≈P(D),

and Y is small enough, then we could efficiently approximate all P(BS|D) for BS Y.∊

K2 algorithm

We start with:

The time complexity is O(mn2r2n). However, if we assume that a node can have at most u

parents, then the complexity is O(munrT(n,u)), where T(n,u) = ∑k=0..uchoose(n,k)

n

i

q

j

r

kijk

iij

iSS

i i

NrN

rBPDBP

1 1 1

!!1

!1,

n

i

q

j

r

kijk

iij

iS

i i

NrN

rcDBP

1 1 1

!!1

!1max,max

K2 algorithm

Let πi be the set of parents of xi in Bs, which we denote as πi->xi. Then we can rewrite as P(BS) P(BS) = ∏i=1..n P(πi->xi), and we get a formula:

If we assume an ordering of the variables, such that xi cannot be a parent of xj if i<j, then the number of possible πi-s for each xi is smaller, but the overall complexity is still exponential.

n

i

q

j

r

kijk

iij

iiiS

i i

i

NrN

rxPDBP

1 1 1

!!1

!1max,max

K2 algorithm

K2 is a heuristic algorithm. It takes as input a set of n nodes, an ordering on the nodes, an upper bound on the number of parents a node may have, and a database D containing m cases.

It starts with the assumption that a node has no parents, and then tries to incrementally add parents whose addition most increases the probability of the resulting structure.

When the addition of no single parent can increase the probability, the algorithm continues with another node.

This procedure is applied with respect to the ordering of the nodes (starting with X1 for which the ordering assumes that it cannot be a parent in the DAG).

The time complexity is polynomial O(mu2n2r), which in the worst case, when u=n, is O(mn4r).

Experimental result

Using a predefined network structure with 37 nodes and the associated conditional probabilities – provided by an expert in the domain of medicine – that describes potential problems with anaesthesia in the operating room, the authors generated a database with 10000 cases.

They ran the algorithm, with the generated database, and an ordering of nodes that was consistent with the original structure.

The algorithm almost completely reconstructed the original network. It missed one original arc, and added one arc that was not present in the original network.

Pros and cons

+ Any exact algorithm has exponential complexity, this heuristic algorithm has polynomial complexity.

+ The preliminary results are promising.

+ The algorithm can be extended to cover the databases with missing values. However, this extension is exponential in the number of missing values.

– The algorithm still requires an ordering of the nodes/variables.

References

G. Cooper and E. Herskovits, “A Bayesian Method for the Induction of Probabilistic Networks from Data”, Machine Learning 9 (1992) pp. 309-347.

Tom Mitchell, Machine Learning.

Date post:	02-Feb-2016
Category:	Documents
Upload:	riona
View:	24 times
Download:	0 times

CIS664 KD&DM G.F. Cooper, E. Herskovits

Documents