CIS664 KD&DM
G.F. Cooper, E. Herskovits
A Bayesian Method for the Induction of Probabilistic Networks
from Data
Presented by Uroš Midić
Mar 21, 2007
Introduction
Bayesian Belief Network
Learning Network Parameters
Learning Network Structure
Probability of Network Structure Given a Dataset
Finding the Most Probable Network Structure
K2 algorithm
Experimental result
Pros and cons
Introduction
Events A and B are independent if
P(A∩B)=P(A)P(B)
or in the case that P(A)>0 and P(B)>0,
P(A|B) = P(A) and P(B|A) = P(B)
Introduction
Discrete-valued random variables X and Y are independent if
for any a, b, the events X=a and Y=b are independent, i.e.
P(X=a∩Y=b)=P(X=a)P(Y=b)
Or
P(X=a|Y=b) = P(X=a)
Introduction
Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y
given Z if for any a, b, c,
P(X=a | Y=b,Z=c)=P(X=a|Z=c)
This can be extended to sets of variables, e.g. we can say that X1…Xl is conditionally independent of
Y1…Ym given Z1…Zn, if
P(X1…Xl | Y1…Ym, Z1…Zn) = P(X1…Xl | Z1…Zn)
Introduction
Let X = {X1, …, Xn} be a set of discrete-valued variables, and each variable Xi has a defined set
of possible values V(Xi).
The joint space of the set of variables X is defined as V(X1)xV(X2)x…xV(Xn).
The probability distribution over the joint space specifies the probability for each of the possible variable bindings for the tuple (X1, …, Xn), and is
called the joint probability distribution.
Bayesian Belief Networks
A Bayesian Belief Network represents the joint probability distribution for a set of variables.
It specifies a set of conditional independence assumptions (in form of a Directed Acyclic Graph) and sets of local conditional probabilities (in form
of tables).
Each variable is represented by a node in the graph. A network arc represents the assertion
that the variable is conditionally independent of its non-descendants, given its immediate
predecessors.
Bayesian Belief Networks
Example:
Storm BusTourGroup
Lightning Campfire
Thunder ForestFireCampfire
S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8
Bayesian Belief Networks
Storm BusTourGroup
Lightning Campfire
Thunder ForestFireCampfire
S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8
Example
P(Campfire = True|Storm=True,BusTourGroup = True) = 0.4
Bayesian Belief Networks
For any assignment of values (x1,…,xn) to the variables (X1, …, Xn), we can compute the joint
probability
P(x1,…,xn) = ∏i=1..nP(xi|Parents(Xi))
The values P(xi|Parents(Xi)) come directly from the tables associated with respective Xi.
We can therefore recover any probability of the form P(X1|X2), where X1,X2 are subsets of
X = {X1, …, Xn}.
Learning Network Parameters
We are given a network structure.
We are given a dataset, such that for each training example all variables have assigned
values, i.e. we always get a full assignment (x1,…,xn) to the variables (X1, …, Xn).
Then learning the conditional probability tables is straightforward:
1. Initialize then counter in all the tables to 0,2. For each training example increase all the appropriate
counters in the tables3. Convert the counts into probabilities.
Learning Network Parameters
We are given a network structure but the dataset has missing values.
In this case, learning the conditional probability tables is not straightforward, and usually involves a gradient-ascent procedure to
maximize P(D|h) – probability of the dataset given the model h.
Note that D (dataset) is fixed and we search for the best h that maximizes P(D|h).
Learning Network Structure
The assumption for learning the network parameters is that we know (or assume) a
network structure.
In simple cases, the network structure is constructed by a domain expert.
In most cases a domain expert is not available, or the dataset is so complicated that even a
domain expert is powerless.
Example: Gene-expression microarray datasets have > 10K variables.
Probability of N.S. Given a Dataset
We are given a dataset and two networks. Which of the networks is better fit for this dataset:
Case x1 x2 x3
1 + - - S1:2 + + +3 - - +4 + + + S2:5 - - -6 - + +7 + + +8 - - -9 + + +10 - - -
x1 x2 x3
x1
x2
x3
P(BS1|D) is much greaterthan P(BS2|D). S1 was used
to generate the dataset.
Probability of N.S. Given a Dataset
We are given a dataset and two networks.
How to calculate P(BS,D)?
DBP
DBP
DP
DBPDP
DBP
DBP
DBP
S
S
S
S
S
S
,
,
,
,
|
|
2
1
2
1
2
1
Probability of N.S. Given a Dataset
Assumptions made in the paper:
1. The database variables are discrete
2. Cases occur independently, given a belief-network model.
3. There are no cases that have variables with missing values.
4. Prior probabilities (before observing the data) for conditional probability assignments are uniform.
Probability of N.S. Given a Dataset
How to calculate P(BS,D)?
Xi has a set of parents πi
Xi has ri possible assignments (vi1, …, viri)
There are qi unique instantiations (wi1, wi2, … , wiqi) for πi in D.
Nijk is the number of cases in D in which variable xi has value vik and πi is instantiated as wij.
Let Nij = ∑k=1..riNijk
With the four assumptions we get a formula:
n
i
q
j
r
kijk
iij
iSS
i i
NrN
rBPDBP
1 1 1
!!1
!1,
Probability of N.S. Given a Dataset
How to calculate P(BS,D)?
Surprisingly, after some indexing and constant bounding we get that the time complexity for this formula is O(mn) – linear in the
number of variables and number of cases.
n
i
q
j
r
kijk
iij
iSS
i i
NrN
rBPDBP
1 1 1
!!1
!1,
Probability of N.S. Given a Dataset
We have an efficient way to calculate P(BS,D)?
We could calculate all
P(BSi|D) = P(BSi,D)/(∑P(BS,D))
and find the optimal one, but the set of possible BSi grows exponentially with n.
However, if we had a situation where∑BS Y∊ P(BS,D) ≈P(D),
and Y is small enough, then we could efficiently approximate all P(BS|D) for BS Y.∊
K2 algorithm
We start with:
The time complexity is O(mn2r2n). However, if we assume that a node can have at most u
parents, then the complexity is O(munrT(n,u)), where T(n,u) = ∑k=0..uchoose(n,k)
n
i
q
j
r
kijk
iij
iSS
i i
NrN
rBPDBP
1 1 1
!!1
!1,
n
i
q
j
r
kijk
iij
iS
i i
NrN
rcDBP
1 1 1
!!1
!1max,max
K2 algorithm
Let πi be the set of parents of xi in Bs, which we denote as πi->xi. Then we can rewrite as P(BS) P(BS) = ∏i=1..n P(πi->xi), and we get a formula:
If we assume an ordering of the variables, such that xi cannot be a parent of xj if i<j, then the number of possible πi-s for each xi is smaller, but the overall complexity is still exponential.
n
i
q
j
r
kijk
iij
iiiS
i i
i
NrN
rxPDBP
1 1 1
!!1
!1max,max
K2 algorithm
K2 is a heuristic algorithm. It takes as input a set of n nodes, an ordering on the nodes, an upper bound on the number of parents a node may have, and a database D containing m cases.
It starts with the assumption that a node has no parents, and then tries to incrementally add parents whose addition most increases the probability of the resulting structure.
When the addition of no single parent can increase the probability, the algorithm continues with another node.
This procedure is applied with respect to the ordering of the nodes (starting with X1 for which the ordering assumes that it cannot be a parent in the DAG).
The time complexity is polynomial O(mu2n2r), which in the worst case, when u=n, is O(mn4r).
Experimental result
Using a predefined network structure with 37 nodes and the associated conditional probabilities – provided by an expert in the domain of medicine – that describes potential problems with anaesthesia in the operating room, the authors generated a database with 10000 cases.
They ran the algorithm, with the generated database, and an ordering of nodes that was consistent with the original structure.
The algorithm almost completely reconstructed the original network. It missed one original arc, and added one arc that was not present in the original network.
Pros and cons
+ Any exact algorithm has exponential complexity, this heuristic algorithm has polynomial complexity.
+ The preliminary results are promising.
+ The algorithm can be extended to cover the databases with missing values. However, this extension is exponential in the number of missing values.
– The algorithm still requires an ordering of the nodes/variables.
References
G. Cooper and E. Herskovits, “A Bayesian Method for the Induction of Probabilistic Networks from Data”, Machine Learning 9 (1992) pp. 309-347.
Tom Mitchell, Machine Learning.