8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 1/25
MTAT.05.113 Bayesian Networks
Parameter Estimation
in Bayesian Networks
Siim OrasmaaKrista Liin
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 2/25
2
Introduction
We have
a Bayesian network structure S with parameters θ
access to a database of cases D
We want to
estimate the parameters of the model (conditionalprobabilities) from given cases
Two approachesEstimate parameters once and for all
Adapt the model as each new case arrives
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 3/25
3
Outline
Parameter estimation
Maximum likehood estimation
Bayesian estimation
Incomplete data and EM algorithm
Adaption
Type variables
Fractional updating
Fading
Tuning
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 4/25
4
Complete data
Complete case:
a configuration over all the variables in model
We assume the parameters can be learned
independently:
Global independence
parameters for various variables are independent
Local independenceuncertainties of the parameters for different parent
configurations are independent
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 5/25
5
Maximum likehood estimation
Likehood of model M given D:
Choose the parameter set that maximizeslikehood:
L M | D=∏d ∈ D
P d | M
=arg max
L M | D
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 6/25
6
Maximum likehood estimation II
MLE for probability matrices:
The intuition of using frequencies as estimates
Example, estimate for P ( A=a | B=b, C=c ):
Drawback:
The outcomes with zero counts → zero probabilities
N A=a , B=b , C =c
N B=b , C =c
positive counts
total counts
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 7/25
7
Bayesian estimation
Alternative principle to MLE:
start with a prior distribution, and use experience(database) to update the distribution.
collapse the posterior distribution to the mean valueand use this as the final value of the parameter
Even prior distribution:
e.g Add 1 virtual count to all variable stateoccurrences
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 8/25
8
Bayesian estimation II
In case of binary variables:
Let X be a binary variable (yes, no) and we haveperformed a number of independent experiments of
which n turned up yes and m turned up no.Then, starting with even prior distribution for θ, the
Bayesian estimate for P( X =yes) is n1
nm2
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 9/25
9
Incomplete data
Some of the cases in database may containmissing values
How the data can be missing:
Missing at random (MAR)probability that a value is missing depends on observed
(existing) values
Missing completely at random (MCAR)
probability is independent of the observed values
Neither MAR nor MCAR
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 10/25
10
EM Algorithm
Expectation-Maximization algorithm:to find maximum likehood estimates for θ when given
dataset is incomplete (assuming MAR)
Starts with even or random probability distributions
Alternates between two steps:
Expectation step
„complete“ the data set by using the current parameter
estimates (calculate expectations for missing values)
Maximization step
Use the „completed“ data set to find a new maximum
likehood estimate for the parameters
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 11/25
11
EM Algorithm II
Calculating expected counts for a configuration:If a case is inconsistent with the configuration, then it
counts as 0
If a case contains the entire configuration, then itcounts as 1
If the value for a variable is missing in a case, then itcontributes with a fractional count corresponding toprobability of seeing the configuration
If more than one values are missing, we need to calculate joint probability → use junction tree structure
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 12/25
12
EM Algorithm III
Expected counts are then used in M-step as if they were „real“ counts – the maximumlikehood estimate for a conditional probabiliy iscalculated as:
The alternating procedure continues until the
probabilities no longer change or until another termination criterion is met.
estimated positive countsestimated total counts
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 13/25
13
EM Algorithm IV
The EM-algorithm can also be generalized for estimating the maximum a posteriori parameters
Virtual counts are added to both the denominator andnumerator in the M-step
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 14/25
14
* EM Algorithm more formally
What we have:– observable variables
– latent variables
θ – all possible parameters in the modelMain goal is to find:
Asis difficult to calculate and optimize, we use auxiliary
function
yi
z i
P yi ... yn |=∫ P y1 ... yn , z 1 ... z n | dz
Q |t =∫ P z 1 ... z n |t , y1... yn log P , z | y1 ... yn dz
P | yi ... yn∝ P y1 ... yn | P ∝ P y1 ... yn |
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 15/25
15
* EM Algorithm more formally II
What EM algorithm does is:
with random starting point
E-stepFind the probabilities for if all parameters are
fixed to
M-step
Now that is fixed, find thatmaximizes the integral
t 1=arg max
Q |t
z 1 ... z nt
p z 1 ... z n |t , y1 ... yn
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 16/25
Adaptation
Adapt to observed data – change probabilities tablesto better suit new observed values
Why do we need adapting:
New situation for an existing network
To have more accurate probabilities
Second-order uncertainty - when we are not sure howprobable something is
General solution: allow each probability to range withinan interval
Second order uncertainty needs to be reduced
16
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 17/25
Adaptation: Type variables
Main idea: add a modifiable parent node to the nodewhich you are uncertain about.
Example: milk test
When adding type variables to the network, keep newnodes d-separated
17
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 18/25
Adaptation: Fractional updating
Main idea: iteratively update probability distributionsExample
P(A | bi,c
j)
Prior distribution:1) new case e=(a
1,b
i,c
j)
2) new case e=(?,bi,c
j), P(A|b
i,c
j,e)=(y
1,y
2,y
3)
3) new case e=(a
1,?,?), P(b
i,c
j|e)=z
P A∣bi , c
j=
n1
s,
n2
s,
n3
s
P A∣bi , c j=n11
s1;
n2
s1;
n3
s1
P A∣bi , c j=
n1 y1
s1 ;
n2 y2
s1 ;
n3 y3
s1
P A∣bi , c j=n1 z
s z ;
n2
s z ;
n3
s z
18
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 19/25
Adaptation: Fractional updating
In general
Drawback: overestimates sample size
xk =nk z ⋅ yk
s z =
nk P ak , bi , c j∣e
s P bi , c j∣e
19
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 20/25
Adaptation: Fading
Main idea: The older evidence is given less weight bymultiplying it with a fading factor
Fading factor
If new evidence (case 1)
Effective sample size
s := s⋅q1 ; x1
:= x1⋅q1 ; x
2:= x
2⋅q ; x3
:= x3⋅q
s ' =1
1−q
q∈0,1
20
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 21/25
Adaptation: Fading
Sample size
Sample size can be used to determine the fading factor
The bigger the sample size, the more resistant is thenetwork to change
Sample size can be different for each node in network
q ' = s ' −1
s '
21
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 22/25
Adaptation: alternative models
Main idea: If we're not sure whether our network'sstructure is correct, use several structures
Adjusting probabilities help, but might not be enough tosolve structural problems
Solutions
Gather data and learn new structure
Use probabilities calculated over several weighted
structures, recalculating weights for new evidence
P A∣e=∑k wk ⋅ P A∣M k , e w k = P M ∣e=
P e∣M i⋅wi
∑ j
P e∣M j⋅w j
22
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 23/25
TuningWe know what we want the probability of certain
variables to be, and want to tune an existingnetwork to suit it.
We have
Bayesian network BN
Evidence e
Existing probability x=P(A|e)=(x1,...,x
n)
Probabilities we want to have y=(y1,...,y
n)
Parameters we can adjust t=(t1,...,t
m)with initial set of
values t0
To do: change values of t until x close enough to y
23
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 24/25
Tuning
To measure difference: Euclidean distance
Method of gradient descent:
Calculate grad dist(x,y) with respect to the parameterst.
Give t0
a displacement Δ t in the direction opposite to
the direction of the gradient grad dist(x,y)(t0); that is,
choose a step size α>0 and let Δ t=-αgrad dist(x,y)(t
0).
Iterate this procedure until the gradient is close to 0
dist x , y=∑i=1
n
x i− yi2
24
8/3/2019 Orasmaa Liin Chapter 6
http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 25/25
Thank you for listening