School of Computer Science
Probabilistic Graphical Models
Learning Partially Observed GM: the Expectation-Maximization
algorithm
Eric XingLecture 8, February 9, 2015
Reading: MJ Chap 9, and 111© Eric Xing @ CMU, 2005-2015
Recall: Learning Graphical Models Scenarios:
completely observed GMs directed undirected
partially or unobserved GMs directed undirected (an open research topic)
Estimation principles: Maximal likelihood estimation (MLE) Bayesian estimation Maximal conditional likelihood Maximal "Margin" Maximum entropy
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
© Eric Xing @ CMU, 2005-2015 2
Recall: Approaches to inference Exact inference algorithms
The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction tree algorithms
Approximate inference techniques
Stochastic simulation / sampling methods Markov chain Monte Carlo methods Variational algorithms
© Eric Xing @ CMU, 2005-2015 3
Partially observed GMs Speech recognition
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
4© Eric Xing @ CMU, 2005-2015
Partially observed GM Biological Evolution
AGAGAC
5© Eric Xing @ CMU, 2005-2015
Mixture Models
6© Eric Xing @ CMU, 2005-2015
Mixture Models, con'd A density model p(x) may be multi-modal. We may be able to model it as a mixture of uni-modal
distributions (e.g., Gaussians). Each mode may correspond to a different sub-population
(e.g., male and female).
7© Eric Xing @ CMU, 2005-2015
Unobserved Variables A variable can be unobserved (latent) because:
it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process e.g., speech recognition models, mixture models …
it is a real-world object and/or phenomena, but difficult or impossible to measure e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors, etc.
Discrete latent variables can be used to partition/cluster data into sub-groups.
Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc).
8© Eric Xing @ CMU, 2005-2015
Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:
This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in
astronomical data, etc.
k kkkn xNxp ),|,(),(
mixture proportion mixture component
9© Eric Xing @ CMU, 2005-2015
Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:
Z is a latent class indicator vector:
X is a conditional Gaussian variable with a class-specific mean/covariance
The likelihood of a sample:
k
zknn
knzzp ):(multi)(
)-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp
1
21
212211
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
11mixture proportion
mixture component
Z
X
10© Eric Xing @ CMU, 2005-2015
Why is Learning Harder? In fully observed iid settings, the log likelihood decomposes
into a sum of local terms (at least for directed models).
With latent variables, all the parameters become coupled together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD l
z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( l
11© Eric Xing @ CMU, 2005-2015
Recall MLE for completely observed data
Data log-likelihood
MLE
What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n k
zkn
n k
zk
nnn
nn
nn
kn
kn
22
1 )-(log
),;(loglog
),,|()|(log),(log);(
2
θl
Toward the EM algorithm
zi
xiN
),;(maxargˆ , DMLEk θl
);(maxargˆ , DMLEk θl
);(maxargˆ , DMLEk θl
nkn
n nkn
MLEk zxz
,ˆ
12© Eric Xing @ CMU, 2005-2015
Question “ … We solve problem X using Expectation-Maximization …”
What does it mean?
E What do we take expectation with? What do we take expectation over?
M What do we maximize? What do we maximize with respect to?
13© Eric Xing @ CMU, 2005-2015
Recall: K-means
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz 1
nt
n
n nt
ntk kz
xkz),(
),()(
)()(
1
14© Eric Xing @ CMU, 2005-2015
Expectation-Maximization Start:
"Guess" the centroid k and coveriance k of each of the K clusters
Loop
15© Eric Xing @ CMU, 2005-2015
Example: Gaussian mixture model A mixture of K Gaussians:
Z is a latent class indicator vector
X is a conditional Gaussian variable with class-specific mean/covariance
The likelihood of a sample:
The expected complete log likelihood
Zn
XnN
k
zknn
knzzp ):(multi)(
)-()-(-exp)(
),,|( // knkT
knk
mk
nn xxzxp
121
212211
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
11
n kkknk
Tkn
kn
n kk
kn
nxzpnn
nxzpnc
Cxxzz
zxpzpzx
log)()(21log
),,|(log)|(log),;(
1
)|()|(
θl
16© Eric Xing @ CMU, 2005-2015
We maximize iteratively using the following iterative procedure:
─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., and ).
Here we are essentially doing inference
i
ti
tin
ti
tk
tkn
tkttk
nq
kn
tkn xN
xNxzpz t ),|,(),|,(),,|1( )()()(
)()()()()()(
)(
)(θcl
E-step
17© Eric Xing @ CMU, 2005-2015
We maximize iteratively using the following iterative procudure:
─ Maximization step: compute the parameters under current results of the expected value of the hidden variables
This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")
)(θcl
M-step
1 s.t. , ,0)( ,)(maxarg
)(*
k
*
)(
Nn
NNz
kll
kntk
nn q
kn
k
kcck
t
k
θθ
ntk
n
n ntk
ntkk
xl )(
)()1(* ,)(maxarg
θ
ntk
n
nTt
knt
kntk
ntkk
xxl )(
)1()1()()1(* ))((
,)(maxarg
θ T
T
T
xxxx
AA
A A
Alog
:Fact
1
1
18© Eric Xing @ CMU, 2005-2015
Compare: K-means and EM
K-means In the K-means “E-step” we do hard
assignment:
In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:
EM E-step
M-step
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz 1
nt
n
n nt
ntk kz
xkz),(
),()(
)()(
1
The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.
it
it
int
i
tk
tkn
tkttk
n
q
kn
tkn
xNxNxzp
z t
),|,(),|,(),,|1( )()()(
)()()()()(
)()(
ntk
n
n ntk
ntk
x)(
)()1(
19© Eric Xing @ CMU, 2005-2015
Theory underlying EM What are we doing?
Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.
But we do not observe z, so computing
is difficult!
What shall we do?
z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( l
20© Eric Xing @ CMU, 2005-2015
Complete & Incomplete Log Likelihoods Complete log likelihood
Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then
Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).
Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.
But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.
Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:
This objective won't decouple
)|,(log),;(def
zxpzxc l
z
c zxpxpx )|,(log)|(log);( l
21© Eric Xing @ CMU, 2005-2015
Expected Complete Log Likelihood
z
qc zxpxzqzx )|,(log),|(),;(def
l
z
z
z
xzqzxpxzq
xzqzxpxzq
zxpxpx
)|()|,(log)|(
)|()|,()|(log
)|,(log
)|(log);(
l
qqc Hzxx ),;();( ll
For any distribution q(z), define expected complete log likelihood:
A deterministic function of Linear in lc() --- inherit its factorizabiility Does maximizing this surrogate yield a maximizer of the likelihood?
Jensen’s inequality
22© Eric Xing @ CMU, 2005-2015
Lower Bounds and Free Energy For fixed data x, define a functional called the free energy:
The EM algorithm is coordinate-ascent on F : E-step:
M-step:
);()|(
)|,(log)|(),(def
xxzq
zxpxzqqFz
l
),(maxarg t
q
t qFq 1
),(maxarg ttt qF
11
23© Eric Xing @ CMU, 2005-2015
E-step: maximization of expected lc w.r.t. q Claim:
This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).
Proof (easy): this setting attains the bound l(;x)F(q, )
Can also show this result using variational calculus or the fact that
),|(),(maxarg tt
q
t xzpqFq 1
);()|(log
)|(log)|(
),()|,(
log),()),,((
xxp
xpxzq
xzpzxpxzpxzpF
ttz
t
zt
tttt
l
),|(||KL),();( xzpqqFx l
24© Eric Xing @ CMU, 2005-2015
E-step plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z|) is a
generalized exponential family distribution:
Special cases: if p(X|Z) are GLIMs, then
The expected complete log likelihood under is
)(),(
)()|,(log),|(),;(
),|(
Azxf
Azxpxzqzx
ixzqi
ti
z
ttq
tc
t
t
1l
i
ii zxfzxhZ
zxp ),(exp),()(
),(
1
)()(),( xzzxf iTii
),|( tt xzpq 1
)()()(),|(
GLIM~
Axzi
ixzqiti
p
t 25© Eric Xing @ CMU, 2005-2015
M-step: maximization of expected lc w.r.t. Note that the free energy breaks into two terms:
The first term is the expected complete log likelihood (energy) and the second term, which does not depend on , is the entropy.
Thus, in the M-step, maximizing with respect to for fixed qwe only need to consider the first term:
Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,).
qqc
zz
z
Hzx
xzqxzqzxpxzqxzq
zxpxzqqF
),;(
)|(log)|()|,(log)|()|(
)|,(log)|(),(
l
zqc
t zxpxzqzx t )|,(log)|(maxarg),;(maxarg
11 l
26© Eric Xing @ CMU, 2005-2015
Example: HMM Supervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
27© Eric Xing @ CMU, 2005-2015
Hidden Markov Model: from static to dynamic mixture models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
NThe sequence:
The underlying source:
Phonemes,
Speech signal,
sequence of rolls,
dice,
28© Eric Xing @ CMU, 2005-2015
The Baum Welch algorithm The complete log likelihood
The expected complete log likelihood
EM The E step
The M step ("symbolically" identical to MLE)
n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ l
)|( ,,, nitn
itn
itn ypy x1
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111
n
T
ti
tn
n
T
tjitnML
ija 1
1
2
,
,,
n
T
ti
tn
ktnn
T
ti
tnMLik
xb 1
1
1
,
,,
Nn
inML
i 1,
29© Eric Xing @ CMU, 2005-2015
Unsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is
unknown,
EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters :
1. Estimate Aij , Bik in the training data How? , ,
2. Update according to Aij , Bik
Now a "supervised learning" problem3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set each iteration
ktntn
itnik xyB ,, ,
tnjtn
itnij yyA
, ,, 1
30© Eric Xing @ CMU, 2005-2015
EM for general BNswhile not converged
% E-stepfor each node i
ESSi = 0 % reset expected sufficient statisticsfor each data sample n
do inference with Xn,H
for each node i
% M-stepfor each node i
i := MLE(ESSi )
)|(,,,,
),( HnHni xxpninii xxSSESS
31© Eric Xing @ CMU, 2005-2015
Summary: EM Algorithm A way of maximizing likelihood function for latent variable models.
Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current parameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:
E-step: M-step:
In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.
),(maxarg t
q
t qFq 1
),(maxarg ttt qF
11
32© Eric Xing @ CMU, 2005-2015
Conditional mixture model: Mixture of experts
We will model p(Y |X) using different experts, each responsible for different regions of the input space. Latent variable Z chooses expert using softmax gating function:
Each expert can be a linear regression model: The posterior expert responsibilities are
xxzP Tk Softmax)( 1 21 k
Tk
k xyzxyP ,;),( N
j jjj
jkkk
kk
xypxzpxypxzp
yxzP),,()(
),,()(),,( 2
2
11
1
33© Eric Xing @ CMU, 2005-2015
EM for conditional mixture model Model:
The objective function
EM:
E-step:
M-step: using the normal equation for standard LR , but with the data
re-weighted by (homework) IRLS and/or weighted IRLS algorithm to update {kkk} based on data pair
(xn,yn), with weights (homework?)
),,,|(),|()( ik
kk xzypxzpxyP 11
j jjnnjn
jn
kknnknkn
nnk
ntk
n xypxzpxypxzp
yxzP),,()(
),,()(),,()(
2
2
11
1
θ
n kk
k
nTknk
nn k
nTk
kn
nyxzpnnn
nyxzpnnc
Cxyzxz
zxypxzpzyx
22
),|(),|(
log)-(21)softmax(log
),,,|(log),|(log),,;(
θl
YXXX TT 1 )(
)(tkn 34© Eric Xing @ CMU, 2005-2015
Hierarchical mixture of experts
This is like a soft version of a depth-2 classification/regression tree. P(Y |X,G1,G2) can be modeled as a GLIM, with parameters
dependent on the values of G1 and G2 (which specify a "conditional path" to a given leaf in the tree).
35© Eric Xing @ CMU, 2005-2015
Mixture of overlapping experts
By removing the X Z arc, we can make the partitions independent of the input, thus allowing overlap.
This is a mixture of linear regressors; each subpopulation has a different conditional mean.
j jjj
jkkk
kk
xypzpxypzp
yxzP),,()(
),,()(),,( 2
2
11
1
36© Eric Xing @ CMU, 2005-2015
Partially Hidden Data Of course, we can learn when there are missing (hidden)
variables on some cases and not on others. In this case the cost function is:
Note that Ym do not have to be the same in each case --- the data can have different missing values in each different sample
Now you can think of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only.
The M-step optimizes the log likelihood on the complete data plus the expected likelihood on the incomplete data using the E-step.
MissingComplete
)|,(log)|,(log);(m y
mmn
nncm
yxpyxpD l
37© Eric Xing @ CMU, 2005-2015
EM Variants Sparse EM:
Do not re-compute exactly the posterior probability on each data point under all models, because it is almost zero. Instead keep an “active list” which you update every once in a while.
Generalized (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still make progress by doing an M-step that improves the likelihood a bit (e.g. gradient step). Recall the IRLS step in the mixture of experts model.
38© Eric Xing @ CMU, 2005-2015
A Report Card for EM Some good things about EM:
no learning rate (step-size) parameter automatically enforces parameter constraints very fast for low dimensions each iteration guaranteed to improve likelihood
Some bad things about EM: can get stuck in local minima can be slower than conjugate gradient (especially near convergence) requires expensive inference step is a maximum likelihood/MAP method
39© Eric Xing @ CMU, 2005-2015