Finite Mixture Models and Expectation
Maximization
Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain
and Dr. Rong Jin
Recall: The Supervised Learning Problem
2
� Given a set of n samples X = {(xi , yi)}, i = 1,…,n
� Chapter 3 of DHS
� Assume examples in each class come from a parameterized Gaussian density
� Estimate the parameters (mean, variance) of the
Gaussian density for each class, and use them for classification
� Estimation uses Maximum Likelihood approach.
Review of Maximum Likelihood� Given n i.i.d. examples from a density p(x;θ), with
known form p and unknown parameter θ.
� Goal: estimate θ, denoted by , such that the
observed data is most likely to be from the distribution with that θ.
� Steps involved:
� Write the likelihood of the observed data.
� Maximize the likelihood with respect to the parameter.
3
θ
Example: 1D Gaussian Distribution
4
-1 -0.5 0 0.5 1 1.5 2 2.5 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
Maximum Likelihood Estimation of Mean of a Gaussian Distribution
True Density
MLE from 10 Samples
MLE from 100 samples
5
Example: 2D Gaussian Distribution
x
y
2D Parameter Estimation for Gaussian Distribution
-2 -1 0 1 2 3 4
0
1
2
3
4
5
Blue: True density
Red : Estimated from
50 examples.
� A single Gaussian may not accurately model the classes.
� Find subclasses in handwritten “online” characters (122,000
characters written by 100 writers)
� Performance improves by modeling subclasses
Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002
Multimodal Class Distributions
Multimodal Classes
7
A single Gaussian distribution may not model the classes accurately.
Handwritten ‘f’ vs ‘y’
classification task.
An extreme example of multimodal classes
8
Red vs. Blue classification.
The classes are well separated.
However, incorrect model
assumptions result in high
classification error.
0 2 4 6 8 10 120
1
2
3
4
5
6
7
8
9
10
x
y
Limitations of Unimodal class modelling
The ‘red’ class is a “mixture of two Gaussian distributions”
There is no class label information, when modeling the
density of just the red class.
9
Finite mixtures
k random sources, probability density functions fi(x), i=1,…,k
X random variable
fk(x)
fi(x)
f2(x)
f1(x)
Choose at random
10
Example: 3 species (Iris)
Finite mixtures
11∑
=
α=k
1i
ii )x(f
Choose at random,
Prob.(source i) = αi
Finite mixtures
X random variable
fk(x)
fi(x)
f2(x)
f1(x)
f (x|source i) = fi (x)Conditional:
f (x and source i) = fi (x) αiJoint:
f(x) = Unconditional: f (x and source i) ∑sourcesall
12
Finite mixtures
∑=
α=k
1i
ii )x(f)x(f
Component densities
Parameterized components (e.g., Gaussian):
∑=
θα=Θk
1i
ii )|x(f)|x(f
},...,,,θ,...,θ,θ{ k21k21 ααα=Θ
Mixing probabilities: ∑=
=αk
1i
i 10i ≥α and
)θ|x(f)x(f ii =
13
Gaussian mixtures
)|x(f iθ Gaussian
Arbitrary covariances: ),|x(N)|x(f iii Cµ=θ
Common covariance: ),|x(N)|x(f ii Cµ=θ
},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ
},...,,,,...,,{ 1k21k21 −ααα=Θ Cµµµ
14
Mixture fitting / estimation
Data: n independent observations, }x,...,x,x{ )n()2()1(=x
Goals: estimate the parameter set ,Θ
Example:
- How many species ? Mean of each species ?
- Which points belong to each species ?
maybe “classify the observations”
Classified data (classes unknown) Observed data
15
Gaussian mixtures (d=1), an example
21 −=µ
31 =σ
6.01 =α
42 =µ
22 =σ
3.02 =α
73 =µ
1.03 =σ
1.03 =α
16
Gaussian mixtures, an R2 example
k = 3
−=
4
03µ
=
10
013C
=
3
32µ
−
−=
21
122C
−=
4
41µ
=
80
022C
(1500 points)
17
Uses of mixtures in pattern recognition
Unsupervised learning (model-based clustering):
- each component models one cluster
- clustering = mixture fitting
Goals:
- find the classes,
- classify the points
Observations:
- unclassified points
18
Uses of mixtures in pattern recognition
Mixtures can approximate arbitrary densities
Good to represent class conditional densities in supervised learning
Example:
- two strongly non-Gaussian
classes.
- Use mixtures to model each
class-conditional density.
Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002
� Find subclasses (lexemes)
� Eg. “online” characters
� Performance improves by modeling subclasses
122,000 characters written by 100 writers
Uses of mixtures in pattern recognition
20
Fitting mixtures
n independent observations
}x,...,x,x{ )n()2()1(=x
Maximum (log)likelihood (ML) estimate of Θ:
),x(Lmaxargˆ Θ=ΘΘ
∏=
Θ=Θn
1j
)j( )|x(flog),x(L ∑ ∑= =
θα=n
1j
k
1i
i
)j(
i )|(flog x
mixture
ML estimate has no closed-form solution
21
Gaussian mixtures: A peculiar type of ML
},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ
Maximum (log)likelihood (ML) estimate of Θ:
),x(Lmaxargˆ Θ=ΘΘ
Subject to: definitepositiveiC
∑=
=αk
1i
i 10i ≥α and
Unusual goal: a “good” local maximum
There is no global maximum.
Problem: the likelihood function is unbounded as 0)det( i →C
22
A Peculiar type of ML problem
Example: a 2-component Gaussian mixture
2
)x(
σ2
)x(
2
2
21
22
21
21
e2
1e
σ2),σ,µ,µ|x(f
µ−−
µ−−
π
α−+
π
α=α
}x,...,x,x{ n21Some data points:
∑=
µ−−
+
π
α−+
π
α=Θ
n
2j
2
)x(
2log(...)e
2
1
σ2log),x(L
221
0σas, 2 →∞→
11 xµ =
23
Fitting mixtures: a missing data problem
ML estimate has no closed-form solution
Missing data problem:
Standard alternative: expectation-maximization (EM) algorithm
[ ] [ ] T)j(
k
)j(
2
)j(
1
)j( 0...010...0,z...,,z,z ==z
“1” at position i ⇔ x( j) generated by component i
}x,...,x,x{ )n()2()1(=xObserved data:
Missing labels (“colors”)
Missing data: },...,,{ )n()2()1(zzzz =
24
Fitting mixtures: a missing data problem
}x,...,x,x{ )n()2()1(=xObserved data:
Missing data: },...,,{ )n()2()1(zzzz =
Complete log-likelihood function:
( )∑∑= =
θα=Θn
1j
k
1i
i
)j(
ii
)j(
ic )|(flogz),,(L xzx
)|z,x(flog )j()j( Θ
[ ])j(
k
)j(
1
)j( z...,,z=z
k-1 zeros,
one “ 1”
In the presence of both x and z, Θ would be easy to estimate,
…but z is missing.
25
The EM algorithm
The E-step: compute the expected value of ),,(Lc Θzx
Iterative procedure: ,...ˆ,ˆ,...,ˆ,ˆ )1t()t()1()0( +ΘΘΘΘ
)ˆ,(Q]ˆ,x|),,(L[E )t()t(
c ΘΘ≡ΘΘzx
The M-step: update parameter estimates
)ˆ,(Qmaxargˆ )t()1t( ΘΘ=ΘΘ
+
Under mild conditions: ∞→
→Θt
)t(ˆ local maximum of ),(L Θx
26
The EM algorithm: the Gaussian case
]ˆ,|),,(L[E)ˆ,(Q )t(
cZ
)t( ΘΘ≡ΘΘ xzx
)],ˆ,|[E,(L )t(
c ΘΘ= xzx
The E-step:
Because ),,(Lc Θzx is linear in z
Binary variable
=Θ ]ˆ,|z[E )t()j(
i x
→)t,j(
iw Estimate, at iteration t, of the probability that x( j) was
produced by component i
“Soft” probabilistic assignment
}ˆ,|1zPr{ )t()j()j(
i Θ= x)t,j(
ik
1n
)t(
n
)j(
n
)t(
i
)j(
i w
)ˆ|x(fˆ
)ˆ|x(fˆ≡
θα
θα=
∑=
Bayes law
27
The EM algorithm: the Gaussian case
→)t,j(
iw Estimate, at iteration t, of
the probability that x( j) was produced by component i
Result of the E-step:
The M-step: ∑=
+ =αn
1j
)t,j(
i
)1t(
i wn
1ˆ
∑
∑
=
=+ =µn
1j
)t,j(
i
n
1j
)j()t,j(
i
)1t(
i
w
xw
ˆ
∑
∑
=
=
++
+
µ−µ−
=n
1j
)t,j(
i
n
1j
T)1t(
i
)j()1t(
i
)j()t,j(
i
)1t(
i
w
)ˆx()ˆx(w
C
28
Difficulties with EM
It is a local (greedy) algorithm (likelihood never dcreases)
Initialization dependent
74 iterations 270 iterations
29
Add a penalty term to the objective function, which increases
with the number of clusters
Start with a large number of clusters
Modify the M-step to include a “killer criterion” which removes
components satisfying certain criterion
Finally, choose the number of components, resulting in with the
largest objective function value (likelihood - penalty).
Automatically deciding the number of components
30
Example
Same as in [Ueda and Nakano, 1998].
31
Example
k = 4
n = 1200
4
1m =α
C = I
=µ
0
01
=µ
4
01
=µ
0
41
=µ
4
41
kmax = 10
32
Example
Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].
33
An example with overlapping components
Example
34
The iris (4-dim.) data-set: 3 components correctly identified
35
Another supervised learning example
Problem: learn to classify textures, from 19 Gabor features.
-Fit Gaussian mixtures to 800 randomly located feature vectors
from each class/texture.
- Four classes:
-Test on the remaining data.
0.0155Quadratic disriminant
0.0185Linear discriminant
0.0074Mixture-based
Error rate
36
2-d projection of the texture
data and the obtained mixtures
Resulting decision regions
37
Properties of EM
EM is extremely popular because of the following properties:
•Easy to implement
•Guarantees the likelihood increases monotonically (why?)
•Guarantees the convergence of the solution to a stationary point
i.e., local maxima (why?).
Limitations of EM
•resulting solution depends highly on the initialization
•Could be slow in several cases compared to direct optimization
methods (e.g., Iterative scaling)
1 2( , )l θ θ
• Start with initial guess
0 01 2,θ θ
0 01 2,θ θ
EM as lower bound optimization
• Start with initial guess
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
Touch
Point
EM as lower bound optimization
• Start with initial guess
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution
that maximizes 1 2( , )Q θ θ
{ }1 11 2,θ θ
EM as lower bound optimization
• Start with initial guess
1 1
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution
that maximizes
• Repeat the procedure
1 2( , )Q θ θ
{ }1 11 2,θ θ { }2 2
1 2,θ θ
EM as lower bound optimization
Optimal
Point
1 2( , )l θ θ
{ }0 01 2,θ θ { }1 1
1 2,θ θ { }2 21 2, , ...θ θ
• Start with initial guess
• Come up with a lower bounded
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution that
maximizes
• Repeat the procedure
• Converge to the local optimal
1 2( , )Q θ θ
EM as lower bound optimization
43
Summary
• Expectation-Maximization algorithm
• E step: Compute expected complete data likelihood
• Mstep: Maximize the likelihood to find parameters
• Can be used with any model with hidden (latent) variables
• Hidden variables can be natural to the model or can be
artificially introduced.
• Makes the parameter estimation simpler, and efficient
• EM algorithm can be explained from many perspectives
• Bound optimization
• Proximal point optimization, etc
• Several generalizations/specializations exist
• Easy to implement, and is widely used!