Post on 25-Sep-2020
transcript
Accelerating the EM Algorithm
for Mixture Density Estimation
Homer Walker
Mathematical Sciences Department
Worcester Polytechnic Instititute
Joint work with Josh Plasse (WPI/Imperial College).
Research supported in part by DOE Grant DE-SC0004880 and NSF Grant DMS-1337943.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18
Mixture Densities
Consider a (finite) mixture density
p(x |Φ) =m∑i=1
αipi (x |φi ).
Problem: Estimate Φ = (α1, . . . , αm, φ1, . . . , φm) using an “unlabeled”sample {xk}Nk=1 on the mixture.
Maximum-Likelihood Estimate (MLE): Determine Φ = arg max L(Φ),where
L(Φ) ≡N∑
k=1
log p(xk |Φ).
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 2/18
The EM (Expectation-Maximization) Algorithm
The general formulation and name were given in . . .
A. P. Dempster, N. M. Laird, and D. B. Rubin (1977),Maximum-likelihood from incomplete data via the EM algorithm, J.Royal Statist. Soc. Ser. B (methodological), 39, pp. 1-38.
General idea: Determine the next approximate MLE to maximize theexpectation of the complete-data log-likelihood function, given theobserved incomplete data and the current approximate MLE.
Marvelous property: The log-likelihood function increases at eachiteration.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 3/18
The EM Algorithm for Mixture Densities
For a mixture density, an EM iteration is . . .
α+i =
1
N
N∑k=1
αci pi (xk |φci )
p(xk |Φc),
φ+i = arg max
N∑k=1
log pi (xk |φi )αci pi (xk |φci )
p(xk |Φc)
For a derivation, convergence analysis, history, etc., see . . .
R. A. Redner and HW (1984), Mixture densities, maximum-likelihood,and the EM algorithm, SIAM Review, 26, 195–239.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 4/18
Particular Example: Normal (Gaussian) Mixtures
Assume (multivariate) normal densities. For each i , φi = (µi ,Σi ) and
pi (x |φi ) =1
(2π)n/2(det Σi )1/2e−(x−µi )
T Σ−1i (x−µi )/2
EM iteration: For i = 1, . . . , m,
α+i =
1
N
N∑k=1
αci pi (xk |φ
ci )
p(xk |Φc ),
µ+i =
{N∑
k=1
xkαci pi (xk |φ
ci )
p(xk |Φc )
}/{N∑
k=1
αci pi (xk |φ
ci )
p(xk |Φc )
},
Σ+i =
{N∑
k=1
(xk − µ+i )(xk − µ+
i )Tαci pi (xk |φ
ci )
p(xk |Φc )
}/{N∑
k=1
αci pi (xk |φ
ci )
p(xk |Φc )
}.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 5/18
EM Iterations Demo
A Univariate Normal Mixture.
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 5.
I Sample of 100,000 observations.— [α1, . . . , α5] = [.2, .3, .3, .1, .1]— [µ1, . . . , µ5] = [0, 1, 2, 3, 4],— [σ2
1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
!3 !2 !1 0 1 2 3 4 5!0.1
!0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5!Population Mixture, Sample Size 10000, EM with No Acceleration, Iteration 0
0 20 40 60 80 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 6/18
EM Iterations Demo
A Univariate Normal Mixture.
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 5.
I Sample of 100,000 observations.— [α1, . . . , α5] = [.2, .3, .3, .1, .1]— [µ1, . . . , µ5] = [0, 1, 2, 3, 4],— [σ2
1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
!3 !2 !1 0 1 2 3 4 5!0.1
!0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5!Population Mixture, Sample Size 10000, EM with No Acceleration, Iteration 0
0 20 40 60 80 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 6/18
Anderson Acceleration
Derived from a method of D. G. Anderson, Iterative procedures for nonlinear integralequations, J. Assoc. Comput. Machinery, 12 (1965), 547–560.
Consider a fixed-point iteration x+ = g(x), g : Rn → Rn.
Anderson Acceleration: Given x0 and mMax ≥ 1.
Set x1 = g(x0).
Iterate: For k = 1, 2, . . .
Set mk = min{mMax , k}.Set Fk = (fk−mk
, . . . , fk ), where fi = g(xi )− xi .
Solve minα∈Rmk+1 ‖Fkα‖2 s. t.∑mk
i=0 αi = 1.
Set xk+1 =∑mk
i=0 αig(xk−mk+i ).
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 7/18
EM Iterations Demo (cont.)
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 5.
I Sample of 100,000 observations.— [α1, . . . , α5] = [.2, .3, .3, .1, .1]— [µ1, . . . , µ5] = [0, 1, 2, 3, 4],— [σ2
1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
!3 !2 !1 0 1 2 3 4 5!0.1
!0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5!Population Mixture, Sample Size 10000, EM with No Acceleration, Iteration 0
0 10 20 30 40 50 60 70 80 90 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 8/18
EM Iterations Demo (cont.)
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 5.
I Sample of 100,000 observations.— [α1, . . . , α5] = [.2, .3, .3, .1, .1]— [µ1, . . . , µ5] = [0, 1, 2, 3, 4],— [σ2
1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
!3 !2 !1 0 1 2 3 4 5!0.1
!0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5!Population Mixture, Sample Size 10000, EM with No Acceleration, Iteration 0
0 10 20 30 40 50 60 70 80 90 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 8/18
EM Convergence and “Separation”
Redner–W (1984): For mixture densities, the convergence is linear anddepends on the “separation” of the component populations:
“well-separated” (fast convergence) if, whenever i 6= j ,
pi (x |φ∗i )
p(x |Φ∗)·pj(x |φ∗j )
p(x |Φ∗)≈ 0 for all x ∈ IRn;
“poorly separated” (slow convergence) if, for some i 6= j ,
pi (x |φ∗i )
p(x |Φ∗)≈
pj(x |φ∗j )
p(x |Φ∗)for all x ∈ Rn.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 9/18
Example: EM Convergence and “Separation”
A Univariate Normal Mixture.
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 3.
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
I Sample of 100,000 observations.
— [α1, α2, α3] = [.3, .3, .4], [σ21 , σ
22 , σ
23 ] = [1, 1, 1].
— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].
0 10 20 30 40 50 60 70 80 90 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 10/18
Example: EM Convergence and “Separation”
A Univariate Normal Mixture.
I pi (x |φi ) = 1√2πσ2
i
e−(x−µi )2/(2σ2
i ) for i = 1, . . . , 3.
I EM iterations on the means: µ+i =
{∑Nk=1 xk
αi pi (xk |φi )p(xk |Φ)
}/{∑Nk=1
αi pi (xk |φi )p(xk |Φ)
}.
I Sample of 100,000 observations.
— [α1, α2, α3] = [.3, .3, .4], [σ21 , σ
22 , σ
23 ] = [1, 1, 1].
— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].
0 10 20 30 40 50 60 70 80 90 100−14
−12
−10
−8
−6
−4
−2
0
Log
Res
idua
l Nor
m
Iteration Number
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 10/18
Experiments with Multivariate Normal Mixtures
Experiment with Anderson acceleration applied to . . .
EM iteration: For i = 1, . . . , m,
α+i =
1
N
N∑k=1
αci pi (xk |φ
ci )
p(xk |Φc ),
µ+i =
{N∑
k=1
xkαci pi (xk |φ
ci )
p(xk |Φc )
}/{N∑
k=1
αci pi (xk |φ
ci )
p(xk |Φc )
},
Σ+i =
{N∑
k=1
(xk − µ+i )(xk − µ+
i )Tαci pi (xk |φ
ci )
p(xk |Φc )
}/{N∑
k=1
αci pi (xk |φ
ci )
p(xk |Φc )
}.
Assume m is known.
Ultimate interest: very large N.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 11/18
Experiments with Multivariate Normal Mixtures (cont.)
Two issues:
I Good initial guess? Use K-means.
— Fast clustering algorithm. Usually gives good results.
— Apply several times to random subsets of the sample. Choose theclustering with minimal sum of within-class distances.
— Use proportions, means, covariance matrices for the clusters as the initialguess.
I Preserving constraints? Iterate on . . .
—√αi , i = 1, . . . , m;
— Cholesky factors of each Σi .
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 12/18
Experiments with Generated Data
I All computing in MATLAB.
I Mixtures with m = 5 subpopulations.
I Generated data in Rd for d = 2, 5, 10, 15, 20:
— For each d , randomly generated 100 “true” {αi , µi ,Σi}5i=1.
— For each {αi , µi ,Σi}5i=1, randomly generated a sample of size
N = 1, 000, 000.
I Compared (unaccelerated) EM with EM+AA with mMax = 5, 10, 15, 20, 25, 30.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 13/18
Experiments with Generated Data (cont.)
A look at failures.
mMax 0 5 10 15 20 25 30
? 75 66 52 52 51 51 51
?? 0 4 19 23 28 29 29
Totals 75 70 71 75 79 80 80
?⇒ failure to converge within 300 iterations. ??⇒∑N
k=1 αi pi (xk )/p(xk ) = 0 for some i .
I There were . . .— 49 trials in which all methods failed,— 26 trials in which EM failed and EM+AA succeeded for at least one mMax ,— 15 trials in which EM failed and EM+AA succeeded for all mMax ,— 20 trials in which EM succeeded and EM+AA failed for all mMax ,— 21 trials in which EM succeeded and EM+AA failed for at least one mMax .
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 14/18
Experiments with Generated Data (cont.)
Performance profiles (Dolan-More, 2002) for (unaccelerated) EM and EM+AA withmMax = 5 over all trials:
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
mMax = 0mMax = 5
0 2 4 6 8 10 12 14 16 18 20
0
0.2
0.4
0.6
0.8
1
mMax = 0mMax = 5
Iteration Numbers Run Times
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 15/18
An Experiment with Real Data
I Remotely sensed data from near Tollhouse, CA.(Thanks to Brett Bader, Digital Globe.)
I N = 3285× 959 = 3150315 observations of16-dimensional multispectral data.
I Modeled with a mixture of m = 3 multivariate normals.
I Applied (unaccelerated) EM and EM+AA withmMax = 5, 10, 15, 20, 25,30.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 16/18
An Experiment with Real Data (cont.)
Log residual norms vs. iteration numbers.
Right: Bayes classification of data based on MLE.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 17/18
In Conclusion . . .
I Anderson acceleration is a promising tool for accelerating the EM algorithm thatmay improve both robustness and efficiency.
I Future work:
— Expand generated-data experiments to include more trials, larger data sets,well-controlled “separation” experiments, “partially-labeled” samples, andother parametric PDF forms.
— Look for more data from real applications.
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 18/18