Accelerating the EM Algorithm for Mixture Density Estimation · 2020. 4. 24. · Accelerating the...

transcript

Accelerating the EM Algorithm

for Mixture Density Estimation

Homer Walker

Mathematical Sciences Department

Worcester Polytechnic Instititute

Joint work with Josh Plasse (WPI/Imperial College).

Research supported in part by DOE Grant DE-SC0004880 and NSF Grant DMS-1337943.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18

Mixture Densities

Consider a (finite) mixture density

p(x |Φ) =m∑i=1

αipi (x |φi ).

Problem: Estimate Φ = (α1, . . . , αm, φ1, . . . , φm) using an “unlabeled”sample {xk}Nk=1 on the mixture.

Maximum-Likelihood Estimate (MLE): Determine Φ = arg max L(Φ),where

L(Φ) ≡N∑

log p(xk |Φ).

The EM (Expectation-Maximization) Algorithm

The general formulation and name were given in . . .

A. P. Dempster, N. M. Laird, and D. B. Rubin (1977),Maximum-likelihood from incomplete data via the EM algorithm, J.Royal Statist. Soc. Ser. B (methodological), 39, pp. 1-38.

General idea: Determine the next approximate MLE to maximize theexpectation of the complete-data log-likelihood function, given theobserved incomplete data and the current approximate MLE.

Marvelous property: The log-likelihood function increases at eachiteration.

The EM Algorithm for Mixture Densities

For a mixture density, an EM iteration is . . .

α+i =

N∑k=1

αci pi (xk |φci )

p(xk |Φc),

φ+i = arg max

N∑k=1

log pi (xk |φi )αci pi (xk |φci )

p(xk |Φc)

For a derivation, convergence analysis, history, etc., see . . .

R. A. Redner and HW (1984), Mixture densities, maximum-likelihood,and the EM algorithm, SIAM Review, 26, 195–239.

Particular Example: Normal (Gaussian) Mixtures

Assume (multivariate) normal densities. For each i , φi = (µi ,Σi ) and

pi (x |φi ) =1

(2π)n/2(det Σi )1/2e−(x−µi )

T Σ−1i (x−µi )/2

EM iteration: For i = 1, . . . , m,

α+i =

N∑k=1

αci pi (xk |φ

p(xk |Φc ),

µ+i =

xkαci pi (xk |φ

p(xk |Φc )

}/{N∑

αci pi (xk |φ

p(xk |Φc )

Σ+i =

(xk − µ+i )(xk − µ+

i )Tαci pi (xk |φ

p(xk |Φc )

}/{N∑

αci pi (xk |φ

p(xk |Φc )

EM Iterations Demo

A Univariate Normal Mixture.

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

I Sample of 100,000 observations.— [α1, . . . , α5] = [.2, .3, .3, .1, .1]— [µ1, . . . , µ5] = [0, 1, 2, 3, 4],— [σ2

1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].

I EM iterations on the means: µ+i =

{∑Nk=1 xk

αi pi (xk |φi )p(xk |Φ)

}/{∑Nk=1

!3 !2 !1 0 1 2 3 4 5!0.1

5!Population Mixture, Sample Size 10000, EM with No Acceleration, Iteration 0

0 20 40 60 80 100−14

Iteration Number

EM Iterations Demo

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].

{∑Nk=1 xk

}/{∑Nk=1

!3 !2 !1 0 1 2 3 4 5!0.1

0 20 40 60 80 100−14

Iteration Number

Anderson Acceleration

Derived from a method of D. G. Anderson, Iterative procedures for nonlinear integralequations, J. Assoc. Comput. Machinery, 12 (1965), 547–560.

Consider a fixed-point iteration x+ = g(x), g : Rn → Rn.

Anderson Acceleration: Given x0 and mMax ≥ 1.

Set x1 = g(x0).

Iterate: For k = 1, 2, . . .

Set mk = min{mMax , k}.Set Fk = (fk−mk

, . . . , fk ), where fi = g(xi )− xi .

Solve minα∈Rmk+1 ‖Fkα‖2 s. t.∑mk

i=0 αi = 1.

Set xk+1 =∑mk

i=0 αig(xk−mk+i ).

EM Iterations Demo (cont.)

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].

{∑Nk=1 xk

}/{∑Nk=1

!3 !2 !1 0 1 2 3 4 5!0.1

0 10 20 30 40 50 60 70 80 90 100−14

Iteration Number

EM Iterations Demo (cont.)

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

1 , . . . , σ25 ] = [.2, 2, .5, .1, .1].

{∑Nk=1 xk

}/{∑Nk=1

!3 !2 !1 0 1 2 3 4 5!0.1

0 10 20 30 40 50 60 70 80 90 100−14

Iteration Number

EM Convergence and “Separation”

Redner–W (1984): For mixture densities, the convergence is linear anddepends on the “separation” of the component populations:

“well-separated” (fast convergence) if, whenever i 6= j ,

pi (x |φ∗i )

p(x |Φ∗)·pj(x |φ∗j )

p(x |Φ∗)≈ 0 for all x ∈ IRn;

“poorly separated” (slow convergence) if, for some i 6= j ,

pi (x |φ∗i )

p(x |Φ∗)≈

pj(x |φ∗j )

p(x |Φ∗)for all x ∈ Rn.

Example: EM Convergence and “Separation”

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 3.

{∑Nk=1 xk

}/{∑Nk=1

I Sample of 100,000 observations.

— [α1, α2, α3] = [.3, .3, .4], [σ21 , σ

22 , σ

23 ] = [1, 1, 1].

— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].

0 10 20 30 40 50 60 70 80 90 100−14

Iteration Number

Example: EM Convergence and “Separation”

I pi (x |φi ) = 1√2πσ2

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 3.

{∑Nk=1 xk

}/{∑Nk=1

I Sample of 100,000 observations.

— [α1, α2, α3] = [.3, .3, .4], [σ21 , σ

22 , σ

23 ] = [1, 1, 1].

— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].

0 10 20 30 40 50 60 70 80 90 100−14

Iteration Number

Experiments with Multivariate Normal Mixtures

Experiment with Anderson acceleration applied to . . .

EM iteration: For i = 1, . . . , m,

α+i =

N∑k=1

αci pi (xk |φ

p(xk |Φc ),

µ+i =

xkαci pi (xk |φ

p(xk |Φc )

}/{N∑

αci pi (xk |φ

p(xk |Φc )

Σ+i =

(xk − µ+i )(xk − µ+

i )Tαci pi (xk |φ

p(xk |Φc )

}/{N∑

αci pi (xk |φ

p(xk |Φc )

Assume m is known.

Ultimate interest: very large N.

Experiments with Multivariate Normal Mixtures (cont.)

Two issues:

I Good initial guess? Use K-means.

— Fast clustering algorithm. Usually gives good results.

— Apply several times to random subsets of the sample. Choose theclustering with minimal sum of within-class distances.

— Use proportions, means, covariance matrices for the clusters as the initialguess.

I Preserving constraints? Iterate on . . .

—√αi , i = 1, . . . , m;

— Cholesky factors of each Σi .

Experiments with Generated Data

I All computing in MATLAB.

I Mixtures with m = 5 subpopulations.

I Generated data in Rd for d = 2, 5, 10, 15, 20:

— For each d , randomly generated 100 “true” {αi , µi ,Σi}5i=1.

— For each {αi , µi ,Σi}5i=1, randomly generated a sample of size

N = 1, 000, 000.

I Compared (unaccelerated) EM with EM+AA with mMax = 5, 10, 15, 20, 25, 30.

Experiments with Generated Data (cont.)

A look at failures.

mMax 0 5 10 15 20 25 30

? 75 66 52 52 51 51 51

?? 0 4 19 23 28 29 29

Totals 75 70 71 75 79 80 80

?⇒ failure to converge within 300 iterations. ??⇒∑N

k=1 αi pi (xk )/p(xk ) = 0 for some i .

I There were . . .— 49 trials in which all methods failed,— 26 trials in which EM failed and EM+AA succeeded for at least one mMax ,— 15 trials in which EM failed and EM+AA succeeded for all mMax ,— 20 trials in which EM succeeded and EM+AA failed for all mMax ,— 21 trials in which EM succeeded and EM+AA failed for at least one mMax .

Experiments with Generated Data (cont.)

Performance profiles (Dolan-More, 2002) for (unaccelerated) EM and EM+AA withmMax = 5 over all trials:

0 5 10 15 20

mMax = 0mMax = 5

0 2 4 6 8 10 12 14 16 18 20

mMax = 0mMax = 5

Iteration Numbers Run Times

An Experiment with Real Data

I Remotely sensed data from near Tollhouse, CA.(Thanks to Brett Bader, Digital Globe.)

I N = 3285× 959 = 3150315 observations of16-dimensional multispectral data.

I Modeled with a mixture of m = 3 multivariate normals.

I Applied (unaccelerated) EM and EM+AA withmMax = 5, 10, 15, 20, 25,30.

An Experiment with Real Data (cont.)

Log residual norms vs. iteration numbers.

Right: Bayes classification of data based on MLE.

In Conclusion . . .

I Anderson acceleration is a promising tool for accelerating the EM algorithm thatmay improve both robustness and efficiency.

I Future work:

— Expand generated-data experiments to include more trials, larger data sets,well-controlled “separation” experiments, “partially-labeled” samples, andother parametric PDF forms.

— Look for more data from real applications.

Accelerating the EM Algorithm for Mixture Density Estimation · 2020. 4. 24. · Accelerating the...

Documents