+ All Categories
Home > Science > k-MLE: A fast algorithm for learning statistical mixture models

k-MLE: A fast algorithm for learning statistical mixture models

Date post: 05-Dec-2014
Category:
Upload: frank-nielsen
View: 261 times
Download: 2 times
Share this document with a friend
Description:
http://arxiv.org/abs/1203.5181 (preliminary version presented at IEEE ICASSP 2012)
31
k -MLE: A fast algorithm for learning statistical mixture models (arXiv:1203.5181) Frank NIELSEN Sony Computer Science Laboratories, Inc. 28th March 2012 International Conference on Acoustics, Speech, and Signal Processing ICASSP, Kyoto ICC c 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/28
Transcript
Page 1: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE:A fast algorithm for learning statistical mixture models

(arXiv:1203.5181)

Frank NIELSEN

Sony Computer Science Laboratories, Inc.

28th March 2012International Conference on Acoustics, Speech, and Signal Processing

ICASSP, Kyoto ICC

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/28

Page 2: k-MLE: A fast algorithm for learning statistical mixture models

Outline

I Background

I Statistical mixtures of exponential families (EFMMs)I Legendre transform and mixture dual parameterizations

I ContributionsI k-MLE and its variantsI k-MLE initialization (k-MLE++)

I Summary

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/28

Page 3: k-MLE: A fast algorithm for learning statistical mixture models

Exponential Family Mixture Models (EFMMs)Generalize Gaussian & Rayleigh MMs to many commondistributions.

m(x) =

k∑

i=1

wipF (x ;λi ) with ∀i wi > 0,∑k

i=1 wi = 1

pF (x ;λ) = e〈t(x),θ〉−F (θ)+k(x)

F : log-Laplace transform (partition, cumulant function):

x∈XpF (x ; θ)dx = 1 ⇒ F (θ) = log

x∈Xe〈t(x),θ〉+k(x)dx ,

θ ∈ Θ =

{

θ |∫

x∈Xe〈t(x),θ〉+k(x)dx < ∞

}

the natural parameter space.

I d : Dimension of the support X.I D: order of the family (= dimΘ). Statistic: t(x) : Rd → R

D .

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/28

Page 4: k-MLE: A fast algorithm for learning statistical mixture models

Statistical mixtures: Rayleigh MMs [7, 5]IntraVascular UltraSound (IVUS) imaging:

Rayleigh distribution:

p(x ;λ) = xλ2 e

− x2

2λ2

x ∈ R+ = X

d = 1 (univariate)D = 1 (order 1)θ = − 1

2λ2

Θ = (−∞, 0)F (θ) = − log(−2θ)t(x) = x2

k(x) = log x(Weibull k = 2)

Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissuesRayleigh Mixture Models (RMMs):for segmentation and classification tasks

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/28

Page 5: k-MLE: A fast algorithm for learning statistical mixture models

Statistical mixtures: Gaussian MMs [3, 5]Gaussian mixture models (GMMs).Color image interpreted as a 5D xyRGB point set.

Gaussian distribution p(x ;µ,Σ):1

(2π)d2√

|Σ|e−

12DΣ−1 (x−µ,x−µ)

Squared Mahalanobis distance:DQ(x , y) = (x − y)TQ(x − y)x ∈ R

d = X

d (multivariate)

D = d(d+3)2 (order)

θ = (Σ−1µ, 12Σ−1) = (θv , θM)

Θ = R× Sd++

F (θ) = 14θ

Tv θ

−1M θv − 1

2 log |θM | +d2 log πt(x) = (x ,−xxT )k(x) = 0

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/28

Page 6: k-MLE: A fast algorithm for learning statistical mixture models

Sampling from a Gaussian Mixture Model (GMM)To sample a variate x from a GMM:

I Choose a component l according to the weight distributionw1, ...,wk ,

I Draw a variate x according to N(µl ,Σl).

Doubly stochastic process:

1. throw a (biased) dice with k faces to choose the component:

l ∼ Multinomial(w1, ...,wk)

(Multinomial distribution belongs also to the exponentialfamilies.)

2. then draw at random a variate x from the l -th component

x ∼ Normal(µl ,Σl)

x = µ+ Cz with Cholesky: Σ = CCT and z = [z1 ... zd ]T

standard normal random variate: zi =√−2 logU1 cos(2πU2)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/28

Page 7: k-MLE: A fast algorithm for learning statistical mixture models

Statistical mixtures: Generative models of data sets

GMM = feature descriptor for information retrieval (IR)→ classification, matching, etc.Increase dimension using color image patches.Low-frequency information encoded into compact statistical model.

Generative model → statistical image by GMM sampling.

Source GMM Sample

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/28

Page 8: k-MLE: A fast algorithm for learning statistical mixture models

Distance between exponential families: Relative entropy

I Distance between features (e.g., GMMs)

I Kullback-Leibler divergence (cross-entropy minus entropy):

KL(P : Q) =

p(x) logp(x)

q(x)dx ≥ 0

=

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

= F (θQ)− F (θP)− 〈θQ − θP ,∇F (θP)〉= BF (θQ : θP)

Bregman divergence BF defined for a strictly convex anddifferentiable function (up to some affine terms).

I Proof KL(P : Q) = BF (θQ : θP) follows from

X ∼ EF (θ) =⇒ E [t(X )] = ∇F (θ)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/28

Page 9: k-MLE: A fast algorithm for learning statistical mixture models

Bregman divergence: Geometric interpretation

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/28

Page 10: k-MLE: A fast algorithm for learning statistical mixture models

Convex duality: Legendre transformation

I For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸ ︷︷ ︸

lF (y ;x);

}

I Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

I Maximum unique from convexity of F (∇2F � 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

I Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/28

Page 11: k-MLE: A fast algorithm for learning statistical mixture models

Legendre duality & Canonical divergence

I Convex conjugates have functional inverse gradients

∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

I Involution: (F ∗)∗ = F .

I Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

I Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/28

Page 12: k-MLE: A fast algorithm for learning statistical mixture models

Dual Bregman divergences & canonical divergence [6]

KL(P : Q) = EP

[

logp(x)

q(x)

]

≥ 0

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/28

Page 13: k-MLE: A fast algorithm for learning statistical mixture models

Exponential family mixtures: Dual parameterizations

A finite weighted point set {(wi , θi )}ki=1 in a statistical manifold.Many coordinate systems for computing (two canonical):

I usual λ-parameterization,

I natural θ-parameterization and dual η-parameterization.

λ ∈ Λ

η ∈ Hθ ∈ Θ

Exponential familydual parameterization

η = ∇θF (θ) θ = ∇ηF∗(η)

Legendre transform(Θ, F ) ↔ (H,F ∗)

Natural parameters Expectation parameters

Original parameters

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/28

Page 14: k-MLE: A fast algorithm for learning statistical mixture models

Maximum Likelihood Estimator (MLE)Given n identical and independently distributed observationsX = {x1, ..., xn}Maximum Likelihood Estimator

θ = argmaxθ∈Θ

n∏

i=1

pF (xi ; θ) = argmaxθ∈Θe∑n

i=1〈t(xi ),θ〉−F (θ)+k(xi )

is unique maximum since ∇2F � 0 (Hessian):

∇F (θ) =1

n

n∑

i=1

t(xi )

MLE is consistent, efficient with asymptotic normal distribution

θ ∼ N

(

θ,1

nI−1(θ)

)

Fisher information matrix

I (θ) = var[t(X )] = ∇2F (θ)

MLE may be biased (eg, normal distributions).c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/28

Page 15: k-MLE: A fast algorithm for learning statistical mixture models

Duality Bregman ↔ Exponential families [2]

Bregman divergence:BF∗(x : η)

Bregman generator:F ∗(η)

Cumulant function:F (θ)

Exponential family:pF (x|θ)

Legendreduality

η = ∇F (θ)

An exponential family...

pF (x ; θ) = exp(〈t(x), θ〉 − F (θ) + k(x))

has the log-density interpreted as a Bregman divergence:

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/28

Page 16: k-MLE: A fast algorithm for learning statistical mixture models

Exponential families ⇔ Bregman divergences: Examples

F (x) pF (x |θ) ⇔ BF∗

Generator Exponential Family ⇔ Dual Bregman divergence

x2 Spherical Gaussian ⇔ Squared lossx log x Multinomial ⇔ Kullback-Leibler divergencex log x − x Poisson ⇔ I -divergence− log(−2x) Rayleigh ⇔ Itakura-Saito divergence− log x Geometric ⇔ Itakura-Saito divergencelog |X | Wishart ⇔ log-det/Burg matrix div. [8]

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/28

Page 17: k-MLE: A fast algorithm for learning statistical mixture models

Maximum likelihood estimator revisited

θ = argmaxθ∏n

i=1 pF (xi ; θ) = argmaxθ∑n

i=1 log pF (xi ; θ)

argmaxθ

n∑

i=1

(〈t(xi ), θ〉 − F (θ) + k(xi))

argmaxθ

n∑

i=1

−BF∗(t(xi ) : η) + F ∗(t(xi )) + k(xi )︸ ︷︷ ︸

constant

≡ argminθ

n∑

i=1

BF∗(t(xi ) : η)

Right-sided Bregman centroid = center of mass: η = 1n

∑ni=1 t(xi).

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/28

Page 18: k-MLE: A fast algorithm for learning statistical mixture models

Bregman batched Lloyd’s k-means [2]Extends Lloyd’s k-means heuristic to Bregman divergences.

I Initialize distinct seeds: C1 = P1, ...,Ck = Pk

I Repeat until convergence

I Assign point Pi to its “closest” centroid (wrt. BF (Pi : C))

Ci = {P ∈ P | BF (P : Ci ) ≤ BF (P : Cj) ∀j 6= i}I Update cluster centroids by taking their center of mass:

Ci =1

|Ci |

P∈CiP .

Loss function

LF (P : C) =∑

P∈P

BF (P : C)

BF (P : C) = mini∈{1,...,k}

BF (P : Ci )

...monotonically decreases and converges to a local optimum.(Extend to weighted point sets using barycenters.)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/28

Page 19: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE for EFMM ≡ Bregman Hard Clustering [4]Bijection exponential families (distributions) ↔ Bregman distances

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x), η = ∇F (θ)

Bregman k-MLE for EFMMs (F ) = additively weighted Bregmanhard k-means for F ∗ in the space {yi = t(xi )}i :Complete log-likelihood log

∏ni=1

∏kj=1(wjpF (xi |θj))δj (zi ):

= maxθ,w

n∑

i=1

k∑

j=1

δj(zi )(log pF (xi |θj) + logwj)

minH,w

n∑

i=1

k∑

j=1

δj(zi )((BF∗(t(xi) : ηj )− logwj)−k(xi )− F ∗(t(xi )︸ ︷︷ ︸

constant

)

≡ minη,w

n∑

i=1

k

minj=1

BF∗(t(xi ) : ηj)− logwj

(This is the argmin that gives the zi ’s)c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/28

Page 20: k-MLE: A fast algorithm for learning statistical mixture models

Complete average log-likelihood optimizationMinimize monotonically the complete average log-likelihood:

1

nminH,w

n∑

i=1

k

minj=1

BF∗(t(xi ) : ηj)− logwj

I 1. Constant weights → dual additive Bregman k-means

1

nminH

n∑

i=1

k

minj=1

(BF∗(t(xi ) : ηj)− logwj)

I 2. Component moment parameters η fixed:

minw

n∑

i=1

k∑

j=1

−δj (zi) logwj = minw

k∑

j=1

−αj logwj ,

where αj =|Cj |n. That is, minimize the cross-entropy:minw H×(α : w) ⇒ w = α.

I Go to 1 until (local) convergence is met.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/28

Page 21: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE-EFMM algorithm [4]

I 0. Initialization: ∀i ∈ {1, ..., k}, let wi =1kand ηi = t(xi)

(initialization is further discussed later on).

I 1. Assignment:∀i ∈ {1, ..., n}, zi = argminkj=1BF∗(t(xi) : ηj)− logwj .Let Ci = {xj |zj = i},∀i ∈ {1, ..., k} be the cluster partition:X = ∪k

i=1Ci .I 2. Update the η-parameters:

∀i ∈ {1, ..., k}, ηi = 1|Ci |

x∈Cit(x).

Goto step 1 unless local convergence of the completelikelihood is reached.

I 3. Update the mixture weights: ∀i ∈ {1, ..., k},wi =1n|Ci |.

Goto step 1 unless local convergence of the completelikelihood is reached.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/28

Page 22: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE initialization

I Forgy’s random seed (d = D),

I Bregman k-means (for F ∗ on Y, and MLE on each cluster).

Usually D > d (eg., multivariate Gaussians D = d(d+3)2 )

I Compute global MLE η = 1n

∑ni=1 t(xi)

(well-defined for n ≥ D → θ ∈ Θ)

I Consider restricted exponential family for Fθ(d+1...D)(θ(1...d)),

then set η(1...d)i = t(1...d)(xi) and η

(d+1...D)i = η(d+1...D).

(e.g., we fix global covariance matrix, and let µi = xi forGaussians)

I Improve initialization by applying Bregman k-means++ [1] forthe convex conjugate of F

θ(d+1...D)(θ(1...d))

k-MLE++ based on Bregman k-means++

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/28

Page 23: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE variants using any Bregman k-means heuristic

I Any k-means optimization heuristic allows one to update themixture η-parameters.

I Hartigan & Wang’s greedy swap (after Lloyd convergence)

I Kanungo et al. swap ((9 + ε)-approximation)

I Performing successively mixture η and w parameters yieldHard EM variant.(easily implemented by winner-take-all EM weightmembership)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/28

Page 24: k-MLE: A fast algorithm for learning statistical mixture models

k-MLE for MVNs with the (µ,Σ) parameters

I 0. Initialization:

I Calculate global mean µ and global covariance matrix Σ:µ = 1

n

∑ki=1 xi , Σ = 1

n

∑ki=1 xix

Ti − µµT

I ∀i ∈ {1, ..., k}, initialize the ith seed as (µi = xi ,Σi = Σ).

I 1. Assignment: ∀i ∈ {1, ..., n}zi = argminkj=1MΣ−1

i(x − µi , x − µi) + log |Σi | − 2 logwi with

MΣ−1i(x − µi , x − µi) the squared Mahalanobis distance:

MQ(x , y) = (x − y)TQ(x − y). LetCi = {xj |zj = i},∀i ∈ {1, ..., k} be the cluster partition:X = ∪k

i=1Ci .I 2. Update the parameters:

∀i ∈ {1, ..., k}, µi =1|Ci |

x∈Cix ,Σi =

1|Ci |

x∈CixxT − µiµ

Ti

Goto step 1 unless local convergence.

I 3. Update the mixture weights: ∀i ∈ {1, ..., k},wi =|Ci |n.

Goto step 1 unless local convergence.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/28

Page 25: k-MLE: A fast algorithm for learning statistical mixture models

Summary of contributions

I Hard k-MLE versus soft EM:

I k-MLE maximizes locally the complete likelihoodI EM maximizes the incomplete likelihood

I The component parameter η update can be implementedusing any Bregman k-means heuristic on conjugate F ∗,

I Initialization can be performed using k-MLE++

I Indivisibility: Robustness when identifying statistical mixture

models? Which k? ∀k ∈ N, N(µ, σ2) =∑k

i=1N(µk, σ

2

k

)

Simplifying mixtures from kernel density estimators is onefine-to-coarse solution. See:Model centroids for the simplification of kernel densityestimators, ICASSP 2012, March 29th.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28

Page 26: k-MLE: A fast algorithm for learning statistical mixture models

Marcel R. Ackermann and Johannes Blomer.Bregman clustering for separable instances.In Scandinavian Workshop on Algorithm Theory (SWAT),pages 212–223, 2010.

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, andJoydeep Ghosh.Clustering with Bregman divergences.Journal of Machine Learning Research, 6:1705–1749, 2005.

Vincent Garcia and Frank Nielsen.Simplification and hierarchical representations of mixtures ofexponential families.Signal Processing (Elsevier), 90(12):3197–3212, 2010.

Frank Nielsen.k-MLE: A fast algorithm for learning statistical mixturemodels.In IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP). IEEE, 2012.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28

Page 27: k-MLE: A fast algorithm for learning statistical mixture models

preliminary, technical report on arXiv.

Frank Nielsen and Vincent Garcia.Statistical exponential families: A digest with flash cards,2009.arXiv.org:0911.4863.

Frank Nielsen and Richard Nock.Entropies and cross-entropies of exponential families.In International Conference on Image Processing (ICIP), pages3621–3624, 2010.

Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri,Petia Radeva, and Joao Sanchez.Rayleigh mixture model for plaque characterization inintravascular ultrasound.IEEE Transaction on Biomedical Engineering,58(5):1314–1324, 2011.

Shijun Wang and Rong Jin.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/28

Page 28: k-MLE: A fast algorithm for learning statistical mixture models

An information geometry approach for distance metriclearning.Journal of Machine Learning Research, 5:591–598, 2009.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28

Page 29: k-MLE: A fast algorithm for learning statistical mixture models

Anisotropic Voronoi diagram (for MVN MMs)From the source color image (a), we buid a 5D GMM with k = 32components, and color each pixel with the mean color of theanisotropic Voronoi cell it belongs to

(a) (b)

Speed-up assignment step using Bregman ball trees or Bregmanvantage point trees.

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/28

Page 30: k-MLE: A fast algorithm for learning statistical mixture models

Expectation-maximization (EM) for EFMMs [2]EM increases monotonically the expected complete likelihood(marginalize):

n∑

i=1

k∑

j=1

p(zj |xi , θ) log p(xi , zj |θ)

Banerjee et al. [2] proved it amounts to a Bregman soft clustering:

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/28

Page 31: k-MLE: A fast algorithm for learning statistical mixture models

Comparisons: k-MLE vs. EM for EFMMs

k-MLE/Hard EM Soft EM (1977)= Bregman hard clustering = Bregman soft clustering

Memory lighter heavier (W matrix)

Speed lighter (VP-tree) heavier (all weights wij)

Conv. always finitely ∞, stopping criterion

Init. k-MLE++ k-means(++)

c© 1997-2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/28


Recommended