+ All Categories
Home > Technology > A new implementation of k-MLE for mixture modelling of Wishart distributions

A new implementation of k-MLE for mixture modelling of Wishart distributions

Date post: 05-Dec-2014
Category:
Upload: frank-nielsen
View: 104 times
Download: 0 times
Share this document with a friend
Description:
A new implementation of k-MLE for mixture modelling of Wishart distributions GSI 2013
46
A new implementation of k-MLE for mixture modelling of Wishart distributions Christophe Saint-Jean Frank Nielsen Geometric Science of Information 2013 August 28, 2013 - Mines Paris Tech
Transcript
Page 1: A new implementation of k-MLE for mixture modelling of Wishart distributions

A new implementation of k-MLE formixture modelling of Wishart distributions

Christophe Saint-Jean Frank Nielsen

Geometric Science of Information 2013

August 28, 2013 - Mines Paris Tech

Page 2: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application Context (1)

2/31

2/31

We are interested in clustering varying-length sets of multivariateobservations of same dim. p.

X1 =

3.6 0.05 −4.3.6 0.05 −4.3.6 0.05 −4.

, . . . ,XN =

5.3 −0.5 2.53.6 0.5 3.51.6 −0.5 4.6−1.6 0.5 5.1−2.9 −0.5 6.1

Sample mean is a good but not discriminative enough feature.

Second order cross-product matrices tXiXi may capture somerelations between (column) variables.

Page 3: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application Context (2)

3/31

3/31

The problem is now the clustering of a set of p × p PSD matrices :

χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN

}

Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...

Page 4: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application Context (2)

3/31

3/31

The problem is now the clustering of a set of p × p PSD matrices :

χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN

}

Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...

Page 5: A new implementation of k-MLE for mixture modelling of Wishart distributions

Outline of this talk

4/31

4/31

1 MLE and Wishart DistributionExponential Family and Maximum Likehood EstimateWishart DistributionTwo sub-families of the Wishart Distribution

2 Mixture modeling with k-MLEOriginal k-MLEk-MLE for Wishart distributionsHeuristics for the initialization

3 Application to motion retrieval

Page 6: A new implementation of k-MLE for mixture modelling of Wishart distributions

Reminder : Exponential Family (EF)

5/31

5/31

An exponential family is a set of parametric probability distributions

EF = {p(x ;λ) = pF (x ; θ) = exp {〈t(x), θ〉+ k(x)− F (θ)|θ ∈ Θ}

Terminology:

λ source parameters.

θ natural parameters.

t(x) sufficient statistic.

k(x) auxiliary carrier measure.

F (θ) the log-normalizer:differentiable, strictlyconvex

Θ = {θ ∈ RD |F (θ) <∞}is an open convex set

Almost all commonly used distributions are EF members butuniform, Cauchy distributions.

Page 7: A new implementation of k-MLE for mixture modelling of Wishart distributions

Reminder : Maximum Likehood Estimate (MLE)

6/31

6/31

Maximum Likehood Estimate principle is a very commonapproach for fitting parameters of a distribution

θ = argmaxθ

L(θ;χ) = argmaxθ

N∏i=1

p(xi ; θ) = argminθ− 1

N

N∑i=1

log p(xi ; θ)

assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations.

Log density have a convenient expression for EF members

log pF (x ; θ) = 〈t(x), θ〉+ k(x)− F (θ)

It follows

θ = argmaxθ

N∑i=1

log pF (xi ; θ) = argmaxθ

(〈

N∑i=1

t(xi ), θ〉 − NF (θ)

)

Page 8: A new implementation of k-MLE for mixture modelling of Wishart distributions

MLE with EF

7/31

7/31

Since F is a strictly convex, differentiable function, MLEexists and is unique :

∇F (θ) =1

N

N∑i=1

t(xi )

Ideally, we have a closed form :

θ = ∇F−1(

1

N

N∑i=1

t(xi )

)

Numerical methods including Newton-Raphson can besuccessfully applied.

Page 9: A new implementation of k-MLE for mixture modelling of Wishart distributions

Wishart Distribution

8/31

8/31

Definition (Central Wishart distribution)

Wishart distribution characterizes empirical covariance matrices forzero-mean gaussian samples:

Wd(X ; n,S) =|X |

n−d−12 exp

{− 1

2tr(S−1X )

}2

nd2 |S |

n2 Γd

(n2

)where for x > 0, Γd(x) = π

d(d−1)4∏d

j=1 Γ(x − j−1

2

)is the

multivariate gamma function.

Remarks : n > d − 1, E[X ] = nS

The multivariate generalization of the chi-square distribution.

Page 10: A new implementation of k-MLE for mixture modelling of Wishart distributions

Wishart Distribution as an EF

9/31

9/31

It’s an exponential family:

logWd(X ; θn, θS) = < θn, log |X | >R + < θS ,−1

2X >HS

+ k(X )− F (θn, θS)

with k(X ) = 0 and

(θn, θS) = (n − d − 1

2,S−1), t(X ) = (log |X |,−1

2X ),

F (θn, θS) =

(θn +

(d + 1)

2

)(d log(2)− log |θS |)+log Γd

(θn +

(d + 1)

2

)

Page 11: A new implementation of k-MLE for mixture modelling of Wishart distributions

MLE for Wishart Distribution

10/31

10/31

In the case of the Wishart distribution, a closed form would beobtained by solving the following system

θ = ∇F−1(

1

N

N∑i=1

t(xi )

)≡ d log(2)− log |θS |+ Ψd

(θn + (d+1)

2

)= ηn

−(θn + (d+1)

2

)θ−1S = ηS

(1)

with ηn and ηS the expectation parameters and Ψd the derivativeof the log Γd .Unfortunately, no closed-form solution is known.

Page 12: A new implementation of k-MLE for mixture modelling of Wishart distributions

Two sub-families of the Wishart Distribution (1)

11/31

11/31

Case n fixed (n = 2θn + d + 1)

Fn(θS) =nd

2log(2)− n

2log |θS |+ log Γd

(n2

)kn(X ) =

n − d − 1

2log |X |

Case S fixed (S = θ−1S )

FS(θn) =

(θn +

d + 1

2

)log |2S |+ log Γd

(θn +

d + 1

2

)

kS(X ) = −1

2tr(S−1X )

Page 13: A new implementation of k-MLE for mixture modelling of Wishart distributions

Two sub-families of the Wishart Distribution (2)

12/31

12/31

Both are exponential families and MLE equations are solvable !

Case n fixed:

−n

2θ−1S =

1

N

N∑i=1

−1

2Xi =⇒ θS = Nn

(N∑i=1

Xi

)−1(2)

Case S fixed :

θn = Ψ−1d

(1

N

N∑i=1

log |Xi | − log |2S |

)−d + 1

2, θn > 0 (3)

with Ψ−1d the functional reciprocal of Ψd .

Page 14: A new implementation of k-MLE for mixture modelling of Wishart distributions

An iterative estimator for the Wishart Distribution

13/31

13/31

Algorithm 1: An estimator for parameters of the Wishart

Input: A sample X1,X2, . . . ,XN of Sd++

Output: Final values of θn and θS

Initialize θn with some value > 0;

repeat

Update θS using Eq. 2 with n = 2θn + d + 1;

Update θn using Eq. 3 with S the inverse matrix of θS ;

until convergence of the likelihood ;

Page 15: A new implementation of k-MLE for mixture modelling of Wishart distributions

Questions and open problems

14/31

14/31

From a sample of Wishart matrices, distr. parameters arerecovered in few iterations.

Major question : do you have a MLE ? probably ...

Minor question : sample size N = 1 ?

Under-determined systemRegularization by sampling around X1

Page 16: A new implementation of k-MLE for mixture modelling of Wishart distributions

Mixture Models (MM)

15/31

15/31

A additive (finite) mixture is a flexible tool to model a morecomplex distribution m:

m(x) =k∑

j=1

wjpj(x), 0 ≤ wj ≤ 1,k∑

j=1

wj = 1

where pj are the component distributions of the mixture, wj

the mixing proportions.

In our case, we consider pj as member of some parametricfamily (EF)

m(x ; Ψ) =k∑

j=1

wjpFj(x ; θj)

with Ψ = (w1,w2, ...,wk−1, θ1, θ2, ..., θk)

Expectation-Maximization is not fast enough [5] ...

Page 17: A new implementation of k-MLE for mixture modelling of Wishart distributions

Original k-MLE (primal form.) in one slide

16/31

16/31

Algorithm 2: k-MLE

Input: A sample χ = {x1, x2, ..., xN}, F1,F2, ...,Fk Bregmangenerator

Output: Estimate Ψ of mixture parameters

A good initialization for Ψ (see later);

repeatrepeat

foreach xi ∈ χ do zi = argmaxj log wjpFj(xi ; θj);

foreach Cj := {xi ∈ χ|zi = j} do θj = MLEFj(Cj);

until Convergence of the complete likelihood ;

Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;

Page 18: A new implementation of k-MLE for mixture modelling of Wishart distributions

k-MLE’s properties

17/31

17/31

Another formulation comes with the connection between EFand Bregman divergences [3]:

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

Bregman divergence BF (. : .) associated to a strictly convexand differentiable function F :

Page 19: A new implementation of k-MLE for mixture modelling of Wishart distributions

Original k-MLE (dual form.) in one slide

18/31

18/31

Algorithm 3: k-MLE

Input: A sample χ = {y1 = t(x1), y2 = x2, ..., yn = t(xN)},F ∗1 ,F

∗2 , ...,F

∗k Bregman generator

Output: Ψ = (w1, w2, ..., wk−1, θ1 = ∇F ∗(η1), ..., θk = ∇F ∗(ηk))

A good initialization for Ψ (see later);

repeatrepeat

foreach xi ∈ χ do zi = argminj

[BF∗j

(yi : ηj)− log wj

];

foreach Cj := {xi ∈ χ|zi = j} do ηj =∑

xi∈Cj yi/|Cj |

until Convergence of the complete likelihood ;

Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;

Page 20: A new implementation of k-MLE for mixture modelling of Wishart distributions

k-MLE for Wishart distributions

19/31

19/31

Practical considerations impose modifications of the algorithm:

During the assignment empty clusters may appear (Highdimensional data get this worse).

A possible solution is to consider Hartigan and Wang’sstrategy [6] instead of Lloyd’s strategy:

Optimally transfer one observation at a timeUpdate the parameters of involved clusters.Stop when no transfer is possible.

This should guarantees non-empty clusters [7] but does notwork when considering weighted clusters...

Get back to an “old school” criterion : |Czi | > 1

Experimentally shown to perform better in high dimensionthan the Lloyd’s strategy.

Page 21: A new implementation of k-MLE for mixture modelling of Wishart distributions

k-MLE - Hartigan and Wang

20/31

20/31

Criterion for potential transfer (Max):

log wzipFzi(xi ; θzi )

log wz∗ipFz∗

i(xi ; θzi∗)

< 1

with z∗i = argmaxj log wjpFj(xi ; θj)

Update rules :

θzi = MLEFj(Czi\{xi})

θz∗i = MLEFj(Cz∗i ∪ {xi})

OR

Criterion for potential transfer (Min):

BF∗(yi : ηz∗i )− logwz∗i

BF∗(yi : ηzi )− logwzi

< 1

with z∗i = argminj(BF∗(yi : ηj) −logwj)

Update rules :

ηzi =|Czi |ηzi − yi|Czi | − 1

ηz∗i =|Cz∗i |ηz∗i + yi

|Cz∗i |+ 1

Page 22: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 23: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 24: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 25: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 26: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 27: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 28: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Page 29: A new implementation of k-MLE for mixture modelling of Wishart distributions

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Fast and greedy approximation : Θ(kN)Probabilistic guarantee of good initialization:

OPTF ≤ k-meansF ≤ O(log k)OPTF

Dual Bregman divergence BF∗ may replace the square distance

Page 30: A new implementation of k-MLE for mixture modelling of Wishart distributions

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

Page 31: A new implementation of k-MLE for mixture modelling of Wishart distributions

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

Page 32: A new implementation of k-MLE for mixture modelling of Wishart distributions

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

It may overestimate the number of clusters...

Page 33: A new implementation of k-MLE for mixture modelling of Wishart distributions

Initialization with DP-k-MLE++

23/31

23/31

Algorithm 4: DP-k-MLE++

Input: A sample y1 = t(X1), . . . , yN = t(XN), F , λ > 0

Output: C a subset of y1, . . . , yN , k the number of clusters

Choose first seed C = {yj}, for j uniformly random in {1, 2, . . . ,N};repeat

foreach yi do compute pi = BF∗(yi : C)/∑N

i ′=1 BF∗(yi ′ : C)

where BF∗(yi : C) = minc∈CBF∗(yi : c) ;

if ∃pi > λ thenChoose next seed s among y1, y2, . . . , yN with prob. pi ;

Add selected seed to C : C = C ∪ {s} ;

until all pi ≤ λ;

k = |C|;

Page 34: A new implementation of k-MLE for mixture modelling of Wishart distributions

Motion capture

24/31

24/31

Real dataset:Motion capture of contemporary dancers (15 sensors in 3d).

Page 35: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 36: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 37: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 38: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 39: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Remark: Size of each sub-motion is known (so its θn)

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 40: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Page 41: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(2)

26/31

26/31

Comparing two movements amounts to compute adissimilarity measure between Ψi and Ψj .

Remark 1 : with DP-k-MLE++, the two mixtures would notprobably have the same number of components.

Remark 2 : when both mixtures have one component, anatural choice is

KL(Wd(.; θ)||Wd(.; θ′)) = BF∗(η : η′) = BF (θ′ : θ)

A closed form is always available !

No closed form exists for KL divergence between generalmixtures.

Page 42: A new implementation of k-MLE for mixture modelling of Wishart distributions

Application to motion retrieval(3)

27/31

27/31

A possible solution is to use the CS divergence [10]:

CS(m : m′) = − log

∫m(x)m′(x)dx∫

m(x)2dx∫m′(x)2dx

It has a analytic formula for∫m(x)m′(x)dx =

k∑j=1

k′∑j ′=1

wjw′j ′ exp

F (θj+θ′j′ )−(F (θj )+F (θ′

j′ ))

Note that this expression is well defined since naturalparameter space Θ = R+

∗ × Sp++ is a convex cone.

Page 43: A new implementation of k-MLE for mixture modelling of Wishart distributions

Implementation

28/31

28/31

Early specific code in MatlabTM.

Today implementation in Python (based on pyMEF [2])

Ongoing proof of concept (with Herranz F., Beurive A.)

Page 44: A new implementation of k-MLE for mixture modelling of Wishart distributions

Conclusions - Future works

29/31

29/31

Still some mathematical work to be done:

Solve MLE equations to get ∇F ∗ = (∇F )−1 then F ∗

Characterize our estimator for full Wishart distribution.

Complete and validate the prototype of system for motionretrieval.

Speeding-up algorithm: computational/numerical/algorithmictricks.

library for bregman divergences learning ?

Possible extensions:

Reintroduce mean vector in the model : Gaussian-WishartOnline k-means -> online k-MLE ...

Page 45: A new implementation of k-MLE for mixture modelling of Wishart distributions

References I

30/31

30/31

Nielsen, F.:k-MLE: A fast algorithm for learning statistical mixture models.In: International Conference on Acoustics, Speech and Signal Processing.(2012) pp. 869–872

Schwander, O. and Nielsen, F.pyMEF - A framework for Exponential Families in Pythonin Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing

Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.Clustering with bregman divergences.Journal of Machine Learning Research (6) (2005) 1705–1749

Nielsen, F., Garcia, V.:Statistical exponential families: A digest with flash cards.http://arxiv.org/abs/0911.4863 (11 2009)

Hidot, S., Saint Jean, C.:An Expectation-Maximization algorithm for the Wishart mixture model:Application to movement clustering.Pattern Recognition Letters 31(14) (2010) 2318–2324

Page 46: A new implementation of k-MLE for mixture modelling of Wishart distributions

References II

31/31

31/31

Hartigan, J.A., Wong, M.A.:Algorithm AS 136: A k-means clustering algorithm.Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1)(1979) 100–108

Telgarsky, M., Vattani, A.:Hartigan’s method: k-means clustering without Voronoi.In: Proc. of International Conference on Artificial Intelligence andStatistics (AISTATS). (2010) pp. 820–827

Arthur, D., Vassilvitskii, S.:k-means++: The advantages of careful seedingIn: Proceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms (2007) pp. 1027–1035

Kulis, B., Jordan, M.I.:Revisiting k-means: New algorithms via Bayesian nonparametrics.In: International Conference on Machine Learning (ICML). (2012)

Nielsen, F.:Closed-form information-theoretic divergences for statistical mixtures.In: International Conference on Pattern Recognition (ICPR). (2012) pp.1723–1726


Recommended