Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | frank-nielsen |
View: | 104 times |
Download: | 0 times |
A new implementation of k-MLE formixture modelling of Wishart distributions
Christophe Saint-Jean Frank Nielsen
Geometric Science of Information 2013
August 28, 2013 - Mines Paris Tech
Application Context (1)
2/31
2/31
We are interested in clustering varying-length sets of multivariateobservations of same dim. p.
X1 =
3.6 0.05 −4.3.6 0.05 −4.3.6 0.05 −4.
, . . . ,XN =
5.3 −0.5 2.53.6 0.5 3.51.6 −0.5 4.6−1.6 0.5 5.1−2.9 −0.5 6.1
Sample mean is a good but not discriminative enough feature.
Second order cross-product matrices tXiXi may capture somerelations between (column) variables.
Application Context (2)
3/31
3/31
The problem is now the clustering of a set of p × p PSD matrices :
χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN
}
Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...
Application Context (2)
3/31
3/31
The problem is now the clustering of a set of p × p PSD matrices :
χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN
}
Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...
Outline of this talk
4/31
4/31
1 MLE and Wishart DistributionExponential Family and Maximum Likehood EstimateWishart DistributionTwo sub-families of the Wishart Distribution
2 Mixture modeling with k-MLEOriginal k-MLEk-MLE for Wishart distributionsHeuristics for the initialization
3 Application to motion retrieval
Reminder : Exponential Family (EF)
5/31
5/31
An exponential family is a set of parametric probability distributions
EF = {p(x ;λ) = pF (x ; θ) = exp {〈t(x), θ〉+ k(x)− F (θ)|θ ∈ Θ}
Terminology:
λ source parameters.
θ natural parameters.
t(x) sufficient statistic.
k(x) auxiliary carrier measure.
F (θ) the log-normalizer:differentiable, strictlyconvex
Θ = {θ ∈ RD |F (θ) <∞}is an open convex set
Almost all commonly used distributions are EF members butuniform, Cauchy distributions.
Reminder : Maximum Likehood Estimate (MLE)
6/31
6/31
Maximum Likehood Estimate principle is a very commonapproach for fitting parameters of a distribution
θ = argmaxθ
L(θ;χ) = argmaxθ
N∏i=1
p(xi ; θ) = argminθ− 1
N
N∑i=1
log p(xi ; θ)
assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations.
Log density have a convenient expression for EF members
log pF (x ; θ) = 〈t(x), θ〉+ k(x)− F (θ)
It follows
θ = argmaxθ
N∑i=1
log pF (xi ; θ) = argmaxθ
(〈
N∑i=1
t(xi ), θ〉 − NF (θ)
)
MLE with EF
7/31
7/31
Since F is a strictly convex, differentiable function, MLEexists and is unique :
∇F (θ) =1
N
N∑i=1
t(xi )
Ideally, we have a closed form :
θ = ∇F−1(
1
N
N∑i=1
t(xi )
)
Numerical methods including Newton-Raphson can besuccessfully applied.
Wishart Distribution
8/31
8/31
Definition (Central Wishart distribution)
Wishart distribution characterizes empirical covariance matrices forzero-mean gaussian samples:
Wd(X ; n,S) =|X |
n−d−12 exp
{− 1
2tr(S−1X )
}2
nd2 |S |
n2 Γd
(n2
)where for x > 0, Γd(x) = π
d(d−1)4∏d
j=1 Γ(x − j−1
2
)is the
multivariate gamma function.
Remarks : n > d − 1, E[X ] = nS
The multivariate generalization of the chi-square distribution.
Wishart Distribution as an EF
9/31
9/31
It’s an exponential family:
logWd(X ; θn, θS) = < θn, log |X | >R + < θS ,−1
2X >HS
+ k(X )− F (θn, θS)
with k(X ) = 0 and
(θn, θS) = (n − d − 1
2,S−1), t(X ) = (log |X |,−1
2X ),
F (θn, θS) =
(θn +
(d + 1)
2
)(d log(2)− log |θS |)+log Γd
(θn +
(d + 1)
2
)
MLE for Wishart Distribution
10/31
10/31
In the case of the Wishart distribution, a closed form would beobtained by solving the following system
θ = ∇F−1(
1
N
N∑i=1
t(xi )
)≡ d log(2)− log |θS |+ Ψd
(θn + (d+1)
2
)= ηn
−(θn + (d+1)
2
)θ−1S = ηS
(1)
with ηn and ηS the expectation parameters and Ψd the derivativeof the log Γd .Unfortunately, no closed-form solution is known.
Two sub-families of the Wishart Distribution (1)
11/31
11/31
Case n fixed (n = 2θn + d + 1)
Fn(θS) =nd
2log(2)− n
2log |θS |+ log Γd
(n2
)kn(X ) =
n − d − 1
2log |X |
Case S fixed (S = θ−1S )
FS(θn) =
(θn +
d + 1
2
)log |2S |+ log Γd
(θn +
d + 1
2
)
kS(X ) = −1
2tr(S−1X )
Two sub-families of the Wishart Distribution (2)
12/31
12/31
Both are exponential families and MLE equations are solvable !
Case n fixed:
−n
2θ−1S =
1
N
N∑i=1
−1
2Xi =⇒ θS = Nn
(N∑i=1
Xi
)−1(2)
Case S fixed :
θn = Ψ−1d
(1
N
N∑i=1
log |Xi | − log |2S |
)−d + 1
2, θn > 0 (3)
with Ψ−1d the functional reciprocal of Ψd .
An iterative estimator for the Wishart Distribution
13/31
13/31
Algorithm 1: An estimator for parameters of the Wishart
Input: A sample X1,X2, . . . ,XN of Sd++
Output: Final values of θn and θS
Initialize θn with some value > 0;
repeat
Update θS using Eq. 2 with n = 2θn + d + 1;
Update θn using Eq. 3 with S the inverse matrix of θS ;
until convergence of the likelihood ;
Questions and open problems
14/31
14/31
From a sample of Wishart matrices, distr. parameters arerecovered in few iterations.
Major question : do you have a MLE ? probably ...
Minor question : sample size N = 1 ?
Under-determined systemRegularization by sampling around X1
Mixture Models (MM)
15/31
15/31
A additive (finite) mixture is a flexible tool to model a morecomplex distribution m:
m(x) =k∑
j=1
wjpj(x), 0 ≤ wj ≤ 1,k∑
j=1
wj = 1
where pj are the component distributions of the mixture, wj
the mixing proportions.
In our case, we consider pj as member of some parametricfamily (EF)
m(x ; Ψ) =k∑
j=1
wjpFj(x ; θj)
with Ψ = (w1,w2, ...,wk−1, θ1, θ2, ..., θk)
Expectation-Maximization is not fast enough [5] ...
Original k-MLE (primal form.) in one slide
16/31
16/31
Algorithm 2: k-MLE
Input: A sample χ = {x1, x2, ..., xN}, F1,F2, ...,Fk Bregmangenerator
Output: Estimate Ψ of mixture parameters
A good initialization for Ψ (see later);
repeatrepeat
foreach xi ∈ χ do zi = argmaxj log wjpFj(xi ; θj);
foreach Cj := {xi ∈ χ|zi = j} do θj = MLEFj(Cj);
until Convergence of the complete likelihood ;
Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;
k-MLE’s properties
17/31
17/31
Another formulation comes with the connection between EFand Bregman divergences [3]:
log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)
Bregman divergence BF (. : .) associated to a strictly convexand differentiable function F :
Original k-MLE (dual form.) in one slide
18/31
18/31
Algorithm 3: k-MLE
Input: A sample χ = {y1 = t(x1), y2 = x2, ..., yn = t(xN)},F ∗1 ,F
∗2 , ...,F
∗k Bregman generator
Output: Ψ = (w1, w2, ..., wk−1, θ1 = ∇F ∗(η1), ..., θk = ∇F ∗(ηk))
A good initialization for Ψ (see later);
repeatrepeat
foreach xi ∈ χ do zi = argminj
[BF∗j
(yi : ηj)− log wj
];
foreach Cj := {xi ∈ χ|zi = j} do ηj =∑
xi∈Cj yi/|Cj |
until Convergence of the complete likelihood ;
Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;
k-MLE for Wishart distributions
19/31
19/31
Practical considerations impose modifications of the algorithm:
During the assignment empty clusters may appear (Highdimensional data get this worse).
A possible solution is to consider Hartigan and Wang’sstrategy [6] instead of Lloyd’s strategy:
Optimally transfer one observation at a timeUpdate the parameters of involved clusters.Stop when no transfer is possible.
This should guarantees non-empty clusters [7] but does notwork when considering weighted clusters...
Get back to an “old school” criterion : |Czi | > 1
Experimentally shown to perform better in high dimensionthan the Lloyd’s strategy.
k-MLE - Hartigan and Wang
20/31
20/31
Criterion for potential transfer (Max):
log wzipFzi(xi ; θzi )
log wz∗ipFz∗
i(xi ; θzi∗)
< 1
with z∗i = argmaxj log wjpFj(xi ; θj)
Update rules :
θzi = MLEFj(Czi\{xi})
θz∗i = MLEFj(Cz∗i ∪ {xi})
OR
Criterion for potential transfer (Min):
BF∗(yi : ηz∗i )− logwz∗i
BF∗(yi : ηzi )− logwzi
< 1
with z∗i = argminj(BF∗(yi : ηj) −logwj)
Update rules :
ηzi =|Czi |ηzi − yi|Czi | − 1
ηz∗i =|Cz∗i |ηz∗i + yi
|Cz∗i |+ 1
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Fast and greedy approximation : Θ(kN)Probabilistic guarantee of good initialization:
OPTF ≤ k-meansF ≤ O(log k)OPTF
Dual Bregman divergence BF∗ may replace the square distance
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
It may overestimate the number of clusters...
Initialization with DP-k-MLE++
23/31
23/31
Algorithm 4: DP-k-MLE++
Input: A sample y1 = t(X1), . . . , yN = t(XN), F , λ > 0
Output: C a subset of y1, . . . , yN , k the number of clusters
Choose first seed C = {yj}, for j uniformly random in {1, 2, . . . ,N};repeat
foreach yi do compute pi = BF∗(yi : C)/∑N
i ′=1 BF∗(yi ′ : C)
where BF∗(yi : C) = minc∈CBF∗(yi : c) ;
if ∃pi > λ thenChoose next seed s among y1, y2, . . . , yN with prob. pi ;
Add selected seed to C : C = C ∪ {s} ;
until all pi ≤ λ;
k = |C|;
Motion capture
24/31
24/31
Real dataset:Motion capture of contemporary dancers (15 sensors in 3d).
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Remark: Size of each sub-motion is known (so its θn)
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
Application to motion retrieval(2)
26/31
26/31
Comparing two movements amounts to compute adissimilarity measure between Ψi and Ψj .
Remark 1 : with DP-k-MLE++, the two mixtures would notprobably have the same number of components.
Remark 2 : when both mixtures have one component, anatural choice is
KL(Wd(.; θ)||Wd(.; θ′)) = BF∗(η : η′) = BF (θ′ : θ)
A closed form is always available !
No closed form exists for KL divergence between generalmixtures.
Application to motion retrieval(3)
27/31
27/31
A possible solution is to use the CS divergence [10]:
CS(m : m′) = − log
∫m(x)m′(x)dx∫
m(x)2dx∫m′(x)2dx
It has a analytic formula for∫m(x)m′(x)dx =
k∑j=1
k′∑j ′=1
wjw′j ′ exp
F (θj+θ′j′ )−(F (θj )+F (θ′
j′ ))
Note that this expression is well defined since naturalparameter space Θ = R+
∗ × Sp++ is a convex cone.
Implementation
28/31
28/31
Early specific code in MatlabTM.
Today implementation in Python (based on pyMEF [2])
Ongoing proof of concept (with Herranz F., Beurive A.)
Conclusions - Future works
29/31
29/31
Still some mathematical work to be done:
Solve MLE equations to get ∇F ∗ = (∇F )−1 then F ∗
Characterize our estimator for full Wishart distribution.
Complete and validate the prototype of system for motionretrieval.
Speeding-up algorithm: computational/numerical/algorithmictricks.
library for bregman divergences learning ?
Possible extensions:
Reintroduce mean vector in the model : Gaussian-WishartOnline k-means -> online k-MLE ...
References I
30/31
30/31
Nielsen, F.:k-MLE: A fast algorithm for learning statistical mixture models.In: International Conference on Acoustics, Speech and Signal Processing.(2012) pp. 869–872
Schwander, O. and Nielsen, F.pyMEF - A framework for Exponential Families in Pythonin Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.Clustering with bregman divergences.Journal of Machine Learning Research (6) (2005) 1705–1749
Nielsen, F., Garcia, V.:Statistical exponential families: A digest with flash cards.http://arxiv.org/abs/0911.4863 (11 2009)
Hidot, S., Saint Jean, C.:An Expectation-Maximization algorithm for the Wishart mixture model:Application to movement clustering.Pattern Recognition Letters 31(14) (2010) 2318–2324
References II
31/31
31/31
Hartigan, J.A., Wong, M.A.:Algorithm AS 136: A k-means clustering algorithm.Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1)(1979) 100–108
Telgarsky, M., Vattani, A.:Hartigan’s method: k-means clustering without Voronoi.In: Proc. of International Conference on Artificial Intelligence andStatistics (AISTATS). (2010) pp. 820–827
Arthur, D., Vassilvitskii, S.:k-means++: The advantages of careful seedingIn: Proceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms (2007) pp. 1027–1035
Kulis, B., Jordan, M.I.:Revisiting k-means: New algorithms via Bayesian nonparametrics.In: International Conference on Machine Learning (ICML). (2012)
Nielsen, F.:Closed-form information-theoretic divergences for statistical mixtures.In: International Conference on Pattern Recognition (ICPR). (2012) pp.1723–1726