So HiraiThe University of Tokyo
Currently NTT DATA Corp.
Kenji YamanishiThe University of Tokyo
WITMSE 2012, Amsterdam, Netherland
Presented at KDD 2012 on Aug.13.
Contents
Problem SettingSignificanceProposed Algorithm : Sequential Dynamic Model Selection with NML(normalized maximum likelihood)
codingHow to compute the NML coding for Gaussian
mixturesExperimental Results Marketing Applications Conclusion
2
Problem Setting (1/2)
3
TimeChange Change
Clustering change detection---Tracking changes of clustering structures in a
sequential setting to detect novelty in dataEx. Market analysis
The structure of customer groups changes over time
Detect changes of the number of clusters as well as their assignment
Problem Setting (2/2)
4
F ED C
BA
F
ED
CB
A
F ED C
BA
FE
DC
BA
αβ
Examples of clustering structure changes
Existing customers change their patterns
New customer s emerge to form a new group
There exist various types of clustering structures
Related works
Evolutionally clustering [Chakrabrti et. al., 2006]Hypothesis testing approach[Song and Wang,
2005]Kalman filter approach [Krempl et. al., 2011]Graph Scope [Sun et. al., 2007]Variational Bayes approach[Sato, 2001]
5
Clustering change detection issue
SignificanceA novel clustering change detection algorithm
Key idea: ・ Sequential dynamic model selection (sequential DMS ) ・ NML(normalized maximum likelihood) code-length as criteria ……..First formulae for NML for Gaussian mixture models
6
Empirical demonstration of its superiority over existing methods
Shown using artificial data sets
Demonstration of its validity in market analysisShown using real beer consumption data sets
7
Proposed Alg. – background of DMS –
Batch DMS criterion :
8
Dynamic Model Selection ( DMS )[Yamanishi and Maruyama, 2007]
Total code-length
Code-length of data seq.
Code-length of model seq.
Minimum w.r.t.
~Extension of MDL (Minimum Description Length) principle[Rissanen, 1978] into model “sequence” selection
Proposed Alg. – Sequential DMS –
At each time t, given , sequentially select for clustering
9
Sequential dynamic model selection (SDMS) Alg.
Code-length for data clustering ~ NML (normalized maximum likelihoood) coding
Code-length for transition of clustering structure
Minimumw.r.t. Kt, Zt
Sequential variant of DMS criterion
[Yamanishi and Maruyama, 2007]
s.t.
Proposed Alg. – model transition –
Run EM alg. with initial values below:Case 1
# of clusters does not changeInitial parameter values remain the same
Case 2# of clusters decreases (e.g. , merging)Assign data in a certain cluster to other ones randomly
Case 3# of clusters increases (e.g., splitting)
Set data to a new cluster randomly
10
Consider three patterns of clustering changes
Case 2
Case 3
Proposed Alg. – code-length for transition –Model transition probability distribution Suppose K transits to neighbors only
Employ Krichevsky-Trofimov (KT) estimate[Krichevsky and Trofimov,
1981]
11
Code-length of the model transition
12
Criteria – NML code-length –Model (Gaussian mixture model) :
NML (normalized maximum likelihood) code-length :
Shortest code-length in the sense of minimax criterion [Shatarkov 1987] 13
Normalization
term
For Continuous DataNormalization term
In case of , the data ranges over all domains
Problem:NML for Gaussian distribution
Normalization term diverges
NML for mixture distribution Normalization term is computationally intractable This comes from combinational difficulties
14
For Continuous Data (Example)For the one-dimension Gaussian distribution
(σ2 is given)
Normalization term
15
Approximate computation (1/2)
16
Use sufficient statistics
g1 : Gaussian distributiong2 : Wishart distribution
Criteria – NML for GMM –
Restrict the range of data so that the MLE lies in a bounded range specified by a parameter
17
Efficiently computing an approximate variant of the NML code-length for a GMM
[Hirai and Yamanishi, 2011]
The normalization term does not divergeBut still highly depends on the parameters :
NMLThe normalization term is calculated as follows :
18
where,
: number of data, : dim. of data
Criteria – RNML code-length –
Re-normalize around the MLE of parameter by restricting the range of data
19
Modify NML to develop the re-normalized maximum likelihood coding (RNML)
[Rissanen, Roos, Myllymaki 2010][Hirai and Yamanishi, 2012]
Less dependent on hyper-parameter
20
Criteria – RNML code-length –
RNML code-lengthTheorem [Hirai and Yamanishi 2012]
RNML code-length for GMM is calculated as follows :
21
Definition
ProblemComputing
, costs .
1
Criteria – efficient computing of RNML –Straightforward computation of RNML requires
time⇒ But we can compute it efficiently
Theorem [Kontkanen and Myllymaki, 07]
22
1
)
Can compute the normalization term in for “mixture” models
Criteria – efficient computing of RNML –Straightforward computation of RNML requires
time⇒ But we can compute it efficiently
Theorem [Hirai and Yamanishi, 2012]The normalization term satisfies
recurrsive formula
23
2 2
2
24
Experimental Results – data generation –
Generate artificial data set according to GMM with
25
Experimental Results – comparison criteria –
AR (accuracy rate) :Average rate of correctly estimating the true number of clusters over all time
IR (identification rate) :Probability of correctly identifying change-points and change themselves
FAR (false alarm rate) :Rate of the number of false alarms over all detected change-points
26
Employ three comparison metrics
Experimental Results – artificial data –
27
Our alg. with NML was able to detect true change-points and identify the true # of clusters with higher probability than AIC
and BIC
Average Number of clusters Over Time
AIC:Akaike’s information criteria [Akaike1974]BIC:Bayesian information criteria [Shwarz 1978]
RNML
AIC BIC
AR0.90
30.103 0.135
IR0.38
00.005 0.020
FAR 0.2600.02
00.718
Comparison w. r. t. KL-divergenceEvaluated change detection accuracies by
varying the Kullback-Leibler divergence (KLD) between the distributions before and after the change points
28
The larger the KLD between GMMs before and after the change-point was, the more accurately it was detected in terms of IR (identification rate).
Experimental Results – vs SW Alg. –
SW algorithm : Hypothesis testing whether clusters are identical or not, then make splitting, merging, etc. [Song and Wang, 2005]
29
The sequential DMS with RNML significantly outperformed SW-alg.
AR IR FAR
Proposed 0.988 0.950 0.050
SW-RNML 0.369 0.300 0.503
SW-BIC 0.019 0.000 0.841 Data : size/time = 512
Experimental Results – market analysis –
30
Data set provided by MACROMILL, Inc.
Clustering customers to detect their structure changes
Our alg. detected clustering changes that
corresponded to the year’s ending demand
Beer 1
Beer 2
. . .
User 1
350 700 . . .
User 2
1050 350 . . .
. . . . . . . . . . . .
Beer 1
Beer 2
. . .
User 1
350 700 . . .
User 2
1050 350 . . .
. . . . . . . . . . . .
Beer 1
Beer 2
. . .
User 1
350 700 . . .
User 2
1050 350 . . .
. . . . . . . . . . . .
14 kinds of beer
78 days
The cluster change in change-point : 1/1,2
31
Many of customers changed their patterns
to purchase Beer-A and Third-Beer at the
year’s end
Conclusion
32
Why is NML ?
33
The shortest code-length in the sense of Shtarkov’s minimax criterion
[Shtarkov, 1987]
Minimum is attained by Q= NML distribution
MaximumLikelihoodEstimator
For a given class :
Restrict the range of data
34
Restrict the range of data forShtarkov’s minimax criterion
[Shtarkov, 1987]
For a given class :
Restrict the range of data.
We change the Shtarkov’s minimax
criterion itself
Comparison with non-parametric Bayes
Sequential Dynamic Model Selection works better than non-parametric Bayes (Infinite HMM, etc.)
[Comparison of Dynamic Model Selection with Infinite HMM for Statistical Model Change Detection
Sakurai and Yamanishi, to appear in ITW 2012]
35