COMP24111 Machine Learning
Cluster Validation
Ke Chen
Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011]
COMP24111 Machine Learning 2
Outline • Motivation and Background • Internal index
– Motivation and general ideas
– Variance-based internal indexes
– Application: finding the “proper” cluster number
• External index – Motivation and general ideas
– Rand Index
• Application: Weighted clustering ensemble • Summary
COMP24111 Machine Learning 3
Motivation and Background • Motivation Supervised classification
– Class labels known for ground truth – Accuracy: based on labels given
Clustering analysis – No class labels – Evaluation still demanded
Validation needs to
– Compare clustering algorithms – Solve the number of clusters – Avoid finding patterns in noise – Find the “best” clusters from data
P
Oranges:
Apples:
Accuracy = 8/10= 80%
Evaluation
COMP24111 Machine Learning 4
• Illustrative Example: which one is the “best”?
Motivation and Background
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
xy
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Agglomerative (Complete Link) K-means (K=3)
Data Set (Random Points) K-means (K=3)
COMP24111 Machine Learning 5
• Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. – How to be “quantitative”: To employ the measures. – How to be “objective”: To validate the measures!
m* INPUT: DataSet(X)
Clustering Algorithm(s)
Validity Index
Different settings/configurations
Partitions P
Motivation and Background
COMP24111 Machine Learning 6
• Internal Criteria (Indexes) – Validate without external information – With different number of clusters – Solve the number of clusters
• External Criteria (Indexes) – Validate against “ground truth” – Compare two partitions:
(how similar) ?
?
? ?
Motivation and Background
COMP24111 Machine Learning 7
Internal Index • Ground truth is unavailable but unsupervised validation must
be done with “common sense” or “a priori knowledge”. • There are a variety of internal indexes: Variances-based methods
Rate-distortion methods
Davies-Bouldin index (DBI)
Bayesian Information Criterion (BIC)
Silhouette Coefficient
Minimum description principle (MDL)
Stochastic complexity (SC)
Modified Huber’s Г (MHГ) index
COMP24111 Machine Learning 8
Internal Index • Variances-based methods
– Minimise within cluster variance (SSW) – Maximise between cluster variance (SSB) Inter-cluster
variance is maximized
Intra-cluster variance is minimized
c
COMP24111 Machine Learning 9
Internal Index • Variances-based methods (cont.)
Assume an algorithm leads to a partition of K clusters where cluster i has 𝑛𝑛𝑖𝑖 data points
and ci is its centroid. d(.,.) is a distance used in this algorithm.
– Within cluster variance (SSW)
– Between cluster variance (SSB)
where c is the global mean (centroid) of the whole data set.
∑∑= =
=K
i
n
jiij
i
dKSSW1 1
2 ),()( cx
∑=
=K
iiidnKSSB
1
2 ),()( cc
COMP24111 Machine Learning 10
Internal Index
• Variance based F-ratio index – Measures ratio of between-cluster variance against the within-
cluster variance (original F-test) – F-ratio index (W-B index) for a partition of K clusters
∑
∑∑
=
= === K
iii
K
i
n
jiij
dn
dK
KSSBKSSWKKF
i
1
2
1 1
2
),(
),(
)()(*)(
cc
cx
ii
iij
nj
ccxcluster in pointsdata ofnumber theis
cluster inpoint data th theis where
COMP24111 Machine Learning
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
25 23 21 19 17 15 13 11 9 7 5Clusters
F-ra
tio (x
10^5
)
minimum
IS
PNN
11
Internal Index • Application: finding the “proper” cluster number
S1 Algorithm 1
Algorithm 2
COMP24111 Machine Learning
S4
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
25 20 15 10 5
Number of clusters
F-ra
tio
minimum at 15
IS
PNN
minimum at 16
12
Internal Index • Application: finding the “proper” cluster number
Algorithm 1
Algorithm 2
COMP24111 Machine Learning 13
External Index
• “Ground truth” is available but an clustering algorithm doesn’t use such information during unsupervised learning.
• There are a variety of internal indexes: Rand Index
Adjusted Rand Index
Pair counting index
Information theoretic index
Set matching index
DVI index
Normalised mutual information (NMI) index
COMP24111 Machine Learning 14
External Index • Main issues
– If the “ground truth” is known, the validity of a clustering can be verified by comparing the class or clustering labels.
– However, this is much more complicated than in supervised classification (where labels used in training) The cluster IDs in a partition resulting from clustering have been assigned
arbitrarily due to unsupervised learning – permutation.
The number of clusters may be different from the number of classes (the “ground truth”) – inconsistence.
The most important problem in external indexes would be how to find all possible correspondences between the “ground truth” and a partition (or two candidate partitions in the case of comparison).
COMP24111 Machine Learning 15
External Index • Rand Index
– This is the first external index proposed by Rand (1971) to address the “correspondence” problem.
– Basic idea: considering all pairs in the data set by looking into both agreement and disagreement against the “ground truth”
– The index defined as RI (X, Y)= (a+d)/(a+b+c+d)
X/Y Y: Pairs in the same class
Y: Pairs in different classes
X: Pairs in the same cluster a b X: Pairs in different clusters c d
COMP24111 Machine Learning 16
External Index
• Rand Index (cont.) – Example: for a 5-point data set, we have
– To calculate a, b, c and d, we have to list all possible pairs (excluding any data point to itself)
[1, 2], [1, 3], [1, 4], [1, 5] [2, 3], [2, 4], [2, 5] [3, 4], [3, 5] [4, 5]
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 17
External Index • Rand Index (cont.)
– Example: for a 5-point data set, we have
Initialisation: a, b, c, d 0 For data point pair [1, 2], we have X: [1, 2] assigned to (i, ii) –> in different clusters Y: [1, 2] labelled by (q, p) –> in different classes Thus, d d +1 = 1
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 18
External Index
• Rand Index (cont.) – Example: for a 5-point data set, we have
Current Status: a=0, b=0, c=0, d =1 For data point pair [1, 3], we have X: [1, 3] assigned to (i, ii) –> in different clusters Y: [1, 3] labelled by (q, q) –> in the same class Thus, c c +1 = 1
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 19
External Index
• Rand Index (cont.) – Example: for a 5-point data set, we have
Current Status: a=0, b=0, c=1, d =1 For data point pair [1, 4], we have X: [1, 4] assigned to (i, i) –> in the same cluster Y: [1, 4] labelled by (q, p) –> in different classes Thus, b b +1 = 1
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 20
External Index
• Rand Index (cont.) – Example: for a 5-point data set, we have
Current Status: a=0, b=1, c=1, d =1 For data point pair [1, 5], we have X: [1, 5] assigned to (i, i) –> in the same cluster Y: [1, 5] labelled by (q, q) –> in the same class Thus, a a +1 = 1
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 21
External Index
• Rand Index (cont.) – Example: for a 5-point data set, we have
Current Status: a=1, b=1, c=1, d =1 On-class Exercise: continuing until have the final value of a, b, c and d.
1 2 3 4 5
X i ii ii i i
Y q p q p q
COMP24111 Machine Learning 22
External Index • Rand Index: Contingency Table
– In general, a, b, c and d are calculated from a contingency table
Assume there are k clusters in X and l classes in Y 𝑛𝑛𝑖𝑖𝑖𝑖: the number of points in both cluster 𝑖𝑖 and class 𝑗𝑗
COMP24111 Machine Learning 23
External Index
• Rand Index: Contingency Table (cont.) – Example: for a 5-point data set, we have
1 2 3 4 5
X i ii ii i i
Y q p q p q
Contingency table p q
i 1 2 3
ii 1 1 2
2 3 5
COMP24111 Machine Learning
Same cluster/class in X and Y
Same cluster in X /different classes in Y
Different clusters in X / same class in Y
Different clusters/classes in X and Y
∑∑= =
−=k
i
l
jijij nna
1 1)1(
21
)(21
1 1 1
22.∑ ∑∑
= = =
−=l
j
k
i
l
jijj nnb
)(21
1 1 1
22.∑ ∑∑
= = =
−=k
i
k
i
l
jiji nnc
))((21
1
2.
1
2.
1 1
22 ∑∑∑∑=== =
+−+=l
jj
k
ii
k
i
l
jij nnnNd
External Index
• Rand Index: measure the number of pairs in
Ex. 4
COMP24111 Machine Learning 25
Weighted Clustering Ensemble • Motivation
– Evidence-accumulation based clustering ensemble (Fred & Jain, 2005) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, …
– However, this clustering ensemble algorithm is subject to limitations: • Don’t distinguish between “non-trivial” and “trivial” partitions; i.e., all partitions used in a
clustering ensemble are treated equally important. • Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensus
– Clustering validity indexes provide an effective way to measure “non-trivialness” or “importance” of a partition • Internal index: directly measure the importance of a partition quantitatively and objectively • External index: indirectly measure the importance of a partition via comparing with other
partitions used in clustering ensemble for another round “evidence accumulation”
COMP24111 Machine Learning
Chen & Yang: Temporal Data Clustering with Different Representations
26
Weighted Clustering Ensemble
Yun Yang and Ke Chen, “Temporal data clustering via weighted clustering ensemble with different representations,” IEEE Transactions on Knowledge and Data Engineering 23(2), pp. 307-320, 2011.
Use validity index values as weights
Weighted Clustering Ensemble Example: convert clustering results into binary “Distance” matrix
A B
C
D
Cluster 1 (C1)
Cluster 2 (C2)
=
0011001111001100
1D
A D C B
D
C
A
B
“distance” Matrix
27 Chen & Yang: Temporal Data Clustering with Different Representations
Weighted Clustering Ensemble Example: convert clustering results into binary “Distance” matrix
A B
C
D
Cluster 1 (C1)
Cluster 2 (C2)
=
0111101111001100
2D
A D C B
D
C
A
B
“distance Matrix”
28 Chen & Yang: Temporal Data Clustering with Different Representations
Cluster 3 (C3)
Weighted Clustering Ensemble Evidence accumulation: form the collective weighted “distance” matrix
29 Chen & Yang: Temporal Data Clustering with Different Representations
=
0111101111001100
2D
=
0011001111001100
1D
])()[(31
22221111 DwwwDwwwD NDMNDME +++++=
Chen & Yang: Temporal Data Clustering with Different Representations 30
Weighted Clustering Ensemble Clustering analysis with our weighted clustering ensemble on CAVIAR database
– annotated video sequences of pedestrians, a set of 222 high-quality moving trajectories – clustering analysis of trajectories is useful for many applications
Experimental setting – Representations: PCF, DCF, PLS and PDWT – Initial clustering analysis: K-mean algorithm (4<K<20), 6 initial settings – Ensemble: 320 partitions totally (80 partitions/representation)
Chen & Yang: Temporal Data Clustering with Different Representations 31
Weighted Clustering Ensemble Clustering results on CAVIAR database
COMP24111 Machine Learning 32
Weighted Clustering Ensemble • Application: UCR time series benchmarks
COMP24111 Machine Learning 33
Weighted Clustering Ensemble • Application: Rand index (%) values of clustering ensembles (Yang & Chen 2011)
COMP24111 Machine Learning 34
Summary • Cluster validation is a process that evaluate clustering results with a pre-defined
criterion. • Two different types of cluster validation methods
– Internal indexes • no “ground truth” available • defined based on “common sense” or “a priori knowledge” • Application: finding the “proper” number of clusters, …
– External indexes • “ground truth” known or reference given (“relative index”) • Application: performance evaluation of clustering with reference information
• Apart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions.
K. Wang et al, “CVAP: Validation for cluster analysis,” Data Science Journal, vol. 8, May 2009. [Code online available: http://www.mathworks.com/matlabcentral/fileexchange/authors/24811]