Margareta Ackerman Joint work with Shai Ben-David Measures of
Clustering Quality: A Working Set of Axioms for Clustering
Slide 2
Clustering is one of the most widely used tools for exploratory
data analysis. Social Sciences Biology Astronomy Computer Science..
All apply clustering to gain a first understanding of the structure
of large data sets. The Theory-Practice Gap Yet, there is
distressingly little theoretical understanding of clustering.
Slide 3
Can clustering be given a formal and general definition? What
is a good clustering? Can we distinguish clusterable from
structureless data? Questions that research of fundamentals of
clustering should address
Slide 4
Clustering is not well defined. There is a wide variety of
different clustering tasks, with different (often implicit)
measures of quality. Inherent Obstacles In most practical
clustering tasks there is no clear ground truth to evaluate your
solution by. (in contrast with classification tasks, in which you
can have a hold out labeled set to evaluate the classifier
against). A clustering may have different value to different users.
e.g. Cluster paintings by painter vs. topic
Slide 5
Objective utility functions Sum Of In-Cluster Distances,
Average Distances to Center Points, Cut Weight, Spectral
Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg,..) Analyze
the computational complexity of discrete optimization problems.
Consider a restricted set of distributions (generative models): Ex.
Mixtures of Gaussians [Dasgupta 99], [Vempala, 03], [Kannan et al
04], [Achlitopas, McSherry 05]. Recover the parameters of the model
generating the data. Many more Add structure:Relevant Information
Ex. Information bottle-neck approach [Tishby, Pereira, Bialek 99]
Factor out user-irrelevant information. Common Solutions
Slide 6
What can we say independently of any specific algorithm,
specific objective function or specific generative data model ?
Clustering Axioms Postulate axioms that, ideally, every clustering
approach should satisfy. e.g. [Hartigan 1975], [Puzicha, Hofmann,
Buhmann 00], [Kleinberg 02]. usually conclude with negative
results. Quest for a General Theory
Slide 7
Sd For a finite domain set S, a distance function d is the
distance defined between the domain points. A Clustering Function
maps d S Input: a distance function d over Sto S Output: a
partition (clustering) of S Our Formal Setup
Slide 8
Kleinberg proposes natural-looking Axioms that distinguish
clustering functions from other functions that output domain
partitions. Kleinbergs Work on Clustering Functions
Slide 9
Scale Invariance F(d)=F(d)d F(d)=F(d) for all d and all
strictly positive . Consistency dd, F(d),F(d)=F(d). If d equals d,
except for shrinking distances within clusters of F(d) or
stretching between-cluster distances, then F(d)=F(d). Richness PS
For any partition P of S, there exists a distance d S F(d)=P
function d over S so that F(d)=P. Kleinbergs Axioms
Slide 10
Theorem [Kleinberg, 2002]: These axioms are inconsistent.
Namely, no function can satisfy these three axioms. Theorem
[Kleinberg, 2002]: These axioms are inconsistent. Namely, no
function can satisfy these three axioms. How come axioms that seem
to capture our intuition about clustering are inconsistent?? Our
answer: The formalization of these axioms is stronger than the
intuition they intend to capture. We express that same intuition in
an alternative framework, and achieve consistency.
Slide 11
Clustering-quality measures quantify the quality of
clusterings. How good is this clustering? Clustering-Quality
Measures
Slide 12
A clustering-quality measure is a function m( dataset,
clustering ) m( dataset, clustering ) R satisfying some properties
that make this function a meaningful clustering-quality measure.
What properties should it satisfy? Defining Clustering-Quality
Measures
Slide 13
Scale Invariance m(C,d)=m(C, d)d Cd m(C,d)=m(C, d) for all d
and all strictly positive , and C over d. Richness CS d S For any
clustering C of S, there exists a distance function d over S so
that C = argmax c m (C,d) C = argmax c m (C,d). Rephrasing
Kleinbergs axioms as clustering-quality measures axioms
Slide 14
Consistency dd, C, m(C,d)m(C,d). If d equals d, except for
shrinking distances within clusters of C or stretching
between-cluster distances, then m(C,d)m(C,d). dd C C Rephrasing
Kleinbergs axioms as clustering-quality measures axioms
Slide 15
C(X,d)C(X,d) f:XXxyC f(x)f(y)C Clusterings C over (X,d) and C
over (X,d) are isomorphic, if there exists a distance-preserving
automorphism f:X X, such that x,y share the same C- cluster iff
f(x) and f(y) share the same C-cluster. Isomorphism Invariance:
CCm(C,d) = m(C,d) If C and C are isomorphic, then m(C,d) = m(C,d).
An Additional Axiom
Slide 16
Moreover, every reasonable CQM satisfies our axioms. We prove
this result by demonstrating measures that satisfy these axioms.
Theorem: Consistency, scale invariance, richness, and isomorphism
invariance for clustering quality measures form a consistent set of
requirements. Theorem: Consistency, scale invariance, richness, and
isomorphism invariance for clustering quality measures form a
consistent set of requirements. Major Gain Consistency of New
Axioms
Slide 17
xC The Relative Margin of a point x in C is x x (dist. to
closest center to x) / (dist. to 2 nd closest center to x) C The
Relative Margin of C is the average relative margin over all
non-center points (over all possible center settings). Relative
Margin satisfies scale-invariance, consistency, richness, and
isomorphism invariance. An example of a CQM for center-based
clustering: Relative Margin
Slide 18
C-index (Dalrymple-Alford, 1970) Gamma (Baker & Hubert,
1975) Adjusted ratio of clustering (Roenker et al., 1971) D-index
(Dalrymple-Alford, 1970) Modified ratio of repetition (Bower,
Lesgold, and Tieman, 1969) Dunn's index (Dunn, 1973) Variations of
Dunns index (Bezdek and Pal, 1998) Strict separation (based on
Balacan, Blum, and Vempala, 2008) And many more... Additional CQMs
Satisfying Our Axioms
Slide 19
In the setting of clustering functions, the consistency axiom
requires that consistent changes to the underlying distance should
not create any new contenders for the best-clustering of the data.
dd CC C C A clustering function that satisfies Kleinbergs
Consistency cannot output C. Why is the CQM formalism more faithful
to intuition?
Slide 20
C In the setting of clustering-quality measures, the
consistency axiom requires only that the quality of the clustering
of a given clustering C does not get worse. dd CC C C C While the
quality of C improves, a different clustering, C, can still have
better quality. Why is the CQM formalism more faithful to
intuition?
Slide 21
The intuition behind Kleinbergs axioms is consistent (in spite
of his impossibility result). The Impossibility Result can be
overcome by a change of formalism. We do this by focusing on
clustering-quality measures. Every reasonable clustering-quality
measure satisfies our axioms. Summary
Slide 22
How can the completeness of a set of axioms be argued? Are the
axioms useful for gaining interesting new insights about
clusterings? Can we find properties that distinguish different
clustering paradigms? Future Work
Slide 23
Appendix: Another Clustering-Quality Measure: Gamma (Baker
& Hubert, 1975) Gamma is the best performing measure in
Milligans study of 30 internal criterions (Milligan, 1981). C Let
d(+) denote the number of times that points which were clustered
together in C had distance greater than two points which were not
in the same cluster Let d(-) denote the opposite result Gamma
satisfies scale-invariance, consistency, richness, and isomorphism
invariance.
Slide 24
Variants of Quality Measures m Given a clustering-quality
measure m, we can create new ones by applying it to a subset of the
clusters. m min (C,d) = min s (m(S,d)) m min (C,d) = min s
(m(S,d)), S C where S is a subset of a least 2 clusters in C. m max
m average Similarly, we can define m max and m average. m m min m
max m average. If m satisfies the axioms of clustering-quality
measures, then so do m min, m max,and m average.