transcript
- Slide 1
- Margareta Ackerman Joint work with Shai Ben-David Measures of
Clustering Quality: A Working Set of Axioms for Clustering
- Slide 2
- Clustering is one of the most widely used tools for exploratory
data analysis. Social Sciences Biology Astronomy Computer Science..
All apply clustering to gain a first understanding of the structure
of large data sets. The Theory-Practice Gap Yet, there is
distressingly little theoretical understanding of clustering.
- Slide 3
- Can clustering be given a formal and general definition? What
is a good clustering? Can we distinguish clusterable from
structureless data? Questions that research of fundamentals of
clustering should address
- Slide 4
- Clustering is not well defined. There is a wide variety of
different clustering tasks, with different (often implicit)
measures of quality. Inherent Obstacles In most practical
clustering tasks there is no clear ground truth to evaluate your
solution by. (in contrast with classification tasks, in which you
can have a hold out labeled set to evaluate the classifier
against). A clustering may have different value to different users.
e.g. Cluster paintings by painter vs. topic
- Slide 5
- Objective utility functions Sum Of In-Cluster Distances,
Average Distances to Center Points, Cut Weight, Spectral
Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg,..) Analyze
the computational complexity of discrete optimization problems.
Consider a restricted set of distributions (generative models): Ex.
Mixtures of Gaussians [Dasgupta 99], [Vempala, 03], [Kannan et al
04], [Achlitopas, McSherry 05]. Recover the parameters of the model
generating the data. Many more Add structure:Relevant Information
Ex. Information bottle-neck approach [Tishby, Pereira, Bialek 99]
Factor out user-irrelevant information. Common Solutions
- Slide 6
- What can we say independently of any specific algorithm,
specific objective function or specific generative data model ?
Clustering Axioms Postulate axioms that, ideally, every clustering
approach should satisfy. e.g. [Hartigan 1975], [Puzicha, Hofmann,
Buhmann 00], [Kleinberg 02]. usually conclude with negative
results. Quest for a General Theory
- Slide 7
- Sd For a finite domain set S, a distance function d is the
distance defined between the domain points. A Clustering Function
maps d S Input: a distance function d over Sto S Output: a
partition (clustering) of S Our Formal Setup
- Slide 8
- Kleinberg proposes natural-looking Axioms that distinguish
clustering functions from other functions that output domain
partitions. Kleinbergs Work on Clustering Functions
- Slide 9
- Scale Invariance F(d)=F(d)d F(d)=F(d) for all d and all
strictly positive . Consistency dd, F(d),F(d)=F(d). If d equals d,
except for shrinking distances within clusters of F(d) or
stretching between-cluster distances, then F(d)=F(d). Richness PS
For any partition P of S, there exists a distance d S F(d)=P
function d over S so that F(d)=P. Kleinbergs Axioms
- Slide 10
- Theorem [Kleinberg, 2002]: These axioms are inconsistent.
Namely, no function can satisfy these three axioms. Theorem
[Kleinberg, 2002]: These axioms are inconsistent. Namely, no
function can satisfy these three axioms. How come axioms that seem
to capture our intuition about clustering are inconsistent?? Our
answer: The formalization of these axioms is stronger than the
intuition they intend to capture. We express that same intuition in
an alternative framework, and achieve consistency.
- Slide 11
- Clustering-quality measures quantify the quality of
clusterings. How good is this clustering? Clustering-Quality
Measures
- Slide 12
- A clustering-quality measure is a function m( dataset,
clustering ) m( dataset, clustering ) R satisfying some properties
that make this function a meaningful clustering-quality measure.
What properties should it satisfy? Defining Clustering-Quality
Measures
- Slide 13
- Scale Invariance m(C,d)=m(C, d)d Cd m(C,d)=m(C, d) for all d
and all strictly positive , and C over d. Richness CS d S For any
clustering C of S, there exists a distance function d over S so
that C = argmax c m (C,d) C = argmax c m (C,d). Rephrasing
Kleinbergs axioms as clustering-quality measures axioms
- Slide 14
- Consistency dd, C, m(C,d)m(C,d). If d equals d, except for
shrinking distances within clusters of C or stretching
between-cluster distances, then m(C,d)m(C,d). dd C C Rephrasing
Kleinbergs axioms as clustering-quality measures axioms
- Slide 15
- C(X,d)C(X,d) f:XXxyC f(x)f(y)C Clusterings C over (X,d) and C
over (X,d) are isomorphic, if there exists a distance-preserving
automorphism f:X X, such that x,y share the same C- cluster iff
f(x) and f(y) share the same C-cluster. Isomorphism Invariance:
CCm(C,d) = m(C,d) If C and C are isomorphic, then m(C,d) = m(C,d).
An Additional Axiom
- Slide 16
- Moreover, every reasonable CQM satisfies our axioms. We prove
this result by demonstrating measures that satisfy these axioms.
Theorem: Consistency, scale invariance, richness, and isomorphism
invariance for clustering quality measures form a consistent set of
requirements. Theorem: Consistency, scale invariance, richness, and
isomorphism invariance for clustering quality measures form a
consistent set of requirements. Major Gain Consistency of New
Axioms
- Slide 17
- xC The Relative Margin of a point x in C is x x (dist. to
closest center to x) / (dist. to 2 nd closest center to x) C The
Relative Margin of C is the average relative margin over all
non-center points (over all possible center settings). Relative
Margin satisfies scale-invariance, consistency, richness, and
isomorphism invariance. An example of a CQM for center-based
clustering: Relative Margin
- Slide 18
- C-index (Dalrymple-Alford, 1970) Gamma (Baker & Hubert,
1975) Adjusted ratio of clustering (Roenker et al., 1971) D-index
(Dalrymple-Alford, 1970) Modified ratio of repetition (Bower,
Lesgold, and Tieman, 1969) Dunn's index (Dunn, 1973) Variations of
Dunns index (Bezdek and Pal, 1998) Strict separation (based on
Balacan, Blum, and Vempala, 2008) And many more... Additional CQMs
Satisfying Our Axioms
- Slide 19
- In the setting of clustering functions, the consistency axiom
requires that consistent changes to the underlying distance should
not create any new contenders for the best-clustering of the data.
dd CC C C A clustering function that satisfies Kleinbergs
Consistency cannot output C. Why is the CQM formalism more faithful
to intuition?
- Slide 20
- C In the setting of clustering-quality measures, the
consistency axiom requires only that the quality of the clustering
of a given clustering C does not get worse. dd CC C C C While the
quality of C improves, a different clustering, C, can still have
better quality. Why is the CQM formalism more faithful to
intuition?
- Slide 21
- The intuition behind Kleinbergs axioms is consistent (in spite
of his impossibility result). The Impossibility Result can be
overcome by a change of formalism. We do this by focusing on
clustering-quality measures. Every reasonable clustering-quality
measure satisfies our axioms. Summary
- Slide 22
- How can the completeness of a set of axioms be argued? Are the
axioms useful for gaining interesting new insights about
clusterings? Can we find properties that distinguish different
clustering paradigms? Future Work
- Slide 23
- Appendix: Another Clustering-Quality Measure: Gamma (Baker
& Hubert, 1975) Gamma is the best performing measure in
Milligans study of 30 internal criterions (Milligan, 1981). C Let
d(+) denote the number of times that points which were clustered
together in C had distance greater than two points which were not
in the same cluster Let d(-) denote the opposite result Gamma
satisfies scale-invariance, consistency, richness, and isomorphism
invariance.
- Slide 24
- Variants of Quality Measures m Given a clustering-quality
measure m, we can create new ones by applying it to a subset of the
clusters. m min (C,d) = min s (m(S,d)) m min (C,d) = min s
(m(S,d)), S C where S is a subset of a least 2 clusters in C. m max
m average Similarly, we can define m max and m average. m m min m
max m average. If m satisfies the axioms of clustering-quality
measures, then so do m min, m max,and m average.