The Stability of a Good Clustering

transcript

Marina MeilaUniversity of Washington

mmp@stat.washington.edu

Optimizing these criteria is NP-hard’

Objective

Algorithm

similarities

Spectral clustering K-means

...but “spectral clustering, K-means work well when good clustering exists”

worst case

interesting case

This talk: If a “good” clustering exists, it is “unique” If “good” clustering found, it is provably good

Results summary Given

objective = NCut, K-means distortion data clustering Y with K clusters

Spectral lower bound on distortion If small Then small

where = best clustering with K clusters

distortion

A graphical view

clusteringslowerbound

Overview Introduction

Matrix representations for clusterings Quadratic representation for clustering cost The misclassification error distance

Results for NCut (easier)

Results for K-means distortion (harder)

Discussion

Clusterings as matrices Clustering of { 1,2,..., n } with K clusters (C1, C2,...CK) Represented by n x K matrix

unnormalized

normalized

All matrices have orthogonal columns

Distortion is quadratic in X

NCut K-means

similarities

mkk’

The Confusion MatrixTwo clusterings

(C1, C2, ... CK) with (C’1, C’2, ... C’K’) with

Confusion matrix (K x K’)

The Misclassification Error distance

computed by the maximal bipartite matching algorithm between clusters

confusion matrix

classification error

Results for NCut given

data A (n x n) clustering X (n x K)

Lower bound for NCut (M02, YS03, BJ03)

Upper bound for (MSX’05)

whenever

largest e-values of A

small w.r.t eigengap K+1-K X close to X*

Two clusterings X,X’ close to X*

trace XTX’ large

trace XTX’ large small

convexity proof

Relaxed minimization for

s.t. X = n x K orthogonal matrixSolution:X* = K principal e-vectors of A

Why the eigengap matters Example

A has 3 diagonal blocks K = 2 gap( C ) = gap( C’ ) = 0 but C, C’ not close

C C’

Remarks on stability results No explicit conditions on S

Different flavor from other stability results, e.g Kannan & al 00, Ng & al 01 which assume S “almost” block diagonal

But…results apply only if a good clustering is found

There are S matrices for which no clustering satisfies theorem

Bound depends on aggregate quantities like K cluster sizes (=probabilities)

Points are weighted by their volumes (degrees) good in some applications bounds for unweighted distances can be obtained

Is the bound ever informative? An experiment: S perfect + additive noise

We can do the same ...

...but, K-th principal subspace typically not stable

K-means distortion

K = 4dim = 30

New approach: Use K-1 vectors Non-redundant representation Y

Distortion – new expression

...and new (relaxed) optimization problem

Solution of the new problem Relaxed optimization problem

Solution

U = K-1 principal e-vectors of A W = KxK orthogonal matrix

with on first row

Clusterings Y,Y’ close to Y*

||YTY’||F large

Solve relaxed minimization

small Y close to Y*

||YTY’||F large small

Theorem For any two clusterings Y,Y’ with Y, Y’ > 0

whenever

Corollary: Bound for d(Y,Yopt)

Experiments

20 replicates

K = 4dim = 30

true error

Conclusions First (?) distribution independent bounds on the clustering error

data dependent hold when data well clustered (this is the case of interest)

Tight? – not yet... In addition

Improved variational bound for the K-means cost Showed local equivalence between “misclassification error”

distance and “Frobenius norm distance” (also known as 2 distance)

Related work Bounds for mixtures of Gaussians (Dasgupta, Vempala) Nearest K-flat to n points (Tseng) Variational bounds for sparse PCA (Mogghadan)

The Stability of a Good Clustering

Documents