Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) •...

transcript

Topic 1

Clustering Basics

Overview

Basics (K-means)

• variance clustering

• generalizations (parametric & non-parametric)

Kernel K-means

Probabilistic K-means,

• entropy clustering

Normalized Cut

Density biases

Spectral methods, bound optimization2

In the beginning there was…

Basic K-means:

squared L2 norm

features

K subsets of

objective function

output

extra parameters (K means)

In this talk: K-means refers mostly to this or related objectives(not to iterative Lloyd’s algorithm, 1957)

Basic K-means examples:

RGB features

color quantization

RGBXY features

superpixels

XY features only

Voronoi cells

pixelsfeatures

compared to RGB onlyXY adds spatial “compactness”

(quazi regularization)

Apply K-means to RGBXY features

Basic K-means examples:

Superpixels

[SLIC superpixels, Achanta et al., PAMI 2011]

K-means as non-parametric clustering

||||)(

equivalent (easy to check)

two standard formulas for sample variance

just plug-in

qSk f||

no parametersµk

K-means as variance clustering criteria

)var(||1

K SSE ==

both objectives can be written as

=> K-means is good for “compact blobs”

K-means – common extensions

k Spdkp

• Parametric methods with arbitrary likelihoods P( ˑ|θ)

(probabilistic K-means) [Kearns, Mansour & Ng, UAI’97]=

)|(log

Examples of P (ˑ|θ) : Gaussian, gamma, exponential, Gibbs, etc.

• Parametric methods with arbitrary distortion measure

(distortion clustering)

• Non-parametric (pairwise) methods with any kernel or affinity measure

(kernel K-means, average association, average distortion, normalized cut)

),( yxk

Examples of : quadratic absolute truncated(K-means) (K-medians) (K-modes)

Could be juxtaposed with GMM/EM as hard clustering via ML parameter fitting

~ kpf −

e.g. Gaussian 2

||||exp~

−− fP

replace dot-products by arbitrary kernel k

Probabilistic K-means Example:

Elliptic K-means

for Normal (Gaussian)distribution

(squared)Mahalanobis distance

Examples: a) Z - normal random vector with

m – meanΣ - covariance

b) X = AZ + m for arbitrary vector m and matrix A

distribution of XX = AZ + m

Elliptic K-means

Basic K-means

Elliptic K-means

Entropy Clustering

)|(log

Monte-Carlo estimation formula

Using “optimal” distributions θk that minimizes cross entropy we get

entropy clustering criteria:

cross entropy

requires sufficiently descriptive (complex) class of probability models that can fit data well.

Probabilistic K-means:

summary

- model fitting (to data)- log-likelihood (model) parameter estimation

- complex data requires complex models

basic K-means works only for compact clusters (blobs)

that are linearly separable

from complex models

towards complex embeddings

From basic K-means to kernel K-means

(high-dimensional embedding story)

Example:

data can become linearly separableafter some non-linear embedding

(typically in high dimensional space)

for some (non-linear) embedding function

(explicit)K-means procedure:

(update at time t+1)

equivalent formulation:

dim(H)x|Ω| embedding matrix

- cluster k at iteration t

|Ω| indicatorvector for cluster k

Assume for now that such embedding is given

equivalent formulation:

Gram matrix

dot products

(implicit)kernel K-means procedure:

Requires only kernel matrix K , no need to know explicit embedding Φ

Gram matrix

equivalent

Kernel trick: start with (any ?) kernel K

If we start from given pairwise affinities (kernel matrix K), sometimes it may still be useful to

think about embedding implicitly defined by the kernel (via decomposition )

(Mercer theorem : any p.s.d. kernels can be decomposed that way)

Q: why even worry about embedding Φ when using kernel K-means procedure?

A: (HINT) Think about convergence. What do we minimize via kernel K-means procedure?

Kernel Trick: p.s.d. kernels K are a standard way to (implicitly) define

some high-dimensional embedding Φ (corresponding to decomposition )

Q: what is dimension of each Φp ?

Example: Gaussian kernel

Kernel-induced embedding:

- isometry

and the corresponding kernel-induced metric:

kernel defines an inner product:(in the original feature space)

Kerned-defined Euclidean embedding is isometric to

the original features with kernel-induced metric

Kernel-induced embedding:

- non-linear separation of original features

high-dimensional isometric embedding induced by kernel Kcan make clusters linearly separable

original feature space with kernel-induced metric

kernel-induced Euclidean embedding

Intuition for such “magic” behind commonly used kernels (e.g. Gaussian)?

(robust metric story)

kernel K-means objective:

remember

robust metric focuses on local distortion (deemphasizes larger distances)

basic (linear) kernel

Examples:

squared Euclidean distance

distance in standard K-means

Gaussian kernel

distance in Gaussian kernel K-means2||||

2|||| k

kernel K-means objective:

and kernel-induced metric:

remember

robust metric focuses on local distortion (deemphasizes larger distances)

2|||| k

On importance of

positive-semi-definite (p.s.d.) kernels K

- Given any (e.g. non-p.s.d.) kernel, “diagonal shift” allows to formulate an equivalent kernel clustering objective with p.s.d. kernel (for sufficiently large scalar δ)

easy to verify equivalence of kernel K-means objectives for any scalar δwhile kernel K-means procedure is modified by the “shift” above

- (Mercer theorem) p.s.d. guarantees existence of explicit Euclidean embedding s.t.

that is

This allows to prove that implicit kernel K-means procedure convergesdue to its equivalence to convergent explicit K-means procedure for some

Weak kernel K-means

versus

Weak kernel K-means

versus

Weak kernel K-means

versus

for(due to isometry)

=Each corresponds to some .(These two give the same solution S where two objectives are equal since embedding is isometric.)

The opposite is not true.

Weak kernel K-means

Implicit search space for(higher-dimensional embedding space H)

is larger than search space for(original feature space)

for(due to isometry)

Weighted K-means and

Weighted kernel K-means

(unary) distortion between a point and a model

(pair-wise) distortion between two points

unary and pair-wise distortion clustering(general weighted case)

probabilistic K-means(ML model fitting)

kernel K-means(pairwise clustering)

basic K-means

p.d. kernel distance

weakkernel clustering

(unary) Hilbertian distortion

normalized cuts

K-modes(mean-shift)

GMM fitting

entropy clustering

elliptic K-means

gamma fitting

Gaussian kernel K-means

spectralratio cuts

distorton

Gibbs fitting

average cut

average association

averagedistortion

complex models complex embeddings(model parameter fitting)

(non-parametric)

Kernel Clustering

• kernel K-means, average association, Normalized Cuts, …• density biases: isolation of modes or sparse subsets• bound optimization

Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) •...

Documents