Clustering IV

Clustering IV

Outline

• Impossibility theorem for clustering

• Density-based clustering and subspace clustering

• Bi-clustering or co-clustering

• Validating clustering results

• Randomization tests

General form of impossibility results

• Define a set of simple axioms (properties) that a computational task should satisfy

• Prove that there does not exist an algorithm that can simultaneously satisfy all the axioms impossibility

Computational task: clustering

• A clustering function operates on a set X of n points. X = {1,2,…,n}

• Distance function d: X x X R with d(i,j)≥0, d(i,j)=d(j,i), and d(i,j)=0 only if i=j

• Clustering function f: f(X,d) = Γ, where Γ is a partition of X

Axiom 1: Scale invariance

• For a>0, distance function ad has values (ad)(i,j)=ad(i,j)

• For any d and for any a>0 we have f(d) = f(ad)

• The clustering function should not be sensitive to the changes in the units of distance measurement – should not have a built-in “length scale”

Axiom 2: Richness

• The range of f is equal to the set of partitions of X

• For any X and any partition Γ of X, there is a distance function on X such that f(X,d) = Γ.

Axiom 3: Consistency

• Let Γ be a partition of X• d, d’ two distance functions on X• d’ is a Γ-transformation of d, if

– For all i,jє X in the same cluster of Γ, we have d’(i,j)≤d(i,j)

– For all i,jє X in different clusters of Γ, we have d’(i,j)≥d(i,j)

• Consistency: if f(X,d)= Γ and d’ is a Γ-transformation of d, then f(X,d’)= Γ.

Axiom 3: Consistency

• Intuition: Shrinking distances between points inside a cluster and expanding distances between points in different clusters does not change the result

Examples

• Single-link agglomerative clustering• Repeatedly merge clusters whose closest points are

at minimum distance • Continue until a stopping criterion is met

– k-cluster stopping criterion: continue until there are k clusters

– distance-r stopping criterion: continue until all distances between clusters are larger than r

– scale-a stopping criterion: let d* be the maximum pairwise distance; continue until all distances are larger than ad*

Examples (cont.)

• Single-link agglomerative clustering with k-cluster stopping criterion does not satisfy richness axiom

• Single-link agglomerative clustering with distance-r stopping criterion does not satisfy scale-invariance property

• Single-link agglomerative clustering with scale-a stopping criterion does not satisfy consistency property

Centroid-based clustering and consistency

• k-centroid clustering: – S subset of X for which ∑iєXminjєS{d(i,j)} is

minimized– Partition of X is defined by assigning each element

of X to the centroid that is the closest to it

• Theorem: for every k≥2 and for n sufficiently large relative to k, the k-centroid clustering function does not satisfy the consistency property

k-centroid clustering and the consistency axiom

• Intuition of the proof• Let k=2 and X be partitioned into parts Y and Z• d(i,j) ≤ r for every i,j є Y• d(i,j) ≤ ε, with ε<r for every i,j є Z• d(i,j) > r for every i є Y and j є Z

• Split part Y into subparts Y1 and Y2

• Shrink distances in Y1 appropriately• What is the result of this shrinking?

Impossibility theorem

• For n≥2, there is no clustering function that satisfies all three axioms of scale-invariance, richness and consistency

Impossibility theorem (proof sketch)

• A partition Γ’ is a refinement of partition Γ, if each cluster C’є Γ’ is included in some set Cє Γ

• There is a partial order between partitions: Γ’≤ Γ

• Antichain of partitions: a collection of partitions such that no one is a refinement of others

• Theorem: If a clustering function f satisfies scale-invariance and consistency, then, the range of f is an anti-chain

What does an impossibility result really mean

• Suggests a technical underpinning for the difficulty in unifying the initial, informal concept of clustering

• Highlights basic trade-offs that are inherent to the clustering problem

• Distinguishes how clustering methods resolve these tradeoffs (by looking at the methods not only at an operational level)

Outline






Density-Based Clustering Methods• Clustering based on density (local cluster criterion), such as

density-connected points• Major features:

– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

• Several interesting studies:– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98)

Classification of points in density-based clustering

• Core points: Interior points of a density-based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= e }| ≥ MinPts

• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)

• Noise points: Not a core or a border point

Core, border and noise points

EpsEps Eps

DBSCAN: The Algorithm

– Label all points as core, border, or noise points

– Eliminate noise points

– Put an edge between all core points that are within Eps of each other

– Make each group of connected core points into a separate cluster

– Assign each border point to one of the cluster of its associated core points

Time and space complexity of DBSCAN

• For a dataset X consisting of n points, the time complexity of DBSCAN is O(n x time to find points in the Eps-neighborhood)

• Worst case O(n2)

• In low-dimensional spaces O(nlogn); efficient data structures (e.g., kd-trees) allow for efficient retrieval of all points within a given distance of a specified point

Strengths and weaknesses of DBSCAN

• Resistant to noise

• Finds clusters of arbitrary shapes and sizes

• Difficulty in identifying clusters with varying densities

• Problems in high-dimensional spaces; notion of density unclear

• Can be computationally expensive when the computation of nearest neighbors is expensive

Generic density-based clustering on a grid

• Define a set of grid cells• Assign objects to appropriate cells and

compute the density of each cell• Eliminate cells that have density below a given

threshold τ• Form clusters from “contiguous” (adjacent)

groups of dense cells

Questions

• How do we define the grid?

• How do we measure the density of a grid cell?

• How do we deal with multidimensional data?

Clustering High-Dimensional Data• Clustering high-dimensional data

– Many applications: text documents, DNA micro-array data

– Major challenges:

• Many irrelevant dimensions may mask clusters

• Distance measure becomes meaningless—due to equi-distance

• Clusters may exist only in some subspaces

• Methods

– Feature transformation: only effective if most dimensions are relevant

• PCA & SVD useful only when features are highly correlated/redundant

– Feature selection: wrapper or filter approaches

• useful to find a subspace where the data have nice clusters

– Subspace-clustering: find clusters in all the possible subspaces

• CLIQUE

The Curse of Dimensionality

• Data in only one dimension is relatively packed

• Adding a dimension “stretches” the points across that dimension, making them further apart

• Adding more dimensions will make the points further apart—high dimensional data is extremely sparse

• Distance measure becomes meaningless

(graphs from Parsons et al. KDD Explorations 2004)

Why Subspace Clustering?(Parsons et al. SIGKDD Explorations 2004)

• Clusters may exist only in some subspaces

• Subspace-clustering: find clusters in some of the subspaces

CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

• CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal length interval

– It partitions an m-dimensional data space into non-overlapping rectangular units

– A unit is dense if the fraction of total data points contained in the unit exceeds an input threshold τ

– A cluster is a maximal set of connected dense units within a subspace

The CLIQUE algorithm• Find all dense areas in the 1-dimensional spaces (single

attributes)• k 2• repeat

– Generate all candidate dense k-dimensional cells from dense (k-1)-dimensional cells

– Eliminate cells that have fewer than τ points– k k+1

• until there are no candidate dense k-dimensional cells• Find clusters by taking the union of all adjacent, high-density

cells• Summarize each cluster using a small set of inequalities that

describe the attribute ranges of the cells in the cluster

CLIQUE: Monotonicity property• “If a set of points forms a density-based cluster in k-

dimensions (attributes), then the same set of points is also part of a density-based cluster in all possible subsets of those dimensions”

Strengths and weakness of CLIQUE

• automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces

• insensitive to the order of records in input and does not presume some canonical data distribution

• scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

• The accuracy of the clustering result may be degraded at the expense of simplicity of the method

Outline






A

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

• m points in Rn

• Group them to k clusters• Represent them by a matrix ARm×n

– A point corresponds to a row of A• Cluster: Partition the rows to k groups

m

nRn

Co-Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 9 7

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

A

• Co-Clustering: Cluster rows and columns of A simultaneously:

k = 2

ℓ = 2Co-cluster

Motivation: Sponsored Search

Main revenue for search engines

• Advertisers bid on keywords• A user makes a query• Show ads of advertisers that are relevant and have high bids• User clicks or not an ad

Ads

Motivation: Sponsored Search

• For every(advertiser, keyword) pairwe have:

– Bid amount– Impressions– # clicks

• Mine information at query time – Maximize # clicks / revenue

Ski boots

Co-Clusters in Sponsored Search

Advertiser

Keyw

ords

Vancouver

Air France

Skis.com

Bids of skis.com for “ski boots”

Markets = co-clusters

All these keywords are relevantto a set of advertisers

Co-Clustering in Sponsored Search

Applications:

• Keyword suggestion– Recommend to advertisers other relevant keywords

• Broad matching / market expansion– Include more advertisers to a query

• Isolate submarkets– Important for economists– Apply different advertising approaches

• Build taxonomies of advertisers / keywords

Motivation: Biology

• Gene-expression data in computational biology– need simultaneous

characterizations of genes

ATTCGT

Co-Clusters in Gene Expression Data

Conditions

Gene

s

GCATD

O2 present,T = 23ºC

O2 absent,T = 10ºC

Expression level of the gene under the conditions

Co-cluster

All these genes are activatedfor some set of conditions

A

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

• m points in Rn

• Group them to k clusters• Represent them by a matrix ARm×n

– A point corresponds to a row of A• Cluster: Partition the rows to k groups

m

nRn

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1

1

1

1

1

1

1

1.6 3.4 4 9.8 8.4 8.8

13 9 11 4 5 6

1.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.813 9 11 4 5 613 9 11 4 5 6

A MR R M

• MRk×n: Cluster centroids• RRm×k: Assignment matrix• Approximate A by RM

• Clustering: Find M, R such that:

Rn

R MA

Distance Metric || · ||

3 0 6 8 9 72 3 4 1

28 10

1 2 3 10

9 8

0 8 4 8 7 92 4 3 11 9 10

16 10 13 6 7 510 8 9 2 3 7

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

13 9 11 4 5 613 9 11 4 5 6

AI

Measuring the quality of a clustering

Theorem. For any k > 0, p ≥ 1 there is a 24-approximation.

Proof. Similar to general k-median, relaxed triangle inequality.

Column Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1 1 1

1 1 1

A MC MC C

• ℓ column clusters • MCRm×ℓ: Cluster centroids• CRℓ×m: Column assignments

• Approximate A by MCC :

3 83 102 94 83 1013 69 4

3 3 3 8 8 8

3 3 3 10 10 10

2 2 2 9 9 9

4 4 4 8 8 8

3 3 3 10 10 10

13 13 13 6 6 6

9 9 9 4 4 4

C

Co-Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 9 7

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1

1

1

1

1

1

1

1 1 1

1 1 1

A MR R M C

• Co-Clustering: Cluster rows and columns of ARm×n simultaneously• k row clusters, ℓ column clusters• MRk×ℓ: Co-cluster means• RRm×k: Row assignments• CRℓ×m: Column assignments

• Co-Clustering: Find M, R, C such that:

3 9

11 5

3 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 911 11 11 5 5 511 11 11 5 5 5

C

Co-Clustering Objective Function3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

A R M C

3 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 911 11 11 5 5 511 11 11 5 5 5

AIJ

Cost of co-cluster AIJ

Some Background• A.k.a.: biclustering, block clustering, …

• Many objective functions in co-clustering– This is one of the easier– Others factor out row-column average (priors)– Others based on information theoretic ideas (e.g. KL divergence)

• A lot of existing work, but mostly heuristic– k-means style, alternate between rows/columns– Spectral techniques

Algorithm

1. Cluster rows of A

2. Cluster columns of A

3. Combine

Properties of the algorithmTheorem 1. Algorithm with optimal row/column clusterings is 3-approximation to co-clustering optimum.

Theorem 2. For ℓ2 the algorithm with optimal row/column clusterings is a 2-approximation.

Outline

• Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for clustering, NIPS 2002]


• Subspace clustering and density-based clustering



Cluster Validity • For supervised classification we have a variety of measures to

evaluate how good our model is– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analyses to determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Different Aspects of Cluster Validation

• Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types.– External Index: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

– Internal Index: Used to measure the goodness of a clustering structure without respect to external information.

• Sum of Squared Error (SSE)

– Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy

• Sometimes these are referred to as criteria instead of indices– However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

Measures of Cluster Validity

• Two matrices – Proximity Matrix– “Incidence” Matrix

• One row and one column for each data point• An entry is 1 if the associated pair of points belong to the same cluster• An entry is 0 if the associated pair of points belongs to different clusters

• Compute the correlation between the two matrices– Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

• High correlation indicates that points that belong to the same cluster are close to each other.

• Not a good measure for some density or contiguity based clusters.

Measuring Cluster Validity Via Correlation

Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = 0.9235 Corr = 0.5810

• Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


• Clusters in random data are not so crisp

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link


1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

• Clusters in more complicated figures aren’t well separated• Internal Index: Used to measure the goodness of a clustering structure

without respect to external information– SSE

• SSE is good for comparing two clusterings or two clusters (average SSE).• Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Internal Measures: SSE

• SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

• Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is that good, fair,

or poor?

• Statistics provide a framework for cluster validity– The more “atypical” a clustering result is, the more likely it represents valid

structure in the data– Can compare the values of an index that result from random data or

clusterings to those of a clustering result.• If the value of the index is unlikely, then the cluster results are valid

– These approaches are more complicated and harder to understand.

• For comparing the results of two different sets of cluster analyses, a framework is less necessary.

– However, there is the question of whether the difference between two index values is significant

Framework for Cluster Validity

• Cluster Cohesion: Measures how closely related are objects in a cluster

– Example: SSE• Cluster Separation: Measure how distinct or well-separated a

cluster is from other clusters• Example: Squared Error

– Cohesion is measured by the within cluster sum of squares (SSE)

– Separation is measured by the between cluster sum of squares

– Where |Ci| is the size of cluster i

Internal Measures: Cohesion and Separation

i Cx

ii

mxWSS 2)(

i

ii mmCBSS 2)(

• A proximity graph based approach can also be used for cohesion and separation.– Cluster cohesion is the sum of the weight of all links within a cluster.– Cluster separation is the sum of the weights between nodes in the cluster and

nodes outside the cluster.

Internal Measures: Cohesion and Separation

cohesion separation

• Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings

• For an individual point, i– Calculate a = average distance of i to the points in its cluster– Calculate b = min (average distance of i to points in another cluster)– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)

– Typically between 0 and 1. – The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures: Silhouette Coefficient

ab

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Final Comment on Cluster Validity

Assessing the significance of clustering (and other data mining)

results

• Data X and algorithm A• Beautiful result A(D)• But: what does it mean?• How to determine whether the result is really

interesting or just due to chance?

Examples

• Pattern discovery: frequent itemsets or association rules

• From data X we can find a collection of nice patterns

• Significance of individual patterns is sometimes straightforward to test

• What about the whole collection of patterns? Is it surprising to see such a collection?

Examples

• In clustering or mixture modeling: we always get a result

• How to test if the whole idea of components/clusters in the data is good?

Classical methods

Classical methods

Randomization methods

• Goal in assessing the significance of results: could the result have occurred by chance?

• Randomization methods: create datasets that somehow reflect the characteristics of the true data

Randomization

• Create randomized versions from the data X• X1, X2,…,Xk

• Run algorithm A on these, producing results A(X1), A(X2),…,A(Xk)

• Check if the result A(X) on the real data is somehow different from these

• Empirical p-value: the fraction of cases for which the result on real data is (say) larger than A(X)

• If the empirical p-value is small, then there is something interesting in the data

Questions

• How is the data randomized?

• Can the sample X1, X2, …, Xk be computed efficiently?

• Can the values A(X1), A(X2), …, A(Xk) be computed efficiently?

How to randomize?

• How are datasets Xi generated?

• Randomly from a “null model”/ ”null hypothesis”

Date post:	23-Feb-2016
Category:	Documents
Upload:	kael
View:	98 times
Download:	0 times

Clustering IV

Documents