+ All Categories
Home > Documents > Clustering IV

Clustering IV

Date post: 23-Feb-2016
Category:
Upload: kael
View: 98 times
Download: 0 times
Share this document with a friend
Description:
Clustering IV. Outline. Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering Validating clustering results Randomization tests. General form of impossibility results. - PowerPoint PPT Presentation
Popular Tags:
78
Clustering IV
Transcript
Page 1: Clustering  IV

Clustering IV

Page 2: Clustering  IV

Outline

• Impossibility theorem for clustering

• Density-based clustering and subspace clustering

• Bi-clustering or co-clustering

• Validating clustering results

• Randomization tests

Page 3: Clustering  IV

General form of impossibility results

• Define a set of simple axioms (properties) that a computational task should satisfy

• Prove that there does not exist an algorithm that can simultaneously satisfy all the axioms impossibility

Page 4: Clustering  IV

Computational task: clustering

• A clustering function operates on a set X of n points. X = {1,2,…,n}

• Distance function d: X x X R with d(i,j)≥0, d(i,j)=d(j,i), and d(i,j)=0 only if i=j

• Clustering function f: f(X,d) = Γ, where Γ is a partition of X

Page 5: Clustering  IV

Axiom 1: Scale invariance

• For a>0, distance function ad has values (ad)(i,j)=ad(i,j)

• For any d and for any a>0 we have f(d) = f(ad)

• The clustering function should not be sensitive to the changes in the units of distance measurement – should not have a built-in “length scale”

Page 6: Clustering  IV

Axiom 2: Richness

• The range of f is equal to the set of partitions of X

• For any X and any partition Γ of X, there is a distance function on X such that f(X,d) = Γ.

Page 7: Clustering  IV

Axiom 3: Consistency

• Let Γ be a partition of X• d, d’ two distance functions on X• d’ is a Γ-transformation of d, if

– For all i,jє X in the same cluster of Γ, we have d’(i,j)≤d(i,j)

– For all i,jє X in different clusters of Γ, we have d’(i,j)≥d(i,j)

• Consistency: if f(X,d)= Γ and d’ is a Γ-transformation of d, then f(X,d’)= Γ.

Page 8: Clustering  IV

Axiom 3: Consistency

• Intuition: Shrinking distances between points inside a cluster and expanding distances between points in different clusters does not change the result

Page 9: Clustering  IV

Examples

• Single-link agglomerative clustering• Repeatedly merge clusters whose closest points are

at minimum distance • Continue until a stopping criterion is met

– k-cluster stopping criterion: continue until there are k clusters

– distance-r stopping criterion: continue until all distances between clusters are larger than r

– scale-a stopping criterion: let d* be the maximum pairwise distance; continue until all distances are larger than ad*

Page 10: Clustering  IV

Examples (cont.)

• Single-link agglomerative clustering with k-cluster stopping criterion does not satisfy richness axiom

• Single-link agglomerative clustering with distance-r stopping criterion does not satisfy scale-invariance property

• Single-link agglomerative clustering with scale-a stopping criterion does not satisfy consistency property

Page 11: Clustering  IV

Centroid-based clustering and consistency

• k-centroid clustering: – S subset of X for which ∑iєXminjєS{d(i,j)} is

minimized– Partition of X is defined by assigning each element

of X to the centroid that is the closest to it

• Theorem: for every k≥2 and for n sufficiently large relative to k, the k-centroid clustering function does not satisfy the consistency property

Page 12: Clustering  IV

k-centroid clustering and the consistency axiom

• Intuition of the proof• Let k=2 and X be partitioned into parts Y and Z• d(i,j) ≤ r for every i,j є Y• d(i,j) ≤ ε, with ε<r for every i,j є Z• d(i,j) > r for every i є Y and j є Z

• Split part Y into subparts Y1 and Y2

• Shrink distances in Y1 appropriately• What is the result of this shrinking?

Page 13: Clustering  IV

Impossibility theorem

• For n≥2, there is no clustering function that satisfies all three axioms of scale-invariance, richness and consistency

Page 14: Clustering  IV

Impossibility theorem (proof sketch)

• A partition Γ’ is a refinement of partition Γ, if each cluster C’є Γ’ is included in some set Cє Γ

• There is a partial order between partitions: Γ’≤ Γ

• Antichain of partitions: a collection of partitions such that no one is a refinement of others

• Theorem: If a clustering function f satisfies scale-invariance and consistency, then, the range of f is an anti-chain

Page 15: Clustering  IV

What does an impossibility result really mean

• Suggests a technical underpinning for the difficulty in unifying the initial, informal concept of clustering

• Highlights basic trade-offs that are inherent to the clustering problem

• Distinguishes how clustering methods resolve these tradeoffs (by looking at the methods not only at an operational level)

Page 16: Clustering  IV

Outline

• Impossibility theorem for clustering

• Density-based clustering and subspace clustering

• Bi-clustering or co-clustering

• Validating clustering results

• Randomization tests

Page 17: Clustering  IV

Density-Based Clustering Methods• Clustering based on density (local cluster criterion), such as

density-connected points• Major features:

– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

• Several interesting studies:– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 18: Clustering  IV

Classification of points in density-based clustering

• Core points: Interior points of a density-based cluster. A point p is a core point if for distance Eps :– |NEps(p)={q | dist(p,q) <= e }| ≥ MinPts

• Border points: Not a core point but within the neighborhood of a core point (it can be in the neighborhoods of many core points)

• Noise points: Not a core or a border point

Page 19: Clustering  IV

Core, border and noise points

EpsEps Eps

Page 20: Clustering  IV

DBSCAN: The Algorithm

– Label all points as core, border, or noise points

– Eliminate noise points

– Put an edge between all core points that are within Eps of each other

– Make each group of connected core points into a separate cluster

– Assign each border point to one of the cluster of its associated core points

Page 21: Clustering  IV

Time and space complexity of DBSCAN

• For a dataset X consisting of n points, the time complexity of DBSCAN is O(n x time to find points in the Eps-neighborhood)

• Worst case O(n2)

• In low-dimensional spaces O(nlogn); efficient data structures (e.g., kd-trees) allow for efficient retrieval of all points within a given distance of a specified point

Page 22: Clustering  IV

Strengths and weaknesses of DBSCAN

• Resistant to noise

• Finds clusters of arbitrary shapes and sizes

• Difficulty in identifying clusters with varying densities

• Problems in high-dimensional spaces; notion of density unclear

• Can be computationally expensive when the computation of nearest neighbors is expensive

Page 23: Clustering  IV

Generic density-based clustering on a grid

• Define a set of grid cells• Assign objects to appropriate cells and

compute the density of each cell• Eliminate cells that have density below a given

threshold τ• Form clusters from “contiguous” (adjacent)

groups of dense cells

Page 24: Clustering  IV

Questions

• How do we define the grid?

• How do we measure the density of a grid cell?

• How do we deal with multidimensional data?

Page 25: Clustering  IV

Clustering High-Dimensional Data• Clustering high-dimensional data

– Many applications: text documents, DNA micro-array data

– Major challenges:

• Many irrelevant dimensions may mask clusters

• Distance measure becomes meaningless—due to equi-distance

• Clusters may exist only in some subspaces

• Methods

– Feature transformation: only effective if most dimensions are relevant

• PCA & SVD useful only when features are highly correlated/redundant

– Feature selection: wrapper or filter approaches

• useful to find a subspace where the data have nice clusters

– Subspace-clustering: find clusters in all the possible subspaces

• CLIQUE

Page 26: Clustering  IV

The Curse of Dimensionality

• Data in only one dimension is relatively packed

• Adding a dimension “stretches” the points across that dimension, making them further apart

• Adding more dimensions will make the points further apart—high dimensional data is extremely sparse

• Distance measure becomes meaningless

(graphs from Parsons et al. KDD Explorations 2004)

Page 27: Clustering  IV

Why Subspace Clustering?(Parsons et al. SIGKDD Explorations 2004)

• Clusters may exist only in some subspaces

• Subspace-clustering: find clusters in some of the subspaces

Page 28: Clustering  IV

CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

• CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal length interval

– It partitions an m-dimensional data space into non-overlapping rectangular units

– A unit is dense if the fraction of total data points contained in the unit exceeds an input threshold τ

– A cluster is a maximal set of connected dense units within a subspace

Page 29: Clustering  IV

The CLIQUE algorithm• Find all dense areas in the 1-dimensional spaces (single

attributes)• k 2• repeat

– Generate all candidate dense k-dimensional cells from dense (k-1)-dimensional cells

– Eliminate cells that have fewer than τ points– k k+1

• until there are no candidate dense k-dimensional cells• Find clusters by taking the union of all adjacent, high-density

cells• Summarize each cluster using a small set of inequalities that

describe the attribute ranges of the cells in the cluster

Page 30: Clustering  IV

CLIQUE: Monotonicity property• “If a set of points forms a density-based cluster in k-

dimensions (attributes), then the same set of points is also part of a density-based cluster in all possible subsets of those dimensions”

Page 31: Clustering  IV

Strengths and weakness of CLIQUE

• automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces

• insensitive to the order of records in input and does not presume some canonical data distribution

• scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

• The accuracy of the clustering result may be degraded at the expense of simplicity of the method

Page 32: Clustering  IV

Outline

• Impossibility theorem for clustering

• Density-based clustering and subspace clustering

• Bi-clustering or co-clustering

• Validating clustering results

• Randomization tests

Page 33: Clustering  IV

A

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

• m points in Rn

• Group them to k clusters• Represent them by a matrix ARm×n

– A point corresponds to a row of A• Cluster: Partition the rows to k groups

m

nRn

Page 34: Clustering  IV

Co-Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 9 7

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

A

• Co-Clustering: Cluster rows and columns of A simultaneously:

k = 2

ℓ = 2Co-cluster

Page 35: Clustering  IV

Motivation: Sponsored Search

Main revenue for search engines

• Advertisers bid on keywords• A user makes a query• Show ads of advertisers that are relevant and have high bids• User clicks or not an ad

Ads

Page 36: Clustering  IV

Motivation: Sponsored Search

• For every(advertiser, keyword) pairwe have:

– Bid amount– Impressions– # clicks

• Mine information at query time – Maximize # clicks / revenue

Page 37: Clustering  IV

Ski boots

Co-Clusters in Sponsored Search

Advertiser

Keyw

ords

Vancouver

Air France

Skis.com

Bids of skis.com for “ski boots”

Markets = co-clusters

All these keywords are relevantto a set of advertisers

Page 38: Clustering  IV

Co-Clustering in Sponsored Search

Applications:

• Keyword suggestion– Recommend to advertisers other relevant keywords

• Broad matching / market expansion– Include more advertisers to a query

• Isolate submarkets– Important for economists– Apply different advertising approaches

• Build taxonomies of advertisers / keywords

Page 39: Clustering  IV

Motivation: Biology

• Gene-expression data in computational biology– need simultaneous

characterizations of genes

Page 40: Clustering  IV

ATTCGT

Co-Clusters in Gene Expression Data

Conditions

Gene

s

GCATD

O2 present,T = 23ºC

O2 absent,T = 10ºC

Expression level of the gene under the conditions

Co-cluster

All these genes are activatedfor some set of conditions

Page 41: Clustering  IV

A

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

• m points in Rn

• Group them to k clusters• Represent them by a matrix ARm×n

– A point corresponds to a row of A• Cluster: Partition the rows to k groups

m

nRn

Page 42: Clustering  IV

Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1

1

1

1

1

1

1

1.6 3.4 4 9.8 8.4 8.8

13 9 11 4 5 6

1.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.81.6 3.4 4 9.8 8.4 8.813 9 11 4 5 613 9 11 4 5 6

A MR R M

• MRk×n: Cluster centroids• RRm×k: Assignment matrix• Approximate A by RM

• Clustering: Find M, R such that:

Rn

Page 43: Clustering  IV

R MA

Distance Metric || · ||

3 0 6 8 9 72 3 4 1

28 10

1 2 3 10

9 8

0 8 4 8 7 92 4 3 11 9 10

16 10 13 6 7 510 8 9 2 3 7

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

1.6 3.4

4 9.8 8.4

8.8

13 9 11 4 5 613 9 11 4 5 6

AI

Page 44: Clustering  IV

Measuring the quality of a clustering

Theorem. For any k > 0, p ≥ 1 there is a 24-approximation.

Proof. Similar to general k-median, relaxed triangle inequality.

Page 45: Clustering  IV

Column Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1 1 1

1 1 1

A MC MC C

• ℓ column clusters • MCRm×ℓ: Cluster centroids• CRℓ×m: Column assignments

• Approximate A by MCC :

3 83 102 94 83 1013 69 4

3 3 3 8 8 8

3 3 3 10 10 10

2 2 2 9 9 9

4 4 4 8 8 8

3 3 3 10 10 10

13 13 13 6 6 6

9 9 9 4 4 4

C

Page 46: Clustering  IV

Co-Clustering

3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 9 7

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

1

1

1

1

1

1

1

1 1 1

1 1 1

A MR R M C

• Co-Clustering: Cluster rows and columns of ARm×n simultaneously• k row clusters, ℓ column clusters• MRk×ℓ: Co-cluster means• RRm×k: Row assignments• CRℓ×m: Column assignments

• Co-Clustering: Find M, R, C such that:

3 9

11 5

3 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 911 11 11 5 5 511 11 11 5 5 5

C

Page 47: Clustering  IV

Co-Clustering Objective Function3 0 6 8 9 7

2 3 4 12 8 10

1 2 3 10 9 8

0 8 4 8 7 9

2 4 3 11 9 10

16 10 13 6 7 5

10 8 9 2 3 7

A R M C

3 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 93 3 3 9 9 911 11 11 5 5 511 11 11 5 5 5

AIJ

Cost of co-cluster AIJ

Page 48: Clustering  IV

Some Background• A.k.a.: biclustering, block clustering, …

• Many objective functions in co-clustering– This is one of the easier– Others factor out row-column average (priors)– Others based on information theoretic ideas (e.g. KL divergence)

• A lot of existing work, but mostly heuristic– k-means style, alternate between rows/columns– Spectral techniques

Page 49: Clustering  IV

Algorithm

1. Cluster rows of A

2. Cluster columns of A

3. Combine

Page 50: Clustering  IV

Properties of the algorithmTheorem 1. Algorithm with optimal row/column clusterings is 3-approximation to co-clustering optimum.

Theorem 2. For ℓ2 the algorithm with optimal row/column clusterings is a 2-approximation.

Page 51: Clustering  IV

Outline

• Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for clustering, NIPS 2002]

• Bi-clustering or co-clustering

• Subspace clustering and density-based clustering

• Validating clustering results

• Randomization tests

Page 52: Clustering  IV

Cluster Validity • For supervised classification we have a variety of measures to

evaluate how good our model is– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters

Page 53: Clustering  IV

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

Page 54: Clustering  IV

1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analyses to determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Different Aspects of Cluster Validation

Page 55: Clustering  IV

• Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types.– External Index: Used to measure the extent to which cluster labels match

externally supplied class labels.• Entropy

– Internal Index: Used to measure the goodness of a clustering structure without respect to external information.

• Sum of Squared Error (SSE)

– Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy

• Sometimes these are referred to as criteria instead of indices– However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

Measures of Cluster Validity

Page 56: Clustering  IV

• Two matrices – Proximity Matrix– “Incidence” Matrix

• One row and one column for each data point• An entry is 1 if the associated pair of points belong to the same cluster• An entry is 0 if the associated pair of points belongs to different clusters

• Compute the correlation between the two matrices– Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

• High correlation indicates that points that belong to the same cluster are close to each other.

• Not a good measure for some density or contiguity based clusters.

Measuring Cluster Validity Via Correlation

Page 57: Clustering  IV

Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = 0.9235 Corr = 0.5810

Page 58: Clustering  IV

• Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 59: Clustering  IV

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 60: Clustering  IV

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 61: Clustering  IV

Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Poi

nts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

Page 62: Clustering  IV

Using Similarity Matrix for Cluster Validation

1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Page 63: Clustering  IV

• Clusters in more complicated figures aren’t well separated• Internal Index: Used to measure the goodness of a clustering structure

without respect to external information– SSE

• SSE is good for comparing two clusterings or two clusters (average SSE).• Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Page 64: Clustering  IV

Internal Measures: SSE

• SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Page 65: Clustering  IV

• Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is that good, fair,

or poor?

• Statistics provide a framework for cluster validity– The more “atypical” a clustering result is, the more likely it represents valid

structure in the data– Can compare the values of an index that result from random data or

clusterings to those of a clustering result.• If the value of the index is unlikely, then the cluster results are valid

– These approaches are more complicated and harder to understand.

• For comparing the results of two different sets of cluster analyses, a framework is less necessary.

– However, there is the question of whether the difference between two index values is significant

Framework for Cluster Validity

Page 66: Clustering  IV

• Cluster Cohesion: Measures how closely related are objects in a cluster

– Example: SSE• Cluster Separation: Measure how distinct or well-separated a

cluster is from other clusters• Example: Squared Error

– Cohesion is measured by the within cluster sum of squares (SSE)

– Separation is measured by the between cluster sum of squares

– Where |Ci| is the size of cluster i

Internal Measures: Cohesion and Separation

i Cx

ii

mxWSS 2)(

i

ii mmCBSS 2)(

Page 67: Clustering  IV

• A proximity graph based approach can also be used for cohesion and separation.– Cluster cohesion is the sum of the weight of all links within a cluster.– Cluster separation is the sum of the weights between nodes in the cluster and

nodes outside the cluster.

Internal Measures: Cohesion and Separation

cohesion separation

Page 68: Clustering  IV

• Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings

• For an individual point, i– Calculate a = average distance of i to the points in its cluster– Calculate b = min (average distance of i to points in another cluster)– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)

– Typically between 0 and 1. – The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures: Silhouette Coefficient

ab

Page 69: Clustering  IV

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Final Comment on Cluster Validity

Page 70: Clustering  IV

Assessing the significance of clustering (and other data mining)

results

• Data X and algorithm A• Beautiful result A(D)• But: what does it mean?• How to determine whether the result is really

interesting or just due to chance?

Page 71: Clustering  IV

Examples

• Pattern discovery: frequent itemsets or association rules

• From data X we can find a collection of nice patterns

• Significance of individual patterns is sometimes straightforward to test

• What about the whole collection of patterns? Is it surprising to see such a collection?

Page 72: Clustering  IV

Examples

• In clustering or mixture modeling: we always get a result

• How to test if the whole idea of components/clusters in the data is good?

Page 73: Clustering  IV

Classical methods

Page 74: Clustering  IV

Classical methods

Page 75: Clustering  IV

Randomization methods

• Goal in assessing the significance of results: could the result have occurred by chance?

• Randomization methods: create datasets that somehow reflect the characteristics of the true data

Page 76: Clustering  IV

Randomization

• Create randomized versions from the data X• X1, X2,…,Xk

• Run algorithm A on these, producing results A(X1), A(X2),…,A(Xk)

• Check if the result A(X) on the real data is somehow different from these

• Empirical p-value: the fraction of cases for which the result on real data is (say) larger than A(X)

• If the empirical p-value is small, then there is something interesting in the data

Page 77: Clustering  IV

Questions

• How is the data randomized?

• Can the sample X1, X2, …, Xk be computed efficiently?

• Can the values A(X1), A(X2), …, A(Xk) be computed efficiently?

Page 78: Clustering  IV

How to randomize?

• How are datasets Xi generated?

• Randomly from a “null model”/ ”null hypothesis”


Recommended