+ All Categories
Home > Documents > Clustering

Clustering

Date post: 01-Jan-2016
Category:
Upload: wynter-sears
View: 20 times
Download: 0 times
Share this document with a friend
Description:
Clustering. COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang. Outline. What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Outlier analysis. Outliers. Cluster 1. Cluster 2. - PowerPoint PPT Presentation
27
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang
Transcript
Page 1: Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Clustering

COMP 790-90 Research SeminarBCB 713 Module

Spring 2011Wei Wang

Page 2: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

2

Outline

What is clustering

Partitioning methods

Hierarchical methods

Density-based methods

Grid-based methods

Model-based clustering methods

Outlier analysis

Page 3: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

3

What Is Clustering?

Group data into clustersSimilar to one another within the same cluster

Dissimilar to the objects in other clusters

Unsupervised learning: no predefined classes

Cluster 1Cluster 2

Outliers

Page 4: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

4

Application Examples

A stand-alone tool: explore data distribution A preprocessing step for other algorithmsPattern recognition, spatial data analysis, image processing, market research, WWW, …

Cluster documentsCluster web log data to discover groups of similar access patterns

Page 5: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

5

What Is A Good Clustering?

High intra-class similarity and low inter-class similarity

Depending on the similarity measure

The ability to discover some or all of the hidden patterns

Page 6: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

6

Requirements of Clustering

Scalability

Ability to deal with various types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine input parameters

Page 7: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

7

Requirements of Clustering

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

Page 8: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

8

Data Matrix

For memory-based clusteringAlso called object-by-variable structure

Represents n objects with p variables (attributes, measures)

A relational table

npx

nfx

nx

ipx

ifx

ix

px

fxx

1

1

1111

Page 9: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

9

Dissimilarity Matrix

For memory-based clusteringAlso called object-by-object structure

Proximities of pairs of objects

d(i,j): dissimilarity between objects i and j

Nonnegative

Close to 0: similar

0,2)(,1)(

0(3,2)(3,1)

0(2,1)

0

ndnd

dd

d

Page 10: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

10

How Good Is A Clustering?

Dissimilarity/similarity depends on distance function

Different applications have different functions

Judgment of clustering quality is typically highly subjective

Page 11: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

11

Types of Data in Clustering

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Page 12: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

12

Similarity and Dissimilarity Between Objects

Distances are normally used measures

Minkowski distance: a generalization

If q = 2, d is Euclidean distance

If q = 1, d is Manhattan distance

Weighed distance

)0(||...||||),(2211

qjx

ix

jx

ix

jx

ixjid q

q

pp

qq

)0()||...||2

||1

),(2211

qjx

ixpwj

xixw

jx

ixwjid q

q

pp

qq

Page 13: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

13

Properties of Minkowski Distance

Nonnegative: d(i,j) 0

The distance of an object to itself is 0

d(i,i) = 0

Symmetric: d(i,j) = d(j,i)Triangular inequality

d(i,j) d(i,k) + d(k,j)

Page 14: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

14

Categories of Clustering Approaches (1)

Partitioning algorithmsPartition the objects into k clustersIteratively reallocate objects to improve the clustering

Hierarchy algorithmsAgglomerative: each object is a cluster, merge clusters to form larger onesDivisive: all objects are in a cluster, split it up into smaller clusters

Page 15: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

15

Categories of Clustering Approaches (2)

Density-based methodsBased on connectivity and density functions

Filter out noise, find clusters of arbitrary shape

Grid-based methodsQuantize the object space into a grid structure

Model-basedUse a model to find the best fit of data

Page 16: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

16

Partitioning Algorithms: Basic Concepts

Partition n objects into k clustersOptimize the chosen partitioning criterion

Global optimal: examine all partitions(kn-(k-1)n-…-1) possible partitions, too expensive!

Heuristic methods: k-means and k-medoidsK-means: a cluster is represented by the center

K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster

Page 17: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

17

K-means

Arbitrarily choose k objects as the initial cluster centers

Until no change, do(Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster

Update the cluster means, i.e., calculate the mean value of the objects for each cluster

Page 18: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

18

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

10

0 1 2 3 4 5 6 7 8 9 100

1

2

7

8

9

3

4

5

6

Page 19: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

19

Pros and Cons of K-means

Relatively efficient: O(tkn)n: # objects, k: # clusters, t: # iterations; k, t << n.

Often terminate at a local optimum

Applicable only when mean is definedWhat about categorical data?

Need to specify the number of clusters

Unable to handle noisy data and outliers

unsuitable to discover non-convex clusters

Page 20: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

20

Variations of the K-means

Aspects of variationsSelection of the initial k meansDissimilarity calculationsStrategies to calculate cluster means

Handling categorical data: k-modes Use mode instead of mean

Mode: the most frequent item(s)

A mixture of categorical and numerical data: k-prototype method

Page 21: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

21

A Problem of K-means

Sensitive to outliersOutlier: objects with extremely large values

May substantially distort the distribution of the data

K-medoids: the most centrally located object in a cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

++

Page 22: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

22

PAM: A K-medoids Method

PAM: partitioning around Medoids

Arbitrarily choose k objects as the initial medoids

Until no change, do(Re)assign each object to the cluster to which the nearest medoid

Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’

If S < 0 then swap o with o’ to form the new set of k medoids

Page 23: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

23

Swapping Cost

Measure whether o’ is better than o as a medoid

Use the squared-error criterion

Compute Eo’-Eo

Negative: swapping brings benefit

k

i Cpi

i

opdE1

2),(

Page 24: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

24

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids Randomly select a

nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 25: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

25

Pros and Cons of PAM

PAM is more robust than k-means in the presence of noise and outliers

Medoids are less influenced by outliers

PAM is efficiently for small data sets but does not scale well for large data sets

O(k(n-k)2 ) for each iteration

Sampling based method: CLARA

Page 26: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

26

CLARA (Clustering LARge Applications)

CLARA (Kaufmann and Rousseeuw in 1990)Built in statistical analysis packages, such as S+

Draw multiple samples of the data set, apply PAM on each sample, give the best clustering

Perform better than PAM in larger data sets

Efficiency depends on the sample sizeA good clustering on samples may not be a good clustering of the whole data set

Page 27: Clustering

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

27

CLARANS (Clustering Large Applications

based upon RANdomized Search)

The problem space: graph of clusteringA vertex is k from n numbers, vertices in totalPAM search the whole graphCLARA search some random sub-graphs

CLARANS climbs mountainsRandomly sample a set and select k medoidsConsider neighbors of medoids as candidate for new medoidsUse the sample set to verifyRepeat multiple times to avoid bad samples

k

n


Recommended