CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa13/...External...

1

CS570: Introduction to Data Mining

Cluster Analysis

Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter

8.4,8.5,9.2.2, 9.3 Tan

Anca Doloc-Mihu, Ph.D.

Slides courtesy of Li Xiong, Ph.D.,

©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann,

and

©2006 Tan, Steinbach & Kumar. Introd. Data Mining.,

Pearson. Addison Wesley. October 15, 2013

October 15, 2013 2

Cluster Analysis

Overview

Partitioning methods

Hierarchical methods

Graph-based methods (CHAMELEON)

Self-organizing maps (SOM)

Density-based methods

EM method

Cluster evaluation

Outlier analysis

October 15, 2013 Data Mining: Concepts and Techniques 3

Spatial Data

• A cluster is regarded as a dense region of regions of high density • A cluster can have arbitrary shapes • Existence of streaks and noises

October 15, 2013 4

Density-Based Clustering Methods

Clustering based on density

Major features:

Clusters of arbitrary shape Handle noise Need density parameters as termination condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

DBSCAN: Basic Concepts

Density = number of points within a specified radius

core point: has high density

border point: has less density, but in the neighborhood of a core point

noise point: not a core point or a border point.

border point

Core point

noise point

October 15, 2013 6

DBScan: Definitions

Two parameters:

Eps: radius of the neighborhood

MinPts: Minimum number of points in an Eps-neighborhood of that point

Eps-neighborhood of a point p:

NEps(p): {q belongs to D | dist(p,q) <= Eps}

core point: |NEps (q)| >= MinPts

p

q

MinPts = 5

Eps = 1 cm

7

DBScan: Definitions

Directly density-reachable: if p belongs to NEps(q), q is a core point

Density-reachable: if there is a chain of points p1, …, pn, p1 = q, pn = p

such that pi+1 is directly density-reachable from pi

Density-connected: if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

p q

o

p

q p1

p

q

MinPts = 5

Eps = 1 cm

October 15, 2013 8

DBSCAN: Cluster Definition

A cluster is defined as a maximal set of density-connected points

Core

Border

Outlier

Eps = 1cm

MinPts = 5


DBSCAN: The Algorithm

Arbitrary select an unvisited point p, mart it as visited and

If p is a core point

Retrieve all points density-reachable from p w.r.t. Eps

and MinPts, a cluster is formed, add p to cluster.

Otherwise

mark the point as noise and

visit the next unvisited point in the database

Continue the process until all of the points have been

processed.


DBSCAN: Sensitive to Parameters

DBSCAN: Determining EPS and MinPts

Basic idea:

Suppose the neighborhood size is k

For points within a cluster, their kth nearest neighbors are at roughly the same distance

Noise points have the kth nearest neighbor at farther distance

Plot sorted distance of every point to its kth nearest neighbor

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

When DBSCAN Does NOT Work Well

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

DBSCAN: Features

Complexity:

O(n2), can be reduced to O(n logn) if using index structures

Advantages

does not require the number of clusters (vs. k-means)

can find arbitrarily shaped clusters

can identify noise

mostly insensitive to the ordering of the points

Disadvantages

sensitive to parameters

does not respond well to data sets with varying densities

October 15, 2013 15

OPTICS: A Cluster-Ordering Method (1999)

OPTICS: Ordering Points To Identify the Clustering Structure

Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

Produces a special order of the database wrt its density-based clustering structure

This cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settings

Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

Can be represented graphically or using visualization techniques

16

OPTICS: Some Extension from DBSCAN

Core Distance of p:

smallest distance that

makes p a core point MinPts = 5

ε = 3 cm

Reachability Distance of p wrt o:

D

p2

max {core-distance (o), d (o, p)}

e.g.,

r(p1, o) = 2.8cm

r(p2,o) = 4cm

p1

p

o

October 16, 2013 17

OPTICS: The Algorithm

Arbitrary select an unvisited point p, mart it as visited and

If p is a core point

Retrieve all points density-reachable from p w.r.t. Eps

and MinPts

Update the reachability distance of points to nearest

neighbor and output the points in ascending order

Otherwise, visit the next unvisited point in the database

Continue the process until all of the points have been

processed.

October 15, 2013 18

Reachability-distance

Cluster-order

of the objects

undefined

‘

OPTICS: example

October 15, 2013 19

Cluster Analysis

Overview






Other: EM method, COBWEB

Cluster evaluation

Outlier analysis

October 16, 2013 20

Probabilistic Model-Based Clustering

Assume the data are generated by a mathematical model

Attempt to optimize the fit between the given data and some mathematical model

Typical methods

Statistical approach

EM (Expectation maximization)

Machine learning approach

COBWEB

Neural network approach

SOM (Self-Organizing Feature Map)

Clustering by Mixture Model

Assume data are generated by a mixture of probabilistic model

Generalization of k-means

Each cluster can be represented by a probabilistic model, like a Gaussian (continuous) or a Poisson (discrete) distribution.

October 16, 2013 21

October 15, 2013 22

Expectation Maximization (EM)

Starts with an initial estimate of the parameters of the mixture model

Iteratively refine the parameters using EM method

Expectation step: computes expectation of the likelihood of each data point Xi to belong to cluster Ci

Maximization step: computes maximum likelihood estimates of the parameters

Until parameters do not change or below a threshold

October 15, 2013 23

Conceptual Clustering

Conceptual clustering

Generates a concept description for each concept (class)

Produces a hierarchical category or classification scheme

Related to decision tree learning and mixture model learning

COBWEB (Fisher’87)

A popular and simple method of incremental conceptual learning

Creates a hierarchical clustering in the form of a classification tree

Each node refers to a concept and contains a probabilistic description of that concept

October 15, 2013 24

COBWEB Classification Tree

COBWEB: Learning the Classification Tree

Incrementally builds the classification tree

Given a new object

Search for the best node at which to incorporate the object or add a new node for the object

Update the probabilistic description at each node

Merging and splitting

Use a heuristic measure - Category Utility – to guide construction of the tree

October 15, 2013 25

October 15, 2013 26

COBWEB: Comments

Limitations

The assumption that the attributes are independent of

each other is often too strong because correlation may

exist

Not suitable for clustering large database – skewed tree

and expensive probability distributions

October 15, 2013 27

Cluster Analysis

Overview






Other: EM method, COBWEB

Cluster evaluation

Outlier analysis

Cluster Evaluation

Determine clustering tendency of data, i.e. distinguish whether non-random structure exists

Determine correct number of clusters Evaluate how well the cluster results fit the data

without external information Evaluate how well the cluster results are

compared to externally known results Compare different clustering algorithms/results

28

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

29

Unsupervised (internal indices): Used to measure the goodness of a clustering structure without respect to external information.

Sum of Squared Error (SSE)

Supervised (external indices): Used to measure the extent to which cluster labels match externally supplied class labels.

Entropy

Relative: Used to compare two different clustering results Often an external or internal index is used for this function, e.g., SSE

or entropy

Measures of Cluster Validity

30

Cluster Cohesion: how closely related are objects in a cluster

Cluster Separation: how distinct or well-separated a cluster is from other clusters

Example: Squared Error

Cohesion: within cluster sum of squares (SSE)

Separation: between cluster sum of squares

Internal Measures: Cohesion and Separation

i Cx

ii

mxWSS 2)(

i

ji mmCBSS 2)(separation

Cohesion

31

Cluster cohesion is the sum of the weight of all links within a cluster.

Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.


cohesion separation

32


Example: SSE

BSS + WSS = constant

1 2 3 4 5 m1 m2

m

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(

22

2222

Total

BSS

WSSK=2 clusters:

10010

0)33(4

10)35()34()32()31(

2

2222

Total

BSS

WSSK=1 cluster:

33

SSE is good for comparing two clusterings

Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

34

Internal Measures: SSE

Another example of a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

35

Statistics framework for cluster validity More “atypical” -> likely valid structure in the data

Use values resulting from random data as baseline

Example Clustering: SSE = 0.005

SSE of three clusters in 500 sets of random data points

Statistical Framework for SSE

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Co

unt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

36

External Measures

Compare cluster results with “ground truth” or manually

clustering

Classification-oriented measures: entropy, purity,

precision, recall, F-measures

Similarity-oriented measures: Jaccard scores

37

External Measures: Classification-Oriented Measures

Entropy: the degree to which each cluster consists of

objects of a single class

Precision: the fraction of a cluster that consists of

objects of a specified class

Recall: the extent to which a cluster contains all objects

of a specified class

38

External Measure: Similarity-Oriented Measures

Given a reference clustering T and clustering S

f00: number of pair of points belonging to different clusters in both T and S

f01: number of pair of points belonging to different cluster in T but same cluster in S

f10: number of pair of points belonging to same cluster in T but different cluster in S

f11: number of pair of points belonging to same clusters in both T and S

October 15, 2013 39

11100100

1100

ffff

ffRand

111001

11

fff

fJaccard

T S

Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

40


Clusters in random data are not so crisp

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

41

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

42



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

43


1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

44

October 15, 2013 45

Chapter 7. Cluster Analysis

Overview




Other methods

Cluster evaluation

Outlier analysis


References (1)

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high

dimensional data for data mining applications. SIGMOD'98

M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.

M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the

clustering structure, SIGMOD’99.

P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996

Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02

M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers.

SIGMOD 2000.

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in

large spatial databases. KDD'96.

M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing

techniques for efficient class identification. SSD'95.

D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-

172, 1987.

D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on

dynamic systems. VLDB’98.

http://www.cs.sfu.ca/~ester/papers/KDD02.Clustering.final.pdf




References (2)

V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.

D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on

dynamic systems. In Proc. VLDB’98.

S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.

SIGMOD'98.

S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In

ICDE'99, pp. 512-521, Sydney, Australia, March 1999.

A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with

Noise. KDD’98.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John

Wiley & Sons, 1990.

E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.

G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John

Wiley and Sons, 1988.

P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.

R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.

http://www-courses.cs.uiuc.edu/~cs591han/papers/guha99.pdf

http://www-courses.cs.uiuc.edu/~cs591han/papers/karyp99.pdf

http://www-courses.cs.uiuc.edu/~cs591han/papers/karyp99.pdf


References (3)

L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review ,

SIGKDD Explorations, 6(1), June 2004

E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.

Proc. 1996 Int. Conf. on Pattern Recognition,.

G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering

approach for very large spatial databases. VLDB’98.

A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large

Databases, ICDT'01.

A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01

H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data

sets, SIGMOD’ 02.

W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,

VLDB’97.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very

large databases. SIGMOD'96.

http://www.acm.org/sigs/sigkdd/explorations/issue6-1/parsons.pdf

http://www-courses.cs.uiuc.edu/~cs591han/papers/icdt01.pdf




http://www-courses.cs.uiuc.edu/~cs591han/papers/cod01.pdf

http://www-courses.cs.uiuc.edu/~cs591han/papers/ww02.pdf

http://www-courses.cs.uiuc.edu/~cs591han/papers/ww02.pdf


Clustering: Rich Applications and Multidisciplinary Efforts

Pattern Recognition

Spatial Data Analysis

Create thematic maps in GIS by clustering feature

spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing

Economic Science (especially market research)

WWW

Document clustering

Cluster Weblog data to discover groups of similar access

patterns

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa13/...External...

Documents