1
CS570: Introduction to Data Mining
Cluster Analysis
Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter
8.4,8.5,9.2.2, 9.3 Tan
Anca Doloc-Mihu, Ph.D.
Slides courtesy of Li Xiong, Ph.D.,
©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann,
and
©2006 Tan, Steinbach & Kumar. Introd. Data Mining.,
Pearson. Addison Wesley. October 15, 2013
October 15, 2013 2
Cluster Analysis
Overview
Partitioning methods
Hierarchical methods
Graph-based methods (CHAMELEON)
Self-organizing maps (SOM)
Density-based methods
EM method
Cluster evaluation
Outlier analysis
October 15, 2013 Data Mining: Concepts and Techniques 3
Spatial Data
• A cluster is regarded as a dense region of regions of high density • A cluster can have arbitrary shapes • Existence of streaks and noises
October 15, 2013 4
Density-Based Clustering Methods
Clustering based on density
Major features:
Clusters of arbitrary shape Handle noise Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
DBSCAN: Basic Concepts
Density = number of points within a specified radius
core point: has high density
border point: has less density, but in the neighborhood of a core point
noise point: not a core point or a border point.
border point
Core point
noise point
October 15, 2013 6
DBScan: Definitions
Two parameters:
Eps: radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighborhood of that point
Eps-neighborhood of a point p:
NEps(p): {q belongs to D | dist(p,q) <= Eps}
core point: |NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
7
DBScan: Definitions
Directly density-reachable: if p belongs to NEps(q), q is a core point
Density-reachable: if there is a chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-reachable from pi
Density-connected: if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts
p q
o
p
q p1
p
q
MinPts = 5
Eps = 1 cm
October 15, 2013 8
DBSCAN: Cluster Definition
A cluster is defined as a maximal set of density-connected points
Core
Border
Outlier
Eps = 1cm
MinPts = 5
October 15, 2013 Data Mining: Concepts and Techniques 9
DBSCAN: The Algorithm
Arbitrary select an unvisited point p, mart it as visited and
If p is a core point
Retrieve all points density-reachable from p w.r.t. Eps
and MinPts, a cluster is formed, add p to cluster.
Otherwise
mark the point as noise and
visit the next unvisited point in the database
Continue the process until all of the points have been
processed.
October 15, 2013 Data Mining: Concepts and Techniques 10
DBSCAN: Sensitive to Parameters
DBSCAN: Determining EPS and MinPts
Basic idea:
Suppose the neighborhood size is k
For points within a cluster, their kth nearest neighbors are at roughly the same distance
Noise points have the kth nearest neighbor at farther distance
Plot sorted distance of every point to its kth nearest neighbor
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
DBSCAN: Features
Complexity:
O(n2), can be reduced to O(n logn) if using index structures
Advantages
does not require the number of clusters (vs. k-means)
can find arbitrarily shaped clusters
can identify noise
mostly insensitive to the ordering of the points
Disadvantages
sensitive to parameters
does not respond well to data sets with varying densities
October 15, 2013 15
OPTICS: A Cluster-Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
Produces a special order of the database wrt its density-based clustering structure
This cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settings
Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure
Can be represented graphically or using visualization techniques
16
OPTICS: Some Extension from DBSCAN
Core Distance of p:
smallest distance that
makes p a core point MinPts = 5
ε = 3 cm
Reachability Distance of p wrt o:
D
p2
max {core-distance (o), d (o, p)}
e.g.,
r(p1, o) = 2.8cm
r(p2,o) = 4cm
p1
p
o
October 16, 2013 17
OPTICS: The Algorithm
Arbitrary select an unvisited point p, mart it as visited and
If p is a core point
Retrieve all points density-reachable from p w.r.t. Eps
and MinPts
Update the reachability distance of points to nearest
neighbor and output the points in ascending order
Otherwise, visit the next unvisited point in the database
Continue the process until all of the points have been
processed.
October 15, 2013 18
Reachability-distance
Cluster-order
of the objects
undefined
‘
OPTICS: example
October 15, 2013 19
Cluster Analysis
Overview
Partitioning methods
Hierarchical methods
Graph-based methods (CHAMELEON)
Self-organizing maps (SOM)
Density-based methods
Other: EM method, COBWEB
Cluster evaluation
Outlier analysis
October 16, 2013 20
Probabilistic Model-Based Clustering
Assume the data are generated by a mathematical model
Attempt to optimize the fit between the given data and some mathematical model
Typical methods
Statistical approach
EM (Expectation maximization)
Machine learning approach
COBWEB
Neural network approach
SOM (Self-Organizing Feature Map)
Clustering by Mixture Model
Assume data are generated by a mixture of probabilistic model
Generalization of k-means
Each cluster can be represented by a probabilistic model, like a Gaussian (continuous) or a Poisson (discrete) distribution.
October 16, 2013 21
October 15, 2013 22
Expectation Maximization (EM)
Starts with an initial estimate of the parameters of the mixture model
Iteratively refine the parameters using EM method
Expectation step: computes expectation of the likelihood of each data point Xi to belong to cluster Ci
Maximization step: computes maximum likelihood estimates of the parameters
Until parameters do not change or below a threshold
October 15, 2013 23
Conceptual Clustering
Conceptual clustering
Generates a concept description for each concept (class)
Produces a hierarchical category or classification scheme
Related to decision tree learning and mixture model learning
COBWEB (Fisher’87)
A popular and simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that concept
October 15, 2013 24
COBWEB Classification Tree
COBWEB: Learning the Classification Tree
Incrementally builds the classification tree
Given a new object
Search for the best node at which to incorporate the object or add a new node for the object
Update the probabilistic description at each node
Merging and splitting
Use a heuristic measure - Category Utility – to guide construction of the tree
October 15, 2013 25
October 15, 2013 26
COBWEB: Comments
Limitations
The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
Not suitable for clustering large database – skewed tree
and expensive probability distributions
October 15, 2013 27
Cluster Analysis
Overview
Partitioning methods
Hierarchical methods
Graph-based methods (CHAMELEON)
Self-organizing maps (SOM)
Density-based methods
Other: EM method, COBWEB
Cluster evaluation
Outlier analysis
Cluster Evaluation
Determine clustering tendency of data, i.e. distinguish whether non-random structure exists
Determine correct number of clusters Evaluate how well the cluster results fit the data
without external information Evaluate how well the cluster results are
compared to externally known results Compare different clustering algorithms/results
28
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete Link
29
Unsupervised (internal indices): Used to measure the goodness of a clustering structure without respect to external information.
Sum of Squared Error (SSE)
Supervised (external indices): Used to measure the extent to which cluster labels match externally supplied class labels.
Entropy
Relative: Used to compare two different clustering results Often an external or internal index is used for this function, e.g., SSE
or entropy
Measures of Cluster Validity
30
Cluster Cohesion: how closely related are objects in a cluster
Cluster Separation: how distinct or well-separated a cluster is from other clusters
Example: Squared Error
Cohesion: within cluster sum of squares (SSE)
Separation: between cluster sum of squares
Internal Measures: Cohesion and Separation
i Cx
ii
mxWSS 2)(
i
ji mmCBSS 2)(separation
Cohesion
31
Cluster cohesion is the sum of the weight of all links within a cluster.
Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.
Internal Measures: Cohesion and Separation
cohesion separation
32
Internal Measures: Cohesion and Separation
Example: SSE
BSS + WSS = constant
1 2 3 4 5 m1 m2
m
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222
Total
BSS
WSSK=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222
Total
BSS
WSSK=1 cluster:
33
SSE is good for comparing two clusterings
Can also be used to estimate the number of clusters
Internal Measures: SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SS
E
5 10 15
-6
-4
-2
0
2
4
6
34
Internal Measures: SSE
Another example of a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
35
Statistics framework for cluster validity More “atypical” -> likely valid structure in the data
Use values resulting from random data as baseline
Example Clustering: SSE = 0.005
SSE of three clusters in 500 sets of random data points
Statistical Framework for SSE
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340
5
10
15
20
25
30
35
40
45
50
SSE
Co
unt
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
36
External Measures
Compare cluster results with “ground truth” or manually
clustering
Classification-oriented measures: entropy, purity,
precision, recall, F-measures
Similarity-oriented measures: Jaccard scores
37
External Measures: Classification-Oriented Measures
Entropy: the degree to which each cluster consists of
objects of a single class
Precision: the fraction of a cluster that consists of
objects of a specified class
Recall: the extent to which a cluster contains all objects
of a specified class
38
External Measure: Similarity-Oriented Measures
Given a reference clustering T and clustering S
f00: number of pair of points belonging to different clusters in both T and S
f01: number of pair of points belonging to different cluster in T but same cluster in S
f10: number of pair of points belonging to same cluster in T but different cluster in S
f11: number of pair of points belonging to same clusters in both T and S
October 15, 2013 39
11100100
1100
ffff
ffRand
111001
11
fff
fJaccard
T S
Order the similarity matrix with respect to cluster labels and inspect visually.
Using Similarity Matrix for Cluster Validation
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
40
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
41
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
42
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Complete Link
43
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
44
October 15, 2013 45
Chapter 7. Cluster Analysis
Overview
Partitioning methods
Hierarchical methods
Density-based methods
Other methods
Cluster evaluation
Outlier analysis
October 15, 2013 Data Mining: Concepts and Techniques 46
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers.
SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-
172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. VLDB’98.
October 15, 2013 Data Mining: Concepts and Techniques 47
References (2)
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In
ICDE'99, pp. 512-521, Sydney, Australia, March 1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with
Noise. KDD’98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John
Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John
Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
October 15, 2013 Data Mining: Concepts and Techniques 48
References (3)
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review ,
SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition,.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB’98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large
Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’ 02.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,
VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very
large databases. SIGMOD'96.
October 15, 2013 Data Mining: Concepts and Techniques 49
Clustering: Rich Applications and Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document clustering
Cluster Weblog data to discover groups of similar access
patterns