Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
Marcus SampaioDSC/UFCG
Lecture Notes for Chapter 8
Introduction to Data Miningby
Tan, Steinbach, Kumar
Cluster Analysis: Basic Concepts and Algorithm
Marcus SampaioDSC/UFCG
• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
What Is ClusterAnalysis?
Marcus SampaioDSC/UFCG
• Description– Group related stocks with
similar price fluctuations
• Summarization– Reduce the size of large
data sets
Discovered Clusters Industry Group
1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Technology1-DOWN
2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Financial-DOWN
4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP
Oil-UP
Clustering precipitation in Australia
Applications of ClusterAnalysis
Marcus SampaioDSC/UFCG
• Supervised classification– Have class label information
• Simple segmentation– Dividing students into different registration groups
alphabetically, by last name
• Results of a query– Groupings are a result of an external specification
• Graph partitioning– Some mutual relevance and synergy, but areas are not
identical
What Is Not ClusterAnalysis?
Marcus SampaioDSC/UFCG
How many clusters?
Four Clusters Two Clusters
Six Clusters
Notion of a Cluster Can Be Ambiguous
Marcus SampaioDSC/UFCG
• A clustering is a set of clusters
• Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
• Hierarchical clustering– A set of nested clusters organized as a hierarchical treeObs.: Hierarchical clustering is out of the scope of the
discipline
Types of Clusterings
Marcus SampaioDSC/UFCG
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function
Types of Clusters
Marcus SampaioDSC/UFCG
• Well-Separated Clusters – A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the cluster than to any point not in the cluster
3 well-separated clusters
Types of Clusters: Well-Separated
Marcus SampaioDSC/UFCG
• Center-based– A cluster is a set of objects such that an object in a cluster
is closer (more similar) to the “center” of a cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
Types of Clusters: Center-Based
Marcus SampaioDSC/UFCG
• Contiguous Cluster (Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is
transitively closer (or more similar) to one or more other points in the cluster than to any point not in the cluster
8 contiguous clusters
Types of Clusters: Contiguity-Based
Marcus SampaioDSC/UFCG
• Density-based– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density – Used when the clusters are irregular or intertwined, and
when noise and outliers are present
6 density-based clusters
Types of Clusters: Density-Based
Marcus SampaioDSC/UFCG
• Shared Property or Conceptual Clusters– Finds clusters that share some common property or
represent a particular concept .
2 Overlapping Circles
Types of Clusters: Conceptual Clusters
Marcus SampaioDSC/UFCG
Points as Representationof Instances
• How to represent an instance as a geometric point?– The points P1 and P2 are to each other closer than
of P3
Name Dept Course
Marcus DSC Computer Science
Cláudio DSC Computer Science
Péricles DEE Electric Engineering
(P1)
(P2)
(P3)
Marcus SampaioDSC/UFCG
• WEKA– Points by instance and by attribute– The most frequent attribute values in the cluster
Marcus SampaioDSC/UFCG
• Clusters Defined by an Objective Function– Finds clusters that minimize or maximize an objective
function – Enumerate all possible ways of dividing the points into
clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function (NP Hard)
– Can have global or local objectives• Hierarchical clustering algorithms typically have local
objectives• Partitional algorithms typically have global objectives
Types of Clusters: Objective Function
Marcus SampaioDSC/UFCG
• Map the clustering problem to a different domain and solve a related problem in that domain– Proximity matrix defines a weighted graph, where
the nodes are the points being clustered, and the weighted edges represent the proximities between points
– Clustering is equivalent to breaking the graph into connected components, one for each cluster.
– Want to minimize the edge weight between clusters and maximize the edge weight within clusters
Marcus SampaioDSC/UFCG
Proximity Function
Centroid Objective Funciotn
Manhattan (L1) median Minimize sum of the L1 distance of an object to its cluster centroid
Squared Euclidean (L2
2)mean Minimize sum of the L2
2 distance of an object to its cluster centroid
Cosine mean Maxiimize sum of the cosine similarity of an object to its cluster centroid
Bregman divergence
mean Minimize sum of the Bregman divergence of an object to its cluster centroid
Marcus SampaioDSC/UFCG
• Type of proximity or density measure– This is a derived measure, but central to clustering
• Sparseness– Dictates type of similarity– Adds to efficiency
• Attribute type– Dictates type of similarity
• Type of Data– Dictates type of similarity– Other characteristics, e.g., autocorrelation
• Dimensionality• Noise and Outliers• Type of Distribution
Characteristics of the Input Data Are Important
Marcus SampaioDSC/UFCG
Measuring the ClusteringPerformance
• Notice that Clustering is a descriptive model, instead of a predictive model (see Introduction)– Performance metrics are different from the ones of
supervised classification• We will see the metric SSE later• Other metrics
– Precision
– Recall
– Entropy
– Purity
Marcus SampaioDSC/UFCG
• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid centroid-based objective function• Number of clusters, K, must be specified• The basic algorithm is very simple
K-means Clustering
Marcus SampaioDSC/UFCG
K-means Clustering – Details
• Proximity function– Squared Euclidean L2
2
• Type of centroid: mean– Example of a centroid or mean
• A cluster containing the three points (1,1), (2,3) e (6,2)
–Centroid = ((1+2+6)/3, (1+3+2)/3) = (3,2)
• A problem of minimization
Marcus SampaioDSC/UFCG
• K-means always converges to a solution– Reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don´t change
• Initial centroids are often chosen randomly– Clusters produced vary from one run to another
• Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Marcus SampaioDSC/UFCG
• The actions of K-means in Steps 3 and 4 are only guaranteed to find a local minimum with respect to the sum of the squared error (SSE)– Based on optimizing the SSE for specific choices
of the centroids and clusters, rather than for all possible choices
Marcus SampaioDSC/UFCG
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
Two different K-meansClusterings
Marcus SampaioDSC/UFCG
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Marcus SampaioDSC/UFCG
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Marcus SampaioDSC/UFCG
• Most common measure is Sum of Squared Error (SSE)– For each point, the error is the distance to the
nearest cluster– To get SSE, we square these errors and sum
them.
– x is a data point in cluster Ci and mi is the representative point for cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
K
i Cxi
i
xmdistSSE1
2 ),(
Evaluating K-means Clusters
Marcus SampaioDSC/UFCG
• Step 3: forms clusters by assigning points to their nearest centroid, which minimizes the SSE for the given set of centroids
• Step 4: recomputes the centroids so as to further minimize the SSE
• Given two different set of clusters, we can choose the one with the smallest error
Marcus SampaioDSC/UFCG
Step 1: Choosing InitialCentroids
• When random initialization of centroids is used, different runs of K-means typically produce different total SSEs– The resulting clusters are often poor
• In next slides, we provide another example of initial centroids, using the same data in the former example– Now, the solution is suboptimal, or the minimum
SSE clustering is not found• Or the solution is only local optimal
Marcus SampaioDSC/UFCG
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Step 1: Choosing InitialCentroids
Marcus SampaioDSC/UFCG
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
xy
Iteration 5
Marcus SampaioDSC/UFCG
• How to overcome the problem of well choosing the initial centroids? Three approaches– Multiple runs– A variant of K-means that is less susceptible to
initialization problems• Bisecting K-means
– Using postprocessing to “fixup” the set of clusters produced
Marcus SampaioDSC/UFCG
Multiple Runs of k-means
• Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest SSE error– The centroids of this clustering are a better
representation of the points in their cluster
• The technique– Perform multiple runs
• Each with a different set of randomly chosen initial centroids
– Select the clusters with the minimum SSE
• May not work very well
Marcus SampaioDSC/UFCG
Reducing the SSE withPostprocessing
• An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K– The local SSEs become smaller, and then the
global SSE also becomes smaller
• However, in many cases, we would like to improve the SSE, but don´t want to increase the number of clusters
Marcus SampaioDSC/UFCG
• One strategy that decreases the total SSE by increasing the number of clusters– Split a cluster: the cluster with the largest SSE is
usually chosen
• One strategy that decreases the number of clusters, while trying to minimize the increase in total SSE– Merge two clusters: the clusters with the closest
centroids are typically chosen
Marcus SampaioDSC/UFCG
• Bisecting K-means algorithm– Variant of K-means that can produce a partitional or a hierarchical
clustering
Bisecting K-means
Algorithm Bisecting K-means algorithm1: Initialize the list of clusters to contain the cluster consisting of all points2: repeat3: Remove a cluster from the list of clusters4: {Perform several “trial” bisections of the chosen cluster.}5: for i = 1 to number of trials do6: Bisect the selected cluster using basic 2-means7: end for8: Select the two clusters from the bisection with the lowest total SSE9: Add these two clusters to the list of clusters10:until Until the list of clusters contains K clusters
Marcus SampaioDSC/UFCG
• There are a number of different ways to choose which cluster to split at each step– The largest cluster– The cluster with the largest SSE– A criterion based on both size and SSE
• One can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm
Marcus SampaioDSC/UFCG
• K-means has problems when clusters are of differing – Sizes– Densities– Non-globular shapes
• K-means has problems when the data contains outliers
Limitations of K-means
Marcus SampaioDSC/UFCG
Original Points K-means (3 Clusters)
Limitations of K-means: Differing Density
Marcus SampaioDSC/UFCG
Original Points K-means (2 Clusters)
Limitations of K-means: Non-globular Shapes
Marcus SampaioDSC/UFCG
Original Points K-means Clusters
One solution is to use many clusters.Find parts of clusters, but need to put together.
Overcoming K-means Limitations
Marcus SampaioDSC/UFCG
Rodando o WEKASimpleKMeans
• === Run information ===
• Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10• Relation: weather.symbolic• Instances: 14• Attributes: 5• outlook• temperature• humidity• windy• play• Test mode: evaluate on training data
• === Model and evaluation on training set ===
Marcus SampaioDSC/UFCG
kMeans ====== Number of iterations: 4
Within cluster sum of squared errors: 26.0
Cluster centroids:
Cluster 0 Mean/Mode: sunny mild high FALSE yes Std Devs: N/A N/A N/A N/A N/A Cluster 1 Mean/Mode: overcast cool normal TRUE yes Std Devs: N/A N/A N/A N/A N/A
Clustered Instances0 10 ( 71%)1 4 ( 29%)
Marcus SampaioDSC/UFCG
0 sunny,hot,high,FALSE,no1 sunny,hot,high,TRUE,no2 overcast,hot,high,FALSE,yes3 rainy,mild,high,FALSE,yes4 rainy,cool,normal,FALSE,yes5 rainy,cool,normal,TRUE,no6 overcast,cool,normal,TRUE,yes7 sunny,mild,high,FALSE,no8 sunny,cool,normal,FALSE,yes9 rainy,mild,normal,FALSE,yes10 sunny,mild,normal,TRUE,yes11 overcast,mild,high,TRUE,yes12 overcast,hot,normal,FALSE,yes13 rainy,mild,high,TRUE,no
Marcus SampaioDSC/UFCG
DBSCAN
• DBSCAN is a density-based algorithm.– Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number
of points (MinPts) within Eps • These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
Marcus SampaioDSC/UFCG
DBSCAN Algorithm
• Eliminate noise points• Perform clustering on the remaining points
Marcus SampaioDSC/UFCG
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
Marcus SampaioDSC/UFCG
When DBSCAN Works Well
ClustersOriginal Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Marcus SampaioDSC/UFCG
When DBSCAN Does NOT Work Well
(Min
Pts
=4,
Eps
=9.
75).
Original Points
(M
inP
ts=
4, E
ps=
9.92
)
• Varying densities
• High-dimensional data
Marcus SampaioDSC/UFCG
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
Marcus SampaioDSC/UFCG
• For supervised classification we have a variety of measures to evaluate how good our model is– Accuracy, precision, recall
• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters
Cluster Validity
Marcus SampaioDSC/UFCG
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete Link
Clusters Found in Random Data
Marcus SampaioDSC/UFCG
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
Different Aspects of Cluster Validation
Marcus SampaioDSC/UFCG
• Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810