+ All Categories
Home > Documents > Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data...

Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data...

Date post: 22-Dec-2015
Category:
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
65
Marcus Sampaio DSC/UFCG
Transcript

Marcus SampaioDSC/UFCG

Marcus SampaioDSC/UFCG

Lecture Notes for Chapter 8

Introduction to Data Miningby

Tan, Steinbach, Kumar

Cluster Analysis: Basic Concepts and Algorithm

Marcus SampaioDSC/UFCG

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

What Is ClusterAnalysis?

Marcus SampaioDSC/UFCG

• Description– Group related stocks with

similar price fluctuations

• Summarization– Reduce the size of large

data sets

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

Applications of ClusterAnalysis

Marcus SampaioDSC/UFCG

• Supervised classification– Have class label information

• Simple segmentation– Dividing students into different registration groups

alphabetically, by last name

• Results of a query– Groupings are a result of an external specification

• Graph partitioning– Some mutual relevance and synergy, but areas are not

identical

What Is Not ClusterAnalysis?

Marcus SampaioDSC/UFCG

How many clusters?

Four Clusters Two Clusters

Six Clusters

Notion of a Cluster Can Be Ambiguous

Marcus SampaioDSC/UFCG

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one subset

• Hierarchical clustering– A set of nested clusters organized as a hierarchical treeObs.: Hierarchical clustering is out of the scope of the

discipline

Types of Clusterings

Marcus SampaioDSC/UFCG

Original Points A Partitional Clustering

Partitional Clustering

Marcus SampaioDSC/UFCG

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function

Types of Clusters

Marcus SampaioDSC/UFCG

• Well-Separated Clusters – A cluster is a set of points such that any point in a cluster

is closer (or more similar) to every other point in the cluster than to any point not in the cluster

3 well-separated clusters

Types of Clusters: Well-Separated

Marcus SampaioDSC/UFCG

• Center-based– A cluster is a set of objects such that an object in a cluster

is closer (more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

Types of Clusters: Center-Based

Marcus SampaioDSC/UFCG

• Contiguous Cluster (Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is

transitively closer (or more similar) to one or more other points in the cluster than to any point not in the cluster

8 contiguous clusters

Types of Clusters: Contiguity-Based

Marcus SampaioDSC/UFCG

• Density-based– A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density – Used when the clusters are irregular or intertwined, and

when noise and outliers are present

6 density-based clusters

Types of Clusters: Density-Based

Marcus SampaioDSC/UFCG

• Shared Property or Conceptual Clusters– Finds clusters that share some common property or

represent a particular concept .

2 Overlapping Circles

Types of Clusters: Conceptual Clusters

Marcus SampaioDSC/UFCG

Points as Representationof Instances

• How to represent an instance as a geometric point?– The points P1 and P2 are to each other closer than

of P3

Name Dept Course

Marcus DSC Computer Science

Cláudio DSC Computer Science

Péricles DEE Electric Engineering

(P1)

(P2)

(P3)

Marcus SampaioDSC/UFCG

• WEKA– Points by instance and by attribute– The most frequent attribute values in the cluster

Marcus SampaioDSC/UFCG

• Clusters Defined by an Objective Function– Finds clusters that minimize or maximize an objective

function – Enumerate all possible ways of dividing the points into

clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function (NP Hard)

– Can have global or local objectives• Hierarchical clustering algorithms typically have local

objectives• Partitional algorithms typically have global objectives

Types of Clusters: Objective Function

Marcus SampaioDSC/UFCG

• Map the clustering problem to a different domain and solve a related problem in that domain– Proximity matrix defines a weighted graph, where

the nodes are the points being clustered, and the weighted edges represent the proximities between points

– Clustering is equivalent to breaking the graph into connected components, one for each cluster.

– Want to minimize the edge weight between clusters and maximize the edge weight within clusters

Marcus SampaioDSC/UFCG

Proximity Function

Centroid Objective Funciotn

Manhattan (L1) median Minimize sum of the L1 distance of an object to its cluster centroid

Squared Euclidean (L2

2)mean Minimize sum of the L2

2 distance of an object to its cluster centroid

Cosine mean Maxiimize sum of the cosine similarity of an object to its cluster centroid

Bregman divergence

mean Minimize sum of the Bregman divergence of an object to its cluster centroid

Marcus SampaioDSC/UFCG

• Type of proximity or density measure– This is a derived measure, but central to clustering

• Sparseness– Dictates type of similarity– Adds to efficiency

• Attribute type– Dictates type of similarity

• Type of Data– Dictates type of similarity– Other characteristics, e.g., autocorrelation

• Dimensionality• Noise and Outliers• Type of Distribution

Characteristics of the Input Data Are Important

Marcus SampaioDSC/UFCG

Measuring the ClusteringPerformance

• Notice that Clustering is a descriptive model, instead of a predictive model (see Introduction)– Performance metrics are different from the ones of

supervised classification• We will see the metric SSE later• Other metrics

– Precision

– Recall

– Entropy

– Purity

Marcus SampaioDSC/UFCG

• K-means and one variant

Clustering Algorithms

Marcus SampaioDSC/UFCG

• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid centroid-based objective function• Number of clusters, K, must be specified• The basic algorithm is very simple

K-means Clustering

Marcus SampaioDSC/UFCG

K-means Clustering – Details

• Proximity function– Squared Euclidean L2

2

• Type of centroid: mean– Example of a centroid or mean

• A cluster containing the three points (1,1), (2,3) e (6,2)

–Centroid = ((1+2+6)/3, (1+3+2)/3) = (3,2)

• A problem of minimization

Marcus SampaioDSC/UFCG

• K-means always converges to a solution– Reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don´t change

• Initial centroids are often chosen randomly– Clusters produced vary from one run to another

• Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Marcus SampaioDSC/UFCG

• The actions of K-means in Steps 3 and 4 are only guaranteed to find a local minimum with respect to the sum of the squared error (SSE)– Based on optimizing the SSE for specific choices

of the centroids and clusters, rather than for all possible choices

Marcus SampaioDSC/UFCG

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Two different K-meansClusterings

Marcus SampaioDSC/UFCG

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Marcus SampaioDSC/UFCG

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Marcus SampaioDSC/UFCG

• Most common measure is Sum of Squared Error (SSE)– For each point, the error is the distance to the

nearest cluster– To get SSE, we square these errors and sum

them.

– x is a data point in cluster Ci and mi is the representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster

K

i Cxi

i

xmdistSSE1

2 ),(

Evaluating K-means Clusters

Marcus SampaioDSC/UFCG

• Step 3: forms clusters by assigning points to their nearest centroid, which minimizes the SSE for the given set of centroids

• Step 4: recomputes the centroids so as to further minimize the SSE

• Given two different set of clusters, we can choose the one with the smallest error

Marcus SampaioDSC/UFCG

Step 1: Choosing InitialCentroids

• When random initialization of centroids is used, different runs of K-means typically produce different total SSEs– The resulting clusters are often poor

• In next slides, we provide another example of initial centroids, using the same data in the former example– Now, the solution is suboptimal, or the minimum

SSE clustering is not found• Or the solution is only local optimal

Marcus SampaioDSC/UFCG

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Step 1: Choosing InitialCentroids

Marcus SampaioDSC/UFCG

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

xy

Iteration 5

Marcus SampaioDSC/UFCG

• How to overcome the problem of well choosing the initial centroids? Three approaches– Multiple runs– A variant of K-means that is less susceptible to

initialization problems• Bisecting K-means

– Using postprocessing to “fixup” the set of clusters produced

Marcus SampaioDSC/UFCG

Multiple Runs of k-means

• Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest SSE error– The centroids of this clustering are a better

representation of the points in their cluster

• The technique– Perform multiple runs

• Each with a different set of randomly chosen initial centroids

– Select the clusters with the minimum SSE

• May not work very well

Marcus SampaioDSC/UFCG

Reducing the SSE withPostprocessing

• An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K– The local SSEs become smaller, and then the

global SSE also becomes smaller

• However, in many cases, we would like to improve the SSE, but don´t want to increase the number of clusters

Marcus SampaioDSC/UFCG

• One strategy that decreases the total SSE by increasing the number of clusters– Split a cluster: the cluster with the largest SSE is

usually chosen

• One strategy that decreases the number of clusters, while trying to minimize the increase in total SSE– Merge two clusters: the clusters with the closest

centroids are typically chosen

Marcus SampaioDSC/UFCG

• Bisecting K-means algorithm– Variant of K-means that can produce a partitional or a hierarchical

clustering

Bisecting K-means

Algorithm Bisecting K-means algorithm1: Initialize the list of clusters to contain the cluster consisting of all points2: repeat3: Remove a cluster from the list of clusters4: {Perform several “trial” bisections of the chosen cluster.}5: for i = 1 to number of trials do6: Bisect the selected cluster using basic 2-means7: end for8: Select the two clusters from the bisection with the lowest total SSE9: Add these two clusters to the list of clusters10:until Until the list of clusters contains K clusters

Marcus SampaioDSC/UFCG

• There are a number of different ways to choose which cluster to split at each step– The largest cluster– The cluster with the largest SSE– A criterion based on both size and SSE

• One can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm

Marcus SampaioDSC/UFCG

Bisecting K-means Example

Marcus SampaioDSC/UFCG

• K-means has problems when clusters are of differing – Sizes– Densities– Non-globular shapes

• K-means has problems when the data contains outliers

Limitations of K-means

Marcus SampaioDSC/UFCG

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Sizes

Marcus SampaioDSC/UFCG

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Marcus SampaioDSC/UFCG

Original Points K-means (2 Clusters)

Limitations of K-means: Non-globular Shapes

Marcus SampaioDSC/UFCG

Original Points K-means Clusters

One solution is to use many clusters.Find parts of clusters, but need to put together.

Overcoming K-means Limitations

Marcus SampaioDSC/UFCG

Original Points K-means Clusters

Marcus SampaioDSC/UFCG

Original Points K-means Clusters

Marcus SampaioDSC/UFCG

Rodando o WEKASimpleKMeans

• === Run information ===

• Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10• Relation: weather.symbolic• Instances: 14• Attributes: 5• outlook• temperature• humidity• windy• play• Test mode: evaluate on training data

• === Model and evaluation on training set ===

Marcus SampaioDSC/UFCG

Marcus SampaioDSC/UFCG

kMeans ====== Number of iterations: 4

Within cluster sum of squared errors: 26.0

Cluster centroids:

Cluster 0 Mean/Mode: sunny mild high FALSE yes Std Devs: N/A N/A N/A N/A N/A Cluster 1 Mean/Mode: overcast cool normal TRUE yes Std Devs: N/A N/A N/A N/A N/A

Clustered Instances0 10 ( 71%)1 4 ( 29%)

Marcus SampaioDSC/UFCG

Marcus SampaioDSC/UFCG

0 sunny,hot,high,FALSE,no1 sunny,hot,high,TRUE,no2 overcast,hot,high,FALSE,yes3 rainy,mild,high,FALSE,yes4 rainy,cool,normal,FALSE,yes5 rainy,cool,normal,TRUE,no6 overcast,cool,normal,TRUE,yes7 sunny,mild,high,FALSE,no8 sunny,cool,normal,FALSE,yes9 rainy,mild,normal,FALSE,yes10 sunny,mild,normal,TRUE,yes11 overcast,mild,high,TRUE,yes12 overcast,hot,normal,FALSE,yes13 rainy,mild,high,TRUE,no

Marcus SampaioDSC/UFCG

DBSCAN

• DBSCAN is a density-based algorithm.– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number

of points (MinPts) within Eps • These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point.

Marcus SampaioDSC/UFCG

DBSCAN: Core, Border, and Noise Points

Marcus SampaioDSC/UFCG

DBSCAN Algorithm

• Eliminate noise points• Perform clustering on the remaining points

Marcus SampaioDSC/UFCG

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

Marcus SampaioDSC/UFCG

When DBSCAN Works Well

ClustersOriginal Points

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Marcus SampaioDSC/UFCG

When DBSCAN Does NOT Work Well

(Min

Pts

=4,

Eps

=9.

75).

Original Points

(M

inP

ts=

4, E

ps=

9.92

)

• Varying densities

• High-dimensional data

Marcus SampaioDSC/UFCG

DBSCAN: Determining EPS and MinPts

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

Marcus SampaioDSC/UFCG

• For supervised classification we have a variety of measures to evaluate how good our model is– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To compare two sets of clusters– To compare two clusters

Cluster Validity

Marcus SampaioDSC/UFCG

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

Clusters Found in Random Data

Marcus SampaioDSC/UFCG

1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analyses to determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Different Aspects of Cluster Validation

Marcus SampaioDSC/UFCG

• Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Marcus SampaioDSC/UFCG

WEKA

• Clustering algorithms– SimpleKMeans– MakeDensityBasedClusterer– CowWeb– EM– FarthestFirst


Recommended