+ All Categories
Home > Documents > Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another...

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another...

Date post: 13-Dec-2015
Category:
Upload: tiffany-lyons
View: 217 times
Download: 1 times
Share this document with a friend
27
Cluster Cluster Analysis Analysis Potyó László
Transcript
Page 1: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Cluster AnalysisCluster Analysis

Potyó László

Page 2: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Cluster: a collection of data objectsCluster: a collection of data objects Similar to one another within the same clusterSimilar to one another within the same cluster Dissimilar to the objects in other clustersDissimilar to the objects in other clusters

Cluster analysisCluster analysis Grouping a set of data objects into clustersGrouping a set of data objects into clusters

Number of possible clusters (Bell)Number of possible clusters (Bell)

Clustering is Clustering is unsupervisedunsupervised classification: no classification: no predefined classespredefined classes

What is Cluster Analysis ?What is Cluster Analysis ?

Page 3: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

General Applications of ClusteringGeneral Applications of Clustering

Pattern RecognitionPattern Recognition

Spatial Data AnalysisSpatial Data Analysis

Image ProcessingImage Processing

Economic ScienceEconomic Science

WWWWWW

Page 4: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Examples of Clustering ApplicationsExamples of Clustering Applications

Marketing:Marketing: Help marketers discover distinct groups in Help marketers discover distinct groups in their customer bases, and then use this knowledge to their customer bases, and then use this knowledge to develop targeted marketing programdevelop targeted marketing program

Insurance:Insurance: Identifying groups of motor insurance policy Identifying groups of motor insurance policy holders with a high average claim costholders with a high average claim cost

City-planning:City-planning: Identifying groups of houses according to Identifying groups of houses according to their house type, value, and geographical locationtheir house type, value, and geographical location

Page 5: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

What Is Good Clustering?What Is Good Clustering?

The quality of a clustering result depends on both the The quality of a clustering result depends on both the

similarity measure used by the method and its similarity measure used by the method and its

implementation.implementation.

The quality of a clustering method is also measured by The quality of a clustering method is also measured by

its ability to discover some or all of the hidden patterns.its ability to discover some or all of the hidden patterns.

Example:Example:

Page 6: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Requirements of ClusteringRequirements of Clustering

ScalabilityScalability

Ability to deal with different types of attributesAbility to deal with different types of attributes

Discovery of clusters with arbitrary shapeDiscovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to Minimal requirements for domain knowledge to determine input parametersdetermine input parameters

Able to deal with noise and outliersAble to deal with noise and outliers

Insensitive to order of input recordsInsensitive to order of input records

High dimensionalityHigh dimensionality

Incorporation of user-specified constraintsIncorporation of user-specified constraints

Interpretability and usabilityInterpretability and usability

Page 7: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Similarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects

DistancesDistances are normally used to measure the are normally used to measure the

similarity or dissimilarity between two data similarity or dissimilarity between two data

objectsobjects

Some popular ones include: Some popular ones include: Minkowski distanceMinkowski distance::

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional

data objects, and q is a positive integer

Page 8: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Similarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects

If If qq = = 11, , dd is Manhattan distance is Manhattan distance

If If qq = = 22, , dd is Euclidean distance is Euclidean distance

d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Page 9: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Categorization of Clustering MethodsCategorization of Clustering Methods

Partitioning MethodsPartitioning Methods

Hierarchical MethodsHierarchical Methods

Density-Based MethodsDensity-Based Methods

Grid-Based MethodsGrid-Based Methods

Model-Based Clustering MethodsModel-Based Clustering Methods

Page 10: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

K-meansK-means1. Ask user how

many clusters they’d like. (e.g. k=5)

Page 11: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

K-meansK-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

Page 12: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

K-meansK-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

Page 13: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

K-meansK-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

Page 14: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

K-meansK-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

Page 15: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

The GMM assumptionThe GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector i 1

2

3

Page 16: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

The GMM assumptionThe GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1

2

3

Page 17: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

The GMM assumptionThe GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2

Page 18: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

The GMM assumptionThe GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2. Datapoint ~ N(i, 2I )

2

x

Page 19: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

The General GMM assumptionThe General GMM assumption

1

2

3

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix i

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2. Datapoint ~ N(i, i )

Page 20: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Expectation-Maximization (EM)Expectation-Maximization (EM)

Solves estimation with incomplete data.Solves estimation with incomplete data.

Obtain initial estimates for parameters.Obtain initial estimates for parameters.

Iteratively use estimates for missing data Iteratively use estimates for missing data and continue until convergence.and continue until convergence.

Page 21: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

EM - algorithmEM - algorithm

Iterative - algorithm Maximizing log-likelihood function

E – step

M – step

Page 22: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Sample 1Sample 1

Clustering data generated by a mixture of Clustering data generated by a mixture of three Gaussians in 2 dimensions three Gaussians in 2 dimensions

number of points: 500number of points: 500 priors are: 0.3, 0.5 and 0.2 priors are: 0.3, 0.5 and 0.2 centers are: (2, 3.5), (0, 0), (0,2)centers are: (2, 3.5), (0, 0), (0,2) variances: 0.2, 0.5 and 1.0variances: 0.2, 0.5 and 1.0

Page 23: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Raw dataRaw data After ClusteringAfter Clustering

• 150 (2, 3.5)(2, 3.5)

• 250 (0, 0)(0, 0)

• 100 (0,2)(0,2)

• 149 (1.9941, 3.4742)

• 265 (0.0306, 0.0026)

• 86 (0.1395, 1.9759)

Sample 1Sample 1

Page 24: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Sample 2Sample 2

Clustering three dimensional dataClustering three dimensional data

Number of points:1000Number of points:1000

Unknown sourceUnknown source

Optimal number of components = ?Optimal number of components = ?

Estimated parameters = ?Estimated parameters = ?

Page 25: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Sample 2Sample 2Raw dataRaw data After ClusteringAfter Clustering

Assumed number of clusters: 5

Page 26: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Sample 2 – Sample 2 – table of estimated parameterstable of estimated parameters

Page 27: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

ReferencesReferences

[1] [1]

http://www.autonlab.org/tutorials/gmm14.pdfhttp://www.autonlab.org/tutorials/gmm14.pdf

[2] [2] http://www.autonlab.org/tutorials/kmeans11.pdfhttp://www.autonlab.org/tutorials/kmeans11.pdf

[3] [3] http://info.ilab.sztaki.hu/~lukacs/AdatbanyaEA2005/klaszterezes.pdfhttp://info.ilab.sztaki.hu/~lukacs/AdatbanyaEA2005/klaszterezes.pdf

[4] [4] http://www.stat.auckland.ac.nz/~balemi/Data%20Mining%20in%20Mahttp://www.stat.auckland.ac.nz/~balemi/Data%20Mining%20in%20Market%20Research.pptrket%20Research.ppt

[5] [5] http://www.ncrg.aston.ac.uk/netlabhttp://www.ncrg.aston.ac.uk/netlab


Recommended