Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | wynter-sears |
View: | 20 times |
Download: | 0 times |
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Clustering
COMP 790-90 Research SeminarBCB 713 Module
Spring 2011Wei Wang
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
2
Outline
What is clustering
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based clustering methods
Outlier analysis
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
3
What Is Clustering?
Group data into clustersSimilar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning: no predefined classes
Cluster 1Cluster 2
Outliers
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
4
Application Examples
A stand-alone tool: explore data distribution A preprocessing step for other algorithmsPattern recognition, spatial data analysis, image processing, market research, WWW, …
Cluster documentsCluster web log data to discover groups of similar access patterns
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
5
What Is A Good Clustering?
High intra-class similarity and low inter-class similarity
Depending on the similarity measure
The ability to discover some or all of the hidden patterns
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
6
Requirements of Clustering
Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
7
Requirements of Clustering
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
8
Data Matrix
For memory-based clusteringAlso called object-by-variable structure
Represents n objects with p variables (attributes, measures)
A relational table
npx
nfx
nx
ipx
ifx
ix
px
fxx
1
1
1111
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
9
Dissimilarity Matrix
For memory-based clusteringAlso called object-by-object structure
Proximities of pairs of objects
d(i,j): dissimilarity between objects i and j
Nonnegative
Close to 0: similar
0,2)(,1)(
0(3,2)(3,1)
0(2,1)
0
ndnd
dd
d
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
10
How Good Is A Clustering?
Dissimilarity/similarity depends on distance function
Different applications have different functions
Judgment of clustering quality is typically highly subjective
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
11
Types of Data in Clustering
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
12
Similarity and Dissimilarity Between Objects
Distances are normally used measures
Minkowski distance: a generalization
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
Weighed distance
)0(||...||||),(2211
qjx
ix
jx
ix
jx
ixjid q
q
pp
)0()||...||2
||1
),(2211
qjx
ixpwj
xixw
jx
ixwjid q
q
pp
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
13
Properties of Minkowski Distance
Nonnegative: d(i,j) 0
The distance of an object to itself is 0
d(i,i) = 0
Symmetric: d(i,j) = d(j,i)Triangular inequality
d(i,j) d(i,k) + d(k,j)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
14
Categories of Clustering Approaches (1)
Partitioning algorithmsPartition the objects into k clustersIteratively reallocate objects to improve the clustering
Hierarchy algorithmsAgglomerative: each object is a cluster, merge clusters to form larger onesDivisive: all objects are in a cluster, split it up into smaller clusters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
15
Categories of Clustering Approaches (2)
Density-based methodsBased on connectivity and density functions
Filter out noise, find clusters of arbitrary shape
Grid-based methodsQuantize the object space into a grid structure
Model-basedUse a model to find the best fit of data
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
16
Partitioning Algorithms: Basic Concepts
Partition n objects into k clustersOptimize the chosen partitioning criterion
Global optimal: examine all partitions(kn-(k-1)n-…-1) possible partitions, too expensive!
Heuristic methods: k-means and k-medoidsK-means: a cluster is represented by the center
K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
17
K-means
Arbitrarily choose k objects as the initial cluster centers
Until no change, do(Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster
Update the cluster means, i.e., calculate the mean value of the objects for each cluster
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
18
K-Means: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
10
0 1 2 3 4 5 6 7 8 9 100
1
2
7
8
9
3
4
5
6
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
19
Pros and Cons of K-means
Relatively efficient: O(tkn)n: # objects, k: # clusters, t: # iterations; k, t << n.
Often terminate at a local optimum
Applicable only when mean is definedWhat about categorical data?
Need to specify the number of clusters
Unable to handle noisy data and outliers
unsuitable to discover non-convex clusters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
20
Variations of the K-means
Aspects of variationsSelection of the initial k meansDissimilarity calculationsStrategies to calculate cluster means
Handling categorical data: k-modes Use mode instead of mean
Mode: the most frequent item(s)
A mixture of categorical and numerical data: k-prototype method
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
21
A Problem of K-means
Sensitive to outliersOutlier: objects with extremely large values
May substantially distort the distribution of the data
K-medoids: the most centrally located object in a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
++
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
22
PAM: A K-medoids Method
PAM: partitioning around Medoids
Arbitrarily choose k objects as the initial medoids
Until no change, do(Re)assign each object to the cluster to which the nearest medoid
Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’
If S < 0 then swap o with o’ to form the new set of k medoids
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
23
Swapping Cost
Measure whether o’ is better than o as a medoid
Use the squared-error criterion
Compute Eo’-Eo
Negative: swapping brings benefit
k
i Cpi
i
opdE1
2),(
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
24
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids Randomly select a
nonmedoid object,Oramdom
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
25
Pros and Cons of PAM
PAM is more robust than k-means in the presence of noise and outliers
Medoids are less influenced by outliers
PAM is efficiently for small data sets but does not scale well for large data sets
O(k(n-k)2 ) for each iteration
Sampling based method: CLARA
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
26
CLARA (Clustering LARge Applications)
CLARA (Kaufmann and Rousseeuw in 1990)Built in statistical analysis packages, such as S+
Draw multiple samples of the data set, apply PAM on each sample, give the best clustering
Perform better than PAM in larger data sets
Efficiency depends on the sample sizeA good clustering on samples may not be a good clustering of the whole data set
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
27
CLARANS (Clustering Large Applications
based upon RANdomized Search)
The problem space: graph of clusteringA vertex is k from n numbers, vertices in totalPAM search the whole graphCLARA search some random sub-graphs
CLARANS climbs mountainsRandomly sample a set and select k medoidsConsider neighbors of medoids as candidate for new medoidsUse the sample set to verifyRepeat multiple times to avoid bad samples
k
n