Date post: | 05-Jul-2015 |
Category: |
Engineering |
Upload: | ashwin-shenoy-m |
View: | 164 times |
Download: | 0 times |
Presented By,Presented By, Ashwin Shenoy MAshwin Shenoy M 4CB13SCS024CB13SCS02
1
Partitioning methods Hierarchical methodsHierarchical methods Density-based methodsDensity-based methods Grid-based methods Grid-based methods Model-based methods Model-based methods
(conceptual clustering, neural networks)
2
A partitioning method:A partitioning method: construct a partition of a database D of nn objects into a set of k clusters such thato each cluster contains at least one objecto each object belongs to exactly one cluster
3
methodsmethods:o k-meansk-means : Each cluster is
represented by the center of the cluster (centroidcentroid).
o k-medoidsk-medoids : Each cluster is represented by one of the objects in the cluster (medoidmedoid).
4
Input to the algorithmInput to the algorithm: the number of clusters k, and a database of n objects
Algorithm consists of four stepsAlgorithm consists of four steps: 1. partition object into k nonempty
subsets/clusters2. compute a seed points as the centroidcentroid (the
mean of the objects in the cluster) for each cluster in the current partition
3. assign each object to the cluster with the nearest centroid
4. go back to Step 2, stop when there are no more new assignments
5
Alternative algorithm also consists of four stepsAlternative algorithm also consists of four steps: 1. arbitrarily choose k objects as the initial
cluster centers (centroids)2. (re)assign each object to the cluster with the
nearest centroid3. update the centroids4. go back to Step 2, stop when there are no
more new assignments
6
7
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Input to the algorithmInput to the algorithm: the number of clusters k, and a database of n objects
Algorithm consists of four stepsAlgorithm consists of four steps: 1. arbitrarily choose k objects as the initial
medoidsmedoids (representative objects)2. assign each remaining object to the cluster
with the nearest medoid3. select a nonmedoid and replace one of the
medoids with it if this improves the clustering4. go back to Step 2, stop when there are no
more new assignments
8
A hierarchical method:A hierarchical method: construct a hierarchy of clustering, not just a single partition of objects.
The number of clusters kk is not required as an input .
Use a distance matrixdistance matrix as clustering criteria
A termination conditiontermination condition can be used (e.g., a number of clusters)
9
1.agglomerative1.agglomerative (bottom-up):oplace each object in its own cluster.omerge in each step the two most similar clusters until
there is only one cluster left or the termination condition is satisfied.
2.divisive2.divisive (top-down):ostart with one big cluster containing all the objectsodivide the most distinctive cluster into smaller
clusters and proceed until there are n clusters or the termination condition is satisfied.
10
11
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerativeagglomerative
divisivedivisive
Birch: Balanced Iterative Reducing and Clustering using Hierarchies.
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data).
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree.
12
Clustering feature:
• summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view.
• registers crucial measurements for computing cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• threshold: max diameter of sub-clusters stored at the leaf nodes
13
14
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
ROCK: RObust Clustering using linKs Major ideas
• Use links to measure similarity/proximity.
15
Links: # of common neighbors• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a,
d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
• link(T1, T2) = 4, since they have 4 common neighbors
{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
{a, b, d}, {a, b, e}, {a, b, g} Thus link is a better measure than Jaccard coefficient
16
Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters
17
18
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
THANK YOU
19