DATA MINING:Clustering Types

Presented By,Presented By, Ashwin Shenoy MAshwin Shenoy M 4CB13SCS024CB13SCS02

1

Partitioning methods Hierarchical methodsHierarchical methods Density-based methodsDensity-based methods Grid-based methods Grid-based methods Model-based methods Model-based methods

(conceptual clustering, neural networks)

2

A partitioning method:A partitioning method: construct a partition of a database D of nn objects into a set of k clusters such thato each cluster contains at least one objecto each object belongs to exactly one cluster

3

methodsmethods:o k-meansk-means : Each cluster is

represented by the center of the cluster (centroidcentroid).

o k-medoidsk-medoids : Each cluster is represented by one of the objects in the cluster (medoidmedoid).

4

Input to the algorithmInput to the algorithm: the number of clusters k, and a database of n objects

Algorithm consists of four stepsAlgorithm consists of four steps: 1. partition object into k nonempty

subsets/clusters2. compute a seed points as the centroidcentroid (the

mean of the objects in the cluster) for each cluster in the current partition

3. assign each object to the cluster with the nearest centroid

4. go back to Step 2, stop when there are no more new assignments

5

Alternative algorithm also consists of four stepsAlternative algorithm also consists of four steps: 1. arbitrarily choose k objects as the initial

cluster centers (centroids)2. (re)assign each object to the cluster with the

nearest centroid3. update the centroids4. go back to Step 2, stop when there are no

more new assignments

6

7

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Input to the algorithmInput to the algorithm: the number of clusters k, and a database of n objects

Algorithm consists of four stepsAlgorithm consists of four steps: 1. arbitrarily choose k objects as the initial

medoidsmedoids (representative objects)2. assign each remaining object to the cluster

with the nearest medoid3. select a nonmedoid and replace one of the

medoids with it if this improves the clustering4. go back to Step 2, stop when there are no

more new assignments

8

A hierarchical method:A hierarchical method: construct a hierarchy of clustering, not just a single partition of objects.

The number of clusters kk is not required as an input .

Use a distance matrixdistance matrix as clustering criteria

A termination conditiontermination condition can be used (e.g., a number of clusters)

9

1.agglomerative1.agglomerative (bottom-up):oplace each object in its own cluster.omerge in each step the two most similar clusters until

there is only one cluster left or the termination condition is satisfied.

2.divisive2.divisive (top-down):ostart with one big cluster containing all the objectsodivide the most distinctive cluster into smaller

clusters and proceed until there are n clusters or the termination condition is satisfied.

10

11

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerativeagglomerative

divisivedivisive

Birch: Balanced Iterative Reducing and Clustering using Hierarchies.

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data).

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree.

12

Clustering feature:

• summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view.

• registers crucial measurements for computing cluster and utilizes storage efficiently

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

• A nonleaf node in a tree has descendants or “children”

• The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters

• Branching factor: specify the maximum number of children.

• threshold: max diameter of sub-clusters stored at the leaf nodes

13

14

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

ROCK: RObust Clustering using linKs Major ideas

• Use links to measure similarity/proximity.

15

Links: # of common neighbors• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a,

d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}

• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

• link(T1, T2) = 4, since they have 4 common neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}

• link(T1, T3) = 3, since they have 3 common neighbors

{a, b, d}, {a, b, e}, {a, b, g} Thus link is a better measure than Jaccard coefficient

16

Measures the similarity based on a dynamic model

• Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two-phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

17

18

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

THANK YOU

19

Date post:	05-Jul-2015
Category:	Engineering
Upload:	ashwin-shenoy-m
View:	164 times
Download:	0 times

DATA MINING:Clustering Types

Engineering