+ All Categories
Home > Documents > Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity...

Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity...

Date post: 18-Aug-2018
Category:
Upload: dangkhue
View: 234 times
Download: 0 times
Share this document with a friend
65
Introduction to Graph Cluster Analysis
Transcript
Page 1: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Introduction to Graph Cluster Analysis

Page 2: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Algorithms for Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

2

Page 3: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

3

Page 4: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is Cluster Analysis?

The process of dividing a set of input data into possibly overlapping, subsets, where elements in each subset are considered related by some similarity measure

4

2 Clusters

3 Clusters

Page 5: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

5

Page 6: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is Graph Clustering?

• Types

– Between-graph

• Clustering a set of graphs

– Within-graph

• Clustering the nodes/edges of a single graph

6

Page 7: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Between-graph Clustering

Between-graph clustering methods divide a set of graphs into different clusters

E.g., A set of graphs representing chemical compounds can be grouped into clusters based on their structural similarity

7

Page 8: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Within-graph Clustering

Within-graph clustering methods divides the nodes of a graph into clusters

E.g., In a social networking graph, these clusters could represent people with same/similar hobbies

8

Note: In this lecture we will look at different algorithms to

perform within-graph clustering

Page 9: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

9

Page 10: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Graph-Based Clustering

• Graph-Based clustering uses the proximity graph

– Start with the proximity matrix

– Consider each point as a node in a graph

– Each edge between two nodes has a weight which is the proximity between the two points

– Initially the proximity graph is fully connected

– MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph

• In the simplest case, clusters are connected components in the graph.

Page 11: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Graph-Based Clustering: Sparsification

• The amount of data that needs to be processed is drastically reduced – Sparsification can eliminate more than 99% of the entries in

a proximity matrix

– The amount of time required to cluster the data is drastically reduced

– The size of the problems that can be handled is increased

Page 12: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Graph-Based Clustering: Sparsification …

• Clustering may work better – Sparsification techniques keep the connections to the most

similar (nearest) neighbors of a point while breaking the connections to less similar points.

– The nearest neighbors of a point tend to belong to the same class as the point itself.

– This reduces the impact of noise and outliers and sharpens the

distinction between clusters.

• Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. – Chameleon and Hypergraph-based Clustering

Page 13: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Sparsification in the Clustering Process

Page 14: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Minimum Spanning Tree based Clustering

14

1

2

3

4

5

2

3 2 k-Spanning

Tree

k

k groups

of

non-overlapping

vertices 4

Minimum Spanning Tree

STEPS:

• Obtains the Minimum Spanning Tree (MST) of input graph G

• Removes k-1 heaviest edges from the MST

• Results in k clusters

Page 15: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is a Spanning Tree?

A connected subgraph with no cycles that includes all vertices in the graph

15

1

2

3

4

5

2

3 2

4

6

5

7 4

1

2

3

4

5 2

6

7 Weight = 17

2

Note: Weight can represent either distance or similarity between

two vertices or similarity of the two vertices

G

Page 16: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is a Minimum Spanning Tree (MST)?

16

1

2

3

4

5

2

3 2

4

6

5

7 4

G

1

2

3

4

5

2

3 2

4

Weight = 11

2 1

2

3

4

5 2

4 5

Weight = 13

1

2

3

4

5 2

6

7 Weight = 17

2

The spanning tree of a graph with the minimum possible sum

of edge weights, if the edge weights represent distance

Note: maximum

possible sum of

edge weights, if the

edge weights

represent similarity

Page 17: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

k-Spanning Tree

17

1

2

3

4

5

2

3 2 Remove k-1 edges with

highest weight 4

Minimum Spanning Tree

Note: k – is the

number of

clusters

E.g., k=3

1

2

3

4

5

2

3 2

4

E.g., k=3

1

2

3

4

5

3 Clusters

Page 18: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor Clustering

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

18

Page 19: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

19

Shared Nearest Neighbor Clustering

0

1

2

3

4

Shared Nearest Neighbor Graph (SNN)

2

2

2 2 1

1

3

2

Shared Nearest

Neighbor Clustering

Groups

of

non-overlapping

vertices

STEPS:

• Obtains the Shared Nearest Neighbor Graph (SNN) of input graph G

• Removes edges from the SNN with weight less than τ

τ

Page 20: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is Shared Nearest Neighbor?

20

u v

Shared Nearest Neighbor is a proximity measure and denotes the number

of neighbor nodes common between any given pair of nodes

Page 21: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Shared Nearest Neighbor (SNN) Graph

21

0

1

2

3

4

G

0

1

2

3

4

SNN

2

2

2 2 1

1

3

Given input graph G, weight each edge (u,v) with the number of shared nearest

neighbors between u and v

1

Node 0 and Node 1 have 2 neighbors in

common: Node 2 and Node 3

Page 22: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Shared Nearest Neighbor Clustering Jarvis-Patrick Algorithm

22

0

1

2

3

4

SNN graph of input graph G

2

2

2 2 1

1

3

2

If u and v share more than τ neighbors

Place them in the same cluster

0

1

2

3

4

E.g., τ =3

Page 23: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor Clustering

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

23

Page 24: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is Betweenness Centrality?

Two types:

– Vertex Betweenness

– Edge Betweenness

24

Betweenness centrality quantifies the degree to which a vertex (or

edge) occurs on the shortest path between all the other pairs of

nodes

Page 25: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Vertex Betweenness

25

The number of shortest paths in the graph G that pass through a given node S

G

E.g., Sharon is likely a liaison between NCSU and DUKE and hence

many connections between DUKE and NCSU pass through Sharon

Page 26: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Edge Betweenness

The number of shortest paths in the graph G that pass through given edge (S, B)

26

E.g., Sharon and

Bob both study at

NCSU and they are

the only link

between NY DANCE

and CISCO groups

NCSU

Vertices and Edges with high Betweenness

form good starting points to identify clusters

Page 27: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Vertex Betweenness Clustering

27

Repeat until

highest vertex

betweenness ≤ μ

Select vertex v with

the highest

betweenness

E.g., Vertex 3 with

value 0.67

Given Input graph G Betweenness for each vertex

1. Disconnect graph at

selected vertex (e.g.,

vertex 3 )

2. Copy vertex to both

Components

Page 28: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

28

Edge-Betweenness Clustering Girvan and Newman Algorithm

28

Repeat until

highest edge

betweenness ≤ μ

Select edge with

Highest Betweenness

E.g., edge (3,4) with

value 0.571

Given Input Graph G Betweenness for each edge

Disconnect graph at

selected edge

(E.g., (3,4 ))

Page 29: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor Clustering

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

29

Page 30: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is a Highly Connected Subgraph?

• Requires the following definitions

– Cut

– Minimum Edge Cut (MinCut)

– Edge Connectivity (EC)

30

Page 31: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Cut

• The set of edges whose removal disconnects a graph

31

6

5

4

7

3 2

1

0

8

6

5

4

7

3 2

1

0

8

6

5

4

7

3 2

1

0

8

Cut = {(0,1),(1,2),(1,3}

Cut = {(3,5),(4,2)}

Page 32: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Minimum Cut

32

6

5

4

7

3 2

1

0

8 6

5

4

7

3 2

1

0

8

MinCut = {(3,5),(4,2)}

The minimum set of edges whose removal disconnects

a graph

Page 33: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Edge Connectivity (EC)

• Minimum NUMBER of edges that will disconnect a graph

33

6

5

4

7

3 2

1

0

8

MinCut = {(3,5),(4,2)}

EC = | MinCut|

= | {(3,5),(4,2)}|

= 2

Edge Connectivity

Page 34: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Highly Connected Subgraph (HCS)

A graph G =(V,E) is highly connected if EC(G)>V/2

34

6

5

4

7

3 2

1

0

8

EC(G) > V/2

2 > 9/2

G

G is NOT a highly connected subgraph

Page 35: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

HCS Clustering

35

6

5

4

7

3 2

1

0

8 Find the

Minimum Cut

MinCut (G)

Given Input graph G

(3,5),(4,2)}

YES

Return G

NO

G1 G2

Divide G

using MinCut

Is EC(G)> V/2

Process Graph G1

Process Graph G2

Page 36: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor Clustering

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

36

Page 37: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is a Clique?

A subgraph C of graph G with edges between all pairs of nodes

37

6

5

4

7

8

Clique

6

5

7 G C

Page 38: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is a Maximal Clique?

38

6

5

4

7

8

Clique

Maximal Clique

6

5

7

6

5

7

8

A maximal clique is a clique that is not part of a

larger clique.

Page 39: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

39

BK(C,P,N)

C - vertices in current clique

P – vertices that can be added to C

N – vertices that cannot be added to C

Condition:

If both P and N are empty – output C as

maximal clique

Maximal Clique Enumeration Bron and Kerbosch Algorithm

Input Graph G

Page 40: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Outline

• Introduction to Clustering

• Introduction to Graph Clustering

• Algorithms for Within Graph Clustering

k-Spanning Tree

Shared Nearest Neighbor Clustering

Betweenness Centrality Based

Highly Connected Components

Maximal Clique Enumeration

Kernel k-means

• Application

40

Page 41: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

What is k-means?

• k-means is a clustering algorithm applied to vector data points

• k-means recap:

– Select k data points from input as centroids

1. Assign other data points to the nearest centroid

2. Recompute centroid for each cluster

3. Repeat Steps 1 and 2 until centroids don’t change

41

Page 42: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

k-means on Graphs Kernel K-means

• Basic algorithm is the same as k-means on Vector data

• We utilize the “kernel trick”

• “kernel trick” recap

– We know that we can use within-graph kernel functions to calculate the inner product of a pair of vertices in a user-defined feature space.

– We replace the standard distance/proximity measures used in k-means with this within-graph kernel function

42

Page 43: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Application

• Functional modules in protein-protein interaction networks

• Subgraphs with pair-wise interacting nodes => Maximal cliques

43

Page 44: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Chameleon: Clustering Using Dynamic Modeling

• Adapt to the characteristics of the data set to find the natural clusters

• Use a dynamic model to measure the similarity between clusters – Main property is the relative closeness and relative inter-connectivity

of the cluster

– Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

– The merging scheme preserves self-similarity

• One of the areas of application is spatial data

Page 45: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Characteristics of Spatial Data Sets

• Clusters are defined as densely populated

regions of the space

• Clusters have arbitrary shapes, orientation,

and non-uniform sizes

• Difference in densities across clusters and

variation in density within clusters

• Existence of special artifacts (streaks) and

noise

The clustering algorithm must address

the above characteristics and also

require minimal supervision.

Page 46: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Chameleon: Steps

• Preprocessing Step: Represent the Data by a Graph – Given a set of points, construct the k-nearest-

neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors

– Concept of neighborhood is captured dynamically (even if region is sparse)

• Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices – Each cluster should contain mostly points from

one “true” cluster, i.e., is a sub-cluster of a “real” cluster

Page 47: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Chameleon: Steps …

• Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters

– Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

– Two key properties used to model cluster similarity:

• Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters

• Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters

Page 48: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Experimental Results: CHAMELEON

Page 49: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Experimental Results: CHAMELEON

Page 50: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Experimental Results: CURE (10 clusters)

Page 51: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Experimental Results: CURE (15 clusters)

Page 52: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Experimental Results: CHAMELEON

Page 53: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

i j i j 4

SNN graph: the weight of an edge is the number of shared

neighbors between vertices given that the vertices are connected

Shared Near Neighbor Approach

Page 54: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Creating the SNN Graph

Sparse Graph

Link weights are similarities

between neighboring points

Shared Near Neighbor Graph

Link weights are number of Shared

Nearest Neighbors

Page 55: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

ROCK (RObust Clustering using linKs)

• Clustering algorithm for data with categorical and Boolean attributes – A pair of points is defined to be neighbors if their similarity is greater than

some threshold

– Use a hierarchical clustering scheme to cluster the data.

1. Obtain a sample of points from the data set

2. Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points

3. Perform an agglomerative hierarchical clustering on the data using the “number of shared neighbors” as similarity measure and maximizing “the shared neighbors” objective function

4. Assign the remaining points to the clusters that have been found

Page 56: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Jarvis-Patrick Clustering

• First, the k-nearest neighbors of all points are found – In graph terms this can be regarded as breaking all but the k strongest

links from a point to other points in the proximity graph

• A pair of points is put in the same cluster if – any two points share more than T neighbors and

– the two points are in each others k nearest neighbor list

• For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors

• Jarvis-Patrick clustering is too brittle

Page 57: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

When Jarvis-Patrick Works Reasonably Well

Original Points Jarvis Patrick Clustering

6 shared neighbors out of 20

Page 58: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Smallest threshold, T,

that does not merge

clusters.

Threshold of T - 1

When Jarvis-Patrick Does NOT Work Well

Page 59: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

SNN Clustering Algorithm 1. Compute the similarity matrix

This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

2. Sparsify the similarity matrix by keeping only the k most similar neighbors This corresponds to only keeping the k strongest links of the similarity graph

3. Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, we could apply a similarity threshold and find the connected components to obtain the clusters (Jarvis-Patrick algorithm)

4. Find the SNN density of each Point. Using a user specified parameters, Eps, find the number points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point

Page 60: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

SNN Clustering Algorithm …

5. Find the core points Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts

6. Form clusters from the core points If two core points are within a radius, Eps, of each other they are place in the same cluster

7. Discard all noise points All non-core points that are not within a radius of Eps of a core point are discarded

8. Assign all non-noise, non-core points to clusters This can be done by assigning such points to the nearest core point

(Note that steps 4-8 are DBSCAN)

Page 61: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

SNN Density

a) All Points b) High SNN Density

c) Medium SNN Density d) Low SNN Density

Page 62: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

SNN Clustering Can Handle Differing Densities

Original Points SNN Clustering

Page 63: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

SNN Clustering Can Handle Other Difficult Situations

Page 64: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Finding Clusters of Time Series In Spatio-Temporal Data

26 SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, 1982-1994)

longitude

latitu

de

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

13 26

24 25

22

14

16 20 17 18

19

15

23

1 9

6

4

7 10 12 11

3

5 2

8

21

SNN Clusters of SLP.

SNN Density of SLP Time Series Data

longitudela

titu

de

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

SNN Density of Points on the Globe.

Page 65: Introduction to Graph Cluster Analysisweb.iitd.ac.in/~bspanda/graphclustering.pdf · similarity measure 4 2 Clusters 3 Clusters . ... Clustering Using Dynamic Modeling ... Use a multilevel

Features and Limitations of SNN Clustering

• Does not cluster all the points

• Complexity of SNN Clustering is high – O( n * time to find numbers of neighbor within Eps)

– In worst case, this is O(n2)

– For lower dimensions, there are more efficient ways to find the nearest neighbors

• R* Tree

• k-d Trees


Recommended