+ All Categories
Home > Documents > MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda...

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda...

Date post: 04-Jan-2016
Category:
Upload: bennett-ray
View: 212 times
Download: 0 times
Share this document with a friend
24
MOSAIC: A Proximity Graph Approach MOSAIC: A Proximity Graph Approach for Agglomerative Clustering for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti, and Christoph F. Eick Department of Computer Science, University of Houston Organization 1. Motivation Scope of the research Region Discovery Traditional Clustering Clustering with Plug-In Fitness Functions Shape-aware Clustering Algorithms Ideas of MOSAIC 2. Background 3. The MOSAIC Algorithm 4. Experimental Evalution 5. Related Work 6. Conclusion and Future Work
Transcript
Page 1: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

MOSAIC: A Proximity Graph ApproachMOSAIC: A Proximity Graph Approachfor Agglomerative Clusteringfor Agglomerative Clustering

Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti, and Christoph F. Eick

Department of Computer Science, University of Houston

Organization 1. Motivation

Scope of the research– Region Discovery– Traditional Clustering

Clustering with Plug-In Fitness Functions Shape-aware Clustering Algorithms Ideas of MOSAIC

2. Background3. The MOSAIC Algorithm4. Experimental Evalution 5. Related Work 6. Conclusion and Future Work

Page 2: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

1.1 Motivation: Examples of Region Discovery1.1 Motivation: Examples of Region Discovery

RD-Algorithm

Application 1: Hot-spot Discovery [EVDW06]Application 2: Find Interesting Regions with respect to a Continuous VariableApplication 3: Find “representative” regions (Sampling)Application 4: Regional Co-location MiningApplication 5: Regional Association Rule Mining [DEWY06]Application 6: Regional Association Rule Scoping [EDYKN07]

Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well

=1.01

=1.04

Page 3: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Region Discovery FrameworkRegion Discovery Framework

The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:

q(X)= cX reward(c)=cX interestingness(c)size(c) with >1

Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous 4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives,

and low reward clusters are frequently not reported

Page 4: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

1.2 Clustering with Plug-In Fitness Functions1.2 Clustering with Plug-In Fitness Functions

Clustering algorithms

No fitness functionProvides plug-infitness function

Fixed Fitness

Function

DBSCANHierarchicalClustering

Implicit Fitness Function

K-MeansCHAMELEON

MOSAIC

PAM

Page 5: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

1.3 Shape-aware Clustering1.3 Shape-aware Clustering

• Shape is a significant characteristic in traditional clustering and region discovery

• Examples

Fig. 1: some chain-like patterns in Volcano dataset

Fig.2: arbitrary shape of regions of high (low) arsenic concentration in Texas wells

Page 6: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

1.4 Ideas Underlying MOSAIC1.4 Ideas Underlying MOSAIC

• MOSAIC provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons

Fig. 6: An illustration of MOSAIC’s approach

(a) input (b) output

Page 7: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Talk OrganizationTalk Organization

1. Motivation2. Background

Representative-based clustering Agglomerative clustering Proximity Graphs

3. The MOSAIC Algorithm4. Experimental Evaluation 5. Related Work 6. Conclusion and Future Work

Page 8: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

2.1 Representative-based Clustering

Attribute2

Attribute1

1

2

3

4

Objective: Find a set of objects OR such that the clustering X

obtained by using the objects in OR as representatives minimizes q(X).

Properties: Cluster shapes are convex polygonsPopular Algorithms: K-means, K-medoids, SCEC

Page 9: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

2.2 MOSAIC and Agglomerative Clustering2.2 MOSAIC and Agglomerative Clustering

Advantages MOSAIC over traditional agglomerative clustering:

• Wider search—considers all neighboring clusters • Plug-in fitness function• Clusters are always contiguous • Expensive algorithm is only run for 20-1000 iterations• Highly generic algorithm

Page 10: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

2.3 Proximity Graphs2.3 Proximity Graphs

• How to identify neighboring clusters for representative-based clustering algorithms?

• Proximity graphs provide various definitions of “neighbour”

NNG MST RNG GG DT

NNG = Nearest Neighbour Graph

MST = Minimum Spanning Tree

RNG = Relative Neighbourhood Graph

GG = Gabriel Graph

DT = Delaunay Triangulation (neighbours of a 1NN-classifier)

Page 11: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Proximity Graphs: DelaunayProximity Graphs: Delaunay

• The Delaunay Triangulation is the dual of the Voronoi diagram

• Three points are each others neighbours if their tangent sphere contains no other points

• Complete: captures all neighbouring clusters

• Expensive to compute in high dimensions

Page 12: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Proximity Graphs: GabrielProximity Graphs: Gabriel

• The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed)

• Points are neighbours only if their (diametral) sphere of influence is empty

• Can be computed more efficiently: O(k3)

Page 13: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

3. MOSAIC3. MOSAIC

Fig. 10: Gabriel graph for clusters generated by a representative-based clustering algorithm

Page 14: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Pseudo Code MOSAICPseudo Code MOSAIC

1. Run a representative-based clustering algorithm to create a large number of clusters.2. Read the representatives of the obtained clusters.3. Create a merge candidate relation using proximity graphs.4. WHILE there are merge-candidates (Ci ,Cj) left BEGIN Merge the pair of merge-candidates (Ci,Cj), that enhances fitness function q the most, into a new cluster C’ Update merge-candidates: C Merge-Candidate(C’,C) Merge-Candidate(Ci,C)

Merge-Candidate(Cj,C) END RETURN the best clustering X found.

Page 15: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Complexity MOSAICComplexity MOSAIC

Let

n be the number of objects in the dataset

k be the number of clusters returned by the representative-based algorithm

Complexity MOSAIC: O(k3 + k2*O(q(x)))

Remarks: • The above formula assumes that fitness is computed from

the scratch when a new clustering is obtained• Lower complexities can be obtained with incrementally

reusing results of previous fitness computations• Our current implementation assumes that only additive

fitness functions are used

Page 16: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

4. Experimental Evaluation for Traditional Clustering4. Experimental Evaluation for Traditional Clustering

• Compared MOSAIC with DBSCAN and K-means• Used silhouette as q(X) when running MOSAIC;

Silhouette considers cohesion and separation (measured as the distance to the nearest cluster).

• Used 9-Diamonds, Volcano, Diabetes, Ionosphere, and Vehicle datasets in the experimental evaluation

Page 17: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Experimental ResultsExperimental Results

• Finding good parameter setting for DBSCAN turned out to be problematic for the 9-Diamonds and Volcano spatial datasets.

• Neither DBSCAN nor MOSAIC were able to obtain to identify all chain-like patterns in the Volcano dataset.

• We compared MOSAIC and K-means for the Ionosphere, Diabetes, and Vehicle high-dimensional datasets. Cluster quality was measured using Silhouette. MOSAIC outperformed K-means on these datasets.

Page 18: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Volcano Dataset Result MOSAIC Volcano Dataset Result MOSAIC

Page 19: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Volcano Dataset Result DBSCANVolcano Dataset Result DBSCAN

Page 20: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Open Issues: What is a Good Fitness Function Open Issues: What is a Good Fitness Function for Traditional Clustering?for Traditional Clustering?

• The use plug-in fitness functions within traditional clustering algorithms is not very common.

• Use existing cluster evaluation measures as fitness function, such as cohesion, separation, and silhouette, does not lead to very good clustering when confronted with arbitrary shape clusters [Choo07].

Question: Can we find better cluster evaluation measures or is finding good evaluation measures for traditional clustering a hopeless project?

Page 21: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

5. Related Work5. Related Work

• CURE integrates a partitioning algorithm with an agglomerative hierarchical algorithm [GRS98].

• CHAMELEON [KHK99] provides a sophisticated two-phased clustering algorithm: a multilevel graph partitioning algorithm and agglomerative clustering algorithm on knn sparse graph.

Page 22: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Related Work ContinuedRelated Work Continued

• Lin and Zhong [LC02 and ZG03] propose hybrid clustering algorithms that combine representative-based clustering and agglomerative clustering methods.

• Surdeanu [STA05] proposes a hybrid clustering approach that combines agglomerative clustering algorithm with the Expectation Maximization (EM) algorithm.

Page 23: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

6. Conclusion 6. Conclusion

• A new clustering algorithm was introduced that approximates arbitrary shape clusters through unions of convex polygons

• The algorithm performs a wider search by considering “all” neighboring clusters as merge candidates. Gabriel graphs are used to determine neighboring clusters

• The algorithm is generic in that it can be used with any initial merge candidate relation, any fitness function, and any representative-based algorithms

• MOSAIC can also be seen as a generalization of agglomerative grid-based clustering algorithms.

• We mainly use MOSAIC in the region discovery project mentioned earlier.

Page 24: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Future Work: Future Work: Learn fitness function based on feedbackLearn fitness function based on feedback

Idea: employs machine learning techniques to learn a fitness function by using the feedback of a domain expert.– Pros:

– It provides more adaptive approach to give the changes to tailor the fitness function based on the domain expert’s requirements.

– The process of finding an appropriate fitness function is automatic.

– Cons: – features selection is non-trivial

– Learning the function is a difficult machine learning task


Recommended