Http://linc.ucy.ac.cy Andreas Papadopoulos - [email protected] [DEXA 2015] Clustering Attributed...

http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

ClusteringAttributed Multi-graphs

with Information Ranking

26th International Conference on Database and Expert Systems Applications

Sep. 1-4, 2015 Valencia, Spain

Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos

Slide 2 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

The Real World: Information Networks

Friendship

Friendship

Coauth

or

Coauthor

Coauthor

Coauthor

Friendship

Coauthor

FriendshipCoauthor


The Real World: Information Networks

Friendship

Friendship

Coauth

or

Coauthor

Coauthor

Coauthor

Friendship

Coauthor

FriendshipCoauthor


Challenges

• Identify importance of each edge-type/attribute property• For instance, clustering a bibliography network• Attribute ‘area of interest’ is important• Attributes ‘name’ and ‘gender’ may introduce noise and

reduce the clustering accuracy

• Combine the attribute and structural vertex properties• Edges and attributes are of different type


Related Work

• Limited attention to the different importance of attributes/edge-types• Weights are mainly updated at each iteration

• Ignore the existence of multiple edge-types• Increases computational cost and complexity

• Spectral clustering is not used for clustering attributed graphs • Used to identify dense clusters in attribute subspaces

Model-Based• BAGC [SIGMOD ‘12, TKDD ‘14]• CESNA [ICDM ‘13]

Distance-Based• SACluster [VLDB ‘09, TKDD ‘11]• PICS [SDM ‘12]• HASCOP [WI ‘13]


Proposed Approach: CAMIR

• Clustering Attributed Multi-graphs with Information Ranking: CAMIR

1. Rank edge-type and attribute properties

2. Construct a unified similarity matrix

3. Adopt spectral clustering technique to generate the final clusters


Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary




• An edge represents the similarity of the two connected vertices

• Find the minimum cut of a graph• Minimizes inter-cluster similarities• Identifies an optimal partitioning of the graph

• Identifying a minimum cut is computationally difficult• Efficient approximations using linear algebra

Background: Graph Partitioning


• Based on the graph Laplacian, or Laplacian matrix

• Given a similarity matrix The normalized symmetric Laplacian L is defined as

• The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R|V| x k • Data is easily separable into clusters, i.e. using k-means

Background: Spectral Clustering


Background: Spectral Clustering

10

1

2

3

4

5

6

78

910

1

2

3

4

5

6

78

95

1

7

12 19

134

30

20

8

Adjacency Matrix1 2 3 4 5 6 7 8 9 10

1 1 12 1 13 1 1 14 1 15 1 167 189

10

Laplacian Matrix1 2 3 4 5 6 7 8 9 10

1 1 -0.354 -0.52 1 -0.408 -0.4083 -0.354 1 -0.25 -0.354-0.3544 -0.408 -0.289 -0.3335 -0.25 -0.289 1 -0.5 -0.2896 -0.5 17 1 -0.7078 -0.408 -0.333-0.289 19 -0.354 -0.707 1

10 -0.5 -0.354 1

Top 3 eigenvectorsU1 U2 U3

1 -0.659 -0.705 0.2632 -0.620 0.747 0.2413 -0.595 -0.486 -0.6404 -0.668 0.711 -0.2215 -0.723 0.395 0.5666 -0.669 0.414 -0.6177 -0.332 -0.486 -0.8088 -0.668 0.711 -0.2219 -0.379 -0.491 0.784

10 -0.659 -0.705 0.263


How do we define the similarity matrix

for an attributed multi-graph?


Background: Similarity Matrices

IR

DM

DM

DM AI

AIAI

AI

AI

IR

[0,1]N X N

5

1

7

12

19

134

30

20

8

0

1

2

3

4

5

6

78

9

Gaussian Kernel

[0,1]N X N

Edges[0,1]N X N

#Edge types + #AttributesSymmetric Non-negative Similarity

Matrices

How do we efficiently combine the similarity matrices?




CAMIR Overview

1. Rank vertex properties and calculate their weights accordingly• By considering the agreement among vertex properties

2. Compute a unified similarity matrix• By combining all vertex properties based on their ranking

3. Generate the final clusters• By adopting a spectral clustering approach


Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR

1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters

EvaluationSummary


• Most informative property [NIPS ’11]:

• Has the highest ‘agreement’ with other properties• ‘agree’ assign vertices the same cluster labels when used individually

Information Ranking

Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property


Information Ranking

From the set of properties ( ), the most informative property is p [NIPS ‘11]

• The highest rank (| |) is assigned to the most informative property

• i.e. best separates the vertices

• The lowest rank (1.0) is assigned to the property that is selected last

• i.e. does not ‘agree’ with the rest of properties

Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property




EvaluationSummary


Unified Similarity Matrix

• Combines the multiple edge-type and attribute

properties with respect to identified ranking

• Defined as the weighted sum of the individual

similarity matrices

• Weights are defined by normalizing the rankings

• Contains all the similarity information about the network

under study




EvaluationSummary


Generating the Final Clusters

• Calculate normalized Laplacian of Unified

Similarity Matrix

• Perform Eigen decomposition

• Apply k-means to the eigenspace of top k

eigenvectors

• Generate the final clusters


CAMIR Clustering Process Diagram

Properties rankingUnified Similarity

MatrixGenerate the final

clusters

Cluster 1Cluster 2

…Cluster k

Iteratively Select the Most Informative

Property

Apply Spectral Clustering

Normalize Rankings andCompute the

Unified Similarity Matrix

Step 1. Identify importance of vertex

properties

Step 2. Efficiently combine vertex

properties

Step 3. Cluster the attributed multi-graph




Evaluation - Datasets

• Real-World Datasets• DBLP: Bibliography Networks• GoogleSP23: Google Software Packages

Dataset DBLP-1K DBLP-10K GoogleSP-23

Nodes 1 000 10 000 1 297

Edges 17 128 65 734 268 956

Attributes 2 2 5

Edge Types 1 1 2

Total Vertex Properties 3 3 7

Synthetic Datasets

{100, 500, 1 000, 5 000, 10 000} 1 000

{1 000 – 1 230 000} ~ 40 000

4 {2, 4, 8, 16, 32}

1 1

5 {3, 5, 9, 17, 33}


• Entropy

• Low entropy equals to high attribute homogeneity

• Normalized Mutual Information (NMI)

• High NMI is equivalent to high similarity between the

resulted clustering and the ground-truth

• NMI of value 1 indicates perfect match

• Runtime

• Quad-core i7 2.8Ghz, 8 Gb RAM

Evaluation Measures


• SACluster [VLDB 2009]

• Similarity is defined as the Random Walk distance in the augmented graph

• BAGC [SIGMOD 2012]

• Uses Bayesian inference to update the parameters of the clusters

distributions

• PICS [SDM 2012]

• Compresses adjacency and attribute matrices

• HASCOP [WI 2013]

• Heuristic distance-based

• Applies to attributed multi-graphs

State-of-the-Art Competitors


Evaluation - Synthetic Datasets• CAMIR Entropy is

always less than 0.5• High Attribute

homogeneity

• CAMIR NMI is at least 0.8 on all experiments• High quality results

• Similar behavior as we increase the number of attributes


Evaluation - Synthetic Datasets

• CAMIR is the 2nd fastest algorithm• Less than 10 secs for

up to 5000 vertices

• CAMIR on average outperforms almost all its competitors


Evaluation - Real-world DatasetsDBLP-1K

DBLP-10K

• CAMIR achieves the

lowest entropy among

its competitors• Efficiently ranks and

combines vertex

properties

• Identifies clusters of

arbitrary shapes and

sizes (Spectral clustering)


Evaluation - Real-world Datasets

GoogleSP-23

GoogleSP-23

• CAMIR achieves low

entropy

• CAMIR achieves high

NMI• Identifies a high

percentage of software packages


Evaluation – Runtime and EntropyAlgorithm DBLP-1K DBLP-10K GoogleSP23

Runtime(secs) Entropy Runtime

(secs) Entropy Runtime(secs) Entropy

CAMIR 1.20 0.299 520.48 0.255 5.98 0.387

BAGC 0.15 1.448 0.35 1.649 0.81 1.573

SACluster 3.22 0.729 433.228 1.066 30.57 1.513

PICS 4.87 1.280 495.17 1.877 476.49 2.178

HASCOP 882.17 0.838 32957 1.306 4675 0.061

• CAMIR requires:• Less than 6 secs for ~1000 vertices• About 8 minutes for 10000 vertices

• CAMIR achieves on average 55% time and 60% entropy improvement

• BAGC is the fastest method, but achieved limited clustering quality• HASCOP achieved slightly better results than CAMIR, but it is the slowest

method




Summary

• A new approach for Clustering Attributed Multi-graphs with

Information Ranking: CAMIR

• A new mechanism to rank and weigh vertex properties• Identifies the importance of each attribute and edge-type property

• A unified similarity matrix for attributed multi-graphs• Efficiently combines vertex properties

• Identify clusters of arbitrary sizes and shapes• Effective in terms of clustering accuracy and computational

time


ClusteringAttributed Multi-graphs

with Information Ranking

Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos

Department of Computer ScienceUniversity of Cyprus

Thank You!

Date post:	05-Jan-2016
Category:	Documents
Upload:	laurel-melton
View:	220 times
Download:	2 times

Http://linc.ucy.ac.cy Andreas Papadopoulos - [email protected] [DEXA 2015] Clustering Attributed...

Documents