Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | laurel-melton |
View: | 220 times |
Download: | 2 times |
http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
ClusteringAttributed Multi-graphs
with Information Ranking
26th International Conference on Database and Expert Systems Applications
Sep. 1-4, 2015 Valencia, Spain
Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos
Slide 2 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
The Real World: Information Networks
Friendship
Friendship
Coauth
or
Coauthor
Coauthor
Coauthor
Friendship
Coauthor
FriendshipCoauthor
Slide 3 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
The Real World: Information Networks
Friendship
Friendship
Coauth
or
Coauthor
Coauthor
Coauthor
Friendship
Coauthor
FriendshipCoauthor
Slide 4 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Challenges
• Identify importance of each edge-type/attribute property• For instance, clustering a bibliography network• Attribute ‘area of interest’ is important• Attributes ‘name’ and ‘gender’ may introduce noise and
reduce the clustering accuracy
• Combine the attribute and structural vertex properties• Edges and attributes are of different type
Slide 5 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Related Work
• Limited attention to the different importance of attributes/edge-types• Weights are mainly updated at each iteration
• Ignore the existence of multiple edge-types• Increases computational cost and complexity
• Spectral clustering is not used for clustering attributed graphs • Used to identify dense clusters in attribute subspaces
Model-Based• BAGC [SIGMOD ‘12, TKDD ‘14]• CESNA [ICDM ‘13]
Distance-Based• SACluster [VLDB ‘09, TKDD ‘11]• PICS [SDM ‘12]• HASCOP [WI ‘13]
Slide 6 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Proposed Approach: CAMIR
• Clustering Attributed Multi-graphs with Information Ranking: CAMIR
1. Rank edge-type and attribute properties
2. Construct a unified similarity matrix
3. Adopt spectral clustering technique to generate the final clusters
http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary
http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary
Slide 9 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
• An edge represents the similarity of the two connected vertices
• Find the minimum cut of a graph• Minimizes inter-cluster similarities• Identifies an optimal partitioning of the graph
• Identifying a minimum cut is computationally difficult• Efficient approximations using linear algebra
Background: Graph Partitioning
Slide 10 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
• Based on the graph Laplacian, or Laplacian matrix
• Given a similarity matrix The normalized symmetric Laplacian L is defined as
• The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R|V| x k • Data is easily separable into clusters, i.e. using k-means
Background: Spectral Clustering
Slide 11 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Background: Spectral Clustering
10
1
2
3
4
5
6
78
910
1
2
3
4
5
6
78
95
1
7
12 19
134
30
20
8
Adjacency Matrix1 2 3 4 5 6 7 8 9 10
1 1 12 1 13 1 1 14 1 15 1 167 189
10
Laplacian Matrix1 2 3 4 5 6 7 8 9 10
1 1 -0.354 -0.52 1 -0.408 -0.4083 -0.354 1 -0.25 -0.354-0.3544 -0.408 -0.289 -0.3335 -0.25 -0.289 1 -0.5 -0.2896 -0.5 17 1 -0.7078 -0.408 -0.333-0.289 19 -0.354 -0.707 1
10 -0.5 -0.354 1
Top 3 eigenvectorsU1 U2 U3
1 -0.659 -0.705 0.2632 -0.620 0.747 0.2413 -0.595 -0.486 -0.6404 -0.668 0.711 -0.2215 -0.723 0.395 0.5666 -0.669 0.414 -0.6177 -0.332 -0.486 -0.8088 -0.668 0.711 -0.2219 -0.379 -0.491 0.784
10 -0.659 -0.705 0.263
Slide 12 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
How do we define the similarity matrix
for an attributed multi-graph?
Slide 13 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Background: Similarity Matrices
IR
DM
DM
DM AI
AIAI
AI
AI
IR
[0,1]N X N
5
1
7
12
19
134
30
20
8
0
1
2
3
4
5
6
78
9
Gaussian Kernel
[0,1]N X N
Edges[0,1]N X N
#Edge types + #AttributesSymmetric Non-negative Similarity
Matrices
How do we efficiently combine the similarity matrices?
Slide 14 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary
Slide 15 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
CAMIR Overview
1. Rank vertex properties and calculate their weights accordingly• By considering the agreement among vertex properties
2. Compute a unified similarity matrix• By combining all vertex properties based on their ranking
3. Generate the final clusters• By adopting a spectral clustering approach
Slide 16 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR
1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters
EvaluationSummary
Slide 17 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
• Most informative property [NIPS ’11]:
• Has the highest ‘agreement’ with other properties• ‘agree’ assign vertices the same cluster labels when used individually
Information Ranking
Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property
Slide 18 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Information Ranking
From the set of properties ( ), the most informative property is p [NIPS ‘11]
• The highest rank (| |) is assigned to the most informative property
• i.e. best separates the vertices
• The lowest rank (1.0) is assigned to the property that is selected last
• i.e. does not ‘agree’ with the rest of properties
Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property
Slide 19 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR
1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters
EvaluationSummary
Slide 20 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Unified Similarity Matrix
• Combines the multiple edge-type and attribute
properties with respect to identified ranking
• Defined as the weighted sum of the individual
similarity matrices
• Weights are defined by normalizing the rankings
• Contains all the similarity information about the network
under study
Slide 21 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR
1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters
EvaluationSummary
Slide 22 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Generating the Final Clusters
• Calculate normalized Laplacian of Unified
Similarity Matrix
• Perform Eigen decomposition
• Apply k-means to the eigenspace of top k
eigenvectors
• Generate the final clusters
Slide 23 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
CAMIR Clustering Process Diagram
Properties rankingUnified Similarity
MatrixGenerate the final
clusters
Cluster 1Cluster 2
…Cluster k
Iteratively Select the Most Informative
Property
Apply Spectral Clustering
Normalize Rankings andCompute the
Unified Similarity Matrix
Step 1. Identify importance of vertex
properties
Step 2. Efficiently combine vertex
properties
Step 3. Cluster the attributed multi-graph
Slide 24 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary
Slide 25 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation - Datasets
• Real-World Datasets• DBLP: Bibliography Networks• GoogleSP23: Google Software Packages
Dataset DBLP-1K DBLP-10K GoogleSP-23
Nodes 1 000 10 000 1 297
Edges 17 128 65 734 268 956
Attributes 2 2 5
Edge Types 1 1 2
Total Vertex Properties 3 3 7
Synthetic Datasets
{100, 500, 1 000, 5 000, 10 000} 1 000
{1 000 – 1 230 000} ~ 40 000
4 {2, 4, 8, 16, 32}
1 1
5 {3, 5, 9, 17, 33}
Slide 26 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
• Entropy
• Low entropy equals to high attribute homogeneity
• Normalized Mutual Information (NMI)
• High NMI is equivalent to high similarity between the
resulted clustering and the ground-truth
• NMI of value 1 indicates perfect match
• Runtime
• Quad-core i7 2.8Ghz, 8 Gb RAM
Evaluation Measures
Slide 27 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
• SACluster [VLDB 2009]
• Similarity is defined as the Random Walk distance in the augmented graph
• BAGC [SIGMOD 2012]
• Uses Bayesian inference to update the parameters of the clusters
distributions
• PICS [SDM 2012]
• Compresses adjacency and attribute matrices
• HASCOP [WI 2013]
• Heuristic distance-based
• Applies to attributed multi-graphs
State-of-the-Art Competitors
Slide 28 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation - Synthetic Datasets• CAMIR Entropy is
always less than 0.5• High Attribute
homogeneity
• CAMIR NMI is at least 0.8 on all experiments• High quality results
• Similar behavior as we increase the number of attributes
Slide 29 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation - Synthetic Datasets
• CAMIR is the 2nd fastest algorithm• Less than 10 secs for
up to 5000 vertices
• CAMIR on average outperforms almost all its competitors
Slide 30 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation - Real-world DatasetsDBLP-1K
DBLP-10K
• CAMIR achieves the
lowest entropy among
its competitors• Efficiently ranks and
combines vertex
properties
• Identifies clusters of
arbitrary shapes and
sizes (Spectral clustering)
Slide 31 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation - Real-world Datasets
GoogleSP-23
GoogleSP-23
• CAMIR achieves low
entropy
• CAMIR achieves high
NMI• Identifies a high
percentage of software packages
Slide 32 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Evaluation – Runtime and EntropyAlgorithm DBLP-1K DBLP-10K GoogleSP23
Runtime(secs) Entropy Runtime
(secs) Entropy Runtime(secs) Entropy
CAMIR 1.20 0.299 520.48 0.255 5.98 0.387
BAGC 0.15 1.448 0.35 1.649 0.81 1.573
SACluster 3.22 0.729 433.228 1.066 30.57 1.513
PICS 4.87 1.280 495.17 1.877 476.49 2.178
HASCOP 882.17 0.838 32957 1.306 4675 0.061
• CAMIR requires:• Less than 6 secs for ~1000 vertices• About 8 minutes for 10000 vertices
• CAMIR achieves on average 55% time and 60% entropy improvement
• BAGC is the fastest method, but achieved limited clustering quality• HASCOP achieved slightly better results than CAMIR, but it is the slowest
method
Slide 33 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary
Slide 34 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
Summary
• A new approach for Clustering Attributed Multi-graphs with
Information Ranking: CAMIR
• A new mechanism to rank and weigh vertex properties• Identifies the importance of each attribute and edge-type property
• A unified similarity matrix for attributed multi-graphs• Efficiently combines vertex properties
• Identify clusters of arbitrary sizes and shapes• Effective in terms of clustering accuracy and computational
time
http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]
ClusteringAttributed Multi-graphs
with Information Ranking
Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos
Department of Computer ScienceUniversity of Cyprus
Thank You!