Date post: | 23-Jun-2015 |
Category: |
Education |
Upload: | waqas-nawaz |
View: | 468 times |
Download: | 1 times |
The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea.
The 3rd International Workshop on Social Networks and Social Web Mining*
Collaborative Similarity Measure for Intra-Graph Clustering*
Waqas Nawaz, Young-Koo Lee, Sungyoung Lee
Department of Computer Engineering, Kyung Hee University, Korea
Presenter
Waqas Nawaz
Thursday, April 13, 2023
Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Agenda
Motivation
Proposed Method (CSM-IGC)
Experiments
Conclusion & Future Directions
2
Related Work
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Graphs with Multiple Attributes
3
Coauthor Network of Top 200 Authors on TEL from DBLP from manyeyes.alphaworks.ibm.com
Attribute of Authors
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Related Work
Structure based clusteringNormalized cuts [Shi and Malik, TPAMI 2000]Modularity [Newman and Girvan, Phys. Rev. 2004] Scan [Xu et al., KDD'07] The clusters generated have a rather random distribution of vertex properties within clusters
OLAP-style graph aggregation K-SNAP [Tian et al., SIGMOD’08]Attributes compatible groupingThe clusters generated have a rather loose intra-cluster structure
4
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Example: A Coauthor Network
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XMLr6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Traditional Coauthor graph
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XMLr6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Structure-based Cluster
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XMLr6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Attribute-based Cluster
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XMLr6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Structural/Attribute Cluster
5
*http
s://w
iki.e
ngr.
illin
ois.
edu/
dow
nloa
d/at
tach
men
ts/1
8638
4385
/VLD
B09
_not
es.p
pt
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Related Work (cont…)
Structure/Attribute based clusteringSA-Cluster [Yang Zhou et al., VLDB’ 2009]
• Modify the structure of the original graph– add dummy vertex w.r.t each attribute instance– Sparse matrix and space inefficient
• Neighborhood random walk: Matrix multiplication is performed iteratively
• Fixed edge weights, and automatically update attribute weights
Scalability issue for medium & large graphs (time complexity)
6
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Two-Fold Objective
A desired clustering of attributed graph should achieve a good balance between the following:
Structural cohesiveness: Vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other
Attribute homogeneity: Vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values
And it should be Scalable to medium scale graphs
7
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Different Graph Clustering Approaches
Structure-based ClusteringVertices with heterogeneous values in a cluster
Attribute-based Clustering Lose much structure information
Structural/Attribute ClusterHomogeneous vertices along structure information at the
expense time complexity
Intra-Graph Clustering Scalable while considering both aspects
8
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Proposed Solution
System Architecture Diagram
9
INPUT Processing Phase OUTPUT
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Phase 1
Similarity Estimation (Inspired by Jaccard Index1) Interaction of vertices (topology or structure)
• Weighted fraction of shared neighbors
• It will be zero for disconnected vertices• Example: Structural similarity among
– SIM(V1, V2) = (1/3)*5 = 1.667– SIM(V1, V3) = (1/4)*4 = 1.0– SIM(V2, V3) = (1/4)*3 = 0.75– V1 & V4 = (1/4)*0 = 0.0
• Transitive Property…!– SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)
10
1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Transitive Property
Lemma 1 (Transitivity): Let ᵱ = {} be a path from source vertex to target vertex. Then
for all the intermediate vertices i =1, 2…, q
Proof: It is based on the fact that the similarity value lies in the
interval [0, 1].
11
¿ (𝑣𝑎 ,𝑣𝑏)=∏𝑖=1
𝑞
𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )≤𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Phase 1 (cont…)
Similarity Estimation (Inspired by Jaccard Index1)Context of vertices (attributes regularity)
• Weighted fraction of shared attributes instances
• It will be zero for contextually disjoint vertices • Example: Contextual similarity among
– Lets Wa1 = 1 and Wa2 = 2 then– SIM(V1, V3) = (2/2) = 1.0– SIM(V3, V4) = (1/2) = 0.5– V1 & V4 = 0.0
12
1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)
𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=
∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊
𝑴
(𝒘 𝒂𝒊)
∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋
𝑴
(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Collaborative Similarity Measure
Structural
Contextual
Collaborative Measure
13
𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=
∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊
𝑴
(𝒘 𝒂𝒊)
∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋
𝑴
(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃
𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞𝐒𝐢𝐦𝐢𝐥𝐚𝐫𝐢𝐭𝐲=𝐂𝐒𝐢𝐦 (𝒗𝒂 ,𝒗𝒃 )=¿ {𝜶∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒔𝒕𝒓𝒖𝒄𝒕+(𝟏−𝜶 )∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒗𝒂↔𝒗𝒃
¿∏𝒊=𝟏
𝒒
𝑪𝑺𝒊𝒎 (𝒗𝒑𝒊 ,𝒗𝒑𝒊+𝟏) ,𝒗𝒂⋯𝒗𝒃 ,𝒗𝒑 𝒊𝒔 𝒐𝒏𝒑𝒂𝒕𝒉𝒗𝒂𝒂𝒏𝒅𝒗𝒃
¿(𝟏−𝜶)∗𝑺𝑰𝑴 (𝒗𝒂 , 𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Phase 2
Clustering (K-Medoid Approach)
14
4. Update the centroids by maximizing SIM distances
3. Evaluate the quality of each cluster
2. Assign vertices to their nearest centroids
1. Randomly choose centroids for K clusters
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Algorithm Details
15
Node Clustering
Similarity Calculation
Iterative
Single Pass
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
(a) (b) (c)
𝐂𝐒𝐢𝐦ሺ𝒗𝒂 ,𝒗𝒃 ሻ
vertex V1 V2 V3 V4 V5 V6
V1 1 2.67 1.17 0.20 0.18 0.18
V2 2.67 1 0.92 0.15 0.14 0.14
V3 1.17 0.92 1 0.17 0.15 0.15
V4 0.2 0.15 0.17 1 0.92 0.92
V5 0.18 0.14 0.15 0.92 1 2.5
V6 0.18 0.14 0.15 0.92 2.5 1
K Clustered Vertices Density Entropy
2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133
3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084
4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084
(a) (b)
Example
16
Fig. 3. Scenarios for similarity between source (green) and destination(red) nodes following some intermediate nodes (yellow) (a) No direct path exist (b) Directly connected (c) In-directly connected, shortest path
Table 2. (a) Collaborative Similarity among vertices given in Fig. 3-c using Collaborative Similarity Measure, (b) Clustering results by varying number of clusters (K), quality of each measure is calculated using Density and Entropy
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Experiments
Real DatasetPolitical Blogs Dataset: 1490 vertices, 19090 edges, one
attribute political leaning • Liberal• Conservative
MethodsK-SNAP: Attributes only S-Cluster: Structure-based clusteringW-Cluster: Weighted random walk strategy SA-Cluster: Consider both factors (matrix manipulation) IGC-CSM: Our proposed method
17
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Evaluation Metrics
Density*: intra-cluster structural cohesiveness
Entropy*: intra-cluster attribute homogeneity
18
*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009)
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Evaluation Metrics (cont…)
F-Measure*: has the ability to evaluate the collective qualitative nature of the formed cluster
19
where
and
*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on Methodologies for Intelligent Systems Edition 9, Poland (2011)
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Results (Time Complexity)
Synthetic Dataset Varying No. of Node
Real DatasetPolitical Blog*No. of Clusters vs. Time
20
Graph size vs. time
*htt
p:/
/ww
w-p
ers
on
al.
um
ich
.ed
u/m
ejn
/ne
tda
ta
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Results (Quality)
Density EvaluationClusters vs. Density Value
Entropy EvaluationClusters vs. Entropy Value
21
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Results (Quality)
F-Measure EstimationClusters vs. F-measure Value
22
Data & Knowledge Engineering LabData & Knowledge Engineering Lab
Conclusion
We study the problem of graph node clustering based on homogeneous characteristics in terms of context and topology collaborative similarity measure to reflect the relational
model among pair of vertices k-Medoid clustering framework is adopted for grouping
similar nodesThe resulting solution is estimated using state of the
art evaluation measures:Density, Entropy, and F-measure
Comparatively scalable to medium scale graphs without compromising on the quality of results
23
Data & Knowledge Engineering LabData & Knowledge Engineering Lab 24
[email protected]@khu.ac.kr
ThanksAny Question…?
[email protected]@khu.ac.kr