1
Complex NetworksTina Eliassi-RadLawrence Livermore National Laboratory & Rutgers University
7/22/2010Mathcamp'10
g y
http://eliassi.org
7/22/2010Mathcamp'10
2
Internet TerrorismReality MiningFood WebInternet TerrorismReality MiningFood Web
Enron EmailsMap of Science HP Emails
Contagion of TB NY State Power GridProtein InteractionsFriendship Network
7/22/2010Mathcamp'10
3
Common Patterns• Scale-free▫ Eigen exponentEigen exponent
• Small-world
• Triangle lawg▫ k friends → ~k1.6 triangles
• Small diameter
i l i fl• Social influence▫ Pr(A.class == B.class | A → B) > Pr(A.class == B.class)
• Social selection• Social selection▫ Pr(A B | A.class == B.class) > Pr(A → B)
7/22/2010Mathcamp'10
4
Problem #1
• Triangles are expensive to compute▫ Friends of friends are friends▫ Friends of friends are friends
3-way join
• Q: Can we do this quickly?Q: Can we do this quickly?• A: Yes!▫ #triangles = 1/6 Σi( λi
3 )g i( i )▫ because of skewness, we only need the top few
eigenvalues
7/22/2010Mathcamp'10
5
Communities• Clusters, groups, modules• Need to• Need to▫ Formalize the notion
of a communityHard vs. soft
▫ Design an algorithm that will find sets of nodes that are “good” communities
▫ Formalize the evaluation of Formalize the evaluation of community structure
7/22/2010Mathcamp'10
6
Clustering Objective Function: ConductanceConductance• A good cluster S has
▫ Many edges internallyS
▫ Many edges internally
▫ Few edges pointing outside
• Simplest objective function:S’
S p est object ve u ct o :
Conductance
Φ(S) = #edges outside S / #edges inside S
▫ Small conductance corresponds to good clusters
• The score of best cluster of size k
7/22/2010Mathcamp'10
7
Network Community Profile (NCP) [Leskovec et al WWW‘08][Leskovec et al. WWW‘08]
• The score of best cluster of size k
log Φ(k)
k=5 k=7k=10
g ( )
Community size, log k
7/22/2010Mathcamp'10
8
Clustering Objective Function: ModularityModularity• m = number of edges in the graph
• A = 1 if v→w; 0 otherwise• Avw = 1 if v→w; 0 otherwise
• kv = degree of vertex v
• δ(i, j) = 1 if i ≡ j; 0 otherwiseδ(i, j) 1 if i j; 0 otherwise
• Maximizes modularity, Q
▫ Fraction of all edges within communities minus the expected Fraction of all edges within communities minus the expected value of the same quantity in a network where the vertices have the same degrees but edges are placed at random
7/22/2010Mathcamp'10
9
Problem #2
• Q: Is maximizing modularity similar to local spectral partitioning?local spectral partitioning?
• Notes:
▫ In spectral methods, eigenvectors are based on the unnormalized Laplacian of the graph: L = D – A
D d t i f th hD = degree matrix of the graph
A = adjacency matrix of the graph
L k i t Fi ld t▫ Look into Fielder vector
7/22/2010Mathcamp'10
10
Clustering Objective Function: CompressionCompression• Organize into few, homogeneous communities
versusw g
rou
ps
ow g
rou
ps
Column groups Column groups
Row
Ro
Good Clustering
1. Similar nodes are grouped together
2. As few groups as
A few,homogeneous
blocks
Good Compression
necessary
implies
11
Total Encoding Cost Objective FunctionTotal Encoding Cost Objective Function
m1 m2 m3
ℓ = 3 col. groups jiep ,=
n1
m1 m2 m3
p p p
density of ones (edges)ji
ji mnp , =
∑i,j nimj H(pi,j)
1 p1,1 p1,2 p1,3
w g
rou
ps
bits total
block size entropy
n2 p2,1 p2,2 p2,3
k=
3ro
w code cost
d i ti t
+
n3 p3,3p3,2p3,1 ∑irow-partitionidescription ∑j
col-partitionjdescription
∑ transmit
+
description cost
transmit
n × m adj. matrix
∑i,jtransmit#edges ei,j
++ transmit#partitions
12
Total Encoding Cost Objective FunctionTotal Encoding Cost Objective Function
m1 m2 m3
ℓ = 3 col. groups
jie
n1
m1 m2 m3
p p p density of ones (edges)ji
jiji mn
ep ,
, =
1 p1,1 p1,2 p1,3
w g
rou
ps
∑ n m H(p ) bits total
block size entropy
n2 p2,1 p2,2 p2,3
k=
3ro
w ∑i,j nimj H(pi,j)code cost
bits total
+n3 p3,3p3,2p3,1 ( ) ( )
⎡ ⎤∑+ m
mmm
nn
nn
lk
mHnH lk
lll
,,,, 11 LL
+
n × m adj. matrix
⎡ ⎤∑+++ji jimnlk
,logloglog
7/22/2010Mathcamp'10
13
one r
one
Total Encoding Cost:cost vs. # of clusters
trow
group
col group
bit
cost
nrow
m
col group
s grou
ps
k
ℓ
k=
3row
ℓ=
3col ℓ
w grou
ps
group
s
7/22/2010Mathcamp'10
14
Problem #3
• Q: How do you find the best partitioning of a graph based on compression?of a graph based on compression?
• Notes:
▫ Requires a couple of lemmas that rely on
Concavity of entropy
Non-negativity of the KL-divergence
7/22/2010Mathcamp'10
15
Evaluation based on Link Prediction
• A good factorization of a graph's connectivity structure should accurately predict links between nodes based on their respective communities
• P(s→t | s, t, cs, ct)
E l t ff ti b
0 1 0 1 1
1 0 1 0 1
0 1 1
1 0 0 1• Evaluate effectiveness by
1. randomly holding out a number of links
2 building a model
0 1 0 0 0
1 0 0 0 1
1 1 0 1 0
0 0
1 0 0 0 1
1 1 02. building a model
3. using learnt model to predict held-out links
4. measuring performance with area under ROC curve (AUC)
1 1 0 1 01 1 0
4. measuring performance with area under ROC curve (AUC)
Evaluation based on Variation of Information [Karrer Levina and Newman Phys Rev E 2008] [Karrer, Levina, and Newman, Phys. Rev. E. 2008]
Perturb graph by randomly reassigning a number of its links
R i i t [0 1] d t i f ti f li k i dRewiring parameter c ∈ [0,1] determines fraction of links rewired
Links are rewired in a way that preserves the expected degree of each node in graph
C = communities discovered on original graph, where c = 0
C’ = communities discovered on perturbed graphs, where c ≠ 0
V l f I f ti Δ(C C’) H(C|C’) H(C’|C)Value of Information, Δ(C, C’) = H(C|C’) + H(C’|C)
H(C’|C) measures the information needed to describe C’ given C
Δ(C, C’) ∈ [0, log(N)] treats each assignment as a messageΔ(C, C ) ∈ [0, log(N)] treats each assignment as a message
Is a symmetric entropy-based measure of the distance between these messages
7/22/2010Mathcamp'10
17
Summary: Clustering Objective Functions for Complex NetworksFunctions for Complex Networks
• Conductance▫ J. Leskovec, K. J. Lang, M. W. Mahoney: Empirical comparison of algorithms for network community
detection. WWW 2010: 631-640
• Modularity▫ A. Clauset, M.E.J. Newman, C. Moore: Finding community structure in very large networks. Phys. Rev. E 70:
066111 (2004)
• Total encoding cost▫ D. Chakrabarti, S. Papadimitriou, D. S. Modha, Christos Faloutsos: Fully automatic cross-associations. KDD
2004: 79-88
• Maximum likelihood▫ D. M. Blei, A. Y. Ng, M. I. Jordan: Latent Dirichlet Allocation. JMLR 3: 993-1022 (2003)
d d i ldi id l i S h i b i i i d b hi S h i ▫ Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing: Mixed Membership Stochastic Blockmodels. JMLR 9: 1981-2014 (2008)
▫ K. Henderson, T. Eliassi-Rad, S. Papadimitriou, C. Faloutsos: HCDF: A Hybrid Community Discovery Framework. SDM 2010: 754-765
• Clique percolation q p▫ G. Palla, I. Derenyi, I. Farkas, T. Vicsek: Uncovering the overlapping community structure of complex
networks in nature and society. Nature 435:814 (2005)
• Many more…