Large networks, clusters and Kronecker productsJure Leskovec ([email protected])Computer Science DepartmentCornell University / Stanford UniversityJoint work with: Jon Kleinberg (Cornell), Christos Faloutsos (CMU), Michael Mahoney (Stanford), Kevin Lang (Yahoo), Anirban Dasgupta (Yahoo)
Rich data: Networks Large on-line computing applications have
detailed records of human activity: On-line communities: Facebook (120 million) Communication: Instant Messenger (~1 billion) News and Social media: Blogging (250 million)
We model the data as a network (an interaction graph)
Can observe and study phenomena at scales not
possible before Communication network
3
Small vs. Large networks Community (cluster) structure of networks
Collaborations in NetSci (N=380) Tiny part of a large social network
What is the structure of the network? How can we model that?
4
Conductance (normalized cut):
How expressed are communities? How community like is a set of
nodes? Idea: Use approximation
algorithms for NP-hard graph partitioning problems as experimental probes of network structure.
Small Φ(S) == more community-like sets of nodes
S
S’
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
5
Network Community Profile Plot We define:
Network community profile (NCP) plotPlot the score of best community of size k
Community size, log k
log Φ(k)Φ(5)=0.25
Φ(7)=0.18
k=5 k=7
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
6
NCP plot: Network Science Collaborations between scientists in
Networks [Newman, 2005]
Community size, log k
Cond
ucta
nce,
log
Φ(k
)
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
7
NCP plot: Large network Typical example:
General relativity collaboration network (4,158 nodes, 13,422 edges)
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
8
More NCP plots of networks
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
9
Φ(k
), (c
ondu
ctan
ce)
k, (community size)
NCP: LiveJournal (n=5m, e=42m)
Better and better
communities
Communities get worse and worse
Best community has ~100
nodes
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
10
Community size is bounded!
Each dot is a different networkPractically constant!
[w/ Mahoney, Lang, Dasgupta, WWW ’08]
11
Structure of large networks
Core-periphery (jellyfish, octopus)
Small good
communities
Denser and denser core
of the network
Core contains ~60% nodes and ~80%
edges
So, what’s a good model?
12
Kronecker product: Definition Kronecker product of matrices A and B is given by
We define a Kronecker product of two graphs as a Kronecker product of their adjacency matrices
N x M K x L
N*K x M*L
[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05]
13
Kronecker graphs Kronecker graph: a growing sequence of
graphs by iterating the Kronecker product
Each Kronecker multiplication exponentially increases the size of the graph
One can easily use multiple initiator matrices (G1
’, G1’’, G1
’’’ ) that can be of different sizes
[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05]
14
Kronecker graphs
Kronecker graphs mimic real networks: Theorem: Power-law degree distribution, Densification,
Shrinking/stabilizing diameter, Spectral properties
Initiator(9x9)(3x3)
(27x27)
pij
Edge probability Edge probability
Starting intuition: Recursion & self-similarity
[w/ Chakrabarti, Kleinberg, Faloutsos, PKDD ’05]
15
Various Kronecker initiator matrices
16
Kronecker graphs: Interpretation Initiator matrix G1 is a similarity
matrix Node u is described with k binary
attributes: u1, u2 ,…, uk Probability of a link between
nodes u, v:P(u,v) = ∏ G1[ui, vi]
1G a bc d
a b
c d
a bc d
v
u = (0,1,1,0)
P(u,v) = b∙d∙c∙b
0 101 v = (1,1,0,1)
u
Given a real graph. How to estimate the
initiator G1?
17
Estimating Kronecker graphs Want to generate realistic networks:
How to estimate initiator matrix: Method of moments [Owen ‘09]:
Compare counts of subgraphs and solve Maximum likelihood [Leskovec&Faloutsos, ’07]:
arg max P( | G1) SVD [VanLoan&Pitsianis ‘93]:
Can solve using SVD
Compare graphs properties, e.g., degree
distribution
Given a real network
Generate a synthetic network
1Ga bc d
211min
FGGG
18
Kronecker & Network structure What do estimated parameters
tell us about the network structure?
[w/ Dasgupta-Lang-Mahoney, WWW ’08]
1G a bc d a edges d edges
b edges
c edges
19
Kronecker & Network structure What do estimated parameters
tell us about the network structure?
Core0.9
edgesPeriphery0.1 edges
0.5 edges
0.5 edges
Core-periphery (jellyfish, octopus)
[w/ Dasgupta-Lang-Mahoney, WWW ’08]
1G 0.9 0.50.5 0.1
20
Small vs. Large networks Small and large networks are very
different:
Collaboration network (N=4,158, E=13,422)
Scientific collaborations (N=397, E=914)
0.99 0.54
0.49 0.13
0.99 0.17
0.17 0.82G1 = G1 =
21
Conclusion Computational tools as probes into the structure of
large networks Community structure of large networks:
Core-periphery structure Scale to natural community size: Dunbar number
Model: Kronecker graphs Analytically tractable: provable properties Can efficiently estimate parameters from data
Implications: No large clusters: no/little hierarchical structure Can’t be well embedded – no underlying geometry
22
Reflections Why are networks the way they are? Only recently have basic properties been
observed on a large scale Confirms social science intuitions; calls others
into question What are good tractable network models?
Builds intuition and understanding Benefits of working with large data
Observe structures not visible at smaller scales
[email protected]://cs.stanford.edu/~jure
24
References Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, by
J. Leskovec, J. Kleinberg, C. Faloutsos, KDD 2005
Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, by J. Leskovec, D. Chakrabarti, J. Kleinberg and C. Faloutsos, PKDD 2005
Scalable Modeling of Real Graphs using Kronecker Multiplication, by J. Leskovec and C. Faloutsos, ICML 2007
Statistical Properties of Community Structure in Large Social and Information Networks, by J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney, WWW 2008
Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters, by J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney, Arxiv 2008