Estimating Sizes of Social Networksvia Biased Sampling
Liran Katzir, Edo Liberty, and Oren Somekh
Yahoo! Labs, Haifa, Israel
International World Wide Web Conference,28th March - 1st April 2011, Hyderabad, India
Yahoo! Labs: WWW’2011 1 / 20
Social Network size estimation
Goal:
Obtaining estimates for sizes of (sub)populations in social network.
Why:
Advertisement - estimate of market share.
Business development - merger/acquisition or asset valuation.
Yahoo! Labs: WWW’2011 2 / 20
The Problem
Difficulties:
Social network have become pretty big:
Facebook (650,000,000)Qzone (200,000,000)Twitter (175,000,000)...
No public API for population size queries.
What is the total number of registered users?What is the number of registered (self-declared) 20–30 year olds livingin New-York?
Even if a public API is provided an independent estimate is needed.
Exhaustive crawl is time/space/communication intensive and violates“politeness”.
Yahoo! Labs: WWW’2011 3 / 20
Population size estimation
Population sizes can be estimated efficiently using the “birthday paradox”.
The “birthday paradox”:
Given r uniform samples from a set of n elements, the expected numberof collisions is r(r−1)
2n .
A collision is a pair of identical samples.
Example:
Samples: X = (d , b, b, a, b, e).Total 3 collisions, (x2, x3), (x2,x5), and (x3,x5).
Yahoo! Labs: WWW’2011 4 / 20
Population size estimation
Using the birthday paradox inversely:
When observing C collisions the pouplation can be estimated by
⇒ n ' r2
2C
If r = const ·√n this gives a rather good estimator.
Similar to mark-and-recapture which counts collisions between two samplesets (but is essentially equivalent).
Newer version of mark-and-recapture also handles non-uniform but a-prioryknown distributions [Chao, 1987].
Social network size estimation [Ye and Wu, 2010]
Alas, we cannot sample users uniformly from most social networks...
Yahoo! Labs: WWW’2011 5 / 20
Uniform distribution on graphs
Social networks can be viewed as an undirected graph which we cantraverse using their public APIs.
Special random walks can generate close to uniform sampling:
1 Bipartite Query-Web page graph [Bharat and Broder, 1998][Bar-Yossef and Gurevich, 2007].
2 Social network [Gjoka et al, 2010].
Uses only r = const√n samples,
but obtaining each sample might be hard.
Yahoo! Labs: WWW’2011 6 / 20
Graph size estimation
It is possible to estimate the size of some graphs directly.
1 Estimate the size of a tree [Knuth, 1974].
2 Estimate the size of a directed acyclic graph [Pitt, 1987].
We give an estimator for the size of undirected graphs (and sub graphs)which:
1 Counts collisions but uses the graph’s stationary distribution.(does not require a uniform sample)
2 Requires asymptotically less than√n samples to converge.
3 Obtains samples efficiently.(provable small number of random walk steps.)
Yahoo! Labs: WWW’2011 7 / 20
Assumptions
The graph can be traversed from nodes to neighboring nodes.
We can perform a random walk the graph:
start at any node
In each step, proceed to one of the neighbors uniformly at random.
Yahoo! Labs: WWW’2011 8 / 20
Facts about random walks
This random walk yields the stationary distribution.
1 The probability to get the i ’th node is diD .
2 di – i ’th node’s degree.3 D =
∑ni=1 di .
taking a few steps/several walks ensures independence between twoconsecutive samples.
Yahoo! Labs: WWW’2011 9 / 20
Algorithm Outline
1 Sample r users using random walk.
2 C – the number of collisions.
3 Ψ1 – the sum of the sampled nodes’ degrees.
4 Ψ−1 – the sum of the inverse sampled nodes’ degrees.
The estimated number of nodes:
n̂ = Ψ1Ψ−1
2C .
Yahoo! Labs: WWW’2011 10 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes:
d f f c c d
Sampled Node Degree:
3 2 2 4 4 3
C:
0 0 1 1 2 3
Ψ1:
3 5 7 11 15 18
Ψ−1:
1/3 5/6 16/12 19/12 22/12 26/12
n̂:
– – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d
f f c c d
Sampled Node Degree: 3
2 2 4 4 3
C: 0
0 1 1 2 3
Ψ1: 3
5 7 11 15 18
Ψ−1: 1/3
5/6 16/12 19/12 22/12 26/12
n̂: –
– 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f
f c c d
Sampled Node Degree: 3 2
2 4 4 3
C: 0 0
1 1 2 3
Ψ1: 3 5
7 11 15 18
Ψ−1: 1/3 5/6
16/12 19/12 22/12 26/12
n̂: – –
4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f
c c d
Sampled Node Degree: 3 2 2
4 4 3
C: 0 0 1
1 2 3
Ψ1: 3 5 7
11 15 18
Ψ−1: 1/3 5/6 16/12
19/12 22/12 26/12
n̂: – – 4
8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c
c d
Sampled Node Degree: 3 2 2 4
4 3
C: 0 0 1 1
2 3
Ψ1: 3 5 7 11
15 18
Ψ−1: 1/3 5/6 16/12 19/12
22/12 26/12
n̂: – – 4 8
6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c
d
Sampled Node Degree: 3 2 2 4 4
3
C: 0 0 1 1 2
3
Ψ1: 3 5 7 11 15
18
Ψ−1: 1/3 5/6 16/12 19/12 22/12
26/12
n̂: – – 4 8 6
6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Example
Sampling process:
Sampled Nodes: d f f c c d
Sampled Node Degree: 3 2 2 4 4 3
C: 0 0 1 1 2 3
Ψ1: 3 5 7 11 15 18
Ψ−1: 1/3 5/6 16/12 19/12 22/12 26/12
n̂: – – 4 8 6 6
Input social network graph:
Yahoo! Labs: WWW’2011 11 / 20
Proof Intuition
Notations:
n – the graph size, r – number of samples
di – node i degree, D =∑n
i=1 di
Expectations:
E [Ψ1] = rD∑n
i=1
(diD
)2, E [Ψ−1] = rn
D
E [C ] =(r
2
)∑ni=1
(diD
)2.
n̂
E [Ψ1]E [Ψ−1]2E [C ] = n r
r−1 ' n.
n̂ =Ψ1Ψ−1
2C' E [Ψ1]E [Ψ−1]
2E [C ]' n
Yahoo! Labs: WWW’2011 12 / 20
Analytic Results
Main statement:
Using r(n, ε, δ) samples: Pr[n(1− ε) ≤ n̂ ≤ n(1 + ε)] ≥ 1− δ
Uniform vs Biased:
Sampling method Number of samples
Any graph, uniform O(√n)
Synthetic graph, Zipfiandegree distribution O( 4
√n log n)
α = 2, dm =√n,
random walk
Example – n = 109
√n ≈ 30, 000.
4√n log n ≈ 6, 000.
Yahoo! Labs: WWW’2011 13 / 20
Setup
Networks of known sizes:
Network Size Edges
Synthetic 1,000,000 Zipfian α = 2, dm = 1000
DBLP 845,211 co-authorship
IMDB 1,955,508 co-casting
Yahoo! Labs: WWW’2011 14 / 20
A Synthetic Network, Degree Zipfian α = 2,dm = 1000
0 0.5 1 1.5 2 2.5
0.8
1
1.2
1.4
1.6
1.8
2
2.2Synthetic network − Confidence interval
Number of samples [Percentage of network size]
Siz
e es
timat
ion
[Rel
ativ
e to
net
wor
k si
ze]
Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%
Yahoo! Labs: WWW’2011 15 / 20
DBLP - The Digital Bibliography and Library Project
0 0.5 1 1.5 2 2.5 3 3.50.5
1
1.5
2
2.5
3DBLP network − Confidence interval
Number of samples [Percentage of network size]
Siz
e es
timat
ion
[Rel
ativ
e to
net
wor
k si
ze]
Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%
Yahoo! Labs: WWW’2011 16 / 20
IMDB - The Internet Movie Database
0 0.5 1 1.5 20.5
1
1.5
2
2.5
3IMDB − Confidence interval
Number of samples [Percentage of network size]
Siz
e es
timat
ion
[Rel
ativ
e to
net
wor
k si
ze]
Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%
Yahoo! Labs: WWW’2011 17 / 20
Date April 2009 October 2010
Sampling method uniform random walk
Number of samples 0.98 · 106 1 · 106
Collision estimator 237 · 106 475 · 106
Facebook report 200− 250 · 106 500 · 106
Thanks to Minas Gjoka for the Facebook crawls.
Yahoo! Labs: WWW’2011 18 / 20
Conclusions
An efficient algorithm to estimate the size of a social network usingpublic API was presented.
Its effectiveness was demonstrated on synthetic and real worldnetworks.
This algorithm outperforms prior art methods by using biasedsampling.
This algorithm also applies for sub-populations.
Yahoo! Labs: WWW’2011 19 / 20
Thanks!
Yahoo! Labs: WWW’2011 20 / 20