Estimating Sizes of Social Networks via Biased...

Post on 18-Oct-2020

0 views 0 download

transcript

Estimating Sizes of Social Networksvia Biased Sampling

Liran Katzir, Edo Liberty, and Oren Somekh

Yahoo! Labs, Haifa, Israel

International World Wide Web Conference,28th March - 1st April 2011, Hyderabad, India

Yahoo! Labs: WWW’2011 1 / 20

Social Network size estimation

Goal:

Obtaining estimates for sizes of (sub)populations in social network.

Why:

Advertisement - estimate of market share.

Business development - merger/acquisition or asset valuation.

Yahoo! Labs: WWW’2011 2 / 20

The Problem

Difficulties:

Social network have become pretty big:

Facebook (650,000,000)Qzone (200,000,000)Twitter (175,000,000)...

No public API for population size queries.

What is the total number of registered users?What is the number of registered (self-declared) 20–30 year olds livingin New-York?

Even if a public API is provided an independent estimate is needed.

Exhaustive crawl is time/space/communication intensive and violates“politeness”.

Yahoo! Labs: WWW’2011 3 / 20

Population size estimation

Population sizes can be estimated efficiently using the “birthday paradox”.

The “birthday paradox”:

Given r uniform samples from a set of n elements, the expected numberof collisions is r(r−1)

2n .

A collision is a pair of identical samples.

Example:

Samples: X = (d , b, b, a, b, e).Total 3 collisions, (x2, x3), (x2,x5), and (x3,x5).

Yahoo! Labs: WWW’2011 4 / 20

Population size estimation

Using the birthday paradox inversely:

When observing C collisions the pouplation can be estimated by

⇒ n ' r2

2C

If r = const ·√n this gives a rather good estimator.

Similar to mark-and-recapture which counts collisions between two samplesets (but is essentially equivalent).

Newer version of mark-and-recapture also handles non-uniform but a-prioryknown distributions [Chao, 1987].

Social network size estimation [Ye and Wu, 2010]

Alas, we cannot sample users uniformly from most social networks...

Yahoo! Labs: WWW’2011 5 / 20

Uniform distribution on graphs

Social networks can be viewed as an undirected graph which we cantraverse using their public APIs.

Special random walks can generate close to uniform sampling:

1 Bipartite Query-Web page graph [Bharat and Broder, 1998][Bar-Yossef and Gurevich, 2007].

2 Social network [Gjoka et al, 2010].

Uses only r = const√n samples,

but obtaining each sample might be hard.

Yahoo! Labs: WWW’2011 6 / 20

Graph size estimation

It is possible to estimate the size of some graphs directly.

1 Estimate the size of a tree [Knuth, 1974].

2 Estimate the size of a directed acyclic graph [Pitt, 1987].

We give an estimator for the size of undirected graphs (and sub graphs)which:

1 Counts collisions but uses the graph’s stationary distribution.(does not require a uniform sample)

2 Requires asymptotically less than√n samples to converge.

3 Obtains samples efficiently.(provable small number of random walk steps.)

Yahoo! Labs: WWW’2011 7 / 20

Assumptions

The graph can be traversed from nodes to neighboring nodes.

We can perform a random walk the graph:

start at any node

In each step, proceed to one of the neighbors uniformly at random.

Yahoo! Labs: WWW’2011 8 / 20

Facts about random walks

This random walk yields the stationary distribution.

1 The probability to get the i ’th node is diD .

2 di – i ’th node’s degree.3 D =

∑ni=1 di .

taking a few steps/several walks ensures independence between twoconsecutive samples.

Yahoo! Labs: WWW’2011 9 / 20

Algorithm Outline

1 Sample r users using random walk.

2 C – the number of collisions.

3 Ψ1 – the sum of the sampled nodes’ degrees.

4 Ψ−1 – the sum of the inverse sampled nodes’ degrees.

The estimated number of nodes:

n̂ = Ψ1Ψ−1

2C .

Yahoo! Labs: WWW’2011 10 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes:

d f f c c d

Sampled Node Degree:

3 2 2 4 4 3

C:

0 0 1 1 2 3

Ψ1:

3 5 7 11 15 18

Ψ−1:

1/3 5/6 16/12 19/12 22/12 26/12

n̂:

– – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d

f f c c d

Sampled Node Degree: 3

2 2 4 4 3

C: 0

0 1 1 2 3

Ψ1: 3

5 7 11 15 18

Ψ−1: 1/3

5/6 16/12 19/12 22/12 26/12

n̂: –

– 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f

f c c d

Sampled Node Degree: 3 2

2 4 4 3

C: 0 0

1 1 2 3

Ψ1: 3 5

7 11 15 18

Ψ−1: 1/3 5/6

16/12 19/12 22/12 26/12

n̂: – –

4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f

c c d

Sampled Node Degree: 3 2 2

4 4 3

C: 0 0 1

1 2 3

Ψ1: 3 5 7

11 15 18

Ψ−1: 1/3 5/6 16/12

19/12 22/12 26/12

n̂: – – 4

8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c

c d

Sampled Node Degree: 3 2 2 4

4 3

C: 0 0 1 1

2 3

Ψ1: 3 5 7 11

15 18

Ψ−1: 1/3 5/6 16/12 19/12

22/12 26/12

n̂: – – 4 8

6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c

d

Sampled Node Degree: 3 2 2 4 4

3

C: 0 0 1 1 2

3

Ψ1: 3 5 7 11 15

18

Ψ−1: 1/3 5/6 16/12 19/12 22/12

26/12

n̂: – – 4 8 6

6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Example

Sampling process:

Sampled Nodes: d f f c c d

Sampled Node Degree: 3 2 2 4 4 3

C: 0 0 1 1 2 3

Ψ1: 3 5 7 11 15 18

Ψ−1: 1/3 5/6 16/12 19/12 22/12 26/12

n̂: – – 4 8 6 6

Input social network graph:

Yahoo! Labs: WWW’2011 11 / 20

Proof Intuition

Notations:

n – the graph size, r – number of samples

di – node i degree, D =∑n

i=1 di

Expectations:

E [Ψ1] = rD∑n

i=1

(diD

)2, E [Ψ−1] = rn

D

E [C ] =(r

2

)∑ni=1

(diD

)2.

E [Ψ1]E [Ψ−1]2E [C ] = n r

r−1 ' n.

n̂ =Ψ1Ψ−1

2C' E [Ψ1]E [Ψ−1]

2E [C ]' n

Yahoo! Labs: WWW’2011 12 / 20

Analytic Results

Main statement:

Using r(n, ε, δ) samples: Pr[n(1− ε) ≤ n̂ ≤ n(1 + ε)] ≥ 1− δ

Uniform vs Biased:

Sampling method Number of samples

Any graph, uniform O(√n)

Synthetic graph, Zipfiandegree distribution O( 4

√n log n)

α = 2, dm =√n,

random walk

Example – n = 109

√n ≈ 30, 000.

4√n log n ≈ 6, 000.

Yahoo! Labs: WWW’2011 13 / 20

Setup

Networks of known sizes:

Network Size Edges

Synthetic 1,000,000 Zipfian α = 2, dm = 1000

DBLP 845,211 co-authorship

IMDB 1,955,508 co-casting

Yahoo! Labs: WWW’2011 14 / 20

A Synthetic Network, Degree Zipfian α = 2,dm = 1000

0 0.5 1 1.5 2 2.5

0.8

1

1.2

1.4

1.6

1.8

2

2.2Synthetic network − Confidence interval

Number of samples [Percentage of network size]

Siz

e es

timat

ion

[Rel

ativ

e to

net

wor

k si

ze]

Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 15 / 20

DBLP - The Digital Bibliography and Library Project

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3DBLP network − Confidence interval

Number of samples [Percentage of network size]

Siz

e es

timat

ion

[Rel

ativ

e to

net

wor

k si

ze]

Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 16 / 20

IMDB - The Internet Movie Database

0 0.5 1 1.5 20.5

1

1.5

2

2.5

3IMDB − Confidence interval

Number of samples [Percentage of network size]

Siz

e es

timat

ion

[Rel

ativ

e to

net

wor

k si

ze]

Unif. dist. − non−unique 95%Deg. dist. − non−unique 95%Deg. dist. − non−unique 5%Unif. dist. − non−unique 5%

Yahoo! Labs: WWW’2011 17 / 20

Facebook

Date April 2009 October 2010

Sampling method uniform random walk

Number of samples 0.98 · 106 1 · 106

Collision estimator 237 · 106 475 · 106

Facebook report 200− 250 · 106 500 · 106

Thanks to Minas Gjoka for the Facebook crawls.

Yahoo! Labs: WWW’2011 18 / 20

Conclusions

An efficient algorithm to estimate the size of a social network usingpublic API was presented.

Its effectiveness was demonstrated on synthetic and real worldnetworks.

This algorithm outperforms prior art methods by using biasedsampling.

This algorithm also applies for sub-populations.

Yahoo! Labs: WWW’2011 19 / 20

Thanks!

Yahoo! Labs: WWW’2011 20 / 20