Download - Taming the long tail - DIMACSdmac.rutgers.edu/.../Slides/12.Chaintreau_TamingLongTail.pdf · 2014-04-11 · Taming the long tail Identify Filtering in Social Media Avner May, Nitish

Taming the long tail Identify Filtering in Social Media

Avner May, Nitish Korula, Silvio Lattanzi A. Chaintreau

Columbia University, Google

1

2

Are social media sustainable?

3

From the trenches: no!

¡ Content producters - May I be missing my audience?

4

¡ Users’s dilemma - May I be missing something?

From the faculty lounge: of course! ¡ Socializing is essential for information - To find about jobs [Gr74], innovation [CKM57] “It pays to know / It hurts to be unaware.”

¡ When looking for good content, most of the time is wasted, but some gems are priceless - This process is more efficient collectively - And curating is at least informally rewarded

¡ In this talk, we focus on news dissemination

5

What is the role of intermediaries?

6

Understanding these intermediaries

7

2013: two interesting works

Twitter “precision”

40.5% average

- Encouraging!

8

a

r

e

n

e

c

e

s

s

a

r

y

f

o

r

a

s

o

c

i

a

l

a

n

d

i

n

f

o

r

m

a

t

i

o

n

n

e

t

w

o

r

k

t

o

e

n

s

u

r

e

u

s

e

r

s

h

a

v

e

h

i

g

h

p

r

e

c

i

s

i

o

n

a

n

d

r

e

c

a

l

l

,

a

n

d

d

i

s

s

e

m

i

n

a

t

i

o

n

t

i

m

e

i

s

s

m

a

l

l

?

C

a

n

w

e

e

m

p

i

r

i

c

a

l

l

y

v

a

l

i

d

a

t

e

t

h

e

s

e

c

o

n

d

i

t

i

o

n

s

a

s

w

e

l

l

a

s

t

h

e

c

o

n

c

l

u

s

i

o

n

o

n

e

x

i

s

t

i

n

g

n

e

t

w

o

r

k

s

?

We motivate this

question

witha preli

minaryempiric

al

userstud

y thatattem

pts to dir

ectlymeasu

re relevan

ce with-

out resort

ing to a definition

of user in

terests: w

e ask 10 ac-

tiveTwit

ter users

to ratea set o

f 30twee

ts asRele

vant/Not

Relevant

. Theusers

arestud

entsat Stan

fordUniv

ersity

wholog in at least

oncea week

on average,

follow at least

30 people, an

d receive at

least20 new

tweets a

weekin their

timeline. The

set of 30twee

ts is puttoge

therby choo

sing

15 tweets fro

m the user’s

timelinein the p

ast 7days

, and15

unique rand

omly selected twee

ts outof th

e set of all

tweet

impressions

overthe same 7 days

1 . Theset o

f 30twee

ts is

thenrend

eredin a rand

om order as

per usual

tweet ren

der-

ing guideline

s [11].The

p

r

e

c

i

s

i

o

n

of each of th

e 15 tweets

is thenthe

fraction of twee

ts thatthe

userthou

ghtwere

relevant.

Theresul

ts of the expe

riment for eachof th

e 10

usersis show

n in Figure 1. The

average

precision

of users

for tweets draw

n fromtheir

timelineis 70%

. On the other

hand, the

precision

drops to a

round 7% for t

he set of r

andom

tweets sh

ownto the users

! Eventhou

gh thisis to

o small a

userstud

y to drawa defin

itiveconc

lusion abou

t theactu

al

value of p

recision on Twit

ter, the re

sultslend

some credenc

e

to the hypothesi

s thatsocia

l network

s suchas Twit

ter are

much more precis

e than one w

ouldexpe

ct ifusers

wereseein

g

content a

t random

. Notethat

sincewe show

ed (as contr

ol)

eachuser

15 random twee

ts chosen

fromtwee

t imp

r

e

s

s

i

o

n

s

,

andgot a

low relevance

scorefor t

his contr

ol set, it

doesnot

appear th

at inspec

tionpara

dox2 alon

e could be an

adequate

explanat

ion of the high

precision

we see in thistrial

.

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1"2"

3"4"

5"6"

7"8"

9"10"

Precision)

User)number)

User.rated

)precision)

of)tweets)

Timeline"

Random"

Figure 1: Compari

sonof s

elf-repo

rtedprec

isionbe-

tweentweets

froma user

’s timelin

e andtweets

cho-

senat rand

om.

1.1 Necessary Conditions for Precision

In thispape

r, wefirst

outline some nece

ssarycond

itions

for obtaining

highprec

ision. For

eachof th

ese condition

s,

we state

the hypot

hesis, val

idateit wi

th data, and

argue via

1We imposed two

restrictio

ns on therand

omly selected

tweets: the

tweets must be in engli

sh (allthe

survey tak-

ers wereengli

sh speakers)

, andthe

tweet must

notbe a

reply(sinc

e a replymay not

makesense

outside of th

e full

conversat

ion,thus

yielding artifi

ciallylow

precision

).

2Theinsp

ection para

doxis an anal

ogueto the well-

known

friendship

paradox

[6]: high qual

ity usershave

more follow-

ers and henc

e a random twee

t impression is of

higher qu

ality

thana rand

om tweet.

modeling and

analysis,

whythe hypo

thesis is nece

ssaryfor

obtaining

highprec

ision.

Interest-based Networks.

Our first h

ypothesis

is anatu

ral one: U

serson socia

l and

information

networks

haveinter

ests,and

linkto othe

r users

whoshar

e some or

all of the

se interes

ts. This a

ssumption

is

folklore in how

thesenetw

orksare gene

rated–sev

eralcom-

monlyused

generativ

e models of

social net

works ind

eed use

thisassu

mption[18,

17, 7]. W

e define (i

n Section 2) an

ana-

lyticmodel

capturing

the essenc

e of these

generativ

e models:

There are

a set of use

rs Vand

a set of inte

restsI. E

ach user

u 2 V has aset o

f interest

s C(u) that (

s)heis int

erested in.

We term these

usersc

o

n

s

u

m

e

r

s

for interes

t i. Each user

con-

nects to

other use

rs based on their

interests,

andthis

yields

a graph G(V,E

) on the users, wh

ich is the observed

social

network.

Thisnetw

orkcould

be directed

(e.g., Twit

ter),

where some users

follow othe

rs and infor

mationflows

along

directed edge

s, orundi

rected (e.g.

, Faceboo

k), where

friend-

shipis m

utual, an

d information

canflow

in bothdirec

tions

along an edge

.

In order to anal

yzeprec

isionin this

model, we

needto

define whic

h usersshar

ing an interest i

2 I p

r

o

d

u

c

e

content

related to the

interest.

LetP (i) deno

te theset of users

whoact a

s produce

rs, We show

(in Section 3) th

at iffor a

ll

interests

i, P (i) =C(i), w

hichmeans

anycons

umer can be

a potential

producer

, then it is only

possible

to construct

networks

withgood

precision

in the trivial sc

enario wher

e

all users

havethe same inter

ests.

Production vs. Consumption.

Thisleads

us to oursecon

d hypothesi

s: theprod

uction

interests

of auser

are narrower

thanthe

consumption

in-

terests.

In other word

s, P (i) ⇢ C(i).We valid

atethis

assumption

on Twitter (desc

ribedin Sect

ion2).

We de-

fineprod

uction as ei

thertwee

tingor re

tweeting

a tweet, an

d

consumption

as tweets

containin

g an URLthat

a userclick

s

on.For

simplicity, w

e referto this

as a clickon a twee

t.

We showthat

theset of inter

estscapt

uredby click

s has

larger entro

py (peruser)

thanthe

set capturing

tweets or

retweets.

We notethat

bothrestr

icting atten

tiononly

to

tweets cont

aining URL

s, and requ

iringclick

s as a measure

of consum

ptioninter

estsare s

trictnotio

ns, which

makesthe

empirical re

sultsstron

ger.

We also show

via analysis

(in Section 3) th

at separa

tionof

productio

n fromcons

umptionis still

insu�cient

to explain

highprec

ision. In part

icular, we

showthat

if users choo

se

theirprod

uction and

consumption

interests

at random

from

anydistr

ibution over

interests

(subject

to mild restrictio

ns),

it is notposs

ibleto achie

ve evencons

tantprec

ision. Our

result is f

airlyrobu

st tothe e

mpirically

observed

variability

in the number

of user inter

ests,and

the cardinali

ty of the

interests.

In Appendi

x A, we show

thesame resul

t when

usersthem

selves hav

e varying

number o

f interest

s, asin the

a�liation netw

ork models [17

, 7].

Structured Interests.

Theabov

e result m

akesa cas

e forinter

estswith

structure:

Users do not

choose inter

estsrand

omly, but ra

ther,choo

se

themin a corre

latedfashi

on. In othe

r words,

interests

have

a correlatio

n structure,

andusers

are more likelyto choo

se

fromamong

correlated

interests

thanfrom

amongunco

r-

related inter

ests.We verif

y thisassu

mptionby measu

ring

Homogeneous or

structured

interests leads to

efficient networks

Can we find evidence of filtering?

9

Intermediaries URLs Posted Data Sets Source Users

URLs

NY Times Links Twitter 330k 33k

Bin Laden Death Twitter 700k 545k

Occupy Wall Street

Twitter 354k 316k

Steve Jobs Death Twitter 719k 251k

iPhone 5 Launch Twitter 81k 37k

iPhone 5 Launch Facebook 330k 193k

All Spinn3r blogs Spinn3r 68k 441k

Obama Spinn3r 13k 85k

Facebook Spinn3r 12k 70k

Euro Spinn3r 10k 53k

Mubarak Spinn3r 7k 43k

Looking for filtering

10

Evidence of information filtering “Filtering law” Not an artefact of - replacement - exposure

11

MORE ACTIVITY à LESS POPULAR CONTENT

Many open questions ¡ Can we find more evidence of precision? - Using click (Twitter data grant, more partners) - Does selectivity correlate with success?

¡ Current models somewhat at odds - Discrete topics + continuous popularity range - Are there more general models

¡ Can crowd-curation be improved? - In principle (no friction etc.), already efficient. - With incentive? With new mechanism?

12

Thank you!

13

Back-Up Slides

14

Theoretical Results

Audience Strategy

Pure Strategy Equilibrium?

Price of Anarchy

Greedy No --

Satisficing Yes 2

Satisficing w/blogger ability

Yes 2

15

Filtering Law Consistent Across Data Sets 16

INACTIVE < 2 / month

5% ACTIVE < 2 /day

35% VERY ACTIVE

>= 2 / day 60% MORE ACTIVE à LESS POPULAR

CONTENT

17

INACTIVE < 2 / month

5%

ACTIVE

< 2 /day 35%

VERY ACTIVE

>= 2 / day 60%

MORE ACTIVE à LESS POPULAR

CONTENT

Simply explained by replacement effect?

NO!

18

In Summary… ¡ Previous work: Intermediaries play key role in information dissemination.

¡ We provided theoretical and empirical justification for intermediaries as information filters.

¡ Come see my poster! - Results not shown: Role of filtering on success of intermediary

MORE ACTIVE à LESS POPULAR CONTENT

19