Taming the long tail Identify Filtering in Social Media
Avner May, Nitish Korula, Silvio Lattanzi A. Chaintreau
Columbia University, Google
1
From the trenches: no!
¡ Content producters - May I be missing my audience?
4
¡ Users’s dilemma - May I be missing something?
From the faculty lounge: of course! ¡ Socializing is essential for information - To find about jobs [Gr74], innovation [CKM57] “It pays to know / It hurts to be unaware.”
¡ When looking for good content, most of the time is wasted, but some gems are priceless - This process is more efficient collectively - And curating is at least informally rewarded
¡ In this talk, we focus on news dissemination
5
2013: two interesting works
Twitter “precision”
40.5% average
- Encouraging!
8
a
r
e
n
e
c
e
s
s
a
r
y
f
o
r
a
s
o
c
i
a
l
a
n
d
i
n
f
o
r
m
a
t
i
o
n
n
e
t
w
o
r
k
t
o
e
n
s
u
r
e
u
s
e
r
s
h
a
v
e
h
i
g
h
p
r
e
c
i
s
i
o
n
a
n
d
r
e
c
a
l
l
,
a
n
d
d
i
s
s
e
m
i
n
a
t
i
o
n
t
i
m
e
i
s
s
m
a
l
l
?
C
a
n
w
e
e
m
p
i
r
i
c
a
l
l
y
v
a
l
i
d
a
t
e
t
h
e
s
e
c
o
n
d
i
t
i
o
n
s
a
s
w
e
l
l
a
s
t
h
e
c
o
n
c
l
u
s
i
o
n
o
n
e
x
i
s
t
i
n
g
n
e
t
w
o
r
k
s
?
We motivate this
question
witha preli
minaryempiric
al
userstud
y thatattem
pts to dir
ectlymeasu
re relevan
ce with-
out resort
ing to a definition
of user in
terests: w
e ask 10 ac-
tiveTwit
ter users
to ratea set o
f 30twee
ts asRele
vant/Not
Relevant
. Theusers
arestud
entsat Stan
fordUniv
ersity
wholog in at least
oncea week
on average,
follow at least
30 people, an
d receive at
least20 new
tweets a
weekin their
timeline. The
set of 30twee
ts is puttoge
therby choo
sing
15 tweets fro
m the user’s
timelinein the p
ast 7days
, and15
unique rand
omly selected twee
ts outof th
e set of all
tweet
impressions
overthe same 7 days
1 . Theset o
f 30twee
ts is
thenrend
eredin a rand
om order as
per usual
tweet ren
der-
ing guideline
s [11].The
p
r
e
c
i
s
i
o
n
of each of th
e 15 tweets
is thenthe
fraction of twee
ts thatthe
userthou
ghtwere
relevant.
Theresul
ts of the expe
riment for eachof th
e 10
usersis show
n in Figure 1. The
average
precision
of users
for tweets draw
n fromtheir
timelineis 70%
. On the other
hand, the
precision
drops to a
round 7% for t
he set of r
andom
tweets sh
ownto the users
! Eventhou
gh thisis to
o small a
userstud
y to drawa defin
itiveconc
lusion abou
t theactu
al
value of p
recision on Twit
ter, the re
sultslend
some credenc
e
to the hypothesi
s thatsocia
l network
s suchas Twit
ter are
much more precis
e than one w
ouldexpe
ct ifusers
wereseein
g
content a
t random
. Notethat
sincewe show
ed (as contr
ol)
eachuser
15 random twee
ts chosen
fromtwee
t imp
r
e
s
s
i
o
n
s
,
andgot a
low relevance
scorefor t
his contr
ol set, it
doesnot
appear th
at inspec
tionpara
dox2 alon
e could be an
adequate
explanat
ion of the high
precision
we see in thistrial
.
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1"2"
3"4"
5"6"
7"8"
9"10"
Precision)
User)number)
User.rated
)precision)
of)tweets)
Timeline"
Random"
Figure 1: Compari
sonof s
elf-repo
rtedprec
isionbe-
tweentweets
froma user
’s timelin
e andtweets
cho-
senat rand
om.
1.1 Necessary Conditions for Precision
In thispape
r, wefirst
outline some nece
ssarycond
itions
for obtaining
highprec
ision. For
eachof th
ese condition
s,
we state
the hypot
hesis, val
idateit wi
th data, and
argue via
1We imposed two
restrictio
ns on therand
omly selected
tweets: the
tweets must be in engli
sh (allthe
survey tak-
ers wereengli
sh speakers)
, andthe
tweet must
notbe a
reply(sinc
e a replymay not
makesense
outside of th
e full
conversat
ion,thus
yielding artifi
ciallylow
precision
).
2Theinsp
ection para
doxis an anal
ogueto the well-
known
friendship
paradox
[6]: high qual
ity usershave
more follow-
ers and henc
e a random twee
t impression is of
higher qu
ality
thana rand
om tweet.
modeling and
analysis,
whythe hypo
thesis is nece
ssaryfor
obtaining
highprec
ision.
Interest-based Networks.
Our first h
ypothesis
is anatu
ral one: U
serson socia
l and
information
networks
haveinter
ests,and
linkto othe
r users
whoshar
e some or
all of the
se interes
ts. This a
ssumption
is
folklore in how
thesenetw
orksare gene
rated–sev
eralcom-
monlyused
generativ
e models of
social net
works ind
eed use
thisassu
mption[18,
17, 7]. W
e define (i
n Section 2) an
ana-
lyticmodel
capturing
the essenc
e of these
generativ
e models:
There are
a set of use
rs Vand
a set of inte
restsI. E
ach user
u 2 V has aset o
f interest
s C(u) that (
s)heis int
erested in.
We term these
usersc
o
n
s
u
m
e
r
s
for interes
t i. Each user
con-
nects to
other use
rs based on their
interests,
andthis
yields
a graph G(V,E
) on the users, wh
ich is the observed
social
network.
Thisnetw
orkcould
be directed
(e.g., Twit
ter),
where some users
follow othe
rs and infor
mationflows
along
directed edge
s, orundi
rected (e.g.
, Faceboo
k), where
friend-
shipis m
utual, an
d information
canflow
in bothdirec
tions
along an edge
.
In order to anal
yzeprec
isionin this
model, we
needto
define whic
h usersshar
ing an interest i
2 I p
r
o
d
u
c
e
content
related to the
interest.
LetP (i) deno
te theset of users
whoact a
s produce
rs, We show
(in Section 3) th
at iffor a
ll
interests
i, P (i) =C(i), w
hichmeans
anycons
umer can be
a potential
producer
, then it is only
possible
to construct
networks
withgood
precision
in the trivial sc
enario wher
e
all users
havethe same inter
ests.
Production vs. Consumption.
Thisleads
us to oursecon
d hypothesi
s: theprod
uction
interests
of auser
are narrower
thanthe
consumption
in-
terests.
In other word
s, P (i) ⇢ C(i).We valid
atethis
assumption
on Twitter (desc
ribedin Sect
ion2).
We de-
fineprod
uction as ei
thertwee
tingor re
tweeting
a tweet, an
d
consumption
as tweets
containin
g an URLthat
a userclick
s
on.For
simplicity, w
e referto this
as a clickon a twee
t.
We showthat
theset of inter
estscapt
uredby click
s has
larger entro
py (peruser)
thanthe
set capturing
tweets or
retweets.
We notethat
bothrestr
icting atten
tiononly
to
tweets cont
aining URL
s, and requ
iringclick
s as a measure
of consum
ptioninter
estsare s
trictnotio
ns, which
makesthe
empirical re
sultsstron
ger.
We also show
via analysis
(in Section 3) th
at separa
tionof
productio
n fromcons
umptionis still
insu�cient
to explain
highprec
ision. In part
icular, we
showthat
if users choo
se
theirprod
uction and
consumption
interests
at random
from
anydistr
ibution over
interests
(subject
to mild restrictio
ns),
it is notposs
ibleto achie
ve evencons
tantprec
ision. Our
result is f
airlyrobu
st tothe e
mpirically
observed
variability
in the number
of user inter
ests,and
the cardinali
ty of the
interests.
In Appendi
x A, we show
thesame resul
t when
usersthem
selves hav
e varying
number o
f interest
s, asin the
a�liation netw
ork models [17
, 7].
Structured Interests.
Theabov
e result m
akesa cas
e forinter
estswith
structure:
Users do not
choose inter
estsrand
omly, but ra
ther,choo
se
themin a corre
latedfashi
on. In othe
r words,
interests
have
a correlatio
n structure,
andusers
are more likelyto choo
se
fromamong
correlated
interests
thanfrom
amongunco
r-
related inter
ests.We verif
y thisassu
mptionby measu
ring
Homogeneous or
structured
interests leads to
efficient networks
Intermediaries URLs Posted Data Sets Source Users
URLs
NY Times Links Twitter 330k 33k
Bin Laden Death Twitter 700k 545k
Occupy Wall Street
Twitter 354k 316k
Steve Jobs Death Twitter 719k 251k
iPhone 5 Launch Twitter 81k 37k
iPhone 5 Launch Facebook 330k 193k
All Spinn3r blogs Spinn3r 68k 441k
Obama Spinn3r 13k 85k
Facebook Spinn3r 12k 70k
Euro Spinn3r 10k 53k
Mubarak Spinn3r 7k 43k
Looking for filtering
10
Evidence of information filtering “Filtering law” Not an artefact of - replacement - exposure
11
MORE ACTIVITY à LESS POPULAR CONTENT
Many open questions ¡ Can we find more evidence of precision? - Using click (Twitter data grant, more partners) - Does selectivity correlate with success?
¡ Current models somewhat at odds - Discrete topics + continuous popularity range - Are there more general models
¡ Can crowd-curation be improved? - In principle (no friction etc.), already efficient. - With incentive? With new mechanism?
12
Theoretical Results
Audience Strategy
Pure Strategy Equilibrium?
Price of Anarchy
Greedy No --
Satisficing Yes 2
Satisficing w/blogger ability
Yes 2
15
INACTIVE < 2 / month
5% ACTIVE < 2 /day
35% VERY ACTIVE
>= 2 / day 60% MORE ACTIVE à LESS POPULAR
CONTENT
17
INACTIVE < 2 / month
5%
ACTIVE
< 2 /day 35%
VERY ACTIVE
>= 2 / day 60%
MORE ACTIVE à LESS POPULAR
CONTENT
Simply explained by replacement effect?
NO!
18
In Summary… ¡ Previous work: Intermediaries play key role in information dissemination.
¡ We provided theoretical and empirical justification for intermediaries as information filters.
¡ Come see my poster! - Results not shown: Role of filtering on success of intermediary
MORE ACTIVE à LESS POPULAR CONTENT
19