Date post: | 16-Jun-2015 |
Category: |
Technology |
Upload: | yandex |
View: | 2,030 times |
Download: | 1 times |
usage mining techniqueswith applications to web searchand content recommendation
Aristides Gionis
Yahoo! Research, Barcelona
yandex aug 31, 2012
yahoo! research, barcelona
web mining
social media and multimedia
large-scale distributed systems
user engagement
semantic web
yandex aug 31, 2012
web mining in yahoo! research
themes
usage mining and query-log mining
social network analysis and graph mining
influence propagation
other data mining problems
data sources
- query logs (search) and toolbar (browsing)
- social networks (flickr, messenger, email, ...)
- question-answering (answers)
- micro-blogging (twitter)
yandex aug 31, 2012
web mining in yahoo! research
themes
usage mining and query-log mining
social network analysis and graph mining
influence propagation
other data mining problems
data sources
- query logs (search) and toolbar (browsing)
- social networks (flickr, messenger, email, ...)
- question-answering (answers)
- micro-blogging (twitter)
yandex aug 31, 2012
overview of the talk
query-log mining
query graphsquery recommendations
yahoo! tips
news recommendations using real-time web
yandex aug 31, 2012
query-log mining
yandex aug 31, 2012
query-log mining
search engines collect a large amount of query logs
lots of interesting information
analyzing users’ behaviorcreating user profiles and personalizationcreating knowledge bases and folksonomiesfinding similar conceptsbuilding systems for query recommendationsusing statistics for improving systems’ performance. . .
yandex aug 31, 2012
query-log mining
search engines collect a large amount of query logs
lots of interesting information
analyzing users’ behaviorcreating user profiles and personalizationcreating knowledge bases and folksonomiesfinding similar conceptsbuilding systems for query recommendationsusing statistics for improving systems’ performance. . .
yandex aug 31, 2012
the click graph
[Craswell and Szummer, 2007]
yandex aug 31, 2012
applications of the click graph
[Craswell and Szummer, 2007]
query-to-document search
query-to-query suggestion
document-to-query annotation
document-to-document relevance feedback
yandex aug 31, 2012
the query-flow graph
[Boldi et al., 2008]
take into account temporal information
captures the “flow” of how users submit queries
definition:
nodes V = Q ∪ {s, t} the distinct set of queries Q, plusa starting state s and a terminal state tedges E ⊆ V × Vweights w(q, q′) representing the probabilitythat q and q′ are part of the same chain
yandex aug 31, 2012
building the query-flow graph
an edge (q, q′) if q and q′ are consecutive inat least one session
weights w(q, q′) learned by machine learning
features used
textual features: cosine similarity, Jaccard coefficient,size of intersection, etc.session features: the number of sessions, the averagesession length, the average number of clicks in thesessions, the average position of the queries in thesessions, etc. andtime-related features: average time difference, etc.
yandex aug 31, 2012
query-flow graph
barcelona fc
<T>
0.506
barcelona fcwebsite
0.043barcelona fc
fixtures
0.031
realmadrid
0.017
barcelonaweather
0.523
barcelonahotels
0.018
barcelonaweatheronline
0.100
barcelona
0.018
0.011
0.439
cheapbarcelona
hotels
0.072
luxurybarcelona
hotels
0.029
0.080
0.416
0.043
0.023
yandex aug 31, 2012
query-flow graph
dog
cat
funny cat
picture of a catcat and dog
picture of a funny
breed of dog
dog for sale
picture of a dog
funny dog
^
$
yandex aug 31, 2012
query recommendations
the general theme:
given an input query q
identify similar queries q
rank them and present them to the user
most query graphs can be used for both tasks:similarity and ranking
yandex aug 31, 2012
query recommendations
the general theme:
given an input query q
identify similar queries q
rank them and present them to the user
most query graphs can be used for both tasks:similarity and ranking
yandex aug 31, 2012
recommendations using the query-flow graph
[Boldi et al., 2008]
perform a random walk on the query-flow graph
teleportation to the submitted query
teleportation to previous queries to take into accountthe user history
normalize PageRank score to un-biasingfor very popular queries
yandex aug 31, 2012
example : apple
Max. weight sq sq sq
t t apple appleapple ipod apple apple fruit apple ipodapple store apple ipod apple ipod apple trailersapple trailers apple store apple belgium apple storeamazon apple trailers eating apple apple macapple mac google apple.nl apple fruititunes amazon apple monitor apple usapc world argos apple usa apple ipod nanoargos itunes apple jobs apple.com/ipod...
yandex aug 31, 2012
example : banana → apple
banana → apple banana
banana bananaapple eating bugsusb no banana holidaybanana cs opening a bananagiant chocolate bar banana shoewhere is the seed inanut
fruit banana
banana shoe recipe 22 feb 08fruit banana banana jules oliverbanana cloths banana cseating bugs banana cloths
yandex aug 31, 2012
example : beatles → apple
beatles → apple beatles
beatles beatlesapple scarringapple ipod paul mcartneyscarring yarns from irelandsrg peppers artwork statutory instrument
A55ill get you silver beatles tribute
bandbashles beatles mp3dundee folk songs GHOST’Sthe beatles love album ill get youplace lyrics beatles fugees triger finger
remix
yandex aug 31, 2012
recommendations as shortcuts to qfg
[Anagnostopoulos et al., 2010]
yandex aug 31, 2012
the query-recommendation problem
yandex aug 31, 2012
the query-recommendation problem
yandex aug 31, 2012
the query-recommendation problem
yandex aug 31, 2012
the query-recommendation problem
yandex aug 31, 2012
the recommendation problem
model user behavior as a random walk on qfg
a user starts at query q0 and follows a path p ofreformulations on qfg before terminating
consider a reward function w(q) on the nodes of qfg
goal: “nudge” users in order to maximize their reward
objectives:
1. collect a large reward along the way
2. end the session at a high-reward node
applications: a general problem formulation for suggestingshortcuts (web graph, social networks, etc.)
yandex aug 31, 2012
probabilistic model
we can only suggest, not order the user
we do not know how the user will act
random walk on qfg is modeled by stochastic matrix P
recommendations R modify P to P ′ = P + R
yandex aug 31, 2012
utility functions
reward function w(q) on queries
- quality of search results, user satisfaction, dwell time,monetization, etc.
utility function U(p) on paths p = 〈q0 . . . qk−1T 〉
U(p) =∑
q∈p
w(q) U(p) = w(qk−1),
(Cafavy) (Machiavelli)
“road to Ithaca” “end justify the means”
yandex aug 31, 2012
utility
w ρ ρw 1−step heuristic
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sum of expected values
yandex aug 31, 2012
qfg projections for diverse recommendations
[Bordino et al., 2010]
yandex aug 31, 2012
diverse recommendations
[Bordino et al., 2010]
we want not only relevant and high-qualityrecommendations, but also a diverse set
we want recommendations that take to different“directions” in the qfg
need notions of distance of queries in the qfg
use spectral embeddings
project a graph in a low dimensional space, so thatembedding minimizes total edge distortion
finding diverse recommendations reduces to a geometricproblem
yandex aug 31, 2012
example: time
Spectral projection on 2-hop neighborhood
time time magazine new york times time zone world time what time is it time warner time warner cabletime magazine 0.9953 0.0162 0.1422 0.1049 -0.6071 -0.6056new york times 0.9953 -0.0051 0.1248 0.0893 -0.6478 -0.6462
time zone 0.0162 -0.0051 0.9903 0.9891 -0.5234 -0.5254world time 0.1422 0.1248 0.9903 0.9970 -0.6263 -0.6282
what time is it 0.1049 0.0893 0.9891 0.9970 -0.6244 -0.6263time warner -0.6071 -0.6478 -0.5234 -0.6263 -0.6244 0.9999
time warner cable -0.6056 -0.6462 -0.5254 -0.6282 -0.6263 0.9999
yandex aug 31, 2012
improving recommendationfor long-tail queries via templates
[Szpektor et al., 2011]
yandex aug 31, 2012
motivation
goal: improve coverage of query-recommendation systems
observation: in a typical query log 50 % of query volumeare unique queries [Baeza-Yates et al., 2007]
most query-recommendation systems are based on findingqueries that co-occur frequently
inherent limitation on using co-occurrences
need to be able to develop methods to reason for rare,and even previously unseen, queries
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
query templates
defined over a hierarchy of entity types
define a global set of templates over the whole query log
do not restrict on specific domains(such as, travel, weather, or movies)
examples:
jaguar spare parts → <car> spare parts
name for salt → name for <compound>
a thousand miles notes → <song> notes
yandex aug 31, 2012
query templates
defined over a hierarchy of entity types
define a global set of templates over the whole query log
do not restrict on specific domains(such as, travel, weather, or movies)
examples:
jaguar spare parts → <car> spare parts
name for salt → name for <compound>
a thousand miles notes → <song> notes
yandex aug 31, 2012
candidate templates – example
chocolate cookie chocolate cookie
food
dessert
drink
recipe
instruction
substance
query: chocolate cookie recipe
candidate templates: <food> cookie recipe
<drink> cookie recipe
<food> recipe
<substance> recipe
chocolate cookie <instruction> . . .
yandex aug 31, 2012
candidate templates – example
chocolate cookie chocolate cookie
food
dessert
drink
recipe
instruction
substance
query: chocolate cookie recipe
candidate templates: <food> cookie recipe
<drink> cookie recipe
<food> recipe
<substance> recipe
chocolate cookie <instruction> . . .
yandex aug 31, 2012
candidate templates – example
chocolate cookie chocolate cookie
food
dessert
drink
recipe
instruction
substance
query: chocolate cookie recipe
candidate templates: <food> cookie recipe
<drink> cookie recipe
<food> recipe
<substance> recipe
chocolate cookie <instruction> . . .
yandex aug 31, 2012
ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
construction of query templates – details
hierarchy used: WordNet 3.0 hierarchy and Wikipediacategory hierarchy, connected via yago mapping
queries are tokenized, and n-grams are looked up andmapped to entities in the hierarchy
enriched with heuristic generalizations for <email>,<url>, numbers, and noun-phrases not in the taxonomy
yandex aug 31, 2012
query-to-template edges
mapping from a query q to its set of templates T (q)viewed as query-to-template edges
associated edge scores
sqt(q, t) = αd
when t obtained by generalizing q at distance d in H
parameter α set experimentally to 0.9
set sqt(q, q′) = 1, if (q, q′) edge in query-flow graph
normalize so that all sqt(q, ·) sum to 1
yandex aug 31, 2012
template-to-templates edges
reasoning about transitions between templates
<food> recipe → healthy <food> recipe
for templates (t1, t2) define the support set of query pairs{(q1, q2)}, s.t.
t1 ∈ T (q1) and t2 ∈ T (q2)t1 and t2 substitute the same token in q1 and q2
(e.g., dosa recipe and healthy dosa recipe)
define template-to-template edge score as
stt(t1, t2) =∑
(q1,q2)∈Sup(t1,t2)
sqq(q1, q2)
normalize so that all stt(t, ·) sum to 1
yandex aug 31, 2012
example – ambiguity
consider query transition:jaguar transmission → jaguar spare parts
template transition<car> transmission → <car> spare parts
supported bybmw transmission → bmw spare parts
audi transmission → audi spare parts
. . .
template transition<animal> transmission → <animal> spare parts
will not be supported bylion transmission → lion spare parts
tiger transmission → tiger spare parts
. . .
yandex aug 31, 2012
example – ambiguity
consider query transition:jaguar transmission → jaguar spare parts
template transition<car> transmission → <car> spare parts
supported bybmw transmission → bmw spare parts
audi transmission → audi spare parts
. . .
template transition<animal> transmission → <animal> spare parts
will not be supported bylion transmission → lion spare parts
tiger transmission → tiger spare parts
. . .
yandex aug 31, 2012
the query-template flow graph
extension of the query-flow graph
superposition of all the concepts we have seen so far:
set of nodes consists of queries and templates
set of edges consists of
query to query edgesquery to template edgestemplate to template edges
associated weights
yandex aug 31, 2012
generating recommendations
q
q q′
q′t1
t2
t3
t4
s1
s2
s3
s4
s5
s6
s7
r(q, q′) = s1s4 + s2s5 + s3s6 + s3s7
interpretation: probability of a feasible path
dashed lines do not really exist, but discovered on-the-fly
queries q and q′ may not have been seen before
transitions in the query-flow graph ranked first
yandex aug 31, 2012
methodology
methods:
query-template flow graph
query-flow graph
evaluation:
inspection a sample of the results
editorial evaluation
automated evaluation
yandex aug 31, 2012
training dataset
queries templates# nodes 95 279 132 5 382 051 983# edges 83 513 590 4 345 497 267avg degree 0.88 0.81max out-degree 14 145 34 249
(craigslist) (<album>)max in-degree 14 317 133 874
(youtube) (<institution>)
yandex aug 31, 2012
anecdotal evidence
{“guangzhou flights”, “guangzhou map”}<capital> flights → <capital> map
{“a thousand miles notes”, “a thousand miles piano notes”}<single> notes → <single> piano notes
{“8 week old weimaraner”, “8 week old weimaraner puppy”}8 week old <breed> → 8 week old <breed> puppy
{“aaa office twin falls idaho”, “aaa twin falls idaho”}aaa office <city> → aaa <city>
{“air force titles”, “air force ranks”}<military service> titles → <military service> ranks
{“name for salt”, “chemical name for salt”}name for <compound> → chemical name for <compound>
yandex aug 31, 2012
editorial evaluation
set-A: 300 pairs from each configuration,recommendation in the top-10
set-B: 100 pairs, same queries in each configuration,same position
set-C: 100 pairs for which query-flow graph has norecommendation
editors labeled query-recommendation pairs as:relevant, not relevant, cannot tell
two editors, 100 common queries, kappa-statistic 0.37
qfg qtfgset-A 98.48% 97.84%set-B 97.65% 98.86%set-C — 94.38%
yandex aug 31, 2012
automated evaluation – guiding principle
extract query pairs {qi , qi+1} from a testing dataset, suchthat user submitted qi+1 after qi in the same session
measure if qi+1 is predicted by our methods, and in whichposition
assumption: qi+1 should be relevant and useful for qi
yandex aug 31, 2012
results
qfg qtfg relative increase
pair occurrences
total pairs 3134388 3134388coverage 22.65 % 28.17 % 24.37 %# in top-100 16.97 % 25.49 % 50.23 %# in top-10 9.49 % 20.74 % 118.49 %# in top-1 2.86 % 10.01 % 249.5 %MAP 0.050 0.137avg. position 18.35 8.3
unique pairs
total pairs 2755922 2755922coverage 13.28 % 19.38 % 45.87 %# in top-100 12.06 % 17.25 % 42.96 %# in top-10 8.41 % 13.52 % 60.68 %# in top-1 2.86 % 6.5 % 127.32 %MAP 0.047 0.089avg. position 12.33 9.43yandex aug 31, 2012
results
0
2
4
6
8
10
12
14
16
18
20
2 4 6 8 10 12 14 16
# te
st-p
airs
at t
op-1
0 (%
)
query length (words)
QFGQTFG
yandex aug 31, 2012
conclusions
improve coverage of query recommendation systems
recommendations for rare or previously unseen queries
well suited for tail queries
complements rather than replaces existing methods
future work: improve quality of extracted templates
yandex aug 31, 2012
yahoo! tips
[Weber et al., 2011]
yandex aug 31, 2012
motivation
provide answers, not links
identify “how to” queries and provide tips
tip: piece of advice that is1 short2 concrete3 self-contained4 non-obvious
yandex aug 31, 2012
yahoo! tips
yandex aug 31, 2012
yahoo! tips
yandex aug 31, 2012
yahoo! tips
yandex aug 31, 2012
yahoo! tips
yandex aug 31, 2012
extract tips from yahoo! answers
tip: To tell if your eggs are fresh : place eggs in a bowl/glassof water.....if it floats it’s bad. if it sinks it’s good.
yandex aug 31, 2012
system diagram
zest lime without zester
250k candidate tips
rule-based extraction
machine learning
Does query have
how-to intent?
show normal
search resultsno
yes
Obtain quality labels for 20k
candidate tip using CrowdFlower
machine learning
22k high quality tipsAre there relevant
high quality tips?
show normal
search results
rank the matching tips and
display highest ranking one
TIP: To zest a lime if you don‘t have a zester : use a cheese grater
no
yes
yandex aug 31, 2012
mining tips from yahoo! answers
consider tips of a specific structure: “X : Y ”
X : goal of the tip
Y : action of the tip
examples
To get the mildew smell out of your towels : try soakingit in a salt water solution, then washing with soap andcold water, that tends to get rid of smellsTo style your hair without heat, gel or straighteners : trycoconut oil mark k
yandex aug 31, 2012
mining tips from yahoo! answers
english
only literal “how to” queries
answer should start with a verb
consider only best answers
replace I, my, me, myself, etc.with you, your, you, yourself, etc.
yandex aug 31, 2012
quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix
yandex aug 31, 2012
quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix
yandex aug 31, 2012
quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix
yandex aug 31, 2012
quality filtering — machine learning results
Method handcrafted content bothfeatures features
Har
d SVM 0.63/0.13 0.60/0.09 0.63/0.16Decision Tree 0.67/0.07 0.61/0.06 0.66/0.13k-NN 0.62/0.23 0.56/0.11 0.63/0.11
Sof
t SVM 0.95/0.11 0.93/0.05 0.95/0.08Decision Tree 0.95/0.03 0.92/0.03 0.94/0.06k-NN 0.94/0.11 0.91/0.05 0.94/0.05
yandex aug 31, 2012
quality filtering — machine learning results
Category P,R VG sizeBeauty & Style 0.53,0.08 0.16 0.08Business & Finance 0.57,0.20 0.20 0.03Cars & Transportation 0.64,0.12 0.23 0.03Computers & Internet 0.69,0.33 0.45 0.15Consumer Electronics 0.70,0.23 0.38 0.06Entertainment & Music 0.60,0.39 0.15 0.05Family & Relationships 0.35,0.05 0.06 0.14Games & Recreation 0.61,0.31 0.24 0.04Health 0.62,0.07 0.15 0.09Home & Garden 0.43,0.06 0.27 0.04Society & Culture 0.50,0.19 0.09 0.03Sports 0.68,0.24 0.19 0.03Yahoo! Products 0.73,0.43 0.45 0.07
yandex aug 31, 2012
detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptopP: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raidoP: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boysP: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phoneP: 61-75%, cover: 0.08%
yandex aug 31, 2012
detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptopP: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raidoP: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boysP: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phoneP: 61-75%, cover: 0.08%
yandex aug 31, 2012
detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptopP: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raidoP: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boysP: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phoneP: 61-75%, cover: 0.08%
yandex aug 31, 2012
detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptopP: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raidoP: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boysP: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phoneP: 61-75%, cover: 0.08%
yandex aug 31, 2012
detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptopP: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raidoP: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boysP: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phoneP: 61-75%, cover: 0.08%
yandex aug 31, 2012
matching queries to tips
precision–recall trade-off
index only the “goal” or also “action”use AND or OR mode for queryrequire minimum “span” for the goal
ranking
rank by number of query tokens in goal, then tf·idf
yandex aug 31, 2012
matching queries to tips — evaluation
mode min span vol. dist. P@1 medianAND .50 8.7% 2.7% .428/.680 1AND .66 6.8% 1.8% .557/.770 1AND 1.0 4.4% 0.8% .625/.835 1OR .50 87.4% 88.4% .048/.110 18OR .66 36.8% 36.3% .092/.200 2OR 1.0 13.5% 10.3% .160/.300 1
yandex aug 31, 2012
future work
mine tips from other recourses
twitterwikitravel
improve quality of existing system
incorporating more featuresimproving rule extractionclassification
yandex aug 31, 2012
information dissemination in social networks
yandex aug 31, 2012
the information dissemination spectrum
news sitescontent-provider siteseditorially curatedusers browseno specific info need
web searchurl, images, music,...clear intent
social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)
yandex aug 31, 2012
the information dissemination spectrum
news sitescontent-provider siteseditorially curatedusers browseno specific info need
web searchurl, images, music,...clear intent
social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)
yandex aug 31, 2012
the information dissemination spectrum
news sitescontent-provider siteseditorially curatedusers browseno specific info need
web searchurl, images, music,...clear intent
social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)
yandex aug 31, 2012
social media
yandex aug 31, 2012
the information overload problem
yandex aug 31, 2012
social media and user-generated content
paradigm shift from a broadcast one-to-many mechanismto a many-to-many model
users at the role of information producers
yandex aug 31, 2012
benefits and opportunities
wealth of information of extreme volume and diversity
wisdom of crowd phenomena
accurate profiling and personalization(toolbar, search, clicks)
content- and context- information available
social and geo information available
yandex aug 31, 2012
challenges
heterogeneous sources
high variability in quality
needle-in-the-haystack problems
we want to:
support users to seek, filter, and disseminate information
build efficient platforms that support social-mediafunctionalities
yandex aug 31, 2012
challenges
heterogeneous sources
high variability in quality
needle-in-the-haystack problems
we want to:
support users to seek, filter, and disseminate information
build efficient platforms that support social-mediafunctionalities
yandex aug 31, 2012
personalized news recommendationsby harnessing the real-time web
[De Francisci Morales et al., 2012]
yandex aug 31, 2012
overview
a news recommendation system based on real-time web,e.g., twitter
suggest news articles to twitter users
infer user preferences from twitter activity
yandex aug 31, 2012
yahoo! news
yandex aug 31, 2012
yahoo! news
yandex aug 31, 2012
yahoo! news
yandex aug 31, 2012
sources characteristics
news stream
+ high coverage
− sparse and noisy data for user profiling
− latency on collecting user feedback
twitter stream
+ much more accurate personalization
+ news spread very fast
yandex aug 31, 2012
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rag
e D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
yandex aug 31, 2012
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rag
e D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
yandex aug 31, 2012
yandex aug 31, 2012
challenges
scale to large volumes of news and tweets
high dynamicity of news and tweets
news have short life-cycle
twitter users use jargon language
find the right degree of personalization
cope with inactive twitter users
yandex aug 31, 2012
relate users, tweets, and news articles
yandex aug 31, 2012
T.rex architecture
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rage D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
yandex aug 31, 2012
recommendation model
Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)
social modelΣ(i , j) social relevance ofnews j to user i
content modelΓ(i , j) content relevanceof news j to user i
popularity modelΠ(j) popularity model ofnews article j
yandex aug 31, 2012
recommendation model
Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)
social modelΣ(i , j) social relevance ofnews j to user i
content modelΓ(i , j) content relevanceof news j to user i
popularity modelΠ(j) popularity model ofnews article j
yandex aug 31, 2012
recommendation model
Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)
social modelΣ(i , j) social relevance ofnews j to user i
content modelΓ(i , j) content relevanceof news j to user i
popularity modelΠ(j) popularity model ofnews article j
yandex aug 31, 2012
recommendation model
Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)
social modelΣ(i , j) social relevance ofnews j to user i
content modelΓ(i , j) content relevanceof news j to user i
popularity modelΠ(j) popularity model ofnews article j
yandex aug 31, 2012
popularity update rule
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rag
e D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10$+
*:#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
news become stale after twodays
track mentions in news andtweets with exponentialdecay
Zτ = λZτ−1 + wTHT + wNHN
yandex aug 31, 2012
model learning and evaluation
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
Yahoo! toolbar data
the recommendation model should rank highnews articles that users click
learn the model using SVM
use clicks and twitter profiles of 3K usersto train and test the system
yandex aug 31, 2012
systems evaluated
T.rex: basic model using only user profiles
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
T.rex+: additional features
entity hotness
news click count
news article age
yandex aug 31, 2012
results
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rage D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
yandex aug 31, 2012
results
Entities
News
Tweets
From Chatter to Headlines:Harnessing the Real-Time Web
for Personalized News Recommendation
Overview Motivation Problem
Model Method Results
tweetsUser
tweetsFollowee
tweetsFollowee
tweetsFollowee
tweetstwitter
articlesnews
T.Rex
User Model
!
"
#
Personalized ranked list of news articles
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000
RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.
5.6.5 Results
We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.
T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.
The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.
It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT
124
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rage D
CG
Rank
T.Rex+T.Rex
PopularityContent
SocialRecency
Click count
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5
What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5
Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5
Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5
in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.
Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),
where α, β, γ are coefficients that specify the relative weight of the components.
At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.
Note that Σ, Γ, Π and R are all time dependent. At any given time τ
the social network and the set of authored tweets vary, thus affecting Σ
and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.
108
Recommendation Model R
T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"
;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0
"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095
How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5
Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05
DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405
EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405
Claudio [email protected]
Gianmarco De Francisci [email protected]
Aristides [email protected]
Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000
Minutes
News-click delay
$8:<"
*%+>%+
''8**"$'
"0
R"?0V',('-%1",#E%1(09*(<89(+$
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
May-01 h20
May-02 h00
May-02 h04
May-02 h08
May-02 h12
May-02 h16
May-02 h20
May-03 h00
May-03 h04
May-03 h08
newstwitterclicks
9:;<;'=-1'>;?$1%9*"$10
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
May-22 h00
May-22 h12
May-23 h00
May-23 h12
May-24 h00
May-24 h12
May-25 h00
May-25 h12
May-26 h00
newstwitterclicks
$+*:
#,(Q"1
%$8:
<"*%+
>%+''8**"$'
"0
@ABC-1'!AD1;?A'9*"$10
),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/
U
T
''(%#89@+*0@()%:#9*(J
4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%
)*+18'"1%<E%2/
U
U
('('0+'(#,%:#9*(J
in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.
We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.
Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as
S∗ =
�i=d�
i=1
σiSi
�,
where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.
Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.
Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.
The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.
104
0+'(#,%($9"*"09
45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5
Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5
C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5
Z
7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*
+,-./0'('*",#9"1$"00%+>%
9?""9%F-%9+%$"?0%1/T
N
*'('9?""9V9+V$"?0%:#9*(J
*+,+!+-+.
!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/
T
Z
!'(%9?""9%:#9*(J
8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/
Z
N
.'(%$"?0%:#9*(J
yandex aug 31, 2012
conclusions
real-time web information can be leveraged to deliverrelevant information
future directions
LSI analysis on entities
models for different user clusters
georgaphic information
yandex aug 31, 2012
conclusions
real-time web information can be leveraged to deliverrelevant information
future directions
LSI analysis on entities
models for different user clusters
georgaphic information
yandex aug 31, 2012
summary
review concepts on query-log mining
answering directly queries with useful tips
challenges and opportunities in information dissemination
news recommendations using real-time web
many nice problems and research opportunities
yandex aug 31, 2012
thank you!
yandex aug 31, 2012
references I
Anagnostopoulos, A., Becchetti, L., Castillo, C., and Gionis, A.(2010).
An optimization framework for query recommendation.
In WSDM.
Baeza-Yates, R. A., Gionis, A., Junqueira, F., Murdock, V.,Plachouras, V., and Silvestri, F. (2007).
The impact of caching on search engines.
In SIGIR.
Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., andVigna, S. (2008).
The query-flow graph: model and applications.
In Proceeding of the 17th ACM conference on Information andknowledge management (CIKM).
yandex aug 31, 2012
references II
Bordino, I., Castillo, C., Donato, D., and Gionis, A. (2010).
Query similarity by projecting the query-flow graph.
In SIGIR.
Craswell, N. and Szummer, M. (2007).
Random walks on the click graph.
In Proceedings of the 30th annual international ACM conference onResearch and development in information retrieval (SIGIR).
De Francisci Morales, G., Gionis, A., and Lucchese, C. (2012).
From chatter to headlines: Harnessing the real-time web forpersonalized news recommendation.
In WSDM.
Szpektor, I., Gionis, A., and Maarek, Y. (2011).
Improving recommendation for long-tail queries via templates.
In WWW.
yandex aug 31, 2012
references III
Weber, I., Ukkonen, A., and Gioni, A. (2011).
Answers, not links: Extracting tips from yahoo! answers to addresshow-to web queries.
In CIKM.
yandex aug 31, 2012