Verarbeitung von Datenstromen in Echtzeit
Tobias Heintz1 Benjamin Kille2
1plista GmbH
2Technische Universitat Berlin
September 26, 2014
Table of Contents
Introduction
Recommender SystemsUnpersonalised RecommendationCollaborative FilteringContent-based FilteringEvaluation
News Recommendation
Big Data Issues
Who are we?
I Tobias Heintz, plista GmbH
I Benjamin Kille, Technische Universitat Berlin
plista GmbH
Pioneers for targeted advertisement and content distribution.
I founded 31 July, 2008
I incorporated in the WPP Group as of 1 January, 2014
I headquaters in Berlin, Germany
I 120 employees (30 % R&D)
Technische Universitat Berlin
I >30 000 enrolled students
I 331 professors
I >2600 researchers
What problems do we address?
Recommender Systems
We will introduce recommender systems; we will discuss a varietyof algorithms; we will explore how to evaluate recommendersystems.
NewsWe will talk about specific challenges when recommending news;we will illustrate issues arising as system fail to buildcomprehensive user profiles; we will depict how news evolving overtime affect recommender systems.
Big Data
We will examplify in what way news represent a source of big data;we will introduce a system which grants researchers access to bigdata; we will show you, how you can compete with your ownapproaches.
Why are these problem important?
Users increasingly face information overload as they interact withitem collections. For instance:
I >43 000 000 songs on Apple’s iTunes
I 100 h of video are uploaded on Youtube every minute
I >3 000 000 movies on IMDb
I ...
Collection continue to grow causing even more severe informationoverload. The same yields for news articles.
Table of Contents
Introduction
Recommender SystemsUnpersonalised RecommendationCollaborative FilteringContent-based FilteringEvaluation
News Recommendation
Big Data Issues
Problem definition
Users have insufficient time and cognitive capacity to iterate thefull collection. Recommender Systems support users as they filtercollections. Recommender Systems differ with respect to themethod they use to filter. More formally, a general-purposerecommender system is a triple (U , I, φ).
U → set of users {u1, u2, . . . , uM}I → set of items {i1, i2, . . . , iN}φ→ a filter function
The performance of different recommendation algorithms typicallydepends on φ.
Filter Functions
Filter functions take a user u, the entire item collection I, and amodel M. They return a subset of items to be recommended I∗.
φ(u, I,M) = I∗
Recommender systems’ success or failure strongly depends on themodel M. In particular, how accurately the model reflects actualuser preferences. M may take various kinds of input, as we willdiscuss for a selection of recommendation algorithms.
Random Recommendation
M takes the item collection and selects items randomly.
Random Recommendation
M takes the item collection and selects items randomly.
random
Most-Popular Recommendation
M orders the item collection according to the number ofinteractions, K ≥ L ≥ M ≥ N.
K interactions
L interactions
M interactions
N interactionsmostpopular
Summary: Unpersonalised Recommenders
Advantages
I low computational complexity
I easy to update M
I domain independent
Disadvantages
I disregard personal taste
I disregard context
I high chance to recommend known or unpopular items
Collaborative Filtering
Basic Assumptions
I systems have access to users’ preferences
I users with similar tastes in the past will continue to likesimilar items
I systems have means to compare users tastes
Distinctions
I model-based vs memory-based
I item-based vs user-based
Example
AnnaAviator
Bob
Clara
Dan
Bad Boys
Cars
District 9
Elektra
Example
AnnaAviator
Bob
Clara
Dan
Bad Boys
Cars
District 9
Elektra
Example
AnnaAviator
Bob
Clara
Dan
Bad Boys
Cars
District 9
Elektra
Bad Boys District 9 Elektra[ , , ]user profile: Anna
Example
Anna
Bob
Clara
Dan
Bad Boys District 9 Elektra[ , , ][ , , , ][ , , ][ ]
Aviator
Aviator
Bad Boys District 9 Elektra
Cars District 9 Elektra
Example
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Example
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Example
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Example
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Preference Elicitation
Explicit Preferences
I Likes
I Thumbs Up/Down
I Ratings
I Comments
I Purchase
Implicit Preferences
I Click
I Dwell Time
I Returns
How can we measure whether users like items and how much theydo?
Collaborative Filtering Algorithms with Ratings
Memory-based
Algorithm uses the complete set of data in the recommendationprocess. M contains the full rating matrix.
I user-based k-nearest neighbour
I item-based k-nearest neighbour
Model-basedAlgorithm learns a model M and uses it to recommend items.
I matrix factorisation with ALS
I matrix factorisation with SGD
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Aviator
Bob
Bad Boys Cars District 9 Elektra
1 1 1
1 1 11
0 0
0
Similarity Measures
Number of items in common
σ(u, v) =∑i∈I
I(i)
I(i) =
{1 if both u and v liked i
0 otherwise
Cosine similarity
σ(u, v) =u · v||u||||v ||
Pearson’s correlation coefficient
σ(u, v) =cov(u, v)
std(u)std(v)
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Bob
Clara
Dan
Anna Bob Clara Dan
11
1
1
sim(Anna, Bob)
sim(Bob, Anna)
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Bob
Clara
Dan
Anna Bob Clara Dan
11
1
1
sim(Anna, Bob)
sim(Bob, Anna)
[1, sBob, sClara, sDan]
User-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(u, v)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
?
User-based k-nearest Neighbour
Recommendation procedure user profile:
u = (r(i1), r(i2), . . . , r(iN))
similarity vector:
σ(u, ·) = (σ(u, v1), σ(u, v2), . . . , σ(u, u), . . . , σ(u, vM))
preference prediction:r(j) = uσ(u, ·)
ResultWe obtain a prediction for each item’s preference and can rankthem accordingly. The algorithm returns as many items asrequested starting from the top rank.
Item-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(i , j)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Item-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(i , j)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
Item-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(i , j)
Anna
Aviator
Bob
Clara
Dan
Bad Boys
1
1
11
0
0 0
0
Similarity Measures
Number of items in common
σ(i , j) =∑u∈U
I(u)
I(u) =
{1 if both i and j are liked by u
0 otherwise
Cosine similarity
σ(i , j) =i · j||i ||||j ||
Pearson’s correlation coefficient
σ(i , j) =cov(i , j)
std(i)std(j)
Item-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(i , j)
Aviator Bad Boys Cars District 9 Elektra
Aviator
Bad Boys
Cars
District 9
Elektra
11
11
1
sim(Aviator, Bad Boys)
sim(Bad Boys, Aviator)
Item-based k-nearest Neighbour
Input: M × N rating matrix R, similarity measure σ(i , j)
Anna
Aviator
Bob
Clara
Dan
Bad Boys Cars District 9 Elektra
?
Item-based k-nearest Neighbour
Recommendation procedure item profile:
i = (r(u1), r(u2), . . . , r(uM))
similarity vector:
σ(i , ·) = (σ(i , j1), σ(i , j2), . . . , σ(i , i), . . . , σ(i , jN))
preference prediction:r(u) = σ(i , ·)i
ResultWe obtain a prediction for each item’s preference and can rankthem accordingly. The algorithm returns as many items asrequested starting from the top rank.
Matrix Factorisation
Input: M × N rating matrix R
R =
1 1 1
1 1 1 11 1 1
1
GoalFill the gaps of missing preferences.
Matrix Factorisation
IdeaProject preferences into low dimensional space to detect latentstructures.
[R]M×N ≈ [P]M×K [Q]>N×K
K � M,N
ProblemHow to determine P and Q?
Matrix Factorisation
Learning P and QInput: Error metric
E (P,Q,R) =∑
(u,i)∈R
(r(u, i)− PuQ>i )2
(quadratic error)
E (P,Q,R) =∑
(u,i)∈R
|r(u, i)− PuQ>i |
(absolute error)
Matrix Factorisation
Stochastic Gradient DescentOptimise error metric by selecting data points at random.
I initialise P,Q with small random values
I pick a preference (u, i) at random
I determine the gradient at that point
I adjust P,Q accordingly
I continue
Alternating Least Squares
Optimise either P or Q keeping the other fixed
I initialise P,Q with small random values
I optimise error metric by P
I optimise error metric by Q
I continue
Summary: Collaborative Filtering
Advantages
I takes personal taste into account
I successful in the Netflix Prize competition
I domain-independent
Disadvantages
I cold-start problem
I sparsity
I grey sheep
Cold-Start Problem
I user without known preferences
I item without preferences
I similarity measures fail
I inconclusive latent factors
Grey Sheep
I user rate all their items average
I user profile: [3, 3, 3, 3, . . . , 3]
I collaborative systems cannot distinguish good from bad items
Content-based Filtering
IdeaSuggest items which are similar to items users have liked.
Similarity
I based on content → features
I depending on the domain
Content-based Filtering
Input: user profile, item collection, item features, and similaritymeasure
Content-based Filtering
Input: user profile, item collection, item features, and similaritymeasure
Content-based Filtering
Input: user profile, item collection, item features, and similaritymeasure
Features
▪ Name/ID▪ Meta data▪ Content▪ audio stream --> songs▪ video stream -->
movies▪ text --> book, news
article
Content-based Filtering
Input: user profile, item collection, item features, and similaritymeasure
CBF
sim(i,j)
Content-based Filtering
Similarity: Example
I keyword overlap → text
I average colour match → images/video
I maximum amplitude → audio/sound
I common actors → movies
I common interests → friends/partnership
Summary: Content-based Filtering
Advantages
I considers personal taste
I high expectability
Disadvantages
I cost-sensitive for high-volume contents, e.g., video
I low serendipity
I user cold-start
Evaluation
Important aspects
I how well does the system predict preferences?
I how often do users receive useful suggestions?
I how long does it take for the system to provide suggestions?
I how many requests cannot be answered?
I how often do users return to the site?
I how often do users purchase/rent/consume items which thesystem had recommended?
I how well did users perceive the system?
Evaluation: Rating Prediction
GoalThe evaluation ought to show how well the system estimatespreferences.
Assumptions
I system can access recorded explicit numerical preferences
I tastes remain stable over time
I the more accurate the system estimates preferences, the moresuited the suggestions
Metrics
I root mean squared error√
1|(u,i)|
∑(u,i)∈R(r(u, i)− r(u, i))2
I mean absolute error 1|(u,i)|
∑(u,i)∈R |r(u, i)− r(u, i)|
Evaluation: Ranking
GoalThe evaluation ought to show how well the system ranks itemsaccording to users’ preferences.
Assumptions
I system can access preference relations between items
I tastes remain stable over time
I the better the system ranks items, the more suited thesuggestions
Metrics
I normalised discounted cumulative gain DCGIDCG
I mean reciprocal rank 1|u|
∑u∈U
1ranki
Evaluation: Top-N
GoaldThe evaluation ought to show how well the system selects the topsuggestions.
Assumptions
I system can access preference relations between items
I tastes remain stable over time
I the better the system selects the top suggestions, the moresuited they are
Metrics
I precision@N TPTP+FP
I recall@N TPTP+FN
Evaluation: Problems
I explicit preferences may not be available
I tastes change over time
I recorded data does not fully reflect the current situation
SolutionAccessing real systems with current user interactions to seewhether method performs better than existing one → second partof the tutorial
Summary: Recommender Systems
I support users by suggesting interesting items
I counteract information overload
I unpersonalised recommenderI collaborative filtering
I user-based k-nearest neighbourI item-based k-nearest neighbourI matrix factorisation
I content-based filtering
I evaluation still difficult
Table of Contents
Introduction
Recommender SystemsUnpersonalised RecommendationCollaborative FilteringContent-based FilteringEvaluation
News Recommendation
Big Data Issues
News Recommendation: Special Characteristics
Collection Dynamics
I thousands of new article published daily
I older articles’ relevancy decays
Contextual Differences
I users perceive recommendations differently
I devices render recommendations differently
I dependence on daytime and weekday
Popularity Bias
I few items receive a lot of attention
I most items receive hardly any attention
News Recommendation: Collection Dynamics
500
1000
1500
2000
Oct Jan
entry
Oct Jan
exit
News Recommendation: Contextual Differences
hour
Mon
Tue
Wed
Thu
Fri
Sat
Sun
0 6 12 18
desktop phone
Mon
Tue
Wed
Thu
Fri
Sat
Suntablet
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
News Recommendation: Popularity Bias
News
Interactions
Frequency
10^0
10^1
10^2
10^3
10^4
10^0 10^1 10^2 10^3 10^4 10^5 10^6
Movies
InteractionsFrequency
10^0.0
10^0.5
10^1.0
10^1.5
10^2.0
10^0 10^1 10^2 10^3 10^4
Table of Contents
Introduction
Recommender SystemsUnpersonalised RecommendationCollaborative FilteringContent-based FilteringEvaluation
News Recommendation
Big Data Issues
Big Data
GoalIntelligent real-time processing of huge amounts of data.Recommender Systems → personalisation
I volume → amount of data to be stored increases
I variety → heterogeneous data
I velocity → data streams in (near) real-time
I veracity → noisy data
Big Data
Do news recommendations fullfil the requirements of big data?
Volumehundreds of GB every day X
Variety
news entail textual data and images enducing some variety
Velocity
news arise continuously → second part of the tutorial X
Veracity
news have some consistent attributes (headline, text), but alsocomprise some features which are missing or wrong (date, location,image)
Questions?
Thank you for your attention! We hope you enjoyed the first partof the tutorial! There is more (practical) to come in the secondpart!