R&D
Toward a New Protocol to EvaluateRecommender Systems
Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier
University Joseph Fourier & OrangeRecSys 2012 – WorkShop on Recommendation Utility Evaluation
2012 – v1.18
Orange R&D Orange FT-groupp2
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
Orange R&D Orange FT-groupp3
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
Orange R&D Orange FT-group
Recommender systems
For industrial applications
Amazon, Google News, Youtube (Google), ContentWise, BeeHive
(IBM),...
as for well-known academic realizations
Fab, More, Twittomender,...
the recommendation is multi-facetted
pushing items, sorting items, linking items...
and cannot be reduced to the rating prediction of a score of
interest of a user u for an item i.
What is a good recommender system?
just a system accurate for rating prediction for top N blockbusters and
top M big users?
... or something else?
p4
Orange R&D Orange FT-groupp5
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
Orange R&D Orange FT-group
Industrial point of view
Main goals of the automatic recommendation:
to increase sales
to increase the audience (click rates...)
to increase customer’s satisfaction and loyalty
Main needs (analysis at Orange: TV, Video On Demand,
shows, web-radios,...)
1. Helping all the users: big users and small users
2. recommending all the items : frequently purchased/viewed items,
rarely purchased/viewed items
3. Helping users on different identified problems
1. should I take this item?
2. should I take this item or that one?
3. what should interest me in this catalog?
4. what is similar to this item?
p6
Orange R&D Orange FT-groupp7
We propose 4 key functions Help to Explore (navigate)
Given an item i used as a context, give N items similar to i.
Help to Decide
Given an user u, and an item i, give a predictive score of
interest of u for i (a rating).
Help to Compare
Given a user u and a list of items i1,…,in, sort the items in
a decreasing order according to the score of interest for u.
Help to Discover
Given a user u, give N interesting items for u.
Example:
Example:
Example:
Example:
Orange R&D Orange FT-groupp8
Decide/ Compare / Discover / ExploreFunction Quality criteria Measure
Decide The rating prediction must be precise.
Extreme errors must be penalized
because they may more often lead to
a wrong decision.
Existing measure: RMSE
Compare The ranking prediction must be good
for any couple of items of the catalog
(not only for a Top N).
Existing measure: NDPM
(or number of compatible orders)
Discover The recommendation must be useful. Existing measure : Precision
Problem: if one recommends only well-
known blockbusters (i.e. Star Wars,
Titanic...) one will be precise but not useful!
Introducing the Impact Measure
Explore Problem: the semantic relevance is
not evaluable without user’s feedback.Introducing a validation method
for a similarity measure
Orange R&D Orange FT-groupp9
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
Orange R&D Orange FT-group
Known Vs Unknown, Risky Vs Safe
Probability that
the user already
knows the item
Probability that
the user likes
the item
Very bad
recommendation
the user does not know
the item: if he trusts the
systems, he will be misled
Bad
recommendation
But the item is
generally known by
name by the user
Trivial
recommendation
correct but not often
useful
Very good
recommendation
Help to Discover
Recommending an item for a user...
Orange R&D Orange FT-groupp11
Measuring the Help to Discover
Average Measure of Impact
Recommendation impact
Impact if the user
dislikes the item
Impact if the user likes
the item
Recommending a
popular item
slightly negative slightly positive
Recommending a rare,
unknown item
Strongly negative Strongly positive
List Z of
recommended
items
List H of logs
(u,i,r) in the
Test Set
Size of the
catalog
(normalization)
Impact: rarity of the
items * relative rating of
the user u (according to
her mean of ratings)
Proba user
likes
Proba user
already
knows
Orange R&D Orange FT-group
userID, itemID, noteuserID, itemID, noteuserID, itemID, rating
Principle of the protocol
LOGS
Learn
TEST
Model
For each list of itemIDs for each userID in Test :
Sort the list according to the ratings, compare the strict
orders of the rating with the order given by the model
For each userID in Test:
generate a list of recommended items; for each of this
items actually rating by userID in Test, evaluate the
relavance
RMSE
%COMP(% compatible)
AMI
Datasets used:
MovieLens 1M and Netflix.
No long tail distribution
detected in Netflix neither in
MovieLens’ dataset
So we use the simplest
segmentation according to
the mean of the number of
ratings: light/heavy users,
popular/unpopular items
Simple mean-based
item/user segmentation
For each (userID, itemID) in Test:
generate a rating prediction, compare with true rating
Orange R&D Orange FT-groupp13
We will use 4 algorithms to validate the protocol
Uniform Random Predictor
Returns a rating between 1 and 5 (min et max) with a random uniform
distribution
Default Predictor (mean of item + mean of user )/2
Robust mean of the items: requires at least 10 ratings on the item, otherwise
use only the user’s mean
K-Nearest Neighbor item method
Use K nearest neighbors per item, a scoring method detailed below, a
similarity measure called Weighted Pearson. Uses the Default predictor when
an item cannot be predicted
• Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted
Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255
Fast factorization method
Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF”
implementation)
• Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative
Filtering Approaches for Large Recommender Systems. Journal of Machine Learning
Research 10: 623-656 (2009)
Orange R&D Orange FT-groupp14
What about “Help to Explore”?
How to compare the “semantic quality” of the link between 2 items?
Principle
Define a similarity measure that could be extracted from the model
use the similarity measure to build an item-item similarity matrix
use the similarity matrix as a model for a recommender system using a KNN item-item
model
if this system obtains good performances for RMSE, %COMP, and AMI then the
semantic quality of the similarity measure must be good
Application
for a KNN-item model this is immediate (there is an intrinsic similarity)
for a matrix factorization model, we can use a similarity measure (as Pearson)
computed on the items’ factors
for a random rating predictor, this is not applicable...
for a mean-based rating predictor, this is not applicable...
Orange R&D Orange FT-group
p15
Evaluating “Help To Explore” for Gravity
rows of
items
columns of users
items X users
matrix of
ratings
matrix of
items’ factors
matrix of
users’ factors
(not used)
Gravity (fast
Matrix
Factorization)
items’ similarity
computations and K
Nearest Neighbors
search, using the matrix
of items’ factors
Similarity
Matrix (KNN)
of the items
(model for a
recommender
system)
KNN based
recommender
system
Possible evaluation of the
quality of this similarity matrix
via RMSE, %Comp, AMI...
Orange R&D Orange FT-groupp16
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3.Main results
Conclusion and future works
Orange R&D Orange FT-groupp17
Finding 1: different performances according to the segments
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0 50 100 150 200
RM
SENumber of KNN
RMSE for KNN on Netflix
rmse av. Default Pred.
rmse
rmse Huser Pitem
rmse Luser Pitem
rmse Huser Uitem
rmse Luser Uitem
0.75
0.8
0.85
0.9
0.95
1
1.05
0 10 20 30 40 50 60 70
RM
SE
Number Of Factors
RMSE for Gravity on Netflixrmse av. Default Pred
rmse
rmse Huser Pitem
rmse Luser Pitem
rmse Huser Uitem
rmse Luser Uitem
RMSE Light users Unpopular items (Luser Uitem)
RMSE Light users Popular items (Luser Pitem)
RMSE Heavy users Unpopular items (Huser Uitem)
RMSE Heavy users Popular items (Huser Pitem)
+ RMSE (global)
+ Default predictor
the 4
segments
analyzed
We have a decrease in performance of more than
25% between heavy user popular item segment
and light user unpopular item segment
Orange R&D Orange FT-group
65.00%
67.00%
69.00%
71.00%
73.00%
75.00%
77.00%
0 20 40 60
%C
om
pa
tib
leNumber of factors
Ranking compatibility for Gravity - Netflix
%Compatible Default Pred
%compatible
%compatible Huser Pitem
%compatible Luser Pitem
%compatible Huser Uitem
%compatible Luser Uitem
0.75
0.8
0.85
0.9
0.95
1
1.05
0 10 20 30 40 50 60 70
RM
SE
Number Of Factors
RMSE for Gravity on Netflixrmse av. Default Pred
rmse
rmse Huser Pitem
rmse Luser Pitem
rmse Huser Uitem
rmse Luser Uitem
p18
Finding 2: RMSE not strictly linked to the other performances
Example on 2 segments...
RMSE Light users Unpopular items
RMSE Light users Popular items
RMSE Heavy users Unpopular items
RMSE Heavy users Popular items
RMSE (global)
Default predictor (global)
the light user popular item segment is
easier to optimize than the light user
unpopular item segment for RMSE
the light user popular item segment is as
difficult to optimize as the light user
unpopular item segment for Ranking
Orange R&D Orange FT-group
0.75
0.8
0.85
0.9
0.95
1
1.05
0 10 20 30 40 50 60 70
RM
SE
Number Of Factors
RMSE for Gravity on Netflixrmse av. Default Pred
rmse
rmse Huser Pitem
rmse Luser Pitem
rmse Huser Uitem
rmse Luser Uitem
p19
Main Fact 2 (continued): RMSE not strictly linked to the other performances
-1
-0.5
0
0.5
1
1.5
2
2.5
Random Pred Default Pred KNN, K=100 Gravity, F=32
Average Measure of Impact - Netflix
Average Measure of Impact -Netflix
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0 50 100 150 200
RM
SE
Number of KNN
RMSE for KNN on Netflix
rmse av. Default Pred.
rmse
rmse Huser Pitem
rmse Luser Pitem
rmse Huser Uitem
rmse Luser Uitem
RMSE (global)
Globally, Gravity is better than KNN for RMSE,
but is worse than KNN for Average Measure of
Impact
Orange R&D Orange FT-groupp20
Global results
Help to Decide / Compare / Discover
Gravity
dominates
for the
RMSE
measure
KNN
dominates on
the heavy
user
segments
The default
Predictor is
very useful for
unpopular (i.e.
infrequent)
item segments
Orange R&D Orange FT-groupp21
Comparing native similarities with Gravity-based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16
factors) Gravity :
1. KNN item-item can be performed on a factorized matrix with little performance loss (and
faster!).
2. Gravity can be used for the “Help to Explore function”
Native KNN
K=100
KNN computed on Gravity's
items factors
K=100, number of
factors=16
RMSE 0.8440 0.8691
Ranking: % compatible 77.03% 75.67%
Precision 91.90% 86.39%
AMI 2.043 2.025
Global time
of the modeling task
5290 seconds 3758 seconds
Orange R&D Orange FT-groupp22
Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
Orange R&D Orange FT-groupp23
Conclusion: contributions
As industrial recommendation is multi-facetted we proposed to list the key functions of the recommendation
• Help to Decide, Help to Compare, Help to Discover, Help to Explore
• Note for Help to explore: the similarity feature is mandatory for a recommender system
we proposed to define a dual segmentation of Items and Users
• just being very accurate on big users and blockbuster items is not very useful
For a new offline protocol to evaluate recommender systems we proposed to use the recommender’s key functions with the dual segmentation
• Mapping Key functions with measures
• adding the measure of Impact to evaluate the “Help to Discover” function
• adding a method to evaluate the “Help to Explore” function
we made a demonstration of its utility
• RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover,
Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee
with the other measures!)
• The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited
to improve the global performances
• + we saw empirically that the KNN approach could be virtualized, performing the similarities
between items on a factorized space built for instance by Gravity
Orange R&D Orange FT-groupp24
Future works: 3 main axis
1. Evalutation of the quality of the 4 core functions using an online
A/B Testing protocol
2. Hybrid switch system: the best algorithm for the adapted task
according to the user-item-segment
3. KNN virtualization via matrix factorization
Orange R&D Orange FT-group
Annexes
p25
Orange R&D Orange FT-groupp26
about this work...
Frank Meyer: Recommender systems in industrial
contexts. CoRR abs/1203.4487: (2012)
Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric
Gaussier: Toward a New Protocol to Evaluate Recommender
Systems. Workshop on Recommender Utility Evaluation, RecSys
2012. Dublin.
Frank Meyer, Françoise Fessant: Reperio: A Generic and
Flexible Industrial Recommender System. Web Intelligence
2011: 502-505. Lyon.
Orange R&D Orange FT-groupp27
Classic mathematic representation of the recommendation problem
u1 u2 ul un
i1 4 2 ? 5 ? 2 ? 1
i2 4 5 4 5 5 4 1 5 4
4 3 1 1
2 1
ik 3 ? 4 ? 5
? 2
1 4 5
? ? ?
4 5 4 4
3 ?
im 5 ? 2 4
thousands of users
thousands
of items
known
ratings
of
interest
ratings
of
interest
to predict
Orange R&D Orange FT-groupp28
Well known industrial example: Item-to-Items recommendation (AmazonTM)
Orange R&D Orange FT-group
Multi-facetted analysis: measures
RMSE
NDPM
Precision
AMI
nb of contradictory
orders
nb of compatible
orders
nb strict orders
given by the user
number of recommeded
items actually evaluable in
the Test Set
predicted rating
real rating
number of logs in the Test
Set
on a same
dataset and on
a same user,
% compatible
directly usable
Average Measure of
Impact
Orange R&D Orange FT-groupp30
Comparing native similarities with Gravity-based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :
Gravity can be used for the “Help to Explore function”
KNN item-item can be performed on a factorized matrix with little performance loss!.
Orange R&D Orange FT-groupp31
Reperio C-V5 Centralized mode, example of a movie recommender
Orange R&D Orange FT-groupp32
Reperio E-V2 Embedded Mode, example of a TV program recommender