User-driven Approaches to Recsys

It's all about the User...

User-driven Approaches to the Recommendation Problem

Xavier AmatriainTelefonica Research

But first...

About me

Up until 2005

About me

2005 2007

About me

2007 ..

But first...

About Telefonica and Telefonica R&D

About 71,000 professionals


Staff

Services

Finances Rev: 4,273 M€EPS(1): 0.45 €

Integrated ICT solutions for all

customers

Clients About 12 million

subscribers

About 260 million

customers

Basic telephone and data services

1989

SpainOperations in 25 countries

Geographies

Rev: 57,946 M€ EPS: 1.63 €

2000 2008


About 68 million

customers

Wireline and mobile voice, data and

Internet services

(1) EPS: Earnings per share

Rev: 28,485 M€EPS(1): 0.67 €

Operations in16 countries

Telefonica is a fast-growing Telecom

Telco sector worldwide ranking by market cap (US$ bn)

Currently among the largest in the world

Source: Bloomberg, 06/12/09

Telefonica R&D (TID) is the Research and Development Unit of the Telefónica Group

MISSION“To contribute to the improvement of the Telefónica Group’s

competitivness through technological innovation”

n Founded in 1988

n Largest private R&D center in Spain

n More than 1100 professionals

n Five centers in Spain and two in Latin America

Telefónica was in 2008 the first Spanish company by R&D Investment and the third in the EU

Products / Services / Processes development

Technological Innovation (1)

R&D594 M€

4.384 M€

Applied research

R&D61 M€

Internet Scientific Areas

Content Distribution and P2P

Next generation Managed P2P-TV

Future Internet: Content Networking

Delay Tolerant Bulk Distribution

Network Transparency

Social Networks

Information Propagation

Social Search Engines

Infrastructure for Social based cloud computing

Wireless and Mobile Systems

Wireless bundling

Device2Device Content Distribution

Large Scale mobile data analysis

Multimedia Scientific Areas

Multimedia Core

Multimedia Data Analysis, Search & Retrieval

Video, Audio, Image, Music, Text, Sensor Data

Understanding, Summarization, Visualization

Mobile and Ubicomp

Context Awareness

Urban Computing

Mobile Multimedia & Search

Wearable Physiological Monitoring

HCC

Multimodal User Interfaces

Expression, Gesture, Emotion Recognition

Personalization & Recommendation Systems

Super Telepresence

Data Mining & User Modeling Areas

DATA MINING-Integration of statistical & knowledge-based techniques

- Stream mining

-Large scale & distributed machine learning

USER MODELING

- Application to new services (technology for development) - Cognitive, socio-cultural, and contextual modeling

- Behavioral user modeling (service-use patterns)

SOCIAL NETWORK ANALYSYS & BUSINESS INT.

- Analytical CRM

- Trend-spotting, service propagation & churn - Social Graph Analysis (construction, dynamics)

Index

Now seriously, this is where the index should go!

Introduction: What areRecommender Systems?

The Age of Search has come to an end

... long live the Age of Recommendation!

Chris Anderson in “The Long Tail”“We are leaving the age of information and entering the age of recommendation”

CNN Money, “The race to create a 'smart' Google”:

“The Web, they say, is leaving the era of search and entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.”

Information overload

“People read around 10 MB worth of material a day, hear 400 MB a day, and see one MB of information every second”

The Economist, November 2006

The value of recommendations

Netflix: 2/3 of the movies rented are recommended

Google News: recommendations generate 38% more clickthrough

Amazon: 35% sales from recommendations

Choicestream: 28% of the people would buy more music if they found what they liked.

u

The “Recommender problem”

Estimate a utility function that is able to automatically predict how much a user will like an item that is unknown for her. Based on:

Past behavior

Relations to other users

Item similarity

Context

...

The “Recommender problem”

Let C be a large set of all users and let S be a large set of all possible items that can be recommended (e.g books, movies, or restaurants).

Let u be a utility function that measures the usefulness of item s to user c, i.e., u : C X S→R, where R is a totally ordered set. Then, for each user c є C, we want to choose such item s’ є S that maximizes u.

Utility of an item is usually represented by rating but can also can be an arbitrary function, including a profit function.

Approaches to Recommendation

Collaborative FilteringRecommend items based only on the users past behavior

User-basedFind similar users to me and recommend what they liked

Item-basedFind similar items to those that I have previously liked

Content-basedRecommend based on features inherent to the items

Social recommendations (trust-based)

Recommendation Techniques

The Netflix Prize

500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE)

49K contestants on 40K teams from 184 countries.

41K valid submissions from 5K teams; 64 submissions per day

Wining approach uses hundreds of predictors from several teams

Is this general? Why did it take so long?

What works

It depends on the domain and particular problem

However, in the general case it has been demonstrated that (currently) the best isolated approach is CF.

Item-based in general more efficient and better but mixing CF approaches can improve result

Other approaches can be hybridized to improve results in specific cases (cold-start problem...)

What matters:

Data preprocessing: outlier removal, denoising, removal of global effects (e.g. individual user's average)

“Smart” dimensionality reduction using MF such as SVD

Combining classifiers

I like it... I like it not

Evaluating User Ratings Noise inRecommender Systems

Xavier Amatriain (@xamat), Josep M. Pujol, Nuria OliverTelefonica Research

The Recommender Problem

Two ways to address it

1. Improve the Algorithm

The Recommender Problem

Two ways to address it

2. Improve the Input Data

Time for Data Cleaning!

User Feedback is Noisy

Natural Noise Limits our User Model

DID YOU HEAR WHAT I LIKE??!!

...and Our Prediction Accuracy

The Magic Barrier

Magic Barrier = Limit on prediction accuracy due to noise in original data

Natural Noise = involuntary noise introduced by users when giving feedback Due to (a) mistakes, and (b) lack of resolution in

personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the

same than a 3 for some users and some items).

Magic Barrier >= Natural Noise Threshold We cannot predict with less error than the

resolution in the original data

Our related research questions

Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure?

Q2. How large is the prediction error due to these inconsistencies?

Q3. What factors affect user inconsistencies?

Experimental Setup (I)

Test-retest procedure: you need at least 3 trials to separate Reliability: how much you can trust the instrument

you are using (i.e. ratings) r = r

12r

23/r

13

Stability: drift in user opinion s

12=r

13/r

23; s

23=r

13/r

12; s

13=r

13²/r

12r

23

Users rated movies in 3 trials Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3

Experimental Setup (II)

100 Movies selected from Netflix dataset doing a stratified random sampling on popularity

Ratings on a 1 to 5 star scale Special “not seen” symbol.

Trial 1 and 3 = random order; trial 2 = ordered by popularity

118 participants

Results

Comparison to Netflix Data

Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate)

Test-retest Reliability and Stability

Overall reliability = 0.924 (good reliabilities are expected to be > 0.9) Removing mild ratings yields higher reliabilities,

while removing extreme ratings yields lower

Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951 Stabilities might also be accounting for “learning

effect” (note s12<s23)

Users are Inconsistent

● What is the probability of making an inconsistency given an original rating


● What is the percentage of inconsistencies given an original rating

Mild ratings are noisier


● What is the percentage of inconsistencies given an original rating

Negative ratings are noisier

Prediction Accuracy

#Ti

#Tj

# RMSE

T

1, T

2 2185 1961 1838 2308 0.573 0.707

T1, T

3 2185 1909 1774 2320 0.637 0.765

T2, T

3 1969 1909 1730 2140 0.557 0.694

● Pairwise RMSE between trials considering intersection and union of both sets

Prediction Accuracy

#Ti

#Tj

# RMSE

T

1, T

2 2185 1961 1838 2308 0.573 0.707

T1, T

3 2185 1909 1774 2320 0.637 0.765

T2, T

3 1969 1909 1730 2140 0.557 0.694


Max error in trials that are most distant in time

Prediction Accuracy

#Ti

#Tj

# RMSE

T

1, T

2 2185 1961 1838 2308 0.573 0.707

T1, T

3 2185 1909 1774 2320 0.637 0.765

T2, T

3 1969 1909 1730 2140 0.557 0.694


Significant less error when 2nd trial is involved

Algorithm Robustness to NN

Alg./Trial T1

T2

T3

Tworst

/Tbest

User Average

1.2011 1.1469 1.1945 4.7%

Item Average

1.0555 1.0361 1.0776 4%

Userbased kNN

0.9990 0.9640 1.0171 5.5%

Itembased kNN

1.0429 1.0031 1.0417 4%

SVD 1.0244 0.9861 1.0285 4.3%

● RMSE for different Recommendation algorithms when predicting each of the trials

Algorithm Robustness to NN

Alg./Trial T1

T2

T3

Tworst

/Tbest

User Average

1.2011 1.1469 1.1945 4.7%

Item Average

1.0555 1.0361 1.0776 4%

Userbased kNN

0.9990 0.9640 1.0171 5.5%

Itembased kNN

1.0429 1.0031 1.0417 4%

SVD 1.0244 0.9861 1.0285 4.3%

● RMSE for different Recommendation algorithms when predicting each of the trials

Trial 2 is consistently the least noisy

Algorithm Robustness to NN (2)

TrainingTesting Dataset

T1-T

2T

1-T

3T

2-T

3

User Average 1.1585 1.2095 1.2036

Movie Average 1.0305 1.0648 1.0637

Userbased kNN 0.9693 1.0143 1.0184

Itembased kNN 1.0009 1.0406 1.0590

SVD 0.9741 1.0491 1.0118

● RMSE for different Recommendation algorithms when predicting ratings in one trial (testing) from ratings on another (training)

Algorithm Robustness to NN (2)

TrainingTesting Dataset

T1-T

2T

1-T

3T

2-T

3

User Average 1.1585 1.2095 1.2036

Movie Average 1.0305 1.0648 1.0637

Userbased kNN 0.9693 1.0143 1.0184

Itembased kNN 1.0009 1.0406 1.0590

SVD 0.9741 1.0491 1.0118

● RMSE for different Recommendation algorithms when predicting ratings in one trial (testing) from ratings on another (training)

Noise is minimized when we predict Trial 2

Let's recap

Users are inconsistent Inconsistencies can depend on many things

including how the items are presented Inconsistencies produce natural noise Natural noise reduces our prediction accuracy

independently of the algorithm

Item order effect

R1 is the trial with most inconsistencies

R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination)

R2 minimizes inconsistencies because of order (reducing “contrast effect”).

User Rating Speed Effect

Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning)

In R2 evaluation time starts decreasing until users find segment of “popular” movies

Rating speed is not correlated with inconsistencies

So...

What can we do?

Different proposals

In order to deal with noise in user feedback we have so far proposed 3 different approaches:

1. Denoise user feedback by using a re-rating approach (Recsys09)

2. Instead of regular users, take feedback from experts, which we expect to be less noisy (SIGIR09)

3. Combine ensembles of datasets to identify which works better for each user (IJCAI09)

Rate it Again

Rate it AgainIncreasing Recommendation Accuracy

by User re-Rating

Xavier Amatriain (with J.M. Pujol, N. Tintarev, N. Oliver)

Telefonica Research

Rate it again

By asking users to rate items again we can remove noise in the dataset Improvements of up to 14% in accuracy!

Because we don't want all users to re-rate all items we design ways to do partial denoising Data-dependent: only denoise extreme ratings User-dependent: detect “noisy” users

Algorithm

Given a rating dataset where (some) items have been re-rated,

Two fairness conditions:

1. Algorithm should remove as few ratings as possible (i.e. only when there is some certainty that the rating is only adding noise)

2.Algorithm should not make up new ratings but decide on which of the existing ones are valid.

Algorithm

One source re-rating case:

Given the following milding function:

Results

One-source re-rating (Denoised Denoising)⊚

T1⊚T

2ΔT

1T

1⊚T

3ΔT

1T

2⊚T

3ΔT

2

Userbased kNN 0.8861 11.3% 0.8960 10.3% 0.8984 6.8%

SVD 0.9121 11.0% 0.9274 9.5% 0.9159 7.1%

Datasets T1

(⊚ T2, T

3) ΔT

1

Userbased kNN 0.8647 13.4%

SVD 0.8800 14.1%

Two-source re-rating (Denoising T1with the other 2)

Denoise outliers

● Improvement in RMSE when doing onesource as a function of the percentage of denoised ratings and users: selecting only noisy users and extreme ratings

The Wisdom of the Few

A Collaborative Filtering Approach Based on Expert Opinions from the Web

Xavier Amatriain (@xamat), Josep M. Pujol, Nuria OliverTelefonica Research (Barcelona)

Neal LathiaUCL (London)

Crowds are not always wise

Collaborative filtering is the preferred approach for Recommender Systems Recommendations are drawn from your past

behavior and that of similar users in the system Standard CF approach:

Find your Neighbors from the set of other users Recommend things that your Neighbors liked and you

have not “seen”

Problem: predictions are based on a large dataset that is sparse and noisy

Overview of the Approach

expert = individual that we can trust to have produced thoughtful, consistent and reliable evaluations (ratings) of items in a given domain

Expert-based Collaborative Filtering Find neighbors from a reduced set of experts instead of

regular users.

1. Identify domain experts with reliable ratings

2. For each user, compute “expert neighbors”

3. Compute recommendations similar to standard kNN CF

Advantages of the Approach

Noise Experts introduce less

natural noise

Malicious Ratings Dataset can be monitored

to avoid shilling

Data Sparsity Reduced set of domain

experts can be motivated to rate items

Cold Start problem Experts rate items as

soon as they are available

Scalability Dataset is several order of

magnitudes smaller

Privacy Recommendations can be

computed locally

Mining the Web for Expert Ratings

Collections of expert ratings can be obtained almost directly on the web: we crawled the Rotten Tomatoes movie critics mash-up

Only those (169) with more than 250 ratings in the Neflix dataset were used

Dataset Analysis. Summary

Experts... are much less sparse rate movies all over the rating scale instead of

being biased towards rating only “good” movies (different incentives).

but, they seem to consistently agree on the good movies.

have a lower overall standard deviation per movie: they tend to agree more than regular users.

tend to deviate less from their personal average rating.

Evaluation Procedure

Use the 169 experts to predict ratings from 10.000 users sampled from the Netflix dataset

Prediction MAE using a 80-20 holdout procedure (5-fold cross-validation)

Top-N precision by classifying items as being “recommendable” given a threshold

Results show Expert CF to behave similar to standard CF But... we have a user study backing up the

approach

User Study

57 participants, only 14.5 ratings/participant

50% of the users consider Expert-based CF to be good or very good

Expert-based CF: only algorithm with an average rating over 3 (on a 0-4 scale)

Current Work

Music recommendations (using metacritics.com), mobile geo-located recommendations...

Adaptive Data Sources

Collaborative Filtering With Adaptive Information Sources

(ITWP @ IJCAI)With Neal LathiaUCL (London)

user modeling experts?

friends?

like-minded?

similarity

trust

reputation

Adaptive data sources

Adaptive Data sources

Given a simple, un-tuned, kNN predictor and multiple

information sources A problem

users are subjective, accuracy varies with source A promise

optimal classification of users to best source produces incredibly accurate predictions

Conclusions

Conclusions

For many applications such as Recommender Systems (but also Search, Advertising, and even Networks) understanding data and users is vital

Algorithms can only be as good as the data they use as input

Importance of User/Data Mining is going to be a growing trend in many areas in the coming years

Thanks!

Questions?

Xavier [email protected]

xavier.amatriain.nettechnocalifornia.blogspot.com

twitter.com/xamat

mailto:[email protected]

Date post:	26-Jan-2015
Category:	Technology
Upload:	xavier-amatriain
View:	105 times
Download:	0 times

User-driven Approaches to Recsys

Technology