Recommender Systems and Active Learning

transcript

Recommender Systems (RS)and

Active Learning (AL)Neil Rubens & Dain Kaplan

June 2016

• Value of RS

• RS Methods

• RS Objectives for Startups

• AL for RS

Outline

Value of Recommender Systems

http://previews.123rf.com/images/lisafx/lisafx1004/lisafx100400069/6903279-Customer-getting-advice-about-tape-from-a-hardware-store-clerk-Isolated-on-white--Stock-Photo.jpg4

Why Recommender Systems?

System

http://previews.123rf.com/images/lisafx/lisafx1004/lisafx100400069/6903279-Customer-getting-advice-about-tape-from-a-hardware-store-clerk-Isolated-on-white--Stock-Photo.jpg

• User Objectives

• finding needed item

• value

• utility

• enjoyment

• novelty

• serendipity

• etc.

• System Objectives

• revenue

• profit

• promote partners

• # of users

• # of visits

• time spent

• etc.Objectives may overlap!

RS Objectives

Value of RS• Amazon: 35% of sales from recommendations

• Netflix: 2/3 of the movies watched are recommended

• Choicestream: 28% of the people would buy more music if they found what they liked

• Google News: recommendations generate 38% more click-throughs

www.slideshare.net/kerveros99/machine-learning-for-recommender-systems-mlss-2015-sydney 6

RS Methods

Predictions8

What’s an RS?Sarah

Like LikeHate

HateRatings ? ? ? ?

•Assumption: preferences of “similar” items/users stay similar

•Similarity: variety of ways to define

Common Approach

Use ratings to estimate “similarity”

Collaborative Filtering (CF)U

Items RatingsLove

Dislike

https://buildingrecomm

enders.wordpress.com

/2015/11/23/overview-of-recom

mender-algorithm

s-part-5/

Users with similar dis/likes are similar, e.g. if Sarah and you have similar tastes, then anything that Sarah likes you will too (and vice versa)

Similar items will have similar ratings, e.g., if you liked a book A,

you will also like a book B with a similar rating

https://buildingrecomm

enders.wordpress.com

/2015/11/23/overview-of-recom

mender-algorithm

s-part-5/

Item-based CF

User-based CF

Model&

Model"(user)"Based"Es0ma0on"

Model"(item)"Based"Es0ma0on"

Ac0ve"Learning"

MODEL (USER) BASED ESTIMATION

MODEL (ITEM) BASED ESTIMATION

ACTIVE LEARNING

lin, "

Tailored to:• domains• item types• data types• objectives• etc.

VARIETY OF

RS APPROA

RS Objectives for Startups

Established Companies Startups

“cruise mode”

• Many existing loyal users• RS used to increase per-

user metrics, e.g. revenue, profit, etc.

“launch mode”• Still building user-base• RS used to attract/retain new users

Startups = Growth“The only essential thing is growth. Everything else we

associate with startups follows from growth.”

(Paul Graham, Y Combinator)

https://caferacerlaurbanabike.files.wordpress.com/2015/02/new-store-coming-soon.jpg

Expecting many:• new users• new items

“Cold Start” Problem

? ? ? ? ?

??????

• RS Needs user/item data to make recommendations with CF

• For new users/new items, no data is available yet:• New item problem• New user problem

New User

New Item

• Problem: don’t have any reviews yet (to base recommendations on)

• Solution: can use content-based item similarity (to bootstrap recommendations)

New Item Problem

Jordan Jumpman Team II Air Jordan 1 Retro High Nouveau

Hurley One And Only Printed

Air Jordan 1 Retro High OG

http://cache2.asset-cache.net/gc/558944927-side-view-of-man-opening-cafe-door-gettyimages.jpg?v=1&c=IWSAsset&k=2&d=BrH9aKEkYRiNc1pWhEX0etmgH38bczDi5XkuRcvp%2Bb9LQTmCIaIUqwLdVhpVf%2B9B20

New User Problem

• Very important to make a good first impression: bad first impression may lose potential user

• Problem: can’t make personalised recommendations (no data on the user yet)

Importance of Good Recommendations

Seriously?

http://www.smh.com.au/content/dam/images/2/5/u/n/g/image.related.articleLeadwide.620x349.25ume.png/1347596915177.jpg22

Learning New User Preferences

• Talking: learn about the user implicitly/explicitly

• Stalking: obtain data indirectly

• Contacts:friends may already be users of app (likely to have similar interests)

• Location

• Device type

• Social profileNOTE: should not be intrusive

Indirect Data

http://orangewebsitedesign.com/wp-content/uploads/2015/01/Jim-working-on-website-design-for-client-on-glass-wall.jpg

System Interaction Data• How: learn about user through

implicit/explicit interaction• clicks (or their absence)• duration• navigation paths• etc.

• What: make interaction more informative: item selection • position• attributes• grouping

Active Learning (AL)for

Recommender Systems

• Recommend an item that a user will like:

Popular items, i.e., everyone likes (but provides little info about user’s preferences)

• Present an item to learn about user’s preferences (Active Learning, AL):

Contentious Items, i.e., many people like / dislike (informative about user’s preferences)

Item Selection•RS Presents items for two primary purposes:

• In practice multiple items are shown for different objectives

AL Categories

• Item-based AL: analyse items and select items that seem most informative

• Model-based AL: analyse model and select items that seem most informative

• Popular: rated by many users [Rashid 2002]

• High Variance in Ratings: item that people either like or hate [Rashid 2002]

• Best/Worst: ask user which items s/he likes most/least [Leino & Raiha 2007]

• Influential: items on which ratings of many other items depend (representative + not represented) [Rubens & Sugiyama 2007]

Item Categories

input1

test point (unrated)

training point(color may differ)

ratings color map

41 32 5

Actual Ratings (Unknown)

• 3R Properties:• Represented by the existing training set? E.g., (b) is already represented

• Representative of others? E.g., (a) is not this way

• Results in achieving objective? E.g., (d) → max coverage

[Rubens & Kaplan, 2010] 29

Item-based AL

!"#$%&

!"#$% '

!"#$%&

!"#$% '

%()%*#+!"%*,$"-.%(/0*

%-.!"!"1*#+!"%,2+3+-*4.5*/!66(-0

-.%!"1)*2+3+-*4.#

$% &' (

72%$.3*8.%!"1)*,9":"+;"0

Figure 1: Active Learning: illustrative example (See Section 1.2).

already possible from the training point in the same area (refer to the chart on the left). If training point (c) isselected, we are able to make new predictions, but only for the other three points in this area, which happensto be Zombie movies. By selecting training point (d), we are able to make predictions for a large number oftest points that are in the same area, which belong to Comedy movies. Thus selecting (d) is the ideal choicebecause it allows us to improve accuracy of predictions the most (for the highest number of training points).4

1.3 Types of Active LearningAL methods presented in this chapter have been categorized based on our interpretation of their primarymotivation/goal. It is important to note, however, that various ways of classification may exist for a givenmethod, e.g. sampling close to a decision boundary may be considered as Output Uncertainty-based since theoutputs are unknown, Parameter-based because the point will alter the model, or even Decision boundary-based because the boundary lines will shift as a result. However, since the sampling is performed with regardto decision boundaries, we would consider this the primary motivation of this method and classify it as such.

In addition to our categorization by primary motivation (Section 1), we further subclassify a method’salgorithms into two commonly classified types for easier comprehension: instance-based and model-based.

Instance-based Methods A method of this type selects points based on their properties in an attempt topredict the user’s ratings by finding the closest match to other users in the system, without explicit knowledgeof the underlying model. Other common names for this type include memory-based, lazy learning, case-based,and non-parametric (Adomavicius & Tuzhilin, 2005). We assume that any existing data is accessible, as well asrating predictions from the underlying model.

Model-based Methods A method of this type selects points in an attempt to best construct a model thatexplains data supplied by the user to predict user ratings (Adomavicius & Tuzhilin, 2005). These points arealso selected to maximize the reduction of expected error of the model. We assume that in addition to any dataavailable to instance-based methods, the model and its parameters are also available.

Modes of Active Learning: Batch and Sequential Because users typically want to see the system outputsomething interesting immediately, a common approach is to recompute a user’s predicted ratings after theyhave rated a single item, in a sequential manner. It is also possible, however, to allow a user to rate severalitems, or several features of an item before readjusting the model. On the other hand, selecting training pointssequentially has the advantage of allowing the system to react to the data provided by users and make necessaryadjustments immediately. Though this comes at the cost of interaction with the user at each step. Thus a trade-o� exists between Batch and Sequential AL: the usefulness of the data vs. the number of interactions with theuser.

2 Properties of Data PointsWhen considering any Active Learning method, the following three factors should always be considered in orderto maximize the e�ectiveness of a given point. Supplementary explanations are then given below for the firsttwo. Examples refer to the Illustrative Example (Figure 1).

(R1) Represented : Is it already represented by the existing training set? E.g. point (b).4This may be dependent on the specific prediction method used in the RS.

Illustrative Example: movies are clustered by genre

Item Selection:Learning User Preferences

X2 Limited information due to few

Simply Not Useful

Ratingspositive

negative

System: limited knowledge User: not much variety, may get bored

User Satisfaction

Drawback

User exposed to disliked items33

Coverage

Drawback

decision boundary

Actual Model Random Sampling Active Learning

decision boundary

Prediction Accuracy

Initial

Improve Margin/Confidence Improve Orientation

AL Model Error

Existing Approaches Parameter Uncertainty AL

g : optimal function (in the sollutionspace)bf : learned functionbfi ’s: learned functions from a slightlydi⇣erent training set.EG = B +V +CB =

⇣Ebf (x)�g(x)

V =⇣bf �Ebf (x)

C = (g(x)� f (x))2

Model Error – Cconstant and is ignored

Bias – BHard to estimate, but is assumedto vanish (assymptotically).

Variance – VEstimate and minize.

10 / 2036

AL Model Error

It is clearly shown in the table that different strategies can improve different aspects of the recom-705

mendation quality. In terms of rating prediction accuracy (MAE/RMSE), there are various strategies thathave shown excellent performance. While, some of these strategies are easy to implement (e.g., Entropy0and Log(popularity)*Entropy), others are more complex and use more sophisticated Machine Learningalgorithms (e.g., Decision Tree, and Personality-based FM). Strategies that have shown excellent per-formance in terms of ranking quality (NDCG/MAP), are Representative-based and Voting strategies.710

In terms of precision, prediction-based strategies (Highest-predicted, and Binary-predicted) have shownexcellent performance. In terms of number of ratings acquired (# Ratings), as expected, strategies thatconsider the popularity of items (Popularity and Entropy0) can acquire the largest number of ratings.But, other strategies that maximize the chance that the selected items are familiar to the user (Item-itemand Personality-based) can also elicit a considerable number of ratings. For these strategies the success715

ratio (#acquired_ratings/#requested_items) is the largest. This is an important factor, since strategiesthat only focus on the informativeness of the items may fail to actually acquire ratings, by selectingobscure items that users do not know and cannot rate.

Table 1: Performance comparison of active learning strategies (“XX” Very Good, “X” Good, “ ” Poor, “-” Not Available)ML: Movielens, NF: Netflix, EM: EachMovie, AWM: Active Web Museum, MP: MyPersonality, STS: South Tyrol Suggests, LF: Last.fm

Type Strategy

Metric Eval.

Compar. Strategies Datasets

uncertainty based1. variance [59, 61] X - - - - y 2, 4, 6, 9, 24 AWM, EM

2. entropy [20, 67] - - - - y 3, 6, 8, 9, 11, 13, 22 EM

3. entropy0 [67] XX - - XX y y 2, 6, 8, 11, 13, 22 ML

error reduction 4. greedy extend [68] X - - - - y 2, 3, 6, 7, 10, 11 NF

5. representative [69] - XX XX - - y 6 NF, ML, LF

attention based 6. popularity [20, 67] X - - XX y y 2, 8, 9, 11, 13, 22 ML

7. co-coverage [68] - - - - y 2, 3, 4, 6, 10, 11 NF

static combin.

8. rand-pop [20, 67] - - y y 2, 3, 6, 11, 13, 22 ML

9. log(pop)*entropy [20] XX - - X y y 3, 6, 8, 13 ML

10. sqrt(pop)*var [68] X - - - - y 2, 3, 4, 6, 7, 11 NF

11. HELF [67] XX - - y y 2, 3, 6, 8, 13, 22 ML

12. non-pers-part rand. [11] X XX X - y 1, 6, 9, 12, 14, 20, 21, 28, 29 ML, NF

acquisition prob.13. item-item [20, 67] - - XX y y 2, 3, 6, 8, 9, 11, 22 ML

14. binary-pred [11, 12] X XX X - y 1, 6, 9, 12, 20, 21, 28, 29 ML, NF

15. personality-based [70, 97] XX XX - XX y y 3, 9, 14 STS, MP

16. impact analysis [71] XX - - - - y 9 ML

prediction based

17. aspect model [72, 73] X - - - - y 2 EM, ML

18. min rating [74] X - - - - y 19,25 ML

19. min norm [74] - - - - y 18,25 ML

20. highest-pred [11, 12] X XX X - y 1, 6, 9, 12, 14, 21, 28, 29 ML, NF

21. lowest-pred [11, 12] X X - y 1, 6, 9, 12, 14, 20, 28, 29 ML, NF

user partitioning 22. IGCN [67] XX - - X y y 2, 3, 6, 8, 11, 13 ML

23. decision tree [64] XX - - - - y 3, 4, 10, 11 NF

static combin.

24. influence based [61] XX - - - - y 1, 4, 6, 9 ML

25. non-myopic [74] X - - - - y 18, 19 ML

26. treeU [75] X - - - - y 23, 27 ML, EM, NF

27. fMF [75] XX - - - - y 23, 26 ML, EM, NF

28. pers-partially rand. [11] X XX X - y 1, 6, 9, 12, 14, 20, 21, 28, 29 ML, NF

29. voting [11, 12] XX XX - y 1, 6, 9, 12, 14, 20, 21, 28 ML, NF

adaptive combin. 30. switching [76] XX XX - XX - y 9, 20, 29 ML

Mehdi Elahi, Francesco Ricci, Neil Rubens, A survey of active learning in collaborative filtering recommender systems, Computer Science Review, Elsevier, 2016.

It is clearly shown in the table that different strategies can improve different aspects of the recom-705

mendation quality. In terms of rating prediction accuracy (MAE/RMSE), there are various strategies thathave shown excellent performance. While, some of these strategies are easy to implement (e.g., Entropy0and Log(popularity)*Entropy), others are more complex and use more sophisticated Machine Learningalgorithms (e.g., Decision Tree, and Personality-based FM). Strategies that have shown excellent per-formance in terms of ranking quality (NDCG/MAP), are Representative-based and Voting strategies.710

In terms of precision, prediction-based strategies (Highest-predicted, and Binary-predicted) have shownexcellent performance. In terms of number of ratings acquired (# Ratings), as expected, strategies thatconsider the popularity of items (Popularity and Entropy0) can acquire the largest number of ratings.But, other strategies that maximize the chance that the selected items are familiar to the user (Item-itemand Personality-based) can also elicit a considerable number of ratings. For these strategies the success715

ratio (#acquired_ratings/#requested_items) is the largest. This is an important factor, since strategiesthat only focus on the informativeness of the items may fail to actually acquire ratings, by selectingobscure items that users do not know and cannot rate.

Table 1: Performance comparison of active learning strategies (“XX” Very Good, “X” Good, “ ” Poor, “-” Not Available)ML: Movielens, NF: Netflix, EM: EachMovie, AWM: Active Web Museum, MP: MyPersonality, STS: South Tyrol Suggests, LF: Last.fm

Type Strategy

Metric Eval.

Compar. Strategies Datasets

uncertainty based1. variance [59, 61] X - - - - y 2, 4, 6, 9, 24 AWM, EM

2. entropy [20, 67] - - - - y 3, 6, 8, 9, 11, 13, 22 EM

3. entropy0 [67] XX - - XX y y 2, 6, 8, 11, 13, 22 ML

error reduction 4. greedy extend [68] X - - - - y 2, 3, 6, 7, 10, 11 NF

5. representative [69] - XX XX - - y 6 NF, ML, LF

attention based 6. popularity [20, 67] X - - XX y y 2, 8, 9, 11, 13, 22 ML

7. co-coverage [68] - - - - y 2, 3, 4, 6, 10, 11 NF

static combin.

8. rand-pop [20, 67] - - y y 2, 3, 6, 11, 13, 22 ML

9. log(pop)*entropy [20] XX - - X y y 3, 6, 8, 13 ML

10. sqrt(pop)*var [68] X - - - - y 2, 3, 4, 6, 7, 11 NF

11. HELF [67] XX - - y y 2, 3, 6, 8, 13, 22 ML

12. non-pers-part rand. [11] X XX X - y 1, 6, 9, 12, 14, 20, 21, 28, 29 ML, NF

acquisition prob.13. item-item [20, 67] - - XX y y 2, 3, 6, 8, 9, 11, 22 ML

14. binary-pred [11, 12] X XX X - y 1, 6, 9, 12, 20, 21, 28, 29 ML, NF

15. personality-based [70, 97] XX XX - XX y y 3, 9, 14 STS, MP

16. impact analysis [71] XX - - - - y 9 ML

prediction based

17. aspect model [72, 73] X - - - - y 2 EM, ML

18. min rating [74] X - - - - y 19,25 ML

19. min norm [74] - - - - y 18,25 ML

20. highest-pred [11, 12] X XX X - y 1, 6, 9, 12, 14, 21, 28, 29 ML, NF

21. lowest-pred [11, 12] X X - y 1, 6, 9, 12, 14, 20, 28, 29 ML, NF

user partitioning 22. IGCN [67] XX - - X y y 2, 3, 6, 8, 11, 13 ML

23. decision tree [64] XX - - - - y 3, 4, 10, 11 NF

static combin.

24. influence based [61] XX - - - - y 1, 4, 6, 9 ML

25. non-myopic [74] X - - - - y 18, 19 ML

26. treeU [75] X - - - - y 23, 27 ML, EM, NF

27. fMF [75] XX - - - - y 23, 26 ML, EM, NF

28. pers-partially rand. [11] X XX X - y 1, 6, 9, 12, 14, 20, 21, 28, 29 ML, NF

29. voting [11, 12] XX XX - y 1, 6, 9, 12, 14, 20, 21, 28 ML, NF

adaptive combin. 30. switching [76] XX XX - XX - y 9, 20, 29 ML

Active LearningStrategies

personalized

combined-heuristic

adaptive combina-tion

switching [76]

static combination

voting [11]

partially rand [11]

fMF [75]

treeU [75]

non-myopic [74]

influence-uncertainty [61]

single-heuritsic

user partitioningdecision tree [64]

IGCN [67]

prediction based

lowest pred [11, 12]

highest pred [11, 12]

min norm [74]

min rating [74]

FMM [72]

aspect model [72,73]

impact basedimpact analysis [71]

influence based [61]

acquisition prob.

personality-based[70]

binary-pred [11, 12]

item-item [20]

non-personalized

combined-heuristic static combination

partially-rand [11,12]

HELF [67]

sqrt(pop)* variance[68]

log(pop)* entropy[20]

rand-popularity [20]

single-heuristic

attention-basedco-coverage [68]

popularity [20, 60]

error reductionrepresentative-based [69]

greedy extend [68]

uncertainty reduc-tion entropy0 [67]

entropy [59, 20]

variance [59]

Figure 4: Classification of active learning strategies in collaborative filtering

Tailored to:

•different objectives

•different data & settings

MANY AL-RS

APPROACHE

http://www.win.tue.nl/~eknutov/gaf.html 38

RS Complexity• RS composed of many modules that need tuning to

achieve high performance

Take-home Messages• RS shows users items they want

• RS accounts for a large portion of purchases

• RS methods: user/item-based

• RS is crucial for user growth, and:

• addressing new items/users (“cold start”) with:

• indirect data acquisition

• content-based item similarity

• informative item selection with AL

• Many RS components could be tuned to achieve high performance

Recommender Systems and Active Learning

Engineering