Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.
Transcript
Slide 1
Data Mining Algorithms for Recommendation Systems Zhenglu Yang
University of Tokyo Some slides are from online materials.
Slide 2
Applications 2
Slide 3
3
Slide 4
4 Corporate Intranets
Slide 5
System Inputs Interaction data (users items) Explicit feedback
rating, comments Implicit feedback purchase, browsing User/Item
individual data User side: Structural attribute information
Personal description Social network Item side: Structural attribute
information Textual description/content information Taxonomy of
item (category) 5
Slide 6
Interaction between Users and Items 6 Observed preferences
(Purchases, Ratings, page views, bookmarks, etc)
Slide 7
Profiles of Users and Items 7 User Profile: (1) Attribute
Nationality,Sex, Age,Hobby,etc (2) Text Personal description (3)
Link Social network Item Profile: (1) Attribute Price,Weight,Co
lor,Brand,etc (2) Text Product description (3) link Taxonomy of
item (category)
Slide 8
All Information about Users and Items 8 Observed preferences
(Purchases, Ratings, page views, bookmarks, etc) User Profile: (1)
Attribute Nationality,Sex, Age,Hobby,etc (2) Text Personal
description (3) Link Social network Item Profile: (1) Attribute
Price,Weight,Co lor,Brand,etc (2) Text Product description (3) link
Taxonomy of i tem (category)
Slide 9
Artificial Intelligence Statistics Machine learning KDD
Database Natural Language Processing Data mining is a
multi-disciplinary field KDD and Data Mining 9
Slide 10
Recommendation Approaches Collaborative filtering Using
interaction data (user-item matrix) Process: Identify similar
users, extrapolate from their ratings Content based strategies
Using profiles of users/items (features) Process: Generate
rules/classifiers that are used to classify new items Hybrid
approaches 10
Slide 11
A Brief Introduction Collaborative filtering Nearest neighbor
based Model based 11
Slide 12
Recommendation Approaches Collaborative filtering Nearest
neighbor based User based Item based Model based 12
Slide 13
User-based Collaborative Filtering Idea: people who agreed in
the past are likely to agree again To predict a users opinion for
an item, use the opinion of similar users Similarity between users
is decided by looking at their overlap in opinions for other
items
Slide 14
User-based CF (Ratings) 14 Item 1Item 2Item 3Item 4Item 5Item 6
User 1 817298 User 2 987 ? 12 User 3 898931 User 4 211231 User 5
312322 User 6 122111 10 9 2 1 good bad
Slide 15
Similarity between Users Item 1Item 2Item 3Item 4Item 5Item 6
User 2 987 ? 12 User 3 898931 Only consider items both users have
rated Common similarity measures: Cosine similarity Pearson
correlation 15
Slide 16
Recommendation Approaches Collaborative filtering Nearest
neighbor based User based Item based Model based Content based
strategies Hybrid approaches 16
Slide 17
Item-based Collaborative Filtering Idea: a user is likely to
have the same opinion for similar items Similarity between items is
decided by looking at how other users have rated them 17
Slide 18
Example: Item-based CF Item 1Item 2Item 3Item 4Item 5 User 181
? 27 User 222575 User 354747 User 471738 User 517465 User
683837
Slide 19
Similarity between Items Item 3Item 4 ? 2 57 74 73 46 83 Only
consider users who have rated both items Common similarity
measures: Cosine similarity Pearson correlation
Slide 20
Recommendation Approaches Collaborative filtering Nearest
neighbor based Model based Matrix factorization (i.e., SVD) Content
based strategies Hybrid approaches 20
Slide 21
Singular Value Decomposition (SVD) Mathematical method used to
apply for many problems Given any mxn matrix R, find matrices U,I,
and V that R = UIV T U is mxr and orthonormal I is rxr and diagonal
V is nxr and orthonormal Remove the smallest values to get R m,k
with k
Hierarchical Agglomerative Clustering Put every point in a
cluster by itself. For I=1 to N-1 do{ let C 1 and C 2 be the most
mergeable pair of clusters Create C 1,2 as parent of C 1 and C 2 }
Example: for simplicity, we use 1-dimensional objects. Numerical
Objects: 1, 2, 5, 6, 7 Agglomerative clustering: find two closest
objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; =>
{1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 12567
90
Slide 91
Recommendation Approaches Collaborative filtering Content based
strategies Association Rule Mining Text similarity based Clustering
Classification Hybrid approaches 91
Slide 92
Illustrating Classification Task
Slide 93
Classification k-Nearest Neighbor (kNN) Decision Tree Nave
Bayesian Artificial Neural Network Support Vector Machine Ensemble
methods 93
Slide 94
k-Nearest Neighbor Classification (kNN) kNN does not build
model from the training data. Approach To classify a test instance
d, define k-neighborhood P as k nearest neighbors of d Count number
n of training instances in P that belong to class c j Estimate Pr(c
j |d) as n/k (majority vote) No training is needed. Classification
time is linear in training set size for each test case. k is
usually chosen empirically via a validation set or cross-validation
by trying a range of k values. Distance function is crucial, but
depends on applications. 94
Slide 95
Example: k=1 (1NN) Car Book Clothes Book 95 which class?
Slide 96
Example: k=3 (3NN) Car Book Clothes Car 96 which class?
Slide 97
Discussion Advantage Nonparametric architecture Simple Powerful
Requires no training time Disadvantage Memory intensive
Classification/estimation is slow Sensitive to k 97
Slide 98
Classification k-Nearest Neighbor (kNN) Decision Tree Nave
Bayesian Artificial Neural Network Support Vector Machine Ensemble
methods 98
Slide 99
Example of a Decision Tree categorical continuous class
Training Data Judge the cheat possibility: Yes/No
Slide 100
Example of a Decision Tree categorical continuous class Refund
MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K
Splitting Attributes Training Data Model: Decision Tree Judge the
cheat possibility: Yes/No
Slide 101
Another Example of Decision Tree categorical continuous class
MarSt Refund TaxInc YES NO Yes No Married Single, Divorced <
80K> 80K There could be more than one tree that fits the same
data! Judge the cheat possibility: Yes/No
Slide 102
Decision Tree - Construction Creating Decision Trees Manual -
Based on expert knowledge Automated - Based on training data (DM)
Two main issues: Issue #1: Which attribute to take for a split?
Issue #2: When to stop splitting?
Slide 103
Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5
Nave Bayesian Artificial Neural Network Support Vector Machine
Ensemble methods 103
Slide 104
The CART Algorithm Classification And Regression Trees
Developed by Breiman et al. in early 80s. Introduced tree-based
modeling into the statistical mainstream Rigorous approach
involving cross-validation to select the optimal tree 104
Slide 105
Key Idea Recursive Partitioning Take all of your data. Consider
all possible values of all variables. Select the variable/value
(X=t 1 ) that produces the greatest separation in the target. (X=t
1 ) is called a split. If X< t 1 then send the data to the left;
otherwise, send data point to the right. Now repeat same process on
these two nodes You get a tree Note: CART only uses binary splits.
105
Slide 106
Key Idea Let (s |t ) be a measure of the goodness of a
candidate split s at node t, where: Then the optimal split
maximizes this (s |t ) measure over all possible splits at node t.
106
Slide 107
Key Idea (s |t ) is large when both of its main components are
large: and 1. - Maximum value if child nodes are equal size (same
support) ): E.g. 0.5*0.5 = 0.25 and 0.9*0.1= 0.09 Maximum value if
for each class the child nodes are completely uniform (pure)
Theoretical maximum value for Q (s|t) is k, where k is the number
of classes for the target variable 107 2. Q (s |t )=
Slide 108
CART Example 108 Training Set of Records for Classifying Credit
Risk
Slide 109
CART Example Candidate Splits 109 Candidate Splits for t = Root
Node Candidate SplitLeft Child Node, t L Right Child Node, t R
123456789123456789 Savings = low Savings = medium Savings = high
Assets = low Assets = medium Assets = high Income $75,000 CART is
restricted to binary splits
Slide 110
CART Primer Split 1. -> Savings=low (L-true, R-false)
Right:1,3,4,6,8 Left:2,5,7 P R =5/8 = 0.625 P L =3/8=0.375 ->
2*P L P R =15/64=0.46875 P(j=Bad | t) P(Bad | t R )= 1/5 = 0.2
P(Bad | t L )= 2/3 = 0.67 P(j=Good | t) P(Good | t R )= 4/5 = 0.8
P(Good | t L )= 1/3 = 0.33 Q(s|t)= |0.67-0.2|+|0.8-0.33| = 0.934
110
Slide 111
CART Example 111 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R
Q(s|t)(s |t ) 123456789123456789 0.375 0.25 0.5 0.25 0.375 0.625
0.875 0.625 0.75 0.5 0.75 0.625 0.375 0.125 G:0.333 B:0.667 G:1 B:0
G:0.5 B:0.5 G:0 B:1 G:0.75 B:0.25 G:1 B:0 G:0.333 B:0.667 G:0.4
B:0.6 G:0.571 B:0.429 G:0.8 B:0.2 G:0.4 B:0.6 G:0.667 B:0.333
G:0.833 B:0.167 G:0.5 B:0.5 G:0.5 B:0.5 G:0.8 B:0.2 G:1 B:0 G:1 B:0
0.46875 0.375 0.5 0.375 0.46875 0.21875 0.934 1.2 0.334 1.667 0.5 1
0.934 1.2 0.858 0.4378 0.5625 0.1253 0.6248 0.25 0.375 0.4378
0.5625 0.1877 For each candidate split, examine the values of the
various components of the measure (s|t)
Slide 112
CART Example 112 CART decision tree after initial split Root
Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk
(Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low
Assets {Medium, High}
Slide 113
CART Example 113 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R
Q(s|t)(s |t ) 1235678912356789 0.167 0.5 0.333 0.667 0.333 0.5
0.167 0.833 0.5 0.667 0.333 0.667 0.5 0.833 G:1 B:0 G:1 B:0 G:0.5
B:0.5 G:0.75 B:0.25 G:1 B:0 G:0.5 B:0.5 G:0.667 B:0.333 G:0.8 B:0.2
G:0.8 B:0.2 G:0.667 B:0.333 G:1 B:0 G:1 B:0 G:0.75 B:0.25 G:1 B:0
G:1 B:0 G:1 B:0 0.2782 0.5 0.4444 0.5 0.2782 0.4 0.6666 1 0.5 1
0.6666 0.4 0.1112 0.3333 0.4444 0.2222 0.4444 0.3333 0.1112 Values
of Components of Measure (s|t) for Each Candidate Split on Decision
Node A
Slide 114
CART Example 114 CART decision tree after decision node A split
Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad
Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low
Assets {Medium, High} Savings=High Savings {Low,Medium} Decision
Node B (Records 3,6) Good Risk (Records 1,4,5,8)
Slide 115
CART Example 115 CART decision tree, fully grown form Root Node
(All Records) Assets = Low vs. Assets {Medium, High} Bad Risk
(Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low
Assets {Medium, High} Savings=High Savings {Low,Medium} Decision
Node B (Records 3,6) Good Risk (Records 1,4,5,8) Assets=High
Assets=Medium Bad Risk (Records 3) Good Risk (Records 6)
Slide 116
Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5
Nave Bayesian Artificial Neural Network Support Vector Machine
Ensemble methods 116
Slide 117
The C4.5 Algorithm Proposed by Quinlan in 1993 An internal node
represents a test on an attribute. A branch represents an outcome
of the test, e.g., Color=red. A leaf node represents a class label
or class label distribution. At each node, one attribute is chosen
to split training examples into distinct classes as much as
possible A new case is classified by following a matching path to a
leaf node. 117
Slide 118
The C4.5 Algorithm Differences between CART and C4.5: Unlike
CART, the C4.5 algorithm is not restricted to binary splits. It
produces a separate branch for each value of the categorical
attribute. C4.5 method for measuring node homogeneity is different
from the CART. 118
Slide 119
The C4.5 Algorithm - Measure We have a candidate split S, which
partitions the training data set T into several subsets, T 1, T
2,..., T k. C4.5 uses the concept of entropy reduction to select
the optimal split. entropy_reduction(S) = H(T)-HS(T), where entropy
H(X) is: Where Pi represents the proportion of records in subset i.
The weighted sum of the entropies for the individual subsets T 1, T
2,..., T k C4.5 chooses the optimal split - the split with greatest
entropy reduction 119
Slide 120
Classification k-Nearest Neighbor (kNN) Decision Tree Nave
Bayesian Artificial Neural Network Support Vector Machine Ensemble
methods 120
Slide 121
Bayes Rule Recommender system question L i is the class for
item i (i.e., that the user likes item i) A is the set of features
associated with item i Estimate p(L i |A) p(L i |A) = p(A| L i )
p(L i ) / p(A) We can always restate a conditional probability in
terms of The reverse condition p(A| L i ) Two prior probabilities
p(L i ) p(A) Often the reverse condition is easier to know We can
count how often a feature appears in items the user liked
Frequentist assumption 121
Slide 122
Naive Bayes Independence (Nave Bayes assumption) the features a
1, a 2,..., a k are independent For joint probability For
conditional probability Bayes' Rule 122
Slide 123
An Example Compute all probabilities required for
classification 123
Slide 124
An Example For C = t, we have For class C = f, we have C = t is
more probable. t is the final class. 124
Slide 125
Nave Bayesian Classifier Advantages: Easy to implement Very
efficient Good results obtained in many applications Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy when the assumption is seriously violated (those highly
correlated data sets) 125
Slide 126
Classification K-Nearest Neighbor (kNN) Decision Tree Nave
Bayesian Artificial Neural Network Support Vector Machine Ensemble
methods 126
Slide 127
References for Machine Learning T. Mitchell, Machine Learning,
McGraw Hill, 1997 C. M. Bishop, Pattern Recognition and Machine
Learning, Springer, 2006 T. Hastie, R. Tibshirani and J. Friedman,
The Elements of Statistical Learning, Springer, 2001. V. Vapnik,
Statistical Learning Theory, Wiley-Interscience, 1998. Y.
Kodratoff, R. S. Michalski, Machine Learning: An Artificial
Intelligence Approach, Volume III, Morgan Kaufmann, 1990 127
Slide 128
Recommendation Approaches Collaborative filtering Nearest
neighbor based Model based Content based strategies Association
Rule Mining Text similarity based Clustering Classification Hybrid
approaches 128
Slide 129
The Netflix Prize Slides here are from Yehuda Koren.
Slide 130
Netflix Movie rentals by DVD (mail) and online (streaming) 100k
movies, 10 million customers Ships 1.9 million disks to customers
each day 50 warehouses in the US Complex logistics problem
Employees: 2000 But relatively few in engineering/software And only
a few people working on recommender systems Moving towards online
delivery of content Significant interaction of customers with Web
site 130
Slide 131
The $1 Million Question 131
Slide 132
Million Dollars Awarded Sept 21 st 2009 132
Slide 133
133
Slide 134
Lessons Learned Scale is important e.g., stochastic gradient
descent on sparse matrices Latent factor models work well on this
problem Previously had not been explored for recommender systems
Understanding your data is important, e.g., time- effects Combining
models works surprisingly well But final 10% improvement can
probably be achieved by judiciously combining about 10 models
rather than 1000s This is likely what Netflix will do in practice
134
Slide 135
Useful References Y. Koren, Collaborative filtering with
temporal dynamics, ACM SIGKDD Conference 2009 Y. Koren, R. Bell, C.
Volinsky, Matrix factorization techniques for recommender systems,
IEEE Computer, 2009 Y. Koren, Factor in the neighbors: scalable and
accurate collaborative filtering, ACM Transactions on Knowledge
Discovery in Data, 2010 135