Matrix Factorizations for Recommender Systems

transcript

Dmitriy Selivanovselivanov.dmitriy@gmail.com

2017-11-16

Recommender systems are everywhere

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Propose “relevant” items to customers

I RetentionI Exploration

I Up-saleI Personalized offers

I recommended items for a customer given history of activities (transactions, browsinghistory, favourites)

I Similar itemsI substitutionsI bundles - frequently bought togetherI . . .

Live demoDataset - LastFM-360K:

I 360k usersI 160k artistsI 17M observationsI sparsity - 0.9999999

Figure 5:

Explicit feedbackRatings, likes/dislikes, purchases:

I cleaner dataI smallerI hard to collect

RMSE 2 = 1D

∑u,i∈D

(rui − r̂ui )2

Netflix prize

I ~ 480k users, 18k movies, 100m ratingsI sparsity ~ 90%I goal is to reduce RMSE by 10% - from 0.9514 to 0.8563

Implicit feedback

I noisy feedback (click, likes, purchases, search, . . . )I much easier to collectI wider user/item coverage

I usually sparsity > 99.9%

One-Class Collaborative Filtering

I observed entries are positive preferencesI should have high confidence

I missed entries in matrix are mix of negative preferences and positive preferencesI consider them as negative with low confidenceI we cannot really distinguish that user did not click a banner because of a lack of

interest or lack of awareness

Evaluation

Recap: we only care about how to produce small set of highly relevant items.RMSE is bad metrics - very weak connection to business goals.

Only interested about relevance precision of retreived items:

I space on the screen is limitedI only order matters - most relevant items should be in top

Ranking - Mean average precision

AveragePrecision =∑n

k=1(P(k)×rel(k))number of relevant documents

## index relevant precision_at_k## 1: 1 0 0.0000000## 2: 2 0 0.0000000## 3: 3 1 0.3333333## 4: 4 0 0.2500000## 5: 5 0 0.2000000

map@5 = 0.1566667

Ranking - Normalized Discounted Cumulative Gain

Intuition is the same as for MAP@K, but also takes into account value of relevance:

DCGp =p∑

2reli − 1log2(i + 1)

nDCGp = DCGpIDCGp

IDCGp =|REL|∑i=1

2reli − 1log2(i + 1)

Approaches

I Content basedI good for cold startI not personalized

I Collaborative filteringI vanilla collaborative fitleringI matrix factorizationsI . . .

I Hybrid and context aware recommender systemsI best of two worlds

Focus today

I WRMF (Weighted Regularized Matrix Factorization) - Collaborative Filtering forImplicit Feedback Datasets (2008)

I efficient learning with accelerated approximate Alternating Least SquaresI inference time

I Linear-FLow - Practical Linear Models for Large-Scale One-Class CollaborativeFiltering (2016)

I efficient truncated SVDI cheap cross-validation with full path regularization

Matrix FactorizationsI Users can be described by small number of latent factors pukI Items can be described by small number of latent factors qki

Figure 6:

Sparse data

Low rank matrix factorization

R = P × Q

factors

itemsfact

Reconstruction

Truncated SVD

Take k largest singular values:X ≈ UkDkV T

- Xk ∈ Rm∗n - Uk ,V - columns are orthonormal bases (dot product of any 2 columns iszero, unit norm) - Dk - matrix with singular values on diagonal

Truncated SVD is the best rank k approximation of the matrix X in terms ofFrobenius norm:

||X − UkDkV Tk ||F

P = Uk√

Q =√

DkV Tk

Issue with truncated SVD for “explicit” feedback

I Optimal in terms of Frobenius norm - takes into account zeros in ratings -

RMSE =√√√√ 1

users × items∑

u∈users,i∈items(rui − r̂ui )2

I Overfits data

Objective = error only in “observed” ratings:

RMSE =√√√√ 1

Observed∑

u,i∈Observed(rui − r̂ui )2

SVD-like matrix factorization with ALS

J =∑

u,i∈Observed(rui − pu × qi )2 + λ(||Q2||+ ||P2||)

Given Q fixed solve for p:

min∑

i∈Observed(ri − qi × P)2 + λ

u∑j=1

Given P fixed solve for q:

min∑

u∈Observed(ru − pu × Q)2 + λ

i∑j=1

Ridge regression: P = (QT Q + λI)−1QT r , Q = (PT P + λI)−1PT r

“Collaborative Filtering for Implicit Feedback Datasets”WRMF - Weighted Regularized Matrix Factorization

I “Default” approachI Proposed in 2008, but still widely used in industry (even at youtube)I several high-quality open-source implementations

J =∑u,i

Cui (Pui − XuYi )2 + λ(||X ||F + ||Y ||F )

I Preferences - binary

Pij ={1 if Rij > 00 otherwise

I Confidence - Cui = 1 + f (Rui )

Alternating Least Squares for implicit feedback

For fixed Y :dL/dxu = −2

∑i=item

cui (pui − xTu yi )yi + 2λxu =

−2∑

i=itemcui (pui − yT

i xu)yi + 2λxu =

−2Y T Cup(u) + 2Y T CuYxu + 2λxu

I Setting dL/dxu = 0 for optimal solution gives us (Y T CuY + λI)xu = Y T Cup(u)I xu can be obtained by solving system of linear equations:

xu = solve(Y T CuY + λI,Y T Cup(u))

Alternating Least Squares for implicit feedback

Similarly for fixed X :

I dL/dyi = −2XT C ip(i) + 2XT C iYyi + 2λyiI yi = solve(XT C iX + λI,XT C ip(i))

Another optimization:

I XT C iX = XT X + XT (C i − I)XI Y T CuY = Y T Y + Y T (Cu − I)Y

XT X and Y T Y can be precomputed

Accelerated Approximate Alternating Least Squaresyi = solve(XT C iX + λI,XT C ip(i))

Iterative methods

I Conjugate GradientI Coordinate Descend

Fixed number of steps of (usually 3-4 is enough):

Inference time

How to make recommendations for new users?There are no user embeddings since users are not in original matrix!

Inference time

Make one step on ALS with fixed item embeddings matrix => get new user embeddings:

I given Y fixed, Cnew - new user-item interactions confidenceI xunew = solve(Y T Cunew Y + λI,Y T Cunew p(unew ))I scores = Xnew Y T

WRMF Implementations

I python implicit - implemets Conjugate Gradient. With GPU support recently!I R reco - implemets Conjugate GradientI Spark ALSI Quora qmfI Google tensorflow

*titles are clickable

Linear-Flow

Idea is to learn item-item similarity matrix W from the data.

min J = ||X − XWk ||F + λ||Wk ||FWith constraint:

rank(W ) ≤ k

Linear-Flow observations

1. Whithout L2 regularization optimal solution is Wk = QkQTk where

SVDk(X ) = PkΣkQTk

2. Whithout rank(W ) ≤ k optimal solution is just solution for ridge regression:W = (XT X + λI)−1XT X - infeasible.

Linear-Flow reparametrization

SVDk(X ) = PkΣkQTk

Let W = QkY :

argmin(Y ) : ||X − XQkY ||F + λ||QkY ||F

Motivation

λ = 0 => W = QkQTk and also soliton for current problem Y = QT

Linear-Flow closed-form solution

I Notice that if Qk orthogogal then ||QkY ||F = ||Y ||FI Solve ||X − XQkY ||F + λ||Y ||FI Simple ridge regression with close form solution

Y = (QTk XT XQk + λI)−1QT

k XT X

Very cheap inversion of the matrix of rank k!

Linear-Flow hassle-free cross-validation

Y = (QTk XT XQk + λI)−1QT

k XT X

How to find lamda with cross-validation?

I pre-compute Z = QTk XT X so Y = (ZQk + λI)−1Z -

I pre-compute ZQkI notice that value of lambda affects only diagonal of ZQkI generate sequence of lambda (say of length 50) based on min/max diagonal valuesI solving 50 rigde regression of a small rank is super-fast

Linear-Flow hassle-free cross-validation

Figure 7:

Suggestions

I start simple - SVD, WRMFI design proper cross-validation - both objective and data splitI think about how to incorporate business logic (for example how to exclude

something)I use single machine implementationsI think about inference timeI don’t waste time with libraries/articles/blogposts wich demonstrate MF with dense

matrices

Questions?

I http://dsnotes.com/tags/recommender-systems/I https://github.com/dselivanov/reco

Contacts:

I selivanov.dmitriy@gmail.comI https://github.com/dselivanovI https://www.linkedin.com/in/dselivanov1

Matrix Factorizations for Recommender Systems

Data & Analytics