DataEngConf: Building the Next New York Times Recommendation Engine

Post on 12-Apr-2017

560 views 0 download

transcript

Building the Next New York Times Recommendation Engine

By Alexander Spangher

Lucas Spangher
maybe define this?
Lucas Spangher
i.e. corpus-pool

Problem Statement:

1. The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Corpus:

n articles that are still relevant over the past x days

For each user:

1 2 3 4 ...

30 day reading history

Machine Learning

“All of machine learning can be broken down into regression and matrix factorization.”

-A drunk PhD student at a bar

1. Regression: f(input) = output

2. Factorization: f(output) = output

-Yann Lecun, 2015

Problem Statement (Refined)

1. Define pool of articles.

Not all articles expire at the same rate

2. Rank order articles based on reading history of user.

Assume that reader’s future preferences will match past preferences

Defining the Pool of Articles

Defining Relevancy

Exponential Distribution

Evergreen Model

Section,Desk,Word Count...

clicks per day2. Learn relationship between features and metric

1. Learn training metric

3. Convert to interpretable expiration date

Lucas Spangher
define feature?

Fit a to each item in training set

Fit:

i

Likelihood function:

Maximum Likelihood Estimate (MLE)

likelihood of data and parameters

joint pdf of data given parameter

product of independent pdf’s

Maximum Likelihood Estimate

Given timestamp of every click:

Maximum Likelihood Estimate

???

Maximum Log Likelihood Estimate

Lucas Spangher
should show that this is being set to 0

Or, use optimization package:

Python: http://cvxopt.org/

Convex Optimization by Stephen Boyd

Learn relationship between article features and

x = [desk, word count, section, ...]

y =

General Linear model:

Performance

Building the Recommender

(http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/)

First Iteration

Keyword-Based model: TF-IDF Vector

N = number of times word appears in documentD = number of documents that word appears in

First Iteration

Keyword-Based model: TF-IDF Vector

[ 0.02, 0.5, 0, 0, … , .01 ]

[ 0.9, 0.01, 0.2, … , .05 ]

fun cat dog scholar nice

Feedback:

“Recommendations work for me

I have been following the Oscar Pistorius case for over a year now and every time there has been a relevant story about the case, I have been recommended that story.

Recommendations seem to be working very well for me.”

Feedback:

“No More Brooks recommendations, please

Your constant pushing of David Brooks onto me is like an annoying grandmother who won't believe you are really allergic to peanuts even though you regularly go into anaphylactic shock at her dinner table and need to be rushed to the hospital. What can I say… you're killing me. Please stop it.

...

Thanks for your attention to this matter.”

Feedback:

“Dear NY Times,

You seem to have missed the fact that, while I do read the Weddings section, I only (or almost only) read about the weddings of same sex couples.

Please stop recommending heterosexual weddings articles to me!!”

[ 0.02, 0.5, 0, 0, … , .01 ]

[ 0.9, 0.01, 0.2, … , .05 ]

1 2 3 4 k

LDA-Based model: Topic Vector

Second Iteration:

Example topic, pr

obab

ility

wei

ght

cat yarn tree building car money bank paw toy newspaper Spotify

Example topic, :pr

obab

ility

wei

ght

cat yarn tree building car money bank paw toy newspaper Spotify

LDA

David Blei (2003)

Topic Space

How do we learn these parameters?

LDA Definition:

Choose ~ Dirichlet(ɑ)𝜃For each in document:

Choose word topic ~ Mult( )𝜃Choose word from

Variational Inference

Image borrowed from David Blei (2003)

Variational Inference (cont.)

Variational Inference (cont.)

1. (E-Step):

2. (M-Step):

tractable!!!

Collaborative Topic Modeling (CTM)

Image borrowed from David Blei (2011)

The graphical model for the CTM model we use.

Scaling the algorithm

Training procedure is batch. Do we have time to scale to all our users, in real time???

Strategy:

1. Iterate until some variables don’t change (article-topics).

2. Scale out, fixing non-changing variables. Update equation for one variable becomes a closed-form equation.

Algorithm

1. Batch train on training set of users

2. Fix and scale out to all users

Derive scores for users

As seen in:

http://benanne.github.io/2014/08/05/spotify-cnns.html!!

C parameter: the back-off average

Any vector-based algorithm.

1)Deep Network (Spotify’s audio-CNN)

2)Shallow Network (Doc2Vec)

3)Topic Model

4)pLSA

In conclusion

Modeling is fun!

All models are bad, but some can be useful!

Improve by recognizing shortfalls.

Evaluate on KPIs, on customer feedback, on design decisions.

not functional

sub-optimalflat-lining/degrading