+ All Categories
Home > Software > DataEngConf: Building the Next New York Times Recommendation Engine

DataEngConf: Building the Next New York Times Recommendation Engine

Date post: 12-Apr-2017
Category:
Upload: hakka-labs
View: 560 times
Download: 0 times
Share this document with a friend
44
Building the Next New York Times Recommendation Engine By Alexander Spangher
Transcript
Page 1: DataEngConf: Building the Next New York Times Recommendation Engine

Building the Next New York Times Recommendation Engine

By Alexander Spangher

Lucas Spangher
maybe define this?
Lucas Spangher
i.e. corpus-pool
Page 2: DataEngConf: Building the Next New York Times Recommendation Engine

Problem Statement:

1. The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Corpus:

n articles that are still relevant over the past x days

Page 3: DataEngConf: Building the Next New York Times Recommendation Engine

For each user:

1 2 3 4 ...

30 day reading history

Page 4: DataEngConf: Building the Next New York Times Recommendation Engine
Page 5: DataEngConf: Building the Next New York Times Recommendation Engine

Machine Learning

“All of machine learning can be broken down into regression and matrix factorization.”

-A drunk PhD student at a bar

1. Regression: f(input) = output

2. Factorization: f(output) = output

-Yann Lecun, 2015

Page 6: DataEngConf: Building the Next New York Times Recommendation Engine

Problem Statement (Refined)

1. Define pool of articles.

Not all articles expire at the same rate

2. Rank order articles based on reading history of user.

Assume that reader’s future preferences will match past preferences

Page 7: DataEngConf: Building the Next New York Times Recommendation Engine

Defining the Pool of Articles

Page 8: DataEngConf: Building the Next New York Times Recommendation Engine

Defining Relevancy

Page 9: DataEngConf: Building the Next New York Times Recommendation Engine

Exponential Distribution

Page 10: DataEngConf: Building the Next New York Times Recommendation Engine

Evergreen Model

Section,Desk,Word Count...

clicks per day2. Learn relationship between features and metric

1. Learn training metric

3. Convert to interpretable expiration date

Lucas Spangher
define feature?
Page 11: DataEngConf: Building the Next New York Times Recommendation Engine

Fit a to each item in training set

Fit:

i

Page 12: DataEngConf: Building the Next New York Times Recommendation Engine

Likelihood function:

Maximum Likelihood Estimate (MLE)

likelihood of data and parameters

joint pdf of data given parameter

product of independent pdf’s

Page 13: DataEngConf: Building the Next New York Times Recommendation Engine

Maximum Likelihood Estimate

Given timestamp of every click:

Page 14: DataEngConf: Building the Next New York Times Recommendation Engine

Maximum Likelihood Estimate

???

Page 15: DataEngConf: Building the Next New York Times Recommendation Engine

Maximum Log Likelihood Estimate

Lucas Spangher
should show that this is being set to 0
Page 16: DataEngConf: Building the Next New York Times Recommendation Engine

Or, use optimization package:

Python: http://cvxopt.org/

Convex Optimization by Stephen Boyd

Page 17: DataEngConf: Building the Next New York Times Recommendation Engine

Learn relationship between article features and

x = [desk, word count, section, ...]

y =

General Linear model:

Page 18: DataEngConf: Building the Next New York Times Recommendation Engine
Page 19: DataEngConf: Building the Next New York Times Recommendation Engine

Performance

Page 20: DataEngConf: Building the Next New York Times Recommendation Engine

Building the Recommender

(http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/)

Page 21: DataEngConf: Building the Next New York Times Recommendation Engine

First Iteration

Keyword-Based model: TF-IDF Vector

N = number of times word appears in documentD = number of documents that word appears in

Page 22: DataEngConf: Building the Next New York Times Recommendation Engine

First Iteration

Keyword-Based model: TF-IDF Vector

[ 0.02, 0.5, 0, 0, … , .01 ]

[ 0.9, 0.01, 0.2, … , .05 ]

fun cat dog scholar nice

Page 23: DataEngConf: Building the Next New York Times Recommendation Engine

Feedback:

“Recommendations work for me

I have been following the Oscar Pistorius case for over a year now and every time there has been a relevant story about the case, I have been recommended that story.

Recommendations seem to be working very well for me.”

Page 24: DataEngConf: Building the Next New York Times Recommendation Engine

Feedback:

“No More Brooks recommendations, please

Your constant pushing of David Brooks onto me is like an annoying grandmother who won't believe you are really allergic to peanuts even though you regularly go into anaphylactic shock at her dinner table and need to be rushed to the hospital. What can I say… you're killing me. Please stop it.

...

Thanks for your attention to this matter.”

Page 25: DataEngConf: Building the Next New York Times Recommendation Engine

Feedback:

“Dear NY Times,

You seem to have missed the fact that, while I do read the Weddings section, I only (or almost only) read about the weddings of same sex couples.

Please stop recommending heterosexual weddings articles to me!!”

Page 26: DataEngConf: Building the Next New York Times Recommendation Engine

[ 0.02, 0.5, 0, 0, … , .01 ]

[ 0.9, 0.01, 0.2, … , .05 ]

1 2 3 4 k

LDA-Based model: Topic Vector

Second Iteration:

Page 27: DataEngConf: Building the Next New York Times Recommendation Engine

Example topic, pr

obab

ility

wei

ght

cat yarn tree building car money bank paw toy newspaper Spotify

Page 28: DataEngConf: Building the Next New York Times Recommendation Engine

Example topic, :pr

obab

ility

wei

ght

cat yarn tree building car money bank paw toy newspaper Spotify

Page 29: DataEngConf: Building the Next New York Times Recommendation Engine

LDA

Page 30: DataEngConf: Building the Next New York Times Recommendation Engine

David Blei (2003)

Page 31: DataEngConf: Building the Next New York Times Recommendation Engine

Topic Space

Page 32: DataEngConf: Building the Next New York Times Recommendation Engine

How do we learn these parameters?

LDA Definition:

Choose ~ Dirichlet(ɑ)𝜃For each in document:

Choose word topic ~ Mult( )𝜃Choose word from

Page 33: DataEngConf: Building the Next New York Times Recommendation Engine

Variational Inference

Image borrowed from David Blei (2003)

Page 34: DataEngConf: Building the Next New York Times Recommendation Engine

Variational Inference (cont.)

Page 35: DataEngConf: Building the Next New York Times Recommendation Engine

Variational Inference (cont.)

1. (E-Step):

2. (M-Step):

tractable!!!

Page 36: DataEngConf: Building the Next New York Times Recommendation Engine

Collaborative Topic Modeling (CTM)

Image borrowed from David Blei (2011)

The graphical model for the CTM model we use.

Page 37: DataEngConf: Building the Next New York Times Recommendation Engine

Scaling the algorithm

Training procedure is batch. Do we have time to scale to all our users, in real time???

Page 38: DataEngConf: Building the Next New York Times Recommendation Engine

Strategy:

1. Iterate until some variables don’t change (article-topics).

2. Scale out, fixing non-changing variables. Update equation for one variable becomes a closed-form equation.

Page 39: DataEngConf: Building the Next New York Times Recommendation Engine

Algorithm

1. Batch train on training set of users

2. Fix and scale out to all users

Page 40: DataEngConf: Building the Next New York Times Recommendation Engine

Derive scores for users

As seen in:

http://benanne.github.io/2014/08/05/spotify-cnns.html!!

Page 41: DataEngConf: Building the Next New York Times Recommendation Engine

C parameter: the back-off average

Page 42: DataEngConf: Building the Next New York Times Recommendation Engine

Any vector-based algorithm.

1)Deep Network (Spotify’s audio-CNN)

2)Shallow Network (Doc2Vec)

3)Topic Model

4)pLSA

Page 43: DataEngConf: Building the Next New York Times Recommendation Engine

In conclusion

Modeling is fun!

All models are bad, but some can be useful!

Improve by recognizing shortfalls.

Evaluate on KPIs, on customer feedback, on design decisions.

Page 44: DataEngConf: Building the Next New York Times Recommendation Engine

not functional

sub-optimalflat-lining/degrading


Recommended