Date post: | 12-Apr-2017 |
Category: |
Software |
Upload: | hakka-labs |
View: | 560 times |
Download: | 0 times |
Building the Next New York Times Recommendation Engine
By Alexander Spangher
Problem Statement:
1. The New York Times publishes over 300 articles, blog posts and interactive stories a day.
Corpus:
n articles that are still relevant over the past x days
For each user:
1 2 3 4 ...
30 day reading history
Machine Learning
“All of machine learning can be broken down into regression and matrix factorization.”
-A drunk PhD student at a bar
1. Regression: f(input) = output
2. Factorization: f(output) = output
-Yann Lecun, 2015
Problem Statement (Refined)
1. Define pool of articles.
Not all articles expire at the same rate
2. Rank order articles based on reading history of user.
Assume that reader’s future preferences will match past preferences
Defining the Pool of Articles
Defining Relevancy
Exponential Distribution
Evergreen Model
Section,Desk,Word Count...
clicks per day2. Learn relationship between features and metric
1. Learn training metric
3. Convert to interpretable expiration date
Fit a to each item in training set
Fit:
i
Likelihood function:
Maximum Likelihood Estimate (MLE)
likelihood of data and parameters
joint pdf of data given parameter
product of independent pdf’s
Maximum Likelihood Estimate
Given timestamp of every click:
Maximum Likelihood Estimate
???
Maximum Log Likelihood Estimate
Or, use optimization package:
Python: http://cvxopt.org/
Convex Optimization by Stephen Boyd
Learn relationship between article features and
x = [desk, word count, section, ...]
y =
General Linear model:
Performance
Building the Recommender
(http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/)
First Iteration
Keyword-Based model: TF-IDF Vector
N = number of times word appears in documentD = number of documents that word appears in
First Iteration
Keyword-Based model: TF-IDF Vector
[ 0.02, 0.5, 0, 0, … , .01 ]
[ 0.9, 0.01, 0.2, … , .05 ]
fun cat dog scholar nice
Feedback:
“Recommendations work for me
I have been following the Oscar Pistorius case for over a year now and every time there has been a relevant story about the case, I have been recommended that story.
Recommendations seem to be working very well for me.”
Feedback:
“No More Brooks recommendations, please
Your constant pushing of David Brooks onto me is like an annoying grandmother who won't believe you are really allergic to peanuts even though you regularly go into anaphylactic shock at her dinner table and need to be rushed to the hospital. What can I say… you're killing me. Please stop it.
...
Thanks for your attention to this matter.”
Feedback:
“Dear NY Times,
You seem to have missed the fact that, while I do read the Weddings section, I only (or almost only) read about the weddings of same sex couples.
Please stop recommending heterosexual weddings articles to me!!”
[ 0.02, 0.5, 0, 0, … , .01 ]
[ 0.9, 0.01, 0.2, … , .05 ]
1 2 3 4 k
LDA-Based model: Topic Vector
Second Iteration:
Example topic, pr
obab
ility
wei
ght
cat yarn tree building car money bank paw toy newspaper Spotify
Example topic, :pr
obab
ility
wei
ght
cat yarn tree building car money bank paw toy newspaper Spotify
LDA
David Blei (2003)
Topic Space
How do we learn these parameters?
LDA Definition:
Choose ~ Dirichlet(ɑ)𝜃For each in document:
Choose word topic ~ Mult( )𝜃Choose word from
Variational Inference
Image borrowed from David Blei (2003)
Variational Inference (cont.)
Variational Inference (cont.)
1. (E-Step):
2. (M-Step):
tractable!!!
Collaborative Topic Modeling (CTM)
Image borrowed from David Blei (2011)
The graphical model for the CTM model we use.
Scaling the algorithm
Training procedure is batch. Do we have time to scale to all our users, in real time???
Strategy:
1. Iterate until some variables don’t change (article-topics).
2. Scale out, fixing non-changing variables. Update equation for one variable becomes a closed-form equation.
Algorithm
1. Batch train on training set of users
2. Fix and scale out to all users
Derive scores for users
As seen in:
http://benanne.github.io/2014/08/05/spotify-cnns.html!!
C parameter: the back-off average
Any vector-based algorithm.
1)Deep Network (Spotify’s audio-CNN)
2)Shallow Network (Doc2Vec)
3)Topic Model
4)pLSA
In conclusion
Modeling is fun!
All models are bad, but some can be useful!
Improve by recognizing shortfalls.
Evaluate on KPIs, on customer feedback, on design decisions.
not functional
sub-optimalflat-lining/degrading