Date post: | 08-Sep-2014 |
Category: |
Technology |
Upload: | erik-bernhardsson |
View: | 12,466 times |
Download: | 0 times |
August 5, 2013
ML ♡ Hadoop @ Spotify
If it’s slow, buy more racks
I’m Erik Bernhardsson
Master’s in Physics from KTH in StockholmStarted at Spotify in 2008, managed the Analytics team for two yearsMoved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC
2
August 5, 2013
What’s Spotify? What are the challenges?
Started in 2006Currently has 24 million users6 million paying usersAvailable in 20 countriesAbout 300 engineers, of which 70 in NYC
And adding 20K every day...
Big challenge: Spotify has over 20 million tracks
4
Good and bad news: we also have 100B streams
Let’s use collaborative filtering!
5
Hey,I like tracks P, Q, R, S!
Well,I like tracks Q, R, S, T!
Then you should check out track P!
Nice! Btw try track T!
Hadoop at Spotify
6
Back in 2009
Matrix factorization causing cluster to overheat? Don’t worry, put up curtain
7
Source:
Hadoop today700 nodes at our data center in London
8
The Discover page
9
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
https://github.com/spotify/hdfs2cass
OK so how do we come up with recommendations?
Let’s do collaborative filtering!In particular, implicit collaborative filteringIn particular, matrix factorization (aka latent factor methods)
11
Stop!!!
Break it down!!
12
AP AP AP AP AP AP
Hadoop(>100B streams)
Play track zplay track y
play track x
5k tracks/s
Step 1: Collect data
13
Step 2: Put everything into a big sparse matrix
14
Using some definition of correlation. Eg. for Pearson:
cij =
Pu NuiNuj
pPu N
2ui
qPu N
2uj
but it’s very slow because:
N =
0
BBBBBB@
0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0
1
CCCCCCA
NTN =
0
BB@
402 19 52 28819 59 147 1352 147 610 117288 13 117 300
1
CCA
O(U · (N/I)2)...where U = number of users
I = number of itemsN = number of nonzero entries
⇡ 107 · (1010/107)2 = 1013 mapper outputs
It’s an extremely sparse matrix
M =
0
BBBBBBBBBBBB@
......
.... . . . . . 53 . . . . . .
......
.... . . . . . . . . 12 . . .
......
.... . . 7 . . . . . . . . .
......
...
1
CCCCCCCCCCCCA
It’s a very big matrix too:
M =
0
BBB@
c11 c12 . . . c1nc21 c22 . . . c2n...
...cm1 cm2 . . . cmn
1
CCCA
| {z }107 items
9>>>>>>>>>=
>>>>>>>>>;
107 users
3
Matrix example
Roughly 25 billion nonzero entriesTotal size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Matrix example
Roughly 25 billion nonzero entriesTotal size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Erik
Never gonna give you up
Erik listened to Never gonna give you up 1
times
Idea is to find vectors for each user and itemHere’s how it looks like algebraically:
Step 3: Matrix factorization
16
Turns out people have been doing this in NLP for a while
M =
0
BBB@
c11 c12 . . . c1nc21 c22 . . . c2n...
...cm1 cm2 . . . cmn
1
CCCA
| {z }Lots of words
9>>>>>>>>=
>>>>>>>>;
Lots of documents
Or more generally:
P =
0
BBB@
p11 p12 . . . p1np21 p22 . . . p2n...
...pm1 pm2 . . . pmn
1
CCCA
The idea with matrix factorization is to represent this probability distribu-tion like this:
pui = aTubi
M 0 = ATB
0
BBBBBB@
1
CCCCCCA⇡
0
BBBBBB@
1
CCCCCCA
| {z }f
� � f
0
BBBBBB@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
CCCCCCA
| {z }probabilities for next event
⇡
0
BBBBBB@
. .
. .
. .
. .
. .
. .
1
CCCCCCA
| {z }user vectors
✓. . . . . . .. . . . . . .
◆
| {z }item vectors
We can look at it as a probability distribution:0
BBBBBB@
0 0.07 0.21 00.05 0 0 0.010.04 0 0.13 0.090 0 0 0.07
0.19 0.01 0 0.130 0.03 0 0
1
CCCCCCA
4
For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)Invented as a method intended for text classification
17
P =
0
BBBBBB@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
CCCCCCA⇡
0
BBBBBB@
. .
. .
. .
. .
. .
. .
1
CCCCCCA
| {z }user vectors
✓. . . . . . .. . . . . . .
◆
| {z }item vectors
PLSA0
BBBBBB@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
CCCCCCA
| {z }P (u,i)=
PzP (u|z)P (i,z)
⇡
0
BBBBBB@
. .
. .
. .
. .
. .
. .
1
CCCCCCA
| {z }P (u|z)
✓. . . . . . .. . . . . . .
◆
| {z }P (i,z)
X
u
P (u|z) = 1
X
i,z
P (i, z) = 1
So in general we want to optimize
logY
P (u, i)Nui =X
u,i
Nui logP (u, i) =X
u,i
Nui logX
z
P (u|z)P (i, z)
N logP =
0
BBBBBB@
0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0
1
CCCCCCAlog
0
BBBBBB@
0
BBBBBB@
. .
. .
. .
. .
. .
. .
1
CCCCCCA
✓. . . .. . . .
◆
1
CCCCCCA
KOREN
N =
0
BBBBBB@
0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0
1
CCCCCCA
5
Why are vectors nice?
Super small fingerprints of the musical style or the user’s tasteUsually something like 40-200 elementsHard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt:
18
0.87 1.17 -0.26 0.56 2.21 0.77 -0.03
Latent factor 1
Latent factor 2
track x's vector
Track X:
Another example of tracks in two dimensions
19
Implementing matrix factorization is a little tricky
Iterative algorithms that stake many steps to converge40 parameters for each item and userSo something like 1.2 billion parameters
“Google News Personalization: Scalable Online Collaborative Filtering”
20
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0i % L = 0
u % K = 0i % L = 1 ... u % K = 0
i % L = L-1
u % K = 1i % L = 0
u % K = 1i % L = 1 ... ...
... ... ... ...
u % K = K-1i % L = 0 ... ... u % K = K-1
i % L = L-1
item vectorsitem%L=0
item vectorsitem%L=1
item vectorsi % L = L-1
user vectorsu % K = 0
user vectorsu % K = 1
user vectorsu % K = K-1
all log entriesu % K = 1i % L = 1
u % K = 0
u % K = 1
u % K = K-1
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0i % L = 0
u % K = 0i % L = 1 ... u % K = 0
i % L = L-1
u % K = 1i % L = 0
u % K = 1i % L = 1 ... ...
... ... ... ...
u % K = K-1i % L = 0 ... ... u % K = K-1
i % L = L-1
item vectorsitem%L=0
item vectorsitem%L=1
item vectorsi % L = L-1
user vectorsu % K = 0
user vectorsu % K = 1
user vectorsu % K = K-1
all log entriesu % K = 1i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Here’s what happens in one map shard
Input is a bunch of (user, item, count) tuplesuser is the same modulo K for all usersitem is the same modulo L for all items
22
One map taskDistributed
cache:All user vectors where u % K = x
Distributed cache:
All item vectors where i % L = y
Mapper Emit contributions
Map input:tuples (u, i, count)
where u % K = x
andi % L = y
Reducer New vector!
Might take a while to converge
Start with random vectors around the origin
23
Hadoop?
Yeah we could probably do it in Spark 10x or 100x faster.Still, Hadoop is a great way to scale things horizontally.
????
24
Nice compact vectors and it’s super fast to compute similarity
25
Latent factor 1
Latent factor 2
track xtrack y
cos(x, y) = HIGH
IPMF item item:
P (i ! j) = exp(bTj bi)/Zi =
exp(bTj bi)P
k exp(bTk bi)
VECTORS:pui = aTubi
simij = cos(bi,bj) =bTi bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P (i ! j) = exp(bTj bi)/Zi =
exp(� |bj � bi|2)Pk exp(� |bk � bi|2)
simij = � |bj � bi|2
(u, i, count)
@L
@au
7
IPMF item item:
P (i ! j) = exp(bTj bi)/Zi =
exp(bTj bi)P
k exp(bTk bi)
VECTORS:pui = aTubi
simij = cos(bi,bj) =bTi bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P (i ! j) = exp(bTj bi)/Zi =
exp(� |bj � bi|2)Pk exp(� |bk � bi|2)
simij = � |bj � bi|2
(u, i, count)
@L
@au
7
Music recommendations are now just dot products
26
Latent factor 1
Latent factor 2
track x
User u's vector
track y
It’s still tricky to search for similar tracks though
We have many million tracks and you don’t want to compute cosine for all pairs
27
Approximate nearest neighbors to the rescue!
Cut the space recursively by random plane.If two points are close, they are more likely to end up on the same side of each plane.
https://github.com/spotify/annoy
28
How do you retrain the model?
It takes a long time to train a full factorization model.We want to update user vectors much more frequently (at least daily!)However, item vectors are fairly stable.Throw away user vectors and recreate them from scratch!
29
The pipeline
“Hack” to recalculate user vectors more frequently.
Is this a little complicated? Yeah probably.
30
May 2013 logs
Matrix factorization
Item vectors
User vectors
June 2013 logs
Matrix factorization
Item vectors
User vectors
+ more logs
Seeding
User vectors (1)
Logs
User vectors (2)
More logs
User vectors (3)
More logs
User vectors (4)
More logs
User vectors (5)
More logs
Time
Ideal case
Put all vectors in Cassandra/Memcached, use Storm to update in real time
31
But Hadoop is pretty nice at parallelizing recommendations
24 core but not a lot of RAM? mmap is your friend
32
One map reduce job
Recs!
ANN index of all vectors
Distributed cache:User vectors
M M
M M
DC
M M
M M
DC
M M
M M
DC
Music recommendations!
Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months!
33
More music recommendations!
Radio!
34
More music recommendations!
Related artists
35
Thanks!
Btw, we’re hiring Machine Learning Engineersand Data Engineers!Email me at [email protected]!