Date post: | 11-Apr-2017 |
Category: |
Technology |
Upload: | mlconf |
View: | 68 times |
Download: | 1 times |
I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify
…But what’s the point?
• vector models are everywhere
• lots of applications (language processing, recommender systems, computer vision)
…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
Deep learning for food• Deep model trained on a GPU on 6M random pics
downloaded from Yelp15
6x15
6x32
154x
154x
32
152x
152x
32
76x7
6x64
74x7
4x64
72x7
2x64
36x3
6x12
8
34x3
4x12
8
32x3
2x12
8
16x1
6x25
6
14x1
4x25
6
12x1
2x25
6
6x6x
512
4x4x
512
2x2x
512
2048
2048
128
1244
3x3 convolutions
2x2 maxpoolfully
connected with dropout
bottleneck layer
Distance in smaller space1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an item vector
3. Use cosine distance in the reduced space
Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …
Represent documents and/or words as f-dimensional vector
Late
nt fa
ctor
1
Latent factor 2
banana
apple
boat
Vector methods for collaborative filtering
• Supervised methods: See everything from the Netflix Prize
• Unsupervised: Use NLP methods
CF vectors – examplesIPMF item item:
P (i ! j) = exp(bTj bi)/Zi =
exp(bTj bi)P
k exp(bTk bi)
VECTORS:pui = aTubi
simij = cos(bi,bj) =bTi bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P (i ! j) = exp(bTj bi)/Zi =
exp(� |bj � bi|2)Pk exp(� |bk � bi|2)
simij = � |bj � bi|2
(u, i, count)
@L
@au
7
Geospatial indexing• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy
low-dimensional embedding
• “Visualizing Large-scale and High-dimensional Data”
• https://github.com/elbamos/largeVis (R implementation)
Nearest neighbors the brute force way
• we can always do an exhaustive search to find the nearest neighbors
• imagine MySQL doing a linear scan for every query…
Using word2vec’s brute force search
$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin
Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys 0m16.362s
Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python/R/Go/Lua bindings
• 585 1227 stars on Github
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys 0m0.162s
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys 0m0.614s
Side note: making trees small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n
Problemo
• The point that’s the closest isn’t necessarily in the same leaf of the binary tree
• Two points that are really close may end up on different sides of a split
• Solution: go to both sides of a split if it’s close
Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at the same time
heap + forest = best
• Since we use a priority queue, we will dive down the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
Annoy query structure
1. Use priority queue to search all trees until we’ve found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
Current ANN trends
• “small world” graph algorithms: SW-graph, k-graph
• locality sensitive hashing: HNSW, FALCONN
Thanks!• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack