Erik Bernhardsson, CTO, Better Morgtgage

Approximate nearest neighbors & vector

models

I’m Erik

• @fulhack

• Author of Annoy, Luigi

• Currently CTO of Better

• Previously 5 years at Spotify

What’s nearest neighbor(s)

• Let’s say you have a bunch of points

Grab a bunch of points

5 nearest neighbors

20 nearest neighbors

100 nearest neighbors

…But what’s the point?

• vector models are everywhere

• lots of applications (language processing, recommender systems, computer vision)

MNIST example• 28x28 = 784-dimensional dataset

• Define distance in terms of pixels:

MNIST neighbors

…Much better approach

1. Start with high dimensional data

2. Run dimensionality reduction to 10-1000 dims

3. Do stuff in a small dimensional space

Deep learning for food• Deep model trained on a GPU on 6M random pics

downloaded from Yelp15

6x15

6x32

154x

154x

32

152x

152x

32

76x7

6x64

74x7

4x64

72x7

2x64

36x3

6x12

8

34x3

4x12

8

32x3

2x12

8

16x1

6x25

6

14x1

4x25

6

12x1

2x25

6

6x6x

512

4x4x

512

2x2x

512

2048

2048

128

1244

3x3 convolutions

2x2 maxpoolfully

connected with dropout

bottleneck layer

Distance in smaller space1. Run image through the network

2. Use the 128-dimensional bottleneck layer as an item vector

3. Use cosine distance in the reduced space

Nearest food pics

Vector methods for text

• TF-IDF (old) – no dimensionality reduction

• Latent Semantic Analysis (1988)

• Probabilistic Latent Semantic Analysis (2000)

• Semantic Hashing (2007)

• word2vec (2013), RNN, LSTM, …

Represent documents and/or words as f-dimensional vector

Late

nt fa

ctor

1

Latent factor 2

banana

apple

boat

Vector methods for collaborative filtering

• Supervised methods: See everything from the Netflix Prize

• Unsupervised: Use NLP methods

CF vectors – examplesIPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

Geospatial indexing• Ping the world: https://github.com/erikbern/ping

• k-NN regression using Annoy

https://github.com/erikbern/ping

low-dimensional embedding

• “Visualizing Large-scale and High-dimensional Data”

• https://github.com/elbamos/largeVis (R implementation)

https://github.com/elbamos/largeVis

Nearest neighbors the brute force way

• we can always do an exhaustive search to find the nearest neighbors

• imagine MySQL doing a linear scan for every query…

Using word2vec’s brute force search

$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin

Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys 0m16.362s

Introducing Annoy

• https://github.com/spotify/annoy

• mmap-based ANN library

• Written in C++, with Python/R/Go/Lua bindings

• 585 1227 stars on Github

https://github.com/spotify/annoy

mmap = best

• load huge data files immediately

• share data between processes

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys 0m0.162s

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys 0m0.614s

(performance)

1. Building an Annoy index

Start with the point set

Split it in two halves

Split again

Again…

…more iterations later

Side note: making trees small

• Split until K items in each leaf (K~100)

• Takes (n/K) memory instead of n

Binary tree

2. Searching

Nearest neighbors

Searching the tree

Problemo

• The point that’s the closest isn’t necessarily in the same leaf of the binary tree

• Two points that are really close may end up on different sides of a split

• Solution: go to both sides of a split if it’s close

Trick 1: Priority queue

• Traverse the tree using a priority queue

• sort by min(margin) for the path from the root

Trick 2: many trees

• Construct trees randomly many times

• Use the same priority queue to search all of them at the same time

heap + forest = best

• Since we use a priority queue, we will dive down the best splits with the biggest distance

• More trees always helps!

• Only constraint is more trees require more RAM

Annoy query structure

1. Use priority queue to search all trees until we’ve found k items

2. Take union and remove duplicates (a lot)

3. Compute distance for remaining items

4. Return the nearest n items

Find candidates

Take union of all leaves

Compute distances

Return nearest neighbors

“Curse of dimensionality”

Are we screwed?

• Would be nice if the data is has a much smaller “intrinsic dimension”!

Improving the algorithm

Que

ries/

s

1-NN accuracy

more accurate

faster

• https://github.com/erikbern/ann-benchmarks

ann-benchmarks


ann-benchmarksX


ann-benchmarks

Current ANN trends

• “small world” graph algorithms: SW-graph, k-graph

• locality sensitive hashing: HNSW, FALCONN

Thanks!• https://github.com/spotify/annoy


• https://github.com/erikbern/ann-presentation

• erikbern.com

• @fulhack

https://github.com/spotify/annoy

https://github.com/erikbern/ann-benchmarks

https://github.com/erikbern/ann-presentation

http://erikbern.com

Date post:	11-Apr-2017
Category:	Technology
Upload:	mlconf
View:	68 times
Download:	1 times

Erik Bernhardsson, CTO, Better Morgtgage

Technology