+ All Categories
Home > Documents > Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo ›...

Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo ›...

Date post: 04-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
72
Locality-Sensitive Hashing: Theory and Applications Ludwig Schmidt MIT Google Brain UC Berkeley Based on joint works with Alex Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Kunal Talwar.
Transcript
Page 1: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Locality-Sensitive Hashing: Theory and Applications

Ludwig Schmidt MIT → Google Brain → UC Berkeley

Based on joint works with Alex Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Kunal Talwar.

Page 2: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Nearest Neighbor Search

1. Build a data structure for a given set of points in Rd.

2. (Repeated) For a given query, find the closest point in the dataset with “good” probability.

Main goal: fast queries with high accuracy.

Page 3: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

MotivationLarge number of applications

Data retrieval • Images (SIFT, …) • Text (tf-idf, …) • Audio (i-vectors, …)

Input Output

Page 4: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

MotivationLarge number of applications

Data retrieval • Images (SIFT, …) • Text (tf-idf, …) • Audio (i-vectors, …)

Sub-routine in other algorithms • Optimization • Cryptanalysis • Classification

Input Output

Page 5: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

A Simple Problem?

In 2D: Voronoi diagram • O(n log n) setup time • O(log n) query time

Almost ideal data structure!

Page 6: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

A Simple Problem?

In 2D: Voronoi diagram • O(n log n) setup time • O(log n) query time

Almost ideal data structure!

Problem?Many applications require high-dimensional spaces.

“Curse of dimensionality”Rd

Page 7: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Rescue from high Dimensionality

Real data often has structure.

Example: significant gap between distance from query to • nearest neighbor• average point

Page 8: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Common Methods

Tree data structures

Locality-sensitive hashing

Vector quantization

Nearest-neighbor graphs

Page 9: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

1. Locality-Sensitive Hashing

2. Cross-Polytope Hash

3. LSH in Neural Networks

Page 10: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

1. Locality-Sensitive Hashing

2. Cross-Polytope Hash

3. LSH in Neural Networks

Page 11: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Formal Problem

Spherical case• Points are on the unit sphere. • Angles between most points

around 90°.

Page 12: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Formal Problem

Spherical case• Points are on the unit sphere. • Angles between most points

around 90°.

Why?• Theory: general case reduces to this case. • Practice: good model for many (pre-processed) instances.

Page 13: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Formal Problem

QueryNearest

Neighbor

Similarity Measures (all equivalent here):

• Cosine Similarity • Angular Distance • Euclidean Distance

Why? Often (approximately) the goal in practice.

Page 14: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Locality-Sensitive HashingIntroduced in [Indyk, Motwani, 1998].

Main idea: partition Rd randomly such that nearby points are more likely to appear in the same cell of the partition.

Page 15: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Locality-Sensitive HashingIntroduced in [Indyk, Motwani, 1998].

Main idea: partition Rd randomly such that nearby points are more likely to appear in the same cell of the partition.

What about hash functions?

Random hash function = random space partitioning

Page 16: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH: Formal DefinitionA family of hash functions is (r, c r, p1, p2)-locality sensitive if for every p, q in Rd, the following holds:

• if || p - q || < r, then P[h(p) = h(q)] > p1 • if || p - q || > c r, then P[h(p) = h(q)] < p2

r cr

p2

p1

Page 17: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH: Data StructureMultiple hash tables with independent hash functions.

Query1. Find candidates in hash buckets h(q). 2. Compute exact distances for candidates.

Page 18: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH: TheoryQuery time quantified as a function of sensitivity ρ.

Intuitively: gap between “nearby” collision probability and “far away” collision probability.

⇢ =log 1/p1log 1/p2Formally:

Page 19: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH: TheoryQuery time quantified as a function of sensitivity ρ.

Intuitively: gap between “nearby” collision probability and “far away” collision probability.

⇢ =log 1/p1log 1/p2Formally:

Query time is O(nρ), data structure size O(n1+ρ).

For example: often ρ ≈ 1/c where “nearby” is distance r and “far away” is distance c r.

Page 20: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Most Common LSH Family

Hyperplane LSH

Introduced in [Charikar 2002], inspired by [Goemans, Williamson 1995].

Hash function: partition the sphere with a random hyperplane.

Page 21: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Most Common LSH Family

Hyperplane LSH

Introduced in [Charikar 2002], inspired by [Goemans, Williamson 1995].

Hash function: partition the sphere with a random hyperplane.

Locality-sensitive: let be the angle between points p and q:↵

P [h(p) = h(q)] = 1� ↵

Page 22: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Empirical State of the Art, 2015

GloVe dataset (word embeddings)

100-dim 1.2 mio vectors

Source: ann-benchmarks Erik Bernhardsson

Page 23: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Empirical State of the Art, 2015

GloVe dataset (word embeddings)

100-dim 1.2 mio vectors

Source: ann-benchmarks Erik Bernhardsson

Page 24: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2
Page 25: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Empirical State of the Art, 2015

SIFT dataset (word embeddings)

128-dim 1 mio vectors

Source: ann-benchmarks Erik Bernhardsson

Page 26: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Empirical State of the Art, 2015

SIFT dataset (word embeddings)

128-dim 1 mio vectors

Source: ann-benchmarks Erik Bernhardsson

Page 27: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2
Page 28: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Annoy (Approximate Nearest Neighbors Oh Yeah)

“Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.” (at Spotify in 2013)

Empirical State of the Art, 2015

Page 29: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Annoy (Approximate Nearest Neighbors Oh Yeah)

“Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.” (at Spotify in 2013)

Empirical State of the Art, 2015

Algorithm: Hybrid between hyperplane hash and kd-tree.

Page 30: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Progress Since Hyperplane

Algorithm ρ

Hyperplane hash [Charikar 2002] 1 / c

Andoni, Indyk 2006 1 / c2

Andoni, Indyk, Nguyen, Razenshteyn 2014 7 / 8c2

Voronoi hash [Andoni, Razenshteyn 2015] 1 / 2c2

Query time is O(nρ), c is the gap between “near” and “far”.

For near = 45°: exponent 0.42 (hyperplane) vs 0.18 [AR’15].

Page 31: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Voronoi Hash

For each hash function, sample T random unit vectors g1, g2, … gT.

To hash a given point p, find the closest gi.

Hash function

Page 32: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cost of Hash Computation

The Voronoi hash requires many inner products for good sensitivity ρ.

101610121081041000.15

0.2

0.25

0.3

0.35

0.4

Number of parts T

Sens

itivity

Time complexity: O(d T)

Page 33: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cost of Hash Computation

The Voronoi hash requires many inner products for good sensitivity ρ.

101610121081041000.15

0.2

0.25

0.3

0.35

0.4

Number of parts T

Sens

itivity

Time complexity: O(d T)

Hyperplane hash works with a single inner product. O(d) time

Page 34: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cost of Hash Computation

The Voronoi hash requires many inner products for good sensitivity ρ.

101610121081041000.15

0.2

0.25

0.3

0.35

0.4

Number of parts T

Sens

itivity

Time complexity: O(d T)

Hyperplane hash works with a single inner product. O(d) time

Can we get fast hash functions with good sensitivity?

Page 35: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

1. Locality-Sensitive Hashing

2. Cross-Polytope Hash

3. LSH in Neural Networks

Page 36: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cross-Polytope Hash

Page 37: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cross-Polytope Hash

Cross-polytope = l1 unit ball

Page 38: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Cross-Polytope Hash

Cross-polytope = l1 unit ball

Hash function

1. Apply random rotation to input point

2. Map to closest vertex of the cross-polytope

Page 39: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Our Contributions

1. Analyze the CP hash

2. Multiprobe for the CP hash

3. Fast Implementation

Page 40: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

AnalysisWith a Gaussian random rotation, the CP hash has sensitivity ρ ≈ 1 / 2c2.

(Caveats: points on the sphere, r2 = √2.)

x = �X1

x = X1

↵x+ �y = ↵X1 + �Y1

↵x+ �y = �(↵X1 + �Y1)

Page 41: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

AnalysisWith a Gaussian random rotation, the CP hash has sensitivity ρ ≈ 1 / 2c2.

(Caveats: points on the sphere, r2 = √2.)

Establish lower bound for ρ vs #parts trade-off.

101610121081041000.15

0.2

0.25

0.3

0.35

0.4

Number of parts T

Sensitivity⇢

Cross-polytope LSH

Lower bound

x = �X1

x = X1

↵x+ �y = ↵X1 + �Y1

↵x+ �y = �(↵X1 + �Y1)

Page 42: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

MultiprobeProblem with LSH in some regimes: requires many tables.

Example: for 106 points and queries with 45°, Hyperplane LSH needs 725 tables for success probability 90%.

Page 43: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

MultiprobeProblem with LSH in some regimes: requires many tables.

Example: for 106 points and queries with 45°, Hyperplane LSH needs 725 tables for success probability 90%.

More memory than the dataset itself.

Page 44: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Multiprobe

Idea: use multiple hash locations in the same few tables. [Lv, Josephson, Wang, Charikar, Li 2007]

Problem with LSH in some regimes: requires many tables.

Page 45: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Multiprobe

Idea: use multiple hash locations in the same few tables. [Lv, Josephson, Wang, Charikar, Li 2007]

Problem with LSH in some regimes: requires many tables.

We develop a multiprobe scheme for the CP hash

How to score hash buckets?

Page 46: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Fast Implementation

Algorithmic side: use fast pseudo-random rotations.

Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.

Page 47: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Fast Implementation

Algorithmic side: use fast pseudo-random rotations.

Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.

Get 2d hash cells in O(d log d) time.

Analysis: [Kennedy, Ward 2016] [Choromanski, Fagan, Gouy-Pailler, Morvan, Sarlós, Atif 2016]

Page 48: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Fast Implementation

Algorithmic side: use fast pseudo-random rotations.

Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.

Get 2d hash cells in O(d log d) time.

Analysis: [Kennedy, Ward 2016] [Choromanski, Fagan, Gouy-Pailler, Morvan, Sarlós, Atif 2016]

Implementation side: C++, vectorized code (AVX), etc.

Page 49: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments vs Hyperplane

Page 50: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments on GloVe

Page 51: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments vs Annoy

Page 52: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Library: FALCONNFast Approximate Look-up of COsine Nearest Neighbors

Page 53: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Since then: Graph Search

Page 54: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Since then: Graph Search

But: 1000x longer setup time

Page 55: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

1. Locality-Sensitive Hashing

2. Cross-Polytope Hash

3. LSH in Neural Networks

Page 56: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Why Nearest Neighbors and Neural Networks?

Feature vectors for images, audio, text, etc. are now often generated by deep neural networks.

Page 57: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Why Nearest Neighbors and Neural Networks?

Neural networks are often used with many output classes.

Language models Recommender systems

Image annotation

Feature vectors for images, audio, text, etc. are now often generated by deep neural networks.

Page 58: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Neural Networks with Many Output Classes

f(x)

Softmax

C = #classes

Input x

Embedding computed by lower layers.

Top layer (fully connected) one vector ci per class.

Page 59: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Neural Networks with Many Output Classes

f(x)

Softmax

C = #classes

Input x

Embedding computed by lower layers.

p(class i |x) =exp(f(x)T ci)PCj=1 exp(f(x)T cj)

Top layer (fully connected) one vector ci per class.

Softmax function:

Page 60: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Prediction ProblemInput: the class vectors ci

Goal: given a new embedding f(x), quickly find the class vector with maximum inner product.

p(class i |x) =exp(f(x)T ci)PCj=1 exp(f(x)T cj)

Page 61: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Prediction ProblemInput: the class vectors ci

Goal: given a new embedding f(x), quickly find the class vector with maximum inner product.

Nearest neighbor under maximum inner product “similarity”[Vijayanarasimhan, Shlens, Monga, Yagnik 2015], [Spring, Shrivastava 2016]

Page 62: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH in Neural NetworksCrucial property: angle to nearest neighbor.

QueryNearest

Neighbor

Page 63: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

LSH in Neural NetworksCrucial property: angle to nearest neighbor.

10x faster softmax

Page 64: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Training For Larger Angles

loss(x, y) ⇡ �f(x)T cy

= �kf(x)k · kcyk · cos^(f(x), cy)

Goal of training the network: minimize cross-entropy loss

Page 65: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Training For Larger Angles

loss(x, y) ⇡ �f(x)T cy

= �kf(x)k · kcyk · cos^(f(x), cy)

Goal of training the network: minimize cross-entropy loss

Constrain via batch

normalization

Page 66: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Training For Larger Angles

loss(x, y) ⇡ �f(x)T cy

= �kf(x)k · kcyk · cos^(f(x), cy)

Goal of training the network: minimize cross-entropy loss

Constrain via batch

normalization

Constrain via projected gradient

descent

Page 67: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Training For Larger Angles

loss(x, y) ⇡ �f(x)T cy

= �kf(x)k · kcyk · cos^(f(x), cy)

Goal of training the network: minimize cross-entropy loss

Constrain via batch

normalization

Constrain via projected gradient

descentResult:

larger angles

Page 68: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments for Softmax Normalization

We control the norm of the class vectors ci via projected gradient descent.

Page 69: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments for Softmax Normalization

We control the norm of the class vectors ci via projected gradient descent.

Page 70: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Experiments for Softmax Normalization

We control the norm of the class vectors ci via projected gradient descent.

Angular distance instead of maximum inner product.

Page 71: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Conclusions• Locality-sensitive hashing for nearest neighbor search. • Cross-polytope hash has good practical performance. • LSH can be used for fast inference in deep networks.

Page 72: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2

Conclusions• Locality-sensitive hashing for nearest neighbor search. • Cross-polytope hash has good practical performance. • LSH can be used for fast inference in deep networks.

Thank You!


Recommended