Home >
Documents >
Circulant Binary Embedding (CBE)felixyu.org/pdf/cbe_slides.pdf · Binary Embedding (cont’d) Di...

Share this document with a friend

Embed Size (px)

of 23
/23

Transcript

Circulant Binary Embedding (CBE)

Felix X. Yu Columbia UniversitySanjiv Kumar Google ResearchYunchao Gong Facebook AI ResearchShih-Fu Chang Columbia University

ICML, Beijing, June 24, 2014

Binary Embedding

I Transform the input data into binary code

I Given x ∈ Rd , h(x) ∈ {1,−1}k

Why binary embedding?

I Learning and retrieval can happen in the binary space

I Save storage and running time

I Widely used to speedup retrieval and classification[LSMK11, RL09]

Methodh(x) = sign(Rx), R ∈ Rk×d

I Randomized R: LSH [Cha02]

I Optimized R: reconstruction error [KD09], quantizationerror[GLGP13], pairwise similarity [WKC10] etc.

Binary Embedding (cont’d)

Difficulties for High-dimensional data

I High-dimensional data requires long code to accuratelypreserve the discriminative power [LSMK11] [GKRL13] [SP11]

k ∼ Θ(d)

I Computing the full projection Rx, R ∈ RΘ(d)×d , hascomputational complexity and space complexity O(d2)

I d ∼ 1 Million: TBs of memory!

How to efficiently perform binary embeddingfor high-dimensional data?

Binary Embedding (cont’d)

Method Time Space Time (Learning)Full projection O(d2) O(d2) O(nd3)Bilinear projection O(d1.5) O(d) O(nd1.5)CBE O(d log d) O(d) O(nd log d)

Related Work: Bilinear projection [GKRL13]

I Reshape x ∈ Rd into a matrix Z ∈ R√d×√d

I Apply a bilinear projection to get the binary code

h(x) = sign(RT1 ZR2)

Our Approach: Circulant Binary Embedding (CBE)

I Much better retrieval performance for fixed coding time byallowing generating more bits

I Much faster computation with no performance degradationfor fixed number of bits

Table of Content

Circulant Binary Embedding (CBE)FFT-based Computation

How to choose R?Randomized CBE (CBE-rand)Learning data-dependent CBE (CBE-opt)

ExperimentsCoding Time based on Fixed HardwareLarge-scale Nearest-Neighbor SearchLarge-scale Classification

Circulant Binary Embedding (CBE)

h(x) = sign(RDx), R ∈ Rd×d

I R is a circulant matrix

I R is defined by a vector r = (r0, r1, · · · , rd−1)T

R = circ(r) :=

r0 rd−1 . . . r2 r1r1 r0 rd−1 r2... r1 r0

. . ....

rd−2. . .

. . . rd−1

rd−1 rd−2 . . . r1 r0

I D is a diagonal matrix, each entry ±1 with probability 1/2

(random sign flipping, dropped to simplify notation)

I k-bit (k < d) code: first k elements of h(x)

Circulant Binary Embedding (CBE)

h(x) = sign(Rx), R = circ(r)

Why CBE?

FFT-based Computation

h(x) = sign(Rx), R = circ(r)

1. Circulant projection is identical to circular convolution

Rx = circ(r)x = r ~ x

2. Circular convolution can be computed with FFT

r ~ x = F−1(F(r) ◦ F(x))

3. Time complexity O(d log d). Space complexity O(d)Related: Johnson-Lindenstruss results with circulant and otherstructured matrices [AC06] [Vyb11]

Circulant Binary Embedding (CBE)

How to choose R?

I Randomized CBE (CBE-rand)

I Learning data-dependent CBE (CBE-opt)

Randomized CBE

h(x) = sign(Rx), R = circ(r)

Each element of r is generated i.i.d. from N (0, 1)

Distance Perserving Properties

For any x1, x2 ∈ Rd , let θ be the angle between x1, x2

I P (hi (x1) 6= hi (x2)) = θ/πI E

(1k hamming(h1...k(x1), h1...k(x2))

)= θ/π

I Var(

1k hamming(h1...k(x1), h1...k(x2))

)= ?

0 1 2 3 4 5 6 7 80

0.05

0.1

0.15

0.2

0.25

log k

Va

ria

nce

θ = π/2

θ = π/6

Learning Data-dependent CBE

I Consider learning d-bits code first

I X ∈ Rn×d : X = [x0, · · · , xn−1]T

I B ∈ {−1, 1}n×d : the binary code matrix

argminB,r

||B− XRT ||2F︸ ︷︷ ︸Binary distortion

+λ ||RRT − I||2F︸ ︷︷ ︸Non-redundancy

in the bits

s.t. R = circ(r)

I A challenging combinatorial optimization problem

Time-Frequency Alternating Minimization

Optimize B with fixed r, in original “time” domain

argminB

||B− XRT ||2F , B = sign(XRT )

Optimize r with fixed B

argminr

||B− XRT ||2F + λ||RRT − I||2F

s.t. R = circ(r)

I Can be solved efficiently in the frequency domain

Time-Frequency Alternating Minimization (cont’d)

I We optimize r := F(r)

I Key tool: Parseval’s theorem (DFT preserves distance)

argminr

<(r)TM<(r) + =(r)TM=(r) + <(r)Th + =(r)Tg

+ λd ||<(r)2 + =(r)2 − 1||22s.t. =(r0) = 0

<(ri ) = <(rd−i ), i = 1, · · · , bd/2c=(ri ) = −=(rd−i ), i = 1, · · · , bd/2c

I Non-convex. Can be decomposed into d independent smalloptimization problems (4th order polynomials with only 2variables!)

Time-Frequency Alternating Minimization (cont’d)

Remarks on the algorithm

I The objective guaranteed to be non-increasing

I Good solution with just 5-10 iterations

I Running time O(nd log d)

I O(d) storage and parallel nature, suitable for GPU

I Not sensitive to λ

Learning k < d bits

I A simple approach: setting the last (d − k) bits to zero

Computational Time

Computational time based on a fixed hardware

d Full projection Bilinear projection CBE

215 5.44× 102 2.85 1.11

217 - 1.91× 101 4.23

220 (1M) - 3.76× 102 3.77× 101

227 (100M) - 2.68× 105 8.15× 103

I Dramatic speedup for high-dim data

I Moderate speedup for low-dim data (FFT overhead)

Large-Scale Nearest-Neighbor Search

Methods

I CBE (CBE-rand, CBE-opt)

I LSH

I Bilinear code (Bilinear-rand, Bilinear-opt)

Experimental Setting

I 100k images, 25,600 dimensional feature

I Use an image as query to retrieve NN. Repeat 500 times

I Ground-truth: 10 nearest neighbors based on `2 distance

Large-Scale Nearest-Neighbor Search (cont’d)

Recall (fixed coding time):Much higher recall than LSH and bilinear code

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Re

ca

ll

Number of retrieved points

LSH

Bilinear−rand

Bilinear−opt

CBE−rand

CBE−opt

(a) # bits (CBE) = 6,400

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Re

ca

ll

Number of retrieved points

LSH

Bilinear−rand

Bilinear−opt

CBE−rand

CBE−opt

(b) # bits (CBE) = 25,600

Large-Scale Nearest-Neighbor Search (cont’d)

Recall (fixed number of bits):Comparable or even better performance with faster computation

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Re

ca

ll

Number of retrieved points

LSH

Bilinear−rand

Bilinear−opt

CBE−rand

CBE−opt

(c) # bits (all) = 6,400

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Re

ca

ll

Number of retrieved points

LSH

Bilinear−rand

Bilinear−opt

CBE−rand

CBE−opt

(d) # bits (all) = 25,600

Large-Scale Classification

Learning on binary code:

I Advantage: save storage

I ImageNet data: 1k categories, 100 images per category fortraining, 50 for validation and 50 for testing

I d = k = 25, 600 (32 times more space efficient)

Multiclass classification accuracy (%)

Original LSH Bilinear-opt CBE-opt

25.59±0.33 23.49±0.24 24.02±0.35 24.55 ±0.30

I CBE: faster computation, no performance degradation

Conclusion and More

Conclusion

I An O(d log d) method for high-dimensional binary embedding

I Much better retrieval performance for fixed coding time

I Much faster computation for fixed number of bits

I CBE can be applied to data with ∼100M dimensions!

More

I CBE can be easily extended to semi-supervised case

I Implementation of CBE and baselines available athttps://github.com/felixyu/cbe

The Requirement of D

h(x) = sign(Rx), R = circ(r)

Two Types of Distance Distortions

1. Distortion from the circulant projection

I Johnson-Lindenstruss type results with structured matrices[Vyb11]

I The random sign flipping is required

I If x is an all-1 vector, all the bits will be the same, and closeto 0

2. Distortion from sign(·)

References I

Nir Ailon and Bernard Chazelle.Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.In ACM Symposium on Theory of Computing, 2006.

Moses S Charikar.Similarity estimation techniques from rounding algorithms.In ACM Symposium on Theory of Computing, 2002.

Yunchao Gong, Sanjiv Kumar, Henry A Rowley, and Svetlana Lazebnik.Learning binary codes for high-dimensional data using bilinear projections.In Computer Vision and Pattern Recognition, 2013.

Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin.Iterative quantization: A procrustean approach to learning binary codes forlarge-scale image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

Brian Kulis and Trevor Darrell.Learning to hash with binary reconstructive embeddings.In Advances in Neural Information Processing Systems, 2009.

Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Konig.Hashing algorithms for large-scale learning.In Advances in Neural Information Processing Systems, 2011.

References II

Maxim Raginsky and Svetlana Lazebnik.Locality-sensitive binary codes from shift-invariant kernels.In Advances in Neural Information Processing Systems, 2009.

Jorge Sanchez and Florent Perronnin.High-dimensional signature compression for large-scale image classification.In Computer Vision and Pattern Recognition, 2011.

Jan Vybıral.A variant of the Johnson–Lindenstrauss lemma for circulant matrices.Journal of Functional Analysis, 260(4):1096–1105, 2011.

Jun Wang, Sanjiv Kumar, and Shih-Fu Chang.Sequential projection learning for hashing with compact codes.In International Conference on Machine Learning, 2010.

Recommended