Dimensionality reduction: .5cm Johnson-Lindenstrauss...

Outline J-L lemma Classical proof Variants Applications Circulant matrices Decoupling vs. Fourier

Dimensionality reduction:

Johnson-Lindenstrauss lemma

for circulant matrices

Jan Vyb́ıral

Austrian Academy of Sciences

RICAM, Linz, Austria

April 2010Helmholtz Zentrum

Munich, Germany

partially joint work with Aicke Hinrichs (University of Jena, Germany)


Outline

◮ Johnson-Lindenstrauss lemma

◮ Classical proof

◮ Variants and improvements

◮ Applications - Approximate nearest neighbours

◮ Circulant matrices

◮ Decoupling vs. Fourier transform


Johnson-Lindenstrauss lemma

Let

◮ ε ∈ (0, 12),

◮ x1, . . . , xn ∈ Rd . . . arbitrary points,

◮ k = O(ε−2 log n), i.e. k ≥ Cε−2 log n.

There exists a (linear) mapping f : Rd → R

k such that

(1 − ε)||xi − xj ||22 ≤ ||f (xi) − f (xj )||22 ≤ (1 + ε)||xi − xj ||22

for all i , j ∈ {1, . . . , n}.Here || · ||2 stands for the Euclidean norm in R

d or Rk , respectively.

For example: n = 109, ε = .2, k = 4200, d arbitrary!


Typical proofA ⊂ R

k×d - k × d matrices,

P - probability measure on AFor each y , ||y ||2 = 1: concentration of measure

P(A ∈ A : ||Ay ||2 > 1 + ε) ≤ exp(−ckε−2),

P(A ∈ A : ||Ay ||2 < 1 − ε) ≤ exp(−ckε−2).

Choosing

exp(−ckε−2) ≤ 1

n2,

the probability of failure (union bound) is smaller then

2 ·(

n

2

)

· 1

n2= 1 − 1

n< 1.

Hence, the probability of success is positive!


The condition

exp(−ckε−2) ≤ 1

n2

leads tockε2 ≥ 2 log n

and

k ≥ 2

c· ε−2 log n, i.e. C =

2

c.

By increasing C (= 3/c), we may achieve, that such a mappingbecomes ”typical”, i.e. occurs with probability at least 1 − 1/n.


Classical proof

W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitzmappings into a Hilbert space. Contem. Math., 26:189-206, 1984

Projection onto a “random” k-dimensional subspace satisfies thedesired property with positive probability

advantages: geometrical proof

disadvantages: measure on the set of all k-dimensional subspacesevaluating f (x) involves orthonormalisationtime consuming


Variants and improvements

Elementary proof: S. Dasgupta and A. GuptaAn elementary proof of a theorem of Johnson and Lindenstrauss.Random. Struct. Algorithms, 22:60-65, 2003.

Improvements motivated by applications:

◮ Good running times of f (x)

◮ Small randomness used

◮ Small memory space used

◮ ...others...


D. Achlioptas, Database-friendly random projections:

Johnson-Lindenstrauss with binary coins.

J. Comput. Syst. Sci., 66(4):671-687, 2003.

f realised by a k × d matrix, where each entry is generatedindependently at random: Gaussian or Bernoulli (or similar)variables.

◮ Running time: k × d

◮ Randomness: k × d

◮ Memory space k × d


N. Ailon and B. Chazelle, Approximate nearest neighbors and

the fast Johnson-Lindenstrauss transform. In Proc. 38th

Annual ACM Symposium on Theory of Computing, 2006.

f (x) = PHDx , where

◮ P is a k × d matrix, where each component is generatedindependently at randomPi ,j = N(0, 1) with probability

q = min

{

Θ

(log2 n

d

)

, 1

}

Pi ,j = 0 with probability 1 − q,

◮ H is the d × d normalised Hadamard matrix,

◮ D is a random d × d diagonal matrix, with each Di ,i drawnindependently from {−1, 1} with probability 1/2.


Runnig time: With high probability, f (x) may be calculated in timeO(d log d + qdk)

Randomness: k × d

Memory space: with high probability O(d + kdq) = O(d + k log2 n)

Not easy to implement

Other variants and improvements. . .


Approximate nearest neighbors

The nearest neighbor problem: Given P = {x1, . . . , xn} in a metricspace X , preprocess P so as to efficiently find the minimiser of

mini=1,...,n

d(xi , q), q ∈ X .

Naive algorith: compare all the distances - no preprocessing.

The approximate nearest neighbor problem: GivenP = {x1, . . . , xn} in a metric space X and ε > 0, preprocess P soas to efficiently find p ∈ P , such that

d(p, q) ≤ (1 + ε)d(p′, q), p′ ∈ X .


X = Rd , hashing functions:

Choose randomly v ∈ Rd and pre-compute hv (i) := vTxi ,

i = 1, . . . , n.Find

argminp∈P |vT (p − q)|.

Iterate over different v1, v2, . . .


Connection to compressed sensing - RIP

We say, that A ∈ Rn×N satisfies the Restricted Isometry Property

of order k, if there exists δk ∈ (0, 1), such that

(1 − δk)||x ||22 ≤ ||Ax ||22 ≤ (1 + δk)||x ||22

holds for all x ∈ RN with ||x ||0 := #{j = 1, . . . ,N : xj 6= 0} ≤ k.

The aim is to find matrices with small δk for large k.


R. Baraniuk, M. Davenport, R. DeVore and M. Wakin, A

simple proof of the Restricted Isometry Property for Random

Matrices, Constructive Approximation, 2008.If

Pω

(||A(ω)x ||22 ≥ (1 + ε)||x ||22

)≤ exp(−nc(ε))

(and the same for ≤) and δ > 0, then A(ω) statisfies RIP for

k ≤ c ′(δ)n

log(N/n) + 1

and δ with exponential high probability.

Every distribution that yields J-L transforms, yields alsoRIP-matrices.


Circulant matrices

a = (a0, . . . , ad−1) be i.i.d. random variables

Ma,k =

a0 a1 a2 . . . ad−1

ad−1 a0 a1 . . . ad−2

ad−2 ad−1 a0 . . . ad−3...

......

. . ....

ad−k+1 ad−k+2 ad−k+3 . . . ad−k

∈ Rk×d

Is it possible to take f (x) = 1√kMa,kx? Or f (x) = 1√

kMa,kDκx?


Decoupling vs. Fourier transform

Yes! With k = O(ε−2 log3 n) - decoupling techniques

Yes! With k = O(ε−2 log2 n) - Fourier-analytic methods

The improvement to O(ε−2 log n) is still open. . . promising numerical experiments

advantages: running time O(d log d) - using FFTrandomness used 2d instead of (k + 1)deasy to implement: FFT is a part of every softwarepackage

disadvantage: up to now - bigger k


log3 n:

Let

◮ x1, . . . , xn be arbitrary points in Rd ,

◮ ε ∈ (0, 12),

◮ k = O(ε−2 log3 n),

◮ a = (a0, . . . , ad−1) be independent Bernoulli variables orindependent normally distributed variables,

◮ Ma,k and Dκ be as above and

◮ f (x) = 1√kMa,kDκx .

Then with probability at least 2/3 the following holds

(1−ε)||xi−xj ||22 ≤ ||f (xi )−f (xj)||22 ≤ (1+ε)||xi−xj ||22, i , j = 1, . . . , n.


Strategy of the proof of log3 n

-decoupling the dependenceConcentration inequalities for every fixed x with ||x ||2 = 1:

Pa,κ

(

||Ma,kDκx ||22 ≥ (1 + ε)k)

≤ exp(−c(kε2)1/3)

and

Pa,κ

(

||Ma,kDκx ||22 ≤ (1 − ε)k)

≤ exp(−c(kε2)1/3).

Then union bound over all n(n − 1)/2 pairs of points.

The bound on k is given by

2 · n(n − 1)

2· exp(−c(kε2)1/3) < 1.


Separation of the diagonal and the off-diagonal term

||Ma,kDκx ||22 =

k−1∑

j=0

(d−1∑

i=0

aiκj+ixj+i

)2

= I + II

I =d−1∑

i=0

a2i ·

k−1∑

j=0

x2j+i

︸︷︷︸

diagonal

, II =k−1∑

j=0

∑

i 6=i ′

aiai ′κj+iκj+i ′xj+ixj+i ′

︸︷︷︸

off −diagonal

. . . summation in the index is modulo d . . .

Pa,κ

(

||Ma,kDκx ||22 ≥ (1+ε)k)

≤ Pa(I ≥ (1+ε/2)k)+Pa,κ(II ≥ εk/2)


Estimates of I : Pa(I ≥ (1 + ε/2)k)Lemma of B. Laurent and P. Massart

. . . or any other variant of Bernstein’s ineqaulity

Exponential concentration of

Z =D∑

i=1

αi(a2i − 1),

where ai are i.i.d. normal variables and αi are nonnegative realnumbers. Then for any t > 0

P(Z ≥ 2||α||2√

t + 2||α||∞t) ≤ exp(−t),

P(Z ≤ −2||α||2√

t) ≤ exp(−t).

αi :=∑k−1

j=0 x2j+i , ||α||1 = k, ||α||∞ ≤ 1 and ||α||2 ≤

√k.


Estimates of II :

Decoupling lemma of Bourgain and Tzafriri:

Let ξ0, . . . , ξd−1 be independent random variables withE ξ0 = · · · = E ξd−1 = 0 and let {xi ,j}d−1

i ,j=0 be a double sequence ofreal numbers. Then for 1 ≤ p < ∞

E

∣∣∣∣

∑

i 6=j

xi ,jξiξj

∣∣∣∣

p

≤ 4pE

∣∣∣∣

∑

i 6=j

xi ,jξiξ′j

∣∣∣∣

p

,

where (ξ′0, . . . , ξ′d−1) denotes an independent copy of

(ξ0, . . . , ξd−1).


Further tools:Two times Khintchine’s inequalities and

(

Ea,a′

∣∣∣

k−1∑

j=0

aja′j

∣∣∣

p)1/p

≤√

p(k + p),

for both a Bernoulli or Gaussian variables.


The role of Dκ

k ≤ d , a0, . . . , ad−1 independent normal variables

x =1√d

(1, . . . , 1), ||Ma,kx ||22 = k(d−1∑

j=0

aj√d

)2

2-stability:

b :=d−1∑

j=0

aj√d≈ N(0, 1)

Pa

(

||Ma,kx ||22 > (1 + ε)k)

= Pb

(

b2 > (1 + ε))

depends neither on k nor on d


log2 n

Let

◮ x1, . . . , xn be arbitrary points in Rd ,

◮ ε ∈ (0, 12),

◮ k = O(ε−2 log2 n),

◮ a = (a0, . . . , ad−1) be independent normally distributedvariables,

◮ Ma,k and Dκ be as above and

◮ f (x) = 1√kMa,kDκx .

Then with probability at least 2/3 the following holds

(1−ε)||xi−xj ||22 ≤ ||f (xi )−f (xj)||22 ≤ (1+ε)||xi−xj ||22, i , j = 1, . . . , n.


Fourier methods

F - unitary discrete Fourier transform, F : Cd → C

d

Every circulant matrix may be diagonalised by F and F−1

Ma,dx = Fdiag(√

dFa)F−1x .

The singular values are the square roots of the eigenvalues of

Ma,dM∗a,d = Fdiag(

√dFa)diag(

√dFa)F−1 = Fdiag(d |Fa|2)F−1

i.e.√

d |Fa|.


Strategy of the proof of log2 n

Concentration inequalities for all x̃ =xi−xj

||xi−xj ||2

Pa

(||Ma,kDκx̃ ||22 ≥ 2(1 + ε)k

)≤ exp

(

− ckε2

log n

)

,

Pa

(||Ma,kDκx̃ ||22 ≤ 2(1 − ε)k

)≤ exp

(

− ckε2

log n

)

.

From this, the result follows again by a union bound.


Let ||x ||2 = 1.

y j := S j(Dκx) ∈ Cd , j = 0, . . . , k − 1,

where S is the shift operator

S : Cd → C

d , S(z0, . . . , zd−1) = (z1, . . . , zd−1, z0).

Y . . . k × d matrix with rows y0, . . . , yk−1.Note, that ||Ma,kDκx ||22 = ||Ya||22

Hence,

P(||Ma,kDκx ||22 ≥ (1 + ε)k

)= P

(||Ya||22 ≥ (1 + ε)k

).


Let Y = UΣV be the singular value decomposition of Y .Then

||Ya||22 = ||UΣVa||22 = ||ΣVa||22 = ||Σb||22,where b := Va is a k-dimanesional vector of independent normalvariables.Hence,

P(||Ma,kDκx ||22 ≥ (1 + ε)k

)= P

(k−1∑

j=0

λ2j b

2j ≥ (1 + ε)k

),

where λj are the singular values of Y .Lemma of B. Laurent and P. Massart: Estimate ||λ||4 and ||λ||∞!

||λ||22 = ||Y ||F = k and ||λ||2∞ ≤ c log n implies

||λ||44 ≤ c k log n.


References:

W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz

mappings into a Hilbert space. Contem. Math., 26:189-206, 1984

S. Dasgupta and A. Gupta, An elementary proof of a theorem of

Johnson and Lindenstrauss. Random. Struct. Algorithms,22:60-65, 2003.

N. Ailon and B. Chazelle, Approximate nearest neighbors and the

fast Johnson-Lindenstrauss transform. In Proc. 38th Annual ACM

Symposium on Theory of Computing, 2006.

A. Hinrichs and J. Vyb́ıral, Johnson-Lindenstrauss lemma for

circulant matrices, http://arxiv.org/abs/1001.4919

J. Vyb́ıral, A variant of the Johnson-Lindenstrauss lemma for

circulant matrices, http://arxiv.org/abs/1002.2847

Date post:	27-Mar-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Dimensionality reduction: .5cm Johnson-Lindenstrauss...

Documents