TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a...

Post on 18-Oct-2020

0 views 0 download

transcript

Learning sums of ridge functions in highdimension: a nonlinear compressed sensing model

Massimo Fornasier

Fakultat fur MathematikTechnische Universitat Munchenmassimo.fornasier@ma.tum.dehttp://www-m15.ma.tum.de/

Winter School on Compressed SensingTechnical University of Berlin

December 3-5, 2015Collection of joint results with

Ingrid Daubechies, Karin Schnass, and Jan Vybıral

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Some origins of ridge functions

I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;

I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.

Some origins of ridge functions

I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;

I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.

Projection pursuit of the ’80s

I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);

I Projection pursuit algorithms approximate a function of d variablesby functions of the form

m∑i=1

gi (ai · x), x ∈ Rd ,

for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .

Projection pursuit of the ’80s

I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);

I Projection pursuit algorithms approximate a function of d variablesby functions of the form

m∑i=1

gi (ai · x), x ∈ Rd ,

for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .

Some relevant applications of the ’90s

I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;

I the simplest case of such a network is described mathematically by afunction of the form

m∑i=1

αiσ

m∑j=1

wijxj + θi

,

where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;

Some relevant applications of the ’90s

I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;

I the simplest case of such a network is described mathematically by afunction of the form

m∑i=1

αiσ

m∑j=1

wijxj + θi

,

where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X .

The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a.

If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g ,

the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Ridge functions and functions of data clustered aroundmanifolds

Figure : Functions on data clustered around a manifold can be locallyapproximated by k-ridge functions

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

How can we learn k-ridge functions from point queries?

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Sensitivity analysis

x

x + εϕSd−1

Figure : We perform at random, randomized sensitivity analysis

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix

X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)

).

The mX ×mΦ instances of (∗) in matrix notation as

ΦX = Y + E (∗∗)

Y and E are mΦ ×mX matrices defined by

yij =f (x j + εϕi ) − f (x j)

ε,

εij = −ε

2[(ϕi )T∇2f (ζij)ϕ

i ],

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero.

In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

A first recovery result

Theorem (Schnass and Vybıral 2011)Let f : Rd → R be a function of k active coordinates that is defined andtwice continuously differentiable on a small neighbourhood of [0, 1]d . ForL 6 d, a positive real number, the randomized algorithm described aboverecovers the k unknown active coordinates of f with probability at least1 − 6 exp(−L) using only

O(k(L + log k)(L + log d))

samples of f .

The constants involved in the O notation depend on smoothnessproperties of g , namely on

maxj=1,...,k ‖∂ij g‖∞minj=1,...,k ‖∂ij g‖1

Examples of active coordinate detection in dimensiond = 1000

6 12 18 24 30 36 42 48 54 60

20

40

60

80

100

120

140

160

180

2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5 10 15 20 25 30 35 40 45 50

80

100

120

140

160

180

200

220

240

260

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure : max(1− 5√

(x3 − 1/2)2 + (x4 − 1/2)2, 0)3 and

sin(6π∑40

i=21 xi

)+∑40

i=21 sin(6πxi ) + 5(xi − 1/2)2

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd

‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2

α =

∫Sd−1

‖∇f (x)‖2`d2 dµSd−1(x) =

∫Sd−1

|g ′(a · x)|2dµSd−1(x) > 0,

We consider again the Taylor expansion (*) with Ω = Sd−1

We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1

The matrix Φ is generated as before and we obtain (**) again in the form

Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd

‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2

α =

∫Sd−1

‖∇f (x)‖2`d2 dµSd−1(x) =

∫Sd−1

|g ′(a · x)|2dµSd−1(x) > 0,

We consider again the Taylor expansion (*) with Ω = Sd−1

We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1

The matrix Φ is generated as before and we obtain (**) again in the form

Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.

Algorithm 1:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .

I Findj0 = arg max

j=1,...,mX

‖xj‖`d2 .

I Set a = xj0/‖xj0‖`d2 .

I Define g(y) := f (aT y) and f (x) := g(a · x).

Recovery result

Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + 2e

−2mXs2α2

C42

),

will satisfy

‖f − f ‖∞ 6 2C2(1 + ε)ν1√

α(1 − s) − ν1,

where

ν1 = C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

)and C ′ depends only on C1 and C2.

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m.

Let us suppose thatd > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m.

Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property.

For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd ,

one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε,

leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Summarizing ...

With high probability

xj = g ′(a · x j)aT + nj ,

where

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

)6 C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

):= ν1

Summarizing ...

With high probability

xj = g ′(a · x j)aT + nj ,

where

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

)6 C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

):= ν1

Stability of one dimensional subspaces

LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ

x

‖x‖`d2− a

∥∥∥∥∥`d2

62ν1‖x‖`d2

.

We recall, thatxj = g ′(a · x j)aT + nj .

and

maxj‖xj‖`d2 > max

j|g ′(a·x j)|−max

j‖xj−xj‖`d2 > max

j|g ′(a · x j)|︸ ︷︷ ︸

we need to estimate it

−ν1

Stability of one dimensional subspaces

LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ

x

‖x‖`d2− a

∥∥∥∥∥`d2

62ν1‖x‖`d2

.

We recall, thatxj = g ′(a · x j)aT + nj .

and

maxj‖xj‖`d2 > max

j|g ′(a·x j)|−max

j‖xj−xj‖`d2 > max

j|g ′(a · x j)|︸ ︷︷ ︸

we need to estimate it

−ν1

Concentration inequalities I

Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that

PXj − EXj ∈ [aj , bj ] = 1,

for j = 1, . . . , m. Then we have

P

∣∣∣∣∣∣

m∑j=1

Xj − E

m∑j=1

Xj

∣∣∣∣∣∣ > t

6 2e− 2t2∑m

j=1(bj−aj)2

.

Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.

Concentration inequalities I

Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that

PXj − EXj ∈ [aj , bj ] = 1,

for j = 1, . . . , m. Then we have

P

∣∣∣∣∣∣

m∑j=1

Xj − E

m∑j=1

Xj

∣∣∣∣∣∣ > t

6 2e− 2t2∑m

j=1(bj−aj)2

.

Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.

Probabilistic estimates from below

By applying Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2, we have

Lemma

Let us fix 0 < s < 1. Then with probability 1 − 2e−

2mXs2α2

C42 we have

maxj=1,...,mX

|g ′(a · x j)| >√α(1 − s),

where α := Ex(|g′(a · x j)|2) =

∫Sd−1 |g ′(a · x)|2dµSd−1(x) =∫

Sd−1 ‖∇f (x)‖2`d2

dµSd−1(x) > 0.

Algorithm 1:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .

I Findj0 = arg max

j=1,...,mX

‖xj‖`d2 .

I Set a = xj0/‖xj0‖`d2 .

I Define g(y) := f (aT y) and f (x) := g(a · x).

Recovery result

Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + 2e

−2mXs2α2

C42

),

will satisfy

‖f − f ‖∞ 6 2C2(1 + ε)ν1√

α(1 − s) − ν1,

where

ν1 = C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

)and C ′ depends only on C1 and C2.

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Dependence on the dimension d

PropositionLet us fix M ∈ N and assume that g : [−1, 1]→ R is CM+2-differentiable

in an open neighbourhood U of 0 and d`

dx`g(0) = 0 for ` = 1, . . . , M.

Then

α(d) = O(d−M), for d →∞.

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Tractability result

CorollaryThe problem of learning functions f in the classes F1

d and F2d from point

evaluations is strongly polynomially tractable (no poly dep. on d) andpolynomially tractable (with poly dep. on d) respectively.

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable.

Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise.

Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.

We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Sensitivity analysis

x

x + εϕSd−1

Figure : We perform at random, randomized sensitivity analysis

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix

X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)

).

The mX ×mΦ instances of (∗) in matrix notation as

ΦX = Y + E (∗∗)

Y and E are mΦ ×mX matrices defined by

yij =f (x j + εϕi ) − f (x j)

ε,

εij = −ε

2[(ϕi )T∇2f (ζij)ϕ

i ],

Algorithm 2:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according to (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 , for j = 1, . . . , mX, and

X = (x1| . . . |xmX) is again a d ×mX matrix.

I Compute the singular value decomposition of

XT =(

U1 U2

)( Σ1 0

0 Σ2

)(V T1

V T2

),

where Σ1 contains the k largest singular values.

I Set A = V T1 .

I Define g(y) := f (AT y) and f (x) := g(Ax).

The control of the error

The quality of the final approximation of f by means of f depends on twokinds of accuracies:

1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;

2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.

To be precise, we have

The control of the error

The quality of the final approximation of f by means of f depends on twokinds of accuracies:

1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;

2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.

To be precise, we have

Recovery result

Theorem (F., Schnass, and Vybıral)Let log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1 such that usingmX · (mΦ + 1) function evaluations of f , Algorithm 2 defines a functionf : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

will satisfy

‖f − f ‖∞ 6 2C2

√k(1 + ε)

ν2√α(1 − s) − ν2

,

where

ν2 = C

(k1/q

[mΦ

log(d/mΦ)

]1/2−1/q

+εk2

√mΦ

),

and C depends only on C1 and C2.

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Compressed sensing

Corollary (after Wojtaszczyk, 2011)Let log d 6 mΦ < [log 6]2d. Then with probability

1 − (e−c ′1mΦ + e−

√mΦd)

the matrix X as calculated in Algorithm 2 satisfies

‖X − X‖F 6 C√

mX

(k1/q

[mΦ

log(d/mΦ)

]1/2−1/q

+εk2

√mΦ

),

where C depends only on C1 and C2.

Stability of SVD

Given two matrices B and B with corresponding singular valuedecompositions

B =(

U1 U2

)( Σ1 00 Σ2

)(V T1

V T2

)and

B =(

U1 U2

)( Σ1 0

0 Σ2

)(V T1

V T2

),

we have:

Wedin’s bound

Theorem (Stability of subspaces)If there is an α > 0 such that

min`,ˆ

|σˆ(Σ1) − σ`(Σ2)| > α,

andminˆ

|σˆ(Σ1)| > α,

then

‖V1V T1 − V1V T

1 ‖F 62

α‖B − B‖F .

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ),

and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ) −√

mXν2.

Note thatXT = GA = UGΣG[V

TG A],

for G =(∇g(Ax1)| . . . |∇g(AxmX)

)T, hence ΣXT = ΣG. Moreover

σi (G) =√σi (GTG), for all i = 1, . . . , k.

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ),

and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ) −√

mXν2.

Note thatXT = GA = UGΣG[V

TG A],

for G =(∇g(Ax1)| . . . |∇g(AxmX)

)T, hence ΣXT = ΣG. Moreover

σi (G) =√σi (GTG), for all i = 1, . . . , k.

Concentration inequalities II

Theorem (Matrix Chernoff bounds)Consider X1, . . . , Xm independent random, positive-semidefinite matricesof dimension k × k. Moreover suppose σ1(Xj) 6 C , almost surely.Compute the singular values of the sum of the expectations

µmax = σ1

(∑mj=1 EXj

)and µmin = σk

(∑mj=1 EXj

), then

P

σ1 m∑

j=1

Xj

− µmax > sµmax

6 k

((1 + s)

e

)−µmax(1+s)C

,

for all s > (e − 1), and

P

σk m∑

j=1

Xj

− µmin 6 −sµmin

6 ke−µmins

2

2C ,

for all s ∈ (0, 1).

Note that

GTG =

mX∑j=1

∇g(Ax j)∇g(Ax j)T .

and by applying the previous result to Xj = ∇g(Ax j)∇g(Ax j)T , we have:

LemmaFor any s ∈ (0, 1) we have that

σk(XT ) >

√mXα(1 − s)

with probability 1 − ke−mXαs2

2kC22 .

Proof of Theorem

with probability at least

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

we have

‖V1V T1 − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

and for A = V T1 and V T

G A = V T1

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

Proof of Theorem

with probability at least

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

we have

‖V1V T1 − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

and for A = V T1 and V T

G A = V T1

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

Proof of Theorem ... continue

Since A is row-orthogonal we have A = AATA and

|f (x) − f (x)| = |g(Ax) − g(Ax)|

= |g(Ax) − g(AAT Ax)|

6 C2

√k‖Ax − AAT Ax‖`k2

= C2

√k‖A(ATA − AT A)x‖`k2

6 C2

√k‖(ATA − AT A)‖F‖x‖`d2

6 2C2

√k(1 + ε)

ν2√α(1 − s) − ν2

.

where we used

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

k-ridge functions may be too simple!

Figure : Functions on data clustered around a manifold with multiple directionscan be locally approximated by sums of k-ridge functions

Sums of ridge functions

Can we still be able to learn functions of the type

f (x) =m∑i=1

gi (ai · x), x ∈ [−1, 1]d?

Our approach (Daubechies, F., Vybıral) is essentially based on theformula

Dα1c1 . . . Dαk

ck f (x) =m∑i=1

g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,

where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th

derivative in the direction ci .

Sums of ridge functions

Can we still be able to learn functions of the type

f (x) =m∑i=1

gi (ai · x), x ∈ [−1, 1]d?

Our approach (Daubechies, F., Vybıral) is essentially based on theformula

Dα1c1 . . . Dαk

ck f (x) =m∑i=1

g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,

where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th

derivative in the direction ci .

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!

Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L.

Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

Nonlinear programming to recover the ai ⊗ ai ’s

Figure : The ai ⊗ ai are the extremal points of the matrix operator norm!

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.

Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2

and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)

When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L.

Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39.

We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

The approximation to L

DefineL = span∆f (xj), j = 1, . . . , mX,

where

(∆f (x))j ,k =f (x + ε(ej + ek)) − f (x + εej) − f (x + εek) + f (x)

ε2,

for j , k = 1, . . . , m, is an approximation to the Hessian of f a x . For xdrawn at random and by applying in a suitable way the Chernoff matrixbounds, one derives a probabilistic error estimate, in the sense that

‖PL − PL‖F→F 6 Cm3/2ε,

with high probability.

A nonlinear operator towards a gradient ascent

Let us introduce first for a given parameter γ > 1 an operator acting onthe singular values of a matrix X = UΣV T as follows:

Πγ(X ) = Udiag(γ, 1, . . . , 1)× Σ

‖(diag(γ, 1, . . . , 1)× Σ)‖FV T ,

where

diag(γ, 1, . . . , 1)× Σ =

γσ1 0 . . . 00 σ2 0 . . .. . . . . . . . . . . .0 . . . 0 σm

Notice that Πγ maps any matrix X onto a matrix of unit Frobeniusnorm, simply exalting the first singular value and damping the others. Itis not a linear operator.

The nonlinear programming

We propose a projected gradient method for solving

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.

Algorithm 3:

I Fix a suitable parameter γ > 1

I Assume to have identified a basis for L of semi-positive definitematrices, for instance, one can use the second order finitedifferences ∆f (xj), j = 1, . . . , mX to form such a basis;

I Generate an initial guess X 0 =∑mX

j=1 ζj∆f (xj) by choosing at

random ζj > 0, so that X 0 ∈ L and ‖X 0‖F = 1;

I For ` > 0:

X `+1 := PLΠγ(X`);

Analysis of the algorithm for L = L

Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >

√2 and

let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.

Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.

The proof is based on the following observation

‖X `+1‖∞ = σ1(X`+1) =

γσ1(X`)√

γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2

>γ‖X `‖∞√

(γ2 − 1)‖X `‖2∞ + 1.

Analysis of the algorithm for L = L

Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >

√2 and

let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.

Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.

The proof is based on the following observation

‖X `+1‖∞ = σ1(X`+1) =

γσ1(X`)√

γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2

>γ‖X `‖∞√

(γ2 − 1)‖X `‖2∞ + 1.

Analysis of the algorithm for L ≈ L

Theorem (Daubechies, F., Vybral)Assume for that ‖PL − PL‖F→F < ε < 1 and that a1, . . . , am are

orthonormal. Let ‖X 0‖∞ > max 1√γ2−1

, 1√2+ ε+ ξ and

√2 < γ. Then

for the iterations (X `)` produced by Algorithm 3, there exists µ0 < 1such that

lim sup`

|1 − ‖X `‖∞| 6 µ1(γ, t0, ε) + 2ε

1 − µ0+ ε,

where µ1(γ, ξ, ε) ≈ ε. The sequence (X `)` is bounded and itsaccumulation points X satisfy simultaneously the following properties

‖X‖F 6 1 and ‖X‖∞ > 1 −µ1(γ, t0, ε) + 2ε

1 − µ0+ ε,

and

‖PLX‖F 6 1 and ‖PLX‖∞ > 1 −µ1(γ, t0, ε) + 2ε

1 − µ0.

A graphical explanation of the algorithm

Figure : Objective function ‖ · ‖∞ to be maximized and iterations of Algorithm3 converging to one of the extremal points ai ⊗ ai

Nonlinear programming

Theorem (Daubechies, F., Vybral)Let M be any local maximizer of

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.

Then

uTj Xuj = 0 for all X ∈ SL with X ⊥ M

and all j ∈ 1, . . . , m with |λj(0)| = ‖M‖∞.

If furthermore the ai ’s are nearly orthonoramal S(a1, . . . , am) 6 ε and

3 ·m · ‖PL − PL‖ < (1 − ε)2,

then λ1 = ‖M‖∞ > max|λ2|, . . . , |λm| and

2m∑

k=2

(uT1 Xuk)

2

λ1 − λk6 λ1.

Nonlinear programming

Algorithm 4:

I Let M be a local maximizer of the nonlinear programming

I Take its singular value decomposition M =∑m

j=1 λjuj ⊗ uj

I Put a := u1

Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C

√ε.

The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.

Nonlinear programming

Algorithm 4:

I Let M be a local maximizer of the nonlinear programming

I Take its singular value decomposition M =∑m

j=1 λjuj ⊗ uj

I Put a := u1

Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C

√ε.

The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.

Learning sums of ridge functions

Algorithm 5:

I Let aj are normalized approximations of aj , j = 1, . . . , m

I Let (bj)mj=1 be the dual basis to (aj)

mj=1

I Assume, that f (0) = g1(0) = · · · = gm(0)

I Put gj(t) := f (tbj), t ∈ (−1/‖bj‖2, 1/‖bj‖2)I Put f (x) :=

∑mj=1 gj(aj · x), ‖x‖2 6 1

Theorem (Daubechies, F., Vybral)Let

I S(a1, . . . , am) 6 ε and S(a1, . . . , am) 6 ε ′;I ‖aj − aj‖2 6 η, j = 1, . . . , m.

Then‖f − f ‖∞ 6 c(ε, ε ′)mη.

Our literature

I I. Daubechies, M. Fornasier, and J. Vybıral, Approximation of sumsof ridge functions, in preparation

I M. Fornasier, K. Schnass, and J. Vybıral, Learning functions of fewarbitrary linear parameters in high dimensions, Foundations onComputational Mathematics, Vol. 2, No. 2, 2012, pp. 229-262

I K. Schnass and J. Vybıral, Compressed learning of high-dimensionalsparse functions, ICASSP11, 2011.

I A. Kolleck and J. Vybıral, On some aspects of approximation ofridge functions, J. Appr. Theory 194 (2015), 35-61

I S. Mayer, T. Ullrich, and J. Vybıral, Entropy and sampling numbersof classes of ridge functions, to appear in ConstructiveApproximation,