Learning sums of ridge functions in highdimension: a nonlinear compressed sensing model
Massimo Fornasier
Fakultat fur MathematikTechnische Universitat [email protected]://www-m15.ma.tum.de/
Winter School on Compressed SensingTechnical University of Berlin
December 3-5, 2015Collection of joint results with
Ingrid Daubechies, Karin Schnass, and Jan Vybıral
Introduction on ridge functions
I A ridge function - in its simplest form - is a function f : Rd → R ofthe type
f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;
I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;
I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.
Introduction on ridge functions
I A ridge function - in its simplest form - is a function f : Rd → R ofthe type
f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;
I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;
I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.
Introduction on ridge functions
I A ridge function - in its simplest form - is a function f : Rd → R ofthe type
f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;
I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;
I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.
Some origins of ridge functions
I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;
I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.
Some origins of ridge functions
I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;
I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.
Projection pursuit of the ’80s
I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);
I Projection pursuit algorithms approximate a function of d variablesby functions of the form
m∑i=1
gi (ai · x), x ∈ Rd ,
for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .
Projection pursuit of the ’80s
I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);
I Projection pursuit algorithms approximate a function of d variablesby functions of the form
m∑i=1
gi (ai · x), x ∈ Rd ,
for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .
Some relevant applications of the ’90s
I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;
I the simplest case of such a network is described mathematically by afunction of the form
m∑i=1
αiσ
m∑j=1
wijxj + θi
,
where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;
Some relevant applications of the ’90s
I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;
I the simplest case of such a network is described mathematically by afunction of the form
m∑i=1
αiσ
m∑j=1
wijxj + θi
,
where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;
Ridge functions and approximation theory
I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);
I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);
I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.
I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;
Ridge functions and approximation theory
I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);
I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);
I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.
I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;
Ridge functions and approximation theory
I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);
I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);
I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.
I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;
Ridge functions and approximation theory
I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);
I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);
I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.
I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;
Ridge functions and approximation theory
I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);
I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);
I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.
I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;
Capturing ridge functions from point queries
I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;
I this might be in certain practical situations very expensive,hazardous or impossible;
I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:
For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1
‖f − f ‖C(Ω) 6 CM0
L−s + M1
(1 + log(d/L)
L
)1/q−1
using 3L + 2 sampling points, deterministically and adaptively chosen.
Capturing ridge functions from point queries
I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;
I this might be in certain practical situations very expensive,hazardous or impossible;
I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:
For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1
‖f − f ‖C(Ω) 6 CM0
L−s + M1
(1 + log(d/L)
L
)1/q−1
using 3L + 2 sampling points, deterministically and adaptively chosen.
Capturing ridge functions from point queries
I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;
I this might be in certain practical situations very expensive,hazardous or impossible;
I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:
For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1
‖f − f ‖C(Ω) 6 CM0
L−s + M1
(1 + log(d/L)
L
)1/q−1
using 3L + 2 sampling points, deterministically and adaptively chosen.
Capturing ridge functions from point queries
I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;
I this might be in certain practical situations very expensive,hazardous or impossible;
I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:
For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1
‖f − f ‖C(Ω) 6 CM0
L−s + M1
(1 + log(d/L)
L
)1/q−1
using 3L + 2 sampling points, deterministically and adaptively chosen.
Capturing ridge functions from point queries: a nonlinearcompressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements
y ≈ Xa,
by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X .
The data
yi ≈ xi · a = xTi a, i = 1, . . . , m
are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi
yi ≈ g(a · xi ), i = 1, . . . , m,
for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...
Capturing ridge functions from point queries: a nonlinearcompressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements
y ≈ Xa,
by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data
yi ≈ xi · a = xTi a, i = 1, . . . , m
are linear measurements of a.
If now we assume yi to be the values of aridge function at the points xi
yi ≈ g(a · xi ), i = 1, . . . , m,
for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...
Capturing ridge functions from point queries: a nonlinearcompressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements
y ≈ Xa,
by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data
yi ≈ xi · a = xTi a, i = 1, . . . , m
are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi
yi ≈ g(a · xi ), i = 1, . . . , m,
for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...
Capturing ridge functions from point queries: a nonlinearcompressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements
y ≈ Xa,
by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data
yi ≈ xi · a = xTi a, i = 1, . . . , m
are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi
yi ≈ g(a · xi ), i = 1, . . . , m,
for some unknown or roughly given nonlinear function g ,
the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...
Capturing ridge functions from point queries: a nonlinearcompressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements
y ≈ Xa,
by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data
yi ≈ xi · a = xTi a, i = 1, . . . , m
are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi
yi ≈ g(a · xi ), i = 1, . . . , m,
for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...
Ridge functions and functions of data clustered aroundmanifolds
Figure : Functions on data clustered around a manifold can be locallyapproximated by k-ridge functions
Universal random sampling for a more general ridge model
M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
Universal random sampling for a more general ridge model
M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
Universal random sampling for a more general ridge model
M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
How can we learn k-ridge functions from point queries?
MD. House’s differential diagnosis (or simply called”sensitivity analysis”)
We rely on numerical approximation of ∂f∂ϕ
∇g(Ax)TAϕ =∂f
∂ϕ(x) (∗)
=f (x + εϕ) − f (x)
ε−ε
2[ϕT∇2f (ζ)ϕ], ε 6 ε
X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd
Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where
ϕj` =
1/√
mΦ with prob. 1/2,
−1/√
mΦ with prob. 1/2
for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d
MD. House’s differential diagnosis (or simply called”sensitivity analysis”)
We rely on numerical approximation of ∂f∂ϕ
∇g(Ax)TAϕ =∂f
∂ϕ(x) (∗)
=f (x + εϕ) − f (x)
ε−ε
2[ϕT∇2f (ζ)ϕ], ε 6 ε
X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd
Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where
ϕj` =
1/√
mΦ with prob. 1/2,
−1/√
mΦ with prob. 1/2
for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d
Sensitivity analysis
x
x + εϕSd−1
Figure : We perform at random, randomized sensitivity analysis
Collecting together the differential analysis
Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix
X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)
).
The mX ×mΦ instances of (∗) in matrix notation as
ΦX = Y + E (∗∗)
Y and E are mΦ ×mX matrices defined by
yij =f (x j + εϕi ) − f (x j)
ε,
εij = −ε
2[(ϕi )T∇2f (ζij)ϕ
i ],
Example of active coordinates: which factor does play arole?
We assume, that
A =
eTi1...
eTik
,
i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),
where f : Ω = [0, 1]d → R and g : [0, 1]k → R
We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...
A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.
Example of active coordinates: which factor does play arole?
We assume, that
A =
eTi1...
eTik
,
i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),
where f : Ω = [0, 1]d → R and g : [0, 1]k → R
We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...
A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.
Example of active coordinates: which factor does play arole?
We assume, that
A =
eTi1...
eTik
,
i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),
where f : Ω = [0, 1]d → R and g : [0, 1]k → R
We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...
A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.
A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity
ΦTΦX = ΦTY +ΦTE
where now X has i th-row
Xi =
(∂g
∂zi(Ax1), . . . ,
∂g
∂zi(AxmX
),
for i ∈ I , and all other row equal to zero.
In expectation:ΦTΦ ≈ Id : Rd → Rd
ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,
We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .
A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity
ΦTΦX = ΦTY +ΦTE
where now X has i th-row
Xi =
(∂g
∂zi(Ax1), . . . ,
∂g
∂zi(AxmX
),
for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd
ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,
We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .
A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity
ΦTΦX = ΦTY +ΦTE
where now X has i th-row
Xi =
(∂g
∂zi(Ax1), . . . ,
∂g
∂zi(AxmX
),
for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd
ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,
We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .
A first recovery result
Theorem (Schnass and Vybıral 2011)Let f : Rd → R be a function of k active coordinates that is defined andtwice continuously differentiable on a small neighbourhood of [0, 1]d . ForL 6 d, a positive real number, the randomized algorithm described aboverecovers the k unknown active coordinates of f with probability at least1 − 6 exp(−L) using only
O(k(L + log k)(L + log d))
samples of f .
The constants involved in the O notation depend on smoothnessproperties of g , namely on
maxj=1,...,k ‖∂ij g‖∞minj=1,...,k ‖∂ij g‖1
Examples of active coordinate detection in dimensiond = 1000
6 12 18 24 30 36 42 48 54 60
20
40
60
80
100
120
140
160
180
2000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
5 10 15 20 25 30 35 40 45 50
80
100
120
140
160
180
200
220
240
260
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure : max(1− 5√
(x3 − 1/2)2 + (x4 − 1/2)2, 0)3 and
sin(6π∑40
i=21 xi
)+∑40
i=21 sin(6πxi ) + 5(xi − 1/2)2
Learning ridge functions k = 1
Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd
‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2
α =
∫Sd−1
‖∇f (x)‖2`d2 dµSd−1(x) =
∫Sd−1
|g ′(a · x)|2dµSd−1(x) > 0,
We consider again the Taylor expansion (*) with Ω = Sd−1
We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1
The matrix Φ is generated as before and we obtain (**) again in the form
Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.
Learning ridge functions k = 1
Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd
‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2
α =
∫Sd−1
‖∇f (x)‖2`d2 dµSd−1(x) =
∫Sd−1
|g ′(a · x)|2dµSd−1(x) > 0,
We consider again the Taylor expansion (*) with Ω = Sd−1
We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1
The matrix Φ is generated as before and we obtain (**) again in the form
Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.
Algorithm 1:
I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).
I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .
I Findj0 = arg max
j=1,...,mX
‖xj‖`d2 .
I Set a = xj0/‖xj0‖`d2 .
I Define g(y) := f (aT y) and f (x) := g(a · x).
Recovery result
Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability
1 −
(e−c ′
1mΦ + e−√mΦd + 2e
−2mXs2α2
C42
),
will satisfy
‖f − f ‖∞ 6 2C2(1 + ε)ν1√
α(1 − s) − ν1,
where
ν1 = C ′
([mΦ
log(d/mΦ)
]1/2−1/q
+ε√mΦ
)and C ′ depends only on C1 and C2.
Ingredients of the proof
I compressed sensing;
I stability of one dimensional subspaces;
I concentration inequalities (Hoeffding’s inequality).
Ingredients of the proof
I compressed sensing;
I stability of one dimensional subspaces;
I concentration inequalities (Hoeffding’s inequality).
Ingredients of the proof
I compressed sensing;
I stability of one dimensional subspaces;
I concentration inequalities (Hoeffding’s inequality).
Compressed sensing
Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/
√m.
Let us suppose thatd > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least
1 − e−c ′1m − e−
√md ,
the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have
‖∆(Φx + ε) − x‖`d2 6 C(
K−1/2σK (x)`d1 + max‖ε‖`m2 ,√
log d‖ε‖`m∞ )
,
whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K
is the best K -term approximation of x.
Compressed sensing
Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/
√m. Let us suppose that
d > [log 6]2m.
Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least
1 − e−c ′1m − e−
√md ,
the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have
‖∆(Φx + ε) − x‖`d2 6 C(
K−1/2σK (x)`d1 + max‖ε‖`m2 ,√
log d‖ε‖`m∞ )
,
whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K
is the best K -term approximation of x.
Compressed sensing
Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/
√m. Let us suppose that
d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least
1 − e−c ′1m − e−
√md ,
the matrix Φ has the following property.
For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have
‖∆(Φx + ε) − x‖`d2 6 C(
K−1/2σK (x)`d1 + max‖ε‖`m2 ,√
log d‖ε‖`m∞ )
,
whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K
is the best K -term approximation of x.
Compressed sensing
Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/
√m. Let us suppose that
d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least
1 − e−c ′1m − e−
√md ,
the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have
‖∆(Φx + ε) − x‖`d2 6 C(
K−1/2σK (x)`d1 + max‖ε‖`m2 ,√
log d‖ε‖`m∞ )
,
whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K
is the best K -term approximation of x.
How does compressed sensing play a role?
For the d ×mX matrix X , i.e.,
X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),
Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj
] = yj + εj , j = 1, . . . mX,
andxj = ∆(yj) := arg min
y j=Φz‖z‖`d1
the previous result gives - with the probability provided there -
xj = g ′(a · x j)aT + nj ,
with nj properly estimated by
‖nj‖`d2 6 C(
K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,
√log d‖εj‖`m∞
).
How does compressed sensing play a role?
For the d ×mX matrix X , i.e.,
X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),
Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj
] = yj + εj , j = 1, . . . mX,
andxj = ∆(yj) := arg min
y j=Φz‖z‖`d1
the previous result gives - with the probability provided there -
xj = g ′(a · x j)aT + nj ,
with nj properly estimated by
‖nj‖`d2 6 C(
K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,
√log d‖εj‖`m∞
).
How does compressed sensing play a role?
For the d ×mX matrix X , i.e.,
X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),
Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj
] = yj + εj , j = 1, . . . mX,
andxj = ∆(yj) := arg min
y j=Φz‖z‖`d1
the previous result gives - with the probability provided there -
xj = g ′(a · x j)aT + nj ,
with nj properly estimated by
‖nj‖`d2 6 C(
K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,
√log d‖εj‖`m∞
).
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd ,
one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣6
ε‖g ′′‖∞2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε, leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣6
ε‖g ′′‖∞2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε, leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε, leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣6
ε‖g ′′‖∞2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε, leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣6
ε‖g ′′‖∞2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε,
leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Some computationsLet us estimate the quontities. By Stechkin’s inequality for which
σK (x)`d1 6 ‖x‖`dq K 1−1/q,
for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a
K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2
[mΦ
log(d/mΦ)
]1/2−1/q
.
Moreover
‖εj‖`mΦ∞ =ε
2· maxi=1,...,mΦ
|ϕi T∇2f (ζij)ϕi |
=ε
2mΦ· maxi=1,...,mΦ
∣∣∣∣∣d∑
k,l=1
akalg′′(a · ζij)
∣∣∣∣∣6
ε‖g ′′‖∞2mΦ
(d∑
k=1
|ak |
)2
6ε‖g ′′‖∞
2mΦ
(d∑
k=1
|ak |q
)2/q
6C 21 C2
2mΦε,
‖εj‖`mΦ2 6√
mΦ‖εj‖`mΦ∞ 6C 21 C2
2√
mΦε, leading to
max‖εj‖`mΦ2 ,√
log d‖εj‖`mΦ∞ 6 C 21C2
2√mΦε ·max
1,√
log dmΦ
=
C 21C2
2√mΦε.
Summarizing ...
With high probability
xj = g ′(a · x j)aT + nj ,
where
‖nj‖`d2 6 C(
K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,
√log d‖εj‖`m∞
)6 C ′
([mΦ
log(d/mΦ)
]1/2−1/q
+ε√mΦ
):= ν1
Summarizing ...
With high probability
xj = g ′(a · x j)aT + nj ,
where
‖nj‖`d2 6 C(
K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,
√log d‖εj‖`m∞
)6 C ′
([mΦ
log(d/mΦ)
]1/2−1/q
+ε√mΦ
):= ν1
Stability of one dimensional subspaces
LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ
x
‖x‖`d2− a
∥∥∥∥∥`d2
62ν1‖x‖`d2
.
We recall, thatxj = g ′(a · x j)aT + nj .
and
maxj‖xj‖`d2 > max
j|g ′(a·x j)|−max
j‖xj−xj‖`d2 > max
j|g ′(a · x j)|︸ ︷︷ ︸
we need to estimate it
−ν1
Stability of one dimensional subspaces
LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ
x
‖x‖`d2− a
∥∥∥∥∥`d2
62ν1‖x‖`d2
.
We recall, thatxj = g ′(a · x j)aT + nj .
and
maxj‖xj‖`d2 > max
j|g ′(a·x j)|−max
j‖xj−xj‖`d2 > max
j|g ′(a · x j)|︸ ︷︷ ︸
we need to estimate it
−ν1
Concentration inequalities I
Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that
PXj − EXj ∈ [aj , bj ] = 1,
for j = 1, . . . , m. Then we have
P
∣∣∣∣∣∣
m∑j=1
Xj − E
m∑j=1
Xj
∣∣∣∣∣∣ > t
6 2e− 2t2∑m
j=1(bj−aj)2
.
Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.
Concentration inequalities I
Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that
PXj − EXj ∈ [aj , bj ] = 1,
for j = 1, . . . , m. Then we have
P
∣∣∣∣∣∣
m∑j=1
Xj − E
m∑j=1
Xj
∣∣∣∣∣∣ > t
6 2e− 2t2∑m
j=1(bj−aj)2
.
Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.
Probabilistic estimates from below
By applying Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2, we have
Lemma
Let us fix 0 < s < 1. Then with probability 1 − 2e−
2mXs2α2
C42 we have
maxj=1,...,mX
|g ′(a · x j)| >√α(1 − s),
where α := Ex(|g′(a · x j)|2) =
∫Sd−1 |g ′(a · x)|2dµSd−1(x) =∫
Sd−1 ‖∇f (x)‖2`d2
dµSd−1(x) > 0.
Algorithm 1:
I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).
I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .
I Findj0 = arg max
j=1,...,mX
‖xj‖`d2 .
I Set a = xj0/‖xj0‖`d2 .
I Define g(y) := f (aT y) and f (x) := g(a · x).
Recovery result
Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability
1 −
(e−c ′
1mΦ + e−√mΦd + 2e
−2mXs2α2
C42
),
will satisfy
‖f − f ‖∞ 6 2C2(1 + ε)ν1√
α(1 − s) − ν1,
where
ν1 = C ′
([mΦ
log(d/mΦ)
]1/2−1/q
+ε√mΦ
)and C ′ depends only on C1 and C2.
Concentration of measure phenomenon and risk ofintractability
Key role is played by
α =
∫Sd−1
|g ′(a · x)|2dµSd−1(x)
Due to symmetry . . . independent on a
Push-forward measure µ1 on [−1, 1]
α =
∫1−1
|g ′(y)|2dµ1(y)
=Γ(d/2)
π1/2Γ((d − 1)/2)
∫1−1
|g ′(y)|2(1 − y2)d−32 dy
µ1 concentrates around zero exponentially fast as d →∞
Concentration of measure phenomenon and risk ofintractability
Key role is played by
α =
∫Sd−1
|g ′(a · x)|2dµSd−1(x)
Due to symmetry . . . independent on a
Push-forward measure µ1 on [−1, 1]
α =
∫1−1
|g ′(y)|2dµ1(y)
=Γ(d/2)
π1/2Γ((d − 1)/2)
∫1−1
|g ′(y)|2(1 − y2)d−32 dy
µ1 concentrates around zero exponentially fast as d →∞
Concentration of measure phenomenon and risk ofintractability
Key role is played by
α =
∫Sd−1
|g ′(a · x)|2dµSd−1(x)
Due to symmetry . . . independent on a
Push-forward measure µ1 on [−1, 1]
α =
∫1−1
|g ′(y)|2dµ1(y)
=Γ(d/2)
π1/2Γ((d − 1)/2)
∫1−1
|g ′(y)|2(1 − y2)d−32 dy
µ1 concentrates around zero exponentially fast as d →∞
Dependence on the dimension d
PropositionLet us fix M ∈ N and assume that g : [−1, 1]→ R is CM+2-differentiable
in an open neighbourhood U of 0 and d`
dx`g(0) = 0 for ` = 1, . . . , M.
Then
α(d) = O(d−M), for d →∞.
Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define
F1d := F1
d(α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and
∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .
(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define
F2d := F2
d(U,α0, q, C1, C2, N) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)
∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .
(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine
F3d := F3
d(U,α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .
Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define
F1d := F1
d(α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and
∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .
(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define
F2d := F2
d(U,α0, q, C1, C2, N) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)
∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .
(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine
F3d := F3
d(U,α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .
Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define
F1d := F1
d(α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and
∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .
(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define
F2d := F2
d(U,α0, q, C1, C2, N) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)
∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .
(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine
F3d := F3
d(U,α0, q, C1, C2) := f : BRd → R :
∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .
Tractability result
CorollaryThe problem of learning functions f in the classes F1
d and F2d from point
evaluations is strongly polynomially tractable (no poly dep. on d) andpolynomially tractable (with poly dep. on d) respectively.
IntractabilityOn the one hand, let us notice that if in the class F3
d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable.
Let
g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the
function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,
Figure : The function g and the spherical cap U(a, 1/2).
IntractabilityOn the one hand, let us notice that if in the class F3
d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let
g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise.
Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the
function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,
Figure : The function g and the spherical cap U(a, 1/2).
IntractabilityOn the one hand, let us notice that if in the class F3
d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let
g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the
function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,
Figure : The function g and the spherical cap U(a, 1/2).
Intractability
The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.
We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.
Intractability
The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.
Truly k-ridge functions for k 1
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1
AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
Truly k-ridge functions for k 1
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1
AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
Truly k-ridge functions for k 1
f (x) = g(Ax), A is a k × d matrix
Rows of A are compressible: maxi ‖ai‖q 6 C1
AAT is the identity operator on Rk
The regularity condition: sup|α|62
‖Dαg‖∞ 6 C2
The matrix H f :=
∫Sd−1
∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix
We assume, that the singular values of the matrix H f satisfy
σ1(Hf ) > · · · > σk(H f ) > α > 0.
MD. House’s differential diagnosis (or simply called”sensitivity analysis”)
We rely on numerical approximation of ∂f∂ϕ
∇g(Ax)TAϕ =∂f
∂ϕ(x) (∗)
=f (x + εϕ) − f (x)
ε−ε
2[ϕT∇2f (ζ)ϕ], ε 6 ε
X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd
Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where
ϕj` =
1/√
mΦ with prob. 1/2,
−1/√
mΦ with prob. 1/2
for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d
MD. House’s differential diagnosis (or simply called”sensitivity analysis”)
We rely on numerical approximation of ∂f∂ϕ
∇g(Ax)TAϕ =∂f
∂ϕ(x) (∗)
=f (x + εϕ) − f (x)
ε−ε
2[ϕT∇2f (ζ)ϕ], ε 6 ε
X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd
Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where
ϕj` =
1/√
mΦ with prob. 1/2,
−1/√
mΦ with prob. 1/2
for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d
Sensitivity analysis
x
x + εϕSd−1
Figure : We perform at random, randomized sensitivity analysis
Collecting together the differential analysis
Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix
X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)
).
The mX ×mΦ instances of (∗) in matrix notation as
ΦX = Y + E (∗∗)
Y and E are mΦ ×mX matrices defined by
yij =f (x j + εϕi ) − f (x j)
ε,
εij = −ε
2[(ϕi )T∇2f (ζij)ϕ
i ],
Algorithm 2:
I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according to (*).
I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 , for j = 1, . . . , mX, and
X = (x1| . . . |xmX) is again a d ×mX matrix.
I Compute the singular value decomposition of
XT =(
U1 U2
)( Σ1 0
0 Σ2
)(V T1
V T2
),
where Σ1 contains the k largest singular values.
I Set A = V T1 .
I Define g(y) := f (AT y) and f (x) := g(Ax).
The control of the error
The quality of the final approximation of f by means of f depends on twokinds of accuracies:
1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;
2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.
To be precise, we have
The control of the error
The quality of the final approximation of f by means of f depends on twokinds of accuracies:
1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;
2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.
To be precise, we have
Recovery result
Theorem (F., Schnass, and Vybıral)Let log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1 such that usingmX · (mΦ + 1) function evaluations of f , Algorithm 2 defines a functionf : BRd (1 + ε)→ R that, with probability
1 −
(e−c ′
1mΦ + e−√mΦd + ke
−mXαs2
2kC22
),
will satisfy
‖f − f ‖∞ 6 2C2
√k(1 + ε)
ν2√α(1 − s) − ν2
,
where
ν2 = C
(k1/q
[mΦ
log(d/mΦ)
]1/2−1/q
+εk2
√mΦ
),
and C depends only on C1 and C2.
Ingredients of the proof
I compressed sensing;
I stability of the SVD;
I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).
Ingredients of the proof
I compressed sensing;
I stability of the SVD;
I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).
Ingredients of the proof
I compressed sensing;
I stability of the SVD;
I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).
Compressed sensing
Corollary (after Wojtaszczyk, 2011)Let log d 6 mΦ < [log 6]2d. Then with probability
1 − (e−c ′1mΦ + e−
√mΦd)
the matrix X as calculated in Algorithm 2 satisfies
‖X − X‖F 6 C√
mX
(k1/q
[mΦ
log(d/mΦ)
]1/2−1/q
+εk2
√mΦ
),
where C depends only on C1 and C2.
Stability of SVD
Given two matrices B and B with corresponding singular valuedecompositions
B =(
U1 U2
)( Σ1 00 Σ2
)(V T1
V T2
)and
B =(
U1 U2
)( Σ1 0
0 Σ2
)(V T1
V T2
),
we have:
Wedin’s bound
Theorem (Stability of subspaces)If there is an α > 0 such that
min`,ˆ
|σˆ(Σ1) − σ`(Σ2)| > α,
andminˆ
|σˆ(Σ1)| > α,
then
‖V1V T1 − V1V T
1 ‖F 62
α‖B − B‖F .
Wedin’s bound
Applied to our situation, where X has rank k and thus Σ2 = 0, we get
‖V1V T1 − V1V T
1 ‖F 62√
mXν2
σk(XT ),
and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that
‖V1V T1 − V1V T
1 ‖F 62√
mXν2
σk(XT ) −√
mXν2.
Note thatXT = GA = UGΣG[V
TG A],
for G =(∇g(Ax1)| . . . |∇g(AxmX)
)T, hence ΣXT = ΣG. Moreover
σi (G) =√σi (GTG), for all i = 1, . . . , k.
Wedin’s bound
Applied to our situation, where X has rank k and thus Σ2 = 0, we get
‖V1V T1 − V1V T
1 ‖F 62√
mXν2
σk(XT ),
and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that
‖V1V T1 − V1V T
1 ‖F 62√
mXν2
σk(XT ) −√
mXν2.
Note thatXT = GA = UGΣG[V
TG A],
for G =(∇g(Ax1)| . . . |∇g(AxmX)
)T, hence ΣXT = ΣG. Moreover
σi (G) =√σi (GTG), for all i = 1, . . . , k.
Concentration inequalities II
Theorem (Matrix Chernoff bounds)Consider X1, . . . , Xm independent random, positive-semidefinite matricesof dimension k × k. Moreover suppose σ1(Xj) 6 C , almost surely.Compute the singular values of the sum of the expectations
µmax = σ1
(∑mj=1 EXj
)and µmin = σk
(∑mj=1 EXj
), then
P
σ1 m∑
j=1
Xj
− µmax > sµmax
6 k
((1 + s)
e
)−µmax(1+s)C
,
for all s > (e − 1), and
P
σk m∑
j=1
Xj
− µmin 6 −sµmin
6 ke−µmins
2
2C ,
for all s ∈ (0, 1).
Note that
GTG =
mX∑j=1
∇g(Ax j)∇g(Ax j)T .
and by applying the previous result to Xj = ∇g(Ax j)∇g(Ax j)T , we have:
LemmaFor any s ∈ (0, 1) we have that
σk(XT ) >
√mXα(1 − s)
with probability 1 − ke−mXαs2
2kC22 .
Proof of Theorem
with probability at least
1 −
(e−c ′
1mΦ + e−√mΦd + ke
−mXαs2
2kC22
),
we have
‖V1V T1 − V1V T
1 ‖F 62ν2√
α(1 − s) − ν2.
and for A = V T1 and V T
G A = V T1
‖ATA − AT A‖F = ‖ATVGV TG A − V1V T
1 ‖F 62ν2√
α(1 − s) − ν2.
Proof of Theorem
with probability at least
1 −
(e−c ′
1mΦ + e−√mΦd + ke
−mXαs2
2kC22
),
we have
‖V1V T1 − V1V T
1 ‖F 62ν2√
α(1 − s) − ν2.
and for A = V T1 and V T
G A = V T1
‖ATA − AT A‖F = ‖ATVGV TG A − V1V T
1 ‖F 62ν2√
α(1 − s) − ν2.
Proof of Theorem ... continue
Since A is row-orthogonal we have A = AATA and
|f (x) − f (x)| = |g(Ax) − g(Ax)|
= |g(Ax) − g(AAT Ax)|
6 C2
√k‖Ax − AAT Ax‖`k2
= C2
√k‖A(ATA − AT A)x‖`k2
6 C2
√k‖(ATA − AT A)‖F‖x‖`d2
6 2C2
√k(1 + ε)
ν2√α(1 − s) − ν2
.
where we used
‖ATA − AT A‖F = ‖ATVGV TG A − V1V T
1 ‖F 62ν2√
α(1 − s) − ν2.
k-ridge functions may be too simple!
Figure : Functions on data clustered around a manifold with multiple directionscan be locally approximated by sums of k-ridge functions
Sums of ridge functions
Can we still be able to learn functions of the type
f (x) =m∑i=1
gi (ai · x), x ∈ [−1, 1]d?
Our approach (Daubechies, F., Vybıral) is essentially based on theformula
Dα1c1 . . . Dαk
ck f (x) =m∑i=1
g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,
where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th
derivative in the direction ci .
Sums of ridge functions
Can we still be able to learn functions of the type
f (x) =m∑i=1
gi (ai · x), x ∈ [−1, 1]d?
Our approach (Daubechies, F., Vybıral) is essentially based on theformula
Dα1c1 . . . Dαk
ck f (x) =m∑i=1
g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,
where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th
derivative in the direction ci .
The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that
S(a1, . . . , am) = inf( m∑
i=1
‖ai − wi‖22)1/2
: w1, . . . , wm orthonormal basis in Rm
is small!
Furthermore, we denote by
L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m
the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia
Ti .
We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1
to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).
The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that
S(a1, . . . , am) = inf( m∑
i=1
‖ai − wi‖22)1/2
: w1, . . . , wm orthonormal basis in Rm
is small!Furthermore, we denote by
L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m
the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia
Ti .
We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1
to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).
The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that
S(a1, . . . , am) = inf( m∑
i=1
‖ai − wi‖22)1/2
: w1, . . . , wm orthonormal basis in Rm
is small!Furthermore, we denote by
L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m
the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia
Ti .
We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L.
Finally, we propose the following algorithm
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1
to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).
The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that
S(a1, . . . , am) = inf( m∑
i=1
‖ai − wi‖22)1/2
: w1, . . . , wm orthonormal basis in Rm
is small!Furthermore, we denote by
L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m
the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia
Ti .
We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1
to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).
Nonlinear programming to recover the ai ⊗ ai ’s
Figure : The ai ⊗ ai are the extremal points of the matrix operator norm!
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.
Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2
and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)
When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L.
Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39.
We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T , a2 = (√
2/2,√
2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that
L = span( 1 ε
ε −ε
),
(0.5 0.5 + ε
0.5 + ε 0.5 − ε
)When choosing ε = 0.05, we find out that
dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].
Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.
The approximation to L
DefineL = span∆f (xj), j = 1, . . . , mX,
where
(∆f (x))j ,k =f (x + ε(ej + ek)) − f (x + εej) − f (x + εek) + f (x)
ε2,
for j , k = 1, . . . , m, is an approximation to the Hessian of f a x . For xdrawn at random and by applying in a suitable way the Chernoff matrixbounds, one derives a probabilistic error estimate, in the sense that
‖PL − PL‖F→F 6 Cm3/2ε,
with high probability.
A nonlinear operator towards a gradient ascent
Let us introduce first for a given parameter γ > 1 an operator acting onthe singular values of a matrix X = UΣV T as follows:
Πγ(X ) = Udiag(γ, 1, . . . , 1)× Σ
‖(diag(γ, 1, . . . , 1)× Σ)‖FV T ,
where
diag(γ, 1, . . . , 1)× Σ =
γσ1 0 . . . 00 σ2 0 . . .. . . . . . . . . . . .0 . . . 0 σm
Notice that Πγ maps any matrix X onto a matrix of unit Frobeniusnorm, simply exalting the first singular value and damping the others. Itis not a linear operator.
The nonlinear programming
We propose a projected gradient method for solving
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.
Algorithm 3:
I Fix a suitable parameter γ > 1
I Assume to have identified a basis for L of semi-positive definitematrices, for instance, one can use the second order finitedifferences ∆f (xj), j = 1, . . . , mX to form such a basis;
I Generate an initial guess X 0 =∑mX
j=1 ζj∆f (xj) by choosing at
random ζj > 0, so that X 0 ∈ L and ‖X 0‖F = 1;
I For ` > 0:
X `+1 := PLΠγ(X`);
Analysis of the algorithm for L = L
Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >
√2 and
let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.
Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.
The proof is based on the following observation
‖X `+1‖∞ = σ1(X`+1) =
γσ1(X`)√
γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2
>γ‖X `‖∞√
(γ2 − 1)‖X `‖2∞ + 1.
Analysis of the algorithm for L = L
Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >
√2 and
let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.
Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.
The proof is based on the following observation
‖X `+1‖∞ = σ1(X`+1) =
γσ1(X`)√
γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2
>γ‖X `‖∞√
(γ2 − 1)‖X `‖2∞ + 1.
Analysis of the algorithm for L ≈ L
Theorem (Daubechies, F., Vybral)Assume for that ‖PL − PL‖F→F < ε < 1 and that a1, . . . , am are
orthonormal. Let ‖X 0‖∞ > max 1√γ2−1
, 1√2+ ε+ ξ and
√2 < γ. Then
for the iterations (X `)` produced by Algorithm 3, there exists µ0 < 1such that
lim sup`
|1 − ‖X `‖∞| 6 µ1(γ, t0, ε) + 2ε
1 − µ0+ ε,
where µ1(γ, ξ, ε) ≈ ε. The sequence (X `)` is bounded and itsaccumulation points X satisfy simultaneously the following properties
‖X‖F 6 1 and ‖X‖∞ > 1 −µ1(γ, t0, ε) + 2ε
1 − µ0+ ε,
and
‖PLX‖F 6 1 and ‖PLX‖∞ > 1 −µ1(γ, t0, ε) + 2ε
1 − µ0.
A graphical explanation of the algorithm
Figure : Objective function ‖ · ‖∞ to be maximized and iterations of Algorithm3 converging to one of the extremal points ai ⊗ ai
Nonlinear programming
Theorem (Daubechies, F., Vybral)Let M be any local maximizer of
arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.
Then
uTj Xuj = 0 for all X ∈ SL with X ⊥ M
and all j ∈ 1, . . . , m with |λj(0)| = ‖M‖∞.
If furthermore the ai ’s are nearly orthonoramal S(a1, . . . , am) 6 ε and
3 ·m · ‖PL − PL‖ < (1 − ε)2,
then λ1 = ‖M‖∞ > max|λ2|, . . . , |λm| and
2m∑
k=2
(uT1 Xuk)
2
λ1 − λk6 λ1.
Nonlinear programming
Algorithm 4:
I Let M be a local maximizer of the nonlinear programming
I Take its singular value decomposition M =∑m
j=1 λjuj ⊗ uj
I Put a := u1
Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C
√ε.
The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.
Nonlinear programming
Algorithm 4:
I Let M be a local maximizer of the nonlinear programming
I Take its singular value decomposition M =∑m
j=1 λjuj ⊗ uj
I Put a := u1
Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C
√ε.
The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.
Learning sums of ridge functions
Algorithm 5:
I Let aj are normalized approximations of aj , j = 1, . . . , m
I Let (bj)mj=1 be the dual basis to (aj)
mj=1
I Assume, that f (0) = g1(0) = · · · = gm(0)
I Put gj(t) := f (tbj), t ∈ (−1/‖bj‖2, 1/‖bj‖2)I Put f (x) :=
∑mj=1 gj(aj · x), ‖x‖2 6 1
Theorem (Daubechies, F., Vybral)Let
I S(a1, . . . , am) 6 ε and S(a1, . . . , am) 6 ε ′;I ‖aj − aj‖2 6 η, j = 1, . . . , m.
Then‖f − f ‖∞ 6 c(ε, ε ′)mη.
Our literature
I I. Daubechies, M. Fornasier, and J. Vybıral, Approximation of sumsof ridge functions, in preparation
I M. Fornasier, K. Schnass, and J. Vybıral, Learning functions of fewarbitrary linear parameters in high dimensions, Foundations onComputational Mathematics, Vol. 2, No. 2, 2012, pp. 229-262
I K. Schnass and J. Vybıral, Compressed learning of high-dimensionalsparse functions, ICASSP11, 2011.
I A. Kolleck and J. Vybıral, On some aspects of approximation ofridge functions, J. Appr. Theory 194 (2015), 35-61
I S. Mayer, T. Ullrich, and J. Vybıral, Entropy and sampling numbersof classes of ridge functions, to appear in ConstructiveApproximation,