+ All Categories
Home > Documents > TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a...

TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a...

Date post: 18-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
134
Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo Fornasier Fakult¨ at f¨ ur Mathematik Technische Universit¨ at M¨ unchen [email protected] http://www-m15.ma.tum.de/ Winter School on Compressed Sensing Technical University of Berlin December 3-5, 2015 Collection of joint results with Ingrid Daubechies, Karin Schnass, and Jan Vyb´ ıral
Transcript
Page 1: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Learning sums of ridge functions in highdimension: a nonlinear compressed sensing model

Massimo Fornasier

Fakultat fur MathematikTechnische Universitat [email protected]://www-m15.ma.tum.de/

Winter School on Compressed SensingTechnical University of Berlin

December 3-5, 2015Collection of joint results with

Ingrid Daubechies, Karin Schnass, and Jan Vybıral

Page 2: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Page 3: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Page 4: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Introduction on ridge functions

I A ridge function - in its simplest form - is a function f : Rd → R ofthe type

f (x) = g(aT x) = g(a · x),where g : R→ R is a scalar univariate function and a ∈ Rd is thedirection of the ridge function;

I Ridge functions are constant along the hyperplanes a · x = λ for anygiven level λ ∈ R and are among the most simple form ofmultivariate functions;

I They have been extensively studied in the past couple of decades asapproximation building blocks for more complicated highdimensional functions.

Page 5: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some origins of ridge functions

I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;

I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.

Page 6: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some origins of ridge functions

I In multivariate Fourier series, the basis functions are of the forme in·x for n ∈ Zd and e ia·x for arbitrary directions a ∈ Rd in theRadon transform;

I The term “ridge function” has been actually coined by Logan andShepp in 1975 in their work on computer tomography where theyshow how ridge functions solve the corresponding L2-minimum normapproximation problem.

Page 7: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Projection pursuit of the ’80s

I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);

I Projection pursuit algorithms approximate a function of d variablesby functions of the form

m∑i=1

gi (ai · x), x ∈ Rd ,

for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .

Page 8: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Projection pursuit of the ’80s

I Ridge function approximation has been as well extensively studiesduring the 80’s in mathematical statistics under the name ofprojection pursuit (Huber, 1985; Donoho-Johnston, 1989);

I Projection pursuit algorithms approximate a function of d variablesby functions of the form

m∑i=1

gi (ai · x), x ∈ Rd ,

for some functions gi : R→ R and some non-zero vectors ai ∈ Rd .

Page 9: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some relevant applications of the ’90s

I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;

I the simplest case of such a network is described mathematically by afunction of the form

m∑i=1

αiσ

m∑j=1

wijxj + θi

,

where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;

Page 10: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some relevant applications of the ’90s

I In the early 90’s there has been an explosion of interest in the fieldof neural networks. One very popular model is the multilayerfeed-forward neural network with input, hidden (internal), andoutput layers;

I the simplest case of such a network is described mathematically by afunction of the form

m∑i=1

αiσ

m∑j=1

wijxj + θi

,

where σ : R→ R is somehow given and called the activationfunction and wij are suitable weights;

Page 11: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Page 12: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Page 13: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Page 14: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Page 15: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and approximation theory

I In the early 90’s the question of whether one can use sums of ridgefunctions to approximate well arbitrary functions has been at thecenter of the attention of the approximation theory community(overviews by Li 2002 and Pinkus 1997);

I the efficiency of such an approximation compared to, e.g., splinetype approximation for smoothness classes of functions, has beenextensively considered (DeVore et al. 1997; Petrushev, 1999);

I The identification of a ridge function has also been thoroughlyconsidered, in particular we mention the work of Pinkus, and, forwhat concerns multilayer neural networks, we refer to the work byFefferman, 1994.

I Except for the work of Candes on ridglets, there has been lessattention after 2000 on the problem of approximating functions bymeans of ridge functions;

Page 16: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Page 17: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Page 18: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Page 19: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries

I The above results on the identification of such functions based ondisposing of any possible output or even derivatives;

I this might be in certain practical situations very expensive,hazardous or impossible;

I In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, andPicard address the approximation of ridge functions by the minimalamount of sampling queries:

For g ∈ C s([0, 1]), 1 < s, ‖g‖C s 6 M0, ‖a‖`dq 6 M1, 0 < q 6 1

‖f − f ‖C(Ω) 6 CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points, deterministically and adaptively chosen.

Page 20: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X .

The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Page 21: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a.

If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Page 22: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Page 23: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g ,

the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Page 24: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Capturing ridge functions from point queries: a nonlinearcompressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m d ,suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from itsmeasurements

y ≈ Xa,

by means of suitable algorithms (`1-minimization, greedy algs) aware of yand X . The data

yi ≈ xi · a = xTi a, i = 1, . . . , m

are linear measurements of a. If now we assume yi to be the values of aridge function at the points xi

yi ≈ g(a · xi ), i = 1, . . . , m,

for some unknown or roughly given nonlinear function g , the problem ofidentifying the ridge direction can be understood as a nonlinearcompressed sensing model ...

Page 25: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ridge functions and functions of data clustered aroundmanifolds

Figure : Functions on data clustered around a manifold can be locallyapproximated by k-ridge functions

Page 26: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 27: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 28: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Universal random sampling for a more general ridge model

M. Fornasier, K. Schnass, J. Vybıral, Learning functions of few arbitrarylinear parameters in high dimensions, FoCM, 2012

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1, 0 < q 6 1AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 29: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

How can we learn k-ridge functions from point queries?

Page 30: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Page 31: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Page 32: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Sensitivity analysis

x

x + εϕSd−1

Figure : We perform at random, randomized sensitivity analysis

Page 33: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix

X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)

).

The mX ×mΦ instances of (∗) in matrix notation as

ΦX = Y + E (∗∗)

Y and E are mΦ ×mX matrices defined by

yij =f (x j + εϕi ) − f (x j)

ε,

εij = −ε

2[(ϕi )T∇2f (ζij)ϕ

i ],

Page 34: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

Page 35: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

Page 36: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Example of active coordinates: which factor does play arole?

We assume, that

A =

eTi1...

eTik

,

i.e.f (x) = f (x1, . . . , xd) = g(xi1 , . . . , xik ),

where f : Ω = [0, 1]d → R and g : [0, 1]k → R

We want to identify first the active coordinates i1, . . . , ik . Then one canapply any usual k-dimensional approximation method...

A possible algorithm chooses the sampling points at random, due to theconcentration of measure effects, we get the right result withoverwhelming probability.

Page 37: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero.

In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

Page 38: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

Page 39: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity

ΦTΦX = ΦTY +ΦTE

where now X has i th-row

Xi =

(∂g

∂zi(Ax1), . . . ,

∂g

∂zi(AxmX

),

for i ∈ I , and all other row equal to zero. In expectation:ΦTΦ ≈ Id : Rd → Rd

ΦTΦX ≈ X andΦTE is small =⇒ ΦTY ≈ X ,

We select the k largest rows of ΦTY and estimate the probability, thattheir indices coincide with the indices of the non-zero rows of X .

Page 40: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A first recovery result

Theorem (Schnass and Vybıral 2011)Let f : Rd → R be a function of k active coordinates that is defined andtwice continuously differentiable on a small neighbourhood of [0, 1]d . ForL 6 d, a positive real number, the randomized algorithm described aboverecovers the k unknown active coordinates of f with probability at least1 − 6 exp(−L) using only

O(k(L + log k)(L + log d))

samples of f .

The constants involved in the O notation depend on smoothnessproperties of g , namely on

maxj=1,...,k ‖∂ij g‖∞minj=1,...,k ‖∂ij g‖1

Page 41: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Examples of active coordinate detection in dimensiond = 1000

6 12 18 24 30 36 42 48 54 60

20

40

60

80

100

120

140

160

180

2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5 10 15 20 25 30 35 40 45 50

80

100

120

140

160

180

200

220

240

260

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure : max(1− 5√

(x3 − 1/2)2 + (x4 − 1/2)2, 0)3 and

sin(6π∑40

i=21 xi

)+∑40

i=21 sin(6πxi ) + 5(xi − 1/2)2

Page 42: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd

‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2

α =

∫Sd−1

‖∇f (x)‖2`d2 dµSd−1(x) =

∫Sd−1

|g ′(a · x)|2dµSd−1(x) > 0,

We consider again the Taylor expansion (*) with Ω = Sd−1

We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1

The matrix Φ is generated as before and we obtain (**) again in the form

Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.

Page 43: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd

‖a‖2 = 1 and ‖a‖q 6 C1, 0 < q 6 1, max06α62 ‖Dαg‖∞ 6 C2

α =

∫Sd−1

‖∇f (x)‖2`d2 dµSd−1(x) =

∫Sd−1

|g ′(a · x)|2dµSd−1(x) > 0,

We consider again the Taylor expansion (*) with Ω = Sd−1

We choose the points X = x j ∈ Sd−1 : j = 1, . . . , mX generated atrandom on Sd−1 with respect to µSd−1

The matrix Φ is generated as before and we obtain (**) again in the form

Φ[g ′(a · x j)a] = yj + εj , j = 1, . . . mX.

Page 44: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Algorithm 1:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .

I Findj0 = arg max

j=1,...,mX

‖xj‖`d2 .

I Set a = xj0/‖xj0‖`d2 .

I Define g(y) := f (aT y) and f (x) := g(a · x).

Page 45: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Recovery result

Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + 2e

−2mXs2α2

C42

),

will satisfy

‖f − f ‖∞ 6 2C2(1 + ε)ν1√

α(1 − s) − ν1,

where

ν1 = C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

)and C ′ depends only on C1 and C2.

Page 46: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Page 47: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Page 48: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of one dimensional subspaces;

I concentration inequalities (Hoeffding’s inequality).

Page 49: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m.

Let us suppose thatd > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Page 50: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m.

Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Page 51: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property.

For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Page 52: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Compressed sensing

Theorem (Wojtaszczyk, 2011)Assume that Φ is an m × d random matrix with all entries beingindependent Bernoulli variables scaled by 1/

√m. Let us suppose that

d > [log 6]2m. Then there are positive constants C , c ′1, c ′2 > 0, such that,with probability at least

1 − e−c ′1m − e−

√md ,

the matrix Φ has the following property. For every x ∈ Rd , ε ∈ Rm andevery natural number K 6 c ′2m/ log(d/m) we have

‖∆(Φx + ε) − x‖`d2 6 C(

K−1/2σK (x)`d1 + max‖ε‖`m2 ,√

log d‖ε‖`m∞ )

,

whereσK (x)`d1 := inf‖x − z‖`d1 : # supp z 6 K

is the best K -term approximation of x.

Page 53: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

Page 54: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

Page 55: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

How does compressed sensing play a role?

For the d ×mX matrix X , i.e.,

X = (g ′(a · x1)aT | . . . |g ′(a · xmX)aT ),

Φ[g ′(a · x j)a︸ ︷︷ ︸:=xj

] = yj + εj , j = 1, . . . mX,

andxj = ∆(yj) := arg min

y j=Φz‖z‖`d1

the previous result gives - with the probability provided there -

xj = g ′(a · x j)aT + nj ,

with nj properly estimated by

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

).

Page 56: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd ,

one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 57: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 58: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 59: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 60: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε,

leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 61: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Some computationsLet us estimate the quontities. By Stechkin’s inequality for which

σK (x)`d1 6 ‖x‖`dq K 1−1/q,

for all x ∈ Rd , one obtains - for xj = g ′(a · x j)a

K−1/2σK (xj)`d1 6 |g ′(a·x j)|·‖a‖`dq ·K1/2−1/q 6 C1 C2

[mΦ

log(d/mΦ)

]1/2−1/q

.

Moreover

‖εj‖`mΦ∞ =ε

2· maxi=1,...,mΦ

|ϕi T∇2f (ζij)ϕi |

2mΦ· maxi=1,...,mΦ

∣∣∣∣∣d∑

k,l=1

akalg′′(a · ζij)

∣∣∣∣∣6

ε‖g ′′‖∞2mΦ

(d∑

k=1

|ak |

)2

6ε‖g ′′‖∞

2mΦ

(d∑

k=1

|ak |q

)2/q

6C 21 C2

2mΦε,

‖εj‖`mΦ2 6√

mΦ‖εj‖`mΦ∞ 6C 21 C2

2√

mΦε, leading to

max‖εj‖`mΦ2 ,√

log d‖εj‖`mΦ∞ 6 C 21C2

2√mΦε ·max

1,√

log dmΦ

=

C 21C2

2√mΦε.

Page 62: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Summarizing ...

With high probability

xj = g ′(a · x j)aT + nj ,

where

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

)6 C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

):= ν1

Page 63: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Summarizing ...

With high probability

xj = g ′(a · x j)aT + nj ,

where

‖nj‖`d2 6 C(

K−1/2σK (g′(a · x j)aT )`d1 + max‖εj‖`m2 ,

√log d‖εj‖`m∞

)6 C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

):= ν1

Page 64: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Stability of one dimensional subspaces

LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ

x

‖x‖`d2− a

∥∥∥∥∥`d2

62ν1‖x‖`d2

.

We recall, thatxj = g ′(a · x j)aT + nj .

and

maxj‖xj‖`d2 > max

j|g ′(a·x j)|−max

j‖xj−xj‖`d2 > max

j|g ′(a · x j)|︸ ︷︷ ︸

we need to estimate it

−ν1

Page 65: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Stability of one dimensional subspaces

LemmaLet us fix x ∈ Rd , a ∈ Sd−1, 0 6= γ ∈ R, and n ∈ Rd with norm‖n‖`d2 6 ν1 < |γ|. If we assume x = γa + n then∥∥∥∥∥signγ

x

‖x‖`d2− a

∥∥∥∥∥`d2

62ν1‖x‖`d2

.

We recall, thatxj = g ′(a · x j)aT + nj .

and

maxj‖xj‖`d2 > max

j|g ′(a·x j)|−max

j‖xj−xj‖`d2 > max

j|g ′(a · x j)|︸ ︷︷ ︸

we need to estimate it

−ν1

Page 66: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration inequalities I

Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that

PXj − EXj ∈ [aj , bj ] = 1,

for j = 1, . . . , m. Then we have

P

∣∣∣∣∣∣

m∑j=1

Xj − E

m∑j=1

Xj

∣∣∣∣∣∣ > t

6 2e− 2t2∑m

j=1(bj−aj)2

.

Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.

Page 67: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration inequalities I

Lemma (Hoeffding’s inequality)Let X1, . . . , Xm be independent random variables. Assume that the Xj arealmost surely bounded, i.e., there exist finite scalars aj , bj such that

PXj − EXj ∈ [aj , bj ] = 1,

for j = 1, . . . , m. Then we have

P

∣∣∣∣∣∣

m∑j=1

Xj − E

m∑j=1

Xj

∣∣∣∣∣∣ > t

6 2e− 2t2∑m

j=1(bj−aj)2

.

Let us now apply Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2.

Page 68: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Probabilistic estimates from below

By applying Hoeffding’s inequality to the random variablesXj = |g ′(a · x j)|2, we have

Lemma

Let us fix 0 < s < 1. Then with probability 1 − 2e−

2mXs2α2

C42 we have

maxj=1,...,mX

|g ′(a · x j)| >√α(1 − s),

where α := Ex(|g′(a · x j)|2) =

∫Sd−1 |g ′(a · x)|2dµSd−1(x) =∫

Sd−1 ‖∇f (x)‖2`d2

dµSd−1(x) > 0.

Page 69: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Algorithm 1:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 .

I Findj0 = arg max

j=1,...,mX

‖xj‖`d2 .

I Set a = xj0/‖xj0‖`d2 .

I Define g(y) := f (aT y) and f (x) := g(a · x).

Page 70: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Recovery result

Theorem (F., Schnass, and Vybıral 2012)Let 0 < s < 1 and log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1defines a function f : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + 2e

−2mXs2α2

C42

),

will satisfy

‖f − f ‖∞ 6 2C2(1 + ε)ν1√

α(1 − s) − ν1,

where

ν1 = C ′

([mΦ

log(d/mΦ)

]1/2−1/q

+ε√mΦ

)and C ′ depends only on C1 and C2.

Page 71: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Page 72: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Page 73: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration of measure phenomenon and risk ofintractability

Key role is played by

α =

∫Sd−1

|g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

Push-forward measure µ1 on [−1, 1]

α =

∫1−1

|g ′(y)|2dµ1(y)

=Γ(d/2)

π1/2Γ((d − 1)/2)

∫1−1

|g ′(y)|2(1 − y2)d−32 dy

µ1 concentrates around zero exponentially fast as d →∞

Page 74: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Dependence on the dimension d

PropositionLet us fix M ∈ N and assume that g : [−1, 1]→ R is CM+2-differentiable

in an open neighbourhood U of 0 and d`

dx`g(0) = 0 for ` = 1, . . . , M.

Then

α(d) = O(d−M), for d →∞.

Page 75: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Page 76: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Page 77: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Tractability classes(1) For 0 < q 6 1, C1 > 1 and C2 > α0 > 0, we define

F1d := F1

d(α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and

∃g ∈ C 2(BR), |g ′(0)| > α0 > 0 : f (x) = g(a · x) .

(2) For a neighborhood U of 0, 0 < q 6 1, C1 > 1, C2 > α0 > 0 andN > 2, we define

F2d := F2

d(U,α0, q, C1, C2, N) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ CN(U)

∃0 6 M 6 N − 1, |g (M)(0)| > α0 > 0 : f (x) = g(a · x) .

(3) For a neighborhood U of 0, 0 < q 6 1, C1 > 1 and C2 > α0 > 0, wedefine

F3d := F3

d(U,α0, q, C1, C2) := f : BRd → R :

∃a ∈ Rd , ‖a‖`d2 = 1, ‖a‖`dq 6 C1 and ∃g ∈ C 2(BR) ∩ C∞(U)|g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) .

Page 78: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Tractability result

CorollaryThe problem of learning functions f in the classes F1

d and F2d from point

evaluations is strongly polynomially tractable (no poly dep. on d) andpolynomially tractable (with poly dep. on d) respectively.

Page 79: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable.

Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

Page 80: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise.

Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

Page 81: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

IntractabilityOn the one hand, let us notice that if in the class F3

d we remove thecondition ‖a‖`dq 6 C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ε, 1 + ε]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ε]and zero otherwise. Notice that, for every a ∈ Rd with ‖a‖`d2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the capU(a, 1/2) := x ∈ Sd−1 : a · x > 1/2,

Figure : The function g and the spherical cap U(a, 1/2).

Page 82: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.

We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.

Page 83: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and isknown to be exponentially small in d . Furthermore, it is known, thatthere is a constant c > 0 and unit vectors a1, . . . , aK , such that the setsU(a1, 1/2), . . . ,U(aK , 1/2) are mutually disjoint and K > ecd . Finally, weobserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.We conclude that any algorithm making only use of the structure off (x) = g(a · x) and the condition needs to use exponentially manysampling points in order to distinguish between f (x) ≡ 0 andf (x) = g(ai · x) for some of the ai ’s as constructed above.

Page 84: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 85: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 86: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Truly k-ridge functions for k 1

f (x) = g(Ax), A is a k × d matrix

Rows of A are compressible: maxi ‖ai‖q 6 C1

AAT is the identity operator on Rk

The regularity condition: sup|α|62

‖Dαg‖∞ 6 C2

The matrix H f :=

∫Sd−1

∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix

We assume, that the singular values of the matrix H f satisfy

σ1(Hf ) > · · · > σk(H f ) > α > 0.

Page 87: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Page 88: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

MD. House’s differential diagnosis (or simply called”sensitivity analysis”)

We rely on numerical approximation of ∂f∂ϕ

∇g(Ax)TAϕ =∂f

∂ϕ(x) (∗)

=f (x + εϕ) − f (x)

ε−ε

2[ϕT∇2f (ζ)ϕ], ε 6 ε

X = x j ∈ Ω : j = 1, . . . , mX drawn uniformly at random in Ω ⊂ Rd

Φ = ϕj ∈ Rd , j = 1, . . . , mΦ, where

ϕj` =

1/√

mΦ with prob. 1/2,

−1/√

mΦ with prob. 1/2

for every j ∈ 1, . . . , mΦ and every ` ∈ 1, . . . , d

Page 89: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Sensitivity analysis

x

x + εϕSd−1

Figure : We perform at random, randomized sensitivity analysis

Page 90: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi , X . . . d ×mX matrix

X =(AT∇g(Ax1)| . . . |AT∇g(AxmX)

).

The mX ×mΦ instances of (∗) in matrix notation as

ΦX = Y + E (∗∗)

Y and E are mΦ ×mX matrices defined by

yij =f (x j + εϕi ) − f (x j)

ε,

εij = −ε

2[(ϕi )T∇2f (ζij)ϕ

i ],

Page 91: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Algorithm 2:

I Given mΦ, mX, draw at random the sets Φ and X, andconstruct Y according to (*).

I Set xj = ∆(yj) := arg minyj=Φz ‖z‖`d1 , for j = 1, . . . , mX, and

X = (x1| . . . |xmX) is again a d ×mX matrix.

I Compute the singular value decomposition of

XT =(

U1 U2

)( Σ1 0

0 Σ2

)(V T1

V T2

),

where Σ1 contains the k largest singular values.

I Set A = V T1 .

I Define g(y) := f (AT y) and f (x) := g(Ax).

Page 92: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The control of the error

The quality of the final approximation of f by means of f depends on twokinds of accuracies:

1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;

2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.

To be precise, we have

Page 93: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The control of the error

The quality of the final approximation of f by means of f depends on twokinds of accuracies:

1. The error between X and X , which can be controlled through thenumber of compressed sensing measurements mΦ;

2. The stability of the span of V T , simply characterized by how wellthe singular values of X or equivalently G are separated from 0,which is related to the number of random samples mX.

To be precise, we have

Page 94: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Recovery result

Theorem (F., Schnass, and Vybıral)Let log d 6 mΦ 6 [log 6]2d. Then there is a constant c ′1 such that usingmX · (mΦ + 1) function evaluations of f , Algorithm 2 defines a functionf : BRd (1 + ε)→ R that, with probability

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

will satisfy

‖f − f ‖∞ 6 2C2

√k(1 + ε)

ν2√α(1 − s) − ν2

,

where

ν2 = C

(k1/q

[mΦ

log(d/mΦ)

]1/2−1/q

+εk2

√mΦ

),

and C depends only on C1 and C2.

Page 95: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Page 96: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Page 97: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Ingredients of the proof

I compressed sensing;

I stability of the SVD;

I concentration inequalities (Chernoff bounds for sums ofpositive-semidefinite matrices).

Page 98: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Compressed sensing

Corollary (after Wojtaszczyk, 2011)Let log d 6 mΦ < [log 6]2d. Then with probability

1 − (e−c ′1mΦ + e−

√mΦd)

the matrix X as calculated in Algorithm 2 satisfies

‖X − X‖F 6 C√

mX

(k1/q

[mΦ

log(d/mΦ)

]1/2−1/q

+εk2

√mΦ

),

where C depends only on C1 and C2.

Page 99: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Stability of SVD

Given two matrices B and B with corresponding singular valuedecompositions

B =(

U1 U2

)( Σ1 00 Σ2

)(V T1

V T2

)and

B =(

U1 U2

)( Σ1 0

0 Σ2

)(V T1

V T2

),

we have:

Page 100: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Wedin’s bound

Theorem (Stability of subspaces)If there is an α > 0 such that

min`,ˆ

|σˆ(Σ1) − σ`(Σ2)| > α,

andminˆ

|σˆ(Σ1)| > α,

then

‖V1V T1 − V1V T

1 ‖F 62

α‖B − B‖F .

Page 101: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ),

and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ) −√

mXν2.

Note thatXT = GA = UGΣG[V

TG A],

for G =(∇g(Ax1)| . . . |∇g(AxmX)

)T, hence ΣXT = ΣG. Moreover

σi (G) =√σi (GTG), for all i = 1, . . . , k.

Page 102: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ),

and further since σk(XT ) > σk(XT ) − ‖X − X‖F , that

‖V1V T1 − V1V T

1 ‖F 62√

mXν2

σk(XT ) −√

mXν2.

Note thatXT = GA = UGΣG[V

TG A],

for G =(∇g(Ax1)| . . . |∇g(AxmX)

)T, hence ΣXT = ΣG. Moreover

σi (G) =√σi (GTG), for all i = 1, . . . , k.

Page 103: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Concentration inequalities II

Theorem (Matrix Chernoff bounds)Consider X1, . . . , Xm independent random, positive-semidefinite matricesof dimension k × k. Moreover suppose σ1(Xj) 6 C , almost surely.Compute the singular values of the sum of the expectations

µmax = σ1

(∑mj=1 EXj

)and µmin = σk

(∑mj=1 EXj

), then

P

σ1 m∑

j=1

Xj

− µmax > sµmax

6 k

((1 + s)

e

)−µmax(1+s)C

,

for all s > (e − 1), and

P

σk m∑

j=1

Xj

− µmin 6 −sµmin

6 ke−µmins

2

2C ,

for all s ∈ (0, 1).

Page 104: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Note that

GTG =

mX∑j=1

∇g(Ax j)∇g(Ax j)T .

and by applying the previous result to Xj = ∇g(Ax j)∇g(Ax j)T , we have:

LemmaFor any s ∈ (0, 1) we have that

σk(XT ) >

√mXα(1 − s)

with probability 1 − ke−mXαs2

2kC22 .

Page 105: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Proof of Theorem

with probability at least

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

we have

‖V1V T1 − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

and for A = V T1 and V T

G A = V T1

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

Page 106: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Proof of Theorem

with probability at least

1 −

(e−c ′

1mΦ + e−√mΦd + ke

−mXαs2

2kC22

),

we have

‖V1V T1 − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

and for A = V T1 and V T

G A = V T1

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

Page 107: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Proof of Theorem ... continue

Since A is row-orthogonal we have A = AATA and

|f (x) − f (x)| = |g(Ax) − g(Ax)|

= |g(Ax) − g(AAT Ax)|

6 C2

√k‖Ax − AAT Ax‖`k2

= C2

√k‖A(ATA − AT A)x‖`k2

6 C2

√k‖(ATA − AT A)‖F‖x‖`d2

6 2C2

√k(1 + ε)

ν2√α(1 − s) − ν2

.

where we used

‖ATA − AT A‖F = ‖ATVGV TG A − V1V T

1 ‖F 62ν2√

α(1 − s) − ν2.

Page 108: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

k-ridge functions may be too simple!

Figure : Functions on data clustered around a manifold with multiple directionscan be locally approximated by sums of k-ridge functions

Page 109: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Sums of ridge functions

Can we still be able to learn functions of the type

f (x) =m∑i=1

gi (ai · x), x ∈ [−1, 1]d?

Our approach (Daubechies, F., Vybıral) is essentially based on theformula

Dα1c1 . . . Dαk

ck f (x) =m∑i=1

g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,

where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th

derivative in the direction ci .

Page 110: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Sums of ridge functions

Can we still be able to learn functions of the type

f (x) =m∑i=1

gi (ai · x), x ∈ [−1, 1]d?

Our approach (Daubechies, F., Vybıral) is essentially based on theformula

Dα1c1 . . . Dαk

ck f (x) =m∑i=1

g(α1+···+αk)i (ai · x)(ai · c1)α1 . . . (ai · ck)αk ,

where k ∈ N, ci ∈ Rd , αi ∈ N for all i = 1, . . . , k and Dαici is the αi -th

derivative in the direction ci .

Page 111: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!

Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

Page 112: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

Page 113: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L.

Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

Page 114: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The recovery strategy: nearly orthonormal systemsWe assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, tomean that

S(a1, . . . , am) = inf( m∑

i=1

‖ai − wi‖22)1/2

: w1, . . . , wm orthonormal basis in Rm

is small!Furthermore, we denote by

L = spanai ⊗ ai , i = 1, . . . , m ⊂ Rm×m

the subspace of symmetric matrices generated by tensor productsai ⊗ ai = aia

Ti .

We first recover an approximation of L, i.e. instead of L we have then asubspace L of symmetric matrices at our disposal, which is (in somesense) close to L. Finally, we propose the following algorithm

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1

to recover ai ’s - or their good approximation ai (which is of coursepossible only up to the sign).

Page 115: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Nonlinear programming to recover the ai ⊗ ai ’s

Figure : The ai ⊗ ai are the extremal points of the matrix operator norm!

Page 116: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.

Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 117: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2

and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 118: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)

When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 119: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 120: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L.

Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 121: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39.

We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 122: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T , a2 = (√

2/2,√

2/2)T and b = (a1 + a2)/‖a1 + a2‖2.Weassume that L = spana1aT1 , a2aT2 and that

L = span( 1 ε

ε −ε

),

(0.5 0.5 + ε

0.5 + ε 0.5 − ε

)When choosing ε = 0.05, we find out that

dist(a1aT1 , L),dist(a2aT2 , L),dist(bbT , L) ⊂ [0.07, 0.08].

Hence, looking at L alone, every algorithm will have difficulties to decide,which two of the three rank-1 matrices above are the generators of thetrue L. Nevertheless, ‖b − a1‖2 = ‖b − a2‖2 > 0.39. We see thatalthough the level of noise was rather mild, we have difficulties todistinguish between well separated vectors.

Page 123: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The approximation to L

DefineL = span∆f (xj), j = 1, . . . , mX,

where

(∆f (x))j ,k =f (x + ε(ej + ek)) − f (x + εej) − f (x + εek) + f (x)

ε2,

for j , k = 1, . . . , m, is an approximation to the Hessian of f a x . For xdrawn at random and by applying in a suitable way the Chernoff matrixbounds, one derives a probabilistic error estimate, in the sense that

‖PL − PL‖F→F 6 Cm3/2ε,

with high probability.

Page 124: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A nonlinear operator towards a gradient ascent

Let us introduce first for a given parameter γ > 1 an operator acting onthe singular values of a matrix X = UΣV T as follows:

Πγ(X ) = Udiag(γ, 1, . . . , 1)× Σ

‖(diag(γ, 1, . . . , 1)× Σ)‖FV T ,

where

diag(γ, 1, . . . , 1)× Σ =

γσ1 0 . . . 00 σ2 0 . . .. . . . . . . . . . . .0 . . . 0 σm

Notice that Πγ maps any matrix X onto a matrix of unit Frobeniusnorm, simply exalting the first singular value and damping the others. Itis not a linear operator.

Page 125: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

The nonlinear programming

We propose a projected gradient method for solving

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.

Algorithm 3:

I Fix a suitable parameter γ > 1

I Assume to have identified a basis for L of semi-positive definitematrices, for instance, one can use the second order finitedifferences ∆f (xj), j = 1, . . . , mX to form such a basis;

I Generate an initial guess X 0 =∑mX

j=1 ζj∆f (xj) by choosing at

random ζj > 0, so that X 0 ∈ L and ‖X 0‖F = 1;

I For ` > 0:

X `+1 := PLΠγ(X`);

Page 126: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Analysis of the algorithm for L = L

Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >

√2 and

let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.

Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.

The proof is based on the following observation

‖X `+1‖∞ = σ1(X`+1) =

γσ1(X`)√

γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2

>γ‖X `‖∞√

(γ2 − 1)‖X `‖2∞ + 1.

Page 127: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Analysis of the algorithm for L = L

Proposition (Daubechies, F., Vybral)Assume that L = L and that a1, . . . , am are orthonormal. Let γ >

√2 and

let ‖X 0‖∞ > 1/√γ2 − 1. Then there exists µ0 < 1 such that∣∣1 − ‖X `+1‖∞∣∣ 6 µ0 ∣∣1 − ‖X `‖∞∣∣ , for all ` > 0.

Being the sequence (X `)` made of matrices with Frobenius normbounded by 1, we conclude that any accumulation point of it has bothunit Frobenius and spectral norm and therefore it has to coincide withone maximizer.

The proof is based on the following observation

‖X `+1‖∞ = σ1(X`+1) =

γσ1(X`)√

γ2σ1(X `)2 + σ2(X `)2 + · · ·+ σm(X `)2

>γ‖X `‖∞√

(γ2 − 1)‖X `‖2∞ + 1.

Page 128: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Analysis of the algorithm for L ≈ L

Theorem (Daubechies, F., Vybral)Assume for that ‖PL − PL‖F→F < ε < 1 and that a1, . . . , am are

orthonormal. Let ‖X 0‖∞ > max 1√γ2−1

, 1√2+ ε+ ξ and

√2 < γ. Then

for the iterations (X `)` produced by Algorithm 3, there exists µ0 < 1such that

lim sup`

|1 − ‖X `‖∞| 6 µ1(γ, t0, ε) + 2ε

1 − µ0+ ε,

where µ1(γ, ξ, ε) ≈ ε. The sequence (X `)` is bounded and itsaccumulation points X satisfy simultaneously the following properties

‖X‖F 6 1 and ‖X‖∞ > 1 −µ1(γ, t0, ε) + 2ε

1 − µ0+ ε,

and

‖PLX‖F 6 1 and ‖PLX‖∞ > 1 −µ1(γ, t0, ε) + 2ε

1 − µ0.

Page 129: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

A graphical explanation of the algorithm

Figure : Objective function ‖ · ‖∞ to be maximized and iterations of Algorithm3 converging to one of the extremal points ai ⊗ ai

Page 130: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Nonlinear programming

Theorem (Daubechies, F., Vybral)Let M be any local maximizer of

arg max ‖M‖∞, s.t. M ∈ L, ‖M‖F 6 1.

Then

uTj Xuj = 0 for all X ∈ SL with X ⊥ M

and all j ∈ 1, . . . , m with |λj(0)| = ‖M‖∞.

If furthermore the ai ’s are nearly orthonoramal S(a1, . . . , am) 6 ε and

3 ·m · ‖PL − PL‖ < (1 − ε)2,

then λ1 = ‖M‖∞ > max|λ2|, . . . , |λm| and

2m∑

k=2

(uT1 Xuk)

2

λ1 − λk6 λ1.

Page 131: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Nonlinear programming

Algorithm 4:

I Let M be a local maximizer of the nonlinear programming

I Take its singular value decomposition M =∑m

j=1 λjuj ⊗ uj

I Put a := u1

Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C

√ε.

The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.

Page 132: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Nonlinear programming

Algorithm 4:

I Let M be a local maximizer of the nonlinear programming

I Take its singular value decomposition M =∑m

j=1 λjuj ⊗ uj

I Put a := u1

Theorem (Daubechies, F., Vybral)Let L = L and S(a1, . . . , am) 6 ε. Then there is j0 ∈ 1, . . . , m, such thata found by Algorithm 4 satisfies ‖a − aj0‖2 6 C

√ε.

The proof is based on testing the optimality condtions forX = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.

Page 133: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Learning sums of ridge functions

Algorithm 5:

I Let aj are normalized approximations of aj , j = 1, . . . , m

I Let (bj)mj=1 be the dual basis to (aj)

mj=1

I Assume, that f (0) = g1(0) = · · · = gm(0)

I Put gj(t) := f (tbj), t ∈ (−1/‖bj‖2, 1/‖bj‖2)I Put f (x) :=

∑mj=1 gj(aj · x), ‖x‖2 6 1

Theorem (Daubechies, F., Vybral)Let

I S(a1, . . . , am) 6 ε and S(a1, . . . , am) 6 ε ′;I ‖aj − aj‖2 6 η, j = 1, . . . , m.

Then‖f − f ‖∞ 6 c(ε, ε ′)mη.

Page 134: TU Berlin · Introduction on ridge functions I A ridge function - in its simplest form - is a function f : Rd !R of the type f (x) = g(aTx) = g(a x), where g : R !R is a scalar univariate

Our literature

I I. Daubechies, M. Fornasier, and J. Vybıral, Approximation of sumsof ridge functions, in preparation

I M. Fornasier, K. Schnass, and J. Vybıral, Learning functions of fewarbitrary linear parameters in high dimensions, Foundations onComputational Mathematics, Vol. 2, No. 2, 2012, pp. 229-262

I K. Schnass and J. Vybıral, Compressed learning of high-dimensionalsparse functions, ICASSP11, 2011.

I A. Kolleck and J. Vybıral, On some aspects of approximation ofridge functions, J. Appr. Theory 194 (2015), 35-61

I S. Mayer, T. Ullrich, and J. Vybıral, Entropy and sampling numbersof classes of ridge functions, to appear in ConstructiveApproximation,


Recommended