Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download...

Post on 25-Jun-2020

5 views 0 download

transcript

Dalalyan, A.S.

LectureI

1

Exploiting sparsity in high-dimensionalstatistical inference

CIMPA-UNESCO-MESR-MICINN research School 2012:New trends in Mathematical Statistics

Punta del Este, URUGUAY

Arnak S. DalalyanENSAE / CREST / GENES

Paris, FRANCE

Dalalyan, A.S.

LectureI

2

Outline of my lectures

Lecture I Introduction to sparsity

� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink

Lecture II Non-orthogonal design : Lasso and Dantzig selector

� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)

property

Lecture III Exponential weights and sparsity favoring priors

� Bayesian approach� Sharp Oracle Inequalities for EW-aggregate� Risk bounds for arbitrary design

Lecture IV Adaptation to the noise magnitude : scaled Lasso,square root Lasso and SDS.

Dalalyan, A.S.

LectureI

3

Lecture I. Introduction to sparsity

Dalalyan, A.S.

LectureI

4

Multiple linear regression

Throughout these lectures, we will be dealing with the followingmodel only : We observe a vector Y ∈ Rn and a matrixX ∈ Rn×p such that

Y = Xβ∗ + σ∗ξ, (1)

where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will

assume it Gaussian Nn(0, In),� σ∗ is the noise level, which may be known or not

depending on the application.

Sparsity scenario : p is not small compared to n (they are com-parable or p is larger than n), but we know that only a smallnumber of βj ’s is significantly different from 0.

Dalalyan, A.S.

LectureI

5

Example of sparsity : wavelet transform of an image

0 5000 10000 15000−400

−200

0

200

400WT in 2D

0 5000 10000 150000

100

200

300

400

Dalalyan, A.S.

LectureI

6

Example of sparsity : robust estimation

●●

0.0 0.2 0.4 0.6 0.8 1.0

3.0

3.5

4.0

4.5

5.0

x

y

Dalalyan, A.S.

LectureI

7

Example of sparsity : background subtraction

Dalalyan, A.S.

LectureI

8

Orthonormal dictionaries

We first consider the case of orthonormal dictionaries. That is n = pand the matrix X satisfies

XX> = X>X = n In.

� Typical examples are the Fourier or wavelet transform, the baseprovided by PCA,...

� Equation (3) can be written as

Y = Xβ∗ + σ∗ξ ⇐⇒ Z = β∗ +σ∗√

nε (2)

where

• Z = 1nX>Y is the transformed response,

• ε ∈ Rn is a random vector drawn from the GaussianNn(0, In) distribution,

� The right-hand side of (5) is often referred to as Gaussiansequence model.

Dalalyan, A.S.

LectureI

9

The oracle and its riskDefinitions

� Thus, we assume that we observe Z = (Z1, . . . ,Zn)> such that

Zi = β∗i +σ∗√

nεi , εi

iid∼ N (0,1).

� Furthermore, we believe that the true vector β∗ is sparse, that is

‖β∗‖0 ,n∑

j=1

1l{β∗j 6=0} is much smaller than n.

� If we know exactly which coordinates of β∗ are nonzero, that is ifwe knew the sparsity pattern J∗ = {j : β∗j 6= 0}, we wouldestimate β∗ by βo defined by

βoj = Zj1l{j∈J∗}.

The vector βo will be called oracle.

� One easily checks that its risk is given by

R[βo,β∗] = E[‖βo − β∗‖22] =

∑j∈J∗

E[(Zj − β∗j )2] = sσ∗2

n.

where s = |J∗| = ‖β∗‖0.

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Dalalyan, A.S.

LectureI

11

Sparsity aware estimationHard and Soft Thresholding

Two observations :

� If for a given j it is very likely that β∗j = 0, than the estimatorβ̂j = 0 is preferable to the ML-estimator β̂j = Zj .

� If β∗j = 0, then the corresponding observation Zj is small.

Donoho and Johnstone proposed (early 90ties) to estimate β∗ by

Hard thresholding : for a given threshold t > 0, set

β̂HTj = Zj1l{|Zj |>tσ∗/

√n}, j = 1, . . . ,n.

Soft thresholding : for a given threshold t > 0, set

β̂STj =

(Zj−

tσ∗√n

sign(Zj))

1l{|Zj |>tσ∗/√

n}, j = 1, . . . ,n.

Dalalyan, A.S.

LectureI

12

Risk bound for Hard Thresholding

Let us consider the slightly more general setting Z = β∗ + σ∗√nε with

an s-sparse signal β∗ ∈ Rp and a Gaussian Np(0, Ip) vector ε (now pis not necessarily = n).

Theorem 1. If β̂ is the hard thresholding estimator

β̂j = Zj1l{|Zj |>tσ∗/√

n}, j = 1, . . . ,p,

then, for every t ≥√

2, it holds that

R[β̂,β∗] ≤(

2tp e−t2/2 + 3t2s)σ∗2

n.

Proof. On the whiteboard.

Remark. Similar bound holds true for the Soft-Thresholding estimator.

Dalalyan, A.S.

LectureI

13

Risk bound for Hard ThresholdingConsequences

� Assuming s ≥ 1 and p ≥ 3, we can choose t =√

2 log(p), whichleads to

R[β̂,β∗] ≤ 7.5 log(p)sσ∗2

n.

� Up to the factor 7.5 log(p) this bound coincides with that of theoracle.

� The choice t =√

2 log p is commonly known as the universalchoice of the penalty. It can be proved that the factor log(p) is anunavoidable price to pay for not knowing the locations ofnonzero entries of β∗.

� The universal choice of t is of correct order of magnitude, but itturns out that it is too strong for applications. Smaller values of tchosen in data-dependent manner are often preferable.

Dalalyan, A.S.

LectureI

14

Stein’s lemma

Let Z = β∗ + σ∗√

n ε with ε ∼ Np(0, Ip).

Stein’s lemma. Let Tn(Z) = (Tn,1(Z), . . . ,Tn,p(Z))> be an esti-mator of β∗ such that the mapping z 7→ Tn(z) is continuous andweakly differentiable. Then, the quantity

r̂n(Z) = ‖Z− Tn(Z)‖22 +2σ∗2

n

p∑i=1

∂zi Tn,i(Z)−σ∗2p

n

is an unbiased estimator of the risk R[Tn,β∗].

Proof. On the whiteboard.

Dalalyan, A.S.

LectureI

15

SURE-Shrink

� In view of this result, one easily checks that

SURE(t ,Z) =p∑

i=1

Z 2i ∧

t2σ∗2

n+

2σ∗2

n

p∑i=1

1l{|Zi |>tσ∗/√

n}−σ∗2p

n

is an unbiased estimator of the risk of the hardthresholding estimator.

� Therefore, Donoho and Johnstone proposed to choose thethreshold t as the minimizer of the (random) function

t 7→ SURE(t ,Z).

Dalalyan, A.S.

LectureI

16

SURE-Shrink

FIGURE: Illustration of the choice of the threshold using SURE. Left :the function t 7→ SURE(t). Right : the true risk as a function of t .

Dalalyan, A.S.

LectureI

17

SURE-Shrink vs universal choice

FIGURE: Root MSE’s for simulated data at varying levels of sparsitywhen the threshold is chosen (a) by SURE and (b) by the universalchoice.

Dalalyan, A.S.

LectureI

18

Conclusion of the first lecture

� The hypothesis of sparsity of the unknown object appearsnaturally in various applications.

� In the case of an orthogonal dictionary the soft and thehard thresholding estimators are well suited to efficientlyrecover sparse structures.

� These estimators have a risk almost of the same order asthe one of the oracle knowing the locations of nonzeroentries : the price for not knowing these positions is onlylogarithmic in p.

� What happens if the dictionary X is not orthonormal ?

Dalalyan, A.S.

LectureI

19

Lecture II. Nonorthogonal dictionaries : Lassoand Dantzig selector

Dalalyan, A.S.

LectureI

20

Summaries of yesterday’s and today’s talks

Yesterday : Introduction to sparsity

� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink

Today : Non-orthogonal design : Lasso and Dantzig selector

� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)

property

Dalalyan, A.S.

LectureI

21

How to get the slides

The webpage is at : http://arnak-dalalyan.fr

Dalalyan, A.S.

LectureI

22

Examples of nonorthogonal dictionaries

� Orthogonal dictionary is (unfortunately) the exceptionrather than the rule.

� Examples of nonorthogonal dictionaries include :

� the case when X is composed of a union of two or moreorthonormal bases,

� the robust estimation (cf. previous lecture),

� nonparametric regression with irregular design,

� compressed sensing,

� . . .

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Dalalyan, A.S.

LectureI

24

Compressed sensingSome remarks

� The minimal number n is obviously ≤ p, and it is indeed < p ifβ∗ is sparse.

� If n < p, there is no way for the matrix X = AW to be orthogonal.(why ?)

� If n < p, the linear system Y = Xβ is under-determined and,therefore, the set of its solutions is infinite (continuumcardinality). So, to apply thresholding, one has to first find anestimator to be thresholded.

� The orthogonality of X is equivalent to(1nX>X

)1/2

β = β

for all β ∈ Rp. But why do we care about all β ?

� We are only interested in a particular type of vectors β : sparsevectors !

Dalalyan, A.S.

LectureI

25

Restricted Isometry Property (RIP)

Definition. Let X be a n× p matrix and let s ∈ {1, . . . ,p}. We say thatX satisfies the restricted isometry property RIP(s) with constant δ if

(1− δ)‖u‖22 ≤

1n‖Xu‖2

2 ≤ (1 + δ)‖u‖22,

for every vector u ∈ Rp with ‖u‖0 ≤ s.We denote by δs(X) the smallest δ such that X satisfies RIP(s) withconstant δ.

Remarks

1 Any orthogonal matrix satisfies RIP(s) with the constant δ = 0for any s ∈ {1, . . . ,p}.

2 If X = AW with an orthogonal matrix W, then X satisfiesRIP(s, δ) if and only if

√nA satisfies RIP(s, δ). (Exercise)

Dalalyan, A.S.

LectureI

26

Lasso and Dantzig selector

From now on, we assume that the columns of X are normalized insuch a way that

1n

n∑i=1

x2ij = 1, ∀ j = 1, . . . ,p.

This does not cause any loss of generality (why ?).

We fix a λ > 0 and define the two estimators

β̂Lasso∈ arg min

β∈Rp

( 12σ∗2 ‖Y− Xβ‖2

2 +λ

σ∗‖β‖1

),

β̂DS∈ arg min

β∈Rp :‖X>(Y−Xβ)‖∞≤λσ∗‖β‖1.

Remarks Both the Lasso and the Dantzig selector can be efficientlycomputed even for very large datasets using convex programming. Infact, it is a second-order cone program for the Lasso and a linearprogram for the Dantzig selector.

Dalalyan, A.S.

LectureI

27

Theoretical guarantees

Exercise. If X is orthogonal, then β̂Lasso

= β̂DS

= β̂ST

.

Theorem 2 (Candès & Tao AoS 2006). Let α ∈ (0,1) be a tolerancelevel and let β∗ be s-sparse. If X satisfies the RIP(2s) with a constantδ <

√2 − 1, then the Dantzig selector based on λ =

√2n log(p/α)

satisfies the inequality

P(‖β̂

DS− β∗‖2 ≤ Cσ∗

√s log(2p/α)

n

)≥ 1− α,

with C = 8(√

2−1)√2−1−δ

.

This result implies that the “rate of convergence” to zero of theestimation error is of the order

σ∗√

s log(2p/α)n

which coincides with the one we obtained for the hard-thresholding inthe case of orthogonal dictionaries.

Dalalyan, A.S.

LectureI

28

Proof of Theorem 2Some auxiliary results

Lemma. Assume that Y = Xβ∗ + σ∗ξ with ‖Xj‖22 = n for every j .

1 If ξ ∼ Nn(0, In) then the probability of the event

B ={‖X>(Y− Xβ∗)‖∞ ≤ σ∗

√2n log(2p/α)

}is larger than or equal to 1− α.

2 If u,v ∈ Rp are s-sparse and u ⊥ v , then

1n|〈Xu,Xv〉| ≤ δ2s‖u‖2‖v‖2.

3 If h = β̂DS− β∗ then ‖hJc‖1 ≤ ‖hJ‖1 on the event B.

4 If J = J0 = {j : β∗j 6= 0} and Jk is recursively defined as the set ofindices j 6∈ J ∪ J1 . . . ∪ Jk−1 of the s largest entries of the vector(|h1|, . . . , |hp|). Then, on the event B,

‖h(J∪J1)c‖2 ≤∑k≥2

‖hJk ‖2 ≤1√s‖hJc‖1 ≤ ‖hJ‖2.

Proof on the whiteboard.

Dalalyan, A.S.

LectureI

29

Main consequences

� Theorem 2 is of interest in the noiseless setting as well, that iswhen σ∗ = 0. It tells us that if X satisfies the RIP, then we areable to perfectly reconstruct β∗ from a small number ofmeasurements.

� If we are looking for an estimate which is γ-accurate, then it issufficient to take

n =Cσ∗2s log(2p/α)

γ2 .

This number is typically much smaller than the dimensionality p !

� If the ratio s/p is not too close to 1, then matrices satisfying theRIP(s) exist. The most famous example is a random matrix withi.i.d. N (0,1) entries.

� So, this answers the two questions we have seen several slidesbefore.

Dalalyan, A.S.

LectureI

30

Lecture III. Nonorthogonal dictionaries : Lasso,Dantzig selector and the Bayesian approach

Dalalyan, A.S.

LectureI

31

Curse of dimensionality and sparsity

High-dimensionality has significantly challenged traditional statisticaltheory.

• In linear regression, the accuracy of the LSE is of order p/n,where p is the number of covariates.

• In many applications, p is much larger than n.

Sparsity assumption provides compelling theoretical framework fordealing with high dimension.

• Even if the number of parameters describing the modelin general setup is large, only few of them participatein the process of data generation.

• No a priori information on the set of relevant parametersis available.

• It is only known (or, postulated) that the cardinality of the set ofrelevant parameters is small.

Dalalyan, A.S.

LectureI

32

Risk bound for the DSSlow rates

The previous result is of the form “all or nothing”. If the RIP is notsatisfied, then we do not know anything about the risk of the DS.

� If we do not want to impose any assumption on X, then theconsistent estimation of β∗ is hopeless. Indeed, in theunder-determined case, which is of interest for us, β∗ is notidentifiable.

� But we can be more optimistic if the goal is to estimate thevector Xβ∗ only (like in Yannick Baraud’s talk yesterday).

Dalalyan, A.S.

LectureI

33

Slow rates for the DS

Theorem 3. If λ =√

2n log(2p/α) then at least with probability 1 − αit holds that

1n‖X(β̂

DS− β∗)‖2

2 ≤ 4σ∗√

2 log(2p/α)n

‖β∗‖1 ∧ ‖(β̂DS− β∗)J‖1.

Remark A similar result holds true for the Lasso estimator as well.

Consequence : If the Euclidean norm of β∗ is finite (not too large)and s log(p)

n is small, then the Dantzig selector is consistent w.r.t. the

prediction loss and the latter is O(√ s log(p)

n

).

Dalalyan, A.S.

LectureI

34

Restricted eigenvalues property

The risk of the Lasso estimator will be analyzed under a differentcondition on X.

Definition. Let X be a n×p matrix and let s ∈ {1, . . . ,p}. For a positiveconstant c0 we say that X satisfies the restricted eigenvalues propertyRE(s, c0) (respectively R̃E(s, c0)) if

κ(s, c0)2 , min

J⊂{1,...,p}|J|≤s

minu 6=0

‖uJc ‖1≤c0‖uJ‖1

‖Xu‖22

n‖uJ‖22> 0.

(respectively,

κ̃(s, c0)2 , min

J⊂{1,...,p}|J|≤s

minu 6=0

‖uJc ‖1≤c0‖uJ‖1

‖Xu‖22

n‖u‖22> 0.

Dalalyan, A.S.

LectureI

35

Restricted eigenvalues property

Remarks

1 Any orthogonal matrix satisfies RE(s, c0) with κ(s, c0) = 1.

2 According to the claim 3 of the last lemma,

‖(β̂DS−β∗)Jc‖1 ≤ ‖(β̂

DS−β∗)J‖1 on the event B and, therefore,

if RE(1, s) holds true, then

1n‖X(β̂

DS− β∗)‖2

2 ≥ κ2(1, s)‖β̂DS− β∗‖2

2.

3 The RE property looks simpler than the RIP, but they are notcomparable (there is no one that implies the other).

• The advantage of the RE is that there is no upper bound.• The weakness of the RE is that the minimum is taken over

a larger set than the one in RIP.

Dalalyan, A.S.

LectureI

36

Risk bound for the LassoFast rates

Theorem 4. Let p ≥ 2 and let the assumption RE(s,3) be fulfilled. Ifλ =

√8n log(2p/α) then at least with probability 1− α it holds that

‖β̂Lasso− β∗‖1 ≤

25

κ2(s,3)σ∗s

√2 log(2p/α)

n,

‖X(β̂Lasso− β∗)‖2

2 ≤27

κ2(s,3)σ∗2s log(2p/α).

Furthermore, if R̃E(s,3) is fulfilled, then

‖β̂Lasso− β∗‖2

2 ≤29

κ̃4(s,3)σ∗2s log(2p/α)

n.

Dalalyan, A.S.

LectureI

37

Two examples (Mairal et al. (2011))

Dalalyan, A.S.

LectureI

38

Bayes setting

• To tackle the high-dimensionality issue with a Bayesianapproach, we introduce a prior on Rp. We will assume it isabsolutely continuous with a density π(·).

• Let us think of π(·) as the way we would model the distribution ofβ∗ prior to looking at the data.

• Instead of considering the posterior mean, the maximum aposteriori (MAP) estimate or other standard estimators used inBayesian framework, we adopt a variational approach.

Dalalyan, A.S.

LectureI

39

Variational approach and exponential weights

• Let P be the set of all probability measures p on Rp such that∫Rd βp(dβ) exists.

• Let λ > 0 and π ∈ P be a prior on Rd .

• Define the pseudo-posterior by

π̂n = arg minp∈P

(∫Rp‖Y− Xβ‖2

2 p(dβ) + λK (p,π)),

with K (p, π) being the Kullback-Leibler divergence

K (p, π) =

{∫Rp log

( p(β)π(β)

)p(β)dβ, if p � π,

+∞, otherwise.• Using π̂n, we define

β̂EWA

=

∫Rpβ π̂n(β)dβ.

Dalalyan, A.S.

LectureI

40

Exponential weights and a risk bound

Theorem 5.

1 The pseudo-posterior π̂n is given by the explicit formula

π̂n(β) ∝ exp{− λ−1‖Y− Xβ‖2

2

}π(β), ∀β ∈ Rp.

2 If λ ≥ 4σ∗2, then the pseudo-posterior mean β̂EWA

satisfies

E[‖X(β̂EWA− β∗)‖2

2] ≤ minp∈P

(∫Rp‖X(β − β∗)‖2

2 p(dβ) + λK (p,π)).

The second claim of the theorem is due to Leung and Barron (2006)and is a nice consequence of Stein’s lemma.

Dalalyan, A.S.

LectureI

41

Comments

• If we choose λ = 2σ∗2, we get the classical Bayes posteriormean estimator. But we have no nice risk bound for it.

• If the prior π is a discrete measure

π(β) =N∑

j=1

πjδβj(β),

then the risk bound simplifies to

E[‖X(β̂EWA− β∗)‖2

2] ≤ minj=1,...,N

(‖X(βj − β

∗)‖22 + 4σ∗2 log(1/πj)

).

Inequalities of this type are usually called oracle inequalities.

• The precise knowledge of σ∗ is not necessary, one only needs a(not very rough) upper bound on σ∗.

• This result is very general, and is not specifically designed todeal with the sparsity scenario.

Dalalyan, A.S.

LectureI

42

Back to the sparsity scenario

• To deal with the sparsity, we need to introduce a prior πwhich promotes sparsity.

• The estimators Ridge (L2-penalty) and Lasso (L1-penalty)can be seen as a MAP with respectively a Gaussian and aLaplace prior.

• We will use a more heavy tailed one : the product of(scaled) Student t(3)

π(β) ∝p∏

j=1

(τ2 + β2j )−2, ∀β ∈ Rp,

for some tuning parameter τ > 0.

Dalalyan, A.S.

LectureI

43

Does π favor sparsity ?

The boxplots of a sample of size 104 drawn from the scaled Gaussian,Laplace and Student t(3) distributions. In all the three cases the locationparameter is 0 and the scale parameter is 10−2.

Dalalyan, A.S.

LectureI

44

Compare with the wavelet coefficients

The boxplot of a sample of size 104 of the wavelet coefficients of the image ofLena.

Dalalyan, A.S.

LectureI

45

Sparsity oracle inequality for the EWA

Theorem 6 (D. and Tsybakov 2007). Let π be the sparsity favoring

prior and let β̂EWA

be the exponentially weighted aggregate with λ ≥4σ∗2. Then, for every τ > 0,

E[‖X(β̂EWA− β∗)‖2

2] ≤ minβ∈Rp

{‖X(β − β∗)‖2

2 + λ

p∑j=1

log(1 + τ−1|βj |)}

+ 11τ2pn.

In particular, if β∗ is s-sparse then

E[1

n‖X(β̂

EWA− β∗)‖2

2

]≤ sλ

nlog(1 + τ−1‖β∗‖∞) + 11τ2p.

This result provides fast rates under no assumption on the dictionary.

Dalalyan, A.S.

LectureI

46

Langevin Monte-Carlo

• Although the EWA can be written in an explicit form, itscomputation is not trivial because of the M-fold integral.

• Naive Monte-Carlo methods fail in moderately largedimensions (M = 50).

• A specific type of Markov Chain Monte-Carlo technique,called Langevin Monte-Carlo, turns out to be very efficient.

Dalalyan, A.S.

LectureI

47

Langevin Monte-Carlo

• Let us consider the linear regression setup. We wish tocompute the EWA, which can be written as

β̂EWA

= C∫βe−λ

−1‖Y−Xβ‖22π(dβ),

where C is the constant of normalization.• We can rewrite

β̂EWA

=

∫RMβpV (β)dβ,

where pV (β) ∝ eV (β) is a density function.• pV is the invariant density of the Langevin diffusion

dLt = ∇V (Lt)dt +√

2 dWt , L0 = β0, t ≥ 0.

Dalalyan, A.S.

LectureI

48

Numerical experimentsExample 1 : Compressed sensing

• Input : n, p and s, all positive integers.• Covariates : we generate an n × p matrix X with iid

Rademacher entries.• Errors : we generate a standard Gaussian vector ξ.• Noise magnitude : σ∗ =

√s/9.

• Response : Y = Xβ∗ + σ∗ξ where β∗ = [1l(j ≤ s); j ≤ p].• Tuning parameters :

λ = 4σ2, τ = 4σ/‖X‖2, h = 4σ2/‖X‖22.

Dalalyan, A.S.

LectureI

49

Numerical experimentsExample 1 : Compressive sensing

Exp.-Weighted Aggregate Lasso

Typical outcome for n = 200, p = 500 and s = 20.

Dalalyan, A.S.

LectureI

50

Numerical experimentsExample 1 : Compressive sensing

p = 200 p = 500EWA Lasso EWA Lasso

n = 100 s = 5 0.064 1.442 0.087 1.616(0.043) (0.461) (0.054) (0.491)T = 1 T = 1

n = 100 s = 10 1.153 5.712 1.891 6.508(1.091) (1.157) (1.522) (1.196)T = 2 T = 5

n = 100 s = 15 6.839 11.149 8.917 11.82(1.896) (1.303) (2.186) (1.256)T = 5 T = 10

Dalalyan, A.S.

LectureI

51

Image denoising and inpaintingA simple example

• Input : n, k positive integers and σ > 0.• We generate n vectors Ui of R2 uniformly distributed in[0,1]2.

• Covariates φj(u) = 1l[0,j1/k ]×[0,j2/k ](u).• Errors : we generate a centered Gaussian vector ξ with

covariance matrix σ2I.• Response : Yi = (φ1(Ui), . . . , φk2(Ui))

>β∗ + ξi whereβ∗ = [1l(j ∈ {10,100,200})]′.

• Tuning parameters : the same rule as previously.

Dalalyan, A.S.

LectureI

52

Image denoising and inpainting

The original image, its sampled noisy version and the denoisedimage.

Dalalyan, A.S.

LectureI

53

Image denoising

σ n = 100 n = 200EWA Ideal LG EWA Ideal LG

2 0.210 0.330 0.187 0.203(0.072) (0.145) (0.048) (0.086)T = 1 T = 1

4 0.420 0.938 0.278 0.571(0.222) (0.631) (0.132) (0.324)T = 1 T = 1

Dalalyan, A.S.

LectureI

54

Chapter IV. The case of unknown noise level

Dalalyan, A.S.

LectureI

55

Multiple linear regression

We are still dealing with the following model : We observe avector Y ∈ Rn and a matrix X ∈ Rn×p such that

Y = Xβ∗ + σ∗ξ, (3)

where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will

assume it Gaussian Nn(0, In),� σ∗ is the noise level (known or unknown depending on the

application).

Under the sparsity scenario with unknown σ∗ is it possible to getthe same guarantees as in the case of known σ∗ ?

Dalalyan, A.S.

LectureI

56

Dependence on the noise level

All the three methods we have seen so far depend on the noise levelthrough the tuning parameter.

• The Lasso and the Dantzig selector :

β̂Lasso∈ arg min

β∈Rp

(12‖Y− Xβ‖2

2 + λ‖β‖1

),

β̂DS∈ arg min

β∈Rp :‖X>(Y−Xβ)‖∞≤λ‖β‖1.

The choice λ = Cσ∗√

n log(p) leads to estimators with nicetheoretical guarantees.

• The EWA :

β̂EWA

=

∫Rd β exp

(− 1

λ‖Y− Xβ‖22

)π(β)dβ∫

Rd exp(− 1

λ‖Y− Xβ‖22

)π(β)dβ

.

The choice λ ≥ 4σ∗2 leads to an estimator having nicetheoretical guarantees.

Dalalyan, A.S.

LectureI

57

Method of substitution

In some application, it is reasonable to assume that an unbiased andconsistent estimator σ̂2 of σ∗2 is available, which is independent of Y.

Examples :

1 One can observe two independent copies of Y, denoted by Y′

and Y′′. Then, σ̂2 =‖Y′−Y′′‖2

22n is a consistent and unbiased

estimator of σ∗2, independent of Y = 12 (Y′ + Y′′).

2 The recording device can be used in an environment withoutsignal : one can record Z = σ∗η. Then σ̂2 = 1

n‖Z‖22, is a

consistent and unbiased estimator of σ∗2

In such a context, one can substitute σ∗ by σ̂2 in the choice of λ forthe Lasso, the DS and the EWA and get nearly the same guarantees.

Dalalyan, A.S.

LectureI

58

Scaled Lasso

In many applications, only (Y,X) is observed and it is impossible toget an estimator of σ∗ independent of Y (biology, econometrics,...). Insuch a setting, it is natural to try to jointly estimate β∗ and σ∗.

• The negative log-likelihood is then given by

L(Y;β, σ) = n log(σ) +1

2σ2 ‖Y− Xβ‖22.

• Following the same ideas as in the case of known σ∗, one canadd a penalty term of the form γ‖β‖1.

• To make the cost function (penalized log-likelihood)homogeneous, γ is chosen equal to λ/σ.

• The scaled Lasso (Städler et al. 2010) is defined as

(β̂ScL, σ̂ScL) ∈ arg min

(β,σ)∈Rp×R+

(n log(σ) +

12σ2 ‖Y− Xβ‖2

2 +λ

σ‖β‖1︸ ︷︷ ︸

PL(Y;β,σ)

).

Dalalyan, A.S.

LectureI

59

Properties of the Scaled Lasso

Computational aspects :

• The function (β, σ) 7→ PL(Y;β, σ) is not convex !

• Transformation of the parameters : The function(φ, ρ) 7→ PL(Y;φ/ρ,1/ρ) is convex !

• Strategy : minimize w.r.t. (φ, ρ) the cost function

−n log(ρ) +12‖Yρ− Xφ‖2

2 + λ‖φ‖1

and set σ̂ = 1/ρ̂ and β̂ = φ̂/ρ̂.

Theoretical aspects :

• In the more general model of Gaussian mixtures, it has beenproved by Städler et al (2010) that under the RE property, withlarge probability the error of the estimation of (β∗, σ∗) is of theorder s log4(n)

n .

Dalalyan, A.S.

LectureI

60

Square-root Lasso

• One weakness of the scaled Lasso is that the application

ρ 7→ log(ρ)

is not Lipschitz continuous on (0,+∞). Lipschitz continuity isoften required for the convergence of the convex optimizationalgorithms.

• To circumvent this drawback, it has been proposed by Antoniadis(2010) to replace the cost function PL(Y;β, σ) by

PL1(Y;β, σ) = nσ +1

2σ‖Y− Xβ‖2

2 + λ‖β‖1

and to define the square root Lasso estimator

(β̂SqRL

, σ̂SqRL) ∈ arg minβ,σ

PL1(Y;β, σ).

• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.

Dalalyan, A.S.

LectureI

61

Scaled Dantzig selector

• One weakness of the scaled Lasso is that the applicationρ 7→ log(ρ)

is not Lipschitz continuous on (0,+∞).

• One can show that the first-order conditions for (β̂ScL, σ̂ScL) can

be written asX>(Y− Xβ) ∈ λσsign(β); ‖Y‖2

2 − Y>Xβ = nσ2.

• The set defined by these conditions is not convex, but it isincluded in the convex setD =

{(β, σ) : ‖X>(Y− Xβ)‖∞ ≤ λσ; ‖Y‖2

2 − Y>Xβ ≥ nσ2}

• Therefore, we proposed to estimate (β∗, σ∗) by the estimator

(β̂SDS

, σ̂SDS) ∈ arg min(β,σ)∈D

‖β‖1.

• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.

Dalalyan, A.S.

LectureI

62

Risk bounds

Theorem (D. and Chen (2012))

Let us choose a significance level α ∈ (0,1) and setλ = 2

√n log(p/α). Assume that β∗ is such that ‖β∗‖0 ≤ s and

‖β∗‖1

σ∗≤√

n2 log(1/α)

(12− 2√

n−1 log(1/α)). (4)

If X satisfies the condition R̃E(s,1), then, with probability at least1− 3α, it holds

‖X(β̂ − β∗)‖22 ≤ 8

(σ∗ + σ̂

κ

)2

s log(p/α), (5)

‖β̂ − β∗‖22 ≤ 32

(σ∗ + σ̂

κ2

)2 s log(p/α)n

. (6)

Moreover, with probability at least 1− 4α,

σ̂ ≤ σ∗(3 +√

2n−1 log(1/α)).

Dalalyan, A.S.

LectureI

63

References

• DONOHO, DAVID ; JOHNSTONE, IAIN, Adapting to unknown smoothness viawavelet shrinkage. J. Amer. Statist. Assoc. 90 (1995), no. 432, 1200-1224.

• CANDES, EMMANUEL AND TAO, TERENCE, The Dantzig selector : statisticalestimation when p is much larger than n. ANN. STATIST. 35 (2007), no. 6,2313-2351.

• BICKEL, PETER ; RITOV, YA’ACOV AND TSYBAKOV, ALEXANDRE, Simultaneousanalysis of lasso and Dantzig selector. ANN. STATIST. 37 (2009), no. 4,1705-1732.

• DALALYAN, ARNAK AND TSYBAKOV, ALEXANDRE, Aggregation by exponentialweighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72(2008), 39-61.

• YIN CHEN AND ARNAK DALALYAN, Fused sparsity and robust estimation forlinear models with unknown variance Neural Information Processing Systems(NIPS 2012) (2012) 1-16.

• JULIEN MAIRAL, Sparse coding for machine learning, image processing andcomputer vision, PhD thesis (2010).

Dalalyan, A.S.

LectureI

64