+ All Categories
Home > Documents > Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download...

Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download...

Date post: 25-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
69
Exploiting sparsity in high-dimensional statistical inference CIMPA-UNESCO-MESR-MICINN research School 2012: New trends in Mathematical Statistics Punta del Este, URUGUAY Arnak S. Dalalyan ENSAE / CREST / GENES Paris, FRANCE
Transcript
Page 1: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

1

Exploiting sparsity in high-dimensionalstatistical inference

CIMPA-UNESCO-MESR-MICINN research School 2012:New trends in Mathematical Statistics

Punta del Este, URUGUAY

Arnak S. DalalyanENSAE / CREST / GENES

Paris, FRANCE

Page 2: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

2

Outline of my lectures

Lecture I Introduction to sparsity

� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink

Lecture II Non-orthogonal design : Lasso and Dantzig selector

� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)

property

Lecture III Exponential weights and sparsity favoring priors

� Bayesian approach� Sharp Oracle Inequalities for EW-aggregate� Risk bounds for arbitrary design

Lecture IV Adaptation to the noise magnitude : scaled Lasso,square root Lasso and SDS.

Page 3: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

3

Lecture I. Introduction to sparsity

Page 4: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

4

Multiple linear regression

Throughout these lectures, we will be dealing with the followingmodel only : We observe a vector Y ∈ Rn and a matrixX ∈ Rn×p such that

Y = Xβ∗ + σ∗ξ, (1)

where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will

assume it Gaussian Nn(0, In),� σ∗ is the noise level, which may be known or not

depending on the application.

Sparsity scenario : p is not small compared to n (they are com-parable or p is larger than n), but we know that only a smallnumber of βj ’s is significantly different from 0.

Page 5: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

5

Example of sparsity : wavelet transform of an image

0 5000 10000 15000−400

−200

0

200

400WT in 2D

0 5000 10000 150000

100

200

300

400

Page 6: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

6

Example of sparsity : robust estimation

●●

0.0 0.2 0.4 0.6 0.8 1.0

3.0

3.5

4.0

4.5

5.0

x

y

Page 7: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

7

Example of sparsity : background subtraction

Page 8: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

8

Orthonormal dictionaries

We first consider the case of orthonormal dictionaries. That is n = pand the matrix X satisfies

XX> = X>X = n In.

� Typical examples are the Fourier or wavelet transform, the baseprovided by PCA,...

� Equation (3) can be written as

Y = Xβ∗ + σ∗ξ ⇐⇒ Z = β∗ +σ∗√

nε (2)

where

• Z = 1nX>Y is the transformed response,

• ε ∈ Rn is a random vector drawn from the GaussianNn(0, In) distribution,

� The right-hand side of (5) is often referred to as Gaussiansequence model.

Page 9: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

9

The oracle and its riskDefinitions

� Thus, we assume that we observe Z = (Z1, . . . ,Zn)> such that

Zi = β∗i +σ∗√

nεi , εi

iid∼ N (0,1).

� Furthermore, we believe that the true vector β∗ is sparse, that is

‖β∗‖0 ,n∑

j=1

1l{β∗j 6=0} is much smaller than n.

� If we know exactly which coordinates of β∗ are nonzero, that is ifwe knew the sparsity pattern J∗ = {j : β∗j 6= 0}, we wouldestimate β∗ by βo defined by

βoj = Zj1l{j∈J∗}.

The vector βo will be called oracle.

� One easily checks that its risk is given by

R[βo,β∗] = E[‖βo − β∗‖22] =

∑j∈J∗

E[(Zj − β∗j )2] = sσ∗2

n.

where s = |J∗| = ‖β∗‖0.

Page 10: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Page 11: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Page 12: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

10

The oracle and its riskDiscussion

The risk of the oracle is given by

R[βo,β∗] =sσ∗2

n.

� If we use the “naive” least-squares estimator β̂LS

= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk

R[β̂LS,β∗] = σ∗2,

which is much worse than that of the oracle when s � n.

� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?

� The answer is “Yes" (up to logarithmic factors).

Page 13: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

11

Sparsity aware estimationHard and Soft Thresholding

Two observations :

� If for a given j it is very likely that β∗j = 0, than the estimatorβ̂j = 0 is preferable to the ML-estimator β̂j = Zj .

� If β∗j = 0, then the corresponding observation Zj is small.

Donoho and Johnstone proposed (early 90ties) to estimate β∗ by

Hard thresholding : for a given threshold t > 0, set

β̂HTj = Zj1l{|Zj |>tσ∗/

√n}, j = 1, . . . ,n.

Soft thresholding : for a given threshold t > 0, set

β̂STj =

(Zj−

tσ∗√n

sign(Zj))

1l{|Zj |>tσ∗/√

n}, j = 1, . . . ,n.

Page 14: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

12

Risk bound for Hard Thresholding

Let us consider the slightly more general setting Z = β∗ + σ∗√nε with

an s-sparse signal β∗ ∈ Rp and a Gaussian Np(0, Ip) vector ε (now pis not necessarily = n).

Theorem 1. If β̂ is the hard thresholding estimator

β̂j = Zj1l{|Zj |>tσ∗/√

n}, j = 1, . . . ,p,

then, for every t ≥√

2, it holds that

R[β̂,β∗] ≤(

2tp e−t2/2 + 3t2s)σ∗2

n.

Proof. On the whiteboard.

Remark. Similar bound holds true for the Soft-Thresholding estimator.

Page 15: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

13

Risk bound for Hard ThresholdingConsequences

� Assuming s ≥ 1 and p ≥ 3, we can choose t =√

2 log(p), whichleads to

R[β̂,β∗] ≤ 7.5 log(p)sσ∗2

n.

� Up to the factor 7.5 log(p) this bound coincides with that of theoracle.

� The choice t =√

2 log p is commonly known as the universalchoice of the penalty. It can be proved that the factor log(p) is anunavoidable price to pay for not knowing the locations ofnonzero entries of β∗.

� The universal choice of t is of correct order of magnitude, but itturns out that it is too strong for applications. Smaller values of tchosen in data-dependent manner are often preferable.

Page 16: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

14

Stein’s lemma

Let Z = β∗ + σ∗√

n ε with ε ∼ Np(0, Ip).

Stein’s lemma. Let Tn(Z) = (Tn,1(Z), . . . ,Tn,p(Z))> be an esti-mator of β∗ such that the mapping z 7→ Tn(z) is continuous andweakly differentiable. Then, the quantity

r̂n(Z) = ‖Z− Tn(Z)‖22 +2σ∗2

n

p∑i=1

∂zi Tn,i(Z)−σ∗2p

n

is an unbiased estimator of the risk R[Tn,β∗].

Proof. On the whiteboard.

Page 17: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

15

SURE-Shrink

� In view of this result, one easily checks that

SURE(t ,Z) =p∑

i=1

Z 2i ∧

t2σ∗2

n+

2σ∗2

n

p∑i=1

1l{|Zi |>tσ∗/√

n}−σ∗2p

n

is an unbiased estimator of the risk of the hardthresholding estimator.

� Therefore, Donoho and Johnstone proposed to choose thethreshold t as the minimizer of the (random) function

t 7→ SURE(t ,Z).

Page 18: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

16

SURE-Shrink

FIGURE: Illustration of the choice of the threshold using SURE. Left :the function t 7→ SURE(t). Right : the true risk as a function of t .

Page 19: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

17

SURE-Shrink vs universal choice

FIGURE: Root MSE’s for simulated data at varying levels of sparsitywhen the threshold is chosen (a) by SURE and (b) by the universalchoice.

Page 20: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

18

Conclusion of the first lecture

� The hypothesis of sparsity of the unknown object appearsnaturally in various applications.

� In the case of an orthogonal dictionary the soft and thehard thresholding estimators are well suited to efficientlyrecover sparse structures.

� These estimators have a risk almost of the same order asthe one of the oracle knowing the locations of nonzeroentries : the price for not knowing these positions is onlylogarithmic in p.

� What happens if the dictionary X is not orthonormal ?

Page 21: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

19

Lecture II. Nonorthogonal dictionaries : Lassoand Dantzig selector

Page 22: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

20

Summaries of yesterday’s and today’s talks

Yesterday : Introduction to sparsity

� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink

Today : Non-orthogonal design : Lasso and Dantzig selector

� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)

property

Page 23: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

21

How to get the slides

The webpage is at : http://arnak-dalalyan.fr

Page 24: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

22

Examples of nonorthogonal dictionaries

� Orthogonal dictionary is (unfortunately) the exceptionrather than the rule.

� Examples of nonorthogonal dictionaries include :

� the case when X is composed of a union of two or moreorthonormal bases,

� the robust estimation (cf. previous lecture),

� nonparametric regression with irregular design,

� compressed sensing,

� . . .

Page 25: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Page 26: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Page 27: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Page 28: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

23

Compressed sensingWhat is it about ?

� Assume you need to acquire a large vector µ∗ ∈ Rp.

� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.

� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.

� One option is to acquire (noisy versions of) all the entries of µ∗

and then to apply the SureShrink procedure.

� Bad choice : this requires to do too many measurements.

� Questions :

• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?

• How the sensing matrix A should be chosen ?

� Compressed sensing is the theory studying these questions.

Page 29: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

24

Compressed sensingSome remarks

� The minimal number n is obviously ≤ p, and it is indeed < p ifβ∗ is sparse.

� If n < p, there is no way for the matrix X = AW to be orthogonal.(why ?)

� If n < p, the linear system Y = Xβ is under-determined and,therefore, the set of its solutions is infinite (continuumcardinality). So, to apply thresholding, one has to first find anestimator to be thresholded.

� The orthogonality of X is equivalent to(1nX>X

)1/2

β = β

for all β ∈ Rp. But why do we care about all β ?

� We are only interested in a particular type of vectors β : sparsevectors !

Page 30: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

25

Restricted Isometry Property (RIP)

Definition. Let X be a n× p matrix and let s ∈ {1, . . . ,p}. We say thatX satisfies the restricted isometry property RIP(s) with constant δ if

(1− δ)‖u‖22 ≤

1n‖Xu‖2

2 ≤ (1 + δ)‖u‖22,

for every vector u ∈ Rp with ‖u‖0 ≤ s.We denote by δs(X) the smallest δ such that X satisfies RIP(s) withconstant δ.

Remarks

1 Any orthogonal matrix satisfies RIP(s) with the constant δ = 0for any s ∈ {1, . . . ,p}.

2 If X = AW with an orthogonal matrix W, then X satisfiesRIP(s, δ) if and only if

√nA satisfies RIP(s, δ). (Exercise)

Page 31: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

26

Lasso and Dantzig selector

From now on, we assume that the columns of X are normalized insuch a way that

1n

n∑i=1

x2ij = 1, ∀ j = 1, . . . ,p.

This does not cause any loss of generality (why ?).

We fix a λ > 0 and define the two estimators

β̂Lasso∈ arg min

β∈Rp

( 12σ∗2 ‖Y− Xβ‖2

2 +λ

σ∗‖β‖1

),

β̂DS∈ arg min

β∈Rp :‖X>(Y−Xβ)‖∞≤λσ∗‖β‖1.

Remarks Both the Lasso and the Dantzig selector can be efficientlycomputed even for very large datasets using convex programming. Infact, it is a second-order cone program for the Lasso and a linearprogram for the Dantzig selector.

Page 32: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

27

Theoretical guarantees

Exercise. If X is orthogonal, then β̂Lasso

= β̂DS

= β̂ST

.

Theorem 2 (Candès & Tao AoS 2006). Let α ∈ (0,1) be a tolerancelevel and let β∗ be s-sparse. If X satisfies the RIP(2s) with a constantδ <

√2 − 1, then the Dantzig selector based on λ =

√2n log(p/α)

satisfies the inequality

P(‖β̂

DS− β∗‖2 ≤ Cσ∗

√s log(2p/α)

n

)≥ 1− α,

with C = 8(√

2−1)√2−1−δ

.

This result implies that the “rate of convergence” to zero of theestimation error is of the order

σ∗√

s log(2p/α)n

which coincides with the one we obtained for the hard-thresholding inthe case of orthogonal dictionaries.

Page 33: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

28

Proof of Theorem 2Some auxiliary results

Lemma. Assume that Y = Xβ∗ + σ∗ξ with ‖Xj‖22 = n for every j .

1 If ξ ∼ Nn(0, In) then the probability of the event

B ={‖X>(Y− Xβ∗)‖∞ ≤ σ∗

√2n log(2p/α)

}is larger than or equal to 1− α.

2 If u,v ∈ Rp are s-sparse and u ⊥ v , then

1n|〈Xu,Xv〉| ≤ δ2s‖u‖2‖v‖2.

3 If h = β̂DS− β∗ then ‖hJc‖1 ≤ ‖hJ‖1 on the event B.

4 If J = J0 = {j : β∗j 6= 0} and Jk is recursively defined as the set ofindices j 6∈ J ∪ J1 . . . ∪ Jk−1 of the s largest entries of the vector(|h1|, . . . , |hp|). Then, on the event B,

‖h(J∪J1)c‖2 ≤∑k≥2

‖hJk ‖2 ≤1√s‖hJc‖1 ≤ ‖hJ‖2.

Proof on the whiteboard.

Page 34: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

29

Main consequences

� Theorem 2 is of interest in the noiseless setting as well, that iswhen σ∗ = 0. It tells us that if X satisfies the RIP, then we areable to perfectly reconstruct β∗ from a small number ofmeasurements.

� If we are looking for an estimate which is γ-accurate, then it issufficient to take

n =Cσ∗2s log(2p/α)

γ2 .

This number is typically much smaller than the dimensionality p !

� If the ratio s/p is not too close to 1, then matrices satisfying theRIP(s) exist. The most famous example is a random matrix withi.i.d. N (0,1) entries.

� So, this answers the two questions we have seen several slidesbefore.

Page 35: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

30

Lecture III. Nonorthogonal dictionaries : Lasso,Dantzig selector and the Bayesian approach

Page 36: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

31

Curse of dimensionality and sparsity

High-dimensionality has significantly challenged traditional statisticaltheory.

• In linear regression, the accuracy of the LSE is of order p/n,where p is the number of covariates.

• In many applications, p is much larger than n.

Sparsity assumption provides compelling theoretical framework fordealing with high dimension.

• Even if the number of parameters describing the modelin general setup is large, only few of them participatein the process of data generation.

• No a priori information on the set of relevant parametersis available.

• It is only known (or, postulated) that the cardinality of the set ofrelevant parameters is small.

Page 37: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

32

Risk bound for the DSSlow rates

The previous result is of the form “all or nothing”. If the RIP is notsatisfied, then we do not know anything about the risk of the DS.

� If we do not want to impose any assumption on X, then theconsistent estimation of β∗ is hopeless. Indeed, in theunder-determined case, which is of interest for us, β∗ is notidentifiable.

� But we can be more optimistic if the goal is to estimate thevector Xβ∗ only (like in Yannick Baraud’s talk yesterday).

Page 38: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

33

Slow rates for the DS

Theorem 3. If λ =√

2n log(2p/α) then at least with probability 1 − αit holds that

1n‖X(β̂

DS− β∗)‖2

2 ≤ 4σ∗√

2 log(2p/α)n

‖β∗‖1 ∧ ‖(β̂DS− β∗)J‖1.

Remark A similar result holds true for the Lasso estimator as well.

Consequence : If the Euclidean norm of β∗ is finite (not too large)and s log(p)

n is small, then the Dantzig selector is consistent w.r.t. the

prediction loss and the latter is O(√ s log(p)

n

).

Page 39: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

34

Restricted eigenvalues property

The risk of the Lasso estimator will be analyzed under a differentcondition on X.

Definition. Let X be a n×p matrix and let s ∈ {1, . . . ,p}. For a positiveconstant c0 we say that X satisfies the restricted eigenvalues propertyRE(s, c0) (respectively R̃E(s, c0)) if

κ(s, c0)2 , min

J⊂{1,...,p}|J|≤s

minu 6=0

‖uJc ‖1≤c0‖uJ‖1

‖Xu‖22

n‖uJ‖22> 0.

(respectively,

κ̃(s, c0)2 , min

J⊂{1,...,p}|J|≤s

minu 6=0

‖uJc ‖1≤c0‖uJ‖1

‖Xu‖22

n‖u‖22> 0.

Page 40: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

35

Restricted eigenvalues property

Remarks

1 Any orthogonal matrix satisfies RE(s, c0) with κ(s, c0) = 1.

2 According to the claim 3 of the last lemma,

‖(β̂DS−β∗)Jc‖1 ≤ ‖(β̂

DS−β∗)J‖1 on the event B and, therefore,

if RE(1, s) holds true, then

1n‖X(β̂

DS− β∗)‖2

2 ≥ κ2(1, s)‖β̂DS− β∗‖2

2.

3 The RE property looks simpler than the RIP, but they are notcomparable (there is no one that implies the other).

• The advantage of the RE is that there is no upper bound.• The weakness of the RE is that the minimum is taken over

a larger set than the one in RIP.

Page 41: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

36

Risk bound for the LassoFast rates

Theorem 4. Let p ≥ 2 and let the assumption RE(s,3) be fulfilled. Ifλ =

√8n log(2p/α) then at least with probability 1− α it holds that

‖β̂Lasso− β∗‖1 ≤

25

κ2(s,3)σ∗s

√2 log(2p/α)

n,

‖X(β̂Lasso− β∗)‖2

2 ≤27

κ2(s,3)σ∗2s log(2p/α).

Furthermore, if R̃E(s,3) is fulfilled, then

‖β̂Lasso− β∗‖2

2 ≤29

κ̃4(s,3)σ∗2s log(2p/α)

n.

Page 42: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

37

Two examples (Mairal et al. (2011))

Page 43: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

38

Bayes setting

• To tackle the high-dimensionality issue with a Bayesianapproach, we introduce a prior on Rp. We will assume it isabsolutely continuous with a density π(·).

• Let us think of π(·) as the way we would model the distribution ofβ∗ prior to looking at the data.

• Instead of considering the posterior mean, the maximum aposteriori (MAP) estimate or other standard estimators used inBayesian framework, we adopt a variational approach.

Page 44: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

39

Variational approach and exponential weights

• Let P be the set of all probability measures p on Rp such that∫Rd βp(dβ) exists.

• Let λ > 0 and π ∈ P be a prior on Rd .

• Define the pseudo-posterior by

π̂n = arg minp∈P

(∫Rp‖Y− Xβ‖2

2 p(dβ) + λK (p,π)),

with K (p, π) being the Kullback-Leibler divergence

K (p, π) =

{∫Rp log

( p(β)π(β)

)p(β)dβ, if p � π,

+∞, otherwise.• Using π̂n, we define

β̂EWA

=

∫Rpβ π̂n(β)dβ.

Page 45: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

40

Exponential weights and a risk bound

Theorem 5.

1 The pseudo-posterior π̂n is given by the explicit formula

π̂n(β) ∝ exp{− λ−1‖Y− Xβ‖2

2

}π(β), ∀β ∈ Rp.

2 If λ ≥ 4σ∗2, then the pseudo-posterior mean β̂EWA

satisfies

E[‖X(β̂EWA− β∗)‖2

2] ≤ minp∈P

(∫Rp‖X(β − β∗)‖2

2 p(dβ) + λK (p,π)).

The second claim of the theorem is due to Leung and Barron (2006)and is a nice consequence of Stein’s lemma.

Page 46: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

41

Comments

• If we choose λ = 2σ∗2, we get the classical Bayes posteriormean estimator. But we have no nice risk bound for it.

• If the prior π is a discrete measure

π(β) =N∑

j=1

πjδβj(β),

then the risk bound simplifies to

E[‖X(β̂EWA− β∗)‖2

2] ≤ minj=1,...,N

(‖X(βj − β

∗)‖22 + 4σ∗2 log(1/πj)

).

Inequalities of this type are usually called oracle inequalities.

• The precise knowledge of σ∗ is not necessary, one only needs a(not very rough) upper bound on σ∗.

• This result is very general, and is not specifically designed todeal with the sparsity scenario.

Page 47: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

42

Back to the sparsity scenario

• To deal with the sparsity, we need to introduce a prior πwhich promotes sparsity.

• The estimators Ridge (L2-penalty) and Lasso (L1-penalty)can be seen as a MAP with respectively a Gaussian and aLaplace prior.

• We will use a more heavy tailed one : the product of(scaled) Student t(3)

π(β) ∝p∏

j=1

(τ2 + β2j )−2, ∀β ∈ Rp,

for some tuning parameter τ > 0.

Page 48: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

43

Does π favor sparsity ?

The boxplots of a sample of size 104 drawn from the scaled Gaussian,Laplace and Student t(3) distributions. In all the three cases the locationparameter is 0 and the scale parameter is 10−2.

Page 49: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

44

Compare with the wavelet coefficients

The boxplot of a sample of size 104 of the wavelet coefficients of the image ofLena.

Page 50: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

45

Sparsity oracle inequality for the EWA

Theorem 6 (D. and Tsybakov 2007). Let π be the sparsity favoring

prior and let β̂EWA

be the exponentially weighted aggregate with λ ≥4σ∗2. Then, for every τ > 0,

E[‖X(β̂EWA− β∗)‖2

2] ≤ minβ∈Rp

{‖X(β − β∗)‖2

2 + λ

p∑j=1

log(1 + τ−1|βj |)}

+ 11τ2pn.

In particular, if β∗ is s-sparse then

E[1

n‖X(β̂

EWA− β∗)‖2

2

]≤ sλ

nlog(1 + τ−1‖β∗‖∞) + 11τ2p.

This result provides fast rates under no assumption on the dictionary.

Page 51: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

46

Langevin Monte-Carlo

• Although the EWA can be written in an explicit form, itscomputation is not trivial because of the M-fold integral.

• Naive Monte-Carlo methods fail in moderately largedimensions (M = 50).

• A specific type of Markov Chain Monte-Carlo technique,called Langevin Monte-Carlo, turns out to be very efficient.

Page 52: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

47

Langevin Monte-Carlo

• Let us consider the linear regression setup. We wish tocompute the EWA, which can be written as

β̂EWA

= C∫βe−λ

−1‖Y−Xβ‖22π(dβ),

where C is the constant of normalization.• We can rewrite

β̂EWA

=

∫RMβpV (β)dβ,

where pV (β) ∝ eV (β) is a density function.• pV is the invariant density of the Langevin diffusion

dLt = ∇V (Lt)dt +√

2 dWt , L0 = β0, t ≥ 0.

Page 53: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

48

Numerical experimentsExample 1 : Compressed sensing

• Input : n, p and s, all positive integers.• Covariates : we generate an n × p matrix X with iid

Rademacher entries.• Errors : we generate a standard Gaussian vector ξ.• Noise magnitude : σ∗ =

√s/9.

• Response : Y = Xβ∗ + σ∗ξ where β∗ = [1l(j ≤ s); j ≤ p].• Tuning parameters :

λ = 4σ2, τ = 4σ/‖X‖2, h = 4σ2/‖X‖22.

Page 54: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

49

Numerical experimentsExample 1 : Compressive sensing

Exp.-Weighted Aggregate Lasso

Typical outcome for n = 200, p = 500 and s = 20.

Page 55: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

50

Numerical experimentsExample 1 : Compressive sensing

p = 200 p = 500EWA Lasso EWA Lasso

n = 100 s = 5 0.064 1.442 0.087 1.616(0.043) (0.461) (0.054) (0.491)T = 1 T = 1

n = 100 s = 10 1.153 5.712 1.891 6.508(1.091) (1.157) (1.522) (1.196)T = 2 T = 5

n = 100 s = 15 6.839 11.149 8.917 11.82(1.896) (1.303) (2.186) (1.256)T = 5 T = 10

Page 56: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

51

Image denoising and inpaintingA simple example

• Input : n, k positive integers and σ > 0.• We generate n vectors Ui of R2 uniformly distributed in[0,1]2.

• Covariates φj(u) = 1l[0,j1/k ]×[0,j2/k ](u).• Errors : we generate a centered Gaussian vector ξ with

covariance matrix σ2I.• Response : Yi = (φ1(Ui), . . . , φk2(Ui))

>β∗ + ξi whereβ∗ = [1l(j ∈ {10,100,200})]′.

• Tuning parameters : the same rule as previously.

Page 57: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

52

Image denoising and inpainting

The original image, its sampled noisy version and the denoisedimage.

Page 58: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

53

Image denoising

σ n = 100 n = 200EWA Ideal LG EWA Ideal LG

2 0.210 0.330 0.187 0.203(0.072) (0.145) (0.048) (0.086)T = 1 T = 1

4 0.420 0.938 0.278 0.571(0.222) (0.631) (0.132) (0.324)T = 1 T = 1

Page 59: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

54

Chapter IV. The case of unknown noise level

Page 60: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

55

Multiple linear regression

We are still dealing with the following model : We observe avector Y ∈ Rn and a matrix X ∈ Rn×p such that

Y = Xβ∗ + σ∗ξ, (3)

where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will

assume it Gaussian Nn(0, In),� σ∗ is the noise level (known or unknown depending on the

application).

Under the sparsity scenario with unknown σ∗ is it possible to getthe same guarantees as in the case of known σ∗ ?

Page 61: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

56

Dependence on the noise level

All the three methods we have seen so far depend on the noise levelthrough the tuning parameter.

• The Lasso and the Dantzig selector :

β̂Lasso∈ arg min

β∈Rp

(12‖Y− Xβ‖2

2 + λ‖β‖1

),

β̂DS∈ arg min

β∈Rp :‖X>(Y−Xβ)‖∞≤λ‖β‖1.

The choice λ = Cσ∗√

n log(p) leads to estimators with nicetheoretical guarantees.

• The EWA :

β̂EWA

=

∫Rd β exp

(− 1

λ‖Y− Xβ‖22

)π(β)dβ∫

Rd exp(− 1

λ‖Y− Xβ‖22

)π(β)dβ

.

The choice λ ≥ 4σ∗2 leads to an estimator having nicetheoretical guarantees.

Page 62: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

57

Method of substitution

In some application, it is reasonable to assume that an unbiased andconsistent estimator σ̂2 of σ∗2 is available, which is independent of Y.

Examples :

1 One can observe two independent copies of Y, denoted by Y′

and Y′′. Then, σ̂2 =‖Y′−Y′′‖2

22n is a consistent and unbiased

estimator of σ∗2, independent of Y = 12 (Y′ + Y′′).

2 The recording device can be used in an environment withoutsignal : one can record Z = σ∗η. Then σ̂2 = 1

n‖Z‖22, is a

consistent and unbiased estimator of σ∗2

In such a context, one can substitute σ∗ by σ̂2 in the choice of λ forthe Lasso, the DS and the EWA and get nearly the same guarantees.

Page 63: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

58

Scaled Lasso

In many applications, only (Y,X) is observed and it is impossible toget an estimator of σ∗ independent of Y (biology, econometrics,...). Insuch a setting, it is natural to try to jointly estimate β∗ and σ∗.

• The negative log-likelihood is then given by

L(Y;β, σ) = n log(σ) +1

2σ2 ‖Y− Xβ‖22.

• Following the same ideas as in the case of known σ∗, one canadd a penalty term of the form γ‖β‖1.

• To make the cost function (penalized log-likelihood)homogeneous, γ is chosen equal to λ/σ.

• The scaled Lasso (Städler et al. 2010) is defined as

(β̂ScL, σ̂ScL) ∈ arg min

(β,σ)∈Rp×R+

(n log(σ) +

12σ2 ‖Y− Xβ‖2

2 +λ

σ‖β‖1︸ ︷︷ ︸

PL(Y;β,σ)

).

Page 64: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

59

Properties of the Scaled Lasso

Computational aspects :

• The function (β, σ) 7→ PL(Y;β, σ) is not convex !

• Transformation of the parameters : The function(φ, ρ) 7→ PL(Y;φ/ρ,1/ρ) is convex !

• Strategy : minimize w.r.t. (φ, ρ) the cost function

−n log(ρ) +12‖Yρ− Xφ‖2

2 + λ‖φ‖1

and set σ̂ = 1/ρ̂ and β̂ = φ̂/ρ̂.

Theoretical aspects :

• In the more general model of Gaussian mixtures, it has beenproved by Städler et al (2010) that under the RE property, withlarge probability the error of the estimation of (β∗, σ∗) is of theorder s log4(n)

n .

Page 65: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

60

Square-root Lasso

• One weakness of the scaled Lasso is that the application

ρ 7→ log(ρ)

is not Lipschitz continuous on (0,+∞). Lipschitz continuity isoften required for the convergence of the convex optimizationalgorithms.

• To circumvent this drawback, it has been proposed by Antoniadis(2010) to replace the cost function PL(Y;β, σ) by

PL1(Y;β, σ) = nσ +1

2σ‖Y− Xβ‖2

2 + λ‖β‖1

and to define the square root Lasso estimator

(β̂SqRL

, σ̂SqRL) ∈ arg minβ,σ

PL1(Y;β, σ).

• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.

Page 66: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

61

Scaled Dantzig selector

• One weakness of the scaled Lasso is that the applicationρ 7→ log(ρ)

is not Lipschitz continuous on (0,+∞).

• One can show that the first-order conditions for (β̂ScL, σ̂ScL) can

be written asX>(Y− Xβ) ∈ λσsign(β); ‖Y‖2

2 − Y>Xβ = nσ2.

• The set defined by these conditions is not convex, but it isincluded in the convex setD =

{(β, σ) : ‖X>(Y− Xβ)‖∞ ≤ λσ; ‖Y‖2

2 − Y>Xβ ≥ nσ2}

• Therefore, we proposed to estimate (β∗, σ∗) by the estimator

(β̂SDS

, σ̂SDS) ∈ arg min(β,σ)∈D

‖β‖1.

• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.

Page 67: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

62

Risk bounds

Theorem (D. and Chen (2012))

Let us choose a significance level α ∈ (0,1) and setλ = 2

√n log(p/α). Assume that β∗ is such that ‖β∗‖0 ≤ s and

‖β∗‖1

σ∗≤√

n2 log(1/α)

(12− 2√

n−1 log(1/α)). (4)

If X satisfies the condition R̃E(s,1), then, with probability at least1− 3α, it holds

‖X(β̂ − β∗)‖22 ≤ 8

(σ∗ + σ̂

κ

)2

s log(p/α), (5)

‖β̂ − β∗‖22 ≤ 32

(σ∗ + σ̂

κ2

)2 s log(p/α)n

. (6)

Moreover, with probability at least 1− 4α,

σ̂ ≤ σ∗(3 +√

2n−1 log(1/α)).

Page 68: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

63

References

• DONOHO, DAVID ; JOHNSTONE, IAIN, Adapting to unknown smoothness viawavelet shrinkage. J. Amer. Statist. Assoc. 90 (1995), no. 432, 1200-1224.

• CANDES, EMMANUEL AND TAO, TERENCE, The Dantzig selector : statisticalestimation when p is much larger than n. ANN. STATIST. 35 (2007), no. 6,2313-2351.

• BICKEL, PETER ; RITOV, YA’ACOV AND TSYBAKOV, ALEXANDRE, Simultaneousanalysis of lasso and Dantzig selector. ANN. STATIST. 37 (2009), no. 4,1705-1732.

• DALALYAN, ARNAK AND TSYBAKOV, ALEXANDRE, Aggregation by exponentialweighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72(2008), 39-61.

• YIN CHEN AND ARNAK DALALYAN, Fused sparsity and robust estimation forlinear models with unknown variance Neural Information Processing Systems(NIPS 2012) (2012) 1-16.

• JULIEN MAIRAL, Sparse coding for machine learning, image processing andcomputer vision, PhD thesis (2010).

Page 69: Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

Dalalyan, A.S.

LectureI

64


Recommended