Dalalyan, A.S.
LectureI
1
Exploiting sparsity in high-dimensionalstatistical inference
CIMPA-UNESCO-MESR-MICINN research School 2012:New trends in Mathematical Statistics
Punta del Este, URUGUAY
Arnak S. DalalyanENSAE / CREST / GENES
Paris, FRANCE
Dalalyan, A.S.
LectureI
2
Outline of my lectures
Lecture I Introduction to sparsity
� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink
Lecture II Non-orthogonal design : Lasso and Dantzig selector
� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)
property
Lecture III Exponential weights and sparsity favoring priors
� Bayesian approach� Sharp Oracle Inequalities for EW-aggregate� Risk bounds for arbitrary design
Lecture IV Adaptation to the noise magnitude : scaled Lasso,square root Lasso and SDS.
Dalalyan, A.S.
LectureI
3
Lecture I. Introduction to sparsity
Dalalyan, A.S.
LectureI
4
Multiple linear regression
Throughout these lectures, we will be dealing with the followingmodel only : We observe a vector Y ∈ Rn and a matrixX ∈ Rn×p such that
Y = Xβ∗ + σ∗ξ, (1)
where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will
assume it Gaussian Nn(0, In),� σ∗ is the noise level, which may be known or not
depending on the application.
Sparsity scenario : p is not small compared to n (they are com-parable or p is larger than n), but we know that only a smallnumber of βj ’s is significantly different from 0.
Dalalyan, A.S.
LectureI
5
Example of sparsity : wavelet transform of an image
0 5000 10000 15000−400
−200
0
200
400WT in 2D
0 5000 10000 150000
100
200
300
400
Dalalyan, A.S.
LectureI
6
Example of sparsity : robust estimation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0.0 0.2 0.4 0.6 0.8 1.0
3.0
3.5
4.0
4.5
5.0
x
y
●
●
●
●
Dalalyan, A.S.
LectureI
7
Example of sparsity : background subtraction
Dalalyan, A.S.
LectureI
8
Orthonormal dictionaries
We first consider the case of orthonormal dictionaries. That is n = pand the matrix X satisfies
XX> = X>X = n In.
� Typical examples are the Fourier or wavelet transform, the baseprovided by PCA,...
� Equation (3) can be written as
Y = Xβ∗ + σ∗ξ ⇐⇒ Z = β∗ +σ∗√
nε (2)
where
• Z = 1nX>Y is the transformed response,
• ε ∈ Rn is a random vector drawn from the GaussianNn(0, In) distribution,
� The right-hand side of (5) is often referred to as Gaussiansequence model.
Dalalyan, A.S.
LectureI
9
The oracle and its riskDefinitions
� Thus, we assume that we observe Z = (Z1, . . . ,Zn)> such that
Zi = β∗i +σ∗√
nεi , εi
iid∼ N (0,1).
� Furthermore, we believe that the true vector β∗ is sparse, that is
‖β∗‖0 ,n∑
j=1
1l{β∗j 6=0} is much smaller than n.
� If we know exactly which coordinates of β∗ are nonzero, that is ifwe knew the sparsity pattern J∗ = {j : β∗j 6= 0}, we wouldestimate β∗ by βo defined by
βoj = Zj1l{j∈J∗}.
The vector βo will be called oracle.
� One easily checks that its risk is given by
R[βo,β∗] = E[‖βo − β∗‖22] =
∑j∈J∗
E[(Zj − β∗j )2] = sσ∗2
n.
where s = |J∗| = ‖β∗‖0.
Dalalyan, A.S.
LectureI
10
The oracle and its riskDiscussion
The risk of the oracle is given by
R[βo,β∗] =sσ∗2
n.
� If we use the “naive” least-squares estimator β̂LS
= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk
R[β̂LS,β∗] = σ∗2,
which is much worse than that of the oracle when s � n.
� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?
� The answer is “Yes" (up to logarithmic factors).
Dalalyan, A.S.
LectureI
10
The oracle and its riskDiscussion
The risk of the oracle is given by
R[βo,β∗] =sσ∗2
n.
� If we use the “naive” least-squares estimator β̂LS
= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk
R[β̂LS,β∗] = σ∗2,
which is much worse than that of the oracle when s � n.
� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?
� The answer is “Yes" (up to logarithmic factors).
Dalalyan, A.S.
LectureI
10
The oracle and its riskDiscussion
The risk of the oracle is given by
R[βo,β∗] =sσ∗2
n.
� If we use the “naive” least-squares estimator β̂LS
= Z, whichcoincides here with the maximum likelihood estimator, we getthe risk
R[β̂LS,β∗] = σ∗2,
which is much worse than that of the oracle when s � n.
� Is there an estimator of β∗ with a risk comparable to that ofthe oracle but which does not rely on the knowledge of J∗ ?
� The answer is “Yes" (up to logarithmic factors).
Dalalyan, A.S.
LectureI
11
Sparsity aware estimationHard and Soft Thresholding
Two observations :
� If for a given j it is very likely that β∗j = 0, than the estimatorβ̂j = 0 is preferable to the ML-estimator β̂j = Zj .
� If β∗j = 0, then the corresponding observation Zj is small.
Donoho and Johnstone proposed (early 90ties) to estimate β∗ by
Hard thresholding : for a given threshold t > 0, set
β̂HTj = Zj1l{|Zj |>tσ∗/
√n}, j = 1, . . . ,n.
Soft thresholding : for a given threshold t > 0, set
β̂STj =
(Zj−
tσ∗√n
sign(Zj))
1l{|Zj |>tσ∗/√
n}, j = 1, . . . ,n.
Dalalyan, A.S.
LectureI
12
Risk bound for Hard Thresholding
Let us consider the slightly more general setting Z = β∗ + σ∗√nε with
an s-sparse signal β∗ ∈ Rp and a Gaussian Np(0, Ip) vector ε (now pis not necessarily = n).
Theorem 1. If β̂ is the hard thresholding estimator
β̂j = Zj1l{|Zj |>tσ∗/√
n}, j = 1, . . . ,p,
then, for every t ≥√
2, it holds that
R[β̂,β∗] ≤(
2tp e−t2/2 + 3t2s)σ∗2
n.
Proof. On the whiteboard.
Remark. Similar bound holds true for the Soft-Thresholding estimator.
Dalalyan, A.S.
LectureI
13
Risk bound for Hard ThresholdingConsequences
� Assuming s ≥ 1 and p ≥ 3, we can choose t =√
2 log(p), whichleads to
R[β̂,β∗] ≤ 7.5 log(p)sσ∗2
n.
� Up to the factor 7.5 log(p) this bound coincides with that of theoracle.
� The choice t =√
2 log p is commonly known as the universalchoice of the penalty. It can be proved that the factor log(p) is anunavoidable price to pay for not knowing the locations ofnonzero entries of β∗.
� The universal choice of t is of correct order of magnitude, but itturns out that it is too strong for applications. Smaller values of tchosen in data-dependent manner are often preferable.
Dalalyan, A.S.
LectureI
14
Stein’s lemma
Let Z = β∗ + σ∗√
n ε with ε ∼ Np(0, Ip).
Stein’s lemma. Let Tn(Z) = (Tn,1(Z), . . . ,Tn,p(Z))> be an esti-mator of β∗ such that the mapping z 7→ Tn(z) is continuous andweakly differentiable. Then, the quantity
r̂n(Z) = ‖Z− Tn(Z)‖22 +2σ∗2
n
p∑i=1
∂zi Tn,i(Z)−σ∗2p
n
is an unbiased estimator of the risk R[Tn,β∗].
Proof. On the whiteboard.
Dalalyan, A.S.
LectureI
15
SURE-Shrink
� In view of this result, one easily checks that
SURE(t ,Z) =p∑
i=1
Z 2i ∧
t2σ∗2
n+
2σ∗2
n
p∑i=1
1l{|Zi |>tσ∗/√
n}−σ∗2p
n
is an unbiased estimator of the risk of the hardthresholding estimator.
� Therefore, Donoho and Johnstone proposed to choose thethreshold t as the minimizer of the (random) function
t 7→ SURE(t ,Z).
Dalalyan, A.S.
LectureI
16
SURE-Shrink
FIGURE: Illustration of the choice of the threshold using SURE. Left :the function t 7→ SURE(t). Right : the true risk as a function of t .
Dalalyan, A.S.
LectureI
17
SURE-Shrink vs universal choice
FIGURE: Root MSE’s for simulated data at varying levels of sparsitywhen the threshold is chosen (a) by SURE and (b) by the universalchoice.
Dalalyan, A.S.
LectureI
18
Conclusion of the first lecture
� The hypothesis of sparsity of the unknown object appearsnaturally in various applications.
� In the case of an orthogonal dictionary the soft and thehard thresholding estimators are well suited to efficientlyrecover sparse structures.
� These estimators have a risk almost of the same order asthe one of the oracle knowing the locations of nonzeroentries : the price for not knowing these positions is onlylogarithmic in p.
� What happens if the dictionary X is not orthonormal ?
Dalalyan, A.S.
LectureI
19
Lecture II. Nonorthogonal dictionaries : Lassoand Dantzig selector
Dalalyan, A.S.
LectureI
20
Summaries of yesterday’s and today’s talks
Yesterday : Introduction to sparsity
� Examples and motivation� Orthogonal covariates : hard and soft thresholding� Risk bounds� Stein’s lemma and SureShrink
Today : Non-orthogonal design : Lasso and Dantzig selector
� Compressed sensing� Risk bounds under restricted isometry (RI)� Risk bounds under restricted eigenvalues (RE)
property
Dalalyan, A.S.
LectureI
21
How to get the slides
The webpage is at : http://arnak-dalalyan.fr
Dalalyan, A.S.
LectureI
22
Examples of nonorthogonal dictionaries
� Orthogonal dictionary is (unfortunately) the exceptionrather than the rule.
� Examples of nonorthogonal dictionaries include :
� the case when X is composed of a union of two or moreorthonormal bases,
� the robust estimation (cf. previous lecture),
� nonparametric regression with irregular design,
� compressed sensing,
� . . .
Dalalyan, A.S.
LectureI
23
Compressed sensingWhat is it about ?
� Assume you need to acquire a large vector µ∗ ∈ Rp.
� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.
� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.
� One option is to acquire (noisy versions of) all the entries of µ∗
and then to apply the SureShrink procedure.
� Bad choice : this requires to do too many measurements.
� Questions :
• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?
• How the sensing matrix A should be chosen ?
� Compressed sensing is the theory studying these questions.
Dalalyan, A.S.
LectureI
23
Compressed sensingWhat is it about ?
� Assume you need to acquire a large vector µ∗ ∈ Rp.
� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.
� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.
� One option is to acquire (noisy versions of) all the entries of µ∗
and then to apply the SureShrink procedure.
� Bad choice : this requires to do too many measurements.
� Questions :
• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?
• How the sensing matrix A should be chosen ?
� Compressed sensing is the theory studying these questions.
Dalalyan, A.S.
LectureI
23
Compressed sensingWhat is it about ?
� Assume you need to acquire a large vector µ∗ ∈ Rp.
� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.
� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.
� One option is to acquire (noisy versions of) all the entries of µ∗
and then to apply the SureShrink procedure.
� Bad choice : this requires to do too many measurements.
� Questions :
• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?
• How the sensing matrix A should be chosen ?
� Compressed sensing is the theory studying these questions.
Dalalyan, A.S.
LectureI
23
Compressed sensingWhat is it about ?
� Assume you need to acquire a large vector µ∗ ∈ Rp.
� Assume that µ∗ ∈ Rp admits a sparse representation, i.e., for agiven orthogonal matrix W we have µ∗ = Wβ∗ with a sparsevector β∗.
� We are able to measure (up to a measurement error) any linearcombination of µ∗. Thus, we can acquire the vector Aµ∗ + noisefor any n × p sensing matrix A.
� One option is to acquire (noisy versions of) all the entries of µ∗
and then to apply the SureShrink procedure.
� Bad choice : this requires to do too many measurements.
� Questions :
• What is the “minimal” number of measurements sufficientfor a good reconstruction of µ∗ ?
• How the sensing matrix A should be chosen ?
� Compressed sensing is the theory studying these questions.
Dalalyan, A.S.
LectureI
24
Compressed sensingSome remarks
� The minimal number n is obviously ≤ p, and it is indeed < p ifβ∗ is sparse.
� If n < p, there is no way for the matrix X = AW to be orthogonal.(why ?)
� If n < p, the linear system Y = Xβ is under-determined and,therefore, the set of its solutions is infinite (continuumcardinality). So, to apply thresholding, one has to first find anestimator to be thresholded.
� The orthogonality of X is equivalent to(1nX>X
)1/2
β = β
for all β ∈ Rp. But why do we care about all β ?
� We are only interested in a particular type of vectors β : sparsevectors !
Dalalyan, A.S.
LectureI
25
Restricted Isometry Property (RIP)
Definition. Let X be a n× p matrix and let s ∈ {1, . . . ,p}. We say thatX satisfies the restricted isometry property RIP(s) with constant δ if
(1− δ)‖u‖22 ≤
1n‖Xu‖2
2 ≤ (1 + δ)‖u‖22,
for every vector u ∈ Rp with ‖u‖0 ≤ s.We denote by δs(X) the smallest δ such that X satisfies RIP(s) withconstant δ.
Remarks
1 Any orthogonal matrix satisfies RIP(s) with the constant δ = 0for any s ∈ {1, . . . ,p}.
2 If X = AW with an orthogonal matrix W, then X satisfiesRIP(s, δ) if and only if
√nA satisfies RIP(s, δ). (Exercise)
Dalalyan, A.S.
LectureI
26
Lasso and Dantzig selector
From now on, we assume that the columns of X are normalized insuch a way that
1n
n∑i=1
x2ij = 1, ∀ j = 1, . . . ,p.
This does not cause any loss of generality (why ?).
We fix a λ > 0 and define the two estimators
β̂Lasso∈ arg min
β∈Rp
( 12σ∗2 ‖Y− Xβ‖2
2 +λ
σ∗‖β‖1
),
β̂DS∈ arg min
β∈Rp :‖X>(Y−Xβ)‖∞≤λσ∗‖β‖1.
Remarks Both the Lasso and the Dantzig selector can be efficientlycomputed even for very large datasets using convex programming. Infact, it is a second-order cone program for the Lasso and a linearprogram for the Dantzig selector.
Dalalyan, A.S.
LectureI
27
Theoretical guarantees
Exercise. If X is orthogonal, then β̂Lasso
= β̂DS
= β̂ST
.
Theorem 2 (Candès & Tao AoS 2006). Let α ∈ (0,1) be a tolerancelevel and let β∗ be s-sparse. If X satisfies the RIP(2s) with a constantδ <
√2 − 1, then the Dantzig selector based on λ =
√2n log(p/α)
satisfies the inequality
P(‖β̂
DS− β∗‖2 ≤ Cσ∗
√s log(2p/α)
n
)≥ 1− α,
with C = 8(√
2−1)√2−1−δ
.
This result implies that the “rate of convergence” to zero of theestimation error is of the order
σ∗√
s log(2p/α)n
which coincides with the one we obtained for the hard-thresholding inthe case of orthogonal dictionaries.
Dalalyan, A.S.
LectureI
28
Proof of Theorem 2Some auxiliary results
Lemma. Assume that Y = Xβ∗ + σ∗ξ with ‖Xj‖22 = n for every j .
1 If ξ ∼ Nn(0, In) then the probability of the event
B ={‖X>(Y− Xβ∗)‖∞ ≤ σ∗
√2n log(2p/α)
}is larger than or equal to 1− α.
2 If u,v ∈ Rp are s-sparse and u ⊥ v , then
1n|〈Xu,Xv〉| ≤ δ2s‖u‖2‖v‖2.
3 If h = β̂DS− β∗ then ‖hJc‖1 ≤ ‖hJ‖1 on the event B.
4 If J = J0 = {j : β∗j 6= 0} and Jk is recursively defined as the set ofindices j 6∈ J ∪ J1 . . . ∪ Jk−1 of the s largest entries of the vector(|h1|, . . . , |hp|). Then, on the event B,
‖h(J∪J1)c‖2 ≤∑k≥2
‖hJk ‖2 ≤1√s‖hJc‖1 ≤ ‖hJ‖2.
Proof on the whiteboard.
Dalalyan, A.S.
LectureI
29
Main consequences
� Theorem 2 is of interest in the noiseless setting as well, that iswhen σ∗ = 0. It tells us that if X satisfies the RIP, then we areable to perfectly reconstruct β∗ from a small number ofmeasurements.
� If we are looking for an estimate which is γ-accurate, then it issufficient to take
n =Cσ∗2s log(2p/α)
γ2 .
This number is typically much smaller than the dimensionality p !
� If the ratio s/p is not too close to 1, then matrices satisfying theRIP(s) exist. The most famous example is a random matrix withi.i.d. N (0,1) entries.
� So, this answers the two questions we have seen several slidesbefore.
Dalalyan, A.S.
LectureI
30
Lecture III. Nonorthogonal dictionaries : Lasso,Dantzig selector and the Bayesian approach
Dalalyan, A.S.
LectureI
31
Curse of dimensionality and sparsity
High-dimensionality has significantly challenged traditional statisticaltheory.
• In linear regression, the accuracy of the LSE is of order p/n,where p is the number of covariates.
• In many applications, p is much larger than n.
Sparsity assumption provides compelling theoretical framework fordealing with high dimension.
• Even if the number of parameters describing the modelin general setup is large, only few of them participatein the process of data generation.
• No a priori information on the set of relevant parametersis available.
• It is only known (or, postulated) that the cardinality of the set ofrelevant parameters is small.
Dalalyan, A.S.
LectureI
32
Risk bound for the DSSlow rates
The previous result is of the form “all or nothing”. If the RIP is notsatisfied, then we do not know anything about the risk of the DS.
� If we do not want to impose any assumption on X, then theconsistent estimation of β∗ is hopeless. Indeed, in theunder-determined case, which is of interest for us, β∗ is notidentifiable.
� But we can be more optimistic if the goal is to estimate thevector Xβ∗ only (like in Yannick Baraud’s talk yesterday).
Dalalyan, A.S.
LectureI
33
Slow rates for the DS
Theorem 3. If λ =√
2n log(2p/α) then at least with probability 1 − αit holds that
1n‖X(β̂
DS− β∗)‖2
2 ≤ 4σ∗√
2 log(2p/α)n
‖β∗‖1 ∧ ‖(β̂DS− β∗)J‖1.
Remark A similar result holds true for the Lasso estimator as well.
Consequence : If the Euclidean norm of β∗ is finite (not too large)and s log(p)
n is small, then the Dantzig selector is consistent w.r.t. the
prediction loss and the latter is O(√ s log(p)
n
).
Dalalyan, A.S.
LectureI
34
Restricted eigenvalues property
The risk of the Lasso estimator will be analyzed under a differentcondition on X.
Definition. Let X be a n×p matrix and let s ∈ {1, . . . ,p}. For a positiveconstant c0 we say that X satisfies the restricted eigenvalues propertyRE(s, c0) (respectively R̃E(s, c0)) if
κ(s, c0)2 , min
J⊂{1,...,p}|J|≤s
minu 6=0
‖uJc ‖1≤c0‖uJ‖1
‖Xu‖22
n‖uJ‖22> 0.
(respectively,
κ̃(s, c0)2 , min
J⊂{1,...,p}|J|≤s
minu 6=0
‖uJc ‖1≤c0‖uJ‖1
‖Xu‖22
n‖u‖22> 0.
Dalalyan, A.S.
LectureI
35
Restricted eigenvalues property
Remarks
1 Any orthogonal matrix satisfies RE(s, c0) with κ(s, c0) = 1.
2 According to the claim 3 of the last lemma,
‖(β̂DS−β∗)Jc‖1 ≤ ‖(β̂
DS−β∗)J‖1 on the event B and, therefore,
if RE(1, s) holds true, then
1n‖X(β̂
DS− β∗)‖2
2 ≥ κ2(1, s)‖β̂DS− β∗‖2
2.
3 The RE property looks simpler than the RIP, but they are notcomparable (there is no one that implies the other).
• The advantage of the RE is that there is no upper bound.• The weakness of the RE is that the minimum is taken over
a larger set than the one in RIP.
Dalalyan, A.S.
LectureI
36
Risk bound for the LassoFast rates
Theorem 4. Let p ≥ 2 and let the assumption RE(s,3) be fulfilled. Ifλ =
√8n log(2p/α) then at least with probability 1− α it holds that
‖β̂Lasso− β∗‖1 ≤
25
κ2(s,3)σ∗s
√2 log(2p/α)
n,
‖X(β̂Lasso− β∗)‖2
2 ≤27
κ2(s,3)σ∗2s log(2p/α).
Furthermore, if R̃E(s,3) is fulfilled, then
‖β̂Lasso− β∗‖2
2 ≤29
κ̃4(s,3)σ∗2s log(2p/α)
n.
Dalalyan, A.S.
LectureI
37
Two examples (Mairal et al. (2011))
Dalalyan, A.S.
LectureI
38
Bayes setting
• To tackle the high-dimensionality issue with a Bayesianapproach, we introduce a prior on Rp. We will assume it isabsolutely continuous with a density π(·).
• Let us think of π(·) as the way we would model the distribution ofβ∗ prior to looking at the data.
• Instead of considering the posterior mean, the maximum aposteriori (MAP) estimate or other standard estimators used inBayesian framework, we adopt a variational approach.
Dalalyan, A.S.
LectureI
39
Variational approach and exponential weights
• Let P be the set of all probability measures p on Rp such that∫Rd βp(dβ) exists.
• Let λ > 0 and π ∈ P be a prior on Rd .
• Define the pseudo-posterior by
π̂n = arg minp∈P
(∫Rp‖Y− Xβ‖2
2 p(dβ) + λK (p,π)),
with K (p, π) being the Kullback-Leibler divergence
K (p, π) =
{∫Rp log
( p(β)π(β)
)p(β)dβ, if p � π,
+∞, otherwise.• Using π̂n, we define
β̂EWA
=
∫Rpβ π̂n(β)dβ.
Dalalyan, A.S.
LectureI
40
Exponential weights and a risk bound
Theorem 5.
1 The pseudo-posterior π̂n is given by the explicit formula
π̂n(β) ∝ exp{− λ−1‖Y− Xβ‖2
2
}π(β), ∀β ∈ Rp.
2 If λ ≥ 4σ∗2, then the pseudo-posterior mean β̂EWA
satisfies
E[‖X(β̂EWA− β∗)‖2
2] ≤ minp∈P
(∫Rp‖X(β − β∗)‖2
2 p(dβ) + λK (p,π)).
The second claim of the theorem is due to Leung and Barron (2006)and is a nice consequence of Stein’s lemma.
Dalalyan, A.S.
LectureI
41
Comments
• If we choose λ = 2σ∗2, we get the classical Bayes posteriormean estimator. But we have no nice risk bound for it.
• If the prior π is a discrete measure
π(β) =N∑
j=1
πjδβj(β),
then the risk bound simplifies to
E[‖X(β̂EWA− β∗)‖2
2] ≤ minj=1,...,N
(‖X(βj − β
∗)‖22 + 4σ∗2 log(1/πj)
).
Inequalities of this type are usually called oracle inequalities.
• The precise knowledge of σ∗ is not necessary, one only needs a(not very rough) upper bound on σ∗.
• This result is very general, and is not specifically designed todeal with the sparsity scenario.
Dalalyan, A.S.
LectureI
42
Back to the sparsity scenario
• To deal with the sparsity, we need to introduce a prior πwhich promotes sparsity.
• The estimators Ridge (L2-penalty) and Lasso (L1-penalty)can be seen as a MAP with respectively a Gaussian and aLaplace prior.
• We will use a more heavy tailed one : the product of(scaled) Student t(3)
π(β) ∝p∏
j=1
(τ2 + β2j )−2, ∀β ∈ Rp,
for some tuning parameter τ > 0.
Dalalyan, A.S.
LectureI
43
Does π favor sparsity ?
The boxplots of a sample of size 104 drawn from the scaled Gaussian,Laplace and Student t(3) distributions. In all the three cases the locationparameter is 0 and the scale parameter is 10−2.
Dalalyan, A.S.
LectureI
44
Compare with the wavelet coefficients
The boxplot of a sample of size 104 of the wavelet coefficients of the image ofLena.
Dalalyan, A.S.
LectureI
45
Sparsity oracle inequality for the EWA
Theorem 6 (D. and Tsybakov 2007). Let π be the sparsity favoring
prior and let β̂EWA
be the exponentially weighted aggregate with λ ≥4σ∗2. Then, for every τ > 0,
E[‖X(β̂EWA− β∗)‖2
2] ≤ minβ∈Rp
{‖X(β − β∗)‖2
2 + λ
p∑j=1
log(1 + τ−1|βj |)}
+ 11τ2pn.
In particular, if β∗ is s-sparse then
E[1
n‖X(β̂
EWA− β∗)‖2
2
]≤ sλ
nlog(1 + τ−1‖β∗‖∞) + 11τ2p.
This result provides fast rates under no assumption on the dictionary.
Dalalyan, A.S.
LectureI
46
Langevin Monte-Carlo
• Although the EWA can be written in an explicit form, itscomputation is not trivial because of the M-fold integral.
• Naive Monte-Carlo methods fail in moderately largedimensions (M = 50).
• A specific type of Markov Chain Monte-Carlo technique,called Langevin Monte-Carlo, turns out to be very efficient.
Dalalyan, A.S.
LectureI
47
Langevin Monte-Carlo
• Let us consider the linear regression setup. We wish tocompute the EWA, which can be written as
β̂EWA
= C∫βe−λ
−1‖Y−Xβ‖22π(dβ),
where C is the constant of normalization.• We can rewrite
β̂EWA
=
∫RMβpV (β)dβ,
where pV (β) ∝ eV (β) is a density function.• pV is the invariant density of the Langevin diffusion
dLt = ∇V (Lt)dt +√
2 dWt , L0 = β0, t ≥ 0.
Dalalyan, A.S.
LectureI
48
Numerical experimentsExample 1 : Compressed sensing
• Input : n, p and s, all positive integers.• Covariates : we generate an n × p matrix X with iid
Rademacher entries.• Errors : we generate a standard Gaussian vector ξ.• Noise magnitude : σ∗ =
√s/9.
• Response : Y = Xβ∗ + σ∗ξ where β∗ = [1l(j ≤ s); j ≤ p].• Tuning parameters :
λ = 4σ2, τ = 4σ/‖X‖2, h = 4σ2/‖X‖22.
Dalalyan, A.S.
LectureI
49
Numerical experimentsExample 1 : Compressive sensing
Exp.-Weighted Aggregate Lasso
Typical outcome for n = 200, p = 500 and s = 20.
Dalalyan, A.S.
LectureI
50
Numerical experimentsExample 1 : Compressive sensing
p = 200 p = 500EWA Lasso EWA Lasso
n = 100 s = 5 0.064 1.442 0.087 1.616(0.043) (0.461) (0.054) (0.491)T = 1 T = 1
n = 100 s = 10 1.153 5.712 1.891 6.508(1.091) (1.157) (1.522) (1.196)T = 2 T = 5
n = 100 s = 15 6.839 11.149 8.917 11.82(1.896) (1.303) (2.186) (1.256)T = 5 T = 10
Dalalyan, A.S.
LectureI
51
Image denoising and inpaintingA simple example
• Input : n, k positive integers and σ > 0.• We generate n vectors Ui of R2 uniformly distributed in[0,1]2.
• Covariates φj(u) = 1l[0,j1/k ]×[0,j2/k ](u).• Errors : we generate a centered Gaussian vector ξ with
covariance matrix σ2I.• Response : Yi = (φ1(Ui), . . . , φk2(Ui))
>β∗ + ξi whereβ∗ = [1l(j ∈ {10,100,200})]′.
• Tuning parameters : the same rule as previously.
Dalalyan, A.S.
LectureI
52
Image denoising and inpainting
The original image, its sampled noisy version and the denoisedimage.
Dalalyan, A.S.
LectureI
53
Image denoising
σ n = 100 n = 200EWA Ideal LG EWA Ideal LG
2 0.210 0.330 0.187 0.203(0.072) (0.145) (0.048) (0.086)T = 1 T = 1
4 0.420 0.938 0.278 0.571(0.222) (0.631) (0.132) (0.324)T = 1 T = 1
Dalalyan, A.S.
LectureI
54
Chapter IV. The case of unknown noise level
Dalalyan, A.S.
LectureI
55
Multiple linear regression
We are still dealing with the following model : We observe avector Y ∈ Rn and a matrix X ∈ Rn×p such that
Y = Xβ∗ + σ∗ξ, (3)
where� β∗ ∈ Rp is the unknown regression vector,� ξ ∈ Rn is a random vector referred to as noise, we will
assume it Gaussian Nn(0, In),� σ∗ is the noise level (known or unknown depending on the
application).
Under the sparsity scenario with unknown σ∗ is it possible to getthe same guarantees as in the case of known σ∗ ?
Dalalyan, A.S.
LectureI
56
Dependence on the noise level
All the three methods we have seen so far depend on the noise levelthrough the tuning parameter.
• The Lasso and the Dantzig selector :
β̂Lasso∈ arg min
β∈Rp
(12‖Y− Xβ‖2
2 + λ‖β‖1
),
β̂DS∈ arg min
β∈Rp :‖X>(Y−Xβ)‖∞≤λ‖β‖1.
The choice λ = Cσ∗√
n log(p) leads to estimators with nicetheoretical guarantees.
• The EWA :
β̂EWA
=
∫Rd β exp
(− 1
λ‖Y− Xβ‖22
)π(β)dβ∫
Rd exp(− 1
λ‖Y− Xβ‖22
)π(β)dβ
.
The choice λ ≥ 4σ∗2 leads to an estimator having nicetheoretical guarantees.
Dalalyan, A.S.
LectureI
57
Method of substitution
In some application, it is reasonable to assume that an unbiased andconsistent estimator σ̂2 of σ∗2 is available, which is independent of Y.
Examples :
1 One can observe two independent copies of Y, denoted by Y′
and Y′′. Then, σ̂2 =‖Y′−Y′′‖2
22n is a consistent and unbiased
estimator of σ∗2, independent of Y = 12 (Y′ + Y′′).
2 The recording device can be used in an environment withoutsignal : one can record Z = σ∗η. Then σ̂2 = 1
n‖Z‖22, is a
consistent and unbiased estimator of σ∗2
In such a context, one can substitute σ∗ by σ̂2 in the choice of λ forthe Lasso, the DS and the EWA and get nearly the same guarantees.
Dalalyan, A.S.
LectureI
58
Scaled Lasso
In many applications, only (Y,X) is observed and it is impossible toget an estimator of σ∗ independent of Y (biology, econometrics,...). Insuch a setting, it is natural to try to jointly estimate β∗ and σ∗.
• The negative log-likelihood is then given by
L(Y;β, σ) = n log(σ) +1
2σ2 ‖Y− Xβ‖22.
• Following the same ideas as in the case of known σ∗, one canadd a penalty term of the form γ‖β‖1.
• To make the cost function (penalized log-likelihood)homogeneous, γ is chosen equal to λ/σ.
• The scaled Lasso (Städler et al. 2010) is defined as
(β̂ScL, σ̂ScL) ∈ arg min
(β,σ)∈Rp×R+
(n log(σ) +
12σ2 ‖Y− Xβ‖2
2 +λ
σ‖β‖1︸ ︷︷ ︸
PL(Y;β,σ)
).
Dalalyan, A.S.
LectureI
59
Properties of the Scaled Lasso
Computational aspects :
• The function (β, σ) 7→ PL(Y;β, σ) is not convex !
• Transformation of the parameters : The function(φ, ρ) 7→ PL(Y;φ/ρ,1/ρ) is convex !
• Strategy : minimize w.r.t. (φ, ρ) the cost function
−n log(ρ) +12‖Yρ− Xφ‖2
2 + λ‖φ‖1
and set σ̂ = 1/ρ̂ and β̂ = φ̂/ρ̂.
Theoretical aspects :
• In the more general model of Gaussian mixtures, it has beenproved by Städler et al (2010) that under the RE property, withlarge probability the error of the estimation of (β∗, σ∗) is of theorder s log4(n)
n .
Dalalyan, A.S.
LectureI
60
Square-root Lasso
• One weakness of the scaled Lasso is that the application
ρ 7→ log(ρ)
is not Lipschitz continuous on (0,+∞). Lipschitz continuity isoften required for the convergence of the convex optimizationalgorithms.
• To circumvent this drawback, it has been proposed by Antoniadis(2010) to replace the cost function PL(Y;β, σ) by
PL1(Y;β, σ) = nσ +1
2σ‖Y− Xβ‖2
2 + λ‖β‖1
and to define the square root Lasso estimator
(β̂SqRL
, σ̂SqRL) ∈ arg minβ,σ
PL1(Y;β, σ).
• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.
Dalalyan, A.S.
LectureI
61
Scaled Dantzig selector
• One weakness of the scaled Lasso is that the applicationρ 7→ log(ρ)
is not Lipschitz continuous on (0,+∞).
• One can show that the first-order conditions for (β̂ScL, σ̂ScL) can
be written asX>(Y− Xβ) ∈ λσsign(β); ‖Y‖2
2 − Y>Xβ = nσ2.
• The set defined by these conditions is not convex, but it isincluded in the convex setD =
{(β, σ) : ‖X>(Y− Xβ)‖∞ ≤ λσ; ‖Y‖2
2 − Y>Xβ ≥ nσ2}
• Therefore, we proposed to estimate (β∗, σ∗) by the estimator
(β̂SDS
, σ̂SDS) ∈ arg min(β,σ)∈D
‖β‖1.
• This estimator can be computed by solving a second-order coneprogram. It can be done very efficiently even for very largevalues of p.
Dalalyan, A.S.
LectureI
62
Risk bounds
Theorem (D. and Chen (2012))
Let us choose a significance level α ∈ (0,1) and setλ = 2
√n log(p/α). Assume that β∗ is such that ‖β∗‖0 ≤ s and
‖β∗‖1
σ∗≤√
n2 log(1/α)
(12− 2√
n−1 log(1/α)). (4)
If X satisfies the condition R̃E(s,1), then, with probability at least1− 3α, it holds
‖X(β̂ − β∗)‖22 ≤ 8
(σ∗ + σ̂
κ
)2
s log(p/α), (5)
‖β̂ − β∗‖22 ≤ 32
(σ∗ + σ̂
κ2
)2 s log(p/α)n
. (6)
Moreover, with probability at least 1− 4α,
σ̂ ≤ σ∗(3 +√
2n−1 log(1/α)).
Dalalyan, A.S.
LectureI
63
References
• DONOHO, DAVID ; JOHNSTONE, IAIN, Adapting to unknown smoothness viawavelet shrinkage. J. Amer. Statist. Assoc. 90 (1995), no. 432, 1200-1224.
• CANDES, EMMANUEL AND TAO, TERENCE, The Dantzig selector : statisticalestimation when p is much larger than n. ANN. STATIST. 35 (2007), no. 6,2313-2351.
• BICKEL, PETER ; RITOV, YA’ACOV AND TSYBAKOV, ALEXANDRE, Simultaneousanalysis of lasso and Dantzig selector. ANN. STATIST. 37 (2009), no. 4,1705-1732.
• DALALYAN, ARNAK AND TSYBAKOV, ALEXANDRE, Aggregation by exponentialweighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72(2008), 39-61.
• YIN CHEN AND ARNAK DALALYAN, Fused sparsity and robust estimation forlinear models with unknown variance Neural Information Processing Systems(NIPS 2012) (2012) 1-16.
• JULIEN MAIRAL, Sparse coding for machine learning, image processing andcomputer vision, PhD thesis (2010).
Dalalyan, A.S.
LectureI
64