ON CROSS-VALIDATED LASSO IN HIGH DIMENSIONSdenoted as P1-Lasso and P2-Lasso estimators respectively,...

Submitted to the Annals of Statistics

SUPPLEMENT TO “ON CROSS-VALIDATED LASSO INHIGH DIMENSIONS”∗

By Denis Chetverikov† Zhipeng Liao‡ Victor Chernozhukov§

UCLA†, UCLA‡, MIT§

This supplemental material contains results of a small-scale MonteCarlo simulation study as well as proofs for Section 5, proofs of lem-mas in Section 6, and several technical lemmas.

7. Simulations. In this section, we present results of our simulationexperiments. The purpose of the experiments is to investigate finite-sampleproperties of the cross-validated Lasso estimator. In particular, we are in-terested in (i) comparing the estimation error of the cross-validated Lassoestimator in different norms to the Lasso estimator based on other choices ofλ and (ii) studying sparsity properties of the cross-validated Lasso estimator.

We consider two data generating processes (DGPs). In both DGPs, wesimulate the vector of covariatesX = (X1, . . . , Xp)

′ from the Gaussian distri-bution with mean zero and variance-covariance matrix given by E[XjXk] =ρ|j−k| for all j, k = 1, . . . , p with ρ = 0.5 and 0.75. Also, we set β =(1,−1, 2,−2, 01×(p−4))

′. We simulate ε from the standard Gaussian distri-bution in DGP1 and from the uniform distribution on [−3, 3] in DGP2. Forboth DGPs, we take ε to be independent of X. Further, for each DGP, weconsider samples of size n = 100 and 400. For each DGP and each samplesize, we consider p = 40, 100, and 400. To construct the candidate set Λnof values of the penalty parameter λ, we use Assumption 4 with a = 0.9,c1 = 0.005 and C1 = 500. Thus, the set Λn contains values of λ ranging from0.0309 to 500 when n = 100 and from 0.0071 to 500 when n = 400, thatis, the set Λn is rather large in both cases. In all experiments, we use 5-foldcross-validation (K = 5). We repeat each experiment 5000 times.

As a comparison to the cross-validated Lasso estimator, we consider theLasso estimators with λ chosen according to [12] and [2], i.e.,

λ = n−1/2σ√

2 log p and λ = n−1/2σ√

2 log(p/s).

These Lasso estimators achieve the optimal convergence rate under the pre-diction norm (see, e.g., [12] and [2]). The noise level σ and the true sparsity

∗Date: May, 2016. Revised June 30, 2020.Keywords and phrases: cross-validation, Lasso, high-dimensional models, sparsity, non-

asymptotic bounds

1

2 CHETVERIKOV LIAO CHERNOZHUKOV

s typically have to be estimated from the data but for simplicity we assumethat both σ and s are known, so we set σ = 1 and s = 4 in DGP1, andσ =

√3 and s = 4 in DGP2. In what follows, these Lasso estimators are

denoted as P1-Lasso and P2-Lasso estimators respectively, and the cross-validated Lasso estimator is denoted as CV-Lasso.

Figure 1 contains simulation results for DGP1 with n = 100, p = 40and ρ = 0.75. The first four (that is, the top-left, top-right, middle-leftand middle-right) panels of Figure 1 present the mean of the estimationerror of the Lasso estimators in the prediction, L2, L1, and out-of-sampleprediction norms, respectively. The out-of-sample prediction norm is definedas ‖b‖p,2,n = (E[(X ′b)2])1/2 for all b ∈ Rp. In these panels, the dashed linerepresents the mean of estimation error of the Lasso estimator as a functionof λ (we perform the Lasso estimator for each value of λ in the candidate setΛn; we sort the values in Λn from the smallest to the largest, and put theorder of λ on the horizontal axis; we only show the results for values of λ upto order 25 as these give the most meaningful comparisons). This estimatoris denoted as λ-Lasso. The solid, dotted and dashed-dotted horizontal linesrepresent the mean of the estimation error of CV-Lasso, P1-Lasso, and P2-Lasso, respectively.

From the top four panels of Figure 1, we see that estimation error ofCV-Lasso is only slightly above the minimum of the estimation error overall possible values of λ not only in the prediction and L2 norms but alsoin the L1 norm. In comparison, P1-Lasso and P2-Lasso tend to have largerestimation error in all four norms.

The bottom-left and bottom-right panels of Figure 1 depict the histogramsfor the numbers of non-zero coefficients of the CV-Lasso estimator and P2-Lasso estimator respectively. Overall, these panels suggest that the CV-Lasso estimator tends to select too many covariates: the number of selectedcovariates with large probability varies between 5 and 30 even though thereare only 4 non-zero coefficients in the true model. The P2-Lasso estimator ismore sparse than the CV-Lasso estimator: it selects around 5 to 15 covariateswith large probability.

Figure 2 includes the simulation results for DGP1 when n = 100, p = 400and ρ = 0.75. The estimation errors of the Lasso estimators are inflatedwhen p is much bigger than the sample size. The estimation error of CV-Lasso under the prediction norm is increased from 0.4481 to 0.7616 whenp is increased from 40 to 400, although it remains the best compared withP1-Lasso and P2-Lasso estimators. Similar phenomena are observed for theestimation error under the L2 norm and the out-of-sample prediction norm.On the other hand, the estimation error of the CV-Lasso is slightly larger

ON CROSS-VALIDATED LASSO 3

than the P1-Lasso and P2-Lasso under the L1 norm. For the sparsity of theLasso estimators, the CV-Lasso is much less sparse than the P2-Lasso: itselects around 5 to 50 covariates with large probability while the P2-Lassoonly selects 8 to 22 covariates with large probability.

For all other experiments, the simulation results on the mean of estima-tion error of the Lasso estimators can be found in Table 1. For simplicity,we only report the minimum over λ ∈ Λn of mean of the estimation errorof λ-Lasso and the mean of the estimation error of P2-Lasso in Table 1.The results in Table 1 confirm findings in Figure 1 and Figure 2: the meanof the estimation error of CV-Lasso is close to the minimum mean of theestimation errors of the λ-Lasso estimators under both DGPs for all combi-nations of n, p and ρ considered in all three norms. Their difference becomessmaller when the sample size n increases. The mean of the estimation errorof P2-Lasso is larger than that of CV-Lasso in cases when p is relativelysmall or the regressors X have strong correlation, while the P2-Lasso hassmaller estimation error when p is much larger than n and the regressorsX are weakly correlated. When the correlations of the regressors X becomestronger and the largest eigenvalue of E[XX ′] becomes bigger, the meanof the estimation error of the CV-Lasso estimator is slightly enlarged andis much less effected compared with the P2-Lasso estimator. For example,in DGP1 with n = 100 and p = 40, the mean of estimation error of CV-Lasso estimator increases 5.39% when ρ is changed from 0.5 to 0.75 (and thelargest eigenvalue of E[XX ′] increases from 2.97 to 6.64), while the P2-Lassoestimator has a 28% increase.

Table 2 reports model selection results for the cross-validated Lasso es-timator. More precisely, the table shows probabilities for the number ofnon-zero coefficients of the cross-validated Lasso estimator hitting differentbrackets. Overall, the results in Table 2 confirm findings in Figure 1 andFigure 2: the cross-validated Lasso estimator tends to select too many co-variates. The probability of selecting larger models tends to increase with pbut decreases with n.

8. Proofs for Section 5. In this section, we prove Theorems 5.1 and5.2. Since the proofs are long, we start with a sequence of preliminary lem-mas.

8.1. Preliminary Lemmas.

Lemma 8.1. For all λ > 0, the Lasso estimator β(λ) given in (2) basedon the data (Xi, Yi)

ni=1 = (Xi, X

′iβ + εi)

ni=1 has the following property: the

function (εi)ni=1 7→ (X ′iβ(λ))ni=1 mapping Rn to Rn for any fixed value of


Xn1 = (X1, . . . , Xn) is well-defined and is Lipschitz-continuous with Lipschitz

constant one with respect to Euclidean norm. Moreover, there always existsa Lasso estimator β(λ) such that ‖β(λ)‖0 ≤ n almost surely. Finally, β(λ) isunique almost surely whenever the distribution of X is absolutely continuouswith respect to the Lebesgue measure on Rp.

Proof. All the asserted claims in this lemma can be found in the liter-ature. Here we give specific references for completeness. The fact that thefunction (εi)

ni=1 7→ (X ′iβ(λ))ni=1 is well-defined follows from Lemma 1 in [16],

which shows that even if the solution β(λ) of the optimization problem (2)is not unique, (X ′iβ(λ))ni=1 is the same across all solutions. The Lipschitzproperty then follows from Proposition 2 in [3]. Moreover, by discussion inSection 2.1 in [16], there always exists a Lasso solution, say β(λ), taking theform in (10) of [16], and such a solution satisfies ‖β(λ)‖0 ≤ n. Finally, thelast claim follows from Lemma 4 in [16]. �

Lemma 8.2. Suppose that Assumption 3 holds. Then for all κ ≥ 1, n ≥eκ, and λ > 0, we have

(29) P(∣∣∣‖β(λ)− β‖2,n − E[‖β − β‖2,n | Xn

1 ]∣∣∣ > t

)≤(Cκ logr n

t2n

)κ/2for some constant C > 0 depending only on C1 and r.

Proof. Fix κ ≥ 1, n ≥ eκ, and λ > 0. Also, let ξ be a N(0, 1) randomvariable that is independent of the data and let C be a positive constantthat depends only on C1 and r but whose value can change from place toplace. Then by Lemma 8.1, the function (εi)

ni=1 7→ (X ′iβ(λ))ni=1 is Lipschitz-

continuous with Lipschitz constant one, and so is

(εi)ni=1 7→

(n∑i=1

(X ′i(β(λ)− β))2

)1/2

=√n‖β − β‖2,n.

Therefore, applying Lemma 10.5 with u(x) = (x ∨ 0)κ and using Markov’sinequality and Assumption 3 shows that for any t > 0,

P(‖β(λ)− β‖2,n − E[‖β − β‖2,n | Xn

1 ] > t | Xn1

)≤( C1π

2t√n

)κE[

max1≤i≤n

(1 + |ei|r)κ]E[|ξ|κ] ≤

( C

t√n

)κE[

max1≤i≤n

|ei|rκ]E[|ξ|κ]

≤( C

t√n

)κ(E[

max1≤i≤n

|ei|r logn])κ/ logn

E[|ξ|κ]

≤(C(r log n)r/2

√κ

t√n

)κ=(C√κ logr n

t√n

)κ=(Cκ logr n

t2n

)κ/2.


This gives one side of the bound (29). Since the other side follows similarly,the proof is complete. �

Lemma 8.3. Suppose that Assumption 3 holds and let Q−1 : Rp×R→ Rbe the inverse of Q : Rp×R→ R with respect to the second argument. Thenfor all λ > 0,

(30) E[‖β(λ)‖0 | Xn1 ] =

n∑i=1

E[ψiX′i(β(λ)− β) | Xn

1 ],

where

ψi =ei

Q2(Xi, ei)+Q22(Xi, ei)

Q2(Xi, ei)2and ei = Q−1(Xi, εi)

for all i = 1, . . . , n. In addition,

E

(‖β(λ)‖0 −n∑i=1

ψiX′i(β − β)

)2

| Xn1

=

n∑i=1

E[γi(X

′i(β(λ)− β))2 | Xn

1

]+ E[‖β(λ)‖0 | Xn

1 ],

where

γi =1

Q2(Xi, ei)2− eiQ22(Xi, ei)

Q2(Xi, ei)3+Q222(Xi, ei)

Q2(Xi, ei)3− 2Q22(Xi, ei)

2

Q2(Xi, ei)4

for all i = 1, . . . , n. Moreover,

Var

(n∑i=1

ψiX′i(β(λ)− β) | Xn

1

)≤ 2

n∑i=1

E

[(γiQ2(Xi, ei)X

′i(β − β)

)2| Xn

1

]+ CE

[‖β(λ)‖0 + 1 | Xn

1

](log p)(logr n),

where C > 0 is a constant depending only on c1, C1, and r.

Remark 8.1. Here, the inverse Q−1 exists because by Assumption 3, Qis strictly increasing and continuous with respect to its second argument. �

Proof. This lemma extends some of the results in [15] and [4] to thenon-Gaussian case. All arguments in the proof are conditional on X1, . . . , Xn

but we drop the conditioning sign for brevity of notation. Also, we use Cto denote a positive constant that depends only on c1, C1 and r but whosevalue can change from place to place.


Fix λ > 0 and denote β = β(λ) and T = {j ∈ {1, . . . , p} : βj 6= 0}.For all i = 1, . . . , n we will use X

iTto denote the sub-vector of Xi in R|T |

corresponding to indices in T . By results in [15], we then have

(31)∂X ′i(β − β)

∂εj= X ′

iT

(n∑l=1

XlTX ′lT

)−1

XjT, i, j = 1, . . . , n;

see, in particular, the proof of Theorem 1 there. Taking the sum over i =j = 1, . . . , n and applying the trace operator on the right-hand side of thisidentity gives

(32) ‖β‖0 = |T | =n∑i=1

∂(X ′i(β − β))

∂εi.

Also, for all i = 1, . . . , n, under Assumption 3 (and conditional on Xn1 ),

the random variable εi is absolutely continuous with respect to Lebesguemeasure on R with continuously differentiable pdf χi defined by

χi(Q(Xi, e)) =φ(e)

Q2(Xi, e), e ∈ R,

where φ is the pdf of the N(0, 1) distribution. Taking the derivative over ehere gives

χ′i(Q(Xi, e))Q2(Xi, e) = − eφ(e)

Q2(Xi, e)− φ(e)Q22(Xi, e)

Q2(Xi, e)2, e ∈ R,

and so

χ′i(εi)

χi(εi)=χ′i(Q(Xi, ei))

χi(Q(Xi, ei))= − ei

Q2(Xi, ei)− Q22(Xi, ei)

Q2(Xi, ei)2= −ψi.

Therefore, by Lemma 10.4, whose application is justified by Assumption 3and Lemma 8.1,

(33) E

[∂(X ′i(β − β))

∂εi

]= E[ψiX

′i(β − β)], i = 1, . . . , n.

Combining (32) and (33) gives the first asserted claim.To prove the second asserted claim, we proceed along the lines in the

proof of Theorem 1.1 in [4]. Specifically, let f1, . . . , fn be twice continuouslydifferentiable functions mapping Rn to R with bounded first and second


derivatives. Also, let ε = (ε1, . . . , εn)′. Then, it follows from Lemma 10.4that

E

ψifi(ε) n∑j=1

ψjfj(ε)−n∑l=1

∂fl(ε)

∂εl

= E

∂fi(ε)∂εi

n∑j=1

ψjfj(ε)−n∑l=1

∂fl(ε)

∂εl

+ E

fi(ε)γifi(ε) +

n∑j=1

ψj∂fj(ε)

∂εi−

n∑l=1

∂2fl(ε)

∂εl∂εi

for all i = 1, . . . , n and, in addition,

E

[ψjfi(ε)

∂fj(ε)

∂εi

]= E

[∂fi(ε)

∂εj

∂fj(ε)

∂εi+ fi(ε)

∂2fj(ε)

∂εi∂εj

]for all j = 1, . . . , n. Combining these results, rearranging the terms, andtaking the sum over i = 1, . . . , n, we obtain

E

( n∑i=1

ψifi(ε)−n∑i=1

∂fi(ε)

∂εi

)2 =

n∑i=1

E[γifi(ε)2]+

n∑i,j=1

E

[∂fi(ε)

∂εj

∂fj(ε)

∂εi

],

and since all second-order derivatives cancel out, it follows from a convo-lution argument that the same identity holds for any Lipschitz functionsf1, . . . , fn; see Appendix A of [4] for details. We now substitute fi(ε) =X ′i(β − β) for all i = 1, . . . , n in this identity and note that

n∑i,j=1

∂fi(ε)

∂εj

∂fj(ε)

∂εi= ‖β‖0

by (31) in this case. This gives the second asserted claim.To prove the third asserted claim, we have by the Gaussian Poincare

inequality, Theorem 3.20 in [7] that

Var

(n∑i=1

ψiX′i(β − β)

)≤

n∑j=1

E

( ∂

∂ej

n∑i=1

ψiX′i(β − β)

)2

≤ 2n∑j=1

E

[(∂ψj∂ej

X ′j(β − β)

)2]

+ 2n∑j=1

E

( n∑i=1

ψi∂X ′i(β − β)

∂ej

)2 .


Here, the first term on the right-hand side is equal to

2n∑j=1

E

[(γjQ2(Xj , ej)X

′j(β − β)

)2].

Also, by (31), the second term is equal to

2n∑j=1

E

n∑i=1

ψiQ2(Xj , ej)X′iT

(n∑l=1

XlTX ′lT

)−1

XjT

2= 2E

max1≤j≤p

Q2(Xj , ej)2

n∑j=1

n∑i=1

ψiX′iT

(n∑l=1

XlTX ′lT

)−1

XjT

2= 2E

max1≤j≤p

Q2(Xj , ej)2

n∑i=1

ψiX′iT

(n∑l=1

XlTX ′lT

)−1 n∑j=1

XjTψj

.Next, observe that

n∑i=1

ψiX′iT

(n∑l=1

XlTX ′lT

)−1 n∑j=1

XjTψj

is equal to ‖PTψ‖22, where ψ = (ψ1, . . . , ψn)′ and P

Tis the matrix pro-

jecting on (X1T, . . . , X

nT)′. In turn, we can bound E[‖P

Tψ‖22] using argu-

ments from the proof of Theorem 4.3 in [4]. In particular, for any M ⊂{1, . . . , p}, letting PM denote the matrix projecting on (X1M , . . . , XnM )′,we have E[‖PM ψ‖22] ≤ C|M |, and so, by Assumption 3 and the Hanson-Wright inequality, Theorem 1.1 in [11],

P(‖PM ψ‖22 > C(|M |+ x)

)≤ e−x

for all x > 0. Thus, applying the union bound twice,

P

(max

M⊂{1,...,p}

(‖PM ψ‖22 − C

(|M |+ log

(p

|M |

)+ log p+ x

))> 0

)≤ e−x,

and so

P

(max

M⊂{1,...,p}

(‖PM ψ‖22 − C

((|M |+ 1) log p+ x

))> 0

)≤ e−x.

By Fubini’s theorem and simple calculations, we then have

(34) E[‖P

Tψ‖22

]≤ CE

[‖β‖0 + 1

]log p.


Also,

(35) E[‖P

Tψ‖42

]≤ E

[‖ψ‖42

]≤ Cn2.

Hence, for a sufficiently large constant A that can be chosen to depend onc1, C1 and r only,

E

max1≤j≤p

Q2(Xj , ej)2

n∑i=1

ψiX′iT

(n∑l=1

XlTX ′lT

)−1 n∑j=1

XjTψj

= E

[max

1≤j≤pQ2(Xj , ej)

2‖PTψ‖221

{max


2 ≤ A logr n

}]+ E

[max


2‖PTψ‖221

{max


2 > A logr n

}].

Here, by (34), the first term on the right-hand side is bounded from above byCE[‖β‖0+1](log p)(logr n) and by (35), Assumption 3, and Holder’s inequal-ity, the second term is bounded from above by C since A is large enough.Combining all presented inequalities together gives the third asserted claimand completes the proof of the lemma. �

8.2. Proof of Theorem 5.1. All arguments in this proof are conditionalon X1, . . . , Xn but we drop the conditioning sign for brevity of notation.Throughout the proof, we will assume that

(36) supδ∈Sp(s)

‖δ‖2,n ≤ C.

Also, we use C to denote a positive constant that depends only on c1, C1,C, and r but whose value can change from place to place.

Fix λ > 0 and denote β = β(λ), s = ‖β‖0, and Rn = E[‖β − β‖2,n]. Westart with some preliminary inequalities. First, by Holder’s inequality andAssumption 3,

n∑i=1

E[γi(X

′i(β − β))2

]≤ nE

[‖β − β‖22,n × max

1≤i≤n|γi|]≤ C(n log n)

√E[‖β − β‖42,n](37)

and, similarly,

(38)

n∑i=1

E

[(γiQ2(Xi, ei)X

′i(β − β)

)2]≤ C(n log2 n)

√E[‖β − β‖42,n].


Second, by the triangle inequality and Fubini’s theorem,

E[‖β − β‖42,n] ≤ C(R4n + E

[(‖β − β‖2,n −Rn)4

])= C

(R4n +

∫ ∞0

P(|‖β − β‖2,n −Rn| > t1/4

)dt

)≤ C

(R4n +

(logr n

n

)2),(39)

where the last line follows from Lemma 8.2 applied with κ = 5 (for example).Third,

(40) P

Rn > ‖β − β‖2,n +

√C logr+1 n

n

≤ 1

n

by Lemma 8.2 applied with κ = log n. Fourth, by Lemma 9 of [5] and (36),

‖β − β‖22,n ≤ ‖β − β‖22 × supδ∈Sp(s+s)

‖δ‖22,n

≤ ‖β − β‖22 ×2(s+ s)

ssup

δ∈Sp(s)‖δ‖22,n ≤

C(s+ s)

s‖β − β‖22.(41)

We now prove the theorem with the help of these bounds. Denote

V1 = Var

(s−

n∑i=1

ψiX′i(β − β)

)and V2 = Var

(n∑i=1

ψiX′i(β − β)

).

Then for any t > 0, with probability at least 1 − 2/t2, by Chebyshev’sinequality and Lemma 8.3,

s ≤n∑i=1

ψiX′i(β − β) + t

√V1

=

n∑i=1

ψiX′i(β − β) + E

[1 +

n∑i=1

ψiX′i(β − β)

]− E[1 + s] + t

√V1

≤ 1 + 2

n∑i=1

ψiX′i(β − β)− E[1 + s] + t(

√V1 +

√V2).(42)


Here, t(√V1 +

√V2) is bounded from above by

Ct(

E[1 + s](log p)(logr n))1/2

+ Ct√n log n

(Rn +

√logr n

n

)

≤ E[1 + s] + Ct2(log p)(logr n) + Ct√n log n

(Rn +

√logr n

n

)by Lemma 8.3 and inequalities (37), (38), and (39). Also, with probabilityat least 1− 1/n,

Rn ≤ C

√ s+ s

s‖β − β‖2 +

√logr+1 n

n

by (40) and (41). In addition,

n∑i=1

ψiX′i(β − β) ≤

∥∥∥∥∥n∑i=1

ψiXi

∥∥∥∥∥∞

× ‖β − β‖1,

where ‖β−β‖1 ≤ (s+s)1/2×‖β−β‖2 and with probability at least 1−1/n,∥∥∥∥∥n∑i=1

ψiXi

∥∥∥∥∥∞

≤ C√

log(pn)× max1≤j≤p

(n∑i=1

X2ij

)1/2

≤ C√n log(pn),

with the first inequality following from Assumption 3 and the union boundand the second from (36). Substituting all these bounds into (42) and usingt = (ts log(pn))1/2 with t ≥ 1 gives

s ≤ Cts(logr n) log2(pn) + C

√t(s+ s)n(log2 n) log(pn)‖β − β‖2

with probability at least 1− 2/(ts log(pn))− 2/n. Solving this inequality fors gives the asserted claim and completes the proof of the theorem. �

8.3. Proof of Theorem 5.2. All arguments in this proof are conditionalon X1, . . . , Xn but we drop the conditioning sign for brevity of notation.Throughout the proof, we will assume that (7) holds. Also, we use C todenote a positive constant that depends only on c1, C1, c, C, and r butwhose value can change from place to place.

Fix λ > 0 and denote β = β(λ), s = ‖β‖0, Jn = Jn(λ), and Rn = Rn(λ).Then by Lemma 8.3,

(43) E[s] =

n∑i=1

E[ψiX′i(β − β)] = I1 + I2,


where

I1 =n∑i=1

E[ψiX

′i(β − β)1

{c‖β − β‖2 ≤ ‖β − β‖2,n

}],(44)

I2 =n∑i=1

E[ψiX

′i(β − β)1

{c‖β − β‖2 > ‖β − β‖2,n

}].(45)

We bound I1 and I2 in turn. To bound I1, note that as in (39) of the proofof Theorem 5.1,

(46) E[‖β − β‖42,n] ≤ C

(R4n +

(logr n

n

)2).

Also, by Assumption 3 and (7),

E

∥∥∥∥∥n∑i=1

ψiXi

∥∥∥∥∥4

∞

≤ Cn2 log2 p.

Therefore,

I1 ≤ E[∥∥∥∑n

i=1ψiXi

∥∥∥∞‖β − β‖11

{c‖β − β‖2 ≤ ‖β − β‖2,n

}]≤ E

[∥∥∥∑n

i=1ψiXi

∥∥∥∞‖β − β‖2(s+ s)1/21

{c‖β − β‖2 ≤ ‖β − β‖2,n

}]≤ CE

[∥∥∥∑n

i=1ψiXi

∥∥∥∞‖β − β‖2,n(s+ s)1/21

{c‖β − β‖2 ≤ ‖β − β‖2,n

}]≤ C

(E

[∥∥∥∑n

i=1ψiXi

∥∥∥2

∞‖β − β‖22,n

]E [s+ s]

)1/2

,

where the last line follows from Holder’s inequality. In turn,(E

[∥∥∥∑n

i=1ψiXi

∥∥∥2

∞‖β − β‖22,n

])1/2

≤(

E

[∥∥∥∑n

i=1ψiXi

∥∥∥4

∞

]E[‖β − β‖42,n

])1/4

≤ C√n log p

(E[‖β − β‖42,n

])1/4≤ C

√n log p

(Rn +

√logr n

n

).

Thus,

(47) I1 ≤ C√n log p

(Rn +

√logr n

n

)(E[s+ s])1/2.


To bound I2, denote

A1 =

√√√√ n∑i=1

ψ2i and A2 =

√√√√ n∑i=1

(X ′i(β − β))2 =√n‖β − β‖2,n

and observe that by Holder’s inequality,

I2 ≤ E[A1A21

{c‖β − β‖2 > ‖β − β‖2,n

}]≤ I2,1 + I2,2,

where

I2,1 = E

[A1A21

{A1A2 > C

(nRn +

√n logr+1 n

)}],

I2,2 = C(nRn +

√n logr+1 n

)P(c‖β − β‖2 > ‖β − β‖2,n

),

for some constant C to be chosen later. To bound I2,1, note that

P(A1 >√Cn) ≤ 1/n

by Chebyshev’s inequality and Assumption 3 if C is large enough. Also, byLemma 8.2 applied with κ = log n,

P

(A2/√n > Rn +

√C logr+1 n/n

)≤ 1/n

if C is large enough. Hence, if we set C in the definition of I2,1 and I2,2

large enough (note that C can be chosen to depend only on c1, C1, and r),it follows that

P

(A1A2 > C

(nRn +

√n logr+1 n

))≤ P(A1 >

√Cn) + P

(A2/√n > Rn +

√C logr+1 n/n

)≤ 2/n,

and so I2,1 is bounded from above by

(E[A2

1A22])1/2(

P

(A1A2 > C

(nRn +

√n logr+1 n

)))1/2

≤ Cn(Rn +

√logr n/n

)/√n ≤ C(

√nRn +

√logr n),


where the first inequality follows from Holder’s inequality, Assumption 3,and (46). Also, by (7) and Markov’s inequality,

P(c‖β − β‖2 > ‖β − β‖2,n

)≤ P (s+ s > Jn) ≤ E[s+ s]

Jn,

so that

I2,2 ≤C(nRn +

√n logr+1 n)

JnE[s+ s],

and soI2,2 ≤ 3−1E[s+ s]

for all n ≥ n0 depending only on c1, C1, c, C, and r by the definition of Jn.Combining all inequalities, it follows that for all n ≥ n0,

E[s] ≤ C√n log p

(Rn +

√logr n

n

)(E[s+ s])1/2

+ C(√

nRn +√

logr n)

+ 3−1E[s+ s],(48)

and soE[‖β‖0] = E[s] ≤ s+ C(log p)(nR2

n + logr n).

This gives the asserted claim for all n ≥ n0 and since the asserted claim forn < n0 is trivial, the proof is complete. �

9. Proofs of lemmas in Section 6.

Proof of Lemma 6.1. In this proof, c and C are strictly positive con-stants that depend only on c1, C1, and q but their values can change fromplace to place. By Jensen’s inequality and the definition of Mn in (5),

Kn :=

(E

[max

1≤i≤nmax

1≤j≤p|Xij |2

])1/2

≤(

E

[max

1≤i≤nmax

1≤j≤p|Xij |q

])1/q

≤ n1/qMn.(49)

Therefore, given that `n ≤ Cn1−c by Assumptions 1(a) and 2, which implieslog `n ≤ C log n, it follows that

δn := Kn

√`n log p/n

(1 + (log `n)(log n)1/2

)≤ Cn−c


by Assumption 2; here, Assumption 1(a) is used only to verify that Mn ≥ c.Also, denoting `n,0 = n2/q+c1M2

ns log3(pn),

supθ∈Sp(`n)

E[(X ′θ)2] ≤ 2(`n + `n,0)

`n,0sup

θ∈Sp(`n,0)E[(X ′θ)2] ≤ C(`n + `n,0)

`n,0

by Lemma 9 in [5] and Assumption 1(b). Thus,

δn supθ∈Sp(`n)

(E[(X ′θ)2]

)1/2 ≤ Cn−cby Assumption 2. Therefore, it follows from Lemma 10.3 that

E

[sup

θ∈Sp(`n)

∣∣∣∣∣ 1nn∑i=1

(X ′iθ)2 − E[(X ′θ)2]

∣∣∣∣∣]≤ Cn−c.

The asserted claim follows from combining this bound and Markov’s inequal-ity. �

Proof of Lemma 6.2. Let T = {j ∈ {1, . . . , p} : βj 6= 0} and T c ={1, . . . , p}\T . Fix k = 1, . . . ,K and denote

Zk =1

n− nk

∑i/∈Ik

Xiεi

and

κk = inf

{√s‖δ‖2,n,−k‖δT ‖1

: δ ∈ Rp, ‖δT c‖1 < 3‖δT ‖1}.

To prove the asserted claims, we will apply Theorem 1 in [5] that shows thatfor any λ ∈ Λn, on the event λ ≥ 4‖Zk‖∞, we have

(50) ‖β−k(λ)− β‖2,n,−k ≤3λ√s

2κk.

To use this bound, we show that there exist c > 0, C > 0, and λ0 = λn,0 ∈Λn, possibly depending on n, such that

(51) P (κk < c) ≤ Cn−c, P(λ0 < 4‖Zk‖∞

)≤ Cn−c, λ0 .

(log(pn)

n

)1/2

.

To prove the first claim in (51), note that

(52) 1 . ‖δ‖2,n,−k . 1


with probability at least 1 − Cn−c uniformly over all δ ∈ Rp such that‖δ‖2 = 1 and ‖δT c‖0 ≤ s log n by Lemma 6.1 and Assumptions 1, 2 and 5.Hence, the first claim in (51) follows from Lemma 10 in [5] applied with mthere equal to s log n here.

To prove the second and the third claims in (51), note that we havemax1≤j≤p

∑i/∈Ik E[|Xijεi|2] . n by Assumptions 1(b) and 3. Also,(

E[

max1≤i≤n

max1≤j≤p

|Xijεi|2])1/2

≤(

E[

max1≤i≤n

max1≤j≤p

|Xijεi|q])1/q

. n1/qMn.

Thus, by Lemma 10.1 and Assumption 2,

E[(n− nk)‖Zk‖∞

].√n log p+ n1/qMn log p .

√n log p.

Hence, applying Lemma 10.2 with t = (n log n)1/2 and Z there replaced by(n− nk)‖Zk‖∞ here and noting that nM q

n/(n log n)q/2 ≤ Cn−c by Assump-tion 2 implies that

‖Zk‖∞ .

(log(pn)

n

)1/2

with probability at least 1 − Cn−c. Hence, noting that log4(pn) ≤ Cn byAssumptions 1(a) and 2, it follows from Assumption 4 that there existsλ0 ∈ Λn such that the second and the third claims in (51) hold. By (50),this λ0 satisfies the following bound:

(53) P

(‖β−k(λ0)− β‖22,n,−k >

Cs log(pn)

n

)≤ Cn−c.

Now, to prove the asserted claims, note that using (51) and (52) and applyingTheorem 2 in [5] with m = s log n there shows that ‖β−k(λ0)‖0 . s withprobability at least 1− Cn−c. Hence,

‖β−k(λ0)− β‖21 . s‖β−k(λ0)− β‖22

. s‖β−k(λ0)− β‖22,n,−k .s2 log(pn)

n

again with probability at least 1−Cn−c, where the second inequality followsfrom (52), and the third one from (53). This gives all asserted claims andcompletes the proof of the lemma. �

Proof of Lemma 6.3. Fix k = 1, . . . ,K and denote β = β−k(λ0). ByLemma 6.2, ‖β‖0 . s with probability at least 1− Cn−c. Hence,

‖β − β‖22,n,k . (β − β)′E[XX ′](β − β) + ‖β − β‖22 . ‖β − β‖22 .s log(pn)

n


with probability at least 1 − Cn−c, where the first inequality follows fromLemma 6.1 and Assumption 5, the second from Assumption 1(b), and thethird from Lemma 6.2. The asserted claim follows. �

Proof of Lemma 6.4. By the definition of λ in (4),

K∑k=1

∑i∈Ik

(Yi −X ′iβ−k(λ))2 ≤K∑k=1

∑i∈Ik

(Yi −X ′iβ−k(λ0))2

for λ0 defined in Lemma 6.2. Therefore,

K∑k=1

nk‖β−k(λ)− β‖22,n,k ≤K∑k=1

nk‖β−k(λ0)− β‖22,n,k

+ 2

K∑k=1

∑i∈Ik

εiX′i(β−k(λ)− β−k(λ0)).

Further, for all k = 1, . . . ,K, denote Dk = {(Xi, Yi)i/∈Ik , (Xi)i∈Ik} and

Zk = maxλ∈Λn

∣∣∣∣∣∑

i∈Ik εiX′i(β−k(λ)− β−k(λ0))

√nk‖β−k(λ)− β−k(λ0)‖2,n,k

∣∣∣∣∣ .Then by Lemma 10.1 and Assumptions 3 and 4, we have that E[Zk | Dk] .√

log logn+Mn,k log logn, where

Mn,k =

(E

[maxλ∈Λn

maxi∈Ik

(εiX′i(β−k(λ)− β−k(λ0)))2

nk‖β−k(λ)− β−k(λ0)‖22,n,k| Dk

])1/2

≤

E

maxλ∈Λn

maxi∈Ik

|εiX ′i(β−k(λ)− β−k(λ0))|u

nu/2k ‖β−k(λ)− β−k(λ0)‖u2,n,k

| Dk

1/u

≤

E

∑λ∈Λn

∑i∈Ik

|εiX ′i(β−k(λ)− β−k(λ0))|u

nu/2k ‖β−k(λ)− β−k(λ0)‖u2,n,k

| Dk

1/u

for any u ≥ 2. In turn, the last expression is bounded from above by

C(

(u(1 + r))u(1+r)/2 log n)1/u

since (i) E[|εi|u | Dk] ≤ CuE[(1+ |e|+ |e|r+1)u] ≤ Cu(u(1+r))u(1+r)/2 underAssumption 3 and (ii) for any sequence (ai)i∈Ik in R, we have

∑i∈Ik |ai|

u ≤


(∑

i∈Ik |ai|2)u/2. Using this bound with u = 3 (for example) gives E[Zk |

Dk] .√

log n. In addition, using this bound with u = log n, it follows fromLemma 10.2 that

P(Zk > 2E[Zk | Dk] + C

√logr+1 n | Dk

)≤ Cn−c.

Hence, Zk .√

logr+1 n with probability at least 1−Cn−c, and so, with thesame probability,∣∣∣∣∣∣

∑i∈Ik

εiX′i(β−k(λ)− β−k(λ0))

∣∣∣∣∣∣ .√n logr+1 n‖β−k(λ)− β−k(λ0)‖2,n,k.

Therefore, since nk/n ≥ c1 by Assumption 5, we have with the same prob-ability that

K∑k=1

‖β−k(λ)− β‖22,n,k .K∑k=1

‖β−k(λ0)− β‖22,n,k

+

√logr+1 n

n

K∑k=1

‖β−k(λ)− β−k(λ0)‖2,n,k,

and thus, by the triangle inequality,

‖β−k(λ)− β−k(λ0)‖22,n,k

.K∑k=1

‖β−k(λ0)− β‖22,n,k

+

√logr+1 n

n‖β−k(λ)− β−k(λ0)‖

2,n,k,

where k is a value of k = 1, . . . ,K that maximizes ‖β−k(λ)− β−k(λ0)‖2,n,k.Therefore, by Lemma 6.3,

‖β−k(λ)− β−k(λ0)‖22,n,k

.s log(pn)

n+

√logr+1 n

n‖β−k(λ)− β−k(λ0)‖

2,n,k,

and thus, for all k = 1, . . . ,K,

‖β−k(λ)− β−k(λ0)‖22,n,k ≤ ‖β−k(λ)− β−k(λ0)‖22,n,k

.s log(pn)

n+

logr+1 n

n,

again with probability at least 1 − Cn−c. The asserted claim now followsfrom combining this bound with the triangle inequality and Lemma 6.3.This completes the proof of the lemma. �


Proof of Lemma 6.5. Fix k = 1, . . . ,K. For λ ∈ Λn, let δλ = (β−k(λ)−β)/‖β−k(λ) − β‖2 and observe that conditional on (Xi, Yi)i/∈Ik , (δλ)λ∈Λn isnon-stochastic. Hence, by Lemma 8.1, Assumptions 1(a) and 5 and Cheby-shev’s inequality applied conditional on (Xi, Yi)i/∈Ik , we have for any λ ∈ Λnthat (nk)

−1∑

i∈Ik(X ′iδλ)2 ≥ c with probability at least 1 − Cn−c, and so

‖β−k(λ) − β‖22 ≤ C‖β−k(λ) − β‖22,n,k with the same probability. Therefore,

by Assumption 4 and the union bound, ‖β−k(λ)−β‖22 ≤ C‖β−k(λ)−β‖22,n,kwith probability at least 1−Cn−c. The asserted claim follows from combin-ing this inequality and Lemma 6.4. �

Proof of Lemma 6.6. Fix k = 1, . . . ,K and note that by Assumption1(b) and Lemma 6.1, supδ∈Sp(s) ‖δ‖2,n ≤ C with probability at least 1 −Cn−c. Hence, by Lemma 6.5, Theorem 5.1, Assumption 4, and the unionbound, ‖β−k(λ)‖0 ≤ nc1/4s log2(pn) with probability at least 1 − Cn−c.

Further, by Assumption 2, nc1/4s log2(pn) ≤√sn1+c1/2 log(pn) for all n ≥ n0

with n0 depending only on c1 and C1, and so, by Assumption 1(b) andLemma 6.1, ‖β−k(λ) − β‖22,n,−k . ‖β−k(λ) − β‖22 with probability at least

1− Cn−c. Combining this bound with Lemma 6.5 now gives

(54) P

(‖β−k(λ)− β‖22,n,−k > C

(s log(pn)

n+

logr+1 n

n

))≤ Cn−c.

Now,

P(λ /∈ Λn,k(X

n1 , Tn)

)≤ P

(‖β−k(λ)− β‖2,n,−k > Tn/2

)+ P

(maxλ∈Λn

∣∣∣‖β−k(λ)− β‖2,n,−k − E[‖β−k(λ)− β‖2,n,−k | Xn1 ]∣∣∣ > Tn/2

),

where the first and the second terms on the right-hand side are at mostCn−c by (54) and Lemma 8.2 applied with κ = log n, respectively, as longthe constant C in the definition of Tn is large enough. The asserted claimfollows. �

Proof of Lemma 6.7. The result in this lemma is sometimes referredto as the two point inequality; see Section 2.4 in [17], where the proof is alsoprovided. �

10. Technical Lemmas.

Lemma 10.1. Let X1, . . . , Xn be independent centered random vectorsin Rp with p ≥ 2. Define Z = ‖

∑ni=1Xi‖∞, M = max1≤i≤n ‖Xi‖∞, and


σ2 = max1≤j≤p∑n

i=1 E[X2ij ]. Then

E[Z] ≤ K(σ√

log p+√

E[M2] log p)

where K > 0 is a universal constant.

Proof. See Lemma E.1 in [9]. �

Lemma 10.2. Consider the setting of Lemma 10.1. For every η > 0,t > 0, and q ≥ 1, we have

P (Z ≥ (1 + η)E[Z] + t) ≤ exp(−t2/(3σ2)) +KE[M q]/tq

where the constant K > 0 depends only on η and q.

Proof. See Lemma E.2 in [9]. �

Remark 10.1. In Lemmas 10.1 and 10.2, if, in addition, we assume thatX1, . . . , Xn are Gaussian, then E[Z] ≤ σ

√2 log p by Lemma A.3.1 in [13] and

for every t > 0, P(Z > E[Z] + t) ≤ exp(−t2/(2σ2)) by Theorem 2.1.1 in [1].�

Lemma 10.3. Let X1, . . . , Xn be i.i.d. random vectors in Rp with p ≥ 2.Also, let K = (E[max1≤i≤n max1≤j≤p |X2

ij ])1/2 and for ` ≥ 1, let

δn =K√` log p√n

(1 + (log `)(log1/2 n)

).

Moreover, let Sp(`) = {θ ∈ Rp : ‖θ‖ = 1 and ‖θ‖0 ≤ `}. Then

E

[sup

θ∈Sp(`)

∣∣∣∣∣ 1nn∑i=1

(X ′iθ)2 − E[(X ′1θ)

2]

∣∣∣∣∣]≤ C

(δ2n + δn sup

θ∈Sp(`)

(E[(X ′1θ)

2])1/2)

,

where C > 0 is a universal constant.

Proof. See Lemma B.1 in [6]. See also [10] for the original result. �

Remark 10.2. If X1, . . . , Xn are centered Gaussian random vectors inRp with p ≥ 2, then for any ε1, ε2, ` > 0 such that ε1 + ε2 < 1 and ` ≤min(p, ε21n),

supθ∈Sp(`)

∣∣∣∣∣ 1nn∑i=1

(X ′iθ)2 − E[(X ′1θ)

2]

∣∣∣∣∣ ≤ 3(ε1 + ε2) supθ∈Sp(`)

E[(X ′1θ)2]

with probability at least 1−2e−nε22/2p` by the proof of Proposition 2 in [18].

�


Lemma 10.4. Let ε be a random variable that is absolutely continuouswith respect to Lebesgue measure on R with continuously differentiable pdfχ and suppose that f : R → R is either Lipschitz-continuous or continu-ously differentiable with finite E[|f ′(ε)|]. Suppose also that both E[|f(ε)|] andE[|f(ε)χ′(ε)|/χ(ε)] are finite. Then

(55) E[f ′(ε)] = −E[f(ε)χ′(ε)/χ(ε)].

Remark 10.3. When ε has a N(0, σ2) distribution, the formula (55)reduces to the well-known Stein identity, E[f ′(ε)] = E[εf(ε)]/σ2. �

Proof. The proof follows immediately from integration by parts and theLebesgue dominated convergence theorem; for example, see Section 13.1.1in [8] for similar results. �

Lemma 10.5. Let e = (e1, . . . , en) be a standard Gaussian random vectorand let Qi : R → R, i = 1, . . . , n be some strictly increasing continuouslydifferentiable functions. Denote ε = (ε1, . . . , εn) where εi = Qi(ei), i =1, . . . , n, and let f : Rn → R be Lipschitz-continuous with Lipschitz constantL > 0. Then for any convex u : R→ R+, the random variable

V = f(ε) = f(ε1, . . . , εn)

satisfies the following inequality:

E[u(V − E[V ])] ≤ E

[u

(πL

2max

1≤i≤nQ′i(ei)ξ

)],

where ξ is a standard Gaussian random variable that is independent of e.

Remark 10.4. The proof of this lemma given below mimics the well-known interpolation proof of the Gaussian concentration inequality for Lip-schitz functions; see Theorem 2.1.12 in [14] for example. �

Proof. To prove the asserted claim, let e = (e1, . . . , en) be another stan-dard Gaussian random vector that is independent of e. Also, define

F (x) = F (x1, . . . , xn) = f(Q1(x1), . . . , Qn(xn)), x = (x1, . . . , xn) ∈ Rn.

Then

E[u(V − EV )] = E[u(F (e)− EF (e))

]= E

[u(F (e)− EF (e))

]= E

[u(E[F (e)− F (e) | e])

]≤ E

[E[u(F (e)− F (e)) | e

]]= E

[u(F (e)− F (e))

].


Further, define

h(θ) = F(e cos(πθ/2) + e sin(πθ/2)

), θ ∈ [0, 1],

so that h(1) = F (e), h(0) = F (e), and for all θ ∈ (0, 1),

h′(θ) =π

2

n∑i=1

Fi(e cos(πθ/2) + e sin(πθ/2))(ej cos(πθ/2)− ej sin(πθ/2))

=π

2(∇F (Wθ),Wθ),

where we denoted

Wθ = e cos(πθ/2)− e sin(πθ/2) and Wθ = e cos(πθ/2) + e sin(πθ/2).

Note that for each θ ∈ (0, 1), the random vectors Wθ and Wθ are independentstandard Gaussian. Hence,

E[u(F (e)− F (e))

]= E

[u(h(1)− h(0))

]= E

[u(∫ 1

0h′(θ)dθ

)]≤ E

[ ∫ 1

0u(h′(θ))dθ

]= E

[ ∫ 1

0u(π

2(∇F (Wθ),Wθ)

)dθ]

=

∫ 1

0E[u(π

2(∇F (Wθ),Wθ)

)]dθ

=

∫ 1

0E[u(π

2(∇F (e), e)

)]dθ = E

[u(π

2(∇F (e), e)

)].

Next, note that since e and e are independent standard Gaussian randomvectors, conditional on e, the random variable (∇F (e), e) is zero-mean Gaus-sian with variance

n∑i=1

(∂F∂ei

(e))2

=

n∑i=1

( ∂f∂εi

(ε))2

(Q′i(ei))2

≤ max1≤i≤n

(Q′i(ei))2

n∑i=1

( ∂f∂εi

(ε))2≤ L2 max

1≤i≤n(Q′i(ei))

2.


Therefore, using the fact that u is convex, we conclude that

E[u(π

2(∇F (e), e)

)]= E

[E[u(π

2(∇F (e), e)

)| e]]

= E[E[u(π

2

( n∑i=1

(∂F∂ei

(e))2)1/2

ξ)| e]]

≤ E[E[u(πL

2max


)| e]]

= E[u(πL

2max


)],

where ξ is a standard Gaussian random variable that is independent of thevector e. Combining presented inequalities gives the asserted claim. �

Lemma 10.6. Let X1, . . . , Xm be random variables (not necessarily in-dependent). Then for all α ∈ (0, 1),

Q1−α(X1 + · · ·+Xm) ≤ Q1−α/(2m)(X1) + · · ·+Q1−α/(2m)(Xm),

where for any random variable Z and any number α ∈ (0, 1), Qα(Z) denotesthe αth quantile of the distribution of Z, i.e. Qα(Z) = inf{z ∈ R : α ≤ P(Z ≤z)}.

Proof. To prove the asserted claim, suppose to the contrary that

Q1−α(X1 + · · ·+Xm) > Q1−α/(2m)(X1) + · · ·+Q1−α/(2m)(Xm).

Then by the union bound,

α ≤ P(X1 + · · ·+Xm ≥ Q1−α(X1 + · · ·+Xm))

≤ P(X1 + · · ·+Xm > Q1−α/(2m)(X1) + · · ·+Q1−α/(2m)(Xm))

≤ P(X1 > Q1−α/(2m)(X1)) + · · ·+ P(Xm > Q1−α/(2m)(Xm))

≤ α/(2m) + · · ·+ α/(2m) = α/2,

which is a contradiction. Thus, the asserted claim follows. �


5 10 15 20 25

0.4

0.5

0.6

0.7

0.8Prediction norm

5 10 15 20 25

0.8

1

1.2

1.4

1.6

1.8

2L2 norm

5 10 15 20 25

2.5

3

3.5

4

4.5L1 norm

5 10 15 20 25

0.4

0.5

0.6

0.7

0.8

0.9Out-of-sample prediction

Sparsity of CV-Lasso

0 10 20 30 40

0

0.02

0.04

0.06

0.08

0.1Sparsity of B-Lasso

0 10 20 30 40

0

0.05

0.1

0.15

0.2

Fig 1. DGP1, n = 100, p = 40, and ρ = 0.75. The top-left, top-right, middle-left, andmiddle-right panels show the mean of estimation error of Lasso estimators in the predic-tion, L2, L1, and out-of-sample prediction norms. The dashed line represents the mean ofestimation error of the Lasso estimator as a function of λ (we perform the Lasso estimatorfor each value of λ in the candidate set Λn; we sort the values in Λn from the smallestto the largest, and put the order of λ on the horizontal axis; we only show the results forvalues of λ up to order 25 as these give the most meaningful comparisons). The solid, dot-ted, and dashed-dotted horizontal lines represent the mean of the estimation error of theCV-Lasso, P1-Lasso (λ = 0.2716), and P2-Lasso (λ = 0.2146) estimators, respectively.


10 20 30 40

0.4

0.6

0.8

1

1.2

1.4

1.6Prediction norm

10 20 30 40

0.5

1

1.5

2

2.5

3

3.5L2 norm

10 20 30 40

2

3

4

5

6L1 norm

10 20 30 40

0.4

0.6

0.8

1

1.2

1.4

1.6Out-of-sample prediction

Sparsity of CV-Lasso

0 20 40 60

0

0.01

0.02

0.03

0.04

0.05Sparsity of B-Lasso

0 20 40 60

0

0.05

0.1

0.15

Fig 2. DGP1, n = 100, p = 400, and ρ = 0.75. The top-left, top-right, middle-left,and middle-right panels show the mean of estimation error of Lasso estimators in theprediction, L2, L1, and out-of-sample prediction norms. The dashed line represents themean of estimation error of the Lasso estimator as a function of λ (we perform the Lassoestimator for each value of λ in the candidate set Λn; we sort the values in Λn from thesmallest to the largest, and put the order of λ on the horizontal axis; we only show theresults for values of λ up to order 25 as these give the most meaningful comparisons).The solid, dotted, and dashed-dotted horizontal lines represent the mean of the estimationerror of the CV-Lasso, P1-Lasso (λ = 0.3462), and P2-Lasso (λ = 0.3035) estimators,respectively.

26C

HE

TV

ER

IKO

VL

IAO

CH

ER

NO

ZH

UK

OV

Table 1The mean of estimation error of Lasso estimators

DGP1 (ρ = 0.5)Prediction norm L2 norm Out-of-Sample prediction norm

CV-Lasso λ-Lasso P2-Lasso CV-Lasso λ-Lasso P2-Lasso CV-Lasso λ-Lasso P2-Lasso(n, p)=(100, 40) 0.4252 0.4097 0.4435 0.6164 0.5700 0.7013 0.4701 0.4530 0.4883(n, p)=(100, 100) 0.5243 0.5040 0.5303 0.8206 0.7598 0.8897 0.6091 0.5885 0.6139(n, p)=(100, 400) 0.7023 0.6448 0.6595 1.2629 1.1624 1.2548 0.8852 0.8474 0.8565(n, p)=(400, 40) 0.2116 0.2047 0.2174 0.2875 0.2634 0.3186 0.2164 0.2095 0.2224(n, p)=(400, 100) 0.2581 0.2501 0.2561 0.3674 0.3301 0.3790 0.2667 0.2588 0.2648(n, p)=(400, 400) 0.3300 0.3206 0.3206 0.5018 0.4546 0.4807 0.3473 0.3391 0.3391


(n, p)=(100, 40) 0.7532 0.7123 0.7672 1.1041 0.9907 1.2107 0.8293 0.7857 0.8419(n, p)=(100, 100) 0.9237 0.8641 0.8917 1.4644 1.3044 1.4792 1.0551 1.0048 1.0264(n, p)=(100, 400) 1.1497 1.0465 1.0493 1.9868 1.8541 1.8962 1.3631 1.3103 1.3118(n, p)=(400, 40) 0.3647 0.3521 0.3746 0.4961 0.4529 0.5485 0.3731 0.3603 0.3831(n, p)=(400, 100) 0.4470 0.4325 0.4431 0.6351 0.5717 0.6550 0.4616 0.4473 0.4577(n, p)=(400, 400) 0.5739 0.5564 0.5561 0.8714 0.7882 0.8333 0.6037 0.5885 0.5882


CV-Lasso λ-Lasso P2-Lasso CV-Lasso λ-Lasso P2-Lasso CV-Lasso λ-Lasso P2-Lasso(n, p)=(100, 40) 0.4481 0.4292 0.5677 0.9133 0.8213 1.3963 0.5005 0.4791 0.6238(n, p)=(100, 100) 0.5817 0.5486 0.6496 1.3110 1.1144 1.6547 0.6907 0.6611 0.7514(n, p)=(100, 400) 0.7616 0.6957 0.7288 2.0360 1.8350 2.0207 0.9836 0.9525 0.9543(n, p)=(400, 40) 0.2206 0.2141 0.2829 0.4143 0.3745 0.6556 0.2263 0.2196 0.2894(n, p)=(400, 100) 0.2782 0.2717 0.3322 0.5381 0.4688 0.7766 0.2897 0.2830 0.3436(n, p)=(400, 400) 0.3847 0.3771 0.4112 0.8217 0.6751 0.9774 0.4151 0.4081 0.4402


(n, p)=(100, 40) 0.7730 0.7285 0.8393 1.6151 1.3895 1.9690 0.8520 0.8072 0.9105(n, p)=(100, 100) 0.9619 0.8843 0.9407 2.1316 1.8093 2.2295 1.0938 1.0293 1.0631(n, p)=(100, 400) 1.2454 1.0586 1.0740 2.8271 2.4914 2.6602 1.3966 1.3298 1.3298(n, p)=(400, 40) 0.3811 0.3696 0.4876 0.7141 0.6427 1.1292 0.3907 0.3788 0.4984(n, p)=(400, 100) 0.4859 0.4719 0.5710 0.9443 0.8132 1.3320 0.5061 0.4920 0.5910(n, p)=(400, 400) 0.6790 0.6499 0.6834 1.5102 1.1683 1.6067 0.7229 0.7028 0.7291


Table 2Probabilities for the number of non-zero coefficients of the CV-Lasso estimator hitting

different brackets

DGP1 (ρ = 0.5)[0, 5] [6, 10] [11, 15] [16, 20] [21, 25] [26, 30] [31, 35] [36, p]

(n, p)=(100, 40) 0.0008 0.0766 0.3598 0.3548 0.1582 0.0390 0.0088 0.0020(n, p)=(100, 100) 0.0006 0.0120 0.0822 0.2146 0.2606 0.1994 0.1186 0.1120(n, p)=(100, 400) 0.0010 0.0190 0.0480 0.0760 0.0978 0.1196 0.1288 0.5098(n, p)=(400, 40) 0.0006 0.0964 0.3926 0.3460 0.1292 0.0316 0.0034 0.0002(n, p)=(400, 100) 0.0006 0.0176 0.1404 0.2624 0.2596 0.1780 0.0828 0.0586(n, p)=(400, 400) 0.0000 0.0016 0.0212 0.0728 0.1372 0.1618 0.1664 0.4390

DGP2 (ρ = 0.5)[0, 5] [6, 10] [11, 15] [16, 20] [21, 25] [26, p] [31, p] [36, p]

(n, p)=(100, 40) 0.0142 0.1436 0.3418 0.3070 0.1402 0.0432 0.0094 0.0006(n, p)=(100, 100) 0.0158 0.1096 0.1866 0.2186 0.1828 0.1338 0.0754 0.0774(n, p)=(100, 400) 0.0310 0.0988 0.1586 0.1752 0.1446 0.1042 0.0830 0.2046(n, p)=(400, 40) 0.0008 0.1030 0.4032 0.3334 0.1258 0.0268 0.0060 0.0010(n, p)=(400, 100) 0.0002 0.0202 0.1358 0.2530 0.2684 0.1704 0.0814 0.0706(n, p)=(400, 400) 0.0002 0.0020 0.0274 0.0798 0.1280 0.1590 0.1592 0.4444

DGP1 (ρ = 0.75)[0, 5] [6, 10] [11, 15] [16, 20] [21, 25] [26, p] [31, p] [36, p]

(n, p)=(100, 40) 0.0028 0.0448 0.2658 0.3920 0.2050 0.0716 0.0178 0.0002(n, p)=(100, 100) 0.0006 0.0316 0.1080 0.1604 0.1948 0.2000 0.1470 0.1576(n, p)=(100, 400) 0.0206 0.0194 0.0506 0.1110 0.1534 0.1660 0.1398 0.3392(n, p)=(400, 40) 0.0000 0.0278 0.2926 0.4222 0.1966 0.0510 0.0090 0.0008(n, p)=(400, 100) 0.0000 0.0002 0.0136 0.1156 0.2480 0.2920 0.1836 0.1470(n, p)=(400, 400) 0.0000 0.0000 0.0002 0.0004 0.0060 0.0192 0.0530 0.9212

DGP2 (ρ = 0.75)[0, 5] [6, 10] [11, 15] [16, 20] [21, 25] [26, p] [31, p] [36, p]

(n, p)=(100, 40) 0.0254 0.2152 0.3326 0.2546 0.1206 0.0392 0.0116 0.0008(n, p)=(100, 100) 0.0904 0.1024 0.2192 0.2262 0.1606 0.0958 0.0502 0.0552(n, p)=(100, 400) 0.3916 0.1022 0.0988 0.0906 0.0826 0.0650 0.0558 0.1134(n, p)=(400, 40) 0.0002 0.0290 0.2976 0.4314 0.1862 0.0468 0.0082 0.0006(n, p)=(400, 100) 0.0000 0.0050 0.0282 0.1264 0.2370 0.2820 0.1804 0.1410(n, p)=(400, 400) 0.0002 0.0134 0.0582 0.0974 0.1156 0.1020 0.0860 0.5272


Table 3Probabilities for max1≤j≤p n

−1|∑n

i=1Xijεi|/λ hitting different brackets

DGP1 (ρ = 0.50)[0, 0.5) [0.6, 1) [1, 1.5) [1.5, 2) [2, 2.5) [2.5, 3) [3, ∞)

(n, p)=(100, 40) 0.0002 0.0910 0.3458 0.2842 0.1460 0.0670 0.0648(n, p)=(100, 100) 0.0000 0.1560 0.4376 0.2470 0.0910 0.0322 0.0338(n, p)=(100, 400) 0.0116 0.3262 0.3374 0.1396 0.0592 0.0282 0.0566(n, p)=(400, 40) 0.0000 0.1118 0.4292 0.3042 0.0988 0.0364 0.0182(n, p)=(400, 100) 0.0000 0.2648 0.5784 0.1362 0.0158 0.0032 0.0004(n, p)=(400, 400) 0.0000 0.5828 0.3972 0.0162 0.0004 0.0000 0.0000

DGP2 (ρ = 0.50)[0, 0.5) [0.6, 1) [1, 1.5) [1.5, 2) [2, 2.5) [2.5, 3) [3, ∞)

(n, p)=(100, 40) 0.0020 0.1522 0.3296 0.2502 0.1322 0.0596 0.0674(n, p)=(100, 100) 0.0084 0.3096 0.3772 0.1650 0.0624 0.0254 0.0208(n, p)=(100, 400) 0.0394 0.5254 0.2252 0.0616 0.0210 0.0090 0.0252(n, p)=(400, 40) 0.0002 0.1170 0.4452 0.2860 0.1006 0.0302 0.0204(n, p)=(400, 100) 0.0000 0.2676 0.5656 0.1422 0.0198 0.0022 0.0014(n, p)=(400, 400) 0.0000 0.5908 0.3894 0.0156 0.0008 0.0002 0.0000

DGP1 (ρ = 0.75)[0, 0.5) [0.6, 1) [1, 1.5) [1.5, 2) [2, 2.5) [2.5, 3) [3, ∞)

(n, p)=(100, 40) 0.0000 0.0224 0.1220 0.2250 0.2012 0.1488 0.2796(n, p)=(100, 100) 0.0008 0.1144 0.2546 0.2306 0.1698 0.0944 0.1312(n, p)=(100, 400) 0.0316 0.4068 0.3408 0.1072 0.0346 0.0164 0.0284(n, p)=(400, 40) 0.0000 0.0098 0.1384 0.2800 0.2620 0.1526 0.1572(n, p)=(400, 100) 0.0000 0.0144 0.2918 0.4250 0.1868 0.0592 0.0228(n, p)=(400, 400) 0.0000 0.0684 0.6724 0.2304 0.0242 0.0040 0.0006

DGP2 (ρ = 0.75)[0, 0.5) [0.6, 1) [1, 1.5) [1.5, 2) [2, 2.5) [2.5, 3) [3, ∞)

(n, p)=(100, 40) 0.0062 0.1090 0.2424 0.2142 0.1508 0.1040 0.1674(n, p)=(100, 100) 0.0686 0.2298 0.3256 0.1842 0.0798 0.0382 0.0518(n, p)=(100, 400) 0.3616 0.3000 0.1594 0.0508 0.0186 0.0080 0.0118(n, p)=(400, 40) 0.0000 0.0102 0.1306 0.2918 0.2750 0.1482 0.1442(n, p)=(400, 100) 0.0000 0.0292 0.2984 0.4072 0.1864 0.0560 0.0226(n, p)=(400, 400) 0.0004 0.3798 0.4626 0.1344 0.0134 0.0016 0.0002


References.

[1] Adler, R. and Taylor, J. (2007). Random fields and geometry. Springer.[2] Pierre C. Bellec (2018). The noise barrier and the large signal bias of the lasso and

other convex estimators. arXiv preprint, arXiv:1804.01230.[3] Bellec, P. and Tsybakov, A. (2017). Bounds on the prediction error of penalized least

squares estimators with convex penalty. Modern Problems of Stochastic Analysis andStatistics, Selected Contributions in Honor of Valentin Konakov, 315-333.

[4] Bellec, P. and Zhang, C.-H. (2018). Second order Stein: SURE for SURE and otherapplications in high-dimensional inference. arXiv:1811.04121.

[5] Belloni, A. and Chernozhukov, V. (2011). High dimensional sparse econometric models:an introduction. Chapter 3 in Inverse Problems and High-Dimensional Estimation,203, 121-156.

[6] Belloni, A., Chernozhukov, V., Chetverikov, D., and Wei, Y. (2015b). Uniformly validpost-regularization confidence regions for many functional parameters in Z-estimationframework. Arxiv:1512.07619.

[7] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: anonasymptotic theory of independent. Oxford University Press.

[8] Chen, L., Goldstein, L., and Shao, Q.-M. (2011). Normal approximation by Stein’smethod. Springer: Probability and Its Applications.

[9] Chernozhukov, V., Chetverikov, D., and Kato, K. (2014). Central limit theorems andbootstrap in high dimensions. Arxiv:1412.3661.

[10] Rudelson, M. and Vershynin, R. (2008). On sparse reconstruction from fourier andgaussian measurements. Communications on Pure and Applied Mathematics, 61, 1025-1045.

[11] Rudelson, M. and Vershynin, R. (2013). Hanson-Wright inequality and sub-gaussianconcentration. Electronic Communications in Probability, 82, 1-9.

[12] Sun, T. and C.-H. Zhang (2013). Sparse matrix inversion with scaled Lasso. Journalof Machine Learning Research, 14, 3385-3418.

[13] Talagrand, T. (2011). Mean field models for spin glasses. Springer.[14] Tao, T. (2012). Topics in random matrix theory. American Mathematical Society.[15] Tibshirani, R. and Taylor, J. (2012). Degrees of freedom in Lasso problems. The

Annals of Statistics, 40, 1198-1232.[16] Tibshirani, R. (2013). The Lasso problem and uniqueness. Electronic Journal of

Statistics, 7, 1456-1490.[17] van de Geer, S. (2016). Estimation and testing under sparsity. Springer: Lecture Notes

in Mathematics.[18] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in

high-dimensional linear regression. The Annals of Statistics, 36, 1567-1594.

Department of Economics, UCLABunche Hall, 8369315 Portola PlazaLos Angeles, CA 90095, USA.E-mail: [email protected]

Department of Economics, UCLABunche Hall, 8379315 Portola PlazaLos Angeles, CA 90095, USA.E-mail: [email protected]

Department of Economics andOperations Research Center, MIT50 Memorial DriveCambridge, MA 02142, USA.E-mail: [email protected]

Date post:	15-Oct-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

ON CROSS-VALIDATED LASSO IN HIGH DIMENSIONSdenoted as P1-Lasso and P2-Lasso estimators respectively,...

Documents