+ All Categories
Home > Documents > Dantzig Lasso

Dantzig Lasso

Date post: 07-Apr-2018
Category:
Upload: taylor-arnold
View: 237 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/6/2019 Dantzig Lasso

    1/28

    The Annals of Statistics

    2009, Vol. 37, No. 4, 17051732DOI: 10.1214/08-AOS620 Institute of Mathematical Statistics, 2009

    SIMULTANEOUS ANALYSIS OF LASSO AND

    DANTZIG SELECTOR1

    BY PETER J. BICKEL, YAACOV RITOV AND ALEXANDRE B. TSYBAKOV

    University of California at Berkeley, The Hebrew University and

    Universit Paris VI and CREST

    We show that, under a sparsity scenario, the Lasso estimator and theDantzig selector exhibit similar behavior. For both methods, we derive, in par-allel, oracle inequalities for the prediction risk in the general nonparametricregression model, as well as bounds on the p estimation loss for 1 p 2in the linear model when the number of variables can be much larger than the

    sample size.

    1. Introduction. During the last few years, a great deal of attention has beenfocused on the 1 penalized least squares (Lasso) estimator of parameters in high-dimensional linear regression when the number of variables can be much largerthan the sample size [8, 9, 11, 17, 18, 2022, 26] and [27]. Quite recently, Candesand Tao [7] have proposed a new estimate for such linear models, the Dantzig se-lector, for which they establish optimal 2 rate properties under a sparsity scenario;that is, when the number of nonzero components of the true vector of parametersis small.

    Lasso estimators have also been studied in the nonparametric regression setup[24, 12, 13, 19] and [5]. In particular, Bunea, Tsybakov and Wegkamp [25] ob-tain sparsity oracle inequalities for the prediction loss in this context and point outthe implications for minimax estimation in classical nonparametric regression set-tings, as well as for the problem of aggregation of estimators. An analog of Lassofor density estimation with similar properties (SPADES) is proposed in [6]. Mod-ified versions of Lasso estimators (nonquadratic terms and/or penalties slightlydifferent from 1) for nonparametric regression with random design are suggestedand studied under prediction loss in [14] and [25]. Sparsity oracle inequalities for

    the Dantzig selector with random design are obtained in [15]. In linear fixed de-sign regression, Meinshausen and Yu [18] establish a bound on the 2 loss for thecoefficients of Lasso that is quite different from the bound on the same loss for theDantzig selector proven in [7].

    The main message of this paper is that, under a sparsity scenario, the Lassoand the Dantzig selector exhibit similar behavior, both for linear regression and

    Received August 2007; revised April 2008.1Supported in part by NSF Grant DMS-06-05236, ISF grant, France-Berkeley Fund, the Grant

    ANR-06-BLAN-0194 and the European Network of Excellence PASCAL.AMS 2000 subject classifications. Primary 60K35, 62G08; secondary 62C20, 62G05, 62G20.Key words and phrases. Linear models, model selection, nonparametric statistics.

    1705

    http://www.imstat.org/aos/http://dx.doi.org/10.1214/08-AOS620http://www.imstat.org/http://www.ams.org/msc/http://www.ams.org/msc/http://www.imstat.org/http://dx.doi.org/10.1214/08-AOS620http://www.imstat.org/aos/
  • 8/6/2019 Dantzig Lasso

    2/28

    1706 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    for nonparametric regression models, for 2 prediction loss and for p loss in thecoefficients for 1 p 2. All the results of the paper are nonasymptotic.

    Let us specialize to the case of linear regression with many covariates,

    y = X + w, where X is the n M deterministic design matrix, with M possiblymuch larger than n, and w is a vector of i.i.d. standard normal random variables.This is the situation considered most recently by Candes and Tao [7] and Mein-shausen and Yu [18]. Here, sparsity specifies that the high-dimensional vector has coefficients that are mostly 0.

    We develop general tools to study these two estimators in parallel. For the fixeddesign Gaussian regression model, we recover, as particular cases, sparsity oracleinequalities for the Lasso, as in Bunea, Tsybakov and Wegkamp [4], and 2 boundsfor the coefficients of Dantzig selector, as in Candes and Tao [7]. This is obtainedas a consequence of our more general results, which are the following:

    In the nonparametric regression model, we prove sparsity oracle inequalities forthe Dantzig selector; that is, bounds on the prediction loss in terms of the bestpossible (oracle) approximation under the sparsity constraint.

    Similar sparsity oracle inequalities are proved for the Lasso in the nonparametricregression model, and this is done under more general assumptions on the designmatrix than in [4].

    We prove that, for nonparametric regression, the Lasso and the Dantzig selectorare approximately equivalent in terms of the prediction loss.

    We develop geometrical assumptions that are considerably weaker than those of

    Candes and Tao [7] for the Dantzig selector and Bunea, Tsybakov and Wegkamp[4] for the Lasso. In the context of linear regression where the number of vari-ables is possibly much larger than the sample size, these assumptions imply theresult of [7] for the 2 loss and generalize it to p loss 1 p 2 and to predic-tion loss. Our bounds for the Lasso differ from those for Dantzig selector onlyin numerical constants.

    We begin, in the next section, by defining the Lasso and Dantzig procedures andthe notation. In Section 3, we present our key geometric assumptions. Some suffi-cient conditions for these assumptions are given in Section 4, where they are also

    compared to those of [7] and [18], as well as to ones appearing in [4] and [5]. Wenote a weakness of our assumptions, and, hence, of those in the papers we cited,and we discuss a way of slightly remedying them. Sections 5 and 6 give someequivalence results and sparsity oracle inequalities for the Lasso and Dantzig es-timators in the general nonparametric regression model. Section 7 focuses on thelinear regression model and includes a final discussion. Two important technicallemmas are given in Appendix B as well as most of the proofs.

    2. Definitions and notation. Let (Z1, Y1) , . . . , ( Zn, Yn) be a sample of inde-

    pendent random pairs withYi = f (Zi ) + Wi , i = 1, . . . , n ,

  • 8/6/2019 Dantzig Lasso

    3/28

    LASSO AND DANTZIG SELECTOR 1707

    where f :Z R is an unknown regression function to be estimated, Z is a Borelsubset ofRd, the Zi s are fixed elements in Z and the regression errors Wi areGaussian. Let FM = {f1, . . . , f M} be a finite dictionary of functions fj :Z R,j = 1, . . . , M . We assume throughout that M 2.Depending on the statistical targets, the dictionary FM can contain qualitativelydifferent parts. For instance, it can be a collection of basis functions used to ap-proximate f in the nonparametric regression model (e.g., wavelets, splines withfixed knots, step functions). Another example is related to the aggregation prob-lem, where the fj are estimators arising from M different methods. They can alsocorrespond to M different values of the tuning parameter of the same method.Without much loss of generality, these estimators fj are treated as fixed functions.The results are viewed as being conditioned on the sample that the fj are basedon.

    The selection of the dictionary can be very important to make the estimationoff possible. We assume implicitly that f can be well approximated by a memberof the span ofFM. However, this is not enough. In this paper, we have in mind thesituation where M n, and f can be estimated reasonably only because it canapproximated by a linear combination of a small number of members ofFM, or, inother words, it has a sparse approximation in the span ofFM. But, when sparsityis an issue, equivalent bases can have different properties. A function that has asparse representation in one basis may not have it in another, even if both of themspan the same linear space.

    Consider the matrix X = (fj (Zi ))i,j , i = 1, . . . , n, j = 1, . . . , M and the vec-tors y = (Y1, . . . , Y n)T, f= (f (Z1) , . . . , f ( Zn))T, w = (W1, . . . , W n)T. With thenotation

    y = f+ w,we will write |x|p for the p norm of x RM, 1 p . The notation nstands for the empirical norm

    gn =

    1

    n

    n

    i=1g2(Zi )

    for any g :ZR. We suppose that fj n = 0, j = 1, . . . , M . Setfmax = max

    1jMfj n, fmin = min

    1jMfj n.

    For any = (1, . . . , M) RM, define f =M

    j=1 j fj or, explicitly, f (z) =Mj=1 j fj (z) and f = X. The estimates we consider are all of the form f (),

    where is data determined. Since we consider mainly sparse vectors , it will beconvenient to define the following. Let

    M() =M

    j=1I{j =0} = |J()|

  • 8/6/2019 Dantzig Lasso

    4/28

    1708 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    denote the number of nonzero coordinates of , where I{} denotes the indicatorfunction J() = {j {1, . . . , M } : j = 0} and |J| denotes the cardinality of J.The value M() characterizes the sparsity of the vector . The smaller M(), the

    sparser . For a vector RM

    and a subset J {1, . . . , M }, we denote by Jthe vector in RM that has the same coordinates as on J and zero coordinates onthe complement Jc ofJ.

    Introduce the residual sum of squares

    S() = 1n

    ni=1

    {Yi f (Zi )}2

    for all RM. Define the Lasso solution

    L = (

    1,L, . . . ,

    M,L ) by

    L = argminRM

    S() + 2r Mj=1

    fj n|j |,(2.1)where r > 0 is some tuning constant, and introduce the corresponding Lasso esti-mator

    fL(x) = fL (x) = Mj=1

    j,L fj (z).(2.2)The criterion in (2.1) is convex in , so that standard convex optimization pro-

    cedures can be used to compute L. We refer to [9, 10, 20, 21, 24] and [16] fordetailed discussion of these optimization problems and fast algorithms.A necessary and sufficient condition of the minimizer in (2.1) is that 0 belongs

    to the subdifferential of the convex function n1|y X|22 + 2r|D1/2|1.This implies that the Lasso selector L satisfies the constraint1n D1/2XT(y XL)

    r,(2.3)where D is the diagonal matrix

    D = diag{f12n, . . . , fM

    2n}.

    More generally, we will say that RM satisfies the Dantzig constraint if be-longs to the set

    RM :1n D1/2XT(y X)

    r

    .

    The Dantzig estimator of the regression function f is based on a particularsolution of (2.3), the Dantzig selector

    D , which is defined as a vector having the

    smallest 1 norm among all satisfying the Dantzig constraint

    D = argmin||1 : 1n D1/2XT(y X) r.(2.4)

  • 8/6/2019 Dantzig Lasso

    5/28

    LASSO AND DANTZIG SELECTOR 1709

    The Dantzig estimator is defined by

    fD (z) = fD (z) =M

    j=1 j,D fj (z),(2.5)where D = (1,D, . . . , M,D ) is the Dantzig selector. By the definition of Dantzigselector, we have |D|1 |L|1.

    The Dantzig selector is computationally feasible, since it reduces to a linearprogramming problem [7].

    Finally, for any n 1, M 2, we consider the Gram matrix

    n =1

    nXTX =

    1

    n

    n

    i=1fj (Zi )fj (Zi )1j,j M

    ,

    and let max denote the maximal eigenvalue ofn.

    3. Restricted eigenvalue assumptions. We now introduce the key assump-tions on the Gram matrix that are needed to guarantee nice statistical propertiesof the Lasso and Dantzig selectors. Under the sparsity scenario, we are typicallyinterested in the case where M > n, and even M n. Then, the matrix n isdegenerate, which can be written as

    minRM: =0

    (Tn)1/2

    ||2 minRM: =0 |X

    |2n||2 = 0.

    Clearly, ordinary least squares does not work in this case, since it requires positivedefiniteness ofn; that is,

    minRM: =0

    |X|2n||2

    > 0.(3.1)

    It turns out that the Lasso and Dantzig selector require much weaker assumptions.The minimum in (3.1) can be replaced by the minimum over a restricted set of

    vectors, and the norm ||2 in the denominator of the condition can be replaced bythe 2 norm of only a part of.

    One of the properties of both the Lasso and the Dantzig selectors is that, for thelinear regression model, the residuals = L and = D satisfy, withprobability close to 1,

    |Jc0 |1 c0|J0 |1,(3.2)where J0 = J() is the set of nonzero coefficients of the true parameter of themodel. For the linear regression model, the vector of Dantzig residuals satisfies

    (3.2) with probability close 1 ifc0 = 1 and M is large [cf. (B.9) and the fact that of the model satisfies the Dantzig constraint with probability close to 1 if M is

  • 8/6/2019 Dantzig Lasso

    6/28

    1710 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    large]. A similar inequality holds for the vector of Lasso residuals = L , butthis time with c0 = 3 [cf. Corollary B.2].

    Now, for example, consider the case where the elements of the Gram ma-

    trix n are close to those of a positive definite (M M)-matrix . Denote, byn

    = maxi,j |(n )i,j |, the maximal difference between the elements of thetwo matrices. Then, for any satisfying (3.2), we get

    Tn

    ||22=

    T + T(n )||22

    T

    ||22 n||

    21

    ||22(3.3)

    T||22 n(1 + c0)|J0 |1|J0 |2 2

    T

    ||22 n(1 + c0)2|J0|.

    Thus, for satisfying (3.2), which are the vectors that we have in mind, and forn|J0| small enough, the LHS of (3.3) is bounded away from 0. This means that wehave a kind of restricted positive definiteness, which is valid only for the vectorssatisfying (3.2). This suggests the following conditions, which will suffice for the

    main argument of the paper. We refer to these conditions as restricted eigenvalue(RE) assumptions.

    ASSUMPTION RE(s,c0). For some integer s such that 1 s M and a posi-tive number c0, the following condition holds:

    (s,c0)= min

    J0{1,...,M},|J0|s

    min =0,

    |Jc0 |1c0|J0 |1

    |X|2n|J0 |2

    > 0.

    The integer s here plays the role of an upper bound on the sparsity M() of avector of coefficients .

    Note that, if Assumption RE(s,c0) is satisfied with c0 1, thenmin{|X|2 :M() 2s, = 0} > 0.

    In other words, the square submatrices of size 2s of the Gram matrix are neces-sarily positive definite. Indeed, suppose that, for some = 0, we have simultane-ously M() 2s and X = 0. Partition J () in two sets J () = I0 I1, such that

    |Ii

    | s, i

    =0, 1. Without loss of generality, suppose that

    |I1

    |1

    |I0

    |1. Since,

    clearly, |I1 |1 = |Ic0 |1 and c0 1, we have |Ic0 |1 c0|I0 |1. Hence, (s,c0) = 0,a contradiction.

  • 8/6/2019 Dantzig Lasso

    7/28

    LASSO AND DANTZIG SELECTOR 1711

    To introduce the second assumption, we need some notation. For integers s, msuch that 1 s M/2 and m s, s +m M, a vector RM and a set of indicesJ0 {1, . . . , M } with |J0| s; denote by J1 the subset of{1, . . . , M } correspond-ing to the m largest in absolute value coordinates of outside of J0, and defineJ01

    = J0 J1. Clearly, J1 and J01 depend on m, but we do not indicate this in ournotation for the sake of brevity.

    ASSUMPTION RE(s,m,c0).

    (s,m,c0)= min

    J0{1,...,M},|J0|s

    min =0,

    |Jc0 |1c0|J0 |1

    |X|2n|J01|2

    > 0.

    Note that the only difference between the two assumptions is in the denomina-tors, and (s,m,c0) (s,c0). As written, for fixed n, the two assumptions areequivalent. However, asymptotically for large n, Assumption RE(s,c0) is less re-strictive than RE(s,m,c0), since the ratio (s,m,c0)/(s,c0) may tend to 0 ifsand m depend on n. For our bounds on the prediction loss and on the 1 loss of theLasso and Dantzig estimators, we will only need Assumption RE(s,c0). Assump-tion RE(s,m,c0) will be required exclusively for the bounds on the p loss with1 < p 2.

    Note also that Assumptions RE(s , c0) and RE(s , m , c0) imply AssumptionsRE(s,c0) and RE(s,m,c0), respectively, ifs > s.

    4. Discussion of the RE assumptions. There exist several simple sufficientconditions for Assumptions RE(s,c0) and RE(s,m,c0) to hold. Here, we discusssome of them.

    For a real number 1 u M, we introduce the following quantities that wewill call restricted eigenvalues:

    min(u) = minxRM:1M(x)u

    xTnx

    |x|22,

    max(u) = maxxRM:1M(x)u

    xTnx|x|22

    .

    Denote by XJ the n |J| submatrix of X obtained by removing from X thecolumns that do not correspond to the indices in J, and, for 1 m1, m2 M,introduce the following quantities called restricted correlations:

    m1,m2 = max

    cT1 XTI1

    XI2 c2

    n|c1|2|c2|2: I1 I2 =, |Ii | mi , ci RIi \ {0}, i = 1, 2

    .

    In Lemma 4.1, below, we show that a sufficient condition for RE(s,c0) and

    RE(s,s,c0) to hold is given, for example, by the following assumption on theGram matrix.

  • 8/6/2019 Dantzig Lasso

    8/28

    1712 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    ASSUMPTION 1. Assume that

    min(2s ) > c0s,2s

    for some integer 1 s M/2 and a constant c0 > 0.This condition with c0 = 1 appeared in [7], in connection with the Dantzig se-

    lector. Assumption 1 is more general, in that we can have an arbitrary constantc0 > 0 that will allow us to cover not only the Dantzig selector but also the Lassoestimators and to prove oracle inequalities for the prediction loss when the modelis nonparametric.

    Our second sufficient condition for RE(s,c0) and RE(s,m,c0) does not needbounds on correlations. Only bounds on the minimal and maximal eigenvalues ofsmall submatrices of the Gram matrix n are involved.

    ASSUMPTION 2. Assume that

    mmin(s + m) > c20smax(m)for some integers s, m, such that 1 s M/2, m s and s + m M, and aconstant c0 > 0.

    Assumption 2 can be viewed as a weakening of the condition on min in [18].Indeed, taking s + m = s log n (we assume, without loss of generality, that s log nis an integer and n > 3) and assuming that max() is uniformly bounded by aconstant, we get that Assumption 2 is equivalent to

    min(s log n) > c/ log n,

    where c > 0 is a constant. The corresponding, slightly stronger, assumption in [18]is stated in asymptotic form, for s = sn , as

    liminfn

    min(sn log n) > 0.

    The following two constants are useful when Assumptions 1 and 2 are consid-

    ered:1(s,c0) =

    min(2s)

    1 c0s,2s

    min(2s)

    and

    2(s,m,c0) =

    min(s + m)

    1 c0

    smax(m)

    mmin(s + m)

    .

    The next lemma shows that if Assumptions 1 or 2 are satisfied, then the quadraticform xTnx is positive definite on some restricted sets of vectors x. The construc-

    tion of the lemma is inspired by Candes and Tao [7] and covers, in particular, thecorresponding result in [7].

  • 8/6/2019 Dantzig Lasso

    9/28

    LASSO AND DANTZIG SELECTOR 1713

    LEMMA 4.1. Fix an integer1 s M/2 and a constantc0 > 0.(i) Let Assumption 1 be satisfied. Then, Assumptions RE(s,c0) and RE(s,s,

    c0) hold with (s,c0)

    =(s,s,c0)

    =1(s,c0). Moreover, for any subset J0 of

    {1, . . . , M }, with cardinality |J0| s, and any RM such that|Jc0 |1 c0|J0 |1,(4.1)

    we have

    1n|P01X|2 1(s,c0)|J01 |2,

    where P01 is the projector in RM on the linear span of the columns of XJ01 .

    (ii) Let Assumption 2 be satisfied. Then, Assumptions RE(s,c0) andRE(s,m,c0) hold with (s,c0)

    =(s,m,c0)

    =2(s,m,c0). Moreover, for any subset J0 of

    {1, . . . , M }, with cardinality |J0| s, and any RM such that (4.1) holds, wehave

    1n|P01X|2 2(s,m,c0)|J01|2.

    The proof of the lemma is given in Appendix A.There exist other sufficient conditions for Assumptions RE(s,c0) and RE(s,m,

    c0) to hold. We mention here three of them implying Assumption RE(s,c0). Thefirst one is the following [1].

    ASSUMPTION 3. For an integer s such that 1 s M, we havemin(s ) > 2c0s,1

    s,

    where c0 > 0 is a constant.

    To argue that Assumption 3 implies RE(s,c0), it suffices to remark that

    1

    n|X|22

    1

    n

    TJ0

    XTXJ0 2

    n|TJ0 XTXJc0 |

    min(s)|J0 |22 2n |TJ0 XTXJc0 |and, if (4.1) holds,

    |TJ0 XTXJc0 |/n |Jc0 |1 maxjJc0TJ0 XTx(j )/n

    s,1|Jc0 |1|J0 |2 c0s,1

    s|J0 |22.

    Another type of assumption related to mutual coherence [8] is discussed in

    connection to Lasso in [4, 5]. We state it in two different forms, which are givenbelow.

  • 8/6/2019 Dantzig Lasso

    10/28

    1714 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    ASSUMPTION 4. For an integer s such that 1 s M, we havemin(s ) > 2c01,1s,

    where c0 > 0 is a constant.

    It is easy to see that Assumption 4 implies RE(s,c0). Indeed, if (4.1) holds,

    1

    n|X|22

    1

    n

    TJ0

    XTXJ0 21,1|Jc0 |1|J0 |1

    min(s)|J0|22 2c01,1|J0|21(4.2)

    min(s) 2c01,1s

    |J0 |22.

    If all the diagonal elements of matrix XT

    X/n are equal to 1 (and thus 1,1 coin-cides with the mutual coherence [8]), then a simple sufficient condition for As-sumption RE(s,c0) to hold is stated as follows.

    ASSUMPTION 5. All the diagonal elements of the Gram matrix n are equalto 1, and for an integer s, such that 1 s M, we have

    1,1 0 is a constant.

    In fact, separating the diagonal and off-diagonal terms of the quadratic form, weget

    TJ0

    XTXJ0 /n |J0 |22 1,1|J0 |21 |J0|22(1 1,1s).Combining this inequality with (4.2), we see that Assumption RE(s,c0) is satisfiedwhenever (4.3) holds.

    Unfortunately, Assumption RE(s,c0) has some weakness. Let, for example, fj ,j

    =1, . . . , 2m

    1, be the Haar wavelet basis on

    [0, 1

    ](M

    =2m), and consider

    Zi = i/n, i = 1, . . . , n. IfM n, then it is clear that min(1) = 0, since there arefunctions fj on the highest resolution level whose supports (of length M1) con-tain no points Zi . So, none of Assumptions 14 hold. A less severe, although sim-ilar, situation is when we consider step functions fj (t) = I{t

  • 8/6/2019 Dantzig Lasso

    11/28

    LASSO AND DANTZIG SELECTOR 1715

    5. Approximate equivalence. In this section, we prove a type of approxi-mate equivalence between the Lasso and the Dantzig selector. It is expressed ascloseness of the prediction losses

    fD f2n and

    fL f2n when the number of

    nonzero components of the Lasso or the Dantzig selector is small as compared tothe sample size.

    THEOREM 5.1. Let Wi be independent N(0, 2) random variables with2 > 0. Fix n 1, M 2. Let Assumption RE(s, 1) be satisfied with 1 s M.Consider the Dantzig estimator fD defined by (2.5)(2.4) with

    r = A

    log M

    n,

    where A > 22, and consider the Lasso estimator fL defined by (2.1)(2.2) withthe same r .

    IfM(L) s, then, with probability at least 1 M1A2/8, we have fD f2n fL f2n 16A2M(L)2n f

    2max

    2(s, 1)log M.(5.1)

    Note that the RHS of (5.1) is bounded by a product of three factors (and anumerical constant which, unfortunately, equals at least 128). The first factorM(L)2/n s2/n corresponds to the error rate for prediction in regressionwith s parameters. The two other factors, log M and f2max/

    2(s, 1), can be re-garded as a price to pay for the large number of regressors. If the Gram matrixn equals the identity matrix (the white noise model), then there is only the log Mfactor. In the general case, there is another factor f2max/

    2(s, 1) representing theextent to which the Gram matrix is ill-posed for estimation of sparse vectors.

    We also have the following result that we state, for simplicity, under the assump-tion that fj n = 1, j = 1, . . . , M . It gives a bound in the spirit of Theorem 5.1but with M(D ) rather than M(L) on the right-hand side.

    THEOREM 5.2. Let the assumptions of Theorem 5.1 hold, but with RE(s, 5)in place of RE(s, 1), and letfj n = 1, j = 1, . . . , M . IfM(D ) s, then, withprobability at least1 M1A2/8, we have

    fL f2n 10 fD f2n + 81A2M(D)2n log M2(s, 5) .(5.2)REMARK. The approximate equivalence is essentially that of the rates as The-

    orem 5.1 exhibits. A statement free ofM() holds for linear regression, see dis-cussion after Theorems 7.2 and 7.3 below.

  • 8/6/2019 Dantzig Lasso

    12/28

    1716 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    6. Oracle inequalities for prediction loss. Here, we prove sparsity oracleinequalities for the prediction loss of the Lasso and Dantzig estimators. These in-equalities allow us to bound the difference between the prediction errors of the

    estimators and the best sparse approximation of the regression function (by an or-acle that knows the truth but is constrained by sparsity). The results of this section,together with those of Section 5, show that the distance between the predictionlosses of the Dantzig and Lasso estimators is of the same order as the distancesbetween them and their oracle approximations.

    A general discussion of sparsity oracle inequalities can be found in [23]. Suchinequalities have been recently obtained for the Lasso type estimators in a numberof settings [26, 14] and [25]. In particular, the regression model with fixed designthat we study here is considered in [24]. The assumptions on the Gram matrixn in [24] are more restrictive than ours. In those papers, either n is positivedefinite, or a mutual coherence condition similar to (4.3) is imposed.

    THEOREM 6.1. Let Wi be independent N(0, 2) random variables with2 > 0. Fix some > 0 and integers n 1, M 2, 1 s M. Let Assump-tion RE(s, 3+4/) be satisfied. Consider the Lasso estimator fL defined by (2.1)(2.2) with

    r = A

    log M

    n

    for some A > 22. Then, with probability at least1 M1A2/8, we have fL f2n

    (6.1)

    (1 + ) infRM:M()s

    f f2n +

    C()f2maxA22

    2(s, 3 + 4/)M() log M

    n

    ,

    where C() > 0 is a constant depending only on .

    We now state, as a corollary, a softer version of Theorem 6.1 that can be used toeliminate the pathologies mentioned at the end of Section 4. For this purpose, wedefine

    Js,,c0 =

    J0 {1, . . . , M } : |J0| s and min =0,

    |Jc0 |1c0|J0 |1

    |X|2n|J0 |2

    ,

    where > 0 is a constant, and set

    s,,c0 = { : J() Js,,c0}.

    In similar way, we define Js,,m,c0 and s,,m,c0 corresponding to Assump-tion RE(s,m,c0).

  • 8/6/2019 Dantzig Lasso

    13/28

    LASSO AND DANTZIG SELECTOR 1717

    COROLLARY 6.2. Let Wi , s and the Lasso estimator fL be the same as inTheorem 6.1. Then, for all n 1, > 0, and > 0, with probability at least 1 M1A2/8 we have

    fL f2n (1 + ) infs,,

    f f2n + C()f2maxA222 M() log Mn ,where s,, = { s,,3+4/ :M() s}.

    To obtain this corollary, it suffices to observe that the proof of Theorem 6.1goes through if we drop Assumption RE(s, 3 + 4/), but we assume instead that s,,3+4/ , and we replace (s, 3 + 4/) by .

    We would like now to get a sparsity oracle inequality similar to that of Theo-rem 6.1 for the Dantzig estimator fD . We will need a mild additional assumptionon f. This is due to the fact that not every RM obeys the Dantzig constraint;thus, we cannot assure the key relation (B.9) for all RM. One possibility wouldbe to prove inequality as (6.1), where the infimum on the right hand side is takenover satisfying not only M() s but also the Dantzig constraint. However, thisseems not to be very intuitive, since we cannot guarantee that the correspondingf gives a good approximation of the unknown function f. Therefore, we chooseanother approach (cf. [5]), in which we consider f satisfying the weak sparsityproperty relative to the dictionary f1, . . . , f M. That is, we assume that there existan integer s and constant C0 < such that the set

    s = RM :M() s, f f2n C0f2maxr22(s, 3 + 4/)M()(6.2)is nonempty. The second inequality in (6.2) says that the bias term f f2ncannot be much larger than the variance term f2maxr22M() [cf. (6.1)].Weak sparsity is milder than the sparsity property in the usual sense. The lattermeans that f admits the exact representation f = f , for some RM, withhopefully small M() = s.

    PROPOSITION 6.3. Let Wi be independentN(0, 2) random variables with

    2 > 0. Fix some > 0 and integers n 1, M 2. Let f obey the weak sparsityassumption for some C0 < and some s such that 1 s max{C1(), 1} M,where

    C1() = 4[(1 + )C0 + C()]maxf

    2max

    2f2min

    and C() is the constant in Theorem 6.1. Suppose, further, that Assump-tion RE(s max{C1(), 1}, 3 + 4/) is satisfied. Consider the Dantzig estimator

    fD defined by (2.5)(2.4) with

    r = A log Mn

  • 8/6/2019 Dantzig Lasso

    14/28

    1718 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    andA > 2

    2. Then, with probability at least 1 M1A2/8, we have

    fD f2n(6.3)

    (1 + ) infRM:M()=s

    f f2n + C2()f2maxA22

    20

    s log Mn

    .

    Here, C2() = 16C1() + C() and0 = (max(C1(), 1)s, 3 + 4/).

    Note that the sparsity oracle inequality (6.3) is slightly weaker than the anal-ogous inequality (6.1) for the Lasso. Here, we have infRM:M()=s instead ofinfRM:M()s in (6.1).

    7. Special case. Parametric estimation in linear regression. In this section,we assume that the vector of observations y = (Y1, . . . , Y n)T is of the formy = X + w,(7.1)

    where X is an n M deterministic matrix RM and w = (W1, . . . , W n)T.We consider dimension M that can be of order n and even much larger. Then,

    is, in general, not uniquely defined. For M > n, if(7.1) is satisfied for = 0,then there exists an affine space U= { : X = X0} of vectors satisfying (7.1).The results of this section are valid for any such that (7.1) holds. However, wewill suppose that Assumption RE(s,c

    0) holds with c

    0 1 and that M(

    )

    s.Then, the set U { :M() s} reduces to a single element (cf. Remark 2 atthe end of this section). In this sense, there is a unique sparse solution of(7.1).

    Our goal in this section, unlike that of the previous ones, is to estimate both Xfor the purpose of prediction and itself for purpose of model selection. We willsee that meaningful results are obtained when the sparsity index M() is small.

    It will be assumed throughout this section that the diagonal elements of theGram matrix n = XTX/n are all equal to 1 (this is equivalent to the conditionfj n = 1, j = 1, . . . , M , in the notation of previous sections). Then, the Lassoestimator of in (7.1) is defined by

    L = argminRM

    1n|y X|22 + 2r||1

    .(7.2)

    The correspondence between the notation here and that of the previous sections is

    f2n = |X|22/n, f f2n = |X( )|22/n, fL f2n = |X(L )|22/n.

    The Dantzig selector for linear model (7.1) is defined by

    D = argmin

    ||1,(7.3)

  • 8/6/2019 Dantzig Lasso

    15/28

    LASSO AND DANTZIG SELECTOR 1719

    where

    =

    RM :1

    nXT(y X)

    r

    is the set of all satisfying the Dantzig constraint.

    We first get bounds on the rate of convergence of Dantzig selector.

    THEOREM 7.1. Let Wi be independent N(0, 2) random variables with2 > 0, let all the diagonal elements of the matrix XTX/n be equal to 1 andM() s, where 1 s M, n 1, M 2. Let Assumption RE(s, 1) be sat-isfied. Consider the Dantzig selector

    D defined by (7.3) with

    r = A log MnandA >

    2. Then, with probability at least1 M1A2/2, we have

    |D |1 8A2(s, 1)

    s

    log M

    n,(7.4)

    |X(

    D )|22

    16A2

    2(s, 1)2s log M.(7.5)

    If Assumption RE(s,m, 1) is satisfied, then, with the same probability as above,simultaneously for all 1 < p 2, we have

    |D |pp 2p18

    1 +

    s

    m

    2(p1)s

    A

    2(s,m, 1)

    log M

    n

    p.(7.6)

    Note that, since s m, the factor in curly brackets in (7.6) is bounded by aconstant independent of s and m. Under Assumption 1 in Section 4, with c0 = 1[which is less general than RE(s,s, 1), cf. Lemma 4.1(i)], a bound of the form (7.6)

    for the case p = 2 is established by Candes and Tao [7].Bounds on the rate of convergence of the Lasso selector are quite similar to

    those obtained in Theorem 7.1. They are given by the following result.

    THEOREM 7.2. Let Wi be independent N(0, 2) random variables with2 > 0. Let all the diagonal elements of the matrix XTX/n be equal to 1, andletM() s, where 1 s M, n 1, M 2. Let Assumption RE(s, 3) be sat-isfied. Consider the Lasso estimator

    L defined by (7.2) with

    r = A log Mn

  • 8/6/2019 Dantzig Lasso

    16/28

    1720 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    andA > 2

    2. Then, with probability at least 1 M1A2/8, we have

    |L

    |1

    16A

    2

    (s, 3) s

    log M

    n

    ,(7.7)

    |X(L )|22 16A22(s, 3) 2s log M,(7.8)M(L) 64max

    2(s, 3)s.(7.9)

    If Assumption RE(s,m, 3) is satisfied, then, with the same probability as above,simultaneously for all 1 < p 2, we have

    |L |pp 161 + 3 sm2(p1)

    s A2(s,m, 3) log Mn p

    .(7.10)

    Inequalities of the form similar to (7.7)and (7.8) can be deduced from the resultsof [3] under more restrictive conditions on the Gram matrix (the mutual coherenceassumption, cf. Assumption 5 of Section 4).

    Assumptions RE(s, 1) and RE(s, 3), respectively, can be dropped in Theorems7.1 and 7.2 if we assume s,,c0 with c0 = 1 or c0 = 3 as appropriate. Then,(7.4) and (7.5) or, respectively, (7.7) and (7.8) hold with = . This is analogousto Corollary 6.2. Similarly, (7.6) and (7.10) hold with = if s,,m,c0 withc0 = 1 or c0 = 3 as appropriate.Observe that, combining Theorems 7.1 and 7.2, we can immediately getbounds for the differences between Lasso and Dantzig selector |L D|pp and|X(L D )|22. Such bounds have the same form as those of Theorems 7.1 and 7.2,up to numerical constants. Another way of estimating these differences follows di-rectly from the proof of Theorem 7.1. It suffices to observe that the only propertyof used in that proof is the fact that satisfies the Dantzig constraint on theevent of given probability, which is also true for the Lasso solution L. So, we canreplace by

    L and s by M(

    L) everywhere in Theorem 7.1. Generalizing a bit

    more, we easily derive the following fact.

    THEOREM 7.3. The result of Theorem 7.1 remains valid if we replace |D |pp by sup{|D |pp : ,M() s} for 1 p 2 and |X(D )|22 bysup{|X(D )|22 : ,M() s}, respectively. Here, is the set of all vectorssatisfying the Dantzig constraint.

    REMARKS.

    1. Theorems 7.1 and 7.2 only give nonasymptotic upper bounds on the loss,with some probability and under some conditions. The probability depends on M

    and the conditions depend on n and M. Recall that Assumptions RE(s,c0) andRE(s,m,c0) are imposed on the n M matrix X. To deduce asymptotic conver-

  • 8/6/2019 Dantzig Lasso

    17/28

    LASSO AND DANTZIG SELECTOR 1721

    gence (as n and/or as M ) from Theorems 7.1 and 7.2, we would needsome very strong additional properties, such as simultaneous validity of Assump-tion RE(s,c0) or RE(s,m,c0) (with one and the same constant ) for infinitely

    many n and M.2. Note that neither Assumption RE(s,c0) or RE(s,m,c0) implies identifia-bility of in the linear model (7.1). However, the vector appearing in thestatements of Theorems 7.1 and 7.2 is uniquely defined, because we addition-ally suppose that M() s and c0 1. Indeed, if there exists a such thatX = X, and M( ) s, then, in view of assumption RE(s,c0) with c0 1,we necessarily have = [cf. discussion following the definition of RE(s,c0)].On the other hand, Theorem 7.3 applies to certain values of that do not comefrom the model (7.1) at all.

    3. For the smallest value ofA (which is A=

    2

    2) the constants in the bound of

    Theorem 7.2 for the Lasso are larger than the corresponding numerical constantsfor the Dantzig selector given in Theorem 7.1, again, for the smallest admissiblevalue A =

    2. On the contrary, the Dantzig selector has certain defects as com-

    pared to Lasso when the model is nonparametric, as discussed in Section 6. Inparticular, to obtain sparsity oracle inequalities for the Dantzig selector, we needsome restrictions on f, for example, the weak sparsity property. On the other hand,the sparsity oracle inequality (6.1) for the Lasso is valid with no restriction on f.

    4. The proofs of Theorems 7.1 and 7.2 differ mainly in the value of the tuningconstant, which is c0 = 1 in Theorem 7.1 and c0 = 3 in Theorem 7.2. Note that,

    since the Lasso solution satisfies the Dantzig constraint, we could have obtained aresult similar to Theorem 7.2, but with less accurate numerical constants, by sim-ply conducting the proof of Theorem 7.1 with c0 = 3. However, we act differently,and we deduce (B.30) directly from (B.1) and not from (B.25). This is done onlyfor the sake of improving the constants. In fact, using (B.25) with c0 = 3 wouldyield (B.30) with the doubled constant on the right-hand side.

    5. For the Dantzig selector in the linear regression model and under Assump-tions 1 or 2, some further improvement of constants in the p bounds for the co-efficients can be achieved by applying the general version of Lemma 4.1 with theprojector P01 inside. We do not pursue this issue here.

    6. All of our results are stated with probabilities at least 1 M1A2/2 or 1 M1A2/8. These are reasonable (but not the most accurate) lower bounds on theprobabilities P(B) and P(A), respectively. We have chosen them for readability.Inspection of (B.4) shows that they can be refined to 1 2M(Alog M) and1 2M(Alog M/2), respectively, where () is the standard normal c.d.f.

    APPENDIX A

    PROOF OF LEMMA 4.1. Consider a partition Jc

    0 into subsets of size m,with the last subset of size m: Jc0 =

    Kk=1 Jk , where K 1, |Jk| = m for

  • 8/6/2019 Dantzig Lasso

    18/28

    1722 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    k = 1, . . . , K 1 and |JK | m, such that Jk is the set of indices correspond-ing to m largest in absolute value coordinates of outside

    k1j=1 Jj (for k < K)

    and JK is the remaining subset. We have

    |P01X|2 |P01XJ01|2

    Kk=2

    P01XJk

    2

    = |XJ01|2

    Kk=2

    P01XJk

    2

    (A.1)

    |XJ01|2 K

    k=2|P01XJk |2.

    We will prove first part (ii) of the lemma. Since for k 1 the vector Jk has only mnonzero components, we obtain

    1n|P01XJk |2

    1n|XJk |2

    max(m)|Jk |2.(A.2)

    Next, as in [7], we observe that |Jk+1 |2 |Jk |1/

    m, k = 1, . . . , K 1. Therefore,K

    k=2|Jk |2

    |Jc0 |1m

    c0|J0 |1m

    c0

    s

    m|J0|2 c0

    s

    m|J01 |2,(A.3)

    where we used (4.1). From (A.1)(A.3), we find

    1n|X|2

    1n|XJ01|2 c0

    max(m)

    s

    m|J01 |2

    min(s + m) c0

    max(m)

    s

    m

    |J01|2,

    which proves part (ii) of the lemma.The proof of part (i) is analogous. The only difference is that we replace, in the

    above argument, m by s, and instead of (A.2), we use the bound (cf. [7])1

    n|P01XJk |2

    s,2smin(2s)

    |Jk |2.

    APPENDIX B: TWO LEMMAS AND THE PROOFS OF THE RESULTS

    LEMMA B.1. Fix M 2 andn 1. LetWi be independentN(0, 2) randomvariables with 2 > 0, and let

    fL be the Lasso estimator defined by (2.2) with

    r = A log Mn

  • 8/6/2019 Dantzig Lasso

    19/28

    LASSO AND DANTZIG SELECTOR 1723

    for some A > 2

    2. Then, with probability at least 1 M1A2/8, we have, simul-taneously for all RM,

    fL f2n + rM

    j=1 fj n|j,L j |

    f f2n + 4r

    jJ()fj n|j,L j |(B.1)

    f f2n + 4rM()

    jJ()

    fj 2n|j,L j |2,and

    1n XT(f XL) 3rfmax/2.(B.2)Furthermore, with the same probability,

    M(L) 4maxf2min( fL f2n/r 2),(B.3)where max denotes the maximal eigenvalue of the matrix X

    TX/n.

    PROOF OF LEMMA B.1. The result (B.1) is essentially Lemma 1 from [5]. Forcompleteness, we give its proof. Set rn,j = rfj n. By definition,

    S(L) + 2M

    j=1 rn,j |j,L| S() + 2M

    j=1 rn,j |j |for all RM, which is equivalent to

    fL f2n + 2 Mj=1

    rn,j |j,L| f f2n + 2

    Mj=1

    rn,j |j | +2

    n

    ni=1

    Wi ( fL f )(Zi ).Define the random variables Vj = n

    1ni=1 fj (Zi )Wi , 1 j M, and the eventA=

    Mj=1

    {2|Vj | rn,j }.

    Using an elementary bound on the tails of Gaussian distribution, we find that theprobability of the complementary event Ac satisfies

    P{Ac} M

    j=1P

    n|Vj | >

    nrn,j /2 MP|| rn/(2 )

    (B.4)

    Mexp nr282

    = MexpA2 log M8

    = M1A2/8,

  • 8/6/2019 Dantzig Lasso

    20/28

    1724 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    where N(0, 1). On the event A we have

    fL f2n

    f

    f

    2n

    +

    M

    j=1 rn,j |j,L j | +M

    j=1 2rn,j |j | M

    j=1 2rn,j |j,L|.Adding the term

    Mj=1 rn,j |j,L j | to both sides of this inequality yields, on A,

    fL f2n + Mj=1

    rn,j |j,L j | f f2n + 2

    Mj=1

    rn,j (|

    j,L j | + |j | |

    j,L|).

    Now, |j,L j | + |j | |j,L| = 0 for j / J(), so that, on A, we get (B.1).To prove (B.2) it suffices to note that, on A, we have1n D1/2XTW

    r/2.(B.5)Now, y = f+ w, and (B.2) follows from (2.3) and (B.5).

    We finally prove (B.3). The necessary and sufficient condition for L to be theLasso solution can be written in the form

    1

    nxT

    (j )

    (y

    XL) = rfj n sign(j,L ) ifj,L = 0,(B.6) 1n xT(j )(y XL) rfj n ifj,L = 0,

    where x(j ) denotes the j th column of X, j = 1, . . . , M . Next, (B.5) yields that,on A, we have 1n xT(j )W

    rfj n/2, j = 1, . . . , M .(B.7)Combining (B.6) and (B.7), we get

    1n xT(j )(f XL) rfj n/2 ifj,L = 0.(B.8)Therefore,

    1

    n2(f XL)TXXT(f XL) = 1

    n2

    Mj=1

    xT(j )(f XL)2

    1n2 j :j,L =0

    xT(j )(f X

    L)

    2

    =M(L)r2fj 2n/4 f2minM(L)r2/4.

  • 8/6/2019 Dantzig Lasso

    21/28

    LASSO AND DANTZIG SELECTOR 1725

    Since the matrices XTX/n and XXT/n have the same maximal eigenvalues,

    1

    n2(f

    XL)

    TXXT(f

    XL)

    max

    n|f

    XL|

    22

    =max

    f

    fL2n,

    and we deduce (B.3) from the last two displays.

    COROLLARY B.2. Let the assumptions of Lemma B.1 be satisfied andfj n = 1, j = 1, . . . , M . Consider the linear regression model y = X+w. Then,with probability at least 1 M1A2/8, we have

    |Jc0 |1 3|J0|1,where J0

    =J() is the set of nonzero coefficients of and

    = L .PROOF. Use the first inequality in (B.1) and the fact that f = f for the linear

    regression model.

    LEMMA B.3. Let RM satisfy the Dantzig constraint1n D1/2XT(y X) r

    and set

    = D , J0 = J(). Then,|Jc0 |1 |J0 |1.(B.9)Further, let the assumptions of Lemma B.1 be satisfied with A >

    2. Then, with

    probability of at least1 M1A2/2, we have1n XT(f XD) 2rfmax.(B.10)

    PROOF OF LEMMA B.3. Inequality (B.9) follows immediately from the defi-

    nition of Dantzig selector (cf. [7]). To prove (B.10), consider the event

    B =1n D1/2XTW

    r

    =M

    j=1{|Vj | rn,j }.

    Analogously to (B.4), P{Bc} M1A2/2. On the other hand, y = f+ w, and, usingthe definition of Dantzig selector, it is easy to see that (B.10) is satisfied on B.

    PROOF OF THEOREM 5.1. Set =L

    D . We have

    1n|f XL|22 = 1n |f XD|22 2nTXT(f XD ) + 1n |X|22.

  • 8/6/2019 Dantzig Lasso

    22/28

    1726 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    This and (B.10) yield

    fD f2n fL f

    2n + 2||1

    1

    nXT(f XD)

    1

    n|X|22

    (B.11) fL f2n + 4fmaxr||1 1

    n|X|22,

    where the last inequality holds with probability at least 1 M1A2/2. Since theLasso solution L satisfies the Dantzig constraint, we can apply Lemma B.3 with = L, which yields

    |Jc0 |1 |J0|1(B.12)with J0

    =J (L). By Assumption RE(s, 1), we get

    1n|X|2 |J0|2,(B.13)

    where = (s, 1). Using (B.12) and (B.13), we obtain

    ||1 2|J0|1 2M1/2(L)|J0|2 2M1/2(L)n |X|2.(B.14)Finally, from (B.11) and (B.14), we get that, with probability at least 1 M1A2/2,

    fD f2n fL f2n + 8fmaxrM1/2(L)n |X|2 1n |X|22(B.15)

    fL f2n + 16f2maxr2M(L)2 ,where the RHS follows (B.2), (B.10) and another application of (B.14). Thisproves one side of the inequality.

    To show the other side of the bound on the difference, we act as in (B.11), upto the inversion of roles of L and D , and we use (B.2). This yields that, withprobability at least 1 M1A2/8,

    fL f2n fD f2n + 2||11n XT(f XL) 1n |X|22

    (B.16)

    fD f2n + 3fmaxr||1 1n|X|22.

    This is analogous to (B.11). Now, paralleling the proof leading to (B.15), we obtain

    fL f2n

    fD f2n

    +

    9f2maxr2M(

    L)

    2 .(B.17)

    The theorem now follows from (B.15) and (B.17).

  • 8/6/2019 Dantzig Lasso

    23/28

    LASSO AND DANTZIG SELECTOR 1727

    PROOF OF THEOREM 5.2. Set, again, = L D . We apply (B.1) with =

    D , which yields that, with probability at least 1 M1A2/8,

    ||1 4|

    J0|1 + fD f2n/r,(B.18)

    where, now, J0 = J (D ). Consider the following two cases: (i) fD f2n >2r|J0|1 and (ii) fD f2n 2r|J0 |1. In case (i), inequality (B.16) with fmax = 1immediately implies

    fL f2n 10 fD f2n,and the theorem follows. In case (ii), we get, from (B.18), that

    ||1 6|J0 |1

    and thus |Jc0 |1 5|J0 |1. We can therefore apply Assumption RE(s, 5), whichyields, similarly to (B.14),

    ||1 6M1/2(D)|J0 |2 6M1/2(D)n |X|2,(B.19)where = (s, 5). Plugging (B.19) into (B.16) we finally get that, in case (ii),

    fL f2n

    fD f2n +

    18rM1/2(D )

    n|X|2

    1

    n|X|22

    (B.20)

    fD f2n + 81r2M(D )2 . PROOF OF THEOREM 6.1. Fix an arbitrary RM with M() s. Set =

    D1/2(L ), J0 = J(). On the eventA, we get, from the first line in (B.1), that fL f2n + r||1 f f2n + 4r

    jJ0fj n|j,L j |

    (B.21)= f f2n + 4r|J0|1,

    and from the second line in (B.1) that

    fL f2n f f2n + 4rM()|J0 |2.(B.22)Consider, separately, the cases where

    4r|J0|1 f f2n(B.23)and

    f f2n < 4r|J0 |1.(B.24)

    In case (B.23), the result of the theorem trivially follows from (B.21). So, we willonly consider the case (B.24). All of the subsequent inequalities are valid on the

  • 8/6/2019 Dantzig Lasso

    24/28

    1728 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    event AA1, where A1 is defined by (B.24). On this event, we get, from (B.21),that

    ||1 4(1 + 1/)|

    J0|1,which implies |Jc0 |1 (3 + 4/)|J0 |1. We now use Assumption RE(s, 3 + 4/).This yields

    2|J0|22 1

    n|X|22 =

    1

    n(K )TD1/2XTXD1/2(L )

    f2

    max

    n(L )TXTX(L ) = f2max fL f2n,

    where =

    (s, 3+

    4/). Combining this with (B.22), we find

    fL f2n f f2n + 4rfmax1M() fL fn f f2n + 4rfmax1

    M()( fL fn + f fn).

    This inequality is of the same form as (A.4) in [4]. A standard decoupling argumentas in [4], using inequality 2xy x2/b + by2 with b > 1, x = r1M() and ybeing either

    fL fn or f fn, yields that

    fL f2n b + 1b 1f f2n + 8b2f2

    max(b 1)2 r2M() b > 1.

    Taking b = 1 + 2/ in the last display finishes the proof of the theorem.

    PROOF OF PROPOSITION 6.3. Due to the weak sparsity assumption, there ex-ists RM with M() s such that f f2n C0f2maxr22M(), where = (s, 3 + 4/) is the same as in Theorem 6.1. Using this together with Theo-rem 6.1 and (B.3), we obtain that, with probability at least 1 M1A2/8,

    M(L) C1()M() C1()s.This and Theorem 5.1 imply

    fD f2n fL f2n + 16C1()f2maxA2220

    s log M

    n

    ,

    where 0 = (max(C1(), 1)s, 3 + 4/). Once Again, applying Theorem 6.1, weget the result.

    PROOF OF THEOREM 7.1. Set = D and J0 = J (). Using Lem-ma B.3 with = , we get that, on the event B (i.e., with probability at least

  • 8/6/2019 Dantzig Lasso

    25/28

    LASSO AND DANTZIG SELECTOR 1729

    1 M1A2/2), the following are true: (i) 1n|XTX| 2r , and (ii) inequality (4.1)

    holds with c0 = 1. Therefore, on B we have1

    n |X|22 =

    1

    nT

    XT

    X

    1n|XTX|||1

    2r(|J0 |1 + |Jc0 |1)(B.25) 2(1 + c0)r|J0 |1 2(1 + c0)r

    s|J0 |2 = 4r

    s|J0|2

    since c0

    =1. From Assumption RE(s, 1), we get that

    1n|X|22 2|J0|22,

    where = (s, 1). This and (B.25) yield that, on B,1

    n|X|22 16r2s/2, |J0 |2 4r

    s/ 2.(B.26)

    The first inequality in (B.26) implies (7.5). Next, (7.4) is straightforward in viewof the second inequality in (B.26) and of the relations (with c0 = 1)

    |

    |1 = |

    J0|1 + |

    Jc

    0 |1 (1

    +c

    0)|

    J0 |1 (1

    +c

    0)

    s|

    J0|2(B.27)

    that hold on B. It remains to prove (7.6). It is easy to see that the kth largest inabsolute value element ofJc0 satisfies |Jc0 |(k) |Jc0 |1/k. Thus,

    |Jc01|22 |Jc0 |

    21

    km+1

    1

    k2 1

    m|Jc0 |

    21,

    and, since (4.1) holds on B (with c0 = 1), we find

    |Jc01

    |2

    c0|J0 |1

    m c0

    |J0

    |2

    s

    m c0

    |J01

    |2

    s

    m

    .

    Therefore, on B,

    ||2

    1 + c0

    s

    m

    |J01|2.(B.28)

    On the other hand, it follows from (B.25) that

    1

    n|X|22 4r

    s|J01|2.

    Combining this inequality with Assumption RE(s,m, 1), we obtain that, on B,

    |J01|2 4r

    s/2.

  • 8/6/2019 Dantzig Lasso

    26/28

    1730 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    Recalling that c0 = 1 and applying the last inequality together with (B.28), we get

    |

    |2

    2 161 + c0

    s

    m2

    r

    s/ 22.(B.29)It remains to note that (7.6) is a direct consequence of(7.4) and (B.29). This fol-lows from the fact that inequalities

    Mj=1 aj b1 and

    Mj=1 a

    2j b2 with aj 0

    imply

    Mj=1

    apj =

    Mj=1

    a2pj a

    2p2j

    M

    j=1aj

    2p Mj=1

    a2j

    p1

    b2

    p

    1 b

    p

    1

    2 1 < p 2.

    PROOF OF THEOREM 7.2. Set = L and J0 = J (). Using (B.1),where we put = , rn,j r and f fn = 0, we get that, on the event A,

    1

    n|X|22 4r

    s|J0 |2(B.30)

    and (4.1) holds with c0 = 3 on the same event. Thus, by Assumption RE(s, 3) andthe last inequality, we obtain that, on A,

    1n|X|22 16r2s/2, |J0|2 4r

    s/2,(B.31)

    where = (s, 3). The first inequality here coincides with (7.8). Next, (7.9) fol-lows immediately from (B.3) and (7.8). To show (7.7), it suffices to note that onthe event A the relations (B.27) hold with c0 = 3, to apply the second inequalityin (B.31) and to use (B.4).

    Finally, the proof of (7.10) follows exactly the same lines as that of (7.6). Theonly difference is that one should set c0 = 3 in (B.28) and (B.29), as well as in thedisplay preceding (B.28).

    REFERENCES

    [1] BICKEL, P. J. (2007). Discussion of The Dantzig selector: Statistical estimation when p ismuch larger than n, by E. Candes and T. Tao. Ann. Statist. 35 23522357. MR2382645

    [2] BUNEA, F., TSYBAKOV, A. B. and WEGKAMP, M. H. (2004). Aggregation for regres-sion learning. Preprint LPMA, Univ. Paris 6Paris 7, n 948. Available at arXiv:math.ST/0410214 and at https://hal.ccsd.cnrs.fr/ccsd-00003205.

    [3] BUNEA, F., TSYBAKOV, A. B. and WEGKAMP, M. H. (2006). Aggregation and sparsity via 1penalized least squares. In Proceedings of 19th Annual Conference on Learning Theory(COLT 2006) (G. Lugosi and H. U. Simon, eds.). Lecture Notes in Artificial Intelligence4005 379391. Springer, Berlin. MR2280619

    http://www.ams.org/mathscinet-getitem?mr=2382645http://arxiv.org/math.ST/0410214http://arxiv.org/math.ST/0410214https://hal.ccsd.cnrs.fr/ccsd-00003205http://www.ams.org/mathscinet-getitem?mr=2280619http://arxiv.org/math.ST/0410214http://www.ams.org/mathscinet-getitem?mr=2280619https://hal.ccsd.cnrs.fr/ccsd-00003205http://arxiv.org/math.ST/0410214http://www.ams.org/mathscinet-getitem?mr=2382645
  • 8/6/2019 Dantzig Lasso

    27/28

    LASSO AND DANTZIG SELECTOR 1731

    [4] BUNEA, F., TSYBAKOV, A. B. and WEGKAMP, M. H. (2007). Aggregation for Gaussian re-gression. Ann. Statist. 35 16741697. MR2351101

    [5] BUNEA, F., TSYBAKOV, A. B. and WEGKAMP, M. H. (2007). Sparsity oracle inequalities forthe Lasso. Electron. J. Statist. 1 169194. MR2312149

    [6] BUNEA, F., TSYBAKOV, A. B. and WEGKAMP, M. H. (2007). Sparse density estimation with1 penalties. In Proceedings of 20th Annual Conference on Learning Theory (COLT 2007)(N. H. Bshouty and C. Gentile, eds.). Lecture Notes in Artificial Intelligence 4539 530543. Springer, Berlin. MR2397610

    [7] CANDES, E. and TAO, T. (2007). The Dantzig selector: Statistical estimation when p is muchlarger than n. Ann. Statist. 35 23132351. MR2382644

    [8] DONOHO, D. L., ELAD, M. and TEMLYAKOV, V. (2006). Stable recovery of sparse over-complete representations in the presence of noise. IEEE Trans. Inform. Theory 52 618.MR2237332

    [9] EFRON, B., HASTIE, T., JOHNSTONE, I. and TIBSHIRANI, R. (2004). Least angle regression.Ann. Statist. 32 407451. MR2060166

    [10] FRIEDMAN, J., HASTIE, T., HFLING, H. and TIBSHIRANI, R. (2007). Pathwise coordinateoptimization. Ann. Appl. Statist. 1 302332. MR2415737

    [11] FU, W. and KNIGHT, K. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28 13561378. MR1805787

    [12] GREENSHTEIN, E. and RITOV, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli 10 971988. MR2108039

    [13] JUDITSKY, A. and NEMIROVSKI, A. (2000). Functional aggregation for nonparametric esti-mation. Ann. Statist. 28 681712. MR1792783

    [14] KOLTCHINSKII , V. (2006). Sparsity in penalized empirical risk minimization. Ann. Inst.H. Poincar Probab. Statist. To appear.

    [15] KOLTCHINSKII , V. (2007). Dantzig selector and sparsity oracle inequalities. Unpublished man-

    uscript.[16] MEIER, L., VAN DE GEER, S. and BHLMANN, P. (2008). The Group Lasso for logisticregression. J. Roy. Statist. Soc. Ser. B 70 5371.

    [17] MEINSHAUSEN, N. and BHLMANN, P. (2006). High-dimensional graphs and variable selec-tion with the Lasso. Ann. Statist. 34 14361462. MR2278363

    [18] MEINSHAUSEN, N. and YU, B. (2006). Lasso type recovery of sparse representations for highdimensional data. Ann. Statist. To appear.

    [19] NEMIROVSKI, A. (2000). Topics in nonparametric statistics. In Ecole dEt de Probabil-its de Saint-Flour XXVIII1998. Lecture Notes in Math. 1738. Springer, New York.MR1775640

    [20] OSBORNE, M. R., PRESNELL, B. and TURLACH, B. A (2000a). On the Lasso and its dual.J. Comput. Graph. Statist. 9 319337. MR1822089

    [21] OSBORNE, M. R., PRESNELL, B. and TURLACH, B. A (2000b). A new approach to variableselection in least squares problems. IMA J. Numer. Anal. 20 389404.

    [22] TIBSHIRANI, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc.Ser. B 58 267288. MR1379242

    [23] TSYBAKOV, A. B. (2006). Discussion of Regularization in Statistics, by P. Bickel and B. Li.TEST 15 303310. MR2273731

    [24] TURLACH, B. A. (2005). On algorithms for solving least squares problems under an L1 penaltyor an L1 constraint. In 2004 Proceedings of the American Statistical Association, Statis-tical Computing Section [CD-ROM] 25722577. Amer. Statist. Assoc., Alexandria, VA.

    [25] VAN DE GEER, S. A. (2008). High dimensional generalized linear models and the Lasso. Ann.Statist. 36 614645. MR2396809

    [26] ZHANG, C.-H. and HUANG, J. (2008). Model-selection consistency of the Lasso in high-dimensional regression. Ann. Statist. 36 15671594. MR2435448

    http://www.ams.org/mathscinet-getitem?mr=2351101http://www.ams.org/mathscinet-getitem?mr=2312149http://www.ams.org/mathscinet-getitem?mr=2397610http://www.ams.org/mathscinet-getitem?mr=2382644http://www.ams.org/mathscinet-getitem?mr=2237332http://www.ams.org/mathscinet-getitem?mr=2060166http://www.ams.org/mathscinet-getitem?mr=2415737http://www.ams.org/mathscinet-getitem?mr=1805787http://www.ams.org/mathscinet-getitem?mr=2108039http://www.ams.org/mathscinet-getitem?mr=1792783http://www.ams.org/mathscinet-getitem?mr=2278363http://www.ams.org/mathscinet-getitem?mr=1775640http://www.ams.org/mathscinet-getitem?mr=1822089http://www.ams.org/mathscinet-getitem?mr=1379242http://www.ams.org/mathscinet-getitem?mr=2273731http://www.ams.org/mathscinet-getitem?mr=2396809http://www.ams.org/mathscinet-getitem?mr=2435448http://www.ams.org/mathscinet-getitem?mr=2435448http://www.ams.org/mathscinet-getitem?mr=2396809http://www.ams.org/mathscinet-getitem?mr=2273731http://www.ams.org/mathscinet-getitem?mr=1379242http://www.ams.org/mathscinet-getitem?mr=1822089http://www.ams.org/mathscinet-getitem?mr=1775640http://www.ams.org/mathscinet-getitem?mr=2278363http://www.ams.org/mathscinet-getitem?mr=1792783http://www.ams.org/mathscinet-getitem?mr=2108039http://www.ams.org/mathscinet-getitem?mr=1805787http://www.ams.org/mathscinet-getitem?mr=2415737http://www.ams.org/mathscinet-getitem?mr=2060166http://www.ams.org/mathscinet-getitem?mr=2237332http://www.ams.org/mathscinet-getitem?mr=2382644http://www.ams.org/mathscinet-getitem?mr=2397610http://www.ams.org/mathscinet-getitem?mr=2312149http://www.ams.org/mathscinet-getitem?mr=2351101
  • 8/6/2019 Dantzig Lasso

    28/28

    1732 P. J. BICKEL, Y. RITOV AND A. B. TSYBAKOV

    [27] ZHAO , P. and YU, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.7 25412563. MR2274449

    P. J. BICKEL

    DEPARTMENT OF STATISTICSUNIVERSITY OF CALIFORNIA AT BERKELEYCALIFORNIAUSAE-MAIL: [email protected]

    Y. RITOV

    DEPARTMENT OF STATISTICSFACULTY OF SOCIAL SCIENCESTHE HEBREW UNIVERSITYJERUSALEM 91904ISRAELE-MAIL: [email protected]

    A. B. TSYBAKOVLABORATOIRE DE STATISTIQUECREST3, AVENUE PIERRE LAROUSSE,

    92240 MALAKOFFAND

    LPMA (UMR CNRS 1599)UNIVERSIT PARIS VI4, PLACE JUSSIEU ,

    75252 PARIS, C EDEX 05FRANCEE-MAIL: [email protected]

    http://www.ams.org/mathscinet-getitem?mr=2274449mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.ams.org/mathscinet-getitem?mr=2274449

Recommended