No penalty no tears: Least squares in high-dimensional ...proceedings.mlr.press/v48/wange16.pdf ·...

No penalty no tears: Least squares in high-dimensional linear models

Xiangyu Wang [email protected]

Department of Statistical Science, Duke University

David Dunson [email protected]

Department of Statistical Science, Duke University

Chenlei Leng [email protected]

Department of Statistics, University of Warwick

AbstractOrdinary least squares (OLS) is the defaultmethod for fitting linear models, but is not ap-plicable for problems with dimensionality largerthan the sample size. For these problems, weadvocate the use of a generalized version ofOLS motivated by ridge regression, and proposetwo novel three-step algorithms involving leastsquares fitting and hard thresholding. The al-gorithms are methodologically simple to under-stand intuitively, computationally easy to imple-ment efficiently, and theoretically appealing forchoosing models consistently. Numerical exer-cises comparing our methods with penalization-based approaches in simulations and data analy-ses illustrate the great potential of the proposedalgorithms.

1. INTRODUCTIONLong known for its consistency, simplicity and optimalityunder mild conditions, ordinary least squares (OLS) is themost widely used technique for fitting linear models. De-veloped originally for fitting fixed dimensional linear mod-els, unfortunately, classical OLS fails in high dimensionallinear models where the number of predictors p far exceedsthe number of observations n. To deal with this prob-lem, Tibshirani (1996) proposed `1-penalized regression,a.k.a, lasso, which triggered the recent overwhelming ex-ploration in both theory and methodology of penalization-based methods. These methods usually assume that onlya small number of coefficients are nonzero (known as thesparsity assumption), and minimize the same least squares

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

loss function as OLS by including an additional penalty onthe coefficients, with the typical choice being the `1 norm.Such “penalization” constrains the solution space to certaindirections favoring sparsity of the solution, and thus over-comes the non-unique issue with OLS. It yields a sparsesolution and achieves model selection consistency and es-timation consistency under certain conditions. See Zhaoand Yu (2006); Fan and Li (2001); Zhang (2010); Zou andHastie (2005).

Despite the success of the methods based on regularization,there are important issues that can not be easily neglected.On the one hand, methods using convex penalties, such aslasso, usually require strong conditions for model selec-tion consistency (Zhao and Yu, 2006; Lounici, 2008). Onthe other hand, methods using non-convex penalties (Fanand Li, 2001; Zhang, 2010) that can achieve model selec-tion consistency under mild conditions often require hugecomputational expense. These concerns have limited thepractical use of regularized methods, motivating alterna-tive strategies such as direct hard thresholding (Jain et al.,2014).

In this article, we aim to solve the problem of fitting high-dimensional sparse linear models by reconsidering OLSand answering the following simple question: Can ordi-nary least squares consistently fit these models with somesuitable algorithms? Our result provides an affirmative an-swer to this question under fairly general settings. In par-ticular, we give a generalized form of OLS in high dimen-sional linear regression, and develop two algorithms thatcan consistently estimate the coefficients and recover thesupport. These algorithms involve least squares type of fit-ting and hard thresholding, and are non-iterative in nature.Extensive empirical experiments are provided in Section 4to compare the proposed estimators to many existing penal-ization methods. The performance of the new estimators isvery competitive under various setups in terms of modelselection, parameter estimation and computational time.


1.1. Related Works

The work that is most closely related to ours is Yanget al. (2014), in which the authors proposed an algorithmbased on OLS and ridge regression. However, both theirmethodology and theory are still within the `1 regulariza-tion framework, and their conditions (especially their C-Ridge and C-OLS conditions) are overly strong and canbe easily violated in practice. Jain et al. (2014) proposedan iterative hard thresholding algorithm for sparse regres-sion, which shares a similar spirit of hard thresholding asour algorithm. Nevertheless, their motivation is completelydifferent, their algorithm lacks theoretical guarantees forconsistent support recovery, and they require an iterativeestimation procedure.

1.2. Our Contributions

We provide a generalized form of OLS for fitting high di-mensional data motivated by ridge regression, and developtwo algorithms that can consistently fit linear models onweakly sparse coefficients. We summarize the advantagesof our new algorithms in three points.

Our algorithms work for highly correlated features underrandom designs. The consistency of the algorithms relieson a moderately growing conditional number, as opposedto the strong irrepresentable condition (Zhao and Yu, 2006;Wainwright, 2009) required by lasso. Our algorithms canachieve consistent identify strong signals for ultra-high di-mensional data (log p = o(n)) with only a bounded vari-ance assumption on the noise ε, i.e., var(ε) < ∞. This isremarkable as most methods (c.f. Zhang (2010); Yang et al.(2014); Cai and Wang (2011); Wainwright (2009); Zhangand Huang (2008); Wang and Leng (2015)) that work forlog p = o(n) case rely on a sub-Gaussian tail/boundederror assumption, which might fail to hold for real data.Lounici (2008) proved that lasso also achieves consistentmodel selection with a second-order condition similar toours, but requires two additional assumptions. The al-gorithms are simple, efficient and scale well for large p.In particular, the matrix operations are fully parallelizablewith very few communications for very large p, while reg-ularization methods are either hard to be computed in par-allel in the feature space, or the parallelization requires alarge amount of machine communications.

The remainder of this article is organized as follows. InSection 2 we generalize the ordinary least squares estimatorfor high dimensional problems where p > n, and proposetwo three-step algorithms consisting only of least squaresfitting and hard thresholding in a loose sense. Section 3provides consistency theory for the algorithms. Section 4evaluates the empirical performance. We conclude and dis-cuss further implications of our algorithms in the last sec-tion. All the proofs are provided in the supplementary ma-

terials.

2. HIGH DIMENSIONAL ORDINARYLEAST SQUARES

Consider the usual linear model

Y = Xβ + ε,

whereX is the n×p design matrix, Y is the n×1 responsevector and β is the coefficient. In the high dimensionalliterature, βi’s are routinely assumed to be zero except fora small subset S∗ = supp(β). In this paper, we consider aslightly more general setting, where β is not exactly sparse,but consists of both strong and weak signals. In particular,we defined S∗ and S∗

S∗ = {k : |βk| ≥ τ∗} S∗ = {k : |βk| ≤ τ∗}

as the strong and weak signal sets and S∗ ∪ S∗ ={1, 2, · · · , p}. The algorithms developed in this paper is torecover the strong signal set S∗. The specific relationshipbetween τ∗ and τ∗ will be detailed later.

To carefully tailor the low-dimensional OLS estimator ina high dimensional scenario, one needs to answer the fol-lowing two questions: i) What is the correct form of OLSin the high dimensional setting? ii) How to correctly usethis estimator? To answer these, we reconsider OLS froma different perspective by viewing the OLS as the limit ofthe ridge estimator with the ridge parameter going to zero,i.e.,

(XTX)−1XTY = limr→0

(XTX + rIp)−1XTY.

One nice property of the ridge estimator is that it existsregardless of the relationship between p and n. A keenobservation reveals the following relationship immediately.

Lemma 1. For any p, n, r > 0, we have

(XTX + rIp)−1XTY = XT (XXT + rIn)−1Y. (1)

Notice that the right hand side of (1) exists when p > n andr = 0. Consequently, we can naturally extend the classicalOLS to the high dimensional scenario by letting r tend tozero in (1). Denote this high dimensional version of theOLS as

β(HD) = limr→0

XT (XXT + rIn)−1Y = XT (XXT )−1Y.

Unfortunately, β(HD) does not have good general perfor-mance in estimating sparse vectors in high-dimensionalcases. Instead of directly estimating β as βHD, how-ever, this new estimator of β may be used for dimensionreduction by observing β(HD) = XT (XXT )−1Xβ +


XT (XXT )−1ε = Φβ + η. Since η is stochastically small,if Φ is close to a diagonally dominant matrix and β issparse, then the strong and weak signals can be separatedby simply thresholding the small entries of β(HD). Theexact meaning of this statement will be discussed in thenext section. Some simple examples demonstrating the di-agonal dominance of XT (XXT )−1X are illustrated im-mediately in Figure 1, where the rows of X in the lefttwo plots are drawn from N(0,Σ) with σij = 0.6 orσij = 0.99|i−j|. The sample size and data dimension arechosen as (n, p) = (50, 1000). The right plot takes thestandardized design matrix directly from the real data inSection 4. A clear diagonal dominance pattern is visible ineach plot.

This ability to separate strong and weak signals allows usto first obtain a smaller model with size d such that |S∗| <d < n containing S∗. Since d is below n, one can directlyapply the usual OLS to obtain an estimator, which will bethresholded further to obtain a more refined model. Thefinal estimator will then be obtained by an OLS fit on therefined model. This three-stage non-iterative algorithm istermed Least-squares adaptive thresholding (LAT) and theconcrete procedure is described in Algorithm 1.

Algorithm 1 The Least-squares Adaptive Thresholding(LAT) AlgorithmInitialization:

1: Input (Y,X), d, δStage 1 : Pre-selection

2: Standardize Y and X to Y and X having mean 0 andvariance 1.

3: Compute β(HD) = XT (XXT + 0.1 · In)−1Y . Rankthe importance of the variables by |β(HD)

i |;4: Denote the model corresponding to the d largest|β(HD)i | as Md. Alternatively use extended BIC (Chen

and Chen, 2008) in conjunction with the obtained vari-able importance to select the best submodel.

Stage 2 : Hard thresholding5: β(OLS) = (XT

MdXMd

)−1XTMd

Y ;6: σ2 =

∑ni=1(y − y)2/(n− d);

7: C = (XTMd

XMd)−1;

8: Threshold β(OLS) by MEAN(√

2σ2Cii log(4d/δ)) oruse BIC to select the best submodel. Denote the chosenmodel as M.

Stage 3 : Refinement9: βM = (XT

MXM)−1XTMY ;

10: βi = 0,∀i 6∈ M;11: return β.

The input parameter d is the submodel size selectedin Stage 1 and δ is the tuning parameter determin-ing the threshold in Stage 2. In Stage 1, we use

β(HD) = XT (XXT + 0.1 · In)−1Y instead of β(HD) =XT (XXT )−1Y because XXT is rank deficient (the rankis n − 1) after standardization. The number 0.1 can be re-placed by any arbitrary small number. As noted in Wangand Leng (2015), this additional ridge term is also essentialwhen p and n get closer. Our results in Section 3 mainlyfocus on β(HD) = XT (XXT )−1Y where X is assumedto be drawn from a distribution with mean zero, so no stan-dardization or ridge adjustment is required. However, theresult is easy to generalize to the case where a ridge term isincluded. See Wang and Leng (2015).

The Stage 1 of Algorithm 1 is very similar to variablescreening methods (Fan and Lv, 2008; Wang and Leng,2015). However, most screening methods require a sub-Gaussian condition the noise to handle the ultra-high di-mensional data where log(p) = o(n). In contrast to theexisting theory, we prove in the next section a better resultthat Stage 1 of Algorithm 1 can produce satisfactory sub-model even for heavy-tailed noise.

The estimator β(OLS) in Stage 2 can be substituted by itsridge counterpart β(Ridge) = (XT

MdXMd

+rId)−1XT

MdY

and C by (XTMd

XMd+rId)

−1 to stabilize numerical com-putation. Similar modification can be applied to the Stage3 as well. The resulted variant of the algorithm is referredto as the Ridge Adaptive Thresholding (RAT) algorithm anddescribed in Algorithm 2.

Algorithm 2 The Ridge Adaptive Thresholding (RAT) Al-gorithmInitialization:

1: Input (Y,X), d, δ, rStage 1 : Pre-selection

2: Standardize Y and X to Y and X having mean 0 andvariance 1.

3: Compute β(HD) = XT (XXT + 0.1 · In)−1Y . Rankthe importance of the variables by |β(HD)

i |;4: Denote the model corresponding to the d largest|β(HD)i | as Md. Alternatively use eBIC in Chen and

Chen (2008) in conjunction with the obtained variableimportance to select the best submodel.

Stage 2 : Hard thresholding5: β(Ridge) = (XT

MdXMd

+ rId)−1XT

MdY ;

6: σ2 =∑ni=1(y − y)2/(n− d);

7: C = (XTMd

XMd+ rId)

−1;

8: Threshold β(OLS) by MEAN(√

2σ2Cii log(4d/δ)) oruse BIC to select the best submodel. Denote the chosenmodel as M.

Stage 3 : Refinement9: βM = (XT

MXM + rI)−1XTMY ;

10: βi = 0,∀i 6∈ M;11: return β.


XT(XXT)-1X: <ij = 0.6 XT(XXT)-1X: <ij = 0.99|i - j| XT(XXT)-1X: Real data

Figure 1. Examples forXT (XXT )−1X . Left: X ∼ N(0,Σ) with σij = 0.6 and σii = 1; Middle: X ∼ N(0,Σ) with σij = 0.9|i−j|;Right: Real data from Section 4.

We suggest to use 10-fold cross-validation to tune the ridgeparameter r. Notice that the model is already small afterstage 1, so using cross-validation will not significantly in-crease the computational burden. The computational per-formance is illustrated in Section 4.

3. THEORYIn this section, we prove the consistency of Algorithm 1in recoverying S∗ and provide concrete forms for all thevalues needed for the algorithm to work. Recall the lin-ear model Y = Xβ + ε. We consider the random designwhere the rows of X are drawn from an elliptical distribu-tion with covariance Σ. It is easy to show that xi admits anequivalent representation as

xi(d)= Li

√pzi

‖zi‖2Σ1/2 =

√pLi

‖zi‖2ziΣ

1/2. (2)

where zi is a p-variate standard Gaussian random variableand Li is a nonnegative random variable that is indepen-dent of zi. We denote this distribution by EN(L,Σ). Thisrandom design allows for various correlation structures andcontains many distribution families that are widely used(Bickel et al., 2009; Raskutti et al., 2010). The noise ε,as mentioned earlier, is only assumed to have the second-order moment, i.e., var(ε) = σ2 < ∞, in contrast tothe sub-Gaussian/bounded error assumption seen in mosthigh dimension literature. See Zhang (2010); Yang et al.(2014); Cai and Wang (2011); Wainwright (2009); Zhangand Huang (2008). This relaxation is similar to Lounici(2008); however we do not require any further assumptionsneeded by Lounici (2008). In Algorithm 1, we also proposeto use extended BIC and BIC for parameter tuning. How-ever, the corresponding details will not be pursued here,as their consistency is straightforwardly implied by the re-sults from this section and the existing literature (Chen andChen, 2008).

As shown in (2), the variable L controls the signal strengthof xi, we thus need a lower bound on Li to guarantee a

good signal strength. Define κ = cond(Σ). We state ourresult in three theorems.Theorem 1. Assume xi ∼ EN(Li,Σ) withE[L−2i ] < M1

and εi is a random variable with a bounded variance σ2.We also assume p > c0n for some c0 > 1 and var(Y ) ≤M0. If |S∗| log p = o(n), n > 4c0/(c0− 1)2, and τ∗/τ∗ ≥4κ2, then we can choose γ to be 2c1κ

−1τ3

np , where c1 is

some absolute constant specified in Lemma 2 and for anyα ∈ (0, 1) we have

P

(maxi∈S∗|β(HD)i | ≤ γ ≤ min

i∈S∗|β(HD)i |

)= 1−O

(σ2κ4 log p

τ∗2nα

).

Theorem 1 guarantees the model selection consistency ofthe first stage of Algorithm 1. It only requires a second-moment condition on the noise tail, relaxing the sub-Gaussian assumption seen in other literature. The proba-bility term shows that the algorithm requires the strong sig-

nals to be lower bounded by a signal strength of σ√

log pnα .

In addition, a gap of τ∗/τ∗ ≥ 4κ2 is needed between thestrong and the weak signals in order for a successful sup-port recovery.

As γ is not easily computable based on data, we propose torank the |β(HD)

i |′s and select d largest coefficients. Alter-natively, we can construct a series of nested models formedby ranking the largest n coefficients and adopt the extendedBIC (Chen and Chen, 2008). Once the submodel Md is ob-tained, we proceed to the second stage by obtaining an es-timate via ordinary least squares β(OLS) corresponding toMd. The theory for β(OLS) requires more stringent con-ditions, as we now need to estimate βMd

instead of justobtaining a correct ranking. In particular, we have to im-pose conditions on the magnitude of βS∗ and the momentsof L, i.e., for β(OLS) we have the following result.Theorem 2. Assume the same conditions forX and ε as inTheorem 1. We also assume n ≥ 64κd log p and d−|S∗| ≤c for some c > 0. If E[L−12] ≤ M1, E[L12] ≤ M2,

τ∗ ≤ σκ

√log pn and there exists some ι ∈ (0, 1) such that

No penalty no tears: Least squares in high-dimensional linear models∑i∈S∗|βi|ι ≤ R, then for any α > 0, we have

P

(max

|M|≤d, S∗⊂M‖β(OLS) − β‖∞ ≤ 2σ

√log p

nα

)= 1−O

(λ−2∗ d log d

n13(1−α)

+M1 +M2

n13(1−4α)

+(M1 +M2)R3

(log p)2ιn3−4α−2ι

),

i.e., if τ∗ ≥ 5σ√

log pnα

, then we can choose γ′ = 3σ√

log pnα

and

maxi 6∈S∗

|β(OLS)i | ≤ γ′ ≤ min

i∈S∗|β(OLS)i |

with probability tending to 1.

The moment condition on L is not tight. We use thisnumber just for simplicity. As shown in Theorem 2,the lι norm of βS∗ is allowed to grow in a rate of(log p)2ι/3n1−4α/3−2ι/3, i.e., our algorithms work forweakly sparse coefficients. However, different from The-orem 1, Theorem 2 imposes an upper bound on α. This ismainly due to the different structures between β(HD) andβ(OLS), i.e., β(OLS) relies on L for diminishing the weaksignals while β(HD) does not. For ridge regression, wehave the following result.

Theorem 3 (Ridge regression). Assume all the conditionsin Theorem 2. If we choose the ridge parameter satisfying

r ≤ σn(7/9−5α/18)√

log p

162κM0,

then we have

P

(max

|M|≤d,S∗⊂M‖β(ridge) − β‖∞ ≤ 3σ

√log p

nα

)= 1−O

(λ−2∗ d log d

n13(1−α)

+2M1 +M2

n13(1−4α)

+(M1 +M2)R3

(log p)2ιn3−4α−2ι

),

i.e., if τ∗ ≥ 7σ√

log pnα

, then we can choose γ′ = 4σ√

log pnα

and

maxi 6∈S∗

|β(Ridge)i (r)| ≤ γ′ ≤ min

i∈S∗|β(Ridge)i (r)|

with probability tending to 1.

When both the noise ε andX follows Gaussian distributionand τ∗ = 0, we can obtain a more explicit form of thethreshold γ′, as the following Corollary shows.

Corollary 1 (Gaussian noise). Assume ε ∼ N(0, σ2),X ∼ N(0,Σ) and τ∗ = 0. For any δ ∈ (0, 1), define

γ′ = 8√

2σ√

2κ log(4d/δ)n , where σ is the estimated stan-

dard error as σ2 =∑ni=1(yi − yi)

2/(n − d). For suffi-ciently large n, if d ≤ n− 4K2 log(2/δ)/c for some abso-

lute constants c, K and τ∗ ≥ 24σ√

2κ log(4d/δ)n , then with

probability at least 1− 2δ, we have

|β(OLS)i | ≥ γ′ ∀i ∈ S∗ and |β(OLS)

i | ≤ γ′ ∀i 6∈ S∗.

Write C = (XTMd

XMd)−1 as in Algorithm 1. In practice,

we propose to use γ′ = mean(√

2σ2Cii log(4d/δ)) as thethreshold (see Algorithm 1), because the estimation errortakes a form of

√σ2Cii log(4d/δ). Once the final model

is obtained, as in Stage 3 of Algorithm 1, we refit it againusing ordinary least squares. The final output will havethe same output as if we knew S∗ a priori with probabilitytending to 1. As implied by Theorem 1 – 3, LAT and RATcan consistently identify strong signals in the ultra-high di-mensional (log p = o(n)) setting with only the boundedmoment assumption var(ε) < ∞, in contrast to most ex-isting methods that require ε ∼ N(0, σ2) or ‖ε‖∞ <∞.

4. EXPERIMENTSIn this section, we provide extensive numerical experi-ments for assessing the performance of LAT and RAT. Inparticular, we compare the two methods to existing penal-ized methods including lasso, elastic net (enet (Zou andHastie, 2005)), adaptive lasso (Zou, 2006), scad (Fan andLi, 2001) and mc+ (Zhang, 2010). As it is well-known thatthe lasso estimator is biased, we also consider two varia-tions of it by combining lasso with Stage 2 and 3 of ourLAT and RAT algorithms, denoted as lasLAT (las1 in Fig-ures) and lasRAT (las2 in Figures) respectively. We notethat the lasLat algorithm is very similar to the thresholdedlasso (Zhou, 2010) with an additional thresholding step.We code LAT and RAT and adaptive lasso in Matlab, useglmnet (Friedman et al., 2010) for enet and lasso, andSparseReg (Zhou et al., 2012; Zhou and Lange, 2013)for scad and mc+. Since adaptive lasso achieves a similarperformance as lasLat on synthetic datasets, we only reportits performance for the real data.

4.1. Synthetic Datasets

The model used in this section for comparison is the lin-ear model Y = Xβ + ε, where ε ∼ N(0, σ2) andX ∼ N(0,Σ). To control the signal-to-noise ratio, we de-fine r = ‖β‖2/σ, which is chosen to be 2.3 for all experi-ments. The sample size and the data dimension are chosento be (n, p) = (200, 1000) or (n, p) = (500, 10000) forall experiments. For evaluation purposes, we consider fourdifferent structures of Σ below.

(i) Independent predictors. The support is set as S ={1, 2, 3, 4, 5}. We generate Xi from a standard multivari-ate normal distribution with independent components. Thecoefficients are specified as

βi =

{(−1)ui(|N(0, 1)|+ 1), ui ∼ Ber(0.5) i ∈ S0 i 6∈ S.

(ii) Compound symmetry. All predictors are equally corre-lated with correlation ρ = 0.6. The coefficients are set to


be βi = 3 for i = 1, ..., 5 and βi = 0 otherwise.

(iii) Group structure. This example is Example 4 in Zouand Hastie (2005), for which we allocate the 15 true vari-ables into three groups. Specifically, the predictors are gen-erated as

x1+3m = z1 +N(0, 0.01),

x2+3m = z2 +N(0, 0.01),

x3+3m = z3 +N(0, 0.01),

where m = 0, 1, 2, 3, 4 and zi ∼ N(0, 1) are independent.The coefficients are set as

βi = 3, i = 1, 2, · · · , 15; βi = 0, i = 16, · · · , p.

(iv) Factor models. This model is also considered in Mein-shausen and Buhlmann (2010) and Cho and Fryzlewicz(2012). Let φj , j = 1, 2, · · · , k be independent standardnormal variables. We set predictors as xi =

∑kj=1 φjfij +

ηi, where fij and ηi are generated from independent stan-dard normal distributions. The number of factors is chosenas k = 5 in the simulation while the coefficients are speci-fied the same as in Example (ii).

To compare the performance of all methods, we simulate200 synthetic datasets for (n, p) = (200, 1000) and 100 for(n, p) = (500, 10000) for each example, and record i) theroot mean squared error (RMSE): ‖β−β‖2, ii) the falsenegatives (# FN), iii) the false positives (# FP) and iv) theactual runtime (in milliseconds). We use the extended BIC(Chen and Chen, 2008) to choose the parameters for anyregularized algorithm. Due to the huge computation ex-pense for scad and mc+, we only find the first d√pe predic-tors on the solution path (because we know s � √p). ForRAT and LAT, d is set to 0.3 × n. For RAT and larsRidge,we adopt a 10-fold cross-validation procedure to tune theridge parameter r for a better finite-sample performance,although the theory allows r to be fixed as a constant. Forall hard-thresholding steps, we fix δ = 0.5. The results for(n, p) = (200, 1000) are plotted in Figure 2, 3, 4 and 5 anda more comprehensive result (average values for RMSE, #FPs, # FNs, runtime) for (n, p) = (500, 10000) is sum-marized in Table 1.

As can be seen from both the plots and the tables, LAT andRAT achieve the smallest RMSE for Example (ii), (iii) and(iv) and are on par with lasLAT for Example (i). For Exam-ple (iii), RAT and enet achieve the best performance whileall the other methods fail to work. In addition, the runtimeof LAT and RAT are also competitive compared to that oflasso and enet. We thus conclude that LAT and RAT achievesimilar or even better performance compared to the usualregularized methods.

4.2. A Student Performance Dataset

We look at one dataset used for evaluating student achieve-ment in Portuguese schools (Cortez and Silva, 2008). Thedata attributes include student grades and school relatedfeatures that were collected by using school reports andquestionnaires. The particular dataset used here providesthe students’ performance in mathematics. The goal of theresearch is to predict the final grade based on all the at-tributes.

The original data set contains 395 students and 32 raw at-tributes. The raw attributes are recoded as 40 attributes andform 780 features after interaction terms are added. Wethen remove features that are constant for all students. Thisgives 767 features for us to work with. To compare the per-formance of all methods, we first randomly split the datasetinto 10 parts. We use one of the 10 parts as a test set, fit allthe methods on the other 9 parts, and then record their pre-diction error (root mean square error, RMSE), model sizeand runtime on the test set. We repeat this procedure untileach of the 10 parts has been used for testing. The averagedprediction error, model size and runtime are summarized inTable 2. We also report the performance of the null modelwhich predicts the final grade on the test set using the meanfinal grade in the training set.

It can be seen that RAT achieves the smallest cross-validation error, followed by scad and mc+. In the post-feature-selection analysis, we found that two features, the1st and 2nd period grades of a student, were selected by allthe methods. This result coincides with the common per-ception that these two grades usually have high impact onthe final grade.

In addition, we may also be interested in what happenswhen no strong signals are presented. One way to do thisis to remove all the features that are related to the 1st and2nd grades before applying the aforementioned procedures.The new result without the strong signals removed are sum-marized in Table 3.

Table 3 shows a few interesting findings. First, under thisartificial weak signal scenario, adaptive lasso achieves thesmallest cross-validation error and RAT is the first runner-up. Second, in Stage 1, lasso seems to provide slightlymore robust screening than OLS in that the selected fea-tures are less correlated. This might be the reason that LATis outperformed by lasLAT. However, in both the strongand weak signal cases, RAT is consistently competitive interms of performance.

5. CONCLUSIONWe have proposed two novel algorithms Lat and Rat thatonly rely on least-squares type of fitting and hard threshold-


LAT RATlasso las1 las2 Enet scad mc+0

0.5

1

1.5

2

2.5Square root of error

LAT RAT lasso las1 las2 Enet scad mc+

0

0.5

1

1.5

2

2.5False positives

LAT RATlasso las1 las2 Enet scad mc+

0

0.5

1

1.5

2

2.5False negatives

Figure 2. The Boxplots for Example (i). Left: Estimation Error; Middle: False Positives; Right: False Negatives

LAT RATlasso las1 las2 Enet scad mc+

0.5

1

1.5

2

Square root of error

LAT RAT lasso las1 las2 Enet scad mc+0

5

10

15False positives


1

2

3

4

5

False negatives

Figure 3. The Boxplots for Example (ii). Left: Estimation Error; Middle: False Positives; Right: False Negatives


5

10

15

20Square root of error

LAT RAT lasso las1 las2 Enet scad mc+

0

0.5

1

1.5

2

2.5False positives


2

4

6

8

10

12

14

16False negatives

Figure 4. The Boxplots for Example (iii). Left: Estimation Error; Middle: False Positives; Right: False Negatives

LAT RATlasso las1 las2 Enetscad mc+0

0.5

1

1.5

2

2.5

3

3.5Square root of error


5

10

15

20

25False positives

LAT RATlasso las1 las2 Enetscadmc+0

1

2

3

4

5

False negatives

Figure 5. The boxplots for Example (iv). Left: Estimation Error; Middle: False Positives; Right: False Negatives


Table 1. Results for (n, p) = (500, 10000)

Example LAT RAT lasso lasLAT lasRAT enet scad mc+RMSE 0.263 0.264 0.781 0.214 0.214 1.039 0.762 0.755

Ex.(i) # FPs 0.550 0.580 0.190 0.190 0.190 0.470 0.280 0.280# FNs 0.010 0.010 0.000 0.000 0.000 0.000 0.000 0.000Time 36.1 41.8 72.7 72.7 74.1 71.8 1107.5 1003.2RMSE 0.204 0.204 0.979 0.260 0.260 1.363 0.967 0.959

Ex. (ii) # FPs 0.480 0.480 1.500 0.350 0.350 10.820 2.470 2.400# FNs 0.000 0.000 0.040 0.040 0.040 0.040 0.020 0.020Time 34.8 40.8 76.1 76.1 77.5 82.0 1557.6 1456.1RMSE 9.738 1.347 7.326 17.621 3.837 1.843 7.285 8.462

Ex. (iii) # FPs 0.000 0.000 0.060 0.000 0.000 0.120 0.120 0.090# FNs 4.640 0.000 1.440 13.360 1.450 0.000 1.800 2.780Time 35.0 41.6 75.6 75.6 77.5 74.4 6304.4 4613.8RMSE 0.168 0.168 1.175 0.256 0.256 1.780 0.389 0.368

Ex. (iv) # FPs 0.920 0.920 21.710 0.260 0.260 37.210 6.360 6.270# FNs 0.010 0.010 0.140 0.140 0.140 0.450 0.000 0.000Time 34.5 41.1 78.7 78.7 80.8 81.4 1895.6 1937.1

Table 2. Prediction Error of the Final Grades by Different Methodsmethods mean error Standard error average model size runtime (millisec)LAT 1.93 0.118 6.8 22.3RAT 1.90 0.131 3.5 74.3lasso 1.94 0.138 3.7 60.7lasLAT 2.02 0.119 3.6 55.5lasRAT 2.04 0.124 3.6 71.3enet 1.99 0.127 4.7 58.5scad 1.92 0.142 3.5 260.6mc+ 1.92 0.143 3.4 246.0adaptive lasso 2.01 0.140 3.6 65.5null 4.54 0.151 0 —

Table 3. Prediction Error of the Final Grades Excluding Strong Signalsmethods mean error Standard error average model size runtime (millisec)LAT 4.50 0.141 5.3 22.4RAT 4.26 0.130 4.0 74.0lasso 4.27 0.151 5.0 318.9lasLAT 4.25 0.131 2.9 316.5lasRAT 4.28 0.127 2.8 331.9enet 4.37 0.171 6.0 265.6scad 4.30 0.156 4.8 387.5mc+ 4.29 0.156 4.7 340.2adaptive lasso 4.24 0.180 4.8 298.0null 4.54 0.151 0 —

ing, based on a high-dimensional generalization of OLS.The two methods are simple, easily implementable, and canconsistently fit a high dimensional linear model and recoverits support. The performance of the two methods are com-petitive compared to existing regularization methods. It is

of great interest to further extend this framework to othermodels such as generalized linear models and models forsurvival analysis.


ReferencesBickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Si-

multaneous analysis of lasso and dantzig selector. TheAnnals of Statistics, 37(4):1705–1732.

Cai, T. T. and Wang, L. (2011). Orthogonal matching pur-suit for sparse signal recovery with noise. IEEE Trans-actions on Information Theory, 57(7):4680–4688.

Chen, J. and Chen, Z. (2008). Extended bayesian informa-tion criteria for model selection with large model spaces.Biometrika, 95(3):759–771.

Cho, H. and Fryzlewicz, P. (2012). High dimensional vari-able selection via tilting. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 74(3):593–622.

Cortez, P. and Silva, A. M. G. (2008). Using data miningto predict secondary school student performance.

Fan, J. and Li, R. (2001). Variable selection via nonconcavepenalized likelihood and its oracle properties. Journalof the American Statistical Association, 96(456):1348–1360.

Fan, J. and Lv, J. (2008). Sure independence screeningfor ultrahigh dimensional feature space. Journal of theRoyal Statistical Society: Series B (Statistical Methodol-ogy), 70(5):849–911.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regu-larization paths for generalized linear models via coordi-nate descent. Journal of Statistical Software, 33(1):1.

Jain, P., Tewari, A., and Kar, P. (2014). On itera-tive hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Process-ing Systems, pages 685–693.

Lounici, K. (2008). Sup-norm convergence rate and signconcentration property of lasso and dantzig estimators.Electronic Journal of Statistics, 2:90–102.

Meinshausen, N. and Buhlmann, P. (2010). Stability selec-tion. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 72(4):417–473.

Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Re-stricted eigenvalue properties for correlated gaussian de-signs. The Journal of Machine Learning Research,11:2241–2259.

Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso. Journal of the Royal Statistical Society.Series B (Statistical Methodology), pages 267–288.

Wainwright, M. J. (2009). Sharp thresholds forhigh-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). IEEE Trans-actions on Information Theory, 55(5):2183–2202.

Wang, X. and Leng, C. (2015). High dimensional ordinaryleast squares projection for screening variables. Jour-nal of the Royal Statistical Society: Series B (StatisticalMethodology).

Yang, E., Lozano, A., and Ravikumar, P. (2014). Elemen-tary estimators for high-dimensional linear regression. InProceedings of the 31st International Conference on Ma-chine Learning (ICML-14), pages 388–396.

Zhang, C.-H. (2010). Nearly unbiased variable selectionunder minimax concave penalty. The Annals of Statis-tics, 38(2):894–942.

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias ofthe lasso selection in high-dimensional linear regression.The Annals of Statistics, 36(4):1567–1594.

Zhao, P. and Yu, B. (2006). On model selection consistencyof lasso. The Journal of Machine Learning Research,7:2541–2563.

Zhou, H., Armagan, A., and Dunson, D. B. (2012). Pathfollowing and empirical bayes model selection for sparseregression. arXiv preprint arXiv:1201.3528.

Zhou, H. and Lange, K. (2013). A path algorithm forconstrained estimation. Journal of Computational andGraphical Statistics, 22(2):261–283.

Zhou, S. (2010). Thresholded lasso for high dimen-sional variable selection and statistical estimation. arXivpreprint arXiv:1002.1583.

Zou, H. (2006). The adaptive lasso and its oracle prop-erties. Journal of the American statistical association,101(476):1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and vari-able selection via the elastic net. Journal of the RoyalStatistical Society: Series B (Statistical Methodology),67(2):301–320.

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

No penalty no tears: Least squares in high-dimensional ...proceedings.mlr.press/v48/wange16.pdf ·...

Documents