+ All Categories
Home > Documents > STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of...

STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of...

Date post: 28-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
67
1/67 STAT 340: Applied Regression Methods Lecture Notes 6: Subset Selection and Shrinkage Methods
Transcript
Page 1: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

1/67

STAT 340: Applied Regression Methods

Lecture Notes 6:Subset Selection and Shrinkage Methods

Page 2: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

2/67

Outline

Linear Model Selection and Regularization

Subset Selection Methods

Shrinkage Methods

Page 3: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

3/67

Linear Regression Models

• There are many advantages to the linear model.

Y = β0 + β1X1 + β2X2 + ...+ βpXp + ε,

including simplicity; interpretability, and good predictiveperformance.

• We typically fits this model using least squares.

• Alternative fitting procedures can sometimes yield better predictionaccuracy and model interpretability.

• More elaboration in the following ...

Page 4: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

4/67

Prediction Accuracy

• Bias: If the true relationship between the response and thepredictors is approximately linear, the least squares estimates willhave low bias.

• Variance:

• n� p: the least squares estimates tend to also have low variance,and hence will perform well on test observations.

• n is not � p: a lot of variability in the least squares fit, resulting inoverfitting and consequently poor predictions on future observationsnot used in model training.

• p > n: no longer a unique least squares coefficient estimate, thevariance is infinite so the method cannot be used at all.

• What can we do then?

Page 5: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

5/67

Prediction Accuracy ...

• We can constrain or shrink the estimated coefficients, thus,substantially reduce the variance at the cost of a negligible increasein bias.

• This can lead to substantial improvements in the accuracy withwhich we can predict the response for observations not used inmodel training.

Page 6: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

6/67

Model Interpretability

• Often, some or many of the variables used in a multiple regressionmodel are in fact not associated with the response.

• Including such irrelevant variables leads to unnecessary complexity inthe resulting model.

• By removing these variables – that is, we can obtain a model that ismore easily interpreted.

• We will see some approaches for automatically performing featureselection or variable selection – excluding irrelevant variables from amultiple regression model.

Page 7: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

7/67

Three Classes of Methods

• Subset selection: Identify a subset of the p features that appear tobe associated with the response. Then fit a model on those featuresusing least squares.

• Shrinkage: Use all p features to fit a model using a technique thatshrinks coefficient estimates towards zero relative to least squares.This regularization results in reduced variance. Depending on whattype of shrinkage is performed, some of the coefficients may beestimated to be exactly zero. Hence, shrinkage methods can alsoperform variable selection.

• Dimension reduction: Project the p predictors onto anM-dimensional subspaces (M < p). This is achieved by computingM different linear combinations, or projections, of the variables.Then use these M projections as predictors in a model fit using leastsquares.

Page 8: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

8/67

Subset Selection Methods

• Simple to understand/use/implement.

• Discussed here in the context of a linear regression model.

• Can be generalized to other types of models, like logistic regressionfor classification.

Page 9: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

9/67

Three Subset Selection Methods

• Best Subset Selection

• Forward Stepwise Selection

• Backward Stepwise Selection

Page 10: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

10/67

Best Subset Selection

Consider every possible model, and choose the best one.

Page 11: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

11/67

Best Subset Selection

Algorithm for Best Subset selection

1. Let M0 denote the null model, which contains no predictors, Thismodel simply predicts the sample mean for each observation.

2. For k = 1, 2, ..., p :

• Fit all (pk) models that contain exactly k predictors.

• Pick the best among these (pk) models, and call it Mk . Here best isdefined as having the smallest RSS, or equivalently largest R2.

3. Select a single best model from among M0, ...,Mp usingcross-validated prediction error, Cp(AIC), BIC, or adjusted R2.

Page 12: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

12/67

Credit Data Example

• outcome variable: balance (credit card debt).

• quantitative predictors: age, cards (number of credit cards),education (years of education), income (in thousands of dollars),limit (credit limit), and rating (credit rating).

• qualitative variables: gender, student (student status), andethnicity (Caucasian, African American or Asian).

Page 13: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

13/67

Credit Data Example

Balance

20 40 60 80 100 5 10 15 20 2000 8000 14000

0500

1500

20

40

60

80

100

Age

Cards

24

68

510

15

20

Education

Income

50

100

150

2000

8000

14000

Limit

0 500 1500 2 4 6 8 50 100 150 200 600 1000

200

600

1000

Rating

Page 14: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

14/67

Credit Data Example

2 4 6 8 10

2e

+0

74

e+

07

6e

+0

78

e+

07

Number of Predictors

Re

sid

ua

l S

um

of

Sq

ua

res

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Number of Predictors

R2

For each possible model containing a subset of predictors in the Creditdata, the RSS and R2 are displayed. The red frontier tracks the bestmodel of each size, according to RSS and R2.

Page 15: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

15/67

Best Subset Selection

• Great idea in principle!

• In practice, might not be feasible: requires consideration of 2p

models - prohibitive when p is even moderate in size.

• Furthermore, consideration of so many models results in high risk ofoverfitting.

• For these reasons, stepwise methods can be a good alternative!

Page 16: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

16/67

Forward Stepwise Selection

• Start with a model containing no predictors, and add in predictors,one-at-a-time.

• At each step, add the predictor that lead to the best improvement.

Page 17: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

17/67

Forward Stepwise Selection

Algorithm for Forward Stepwise selection

1. Let M0 denote the null model, which contains no predictors.

2. For k = 1, 2, ..., p − 1 :

• Fit all p − k models that augment the predictors in Mk with oneadditional predictor.

• Choose the best among these p − k models, and call it Mk+1. Herebest is defined as having the smallest RSS, or highest R2.

3. Select a single best model from among M0, ...,Mp usingcross-validated prediction error, Cp(AIC), BIC, or adjusted R2.

Page 18: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

18/67

Forward Stepwise Selection

• Huge computational advantage over best subset selection:1 + p(p + 1)/2 versus 2p

• Not guaranteed to find the best model out of all 2p possible modelsinvolving p predictors.

• can be applied even in the high-dimensional setting where n < p

• however, in this case, it is possible to construct submodelsM0, ...,Mn−1 only. (Why?)

Page 19: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

19/67

Credit Data Example

# Variables Best subset Forward stepwiseOne rating ratingTwo rating, income rating, incomeThree rating, income, student rating, income, studentFour cards, income rating, income

student, limit student, limit

The first four selected models for best subset selection and forwardstepwise selection on the Credit data set. The first three models areidentical but the fourth models differ.

Page 20: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

20/67

Backward Stepwise Selection

• Start with a model containing all of the predictors, and removepredictors, one-at-a-time.

• At each step, remove the predictor that is least useful in predictingthe response.

Page 21: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

21/67

Backward Stepwise Selection

Algorithm for Backward Stepwise selection

1. Let Mp denote the full model, which contains all p predictors.

2. For k = p, p − 1, ..., 1 :

• Consider all k models that contain all but one of the predictors inMk for a total of k − 1 predictor.

• Choose the best among these k models, and call it Mk−1. Here bestis defined as having the smallest RSS, or highest R2.

3. Select a single best model from among M0, ...,Mp usingcross-validated prediction error, Cp(AIC), BIC, or adjusted R2.

Page 22: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

22/67

Backward Stepwise Selection

• Like forward stepwise, backward stepwise has a huge computationaladvantage over best subset selection: 1 + p(p + 1)/2 versus 2p.

• Like forward stepwise, not guaranteed to find the best model out ofall 2p possible models involving p predictors.

• Unlike forward stepwise, can be applied only when n > p: must havemore observations than features in order to fit the initial modelcontaining all predictors.

Page 23: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

23/67

Hybrid Approaches

• The best subset, forward stepwise, and backward stepwise selectionapproaches generally give similar but not identical models.

• As another alternative, hybrid versions of forward and backwardstepwise selection are available:

• variables are added to the model sequentially, in analogy to forwardselection.

• after adding each new variable, the method may also remove anyvariables that no longer provide an improvement in the model fit.

• Such an approach attempts to more closely mimic best subsetselection while retaining the computational advantages of forwardand backward stepwise selection.

Page 24: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

24/67

Choosing The Best Model

• For any of the subset selection methods, we need to choose amongmodels of different sizes: M0, ...,Mp.

• If we choose the model with the smallest RSS or the largest R2,then we’ll always end up with the largest model.

• This is because RSS and R2 are related to training error. What wereally care about is test error.

• How can we select the model with smallest test error?

Page 25: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

25/67

Estimating Test Error

We have two options

• Indirectly estimate the test error, by making an adjustment to thetraining error to account for bias due to overfitting.

• Directly estimate the test error using cross-validation or thevalidation set approach.

Page 26: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

26/67

Cp, AIC, and BIC

• Adjust training error for model size.

• Can be used to select among models with different numbers ofpredictors.

• Formula for Mallow’s Cp

Cp =1

n(RSS + 2d σ2),

where σ2 is an estimate of the error variance in linear regressionmodel, and d is the number of predictors in the model.

• Essentially, the Cp statistic adds a penalty of 2d σ2 to the trainingRSS in order to adjust for the fact that the training error tends tounderestimate the test error.

• One can show that if σ2 is an unbiased estimate of σ2 in, then Cp isan unbiased estimate of test MSE.

Page 27: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

27/67

AIC

• Akaike information criterion (AIC) is defined for a large class ofmodels fit by maximum likelihood.

• In general, for any likelihood-based model,

AIC = −2×max log-likelihood + 2× number of parameters.

• In the case of the model with Normal distributed errors, maximumlikelihood and least squares are the same thing. In this case AIC isgiven by

AIC =1

nσ2(RSS + 2d σ2),

where for simplicity, we have omitted an additive constant. AIC isproportional to Cp for least square models.

Page 28: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

28/67

BIC

• Bayesian information criterion (BIC) is derived from a Bayesianpoint of view, but ends up looking similar to Cp (and AIC) as well.

• Formula for BIC (up to irrelevant constants):

BIC =1

n(RSS + log(n)d σ2),

• Like Cp, the BIC will tend to take on a small value for a model witha low test error, so generally we select the model that has the lowestBIC value.

• the BIC statistic generally places a heavier penalty on models withmany variables, and hence results in the selection of smaller modelsthan Cp. (Why?)

Page 29: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

29/67

Adjusted R2

• AIC, BIC and Cp can be hard to apply because they require anestimate of the error variance, σ2.

• Adjusted R2 applies an adjustment to the usual R2 in order toaccount for the number of features in the model:

AdjustedR2 = 1− RSS/(n − d − 1)

TSS/n − 1,

where TSS is the total sum of squares:∑n

i=1(yi − y)2.

• Recall that the usual R2 is defined as

R2 = 1− RSS

TSS.

• So adjusted R2 adjusts (decreases) the usual R2 in order to accountfor the number of parameters in the model.

Page 30: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

30/67

Cross-Validation or Validation Set Approach

• Instead of adjusting the training error to obtain an estimate of thetest error, we can instead directly estimate the test error usingcross-validation or the validation set approach.

• Cross-validation and the validation set approach can always beapplied, regardless of the type of model.

Page 31: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

31/67

Credit Data Example: Cp, BIC, adjusted R2

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0

Number of Predictors

Cp

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0

Number of Predictors

BIC

2 4 6 8 10

0.8

60

.88

0.9

00

.92

0.9

40

.96

Number of Predictors

Ad

juste

d R

2

Cp, BIC, and Adjusted R2 are shown for the best model of each size forthe Credit data set. No need for > 4 variables.

Page 32: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

32/67

Credit Data Example: BIC, Cross-validation, Validation Set

2 4 6 8 10

100

120

140

160

180

200

220

Number of Predictors

Square

Root of B

IC

2 4 6 8 10

100

120

140

160

180

200

220

Number of Predictors

Valid

ation S

et E

rror

2 4 6 8 10

100

120

140

160

180

200

220

Number of Predictors

Cro

ss−

Valid

ation E

rror

BIC, validation set error, and cross-validation error are shown for the bestmodel of each size for the Credit data set. No need for > 4 variables.

Page 33: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

33/67

Shrinkage Methods

• Subset selection methods use least squares to fit a model involving asubset of predictors.

• Shrinkage methods instead fit a model using all p predictors, using atechnique that shrinks or regularizes the coefficient estimatestowards zero.

• Shrinking the coefficients leads to estimates with much lowervariance – often leading to better results!

• Two of the most important shrinkage approaches: ridge regressionand lasso.

Page 34: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

34/67

Ridge Regression

• Least squares seeks β0, β1, ..., βp that minimize

RSS =n∑

i=1

yi − β0 −p∑

j=1

βjxij

2

= (Y − Xβ)T (Y − Xβ).

• Ridge regression seeks β0, β1, ..., βp that minimize

(Y − Xβ)T (Y − Xβ) + λ

p∑j=1

β2j = RSS + λ

p∑j=1

β2j .

• λ ≥ 0 is a tuning parameter that controls amount of shrinkage.

Page 35: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

35/67

Ridge Regression ...

• λ∑p

j=1 β2j : shrinkage penalty.

• The shrinkage penalty is applied to β1, β2, ..., βp, but not to theintercept β0.

• λ ≥ 0 is a tuning parameter that controls amount of shrinkage.

• When λ = 0, get least squares estimates.

• When λ > 0, get estimates that shrunken towards zero.

• Select good value of λ via cross-validation.

Page 36: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

36/67

Credit Data Example

Look at the plot in the next slide:

• Left: each curve displays ridge estimates for a variable, as λ varies.

• Right: each curve displays ridge estimates, as ‖βRλ ‖2/‖β‖2 varies.

• Here β is the least squares estimates, and ‖.‖2 is the `2 norm:

‖β‖2 =√∑p

i=1 β2j .

• x-axis of the right-hand panel: the amount of the ridge regressioncoefficient estimates have been shrunken towards zero.

Page 37: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

37/67

Credit Data Example ...

Standardized ridge regression coefficients:

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

λ ‖βRλ ‖2/‖β‖2

Page 38: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

38/67

Standardize the Variables

• For least squares linear regression, scaling the variables by aconstant has no effect: if you multiply Xj by c , then βj will get

multiplied by 1/c , and so Xj βj will not be affected.

• For ridge regression, it does matter if you scale the variables, sincelarge coefficients are heavily shrunken towards zero.

• Before performing ridge, you should standardize the variables:

xij =xij√

1n

∑ni=1(xij − xj)2

.

Page 39: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

39/67

Why Does Ridge Regression Improve Over Least Squares?

• Ridge regression’s advantage over least squares is rooted in thebias-variance trade-off.

• As λ increases, the flexibility of the ridge regression fit decreases,leading to decreased variance but increased bias.

• Look at the plot in the next slide.

• a simulated data set containing p = 45 predictors and n = 50observations.

• The horizontal dashed lines indicate the minimum possible MSE.

Page 40: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

40/67

Ridge Regression Predictions on a Simulated Data Set

1e−01 1e+01 1e+03

01

02

03

04

05

06

0

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

01

02

03

04

05

06

0

Mean S

quare

d E

rror

λ ‖βRλ ‖2/‖β‖2

Squared bias (black), variance (green), and test MSE (purple) for theridge regression predictions.

Page 41: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

41/67

Advantage of Ridge Regression

• When the number of variables p is almost as large as the number ofobservations n, as in the example above, the least squares estimateswill be extremely variable.

• If p > n, then the least squares estimates do not even have a uniquesolution.

• Ridge regression works best in situations where the least squaresestimates have high variance.

• Ridge regression also has substantial computational advantages overbest subset selection, which requires searching through 2p models.

Page 42: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

42/67

Disadvantage of Ridge Regression

• Unlike best subset, forward stepwise, and backward stepwiseselection, which will generally select models that involve just asubset of the variables, ridge regression will include all p predictorsin the final model.

• In short words, ridge regression fails to shrink any coefficientsexactly to zero.

• Increasing the value of λ will tend to reduce the magnitudes of thecoefficients, but will not result in exclusion of any of the variables.

• May not be a problem for prediction accuracy, but create a challengein model interpretation, especially in settings where p is quite large.

What can we do then?

Page 43: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

43/67

The Lasso

• Lasso: Tibshirani (1996), Least Absolute Shrinkage and SelectionOperator.

• Sometimes we want a sparse model in which some coefficients areexactly zero.

• To accomplish this, we can use the lasso.

• Lasso seeks β0, β1, ..., βp that minimize

(Y − Xβ)T (Y − Xβ) + λ

p∑j=1

|βj | = RSS + λ

p∑j=1

|βj |.

• λ ≥ 0 is a parameter that controls amount of shrinkage.

• Lasso has the absolute value (`1) penalty on β instead of (`2)penalty as in ridge regression.

Page 44: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

44/67

The Lasso ...

• As with ridge regression, the lasso shrinks the coefficient estimatestowards zero.

• The (`1) penalty has the effect of forcing some of the coefficientestimates to be exactly equal to zero when λ is large. (sparsemodels)

• Hence, much like best subset selection, the lasso performs variableselection. (easier to interpret!)

• The lasso retains the good features of both subset selection andridge regression.

• Select good value of λ via cross-validation.

• Standardization of predictors is preferred as well.

Page 45: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

45/67

Credit Data Example

Look at the plot in the next slide:

• Left: each curve displays lasso estimates for a variable, as λ varies.

• Right: each curve displays lasso estimates, as ‖βLλ‖1/‖β‖1 varies.

• Here β is the least squares estimates, and ‖.‖1 is the `1 norm:‖β‖1 =

∑pi=1 |βj |.

• Notice the difference between the ridge regression and lasso models.

Page 46: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

46/67

The standardized lasso coefficients on the Credit data set

20 50 100 200 500 2000 5000

−200

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

0.0 0.2 0.4 0.6 0.8 1.0−

300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

λ ‖βLλ ‖1/‖β‖1

Page 47: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

47/67

Alternative Formulation for Ridge and Lasso

One can show that the lasso and ridge regression coefficient estimatessolve the problems

minimizeβ

{(Y − Xβ)T (Y − Xβ)

}subject to

p∑j=1

|βj | ≤ s,

and

minimizeβ

{(Y − Xβ)T (Y − Xβ)

}subject to

p∑j=1

β2j ≤ s,

respectively.

Page 48: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

48/67

Understanding the Formulation

• There is a one-to-one correspondence between the parameters λ ands, for lasso (or ridge regression).

• The lasso formulation suggests: find coefficient estimates leading tothe smallest RSS, subject to the constraint that there is a budget sfor∑p

j=1 |βj |.

• When s is extremely large...

• When s is small...

• Similar conclusion for ridge regression formulation.

Page 49: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

49/67

Understanding the Formulation ...

• The formulations also reveal a close connection between lasso, ridgeregression, and best subset selection.

• Consider the problem

minimizeβ

{(Y − Xβ)T (Y − Xβ)

},

subject to

p∑j=1

I (βj 6= 0) ≤ s.

• Here I (βj 6= 0) is an indicator variable: 1 or 0.

• What does this formulation suggest?

• Computational burden?

Page 50: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

50/67

The Variable Selection Property of the Lasso

• Why does lasso, unlike ridge, yield sparse coefficient estimates?

• The optimization with restriction expressions for lasso and ridgeregression can be used to shed light on the issue.

• When p = 2, then the formulations indicate that• the lasso coefficient estimates have the smallest RSS out of all points

that lie within the diamond defined by |β1|+ |β2| ≤ s.

• the ridge regression estimates have the smallest RSS out of all pointsthat lie within the circle defined by β2

1 + β22 ≤ s.

Take a look at the figure in the following ...

Page 51: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

51/67

The Lasso/Ridge Picture

β is the least squares solution.

Page 52: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

52/67

The Lasso/Ridge Picture

• The ellipses that are centered around β represent regions of constantRSS (Contours of the RSS).

• As the ellipses expand away from β, the RSS increases.

• The lasso and ridge regression coefficient estimates: the first pointat which an ellipse contacts the constraint region.

• ridge: intersection will not generally occur on an axis.• lasso: the ellipse will often intersect the constraint region at an axis.

When this occurs, one of the coefficients will equal zero.• In higher dimensions, many of the coefficient estimates may equal

zero simultaneously.

• When p = 3, then the constraint region for ridge regression becomesa sphere, and the constraint region for the lasso becomes apolyhedron. Key ideas depicted in the figure still hold.

Page 53: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

53/67

Lasso Versus Ridge

• It is clear that the lasso has a major advantage over ridge regression,in that it produces simpler and more interpretable models thatinvolve only a subset of the predictors.

• However, which method leads to better prediction accuracy?

• Consider the following plot ... Applying lasso and ridge regression tothe previously simulated dataset (n = 50, p=45).

• Left: Squared bias (black), variance (green), and test meansquared error (purple) for lasso.

• Right: plot of both lasso and ridge against R2 in the training data.

Page 54: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

54/67

The Lasso/Ridge Picture

The dotted lines represent the ridge regression fits. All 45 predictors wererelated to the response

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Mean S

quare

d E

rror

λ

Page 55: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

55/67

The Lasso/Ridge Picture ...

Response is the function of 2 out of 45 predictors.

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Mean S

quare

d E

rror

0.4 0.5 0.6 0.7 0.8 0.9 1.00

20

40

60

80

100

R2 on Training Data

Mean S

quare

d E

rror

λ

Page 56: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

56/67

The Lasso/Ridge Comparison: Conclusions

• Neither ridge regression nor the lasso will universally dominate theother.

• the lasso performs better where a relatively small number ofpredictors have substantial coefficients, and the remaining predictorshave coefficients that are very small or that equal zero.

• the ridge regression performs better where the response is a functionof many predictors, all with coefficients of roughly equal size.

• However, the number of predictors that is related to the response isnever known a priori for real data set.

• Use cross-validation to determine which approach is better on aparticular data set.

Page 57: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

57/67

Brief Summary of the Lasso

• As with ridge regression, when the least squares estimates haveexcessively high variance, the lasso solution can yield a reduction invariance at the expense of a small increase in bias, and consequentlycan generate more accurate predictions.

• Unlike ridge regression, the lasso performs variable selection, andhence results in models that are easier to interpret.

Page 58: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

58/67

A Simple Case Study for Ridge and Lasso

• Consider a simple special case with n = p, and X a diagonal matrixwith 1’s on the diagonal and 0’s in all off-diagonal elements. Alsowe are performing regression without an intercept.

• With these assumptions, the usual least squares problem simplifiesto finding β1, ..., βp that minimize

p∑j=1

(yj − βj)2

In this case, the least squares solution is given by

βj = yj

• The ridge regression: expression?

• The lasso: expression?

Page 59: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

59/67

A Simple Case Study for Ridge and Lasso

• The ridge regression estimates take the form

βRj = yj/(1 + λ)

• The lasso estimates take the form

βLj =

yj − λ/2, if yj > λ/2

yj + λ/2, if yj < −λ/2

0, if |yj | ≤ λ/2.

• Look at the figure in the following ...

Page 60: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

60/67

The Lasso/Ridge Picture

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.5

Coeffic

ient E

stim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.5

Coeffic

ient E

stim

ate

Lasso

Least Squares

yjyj

Page 61: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

61/67

Shrinkage by Ridge and Lasso

• Ridge regression and the lasso perform two very different types ofshrinkage.

• Ridge regression: more or less shrinks every dimension of the data bythe same proportion.

• Lasso: more or less shrinks all coefficients toward zero by a similaramount, and sufficiently small coefficients are shrunken all the wayto zero.

• The fact that some lasso coefficients are shrunken entirely to zeroexplains why the lasso performs feature selection.

• Soft-thresholding: the type of shrinkage performed by the lasso.

Page 62: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

62/67

Selecting the Tuning Parameter

• Implementing ridge regression and the lasso requires a method forselecting a value for the tuning parameter λ, or equivalently, thevalue of the constraint s

• Cross-validation provides a simple way to tackle this problem.

• Here are the general steps:

1. We choose a grid of λ values, and compute the cross-validation errorfor each value of λ.

2. Select the tuning parameter value for which the cross-validation erroris smallest.

3. The model is re-fit using all of the available observations and theselected value of the tuning parameter.

Page 63: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

63/67

Leave-one-out CV: Ridge Regression on Credit data

Right: The coefficient estimates as a function of λ. Vertical dashed lines:selected value of λ.

5e−03 5e−02 5e−01 5e+00

25.0

25.2

25.4

25.6

Cro

ss−

Validation E

rror

5e−03 5e−02 5e−01 5e+00

−300

−100

0100

300

Sta

ndard

ized C

oeffic

ients

λλ

Page 64: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

64/67

Lasso fit on the sparse simulated data

• Ten-fold cross-validation is applied to the lasso fits on the sparsesimulated data (response is a function of 2 out of 45 predictors).

• The figure in the following provides an illustration of lasso fitting.

• Left: the cross-validation error

• Right: the coefficient estimates.

• The two colored lines: signal variables.

• Grey lines: noise variables.

Page 65: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

65/67

ten-fold cross-validation: lasso

0.0 0.2 0.4 0.6 0.8 1.0

0200

600

1000

1400

Cro

ss−

Validation E

rror

0.0 0.2 0.4 0.6 0.8 1.0

−5

05

10

15

Sta

ndard

ized C

oeffic

ients

‖βLλ ‖1/‖β‖1‖βL

λ ‖1/‖β‖1

Page 66: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

66/67

ten-fold cross-validation: lasso

• Lasso: correctly given much larger coefficient estimates to the twosignal predictors.

• The minimum cross-validation error corresponds to a set ofcoefficient estimates for which only the signal variables are non-zero.

• Hence cross-validation together with the lasso has correctlyidentified the two signal variables in the model.

• In contrast, the least squares solution assigns a large coefficientestimate to only one of the two signal variables.

Page 67: STAT 340: Applied Regression Methods · 2016-10-26 · TSS=n 1; where TSS is the total sum of squares: P n i=1 (y i y) 2: Recall that the usual R2 is de ned as R2 = 1 RSS TSS: So

67/67

Summary

• Neither ridge nor lasso will universally dominate the other.

• When there are many variables relative to the number ofobservations, both can far outperform least squares!

• In general, we expect the lasso to perform better when the responseis a function of relatively few predictors.

• However, the number of predictors that is related to the response isnever known as a priori for real data sets.

• Cross-Validation can be used in order to determine which approachis better on a particular data set.


Recommended