4. Linear Model Selection and Regularization
Jesper Armouti-Hansen
University of Cologne
January 14, 2019
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 1 / 34
Course website
jeshan49.github.io/eemp2/
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 2 / 34
Today
Lecture1:
Subset SelectionShrinkage/RegularizationDimension Reduction
Tutorial:Reproducing results from the lecture using:
Forward/Backward Subset SelectionRidge and Lasso RegressionPrincipal Component and Partial Least Squares Regression
1Some of the figures in this presentation are taken from “An Introduction toStatistical Learning, with applications in R” (Springer, 2013) with permissionfrom the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 3 / 34
Introduction
In the regression setting, the standard linear model
Y = β0 + β1X1 + . . . βpXp + ε (1)
is commonly used to describe the relationship between theresponse and the input
This linear model has an obvious advantage compared tonon-linear methods in terms of model interpretability
In addition, it is surprisingly competitive in relation tonon-linear methods in many settings in terms of predictionaccuracy
Today, we will discuss alternative fitting strategies to leastsquares that may improve the fit
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 4 / 34
Why use alternative fitting strategies over least squares?
Let us first consider prediction accuracy:
Recall the bias-variance trade-off: In general, too simple(complex) models with have high (low) bias and low (high)variance
Suppose now that the true relationship between the input andoutput is approx. linear
Then, our linear model in (1) will have low bias. In addition,if N >> p, it will have low variance as well
However, if N > p, there is a lot of variability in the fit, andhence high variance. In addition, if N < p, this variance isinfinite
By constraining or shrinking the coefficients, we can reducethe variance substantially at the cost of a small increase in bias
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 5 / 34
Let us now consider model interpretability:
Often some or many of the p predictors are not associatedwith the response
Including these predictors leads to unnecessary in the resultingmodel
This is because the least square fit is extemely unlikely toyield exact zero coefficients
By setting some of the coefficients to zero, we obtain a moreeasily intepretable model
We will thus consider methods that automatically performfeature selection
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 6 / 34
Three alternative methods
Today we’ll discuss three alternative methods:
1 Subset selection:
Identify a subset of the p predictors that we believe are relatedto Y . Then we fit using least squares on this subset
2 Shrinkage:
Fit on all p predictors using least squares subject to aconstraint on the size of the coefficientsThis shrinkage/regularization reduces the variance
3 Dimension reduction:
Projecting the p predictors into a M-dimensional subspace,where M < pWe then fit the model with the M predictors using leastsquares
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 7 / 34
Best subset selection
Algorithm 1 Best subset selection
1: Let M0 denote the null model, which contains no predictors.2: for k = 1 to p do3: (a) fit all
(pk
)models that contain exactly k predictors.
4: (b) Pick the best among these(pk
)models, and call it Mk .
Here the best is defined as having the smallest RSS or highestR2.
5: end for6: Select a single best model from M0, . . . ,Mp using CV,
Cp,AIC ,BIC or Adj. R2.
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 8 / 34
Example
2 4 6 8 10
2e+
07
4e+
07
6e+
07
8e+
07
Number of Predictors
Resid
ual S
um
of S
quare
s
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Number of PredictorsR
2
Figure: For each possible model containing a subset of ten predictors in theCredit data set, the RSS and R2 are displayed. The red frontier tracks the bestmodel for a given number of predictors. (See ISLR p. 206)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 9 / 34
Some notes on best subset selection
The same idea of best subset selection can be applied to awide array of models, e.g. logistic regression
While the method is simple, it suffers from computationallimitations:
If p = 10 we must fit 210 = 1, 024 models
If p = 20, we must fit 220 = 1, 048, 576 models
Thus, best subset selection becomes unfeasible for p > 40
In addition, the method may suffer from overfitting and highvariance of coefficient estimates for large p
We will now consider more computational efficientcompromises, with a smaller search space - Stepwise selection
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 10 / 34
Forward stepwise selection
Algorithm 2 Forward stepwise selection
1: Let M0 denote the null model, which contains no predictors.2: for k = 0 to p − 1 do3: (a) Consider all p − k models that augment the predictors in
Mk with one additional predictor.4: (b) Pick the best among these p−k models, and call itMk+1.
Here the best is defined as having the smallest RSS or highestR2.
5: end for6: Select a single best model from M0, . . . ,Mp using CV,
Cp,AIC ,BIC or Adj. R2.
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 11 / 34
Some notes on forward stepwise selection
Forward stepwise selection has a clear computationaladvantage over best subset selection:
For p = 20, the latter fits 220 = 1, 048, 576 models whereasthe former only fits1 +
∑20−1k=0 (20− k) = 1 + 20(20 + 1)/2 = 211 models
Furthermore, it may perform better due to its smaller searchspace
However, it may fail to selection the best model
Suppose that p = 3 and the best model is a two-variablemodel with X2,X3
If the best one-variable model is with X1, then forwardstepwise selection will fail in finding the best model
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 12 / 34
Backward stepwise selection
Algorithm 3 Backward stepwise selection
1: Let Mp denote the full model, which contains all p predictors.2: for k = p to 1 do3: (a) Consider all k models contain all but one of the predictors
in Mk , for a total of k − 1 predictors.4: (b) Pick the best among these k models, and call it Mk−1.
Here the best is defined as having the smallest RSS or highestR2.
5: end for6: Select a single best model from M0, . . . ,Mp using CV,
Cp,AIC ,BIC or Adj. R2.
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 13 / 34
Some notes on backward stepwise selection
Like forward stepwise selection, backward stepwise selectiononly fits 1 + p(p + 1)/2 models
Thus, it can be used with large p
However, once again, it is not guaranteed to select the bestmodel containing a subset of p predictors
Furthermore, backward stepwise selection can only be used insettings in which N > p
In contrast, we can always use forward selection up to aparticular number of predictors
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 14 / 34
Choosing the optimal model
Each preceding method yield an optimal model for 1, . . . , p
Thus, at the end, we need to select one best model fromthese candidates
We do not want to use RSS nor R2 as they are directlyrelated to the training error
as we know, the training error can be a poor estimate of thetest error
Thus, we consider two alternative approaches:
1 Indirectly estimate test error by making an adjustment to thetraining error
2 Directly estimate the test error by using the validation set orthe cross-validation approach
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 15 / 34
Cp,AIC ,BIC , and Adj. R2
Let d be the # of predictors and σ̂2 = RSS/(N − p − 1) anestimate of Var [ε] from the full model
Mallow’s Cp:
Cp =1
N(RSS + 2d σ̂2) (2)
Akaike information criterion:
AIC =1
Nσ̂2(RSS + 2d σ̂2) =
1
σ̂2∗ Cp (3)
Bayesian information criterion:
BIC =1
Nσ̂2(RSS + log(N)d σ̂2) (4)
Adjusted R2:
Adj. R2 = 1− RSS/(N − d − 1)
TSS/(N − 1)(5)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 16 / 34
Example (Best subset selection)
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
Cp
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
BIC
2 4 6 8 10
0.8
60
.88
0.9
00
.92
0.9
40
.96
Number of Predictors
Adju
ste
d R
2
Figure: Cp,BIC , and Adj. R2 for the best models of each size of the Creditdata set. (See ISLR p. 211)
Cp: income, limit, rating, cards, age, student
BIC : income, limit, cards, student
Adj. R2: income, limit, rating, cards, age, student, gender
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 17 / 34
Validation and Cross-Validation
As we already know, we can also use the validation setapproach or k-fold CV for the task of model selection
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Square
Root of B
IC
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Valid
ation S
et E
rror
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Cro
ss−
Valid
ation E
rror
Figure: Credit data set. Left: Square root of BIC, Center: Validation seterrors, Right: 10-fold CV errors (See ISLR p. 214)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 18 / 34
Shrinkage methods - Ridge Regression
As an alternative to subset selection, we can fit a model on allp predictors, whilst shrinking the coefficients toward zero
We will see that this can reduce the variance of our model
Ridge regression solves:
minβ
N∑i=1
yi − β0 −p∑
j=1
β2j xij
2
+ λ
p∑j=1
β2j = RSS + λ
p∑j=1
β2j (6)
where λ ≥ 0 is a tuning/hyper parameter; it controls themagnitude of regularization vs. fit
We find an estimate β̂Rλ for many λ, and then choose theoptimal λ by CV - This is not computational expensive
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 19 / 34
Notes on Ridge regression
Shrinkage is applied to β1, . . . , βp but not to β0This is because we want to shrink the estimated association ofeach variable with the response
Standard least square coefficient estimates are scale invariant
Xj β̂j will remain the same
Ridge coefficient estimates can change substantially
Xj β̂Rj,λ may not only depend on λ and its predictor’s scale, but
also on other predictors’ scale
Thus, it is best to apply ridge regression after standardizingthe predictors
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 20 / 34
Example
1e−02 1e+00 1e+02 1e+04
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
IncomeLimitRatingStudent
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
λ ‖β̂Rλ ‖2/‖β̂‖2
Figure: The standardized ridge regression coefficients are displayed for the
Credit data set, as a function of λ and ||β̂Rλ ||2/||β̂||2 (See ISLR p. 216)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 21 / 34
Why does Ridge regression improve over Least squares?
As λ increases, the flexibility of the fit decreases, leading todecreased variance but increased bias
1e−01 1e+01 1e+03
010
20
30
40
50
60
Mean S
quare
d E
rror
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
Mean S
quare
d E
rror
λ ‖β̂Rλ ‖2/‖β̂‖2
Figure: Squared bias (black), variance (green), and test MSE (purple) for theridge regression predictions on a simulated data set, as a function of λ and
||β̂Rλ ||2/||β̂||2. The dashed line indicates minimum MSE (See ISLR p. 218)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 22 / 34
Shrinkage methods - Lasso
Ridge regression has one obvious limitation: None of thecoefficient estimates will be exactly zero, unless λ =∞
Lasso solves:
minβ
N∑i=1
yi − β0 −p∑
j=1
β2j xij
2
+λ
p∑j=1
|βj | = RSS+λ
p∑j=1
|βj | (7)
The penalty term of the lasso has the effect of settingcoefficient estimates exactly zero for finite λ
We say that the lasso yields sparse models – models thatinvolve only a subset of the variables
We find an estimate β̂Lλ for many λ, and then choose theoptimal λ by CV
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 23 / 34
Example
20 50 100 200 500 2000 5000
−200
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
IncomeLimitRatingStudent
λ ‖β̂Lλ ‖1/‖β̂‖1
Figure: The standardized lasso coefficients are displayed for the Credit data
set, as a function of λ and ||β̂Lλ||1/||β̂||1 (See ISLR p. 220)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 24 / 34
Another representation for Ridge regression and Lasso
One can show that (i) best subset selection, (ii) ridgeregression, and (iii) the lasso can be formulated as follows
Best subset selection:
minβ
N∑i=1
yi − β0 −p∑
j=1
β2j xij
2 s.t.
p∑j=1
I(βj 6= 0) ≤ s (8)
Ridge regression:
minβ
N∑i=1
yi − β0 −p∑
j=1
β2j xij
2 s.t.
p∑j=1
β2j ≤ s (9)
Lasso:
minβ
N∑i=1
yi − β0 −p∑
j=1
β2j xij
2 s.t.
p∑j=1
|βj | ≤ s (10)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 25 / 34
The variable selection property of Lasso
Figure: Contours of the error and constraint functions for the lasso (left) andridge regression (right). The solid blue areas are the constraint regions,|β1|+ |β2| ≤ s and β2
1 + β22 ≤ s (See ISLR p. 222)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 26 / 34
Comparing Ridge regression and Lasso
Lasso has an advantage over ridge regression in terms ofmodel interpretability
But what about prediction accuracy?
Lasso implicitly assumes that some of the predictors areunrelated to the response
If this assumption holds, then the lasso can perform better. Ifnot, then ridge regression will in general perform better
It is possible to combine both approaches in one method
ElasticNet regression: a convex combination of ridge regressionand lasso
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 27 / 34
Dimension reduction methods
Methods that transform the p predictors onto M dimensionsand then fit a least square model on the transformed variables
Let Z1, . . .ZM represent M < p linear combinations of the ppredictors
Zm =
p∑j=1
φjmXj ∀m (11)
for some constants φ1m, . . . , φpm
We then fit the linear regression model by least squares:
yi = θ0 +M∑
m=1
θmzim + εi , i = 1, . . . , n (12)
With this approach, we are reducing the dimension of theproblem from p + 1 to M + 1
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 28 / 34
Principal Component Regression (PCR)
To perform PCR, we apply principal component analysis(PCA)
PCA is a technique for reducing the dimension of our data
Our goal: Ending up with M < p principal components whichsummarizes the majority of the variation in our data
Assumption: The direction in which X1, . . . ,Xp show themost variation are the directions that are associated with theresponse
1st PC (Z1): The linear combination of predictors with thelargest variance
Or: the line that is as close as possible to the data
i + 1th PC (Zi+1): The linear combination of predictors withthe largest variance subject to being uncorrelated with the ithPC
Or: The line that is as close as possible to the data subject tobeing orthogonal to the ith PC
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 29 / 34
Example
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad
Sp
en
din
g
Figure: The population size and ad spending for 100 different cities are shownas purple circles. First PC (green), second PC (blue, dashed) (See ISLR p. 230)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 30 / 34
Example (cont’d)
−3 −2 −1 0 1 2 3
20
30
40
50
60
1st Principal Component
Popula
tion
−3 −2 −1 0 1 2 3
510
15
20
25
30
1st Principal Component
Ad S
pendin
g
−1.0 −0.5 0.0 0.5 1.0
20
30
40
50
60
2nd Principal Component
Popula
tion
−1.0 −0.5 0.0 0.5 1.0
510
15
20
25
30
2nd Principal Component
Ad S
pendin
g
Figure: Plots of the first (top) and second (bottom) PC scores vs. population(left) and ad spending (right) (See ISLR pp. 233-234)
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 31 / 34
Partial Least Squares (PLS)
PCR identifies Z1, . . . ,ZM in an unsupervised way – i.e.without considering the response
Thus, it may be that the directions/components are not thebest predictors of the response
Unlike PCR, PLS defines Z1, . . . ,ZM in a supervised way:
PLS computes Z1 by defining each φj1 from (11) to be thecoefficient from the simple linear regression of Y onto Xj
The coefficient is proportional to the correlationThus, PLS places most weight on variables that are stronglycorrelated with the response
Subsequent directions are found by taking residuals andrepeating the process
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 32 / 34
Notes on PCR and PLS
With both PCR and PLS, we standardize the predictors beforeapplying the methods
With both PCR and PLS, we generally locate the optimalnumber of directions by cross-validation
In general, the supervised dimension reduction by PLS canreduce bias
However, this can come at the cost of an increase in variance
PCR, PLS and ridge regression performs in practice similar interms of prediction accuracy
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 33 / 34
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction tostatistical learning (Vol. 112). Chapter 6
Jesper Armouti-Hansen 4. Linear Model Selection and Regularization 34 / 34