Statistical Methods for Data Mining
Kuangnan Fang
Xiamen University Email: [email protected]
February 14, 2016
Linear Model Selection and Regularization
• Recall the linear model
Y = �0 + �1X1 + · · ·+ �pXp + ✏.
• In the lectures that follow, we consider some approaches forextending the linear model framework. In the lecturescovering Chapter 7 of the text, we generalize the linearmodel in order to accommodate non-linear, but stilladditive, relationships.
• In the lectures covering Chapter 8 we consider even moregeneral non-linear models.
1 / 57
In praise of linear models!
• Despite its simplicity, the linear model has distinctadvantages in terms of its interpretability and often showsgood predictive performance.
• Hence we discuss in this lecture some ways in which thesimple linear model can be improved, by replacing ordinaryleast squares fitting with some alternative fittingprocedures.
2 / 57
Why consider alternatives to least squares?
•Prediction Accuracy: especially when p > n, to control thevariance.
•Model Interpretability: By removing irrelevant features —that is, by setting the corresponding coe�cient estimatesto zero — we can obtain a model that is more easilyinterpreted. We will present some approaches forautomatically performing feature selection.
3 / 57
Three classes of methods
•Subset Selection. We identify a subset of the p predictorsthat we believe to be related to the response. We then fit amodel using least squares on the reduced set of variables.
•Shrinkage. We fit a model involving all p predictors, butthe estimated coe�cients are shrunken towards zerorelative to the least squares estimates. This shrinkage (alsoknown as regularization) has the e↵ect of reducing varianceand can also perform variable selection.
•Dimension Reduction. We project the p predictors into aM -dimensional subspace, where M < p. This is achieved bycomputing M di↵erent linear combinations, or projections,of the variables. Then these M projections are used aspredictors to fit a linear regression model by least squares.
4 / 57
Subset Selection
Best subset and stepwise model selection procedures
Best Subset Selection
1. Let M0 denote the null model, which contains nopredictors. This model simply predicts the sample meanfor each observation.
2. For k = 1, 2, . . . p:(a) Fit all
�pk
�models that contain exactly k predictors.
(b) Pick the best among these�pk
�models, and call it Mk. Here
best is defined as having the smallest RSS, or equivalentlylargest R2.
3. Select a single best model from among M0, . . . ,Mp usingcross-validated prediction error, Cp (AIC), BIC, oradjusted R
2.
5 / 57
Example- Credit data set
2 4 6 8 10
2e+
07
4e+
07
6e+
07
8e+
07
Number of Predictors
Re
sid
ua
l Su
m o
f S
qu
are
s
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Number of Predictors
R2
For each possible model containing a subset of the ten predictors
in the Credit data set, the RSS and R
2are displayed. The red
frontier tracks the best model for a given number of predictors,
according to RSS and R
2. Though the data set contains only
ten predictors, the x-axis ranges from 1 to 11, since one of the
variables is categorical and takes on three values, leading to the
creation of two dummy variables
6 / 57
Extensions to other models
• Although we have presented best subset selection here forleast squares regression, the same ideas apply to othertypes of models, such as logistic regression.
• The deviance— negative two times the maximizedlog-likelihood— plays the role of RSS for a broader class ofmodels.
7 / 57
Stepwise Selection
• For computational reasons, best subset selection cannot beapplied with very large p. Why not?
• Best subset selection may also su↵er from statisticalproblems when p is large: larger the search space, thehigher the chance of finding models that look good on thetraining data, even though they might not have anypredictive power on future data.
• Thus an enormous search space can lead to overfitting andhigh variance of the coe�cient estimates.
• For both of these reasons, stepwise methods, which explorea far more restricted set of models, are attractivealternatives to best subset selection.
8 / 57
Forward Stepwise Selection
• Forward stepwise selection begins with a model containingno predictors, and then adds predictors to the model,one-at-a-time, until all of the predictors are in the model.
• In particular, at each step the variable that gives thegreatest additional improvement to the fit is added to themodel.
9 / 57
In Detail
Forward Stepwise Selection
1. Let M0 denote the null model, which contains nopredictors.
2. For k = 0, . . . , p� 1:2.1 Consider all p� k models that augment the predictors in
Mk with one additional predictor.2.2 Choose the best among these p� k models, and call it
Mk+1. Here best is defined as having smallest RSS orhighest R2.
3. Select a single best model from among M0, . . . ,Mp usingcross-validated prediction error, Cp (AIC), BIC, oradjusted R
2.
10 / 57
More on Forward Stepwise Selection
• Computational advantage over best subset selection isclear.
• It is not guaranteed to find the best possible model out ofall 2p models containing subsets of the p predictors. Why
not? Give an example.
11 / 57
Credit data example
# Variables Best subset Forward stepwiseOne rating rating
Two rating, income rating, incomeThree rating, income, student rating, income, studentFour cards, income rating, income,
student, limit student, limit
The first four selected models for best subset selection and
forward stepwise selection on the Credit data set. The first
three models are identical but the fourth models di↵er.
12 / 57
Backward Stepwise Selection
• Like forward stepwise selection, backward stepwise selection
provides an e�cient alternative to best subset selection.
• However, unlike forward stepwise selection, it begins withthe full least squares model containing all p predictors, andthen iteratively removes the least useful predictor,one-at-a-time.
13 / 57
Backward Stepwise Selection: details
Backward Stepwise Selection
1. Let Mp denote the full model, which contains all ppredictors.
2. For k = p, p� 1, . . . , 1:2.1 Consider all k models that contain all but one of the
predictors in Mk, for a total of k � 1 predictors.2.2 Choose the best among these k models, and call it Mk�1.
Here best is defined as having smallest RSS or highest R2.
3. Select a single best model from among M0, . . . ,Mp usingcross-validated prediction error, Cp (AIC), BIC, oradjusted R
2.
14 / 57
More on Backward Stepwise Selection
• Like forward stepwise selection, the backward selectionapproach searches through only 1 + p(p+ 1)/2 models, andso can be applied in settings where p is too large to applybest subset selection
• Like forward stepwise selection, backward stepwiseselection is not guaranteed to yield the best modelcontaining a subset of the p predictors.
• Backward selection requires that the number of samples n
is larger than the number of variables p (so that the fullmodel can be fit). In contrast, forward stepwise can beused even when n < p, and so is the only viable subsetmethod when p is very large.
15 / 57
Choosing the Optimal Model
• The model containing all of the predictors will always havethe smallest RSS and the largest R2, since these quantitiesare related to the training error.
• We wish to choose a model with low test error, not a modelwith low training error. Recall that training error is usuallya poor estimate of test error.
• Therefore, RSS and R
2 are not suitable for selecting thebest model among a collection of models with di↵erentnumbers of predictors.
16 / 57
Estimating test error: two approaches
• We can indirectly estimate test error by making anadjustment to the training error to account for the bias dueto overfitting.
• We can directly estimate the test error, using either avalidation set approach or a cross-validation approach, asdiscussed in previous lectures.
• We illustrate both approaches next.
17 / 57
Cp
, AIC, BIC, and Adjusted R2
• These techniques adjust the training error for the modelsize, and can be used to select among a set of models withdi↵erent numbers of variables.
• The next figure displays Cp, BIC, and adjusted R
2 for thebest model of each size produced by best subset selectionon the Credit data set.
18 / 57
Credit data example
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
Cp
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
BIC
2 4 6 8 10
0.8
60
.88
0.9
00
.92
0.9
40
.96
Number of Predictors
Ad
just
ed
R2
19 / 57
Now for some details
•Mallow’s Cp:
Cp =1
n
�RSS + 2d�̂2
�,
where d is the total # of parameters used and �̂
2 is anestimate of the variance of the error ✏ associated with eachresponse measurement.
• The AIC criterion is defined for a large class of models fitby maximum likelihood:
AIC = �2 logL+ 2 · d
where L is the maximized value of the likelihood functionfor the estimated model.
• In the case of the linear model with Gaussian errors,maximum likelihood and least squares are the same thing,and Cp and AIC are equivalent. Prove this.
20 / 57
Details on BIC
BIC =1
n
�RSS + log(n)d�̂2
�.
• Like Cp, the BIC will tend to take on a small value for amodel with a low test error, and so generally we select themodel that has the lowest BIC value.
• Notice that BIC replaces the 2d�̂2 used by Cp with alog(n)d�̂2 term, where n is the number of observations.
• Since log n > 2 for any n > 7, the BIC statistic generallyplaces a heavier penalty on models with many variables,and hence results in the selection of smaller models thanCp. See Figure on slide 19.
21 / 57
Adjusted R2
• For a least squares model with d variables, the adjusted R
2
statistic is calculated as
Adjusted R
2 = 1� RSS/(n� d� 1)
TSS/(n� 1).
where TSS is the total sum of squares.• Unlike Cp, AIC, and BIC, for which a small value indicatesa model with a low test error, a large value of adjusted R
2
indicates a model with a small test error.• Maximizing the adjusted R
2 is equivalent to minimizingRSS
n�d�1 . While RSS always decreases as the number of
variables in the model increases, RSSn�d�1 may increase or
decrease, due to the presence of d in the denominator.• Unlike the R
2 statistic, the adjusted R
2 statistic pays a
price for the inclusion of unnecessary variables in themodel. See Figure on slide 19.
22 / 57
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size k = 0, 1, 2, . . .. Our job here is toselect k̂. Once selected, we will return model Mk̂
• We compute the validation set error or the cross-validationerror for each model Mk under consideration, and thenselect the k for which the resulting estimated test error issmallest.
• This procedure has an advantage relative to AIC, BIC, Cp,and adjusted R
2, in that it provides a direct estimate ofthe test error, and doesn’t require an estimate of the error
variance �
2.
• It can also be used in a wider range of model selectiontasks, even in cases where it is hard to pinpoint the modeldegrees of freedom (e.g. the number of predictors in themodel) or hard to estimate the error variance �
2.
23 / 57
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size k = 0, 1, 2, . . .. Our job here is toselect k̂. Once selected, we will return model Mk̂
• We compute the validation set error or the cross-validationerror for each model Mk under consideration, and thenselect the k for which the resulting estimated test error issmallest.
• This procedure has an advantage relative to AIC, BIC, Cp,and adjusted R
2, in that it provides a direct estimate ofthe test error, and doesn’t require an estimate of the error
variance �
2.
• It can also be used in a wider range of model selectiontasks, even in cases where it is hard to pinpoint the modeldegrees of freedom (e.g. the number of predictors in themodel) or hard to estimate the error variance �
2.
23 / 57
Credit data example
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Square
Root of B
IC
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Valid
atio
n S
et E
rror
2 4 6 8 10
100
120
140
160
180
200
220
Number of Predictors
Cro
ss−
Valid
atio
n E
rror
24 / 57
Details of Previous Figure
• The validation errors were calculated by randomly selectingthree-quarters of the observations as the training set, andthe remainder as the validation set.
• The cross-validation errors were computed using k = 10folds. In this case, the validation and cross-validationmethods both result in a six-variable model.
• However, all three approaches suggest that the four-, five-,and six-variable models are roughly equivalent in terms oftheir test errors.
• In this setting, we can select a model using theone-standard-error rule. We first calculate the standarderror of the estimated test MSE for each model size, andthen select the smallest model for which the estimated testerror is within one standard error of the lowest point onthe curve. What is the rationale for this?
25 / 57
Shrinkage Methods
Ridge regression and Lasso
• The subset selection methods use least squares to fit alinear model that contains a subset of the predictors.
• As an alternative, we can fit a model containing all ppredictors using a technique that constrains or regularizesthe coe�cient estimates, or equivalently, that shrinks thecoe�cient estimates towards zero.
• It may not be immediately obvious why such a constraintshould improve the fit, but it turns out that shrinking thecoe�cient estimates can significantly reduce their variance.
26 / 57
Ridge regression
• Recall that the least squares fitting procedure estimates�0,�1, . . . ,�p using the values that minimize
RSS =nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
.
• In contrast, the ridge regression coe�cient estimates �̂R
are the values that minimize
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
+ �
pX
j=1
�
2j = RSS + �
pX
j=1
�
2j ,
where � � 0 is a tuning parameter, to be determinedseparately.
27 / 57
Ridge regression
• Writing the criterion in matrix form
RSS(�) = (y � X �)T (y � X �) + � �T �
• the ridge regression solutions are easily seen to be
�r idge = (XT X + �I)�1XT y
where I is the p ⇥ p identity matrix.
Ridge regression: continued
• As with least squares, ridge regression seeks coe�cientestimates that fit the data well, by making the RSS small.
• However, the second term, �P
j �2j , called a shrinkage
penalty, is small when �1, . . . ,�p are close to zero, and so ithas the e↵ect of shrinking the estimates of �j towards zero.
• The tuning parameter � serves to control the relativeimpact of these two terms on the regression coe�cientestimates.
• Selecting a good value for � is critical; cross-validation isused for this.
28 / 57
Credit data example
1e−02 1e+00 1e+02 1e+04
−300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
0.0 0.2 0.4 0.6 0.8 1.0−
300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
� k�̂R
�
k2/k�̂k2
29 / 57
Details of Previous Figure
• In the left-hand panel, each curve corresponds to the ridgeregression coe�cient estimate for one of the ten variables,plotted as a function of �.
• The right-hand panel displays the same ridge coe�cientestimates as the left-hand panel, but instead of displaying� on the x-axis, we now display k�̂R
� k2/k�̂k2, where �̂
denotes the vector of least squares coe�cient estimates.
• The notation k�k2 denotes the `2 norm (pronounced “ell
2”) of a vector, and is defined as k�k2 =qPp
j=1 �j2.
30 / 57
Ridge regression: scaling of predictors
• The standard least squares coe�cient estimates are scale
equivariant: multiplying Xj by a constant c simply leads toa scaling of the least squares coe�cient estimates by afactor of 1/c. In other words, regardless of how the jthpredictor is scaled, Xj �̂j will remain the same.
• In contrast, the ridge regression coe�cient estimates canchange substantially when multiplying a given predictor bya constant, due to the sum of squared coe�cients term inthe penalty part of the ridge regression objective function.
• Therefore, it is best to apply ridge regression afterstandardizing the predictors, using the formula
x̃ij =xijq
1n
Pni=1(xij � xj)2
31 / 57
Why Does Ridge Regression Improve Over LeastSquares?
The Bias-Variance tradeo↵
1e−01 1e+01 1e+03
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
� k�̂R
�
k2/k�̂k2
Simulated data with n = 50 observations, p = 45 predictors, all having
nonzero coe�cients. Squared bias (black), variance (green), and test
mean squared error (purple) for the ridge regression predictions on a
simulated data set, as a function of � and k�̂R� k2/k�̂k2. The
horizontal dashed lines indicate the minimum possible MSE. The
purple crosses indicate the ridge regression models for which the MSE
is smallest. 32 / 57
The Lasso
• Ridge regression does have one obvious disadvantage:unlike subset selection, which will generally select modelsthat involve just a subset of the variables, ridge regressionwill include all p predictors in the final model
• The Lasso is a relatively recent alternative to ridgeregression that overcomes this disadvantage. The lassocoe�cients, �̂L
� , minimize the quantity
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
+ �
pX
j=1
|�j | = RSS + �
pX
j=1
|�j |.
• In statistical parlance, the lasso uses an `1 (pronounced“ell 1”) penalty instead of an `2 penalty. The `1 norm of acoe�cient vector � is given by k�k1 =
P|�j |.
33 / 57
The Lasso: continued
• As with ridge regression, the lasso shrinks the coe�cientestimates towards zero.
• However, in the case of the lasso, the `1 penalty has thee↵ect of forcing some of the coe�cient estimates to beexactly equal to zero when the tuning parameter � issu�ciently large.
• Hence, much like best subset selection, the lasso performsvariable selection.
• We say that the lasso yields sparse models — that is,models that involve only a subset of the variables.
• As in ridge regression, selecting a good value of � for thelasso is critical; cross-validation is again the method ofchoice.
34 / 57
Example: Credit dataset
20 50 100 200 500 2000 5000
−200
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
0.0 0.2 0.4 0.6 0.8 1.0−
300
−100
0100
200
300
400
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
� k�̂L
�
k1/k�̂k1
35 / 57
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results incoe�cient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coe�cientestimates solve the problems
minimize�
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
subject topX
j=1
|�j | s
and
minimize�
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
subject topX
j=1
�
2j s,
respectively.
36 / 57
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results incoe�cient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coe�cientestimates solve the problems
minimize�
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
subject topX
j=1
|�j | s
and
minimize�
nX
i=1
0
@yi � �0 �
pX
j=1
�jxij
1
A2
subject topX
j=1
�
2j s,
respectively.
36 / 57
The Lasso Picture
37 / 57
shrinkage methods
Coordinate descent algorithm for Lasso
Coordinate descent algorithm for Lasso
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Mean S
quare
d E
rror
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Mean S
quare
d E
rror
�
Left: Plots of squared bias (black), variance (green), and test
MSE (purple) for the lasso on simulated data set of Slide 32.
Right: Comparison of squared bias, variance and test MSE
between lasso (solid) and ridge (dashed). Both are plotted
against their R
2on the training data, as a common form of
indexing. The crosses in both plots indicate the lasso model for
which the MSE is smallest.
38 / 57
Comparing the Lasso and Ridge Regression: continued
0.02 0.10 0.50 2.00 10.00 50.00
02
04
06
08
01
00
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
02
04
06
08
01
00
R2 on Training Data
Me
an
Sq
ua
red
Err
or
�
Left: Plots of squared bias (black), variance (green), and test
MSE (purple) for the lasso. The simulated data is similar to
that in Slide 38, except that now only two predictors are related
to the response. Right: Comparison of squared bias, variance
and test MSE between lasso (solid) and ridge (dashed). Both
are plotted against their R
2on the training data, as a common
form of indexing. The crosses in both plots indicate the lasso
model for which the MSE is smallest.
39 / 57
Conclusions
• These two examples illustrate that neither ridge regressionnor the lasso will universally dominate the other.
• In general, one might expect the lasso to perform betterwhen the response is a function of only a relatively smallnumber of predictors.
• However, the number of predictors that is related to theresponse is never known a priori for real data sets.
• A technique such as cross-validation can be used in orderto determine which approach is better on a particular dataset.
40 / 57
Adaptive Lasso
Adaptive Lasso
Generalization of shrinkage methods
• We can generalize ridge regression and the lasso,Consider the criterion
• q > 1, | � j |q is differentiable at 0, and so does not share theability of lasso.
Generalization of shrinkage methods• Zou and Hastie (2005) introduced the elastic- net penalty
�pX
j=1
(↵�2j + (1 � ↵) | � j |)
a different compromise between ridge and lasso.
MCP• Zhang(2010) propose MCP penalty
SCAD
• Fan and Li (2001) propose SCAD penalty
Group LASSO• Yuan and Lin (2006) propose group LASSO penalty
Composite MCP• composite MCP
P
outer
(p jX
k=1
P
inner
(| �( j )k
|))
• Composite MCP for logistic regression
adSGL
Fang, Wang and Ma(2015) propose adaptive sparse grouplasso
min
8>><>>:
12
���y �JX
j=1
Xj
�( j )���2
2 + � (1 � ↵)JX
j=1
wj
����( j )���
2 + �↵JX
j=1
⇠ ( j )T | �( j ) |9>>=>>;
(1)where W = (w1 , · · · ,wJ
)T 2 R
J
+ is the group weight vector, ⇠T =⇣⇠ (1)T , · · · , ⇠ (J )T
⌘=⇣⇠ (1)
1 , · · · , ⇠(1)p1 , · · · · · · , ⇠
(J )1 , · · · , ⇠
(J )pJ
⌘2 R
p
+denote the individual weights, and � 2 R+ is the tuningparameter. For different groups, the penalty level can bedifferent. By adopting lower penalty for large coefficients whilehigher penalty for small ones, we expect this to be able toimprove variable selection accuracy and reduce estimationbias.
adSGL
We use the group bridge estimator to construct these two typeof weights
wj
=
��� �̂( j ) (GB)���
1 +1n
!�1
⇠ ( j )i
=
| �̂( j )
i
(GB) | + 1n
!�1
adSGL
Selecting the Tuning Parameter for Ridge Regressionand Lasso
• As for subset selection, for ridge regression and lasso werequire a method to determine which of the models underconsideration is best.
• That is, we require a method selecting a value for thetuning parameter � or equivalently, the value of theconstraint s.
•Cross-validation provides a simple way to tackle thisproblem. We choose a grid of � values, and compute thecross-validation error rate for each value of �.
• We then select the tuning parameter value for which thecross-validation error is smallest.
• Finally, the model is re-fit using all of the availableobservations and the selected value of the tuningparameter.
41 / 57
Credit data example
5e−03 5e−02 5e−01 5e+00
25.0
25.2
25.4
25.6
Cro
ss−
Va
lid
atio
n E
rro
r
5e−03 5e−02 5e−01 5e+00
−300
−100
0100
300
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
��
Left: Cross-validation errors that result from applying ridge
regression to the Credit data set with various values of �.
Right: The coe�cient estimates as a function of �. The vertical
dashed lines indicates the value of � selected by cross-validation.
42 / 57
Simulated data example
0.0 0.2 0.4 0.6 0.8 1.0
0200
600
1000
1400
Cro
ss−
Va
lid
atio
n E
rro
r
0.0 0.2 0.4 0.6 0.8 1.0
−5
05
10
15
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
k�̂L� k1/k�̂k1k�̂L
� k1/k�̂k1
Left: Ten-fold cross-validation MSE for the lasso, applied to the
sparse simulated data set from Slide 39. Right: Thecorresponding lasso coe�cient estimates are displayed. The
vertical dashed lines indicate the lasso fit for which the
cross-validation error is smallest.
43 / 57
Dimension Reduction Methods
• The methods that we have discussed so far in this chapterhave involved fitting linear regression models, via leastsquares or a shrunken approach, using the originalpredictors, X1, X2, . . . , Xp.
• We now explore a class of approaches that transform thepredictors and then fit a least squares model using thetransformed variables. We will refer to these techniques asdimension reduction methods.
44 / 57
Dimension Reduction Methods: details
• Let Z1, Z2, . . . , ZM represent M < p linear combinations ofour original p predictors. That is,
Zm =pX
j=1
�mjXj (1)
for some constants �m1, . . . ,�mp.• We can then fit the linear regression model,
yi = ✓0 +MX
m=1
✓mzim + ✏i, i = 1, . . . , n, (2)
using ordinary least squares.• Note that in model (2), the regression coe�cients are givenby ✓0, ✓1, . . . , ✓M . If the constants �m1, . . . ,�mp are chosenwisely, then such dimension reduction approaches can oftenoutperform OLS regression.
45 / 57
• Notice that from definition (1),
MX
m=1
✓mzim =MX
m=1
✓m
pX
j=1
�mjxij =pX
j=1
MX
m=1
✓m�mjxij =pX
j=1
�jxij ,
where
�j =MX
m=1
✓m�mj . (3)
• Hence model (2) can be thought of as a special case of theoriginal linear regression model.
• Dimension reduction serves to constrain the estimated �j
coe�cients, since now they must take the form (3).
• Can win in the bias-variance tradeo↵.
46 / 57
Principal Components Regression
• Here we apply principal components analysis (PCA)(discussed in Chapter 10 of the text) to define the linearcombinations of the predictors, for use in our regression.
• The first principal component is that (normalized) linearcombination of the variables with the largest variance.
• The second principal component has largest variance,subject to being uncorrelated with the first.
• And so on.
• Hence with many correlated original variables, we replacethem with a small set of principal components that capturetheir joint variation.
47 / 57
Principal Components Analysis
• PCA produces a low-dimensional representation of adataset. It finds a sequence of linear combinations of thevariables that have maximal variance, and are mutuallyuncorrelated.
• Apart from producing derived variables for use insupervised learning problems, PCA also serves as a tool fordata visualization.
5 / 52
Principal Components Analysis: details
• The first principal component of a set of featuresX1, X2, . . . , Xp is the normalized linear combination of thefeatures
Z1 = �11X1 + �21X2 + . . .+ �p1Xp
that has the largest variance. By normalized, we mean thatPpj=1 �
2j1 = 1.
• We refer to the elements �11, . . . ,�p1 as the loadings of thefirst principal component; together, the loadings make upthe principal component loading vector,�1 = (�11 �21 . . . �p1)T .
• We constrain the loadings so that their sum of squares isequal to one, since otherwise setting these elements to bearbitrarily large in absolute value could result in anarbitrarily large variance.
6 / 52
PCA: example
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad
Sp
en
din
g
The population size (pop) and ad spending (ad) for 100 di↵erentcities are shown as purple circles. The green solid line indicatesthe first principal component direction, and the blue dashedline indicates the second principal component direction.
7 / 52
Pictures of PCA: continued
−3 −2 −1 0 1 2 3
20
30
40
50
60
1st Principal Component
Po
pu
latio
n
−3 −2 −1 0 1 2 3
510
15
20
25
30
1st Principal ComponentA
d S
pe
nd
ing
Plots of the first principal component scores zi1 versus pop and
ad. The relationships are strong.
50 / 57
Pictures of PCA: continued
−1.0 −0.5 0.0 0.5 1.0
20
30
40
50
60
2nd Principal Component
Po
pu
latio
n
−1.0 −0.5 0.0 0.5 1.0
510
15
20
25
30
2nd Principal ComponentA
d S
pe
nd
ing
Plots of the second principal component scores zi2 versus pop
and ad. The relationships are weak.
51 / 57
Computation of Principal Components
• Suppose we have a n⇥ p data set X. Since we are onlyinterested in variance, we assume that each of the variablesin X has been centered to have mean zero (that is, thecolumn means of X are zero).
• We then look for the linear combination of the samplefeature values of the form
zi1 = �11xi1 + �21xi2 + . . .+ �p1xip (1)
for i = 1, . . . , n that has largest sample variance, subject tothe constraint that
Ppj=1 �
2j1 = 1.
• Since each of the xij has mean zero, then so does zi1 (forany values of �j1). Hence the sample variance of the zi1
can be written as 1n
Pni=1 z
2i1.
8 / 52
Computation: continued
• Plugging in (1) the first principal component loading vectorsolves the optimization problem
maximize�11,...,�p1
1
n
nX
i=1
0
@pX
j=1
�j1xij
1
A2
subject topX
j=1
�
2j1 = 1.
• This problem can be solved via a singular-valuedecomposition of the matrix X, a standard technique inlinear algebra.
• We refer to Z1 as the first principal component, withrealized values z11, . . . , zn1
9 / 52
Geometry of PCA
• The loading vector �1 with elements �11,�21, . . . ,�p1
defines a direction in feature space along which the datavary the most.
• If we project the n data points x1, . . . , xn onto thisdirection, the projected values are the principal componentscores z11, . . . , zn1 themselves.
10 / 52
Further principal components
• The second principal component is the linear combinationof X1, . . . , Xp that has maximal variance among all linearcombinations that are uncorrelated with Z1.
• The second principal component scores z12, z22, . . . , zn2take the form
zi2 = �12xi1 + �22xi2 + . . .+ �p2xip,
where �2 is the second principal component loading vector,with elements �12,�22, . . . ,�p2.
11 / 52
Further principal components: continued
• It turns out that constraining Z2 to be uncorrelated withZ1 is equivalent to constraining the direction �2 to beorthogonal (perpendicular) to the direction �1. And so on.
• The principal component directions �1, �2, �3, . . . are theordered sequence of right singular vectors of the matrix X,and the variances of the components are 1
n times thesquares of the singular values. There are at mostmin(n� 1, p) principal components.
12 / 52
Illustration
•USAarrests data: For each of the fifty states in the UnitedStates, the data set contains the number of arrests per100, 000 residents for each of three crimes: Assault, Murder,and Rape. We also record UrbanPop (the percent of thepopulation in each state living in urban areas).
• The principal component score vectors have length n = 50,and the principal component loading vectors have lengthp = 4.
• PCA was performed after standardizing each variable tohave mean zero and standard deviation one.
13 / 52
USAarrests data: PCA plot
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
First Principal Component
Seco
nd P
rinci
pal C
om
ponent
Alabama Alaska
Arizona
Arkansas
California
ColoradoConnecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
IndianaIowaKansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−0.5 0.0 0.5
−0.5
0.0
0.5
Murder
Assault
UrbanPop
Rape
14 / 52
Figure details
The first two principal components for the USArrests data.
• The blue state names represent the scores for the first twoprincipal components.
• The orange arrows indicate the first two principalcomponent loading vectors (with axes on the top andright). For example, the loading for Rape on the firstcomponent is 0.54, and its loading on the second principalcomponent 0.17 [the word Rape is centered at the point(0.54, 0.17)].
• This figure is known as a biplot, because it displays boththe principal component scores and the principalcomponent loadings.
15 / 52
PCA loadings
PC1 PC2Murder 0.5358995 -0.4181809Assault 0.5831836 -0.1879856UrbanPop 0.2781909 0.8728062Rape 0.5434321 0.1673186
16 / 52
Pictures of PCA: continued
20 30 40 50
510
15
20
25
30
Population
Ad S
pendin
g
−20 −10 0 10 20
−10
−5
05
10
1st Principal Component
2nd P
rinci
pal C
om
ponent
A subset of the advertising data. Left: The first principal
component, chosen to minimize the sum of the squared
perpendicular distances to each point, is shown in green. These
distances are represented using the black dashed line segments.
Right: The left-hand panel has been rotated so that the first
principal component lies on the x-axis.
49 / 57
Another Interpretation of Principal Components
First principal component
Seco
nd p
rinci
pal c
om
ponent
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
••
•
•
••
••
••
••
•
•
•
• ••
••
•
•
•• •
••
• •
• •
•
•
•
••
•
••
•
•
• •
• ••
•
•
••
•
•
•
•
• •
•
•
•
•
• •
•
• •
•
•
• •
•
•
••
•• •
•
••
• •
••
••
•
••
••
17 / 52
PCA find the hyperplane closest to the observations
• The first principal component loading vector has a veryspecial property: it defines the line in p-dimensional spacethat is closest to the n observations (using average squaredEuclidean distance as a measure of closeness)
• The notion of principal components as the dimensions thatare closest to the n observations extends beyond just thefirst principal component.
• For instance, the first two principal components of a dataset span the plane that is closest to the n observations, interms of average squared Euclidean distance.
18 / 52
Scaling of the variables matters• If the variables are in di↵erent units, scaling each to havestandard deviation equal to one is recommended.
• If they are in the same units, you might or might not scalethe variables.
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
First Principal Component
Seco
nd P
rinci
pal C
om
ponent
* *
*
*
*
**
**
*
*
*
*
***
* *
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
***
*
*
* *
*
*
*
*
*
*
*
*
−0.5 0.0 0.5
−0.5
0.0
0.5
Murder
Assault
UrbanPop
Rape
Scaled
−100 −50 0 50 100 150
−100
−50
050
100
150
First Principal Component
Seco
nd P
rinci
pal C
om
ponent
* *
*
*
***
* **
*
*
*** *
* ** *
***
*
**
**
*
*
**
**
** **
*
***
**
*
**
*
**
−0.5 0.0 0.5 1.0
−0.5
0.0
0.5
1.0
Murder Assault
UrbanPop
Rape
Unscaled
19 / 52
Proportion Variance Explained
• To understand the strength of each component, we areinterested in knowing the proportion of variance explained(PVE) by each one.
• The total variance present in a data set (assuming that thevariables have been centered to have mean zero) is definedas
pX
j=1
Var(Xj) =pX
j=1
1
n
nX
i=1
x
2ij ,
and the variance explained by the mth principalcomponent is
Var(Zm) =1
n
nX
i=1
z
2im.
• It can be shown thatPp
j=1Var(Xj) =PM
m=1Var(Zm),with M = min(n� 1, p).
20 / 52
Proportion Variance Explained: continued• Therefore, the PVE of the mth principal component isgiven by the positive quantity between 0 and 1
Pni=1 z
2imPp
j=1
Pni=1 x
2ij
.
• The PVEs sum to one. We sometimes display thecumulative PVEs.
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Pro
p. V
ari
ance
Exp
lain
ed
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Cum
ula
tive P
rop. V
ari
ance
Exp
lain
ed
21 / 52
How many principal components should we use?
If we use principal components as a summary of our data, howmany components are su�cient?
• No simple answer to this question, as cross-validation is notavailable for this purpose.
• Why not?
• When could we use cross-validation to select the number ofcomponents?
• the “scree plot” on the previous slide can be used as aguide: we look for an “elbow”.
22 / 52
How many principal components should we use?
If we use principal components as a summary of our data, howmany components are su�cient?
• No simple answer to this question, as cross-validation is notavailable for this purpose.
• Why not?• When could we use cross-validation to select the number of
components?
• the “scree plot” on the previous slide can be used as aguide: we look for an “elbow”.
22 / 52
How many principal components should we use?
If we use principal components as a summary of our data, howmany components are su�cient?
• No simple answer to this question, as cross-validation is notavailable for this purpose.
• Why not?• When could we use cross-validation to select the number of
components?
• the “scree plot” on the previous slide can be used as aguide: we look for an “elbow”.
22 / 52
Application to Principal Components Regression
0 10 20 30 40
010
20
30
40
50
60
70
Number of Components
Me
an
Sq
ua
red
Err
or
0 10 20 30 40
050
100
150
Number of Components
Me
an
Sq
ua
red
Err
or
Squared BiasTest MSEVariance
PCR was applied to two simulated data sets. The black, green,
and purple lines correspond to squared bias, variance, and test
mean squared error, respectively. Left: Simulated data from
slide 32. Right: Simulated data from slide 39.
52 / 57
Principal Component Regression
In each panel, the irreducible error Var() is shown as ahorizontal dashed line. Left: Results for PCR. Right: Results forlasso (solid) and ridge regression (dotted). The x-axis displaysthe shrinkage factor of the coeffi cient estimates, defined as the2 norm of the shrunken coefficient estimates divided by the 2
norm of the least squares estimate.
Choosing the number of directions M
2 4 6 8 10
−300
−100
0100
200
300
400
Number of Components
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
2 4 6 8 10
20000
40000
60000
80000
Number of ComponentsC
ross
−V
alid
atio
n M
SE
Left: PCR standardized coe�cient estimates on the Credit
data set for di↵erent values of M . Right: The 10-fold cross
validation MSE obtained using PCR, as a function of M .
53 / 57
Partial Least Squares
• PCR identifies linear combinations, or directions, that bestrepresent the predictors X1, . . . , Xp.
• These directions are identified in an unsupervised way, sincethe response Y is not used to help determine the principalcomponent directions.
• That is, the response does not supervise the identificationof the principal components.
• Consequently, PCR su↵ers from a potentially seriousdrawback: there is no guarantee that the directions thatbest explain the predictors will also be the best directionsto use for predicting the response.
54 / 57
Partial Least Squares: continued
• Like PCR, PLS is a dimension reduction method, whichfirst identifies a new set of features Z1, . . . , ZM that arelinear combinations of the original features, and then fits alinear model via OLS using these M new features.
• But unlike PCR, PLS identifies these new features in asupervised way – that is, it makes use of the response Y inorder to identify new features that not only approximatethe old features well, but also that are related to the
response.
• Roughly speaking, the PLS approach attempts to finddirections that help explain both the response and thepredictors.
55 / 57
Details of Partial Least Squares
• After standardizing the p predictors, PLS computes thefirst direction Z1 by setting each �1j in (1) equal to thecoe�cient from the simple linear regression of Y onto Xj .
• One can show that this coe�cient is proportional to thecorrelation between Y and Xj .
• Hence, in computing Z1 =Pp
j=1 �1jXj , PLS places thehighest weight on the variables that are most stronglyrelated to the response.
• Subsequent directions are found by taking residuals andthen repeating the above prescription.
56 / 57
Summary
• Model selection methods are an essential tool for dataanalysis, especially for big datasets involving manypredictors.
• Research into methods that give sparsity, such as the lasso
is an especially hot area.
• Later, we will return to sparsity in more detail, and willdescribe related approaches such as the elastic net.
57 / 57