Date post: | 04-Jun-2018 |
Category: |
Documents |
Upload: | stanley-lan |
View: | 266 times |
Download: | 1 times |
of 21
8/13/2019 Modern Regression - Ridge Regression
1/21
Modern regression 1: Ridge regression
Ryan Tibshirani
Data Mining: 36-462/36-662
March 19 2013
Optional reading: ISL 6.2.1, ESL 3.4.1
1
8/13/2019 Modern Regression - Ridge Regression
2/21
Reminder: shortcomings of linear regression
Last time we talked about:
1. Predictive ability: recall that we can decompose predictionerror into squared bias and variance. Linear regression has lowbias (zero bias) but suffers from high variance. So it may beworth sacrificing some bias to achieve a lower variance
2. Interpretative ability: with a large number of predictors, it can
be helpful to identify a smaller subset of important variables.Linear regression doesnt do this
Also: linear regression is not defined when p > n (Homework 4)
Setup: given fixed covariates xi Rp
, i= 1, . . . n, we observeyi=f(xi) +i, i= 1, . . . n ,
where f : Rp R is unknown (think f(xi) =xTi
for a linearmodel) and i R with E[i] = 0, Var(i) =
2, Cov(i, j) = 0
2
8/13/2019 Modern Regression - Ridge Regression
3/21
Example: subset of small coefficients
Recall our example: we have n= 50, p= 30, and 2 = 1. The
true model is linear with 10 large coefficients (between 0.5 and 1)and 20 small ones (between 0 and 0.3). Histogram:
True coefficients
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
7
Thelinear regressionfit:
Squared bias 0.006Variance 0.627Pred. error 1 + 0.006+0.627 1.633
We reasoned that we can do better by shrinkingthe coefficients, toreduce variance
3
8/13/2019 Modern Regression - Ridge Regression
4/21
0 5 10 15 20 25
1.5
0
1.5
5
1.6
0
1.6
5
1.70
1.7
5
1.8
0
Amount of shrinkage
Prediction
error
Low High
Linear regression
Ridge regression
Linear regression:Squared bias 0.006Variance 0.627Pred. error 1 + 0.006 + 0.627Pred. error 1.633
Ridge regression, at its best:Squared bias 0.077Variance 0.403Pred. error 1 + 0.077 + 0.403Pred. error 1.48
4
8/13/2019 Modern Regression - Ridge Regression
5/21
Ridge regression
Ridge regressionis like least squares but shrinks the estimatedcoefficients towards zero. Given a response vector y Rn and a
predictor matrix X Rnp, the ridge regression coefficients aredefined as
ridge = argminRp
ni=1
(yi xTi )
2 +
pj=1
2j
= argminRp
y X22 Loss
+ 22Penalty
Here 0 is atuning parameter, which controls the strength ofthe penalty term. Note that:
When = 0, we get the linear regression estimate
When = , we get ridge = 0
For in between, we are balancing two ideas: fitting a linearmodel ofy on X, and shrinking the coefficients
5
8/13/2019 Modern Regression - Ridge Regression
6/21
Example: visual representation of ridge coefficients
Recall our last example (n= 50, p= 30, and 2 = 1; 10 large truecoefficients, 20 small). Here is a visual representation of the ridge
regression coefficients for = 25:
0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
Coefficients
True Linear Ridge
6
8/13/2019 Modern Regression - Ridge Regression
7/21
Important details
When including aninterceptterm in the regression, we usuallyleave this coefficientunpenalized. Otherwise we could add some
constant amount c to the vector y, and this would not result in thesame solution. Hence ridge regression with intercept solves
0,ridge = argmin
0R, Rpy 01 X
22+
22
If we center the columns ofX, then the intercept estimate ends upjust being 0= y, so we usually just assume that y, Xhave beencentered and dont include an intercept
Also, the penalty term 2
2=p
j=12
j is unfair is the predictorvariables arenot on the same scale. (Why?) Therefore, if we knowthat the variables are not measured in the same units, we typicallyscalethe columns ofX(to have sample variance 1), and then weperform ridge regression
7
8/13/2019 Modern Regression - Ridge Regression
8/21
Bias and variance of ridge regression
Thebiasandvarianceare not quite as simple to write down forridge regression as they were for linear regression, but closed-formexpressions are still possible (Homework 4). Recall that
ridge = argminRp
y X22+ 22
The general trend is:
The bias increases as (amount of shrinkage) increases
The variance decreases as (amount of shrinkage) increases
What is the bias at = 0? The variance at = ?
8
8/13/2019 Modern Regression - Ridge Regression
9/21
Example: bias and variance of ridge regression
Bias and variance for our last example (n= 50, p= 30, 2 = 1; 10large true coefficients, 20 small):
0 5 10 15 20 25
0.
0
0.
1
0.
2
0.
3
0.4
0.
5
0.
6
Bias^2
Var
9
8/13/2019 Modern Regression - Ridge Regression
10/21
Mean squared error for our last example:
0 5 10 15 20 25
0.
0
0.
2
0.
4
0.
6
0.
8
Linear MSERidge MSERidge Bias^2Ridge Var
Ridge regression in R: see the function lm.ridge in the packageMASS, or the glmnet function and package
10
8/13/2019 Modern Regression - Ridge Regression
11/21
What you may (should) be thinking now
Thought 1:
Yeah, OK, but this only works for some values of. So howwould wechoose in practice?
This is actually quite a hard question. Well talk about this indetail later
Thought 2:
What happens when wenoneof the coefficientsare small?
In other words, if all the true coefficients are moderately large, is it
still helpful to shrink the coefficient estimates? The answer is(perhaps surprisingly) still yes. But the advantage of ridgeregression here is less dramatic, and the corresponding range forgood values of is smaller
11
8/13/2019 Modern Regression - Ridge Regression
12/21
Example: moderate regression coefficients
Same setup as our last example: n= 50, p= 30, and 2 = 1.
Except now the true coefficients are all moderately large (between0.5 and 1). Histogram:
True coefficients
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
Thelinear regressionfit:
Squared bias 0.006Variance 0.628Pred. error 1 + 0.006+0.628 1.634
Why are these numbersessentially the sameas those from the lastexample, even though the true coefficients changed?
12
8/13/2019 Modern Regression - Ridge Regression
13/21
Ridge regression can still outperform linear regression in terms ofmean squared error:
0 5 10 15 20 25 30
0.
0
0.
5
1.
0
1.
5
2.
0
Linear MSERidge MSERidge Bias^2Ridge Var
Only works for
less than 5, otherwise it is very biased. (Why?) 13
8/13/2019 Modern Regression - Ridge Regression
14/21
Variable selection
To the other extreme (of a subset of small coefficients), supposethat there is a group of true coefficients that are identically zero.
This means that the mean response doesnt depend on thesepredictors at all; they are completely extraneous.
The problem of picking out the relevant variables from a larger setis calledvariable selection. In the linear model setting, this means
estimating some coefficients to be exactly zero. Aside frompredictive accuracy, this can be very important for the purposes ofmodel interpretation
Thought 3:
How does ridge regression perform if a group of the truecoefficients wasexactly zero?
The answer depends whether on we are interested in prediction orinterpretation. Well consider the former first
14
8/13/2019 Modern Regression - Ridge Regression
15/21
Example: subset of zero coefficients
Same general setup as our running example: n= 50, p= 30, and
2 = 1. Now, the true coefficients: 10 are large (between 0.5 and1) and 20 areexactly 0. Histogram:
True coefficients
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0
5
10
15
20
Thelinear regressionfit:
Squared bias 0.006Variance 0.627Pred. error 1 + 0.006+0.627 1.633
Note again that these numbers havent changed
15
8/13/2019 Modern Regression - Ridge Regression
16/21
Ridge regression performs well in terms of mean-squared error:
0 5 10 15 20 25
0
.0
0.
2
0.
4
0.
6
0.8
Linear MSERidge MSE
Ridge Bias^2Ridge Var
Why is the bias not as large here for large ?
16
8/13/2019 Modern Regression - Ridge Regression
17/21
Remember that as we vary we get different ridge regressioncoefficients, the larger the the more shrunken. Here we plotthem again
0 5 10 15 20 25
0.5
0.0
0.5
1.0
Coefficients
True nonzero
True zero
The red paths correspond to thetrue nonzero coefficients; the gray
paths correspond to true zeros.The vertical dashed line at = 15marks the point above which ridgeregressions MSE starts losing tothat of linear regression
An important thing to notice is that the gray coefficient paths arenotexactly zero; they are shrunken, but still nonzero
17
f
8/13/2019 Modern Regression - Ridge Regression
18/21
Ridge regression doesnt perform variable selection
We can show that ridge regression doesnt set coefficients exactlyto zero unless = , in which case theyre all zero. Hence ridge
regressioncannot perform variable selection, and even though itperforms well in terms of prediction accuracy, it does poorly interms of offering a clear interpretation
E.g., suppose that we are studying the level ofprostate-specific antigen (PSA), which is oftenelevated in men who have prostate cancer. Welook at n = 97 men with prostate cancer, and
p = 8 clinical measurements.1 We are inter-
ested in identifying a small number of predic-tors, say 2 or 3, that drive PSA
2
1Data from Stamey et al. (1989), Prostate specific antigen in the diag...2
Figure from http://www.mens-hormonal-health.com/psa-score.html18
E l id i ffi i f d
http://www.mens-hormonal-health.com/psa-score.htmlhttp://www.mens-hormonal-health.com/psa-score.html8/13/2019 Modern Regression - Ridge Regression
19/21
Example: ridge regression coefficients for prostate data
We perform ridge regression over a wide range of values (aftercentering and scaling). The resultingcoefficient profiles:
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
Coefficients
lcavol
lweight
age
lbph
svi
lcp
leason
pgg45
0 2 4 6 8
0.0
0.2
0.4
0.6
df()
Coefficients
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
This doesnt give us a clear answer to our question ...
19
R id i
8/13/2019 Modern Regression - Ridge Regression
20/21
Recap: ridge regression
We learnedridge regression, which minimizes the usual regressioncriterion plus a penalty term on the squared 2 norm of the
coefficient vector. As such, it shrinks the coefficients towards zero.This introduces some bias, but can greatly reduce the variance,resulting in a better mean-squared error
The amount of shrinkage is controlled by , thetuning parameter
that multiplies the ridge penalty. Large means more shrinkage,and so we get different coefficient estimates for different values of. Choosingan appropriate value of is important, and alsodifficult. Well return to this later
Ridge regression performs particularly well when there is a subsetof true coefficients that aresmallor evenzero. It doesnt do aswell when all of the true coefficients are moderately large; however,in this case it can still outperform linear regression over a prettynarrow range of (small) values
20
N t ti th l
8/13/2019 Modern Regression - Ridge Regression
21/21
Next time: the lasso
The lasso combines some of the shrinking advantages of ridge withvariable selection
(From ESL page 71)
21