Post on 18-Jun-2020
transcript
Lecture 11: Cross validation
Reading: Chapter 5
STATS 202: Data mining and analysis
October 16, 2019
1 / 17
Validation set approach
Goal: Estimate the test error for a supervised learning method.
Strategy:
I Split the data in two parts.
I Train the method in the first part.
I Compute the error on the second part.
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
2 / 17
Validation set approach
Polynomial regression to estimate mpg from horsepower in theAuto data.
2 4 6 8 10
16
18
20
22
24
26
28
Degree of Polynomial
Mean S
quare
d E
rror
2 4 6 8 1016
18
20
22
24
26
28
Degree of Polynomial
Mean S
quare
d E
rror
Problems: 1. Every split yields a different estimate of the error.2. Only a subset of points is used to evaluate the model.
3 / 17
Leave one out cross-validation
I For every i = 1, . . . , n:
I train the model on every point except i,
I compute the test error on the held out point.
I Average the test errors.
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!
%!
%!
4 / 17
Leave one out cross-validation
I For every i = 1, . . . , n:
I train the model on every point except i,
I compute the test error on the held out point.
I Average the test errors.
CV(n) =1
n
n∑i=1
(yi − y(−i)i )2
Prediction for the i sample without using the ith sample.
5 / 17
Leave one out cross-validation
I For every i = 1, . . . , n:
I train the model on every point except i,
I compute the test error on the held out point.
I Average the test errors.
CV(n) =1
n
n∑i=1
1(yi 6= y(−i)i )
... for a classification problem.
6 / 17
Leave one out cross-validation
Computing CV(n) can be computationally expensive, since itinvolves fitting the model n times.
For linear regression, there is a shortcut:
CV(n) =1
n
n∑i=1
(yi − yi1− hii
)2
where hii is the leverage statistic.
7 / 17
k-fold cross-validation
I Split the data into k subsets or folds.
I For every i = 1, . . . , k:I train the model on every fold except the ith fold,
I compute the test error on the ith fold.
I Average the test errors.
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
8 / 17
LOOCV vs. k-fold cross-validation
2 4 6 8 10
16
18
20
22
24
26
28
LOOCV
Degree of Polynomial
Me
an
Sq
ua
red
Err
or
2 4 6 8 10
16
18
20
22
24
26
28
10−fold CV
Degree of Polynomial
Me
an
Sq
ua
red
Err
or
I k-fold CV depends on the chosen split.
I In k-fold CV, we train the model on less data than what isavailable. This introduces bias into the estimates of test error.
I In LOOCV, the training samples highly resemble each other.This increases the variance of the test error estimate.
9 / 17
Choosing an optimal model
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Flexibility
Mean S
quare
d E
rror
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Flexibility
Mean S
quare
d E
rror
2 5 10 20
05
10
15
20
Flexibility
Mean S
quare
d E
rror
Even if the CV error estimates differ significantly from the true testerror, the model with the minimum cross validation error often has
relatively small test error.
10 / 17
Choosing an optimal model
In a classification problem, things look similar.
Degree=1
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=2
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=3
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=4
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
- - - Bayes boundary
—— Logistic regressionwith polynomial predictorsof increasing degree.
11 / 17
Choosing an optimal model
In a classification problem, things look similar.— Training error, — 10-fold CV, — Test error.
2 4 6 8 10
0.1
20.1
40.1
60.1
80.2
0
Order of Polynomials Used
Err
or
Ra
te
0.01 0.02 0.05 0.10 0.20 0.50 1.000.1
20.1
40.1
60.1
80.2
0
1/K
Err
or
Ra
te
Note: Training error rate sometimes increases as logistic regressiondoes not directly minimize 0-1 error rate but maximizes likelihood.
12 / 17
The one standard error rule
Forward stepwise selection
Elements of Statistical Learning (2nd Ed.) c!Hastie, Tibshirani & Friedman 2009 Chap 7
Subset Size p
Mis
clas
sific
atio
n Er
ror
5 10 15 20
0.0
0.1
0.2
0.3
0.4
0.5
0.6
•• •
••
• ••
•• • • • • • • • • • •
••
• ••
•• •
• • • • • • •• • • • •
FIGURE 7.9. Prediction error (orange) and tenfoldcross-validation curve (blue) estimated from a singletraining set, from the scenario in the bottom right panelof Figure 7.3.
Blue: 10-fold cross validationYellow: True test error
I Curves minimized at p = 10.
I Models with 9 ≤ p ≤ 15have very similar CV error.
I The vertical bars represent 1standard error in the testerror from the 10 folds.
I Rule of thumb: Choose thesimplest model whose CVerror is no more than onestandard error above themodel with the lowest CVerror.
13 / 17
The wrong way to do cross validation
Reading: Section 7.10.2 of The Elements of Statistical Learning.
We want to classify 200 individuals according to whether they havecancer or not. We use logistic regression onto 1000 measurementsof gene expression.
Proposed strategy:
I Using all the data, select the 20 most significant genes usingz-tests.
I Estimate the test error of logistic regression with these 20predictors via 10-fold cross validation.
14 / 17
The wrong way to do cross validation
To see how that works, let’s use the following simulated data:
I Each gene expression is standard normal and independent ofall others.
I The response (cancer or not) is sampled from a coin flip — nocorrelation to any of the “genes”.
What should the misclassification rate be for any classificationmethod using these predictors?
Roughly 50%.
15 / 17
The wrong way to do cross validation
We run this simulation, and obtain a CV error rate of 3%!
Why is this?
I Since we only have 200 individuals in total, among 1000variables, at least some will be correlated with the response.
I We do variable selection using all the data, so the variables weselect have some correlation with the response in every subsetor fold in the cross validation.
16 / 17
The right way to do cross validation
I Divide the data into 10 folds.
I For i = 1, . . . , 10:
I Using every fold except i, perform the variable selection and fitthe model with the selected variables.
I Compute the error on fold i.
I Average the 10 test errors obtained.
In our simulation, this produces an error estimate of close to 50%.
Moral of the story: Every aspect of the learning method thatinvolves using the data — variable selection, for example — mustbe cross-validated.
17 / 17