201ab Quantitative methodsResampling: Cross Validation
Ed Vul
Resampling
Using our existing data to generate possible samples and thusobtain a sampling distribution:
I Bootstrap: of a statitistc for confidence intervals.I Randomization: under the null for NHST.I Cross Validation: for prediction.
The problem: overfitting
−2.5
0.0
2.5
0.00 0.25 0.50 0.75 1.00x
y
The problem: overfitting
9th order polynomial for10 data points
−2.5
0.0
2.5
0.00 0.25 0.50 0.75 1.00x
y
I Complex models can fitweird patterns.
I They will fit noise, not justsignal
I Fitting noise yields terribleprediction performance,even though “fit” toobserved data looks verygood.
Overfitting yields worse prediction error
−2.5
0.0
2.5
0.00 0.25 0.50 0.75 1.00x
y
Overfitting yields worse prediction error
−2.5
0.0
2.5
0.00 0.25 0.50 0.75 1.00x
y
The problem: overfitting
We want to. . .
I know how well our model will predict new data, not just howwell it fits observed data/noise.
I pick models that will predict new data well, and not overfit.
But we obviously have not yet seen future data.
Solution: Hold out / validation data
I Use part of existing data as though we have not seen it: Splitthe data into two sets:training used to fit the modeltest (“holdout”) to evaluate the model
I Doing this once is ok if we have a lot of data, so both trainingand test sets can be big even with split.
I With little data we will have lots of variability in evaluation.
Cross-validation
We will do the hold-out process a bunch of times on the same datato try to reduce noise in our test-set performance.
This gives us a better estimate of prediction accuracy for the modelclass (but not any one particular set of parameter values!).
Hold-out: example
dat <- dat %>%mutate(use_as = ifelse((1:n())%%2==1,'train', 'test'))
training_data = dat %>%filter(use_as == 'train')
test_data = dat %>%filter(use_as == 'test')
x y use_as0.00 -0.21 train0.11 -0.93 test0.22 -0.93 train0.33 0.65 test0.44 -1.06 train0.56 0.11 test0.67 2.40 train0.78 1.29 test0.89 0.99 train1.00 0.94 test
Hold-out: example
I fit model on training dataM = lm(data = training_data, y~poly(x, 3))
I generate predictions on test dataprediction = predict(M, test_data)
I Measure prediction error. Here: as sum of squared errors.sum((test_data$y - prediction)ˆ2)
## [1] 7.616142
Train vs test performance as function of complexity
poly.order train.SSE test.SSE1 2.52 5.862 1.88 23.673 0.94 587.044 0.00 15690.46
Note: 10 total datapoints, splitting into 5 train, 5 test. Over andover again. (More on this later)
Leave-one-out cross-validation
Run hold out n times for n data points. Each time use one datapoint as the test data, and the remaining n-1 datapoints as training.
Leave-one-out cross-validation
n = nrow(dat)loo_error = rep(NA, n)for(i in 1:n){training_data = dat[(1:n)[-i], ]test_data = dat[i,]M = lm(data = training_data, y~poly(x, 3))prediction = predict(M, test_data)loo_error[i] = (test_data$y - prediction)ˆ2
}
Leave-one-out cross-validation
0
1
2
3
0 1 2 3error
coun
t
mean(loo_error)
## [1] 0.9759489
Varieties of cross-validation
I Repeated random sub-sampling (suitable for larger sample sizesand replicates)
I Leave k out (LOO: k=1): exhaustive, for small sample sizesI K-fold (LOO: k=n)
For both fitting and evaluation:- Nested cross-validation.
There are lots of varieties of error/fit measures depending on whatyou are after.
Larger-scale example: data
## Rows: 251## Columns: 14## $ bf.percent <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7...## $ age <dbl> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, ...## $ weight <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, ...## $ height <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, ...## $ neck <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42...## $ chest <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9...## $ abdomen <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8...## $ hip <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9...## $ thigh <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63...## $ knee <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41...## $ ankle <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25...## $ bicep <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35...## $ forearm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30...## $ wrist <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19...
Large-scale example: Modelslm.model = lm(bf.percent ~ .)svr.model = e1071::svm(bf.percent ~ ., cross=0)lm2.model = lm(bf.percent ~ polym(age,
weight,height,neck,chest,abdomen,hip,thigh,knee,ankle,bicep,forearm,wrist,raw = T,degree = 2))
Leave-50-out random sub-samplingRMSE = function(true_y, predicted_y){sqrt(mean((predicted_y - true_y)ˆ2))
}
n = nrow(dat)k = 50repetitions = 100errors = rep(NA, repetitions)for(i in 1:repetitions){test_idx = sort(sample(n, k, replace=F))train_idx = (1:n)[-test_idx]test_dat = dat[test_idx,]train_dat = dat[train_idx,]M = lm(data = train_dat, bf.percent ~ .)pred_y = predict(M, test_dat)errors[i] = RMSE(test_dat$bf.percent, pred_y)
}
Leave-50-out random sub-sampling: Results
1.0
1.5
2.0
2.5
lm lm2 svrmodel
log1
0(M
SE
)
type
test.err
train.err
Resampling: Cross-validation
Goal: estimate prediction accuracy/error on future data withoutactually having data from the future.
Strategy: Repeat many times:
I Split existing data into training and test setI fit model to training set, evaluate error on test set.
Resampling: Bootstrap
Goal: quantify sampling error in some statistic to get confidenceintervals.
Strategy:
I Generate new hypothetical samples of the same size as existingsample by resampling from it (with replacement!).
I Calculate statistic on each sample to obtain many samples ofthe sampling distribution of statistic.
I Use that to get confidence intervals via quantile function.
Resampling: Randomization
Goal: test a null hypothesis that some structure/regularity does notexist in the data.
Strategy:
I Define statistic to measure structureI Define shuffling (sampling without replacement) process to
destroy only that structure.I Repeat many times: statistic(shuffle(data)) to obtain many
samples of the distribution of statistic under null.I Calculate p value from samples.
Resampling recap
Randomization Shuffle data to obtain sampling distribution ofstatistic under the null, and thus test null hypothesis.
Bootstrapping Resample current data to obtain samplingdistribution of statistic, and thus get a confidence interval.
Cross-validation Subsample existing data into training and test toestimate prediction performance on unseen data.
Resampling: beware
You have lots of responsibility here. Lots of room make a mistake,and only check / catch it if mistake is unfavorable.
Questions?