+ All Categories
Home > Documents > Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I...

Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I...

Date post: 20-Apr-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
66
Lecture 27: Selection Pratheepa Jeganathan 11/22/2019
Transcript
Page 1: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Lecture 27: Selection

Pratheepa Jeganathan

11/22/2019

Page 2: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Recap

I What is a regression model?I Descriptive statistics – graphicalI Descriptive statistics – numericalI Inference about a population meanI Difference between two population meansI Some tips on RI Simple linear regression (covariance, correlation, estimation,

geometry of least squares)I Inference on simple linear regression modelI Goodness of fit of regression: analysis of variance.I F -statistics.I Residuals.I Diagnostic plots for simple linear regression (graphical

methods).

Page 3: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Recap

I Multiple linear regressionI Specifying the model.I Fitting the model: least squares.I Interpretation of the coefficients.I Matrix formulation of multiple linear regressionI Inference for multiple linear regression

I T -statistics revisited.I More F statistics.I Tests involving more than one β.

I Diagnostics – more on graphical methods and numericalmethodsI Different types of residualsI InfluenceI Outlier detectionI Multiple comparison (Bonferroni correction)I Residual plots:

I partial regression (added variable) plot,I partial residual (residual plus component) plot.

Page 4: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Recap

I Adding qualitative predictorsI Qualitative variables as predictors to the regression model.I Adding interactions to the linear regression model.I Testing for equality of regression relationship in various subsets

of a populationI ANOVA

I All qualitative predictors.I One-way layoutI Two-way layout

I TransformationI Achieving linearityI Stabilize varianceI Weighted least squares

I Correlated ErrorsI Generalized least squares

I Bootstrapping linear regression

Page 5: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Selection

Page 6: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Outline (Model selection)

I In a given regression situation, there are often many choices tobe made.

I Recall our usual setup

Yn×1 = Xn×pβp×1 + εn×1.

I Any subset A ⊂ {1, . . . , p} yields a new regression model

M(A) : Yn×1 = X [,A]β[A] + εn×1

by setting β[Ac ] = 0.

I Model selection is, roughly speaking, how to choose A amongthe 2p possible choices.

Page 7: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Election data

Here is a dataset from the book that we will use to explore differentmodel selection approaches.

Variable DescriptionV votes for a presidential candidateI are they incumbent?D Democrat or Republican incumbent?W wartime election?G GDP growth rate in election yearP (absolute) GDP deflator growth rateN number of quarters in which GDP growth rate > 3.2%

Page 8: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Election data

url = 'http://stats191.stanford.edu/data/election.table'election.table = read.table(url, header=T)pairs(election.table[,2:ncol(election.table)],

cex.labels=3, pch=23,bg='orange', cex=2)

Page 9: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Election data

V

−1.

0−

0.5

0.0

0.5

1.0

0.0

0.2

0.4

0.6

0.8

1.0

05

1015

0.40 0.50 0.60

−1.0 −0.5 0.0 0.5 1.0

I

D

−1.0 −0.5 0.0 0.5 1.0

0.0 0.2 0.4 0.6 0.8 1.0

W

G

−15 −5 0 5 10

0 5 10 15

P

0.40

0.50

0.60

−1.

0−

0.5

0.0

0.5

1.0

−15

−5

05

102 4 6 8 10 14

24

68

1014

N

Page 10: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Problem & Goals

I When we have many predictors (with many possibleinteractions), it can be difficult to find a good model.

I Which main effects do we include?I Which interactions do we include?I Model selection procedures try to simplify / automate this task.I Election data has 26 = 64 different models with just main

effects!

Page 11: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

General comments

I This is generally an “unsolved” problem in statistics: there areno magic procedures to get you the “best model.”

I Many machine learning methods look for good “sparse” models:selecting a “sparse” model.

I “Machine learning” often work with very many predictors.

I Our model selection problem is generally at a much smallerscale than “data mining” problems.

I Still, it is a hard problem.

Page 12: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Hypothetical example

I Suppose we fit a a model F : Yn×1 = Xn×pβp×1 + εn×1 withpredictors X1, . . . ,Xp.

I In reality, some of the β’s may be zero. Let’s suppose thatβj+1 = · · · = βp = 0.

I Then, any model that includes β0, . . . , βj is correct: whichmodel gives the best estimates of β0, . . . , βj?

I Principle of parsimony (i.e. Occam’s razor) says that the modelwith only X1, . . . ,Xj is “best”.

Page 13: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Justifying parsimony

I For simplicity, let’s assume that j = 1 so there is only onecoefficient to estimate.

I Then, because each model gives an unbiased estimate of β1 wecan compare models based on Var(β1).

I The best model, in terms of this variance, is the one containingonly X1.

I What if we didn’t know that only β1 was non-zero (which wedon’t know in general)?

I In this situation, we must choose a set of variables.

Page 14: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Model selection: choosing a subset of variables

I To “implement” a model selection procedure, we first need acriterion or benchmark to compare two models.

I Given a criterion, we also need a search strategy.I With a limited number of predictors, it is possible to search all

possible models (leaps in R).

Page 15: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Candidate criteria

Page 16: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Candidate criteria

Possible criteria:

I R2: not a good criterion. Always increase with model size=⇒ “optimum” is to take the biggest model.

I Adjusted R2: better. It “penalized” bigger models. Followsprinciple of parsimony / Occam’s razor.

I Mallow’s Cp – attempts to estimate a model’s predictive power,i.e. the power to predict a new observation.

Page 17: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, R2

I Leaps takes a design matrix as argument: throw away theintercept column or leaps will complain.

election.lm = lm(V ~ I + D + W + G:I +P + N, election.table)

#election.lm

Page 18: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, R2

X = model.matrix(election.lm)[,-1]library(leaps)# Since the algorithm returns a best model of each size,# the results do not depend on a penalty model for# model size# nbest: Number of subsets of each size to reportelection.leaps = leaps(x = X, y = election.table$V,

nbest=3, method='r2')

I Find out the predictors in the model with the largest R2:# election.leaps$which: matrix, each row can be# used to select the columns of x in the respective modelind = which((election.leaps$r2 == max(election.leaps$r2)))best.model.r2 = election.leaps$which[ind, ]best.model.r2

## 1 2 3 4 5 6## TRUE TRUE TRUE TRUE TRUE TRUE

Page 19: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, R2

I Let’s plot the R2 as a function of the model size.plot(election.leaps$size, election.leaps$r2,

pch=23, bg='orange', cex=2,xlab = "Size of the model",ylab = bquote(R^2))

Page 20: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, R2

I For example, there are three models with 2 predictors and withdifferent R2

I We see that the full model does include all variables and hasthe largest R2.

2 3 4 5 6 7

0.1

0.3

0.5

0.7

Size of the model

R2

Page 21: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, adjusted R2

I As we add more and more variables to the model – evenrandom ones, R2 will increase to 1.

I Adjusted R2 tries to take this into account by replacing sumsof squares by mean squares

R2a = 1− SSE/(n − p − 1)

SST/(n − 1) = 1− MSEMST .

Page 22: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, adjusted R2

election.leaps = leaps(X, election.table$V, nbest=3,method='adjr2')

ind2 = which((election.leaps$adjr2 ==max(election.leaps$adjr2)))

best.model.adjr2 = election.leaps$which[ind2,]best.model.adjr2

## 1 2 3 4 5 6## TRUE TRUE FALSE FALSE TRUE TRUE

I Best model based on the adjusted R2 has four predictorvariables.

Page 23: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, adjusted R2

plot(election.leaps$size,election.leaps$adjr2,

pch=23, bg='orange', cex=2)

2 3 4 5 6 7

0.1

0.3

0.5

0.7

election.leaps$size

elec

tion.

leap

s$ad

jr2

Page 24: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Mallow’s Cp

I Mallow’s Cp

Cp(M) = SSE (M)σ2 + 2 · p(M)− n.

I σ2 = SSE (F )/dfF is the “best” estimate of σ2 we have (use thefullest model), i.e. in the election data it uses all 6 main effects.

I SSE (M) is the SSE of the modelM.I p(M) is the number of predictors inM.

I This is an estimate of the expected mean-squared error ofY (M), it takes bias and variance of fit into account.

I Account for the sample size, effect size of the predictors, andcollinearity between the predictors.

Page 25: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, Mallow’s Cp

election.leaps = leaps(X, election.table$V, nbest=3,method='Cp')

indcp = which((election.leaps$Cp ==min(election.leaps$Cp)))

best.model.Cp = election.leaps$which[indcp,]best.model.Cp

## 1 2 3 4 5 6## FALSE TRUE FALSE FALSE TRUE TRUE

Page 26: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Best subsets, Mallow’s Cp

plot(election.leaps$size,election.leaps$Cp, pch=23,bg='orange', cex=2)

2 3 4 5 6 7

1020

3040

election.leaps$size

elec

tion.

leap

s$C

p

Page 27: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Search strategies

Page 28: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Search strategies

I Given a criterion, we now have to decide how we are going tosearch through the possible models.

I “Best subset”: search all possible models and take the one withhighest R2

a or lowest Cp leaps. Such searches are typicallyfeasible only up to p = 30 or 40 at the very most.

I Stepwise (forward, backward or both): useful when the numberof predictors is large. Choose an initial model and be “greedy”.I “Greedy” means always take the biggest jump (up or down) in

your selected criterion.

Page 29: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Implementations in R

I “Best subset”: use the function leaps. Works only formultiple linear regression models.

I Stepwise: use the function step. Works for any model withAkaike Information Criterion (AIC). In multiple linearregression, AIC is (almost) a linear function of Cp.

Page 30: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Akaike / Bayes Information Criterion

I Akaike (AIC) defined as

AIC(M) = −2 log L(M) + 2 · p(M)

where L(M) is the maximized likelihood of the model.I Bayes (BIC) defined as

BIC(M) = −2 log L(M) + log n · p(M)

I Strategy can be used for whenever we have a likelihood, so thisgeneralizes to many statistical models.

Page 31: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

AIC for regression

I In linear regression with unknown σ2

−2 log L(M) = n log(2πσ2MLE ) + n

where σ2MLE = 1

nSSE (β)I In linear regression with known σ2

−2 log L(M) = n log(2πσ2) + 1σ2 SSE (β)

so AIC is very much like Mallow’s Cp in this case.

Page 32: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

AIC for regression

I For the election data, the linear regression with all predictorshas

n = nrow(X)p = 7 + 1 # sigma^2 is unknownAIC_calculated = n * log(2*pi*sum(resid(election.lm)^2)/n) + n + 2*pc(AIC_calculated, AIC(election.lm))

## [1] -66.94026 -66.94026

Page 33: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Properties of AIC / BIC

I BIC will typically choose a model as small or smaller than AIC(if using the same search direction).

I As our sample size grows, under some assumptions, it can beshown thatI AIC will (asymptotically) always choose a model that contains

the true model, i.e. it won’t leave any variables out.I BIC will (asymptotically) choose exactly the right model.

Page 34: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Election example

I Let’s take a look at step in action.I Probably the simplest strategy is forward stepwise which tries

to add one variable at a time, as long as it can find a resultingmodel whose AIC is better than its current position.

I When it can make no further additions, it terminates.

Page 35: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Election example (forward stepwise)

# k = 2 gives the AIC, k = log(n) refers to BICelection.step.forward = step(lm(V ~ 1, election.table),

list(upper = ~ I + D + W + G + G:I + P + N),direction='forward', k=2, trace=FALSE)

election.step.forward

#### Call:## lm(formula = V ~ D + P, data = election.table)#### Coefficients:## (Intercept) D P## 0.514022 0.043134 -0.006017

Page 36: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

I Summary of the chosen model based on forward stepwise andAIC.

##summary(election.step.forward)

Page 37: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Interactions and hierarchy

I We notice that although the full model we gave it had theinteraction I:G, the function step never tried to use it.

I This is due to some rules implemented in step that do notinclude an interaction unless both main effects are already inthe model.

I In this case, because neither I nor G were added, theinteraction was never considered.

I In the leaps example, we gave the function the design matrixand it did not have to consider interactions: they were alreadyencoded in the design matrix.

Page 38: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

BIC exampleI The only difference between AIC and BIC is the price paid per

variable. This is the argument k to step.I By default k=2 and for BIC we set k=log(n).I If we set k=0 it will always add variables.

election.step.forward.BIC = step(lm(V ~ 1,election.table),

list(upper = ~ I + D + W +G:I + P + N),direction='forward', k=log(nrow(X)))

## Start: AIC=-106.73## V ~ 1#### Df Sum of Sq RSS AIC## + D 1 0.0280805 0.084616 -109.71## <none> 0.112696 -106.73## + I 1 0.0135288 0.099167 -106.38## + P 1 0.0124463 0.100250 -106.15## + N 1 0.0024246 0.110271 -104.15## + W 1 0.0009518 0.111744 -103.87#### Step: AIC=-109.71## V ~ D#### Df Sum of Sq RSS AIC## <none> 0.084616 -109.71## + P 1 0.0099223 0.074693 -109.28## + W 1 0.0068141 0.077801 -108.43## + I 1 0.0012874 0.083328 -106.99## + N 1 0.0000033 0.084612 -106.67

Page 39: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

BIC example

Page 40: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

BIC example

#summary(election.step.forward.BIC)

Page 41: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Backward selectionI Let’s consider backwards stepwise. This starts at a full model

and tries to delete variables.I There is also a direction="both" option.

election.step.backward = step(election.lm,direction='backward')

## Start: AIC=-128.54## V ~ I + D + W + G:I + P + N#### Df Sum of Sq RSS AIC## - P 1 0.000055 0.023741 -130.49## - W 1 0.000170 0.023855 -130.39## <none> 0.023686 -128.54## - N 1 0.003133 0.026818 -127.93## - D 1 0.011926 0.035612 -121.97## - I:G 1 0.050640 0.074325 -106.52#### Step: AIC=-130.49## V ~ I + D + W + N + I:G#### Df Sum of Sq RSS AIC## - W 1 0.000120 0.023860 -132.38## <none> 0.023741 -130.49## - N 1 0.003281 0.027021 -129.77## - D 1 0.013983 0.037724 -122.76## - I:G 1 0.053507 0.077248 -107.71#### Step: AIC=-132.38## V ~ I + D + N + I:G#### Df Sum of Sq RSS AIC## <none> 0.023860 -132.38## - N 1 0.003199 0.027059 -131.74## - D 1 0.013867 0.037727 -124.76## - I:G 1 0.059452 0.083312 -108.12

Page 42: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Backward selection

Page 43: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Backward selection

# summary(election.step.backward)

Page 44: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cross-validation

I Yet another model selection criterion is K -fold cross-validation.I Fix a modelM. Break data set into K approximately equal

sized groups (G1, . . . ,GK ).I For (i in 1:K) Use all groups except Gi to fit model, predict

outcome in group Gi based on this model Yj,M,Gi , j ∈ Gi .I Similar to what we saw in Cook’s distance / DFFITS.I Estimate CV (M) = 1

n∑K

i=1∑

j∈Gi (Yj − Yj,M,Gi )2.

Page 45: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Comments about cross-validation.

I It is a general principle that can be used in other situations to“choose parameters.”

I Pros (partial list): “objective” measure of a model’s predictivepower.

I Cons (partial list): all we know about inference is usually “outthe window” (also true for other model selection procedures).

I If goal is not really inference about certain specific parameters,it is a reasonable way to compare models.

Page 46: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Example (Cross-validation)

library(boot)#Fitting Generalized Linear Modelselection.glm = glm(V ~ ., data=election.table)# 5-fold cross-validation# The first component is the raw cross-validation# estimate of prediction error.# The second component is the adjusted cross-validation# estimate.# The adjustment is designed to compensate for# the bias introduced by not using# leave-one-out cross-validation.cv.glm(model.frame(election.glm),

election.glm, K=5)$delta

## [1] 0.01411831 0.01242346

Page 47: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cp versus 5-fold cross-validation

I Let’s plot our Cp versus the CV score.

I Keep in mind that there is additional randomness in the CVscore due to the random assignments to groups.

Page 48: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cp versus 5-fold cross-validationelection.leaps = leaps(X, election.table$V,

nbest=3, method='Cp')V = election.table$Velection.leaps$CV = 0 * election.leaps$Cpfor (i in 1:nrow(election.leaps$which)) {

subset = c(1:ncol(X))[election.leaps$which[i,]]if (length(subset) > 1) {

Xw = X[,subset]wlm = glm(V ~ Xw)election.leaps$CV[i] = cv.glm(model.frame(wlm),

wlm, K=5)$delta[1]}else {

Xw = X[,subset[1]]wlm = glm(V ~ Xw)election.leaps$CV[i] = cv.glm(model.frame(wlm),

wlm, K=5)$delta[1]}

}

Page 49: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cp versus 5-fold cross-validation

plot(election.leaps$Cp, election.leaps$CV,pch=23, bg='orange', cex=2)

10 20 30 40

0.00

20.

004

election.leaps$Cp

elec

tion.

leap

s$C

V

Page 50: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cp versus 5-fold cross-validation

plot(election.leaps$size, election.leaps$CV,pch=23, bg='orange', cex=2)

2 3 4 5 6 7

0.00

20.

004

election.leaps$size

elec

tion.

leap

s$C

V

Page 51: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Cp versus 5-fold cross-validation

indcp_5fold = which((election.leaps$CV==min(election.leaps$CV)))

best.model.Cv = election.leaps$which[indcp_5fold,]best.model.Cv

## 1 2 3 4 5 6## TRUE TRUE FALSE FALSE TRUE TRUE

Page 52: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Summarizing results

I The model selected depends on the criterion used.

Criterion ModelR2 ∼ I + D + W + G : I + P + NR2

a ∼ I + D + P + NCp ∼ D + P + NAIC forward ∼ D + PBIC forward ∼ DAIC backward ∼ I + D + N + I : G5-fold CV ∼ I + WI The selected model is random and depends on which

method we use!

Page 53: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Where we are so far

I Many other “criteria” have been proposed.I Some work well for some types of data, others for different

data.I Check diagnostics!I These criteria (except cross-validation) are not “direct

measures” of predictive power, though Mallow’s Cp is a step inthis direction.

I Cp measures the quality of a model based on both bias andvariance of the model. Why is this important?

I Bias-variance tradeoff is ubiquitous in statistics. More soon.

Page 54: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

A larger example

I Resistance of n = 633 different HIV+ viruses to drug 3TC.I Features p = 91 are mutations in a part of the HIV virus,

response is log fold change in vitro.

Page 55: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Example (HIV and mutations)

X_HIV = read.table('http://stats191.stanford.edu/data/NRTI_X.csv', header=FALSE, sep=',')Y_HIV = read.table('http://stats191.stanford.edu/data/NRTI_Y.txt', header=FALSE, sep=',')set.seed(0)Y_HIV = as.matrix(Y_HIV)[,1]X_HIV = as.matrix(X_HIV)nrow(X_HIV)

## [1] 633

Page 56: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Forward stepwise

D = data.frame(X_HIV, Y_HIV)M = lm(Y_HIV ~ ., data=D)M_forward = step(lm(Y_HIV ~ 1, data=D), list(upper=M),

trace=FALSE, direction='forward')#M_forward

Page 57: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Backward stepwise

M_backward = step(M, list(lower= ~ 1),trace=FALSE, direction='backward')

#M_backward

Page 58: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Both directions

M_both1 = step(M, list(lower= ~ 1, upper=M),trace=FALSE, direction='both')

#M_both1

Page 59: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Both directions

M_both2 = step(lm(Y_HIV ~ 1, data=D),list(lower= ~ 1, upper=M),trace=FALSE, direction='both')

#M_both2

Page 60: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Compare selected models

sort(names(coef(M_forward)))sort(names(coef(M_backward)))sort(names(coef(M_both1)))sort(names(coef(M_both2)))

Page 61: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

BIC vs AIC

M_backward_BIC = step(M, list(lower= ~ 1), trace=FALSE,direction='backward', k=log(633))

M_forward_BIC = step(lm(Y_HIV ~ 1, data=D), list(upper=M),trace=FALSE, direction='forward', k=log(633))

M_both1_BIC = step(M, list(upper=M, lower=~1),trace=FALSE, direction='both', k=log(633))

M_both2_BIC = step(lm(Y_HIV ~ 1, data=D), list(upper=M, lower=~1),trace=FALSE, direction='both', k=log(633))

Page 62: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

BIC vs AIC

sort(names(coef(M_backward_BIC)))sort(names(coef(M_forward_BIC)))sort(names(coef(M_both1_BIC)))sort(names(coef(M_both2_BIC)))

Page 63: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Inference after selection

Page 64: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Inference after selection: data snooping and splittingI Each of the above criteria return a model. The summary

provides p-values.summary(election.step.forward)

#### Call:## lm(formula = V ~ D + P, data = election.table)#### Residuals:## Min 1Q Median 3Q Max## -0.101121 -0.036838 -0.006987 0.019029 0.163250#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.514022 0.022793 22.552 1.2e-14 ***## D 0.043134 0.017381 2.482 0.0232 *## P -0.006017 0.003891 -1.546 0.1394## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.06442 on 18 degrees of freedom## Multiple R-squared: 0.3372, Adjusted R-squared: 0.2636## F-statistic: 4.579 on 2 and 18 DF, p-value: 0.02468

Page 65: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Inference after selection

I We can also form confidence intervals. But, can we trustthese intervals or tests? No!

I Recommended reading Work by Jonathan Taylorlibrary(selectiveInference)

Page 66: Lecture 27: Selection · 2019. 12. 14. · Recap I Multiplelinearregression I Specifyingthemodel. I Fittingthemodel: leastsquares. I Interpretationofthecoefficients. I ...

Reference

I CH Chapter 11 (Variable selection procedures)I Lecture notes of Jonathan Taylor .


Recommended