Model Selection Topics in Data Mining Fall 2015 Bruno Ribeiro

Model Selection!!

Topics in Data Mining!Fall 2015!

!Bruno Ribeiro!

© 2015 Bruno Ribeiro

Goal

} Model Selection

} Model Assessment


In Training Phase There is Often a Better Model

222 7. Model Assessment and Selection

The “−2” in the definition makes the log-likelihood loss for the Gaussiandistribution match squared-error loss.

For ease of exposition, for the remainder of this chapter we will use Y andf(X) to represent all of the above situations, since we focus mainly on thequantitative response (squared-error loss) setting. For the other situations,the appropriate translations are obvious.

In this chapter we describe a number of methods for estimating theexpected test error for a model. Typically our model will have a tuningparameter or parameters α and so we can write our predictions as f̂α(x).The tuning parameter varies the complexity of our model, and we wish tofind the value of α that minimizes error, that is, produces the minimum ofthe average test error curve in Figure 7.1. Having said this, for brevity wewill often suppress the dependence of f̂(x) on α.

It is important to note that there are in fact two separate goals that wemight have in mind:

Model selection: estimating the performance of different models in orderto choose the best one.

Model assessment: having chosen a final model, estimating its predic-tion error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems isto randomly divide the dataset into three parts: a training set, a validationset, and a test set. The training set is used to fit the models; the validationset is used to estimate prediction error for model selection; the test set isused for assessment of the generalization error of the final chosen model.Ideally, the test set should be kept in a “vault,” and be brought out onlyat the end of the data analysis. Suppose instead that we use the test-setrepeatedly, choosing the model with smallest test-set error. Then the testset error of the final chosen model will underestimate the true test error,sometimes substantially.

It is difficult to give a general rule on how to choose the number ofobservations in each of the three parts, as this depends on the signal-to-noise ratio in the data and the training sample size. A typical split mightbe 50% for training, and 25% each for validation and testing:

TestTrain Validation TestTrain Validation TestValidationTrain Validation TestTrain

The methods in this chapter are designed for situations where there isinsufficient data to split it into three parts. Again it is too difficult to givea general rule on how much training data is enough; among other things,this depends on the signal-to-noise ratio of the underlying function, andthe complexity of the models being fit to the data.












Real-world ErrorModel assessment


}  Input: X

}  Output: Y

}  Estimator:

◦  E.g.:

}  Examples of loss function:

Measuring Error

Estimated parameters


5

Measuring Errors: Loss Functions}  Typical classification loss functions◦  0-1 Loss:

◦  Log-likelihood:


The Problem of Overfitting220 7. Model Assessment and Selection

0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Model Complexity (df)

Pred

ictio

n Er

ror

High Bias Low BiasHigh VarianceLow Variance

FIGURE 7.1. Behavior of test sample and training sample error as the modelcomplexity is varied. The light blue curves show the training error err, while thelight red curves show the conditional test error ErrT for 100 training sets of size50 each, as the model complexity is increased. The solid curves show the expectedtest error Err and the expected training error E[err].

Test error, also referred to as generalization error, is the prediction errorover an independent test sample

ErrT = E[L(Y, f̂(X))|T ] (7.2)

where both X and Y are drawn randomly from their joint distribution(population). Here the training set T is fixed, and test error refers to theerror for this specific training set. A related quantity is the expected pre-diction error (or expected test error)

Err = E[L(Y, f̂(X))] = E[ErrT ]. (7.3)

Note that this expectation averages over everything that is random, includ-ing the randomness in the training set that produced f̂ .

Figure 7.1 shows the prediction error (light red curves) ErrT for 100simulated training sets each of size 50. The lasso (Section 3.4.2) was usedto produce the sequence of fits. The solid red curve is the average, andhence an estimate of Err.

Estimation of ErrT will be our goal, although we will see that Err ismore amenable to statistical analysis, and most methods effectively esti-mate the expected error. It does not seem possible to estimate conditional

Training Error

Real-World Error

The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie


Linear Regression

y

x


OK

n observations

Cubic Regression

Figure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial.

a relatively simple polynomial with small but nonzero error, as in the rightmost picture.This intuition is confirmed by numerous experiments on real-world data from a broadvariety of sources [Rissanen 1989; Vapnik 1998; Ripley 1996]: if one naively fits a high-degree polynomial to a small sample (set of data points), then one obtains a very goodfit to the data. Yet if one tests the inferred polynomial on a second set of data comingfrom the same source, it typically fits this test data very badly in the sense that thereis a large distance between the polynomial and the new data points. We say that thepolynomial overfits the data. Indeed, all model selection methods that are used inpractice either implicitly or explicitly choose a trade-off between goodness-of-fit andcomplexity of the models involved. In practice, such trade-offs lead to much betterpredictions of test data than one would get by adopting the ‘simplest’ (one degree)or most ‘complex3’ (n − 1-degree) polynomial. MDL provides one particular means ofachieving such a trade-off.

It will be useful to make a precise distinction between ‘model’ and ‘hypothesis’:

Models vs. HypothesesWe use the phrase point hypothesis to refer to a single probability distribution orfunction. An example is the polynomial 5x2 + 4x + 3. A point hypothesis is alsoknown as a ‘simple hypothesis’ in the statistical literature.We use the word model to refer to a family (set) of probability distributions orfunctions with the same functional form. An example is the set of all second-degree polynomials. A model is also known as a ‘composite hypothesis’ in thestatistical literature.We use hypothesis as a generic term, referring to both point hypotheses and mod-els.

In our terminology, the problem described in Example 1.2 is a ‘hypothesis selectionproblem’ if we are interested in selecting both the degree of a polynomial and the cor-

10

y

x


Better

n observations

Degree n-1 Polynomial

Figure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial.

a relatively simple polynomial with small but nonzero error, as in the rightmost picture.This intuition is confirmed by numerous experiments on real-world data from a broadvariety of sources [Rissanen 1989; Vapnik 1998; Ripley 1996]: if one naively fits a high-degree polynomial to a small sample (set of data points), then one obtains a very goodfit to the data. Yet if one tests the inferred polynomial on a second set of data comingfrom the same source, it typically fits this test data very badly in the sense that thereis a large distance between the polynomial and the new data points. We say that thepolynomial overfits the data. Indeed, all model selection methods that are used inpractice either implicitly or explicitly choose a trade-off between goodness-of-fit andcomplexity of the models involved. In practice, such trade-offs lead to much betterpredictions of test data than one would get by adopting the ‘simplest’ (one degree)or most ‘complex3’ (n − 1-degree) polynomial. MDL provides one particular means ofachieving such a trade-off.

It will be useful to make a precise distinction between ‘model’ and ‘hypothesis’:

Models vs. HypothesesWe use the phrase point hypothesis to refer to a single probability distribution orfunction. An example is the polynomial 5x2 + 4x + 3. A point hypothesis is alsoknown as a ‘simple hypothesis’ in the statistical literature.We use the word model to refer to a family (set) of probability distributions orfunctions with the same functional form. An example is the set of all second-degree polynomials. A model is also known as a ‘composite hypothesis’ in thestatistical literature.We use hypothesis as a generic term, referring to both point hypotheses and mod-els.

In our terminology, the problem described in Example 1.2 is a ‘hypothesis selectionproblem’ if we are interested in selecting both the degree of a polynomial and the cor-

10

y

x


Even better?

n observations

Model Selection and Assessment

}  Model Selection: Estimating performances of different models to select the best one ��

}  Model Assessment: Having chosen a model, estimating the prediction error on new data


Model Selection and Assessment

}  In data-poor scenarios: approximate validation step ◦  analytically�  AIC, BIC, MDL◦  via sample re-use�  cross- validation�  Leave-one-out�  K-fold

�  bootstrap












Model Selection Model assessment

}  In data-rich scenarios split the data:


}  Simple hypothesis: single probability distributions (or functions)

◦  Model with specific parameters

}  Composite hypothesis: family of probability distributions ��

(or functions)

◦  A Model

12

Model vs Hypotheses


Why Look for More Restricted Models?

7.3 The Bias–Variance Decomposition 225

RealizationClosest fit in population

Estimation Bias

SPACE

VarianceEstimation

Closest fit

Truth

Model bias

RESTRICTED

Shrunken fit

MODEL SPACE

MODEL

FIGURE 7.2. Schematic of the behavior of bias and variance. The model spaceis the set of all possible predictions from the model, with the “closest fit” labeledwith a black dot. The model bias from the truth is shown, along with the variance,indicated by the large yellow circle centered at the black dot labeled “closest fitin population.” A shrunken or regularized fit is also shown, having additionalestimation bias, but smaller prediction error due to its decreased variance.

Bias x Variance

© 2015 Bruno Ribeiro The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

For squared-error loss & additive noise:

Expected squared deviation ofestimate around its own mean

Model:

Irreducible ��error of target Y

Deviation of average estimateFrom true function’s mean

vectors


}  For linear models (eg. Ridge), bias can be further decomposed:

β* is the best fitting linear approximationAverage Estimation Bias

For standard (unconstrained) linear regression,

estimation bias = 0

2))((minarg XXfE Tββ −=∗

Average Model Bias


Optimism of the Training Error Rate

}  Typically: training error rate < true error (same data is being used to fit the method and assess its error)

overly optimistic


Estimating Test Error

}  Can we estimate the discrepancy between Err(x0) and Err(X)?

}  Supppose we measured new values of y at same input x

Adjustment for optimism of training error

Errin --- In-sample error: Expectation over N new responses at each xi


◦  Optimism grows linearly with model dimension◦  Optimism decreases as training sample size increases

If model:with |θ| = d ��

Then,


Ways to Estimate Prediction Error}  In-sample error estimates:◦  AIC◦  BIC◦  MDL

}  Extra-sample error estimates:◦  Cross-Validation

�  Leave-one-out�  K-fold◦  Bootstrap


}  General form of the in-sample estimate:

}  For linear fit :

Estimates of In-Sample Prediction Error

Known as the Marllow’s Cp statistic


Akaike Information Criterion (AIC)

€

AIC = −2N⋅ loglik+ 2 ⋅ d

N

Bayesian Information Criterion (BIC)

dNCBI )(logloglik2 +−=


AIC = log lik(Data |MLE params)− (# of parameters)

BIC = log lik(Data |MLE params)− logN2

(# of parameters)

AIC does not assume model is correctBetter for small # samplesDoes not converge to correct model

BIC assumes model is correctConverges to correct parametrization of model


MDL (Minimum Description Length)}  Find hypothesis (model) that minimizes◦  H(M) + H(D|M)◦  H(M) – length in bits of description of model◦  H(D|M) – length in bits of description of data encoded by model

}  If model probabilistic:

€

length = −logPr(y |θ,M,X) − logPr(θ |M)

Length of transmitting the discrepancy given the model + optimal coding under the given model

Description of the model under optimal coding

MDL principle: choose the model with the minimum description length

Equivalent to maximizing the posterior:

€

Pr(y |θ,M,X) ⋅Pr(θ |M)© 2015 Bruno Ribeiro

Estimation of Extra-Sample Err

}  Cross Validation

}  Bootstrap


……

validation train

K-fold


ValidationTrain

1 2 3 4 5

Train Train Train

For the kth part (third above), we fit the model to the other K −1 partsof the data, and calculate the prediction error of the fitted model whenpredicting the kth part of the data. We do this for k = 1, 2, . . . ,K andcombine the K estimates of prediction error.

Here are more details. Let κ : {1, . . . , N} "→ {1, . . . ,K} be an indexingfunction that indicates the partition to which observation i is allocated bythe randomization. Denote by f̂−k(x) the fitted function, computed withthe kth part of the data removed. Then the cross-validation estimate ofprediction error is

CV(f̂) =1

N

N!

i=1

L(yi, f̂−κ(i)(xi)). (7.48)

Typical choices of K are 5 or 10 (see below). The case K = N is knownas leave-one-out cross-validation. In this case κ(i) = i, and for the ithobservation the fit is computed using all the data except the ith.

Given a set of models f(x,α) indexed by a tuning parameter α, denoteby f̂−k(x,α) the αth model fit with the kth part of the data removed. Thenfor this set of models we define

CV(f̂ ,α) =1

N

N!

i=1

L(yi, f̂−κ(i)(xi,α)). (7.49)

The function CV(f̂ ,α) provides an estimate of the test error curve, and wefind the tuning parameter α̂ that minimizes it. Our final chosen model isf(x, α̂), which we then fit to all the data.

It is interesting to wonder about what quantity K-fold cross-validationestimates. With K = 5 or 10, we might guess that it estimates the ex-pected error Err, since the training sets in each fold are quite differentfrom the original training set. On the other hand, if K = N we mightguess that cross-validation estimates the conditional error ErrT . It turnsout that cross-validation only estimates effectively the average error Err,as discussed in Section 7.12.

What value should we choose for K? With K = N , the cross-validationestimator is approximately unbiased for the true (expected) prediction er-ror, but can have high variance because the N “training sets” are so similarto one another. The computational burden is also considerable, requiringN applications of the learning method. In certain special problems, thiscomputation can be done quickly—see Exercises 7.3 and 5.13.


Variance decreases bias decreases

Computation increases

k increases

k fold Leave-one-out

Cross-Validation: Choosing K

Popular choices for K: 5, 10, N=leave one out

7.10 Cross-Validation 243

Size of Training Set

1-Er

r

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

FIGURE 7.8. Hypothetical learning curve for a classifier on a given task: aplot of 1 − Err versus the size of the training set N . With a dataset of 200observations, 5-fold cross-validation would use training sets of size 160, whichwould behave much like the full set. However, with a dataset of 50 observationsfivefold cross-validation would use training sets of size 40, and this would resultin a considerable overestimate of prediction error.

On the other hand, with K = 5 say, cross-validation has lower variance.But bias could be a problem, depending on how the performance of thelearning method varies with the size of the training set. Figure 7.8 showsa hypothetical “learning curve” for a classifier on a given task, a plot of1 − Err versus the size of the training set N . The performance of theclassifier improves as the training set size increases to 100 observations;increasing the number further to 200 brings only a small benefit. If ourtraining set had 200 observations, fivefold cross-validation would estimatethe performance of our classifier over training sets of size 160, which fromFigure 7.8 is virtually the same as the performance for training set size200. Thus cross-validation would not suffer from much bias. However if thetraining set had 50 observations, fivefold cross-validation would estimatethe performance of our classifier over training sets of size 40, and from thefigure that would be an underestimate of 1 − Err. Hence as an estimate ofErr, cross-validation would be biased upward.

To summarize, if the learning curve has a considerable slope at the giventraining set size, five- or tenfold cross-validation will overestimate the trueprediction error. Whether this bias is a drawback in practice depends onthe objective. On the other hand, leave-one-out cross-validation has lowbias but can have high variance. Overall, five- or tenfold cross-validationare recommended as a good compromise: see Breiman and Spector (1992)and Kohavi (1995).

Figure 7.9 shows the prediction error and tenfold cross-validation curveestimated from a single training set, from the scenario in the bottom rightpanel of Figure 7.3. This is a two-class classification problem, using a lin-


}  Use cross-validation to estimate anything in the model or for feature selection

}  Use cross-validation to hunt for new models

28

Wrong Way to Do Cross-validation

The human regression machine: Proposes New

Model

Check CV CV Bad


Bootstrap: Main Concept

Step 1: Draw samples with replacement

Step 2: Calculate the statistic


Bootstrap

Bootstrap

replications

samples

sampleTrainingZ = (z1, z2, . . . , zN )

Z∗1 Z∗2 Z∗B

S(Z∗1) S(Z∗2) S(Z∗B)

FIGURE 7.12. Schematic of the bootstrap process. We wish to assess the sta-tistical accuracy of a quantity S(Z) computed from our dataset. B training setsZ∗b, b = 1, . . . , B each of size N are drawn with replacement from the originaldataset. The quantity of interest S(Z) is computed from each bootstrap trainingset, and the values S(Z∗1), . . . , S(Z∗B) are used to assess the statistical accuracyof S(Z).

where S̄∗ =!

b S(Z∗b)/B. Note that "Var[S(Z)] can be thought of as aMonte-Carlo estimate of the variance of S(Z) under sampling from theempirical distribution function F̂ for the data (z1, z2, . . . , zN ).

How can we apply the bootstrap to estimate prediction error? One ap-proach would be to fit the model in question on a set of bootstrap samples,and then keep track of how well it predicts the original training set. Iff̂∗b(xi) is the predicted value at xi, from the model fitted to the bth boot-strap dataset, our estimate is

"Errboot =1

B

1

N

B#

b=1

N#

i=1

L(yi, f̂∗b(xi)). (7.54)

However, it is easy to see that "Errboot does not provide a good estimate ingeneral. The reason is that the bootstrap datasets are acting as the trainingsamples, while the original training set is acting as the test sample, andthese two samples have observations in common. This overlap can makeoverfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and testsamples. Consider for example a 1-nearest neighbor classifier applied to atwo-class classification problem with the same number of observations in


}  The Elements of Statistical Learning: Data Mining, Inference, and Prediction By Trevor Hastie, Robert Tibshirani, Jerome Friedman

30

References

Date post:	13-Feb-2017
Category:	Documents
Upload:	dotram
View:	218 times
Download:	4 times

Model Selection Topics in Data Mining Fall 2015 Bruno Ribeiro

Documents