Introduction to statistical learning with R
Anna Liu
January 20, 2015
Statistical learning
Data set: Advertisement
> Ad=read.csv("Advertising.csv")
> head(Ad)
X TV Radio Newspaper Sales
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
Objective: Adjust advistising budgets to improve sales of aparticular product.
0 50 100 200 300
510
15
20
25
TV
Sale
s
0 10 20 30 40 50
510
15
20
25
Radio
Sale
s
0 20 40 60 80 100
510
15
20
25
Newspaper
Sale
s
Figure: The Advertising data set. The plot displays sales, in thousands ofunits, as a function of TV, radio, and newspaper budgets, in thousands ofdollars, for 200 different markets. In each plot we show the simple leastsquares fit of sales to that variable, as described in Chapter 3. In otherwords, each blue line represents a simple model that can be used topredict sales using TV, radio, and newspaper, respectively
Data set: IncomeObjective: Understand how one’s education history and seniorityaffects the income.
Years of Education
Sen
iorit
y
Incom
e
Figure: The plot displays income as a function of years of education andseniority in the Income data set. The blue surface represents the trueunderlying relationship between income and years of education andseniority, which is known since the data are simulated. The red dotsindicate the observed values of these quantities for 30 individuals.
Input variables (predictors, features, independent variables):TV, radio and newspaper budgets in the Advertisement data; yearsof education and seniority in the Income data.Output variables (response, dependnet variabls): sales in theAdvertisement data and income in the Income data.In general, let X1, · · · ,Xp denote the p independent variables, andY denote the dependent variable. We assume that there is somerelationship between Y and X = (X1, · · · ,Xp), that is,
Y = f (X ) + ε,
where f denote the systematic relationship and ε is the errorterm which is independent of X and has mean zero.Statistical learning refers to a set of approaches for estimating f .The rest of the chapter introduces different statistical learningobjectives, methods, and evaluations.
Objective of Statistical learning
Predition: Predict the value of Y for a set of values of X .
Y = f (X ).
The goal is to achieve prediction accuracy as opposed tounderstand the relationship between X and Y . f can be a blackbox.Example: Predict the risk of a severe disease based on patients’blood samples.Prediction accuracy: The average prediction error is
E (Y − Y )2 = E (f (X )− f (X ) + ε)2 = E ( ˆf (X )− f (X ))2 + Var(ε)
where the first term is the average reducible error and ε is the theirreducible error. Statistical learning techniques minimize thereducible error.
Inference: understand how Y changes with X . f cannot betreated as a black box.1. Which predictors are associated with the response?2. What is the relationship between the response and eachpredictor?3. Can the relationship between Y and each predictor beadequately summarized using a linear equation, or is therelationship more complicated?Example: The advertisement and the Income dataPrediction and Inference: In real estate, predict the value of ahome and at the same time understand the relationship betweenvarious features of a home and its value.Depending on the purpose the learning, different methods forestimating f may be appropriate. For example, linear modelsallow for relatively simple and interpretable inference, but may notyield as accurate predictions as some other approaches. Incontrast, some of the highly non-linear approaches that we discussin the later chapters can potentially provide quite accuratepredictions for Y , but this comes at the expense of a lessinterpretable model for which inference is more challenging.
Learning techniques
Let xi = (xi1, xi2, · · · , xip) be the vector of the ith observedindependent variables and yi be the ith observed response. Thenour training data consists of {(x1, y1), (x2, y2), · · · , (xn, yn)}. Weuse our training data to estimate f . Techniques are broadlyclassified into parametric and nonparametric methods.Parametric methods:1. Assume a form or shape for f with a set of unknownparameters, for example, the linear model assumes
f (X ) = β0 + β1X1 + · · ·+ βpXp,
where β0, · · · , βp are called parameters.2. Estimate the parameters, therefore, f , with some methods, suchas the least square method.Pros:Easy to fit and interpretCons:Assumed model may not mimic the truth well and thus leadto loss of prediction accuracy.
Parametric methods: a linear model example
Years of Education
Sen
iorit
y
Incom
e
Figure: A linear model fit by least squares to the Income data:Income = β0 + β1Eduction + β2Seniority . The observations are shown inred, and the yellow plane indicates the least squares fit to the data.
Nonparametric methodsDo not make explicit assumptions about the functional form of f .Instead they seek an estimate of f that gets as close to the datapoints as possible without being too rough or wiggly.Pros: potentially more accurately model the true unknown fCons: need a larger number of observations and lack ofinterpretability
Years of Education
Sen
iorit
y
Incom
eFigure: A smooth thin-plate spline fit to the Income data is shown inyellow; the observations are displayed in red.
OverfittingBoth parametric and nonparmetric models can be made moreflexible by adding more variables/terms, and allow a rougher fit,but too much flexibility can lead to overfitting. It is anundesirable situation because the fit follows the error/noise in thetraining data too closely and will not yield accurate estimates ofthe response on new observations that were not part of the originaltraining data set.
Years of Education
Sen
iorit
y
Incom
e
Figure: A rough thin-plate spline fit to the Income data. This fit makeszero errors on the training data.
Tradeoff between flexibility and interpretability
Flexibility
Inte
rpre
tab
ility
Low High
Low
Hig
h Subset SelectionLasso
Least Squares
Generalized Additive ModelsTrees
Bagging, Boosting
Support Vector Machines
Figure: A representation of the tradeoff between flexibility andinterpretability, using different statistical learning methods. In general, asthe flexibility of a method increases, its interpretability decreases.
When inference is the goal, there are clear advantages to usingsimple and relatively inflexible statistical learning methods.When prediction is the goal, we might expect that it will be best touse the most flexible model available. Surprisingly, this is notalways the case! We will often obtain more accurate predictionsusing a less flexible method. This phenomenon, which may seemcounterintuitive at first glance, has to do with the potential foroverfitting in highly flexible methods.
Learning techniques classified according to types ofresponses
I Supervised learning VS unsupervised learning: existence VSlack of responses in the training data.
I Regression VS classification: continuous (quantitative) VSdiscrete (qualatitive, categorical) responses. An example ofclassification techniques is logistic regression for binaryresponses. Some statistical methods, such as K-nearestneighbors and boosting, can be used in the case of eitherquantitative or qualitative responses.
Supervised and unsupervised learningSupervised learning is to predict the response or infer therelationship between the response and the independent variabls.Unsupervised learning is to understand the relationships betweenthe variables or between the observations. One example is clusteranalysis.
0 2 4 6 8 10 12
24
68
10
12
0 2 4 6
24
68
X1X1
X2
X2
Evaluation of different techniques
No free lunch in statistics: no one method dominates all othersover all possible data sets. On a particular data set, one specificmethod may work best, but some other method may work betteron a similar but different data set. Hence it is an important task todecide for any given set of data which method produces the bestresults.Goodness of fit in the regression setting:
MSE =2∑
i=1
(yi − f (xi ))2
where MSE represents Mean Squared Error.training MSE: MSE calculated based on the training datatest MSE: MSE calculated based on the test dataWe want to choose the method that gives the lowest test MSE, asopposed to the lowest training MSE.
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean S
quare
d E
rror
Figure: Left: Data simulated from f, shown in black. Three estimates off are shown: the linear regression line (orange curve), and two smoothingspline fits (blue and green curves). Right: Training MSE (grey curve),test MSE (red curve), and minimum possible test MSE over all methods(dashed line). Squares represent the training and test MSEs for the threefits shown in the left-hand panel.
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean S
quare
d E
rror
Figure: Details are as in the previous figure, using a different true f thatis much closer to linear. In this setting, linear regression provides a verygood fit to the data.
0 20 40 60 80 100
−10
010
20
X
Y
2 5 10 20
05
10
15
20
Flexibility
Mean S
quare
d E
rror
Figure: Details are as in the previous figure, using a different f that is farfrom linear. In this setting, linear regression provides a very poor fit tothe data.Cross-validation: Often times, the test data is not available. Wewill discuss a variety of approaches to estimate the minimum pointwith no test data. One important method is cross-validation,which is for estimating test MSE using the training data.
Bias-variance tradeoff
The U-shape of the test MSE curve can be explained by theBias-variance tradeoff. Specifically, at a test point x0,
MSE = E (y0−f (x0))2 = E (f (x0)−f (x0)+ε)2 = Bias(f (x0))2+Var(f (x0))+Var(ε).
One can see that in order to minimize the expected test error, weneed to select a statistical learning method that simultaneouslyachieves low variance and low bias.Variance: refers to the amount by which f would change if weestimated it using a different training data set.Bias: refers to the error that is introduced by approximating areal-life problem, which may be extremely complicated, by a muchsimpler model.
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
05
10
15
20
Flexibility
MSEBiasVar
Figure: Squared bias (blue curve), variance (orange curve), Var(ε)(dashed line), and test MSE (red curve) for the three data sets in theprevious three Figures. The vertical dotted line indicates the flexibilitylevel corresponding to the smallest test MSE.
The challenge lies in finding a method for which both the varianceand the squared bias are low. This trade-off is one of the most
important recurring themes in the textbook.
Classification error rate
Replacing MSE, the classification error rate is defined as
1
n
n∑i=1
I (yi 6= yi )
where yi is the predicted class label for predictor variable xi .training error rate: the above error rate calculated for traningdatatest error rate: the above error rate calculated for test data.Bayes classifier: Classify x to category that has the mostposterior probability P(Y = j |X = x). In binary response case,classify x to category that has the posterior probability over 50%.
I Bayes classifier is optimal in the sense that it minimzes thetest error rate.
I Bayes error rate is analogous to the irreducible error in theregression setting.
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
Figure: A simulated data set consisting of 100 observations in each oftwo groups, indicated in blue and in orange. The purple dashed linerepresents the Bayes decision boundary. The orange background gridindicates the region in which a test observation will be assigned to theorange class, and the blue background grid indicates the region in whicha test observation will be assigned to the blue class.
Optimality of Bayes classifier
Given the classfier f (x), for a pair of future observation (X ,Y ),the expected test error rate is
E (I (Y 6= f (X ))) = EX (EY |X I (Y 6= f (X ))).
To find the optimal classifier, it suffices for the classifier tominimize
EY |x I (Y 6= f (x))
for every x . Note
EY |x I (Y 6= f (x)) = P(Y 6= f (x)|X = x) = 1−P(Y = f (x)|X = x).
Therefore, we should choose f (x) such that P(Y = f (x)|X = x) isthe largest, that is, choose f (x) be class j if P(Y = j |X = x) isthe largest among all the possible classes. This amounts to theBayes classifier.
Bayes error rate
At x , the Bayes error rate is
1−maxjP(Y = j |X = x).
The overall Bayes error rate is
1− EmaxjP(Y = j |X ).
For our simulated data, the Bayes error rate is 0.1304. It is greaterthan zero, because the classes overlap in the true population somaxjP(Y = j |X = x0) < 1 for some values of x0. The Bayes errorrate is analogous to the irreducible error as in the regressionsetting.In practice, the posterior probability P(Y = j |X = x0) has to beestimated, which will be discussed in details in Chapter 4. Oneexample is the K-nearest neighbor classifier.
K-nearest neighbor (KNN) classifier
The algorithm:
I locate K points closeset to x0 for which we’d like to classify.Denote the neighborhood as N0 and the points y1, · · · , yK .
I assign x0 to the most frequent class among the K points.That is, estimate the class j probability for x0 as
P(Y = j |X = x0) =1
K
∑i∈N0
I (yi = j).
KNN then applies Bayes rule and classifies the testobservation x0 to the class with the largest probability.
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
Figure: The KNN approach, using K = 3, is illustrated in a simplesituation with six blue observations and six orange observations. Left: atest observation at which a predicted class label is desired is shown as ablack cross. The three closest points to the test observation areidentified, and it is predicted that the test observation belongs to themost commonly-occurring class, in this case blue. Right: The KNNdecision boundary for this example is shown in black. The blue gridindicates the region in which a test observation will be assigned to theblue class, and the orange grid indicates the region in which it will beassigned to the orange class.
Compare KNN with the optimal Bayes classifier
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
KNN: K=10
Figure: The black curve indicates the KNN decision boundary using K =10. The Bayes decision boundary is shown as a purple dashed line. TheKNN and Bayes decision boundaries are very similar.
Changing K
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
Figure: A comparison of the KNN decision boundaries (solid blackcurves) obtained using K = 1 and K = 100. With K = 1, the decisionboundary is overly flexible, while with K = 100 it is not sufficientlyflexible. The Bayes decision boundary is shown as a purple dashed line.
Optimal K
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.0
00
.05
0.1
00
.15
0.2
0
1/K
Err
or
Ra
te
Training Errors
Test Errors
Figure: The KNN training error rate (blue, 200 observations) and testerror rate (orange, 5,000 observations), as the level of flexibility (assessedusing 1/K) increases, or equivalently as the number of neighbors Kdecreases. The black dashed line indicates the Bayes error rate. Thejumpiness of the curves is due to the small size of the training data set.