Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2017/03/... · Example- Credit...

Statistical Methods for Data Mining

Kuangnan Fang

Xiamen University Email: [email protected]

February 14, 2016

Linear Model Selection and Regularization

• Recall the linear model

Y = �0 + �1X1 + · · ·+ �pXp + ✏.

• In the lectures that follow, we consider some approaches forextending the linear model framework. In the lecturescovering Chapter 7 of the text, we generalize the linearmodel in order to accommodate non-linear, but stilladditive, relationships.

• In the lectures covering Chapter 8 we consider even moregeneral non-linear models.

1 / 57

In praise of linear models!

• Despite its simplicity, the linear model has distinctadvantages in terms of its interpretability and often showsgood predictive performance.

• Hence we discuss in this lecture some ways in which thesimple linear model can be improved, by replacing ordinaryleast squares fitting with some alternative fittingprocedures.

2 / 57

Why consider alternatives to least squares?

•Prediction Accuracy: especially when p > n, to control thevariance.

•Model Interpretability: By removing irrelevant features —that is, by setting the corresponding coe�cient estimatesto zero — we can obtain a model that is more easilyinterpreted. We will present some approaches forautomatically performing feature selection.

3 / 57

Three classes of methods

•Subset Selection. We identify a subset of the p predictorsthat we believe to be related to the response. We then fit amodel using least squares on the reduced set of variables.

•Shrinkage. We fit a model involving all p predictors, butthe estimated coe�cients are shrunken towards zerorelative to the least squares estimates. This shrinkage (alsoknown as regularization) has the e↵ect of reducing varianceand can also perform variable selection.

•Dimension Reduction. We project the p predictors into aM -dimensional subspace, where M < p. This is achieved bycomputing M di↵erent linear combinations, or projections,of the variables. Then these M projections are used aspredictors to fit a linear regression model by least squares.

4 / 57

Subset Selection

Best subset and stepwise model selection procedures

Best Subset Selection

1. Let M0 denote the null model, which contains nopredictors. This model simply predicts the sample meanfor each observation.

2. For k = 1, 2, . . . p:(a) Fit all

�pk

�models that contain exactly k predictors.

(b) Pick the best among these�pk

�models, and call it Mk. Here

best is defined as having the smallest RSS, or equivalentlylargest R2.

3. Select a single best model from among M0, . . . ,Mp usingcross-validated prediction error, Cp (AIC), BIC, oradjusted R

2.

5 / 57

Example- Credit data set

2 4 6 8 10

2e+

07

4e+

07

6e+

07

8e+

07

Number of Predictors

Re

sid

ua

l Su

m o

f S

qu

are

s

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0


R2

For each possible model containing a subset of the ten predictors

in the Credit data set, the RSS and R

2are displayed. The red

frontier tracks the best model for a given number of predictors,

according to RSS and R

2. Though the data set contains only

ten predictors, the x-axis ranges from 1 to 11, since one of the

variables is categorical and takes on three values, leading to the

creation of two dummy variables

6 / 57

Extensions to other models

• Although we have presented best subset selection here forleast squares regression, the same ideas apply to othertypes of models, such as logistic regression.

• The deviance— negative two times the maximizedlog-likelihood— plays the role of RSS for a broader class ofmodels.

7 / 57

Stepwise Selection

• For computational reasons, best subset selection cannot beapplied with very large p. Why not?

• Best subset selection may also su↵er from statisticalproblems when p is large: larger the search space, thehigher the chance of finding models that look good on thetraining data, even though they might not have anypredictive power on future data.

• Thus an enormous search space can lead to overfitting andhigh variance of the coe�cient estimates.

• For both of these reasons, stepwise methods, which explorea far more restricted set of models, are attractivealternatives to best subset selection.

8 / 57

Forward Stepwise Selection

• Forward stepwise selection begins with a model containingno predictors, and then adds predictors to the model,one-at-a-time, until all of the predictors are in the model.

• In particular, at each step the variable that gives thegreatest additional improvement to the fit is added to themodel.

9 / 57

In Detail

Forward Stepwise Selection

1. Let M0 denote the null model, which contains nopredictors.

2. For k = 0, . . . , p� 1:2.1 Consider all p� k models that augment the predictors in

Mk with one additional predictor.2.2 Choose the best among these p� k models, and call it

Mk+1. Here best is defined as having smallest RSS orhighest R2.


2.

10 / 57

More on Forward Stepwise Selection

• Computational advantage over best subset selection isclear.

• It is not guaranteed to find the best possible model out ofall 2p models containing subsets of the p predictors. Why

not? Give an example.

11 / 57

Credit data example

# Variables Best subset Forward stepwiseOne rating rating

Two rating, income rating, incomeThree rating, income, student rating, income, studentFour cards, income rating, income,

student, limit student, limit

The first four selected models for best subset selection and

forward stepwise selection on the Credit data set. The first

three models are identical but the fourth models di↵er.

12 / 57

Backward Stepwise Selection

• Like forward stepwise selection, backward stepwise selection

provides an e�cient alternative to best subset selection.

• However, unlike forward stepwise selection, it begins withthe full least squares model containing all p predictors, andthen iteratively removes the least useful predictor,one-at-a-time.

13 / 57

Backward Stepwise Selection: details

Backward Stepwise Selection

1. Let Mp denote the full model, which contains all ppredictors.

2. For k = p, p� 1, . . . , 1:2.1 Consider all k models that contain all but one of the

predictors in Mk, for a total of k � 1 predictors.2.2 Choose the best among these k models, and call it Mk�1.

Here best is defined as having smallest RSS or highest R2.


2.

14 / 57

More on Backward Stepwise Selection

• Like forward stepwise selection, the backward selectionapproach searches through only 1 + p(p+ 1)/2 models, andso can be applied in settings where p is too large to applybest subset selection

• Like forward stepwise selection, backward stepwiseselection is not guaranteed to yield the best modelcontaining a subset of the p predictors.

• Backward selection requires that the number of samples n

is larger than the number of variables p (so that the fullmodel can be fit). In contrast, forward stepwise can beused even when n < p, and so is the only viable subsetmethod when p is very large.

15 / 57

Choosing the Optimal Model

• The model containing all of the predictors will always havethe smallest RSS and the largest R2, since these quantitiesare related to the training error.

• We wish to choose a model with low test error, not a modelwith low training error. Recall that training error is usuallya poor estimate of test error.

• Therefore, RSS and R

2 are not suitable for selecting thebest model among a collection of models with di↵erentnumbers of predictors.

16 / 57

Estimating test error: two approaches

• We can indirectly estimate test error by making anadjustment to the training error to account for the bias dueto overfitting.

• We can directly estimate the test error, using either avalidation set approach or a cross-validation approach, asdiscussed in previous lectures.

• We illustrate both approaches next.

17 / 57

Cp

, AIC, BIC, and Adjusted R2

• These techniques adjust the training error for the modelsize, and can be used to select among a set of models withdi↵erent numbers of variables.

• The next figure displays Cp, BIC, and adjusted R

2 for thebest model of each size produced by best subset selectionon the Credit data set.

18 / 57

Credit data example

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0


Cp

2 4 6 8 10

10

00

01

50

00

20

00

02

50

00

30

00

0


BIC

2 4 6 8 10

0.8

60

.88

0.9

00

.92

0.9

40

.96


Ad

just

ed

R2

19 / 57

Now for some details

•Mallow’s Cp:

Cp =1

n

�RSS + 2d�̂2

�,

where d is the total # of parameters used and �̂

2 is anestimate of the variance of the error ✏ associated with eachresponse measurement.

• The AIC criterion is defined for a large class of models fitby maximum likelihood:

AIC = �2 logL+ 2 · d

where L is the maximized value of the likelihood functionfor the estimated model.

• In the case of the linear model with Gaussian errors,maximum likelihood and least squares are the same thing,and Cp and AIC are equivalent. Prove this.

20 / 57

AIC=1/(n*hsigma^2)(RSS+2*d*hsigma^2)

C_p*=RSS/(hsigma^2)+2d-n

C_p=1/n*hsigma^2(C_p*+n)

Details on BIC

BIC =1

n

�RSS + log(n)d�̂2

�.

• Like Cp, the BIC will tend to take on a small value for amodel with a low test error, and so generally we select themodel that has the lowest BIC value.

• Notice that BIC replaces the 2d�̂2 used by Cp with alog(n)d�̂2 term, where n is the number of observations.

• Since log n > 2 for any n > 7, the BIC statistic generallyplaces a heavier penalty on models with many variables,and hence results in the selection of smaller models thanCp. See Figure on slide 19.

21 / 57

Adjusted R2

• For a least squares model with d variables, the adjusted R

2

statistic is calculated as

Adjusted R

2 = 1� RSS/(n� d� 1)

TSS/(n� 1).

where TSS is the total sum of squares.• Unlike Cp, AIC, and BIC, for which a small value indicatesa model with a low test error, a large value of adjusted R

2

indicates a model with a small test error.• Maximizing the adjusted R

2 is equivalent to minimizingRSS

n�d�1 . While RSS always decreases as the number of

variables in the model increases, RSSn�d�1 may increase or

decrease, due to the presence of d in the denominator.• Unlike the R

2 statistic, the adjusted R

2 statistic pays a

price for the inclusion of unnecessary variables in themodel. See Figure on slide 19.

22 / 57

Validation and Cross-Validation

• Each of the procedures returns a sequence of models Mk

indexed by model size k = 0, 1, 2, . . .. Our job here is toselect k̂. Once selected, we will return model Mk̂

• We compute the validation set error or the cross-validationerror for each model Mk under consideration, and thenselect the k for which the resulting estimated test error issmallest.

• This procedure has an advantage relative to AIC, BIC, Cp,and adjusted R

2, in that it provides a direct estimate ofthe test error, and doesn’t require an estimate of the error

variance �

2.

• It can also be used in a wider range of model selectiontasks, even in cases where it is hard to pinpoint the modeldegrees of freedom (e.g. the number of predictors in themodel) or hard to estimate the error variance �

2.

23 / 57

Validation and Cross-Validation

• Each of the procedures returns a sequence of models Mk

indexed by model size k = 0, 1, 2, . . .. Our job here is toselect k̂. Once selected, we will return model Mk̂

• We compute the validation set error or the cross-validationerror for each model Mk under consideration, and thenselect the k for which the resulting estimated test error issmallest.

• This procedure has an advantage relative to AIC, BIC, Cp,and adjusted R

2, in that it provides a direct estimate ofthe test error, and doesn’t require an estimate of the error

variance �

2.

• It can also be used in a wider range of model selectiontasks, even in cases where it is hard to pinpoint the modeldegrees of freedom (e.g. the number of predictors in themodel) or hard to estimate the error variance �

2.

23 / 57

Credit data example

2 4 6 8 10

100

120

140

160

180

200

220


Square

Root of B

IC

2 4 6 8 10

100

120

140

160

180

200

220


Valid

atio

n S

et E

rror

2 4 6 8 10

100

120

140

160

180

200

220


Cro

ss−

Valid

atio

n E

rror

24 / 57

Details of Previous Figure

• The validation errors were calculated by randomly selectingthree-quarters of the observations as the training set, andthe remainder as the validation set.

• The cross-validation errors were computed using k = 10folds. In this case, the validation and cross-validationmethods both result in a six-variable model.

• However, all three approaches suggest that the four-, five-,and six-variable models are roughly equivalent in terms oftheir test errors.

• In this setting, we can select a model using theone-standard-error rule. We first calculate the standarderror of the estimated test MSE for each model size, andthen select the smallest model for which the estimated testerror is within one standard error of the lowest point onthe curve. What is the rationale for this?

25 / 57

Shrinkage Methods

Ridge regression and Lasso

• The subset selection methods use least squares to fit alinear model that contains a subset of the predictors.

• As an alternative, we can fit a model containing all ppredictors using a technique that constrains or regularizesthe coe�cient estimates, or equivalently, that shrinks thecoe�cient estimates towards zero.

• It may not be immediately obvious why such a constraintshould improve the fit, but it turns out that shrinking thecoe�cient estimates can significantly reduce their variance.

26 / 57

Ridge regression

• Recall that the least squares fitting procedure estimates�0,�1, . . . ,�p using the values that minimize

RSS =nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

.

• In contrast, the ridge regression coe�cient estimates �̂R

are the values that minimize

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

+ �

pX

j=1

�

2j = RSS + �

pX

j=1

�

2j ,

where � � 0 is a tuning parameter, to be determinedseparately.

27 / 57

Ridge regression

• Writing the criterion in matrix form

RSS(�) = (y � X �)T (y � X �) + � �T �

• the ridge regression solutions are easily seen to be

�r idge = (XT X + �I)�1XT y

where I is the p ⇥ p identity matrix.

Ridge regression: continued

• As with least squares, ridge regression seeks coe�cientestimates that fit the data well, by making the RSS small.

• However, the second term, �P

j �2j , called a shrinkage

penalty, is small when �1, . . . ,�p are close to zero, and so ithas the e↵ect of shrinking the estimates of �j towards zero.

• The tuning parameter � serves to control the relativeimpact of these two terms on the regression coe�cientestimates.

• Selecting a good value for � is critical; cross-validation isused for this.

28 / 57

Credit data example

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0−

300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

� k�̂R

�

k2/k�̂k2

29 / 57

Details of Previous Figure

• In the left-hand panel, each curve corresponds to the ridgeregression coe�cient estimate for one of the ten variables,plotted as a function of �.

• The right-hand panel displays the same ridge coe�cientestimates as the left-hand panel, but instead of displaying� on the x-axis, we now display k�̂R

� k2/k�̂k2, where �̂

denotes the vector of least squares coe�cient estimates.

• The notation k�k2 denotes the `2 norm (pronounced “ell

2”) of a vector, and is defined as k�k2 =qPp

j=1 �j2.

30 / 57

Ridge regression: scaling of predictors

• The standard least squares coe�cient estimates are scale

equivariant: multiplying Xj by a constant c simply leads toa scaling of the least squares coe�cient estimates by afactor of 1/c. In other words, regardless of how the jthpredictor is scaled, Xj �̂j will remain the same.

• In contrast, the ridge regression coe�cient estimates canchange substantially when multiplying a given predictor bya constant, due to the sum of squared coe�cients term inthe penalty part of the ridge regression objective function.

• Therefore, it is best to apply ridge regression afterstandardizing the predictors, using the formula

x̃ij =xijq

1n

Pni=1(xij � xj)2

31 / 57

Why Does Ridge Regression Improve Over LeastSquares?

The Bias-Variance tradeo↵

1e−01 1e+01 1e+03

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

� k�̂R

�

k2/k�̂k2

Simulated data with n = 50 observations, p = 45 predictors, all having

nonzero coe�cients. Squared bias (black), variance (green), and test

mean squared error (purple) for the ridge regression predictions on a

simulated data set, as a function of � and k�̂R� k2/k�̂k2. The

horizontal dashed lines indicate the minimum possible MSE. The

purple crosses indicate the ridge regression models for which the MSE

is smallest. 32 / 57

The Lasso

• Ridge regression does have one obvious disadvantage:unlike subset selection, which will generally select modelsthat involve just a subset of the variables, ridge regressionwill include all p predictors in the final model

• The Lasso is a relatively recent alternative to ridgeregression that overcomes this disadvantage. The lassocoe�cients, �̂L

� , minimize the quantity

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

+ �

pX

j=1

|�j | = RSS + �

pX

j=1

|�j |.

• In statistical parlance, the lasso uses an `1 (pronounced“ell 1”) penalty instead of an `2 penalty. The `1 norm of acoe�cient vector � is given by k�k1 =

P|�j |.

33 / 57

||\beta||_2=\sum |\beta|^2 ||\beta||_0=\sum |\beta|^0

The Lasso: continued

• As with ridge regression, the lasso shrinks the coe�cientestimates towards zero.

• However, in the case of the lasso, the `1 penalty has thee↵ect of forcing some of the coe�cient estimates to beexactly equal to zero when the tuning parameter � issu�ciently large.

• Hence, much like best subset selection, the lasso performsvariable selection.

• We say that the lasso yields sparse models — that is,models that involve only a subset of the variables.

• As in ridge regression, selecting a good value of � for thelasso is critical; cross-validation is again the method ofchoice.

34 / 57

Example: Credit dataset

20 50 100 200 500 2000 5000

−200

0100

200

300

400

Sta

ndard

ized C

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0−

300

−100

0100

200

300

400

Sta

ndard

ized C

oeffic

ients


� k�̂L

�

k1/k�̂k1

35 / 57

The Variable Selection Property of the Lasso

Why is it that the lasso, unlike ridge regression, results incoe�cient estimates that are exactly equal to zero?

One can show that the lasso and ridge regression coe�cientestimates solve the problems

minimize�

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

subject topX

j=1

|�j | s

and

minimize�

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

subject topX

j=1

�

2j s,

respectively.

36 / 57

The Variable Selection Property of the Lasso

Why is it that the lasso, unlike ridge regression, results incoe�cient estimates that are exactly equal to zero?

One can show that the lasso and ridge regression coe�cientestimates solve the problems

minimize�

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

subject topX

j=1

|�j | s

and

minimize�

nX

i=1

0

@yi � �0 �

pX

j=1

�jxij

1

A2

subject topX

j=1

�

2j s,

respectively.

36 / 57

The Lasso Picture

37 / 57

shrinkage methods

Coordinate descent algorithm for Lasso

Coordinate descent algorithm for Lasso

Comparing the Lasso and Ridge Regression

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Mean S

quare

d E

rror

�

Left: Plots of squared bias (black), variance (green), and test

MSE (purple) for the lasso on simulated data set of Slide 32.

Right: Comparison of squared bias, variance and test MSE

between lasso (solid) and ridge (dashed). Both are plotted

against their R

2on the training data, as a common form of

indexing. The crosses in both plots indicate the lasso model for

which the MSE is smallest.

38 / 57

Comparing the Lasso and Ridge Regression: continued

0.02 0.10 0.50 2.00 10.00 50.00

02

04

06

08

01

00

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

02

04

06

08

01

00

R2 on Training Data

Me

an

Sq

ua

red

Err

or

�

Left: Plots of squared bias (black), variance (green), and test

MSE (purple) for the lasso. The simulated data is similar to

that in Slide 38, except that now only two predictors are related

to the response. Right: Comparison of squared bias, variance

and test MSE between lasso (solid) and ridge (dashed). Both

are plotted against their R

2on the training data, as a common

form of indexing. The crosses in both plots indicate the lasso

model for which the MSE is smallest.

39 / 57

Conclusions

• These two examples illustrate that neither ridge regressionnor the lasso will universally dominate the other.

• In general, one might expect the lasso to perform betterwhen the response is a function of only a relatively smallnumber of predictors.

• However, the number of predictors that is related to theresponse is never known a priori for real data sets.

• A technique such as cross-validation can be used in orderto determine which approach is better on a particular dataset.

40 / 57

Adaptive Lasso

Adaptive Lasso

Generalization of shrinkage methods

• We can generalize ridge regression and the lasso,Consider the criterion

• q > 1, | � j |q is differentiable at 0, and so does not share theability of lasso.

Generalization of shrinkage methods• Zou and Hastie (2005) introduced the elastic- net penalty

�pX

j=1

(↵�2j + (1 � ↵) | � j |)

a different compromise between ridge and lasso.

MCP• Zhang(2010) propose MCP penalty

SCAD

• Fan and Li (2001) propose SCAD penalty

Group LASSO• Yuan and Lin (2006) propose group LASSO penalty

Composite MCP• composite MCP

P

outer

(p jX

k=1

P

inner

(| �( j )k

|))

• Composite MCP for logistic regression

adSGL

Fang, Wang and Ma(2015) propose adaptive sparse grouplasso

min

8>><>>:

12

��y �JX

j=1

Xj

�( j )��2

2 + � (1 � ↵)JX

j=1

wj

��( j )��

2 + �↵JX

j=1

⇠ ( j )T | �( j ) |9>>=>>;

(1)where W = (w1 , · · · ,wJ

)T 2 R

J

+ is the group weight vector, ⇠T =⇣⇠ (1)T , · · · , ⇠ (J )T

⌘=⇣⇠ (1)

1 , · · · , ⇠(1)p1 , · · · · · · , ⇠

(J )1 , · · · , ⇠

(J )pJ

⌘2 R

p

+denote the individual weights, and � 2 R+ is the tuningparameter. For different groups, the penalty level can bedifferent. By adopting lower penalty for large coefficients whilehigher penalty for small ones, we expect this to be able toimprove variable selection accuracy and reduce estimationbias.

adSGL

We use the group bridge estimator to construct these two typeof weights

wj

=

�� ̂( j ) (GB)��

1 +1n

!�1

⇠ ( j )i

=

| �̂( j )

i

(GB) | + 1n

!�1

adSGL

Selecting the Tuning Parameter for Ridge Regressionand Lasso

• As for subset selection, for ridge regression and lasso werequire a method to determine which of the models underconsideration is best.

• That is, we require a method selecting a value for thetuning parameter � or equivalently, the value of theconstraint s.

•Cross-validation provides a simple way to tackle thisproblem. We choose a grid of � values, and compute thecross-validation error rate for each value of �.

• We then select the tuning parameter value for which thecross-validation error is smallest.

• Finally, the model is re-fit using all of the availableobservations and the selected value of the tuningparameter.

41 / 57

Credit data example

5e−03 5e−02 5e−01 5e+00

25.0

25.2

25.4

25.6

Cro

ss−

Va

lid

atio

n E

rro

r

5e−03 5e−02 5e−01 5e+00

−300

−100

0100

300

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

��

Left: Cross-validation errors that result from applying ridge

regression to the Credit data set with various values of �.

Right: The coe�cient estimates as a function of �. The vertical

dashed lines indicates the value of � selected by cross-validation.

42 / 57

Simulated data example

0.0 0.2 0.4 0.6 0.8 1.0

0200

600

1000

1400

Cro

ss−

Va

lid

atio

n E

rro

r

0.0 0.2 0.4 0.6 0.8 1.0

−5

05

10

15

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

k�̂L� k1/k�̂k1k�̂L

� k1/k�̂k1

Left: Ten-fold cross-validation MSE for the lasso, applied to the

sparse simulated data set from Slide 39. Right: Thecorresponding lasso coe�cient estimates are displayed. The

vertical dashed lines indicate the lasso fit for which the

cross-validation error is smallest.

43 / 57

Dimension Reduction Methods

• The methods that we have discussed so far in this chapterhave involved fitting linear regression models, via leastsquares or a shrunken approach, using the originalpredictors, X1, X2, . . . , Xp.

• We now explore a class of approaches that transform thepredictors and then fit a least squares model using thetransformed variables. We will refer to these techniques asdimension reduction methods.

44 / 57

Dimension Reduction Methods: details

• Let Z1, Z2, . . . , ZM represent M < p linear combinations ofour original p predictors. That is,

Zm =pX

j=1

�mjXj (1)

for some constants �m1, . . . ,�mp.• We can then fit the linear regression model,

yi = ✓0 +MX

m=1

✓mzim + ✏i, i = 1, . . . , n, (2)

using ordinary least squares.• Note that in model (2), the regression coe�cients are givenby ✓0, ✓1, . . . , ✓M . If the constants �m1, . . . ,�mp are chosenwisely, then such dimension reduction approaches can oftenoutperform OLS regression.

45 / 57

• Notice that from definition (1),

MX

m=1

✓mzim =MX

m=1

✓m

pX

j=1

�mjxij =pX

j=1

MX

m=1

✓m�mjxij =pX

j=1

�jxij ,

where

�j =MX

m=1

✓m�mj . (3)

• Hence model (2) can be thought of as a special case of theoriginal linear regression model.

• Dimension reduction serves to constrain the estimated �j

coe�cients, since now they must take the form (3).

• Can win in the bias-variance tradeo↵.

46 / 57

Principal Components Regression

• Here we apply principal components analysis (PCA)(discussed in Chapter 10 of the text) to define the linearcombinations of the predictors, for use in our regression.

• The first principal component is that (normalized) linearcombination of the variables with the largest variance.

• The second principal component has largest variance,subject to being uncorrelated with the first.

• And so on.

• Hence with many correlated original variables, we replacethem with a small set of principal components that capturetheir joint variation.

47 / 57

Principal Components Analysis

• PCA produces a low-dimensional representation of adataset. It finds a sequence of linear combinations of thevariables that have maximal variance, and are mutuallyuncorrelated.

• Apart from producing derived variables for use insupervised learning problems, PCA also serves as a tool fordata visualization.

5 / 52

研究p个变量关系，做两两散点图，需要做 P取2个图，此外两两散点图包含的信息很少，不能很好反应数据之间的关系！需要一种低维的表示方法，但又包含了数据足够多的信息

Principal Components Analysis: details

• The first principal component of a set of featuresX1, X2, . . . , Xp is the normalized linear combination of thefeatures

Z1 = �11X1 + �21X2 + . . .+ �p1Xp

that has the largest variance. By normalized, we mean thatPpj=1 �

2j1 = 1.

• We refer to the elements �11, . . . ,�p1 as the loadings of thefirst principal component; together, the loadings make upthe principal component loading vector,�1 = (�11 �21 . . . �p1)T .

• We constrain the loadings so that their sum of squares isequal to one, since otherwise setting these elements to bearbitrarily large in absolute value could result in anarbitrarily large variance.

6 / 52

PCA: example

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad

Sp

en

din

g

The population size (pop) and ad spending (ad) for 100 di↵erentcities are shown as purple circles. The green solid line indicatesthe first principal component direction, and the blue dashedline indicates the second principal component direction.

7 / 52

Z1=0.839*(pop-meanpop)+0.544*(ad-meanad)

Pictures of PCA: continued

−3 −2 −1 0 1 2 3

20

30

40

50

60

1st Principal Component

Po

pu

latio

n

−3 −2 −1 0 1 2 3

510

15

20

25

30

1st Principal ComponentA

d S

pe

nd

ing

Plots of the first principal component scores zi1 versus pop and

ad. The relationships are strong.

50 / 57


−1.0 −0.5 0.0 0.5 1.0

20

30

40

50

60

2nd Principal Component

Po

pu

latio

n

−1.0 −0.5 0.0 0.5 1.0

510

15

20

25

30

2nd Principal ComponentA

d S

pe

nd

ing

Plots of the second principal component scores zi2 versus pop

and ad. The relationships are weak.

51 / 57

Computation of Principal Components

• Suppose we have a n⇥ p data set X. Since we are onlyinterested in variance, we assume that each of the variablesin X has been centered to have mean zero (that is, thecolumn means of X are zero).

• We then look for the linear combination of the samplefeature values of the form

zi1 = �11xi1 + �21xi2 + . . .+ �p1xip (1)

for i = 1, . . . , n that has largest sample variance, subject tothe constraint that

Ppj=1 �

2j1 = 1.

• Since each of the xij has mean zero, then so does zi1 (forany values of �j1). Hence the sample variance of the zi1

can be written as 1n

Pni=1 z

2i1.

8 / 52

Computation: continued

• Plugging in (1) the first principal component loading vectorsolves the optimization problem

maximize�11,...,�p1

1

n

nX

i=1

0

@pX

j=1

�j1xij

1

A2

subject topX

j=1

�

2j1 = 1.

• This problem can be solved via a singular-valuedecomposition of the matrix X, a standard technique inlinear algebra.

• We refer to Z1 as the first principal component, withrealized values z11, . . . , zn1

9 / 52

Geometry of PCA

• The loading vector �1 with elements �11,�21, . . . ,�p1

defines a direction in feature space along which the datavary the most.

• If we project the n data points x1, . . . , xn onto thisdirection, the projected values are the principal componentscores z11, . . . , zn1 themselves.

10 / 52

Further principal components

• The second principal component is the linear combinationof X1, . . . , Xp that has maximal variance among all linearcombinations that are uncorrelated with Z1.

• The second principal component scores z12, z22, . . . , zn2take the form

zi2 = �12xi1 + �22xi2 + . . .+ �p2xip,

where �2 is the second principal component loading vector,with elements �12,�22, . . . ,�p2.

11 / 52

Further principal components: continued

• It turns out that constraining Z2 to be uncorrelated withZ1 is equivalent to constraining the direction �2 to beorthogonal (perpendicular) to the direction �1. And so on.

• The principal component directions �1, �2, �3, . . . are theordered sequence of right singular vectors of the matrix X,and the variances of the components are 1

n times thesquares of the singular values. There are at mostmin(n� 1, p) principal components.

12 / 52

Illustration

•USAarrests data: For each of the fifty states in the UnitedStates, the data set contains the number of arrests per100, 000 residents for each of three crimes: Assault, Murder,and Rape. We also record UrbanPop (the percent of thepopulation in each state living in urban areas).

• The principal component score vectors have length n = 50,and the principal component loading vectors have lengthp = 4.

• PCA was performed after standardizing each variable tohave mean zero and standard deviation one.

13 / 52

USAarrests data: PCA plot

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

First Principal Component

Seco

nd P

rinci

pal C

om

ponent

Alabama Alaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

IndianaIowaKansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

OregonPennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−0.5 0.0 0.5

−0.5

0.0

0.5

Murder

Assault

UrbanPop

Rape

14 / 52

Figure details

The first two principal components for the USArrests data.

• The blue state names represent the scores for the first twoprincipal components.

• The orange arrows indicate the first two principalcomponent loading vectors (with axes on the top andright). For example, the loading for Rape on the firstcomponent is 0.54, and its loading on the second principalcomponent 0.17 [the word Rape is centered at the point(0.54, 0.17)].

• This figure is known as a biplot, because it displays boththe principal component scores and the principalcomponent loadings.

15 / 52

PCA loadings

PC1 PC2Murder 0.5358995 -0.4181809Assault 0.5831836 -0.1879856UrbanPop 0.2781909 0.8728062Rape 0.5434321 0.1673186

16 / 52


20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10

1st Principal Component

2nd P

rinci

pal C

om

ponent

A subset of the advertising data. Left: The first principal

component, chosen to minimize the sum of the squared

perpendicular distances to each point, is shown in green. These

distances are represented using the black dashed line segments.

Right: The left-hand panel has been rotated so that the first

principal component lies on the x-axis.

49 / 57

Another Interpretation of PCA

Another Interpretation of Principal Components

First principal component

Seco

nd p

rinci

pal c

om

ponent

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

••

•

•

••

••

••

••

•

•

•

• ••

••

•

•

•• •

••

• •

• •

•

•

•

••

•

••

•

•

• •

• ••

•

•

••

•

•

•

•

• •

•

•

•

•

• •

•

• •

•

•

• •

•

•

••

•• •

•

••

• •

••

••

•

••

••

17 / 52

PCA find the hyperplane closest to the observations

• The first principal component loading vector has a veryspecial property: it defines the line in p-dimensional spacethat is closest to the n observations (using average squaredEuclidean distance as a measure of closeness)

• The notion of principal components as the dimensions thatare closest to the n observations extends beyond just thefirst principal component.

• For instance, the first two principal components of a dataset span the plane that is closest to the n observations, interms of average squared Euclidean distance.

18 / 52

Scaling of the variables matters• If the variables are in di↵erent units, scaling each to havestandard deviation equal to one is recommended.

• If they are in the same units, you might or might not scalethe variables.

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23


Seco

nd P

rinci

pal C

om

ponent

* *

*

*

*

**

**

*

*

*

*

***

* *

* *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

***

*

*

* *

*

*

*

*

*

*

*

*

−0.5 0.0 0.5

−0.5

0.0

0.5

Murder

Assault

UrbanPop

Rape

Scaled

−100 −50 0 50 100 150

−100

−50

050

100

150


Seco

nd P

rinci

pal C

om

ponent

* *

*

*

***

* **

*

*

*** *

* ** *

***

*

**

**

*

*

**

**

** **

*

***

**

*

**

*

**

−0.5 0.0 0.5 1.0

−0.5

0.0

0.5

1.0

Murder Assault

UrbanPop

Rape

Unscaled

19 / 52

Proportion Variance Explained

• To understand the strength of each component, we areinterested in knowing the proportion of variance explained(PVE) by each one.

• The total variance present in a data set (assuming that thevariables have been centered to have mean zero) is definedas

pX

j=1

Var(Xj) =pX

j=1

1

n

nX

i=1

x

2ij ,

and the variance explained by the mth principalcomponent is

Var(Zm) =1

n

nX

i=1

z

2im.

• It can be shown thatPp

j=1Var(Xj) =PM

m=1Var(Zm),with M = min(n� 1, p).

20 / 52

Proportion Variance Explained: continued• Therefore, the PVE of the mth principal component isgiven by the positive quantity between 0 and 1

Pni=1 z

2imPp

j=1

Pni=1 x

2ij

.

• The PVEs sum to one. We sometimes display thecumulative PVEs.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Pro

p. V

ari

ance

Exp

lain

ed

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Cum

ula

tive P

rop. V

ari

ance

Exp

lain

ed

21 / 52

How many principal components should we use?

If we use principal components as a summary of our data, howmany components are su�cient?

• No simple answer to this question, as cross-validation is notavailable for this purpose.

• Why not?

• When could we use cross-validation to select the number ofcomponents?

• the “scree plot” on the previous slide can be used as aguide: we look for an “elbow”.

22 / 52




• Why not?• When could we use cross-validation to select the number of

components?


22 / 52




• Why not?• When could we use cross-validation to select the number of

components?


22 / 52

Application to Principal Components Regression

0 10 20 30 40

010

20

30

40

50

60

70

Number of Components

Me

an

Sq

ua

red

Err

or

0 10 20 30 40

050

100

150


Me

an

Sq

ua

red

Err

or

Squared BiasTest MSEVariance

PCR was applied to two simulated data sets. The black, green,

and purple lines correspond to squared bias, variance, and test

mean squared error, respectively. Left: Simulated data from

slide 32. Right: Simulated data from slide 39.

52 / 57

Principal Component Regression

In each panel, the irreducible error Var() is shown as ahorizontal dashed line. Left: Results for PCR. Right: Results forlasso (solid) and ridge regression (dotted). The x-axis displaysthe shrinkage factor of the coeffi cient estimates, defined as the2 norm of the shrunken coefficient estimates divided by the 2

norm of the least squares estimate.

Choosing the number of directions M

2 4 6 8 10

−300

−100

0100

200

300

400


Sta

ndard

ized C

oeffic

ients


2 4 6 8 10

20000

40000

60000

80000

Number of ComponentsC

ross

−V

alid

atio

n M

SE

Left: PCR standardized coe�cient estimates on the Credit

data set for di↵erent values of M . Right: The 10-fold cross

validation MSE obtained using PCR, as a function of M .

53 / 57

M最优是10，基本无降维效果

Partial Least Squares

• PCR identifies linear combinations, or directions, that bestrepresent the predictors X1, . . . , Xp.

• These directions are identified in an unsupervised way, sincethe response Y is not used to help determine the principalcomponent directions.

• That is, the response does not supervise the identificationof the principal components.

• Consequently, PCR su↵ers from a potentially seriousdrawback: there is no guarantee that the directions thatbest explain the predictors will also be the best directionsto use for predicting the response.

54 / 57

Partial Least Squares: continued

• Like PCR, PLS is a dimension reduction method, whichfirst identifies a new set of features Z1, . . . , ZM that arelinear combinations of the original features, and then fits alinear model via OLS using these M new features.

• But unlike PCR, PLS identifies these new features in asupervised way – that is, it makes use of the response Y inorder to identify new features that not only approximatethe old features well, but also that are related to the

response.

• Roughly speaking, the PLS approach attempts to finddirections that help explain both the response and thepredictors.

55 / 57

Details of Partial Least Squares

• After standardizing the p predictors, PLS computes thefirst direction Z1 by setting each �1j in (1) equal to thecoe�cient from the simple linear regression of Y onto Xj .

• One can show that this coe�cient is proportional to thecorrelation between Y and Xj .

• Hence, in computing Z1 =Pp

j=1 �1jXj , PLS places thehighest weight on the variables that are most stronglyrelated to the response.

• Subsequent directions are found by taking residuals andthen repeating the above prescription.

56 / 57

Summary

• Model selection methods are an essential tool for dataanalysis, especially for big datasets involving manypredictors.

• Research into methods that give sparsity, such as the lasso

is an especially hot area.

• Later, we will return to sparsity in more detail, and willdescribe related approaches such as the elastic net.

57 / 57

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2017/03/... · Example- Credit...

Documents