+ All Categories

Outline

Date post: 14-Mar-2016
Category:
Upload: wyatt-witt
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Outline. Ordinary least squares regression Ridge regression. Ordinary least squares regression (OLS). y. Model: Terminology:  0 : intercept (or bias)  1 , …,  p : regression coefficients (or weights). …. x 1. x 2. x p. - PowerPoint PPT Presentation
23
Data mining and statistic al learning, lecture 3 Outline Ordinary least squares regression Ridge regression
Transcript
Page 1: Outline

Data mining and statistical learning, lecture 3

Outline

Ordinary least squares regression

Ridge regression

Page 2: Outline

Data mining and statistical learning, lecture 3

Ordinary least squares regression (OLS)

x1 x2 xp…

yModel:

Terminology:

0: intercept (or bias)

1, …, p: regression coefficients (or weights)

The response variable responds directly and linearly to changes in the inputs

errorxβ...xy pp 110

errory T Xβ0

Page 3: Outline

Data mining and statistical learning, lecture 3

Least squares regression

Assume that we have observed a training set of data

Estimate the coefficients by minimizing the residual sum of squares

N

i

p

jijji XyRSS

1 1

20 )()(

Case X 1 X 2 X p Y1 x 11 x 21 x p 1 y 12 x 12 x 22 x p 2 y 23 x 13 x 23 x p 3 y 3

N x 1N x 2N x pN y N

Page 4: Outline

Data mining and statistical learning, lecture 3

Matrix formulation of OLS regression

n

i

p

jijji XyRSS

1 1

20 )()(

Differentiating the residual sum of squares and setting the first derivatives equal to zero we obtain

where

and

0)( XyX T

pNNN

p

p

xxx

xxxxxx

21

22212

12111

1

11

X

Ny

yy

2

1

y

Page 5: Outline

Data mining and statistical learning, lecture 3

Parameter estimates and predictions

n

i

p

jijji XyRSS

1 1

20 )()(

HyyXXXXXy TT 1)(ˆˆ

Least squares estimates of the parameters

Predicted values

yXXX TT 1)(ˆ

HyyXXXXXy TT 1)(ˆˆ

Page 6: Outline

Data mining and statistical learning, lecture 3

Different sources of inputs

n

i

p

jijji XyRSS

1 1

20 )()(

HyyXXXXXy TT 1)(ˆˆ

Quantitative inputs

Transformations of quantitative inputs

Numeric or dummy coding of the levels of qualitative inputs

Interactions between variables (e.g. X3 = X1 X2)

Example of dummy coding:

otherwise 0,Nov if ,1

otherwise 0,Feb if ,1

otherwise 0,Jan if ,1

11

2

1

X

X

X

Page 7: Outline

Data mining and statistical learning, lecture 3

An example of multiple linear regression

n

i

p

jijji XyRSS

1 1

20 )()(

Response variable: Requested price of used Porsche cars (1000 SEK)

Inputs:X1 = Manufacturing yearX2 = Milage (km)X3 = Model (0 or 1)X4 = Equipment (1 2, 3)X5 = Colour (Red Black Silver Blue Black White Green)

Page 8: Outline

Data mining and statistical learning, lecture 3

Price of used Porsche cars

n

i

p

jijji XyRSS

1 1

20 )()(

Response variable: Requested price of used Porsche cars (1000 SEK)

Inputs:X1 = Manufacturing yearX2 = Milage (km)

Inputs Estimated model RSS Year Price = -76829 + 38.6Year 113030 Milage Price = 430.7 -0.001862Milage 230212 Year, Milage Price = -6389 +32.1Year – 0.000789Milage 92541

Page 9: Outline

Data mining and statistical learning, lecture 3

Interpretation of multiple regression coefficients

Assume that

and that the regression coefficients are estimated by ordinary least squares regression

Then the multiple regression coefficient represents the additional contribution of xj on y, after xj has been adjusted for x0, x1, …, xj-1, xj+1, …, xp

j

p

jjjXY

10

Page 10: Outline

Data mining and statistical learning, lecture 3

Confidence intervals for regression parameters

n

i

p

jijji XyRSS

1 1

20 )()(

Assume that

where the X-variables are fixed and the error terms are i.i.d. and N(0, )

Then

where vj is the jth diagonal element of

p

jjjXY

10

%)95(ˆ)1(ˆ05.0 jjj vpNt

1)( XX T

Page 11: Outline

Data mining and statistical learning, lecture 3

Interpretation of software outputs

Adding new independent variables to a regression model alters at least one of the old regression coefficients unless the columns of the X-matrix are orthogonal, i.e.

Regression of the price of used Porsche cars vs

milage (km) and manufacturing year

Predictor Coef SE Coef T P

Constant 430.69 17.42 24.72 0.000

Milage (km) -0.0018621 0.0002959 -6.29 0.000

Predictor Coef SE Coef T P

Constant -63809 6976 -9.15 0.000

Milage (km) -0.0007894 0.0002222 -3.55 0.001

Year 32.103 3.486 9.21 0.000

N

iikijxx

1

0

Page 12: Outline

Data mining and statistical learning, lecture 3

Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Step 1 2 3 4Constant -76829 -63809 -53285 -52099

Year 38.6 32.1 26.8 26.2T-Value 11.87 9.21 7.00 6.88P-Value 0.000 0.000 0.000 0.000

Milage (km) -0.00079 -0.00066 -0.00062T-Value -3.55 -3.08 -2.88P-Value 0.001 0.003 0.006

Model 37 27T-Value 2.72 1.83P-Value 0.009 0.073

Equipment 11.0T-Value 1.52P-Value 0.135

S 44.1 40.3 38.2 37.8R-Sq 70.82 76.11 78.89 79.74R-Sq(adj) 70.32 75.27 77.76 78.27Mallows Cp 23.8 11.3 5.7 5.4

The p-value refers to a t-test of the hypothesis that the regression coefficient of the last entered x-variable is zero

Classical statistical model selection techniques are model-based.

In data-mining the model selection is data-driven.

Page 13: Outline

Data mining and statistical learning, lecture 3

Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...- model validation by visual inspection of residuals

500450400350300250200

200

150

100

50

0

-50

-100

Fitted Value

Resid

ual

Versus Fits(response is Price (1000SEK))

140000120000100000800006000040000200000

200

150

100

50

0

-50

-100

Milage (km)

Resid

ual

Residuals Versus Milage (km)(response is Price (1000SEK))

Residual = Observed - Predicted

Page 14: Outline

Data mining and statistical learning, lecture 3

The Gram-Schmidt procedure for regression by successive orthogonalization and simple linear regression

1. Intialize z0 = x0 = 1

2. For j = 1, … , p, compute

where depicts the inner product (the sum of coordinate-wise products)

3. Regress y on zp to obtain the multiple regression coefficient

1

0

1

0

,ˆ,,j

k

j

kkkjjk

kk

jkjj zxz

zzxz

xz

p

Page 15: Outline

Data mining and statistical learning, lecture 3

Prediction of a response variable using correlated explanatory variables- daily temperatures in Stockholm, Göteborg, and Malmö

-20

-10

0

10

20

30

-20 -10 0 10 20 30Stockholm temperature

Göt

ebor

g te

mpe

ratu

re

-20

-10

0

10

20

30

-20 -10 0 10 20 30Stockholm temperature

Mal

tem

pera

ture

-20

-10

0

10

20

30

-20 -10 0 10 20 30Malmö temperature

Göt

ebor

g te

mpe

ratu

re

Page 16: Outline

Data mining and statistical learning, lecture 3

Absorbance records for ten samples of chopped meat

0.00.51.01.52.02.53.03.54.04.55.0

1 12 23 34 45 56 67 78 89 100

Channel

Abs

orba

nce

Sample_1Sample_2Sample_3Sample_4Sample_5Sample_6Sample_7Sample_8Sample_9Sample_10

1 response variable (protein)

100 predictors (absorbance at 100 wavelengths or channels)

The predictors are strongly correlated to each other

Page 17: Outline

Data mining and statistical learning, lecture 3

Absorbance records for 240 samples of chopped meat

The target is poorly correlated to each predictor

0

5

10

15

20

25

0 2 4 6

Absorbance in channel 50

Prot

ein

(%)

Page 18: Outline

Data mining and statistical learning, lecture 3

Ridge regression

The ridge regression coefficients minimize a penalized residual sum of squares:

or

Normally, inputs are centred prior to the estimation of regression coefficients

N

i

p

jjpjpji

ridge xxy1 1

22110 )...(argminˆ

p

jj

N

ipjpji

ridge

s

xxy

1

2

1

2110

)...(argminˆ

tosubject

Page 19: Outline

Data mining and statistical learning, lecture 3

Matrix formulation of ridge regression for centred inputs

If the inputs are orthogonal, the ridge estimates are just a scaled version

of the least squares estimates

Shrinking enables estimation of regression coefficients even if the number of parameters exceeds the number of cases

Figure 3.7

yXIXX TTridge 1)(ˆ

T--RSS )()()( 1 XyXy

10 where,ˆˆ ridge

Page 20: Outline

Data mining and statistical learning, lecture 3

Ridge regression – pros and cons

Ridge regression is particularly useful if the explanatory variables are strongly correlated to each other.

The variance of the estimated regression coefficient is reduced at the expensive of (slightly) biased estimates

Page 21: Outline

Data mining and statistical learning, lecture 3

The Gauss-Markov theorem

Consider a linear regression model in which:– the inputs are regarded as fixed– the error terms are i.i.d. with mean 0 and variance 2.

Then, the least squares estimator of a parameter aT has variance no bigger than any other linear unbiased estimator of aT

Biased estimators may have smaller variance and mean squared error!

Page 22: Outline

Data mining and statistical learning, lecture 3

SAS code for an ordinary least squares regression

proc reg data=mining.dailytemperature outest = dtempbeta;model daily_consumption = stockholm g_teborg malm_;run;

Page 23: Outline

Data mining and statistical learning, lecture 3

SAS code for ridge regression

proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;model daily_consumption = stockholm g_teborg malm_;proc print data=dtempbeta;run;

_TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_PARMS Daily_Consumption 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 0 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6RIDGE Daily_Consumption 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6RIDGE Daily_Consumption 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9RIDGE Daily_Consumption 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4RIDGE Daily_Consumption 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1RIDGE Daily_Consumption 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1RIDGE Daily_Consumption 7 60138.0 425553.4 -899.4 -950.4 -1002.1RIDGE Daily_Consumption 8 61829.0 423313.5 -816.7 -863.8 -910.6RIDGE Daily_Consumption 9 63248.9 421448.8 -747.9 -791.7 -834.4RIDGE Daily_Consumption 10 64457.3 419872.4 -689.8 -730.6 -770.0


Recommended