Date post: | 14-Mar-2016 |
Category: |
Documents |
Upload: | wyatt-witt |
View: | 33 times |
Download: | 0 times |
Data mining and statistical learning, lecture 3
Outline
Ordinary least squares regression
Ridge regression
Data mining and statistical learning, lecture 3
Ordinary least squares regression (OLS)
x1 x2 xp…
yModel:
Terminology:
0: intercept (or bias)
1, …, p: regression coefficients (or weights)
The response variable responds directly and linearly to changes in the inputs
errorxβ...xy pp 110
errory T Xβ0
Data mining and statistical learning, lecture 3
Least squares regression
Assume that we have observed a training set of data
Estimate the coefficients by minimizing the residual sum of squares
N
i
p
jijji XyRSS
1 1
20 )()(
Case X 1 X 2 X p Y1 x 11 x 21 x p 1 y 12 x 12 x 22 x p 2 y 23 x 13 x 23 x p 3 y 3
N x 1N x 2N x pN y N
Data mining and statistical learning, lecture 3
Matrix formulation of OLS regression
n
i
p
jijji XyRSS
1 1
20 )()(
Differentiating the residual sum of squares and setting the first derivatives equal to zero we obtain
where
and
0)( XyX T
pNNN
p
p
xxx
xxxxxx
21
22212
12111
1
11
X
Ny
yy
2
1
y
Data mining and statistical learning, lecture 3
Parameter estimates and predictions
n
i
p
jijji XyRSS
1 1
20 )()(
HyyXXXXXy TT 1)(ˆˆ
Least squares estimates of the parameters
Predicted values
yXXX TT 1)(ˆ
HyyXXXXXy TT 1)(ˆˆ
Data mining and statistical learning, lecture 3
Different sources of inputs
n
i
p
jijji XyRSS
1 1
20 )()(
HyyXXXXXy TT 1)(ˆˆ
Quantitative inputs
Transformations of quantitative inputs
Numeric or dummy coding of the levels of qualitative inputs
Interactions between variables (e.g. X3 = X1 X2)
Example of dummy coding:
otherwise 0,Nov if ,1
otherwise 0,Feb if ,1
otherwise 0,Jan if ,1
11
2
1
X
X
X
Data mining and statistical learning, lecture 3
An example of multiple linear regression
n
i
p
jijji XyRSS
1 1
20 )()(
Response variable: Requested price of used Porsche cars (1000 SEK)
Inputs:X1 = Manufacturing yearX2 = Milage (km)X3 = Model (0 or 1)X4 = Equipment (1 2, 3)X5 = Colour (Red Black Silver Blue Black White Green)
Data mining and statistical learning, lecture 3
Price of used Porsche cars
n
i
p
jijji XyRSS
1 1
20 )()(
Response variable: Requested price of used Porsche cars (1000 SEK)
Inputs:X1 = Manufacturing yearX2 = Milage (km)
Inputs Estimated model RSS Year Price = -76829 + 38.6Year 113030 Milage Price = 430.7 -0.001862Milage 230212 Year, Milage Price = -6389 +32.1Year – 0.000789Milage 92541
Data mining and statistical learning, lecture 3
Interpretation of multiple regression coefficients
Assume that
and that the regression coefficients are estimated by ordinary least squares regression
Then the multiple regression coefficient represents the additional contribution of xj on y, after xj has been adjusted for x0, x1, …, xj-1, xj+1, …, xp
j
p
jjjXY
10
Data mining and statistical learning, lecture 3
Confidence intervals for regression parameters
n
i
p
jijji XyRSS
1 1
20 )()(
Assume that
where the X-variables are fixed and the error terms are i.i.d. and N(0, )
Then
where vj is the jth diagonal element of
p
jjjXY
10
%)95(ˆ)1(ˆ05.0 jjj vpNt
1)( XX T
Data mining and statistical learning, lecture 3
Interpretation of software outputs
Adding new independent variables to a regression model alters at least one of the old regression coefficients unless the columns of the X-matrix are orthogonal, i.e.
Regression of the price of used Porsche cars vs
milage (km) and manufacturing year
Predictor Coef SE Coef T P
Constant 430.69 17.42 24.72 0.000
Milage (km) -0.0018621 0.0002959 -6.29 0.000
Predictor Coef SE Coef T P
Constant -63809 6976 -9.15 0.000
Milage (km) -0.0007894 0.0002222 -3.55 0.001
Year 32.103 3.486 9.21 0.000
N
iikijxx
1
0
Data mining and statistical learning, lecture 3
Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Step 1 2 3 4Constant -76829 -63809 -53285 -52099
Year 38.6 32.1 26.8 26.2T-Value 11.87 9.21 7.00 6.88P-Value 0.000 0.000 0.000 0.000
Milage (km) -0.00079 -0.00066 -0.00062T-Value -3.55 -3.08 -2.88P-Value 0.001 0.003 0.006
Model 37 27T-Value 2.72 1.83P-Value 0.009 0.073
Equipment 11.0T-Value 1.52P-Value 0.135
S 44.1 40.3 38.2 37.8R-Sq 70.82 76.11 78.89 79.74R-Sq(adj) 70.32 75.27 77.76 78.27Mallows Cp 23.8 11.3 5.7 5.4
The p-value refers to a t-test of the hypothesis that the regression coefficient of the last entered x-variable is zero
Classical statistical model selection techniques are model-based.
In data-mining the model selection is data-driven.
Data mining and statistical learning, lecture 3
Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...- model validation by visual inspection of residuals
500450400350300250200
200
150
100
50
0
-50
-100
Fitted Value
Resid
ual
Versus Fits(response is Price (1000SEK))
140000120000100000800006000040000200000
200
150
100
50
0
-50
-100
Milage (km)
Resid
ual
Residuals Versus Milage (km)(response is Price (1000SEK))
Residual = Observed - Predicted
Data mining and statistical learning, lecture 3
The Gram-Schmidt procedure for regression by successive orthogonalization and simple linear regression
1. Intialize z0 = x0 = 1
2. For j = 1, … , p, compute
where depicts the inner product (the sum of coordinate-wise products)
3. Regress y on zp to obtain the multiple regression coefficient
1
0
1
0
,ˆ,,j
k
j
kkkjjk
kk
jkjj zxz
zzxz
xz
p
Data mining and statistical learning, lecture 3
Prediction of a response variable using correlated explanatory variables- daily temperatures in Stockholm, Göteborg, and Malmö
-20
-10
0
10
20
30
-20 -10 0 10 20 30Stockholm temperature
Göt
ebor
g te
mpe
ratu
re
-20
-10
0
10
20
30
-20 -10 0 10 20 30Stockholm temperature
Mal
mö
tem
pera
ture
-20
-10
0
10
20
30
-20 -10 0 10 20 30Malmö temperature
Göt
ebor
g te
mpe
ratu
re
Data mining and statistical learning, lecture 3
Absorbance records for ten samples of chopped meat
0.00.51.01.52.02.53.03.54.04.55.0
1 12 23 34 45 56 67 78 89 100
Channel
Abs
orba
nce
Sample_1Sample_2Sample_3Sample_4Sample_5Sample_6Sample_7Sample_8Sample_9Sample_10
1 response variable (protein)
100 predictors (absorbance at 100 wavelengths or channels)
The predictors are strongly correlated to each other
Data mining and statistical learning, lecture 3
Absorbance records for 240 samples of chopped meat
The target is poorly correlated to each predictor
0
5
10
15
20
25
0 2 4 6
Absorbance in channel 50
Prot
ein
(%)
Data mining and statistical learning, lecture 3
Ridge regression
The ridge regression coefficients minimize a penalized residual sum of squares:
or
Normally, inputs are centred prior to the estimation of regression coefficients
N
i
p
jjpjpji
ridge xxy1 1
22110 )...(argminˆ
p
jj
N
ipjpji
ridge
s
xxy
1
2
1
2110
)...(argminˆ
tosubject
Data mining and statistical learning, lecture 3
Matrix formulation of ridge regression for centred inputs
If the inputs are orthogonal, the ridge estimates are just a scaled version
of the least squares estimates
Shrinking enables estimation of regression coefficients even if the number of parameters exceeds the number of cases
Figure 3.7
yXIXX TTridge 1)(ˆ
T--RSS )()()( 1 XyXy
10 where,ˆˆ ridge
Data mining and statistical learning, lecture 3
Ridge regression – pros and cons
Ridge regression is particularly useful if the explanatory variables are strongly correlated to each other.
The variance of the estimated regression coefficient is reduced at the expensive of (slightly) biased estimates
Data mining and statistical learning, lecture 3
The Gauss-Markov theorem
Consider a linear regression model in which:– the inputs are regarded as fixed– the error terms are i.i.d. with mean 0 and variance 2.
Then, the least squares estimator of a parameter aT has variance no bigger than any other linear unbiased estimator of aT
Biased estimators may have smaller variance and mean squared error!
Data mining and statistical learning, lecture 3
SAS code for an ordinary least squares regression
proc reg data=mining.dailytemperature outest = dtempbeta;model daily_consumption = stockholm g_teborg malm_;run;
Data mining and statistical learning, lecture 3
SAS code for ridge regression
proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;model daily_consumption = stockholm g_teborg malm_;proc print data=dtempbeta;run;
_TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_PARMS Daily_Consumption 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 0 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6RIDGE Daily_Consumption 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6RIDGE Daily_Consumption 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9RIDGE Daily_Consumption 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4RIDGE Daily_Consumption 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1RIDGE Daily_Consumption 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1RIDGE Daily_Consumption 7 60138.0 425553.4 -899.4 -950.4 -1002.1RIDGE Daily_Consumption 8 61829.0 423313.5 -816.7 -863.8 -910.6RIDGE Daily_Consumption 9 63248.9 421448.8 -747.9 -791.7 -834.4RIDGE Daily_Consumption 10 64457.3 419872.4 -689.8 -730.6 -770.0