Chapter 10 Simple Linear Regression and Correlation
Linear Regression
Methods for studying the relationship of two or more quantitative variables
Example: • Predict salary from education and years of experience• Predict sales from the amount of advertising expenditures• Predict vocabulary size from the age and amount of education of parents
Variables:• Response/outcome/dependent variable• Predictor/explanatory/independent variable
1
Relationships between the response and predictor variables• Functional or mathematical relation:
– deterministic• Structural or statistical relation:
error – stochastic/probabilistic
Goals: 1) What is a reasonable model?
(a) (b) errors
2) When has unknown parameters, estimate the parameters3) predict at new
2
Simple Linear Regression (SLR)
Basic model:
• : the response/dependent variable• : the predictor/explanatory/independent variable• : the observed value of • : treated as a fixed quantity (or conditioned upon)• : the random error, typically assumed 0 and
, and usually assumed normally distributedKey assumptions (to be checked later):
• Linear relationship• Independent (uncorrelated) errors• Constant variance errors• Normally distributed errors
3
The SLR model can also be written as
| ~ ,
4
• The mean of given (known as the condition mean) is a linear function of given by
• is the conditional mean when 0• If we replace by then is interpreted as conditional
mean when • is the slope, i.e. change in the mean of per unit change in • is the variation of responses about the mean • The relationship is described by the true regression lineE Y|
• The model is called “linear” not because it is linear in , but rather because it is linear in the parameters and
5
Example: Crime Rate A criminologist studying the relationship between level of education
and crime rate in medium-sized U.S. counties collected the following data for a random sample of 84 counties; is the percentage of individuals in the county having at least a high-school diploma and is the crime rate (crimes reported per 100, 000 residents) last year.
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
6
Fitting the SLR model - least squares (LS) estimationChoose , to minimize the sum of squared deviations
(vertical distance) of all data points to the fitted line:, ∑
, ≡
Taking first partial derivatives and setting them equal to zero yields normal equations:
∑ ∑∑ ∑ ∑
which are equivalent to ∑ 0∑ 0
7
• Least squares estimators:∑ ∑ ∑ ∑
∑ ∑∑ ∑ ∑
∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑
∑ ∑ ∑
,
8
• , and are the best linear unbiased estimates of and
• The fitted values: • Residuals: • Least squares (LS) line:
, is the “centroid” of the scatter plot
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
,
20517.6 170.58
0 20 40 60 80
-400
0-2
000
020
0040
0060
00
Res
idua
ls
9
Goodness of fit of the LS lineResiduals:
Error sum of squares (SSE): ∑Compare with the SSE for the simplest model:
, and ∑ , referred to as the (corrected) total sum of squares (SST), which measures the variability of around its mean
Then SST can be decomposed as∑ ∑ ∑
SST = SSR + SSESSR: the regression sum of squares, which measures the variation in that is accounted for by regression on x
10
The coefficient of determination:
1 , 0 1
which represents the proportion of variation in that is accounted for by regression on .
Relationship to the sample correlation coefficient :
The sign of is the same as the sign of .
11
Estimation of A common unbiased estimator of is given by
∑2 2
MSE: Mean square error• The d.f. for is 2 since 2 unknown parameters and
are estimated from the data of size .
Crime rate example continued:Obtain the point estimates of the following: (1) The difference in the mean crime rate for the two counties whose high-
school graduation rates differ by one percentage point;(2) The mean crime rate last year in counties with high school graduation
percentage X=80;(3) The random error .
12
# read in the data set> crime=read.table("crimerate.txt",header=FALSE)> names(crime)=c("rate","percentage")
# scatter plot> plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents",type="p",pch=16)
# fitting a SLR model using least squares> g1=lm(rate~percentage,data=crime)
# adding the fitted LR line in the scatter plot> abline(g1,col="red",lwd=2)
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
13
# LS estimation results> summary(g1)
Call:lm(formula = rate ~ percentage, data = crime)
Residuals:Min 1Q Median 3Q Max
-5278.3 -1757.5 -210.5 1575.3 6803.3
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
14
> summary(g1)$coeffEstimate Std. Error t value Pr(>|t|)
(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08percentage -170.5752 41.57433 -4.102897 9.571396e-05
> predict(g1,data.frame(percentage=80),se=TRUE)$fit
1 6871.585$se.fit[1] 263.6425$df[1] 82$residual.scale[1] 2356.292
> deviance(g1) # SSE[1] 455273165> df.residual(g1) # df for SSE[1] 82> sqrt(deviance(g1)/df.residual(g1)) # estimate for sigma[1] 2356.292
15
> residuals(g1)1 2 3 4 5 6 7 8
591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967 9 10 11 12 13 14 15 16
97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004 …
81 82 83 84 -1363.25778 2533.01666 621.14071 28.11439
> summary(g1)$residuals # do the same as residuals(g1)> sum(residuals(g1)^2) # SSE[1] 455273165
> plot(residuals(g1),pch=16,main="Scatter Plot of Residuals“,ylab="Residuals",xlab="")> abline(h=0,lty=2)
0 20 40 60 80
-400
0-2
000
020
0040
0060
00
Scatter Plot of Residuals
Res
idua
ls
16
> fitted.values(g1)1 2 3 4 5 6 7 8 9
7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036 10 11 12 13 14 15 16 17 18
6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611 …
82 83 84 5506.983 6359.859 7553.886> plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")> abline(g1,lty=2)> plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)> plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)
60 65 70 75 80 85 90
5000
6000
7000
8000
9000
1000
0
Percentage
Fitte
d V
alue
s
5000 6000 7000 8000 9000 10000
-400
0-2
000
020
0040
0060
00
y
Res
idua
ls
60 65 70 75 80 85 90
-400
0-2
000
020
0040
0060
00
Percentage
Res
idua
ls
17
Statistical Inference for Simple Linear Regression
Inference on and
∑ ∑
∑ ∑
∑
∑ ∑
∑ ∑
18
~ 0,1 and ~ 0,1
~
, , and are independently distributed
∑ and
~ and ~
100 1 % CI’s on and are given by
, /
, /
19
Hypotheses tests: : vs. :
Use the t-test:
~ when is true
Reject at level if | |
, /
or p-value 2
Particularly, for testing if there is a linear relationship,: 0 vs. : 0
Reject at level if | |
, /
20
Crime Rate Example continued:(1) Test linear relationship at 0.05
> summary(g1)Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
21
(2) Calculate a 95% CI on the change in the mean crime rate for every one percentage point increase in high-school graduation rate. > confint(g1)
2.5 % 97.5 %(Intercept) 13997.3245 27037.87538percentage -253.2798 -87.87061
> # we can specify a particular parameter> # as well as change confidence level> confint(g1,"percentage",level=0.9)
5 % 95 %percentage -239.7403 -101.4101
22
Analysis of Variance (ANOVA) for SLR
ANOVA is a statistical technique to decompose the total variability in the ’s into separate variance components associated with specific sources
Decomposition of the variability and degrees of freedom (d.f.)∑ ∑ ∑
SST = SSR + SSEd.f. n-1 = 1 + n-2
A mean square is defined by a sum of squares divided by its d.f.Mean square regression: /1Mean square error: / 2
23
Since /
~ ,
we can test : 0 vs. : 0 at level by rejecting if , , (equivalent to , / )
Analysis of variance (ANOVA) table
Source of Variation(Source)
Sum of Squares(SS)
Degrees of Freedom (d.f.)
Mean Square (MS) F statistic
Regression SSR 11
Error SSE 22
Total SST 124
Crime Rate Example continued:- Test the significance of the linear relationship between the crime rate and the high-school graduation rate at 0.05
> anova(g1)Analysis of Variance Table
Response: rateDf Sum Sq Mean Sq F value Pr(>F)
percentage 1 93462942 93462942 16.834 9.571e-05 ***Residuals 82 455273165 5552112 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
25
Prediction of Future Observations To predict the value of a future response ∗ at a specified value ∗
Use confidence interval to estimate the fixed unknown mean of ∗, denoted by ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗, /
∗
Use prediction interval to predict the value of the r.v. ∗
∗ ∗~ 0, 1∗
∗, / 1
1 ∗
26
Crime rate example continued:(a) Calculate 95% CI for the average crime rate in counties with
80% high-school graduation rate;(b) Calculate 95% PI for the crime rate of a future selected county
with 80% high-school graduation rate.
> predict(g1,data.frame(percentage=80), interval="confidence")$fit
fit lwr upr1 6871.585 6347.116 7396.054
> predict(g1,data.frame(percentage=80), interval="prediction")$fit
fit lwr upr1 6871.585 2154.92 11588.25
27
> grid=seq(60,90,1)> conf=predict(g1,data.frame(percentage=grid),interval="confidence")> pred=predict(g1,data.frame(percentage=grid),interval="prediction")> matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main="CI vs PI", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents)")> matplot(grid,conf[,2:3],lty=c(2,2),col=c("blue","blue"),type="l",add=T,lwd=2)
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
CI vs PI
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts)
Both CI and PI have shortest widths when ∗ ;
Predicting beyond the range of observed data (extrapolation) is risky and should generally be avoided
28