Chapter 2: Simple Linear Regression
• 2.1 The Model• 2.2-2.3 Parameter Estimation• 2.4 Properties of Estimators• 2.5 Inference• 2.6 Prediction• 2.7 Analysis of Variance• 2.8 Regression Through the Origin• 2.9 Related Models
1
2.1 The Model
• Measurement of y (response) changes in a linear fashion with asetting of the variable x (predictor):
y =β0 + β1x
linear relation+
ε
noise
◦ The linear relation is deterministic (non-random).
◦ The noise or error is random.
• Noise accounts for the variability of the observations about thestraight line.No noise⇒ relation is deterministic.Increased noise⇒ increased variability.
2
• Experiment with this simulation program:
simple.sim <-
function(intercept=0,slope=1,x=seq(1,10),sigma=1){
noise <- rnorm(length(x),sd = sigma)
y <- intercept + slope*x + noise
title1 <- paste("sigma = ", sigma)
plot(x,y,pch=16,main=title1)
abline(intercept,slope,col=4,lwd=2)
}
• > source("simple.sim.R")
> simple.sim(sigma=.01)
> simple.sim(sigma=.1)
> simple.sim(sigma=1)
> simple.sim(sigma=10)
3
Simulation Examples
2 4 6 8 10
24
68
10
sigma = 0.01
x
y
2 4 6 8 10
24
68
10
sigma = 0.1
x
y
2 4 6 8 10
24
68
10
sigma = 1
x
y
2 4 6 8 10
010
2030
sigma = 10
x
y
4
The Setup
• Assumptions:1. E[y|x] = β0 + β1x.2. Var(y|x) = Var(β0 + β1x+ ε|x) = σ2.
• Data: Suppose data y1, y2, . . . , yn are obtained at settingsx1, x2, . . . , xn, respectively. Then the model on the data is
yi = β0 + β1xi + εi
(εi i.i.d. N(0, σ2) and E[yi|xi] = β0 + β1xi.)Either1. the x’s are fixed values and measured without error
(controlled experiment)OR
2. the analysis is conditional on the observed values of x(observational study).
5
2.2-2.3 Parameter Estimation, Fitted Values and Residuals
1. Maximum Likelihood Estimation
Distributional assumptions are required
2. Least Squares Estimation
Distributional assumptions are not required
6
2.2.1 Maximum Likelihood Estimation
◦ Normal assumption is required:
f(yi|xi) =1√2πσ
e−12σ2(yi−β0−β1xi)
2
Likelihood: L(β0, β1, σ) =∏f(yi|xi)
∝1
σne− 1
2σ2
∑ni=1(yi−β0−β1xi)
2
◦ Maximize with respect to β0, β1, and σ2.
◦ (β0, β1): Equivalent to minimizingn∑i=1
(yi − β0 − β1xi)2
(i.e. Least-squares) σ2: SSE/n, biased.
7
2.2.2 Least Squares Estimation
◦ Assumptions:
1. E[εi] = 0
2. Var(εi) = σ2
3. εi’s are independent.
• Note that normality is not required.
8
Method
• Minimize
S(β0, β1) =n∑i=1
(yi − β0 − β1xi)2
with respect to the parameters or regression coefficients β0 andβ1: β0 and β1.
◦ Justification: We want the fitted line to pass as close to all of thepoints as possible.
◦ Aim: small Residuals (observed - fitted response values):
ei = yi − β0 − β1xi
9
Look at the following plots:
> source("roller2.plot")
> roller2.plot(a=14,b=0)
> roller2.plot(a=2,b=2)
> roller2.plot(a=12, b=1)
> roller2.plot(a=-2,b=2.67)
10
0 2 4 6 8 10 12
020
4060
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual
−ve residual
a=14, b=0
0 2 4 6 8 10 12
020
4060
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual
−ve residual
a=2, b=2
0 2 4 6 8 10 12
020
4060
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual
−ve residual
a=12, b=1
0 2 4 6 8 10 12
020
4060
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual −ve residual
a=−2, b=2.67
◦ The first three lines above do not pass as close to the plottedpoints as the fourth, even though the sum of the residuals isabout the same in all four cases.◦ Negative residuals cancel out positive residuals.
The Key: minimize squared residuals source("roller3.plot.R")
> roller3.plot(14,0); roller3.plot(2,2); roller3.plot(12,1);> roller3.plot(a=-2,b=2.67) # small SS
0 2 4 6 8 10 12
040
8012
0
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual−ve residual
a=14, b=0
0 2 4 6 8 10 12
040
8012
0Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual−ve residual
a=2, b=2
0 2 4 6 8 10 12
040
8012
0
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual
−ve residual
a=12, b=1
0 2 4 6 8 10 12
040
8012
0
Roller weight (t)
Dep
ress
ion
in la
wn
(mm
)
Fitted valuesData values
+ve residual −ve residual
a=−2, b=2.67
11
Unbiased estimate of σ2
1
n− # parameters estimated
n∑i=1
e2i
=1
n− 2
n∑i=1
e2i
* n observations⇒ n degrees of freedom
* 2 degrees of freedom are required to estimate the parameters
* the residuals retain n− 2 degrees of freedom
12
Alternative Viewpoint
* y = (y1, y2, . . . , yn) is a vector in n-dimensional space.(n degrees of freedom)
* The fitted values yi = β0 + β1xi also form a vector inn-dimensional space:y = (y1, y2, . . . , yn)
(2 degrees of freedom)* Least-squares seeks to minimize the distance between y and y.* The distance between n-dimensional vectors u and v is the
square root ofn∑i=1
(ui − vi)2
* Thus, the squared distance between y and y isn∑i=1
(yi − yi)2 =n∑i=1
e2i
13
Regression Coefficient Estimators
• The minimizers of
S(β0, β1) =n∑i=1
(yi − β0 − β1xi)2
are
β0 = y − β1x
and
β1 =Sxy
Sxx
where
Sxy =n∑i=1
(xi − x)(yi − y)
and
Sxx =n∑i=1
(xi − x)2
14
HomeMade R Estimators
> ls.est <-
function (data)
{
x <- data[,1]
y <- data[,2]
xbar <- mean(x)
ybar <- mean(y)
Sxy <- S(x,y,xbar,ybar)
Sxx <- S(x,x,xbar,xbar)
b <- Sxy/Sxx
a <- ybar - xbar*b
list(a = a, b = b, data=data)
}
> S <-
function (x,y,xbar,ybar)
{
sum((x-xbar)*(y-ybar))
}
15
Calculator Formulas and Other Properties
•
Sxy =n∑i=1
xiyi −(∑ni=1 xi)(
∑ni=1 yi)
n
Sxx =n∑i=1
x2i −
(∑ni=1 xi)
2
n
• Linearity Property:
Sxy =n∑i=1
yi(xi − x)
so β1 is linear in the yi’s.• Gauss-Markov Theorem: Among linear unbiased estimators forβ1 and β0, β1 and β0 are best (i.e. they have smallest variance).• Exercise: Find the expected value and variance of y1−y
x1−x. Is itunbiased for β1? Is there a linear unbiased estimator withsmaller variance?
16
Residuals
εi = ei = yi − β0 − β1xi
= yi − yi
= yi − y − β1(xi − x)
• In R:> res
function (ls.object)
{
a <- ls.object$a
b <- ls.object$b
x <- ls.object$data[,1]
y <- ls.object$data[,2]
resids <- y - a - b*x
resids
}
17
2.3.1 Consequences of Least-Squares
1. ei = yi − y − β1(xi − x)
(follows from intercept formula)2.∑ni=1 ei = 0 (follows from 1.)
3.∑ni=1 yi =
∑ni=1 yi
(follows from 2.)4. The regression line passes through the centroid (x, y) (follows
from intercept formula)5.∑xiei = 0
(set partial derivative of S(β0, β1) wrt β1 to 0)6.∑yiei = 0
(follows from 2. and 5.)
18
2.3.2 Estimation of σ2
• The residual sum of squares or error sum of squares is given by
SSE =n∑i=1
e2i = Syy − β1Sxy
Note: SSE = SSRes and Syy = SST
• An unbiased estimator for the error variance is
σ2 = MSE =SSE
n− 2=
[Syy − β1Sxy
]n− 2
Note: MSE = MSRes
• σ = Residual Standard Error =√
MSE
19
Example
• roller data
weight depression
1 1.9 2
2 3.1 1
3 3.3 5
4 4.8 5
5 5.3 20
6 6.1 20
7 6.4 23
8 7.6 10
9 9.8 30
10 12.4 25
20
Hand Calculation
• (y = depression, x = weight)
◦∑10i=1 xi = 1.9 + 3.1 + · · ·12.4 = 60.7
◦ x = 60.710 = 6.07
◦∑10i=1 yi = 2 + 1 + · · ·25 = 141
◦ y = 14110 = 14.1
◦∑10i=1 x
2i = 1.92 + 3.12 + · · ·+ 12.42 = 461
◦∑10i=1 y
2i = 4 + 1 + 25 + · · ·+ 625 = 3009
◦∑10i=1 xiyi = (1.9)(2) + · · · (12.4)(25) = 1103
◦ Sxx = 461− (60.7)2
10 = 92.6
21
◦ Sxy = 1103− (60.7)(141)10 = 247
◦ Syy = 3009− (141)2
10 = 1021
◦ β1 =SxySxx
= 24792.6 = 2.67
◦ β0 = y − β1x = 14.1− 2.67(6.07) = −2.11
◦ σ2 = 1n−2(Syy − β1Sxy)
= 18(1021− 2.67(247)) = 45.2 = MSE
• Example Summary: the fitted regression line relating depression(y) to weight (x) is
y = −2.11 + 2.67x
The error variance is estimated as MSE = 45.2.
R commands (home-made version)
> roller.obj <- ls.est(roller)
> roller.obj[-3] # intercept and slope estimate
$a
[1] -2.09
$b
[1] 2.67
> res(roller.obj) # residuals
[1] -0.98 -5.18 -1.71 -5.71 7.95
[6] 5.82 8.02 -8.18 5.95 -5.98
> sum(res(roller.obj)ˆ2)/8 # error variance (MSE)
[1] 45.4
22
R commands (built-in version)
> attach(roller)
> roller.lm <- lm(depression ˜ weight)
> summary(roller.lm)
Call:
lm(formula = depression ˜ weight)
Residuals:
Min 1Q Median 3Q Max
-8.180 -5.580 -1.346 5.920 8.020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0871 4.7543 -0.439 0.67227
weight 2.6667 0.7002 3.808 0.00518
23
Residual standard error: 6.735 on 8 degrees of freedom
Multiple R-squared: 0.6445, Adjusted R-squared: 0.6001
F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175
> detach(roller)
• or
> roller.lm <- lm(depression ˜ weight, data = roller)
> summary(roller.lm)
Using Extractor Functions
• Partial Output:
> coef(roller.lm)
(Intercept) weight
-2.09 2.67
> summary(roller.lm)$sigma
[1] 6.74
• From the output,
◦ the slope estimate is β1 = 2.67.
◦ the intercept estimate is β0 = −2.09.
◦ the Residual standard error is the square root of the MSE:6.742 = 45.4.
24
Other R commands
◦ fitted values: predict(roller.lm)
◦ residuals: resid(roller.lm)
◦ diagnostic plots: plot(roller.lm)(these include a plot of the residuals against the fitted valuesand a normal probability plot of the residuals)
◦ Also plot(roller); abline(roller.lm)
(this gives a plot of the data with the fitted line overlaid)
25
2.4: Properties of Least Squares Estimates
◦ E[β1] = β1
◦ E[β0] = β0
◦ Var(β1) = σ2
Sxx
◦ Var(β0) = σ2(
1n + x2
Sxx
)
26
Standard Error Estimators
◦ Var(β1) = MSESxx
so the standard error (s.e.) of β1 is estimated by√MSESxx
roller e.g.: MSE = 45.2, Sxx = 92.6
⇒ s.e. of β1 is√
45.2/92.6 = .699
◦ Var(β0) = MSE(
1n + x2
Sxx
)so the standard error (s.e.) of β0 is estimated by√√√√MSE
(1
n+
x2
Sxx
)roller e.g.:MSE = 45.2, Sxx = 92.6, x = 6.07, n = 10
⇒ s.e. of β0 is√
45.2(
110 + 6.072
92.62
)= 4.74
27
Distributions of β1 and β0
◦ yi is N(β0 + β1xi, σ2), and
β1 =∑ni=1 aiyi (ai = xi−x
Sxx)
⇒ β1 is N(β1,σ2
Sxx).
Also, SSEσ2 is χ2
n−2 (independent of β1) so
β1 − β1√MSE/Sxx
∼ tn−2
◦ β0 =∑ni=1 ciyi (ci = 1
n −x(xi−x)Sxx
)
⇒ β0 is N(β0, σ2(
1n + x2
Sxx
)) and
β0 − β0√MSE
(1n + x2
Sxx
) ∼ tn−2
28
2.5 Inferences about the Regression Parameters
• Tests
• Confidence Intervals
for β1, and β0 + β1x0
29
2.5.1 Inference for β1
H0 : β1 = β10 vs. H1 : β1 6= β10
Under H0,
t0 =β1 − β10√MSE/Sxx
has a t-distribution on n− 2 degrees of freedom.
p-value = P (|tn−2| > t0)
e.g. testing significance of regression for roller:
H0 : β1 = 0 vs. H1 : β1 6= 0
t0 =2.67− 0
.699= 3.82
p-value = P (|t8| > 3.82) = 2(1− P (t8 < 3.82)) = .00509
R command: 2*(1-pt(3.82,8))
30
(1− α) Confidence Intervals
◦ slope:
β1 ± tn−2,α/2s.e.
or
β1 ± tn−2,α/2
√MSE/Sxx
roller e.g. (95% confidence interval)
2.67± t8,.025(.699)
or
2.67± 2.31(.699) = 2.67± 1.61
R command: qt(.975, 8)
◦ intercept:
β0 ± tn−2,α/2s.e.
e.g. − 2.11± 2.31(4.74) = −2.11± 10.9
31
2.5.2 Confidence Interval for Mean Response
◦ E[y|x0] = β0 + β1x0
where x0 is a possible value that the predictor could take.
◦ E[y|x0] = β0 + β1x0 is a point estimate of the mean response atthat value.
◦ To find a 1− α confidence interval for E[y|x0], we need thevariance of y0 = E[y|x0]:
Var(y0) = Var(β0 + β1x0)
β0 =n∑i=1
ciyi
β1x0 =n∑i=1
bix0yi
32
Therefore,
y0 =n∑i=1
(ci + bix0)yi
so
Var(y0) =n∑i=1
(ci + bix0)2Var(yi)
= σ2n∑i=1
(ci + bix0)2
= σ2n∑i=1
(1
n+
(x0 − x)(xi − x)
Sxx)2
= σ2(1
n+
(x0 − x)2
Sxx)
◦ Var(y0) = MSE(1n + (x0−x)2
Sxx)
◦ The confidence interval is then
y0 ± tn−2,α/2
√√√√MSE(1
n+
(x0 − x)2
Sxx)
◦ e.g. Compute a 95% confidence interval for the expecteddepression when the weight is 5 tonnes.
x0 = 5, y0 = −2.11 + 2.67(5) = 11.2
n = 10, x = 6.07, Sxx = 92.6,MSE = 45.2
Interval:
11.2± t8,.025
√45.2(
1
10+
(5− 6.07)2
92.6)
= 11.2± 2.31√
5.08 = 11.2± 5.21
= (5.99,16.41)
◦ R code:
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="confidence")
fit lwr upr
[1,] 11.2 6.04 16.5
◦ Ex. Write your own R function to compute this interval.
2.6: Predicted Responses
◦ If a new observation were to be taken at x0, we would predict itto be y0 = β0 + β1x0 + ε where ε is independent noise.
◦ y0 = β0 + β1x0 is a point prediction of the response at thatvalue.
◦ To find a 1− α prediction interval for y0, we need the variance ofy0 − y0:
Var(y0 − y0) = Var(β0 + β1x0 + ε− β0 − β1x0)
= Var(−β0 − β1x0) + Var(ε)
= σ2(1
n+
(x0 − x)2
Sxx) + σ2
33
= σ2(1 +1
n+
(x0 − x)2
Sxx)
◦ Var(y0 − y0) = MSE(1 + 1n + (x0−x)2
Sxx)
◦ The prediction interval is then
y0 ± tn−2,α/2
√√√√MSE(1 +1
n+
(x0 − x)2
Sxx)
◦ e.g. Compute a 95% prediction interval for the depression for asingle new observation where the weight is 5 tonnes.
x0 = 5, y0 = −2.11 + 2.67(5) = 11.2
n = 10, x = 6.07, Sxx = 92.6,MSE = 45.2
Interval:
11.2± t8,.025
√45.2(1 +
1
10+
(5− 6.07)2
92.6)
= 11.2± 2.31√
50.3 = 11.2± 16.4 = (−5.2,27.6)
R code for Prediction Intervals
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="prediction")
fit lwr upr
[1,] 11.2 -5.13 27.6
◦ Write your own R function to produce this interval.
34
Degrees of Freedom
• A random sample of size n coming from a normal populationwith mean µ and variance σ2 has n degrees of freedom:Y1, . . . , Yn.• Each linearly independent restriction reduces the number of
degrees of freedom by 1.•∑ni=1
(Yi−µ)2
σ2 has a χ2(n) distribution.
•∑ni=1
(Yi−Y )2
σ2 has a χ2(n−1) distribution. (Calculating Y imposes
one linear restriction.)
•∑ni=1
(Yi−Yi)2
σ2 has a χ2(n−2) distribution. (Calculating β0 and β1
imposes two linearly independent restrictions.)• Given the xi’s, there are 2 degrees of freedom in the quantityβ0 + β1xi.• Given xi and Y , there is one degree of freedom:yi − Y = β1(xi − x).
35
2.7: Analysis of Variance: Breaking down (analyzing) Variation
◦ The variation in the data (responses) is summarized by
Syy =n∑i=1
(yi − y)2 = TSS
(Total sum of squares)
◦ 2 sources of variation in the responses:
1. variation due to the straight line relationship with thepredictor
2. deviation from the line (noise)
◦ yi−ydeviation from data center = yi−yi
residual + yi−ydifference: line and center
yi − y = ei + yi − y36
so
Syy =∑
(yi − y)2
=∑
(ei + yi − y)2
=∑
e2i +
∑(yi − y)2
since ∑eiyi = 0 and y
∑ei = 0
Syy = SSE +∑
(yi − y)2 = SSE + SSR
The last term is the regression sum of squares.
Relation between SSR and β1
◦ We saw earlier that
SSE = Syy − β1Sxy
◦ Therefore,
SSR = β1Sxy = β12Sxx
Note that, for a given set of x’s SSR depends only on β1.
MSR = SSR/d.f. = SSR/1
(1 degree of freedom for slope parameter)
37
Expected Sums of Squares
◦
E[SSR] = E[Sxxβ12]
= Sxx(Var(β1) + (E[β1])2
)
= Sxx
(σ2
Sxx+ β2
1
)
= σ2 + β21Sxx
= E[MSR]
Therefore, if β1 = 0, then SSR is an unbiased estimator for σ2.◦ Development of E[MSE]:
E[Syy] = E[∑
(yi − y)2]
= E[∑
y2i − ny
2] =∑
E[y2i ]− nE[y2]
38
Development of E[MSE] (cont’d)
Consider the 2 terms on RHS, separately:
1. term:
E[y2i ] = Var(yi) + (E[yi])
2
= σ2 + (β0 + β1xi)2
∑E[y2
i ] = nσ2 + nβ20 + 2nβ0β1x+
∑β2
1x2i
2. term:
E[y2] = Var(y) + (E[y])2
= σ2/n+ (β0 + β1x)2
nE[y2] = σ2 + nβ20 + 2nβ0β1x+ nβ2
1x2
39
Development of E[MSE] (cont’d)
⇒
E[Syy] = (n− 1)σ2 + β21
∑(xi − x)2
= E[SST ]
◦
E[SSE] = E[Syy]− E[SSR]
= (n− 1)σ2 + β21
∑(xi − x)2 − (σ2 + β2
1Sxx)
= (n− 2)σ2
E[MSE] = E[SSE/(n− 2)] = σ2
40
Another approach to testing H0 : β1 = 0
◦ Under the null hypothesis, both MSE and MSR estimate σ2.
◦ Under the alternative, only MSE estimates σ2.
E[MSR] = σ2 + β21Sxx > σ2
◦ A reasonable test is
F0 =MSRMSE
∼ F1,n−2
◦ Large F0 ⇒ evidence against H0.
◦ Note t2ν = F1,ν so this is really the same test as
t20 =
β1√MSE/Sxx
2
=β1
2Sxx
MSE=
MSRMSE
41
The ANOVA table
Source df SS MS FReg. 1 β1
2Sxx β12Sxx
MSRMSE
Error n− 2 Syy − β12Sxx SSE/(n− 2)
Total n-1 Syyroller data example:> anova(roller.lm) # R code
Analysis of Variance Table
Response: depression
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 658 658 14.5 0.0052
Residuals 8 363 45
(recall that the t-statistic for testing β1 = 0 had been 3.81 =√14.5)
◦ Ex. Write an R function to compute these ANOVA quantities.
42
Confidence Interval for σ2
◦ SSEσ2 ∼ χ2
n−2so
P (χ2n−2,1−α/2 ≤
SSE
σ2≤ χ2
n−2,α/2) = 1− α
so
P (SSE
χ2n−2,α/2
≤ σ2 ≤SSE
χ2n−2,1−α/2
) = 1− α
e.g. roller data:
SSE = 363
χ28,.975 = 2.18
(R code: 1-qchisq(8, .025))
χ28,.025 = 17.5
(363/17.5,363/2.18) = (20.7,166.5)
43
2.7.1 R2 - Coefficient of Determination
◦ R2 = is the fraction of the response variability explained by theregression:
R2 =SSRSyy
◦ 0 ≤ R2 ≤ 1. Values near 1 imply that most of the variability isexplained by the regression.
◦ roller data:
SSR = 658 and Syy = 1021
so
R2 =658
1021= .644
44
R output
> summary(roller.lm)
...
Multiple R-Squared: 0.644, ...
◦ Ex. Write an R function which computes R2.
◦ Another interpretation:
E[R2].
=E[SSR]
E[Syy]=
β21Sxx + σ2
(n− 1)σ2 + β21Sxx
=β2
1Sxxn−1 + σ2
n−1
σ2 + β21Sxxn−1
.=
β21
Sxx(n−1)
σ2 + β21
Sxx(n−1)
for large n. (Note: this differs from the textbook.)
45
Properties of R2
• Thus, R2 increases as
1. Sxx increases (x’s become more spread out)
2. σ2 decreases
◦ Cautions
1. R2 does not measure the magnitude of the regression slope.
2. R2 does not measure the appropriateness of the linear model.
3. A large value of R2 does not imply that the regression modelwill be an accurate predictor.
46
Hazards of Regression
• Extrapolation: predicting y values outside the range of observedx values. There is no guarantee that a future response wouldbehave in the same linear manner outside the observed range.e.g. Consider an experiment with a spring. The spring isstretched to several different lengths x (in cm) and the restoringforce F (in Newtons) is measured:
x F
3 5.1
4 6.2
5 7.9
6 9.5
> spring.lm <- lm(F˜x - 1,data=spring)
> summary(spring.lm)
Coefficients:47
Estimate Std. Error t value Pr(>|t|)
x 1.5884 0.0232 68.6 6.8e-06
The fitted model relating F to x is
F = 1.58x
Can we predict the restoring force for the spring, if it has beenextended to a length of 15 cm?
• High leverage observations: x values at the extremes of therange have more influence on the slope of the regression thanobservations near the middle of the range.
• Outliers can distort the regression line. Outliers may beincorrectly recorded OR may be an indication that the linearrelation or constant variance assumption is incorrect.
• A regression relationship does not mean that there is acause-and-effect relationship. e.g. The following data give thenumber of lawyers and number of homicides in a given year for anumber of towns:
no. lawyers no. homicides
1 0
2 0
7 2
10 5
12 6
14 6
15 7
18 8
Note that the number of homicides increases with the number oflawyers. Does this mean that in order to reduce the number ofhomicides, one should reduce the number of lawyers?
• Beware of nonsense relationships. e.g. It is possible to showthat the area of some lakes in Manitoba is related to elevation.Do you think there is a real reason for this? Or is the apparentrelation just a result of chance?
2.9 - Regression through the Origin
◦ intercept = 0
yi = β1xi + ε
◦ Max. Likelihood and L-S⇒
minimizen∑i=1
(yi − β1xi)2
β1 =
∑xiyi∑x2i
ei = yi − β1xi
SSE =∑
e2i
◦ Max. Likelihood⇒
σ2 =SSE
n
48
◦ Unbiased Estimator:
σ2 = MSE =SSE
n− 1
◦ Properties of β1:
E[β1] = β1
Var(β1) =σ2∑x2i
◦ 1− α C.I. for β1:
β1 ± tn−1,α/2
√√√√MSE∑x2i
◦ 1− α C.I. for E[y|x0]:
y0 ± tn−1,α/2
√√√√MSEx20∑
x2i
since
Var(β1x0) =σ2∑x2i
x20
◦ 1− α P.I. for y, given x0:
y0 ± tn−1,α/2
√√√√MSE(1 +x2
0∑x2i
)
◦ R code:> roller.lm <- lm(depression˜weight - 1,
data=roller)
> summary(roller.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
weight 2.392 0.299 7.99 2.2e-05
Residual standard error: 6.43
on 9 degrees of freedom
Multiple R-Squared: 0.876,
Adjusted R-squared: 0.863
F-statistic: 63.9 on 1 and 9 DF,
p-value: 2.23e-005
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="prediction")
fit lwr upr
[1,] 12.0 -2.97 26.9
Ch. 2.9.2 Correlation
• Bivariate Normal Distribution:
f(x, y) =1
2πσ1σ2
√1− ρ2
e− 1
2(1−ρ2)(Y 2−2ρXY+X2)
where
X =x− µ1
σ1
and
Y =y − µ2
σ2
◦ correlation coefficient:
ρ =σ12
σ1σ2
◦ µ1 and σ21 are the mean and variance of x
◦ µ2 and σ22 are the mean and variance of y
◦ σ12 = E[(x− µ1)(y − µ2)], the covariance
49
Conditional distribution of y given x
f(y|x) =1√
2πσ1.2e−1
2
(y−β0−β1x
σ1.2
)2
where
β0 = µ2 − µ1ρσ2
σ1
β1 =σ2
σ1ρ
σ21.2 = σ2
2(1− ρ2)
50
An Explanation
:
◦ Suppose
y = β0 + β1x+ ε
where ε and x are independent N(0, σ2) and N(µ1, σ21) random
variables.◦ Define µ2 = E[y] and σ2
2 = Var(y):
µ2 = β0 + β1µ1
σ22 = β2
1σ21 + σ2
◦ Define σ12 = Cov(x− µ1, y − µ2):
σ12 = E[(x− µ1)(y − µ2)]
= E[(x− µ1)(β1x+ ε− β1µ1)]
= β1E[(x− µ1)2] + E[(x− µ1)ε]
= β1σ21
51
◦ Define ρ = σ12σ1σ2
:
ρ =β1σ
21
σ1σ2= β1
σ1
σ2
Therefore,
β1 =σ2
σ1ρ
and
β0 = µ2 − β1µ1
= µ2 − µ1σ2
σ1ρ
◦ What is the conditional distribution of y, given x = x?
y|x=x = β0 + β1x+ ε
must be normal with mean
E[y|x = x] = β0 + β1x
= µ2 + ρσ2
σ1(x− µ1)
and variance
Var(y|x = x) = σ21.2 = σ2 = σ2
2 − β21σ
21
= σ22 − ρ
2σ22
σ21σ2
1
= σ22(1− ρ2)
Estimation
◦ Maximum Likelihood Estimation (using bivariate normal) gives
β0 = y − β1x
β1 =Sxy
Sxx
r = ρ =Sxy√SxxSyy
◦ Note that
r = β1
√Sxx
Syy
so
r2 = β12Sxx
Syy= R2
52
i.e. the coefficient of determination = square of correlationcoefficient
Testing H0 : ρ = 0 vs. H1 : ρ 6= 0
◦ Cond’l approach (equiv. to testing β1 = 0):
t0 =β1√
MSE/Sxx
=β1√
Syy−β12Sxx
(n−2)Sxx
=
√√√√√β12(n− 2)
SyySxx− β1
2
=
√√√√(n− 2)r2
1− r2
where we have used β12 = r2Syy/Sxx.
53
◦ The above statistic is a t statistic on n− 2 degrees of freedom⇒conclude ρ 6= 0 if pvalue
P (|tn−2| > |t0|) is small.
Testing H0 : ρ = ρ0
Z =1
2log
1 + r
1− rhas an approximate normal distribution with mean
µZ =1
2log
1 + ρ
1− ρand variance
σ2Z =
1
n− 3
for large n.
◦ Thus,
Z0 =log 1+r
1−r − log 1+ρ01−ρ0
2√
1/(n− 3)
has an approximate standard normal distribution when the nullhypothesis is true.
54
Confidence Interval for ρ
• Confidence interval for 12 log 1+ρ
1−ρ :
Z ± zα/2
√1/(n− 3)
• Find endpoints (l, u) of this confidence interval and solve for ρ:
(e2l − 1
1 + e2l,e2u − 1
1 + e2u)
55
R code for fossum example
• Find 95% confidence interval for the correlation between totallength and head length:
> source("fossum.R")
> attach(fossum)
> n <- length(totlngth) # sample size
> r <- cor(totlngth,hdlngth)# correlation est.
> zci <- .5*log((1+r)/(1-r)) +
qnorm(c(.025,.975))*sqrt(1/(n-3))
> ci <- (exp(2*zci)-1)/(1+exp(2*zci))
> ci # transformed conf. interval
> detach(fossum)
[1] 0.62 0.83 # conf. interval for true
# correlation
56