Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION
Part 2: Properties, Hypothesis tests,Model adequacy and assumptions
Sections 11-3, 11-4.1, 11-7
• Recall the SLR estimates for β0, β1 and σ2:
β0 = y − β1x
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=SxySxx
σ2 =SSEn− 2
=
∑ni=1(yi − yi)2
n− 2= MSE
• These estimators are unbiased estimators(a nice characteristic):
* E[β0] = β0
* E[β1] = β1
* E[σ2] = σ2
1
What kind of variability do these least squaresestimators have?
• Variance of β1 (a random variable):
V ar(β1) =σ2∑n
i=1(xi − x)2=
σ2
Sxx
The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the x-values.
Since we don’t know σ2, we’ll plug-in the es-
timate σ2 to get a usable value of the...
• Estimated standard error for β1:
se(β1) =
√σ2∑n
i=1(xi−x)2
2
• Variance of β0 (a random variable):
V ar(β0) = σ2
(1
n+
x2∑ni=1(xi − x)2
)The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the x-values,AND how far the center of the x-values (i.e. x)is from x = 0.
We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).
We’ll plug-in the estimate σ2 to get the...
• Estimated standard error for β0
se(β0) =
√σ2(
1n + x2∑n
i=1(xi−x)2
)3
Hypothesis tests for β0 and β1
• For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .
H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
• Under the assumption εiiid∼ N(0, σ2), we
have
β0 ∼ N(β0, σ2
(1n + x2∑n
i=1(xi−x)2
))
β1 ∼ N(β1,
σ2∑ni=1(xi−x)2
)• Test of interest for the slope:
H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
4
• Since we will be estimating σ2, we will use at-statistic:
T0 =β1 − 0
se(β1)=
β1√σ2∑n
i=1(xi−x)2
Under H0 true, T0 ∼ tn−2.
From our test statistic, we can compute ap-value for our hypothesis test on the slope.
• Test of interest for the intercept:
H0 : β0 = 0 vs. H1 : β0 6= 0
The test statistic:
T0 =β0 − 0
se(β0)=
β0√σ2(
1n + x2∑n
i=1(xi−x)2
)Under H0 true, T0 ∼ tn−2.
5
• Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 11-10 in book)
An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.
They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
6
The data:
obs PercRoadways ChlorConc
1 0.15 6.6
2 0.19 4.4
3 0.47 11.8
4 0.57 9.7
5 0.60 14.3
6 0.63 10.9
7 0.67 10.8
8 0.69 19.2
9 0.70 10.6
10 0.70 12.1
11 0.78 14.7
12 0.78 17.3
13 0.81 15.0
14 1.05 27.4
15 1.06 27.7
16 1.30 23.1
17 1.62 39.5
18 1.74 31.8
n = 18
7
Summary statistics:∑ni=1 xi = 14.51 x = 0.8061∑ni=1 yi = 306.9 y = 17.05∑ni=1 x
2i = 14.7073
∑ni=1 y
2i = 6727.13∑n
i=1(yi − y)(xi − x) = 61.9205∑ni=1(xi − x)2 = 3.0106
The regression coefficient estimates:
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=
61.9205
3.0106= 20.5675
β0 = y − β1x = 17.05− 20.5675(0.8061)= 0.4705
To estimate σ2, we need the residuals whichare denoted as ei = yi − yi.
8
To get the residuals or ei = yi − yi we firstneed the fitted values (or predicted values)denoted as yi...
yi = 0.4705 + 20.5675(xi)
Above is the fitted model or fitted line.
Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...
9
σ2 = MSE =SSEn− 2
=
∑ni=1(yi − yi)2
n− 2
=220.9472
16= 13.8092
and σ =√
13.8092 = 3.7161
The fitted model: yi = 0.4705+20.5675(xi)
10
Interpretation of regression coefficients:
For any SLR analysis, β1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .
β1 = riserun = 4Y4X = β1 units of Y
1 unit of X
0.5 1.0 1.5
510
1520
2530
3540
x
y
1
β1
11
Interpretation of regression coefficients:
slope: β1 = 20.5675 = 20.56751 = rise
run = 4Y4X
A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
1
20.5675
12
Interpretation of regression coefficients:
Intercept: β0 = 0.4705
When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hy-pothesis test for β0 in the next slides).
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
13
•When providing regression COEFFICIENTINTERPRETATION, YOU MUST in-clude the relevant units for X and Y , andput it in the context of the problem.
MINITAB output from this example:
Regression Analysis: ChlorConc vs PercRoadways
Regression Equation
ChlorConc = 0.47 + 20.6 PercRoadways
Coefficients
Term Coef SE Coef T-Value P-Value
Constant 0.470 1.94 0.24 0.811
PercRoadways 20.567 2.14 9.60 0.000
Model Summary
S = 3.71607 R-sq = 85.22%
• Testing for a linear relationship between chlo-ride concentration (Y ) and % of watershedin roadways (X).
H0 : β1 = 0H1 : β1 6= 0
14
Slope estimate and standard error:
β1 = 20.567 se(β1) =√
13.80923.0106 = 2.1417
Test statistic:
t0 =β1 − 0
se(β1)=
20.567
2.1417= 9.603
Under H0 true, T0 ∼ t16
P-value: 2× P (T0 > 9.603) = 4.81× 10−8
{very small}Reject H0.
There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concen-tration and % of watershed in roadways.
15
• Similarly, we can run a hypothesis test thatthe intercept equals 0...
H0 : β0 = 0H1 : β0 6= 0
Estimates:β0 = 0.4705
se(β0) =
√13.8092
(118 + 0.80612
3.0106
)= 1.9358
Test statistic:
t0 =β0 − 0
se(β0)=
0.4705
1.9358= 0.2431
Under H0 true, T0 ∼ t16
P-value: 2× P (T0 > 0.2431) = 0.8110
16
Fail to reject H0.
This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concen-tration in the streams being zero.
We do not have evidence to suggest the in-tercept is anything other than zero. (So, a
watershed with no roadways essentially has a chloride
concentration of 0 mg/liter.)
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
17
Adequacy of the regression model andChecking assumptions
• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)
• Are the assumptions we’re making for ourmodel reasonable, or are they violated?
• To answer these questions, we will use theresiduals of the model.
The residual for observations i:
ei = yi − yi
18
Residuals are informative
Consider the Price vs. Age of clock data:
1000
1400
1800
2200
125 150 175Age of Clock (yrs)
Pric
e So
ld a
t Auc
tion
5.0
7.5
10.0
12.5
15.0Bidders
Use plot of residuals vs. y fitted values (below)to check adequacy of model AND constant vari-ance assumption.
19
• If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.
• If not (i.e. if there is a non-random patternin the residual plot), then there may be is-sues with our linearity assumption or per-haps other assumptions in our model and themodel may not be adequate.
20
• Example showing inadequacy:Kentucky Derby data set
on year of race and speed of horse.
The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...
21
Residual Plot of ‘residuals vs. fitted values’
• Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.
• Linear form may not be reasonable oradequate.
⇒ Quadratic may fit better.
22
Beyond Adequacy
• Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.
• The basic model
Yi = β0 + β1xi + εi︸ ︷︷ ︸ ↑linear random
relationship error term
with εiiid∼ N(0, σ2)
– Constant variance of errors(only one σ2 for all errors)
– Normality of errors
– Independence of errors
23
Constant Variance Assumption
•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)
• Look for a constant ‘spread’ above and belowthe horizontal reference line.
• NOTE: This same residual plot was also used to check
linearity.
24
•Constant Variance and Adequacy areboth checked with the same residualplot in SLR
• Plot residuals vs. y (or in SLR, against x).
25
Normality Assumption
• Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).
26
Independence Assumption
• Verify that the observations are independent.
• Check how the data was collected (talk tothe researcher or client).
• If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.
27
• Predictions and Extrapolation
– We can use our fitted model to make pre-dictions.
– e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
0.65 0.70 0.75 0.80 0.85 0.90 0.95
2040
6080
100
ff.data$Thorax
ff.da
ta$L
onge
vity
Y = −61.05 + 144.33 x
28
Prediction:
Yx=0.80 = −61.05 + 144.33(0.80)= 54.414 days
– If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).
– You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.
– Predicting Y for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.
29