1
Chapter 10: Inferential for Regression
http://jonfwilkins.blogspot.com/2011_08_01_archive.html
2
10.1: Simple Linear Regression10.2: More Detail about Simple Linear Regression
Goals• Describe the simple linear regression model (review – Ch. 2).• Be able to perform the method (with the output from software
packages, Lab 8).• Use diagnostic plots to check the assumptions.• Be able to perform inference on the slope (Confidence interval and
hypothesis test).• Be able to determine if there is an association between the response
and explanatory variables.• Be able to perform a hypothesis test using the correlation coefficient.• Be able to state the similarities and differences between a confidence
interval for a mean response and a prediction interval and in which situations each would be used (if there is time)
3
Conditions for Linear Regression
• We have n (x,y) pairs.• For any fixed x, y ~ N(y, )
• Each yi is independent of the other yj’s.
• y = 0 + 1x
4
Model for Linear Regression
yi = 0 + 1x + i
Data = Fit + Error
5
Linear Regressiony = b0 + b1x• y is an unbiased estimator for y
• b0 is an unbiased estimator for 0
• b1 is an unbiased estimator for 1
6
Linear Regression
i i xy y
1 2xx xi
x x y y SS sb r
SS sx x
b0 = y - b1x�
ei = yi - yi
7
Other SS and df
• Total
dft = n - 1• Model
dfm = 1
8
ANOVA table for Linear Regression
Source df SS MS
Model (Regression) 1 Σ(yi - y)2
Error n – 2 Σ(yi - yi)2
Total n - 1 Σ(yi - y)2
SSM
dfm
SSE
dfe
SST
dft
9
Conditions for Linear Regression• SRS• Observations are independent of each other.• The relationship is linear in the population.• The response, y, is normally distribution around
the population regression line.• The standard deviation of the response is constant.• Important plots:– Scatter plot– Residual plot– Histogram/Normal quantile plot of the
residuals.
10
Residual Plots
11
Example: Linear Regression 1The cetane number is a critical property in specifying the ignition
quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number
whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number
that can be attributed to the iodine value?
12
Example: Linear Regression 1 (cont.)x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0
13
Example: SLR 1 - Scatterplot
14
Example: SLR 1 – Residual Plot
15
Example: SLR 1 – Normality
16
Example: Linear Regression 1The cetane number is a critical property in specifying the ignition
quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number
whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number
that can be attributed to the iodine value?
17
Example: SLR 1 – Fitted Line x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0
r = -0.88925 sx = 22.8755 sy = 5.3864yQ = 55.657 xQ = 93.393
18
Example: SLR – fitted line
19
Example: Linear Regression 1The cetane number is a critical property in specifying the ignition
quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number
whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number
that can be attributed to the iodine value?
20
Example: Linear Regression 1The cetane number is a critical property in specifying the ignition
quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number
whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number
that can be attributed to the iodine value?
21
Example: SLR - sx: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0
Analysis of VarianceSource DF Sum of
SquaresMean
SquareF
ValuePr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
22
Example: Linear Regression 1The cetane number is a critical property in specifying the ignition
quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number
whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number
that can be attributed to the iodine value?
23
Example: SLR 1x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0
Analysis of VarianceSource DF Sum of
SquaresMean
SquareF
ValuePr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
24
Confidence Interval• Point estimates–b0 is an unbiased estimator for 0
–b1 is an unbiased estimator for 1
• Assumptions– SRS– linearity–Constant standard deviation of residuals–Normality• If y is normal, then both b0 and b1 are
normal• If y is not normal, there is still CLT
25
Standard deviation for b1
(Bonus on HW)
26
Confidence Interval for 1
𝑏1± 𝑡𝑐𝑜𝑙𝑢𝑚𝑛∗ (𝑑𝑓 )𝑆𝐸𝑏1=𝑏1±𝑡𝑐𝑜𝑙𝑢𝑚𝑛
∗ (𝑑𝑓 ) √ 𝑀𝑆𝐸𝑆𝑥𝑥
27
Example: SLR 1 - InferenceThe cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
e) What is the 95% Confidence Interval for the population
slope?f) Is the model useful (that is, is there a useful linear
relationship between iodine value and cetane number)?
28
Example: SLR 1x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0
Analysis of VarianceSource DF Sum of
SquaresMean
SquareF
ValuePr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429b1 = -0.209 Sxx = 6802.77
29
Example: SLR 1 – CI.
We are 95% confidence that the population slope is between -0.277 and -0.141.
30
Example: SLR – fitted line
31
LR Hypothesis Test: SummaryNull hypothesis: H0: 1 = Δ
Test statistic:
Note: A two-sided test with Δ = 0 is called a model utility test
Alternative Hypothesis
P-Value
Upper-tailed Ha: 1 > Δ P(T ≥ t)Lower-tailed Ha: 1 < Δ P(T ≤ t)two-sided Ha: 1 ≠ Δ 2P(T ≥ |t|)
32
Example: SLR 1 - InferenceThe cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
e) What is the 95% Confidence Interval for the population
slope?f) Is the model useful (that is, is there a useful linear
relationship between iodine value and cetane number)?
33
Example: SLR 1 - HT
The data does provide strong support (P = 2.13 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.
34
ANOVA table for Linear Regression
Source df SS MS
Model (Regression) 1 Σ(yi - y)2
Error n – 2 Σ(yi - yi)2
Total n - 1 Σ(yi - y)2
SSM
dfm
SSE
dfe
SST
dft
35
LR Hypothesis Test: SummaryNull hypothesis: H0: there is an association between
x and yTest statistic: F
P-value: P = P(F > Ftest), dfn = dfm, dfd = dfe
36
Example: LR - InferenceThe cetane number is a critical property in specifying
the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
g) Perform the hypothesis test using the F test statistic.h) Perform the hypothesis using the population
correlation coefficient
37
Example: LR – Inference - ANOVA
Parameter EstimatesVariable DF Parameter
EstimateStandard
Errort Value Pr > |t|
Intercept 1 75.21243 2.98363 25.21 <.0001iodine 1 -0.20939 0.03109 -6.73 <.0001
Analysis of VarianceSource DF Sum of
SquaresMean
SquareF
ValuePr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
38
Example: LR – Inference (cont)
The data does provide strong support (P = 2.09 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.
39
Inference for Correlation: Assumptions
• (x,y) are independent• (x,y) is normal• Linear relationship between x and y• Constant variance for the residuals.
40
LR Hypothesis Test: SummaryNull hypothesis: H0: = 0
Test statistic:
Alternative Hypothesis
P-Value
Upper-tailed Ha: > Δ P(T ≥ t)Lower-tailed Ha: < Δ P(T ≤ t)two-sided Ha: ≠ Δ 2P(T ≥ |t|)
41
Example: LR - InferenceThe cetane number is a critical property in specifying
the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
g) Perform the hypothesis test using the F test statistic.h) Perform the hypothesis using the population
correlation coefficient.
42
Example: LR – Inference - ANOVA
Parameter EstimatesVariable DF Parameter
EstimateStandard
Errort Value Pr > |t|
Intercept 1 75.21243 2.98363 25.21 <.0001iodine 1 -0.20939 0.03109 -6.73 <.0001
Analysis of VarianceSource DF Sum of
SquaresMean
SquareF
ValuePr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
43
Example: LR - Inference
The data does provide strong support (P = 2.12 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.
44
SE hµ�
𝑆𝐸 ��h=√𝑀𝑆𝐸 [ 1𝑛+(𝑥h−𝑥 )2
∑ (𝑥 𝑖−𝑥 )2 ]
45
Example: LR - InferenceThe cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
i) What is the 95% confidence interval for the cetane
number with the iodine value is 100.j) Predict the cetane number for the next sample of
biofuel that contains an iodine value of 100.
46
Example: LR – Inference Analysis of Variance
Source DF Sum ofSquares
MeanSquare
F Value
Pr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
� = 54.313Sxx = 6802.77 xQ = 93.393
47
Example: SLR (cont)
We are 95% confident that the population mean cetane number is between 52.754 and 55.872 when the iodine value is 100.
48
SEy
Variance Components of prediction value1) Variance associate with the mean response
2) Variance associated with the observation
49
Example: LR - InferenceThe cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.
i) What is the 95% confidence interval for the cetane
number with the iodine value is 100.j) Predict the cetane number for the next sample of
biofuel that contains an iodine value of 100.
50
Example: LR – Inference Analysis of Variance
Source DF Sum ofSquares
MeanSquare
F Value
Pr > F
Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total
13 377.17429
� = 54.313Sxx = 6802.77 xQ = 93.393
51
Example: SLR (cont)
We are 95% confident that the next cetane number is between 48.512 and 60.114 when the iodine value is 100.
Mean response: (52.754, 55.872)Prediction interval: (48.512. 60.114)
52
CI for mean responsePrediction interval
53
Example: Confidence/Prediction Band
54
Multiple Regression: Examples 11) A portrait studio operates in cities of medium
size and specializes in portraits of children. They want to open a store in a other similar community, but want to be able to predict sales.
2) So that only students that succeed are accepted into college, the registrar’s office wants to be able to predict GPA from entering high school students.
3) A researcher studied the effects of the charge rate and the temperature on the life of a new type of power cell in a preliminary small-scale experiment.
55
Multiple Regression: Examples 24) An experiment was run to investigate the yield
of tomato plants as a function of the amount of water levels. A series of plots were randomized to different water levels and at the end of the season, the yield of the plants was determined.
5) Fernandez-Juricic et al. (2003) examined the effect of human disturbance on the nesting of house sparrows (Passer domesticus). They counted breeding sparrows per hectare in 18 parks in Madrid, Spain, and also counted the number of people per minute walking through each park (both measurement variables).