Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | cedric-stout |
View: | 33 times |
Download: | 0 times |
Chapter 16
Understanding Relationships – Numerical
Data Part 2
Created by Kathy Fritz
The Simple Linear Regression Model
You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using
20
40
60
80
100
10 20 30 40 50
Temperature in centigrade
Tem
pera
ture
in
Fahre
nheit
This is a deterministic relationship. The value of the independent variable (centigrade temperature) is all that is needed to determine the value of the dependent variable (Fahrenheit temperature).
Suppose you want to convert 20˚C into Fahrenheit.
20˚C = 68˚F
The equation for a probabilistic model is:
Where e is an “error” variable
Now suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average.
Is the first-year college grade point average determined solely by the high school grade point average? Explain.
The first-year college grade point average and the high school grade
point average do NOT have a deterministic relationship.
A description of the relationship between two variables that are not deterministically related
can be given by a probabilistic model.
1020304050
x
y
x1 x2
The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line.
When a value of the independent variable x is fixed and an observation on the dependent variable y is made, exy
a
Population regression line (slope b)
e1
e2
Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population
regression line.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular value of x is normal.
x1 x2 x3
Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e (random deviation from the regression line). It could be positive, negative, or even 0.
The linear regression model makes some assumptions about the distribution of e at any
particular x value in the population.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular value of x is normal.
2. The distribution of e at any particular x value has mean value 0. That is, me = 0.
x1
x2 x3
Because the values of e can be negative or positive, the sum of the values of e at any
particular x value will be zero. Thus, me = 0.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular value of x is normal.
2. The distribution of e at any particular x value has mean value 0. That is, me = 0.
x1
x2 x3
3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular value of x is normal.
2. The distribution of e at any particular x value has mean value 0. That is, me = 0.
3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.
4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.
The standard deviation of y for any fixed value of x* is also se
y
x
x1 x2 x3
a + bx1
a + bx2
a + bx3
Just as there is variability in the values of e at any particular value of x, there is
also variability in the y values.
The mean of y values at a fixed value x* is
y = a + bx*
The population regression line passes through the means of the y values.
Thus the slope b is the mean or expected change in y associated with
a 1 unit increase in x.
se is the same for
any particular x
value
Another look at se
The smaller se, the closer the points are to the regression line.
The larger se, the farther the points are from the regression line.
The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line, .
Let x* denote a specified value of the independent variable x. Then a + bx* has two different interpretations:
The values of a and b are usually obtained using statistical software or a graphing calculator.
1. It is a point estimate of the mean y value when x = x*.2. It is a point prediction of an individual y value to be observed when x = x*.
Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females.
x 15 17 18 15 16 19 17 16 18 19
y 2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
15 16 17 18 19
2500
3000
3500
Mother’s Age (yrs)
Baby’s
Weig
ht
(g)
Sketch a scatterplot of these data.
The scatterplot shows a linear pattern and the spread in the y values
appears to be similar across the range of x values. This
supports the appropriateness of the simple linear regression
model.
Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).
Birth Weight Continued . . .
The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).
x 15 17 18 15 16 19 17 16 18 19
y 2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
xy 15.24545.1163ˆ
15 16 17 18 19
2500
3000
3500
Mother’s Age (yrs)
Baby’s
Weig
ht
(g)
grams25.3249)18(15.24545.1163
What is the point estimate for the mean weight of babies born
to 18-year-old mothers?
This is the point estimate for the
mean weight of all babies born to 18-year-old mothers.
This is also the prediction of the weight of a single baby born to
a mother 18 years of age.
The weight of babies increases approximately 245.15 grams for each increase of 1 year in the mother’s age.
Beware of the danger of extrapolation. That is, be careful when trying to make an
estimate or prediction for any x value much outside the range of the observed x values
in the data.
The statistic for estimating the variance is
2Resid2
nSS
se
2
ˆResid yySS
2ee ss
Note that the degrees of freedom associated with
estimating or in simple linear regression is
df = n - 2
whereThe subscript “e” is a reminder that you are
estimating the variance of the “errors” or residuals.
The estimate of se is the estimated the standard deviation
The value of se, the estimated standard deviation about the population regression line, is interpreted
as the typical amount by which an observation deviates from the population regression line.
Recall, the coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y.
The residual plot and the values of se and r2 can be used to determine the estimated regression equation’s usefulness.
How do we know if the estimated regression equation will be useful model for predicting y values from x?
Wildlife biologists monitor the ecological health of the Rocky Mountain elk. The equipment, manpower, and time to make direct measurement of the elk weights are difficult and expensive.
Biologists found that they could reliably estimate the weight of an elk by measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk.
There appears to be a strong positive linear relationship between the chest girth and weight of elk.
Elk Weight Problem Continued . . .
Partial Minitab regression output is shown below.
The regression equation isWeight = -136 + 2.81 Girth
Predictor Coef SE Coef T P
Constant -135.51 35.75 -3.79 0.001
Girth 2.8063 0.2686 10.45 0.000
S = 23.6626 R-Sq = 86.5% R-Sq(adj) = 85.7%
This is the estimated regression equation.Approximately 86.5% of the observed
variation in elk weight can be attributed to the linear relationship between weight and
chest girth.
The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values (shown in the
scatterplot).
Inferences Concerning the Slope of the Population Regression Line
Properties of the Sampling Distribution of bWhen the four basic assumptions of the simple linear regression model are satisfied, the following statements are true:
1.The mean value of b is b. That is, mb = b, so the sampling distribution of b is centered at the value of b.
2. The standard deviation of the statistic b is
Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of
the least-squares line gives a point estimate for b.
3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)
Since sb is usually unknown, the estimated standard deviation of the statistic b is
When the four basic assumptions of the simple linear model are satisfied, the probability distribution of the standardized variable is the t distribution with df = (n - 2).
Confidence Interval for bWhen the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form
where the t critical value is based on df = n – 2.
The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3000 animals. It is important to monitor and manage the size of the bison population.
Researchers have studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison.Data from 1981-1997 on y = spring calf ratio (SCR) and x = previous fall snow-water equivalent (SWE) are shown on page 750. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent.
Bison Population Problem Continued . . .
Step 1 (Estimate):The value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated.
Step 2 (Method):Because the answers to the four key questions are estimation, sample data, two numerical values, and one sample, a confidence interval for b, the slope of the population regression line, will be considered. A 95% confidence level will be used.
Bison Population Problem Continued . . .
Step 3 (Check):• You will need to assume that these 17 years are
representative of yearly circumstances at Yellowstone and that each year’s reproduction and snowfall is independent of previous years.
• A scatterplot of the data looks linear and the spread does not seem different for different values of x.
• Because the boxplot of the residuals is approximately symmetrical and there are no outliers, it is reasonable to think that the distribution of e is approximately normal.
Bison Population Problem Continued . . .
Step 4 (Calculate):JMP regression output is shown here:
Linear FitSCR = 0.2606561 – 0.0136639*SWE
Summary of Fit
RSquare 0.257644
Rsquare Adj 0.208153
Root Mean Square Error 0.033513
Mean of Response 0.209412
Observations 17
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 0.206561 0.023885 10.91 <.0001*
SWE -0.013664 0.005989 -2.28 0.0375*
Slope b sb
df = 17 – 2 = 15The t critical value for a 95% confidence level and df = 15 is 2.13.b ± (t critical value) sb
= -0.0137 ± (2.13)(0.005989) = (-0.265, -0.0009)
Bison Population Problem Continued . . .
Step 5 (Communicate Results):
Confidence Interval:You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between -0.0265 and -0.0009.
Confidence level:
The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression line about 95% of the time.
Summary of Hypothesis Tests Concerning bAppropriate when the four basic assumptions of the simple regression model are reasonable:
1. The distribution of e at any particular x value has a mean of 0 (me = 0).
2. The standard deviation of e is se, which does not depend on x.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, …, en associated with different observations are independent of one another.
Summary of Hypothesis Tests Concerning b Continued . . .When these conditions are met, the following test statistic can be used:
where b0 is the hypothesized value from the null hypothesis.
Form of the null hypothesis: H0: b = b0
When the assumptions of the simple linear model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df = n – 2.
Associated P-Value:
Summary of Hypothesis Tests Concerning b Continued . . .
When the alternative hypothesis is . . .
The P-value is . . .
Ha: b > b0 area to right of t under the appropriate t curve
Ha: b < b0 area to left of t under the appropriate t curve
Ha: b ≠ b0 2(area to the right of t) if t is positiveor 2(area to the left of t) if t is negative
Inference for a population slope generally focuses on two questions:
(1) What are plausible values for the population slope?(2) Is the population slope different from zero?
When the null hypothesis H0: b = 0 is true, the population regression line is a horizontal line.
This question can be answered by using
the hypothesis testing procedure with a null
hypothesis H0: b = 0 If b is in fact equal to
0, knowledge of x will be of no use – it will have no “utility” for
predicting y.
This question can be addressed by
calculating a confidence
interval.
This test of H0: b = 0 versus Ha: b ≠ 0
is called the model utility test for simple linear regression.
The Model Utility Test for Simple Linear Regression
The model utility test for simple linear regression is the test of
H0: b = 0 versus Ha: b ≠ 0
The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.
The test statistic is the t ratio:If H0 is rejected, you can conclude that the simple linear regression model is useful for predicting y.
When you hear a song on your car radio, you probably remember title of the song, the artist, and even when the song was released. An investigator wants to study this phenomenon. He compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students.
Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week.
After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released.
The accompanying data show the actual release year and the average of the release years given by the students. Is there a relationship between the judged and actual release year for these songs?
Let’s perform a model utility test to answer this question.
Song Recognition Problem Continued . . .
Step 1 (Hypotheses):
H0: b = 0
Ha: b ≠ 0
where b is the slope of the population regression line of the judged release year and the actual year
Step 2 (Method):
Because the answers to the four key questions are hypothesis testing, two numerical variables in a regression setting, and one sample, a hypothesis test for the slope of a population regression line will be considered. A significance level of 0.05 will be used.
Song Recognition Problem Continued . . .
Step 3 (Check):
For this example you can assume that the assumptions are reasonable and proceed with the model utility test. (We will see how to check if the four assumptions of the simple linear regression model are reasonable in the next section.)
Song Recognition Problem Continued . . .
Step 4 (Calculate):JMP regression output is shown here:
Linear FitJudged Release = 1095.1525 + 0.449281*Actual Release
Summary of Fit
RSquare 0.771
Rsquare Adj 0.766759
Root Mean Square Error 3.59844
Mean of Response 1986.013
Observations 56
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 1095.1525 66.07159 16.58 <.0001*
SWE 0.449281 0.033321 13.48 <.0001*
Slope b sb
P-value = 2P (t > 13.48) ≈ 0
Song Recognition Problem Continued . . .
Step 5 (Communicate Results):
Because the P-value is less than the selected significance level, the null hypothesis is rejected.
Decision: Reject H0
Conclusion:The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year.
Checking Model Adequacy
Checking Model Adequacy
The simple linear regression model is
y = a + bx + e
where e represents the random deviation of a y value from the population regression line a + bx.
These assumptions include:1. At any particular x value, the distribution of e is
normal. 2. At any particular x value, the standard deviation of
e is se, which is constant over all values of x (that is, se does not depend on x).
The methods, confidence interval for slope and the model utility test, require some assumptions about the random deviations in the simple linear regression model be met in order for inference to
be valid.
Instead, diagnostic checks MUST be based on the residuals
which are the deviations from the estimated regression line.
Residual AnalysisIf the deviations e1, e2, . . . , en from the population line were available, they could be examined for any inconsistencies with model assumptions.However, these deviations are
e1 = y1 – (a + bx1)
en = yn – (a + bxn)
These values of e can ONLY be calculated if a and b are known, which
is almost never the case.
Any observation that gives a large positive or negative residual should be examined carefully
for any unusual circumstances, such as a recording error or nonstandard experimental
condition.
Residual Analysis
Identifying residuals with unusually large magnitudes is made easier by inspecting standardize residuals.
Recall, me = 0.So, the numerator is
really residual – 0.
Because residuals at different x values have different standard deviations (depending on the value of x for that observation), computing the
standardized residuals can be tedious. Most statistical software will perform this calculation.
Revisiting the Elk
Example 16.3 introduced data on x = chest girth (in cm) and y = weight (in kg)
for a sample of 19 Rocky Mountain elk.
Inspection of the scatterplot suggest the data are consistent with the assumptions of the simple linear regression model.
Revisiting the Elk Continued . . .
Let’s examine the residuals more closely. The data, residuals, and the standardized residuals (computed using Minitab) are given on page 761.
The largest residual = 38.1397 and the associated standard residual = 1.81294.
The smallest residual = -38.2661 and the associated standard residual = -1.92313.
Neither one of these is surprisingly large.
The boxplots of the residuals and standardized residuals are approximately symmetric with no outliers, so the assumption of normally
distributed errors seems reasonable.Notice that the boxplots of the residuals and standardized residuals are nearly identical.
Revisiting the Elk Continued . . .
Another way to assess whether the error values are normally distributed is to look at normal probability plots of the residuals or the standardized residuals. (Only one plot is needed.)The pattern in the normal probability plots are
reasonably straight, confirming that the assumption of normality of the error
distribution is reasonable.
The standardized plot is recommended, but it is acceptable to use the unstandardized
residual plot if you do not have access to a computer package
A Look at Residual Plots
In this plot, the standard deviation of the residuals increases as the x-values increase.
While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your
local statistician!
Both of these plots contain
points far away from the others. These points can have substantial
effects on estimates of a
and b as well as other quantities.
This plot exhibits a curved pattern which indicates
that the fitted model should be changed to
incorporate the curvature.
This is a desirable plot in that it exhibits no pattern and has
no point that lies far away from the other points.
Newborns and infants have a small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children ages 2 months to 14 years, researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age.Below are a scatterplot and a standardized residual plot constructed using data on the insertion depth and height of children (both measured in cm).
Residual plots like the one shown here are desirable.
There are no unusually large residuals since no point lies much outside the horizontal band
between -2 and 2. There is no point far to the left or right of the others and there are no pattern of curvature or differences in the variability of the
residuals for different height values to indicate that the model assumptions are not reasonable.
Newborns and Infants Problem Continued . . .
But consider what happens when the relationship between insertion depth and weight is examined.
While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot.
A careful inspection of these plots suggests that along with curvature, the residuals may be more
variable at larger weights.The linear regression
model is not appropriate.