Chapter 16

Chapter 16

Understanding Relationships – Numerical

Data Part 2

Created by Kathy Fritz

The Simple Linear Regression Model

You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using

20

40

60

80

100

10 20 30 40 50

Temperature in centigrade

Tem

pera

ture

in

Fahre

nheit

This is a deterministic relationship. The value of the independent variable (centigrade temperature) is all that is needed to determine the value of the dependent variable (Fahrenheit temperature).

Suppose you want to convert 20˚C into Fahrenheit.

20˚C = 68˚F

The equation for a probabilistic model is:

Where e is an “error” variable

Now suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average.

Is the first-year college grade point average determined solely by the high school grade point average? Explain.

The first-year college grade point average and the high school grade

point average do NOT have a deterministic relationship.

A description of the relationship between two variables that are not deterministically related

can be given by a probabilistic model.

1020304050

x

y

x1 x2

The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line.

When a value of the independent variable x is fixed and an observation on the dependent variable y is made, exy

a

Population regression line (slope b)

e1

e2

Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population

regression line.

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular value of x is normal.

x1 x2 x3

Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e (random deviation from the regression line). It could be positive, negative, or even 0.

The linear regression model makes some assumptions about the distribution of e at any

particular x value in the population.



2. The distribution of e at any particular x value has mean value 0. That is, me = 0.

x1

x2 x3

Because the values of e can be negative or positive, the sum of the values of e at any

particular x value will be zero. Thus, me = 0.




x1

x2 x3

3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.




3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.

4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

The standard deviation of y for any fixed value of x* is also se

y

x

x1 x2 x3

a + bx1

a + bx2

a + bx3

Just as there is variability in the values of e at any particular value of x, there is

also variability in the y values.

The mean of y values at a fixed value x* is

y = a + bx*

The population regression line passes through the means of the y values.

Thus the slope b is the mean or expected change in y associated with

a 1 unit increase in x.

se is the same for

any particular x

value

Another look at se

The smaller se, the closer the points are to the regression line.

The larger se, the farther the points are from the regression line.

The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line, .

Let x* denote a specified value of the independent variable x. Then a + bx* has two different interpretations:

The values of a and b are usually obtained using statistical software or a graphing calculator.

1. It is a point estimate of the mean y value when x = x*.2. It is a point prediction of an individual y value to be observed when x = x*.

Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females.

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

15 16 17 18 19

2500

3000

3500

Mother’s Age (yrs)

Baby’s

Weig

ht

(g)

Sketch a scatterplot of these data.

The scatterplot shows a linear pattern and the spread in the y values

appears to be similar across the range of x values. This

supports the appropriateness of the simple linear regression

model.

Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

Birth Weight Continued . . .

The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

xy 15.24545.1163ˆ

15 16 17 18 19

2500

3000

3500

Mother’s Age (yrs)

Baby’s

Weig

ht

(g)

grams25.3249)18(15.24545.1163

What is the point estimate for the mean weight of babies born

to 18-year-old mothers?

This is the point estimate for the

mean weight of all babies born to 18-year-old mothers.

This is also the prediction of the weight of a single baby born to

a mother 18 years of age.

The weight of babies increases approximately 245.15 grams for each increase of 1 year in the mother’s age.

Beware of the danger of extrapolation. That is, be careful when trying to make an

estimate or prediction for any x value much outside the range of the observed x values

in the data.

The statistic for estimating the variance is

2Resid2

nSS

se

2

ˆResid yySS

2ee ss

Note that the degrees of freedom associated with

estimating or in simple linear regression is

df = n - 2

whereThe subscript “e” is a reminder that you are

estimating the variance of the “errors” or residuals.

The estimate of se is the estimated the standard deviation

The value of se, the estimated standard deviation about the population regression line, is interpreted

as the typical amount by which an observation deviates from the population regression line.

Recall, the coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y.

The residual plot and the values of se and r2 can be used to determine the estimated regression equation’s usefulness.

How do we know if the estimated regression equation will be useful model for predicting y values from x?

Wildlife biologists monitor the ecological health of the Rocky Mountain elk. The equipment, manpower, and time to make direct measurement of the elk weights are difficult and expensive.

Biologists found that they could reliably estimate the weight of an elk by measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk.

There appears to be a strong positive linear relationship between the chest girth and weight of elk.

Elk Weight Problem Continued . . .

Partial Minitab regression output is shown below.

The regression equation isWeight = -136 + 2.81 Girth

Predictor Coef SE Coef T P

Constant -135.51 35.75 -3.79 0.001

Girth 2.8063 0.2686 10.45 0.000

S = 23.6626 R-Sq = 86.5% R-Sq(adj) = 85.7%

This is the estimated regression equation.Approximately 86.5% of the observed

variation in elk weight can be attributed to the linear relationship between weight and

chest girth.

The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values (shown in the

scatterplot).

Inferences Concerning the Slope of the Population Regression Line

Properties of the Sampling Distribution of bWhen the four basic assumptions of the simple linear regression model are satisfied, the following statements are true:

1.The mean value of b is b. That is, mb = b, so the sampling distribution of b is centered at the value of b.

2. The standard deviation of the statistic b is

Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of

the least-squares line gives a point estimate for b.

3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)

Since sb is usually unknown, the estimated standard deviation of the statistic b is

When the four basic assumptions of the simple linear model are satisfied, the probability distribution of the standardized variable is the t distribution with df = (n - 2).

Confidence Interval for bWhen the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form

where the t critical value is based on df = n – 2.

The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3000 animals. It is important to monitor and manage the size of the bison population.

Researchers have studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison.Data from 1981-1997 on y = spring calf ratio (SCR) and x = previous fall snow-water equivalent (SWE) are shown on page 750. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent.

Bison Population Problem Continued . . .

Step 1 (Estimate):The value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated.

Step 2 (Method):Because the answers to the four key questions are estimation, sample data, two numerical values, and one sample, a confidence interval for b, the slope of the population regression line, will be considered. A 95% confidence level will be used.


Step 3 (Check):• You will need to assume that these 17 years are

representative of yearly circumstances at Yellowstone and that each year’s reproduction and snowfall is independent of previous years.

• A scatterplot of the data looks linear and the spread does not seem different for different values of x.

• Because the boxplot of the residuals is approximately symmetrical and there are no outliers, it is reasonable to think that the distribution of e is approximately normal.


Step 4 (Calculate):JMP regression output is shown here:

Linear FitSCR = 0.2606561 – 0.0136639*SWE

Summary of Fit

RSquare 0.257644

Rsquare Adj 0.208153

Root Mean Square Error 0.033513

Mean of Response 0.209412

Observations 17

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 0.206561 0.023885 10.91 <.0001*

SWE -0.013664 0.005989 -2.28 0.0375*

Slope b sb

df = 17 – 2 = 15The t critical value for a 95% confidence level and df = 15 is 2.13.b ± (t critical value) sb

= -0.0137 ± (2.13)(0.005989) = (-0.265, -0.0009)


Step 5 (Communicate Results):

Confidence Interval:You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between -0.0265 and -0.0009.

Confidence level:

The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression line about 95% of the time.

Summary of Hypothesis Tests Concerning bAppropriate when the four basic assumptions of the simple regression model are reasonable:

1. The distribution of e at any particular x value has a mean of 0 (me = 0).

2. The standard deviation of e is se, which does not depend on x.

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, …, en associated with different observations are independent of one another.

Summary of Hypothesis Tests Concerning b Continued . . .When these conditions are met, the following test statistic can be used:

where b0 is the hypothesized value from the null hypothesis.

Form of the null hypothesis: H0: b = b0

When the assumptions of the simple linear model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df = n – 2.

Associated P-Value:

Summary of Hypothesis Tests Concerning b Continued . . .

When the alternative hypothesis is . . .

The P-value is . . .

Ha: b > b0 area to right of t under the appropriate t curve

Ha: b < b0 area to left of t under the appropriate t curve

Ha: b ≠ b0 2(area to the right of t) if t is positiveor 2(area to the left of t) if t is negative

Inference for a population slope generally focuses on two questions:

(1) What are plausible values for the population slope?(2) Is the population slope different from zero?

When the null hypothesis H0: b = 0 is true, the population regression line is a horizontal line.

This question can be answered by using

the hypothesis testing procedure with a null

hypothesis H0: b = 0 If b is in fact equal to

0, knowledge of x will be of no use – it will have no “utility” for

predicting y.

This question can be addressed by

calculating a confidence

interval.

This test of H0: b = 0 versus Ha: b ≠ 0

is called the model utility test for simple linear regression.

The Model Utility Test for Simple Linear Regression

The model utility test for simple linear regression is the test of

H0: b = 0 versus Ha: b ≠ 0

The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.

The test statistic is the t ratio:If H0 is rejected, you can conclude that the simple linear regression model is useful for predicting y.

When you hear a song on your car radio, you probably remember title of the song, the artist, and even when the song was released. An investigator wants to study this phenomenon. He compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students.

Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week.

After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released.

The accompanying data show the actual release year and the average of the release years given by the students. Is there a relationship between the judged and actual release year for these songs?

Let’s perform a model utility test to answer this question.

Song Recognition Problem Continued . . .

Step 1 (Hypotheses):

H0: b = 0

Ha: b ≠ 0

where b is the slope of the population regression line of the judged release year and the actual year

Step 2 (Method):

Because the answers to the four key questions are hypothesis testing, two numerical variables in a regression setting, and one sample, a hypothesis test for the slope of a population regression line will be considered. A significance level of 0.05 will be used.


Step 3 (Check):

For this example you can assume that the assumptions are reasonable and proceed with the model utility test. (We will see how to check if the four assumptions of the simple linear regression model are reasonable in the next section.)


Step 4 (Calculate):JMP regression output is shown here:

Linear FitJudged Release = 1095.1525 + 0.449281*Actual Release

Summary of Fit

RSquare 0.771

Rsquare Adj 0.766759

Root Mean Square Error 3.59844

Mean of Response 1986.013

Observations 56

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 1095.1525 66.07159 16.58 <.0001*

SWE 0.449281 0.033321 13.48 <.0001*

Slope b sb

P-value = 2P (t > 13.48) ≈ 0


Step 5 (Communicate Results):

Because the P-value is less than the selected significance level, the null hypothesis is rejected.

Decision: Reject H0

Conclusion:The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year.

Checking Model Adequacy

Checking Model Adequacy

The simple linear regression model is

y = a + bx + e

where e represents the random deviation of a y value from the population regression line a + bx.

These assumptions include:1. At any particular x value, the distribution of e is

normal. 2. At any particular x value, the standard deviation of

e is se, which is constant over all values of x (that is, se does not depend on x).

The methods, confidence interval for slope and the model utility test, require some assumptions about the random deviations in the simple linear regression model be met in order for inference to

be valid.

Instead, diagnostic checks MUST be based on the residuals

which are the deviations from the estimated regression line.

Residual AnalysisIf the deviations e1, e2, . . . , en from the population line were available, they could be examined for any inconsistencies with model assumptions.However, these deviations are

e1 = y1 – (a + bx1)

en = yn – (a + bxn)

These values of e can ONLY be calculated if a and b are known, which

is almost never the case.

Any observation that gives a large positive or negative residual should be examined carefully

for any unusual circumstances, such as a recording error or nonstandard experimental

condition.

Residual Analysis

Identifying residuals with unusually large magnitudes is made easier by inspecting standardize residuals.

Recall, me = 0.So, the numerator is

really residual – 0.

Because residuals at different x values have different standard deviations (depending on the value of x for that observation), computing the

standardized residuals can be tedious. Most statistical software will perform this calculation.

Revisiting the Elk

Example 16.3 introduced data on x = chest girth (in cm) and y = weight (in kg)

for a sample of 19 Rocky Mountain elk.

Inspection of the scatterplot suggest the data are consistent with the assumptions of the simple linear regression model.

Revisiting the Elk Continued . . .

Let’s examine the residuals more closely. The data, residuals, and the standardized residuals (computed using Minitab) are given on page 761.

The largest residual = 38.1397 and the associated standard residual = 1.81294.

The smallest residual = -38.2661 and the associated standard residual = -1.92313.

Neither one of these is surprisingly large.

The boxplots of the residuals and standardized residuals are approximately symmetric with no outliers, so the assumption of normally

distributed errors seems reasonable.Notice that the boxplots of the residuals and standardized residuals are nearly identical.

Revisiting the Elk Continued . . .

Another way to assess whether the error values are normally distributed is to look at normal probability plots of the residuals or the standardized residuals. (Only one plot is needed.)The pattern in the normal probability plots are

reasonably straight, confirming that the assumption of normality of the error

distribution is reasonable.

The standardized plot is recommended, but it is acceptable to use the unstandardized

residual plot if you do not have access to a computer package

A Look at Residual Plots

In this plot, the standard deviation of the residuals increases as the x-values increase.

While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your

local statistician!

Both of these plots contain

points far away from the others. These points can have substantial

effects on estimates of a

and b as well as other quantities.

This plot exhibits a curved pattern which indicates

that the fitted model should be changed to

incorporate the curvature.

This is a desirable plot in that it exhibits no pattern and has

no point that lies far away from the other points.

Newborns and infants have a small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children ages 2 months to 14 years, researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age.Below are a scatterplot and a standardized residual plot constructed using data on the insertion depth and height of children (both measured in cm).

Residual plots like the one shown here are desirable.

There are no unusually large residuals since no point lies much outside the horizontal band

between -2 and 2. There is no point far to the left or right of the others and there are no pattern of curvature or differences in the variability of the

residuals for different height values to indicate that the model assumptions are not reasonable.

Newborns and Infants Problem Continued . . .

But consider what happens when the relationship between insertion depth and weight is examined.

While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot.

A careful inspection of these plots suggests that along with curvature, the residuals may be more

variable at larger weights.The linear regression

model is not appropriate.

Date post:	04-Jan-2016
Category:	Documents
Upload:	cedric-stout
View:	33 times
Download:	0 times

Chapter 16

Documents