+ All Categories
Home > Documents > Chapter 16

Chapter 16

Date post: 04-Jan-2016
Category:
Upload: cedric-stout
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Chapter 16. Understanding Relationships – Numerical Data Part 2. Created by Kathy Fritz. The Simple Linear Regression Model. You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using. Suppose you want to convert 20˚C into Fahrenheit. - PowerPoint PPT Presentation
Popular Tags:
46
Chapter 16 Understanding Relationships – Numerical Data Part 2 Created by Kathy Fritz
Transcript
Page 1: Chapter 16

Chapter 16

Understanding Relationships – Numerical

Data Part 2

Created by Kathy Fritz

Page 2: Chapter 16

The Simple Linear Regression Model

Page 3: Chapter 16

You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using

20

40

60

80

100

10 20 30 40 50

Temperature in centigrade

Tem

pera

ture

in

Fahre

nheit

This is a deterministic relationship. The value of the independent variable (centigrade temperature) is all that is needed to determine the value of the dependent variable (Fahrenheit temperature).

Suppose you want to convert 20˚C into Fahrenheit.

20˚C = 68˚F

Page 4: Chapter 16

The equation for a probabilistic model is:

Where e is an “error” variable

Now suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average.

Is the first-year college grade point average determined solely by the high school grade point average? Explain.

The first-year college grade point average and the high school grade

point average do NOT have a deterministic relationship.

A description of the relationship between two variables that are not deterministically related

can be given by a probabilistic model.

Page 5: Chapter 16

1020304050

x

y

x1 x2

The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line.

When a value of the independent variable x is fixed and an observation on the dependent variable y is made, exy

a

Population regression line (slope b)

e1

e2

Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population

regression line.

Page 6: Chapter 16

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular value of x is normal.

x1 x2 x3

Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e (random deviation from the regression line). It could be positive, negative, or even 0.

The linear regression model makes some assumptions about the distribution of e at any

particular x value in the population.

Page 7: Chapter 16

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular value of x is normal.

2. The distribution of e at any particular x value has mean value 0. That is, me = 0.

x1

x2 x3

Because the values of e can be negative or positive, the sum of the values of e at any

particular x value will be zero. Thus, me = 0.

Page 8: Chapter 16

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular value of x is normal.

2. The distribution of e at any particular x value has mean value 0. That is, me = 0.

x1

x2 x3

3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.

Page 9: Chapter 16

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular value of x is normal.

2. The distribution of e at any particular x value has mean value 0. That is, me = 0.

3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se.

4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

Page 10: Chapter 16

The standard deviation of y for any fixed value of x* is also se

y

x

x1 x2 x3

a + bx1

a + bx2

a + bx3

Just as there is variability in the values of e at any particular value of x, there is

also variability in the y values.

The mean of y values at a fixed value x* is

y = a + bx*

The population regression line passes through the means of the y values.

Thus the slope b is the mean or expected change in y associated with

a 1 unit increase in x.

se is the same for

any particular x

value

Page 11: Chapter 16

Another look at se

The smaller se, the closer the points are to the regression line.

The larger se, the farther the points are from the regression line.

Page 12: Chapter 16

The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line, .

Let x* denote a specified value of the independent variable x. Then a + bx* has two different interpretations:

The values of a and b are usually obtained using statistical software or a graphing calculator.

1. It is a point estimate of the mean y value when x = x*.2. It is a point prediction of an individual y value to be observed when x = x*.

Page 13: Chapter 16

Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females.

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

15 16 17 18 19

2500

3000

3500

Mother’s Age (yrs)

Baby’s

Weig

ht

(g)

Sketch a scatterplot of these data.

The scatterplot shows a linear pattern and the spread in the y values

appears to be similar across the range of x values. This

supports the appropriateness of the simple linear regression

model.

Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

Page 14: Chapter 16

Birth Weight Continued . . .

The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

xy 15.24545.1163ˆ

15 16 17 18 19

2500

3000

3500

Mother’s Age (yrs)

Baby’s

Weig

ht

(g)

grams25.3249)18(15.24545.1163

What is the point estimate for the mean weight of babies born

to 18-year-old mothers?

This is the point estimate for the

mean weight of all babies born to 18-year-old mothers.

This is also the prediction of the weight of a single baby born to

a mother 18 years of age.

The weight of babies increases approximately 245.15 grams for each increase of 1 year in the mother’s age.

Beware of the danger of extrapolation. That is, be careful when trying to make an

estimate or prediction for any x value much outside the range of the observed x values

in the data.

Page 15: Chapter 16

The statistic for estimating the variance is

2Resid2

nSS

se

2

ˆResid yySS

2ee ss

Note that the degrees of freedom associated with

estimating or in simple linear regression is

df = n - 2

whereThe subscript “e” is a reminder that you are

estimating the variance of the “errors” or residuals.

The estimate of se is the estimated the standard deviation

The value of se, the estimated standard deviation about the population regression line, is interpreted

as the typical amount by which an observation deviates from the population regression line.

Page 16: Chapter 16

Recall, the coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y.

The residual plot and the values of se and r2 can be used to determine the estimated regression equation’s usefulness.

How do we know if the estimated regression equation will be useful model for predicting y values from x?

Page 17: Chapter 16

Wildlife biologists monitor the ecological health of the Rocky Mountain elk. The equipment, manpower, and time to make direct measurement of the elk weights are difficult and expensive.

Biologists found that they could reliably estimate the weight of an elk by measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk.

There appears to be a strong positive linear relationship between the chest girth and weight of elk.

Page 18: Chapter 16

Elk Weight Problem Continued . . .

Partial Minitab regression output is shown below.

The regression equation isWeight = -136 + 2.81 Girth

Predictor Coef SE Coef T P

Constant -135.51 35.75 -3.79 0.001

Girth 2.8063 0.2686 10.45 0.000

S = 23.6626 R-Sq = 86.5% R-Sq(adj) = 85.7%

This is the estimated regression equation.Approximately 86.5% of the observed

variation in elk weight can be attributed to the linear relationship between weight and

chest girth.

The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values (shown in the

scatterplot).

Page 19: Chapter 16

Inferences Concerning the Slope of the Population Regression Line

Page 20: Chapter 16

Properties of the Sampling Distribution of bWhen the four basic assumptions of the simple linear regression model are satisfied, the following statements are true:

1.The mean value of b is b. That is, mb = b, so the sampling distribution of b is centered at the value of b.

2. The standard deviation of the statistic b is

Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of

the least-squares line gives a point estimate for b.

3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)

Since sb is usually unknown, the estimated standard deviation of the statistic b is

When the four basic assumptions of the simple linear model are satisfied, the probability distribution of the standardized variable is the t distribution with df = (n - 2).

Page 21: Chapter 16

Confidence Interval for bWhen the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form

where the t critical value is based on df = n – 2.

Page 22: Chapter 16

The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3000 animals. It is important to monitor and manage the size of the bison population.

Researchers have studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison.Data from 1981-1997 on y = spring calf ratio (SCR) and x = previous fall snow-water equivalent (SWE) are shown on page 750. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent.

Page 23: Chapter 16

Bison Population Problem Continued . . .

Step 1 (Estimate):The value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated.

Step 2 (Method):Because the answers to the four key questions are estimation, sample data, two numerical values, and one sample, a confidence interval for b, the slope of the population regression line, will be considered. A 95% confidence level will be used.

Page 24: Chapter 16

Bison Population Problem Continued . . .

Step 3 (Check):• You will need to assume that these 17 years are

representative of yearly circumstances at Yellowstone and that each year’s reproduction and snowfall is independent of previous years.

• A scatterplot of the data looks linear and the spread does not seem different for different values of x.

• Because the boxplot of the residuals is approximately symmetrical and there are no outliers, it is reasonable to think that the distribution of e is approximately normal.

Page 25: Chapter 16

Bison Population Problem Continued . . .

Step 4 (Calculate):JMP regression output is shown here:

Linear FitSCR = 0.2606561 – 0.0136639*SWE

Summary of Fit

RSquare 0.257644

Rsquare Adj 0.208153

Root Mean Square Error 0.033513

Mean of Response 0.209412

Observations 17

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 0.206561 0.023885 10.91 <.0001*

SWE -0.013664 0.005989 -2.28 0.0375*

Slope b sb

df = 17 – 2 = 15The t critical value for a 95% confidence level and df = 15 is 2.13.b ± (t critical value) sb

= -0.0137 ± (2.13)(0.005989) = (-0.265, -0.0009)

Page 26: Chapter 16

Bison Population Problem Continued . . .

Step 5 (Communicate Results):

Confidence Interval:You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between -0.0265 and -0.0009.

Confidence level:

The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression line about 95% of the time.

Page 27: Chapter 16

Summary of Hypothesis Tests Concerning bAppropriate when the four basic assumptions of the simple regression model are reasonable:

1. The distribution of e at any particular x value has a mean of 0 (me = 0).

2. The standard deviation of e is se, which does not depend on x.

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, …, en associated with different observations are independent of one another.

Page 28: Chapter 16

Summary of Hypothesis Tests Concerning b Continued . . .When these conditions are met, the following test statistic can be used:

where b0 is the hypothesized value from the null hypothesis.

Form of the null hypothesis: H0: b = b0

When the assumptions of the simple linear model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df = n – 2.

Page 29: Chapter 16

Associated P-Value:

Summary of Hypothesis Tests Concerning b Continued . . .

When the alternative hypothesis is . . .

The P-value is . . .

Ha: b > b0 area to right of t under the appropriate t curve

Ha: b < b0 area to left of t under the appropriate t curve

Ha: b ≠ b0 2(area to the right of t) if t is positiveor 2(area to the left of t) if t is negative

Page 30: Chapter 16

Inference for a population slope generally focuses on two questions:

(1) What are plausible values for the population slope?(2) Is the population slope different from zero?

When the null hypothesis H0: b = 0 is true, the population regression line is a horizontal line.

This question can be answered by using

the hypothesis testing procedure with a null

hypothesis H0: b = 0 If b is in fact equal to

0, knowledge of x will be of no use – it will have no “utility” for

predicting y.

This question can be addressed by

calculating a confidence

interval.

This test of H0: b = 0 versus Ha: b ≠ 0

is called the model utility test for simple linear regression.

Page 31: Chapter 16

The Model Utility Test for Simple Linear Regression

The model utility test for simple linear regression is the test of

H0: b = 0 versus Ha: b ≠ 0

The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.

The test statistic is the t ratio:If H0 is rejected, you can conclude that the simple linear regression model is useful for predicting y.

Page 32: Chapter 16

When you hear a song on your car radio, you probably remember title of the song, the artist, and even when the song was released. An investigator wants to study this phenomenon. He compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students.

Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week.

After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released.

The accompanying data show the actual release year and the average of the release years given by the students. Is there a relationship between the judged and actual release year for these songs?

Let’s perform a model utility test to answer this question.

Page 33: Chapter 16

Song Recognition Problem Continued . . .

Step 1 (Hypotheses):

H0: b = 0

Ha: b ≠ 0

where b is the slope of the population regression line of the judged release year and the actual year

Step 2 (Method):

Because the answers to the four key questions are hypothesis testing, two numerical variables in a regression setting, and one sample, a hypothesis test for the slope of a population regression line will be considered. A significance level of 0.05 will be used.

Page 34: Chapter 16

Song Recognition Problem Continued . . .

Step 3 (Check):

For this example you can assume that the assumptions are reasonable and proceed with the model utility test. (We will see how to check if the four assumptions of the simple linear regression model are reasonable in the next section.)

Page 35: Chapter 16

Song Recognition Problem Continued . . .

Step 4 (Calculate):JMP regression output is shown here:

Linear FitJudged Release = 1095.1525 + 0.449281*Actual Release

Summary of Fit

RSquare 0.771

Rsquare Adj 0.766759

Root Mean Square Error 3.59844

Mean of Response 1986.013

Observations 56

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 1095.1525 66.07159 16.58 <.0001*

SWE 0.449281 0.033321 13.48 <.0001*

Slope b sb

P-value = 2P (t > 13.48) ≈ 0

Page 36: Chapter 16

Song Recognition Problem Continued . . .

Step 5 (Communicate Results):

Because the P-value is less than the selected significance level, the null hypothesis is rejected.

Decision: Reject H0

Conclusion:The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year.

Page 37: Chapter 16

Checking Model Adequacy

Page 38: Chapter 16

Checking Model Adequacy

The simple linear regression model is

y = a + bx + e

where e represents the random deviation of a y value from the population regression line a + bx.

These assumptions include:1. At any particular x value, the distribution of e is

normal. 2. At any particular x value, the standard deviation of

e is se, which is constant over all values of x (that is, se does not depend on x).

The methods, confidence interval for slope and the model utility test, require some assumptions about the random deviations in the simple linear regression model be met in order for inference to

be valid.

Page 39: Chapter 16

Instead, diagnostic checks MUST be based on the residuals

which are the deviations from the estimated regression line.

Residual AnalysisIf the deviations e1, e2, . . . , en from the population line were available, they could be examined for any inconsistencies with model assumptions.However, these deviations are

e1 = y1 – (a + bx1)

en = yn – (a + bxn)

These values of e can ONLY be calculated if a and b are known, which

is almost never the case.

Any observation that gives a large positive or negative residual should be examined carefully

for any unusual circumstances, such as a recording error or nonstandard experimental

condition.

Page 40: Chapter 16

Residual Analysis

Identifying residuals with unusually large magnitudes is made easier by inspecting standardize residuals.

Recall, me = 0.So, the numerator is

really residual – 0.

Because residuals at different x values have different standard deviations (depending on the value of x for that observation), computing the

standardized residuals can be tedious. Most statistical software will perform this calculation.

Page 41: Chapter 16

Revisiting the Elk

Example 16.3 introduced data on x = chest girth (in cm) and y = weight (in kg)

for a sample of 19 Rocky Mountain elk.

Inspection of the scatterplot suggest the data are consistent with the assumptions of the simple linear regression model.

Page 42: Chapter 16

Revisiting the Elk Continued . . .

Let’s examine the residuals more closely. The data, residuals, and the standardized residuals (computed using Minitab) are given on page 761.

The largest residual = 38.1397 and the associated standard residual = 1.81294.

The smallest residual = -38.2661 and the associated standard residual = -1.92313.

Neither one of these is surprisingly large.

The boxplots of the residuals and standardized residuals are approximately symmetric with no outliers, so the assumption of normally

distributed errors seems reasonable.Notice that the boxplots of the residuals and standardized residuals are nearly identical.

Page 43: Chapter 16

Revisiting the Elk Continued . . .

Another way to assess whether the error values are normally distributed is to look at normal probability plots of the residuals or the standardized residuals. (Only one plot is needed.)The pattern in the normal probability plots are

reasonably straight, confirming that the assumption of normality of the error

distribution is reasonable.

The standardized plot is recommended, but it is acceptable to use the unstandardized

residual plot if you do not have access to a computer package

Page 44: Chapter 16

A Look at Residual Plots

In this plot, the standard deviation of the residuals increases as the x-values increase.

While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your

local statistician!

Both of these plots contain

points far away from the others. These points can have substantial

effects on estimates of a

and b as well as other quantities.

This plot exhibits a curved pattern which indicates

that the fitted model should be changed to

incorporate the curvature.

This is a desirable plot in that it exhibits no pattern and has

no point that lies far away from the other points.

Page 45: Chapter 16

Newborns and infants have a small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children ages 2 months to 14 years, researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age.Below are a scatterplot and a standardized residual plot constructed using data on the insertion depth and height of children (both measured in cm).

Residual plots like the one shown here are desirable.

There are no unusually large residuals since no point lies much outside the horizontal band

between -2 and 2. There is no point far to the left or right of the others and there are no pattern of curvature or differences in the variability of the

residuals for different height values to indicate that the model assumptions are not reasonable.

Page 46: Chapter 16

Newborns and Infants Problem Continued . . .

But consider what happens when the relationship between insertion depth and weight is examined.

While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot.

A careful inspection of these plots suggests that along with curvature, the residuals may be more

variable at larger weights.The linear regression

model is not appropriate.


Recommended