CHAPTER 14 Section 14.1 Regression Models 1. Simple...

CHAPTER 14 Section 14.1 Regression Models

1. Simple Regression Example 5.6 gave the regression line that explained the response variable Left Handspan (LftSpan) using the explanatory variable Right Handspan (RtSpan). Since only one explanatory variable or predictor (Right Handspan) is used, the method is called ‘simple regression’. We will first revisit this regression equation and take a closer look at the output. Observations for the handspan variables are found in the file pennstate1.mtw, which also contains other variables. From the menu, select File>Open Worksheet to open the window shown below. You now want to indicate, in the Look in dialog box, the location of the file pennstate1.mtw. Recall that the MTW folder within DataSets of the CD contains the file pennstate1.mtw.

To obtain the regression equation, select Stat>Regression>Regression from the menu. Enter LftSpan in the Response dialog box and enter RtSpan in the Predictor dialog box. The regression window looks like the following:

146

After clicking OK, the output will appear in the session window. The output includes the regression equation, the value of R2, and other useful statistics. The regression equation is LftSpan = 1.46 + 0.938 RtSpan

(Note. Due to formatting issues, the equation of the regression line appears written in terms of LftSpan (Y) and not in terms of estimated LftSpan ( ) as it should be.) Y The value of R2 (90.2%) is written in Minitab as R-Sq. (R-Sq(adj)) is an “adjustment” to R-Sq and it is a useful statistic in evaluating multiple regression models. In the case of simple regression, the formula is R-Sq(adj) = 1 – (1 – R

2adjR

2)((n-1)/(n-2) and both typically have the same value, R2= . 2

adjR

The regression equation is LftSpan = 1.46 + 0.938 RtSpan Predictor Coef StDev T P Constant 1.4635 0.4792 3.05 0.003 RtSpan 0.93830 0.02252 41.67 0.000 S = 0.6386 R-Sq = 90.2% R-Sq(adj) = 90.2%

2. Multiple Regression If we want to explain LftSpan (Left Handspan) using the explanatory variable RtSpan (Right Handspan) and some other variable(s) such as Height, the regression is called ‘multiple regression.’ A multiple regression model consists of two or more explanatory variables. A brief explanation about multiple regression can be found in Section 14.1 of the book.

147

To do multiple regression using Minitab the steps are the same as simple regression, except more explanatory variables are selected. We will continue to use the dataset pennstate1.mtw. Select Stat>Regression>Regression from the menu and from the list of variables select LftSpan for the Response dialog box and RtSpan, Height for the Predictors dialog box.

A portion of the regression output appears next. The regression equation is now in terms of the two explanatory variables (RtSpan and Height) LftSpan = - 0.253 + 0.888 RtSpan + 0.0409 Height

Notice that the values of R2 and are not equal as they were for simple regression. 2

adjR Regression Analysis: LftSpan versus RtSpan, Height The regression equation is LftSpan = - 0.253 + 0.888 RtSpan + 0.0409 Height Predictor Coef SE Coef T P Constant -0.2530 0.7701 -0.33 0.743 RtSpan 0.88759 0.02852 31.13 0.000 Height 0.04091 0.01453 2.82 0.005 S = 0.6272 R-Sq = 90.6% R-Sq(adj) = 90.5% Section 14.2 Estimating the Standard deviation in Regression

1. Regression Output The regression output includes several things in addition to the regression equation. Some of them were explained in Chapter 5. To review these ideas, open the file signdist.mtw that contains observations for two variables. To open the file, from the menu select File>Open Worksheet.

148

Then select the name of the file signdist.mtw. This data set was studied in Chapter 5 (Example 5.2) and appears again in Example 14.3. The response variable is the maximum Distance at which a driver can read a sign and the explanatory variable is the Age of the driver. To perform the regression analysis, from the menu select Stat>Regression>Regression Select from the list the name of the Response variable (Distance) and the explanatory variable or Predictors (Age)

In the output below, in addition to the regression equation, you can identify:

• The slope is -3.0068. (Interpretation: For each additional year of age, the distance at which a sign becomes legible diminishes by 3 feet.)

• The intercept is 576.68. (We don’t interpret the intercept in this case because

there are no drivers that are 0 years of age.)

• R-Sq = 64.2% The coefficient of determination or R2. (Interpretation: Not all drivers require the same distance to be able to read the sign. There is variability in the required distances; 64.2% of that variability is explained by the fact that not all drivers are of the same age.)

• Total sum of squares is SSTO=193667. (The ‘total variability’, the sum of the

squared differences between the observed distance values and the mean of the distances, ∑ − 2)( yyi =193667.)

• Sum of squared errors is SSE=69334. (For a given age of the driver, the residual (error) is the difference between the observed distance and the predicted distance.) SSE is the sum of squared errors or residual sum of squares, SSE= , also called ‘unexplained variability’. ∑ − 2)ˆ( ii yy

149

Results for: signdist.MTW Regression Analysis: Distance versus Age The regression equation is Distance = 577 - 3.01 Age Predictor Coef SE Coef T P Constant 576.68 23.47 24.57 0.000 Age -3.0068 0.4243 -7.09 0.000 S = 49.76 R-Sq = 64.2% R-Sq(adj) = 62.9% Analysis of Variance Source DF SS MS F P Regression 1 124333 124333 50.21 0.000 Residual Error 28 69334 2476 Total 29 193667 Unusual Observations Obs Age Distance Fit SE Fit Residual St Resid 27 75.0 460.00 351.17 13.65 108.83 2.27R R denotes an observation with a large standardized residual Notice that R2 can be calculated from SSTO and SSE

=iabilitytotal

iabilityExplainedvar

var642.0

19366769334193667

=−

=−

SSTOSSESSTO

2. The Standard Deviation in Regression

Section 14.2 explains how to obtain the standard deviation in regression analysis. The

formula is2−

=nSSEs . For the data of Example 14.3 (see regression output above),

23069334

−=s = 49.8.

The value of s also appears on the left hand side of the regression output. (A blue arrow has been added for easier identification.) 14.3 Inference about the Linear Regression Relationship Several questions can be posed about the population values of the parameters in the model and the population correlation. The three main questions are:

• Can the population slope be considered different from 0? • Between what values do we think the population slope is? • Can the correlation in the population be considered different from 0?

In order to answer these questions we need to conduct tests of hypotheses about the population slope and the population correlation, and estimate the population slope using a confidence interval.

150

1. Testing hypotheses about the population slope

The question ‘Can the population slope be considered different from 0?’ is translated into the null and alternative hypotheses Ho; 01 =β Ha: 01 ≠β For Example 14.3 (data file signdist.mtw) where the response variable is Distance required to read a sign and the explanatory variable is the Age of the driver, the value of the test statistic is

).(.0

1

1bes

bt

−= = 09.7

4243.00068.3

−=−

Note that the values necessary for that calculation, the value of the test statistic (-7.09) and the corresponding p-value (0.000) are part of the regression output. The p-value indicates that the null hypothesis should be rejected and we conclude that the variable Age is useful to explain Distance and that it should be retained in the model. Regression Analysis: Distance versus Age The regression equation is Distance = 577 - 3.01 Age Predictor Coef SE Coef T P Constant 576.68 23.47 24.57 0.000 Age -3.0068 0.4243 -7.09 0.000 S = 49.76 R-Sq = 64.2% R-Sq(adj) = 62.9% Analysis of Variance Source DF SS MS F P Regression 1 124333 124333 50.21 0.000 Residual Error 28 69334 2476 Total 29 193667

2. Finding a confidence interval for the population slope

The formula for the confidence interval for the slope is ).(.* 11 bestb ± Minitab does not automatically calculate the confidence interval, but in the regression output we find the necessary values (marked with a blue arrow to make identification easier) to perform the calculation. Regression Analysis: Distance versus Age The regression equation is Distance = 577 - 3.01 Age Predictor Coef SE Coef T P Constant 576.68 23.47 24.57 0.000 Age -3.0068 0.4243 -7.09 0.000 S = 49.76 R-Sq = 64.2% R-Sq(adj) = 62.9%

151

Analysis of Variance Source DF SS MS F P Regression 1 124333 124333 50.21 0.000 Residual Error 28 69334 2476 Total 29 193667

).(.* 11 bestb ± becomes 4243.0*0068.3 t±− ; we still need to find the multiplier t*. To calculate t*, use Calc>Probability Distributions>t and complete the required information in the dialog boxes. Given an area we need to find a t* value, so select Inverse Cumulative Probability. Note that the number of Degrees of freedom (28) appears in the regression output. For a 95% confidence interval, the necessary multiplier is the value t* such that the area under the curve up to the value t* is 0.975.

After clicking OK, the following output appears in the session window: Inverse Cumulative Distribution Function Student's t distribution with 28 DF P( X <= x ) x 0.975 2.04841

Hence, t*=2.0484 and the 95% confidence interval for the population slope is: -3.0068 2.04841 *0.4243 ±The limits of the confidence interval are: –3.0068-2.04841*0.4243=-3.88 feet –3.0068+2.04841*0.4243=-2.14 feet We are 95% confident that for each additional year of age, the distance at which a person is able to read the sign decreases on average somewhere between 2.14 and 3.88 feet.

152

3. Testing hypotheses about the population correlation coefficient The hypotheses are

H0: 0=ρ Ha: 0≠ρ

We will conduct this test of hypotheses for the two variables (Age, Distance) in the data set signdist.mtw. The test in the book is for another dataset so the output displayed here is different from the results in the book. In order to perform the test of hypotheses using Minitab, select Stat>Basic Statistics>Correlation. In the Correlation window, select Age, Distance for the Variables dialog box and check the option Display p values.

In the output, the p-value corresponding to the test statistic for testing H0: 0=ρ vs. Ha: 0≠ρ will appear. The p-value is equal to 0 up to the third decimal place. Since the p-value is very small, the null hypothesis H0: 0=ρ is rejected and we conclude that there is definitely an association between the maximum distance at which a driver is able to read a sign and the age of the driver. Correlations: Age, Distance Pearson correlation of Age and Distance = -0.801 P-Value = 0.000 It is not a coincidence that the p-value of the test H0: 0=ρ is equal to the p-value of the test of H0: 01 =β in the regression output. In the case of the simple linear regression, both tests would arrive at the same conclusion. This is due to the relationship between the slope and the correlation.

153

Section 14.4 Prediction interval for a specified value of x. A regression model can be used for predicting unknown values of the response variable (Y) from specified values of the explanatory variable (X). In Example 14.3 continued, prediction intervals are calculated for the ages 21, 30, and 45. In order to obtain the prediction intervals for these ages in Minitab we will first type the ages 21, 30, and 45 in an empty column (C3). Then, select Stat>Regression>Regression. In the regression window, select the appropriate variables and click on Options to make the Options window appear. Once there, place the cursor in the dialog box Prediction iintervals for new observations and then select column C3.

Checking the box of Prediction limits will store the limits (bounds) of the prediction intervals for each age 21, 30, and 45 into two columns (PLIM1PLIM2) of the worksheet. If the box is not checked then the results of the prediction intervals will appear in the session window only. Click OK to return to the Regression window and then click OK again. A portion of the regression output is shown below with the prediction intervals appearing under the column labeled 95.0% PI. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 513.54 15.64 (481.50, 545.57) (406.69, 620.39) 2 486.48 12.73 (460.41, 512.54) (381.26, 591.69) 3 441.37 9.44 (422.05, 460.70) (337.63, 545.12) Values of Predictors for New Observations New Obs Age 1 21.0 2 30.0 3 45.0

154

Section 14.5 Prediction intervals for y and confidence intervals for E(Y) A confidence interval can be used to estimate the mean value of the response variable Y for a specified value of the explanatory variable X. Example 14.2-continued displays the Minitab output for estimating the mean Weight (Y) in the population of college men for each of the three Heights (X): 68 inches, 70 inches, and 72 inches. The data can be found in the file wtheightM.mtw. Open the wtheightM.mtw file by using File>Open Worksheet:

In order to obtain the confidence intervals for the three Heights (X): 68 inches, 70 inches, and 72 inches in Minitab we will first type the heights 68, 70, and 72 in an empty column (C4).

155

Use Stat>Regression>Regression and then fill in the dialog boxes as shown below. To get the confidence intervals for each of the three heights click Options.

Once there, place the cursor in the dialog box Prediction intervals for new observations and then select column C4. Checking the box of Confidence limits will store the limits (bounds) of the confidence intervals for each of the heights 68, 70, and 72 into two columns (CLIM1 CLIM2) of the worksheet. If the box is not checked then the results of the confidence intervals will appear in the session window only.

156

The regression output appears next. The confidence intervals are highlighted in red. The regression equation is Weight = - 318 + 7.00 Height Predictor Coef SE Coef T P Constant -317.9 110.9 -2.87 0.007 Height 6.996 1.581 4.42 0.000 S = 24.00 R-Sq = 32.3% R-Sq(adj) = 30.7% Analysis of Variance Source DF SS MS F P Regression 1 11277 11277 19.58 0.000 Residual Error 41 23617 576 Total 42 34894 Unusual Observations Obs Height Weight Fit SE Fit Residual St Resid 19 73.0 240.00 192.78 5.85 47.22 2.03R 33 71.0 237.00 178.79 3.92 58.21 2.46R R denotes an observation with a large standardized residual Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 157.80 4.96 ( 147.78, 167.81) ( 108.31, 207.29) 2 171.79 3.66 ( 164.39, 179.19) ( 122.76, 220.82) 3 185.78 4.72 ( 176.25, 195.31) ( 136.38, 235.18) Values of Predictors for New Observations New Obs Height 1 68.0 2 70.0 3 72.0

157

Section 14.6 Checking conditions for using regression models for inference

1. Residual Plots In Section 14.6 residual plots are used to check the assumptions of a linear regression model. Figure 14.8 displays a plot of the residuals versus x for the following model

Weight = - 318 + 7.00 Height. Recall that the data are in the file wtheightM.mtw. The residual plots in Figure 14.8 and Figure 14.9 can be produced by using Stat>Regression>Regression. Fill in the dialog boxes as shown below and then click Graphs.

In the Regression Graphs window, indicate that you want to work with the Regular residuals . Check Histogram of residuals, Residuals versus Fits and in the Residual versus the variables dialog box select the explanatory variable Height.

)ˆ( yy −

158

The option Residuals versus variables (Height) produces the graph below. (Figure 14.8 in the textbook) A horizontal reference line is drawn at 0.

Height

Res

idua

l

75.072.570.067.565.0

50

25

0

-25

-50

Residuals Versus Height(response is Weight)

The option Residuals versus fits produces the graph below.

159

Fitted Value

Res

idua

l

210200190180170160150140

50

25

0

-25

-50

Residuals Versus the Fitted Values(response is Weight)

Note. In simple linear regression the plot of residuals versus x and the plot of residuals versus fitted values (predicted values) will show the same pattern. This is not necessarily true in multiple regression. In multiple regression, the residual plots should include: residuals versus fitted values and residuals versus all predictor variables. The option Histogram of residuals produces the graph below. (Figure 14.9 in the textbook)

Residual

Freq

uenc

y

604530150-15-30-45

14

12

10

8

6

4

2

0

Histogram of the Residuals(response is Weight)

The graphs above indicate that the straight-line model was the appropriate model to represent the relationship between Height and Weight for this particular data set. Figure 14.10 in the book is an example of a graph in which the model was not appropriate.

160

Optional In the Regression Graphs window, to display a layout of a histogram of residuals, a normal probability plot of residuals, a plot of residuals versus fits, and a plot of residuals versus order select Four in one:

Clicking OK produces the following graphs.

Residual

Per

cent

50250-25-50

99

90

50

10

1

Fitted Value

Res

idua

l

210195180165150

50

25

0

-25

-50

Residual

Freq

uenc

y

604530150-15-30-45

16

12

8

4

0

Observation Order

Res

idua

l

4035302520151051

50

25

0

-25

-50

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Histogram of the Residuals Residuals Versus the Order of the Data

Residual Plots for Weight

The normal probability plot is a more advanced plot used to check whether the residuals are normally distributed (condition 4 Section 14.6). The plot of residuals versus order is used to determine if the observations in the sample are independent of each other (condition 5 in Section 14.6).

161

2. Using transformations on the variables

Sometimes the relationship between two variables is not linear and hence a simple linear (straight line) regression model is not appropriate. However, in many of these situations a simple linear regression model can be applied after transforming one or more of the variables. This type of solution is explained at the end of Section 14.6 of the text. An example, not provided in the book, of a natural log transformation of the y variable is shown next. The data below correspond to the number of bacteria and the time (in hours) passed since the beginning of the experiment. time Bacteria 0 32 1 47 2 65 3 92 4 132 5 190

6 275

The plot below indicates that the relationship is strong but not linear. This is a type of relationship that can be transformed into a linear one by working with the natural log of bacteria.

time

Bact

eria

6543210

300

250

200

150

100

50

0

Scatterplot of Bacteria vs time

Open a new worksheet using File>New and type in the values of the two variables. Type the values of time into column C1 and the values of bacteria into column C2. Now type the name of C3 as ln Y. To obtain the logarithms we have two options. The first option is to, at the MTB> prompt, type MTB> loge c2 c3 The second option is to select Calc>Calculator from the menu and fill in the dialog boxes as shown below. Note that the Natural log function can be typed (LOGE) or selected from the Functions window. If the function Natural log was selected then the

162

Expression dialog box will show: LOGE (number). Double-click on bacteria to replace number.

After clicking OK, the worksheet will look like:

Now we can use Graph>Scatterplot>Simple to plot the variable ln y vs. time.

163

time

ln Y

6543210

5.5

5.0

4.5

4.0

3.5

Scatterplot of ln Y vs time

The graph shows a strong linear relationship so we will use Stat>Regression>Regression to obtain the linear regression model to explain the response variable ln y (logarithm of the number of bacteria) in terms of the explanatory variable time (in hours).

Regression Analysis: ln Y versus time The regression equation is ln Y = 3.47 + 0.356 time Predictor Coef SE Coef T P Constant 3.47031 0.01037 334.54 0.000 time 0.355545 0.002877 123.58 0.000 S = 0.0152239 R-Sq = 100.0% R-Sq(adj) = 100.0% The model is not in terms of ‘bacteria’ but in terms of ‘ln bacteria’ or ‘ln Y’. If we want to express the model in terms of bacteria, antilogarithms (exp) can be used: ln Y = 3.47 + 0.356 time

can be written as (we prefer to use Y instead of simply Y)

timeeeY 356.047.3 *ˆ = where Y denotes the estimated number of bacteria. Note that a nonlinear model y=cdx

can be written, using logarithms, as ln y = ln c + x ln d To obtain the values of e 3.47and e 0.356 the following commands can be used:

164

MTB > let k1=expo(3.47) MTB > let k2=expo(0.356) MTB > print k1 k2 (Alternatively, the Calc>Calculator option could have been used.) The output was K1 32.1367 K2 1.42761 Since e 3.47= 32.1367 and e 0.356=1.42761 the non-linear model for bacteria can be written as

Y =32.1367(1.42761)time

To see the graph of this model, create a new variable Y (type the name Y hat in C4) by typing at the MTB> prompt

MTB> let c4=32.1367*(1.42761**C1)

The worksheet will look like the following:

Note that the values in the columns bacteria and Y hat are nearly the same (we could have been even closer if we had not rounded the decimals in the linear regression model). To obtain the graph of the model, use Graph>Scatterplot>With Connect Line to indicate that you want to plot Y hat vs. time and that you want to connect the points.

165

The graph of the model is displayed below:

time

Y H

at

6543210

300

250

200

150

100

50

0

Scatterplot of Y Hat vs time

166

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

CHAPTER 14 Section 14.1 Regression Models 1. Simple...

Documents