+ All Categories
Home > Documents > Multiple Regression

Multiple Regression

Date post: 16-Mar-2016
Category:
Upload: kalea
View: 76 times
Download: 9 times
Share this document with a friend
Description:
Multiple Regression. Chapter 18. 18.1 Introduction. In this chapter we extend the simple linear regression model, and allow for any number of independent variables. We expect to build a model that fits the data better than the simple linear regression model. - PowerPoint PPT Presentation
Popular Tags:
45
Multiple Regression Chapter 18
Transcript
Page 1: Multiple  Regression

Multiple RegressionChapter 18

Page 2: Multiple  Regression

18.1 Introduction

• In this chapter we extend the simple linear regression model, and allow for any number of independent variables.

• We expect to build a model that fits the data better than the simple linear regression model.

Page 3: Multiple  Regression

• We will use computer printout to – Assess the model

• How well it fits the data• Is it useful• Are any required conditions violated?

– Employ the model• Interpreting the coefficients• Predictions using the prediction equation• Estimating the expected value of the dependent variable

Page 4: Multiple  Regression

Coefficients

Dependent variable Independent variables

Random error variable

18.2 Model and Required Conditions

• We allow for k independent variables to potentially be related to the dependent variable

y = 0 + 1x1+ 2x2 + …+ kxk +

Page 5: Multiple  Regression

y = 0 + 1xy = 0 + 1xy = 0 + 1xy = 0 + 1x

X

y

X2

1

The simple linear regression modelallows for one independent variable, “x”

y =0 + 1x +

The multiple linear regression modelallows for more than one independent variable.Y = 0 + 1x1 + 2x2 +

Note how the straight line becomes a plain, and...

y = 0 + 1x1 + 2x2

y = 0 + 1x1 + 2x2

y = 0 + 1x1 + 2x2

y = 0 + 1x1 + 2x2y = 0 + 1x1 + 2x2

y = 0 + 1x1 + 2x2

y = 0 + 1x1 + 2x2

Page 6: Multiple  Regression

X

y

X2

1

… a parabola becomes a parabolic surface

y= b0+ b1x2

y = b0 + b1x12 + b2x2

b0

Page 7: Multiple  Regression

• Required conditions for the error variable – The error is normally distributed with mean equal

to zero and a constant standard deviation

(independent of the value of y). is unknown.– The errors are independent.

• These conditions are required in order to – estimate the model coefficients,– assess the resulting model.

Page 8: Multiple  Regression

– If the model passes the assessment tests, use it to interpret the coefficients and generate predictions.

– Assess the model fit and usefulness using the model statistics.

– Diagnose violations of required conditions. Try to remedy problems when identified.

18.3 Estimating the Coefficients and Assessing the Model

• The procedure– Obtain the model coefficients and statistics using a

statistical computer software.

Page 9: Multiple  Regression

– La Quinta Motor Inns is planning an expansion.– Management wishes to predict which sites are likely to be

profitable.– Several areas where predictors of profitability can be

identified are:• Competition• Market awareness• Demand generators• Demographics• Physical quality

Example 18.1 Where to locate a new motor inn?

Page 10: Multiple  Regression

Profitability

Competition Market awareness Customers Community Physical

Margin

Rooms Nearest Officespace

Collegeenrollment

Income Disttwn

Distance to downtown.

Medianhouseholdincome.

Distance tothe nearestLa Quinta inn.

Number of hotels/motelsrooms within 3 miles from the site.

Page 11: Multiple  Regression

– Data was collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model:

Margin =Rooms NearestOfficeCollege

+ 5Income + 6Disttwn + INN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 0.1 549 8 37 12.12 33.8 2810 1.5 496 17.5 39 0.43 49 2890 1.9 254 20 39 12.24 31.9 3422 1 434 15.5 36 2.75 57.4 2687 3.4 678 15.5 32 7.96 49 3759 1.4 635 19 41 4

Page 12: Multiple  Regression

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.724611R Square 0.525062Adjusted R Square0.49442Standard Error5.512084Observations 100

ANOVAdf SS MS F Significance F

Regression 6 3123.832 520.6387 17.13581 3.03E-13Residual 93 2825.626 30.38307Total 99 5949.458

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78049 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.01011 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.90292 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.05318 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.69025 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962 0.580138

• Excel outputThis is the sample regression equation (sometimes called the prediction equation)

MARGIN = 72.455 - 0.008ROOMS -1.646NEAREST + 0.02OFFICE +0.212COLLEGE - 0.413INCOME + 0.225DISTTWN

Let us assess this equation

Page 13: Multiple  Regression

• Standard error of estimate– We need to estimate the standard error of estimate

– Compare s to the mean value of y• From the printout, Standard Error = 5.5121 • Calculating the mean value of y we have

– It seems s is not particularly small. – Can we conclude the model does not fit the data well?

1knSSE

s

739.45y

Page 14: Multiple  Regression

• Coefficient of determination– The definition is

– From the printout, R2 = 0.5251– 52.51% of the variation in the measure of profitability is

explained by the linear regression model formulated above.– When adjusted for degrees of freedom,

Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] = = 49.44%

2i

2

)yy(

SSE1R

Page 15: Multiple  Regression

• Testing the validity of the model– We pose the question:

Is there at least one independent variable linearly related to the dependent variable?

– To answer the question we test the hypothesis

H0: 1 = 2 = … = k = 0

H1: At least one i is not equal to zero.

– If at least one i is not equal to zero, the model is valid.

Page 16: Multiple  Regression

• To test these hypotheses we perform an analysis of variance procedure.

• The F test – Construct the F statistic

– Rejection regionF>F,k,n-k-1

MSEMSR

F

MSR=SSR/k

MSE=SSE/(n-k-1)

[Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model. The null hypothesis shouldbe rejected; thus, the model is valid.

Required conditions mustbe satisfied.

Page 17: Multiple  Regression

ANOVAdf SS MS F Significance F

Regression 6 3123.832 520.6387 17.13581 3.03382E-13Residual 93 2825.626 30.38307Total 99 5949.458

• Excel provides the following ANOVA results

Example 18.1 - continued

SSESSR

MSEMSR

MSR/MSE

Page 18: Multiple  Regression

ANOVAdf SS MS F Significance F

Regression 6 3123.832 520.6387 17.13581 3.03382E-13Residual 93 2825.626 30.38307Total 99 5949.458

• Excel provides the following ANOVA results

Example 18.1 - continued

F,k,n-k-1 = F0.05,6,100-6-1=2.17F = 17.14 > 2.17

Also, the p-value (Significance F) = 3.03382(10)-13

Clearly, = 0.05>3.03382(10)-13, and the null hypothesisis rejected.

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid

Page 19: Multiple  Regression

• Let us interpret the coefficients

– This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.

– In this model, for each additional 1000 rooms within 3 mile of the La Quinta inn, the operating margin decreases on the average by 7.6% (assuming the other variables are held constant).

5.72b

0076.b1

Page 20: Multiple  Regression

– In this model, for each additional mile that the nearest competitor is to La Quinta inn, the average operating margin decreases by 1.65%

– For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%.

– For additional thousand students MARGIN

increases by .21%.

– For additional $1000 increase in median

household income, MARGIN decreases by .41%

– For each additional mile to the downtown

center, MARGIN increases by .23% on the average

65.1b2

02.b3

21.b4

41.b5

23.b6

Page 21: Multiple  Regression

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78048735 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.010110582 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.902924523 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993085 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.053178229 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.690245235 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962198 0.580138

• Testing the coefficients– The hypothesis for each i

– Excel printout

H0: i = 0H1: i = 0

Test statistic

d.f. = n - k -1ib

iis

bt

Page 22: Multiple  Regression

• Using the linear regression equation

– The model can be used by• Producing a prediction interval for the particular value of y,

for a given set of values of xi.• Producing an interval estimate for the expected value of y,

for a given set of values of xi.

– The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients i

Page 23: Multiple  Regression

• Example 18.1 - continued. Produce predictions

– Predict the MARGIN of an inn at a site with the following characteristics:• 3815 rooms within 3 miles,• Closet competitor 3.4 miles away,• 476,000 sq-ft of office space,• 24,500 college students,• $39,000 median household income,• 3.6 miles distance to downtown center.

MARGIN = 72.455 - 0.008(3815) -1.646(3.4) + 0.02(476) +0.212(24.5) - 0.413(39) + 0.225(3.6) = 37.1%

Page 24: Multiple  Regression

• The required conditions for the model assessment to apply must be checked.

– Is the error variable normally distributed?

– Is the error variance constant?– Are the errors independent?– Can we identify outliers?– Is multicollinearity a problem?

18.4 Regression Diagnostics - II

Draw a histogram of the residuals

Plot the residuals versus y

Plot the residuals versus the time periods

Page 25: Multiple  Regression

• Example 18.2 House price and multicollinearity

– A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size.

– A random sample of 100 houses was drawn and data recorded.

– Analyze the relationship among the four variables

Price Bedrooms H Size Lot Size124100 3 1290 3900218300 4 2080 6600117800 3 1250 3750

. . . .

. . . .

Page 26: Multiple  Regression

Regression StatisticsMultiple R 0.74833R Square 0.559998Adjusted R Square0.546248Standard Error25022.71Observations 100

ANOVAdf SS MS F Significance F

Regression 3 7.65E+10 2.55E+10 40.7269 4.57E-17Residual 96 6.01E+10 6.26E+08Total 99 1.37E+11

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 37717.59 14176.74 2.660526 0.009145 9576.963 65858.23Bedrooms 2306.081 6994.192 0.329714 0.742335 -11577.3 16189.45H Size 74.29681 52.97858 1.402393 0.164023 -30.8649 179.4585Lot Size -4.36378 17.024 -0.25633 0.798244 -38.1562 29.42862

• Solution• The proposed model is

PRICE = 0 + 1BEDROOMS + 2H-SIZE +3LOTSIZE + – Excel solution

The model is valid, but no variable is significantly relatedto the selling price !!

Page 27: Multiple  Regression

– when regressing the price on each independent variable alone, it is found that each variable is strongly related to the selling price.

– Multicollinearity is the source of this problem.Price Bedrooms H Size Lot Size

Price 1Bedrooms 0.645411 1H Size 0.747762 0.846454 1Lot Size 0.740874 0.83743 0.993615 1

• Multicollinearity causes two kinds of difficulties:– The t statistics appear to be too small.– The coefficients cannot be interpreted as “slopes”.

• However,

Page 28: Multiple  Regression

• Remedying violations of the required conditions

– Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.

– The transformations can improve the linear relationship between the dependent variable and the independent variables.

– Many computer software systems allow us to make the transformations easily.

Page 29: Multiple  Regression

• A brief list of transformations» y’ = log y (for y > 0)

• Use when the s increases with y, or• Use when the error distribution is positively skewed

» y’ = y2

• Use when the s2 is proportional to E(y), or

• Use when the error distribution is negatively skewed» y’ = y1/2 (for y > 0)

• Use when the s2 is proportional to E(y)

» y’ = 1/y• Use when s2

increases significantly when y increases beyond some value.

Page 30: Multiple  Regression

• Example 18.3: Analysis, diagnostics, transformations.

– A statistics professor wanted to know whether time limit affect the marks on a quiz?

– A random sample of 100 students was split into 5 groups.– Each student wrote a quiz, but each group was given a

different time limit. See data below.Time 40 45 50 55 60

20 24 26 30 3223 26 25 32 31. . . . .. . . . .

Marks Analyze these results, and include diagnostics

Page 31: Multiple  Regression

0

10

20

30

40

50

-2.5 -1.5 -0.5 0.5 1.5 2.5 MoreSUMMARY OUTPUT

Regression StatisticsMultiple R 0.86254R Square 0.743974Adjusted R Square 0.741362Standard Error 2.304609Observations 100

ANOVAdf SS MS F Significance F

Regression 1 1512.5 1512.5 284.7743 9.42E-31Residual 98 520.5 5.311224Total 99 2033

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept -2.2 1.64582 -1.33672 0.184409 -5.46608 1.066077Time 0.55 0.032592 16.87526 9.42E-31 0.485322 0.614678

This model is useful andprovides a good fit.

The errors seem to benormally distributed

The model tested:MARK = 0 + 1TIME +

Page 32: Multiple  Regression

-3

-2-1

0

1

23

4

20 22 24 26 28 30 32

Standardized errors vs. predicted mark.

The standard error of estimate seems to increase with the predicted value of y.

Two transformations are used to remedy this problem:1. y’ = logey2. y’ = 1/y

Page 33: Multiple  Regression

Let us see what happens when a transformation is applied

Mark

15

20

25

30

35

40

0 20 40 60 80

LogMark

2

3

4

0 20 40 60 80

40,18

40,2340, 3.135

40, 2.89

Loge23 = 3.135

Loge18 = 2.89

The original data, where “Mark” is a function of “Time”

The modified data, where LogMark is a function of “Time"

Page 34: Multiple  Regression

The new regression analysis and the diagnostics are:

The model tested:LOGMARK = ’0 + ’1TIME + ’

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.8783R Square 0.771412Adjusted R Square0.769079Standard Error0.084437Observations 100

ANOVAdf SS MS F Significance F

Regression 1 2.357901 2.357901 330.7181 3.58E-33Residual 98 0.698705 0.00713Total 99 3.056606

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 2.129582 0.0603 35.31632 1.51E-57 2.009918 2.249246Time 0.021716 0.001194 18.18566 3.58E-33 0.019346 0.024086

Predicted LogMark = 2.1295 + .0217Time

This model is useful andprovides a good fit.

Page 35: Multiple  Regression

The errors seem to benormally distributed

Standard Residuals

-4

-2

0

2

4

2.9 3 3.1 3.2 3.3 3.4 3.5

0

10

20

30

40

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

The standard errors still changes with the predicted y, but the change is smaller than before.

Page 36: Multiple  Regression

Let TIME = 55 minutes

LogMark = 2.1295 + .0217Time = 2.1295 + .0217(55) = 3.323

To find the predicted mark, take the antilog:

antiloge3.323 = e3.323 = 27.743

How do we use the modified model to predict?

Page 37: Multiple  Regression

18.5 Regression Diagnostics - III

• The Durbin - Watson Test– This test detects first order auto-correlation between

consecutive residuals in a time series– If autocorrelation exists the error variables are not

independent

4d0isdofrangeThe

r

)rr(

dn

1i

2i

n

2i

21ii

Residual at time i

Page 38: Multiple  Regression

+

+

+

+

++

+

++

+ Residuals

Time

Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then,the value of d is small (less than 2).

Positive first order autocorrelation

Negative first order autocorrelation

+

+

+

+

0

0

Residuals

Time

+

Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2).

Page 39: Multiple  Regression

• One tail test for positive first order auto-correlation– If d<dL there is enough evidence to show that positive first-order

correlation exists– If d>dU there is not enough evidence to show that positive first-

order correlation exists– If d is between dL and dU the test is inconclusive.

• One tail test for negative first order auto-correlation– If d>4-dL, negative first order correlation exists

– If d<4-dU, negative first order correlation does not exists

– if d falls between 4-dU and 4-dL the test is inconclusive.

Page 40: Multiple  Regression

• Two-tail test for first order auto-correlation– If d<dL or d>4-dL first order auto-correlation exists

– If d falls between dL and dU or between 4-dU and 4-dL the test is inconclusive

– If d falls between dU and 4-dU there is no evidence for first order auto-correlation

dL dU 20 44-dU 4-dL

First ordercorrelationexists

First ordercorrelationexists

Inconclusivetest

Inconclusivetest

First ordercorrelationdoes notexist

First ordercorrelationdoes notexist

Page 41: Multiple  Regression

– How does the weather affect the sales of lift tickets in a ski resort?

– Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected.

– The model hypothesized was

TICKETS=0+1SNOWFALL+2TEMPERATURE+ – Regression analysis yielded the following results:

• Example 18.4

Page 42: Multiple  Regression

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.3464529R Square 0.1200296Adjusted R Square 0.0165037Standard Error 1711.6764Observations 20

ANOVAdf SS MS F Signif. F

Regression 2 6793798.2 3396899.1 1.1594 0.3372706Residual 17 49807214 2929836.1Total 19 56601012

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 8308.0114 903.7285 9.1930391 5E-08 6401.3083 10214.715Snowfall 74.593249 51.574829 1.4463111 0.1663 -34.22028 183.40678Tempture -8.753738 19.704359 -0.444254 0.6625 -50.32636 32.818884

The model seems to be very poor:

• The fit is very low (R-square=0.12),• It is not valid (Signif. F =0.33)• No variable is linearly related to Sales

Diagnosis of the required conditions resulted with the following findings

Page 43: Multiple  Regression

-4000-3000-2000-1000

0100020003000

7500 8500 9500 10500 11500 12500

-4000-3000-2000-1000

0100020003000

0 5 10 15 20 25

Residual over time

Residual vs. predicted y

The errors are not independent

The error variance is constant

01234567

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

The errors may benormally distributed

The error distribution

Page 44: Multiple  Regression

Residuals

-4000

-2000

0

2000

4000

0 5 10 15 20 25

Test for positive first order auto-correlation:n=20, k=2. From the Durbin-Watson table we have:dL=1.10, dU=1.54. The statistic d=0.59

Conclusion: Because d<dL , there is sufficient evidence to infer that positive first order auto-correlation exists.

Durbin-Watson Statistic-2793.99-1723.23 d = 0.5931-2342.03-956.955-1963.73

.

.

Using the computer - ExcelTools > data Analysis > Regression (check the residual option and then OK)Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residualsfrom the regression run > OK

The residuals

Page 45: Multiple  Regression

The modified regression model

TICKETS=0+ 1SNOWFALL+ 2TEMPERATURE+ 3YEARS+

The autocorrelation has occurred over time.Therefore, a time dependent variable added to the model may correct the problem

• All the required conditions are met for this model.

• The fit of this model is high R2 = 0.74.

• The model is useful. Significance F = 5.93 E-5. • SNOWFALL and YEARS are linearly related to ticket sales.

• TEMPERATURE is not linearly related to ticket sales.


Recommended