Chapter 15 Multiple Regression. Regression Multiple Regression Model y = 0 + 1 x 1 + 2 x 2 + …...

Post on 18-Dec-2015

226 views 3 download

Tags:

transcript

Chapter 15

Multiple Regression

Regression

Multiple Regression Modely = b0 + b1x1 + b2x2 + … + bpxp + e

Multiple Regression Equationy = b0 + b1x1 + b2x2 + … + bpxp

Estimated Multiple Regression Equation

ppxbxbxbby ...ˆ 22110

Car DataMPG Weight Year Cylinders

18 3504 70 815 3693 70 818 3436 70 816 3433 70 817 3449 70 815 4341 70 814 4354 70 814 4312 70 814 4425 70 815 3850 70 8

. . . .

. . . .

. . . .

Continuing on for 397 observations

Multiple Regression, Example  Coefficients Standard Error t Stat

Intercept 46.3 0.800 57.8Weight -0.00765 0.000259 -29.4

R Square 0.687

  Coefficients Standard Error t StatIntercept -14.7 3.96 -3.71Weight -0.00665 0.000214 -31.0Year 0.763 0.0490 15.5

R Square 0.807

Multiple Regression, Example

  Coefficients Standard Error t StatIntercept -14.4 4.03 -3.58Weight -0.00652 0.000460 -14.1Year 0.760 0.0498 15.2Cylinders -0.0741 0.232 -0.319

R Square 0.807

Predicted MPG for car weighing 4000 lbs built in 1980 with 6 cylinders:-14.4 -.00652(4000)+.76(80)-.0741(6)=-14.4-26.08+60.8-.4446=19.88

2ˆ ii yySSE

2ˆ yySSR i

2 yySST i

SST = SSR + SSE

Sums of Squares

Multiple Coefficient of DeterminationThe share of the variation explained by the estimated model.

R2 = SSR/SST

Multiple Correlation Coefficient

yyrRR ˆ2

The correlation coefficient of the actual and predicted values

Adjusted Multiple Coefficient of Determination

1

111 22

pn

nRRa

Regression StatisticsMultiple R 0.898R Square 0.807Adjusted R Square 0.805Standard Error 3.44Observations 397

F Test for Overall Significance

H0: b1 = b2 = . . . = bp = 0Ha: One or more of the parameters is not equal to zero

Reject H0 if: F > Fa OrReject H0 if: p-value < a

F = MSR/MSE

ANOVA Table for Multiple Regression Model

Source Sum of Squares

Degrees of Freedom

Mean Squares F

Regression SSR p MSR = SSR/p F=MSR/MSE

Error SSE n-p-1 MSE = SSE/(n-p-1)

Total SST n-1

ANOVA Example

ANOVA

  df SS MS FSignificance 

FRegression 3 19382 6460 547 6.42E-140Residual 393 4638 11.8Total 396 24021

t Test for Coefficients

H0: b1 = 0Ha: b1 ≠ 0

Reject H0 if:t < -t /2a or t > t /2a Or if:p < a

t = b1/sb1

With a t distribution of n-p-1 df

t Test Example

  Coefficients Standard Error t Stat P-valueIntercept -14.48 4.038 -3.587 0.0003769Weight -0.006525 0.0004603 -14.18 3.892E-37Year 0.7608 0.04985 15.26 1.258E-41Cylinders -0.07420 0.2322 -0.3196 0.7494

MulticollinearityWhen two or more independent variables are highly correlated.

When multicollinearity is severe the estimated values of coefficients will be unreliable.

MulticollinearityTwo guidelines for identifying multicollinearity:• If the absolute value of the correlation coefficient for two independent variables exceeds 0.7• If the correlation coefficient for an independent variable and some other independent variable is greater than the correlation with that variable and the dependent variable

Multicollinearity

  MPG Weight Year CylindersMPG 1Weight -0.829 1Year 0.578 -0.300 1Cylinders -0.773 0.895 -0.344 1

Table of correlation coefficients:

Multicollinearity  Coefficients Standard Error t Stat

Intercept -14.4 4.03 -3.58Weight -0.00652 0.000460 -14.1Year 0.760 0.0498 15.2Cylinders -0.0741 0.232 -0.319

R Square 0.807

  Coefficients Standard Error t StatIntercept -16.9 4.95 -3.42Year 0.747 0.0612 12.21

Cylinders -2.99 0.133 -22.46

R Square 0.708

Qualitative Variables and Regression

Quantitative variable – A variable that can be measured numerically (interval or ratio scale of measurement)

Qualitative variable – A variable where labels or names are used to identify some attribute (nominal or ordinal scale of measurement)

Qualitative Variables and Regression

The effect of a quantitative variable can be estimated using a dummy variable.

A dummy variable can equal 0 or 1, it creates different y intercepts for groups with different attributes.

Qualitative Variables and Regression

Assume we estimate a regression model for the number of sick days an employee takes per year. A dummy variable is included that equals 1 if the individual smokes and 0 if they do not. Age is also included in the model.

Qualitative Variables and Regression

Estimated model:Sick days taken = -1 +(3)Smoker + (.1)Age

Sick Days Smoker Age

3 0 45

6 1 50

0 0 20

5 0 65

10 1 60

Example of how data would be coded:

Dummy VariablesSick days taken = -1 +(3)Smoker + (.1)Age

What is the y-intercept for nonsmokers? -1What is the y-intercept for smokers? 2What is the predicted number of sick days for a 40-year-old smoker? 6What is the average difference in the number of sick days taken by smokers and nonsmokers? 3

Dummy Variables

If an attribute has three or more possible values you must include k-1 dummy variables in the model, where k is the number of possible values.

Dummy VariablesSuppose we have three job classifications: manager, operator, and secretary

Operator dummy equals 1 if the person is an operator, 0 otherwise

Secretary dummy equals 1 if the person is an secretary, 0 otherwise

Manager is the omitted group (choice of omitted group will not alter the predicted values)

Dummy VariablesSick days taken = -1 +(1)Operator + 1.5(Secretary) + (.1)Age

What are the y-intercepts for each job classification? Managers=-1, Operators=0, Secretaries=0.5 What is the predicted number of sick days for a 40-year-old secretary? 4.5What is the average difference in the number of sick days taken by operators and secretaries? 0.5

Dummy VariablesIn some cases there will be multiple sets of dummy variables, such as:Sick days taken = -1 +(3)Smoker + (1)Operator + 1.5(Secretary) + (.1)Age

Note that there are now 6 different intercepts:Nonsmoker, Manager: -1 (omitted group)Smoker, Manager: 2Nonsmoker, Operator: 0Smoker, Operator: 3Nonsmoker, Secretary: 0.5Smoker, Secretary: 3.5

Dummy VariablesNote that when dummy variables are used we are assuming that the coefficients of the other variables are the same for all groups.

In this example the increase in sick days used from aging a year is equal to 0.1 for all of the groups.

If there is reason to believe the effect of an independent variable differs by group, you may want to estimate separate equations for each group.

Nonlinear Relationships

Nonlinear relationships can be modeled by including a variable that is a nonlinear function of an independent variable.

For example it is usually assumed that health care expenditures increase at an increasing rate as people age.

Nonlinear Relationships

In that case you might try including age squared into the model:Health expend = 500 + (5)Age + (.5)AgeSQ

Age Health Expend10 60020 80030 110040 1500

Nonlinear Relationships

If the dependent variable increases at a decreasing rate as the independent variable rises you might want to include the square root of the independent variable.

If you are unsure of the nature of the relationship you can use dummy variables for different ranges of values of the independent variable.

Non-continuous Relationships

If the relationship between the dependent variable and an independent variable is non-continuous a slope dummy variable can be used to estimate two sets of coefficients for the independent variable.

For example, if natural gas usage is not affected by temperature when the temperature rises above 60 degrees, we could have:Gas usage = b0 + b1(GT60) + b2(Temp) + b2(GT60)(Temp)

Non-continuous Relationships

Note that at temperatures above 60 degrees the net effect of a 1 degree increase in temperature on gas usage is -0.056 (-.866+.810)

  CoefficientsStandard Error t Stat P-value

Intercept 53.002 2.415 21.95 7.48E-18

GT60 -46.623 16.682 -2.79 0.0098

Temp -0.866 0.0595 -14.56 1.02E-13

(GT60)(Temp) 0.810 0.255 3.18 0.0039