+ All Categories
Home > Documents > Multicolinearity p80 - University of Torontoutstat.utoronto.ca/~mahinda/sta221/s221wk5.pdf ·...

Multicolinearity p80 - University of Torontoutstat.utoronto.ca/~mahinda/sta221/s221wk5.pdf ·...

Date post: 24-Apr-2018
Category:
Upload: hoangkiet
View: 219 times
Download: 0 times
Share this document with a friend
23
Week 5: Multicollinearity. Use of dummy variables.. Stepwise Regression & Model Building. [80-96] Multicolinearity p80 Multicollinearity exists when two or more independent variables used in regression are moderately or highly correlated. when multicollinearity exists, - regression results can be confusing and misleading. For example in a multiple regression model all partial slopes will be nonsignificant with a significant global F-test. - Signs of the regression coefficients might not make sense. Loss of one dimension of information If a designed experiment, one can eliminate multicollinearity by defining the explanatory 1
Transcript

Week 5: Multicollinearity. Use of dummy variables.. Stepwise Regression & Model Building. [80-96]

Multicolinearity p80 Multicollinearity exists when two or more independent variables used in regression are moderately or highly correlated. when multicollinearity exists, - regression results can be confusing and misleading. For example in a multiple regression model all partial slopes will be nonsignificant with a significant global F-test. - Signs of the regression coefficients might not make sense.

Loss of one dimension of information If a designed experiment, one can eliminate multicollinearity by defining the explanatory

1

variables in an appropriate uncorrelated fashion p81 - Example p82 course notes

- Models with qualitative explanatory variables

Example gen = 1 for female Row gpa hsm gen 1 3.32 10 0 2 2.26 6 0 3 2.35 8 0 4 2.08 9 0 5 3.38 8 0 6 3.29 10 0 7 3.21 8 0 8 2.00 3 0 9 3.18 9 0 10 2.34 7 0 11 3.08 9 0 218 2.86 9 1 219 3.32 10 1 220 2.07 9 1 221 0.85 7 1 222 1.86 7 1 223 2.59 5 1 224 2.28 9 1

2

The regression equation is gpa = 0.903 + 0.207 hsm + 0.0269 gen Predictor Coef StDev T P Constant 0.9029 0.2447 3.69 0.000 hsm 0.20704 0.02885 7.18 0.000 gen 0.02693 0.09874 0.27 0.785 S = 0.7043 R-Sq = 19.1% R-Sq(adj) = 18.3% Analysis of Variance Source DF SS MS F P Regression 2 25.847 12.923 26.06 0.000 Residual Error 221 109.616 0.496 Total 223 135.463

3

If the qualitative variable had more than two levels (say, l levels) introduce l-1 dummy variables.

Example Length = length of stay in hospital (days) Nnurses = Number of nurses Region : There are 4 regions: NC, NE, S and W Row length nnurses region NC NE S 1 7.13 241 W 0 0 0 2 8.82 52 NC 1 0 0 3 8.34 54 S 0 0 1 4 8.95 148 W 0 0 0 5 11.20 151 NE 0 1 0 6 9.76 106 NC 1 0 0 7 9.68 129 S 0 0 1 8 11.18 360 NC 1 0 0 9 8.67 118 S 0 0 1 109 11.80 469 NC 1 0 0 110 9.50 46 S 0 0 1 111 7.70 136 W 0 0 0 112 17.94 407 NE 0 1 0 113 9.41 22 S 0 0 1 The regression equation is length = 7.52 + 0.00401 nnurses + 1.42 NC + 2.80 NE + 1.03 S Predictor Coef StDev T P Constant 7.5218 0.4272 17.61 0.000 nnurses 0.004010 0.001083 3.70 0.000 NC 1.4178 0.4869 2.91 0.004 NE 2.8028 0.4988 5.62 0.000 S 1.0256 0.4744 2.16 0.033 S = 1.585 R-Sq = 33.7% R-Sq(adj) = 31.3% Analysis of Variance Source DF SS MS F P Regression 4 138.000 34.500 13.74 0.000 Residual Error 108 271.211 2.511 Total 112 409.210

4

Predicted Values (at nnurses = 150, NC = 1, NE =0, S = 0) Fit StDev Fit 95.0% CI 95.0% PI 9.541 0.283 ( 8.981, 10.102) ( 6.350, 12.732) MINITAB Commands

5

6

Example Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4 0 0 1 46 5 0 0 1 38 6 0 0 1 47 7 0 0 0 21 8 0 0 0 12 9 0 0 0 14 10 0 0 0 17 11 0 0 0 13 12 0 0 0 17 13 0 1 0 37 14 0 1 0 32 15 0 1 0 15 16 0 1 0 25 17 0 1 0 39 18 0 1 0 41 19 1 0 0 16 20 1 0 0 11 21 1 0 0 20 22 1 0 0 21 23 1 0 0 14 24 1 0 0 7

Test whether some colors are more attractive than others to beetles. The regression equation is Insects trapped = 15.7 - 0.83 Blue + 15.8 Green + 31.5 Lemon Predictor Coef StDev T P Constant 15.667 2.770 5.66 0.000 Blue -0.833 3.917 -0.21 0.834 Green 15.833 3.917 4.04 0.001 Lemon 31.500 3.917 8.04 0.000 S = 6.784 R-Sq = 82.1% R-Sq(adj) = 79.4% Analysis of Variance Source DF SS MS F P Regression 3 4218.5 1406.2 30.55 0.000 Residual Error 20 920.5 46.0 Total 23 5139.0

7

State whether the following statements are true or false.

a) The value of the F-statistic for testing any differences among the colours is 30.55.

b) We have evidence at p < 0.01 that the

means for green and white are different.

c) We have evidence at p < 0.01 that means

for blue and white are different. d) A 95% confidence interval for the

difference between means for lemon yellow and white is (23.3, 39.7)

e) We may say that 82.1% of the variation in

the number of insects trapped has been accounted for by the above model.

8

Models with two qualitative variables

The performance, y (measured as mass burning rate per degree of crank angle), for six combinations of fuel type and engine brand, (2 brands and 3 fuel types) was analyzed. Data Display Row C1 perform F2 F3 B2 F2B2 F3B2 F B 1 F1B1 65 0 0 0 0 0 F1 B1 2 F1B1 73 0 0 0 0 0 F1 B1 3 F1B1 68 0 0 0 0 0 F1 B1 4 F1B2 36 0 0 1 0 0 F1 B2 5 F2B1 78 1 0 0 0 0 F2 B1 6 F2B1 82 1 0 0 0 0 F2 B1 7 F2B2 50 1 0 1 1 0 F2 B2 8 F2B2 43 1 0 1 1 0 F2 B2 9 F3B1 48 0 1 0 0 0 F3 B1 10 F3B1 46 0 1 0 0 0 F3 B1 11 F3B2 61 0 1 1 0 1 F3 B2 12 F3B2 62 0 1 1 0 1 F3 B2

Interaction model The regression equation is perform = 68.7 + 11.3 F2 - 21.7 F3 - 32.7 B2 - 0.83 F2B2 + 47.2 F3B2 Predictor Coef StDev T P Constant 68.667 1.939 35.42 0.000 F2 11.333 3.066 3.70 0.010 F3 -21.667 3.066 -7.07 0.000 B2 -32.667 3.878 -8.42 0.000 F2B2 -0.833 5.130 -0.16 0.876 F3B2 47.167 5.130 9.19 0.000 S = 3.358 R-Sq = 97.1% R-Sq(adj) = 94.8% Analysis of Variance Source DF SS MS F P Regression 5 2303.00 460.60 40.84 0.000 Residual Error 6 67.67 11.28 Total 11 2370.67

9

Source DF Seq SS F2 1 92.04 F3 1 78.13 B2 1 688.09 F2B2 1 491.30 F3B2 1 953.44

10

Stepwise regression

A hospital surgical unit was interested in predicting the survival in patients undergoing a particular type of liver operation. A random sample of patients was available for analysis. From each patient record, the following info was extracted from the preoperation evaluation: X1 = blood clotting score X2 = prognostic index X3 = enzyme function test score X4 = liver function test score X5 = age in years X6 = indicator variable for gender ( 0 = M, 1 = F) X7 and X8 = indicator variables for history of alcohol use ( categorical : none, moderate, severe) X7 = indicator of moderate X8 = indicator of severe

11

Data Display Row X1 X2 X3 X4 X5 X6 X7 X8 Y lnY 1 6.7 62 81 2.59 50 0 1 0 695 6.544 2 5.1 59 66 1.70 39 0 0 0 403 5.999 3 7.4 57 83 2.16 55 0 0 0 710 6.565 4 6.5 73 41 2.01 48 0 0 0 349 5.854 5 7.8 65 115 4.30 45 0 0 1 2343 7.759 6 5.8 38 72 1.42 65 1 1 0 348 5.852 7 5.7 46 63 1.91 49 1 0 1 518 6.25

50 3.9 82 103 4.55 50 0 1 0 1078 6.983 51 6.6 77 46 1.95 50 0 1 0 405 6.005 52 6.4 85 40 1.21 58 0 0 1 579 6.361 53 6.4 59 85 2.33 63 0 1 0 550 6.310 54 8.8 78 72 3.20 56 0 0 0 651 6.478

The regression equation is Y = - 1149 + 62.4 X1 + 8.97 X2 + 9.89 X3 + 50.4 X4 - 0.95 X5 + 15.9 X6 + 7.7 X7+ 321 X8 Predictor Coef StDev T P Constant -1148.8 242.3 -4.74 0.000 X1 62.39 24.47 2.55 0.014 X2 8.973 1.874 4.79 0.000 X3 9.888 1.742 5.68 0.000 X4 50.41 44.96 1.12 0.268 X5 -0.951 2.649 -0.36 0.721 X6 15.87 58.47 0.27 0.787 X7 7.71 64.96 0.12 0.906 X8 320.70 85.07 3.77 0.000 S = 201.4 R-Sq = 78.2% R-Sq(adj) = 74.3% Analysis of Variance Source DF SS MS F P Regression 8 6543615 817952 20.16 0.000 Residual Error 45 1825906 40576 Total 53 8369521

12

150010005000

800

400

0

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is Y)

8004000

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is Y)

10005000

15

10

5

0

Residual

Fre

quen

cy

Histogram of the Residuals(response is Y)

13

The regression equation is lnY = 4.05 + 0.0685 X1 + 0.0135 X2 + 0.0150 X3 + 0.0080 X4 - 0.00357 X5 + 0.0842 X6 + 0.0579 X7 + 0.388 X8 Predictor Coef StDev T P Constant 4.0505 0.2518 16.09 0.000 X1 0.06851 0.02542 2.70 0.010 X2 0.013452 0.001947 6.91 0.000 X3 0.014954 0.001809 8.26 0.000 X4 0.00802 0.04671 0.17 0.865 X5 -0.003566 0.002752 -1.30 0.202 X6 0.08421 0.06075 1.39 0.173 X7 0.05786 0.06748 0.86 0.396 X8 0.38838 0.08838 4.39 0.000 S = 0.2093 R-Sq = 84.6% R-Sq(adj) = 81.9% Analysis of Variance Source DF SS MS F P Regression 8 10.8370 1.3546 30.93 0.000 Residual Error 45 1.9707 0.0438 Total 53 12.8077

7.56.55.5

0.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is lnY)

14

0.50.40.30.20.10.0-0.1-0.2-0.3-0.4

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is lnY)

0.50.40.30.20.10.0-0.1-0.2-0.3-0.4

15

10

5

0

Residual

Fre

quen

cy

Histogram of the Residuals(response is lnY)

15

Stepwise Regression: lnY versus X1, X2, X3, X4, X5, X6, X7, X8 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is lnY on 8 predictors, with N = 54 Step 1 2 3 4 5 Constant 5.264 4.351 4.291 3.852 3.867 X3 0.0151 0.0154 0.0145 0.0155 0.0151 T-Value 6.23 8.19 9.33 11.07 10.82 P-Value 0.000 0.000 0.000 0.000 0.000 X2 0.0141 0.0149 0.0142 0.0139 T-Value 5.98 7.68 8.20 8.07 P-Value 0.000 0.000 0.000 0.000 X8 0.429 0.353 0.363 T-Value 5.08 4.57 4.74 P-Value 0.000 0.000 0.000 X1 0.073 0.071 T-Value 3.86 3.79 P-Value 0.000 0.000 X6 0.087 T-Value 1.49 P-Value 0.142 S 0.375 0.291 0.238 0.211 0.208 R-Sq 42.76 66.33 77.80 82.99 83.74 R-Sq(adj) 41.66 65.01 76.47 81.60 82.05 Mallows C-p 117.4 50.5 18.9 5.8 5.5

16

Minitab commands for stepwise regression

17

18

All possible Regressions Selection Procedure

R-sq Criterion:

2 1SSR SSER SST SST= = − Response is lnY Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 1 42.2 41.0 119.2 0.37746 X 1 22.1 20.6 177.9 0.43807 X 1 13.9 12.2 201.8 0.46052 X 1 6.1 4.3 224.7 0.48101 X 2 66.3 65.0 50.5 0.29079 X X 2 59.9 58.4 69.1 0.31715 X X 2 54.9 53.1 84.0 0.33668 X X 2 51.6 49.7 93.4 0.34850 X X 2 50.8 48.9 95.9 0.35157 X X 3 77.8 76.5 18.9 0.23845 X X X 3 75.7 74.3 25.0 0.24934 X X X 3 71.8 70.1 36.5 0.26885 X X X 3 68.1 66.2 47.3 0.28587 X X X 3 67.6 65.7 48.7 0.28802 X X X 4 83.0 81.6 5.8 0.21087 X X X X 4 81.4 79.9 10.3 0.22023 X X X X 4 78.9 77.2 17.8 0.23498 X X X X 4 78.4 76.6 19.3 0.23785 X X X X 4 78.0 76.2 20.4 0.23982 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 5 83.6 81.9 6.0 0.20931 X X X X X 5 83.3 81.6 6.8 0.21100 X X X X X 5 83.2 81.4 7.2 0.21193 X X X X X 5 81.8 79.9 11.3 0.22044 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 6 83.9 81.9 7.0 0.20934 X X X X X X 6 83.9 81.8 7.2 0.20964 X X X X X X 6 83.8 81.8 7.2 0.20982 X X X X X X 6 83.7 81.6 7.6 0.21066 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 7 84.4 82.0 7.7 0.20867 X X X X X X X 7 84.0 81.6 8.7 0.21081 X X X X X X X 7 84.0 81.5 8.9 0.21136 X X X X X X X 7 82.1 79.4 14.3 0.22306 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X

19

Best Subsets Regression Response is lnY Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 2 66.3 65.0 50.5 0.29079 X X 3 77.8 76.5 18.9 0.23845 X X X 4 83.0 81.6 5.8 0.21087 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X

1 2 3 4 5 6 7 8

40

45

50

55

60

65

70

75

80

85

vars

R-sq

20

Ex: Response is crimes p b o h d p t p s e o o 1 p g g v u t 8 o r r e n p - p a e r e Adj. o 3 6 d e t m Vars R-Sq R-Sq C-p s p 4 5 s s y p 1 75.4 75.3 23.6 39995 X 2 78.3 78.1 -0.2 37660 X X 3 78.4 78.1 1.0 37671 X X X 4 78.5 78.0 2.6 37732 X X X X 5 78.5 78.0 4.1 37784 X X X X X 6 78.5 77.9 6.1 37875 X X X X X X 7 78.5 77.8 8.0 37968 X X X X X X X

7654321

78.5

77.5

76.5

75.5

Vars

R-sq

Other Criteria

R-sq (Adj)

2 1 /( 1)MSERAdj SST n= −

21

MINITAB commands

22

23


Recommended