+ All Categories
Home > Documents > Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers...

Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers...

Date post: 01-Jan-2016
Category:
Upload: blaze-baker
View: 219 times
Download: 1 times
Share this document with a friend
Popular Tags:
40
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II
Transcript
Page 1: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Lecture 13

Diagnostics in MLRVariance Inflation FactorsAdded variable plotsIdentifying outliers

BMTRY 701Biostatistical Methods II

Page 2: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Variance Inflation Factor (VIF)

Diagnostic for multicollinearity Describes the amount of an X that is explained

by the other X’s in the model If VIF is high, then it suggests that the covariate

should not be added. Why?

• it is redundant• it adds variance to the model• it creates ‘instability’ in the estimation

Page 3: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

How to calculate VIF?

Simple idea:

That is, the VIF for the jth covariate is the coefficient of determination (R2) obtained from regressing xj on the remaining x’s in the model

21

1

jj R

VIF

exxxxx JJjjjjj 1111110

Page 4: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Sounds like a lot of work!

You don’t actually have to estimate the regressions for each xj.

Some matrix notation:• X = matrix of covariates including a column for the intercept• XT = transpose of X. That is, flip X on its diagonal• X-1 = the inverse of X. That is, what you multiply X by to get the

identity matrix• I = the identity matrix. A matrix with 0’s on the off-diagonal and

1’s on the diagonal

Useful matrix: XTX. (see chapter 3 for lots on it!) Another useful matrix: (XTX)-1

Page 5: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

XTX

Recall what it means to standardize a variable:• subtract off the mean• divide by the standard deviation

Imagine that you standardize all of the variables in your model (x’s).

Call the new covariate matrix W Now, if calculate WTW (and divide by n-1), it is

the correlation matrix Lastly, take the inverse of WTW (i.e., (WTW)-1)

Page 6: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

VIFs

The diagonals of the (WTW)-1 matrix are the VIFs This is a natural by-product of the regression The (WTW)-1 matrix is estimated when the

regression is estimated

Rules of thumb:• VIF larger than 10 implies a serious multicollinearity

problem• VIFs of 5 or greater suggest that coefficient estimates

may be misleading due to multicollinearity

Page 7: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Getting the VIFs the old-fashioned way

# standardize variablesages <- (AGE-mean(AGE))/sqrt(var(AGE))censuss <- (CENSUS - mean(CENSUS))/sqrt(var(CENSUS))xrays <- (XRAY - mean(XRAY))/sqrt(var(XRAY))infrisks <- (INFRISK-mean(INFRISK))/sqrt(var(INFRISK))sqrtcults <- (sqrtCULT-mean(sqrtCULT))/sqrt(var(sqrtCULT))nurses <- (NURSE - mean(NURSE))/sqrt(var(NURSE))

# create matrix of covariatesxmat <- data.frame(ages, censuss, xrays, infrisks, sqrtcults, nurses)xmat <- as.matrix(xmat)n <- nrow(xmat) # estimate x-transpose x and divide by n-1cormat <- t(xmat)%*%xmat/(n-1)

# solve finds the inverse of a matrixvifmat <- solve(cormat)

round(diag(vifmat), 2)

Page 8: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

More practical way.

library(HH)mlr <- lm(logLOS ~ AGE + CENSUS + XRAY +

INFRISK + sqrtCULT + NURSE)

round(diag(vifmat), 2) ages censuss xrays infrisks sqrtcults nurses 1.10 5.88 1.39 2.01 1.92 5.94

vif(mlr)

AGE CENSUS XRAY INFRISK sqrtCULT NURSE 1.096204 5.875625 1.390417 2.007692 1.916983 5.935711

Page 9: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

What to do?

Unlikely that only one variable will have high VIF You need to then determine which to include,

which to remove Judgement should be based on science +

statistics!

Page 10: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

More diagnostics: the added variable plots

These can help check for adequacy of model Is there curvature between Y and X after

adjusting for the other X’s? “Refined” residual plots They show the marginal importance of an

individual predictor Help figure out a good form for the predictor

Page 11: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Example: SENIC

Recall the difficulty determining the form for INFIRSK in our regression model.

Last time, we settled on including one term, INFRISK^2

But, we could do an adjusted variable plot approach.

How? We want to know, adjusting for all else in the

model, what is the right form for INFRISK?

Page 12: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

R code

av1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) )av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) )resy <- av1$residualsresx <- av2$residuals

plot(resx, resy, pch=16)

abline(lm(resy~resx), lwd=2)

Page 13: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Added Variable Plot

-2 -1 0 1 2 3

-0.2

0.0

0.2

0.4

resx

resy

Page 14: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

What does that show?

The relationship between logLOS and INFRISK if you added INFRISK to the regression

But, is that what we want to see? How about looking at residuals versus INFRISK

(before including INFRISK in the model)?

Page 15: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

R codemlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION))smoother <- lowess(INFRISK, mlr8$residuals)plot(INFRISK, mlr8$residuals)lines(smoother)

2 3 4 5 6 7 8

-0.2

0.0

0.2

0.4

INFRISK

mlr

8$

resi

du

als

Page 16: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

R code

> infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0)> mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION))> summary(mlr9)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 ***INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 ***factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 ***factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1137 on 104 degrees of freedomMultiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16

Page 17: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Residual Plots

2 3 4 5 6 7 8

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

INFRISK

mlr

9$

resi

du

als

2 3 4 5 6 7 8

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

INFRISK

mlr

7$

resi

du

als

SPLINE FOR INFRISK INFRISK2

Page 18: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Which is better?

Cannot compare via ANOVA because they are not nested!

But, we can compare statistics qualitatively R-squared:

• MLR7: 0.60• MLR9: 0.62

Partial R-squared:• MLR7: 0.17• MLR9: 0.19

Page 19: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Identifying Outliers

Harder to do in the MLR setting than in the SLR setting.

Recall two concepts that make outliers important: • Leverage is a function of the explanatory variable(s)

alone and measures the potential for a data point to affect the model parameter estimates.

• Influence is a measure of how much a data point actually does affect the estimated model.

Leverage and influence both may be defined in terms of matrices

Page 20: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

“Hat” matrix

We must do some matrix stuff to understand this Notation for a MLR with p predictors and data on

n patients. The data:

nY

Y

Y

Y2

1

~

npn

p

p

XX

XX

XX

X

1

221

111

1

1

1

~

Page 21: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

More notation:

THE MODEL:

What are the dimensions of each?

Matrix Format for the MLR model

ne

e

e

e2

1

p

1

0

eXY

Page 22: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

“Transpose” and “Inverse”

X-transpose: X’ or XT

X-inverse: X-1

Hat matrix = H

Why is H important? It transforms Y’s to Yhat’s:

')'( 1 XXXXH

HYY ˆ

Page 23: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Estimating, based on fitted model

)()(2 HIMSEes

Variance-Covariance Matrix of residuals:

)1()(2iii hMSEes

Variance of ith residual:

MSEhhses ijijij )0()( 22

Covariance of ith and jth residual:

Page 24: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Other uses of H

YHIe )(

I = identity matrix

)()( 22 HIe

Variance-Covariance Matrix of residuals:

)1()( 22iii he

Variance of ith residual:

222 )0()( ijijij hhe

Covariance of ith and jth residual:

Page 25: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Property of hij’s

n

i

n

jijij hh

1 1

1

This means that each row of H sums to 1And, that each column of H sums to 1

Page 26: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Other use of H

Identifies points of leverage

0 5 10 15

-10

01

02

03

04

0

x

y

1 2

4

3

Page 27: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Using the Hat Matrix to identify outliers

Look at hii to see if a datapoint is an outlier Large values of hii imply small values of var(ei) As hii gets close to 1, var(ei) approaches 0. Note that

As hii approaches 1, yhat approaches y This gives hii the name “leverage” HIGH HAT VALUE IMPLIES POTENTIAL FOR

OUTLIER!

ji

jijiii

n

jjiji yhyhyhy

1

ˆ

Page 28: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

R code

hat <- hatvalues(reg)plot(1:102, hat)highhat <- ifelse(hat>0.10,1,0)plot(x,y)points(x[highhat==1], y[highhat==1],

col=2, pch=16, cex=1.5)

Page 29: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Hat values versus index

0 20 40 60 80 100

0.0

20

.06

0.1

00

.14

1:102

ha

t

Page 30: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Identifying points with high hii

0 5 10 15

-10

01

02

03

04

0

x

y

Page 31: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Does a high hat mean it has a large residual?

No. hii measures leverage, not influence Recall what hii is made of

• it depends ONLY on the X’s• it does not depend on the actual Y value

Look back at the plot: which of these is probably most “influential”

Standard cutoffs for “large” hii: • 2p/n• 0.5 very high, 0.2-0.5 high

Page 32: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Let’s look at our MLR9

Any outliers?

0 20 40 60 80 100

0.0

50

.10

0.1

50

.20

1:length(hat9)

ha

t9

Page 33: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Using the hat matrix in MLR

Studentized residuals Acknowledge:

• each residual has a different variance• magnitude of residual should be made relative to its

variance (or sd)

Studentized residuals recognize differences in sampling errors

Page 34: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Defining Studentized Residuals

From slide 15,

We then define

Comparing ei and ri• ei have different variance due to sampling variations• ri have constant variance

)1()(2iii hMSEes

)1()(ii

i

i

ii

hMSE

e

es

er

Page 35: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Deleted Residuals

Influence is more intuitively quantified by how things change when an observation is in versus out of the estimation process

Would be more useful to have residuals in the situation when the observation is removed.

Example: • if a Yi is far out then it may be very influential in the

regression and the residual will be small• but, if that case is removed before estimating and

then the residual is calculated based on the fit, the residual would be large

Page 36: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Deleted Residuals, di

Process:• delete ith case• fit regression with all other cases• obtain estimate of E(Yi) based on its X’s and fitted

model

)(

)(

ˆ

estimation from removed with valuefittedˆ

iiii

iii

YYd

YY

Page 37: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Deleted Residuals, di

Nice result: you don’t actually have to refit without the ith case!

where ei is the ‘plain’ residual from the ith case and hii is the hat value. Both are from the regression INCLUDING the case

For small hii: ei and di will be similar For large hii: ei and di will be different

ii

ii h

ed

1

Page 38: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Studentized Deleted Residuals

Recall the need to standardize, based on the knowledge of the variance

The difference between ti and ri?

)1(

)(

)( iii

i

i

ii

hMSE

e

ds

dt

Page 39: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Another nice result

You can calculate MSE(i) without refitting the model

2/1

2)1(

1

iiiii ehSSE

pnet

Page 40: Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Testing for outliers

outlier = Y observations whose studentized deleted residuals are large (in absolute value)

ti ~ t with n-p-1 degrees of freedom

Two examples: • simulated data• mlr9


Recommended