Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15.

Topic 7 – Other Regression Issues

Reading: Some parts of

Chapters 11 and 15

Overview

Confounding (Chapter 11)

Interaction (Chapter 11)

Using Polynomial Terms (Chapter 15)

Regression: Primary Goals

We usually are focused on one of the following goals:

Predicting the response variable based on a set of predictorsReliability

Quantifying the relationship between the predictors and the response--Interpretability

It both situations, confounding and interaction can be concerns.

What is “Confounding”?

We saw this with the Smoking and Age predictors in our SBP example.

We consider the relationship of SBP to…

Smoking Status alone

Smoking Status along with age

Our interest is in determining whether smoking raises blood pressure.

SBP Example Continued

Smoking is confounded with Age

Smoking by itself is not significant

Without age, we are not able to see a difference in the smoking groups.

(The groups are actually different but we cannot see it until we add age (a covariate).

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 140.80000 3.66147 38.45 <.0001 SMK 1 7.02353 5.02350 1.40 0.1723

Smoking is confounded with Age (2)

Smoking variable tests significant

After adjusting for age, the two smoking groups are clearly different!

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 48.04960 11.12956 4.32 0.0002 AGE 1 1.70916 0.20176 8.47 <.0001 SMK 1 10.29439 2.76811 3.72 0.0009

Estimates

The effect of smoking is confounded with age – if we don’t first adjust for age we cannot won’t see accurately the effect of smoking.

Confounding Confounding exists if meaningfully different

interpretations of a relationship of interest can be made depending on whether or not a nuisance variable (or covariate) is included in the model.

How to find confounding?

Get lucky and stumble upon it (like we did)

Look for it intentionally by running a lot of different models and watching for variables that aren’t significant at first but become significant when adding other variables (covariates).

Confounding (2) If confounding is present, it may lead to

inaccurate results if not careful – important covariates MUST be included (even if they aren’t even significant!)

Making the variable of interest significant is enough to warrant including the covariate

If we had failed to adjust for age, we will not get a good estimate for the difference due to smoking, and will also have wrongly conclude that smoking status doesn’t matter.

Confounding vs. Multicollinearity

Parameter estimates will change wildly when (multi)collinearity is involved too!

They are almost opposite

SE’s increase and X1 becomes insignificant (added last) when X2 is in the model – (MULTI)COLLINEARITY

This (usually) works both ways—both variables “fight”

SE’s decrease and X1 becomes significant (added last) only when X2 is in the model – CONFOUNDING

Confounding is usually only one way—the covariate(Z) helps the confounded variable(X)Age is helping Smoking

Confounding vs. Multicollinearity (2)

Can catch (multi)collinearity in the correlation matrix

Any single correlation > 0.9 collinearity between just those two predictors

Any predictor that has several values between 0.5 and 0.9 with other predictors multi-collinearity

For confounding, there will usually be some correlation between X and Z but it will not be very large.

Our example: , 0.13age smkr

Interaction

Interaction is (sort of) one step beyond confounding – not only does it make a difference to adjust for Z, but the relationship between Y and X is fundamentally different at different levels of Z.

Can think of this as having a differerent regression line for each fixed level of Z. With no interaction, these lines would be parallel.

SBP Example

We found Age and Smk to both be important. Is it possible that they are interacting?

X = age

Z = 0 for non-smokers, 1 for smokers

Interaction

Looking at plots can give us some idea of interaction (parallel lines). However...

It is very easy to just test to see if the XZ interaction term is important.

Treat it just as you would any other variable and do a partial F-test.

Note that if a model includes XZ interaction term, it should also include X and Z main effects. We would never just look at the XZ term by itself.

Age/Smk Interaction Model

Interaction mathematically described using a product term:

Or just:

where X3 is X1X2

0 1 2 12Y X Z XZ

0 1 1 2 2 3 3Y X X X

SBP Example

The interaction tests insignificant, there is no significant interaction between age and smk

Suppose it was significant

Would then have to keep the age_smk interaction term AS WELL AS both the age and smk variables (even if age and smk themselves are insignificant)

Source DF Type I SS F Value Pr > F age 1 3862 64.84 <.0001 smk 1 828 13.90 0.0009 age_smk 1 69 1.15 0.2918

Confounding vs. Interaction

Y = response

X = predictor

Z = covariate / 2nd predictor

Is the estimated relationship between Y and X dramatically different if one adjusts or does not adjust for Z? Confounding

Is the estimated relationship between Y and X meaningfully different at different values of Z? Interaction

Correlations One problem with using interaction terms is that they

tend to be highly correlated with one or both of the original variables

In our example: Correlation between SMK and AGE_SMK turned out to be 0.98

This is NOT REAL!!! It is a form of “fake” collinearity, the variables aren’t really “fighting” to explain SS

To remove this “fake” collinearity just center the variables

Subtract the mean from all predictors

This doesn’t change any significance tests or p-values, it only removes what we are calling fake collinearity

How to center?SBP Example

Mean age was 53.25, subtract 53.25 from all the ages in the dataset and use these new values in the analysis

Mean smk was 0.53125, (do the same thing)

After centering:

Correlation between SMK and AGE_SMK is now 0.017 (so they weren’t really fighting, it just looked like it because we didn’t center)

Maybe we should always center???

Polynomial Regression

Chapter 15

General Uses

Polynomial models used in situations where the relationship between Y and X is non-linear

Can usually see it in scatterplots

Should definitely catch it in residual plots!

Somewhat dangerous, since a polynomial model of order n – 1 will always fit n data points exactly.

Example?

Strategy for fitting

CENTER your variables to avoid the “fake” (multi)collinearity.

Use a special type of backward elimination procedure Test highest order term first!

If a higher order term is significant, you MUST include all lower order terms for that variable

Example

Problem 15.7 (sas/data available online)

X = amount of vaccine, Y = measure of skin response in rats.

12 data points

If we run just a simple linear regression, the R-square is only 45%, we will consider a polynomial model and try to do better!

Scatter Plot

Residual plot

Cubic Model

x is X, x2 is X2=X*X, x3 is X3=X*X*X, etc

X3 is important – Must keep X2 and X, why?

Cubic model, model with X, X2, and X3 now explains 82% of the variation (was only 45% for the linear model)

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 14.53498 0.20011 72.63 <.0001 x 1 -0.54454 0.11047 -4.93 0.0012 x2 1 0.12179 0.05386 2.26 0.0536 x3 1 0.28852 0.08177 3.53 0.0077

Date post:	14-Dec-2015
Category:	Documents
Upload:	haven-hardesty
View:	216 times
Download:	0 times

Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15.

Documents