Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | haven-hardesty |
View: | 216 times |
Download: | 0 times |
Topic 7 – Other Regression Issues
Reading: Some parts of
Chapters 11 and 15
Overview
Confounding (Chapter 11)
Interaction (Chapter 11)
Using Polynomial Terms (Chapter 15)
Regression: Primary Goals
We usually are focused on one of the following goals:
Predicting the response variable based on a set of predictorsReliability
Quantifying the relationship between the predictors and the response--Interpretability
It both situations, confounding and interaction can be concerns.
What is “Confounding”?
We saw this with the Smoking and Age predictors in our SBP example.
We consider the relationship of SBP to…
Smoking Status alone
Smoking Status along with age
Our interest is in determining whether smoking raises blood pressure.
SBP Example Continued
Smoking is confounded with Age
Smoking by itself is not significant
Without age, we are not able to see a difference in the smoking groups.
(The groups are actually different but we cannot see it until we add age (a covariate).
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 140.80000 3.66147 38.45 <.0001 SMK 1 7.02353 5.02350 1.40 0.1723
Smoking is confounded with Age (2)
Smoking variable tests significant
After adjusting for age, the two smoking groups are clearly different!
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 48.04960 11.12956 4.32 0.0002 AGE 1 1.70916 0.20176 8.47 <.0001 SMK 1 10.29439 2.76811 3.72 0.0009
Estimates
The effect of smoking is confounded with age – if we don’t first adjust for age we cannot won’t see accurately the effect of smoking.
Confounding Confounding exists if meaningfully different
interpretations of a relationship of interest can be made depending on whether or not a nuisance variable (or covariate) is included in the model.
How to find confounding?
Get lucky and stumble upon it (like we did)
Look for it intentionally by running a lot of different models and watching for variables that aren’t significant at first but become significant when adding other variables (covariates).
Confounding (2) If confounding is present, it may lead to
inaccurate results if not careful – important covariates MUST be included (even if they aren’t even significant!)
Making the variable of interest significant is enough to warrant including the covariate
If we had failed to adjust for age, we will not get a good estimate for the difference due to smoking, and will also have wrongly conclude that smoking status doesn’t matter.
Confounding vs. Multicollinearity
Parameter estimates will change wildly when (multi)collinearity is involved too!
They are almost opposite
SE’s increase and X1 becomes insignificant (added last) when X2 is in the model – (MULTI)COLLINEARITY
This (usually) works both ways—both variables “fight”
SE’s decrease and X1 becomes significant (added last) only when X2 is in the model – CONFOUNDING
Confounding is usually only one way—the covariate(Z) helps the confounded variable(X)Age is helping Smoking
Confounding vs. Multicollinearity (2)
Can catch (multi)collinearity in the correlation matrix
Any single correlation > 0.9 collinearity between just those two predictors
Any predictor that has several values between 0.5 and 0.9 with other predictors multi-collinearity
For confounding, there will usually be some correlation between X and Z but it will not be very large.
Our example: , 0.13age smkr
Interaction
Interaction is (sort of) one step beyond confounding – not only does it make a difference to adjust for Z, but the relationship between Y and X is fundamentally different at different levels of Z.
Can think of this as having a differerent regression line for each fixed level of Z. With no interaction, these lines would be parallel.
SBP Example
We found Age and Smk to both be important. Is it possible that they are interacting?
X = age
Z = 0 for non-smokers, 1 for smokers
Interaction
Looking at plots can give us some idea of interaction (parallel lines). However...
It is very easy to just test to see if the XZ interaction term is important.
Treat it just as you would any other variable and do a partial F-test.
Note that if a model includes XZ interaction term, it should also include X and Z main effects. We would never just look at the XZ term by itself.
Age/Smk Interaction Model
Interaction mathematically described using a product term:
Or just:
where X3 is X1X2
0 1 2 12Y X Z XZ
0 1 1 2 2 3 3Y X X X
SBP Example
The interaction tests insignificant, there is no significant interaction between age and smk
Suppose it was significant
Would then have to keep the age_smk interaction term AS WELL AS both the age and smk variables (even if age and smk themselves are insignificant)
Source DF Type I SS F Value Pr > F age 1 3862 64.84 <.0001 smk 1 828 13.90 0.0009 age_smk 1 69 1.15 0.2918
Confounding vs. Interaction
Y = response
X = predictor
Z = covariate / 2nd predictor
Is the estimated relationship between Y and X dramatically different if one adjusts or does not adjust for Z? Confounding
Is the estimated relationship between Y and X meaningfully different at different values of Z? Interaction
Correlations One problem with using interaction terms is that they
tend to be highly correlated with one or both of the original variables
In our example: Correlation between SMK and AGE_SMK turned out to be 0.98
This is NOT REAL!!! It is a form of “fake” collinearity, the variables aren’t really “fighting” to explain SS
To remove this “fake” collinearity just center the variables
Subtract the mean from all predictors
This doesn’t change any significance tests or p-values, it only removes what we are calling fake collinearity
How to center?SBP Example
Mean age was 53.25, subtract 53.25 from all the ages in the dataset and use these new values in the analysis
Mean smk was 0.53125, (do the same thing)
After centering:
Correlation between SMK and AGE_SMK is now 0.017 (so they weren’t really fighting, it just looked like it because we didn’t center)
Maybe we should always center???
Polynomial Regression
Chapter 15
General Uses
Polynomial models used in situations where the relationship between Y and X is non-linear
Can usually see it in scatterplots
Should definitely catch it in residual plots!
Somewhat dangerous, since a polynomial model of order n – 1 will always fit n data points exactly.
Example?
Strategy for fitting
CENTER your variables to avoid the “fake” (multi)collinearity.
Use a special type of backward elimination procedure Test highest order term first!
If a higher order term is significant, you MUST include all lower order terms for that variable
Example
Problem 15.7 (sas/data available online)
X = amount of vaccine, Y = measure of skin response in rats.
12 data points
If we run just a simple linear regression, the R-square is only 45%, we will consider a polynomial model and try to do better!
Scatter Plot
Residual plot
Cubic Model
x is X, x2 is X2=X*X, x3 is X3=X*X*X, etc
X3 is important – Must keep X2 and X, why?
Cubic model, model with X, X2, and X3 now explains 82% of the variation (was only 45% for the linear model)
Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 14.53498 0.20011 72.63 <.0001 x 1 -0.54454 0.11047 -4.93 0.0012 x2 1 0.12179 0.05386 2.26 0.0536 x3 1 0.28852 0.08177 3.53 0.0077