ASSUMPTION CHECKING
• In regression analysis with Stata• In multi-level analysis with Stata
(not much extra)
• In logistic regression analysis with Stata
NOTE: THIS WILL BE EASIER IN STATA THAN IT WAS IN SPSS
Assumption checking in “normal” multiple regression
with Stata
3
Assumptions in regression analysis•No multi-collinearity
•All relevant predictor variables included•Homoscedasticity: all residuals are from a distribution with the same variance•Linearity: the “true” model should be linear.•Independent errors: having information about the value of a residual should not give you information about the value of other residuals•Errors are distributed normally
4
FIRST THE ONE THAT LEADS TO NOTHING NEW IN STATA (NOTE: SLIDE TAKEN LITERALLY FROM MMBR)
Independent errors: having information about the value of a residual should not give you information about the value of other residuals
Detect: ask yourself whether it is likely that knowledge about one residual would tell you something about the value of another residual.Typical cases: -repeated measures-clustered observations (people within firms / pupils within schools)
Consequences: as for heteroscedasticityUsually, your confidence intervals are estimated too small (think about why that is!).
Cure: use multi-level analyses
In Stata:Example: the Stata “auto.dta” data setsysuse auto
corr (correlation)vif (variance inflation factors)
ovtest (omitted variable test)
hettest (heterogeneity test)
predict e, residswilk (test for normality)
Finding the commands
• “help regress”• “regress postestimation”
and you will find most of them (and more) there
7
Multi-collinearity A strong correlation between two or more of your predictor variables
You don’t want it, because:1. It is more difficult to get higher R’s2. The importance of predictors can be difficult to
establish (b-hats tend to go to zero)3. The estimates for b-hats are unstable under slightly
different regression attempts (“bouncing beta’s”)
Detect: 4. Look at correlation matrix of predictor variables5. calculate VIF-factors while running regression
Cure:Delete variables so that multi-collinearity disappears, for instance by combining them into a single variable
8
Stata: calculating the correlation matrix (“corr”) and VIF statistics (“vif”)
9
Misspecification tests(replaces: all relevant predictor
variables included)
10
Homoscedasticity: all residuals are from a distribution with the same variance
Consequences: Heteroscedasticiy does not necessarily lead to biases in your estimated coefficients (b-hat), but it does lead to biases in the estimate of the width of the confidence interval, and the estimation procedure itself is not efficient.
Testing for heteroscedasticity in Stata
• Your residuals should have the same variance for all values of Y hettest
• Your residuals should have the same variance for all values of X hettest, rhs
12
Errors distributed normally
Errors are distributed normally (just the errors, not the variables themselves!)
Detect: look at the residual plots, test for normality
Consequences: rule of thumb: if n>600, no problem. Otherwise confidence intervals are wrong.
Cure: try to fit a better model, or use more difficult ways of modeling instead (ask an expert).
First calculate the errors:predict e, resid
Then test for normalityswilk e
Errors distributed normally
Assumption checking in multi-level multiple regression
with Stata
In multi-level
• Test all that you would test for multiple regression – poor man’s test: do this using multiple regression! (e.g. “hettest”)
Add:• xttest0 (see last week)
Add (extra):Test visually whether the normality assumption holds, but do this for the random
Note: extra material(= not on the exam, bonus points if you know how
to use it)
tab school, gen(sch_)reg y sch2 – sch28
gen coefs = .for num 2/28: replace coefs =_b[schX] if _n==X
swilk coefs
Assumption checking in multi-level multiple regression
with Stata
Assumptions
• Y is 0/1• Ratio of cases to variables should
be “reasonable”• No cases where you have
complete separation (Stata will remove these cases automatically)
• Linearity in the logit (comparable to “the true model should be linear” in multiple regression)
• Independence of errors (as in multiple regression)
Further things to do:
• Check goodness of fit and prediction for different groups (as done in the do-file you have)
• Check the correlation matrix for strong correlations between predictors (corr)
• Check for outliers using regress and diag (but don’t tell anyone I suggested this)