Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | osvaldo-waldon |
View: | 214 times |
Download: | 0 times |
Working with Missing Values
Alan C. AcockFebruary, 2007
Supporting material is available at www.oregonstate.edu/~acock/missing
Alan C. Acock, Working with Missing Values
2
Why are the Values Missing: The reason instructs the solution
By Design—Completely Random– Missing Completely at Random (MCAR)– 50% of items selected randomly for each interview– 50% randomly selected for follow-up– Effective when there are too many items or high costs
Intentionally Missing—Researcher controlled– Boys not asked when first menstruation– Drop from analysis– Sometimes unintentionally imputed– Imputing doesn’t necessarily hurt
Alan C. Acock, Working with Missing Values
3
Why are the Values Missing
Refusals—We may know mechanism– Adjusted for gender, race, education– May be missing at random– Otherwise, bias is likely w/o Auxiliary
Variables
Missing because of “don’t know” responses– Between agree and disagree?– Can we impute a better value? – Should we?
Alan C. Acock, Working with Missing Values
4
Why are the Values Missing
Missing by researcher error– May be missing completely at random– May reflect researcher bias – Perceived risk to researcher– Missing observation worse than missing value
Code reason value is missing– NLSY97, uses 5 types of missing values – Treat each differently
Alan C. Acock, Working with Missing Values
5
Why are the Values Missing
• Understand why each value is missing
• Delete observations or variables where you do not intend to impute a value
– Drop variable
– Drop observation
Alan C. Acock, Working with Missing Values
6
Four Questions
• Do I want to have a value for this person?
• Is the value missing completely at random, or
• Do I have auxiliary variables that explain why it is missing, and
• Do I have covariates that predict the score?
Alan C. Acock, Working with Missing Values
7
Patterns of Missing Values MISSING DATA PATTERNS 1 2 3 4 5 6 7 8 9 10 HLTH x x x x CHILDS x x x x x x x x x x HAP_GEN x x x x x INCOME98 x x x x x x AGE x x x x x x x x EDUC x x x x x
– What is problem with • HLTH? • INCOME98? • EDUC?
Alan C. Acock, Working with Missing Values
8
Patterns of Missing Values
MISSING DATA PATTERN FREQUENCIESPattern Freq Pattern Freq Pattern Freq 1 550 5 27 9 4 2 81 6 2 10 14 3 77 7 12 4 30 8 21
• Throw out 81 people in pattern 2?• We have data on five of the six variables• Income might not be a key predictor
• Why is health missing in patterns 5 to 10—Was this by design?
Alan C. Acock, Working with Missing Values
9
Amount of Missing ValuesPROPORTION OF DATA PRESENT HLTH CHILDS HAP_GEN INC AGE EDUC HLTH .90CHILDS .90 1.00HAP_GEN .77 .82 .82INCOME98 .76 .83 .70 .83AGE .90 .99 .81 .82 .99EDUC .77 .82 .82 .70 .81 .822
• Income low with educ, hlth, hap_gen• If income is “just” a control variable--Find a substitute or
impute • Over 50% of cases for all the combinations• Could be worse if you did 3-way (hlth, income, educ)
Alan C. Acock, Working with Missing Values
10
Raw Data Missingness
ID Var1 Var2 Var3
1 9 7 .
2 . 3 5
3 7 4 .
4 9 4 6
5 6 2 7
6 . . 5
ID D1 D2 D3
1 0 0 1
2 1 0 0
3 0 0 1
4 0 0 0
5 0 0 0
6 1 1 0
Alan C. Acock, Working with Missing Values
11
Missing Completely at Random (MCAR)
• The Missingness data is random. D1, D2, D3 uncorrelated with anything!
• Correlate (or logistic regression) variables with D1, D2, D3
• Consider race, gender, age, education• None of these should be correlated with
D1, D2, or D3• This is not correlating variables with the
raw score!
Alan C. Acock, Working with Missing Values
12
Missing at Random (MAR)
• The Missingness data is a random pattern after you control for – Variables in your analysis– Auxiliary variables– Probability of missingness NOT dependent on
unobserved variables
• Correlate variables with D1, D2, D3• Consider auxiliary variables--race, gender,
age, education
Alan C. Acock, Working with Missing Values
13
Missing at Random (MAR)
• Include auxiliary variables as mechanisms for missingness– If they are correlated significantly with the
missingness, D1, D2, D3
• Data is MAR after controlling auxiliary variables
• Auxiliary variables available in many datasets
Alan C. Acock, Working with Missing Values
14
Problem with Traditional Approaches
Listwise deletion—standard default– It excludes many observations—50%?– May be only missing one variable and that
variable may not be important– In longitudinal program evaluations
• Missing those with low level of implementation
– If MCAR, this reduces power, but is unbiased– W/O MCAR this is biased– Political Science Journal—50% deleted
Alan C. Acock, Working with Missing Values
15
Problem with Traditional Approaches
Mean Substitution
– Mean often bad estimate
– Attenuates variance
– Reduces effect—variables w/ missing data, or
– Exaggerates effects--variables with little missing data
– Reduces R2
Alan C. Acock, Working with Missing Values
16
Problem with Traditional Approaches
Pairwise Deletion (rarely used)
– Each correlation on different subsample
– Set of correlations—no single sample
– May not be able to invert matrix
– What is the right sample size?
– If it works, usually better than mean substitution or listwise deletion
Alan C. Acock, Working with Missing Values
17
Problem with Traditional Approaches
Ordinary regression imputation – Multiple regression used to predict their score– Predicted value will have no new information if
predictors are in your model—colinearity – Does nothing about uncertainty of predictions
• If R2 = .90, the predicted value is good• If R2 = .10, the predicted value has a lot of noise
– Thus, predicted values are “too good”
Alan C. Acock, Working with Missing Values
18
Problem with Traditional Approaches
Single Imputation (SPSS Module) (MAR)
– American Statistician article--done incorrectly
– Single imputation does not incorporate variability between multiple imputations
– Reviewers for many journals not aware of limitations of single imputation so . . .
– Easy to implement using SPSS
Alan C. Acock, Working with Missing Values
19
Modern Approaches
Multiple Imputation--Assumes MAR
– Imputation is done 5-20 times
– Model is estimated 5-20 times
– Estimates (R’s, B’s, Betas) are averaged
– Standard errors--variances between solutions incorporated
– Reflects uncertainty of the process
– Always better than single imputation
Alan C. Acock, Working with Missing Values
20
Modern Approaches
Multiple Imputation– Available with best Statistical packages
• Stata• SAS
– Available with freeware programs that work in conjunction with statistical packages
• Norm• Amelia• IVEware• Mice
Alan C. Acock, Working with Missing Values
21
Modern Approaches
Full Information Maximum Likelihood (FIML)– Assumes MAR– Uses all available information– Assumes patterns same if no missing– Results similar to multiple imputation– Available with SEM programs
• Mplus• LISREL• AMOS• EQS
Alan C. Acock, Working with Missing Values
22
Modern Approaches
Full Information Maximum Likelihood – Easy changes in SEM programs will do this– Researchers rarely include auxiliary variables– Researchers rarely include covariates unless
in model– Possible to add auxiliary/predictor variables– Mplus allows for both FIML estimation and
multiple imputation--nice to compare results
Alan C. Acock, Working with Missing Values
23
How Multiple Imputation Works: Non-technical Explanation
• All variables may have some missing values, including DV
• Eliminate observations will missing values on all variables – Missing wave of panel is just missing values
• Estimate covariance matrix (listwise)
• Regress xi on remaining variables
Alan C. Acock, Working with Missing Values
24
How Multiple Imputation Works
• Add residual based on strength of prediction– R2 = .90—add small error – R2 = .10—add big error
• You now have an actual or imputed value for all observations on all variables
• Estimate a covariance• This covariance matrix should be “better”
because it utilizes more information
Alan C. Acock, Working with Missing Values
25
How Multiple Imputation Works
• If covariance matrices are different– Repeat process until successive covariance
matrices are virtually identical
• This provides first imputed dataset
• Repeat this process m times – Results—m imputed datasets with no missing
values
Alan C. Acock, Working with Missing Values
26
How Multiple Imputation Works
• Estimate your model with each of your m imputed datasets
• Combine the results using Rubin’s rules – Parameter estimates—mean of their m values– Standard errors inflate mean of standard
errors based on how much solutions vary– Standard errors (hence t-tests) will be
unbiased if the data is MAR
Alan C. Acock, Working with Missing Values
27
How FIML is Implemented: MplusTitle: Missing values including mechanismsData: File is miss_systematic-999.dat ;Variables: Names are childs satfin male hap_gen ident income98 educ hlth age; Missing are all (-999) ; Usevariables are hlth childs hap_gen income98 age educ satfin male ;Analysis: Type = missing ; *without this get listwise
Alan C. Acock, Working with Missing Values
28
FIML: Mplus ExampleModel: hlth on childs hap_gen income98 age educ ;
satfin on childs hap_gen income98 age educ ;
male on childs hap_gen income98 age educ ;
Output: standardized ;
1.The “hlth” and “satfin” lines are the model2.The “male” line is a nonsense equation that
includes any covariates or auxiliary variables
Alan C. Acock, Working with Missing Values
29
Freeware Dedicated Packages
Package Single Imputation
Multiple Imputation
FIML
Amelia X X
IVEware X X
Norm X X
MICE X X
Mx X
Alan C. Acock, Working with Missing Values
30
Commercial Statistical Packages
Package Single Imputation
Multiple Imputation
FIML
SAS (MI) X
SPSS (EM) X
Stata (ice) X X
Alan C. Acock, Working with Missing Values
31
Commercial FIML Packages
Package Single Imputation
Multiple Imputation
FIML
AMOS X
EQS X
HLM X
LISREL X
Mplus X X
Alan C. Acock, Working with Missing Values
32
Web Pages for Selected Software
• Ameilia gking.harvard.edu/amelia/• Iveware http://www.isr.umich.edu/src/smp/ive/• Norm http://www.stat.psu.edu/~jls/misoftwa.html#aut
• MX www.vcu.edu/mx/ • SPSS www.spss.comwww.mvsoft.com/• LISREL http://www.ssicentral.com/hlm/index.html • Mplus www.statmodel.com • SAS www.sas.com • Stata www.stata.com