The Missing Data Problem
•Problems with Statistical Inference
• Sample Size & Power
• Biased Results
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 2
Real World Examples
• Respondents in a household survey refuse to report income
• Missing results of manufacturing experiment due to equipment failure
• Voters’ inability to express preference for a political candidate in an opinion poll
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 3
Outline
• Common Assumptions and Missing Data Patterns
• Taxonomy of Methods for Handling Missing Values
• Multiple Imputation
• Maximum Likelihood
• Simulation4
Missing Data Patterns
• All missing data are not created equal
• Missing due to a random process
• Missing due to a non-random process
5
A Simple Example: Income Survey
Westfall, P., & Henning, K. (2013). Understanding Advanced Statistical Methods (1st ed.). Boca Raton, Florida: CRC Press, Taylor & Francis Group. 6
Multivariate Missing Data Processes:
MCAR and MAR
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 8
Taxonomy of Missing-Data Methods
• Complete Case Analysis (Listwise Deletion)
• Available Case Analysis (Pairwise Deletion)
• Least Squares on Imputed Data
• Multiple Imputation
• Maximum Likelihood (and Bayes)
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 10
Complete Case Analysis (Listwise Deletion)
• Easy to implement
• Works well when MCAR assumption is met
• Wastes a lot of information
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing
%20X's.pdf11
Available Case Analysis (Pairwise Deletion)
• Attempts to minimize the loss of data in listwise deletion
• Increases the power of your test
• Usually is outperformed by Maximum Likelihood
• Caveat: Can result in non-positive definite covariance matrices
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing
%20X's.pdf12
Least Squares Imputation Methods
• Unconditional Mean Substitution
• Conditional Mean Imputation based on X
• Conditional Mean Imputation based on X and Y
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing
%20X's.pdf13
Unconditional Mean Substitution
• Just take the sample mean of the observed data and use it for the missing values
• Heavily biases the covariance matrix
• Bias can be corrected but the inferences (confidence intervals, tests, etc.) are distorted and over-precise
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing
%20X's.pdf14
Conditional Mean Imputation
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing
%20X's.pdf15
Multiple Imputation
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 16
Steps Involved in Multiple Imputation
• Introduce random variation into the process of imputing missing values
• Generate several data sets, each with different imputed values
• Perform an analysis on each data set
• Combine the results into a single set of parameter estimates, standard errors, and test statistics
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 17
Introducing Randomness into a M.I. Model
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 18
Adding Variability to the Imputed Values
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 19
Why Do We Want to Add Variability?
• This is the whole point of multiple imputation
http://www.stat.columbia.edu/~gelman/arm/missing.pdf 20
Combining Inferences from Imputed Data
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 21
Likelihood-Based Inference
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 23
ML with Ignorable Missing Data
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 24
ML with Ignorable Missing Data
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 25
Comparison of Methods
Listwise Pairwise• Easiest to implement• Has minimal effect if data are MCAR, or
MAR for large sample sizes• Has a tendency to bias results
• Uses more information than listwise• Increases statistical power• Also easy to implement
Multiple Imputation Maximum Likelihood• Requires no special software once the
imputed datasets are generated• Requires specification of a model• Requires more assumptions
• Requires specification of a model for each variable
• Most asymptotically efficient• Most complex• You get model comparison statistics (AIC,
BIC, etc.)26