Jörg Drechsler(Institute for Employment Research,
Germany)&
Trivellore Raghunathan(University of Michigan)
UNECE workshop on data editing and imputation, Vienna
22. April 2008
Evaluating Different Approaches for Multiple Imputation Under
Linear Constrains
2
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
3
The Problem
Some Variables Y1, Y2,…, Yk have to some up to a given total Yt
Examples
- turnover in different regions
- number of employees with different qualification levels
- Investment in different subcategories
kt YYYY ...21
4
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
5
The Data
The IAB Establishment Panel
The number of employees
with - Yt total number of employees
- Ywork number of blue collar + white collar workers
- Ytrain number of trainees
- Yexec number of executives
- Yown number of owners + working family members
- Ymarg number of “marginal” workers not covered by social security
- Yother number of other employees
othermargownexectrainworkt YYYYYYY
6
The Data
Summary Statistics
- data is heavily skewed- most variables are semi-continuous - low variation for the number of owners- additional constrain: all variables >=0
Min.1st
Quart. Median Mean3rd
Quart. Max. nb of obs != 0
total nb of emp 1 6 19 128.6 79 22920 11536
workers 0 3 14 109.9 66 19410 11211
trainees 0 0 0 6.101 3 1552 5232
executives 0 0 0 6.124 0 6323 729
owners 0 0 0 0.6667 1 21 5735
marginal workers 0 0 1 5.413 3 2492 5772
others 0 0 0 0.4577 0 566 555
7
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
8
A Little Background on Multiple Imputation
Generate random draws from
Imputation in two steps 1. Generate random draws for θ from its posterior distribution given the
observed values
2. Generate random draws for the missing values from the conditional predictive distribution given the drawn parameters
Drawing from 1. can be difficult
Solution MCMC-Techniques
dyPyyPdyyPyyP obsobsmisobsmisobsmis )|(),|()|,()|(
)|( obsmis yyP
),|( obsmis yyP
)|( obsyP
9
Gibbs Sampling
Generate random draws from conditional univariate distributions
P(Y1|Y-1,θ1)
P(Yk|Y-k,θk)
Iteration provides draws from the joint distribution
Imputation in two steps for every univariate distribution
Imputation model can vary for different variable types
10
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
11
The Methodology
Five imputation methods
- simple imputation of all variables
- independent imputation considering semi-continuity
- nested imputation of the proportions
- non-Bayesian Dirichlet imputation
- Bayesian Dirichlet/Multinomial imputation
12
Simple Imputation
Impute all variables independently
Transform all continuous variables by taking the cubic root
Ignore semi-continuity
Use simple linear models
Use same models as for independent imputation under semi-continuity
Fulfill constrains by:
- setting if
- Down weighting all imputed subcategories if Yt is observed or
i it YY i iobstimp YY ,,
Ytotal Y1 Y2 Y3 Y4 Y5 Y6
20 . 5 3 . 1 1
Ytotal Y1 Y2 Y3 Y4 Y5 Y6
20 9 5 3 1 1 1
Ytotal Y1 Y2 Y3 Y4 Y5 Y6
20 18 5 3 2 1 1
i iobstimp YY ,,
13
Independent imputation
Impute all variables independently
Run a logit regression for all variables to address semi-continuityOutcome: 1 if Yij>0, 0 otherwise
Run a linear regression only for the units with Yij>0 and impute only for missing units with positive outcome in the logit regression
set all other values to 0
Depending on number of units with Yij>0 stratify for Western/Eastern Germany and two quantiles for establishment size
Use only 20 explanatory variables for number of executives and other workers, ≈ 100 variables for all other dependent variables
Use same correction methods afterwards
14
Nested Imputation of Proportions
Address semi-continuity with logit-model
Caculate proportions of the total for all subcategories with positive outcome
Use a logit transformation on the proportions
Variables are distributed between ]-Inf;Inf[
Impute variables with linear models
Use almost the same models as for independent imputation under semi-continuity
Nested Imputation: after imputing number of workers define proportions as
After imputation transform variables back and multiply with totals
Use same correction methods afterwards)/( ,, workitotaliij YYY
15
Non Bayesian Dirichlet Distribution
Following an idea by Tempelman (2007)
Ignore semi-continuity
Calculate nested proportions again
Assume Dirichlet distribution for the proportions
Generate starting values using the EM-Algorithm for the Dirichlet Distribution
16
Non Bayesian Dirichlet Distribution II
Imputation Algorithm (Data Augmentation):
- draw new values for from obtained by Maximum-Likelihood-Estimation
- draw new values for mi number of observations to impute for unit i
- Calculate
Not fully Bayesian since the distribution of is only approximated
Use same correction methods afterwards
misobs YY ,| ))ˆ(,ˆ(~,| VNYY misobs
)(~,| ,*
misimobsmis iDirYY
*,,, )1( misimjobsimisi YiYY
i
misobs YY ,|
17
Bayesian Dirichlet/Multinomial Imputation
Generate starting values using the simple imputation approach
For each unit generate a random draw from the Dirichlet distribution with
For each unit generate a random draw from a multinomial distribution with and
weighted vector p for missing obs,
Use same correction methods afterwards
)(~ Dirp ),,,,,( ,,,,,, otherimargiowniexecitrainiworki YYYYYY
jYYsize jobstotal ,*mispprob
*misp 1* misp
18
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
19
The Simulation Design
Use fully observed survey data (n=11536)
Generate a random sample with replacement of size n
Generate ≈30% missings for each variable (MAR)
Impute missings with different approaches (m=10, iterations=20)
Calculate different quantities of interest
Repeat whole process of sampling and imputation 100 times
20
Generating missing values
X1 expected development for the number of employees in the next five years (6 categories)
X2 number of unskilled workers
X3 industry-wide wage agreement (1=Yes)
Increase for any X leads to decrease of pmis
321 01.05.04.1 XXXY
)exp(1)exp(YYpmis
21
Quality measures
For all estimates of interest:
Compute the estimate from the original survey
Compute the average estimate across the 100 samples
Compute the average estimate across the 100 imputed samples
Compute the 95% coverage rate for the fully observed samples and the imputed samples
Compute
Compute
Compute the average confidence interval overlap for the fully observed sample and the imputed sample
org
)ˆ( sampleE
)ˆ( impE
)ˆvar(/)ˆvar( orgsample
)ˆvar(/)ˆvar( orgimp
22
Confidence interval overlapSuggested by Karr et al. (2006)
Measure the overlap of CIs from the original data and CIs from the imputed data
The higher the overlap, the higher the data utility
Compute the average relative CI overlap for any k
ksynksyn
koverkover
korigkorig
koverkoverk LU
LULULU
J,,
,,
,,
,,
21
overUoverL
origL synL origUsynU
CI for the imputed data
CI for the original data
23
Estimates of Interest
Mean (Yi) in the 16 German Länder
Logit regression to explain collective wage agreements by establishment size
- Use number of employees covered by social security in 6 categories (employees covered by social security = workers + trainees):
Y~emp<10+emp<50+emp<100+emp<250+emp<750+emp>750+industry.dummies
- Compare the estimates for the establishment size from the different imputation methods
24
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
25
Example for the results
number of workers
org
meansample mean
mis mean
imp mean
sample cov
mis cov imp cov
sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio length
region1 79.85 81.67 94.16 81.78 0.97 0.81 0.97 0.81 0.65 0.81 1.01 1.24 1.02 392
region2 303.47 294.44 359.77 294.59 0.85 0.93 0.86 0.80 0.75 0.80 0.90 1.10 0.90 180
region3 110.78 111.65 126.29 111.65 0.93 0.70 0.93 0.79 0.60 0.78 0.99 1.19 0.99 834
region4 56.08 55.37 62.64 55.41 0.93 0.86 0.93 0.78 0.71 0.78 0.96 1.19 0.96 781
region5 181.80 183.64 213.98 183.49 0.83 0.74 0.83 0.77 0.61 0.77 0.98 1.20 0.98 1126
region6 122.21 123.29 141.48 123.31 0.95 0.76 0.95 0.79 0.63 0.79 1.00 1.19 1.00 746
region7 83.05 85.24 96.92 85.30 0.97 0.80 0.97 0.80 0.62 0.80 1.02 1.21 1.02 560
region8 158.33 159.85 187.97 159.87 0.93 0.89 0.93 0.79 0.69 0.79 0.98 1.20 0.98 902
region9 217.41 218.12 256.89 218.06 0.90 0.89 0.90 0.80 0.69 0.80 0.97 1.16 0.97 866
region10 71.62 72.53 91.07 72.58 0.89 0.87 0.89 0.80 0.70 0.80 0.96 1.21 0.96 396
region11 109.49 108.11 124.26 108.03 0.70 0.88 0.70 0.79 0.74 0.79 0.84 1.01 0.84 594
region12 50.06 49.64 53.91 49.65 0.92 0.91 0.92 0.78 0.73 0.78 0.97 1.16 0.97 777
region13 60.71 61.61 66.80 61.59 0.95 0.90 0.94 0.80 0.72 0.80 1.01 1.20 1.01 682
region14 86.23 87.16 97.63 87.32 0.89 0.89 0.90 0.77 0.71 0.77 0.97 1.23 0.97 949
region15 73.57 73.75 79.30 73.87 0.95 0.87 0.95 0.80 0.71 0.80 0.99 1.17 0.99 793
region16 60.69 61.23 68.29 61.30 0.96 0.93 0.97 0.81 0.71 0.81 0.99 1.25 0.99 958
average 0.91 0.85 0.91 0.79 0.69 0.79 0.97 1.18 0.97
26
Results Averaged Over Different Regions
workers
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.908 0.852 0.909 0.793 0.686 0.793 0.972 1.182 0.971independent 0.895 0.861 0.893 0.788 0.691 0.788 0.963 1.173 0.966proportions 0.906 0.868 0.899 0.798 0.696 0.797 0.969 1.175 0.969Dirichlet 0.906 0.876 0.924 0.791 0.692 0.785 0.972 1.184 1.261Bayesian Dir. 0.893 0.870 0.922 0.791 0.698 0.778 0.968 1.178 1.656
trainees
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.893 0.888 0.902 0.794 0.769 0.796 0.955 1.139 0.962
independent 0.898 0.892 0.909 0.793 0.775 0.794 0.958 1.138 1.004proportions 0.894 0.900 0.937 0.798 0.776 0.797 0.963 1.143 1.153
Dirichlet 0.890 0.885 0.907 0.795 0.773 0.791 0.968 1.148 1.167Bayesian Dir. 0.884 0.886 0.890 0.793 0.770 0.793 0.963 1.140 0.992
27
Results Averaged Over Different Regions
executives
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.863 0.897 0.868 0.801 0.783 0.802 0.938 1.120 0.943independent 0.827 0.861 0.833 0.786 0.769 0.786 0.930 1.101 0.942
proportions 0.855 0.889 0.886 0.795 0.776 0.798 0.949 1.127 1.025Dirichlet 0.850 0.874 0.861 0.790 0.770 0.789 0.954 1.139 1.095Bayesian Dir. 0.845 0.869 0.853 0.793 0.774 0.794 0.936 1.111 0.947
owners
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.937 0.798 0.945 0.787 0.671 0.708 0.996 1.128 1.694
independent 0.946 0.802 0.943 0.791 0.674 0.685 0.995 1.126 2.576
proportions 0.938 0.806 0.951 0.797 0.674 0.590 0.996 1.128 4.394
Dirichlet 0.943 0.778 0.795 0.791 0.661 0.519 0.996 1.126 3.470
Bayesian Dir. 0.949 0.806 0.982 0.796 0.673 0.711 0.998 1.127 2.505
28
Results Averaged Over Different Regions
marginal workers
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.865 0.921 0.873 0.792 0.757 0.793 0.948 1.238 0.957independent 0.882 0.928 0.899 0.797 0.762 0.797 0.959 1.250 1.025
proportions 0.876 0.929 0.916 0.802 0.759 0.793 0.947 1.237 1.126Dirichlet 0.888 0.919 0.916 0.799 0.759 0.791 0.956 1.250 1.113Bayesian Dir. 0.874 0.928 0.903 0.794 0.757 0.794 0.954 1.243 1.017
others
sample cov mis cov imp covsample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
simple 0.803 0.808 0.852 0.790 0.760 0.783 0.912 1.118 1.045
independent 0.804 0.811 0.866 0.793 0.762 0.777 0.916 1.118 1.220proportions 0.800 0.822 0.937 0.790 0.765 0.746 0.903 1.101 2.150
Dirichlet 0.819 0.825 0.905 0.793 0.763 0.735 0.928 1.139 1.560Bayesian Dir. 0.799 0.810 0.873 0.789 0.761 0.775 0.913 1.102 1.201
29
Average absolute deviation
Average absolute deviation
simple independent proportions Dirichlet Bayesian Dirichlet
employees total 0.344 0.381 0.340 4.877 8.045
workers 0.219 0.349 0.836 4.172 7.688
trainees 0.073 0.130 0.328 0.298 0.135executives 0.050 0.078 0.267 0.186 0.070
owners 0.043 0.069 0.160 0.143 0.056
marginal workers 0.079 0.133 0.234 0.283 0.220others 0.048 0.070 0.151 0.131 0.060
30
Results for the regression
simple
org
meansample mean
mis mean
imp mean
sample cov
mis cov
imp cov
sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
Intercept -1.09 -1.09 -0.79 -1.10 0.94 0.52 0.93 0.78 0.47 0.79 1.01 1.24 1.0110<x<=50 0.84 0.83 0.91 0.84 0.92 0.83 0.92 0.80 0.65 0.81 1.00 1.25 1.0250<x<100 1.29 1.29 1.41 1.27 0.94 0.77 0.96 0.80 0.61 0.80 1.00 1.33 1.02100<x<=250 1.81 1.81 1.89 1.81 0.97 0.86 0.96 0.79 0.70 0.79 1.00 1.30 1.02250<x<=750 2.35 2.36 2.32 2.35 0.92 0.90 0.93 0.77 0.76 0.78 1.00 1.26 1.01>750 emp. 3.86 3.93 3.81 3.93 0.98 0.93 0.97 0.80 0.76 0.80 1.03 1.21 1.04
independent org
meansample mean
mis mean
imp mean
sample cov
mis cov
imp cov
sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
Intercept -1.09 -1.08 -0.78 -1.08 0.96 0.52 0.96 0.79 0.45 0.79 1.01 1.24 1.0110<x<=50 0.84 0.84 0.92 0.85 0.88 0.74 0.93 0.76 0.61 0.76 1.00 1.26 1.0250<x<100 1.29 1.30 1.41 1.30 0.96 0.81 0.96 0.82 0.62 0.82 1.00 1.32 1.03100<x<=250 1.81 1.82 1.89 1.82 0.95 0.86 0.95 0.80 0.69 0.81 1.00 1.30 1.02250<x<=750 2.35 2.35 2.31 2.35 0.98 0.95 0.98 0.80 0.76 0.81 1.00 1.25 1.01>750 emp. 3.86 3.91 3.76 3.91 0.95 0.89 0.97 0.77 0.71 0.78 1.03 1.18 1.03
proportions org
meansample mean
mis mean
imp mean
sample cov
mis cov
imp cov
sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
Intercept -1.09 -1.08 -0.78 -1.08 0.91 0.51 0.91 0.77 0.44 0.77 1.00 1.24 1.0010<x<=50 0.84 0.83 0.91 0.85 0.97 0.81 0.96 0.82 0.63 0.81 1.00 1.25 1.0350<x<100 1.29 1.29 1.40 1.29 0.97 0.74 0.97 0.80 0.62 0.81 1.00 1.32 1.03100<x<=250 1.81 1.81 1.88 1.82 0.91 0.90 0.96 0.79 0.71 0.79 1.00 1.30 1.03250<x<=750 2.35 2.34 2.30 2.36 0.94 0.93 0.94 0.81 0.75 0.79 1.00 1.25 1.02>750 emp. 3.86 3.94 3.85 3.95 0.96 0.93 0.96 0.80 0.76 0.80 1.04 1.24 1.07
31
Results for the regression II
Bayesian Dirichlet org
meansample
meanmis
meanimp
meansample
covmis cov imp cov sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
Intercept -1.09 -1.07 -0.76 -1.08 0.92 0.48 0.93 0.77 0.43 0.78 1.00 1.24 1.0010<x<=50 0.84 0.84 0.92 0.85 0.96 0.74 0.96 0.79 0.60 0.78 1.00 1.25 1.0250<x<100 1.29 1.28 1.39 1.26 0.95 0.82 0.94 0.80 0.65 0.80 1.00 1.33 1.03100<x<=250 1.81 1.82 1.88 1.78 0.94 0.91 0.95 0.81 0.72 0.79 1.00 1.30 1.02250<x<=750 2.35 2.35 2.32 2.31 0.94 0.92 0.89 0.79 0.75 0.76 1.00 1.26 1.01>750 emp. 3.86 3.94 3.82 3.62 0.94 0.94 0.80 0.76 0.75 0.70 1.05 1.22 0.99
Dirichlet
org
meansample
meanmis
meanimp
meansample
covmis cov
imp cov
sample overl
mis overl
imp overl
sample var_ratio
mis var_ratio
imp var_ratio
Intercept -1.09 -1.08 -0.79 -1.08 0.95 0.58 0.96 0.80 0.48 0.80 1.00 1.24 1.0010<x<=50 0.84 0.83 0.92 0.84 0.98 0.81 0.98 0.82 0.62 0.83 1.00 1.25 1.0350<x<100 1.29 1.29 1.42 1.27 0.94 0.79 0.95 0.80 0.59 0.79 1.00 1.32 1.02100<x<=250 1.81 1.81 1.89 1.77 0.95 0.88 0.95 0.80 0.69 0.79 1.00 1.30 1.01250<x<=750 2.35 2.35 2.30 2.31 0.94 0.96 0.96 0.80 0.76 0.79 1.00 1.25 1.01>750 emp. 3.86 3.88 3.77 3.67 0.95 0.96 0.84 0.80 0.76 0.73 1.01 1.19 0.95
32
Overview
The Problem
The Data
A Little Background on Multiple Imputation
The Methodology
The Simulation Design
The Results
Conclusions/Future Work
33
Conclusions
All methods provide good repeated sampling properties
Differences between the approaches are relatively small
Dirichlet and proportions approach tend to introduce more variability
Dirichlet and proportions approach don’t work very well for owners and others
The simple approach seems to work best with high coverage and low additional variability
Future Work Compare same approaches for more equally distributed subcategories
34
Thank you for your attention