UNECE workshop on data editing and imputation, Vienna 22. April 2008

Jörg Drechsler(Institute for Employment Research,

Germany)&

Trivellore Raghunathan(University of Michigan)

UNECE workshop on data editing and imputation, Vienna

22. April 2008

Evaluating Different Approaches for Multiple Imputation Under

Linear Constrains

2

Overview

The Problem

The Data

A Little Background on Multiple Imputation

The Methodology

The Simulation Design

The Results

Conclusions/Future Work

3

The Problem

Some Variables Y1, Y2,…, Yk have to some up to a given total Yt

Examples

- turnover in different regions

- number of employees with different qualification levels

- Investment in different subcategories

kt YYYY ...21

4

Overview

The Problem

The Data


The Methodology


The Results


5

The Data

The IAB Establishment Panel

The number of employees

with - Yt total number of employees

- Ywork number of blue collar + white collar workers

- Ytrain number of trainees

- Yexec number of executives

- Yown number of owners + working family members

- Ymarg number of “marginal” workers not covered by social security

- Yother number of other employees

othermargownexectrainworkt YYYYYYY

6

The Data

Summary Statistics

- data is heavily skewed- most variables are semi-continuous - low variation for the number of owners- additional constrain: all variables >=0

Min.1st

Quart. Median Mean3rd

Quart. Max. nb of obs != 0

total nb of emp 1 6 19 128.6 79 22920 11536

workers 0 3 14 109.9 66 19410 11211

trainees 0 0 0 6.101 3 1552 5232

executives 0 0 0 6.124 0 6323 729

owners 0 0 0 0.6667 1 21 5735

marginal workers 0 0 1 5.413 3 2492 5772

others 0 0 0 0.4577 0 566 555

7

Overview

The Problem

The Data


The Methodology


The Results


8


Generate random draws from

Imputation in two steps 1. Generate random draws for θ from its posterior distribution given the

observed values

2. Generate random draws for the missing values from the conditional predictive distribution given the drawn parameters

Drawing from 1. can be difficult

Solution MCMC-Techniques

dyPyyPdyyPyyP obsobsmisobsmisobsmis )|(),|()|,()|(

)|( obsmis yyP

),|( obsmis yyP

)|( obsyP

9

Gibbs Sampling

Generate random draws from conditional univariate distributions

P(Y1|Y-1,θ1)

P(Yk|Y-k,θk)

Iteration provides draws from the joint distribution

Imputation in two steps for every univariate distribution

Imputation model can vary for different variable types

10

Overview

The Problem

The Data


The Methodology


The Results


11

The Methodology

Five imputation methods

- simple imputation of all variables

- independent imputation considering semi-continuity

- nested imputation of the proportions

- non-Bayesian Dirichlet imputation

- Bayesian Dirichlet/Multinomial imputation

12

Simple Imputation

Impute all variables independently

Transform all continuous variables by taking the cubic root

Ignore semi-continuity

Use simple linear models

Use same models as for independent imputation under semi-continuity

Fulfill constrains by:

- setting if

- Down weighting all imputed subcategories if Yt is observed or

i it YY i iobstimp YY ,,

Ytotal Y1 Y2 Y3 Y4 Y5 Y6

20 . 5 3 . 1 1


20 9 5 3 1 1 1


20 18 5 3 2 1 1

i iobstimp YY ,,

13

Independent imputation

Impute all variables independently

Run a logit regression for all variables to address semi-continuityOutcome: 1 if Yij>0, 0 otherwise

Run a linear regression only for the units with Yij>0 and impute only for missing units with positive outcome in the logit regression

set all other values to 0

Depending on number of units with Yij>0 stratify for Western/Eastern Germany and two quantiles for establishment size

Use only 20 explanatory variables for number of executives and other workers, ≈ 100 variables for all other dependent variables

Use same correction methods afterwards

14

Nested Imputation of Proportions

Address semi-continuity with logit-model

Caculate proportions of the total for all subcategories with positive outcome

Use a logit transformation on the proportions

Variables are distributed between ]-Inf;Inf[

Impute variables with linear models

Use almost the same models as for independent imputation under semi-continuity

Nested Imputation: after imputing number of workers define proportions as

After imputation transform variables back and multiply with totals

Use same correction methods afterwards)/( ,, workitotaliij YYY

15

Non Bayesian Dirichlet Distribution

Following an idea by Tempelman (2007)

Ignore semi-continuity

Calculate nested proportions again

Assume Dirichlet distribution for the proportions

Generate starting values using the EM-Algorithm for the Dirichlet Distribution

16

Non Bayesian Dirichlet Distribution II

Imputation Algorithm (Data Augmentation):

- draw new values for from obtained by Maximum-Likelihood-Estimation

- draw new values for mi number of observations to impute for unit i

- Calculate

Not fully Bayesian since the distribution of is only approximated


misobs YY ,| ))ˆ(,ˆ(~,| VNYY misobs

)(~,| ,*

misimobsmis iDirYY

*,,, )1( misimjobsimisi YiYY

i

misobs YY ,|

17

Bayesian Dirichlet/Multinomial Imputation

Generate starting values using the simple imputation approach

For each unit generate a random draw from the Dirichlet distribution with

For each unit generate a random draw from a multinomial distribution with and

weighted vector p for missing obs,


)(~ Dirp ),,,,,( ,,,,,, otherimargiowniexecitrainiworki YYYYYY

jYYsize jobstotal ,*mispprob

*misp 1* misp

18

Overview

The Problem

The Data


The Methodology


The Results


19


Use fully observed survey data (n=11536)

Generate a random sample with replacement of size n

Generate ≈30% missings for each variable (MAR)

Impute missings with different approaches (m=10, iterations=20)

Calculate different quantities of interest

Repeat whole process of sampling and imputation 100 times

20

Generating missing values

X1 expected development for the number of employees in the next five years (6 categories)

X2 number of unskilled workers

X3 industry-wide wage agreement (1=Yes)

Increase for any X leads to decrease of pmis

321 01.05.04.1 XXXY

)exp(1)exp(YYpmis

21

Quality measures

For all estimates of interest:

Compute the estimate from the original survey

Compute the average estimate across the 100 samples

Compute the average estimate across the 100 imputed samples

Compute the 95% coverage rate for the fully observed samples and the imputed samples

Compute

Compute

Compute the average confidence interval overlap for the fully observed sample and the imputed sample

org

)ˆ( sampleE

)ˆ( impE

)ˆvar(/)ˆvar( orgsample

)ˆvar(/)ˆvar( orgimp

22

Confidence interval overlapSuggested by Karr et al. (2006)

Measure the overlap of CIs from the original data and CIs from the imputed data

The higher the overlap, the higher the data utility

Compute the average relative CI overlap for any k

ksynksyn

koverkover

korigkorig

koverkoverk LU

LULULU

J,,

,,

,,

,,

21

overUoverL

origL synL origUsynU

CI for the imputed data

CI for the original data

23

Estimates of Interest

Mean (Yi) in the 16 German Länder

Logit regression to explain collective wage agreements by establishment size

- Use number of employees covered by social security in 6 categories (employees covered by social security = workers + trainees):

Y~emp<10+emp<50+emp<100+emp<250+emp<750+emp>750+industry.dummies

- Compare the estimates for the establishment size from the different imputation methods

24

Overview

The Problem

The Data


The Methodology


The Results


25

Example for the results

number of workers

org

meansample mean

mis mean

imp mean

sample cov

mis cov imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio length

region1 79.85 81.67 94.16 81.78 0.97 0.81 0.97 0.81 0.65 0.81 1.01 1.24 1.02 392

region2 303.47 294.44 359.77 294.59 0.85 0.93 0.86 0.80 0.75 0.80 0.90 1.10 0.90 180

region3 110.78 111.65 126.29 111.65 0.93 0.70 0.93 0.79 0.60 0.78 0.99 1.19 0.99 834

region4 56.08 55.37 62.64 55.41 0.93 0.86 0.93 0.78 0.71 0.78 0.96 1.19 0.96 781

region5 181.80 183.64 213.98 183.49 0.83 0.74 0.83 0.77 0.61 0.77 0.98 1.20 0.98 1126

region6 122.21 123.29 141.48 123.31 0.95 0.76 0.95 0.79 0.63 0.79 1.00 1.19 1.00 746

region7 83.05 85.24 96.92 85.30 0.97 0.80 0.97 0.80 0.62 0.80 1.02 1.21 1.02 560

region8 158.33 159.85 187.97 159.87 0.93 0.89 0.93 0.79 0.69 0.79 0.98 1.20 0.98 902

region9 217.41 218.12 256.89 218.06 0.90 0.89 0.90 0.80 0.69 0.80 0.97 1.16 0.97 866

region10 71.62 72.53 91.07 72.58 0.89 0.87 0.89 0.80 0.70 0.80 0.96 1.21 0.96 396

region11 109.49 108.11 124.26 108.03 0.70 0.88 0.70 0.79 0.74 0.79 0.84 1.01 0.84 594

region12 50.06 49.64 53.91 49.65 0.92 0.91 0.92 0.78 0.73 0.78 0.97 1.16 0.97 777

region13 60.71 61.61 66.80 61.59 0.95 0.90 0.94 0.80 0.72 0.80 1.01 1.20 1.01 682

region14 86.23 87.16 97.63 87.32 0.89 0.89 0.90 0.77 0.71 0.77 0.97 1.23 0.97 949

region15 73.57 73.75 79.30 73.87 0.95 0.87 0.95 0.80 0.71 0.80 0.99 1.17 0.99 793

region16 60.69 61.23 68.29 61.30 0.96 0.93 0.97 0.81 0.71 0.81 0.99 1.25 0.99 958

average 0.91 0.85 0.91 0.79 0.69 0.79 0.97 1.18 0.97

26

Results Averaged Over Different Regions

workers

sample cov mis cov imp covsample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.908 0.852 0.909 0.793 0.686 0.793 0.972 1.182 0.971independent 0.895 0.861 0.893 0.788 0.691 0.788 0.963 1.173 0.966proportions 0.906 0.868 0.899 0.798 0.696 0.797 0.969 1.175 0.969Dirichlet 0.906 0.876 0.924 0.791 0.692 0.785 0.972 1.184 1.261Bayesian Dir. 0.893 0.870 0.922 0.791 0.698 0.778 0.968 1.178 1.656

trainees


mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.893 0.888 0.902 0.794 0.769 0.796 0.955 1.139 0.962

independent 0.898 0.892 0.909 0.793 0.775 0.794 0.958 1.138 1.004proportions 0.894 0.900 0.937 0.798 0.776 0.797 0.963 1.143 1.153

Dirichlet 0.890 0.885 0.907 0.795 0.773 0.791 0.968 1.148 1.167Bayesian Dir. 0.884 0.886 0.890 0.793 0.770 0.793 0.963 1.140 0.992

27


executives


mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.863 0.897 0.868 0.801 0.783 0.802 0.938 1.120 0.943independent 0.827 0.861 0.833 0.786 0.769 0.786 0.930 1.101 0.942

proportions 0.855 0.889 0.886 0.795 0.776 0.798 0.949 1.127 1.025Dirichlet 0.850 0.874 0.861 0.790 0.770 0.789 0.954 1.139 1.095Bayesian Dir. 0.845 0.869 0.853 0.793 0.774 0.794 0.936 1.111 0.947

owners


mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.937 0.798 0.945 0.787 0.671 0.708 0.996 1.128 1.694

independent 0.946 0.802 0.943 0.791 0.674 0.685 0.995 1.126 2.576

proportions 0.938 0.806 0.951 0.797 0.674 0.590 0.996 1.128 4.394

Dirichlet 0.943 0.778 0.795 0.791 0.661 0.519 0.996 1.126 3.470

Bayesian Dir. 0.949 0.806 0.982 0.796 0.673 0.711 0.998 1.127 2.505

28


marginal workers


mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.865 0.921 0.873 0.792 0.757 0.793 0.948 1.238 0.957independent 0.882 0.928 0.899 0.797 0.762 0.797 0.959 1.250 1.025

proportions 0.876 0.929 0.916 0.802 0.759 0.793 0.947 1.237 1.126Dirichlet 0.888 0.919 0.916 0.799 0.759 0.791 0.956 1.250 1.113Bayesian Dir. 0.874 0.928 0.903 0.794 0.757 0.794 0.954 1.243 1.017

others


mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

simple 0.803 0.808 0.852 0.790 0.760 0.783 0.912 1.118 1.045

independent 0.804 0.811 0.866 0.793 0.762 0.777 0.916 1.118 1.220proportions 0.800 0.822 0.937 0.790 0.765 0.746 0.903 1.101 2.150

Dirichlet 0.819 0.825 0.905 0.793 0.763 0.735 0.928 1.139 1.560Bayesian Dir. 0.799 0.810 0.873 0.789 0.761 0.775 0.913 1.102 1.201

29

Average absolute deviation

Average absolute deviation

simple independent proportions Dirichlet Bayesian Dirichlet

employees total 0.344 0.381 0.340 4.877 8.045

workers 0.219 0.349 0.836 4.172 7.688

trainees 0.073 0.130 0.328 0.298 0.135executives 0.050 0.078 0.267 0.186 0.070

owners 0.043 0.069 0.160 0.143 0.056

marginal workers 0.079 0.133 0.234 0.283 0.220others 0.048 0.070 0.151 0.131 0.060

30

Results for the regression

simple

org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio

Intercept -1.09 -1.09 -0.79 -1.10 0.94 0.52 0.93 0.78 0.47 0.79 1.01 1.24 1.0110<x<=50 0.84 0.83 0.91 0.84 0.92 0.83 0.92 0.80 0.65 0.81 1.00 1.25 1.0250<x<100 1.29 1.29 1.41 1.27 0.94 0.77 0.96 0.80 0.61 0.80 1.00 1.33 1.02100<x<=250 1.81 1.81 1.89 1.81 0.97 0.86 0.96 0.79 0.70 0.79 1.00 1.30 1.02250<x<=750 2.35 2.36 2.32 2.35 0.92 0.90 0.93 0.77 0.76 0.78 1.00 1.26 1.01>750 emp. 3.86 3.93 3.81 3.93 0.98 0.93 0.97 0.80 0.76 0.80 1.03 1.21 1.04

independent org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio


proportions org

meansample mean

mis mean

imp mean

sample cov

mis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio


31

Results for the regression II

Bayesian Dirichlet org

meansample

meanmis

meanimp

meansample

covmis cov imp cov sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio


Dirichlet

org

meansample

meanmis

meanimp

meansample

covmis cov

imp cov

sample overl

mis overl

imp overl

sample var_ratio

mis var_ratio

imp var_ratio


32

Overview

The Problem

The Data


The Methodology


The Results


33

Conclusions

All methods provide good repeated sampling properties

Differences between the approaches are relatively small

Dirichlet and proportions approach tend to introduce more variability

Dirichlet and proportions approach don’t work very well for owners and others

The simple approach seems to work best with high coverage and low additional variability

Future Work Compare same approaches for more equally distributed subcategories

34

Thank you for your attention

Date post:	25-Feb-2016
Category:	Documents
Upload:	pillan
View:	38 times
Download:	0 times

UNECE workshop on data editing and imputation, Vienna 22. April 2008

Documents