European Conference on Quality in Official Statistics 2008
Rome, 08.-11. July 2008
Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control
in the German IAB Establishment Panel
Jörg Drechsler, Stefan Bender (Institute for Employment Research, Germany)
& Susanne Rässler
(University of Bamberg)
2
Overview
Multiple Imputation for Statistical Disclosure Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of The Results
Conclusion
3
YsynthetischYsynthetischYsynthetischYsynthetisch
Fully synthetic data sets (Rubin 1993)
advantages: - data are fully synthetic- re-identification of single units almost impossible- all variables are still fully available
disadvantages: - strong dependence on the imputation model- setting up a model might be difficult/impossible
Yobserved
X Ynot observed
Ysynthetic
4
Partially synthetic data sets (Little 1993)
only potentially identifying or sensitive variables are replaced
5
Partially synthetic data sets (Little 1993)
only potentially identifying or sensitive variables are replaced
6
Partially synthetic data sets (Little 1993)
only potentially identifying or sensitive variables are replaced
advantages: - model dependence decreases- models are easier to set up
disadvantages: - true values remain in the data set- disclosure might still be possible
7
Overview
Multiple Imputation for Statistical Disclosure Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusions
8
The IAB Establishment Panel
Annually conducted Establishment Survey
Since 1993 in Western Germany, since 1996 in Eastern Germany
Population: All establishments with at least one employee covered by social security
Source: Official Employment Statistics
Response rate of repeatedly interviewed establishments more than 80%
Sample of more than 16.000 establishments in the last wave
Contents: employment structure, changes in employment, business policies, investment, training, remuneration,
working hours, collective wage agreements, works councils
9
Overview
Multiple Imputation for Statistical Disclosure Control
The IAB Establishment Panel
Application of the Two Approaches
Comparison of the Results
Conclusions
10
Generating fully synthetic data sets for the IAB Establishment Panel
Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel
Draw 10 new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by industry, size, and region)
The number of observations in each sample equals the number of observations in the panel ns=np=7332
Every sample is imputed ten times using sequential regression
Number of variables from the establishment panel: 48
Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)
11
Imputation procedure for partially synthetic data
Only two variables are synthesized: - number of employees
- industry (16 categories)
Same variables for the imputation models
Imputation by sequential regression
Imputation model: - multinomial logit for the industry
- linear model for the cubic root of the nb of employees- 4 independent linear models defined by quartiles for the establishment size
Imputations based on own coding in R.
12
Overview
Multiple Imputation for Statistical Disclosure Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusion
13
Analytical validityCompare regression results from the original data with results from the synthetic data
First regression: Zwick (2005) analyses the productivity effects of different continuing
vocational training forms in Germany Probit regression to explain, why firms offer vocational training 13 Explanatory variables including: Share of qualified employees,
establishment size, industry, collective wage agreement, high qualification needs expected…
Second regression: Log(number of employees) on 15 industry dummies
Two data utility measures:- Comparison of the beta coefficients from the original data
set and the synthetic data sets- confidence interval overlap
14
Confidence interval overlapSuggested by Karr et al. (2006)
Measure the overlap of CIs from the original data and CIs from the synthetic data
The higher the overlap, the higher the data utility
Compute the average relative CI overlap for any k
ksynksyn
koverkover
korigkorig
koverkoverk LU
LULULU
J,,
,,
,,
,,
21
overUoverL
origL synL origUsynU
CI for the synthetic data
CI for the original data
15
Significant at the 0,1 % level Significant at the 1 % level Significant at the 5 % level
Results from the first regression (Zwick 2005)
16
Average overlap0,808 0,926
Average confidence interval (CI) overlap for the estimates from the first regression
17
= Significant at the 0,1 % level = Significant at the 1 % level = Significant at the 5 % level
Results from the second regression (log(nb. of employees) on industry)
= insignificant
180,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0
Industry dummy 1
Industry dummy 2
Industry dummy 3
Industry dummy 4
Industry dummy 5
Industry dummy 6
Industry dummy 7
Industry dummy 8
Industry dummy 9
Industry dummy 10
Industry dummy 11
Industry dummy 12
Industry dummy 13
Industry dummy 14
Industry dummy 15
Intercept
CI overlap fully synthetic data CI overlap part. synthetic dataAverage overlap
0,699 0,839
Average confidence interval (CI) overlap for the estimates from the second regression
19
Disclosure risk
Difficult to compare between partially and fully synthetic data sets
Disclosure risk is low for fully synthetic data sets, although not zero
DR is higher for partially synthetic data sets, because:
• True values remain in the data set
• Only survey respondents are included
For partially synthetic data sets a careful disclosure risk evaluation is necessary
20
Overview
Multiple Imputation for Statistical Disclosure Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusions
21
Conclusions
Generating synthetic data sets can be a useful method for SDC
Advantages for partially synthetic data sets:• Higher data validity• Imputation models easier to set up • Lower risk of biased imputations
Disadvantages for partially synthetic data sets:• Higher risk of disclosure• Careful disclosure risk evaluation necessary
Agencies will have to decide depending on the complexity of the survey and the expected risk of disclosure
22
Thank you for your attention