Synthetic data generation for anonymization purposes. Application on the Norwegian Survey on living conditions/EHIS
JOHAN HELDAL AND DIANA-CRISTINA IANCU
STATISTICS NORWAY, DEPARTMENT OF METHODOLOGY AND DATA COLLECTION
JOINT UNECE/EUROSTAT WORK SESSION ON STATISTICAL DATA CONFIDENTIALITY
29-31 OCTOBER 2019, THE HAGUE
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
• An ideal solution:
◦ preserves data confidentiality
◦ allows for quality controls
◦ provides the same opportunities for data analysis as the non-anonymized data would
◦ enables international reporting to institutions
◦ provides the possibility to adjust data after a longer period
◦ can be reused with minimal adaptations
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
• An ideal solution:
◦ preserves data confidentiality
◦ allows for quality controls
◦ provides the same opportunities for data analysis as the non-anonymized data would
◦ enables international reporting to institutions
◦ provides the possibility to adjust data after a longer period
◦ can be reused with minimal adaptations
• Potential solution: create “satellite” datasets with register data that could be merged with the survey data as
needed, based on a key
Synthetic data generation
• Our solution: eliminate the key and match the “satellite” files through statistical matching
• We propose a model-free method that relies on statistical matching to replace the register information
corresponding to individuals from the original sample with register information corresponding to other similar
individuals
• Draw a “register sample” → Add register variables → Match the sample to the original survey and replace the
register data
Data
• Norwegian Survey on living conditions/EHIS (European Health Interview Survey) from 2015
• Comprehensive survey covering several topics
• Multiple uses for both aggregated results and microdata
• Conducted on a representative sample of individuals aged 16 and above
• The sample is divided into 19 strata, corresponding to the 19 counties in Norway
• Survey sample: 14,000 potential respondents for the entire country (700 individuals per county , except for
Oslo, with 1400 individuals)
• Goss sample: 13,748 individuals
• Net sample: 8164 individuals
Data
• Data collected before the interview:
◦ residence municipality of the respondent
◦ componence of the household
◦ name and address of the employer of each household member
◦ respondent’s occupation
• Data added after the interview is conducted:
◦ education
◦ income
◦ whether the respondent lives in a densely or sparsely populated area
◦ more detailed demographic information for the household and each family member, such as country of birth
and immigrant background
Method
• Step 1. Drawing a “register sample”
◦ We draw a sample of 42,000 individuals from the population register
◦ Survey respondents are not excluded prior to drawing the sample
• Step 2. Adding register variables
◦ Variables that would normally be linked to the Norwegian Survey on living conditions/EHIS
Method
• Step 3. Performing the statistical matching
◦ Match the enriched “register sample” to the original survey sample
◦ We follow the procedure outlined in D’Orazio (2016) for the statistical matching, consisting in 5 steps:
1) choosing the target variables
2) identifying the common variables
3) choosing the matching variables
4) applying a statistical matching method
5) evaluating the results
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
• Step 3.2. Choosing the common variables
◦ Assess definitions, accuracy, frequency distributions of the common variables
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
• Step 3.2. Choosing the common variables
◦ Assess definitions, accuracy, frequency distributions of the common variables
• Step 3.3. Choosing the matching variables
◦ Choose only relevant variables
◦ Apply the principle of parsimony
◦ In order to preserve the structure of the survey sample, we use county, gender, age and household size as
matching variables
Method
Distribution of gender in the EHIS survey sample compared with the “register sample”
Method
Distribution of age in the EHIS survey sample compared
with the “register sample”
Method
Distribution of household size in the EHIS survey sample compared with the “register sample”
Method
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender
Matching variables Age, Household size Age, Household size Age, Household size Age, Household size
Method
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender
Matching variables Age, Household size Age, Household size Age, Household size Age, Household size
Number of observations in the synthetic dataset
8164 8164 8164 8164
Method
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender
Matching variables Age, Household size Age, Household size Age, Household size Age, Household size
Number of observations in the synthetic dataset
8164 8164 8164 8164
Number of distinct donors in the synthetic dataset
5135 7285 7239 8164
Method
• Step 3.5. Evaluating the results
◦ Assess the representativeness of the synthetic dataset
◦ Check the marginal distribution of the imputed variables
◦ Check the joint distribution of the imputed variables with the matching variables (the distribution in the
donor dataset, i.e. the register sample, is the reference)
ResultsMatching
method
Test
Variable
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Gender 0.0010 0.9990 1.0000 0.0007 0.0010 0.9990 1.0000 0.0007
County 0.0393 0.9607 0.9985 0.0390 0.0393 0.9607 0.9985 0.0390
Education
level 0.0172 0.9828 0.9997 0.0165 0.0276 0.9724 0.9996 0.0208
Education
field 0.0176 0.9824 0.9996 0.0199 0.0170 0.9830 0.9997 0.0167
Immigrant
category 0.0122 0.9878 0.9998 0.0158 0.0142 0.9858 0.9998 0.0143
Country
background 0.0141 0.9859 0.9997 0.0176 0.0161 0.9839 0.9997 0.0178
Degree of
urbanization 0.0052 0.9948 0.9999 0.0095 0.0011 0.9989 1.0000 0.0025
Occupation 0.0209 0.9791 0.9996 0.0198 0.0249 0.9751 0.9996 0.0198
Works full-/
part-time 0.0018 0.9982 1.0000 0.0014 0.0058 0.9942 1.0000 0.0044
Employment
status 0.0028 0.9972 0.9999 0.0084 0.0054 0.9946 0.9999 0.0106
RND1 RND2
Similarity and dissimilarity measures for comparing estimated distributions of
categorical variables from the synthetic datasets generated through random hot
deck statistical matching
Matching
method
Test
Variable
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Gender 0.0010 0.9990 1.0000 0.0007 0.0010 0.9990 1.0000 0.0007
County 0.0393 0.9607 0.9985 0.0390 0.0393 0.9607 0.9985 0.0390
Education
level 0.0125 0.9875 0.9998 0.0143 0.0153 0.9847 0.9998 0.0151
Education
field 0.0098 0.9902 0.9998 0.0123 0.0170 0.9830 0.9998 0.0158
Immigrant
category 0.0136 0.9864 0.9997 0.0168 0.0143 0.9857 0.9998 0.0142
Country
background 0.0119 0.9881 0.9998 0.0146 0.0145 0.9855 0.9997 0.0161
Degree of
urbanization 0.0109 0.9891 0.9999 0.0093 0.0071 0.9929 0.9999 0.0071
Occupation 0.0162 0.9838 0.9997 0.0169 0.0096 0.9904 0.9999 0.0119
Works full-/
part-time 0.0026 0.9974 1.0000 0.0020 0.0023 0.9977 1.0000 0.0018
Employment
status 0.0067 0.9933 0.9999 0.0080 0.0034 0.9966 1.0000 0.0046
NN NNC
Similarity and dissimilarity measures for comparing estimated distributions of
categorical variables from the synthetic datasets generated through nearest
neighbour distance hot deck statistical matching
ResultsMatching
method
Test
Variables
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
County and
education
level 0.0780 0.9220 0.9933 0.0819 0.0703 0.9297 0.9950 0.0705
County and
employment
status 0.0490 0.9510 0.9972 0.0528 0.0519 0.9481 0.9975 0.0503
Gender and
education
level 0.0206 0.9794 0.9996 0.0199 0.0314 0.9686 0.9993 0.0259
Gender and
employment
status 0.0152 0.9848 0.9998 0.0145 0.0056 0.9944 0.9999 0.0107
RND1 RND2
Similarity and dissimilarity measures for comparing joint distributions of
categorical variables from the synthetic datasets generated through random hot
deck statistical matching
Matching
method
Test
Variable
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
Dissimilarity
index Overlap
Bhattacharyya
coefficient
Hellinger's
distance
County and
education
level 0.0698 0.9302 0.9951 0.0703 0.0608 0.9392 0.9960 0.0634
County and
employment
status 0.0463 0.9537 0.9972 0.0526 0.0462 0.9538 0.9978 0.0467
Gender and
education
level 0.0129 0.9871 0.9997 0.0174 0.0155 0.9845 0.9997 0.0168
Gender and
employment
status 0.0093 0.9907 0.9999 0.0098 0.0097 0.9903 0.9999 0.0089
NN NNC
Similarity and dissimilarity measures for comparing joint distributions of
categorical variables from the synthetic datasets generated through nearest
neighbour distance hot deck statistical matching
Results
Distribution of age in each of the four synthetic
datasets, compared with the “register sample”
The road further
• Synthesize the entire survey sample, to capture non-response patterns
• Perform robustness checks
◦ Change the size of the “register sample”
◦ Use alternative options in the implementation of the matching procedure
• Test the quality of the resulting synthetic datasets with respect to more variables and to the household
structure
• Compare the information loss due to the usage of synthetic data with the information loss caused by applying
traditional disclosure control methods on the original survey data
• Simulate an attack by an intruder on the synthetic datasets
Thank you!