Date post: | 19-Jan-2018 |
Category: |
Documents |
Upload: | abner-watson |
View: | 219 times |
Download: | 0 times |
Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF
Jean-Marc Museux
The Statistical Office of the European Communities Unit F3: Living conditions and social protection
Geneva - Nov. 2005 Eurostat - UNECE worksession
Outline
EU-SILC Task Force on Anonymisation EU-SILC instrument and database Methodological issues Implementation Conclusions
Geneva - Nov. 2005 Eurostat - UNECE worksession
EU-SILC Task Force on Anonymisation
Objective To come up with best practices and recommendations
for anonymisation of EU-SILC databases Participants
B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin)
Geneva - Nov. 2005 Eurostat - UNECE worksession
EU-SILC instrument
Instrument:- gathering ex post harmonised micro data - on income and living conditions - from 27 European States
Regulatory framework Harmonised definitions Minimum methodological requirements (probability
sampling, fieldwork, …) Methodological recommendations Main source for EU (income) poverty indicators
Geneva - Nov. 2005 Eurostat - UNECE worksession
EU-SILC instrument
Variables Income (Canberra recommendations) Demographic Labour status Living conditions – housing – deprivation -
health
Measurement units Households and individuals
Geneva - Nov. 2005 Eurostat - UNECE worksession
EU-SILC instrument
Databases Annual cross sectional data from 2004 onwards
(households and individuals) Longitudinal data (subset of individual variables)
minimum 3 years spell (4 waves)
Data collection Implementation under the responsibility of EU+ National
Statistical Institutes Flexibility
Rotational design, pure panel or independent components Survey data and/or register data
Geneva - Nov. 2005 Eurostat - UNECE worksession
Release policy
Interest of the database Social and employment policy monitoring (EU
Commission services and Study centres) Social research (Universities, Research centres)
Legal issues Eu legislation allows for micro data release for
scientific purpose Micro data have to be anonymised in order to
minimise the risk of disclosure of individual information
EU-SILC regulation plans scientific release according to a strict timetable
Geneva - Nov. 2005 Eurostat - UNECE worksession
Release policy
Eurostat main orientations Right for information collected with public money Maximise utility of data collected and social return of
money invested (20 Mo € /year) Significant improvement of the quality through user
feedback Implementation
Encrypted CD-ROM with anonymised EU-SILC database released under licence to researchers
Centralised (Luxembourg) Safe Centre with limited capacity
Decentralised access under study Remote access not yet developed
Geneva - Nov. 2005 Eurostat - UNECE worksession
Anonymisation – Main issues
Heterogeneous environment in EU Different perceptions of disclosure risk No one European best practice Various implementations of merely the same
common principles Significant variations of disclosure risk (i.e.
Norwegian income register available on Web) Harmonisation of procedures in order to
ease international comparison
Geneva - Nov. 2005 Eurostat - UNECE worksession
Anonymisation – Main issues
Methodological issues Common disclosure/attacker scenarios for EU
purpose Measures of risk Hierarchical files (household and individual
levels) Longitudinal aspects Cross sectional and longitudinal files matching Sampling design information Register matching Methods of protection
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Common disclosure/attacker scenarios Broad band approach considering
combinations of 3 types of identifying/key variables
Geographic information
Sex
Age | Activity | Education | Dwelling | Marital Status | Citizenship | Place of BirthEconomic status | Employment | Sector of activity | Household size | Household type
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Common EU disclosure/attacker scenarios 3 additional and more complex attacker scenarios
EU1 (Simple attack with HH information (individual and household level)
– REGION x SEX x YEAR OF BIRTH x MARITAL STATUS x HH SIZE x HH TYPE
EU2 (Nosy neighbour individual attack)– REGION x URBANISATION x SEX x DATE OF BIRTH x
BASIC ACTIVITY STATUS x BATH OR SHOWER x DO YOU HAVE A CAR? x EDUCATION x OCCUPATION x SECTOR OF ACTIVITY x HH SIZE x HH TYPE
EU3 (Occupational group address book individual attack)
– REGION x URBANISATION x SEX x DATE OF BIRTH x EMPLOYMENT STATUS x OCCUPATION x SECTOR OF ACTIVITY
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Measure of risk and threshold For broad band approach, thresholds are
expressed in sample frequencies (heuristic developed by CBS-NL)
Sampling fraction : f Countries Threshold = int (1+114 f)
1/50 – 1/2 LU (f=2.5%) 5
1/100 – 1/50 MT, IS, CY 3
1/200 – 1/100 EE, SI 2
< 1/200 All other 21 MS 1
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Measure of risk and threshold for more complex scenario Probability of a correct match based the key variables
between survey database and the attacker’s database Measure developed by Benedetti and Franconi and
available in Mu-Argus Takes into account the hierarchical structure of the
files : individuals/households In practice, due to software limitation, only six
variables are handled simultaneously and various combinations using subset of key variables are tested.
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Hierarchical structure of information Household and individual information are
collected in EU-SILC Household and individual records share
common identifiers (linkable) Possibility of linkage is required for many
statistical studies Increased risk of disclosure: individual
information can be disclosed through household information and vice versa
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Measure of risk and threshold In addition, external information on
population uniques (ONS) is used to cross check protection measures (for instance, 5+ households with age, sex of its members are often population unique up to high level of geographic aggregation)
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Longitudinal data The follow up of individuals through time generates
rare transitions in some key variables. These transitions are potentially disclosive if attacker
database is updated with the same frequency Corresponding risk is not easily estimated
Matching of longitudinal and cross sectional data files For rotational panel and pure panel designs, the
longitudinal and cross sectional files can be matched on the basis of common variables
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Sampling design information Design weights and strata identifiers are
potentially disclosive because correlated with disaggregated geographical information
Register information Few variables (income components) in EU-
SILC are obtained directly from registers The availability of register to attackers is
limited except in rare situation (Income Register Norway and Tax register in Finland)
Geneva - Nov. 2005 Eurostat - UNECE worksession
Methodological issues
Methods of protection Global/ top recoding
Usability of the database Requires arbitrage between variables
Local suppressions May render uneasy statistical analysis Only if allow significant gain in global
recoding of secondary variables
Geneva - Nov. 2005 Eurostat - UNECE worksession
Experiments
Level of recoding significantly decreasing disclosure risk Geographic information needs to be coarsened depending
on the size of the country (For large countries, NUTS1 and degree of urbanisation could be released)
Country of birth and Citizenship should be coarsened in 4 broad categories
Age can be delivered in years but must be top coded (80+). This avoids the difficulty of ensuring coherence of protection of longitudinal and cross sectional data
Number of rooms must be top coded (5+) ISCED levels 5 and 6 must be regrouped NACE is regrouped at 19 levels ISCO 2 digit code can be released
Geneva - Nov. 2005 Eurostat - UNECE worksession
Implementation
Remaining risks Identification of large
households remains Rare transition in
longitudinal data Sampling design
information
Specific national circumstances
Researcher needs Household structure
Longitudinal data for longitudinal analysis
Design information for proper inference (not only variable but causal models)
Harmonisation and flexibility
Geneva - Nov. 2005 Eurostat - UNECE worksession
Implementation
ECHP experience Large dissemination in research community
under license release Less protection No observed breach of confidentiality
For EU-SILC Developing a responsible management of
risk through controlled release and possibly audit provision and follow up.
Geneva - Nov. 2005 Eurostat - UNECE worksession
Implementation
Eurostat approach Common rules for anonymisation of
national databases Residual flexibility is allowed to adapt to
national situations following national assessment according to common standards (measure of risk and thresholds, …)
Geneva - Nov. 2005 Eurostat - UNECE worksession
Conclusions
Anonymisation is a matter of trade off Among national perception of disclosure risk Between right for privacy and researcher
need Between presence of risk and monitoring of
risk Value added of EU-SILC TF
These trade off have been debated and made explicit