+ All Categories
Home > Documents > Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The...

Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The...

Date post: 19-Jan-2018
Category:
Upload: abner-watson
View: 219 times
Download: 0 times
Share this document with a friend
Description:
Geneva - Nov Eurostat - UNECE worksession EU-SILC Task Force on Anonymisation Objective To come up with best practices and recommendations for anonymisation of EU-SILC databases Participants B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin)
24
Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3: Living conditions and social protection [email protected]
Transcript
Page 1: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF

Jean-Marc Museux

The Statistical Office of the European Communities Unit F3: Living conditions and social protection

[email protected]

Page 2: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Outline

EU-SILC Task Force on Anonymisation EU-SILC instrument and database Methodological issues Implementation Conclusions

Page 3: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

EU-SILC Task Force on Anonymisation

Objective To come up with best practices and recommendations

for anonymisation of EU-SILC databases Participants

B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin)

Page 4: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

EU-SILC instrument

Instrument:- gathering ex post harmonised micro data - on income and living conditions - from 27 European States

Regulatory framework Harmonised definitions Minimum methodological requirements (probability

sampling, fieldwork, …) Methodological recommendations Main source for EU (income) poverty indicators

Page 5: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

EU-SILC instrument

Variables Income (Canberra recommendations) Demographic Labour status Living conditions – housing – deprivation -

health

Measurement units Households and individuals

Page 6: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

EU-SILC instrument

Databases Annual cross sectional data from 2004 onwards

(households and individuals) Longitudinal data (subset of individual variables)

minimum 3 years spell (4 waves)

Data collection Implementation under the responsibility of EU+ National

Statistical Institutes Flexibility

Rotational design, pure panel or independent components Survey data and/or register data

Page 7: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Release policy

Interest of the database Social and employment policy monitoring (EU

Commission services and Study centres) Social research (Universities, Research centres)

Legal issues Eu legislation allows for micro data release for

scientific purpose Micro data have to be anonymised in order to

minimise the risk of disclosure of individual information

EU-SILC regulation plans scientific release according to a strict timetable

Page 8: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Release policy

Eurostat main orientations Right for information collected with public money Maximise utility of data collected and social return of

money invested (20 Mo € /year) Significant improvement of the quality through user

feedback Implementation

Encrypted CD-ROM with anonymised EU-SILC database released under licence to researchers

Centralised (Luxembourg) Safe Centre with limited capacity

Decentralised access under study Remote access not yet developed

Page 9: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Anonymisation – Main issues

Heterogeneous environment in EU Different perceptions of disclosure risk No one European best practice Various implementations of merely the same

common principles Significant variations of disclosure risk (i.e.

Norwegian income register available on Web) Harmonisation of procedures in order to

ease international comparison

Page 10: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Anonymisation – Main issues

Methodological issues Common disclosure/attacker scenarios for EU

purpose Measures of risk Hierarchical files (household and individual

levels) Longitudinal aspects Cross sectional and longitudinal files matching Sampling design information Register matching Methods of protection

Page 11: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Common disclosure/attacker scenarios Broad band approach considering

combinations of 3 types of identifying/key variables

Geographic information

Sex

Age | Activity | Education | Dwelling | Marital Status | Citizenship | Place of BirthEconomic status | Employment | Sector of activity | Household size | Household type

Page 12: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Common EU disclosure/attacker scenarios 3 additional and more complex attacker scenarios

EU1 (Simple attack with HH information (individual and household level)

– REGION x SEX x YEAR OF BIRTH x MARITAL STATUS x HH SIZE x HH TYPE

EU2 (Nosy neighbour individual attack)– REGION x URBANISATION x SEX x DATE OF BIRTH x

BASIC ACTIVITY STATUS x BATH OR SHOWER x DO YOU HAVE A CAR? x EDUCATION x OCCUPATION x SECTOR OF ACTIVITY x HH SIZE x HH TYPE

EU3 (Occupational group address book individual attack)

– REGION x URBANISATION x SEX x DATE OF BIRTH x EMPLOYMENT STATUS x OCCUPATION x SECTOR OF ACTIVITY

Page 13: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Measure of risk and threshold For broad band approach, thresholds are

expressed in sample frequencies (heuristic developed by CBS-NL)

Sampling fraction : f Countries Threshold = int (1+114 f)

1/50 – 1/2 LU (f=2.5%) 5

1/100 – 1/50 MT, IS, CY 3

1/200 – 1/100 EE, SI 2

< 1/200 All other 21 MS 1

Page 14: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Measure of risk and threshold for more complex scenario Probability of a correct match based the key variables

between survey database and the attacker’s database Measure developed by Benedetti and Franconi and

available in Mu-Argus Takes into account the hierarchical structure of the

files : individuals/households In practice, due to software limitation, only six

variables are handled simultaneously and various combinations using subset of key variables are tested.

Page 15: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Hierarchical structure of information Household and individual information are

collected in EU-SILC Household and individual records share

common identifiers (linkable) Possibility of linkage is required for many

statistical studies Increased risk of disclosure: individual

information can be disclosed through household information and vice versa

Page 16: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Measure of risk and threshold In addition, external information on

population uniques (ONS) is used to cross check protection measures (for instance, 5+ households with age, sex of its members are often population unique up to high level of geographic aggregation)

Page 17: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Longitudinal data The follow up of individuals through time generates

rare transitions in some key variables. These transitions are potentially disclosive if attacker

database is updated with the same frequency Corresponding risk is not easily estimated

Matching of longitudinal and cross sectional data files For rotational panel and pure panel designs, the

longitudinal and cross sectional files can be matched on the basis of common variables

Page 18: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Sampling design information Design weights and strata identifiers are

potentially disclosive because correlated with disaggregated geographical information

Register information Few variables (income components) in EU-

SILC are obtained directly from registers The availability of register to attackers is

limited except in rare situation (Income Register Norway and Tax register in Finland)

Page 19: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Methodological issues

Methods of protection Global/ top recoding

Usability of the database Requires arbitrage between variables

Local suppressions May render uneasy statistical analysis Only if allow significant gain in global

recoding of secondary variables

Page 20: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Experiments

Level of recoding significantly decreasing disclosure risk Geographic information needs to be coarsened depending

on the size of the country (For large countries, NUTS1 and degree of urbanisation could be released)

Country of birth and Citizenship should be coarsened in 4 broad categories

Age can be delivered in years but must be top coded (80+). This avoids the difficulty of ensuring coherence of protection of longitudinal and cross sectional data

Number of rooms must be top coded (5+) ISCED levels 5 and 6 must be regrouped NACE is regrouped at 19 levels ISCO 2 digit code can be released

Page 21: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Implementation

Remaining risks Identification of large

households remains Rare transition in

longitudinal data Sampling design

information

Specific national circumstances

Researcher needs Household structure

Longitudinal data for longitudinal analysis

Design information for proper inference (not only variable but causal models)

Harmonisation and flexibility

Page 22: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Implementation

ECHP experience Large dissemination in research community

under license release Less protection No observed breach of confidentiality

For EU-SILC Developing a responsible management of

risk through controlled release and possibly audit provision and follow up.

Page 23: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Implementation

Eurostat approach Common rules for anonymisation of

national databases Residual flexibility is allowed to adapt to

national situations following national assessment according to common standards (measure of risk and thresholds, …)

Page 24: Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Geneva - Nov. 2005 Eurostat - UNECE worksession

Conclusions

Anonymisation is a matter of trade off Among national perception of disclosure risk Between right for privacy and researcher

need Between presence of risk and monitoring of

risk Value added of EU-SILC TF

These trade off have been debated and made explicit


Recommended