Post on 28-Mar-2015
transcript
CAPRI CCSR
Analysis of Information
Loss: a Case Study From a UK Survey
Mark Elliot
Kingsley Purdam
Confidentiality and Privacy Group (CAPRI)
CCSR, University of Manchester
CAPRI CCSR
Outline
• Concepts
• General Method
• Results
CAPRI CCSR
Concepts
• Analytical Completeness– Effects of Recodes
• Analytical Validity– Effects of Perturbations
CAPRI CCSR
General Method
• Selected Sample of publications
• Contact Authors
• Phase 1 Questionnaire
• Phase 2 Rerun of Studies
CAPRI CCSR
Completeness Study
CAPRI CCSR
Example Recodes
• Age recoded from single years to Five-year bands.
• Area removed from data set but region left in.
• Ethnicity recoded from 10 to 4 categories:
– a. White
– b. Black
– c. Asian
– d. Other
CAPRI CCSR
30.4%
8.7%
17.4%
21.7%
21.7%13+
10-12
7-9
4-6
1-3
Number of recodes impacting on analyses per author.
CAPRI CCSR
4.3%
8.7%
17.4%
30.4%
17.4%
21.7%
13+
10-12
7-9
4-6
1-3
0
Number of recodes severely impacting on analyses per author
CAPRI CCSR
17.4%
34.8%
8.7%
39.1%
Other
Severely affacted
Moderately affected
Not affected
Percentage of authors giving each category of response to whether removing area retaining region would affect their analyses.
CAPRI CCSR
34.8%
43.5%
21.7%
Severely affacted
Moderately affected
Not affected
Percentage of authors giving to each category of response to whether recoding age into ten-year bands would affect their analyses
CAPRI CCSR
•utility index.
Utility = %none+ (%moderate + %other)/2
•No great claims made about this but •useful way of summarising results and •can be compared to disclosure risk impact (using DIS: Skinner and Elliot 2003).
CAPRI CCSR
Variable ToUtility index Variable To
Utility index
Age Five year bands 74 Industry 9 categories 91Age Ten year bands 43 Marital status 3 categories 74Area 12 regions 52 Occupation 9 categories 83Area 4 countries 59 Number of highest qualification Omit 78Country of birth 2 categories 67 Level of highest qualification 2 categories 80Country of birth 4 categories 63 Subject of highest qualification Omit 91Ethnic group 4 categories 59 Relationship to household head 4 categories 83Distance of move 3 categories 83 Socio-economic group Omit 61Distance to work 5 categories 91 Term time address Omit 96Primary economic status 4 categories 63 Method of transport to work 5 categories 96Secondary economic status Omit 87 Work place Omit 89Family type 3 categories 61 Number of cars in household 3 categories 89Work hours 4 bands 93 Dwelling space type 5 categories 93Work hours Top coded at 50 93 Number of residents per room 3 categories 89
Tenure 3 categories 65
utility index for the data after recode
CAPRI CCSR
A DIS analysis showing the probability of a correct math given a unique match of the SARs using a base key (basic = age94, sex2, marital status5) + a selection of other variables
before and after recoding
Key Recoded variable Categories
bef>aft SARS Recoded Impact
Area,Age,sex,mstatus,Ocupation OCCUPATION 74->10 0.055 0.025 0.459
Area,Age,sex,mstatus,Industry INDUSTRY 63->10 0.049 0.026 0.524
Area,Age,sex,mstatus,hours HOURS 73->50 0.044 0.038 0.864
Area,Age,sex,mstatus,cobirth COBIRTH 42->2 0.041 0.038 0.927
Area,Age,sex,mstatus,primecon PRIMECON 10->4 0.028 0.021 0.766
Area,Age,sex,mstatus,tenure TENURE 10->3 0.028 0.022 0.802
Area,Age,sex,mstatus,ethnic ETHNIC 10->4 0.023 0.020 0.870
Area,Age,sex,mstatus,primecon Age 93->10 0.028 0.020 0.726
Area,Age,sex,mstatus,primecon Age 93->20 0.028 0.021 0.753
Region,Age,sex,mstatus,primecon Geography 273->12 0.028 0.020 0.711
CAPRI CCSR
Table 3: Relationship between utility index and disclosure risk impact
Variable From To
Utility index (UI)
Disclosure risk impact (DRI)
UI/ (DRI*100)
Age Single years Five year bands 74 0.75 0.98 Age Single years Ten year bands 43 0.73 0.59 Area 278 areas 12 regions 52 0.71 0.73 Country of birth 42 categories 2 categories 67 0.93 0.72 Ethnic group 10 categories 4 categories 59 0.87 0.68 Primary economic status 10 categories 4 categories 63 0.77 0.82 Work hours Single hours Top coded at 50 93 0.86 1.08 Industry 61 categories 9 categories 91 0.52 1.74 Occupation 73 categories 9 categories 83 0.46 1.81 Tenure 10 categories 3 categories 65 0.80 0.81
CAPRI CCSR
Validity Study
CAPRI CCSR
ARGUS: Problems and Resolutions
• Key Variable Selection problematic.– Not able to use Elliot and Dale(1999)
scenarios keys.
• Individual risk model doesn’t work on un-weighted data.
• Not able to block certain missing values from use.
CAPRI CCSR
Perturbed File 1
File with suppressions.• All two dimensional tables.• Three dimensional tables under scenarios.
CAPRI CCSR
Perturbed File 2
PRAMed file• All Variables PRAMED levels set to
maintain univariate distributions
CAPRI CCSR
Perturbed File 3
1. Unperturbed! Control File.
CAPRI CCSR
Perturbed File 4
1. PRAMed as file 2.
2. Suppressions• All two dimensional tables.
CAPRI CCSR
Overview of Results
•Basic analyses on the whole SAR: cross-tabs, correlations, simple regressions lead to fairly consistent interpretations. However still some problems.
•Problems arise for all three perturbed files for more complex analyses and/or those involving sub-sections of the file (e.g one geographical area).
CAPRI CCSR
Author/Researcher Description of effect of perturbations by suppression method used. Ten example studies.
Affect of perturbation File Perturbation method None Moderate Severe A Suppressions 5 5 0 B PRAM 2 7 1 C None 10 0 0 D Both 1 5 4
CAPRI CCSR
Conclusions
• Study introduces methods for measuring the utility impact of disclosure control measures•The relationship between utility measures and and disclosure risk measures represent the cost benefit equation of disclosure control.