+ All Categories
Home > Documents > Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of...

Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of...

Date post: 28-Mar-2015
Category:
Upload: diana-gibbs
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
56
Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester
Transcript
Page 1: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Statistical Disclosure ControlMark Elliot

Confidentiality and Privacy Group CCSR

University of Manchester

Page 2: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Overview

• CAPRI –who we are / what we do• SDC – some basics• SD Risk Assessment and Microdata

– General Concepts– Our Approach

• SD Risk Assessment and Aggregate Data– General Concepts– Our Approach

• Statistical Disclosure and the Grid

Page 3: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Confidentiality And PRIvacy group

www.ccsr.ac.uk/capriUniversity of Manchester

Page 4: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Purpose

To investigate the Confidentiality and Privacy

issues that arise from the collection,dissemination and analysis of data.

Page 5: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Multidisciplinary Approach

• Mark Elliot, Knowledge and Data Engineering • Kingsley Purdam, Politics and Information

Society• Anna Manning, Data Mining and HPC• Elaine Mackey, Social Policy• Duncan Smith, Statistics and Stochastic

Systems• Karen McCullagh, the Law and Social Policy

Page 6: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Associate Members in Manchester

C S: Alan Rector, John Gurd,Len Freeman, Adel Taweel.

Computation: John Keane.Psychology: Karen Lander, Lee Wickham.Medicine: Iain Buchan. Manchester Computing Centre:

Stephen Pickles.Law: Joseph Jakaneli, John Harris.

Page 7: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Research Programmes The Social and Political Aspects of Confidentiality andPrivacy

The Detection of Risky Records: Special Uniqueness

The Disclosure risk issues posed by the Grid

High Performance Computing and statistical Disclosure

Medical Records: Clinical E-Science Framework

The SAMDIT methodology: Data Monitoring Centre

Page 8: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Consultancy

ONSCensusSocial SurveyNeighbourhood statistics

US Census BureauAustralian Bureau of StatisticsStatistics New Zealand

Page 9: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Statistical Disclosure Control

Page 10: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Sub Fields

• Disclosure risk assessment.• Disclosure control methodology.• Analytical validity.

• Microdata and Aggregate data.• Business and Personal data.• Intentional and Consequential

data

Page 11: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Our General Approach:The SAMDIT method

• Scenario Analysis (Elliot and Dale 1999)

• Metric Development• Implementation • Testing

Page 12: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Microdata

Page 13: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Microdata Disclosure Risk Problem:An Example

Name Address Sex Age ..

Income .. ..Sex Age ..

IDvariables

Keyvariables

Targetvariables

Identification file

Target file

Page 14: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Risk Assessment methods• File Level

– Population Uniqueness e.g Bethlehem(1990), Samuels(1998)

– DIS; Skinner and Elliot(2002)

• Record level– Statistical modelling (Fienberg and

Makov 1998, Skinner and Holmes 1998)– Computational Search Elliot et al (2002)

Page 15: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Data Intrusion Simulation• Uses microdata set (or table) itself

to estimate risk - no population data.• An estimate of the probability of a

correct match (given a unique match).

• Special method: sub-sampling and re-sampling.

• General method: derivation from the equivalence class structure.

Page 16: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The DIS Method

Remove a small number of records

Microdata sample

Page 17: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The DIS Method II

Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)

Page 18: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The DIS Method III

Match the removed fragment against the truncated microdata file

Page 19: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Validation

• Empirical validation studies comparing with the results obtained using population data: Empirical results: No bias and small error. Elliot (2001)

• Mathematical proof: Skinner and Elliot (2002).

Page 20: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Pr(cm|um) for 2% sample with basic key (age sex marital

status)

1.00

1.20

1.40

1.60

1.80

2.00

2.20

2.40

0 100000 200000 300000 400000 500000

population size

pr(

cm|u

m)

%

actual

estimated

Page 21: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Levels of Risk Analysis• DIS

– Works at the file level– Very good for comparative

analyses• e.g. SAMs

Page 22: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Levels of Risk Analysis• Record level risk is important

– Variations in risk topography– Risky records

Page 23: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Special Uniques• Original concept

– Counterintuitive geographical effect, indicated two types of sample uniques.

– Random and Special– Special

• Epidemiological peculiarity

– Random • Effect of sampling and variable

definition

Page 24: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Special Uniques

• Changing definition:1. Sample uniques which remain unique

despite geographical aggregation2. Sample uniques which remain unique

through any variable aggregation3. Sample uniques on subset of key

variables4. Dichotomy to Dimension

Page 25: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Minimal Sample Unique

• A set of sample unique set of variable values – for which no subset is also unique.

Page 26: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Risk Signatures: combinations of minimal

uniques• Example– Unique pairs 0– Unique triples 5– Unique fourfolds 1– Unique fivefolds 3– Unique sixfolds 0– Unique sevenfolds 0– ………

Page 27: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Special Uniques

• Problem: how to look at all the variables?

– File may contain hundreds– Even with scenario keys

individual records can contain hundreds of minimal sample uniques

– Combinatorial explosion

Page 28: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

HIPERSTAD Projects

• Funded by ESRC, ONS and EPSRC

• Use of high performance computing– Enables comprehensive analysis of

patterns of uniqueness within each record

– Has allowed investigation of more complex grading systems

Page 29: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Risk Signatures II• Allow grading and

classification of records– Differential treatment– Low impact high efficacy

disclosure control

Page 30: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Combining DIS and SUDA

• A heuristic method for combining the two methods to provide a per record matching confidence has proved very effective

• ONS evaluation studies show that combined method picks out high probability risk very well

Page 31: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

SUDA software

• Available free under licence• Used at ONS, ABS and Stats

new Zealand

Page 32: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Aggregate Data

Page 33: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Introduction• Measurement of Disclosure Risk is

an important precursor for its control

• Intruder/scenario based metrics are better than abstract ones

• Such metrics are available for microdata but not for aggregate data

Page 34: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Overview

• Overview of the issues and introducing the method on a conceptual level

• Details of the algorithms

• Ongoing and Future Work

Page 35: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Issues

• Aggregate data is usually 100% data, so measures based on identification disclosure and sampling are meaningless

• A better approach is to evaluate what can be inferred through attribute disclosure

Page 36: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Attribute Disclosure

High Medium Low TotalAcademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 37: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Attribute Disclosure

High Medium Low TotalAccademics 1 100 50 151Lawyers 100 50 5 155Total 101 150 55 306

Income levels for two occupations

Page 38: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Approach• Rather than assess the risk of actual attribute

disclosure we propose estimating the probability of producing a potentially disclosive table, which we define as any table containing at least one zero

• The method/measure we propose can be applied to:– Single tables– Groups of tables– Unperturbed and perturbed tables– Unpublished tables

Page 39: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Bounds Problem• In a general sense any set of tables

can be viewed as a set of bounds on the full table. For example if we release two one way frequency tables:

X 1Y 2Z 10Total 13

Var A

P 6Q 3R 4total 13

Var B

Page 40: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Bounds ProblemWe are effectively releasing the marginals to

a two-way frequency table where the entire joint distribution has been suppressed

Var A P Q R TotalX ? ? ? 1Y ? ? ? 2Z ? ? ? 10

Total 6 3 4 13

Var B

Page 41: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The cells in the joint distribution can beexpressed as a set of bounds (or ranges offeasible values)

Var A P Q R TotalX 0-1 0-1 0-1 1Y 0-2 0-2 0-2 2Z 3-6 0-3 1-4 10

Total 6 3 4 13

Var B

Page 42: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

The Subtraction – Attribution Probability

(SAP) Method• The risk associated with a table release

depends on the set of tables jointly, rather than on the individual tables.

• SAP can be used on single tables, groups of tables, perturbed or unperturbed tables.

• Bounds are calculated and then the probability of an intruder producing one or more upper bounds of zero by subtracting k random individuals from the table is calculated

• The output can be set for user defined levels of k

Page 43: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Var1

Var2 A B

C 3 9

D 2 2

 

 

 

Var1

Var3 A B

E 1 10

F 4 1

Page 44: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Var2

Var3

C D

E 8 3

F 4 1

  

 

Var1 and Var2

Var3

A, C

A, D

B, C

B, D

E 0 1 8 2

F 3 1 1 0

• Original cell counts can be recovered from the marginal tables

Page 45: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Subtraction

• We consider that an intruder might have knowledge of the relevant population, as well as information in the table release

• We assume (at least initially) that the intruder has perfect knowledge of k randomly selected individuals

Page 46: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Single exact tables• The lower / upper bounds are equal

to the published counts

• The probability of an intruder recovering at least one zero by subtracting known individuals is found by calculating Hypergeometric probabilities and applying the inclusion / exclusion principle

Page 47: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

• The marginal probability of observing all individuals in a cell is calculated for each individual cell, and the sum is added to a total (initially zero)

• The marginal probability of observing all individuals in a pair of cells is calculated for each pair of cells, and subtracted from the total

• The marginal probability of observing all individuals in a ‘triple’ of cells is calculated for each triple of cells, and added to the total

• And so on, until we have considered the table total, or all subsequent probabilities are zero

Page 48: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

For example,

For k = 3 and the following table (and notshowing zero probability terms),

1 2 4

3,7

0,43,3

3,7

1,52,2

3,7

2,61,1

C

CC

C

CC

C

CCprob

543.0

3,7

0,41,52,6

C

CCC

Page 49: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Example outputMean probability of recovering a zero in a table given a subtraction level (k) for Income Support broken down by Gender and Age for exact and rounded output area tables. Gender Age k Exact Rounded Proportion Exact Rounded Proportion 1 0.157 0.000 0.000 0.814 0.000 0.000 2 0.203 0.000 0.000 0.822 0.000 0.000 3 0.248 0.000 0.001 0.830 0.001 0.001 4 0.290 0.012 0.043 0.838 0.011 0.013 5 0.330 0.017 0.051 0.845 0.013 0.016 6 0.367 0.021 0.056 0.853 0.016 0.019 7 0.401 0.024 0.059 0.859 0.019 0.022 8 0.433 0.026 0.061 0.866 0.022 0.025 9 0.461 0.035 0.075 0.872 0.031 0.035 10 0.488 0.037 0.076 0.878 0.034 0.038 20 0.685 0.061 0.090 0.929 0.069 0.074 50 0.951 0.094 0.098 0.992 0.123 0.124 100 0.999 0.100 0.100 1.000 0.132 0.132 Mean 0.463 0.033 0.071 0.877 0.036 0.041

Page 50: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

1) What new data possibilities does the Grid provide and what confidentiality implications do they have?

2) How could the Grid (or a Grid) be used to enable disclosure risk assessment and control?

3) How could a grid enable a data intruder?

4) What are the possibilities and issues provided by remote access?

Confidentiality and the Grid

Page 51: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

New data• One of the key potentials for the

Grid is the possibility of bringing together different data sources through linking and fusing.

• This is precisely the disclosure risk situation. Our pilot project work shows that adding a third data set tends to increases the linkability of two other datasets.

Page 52: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

New Access• Virtual remote access has the

potential to provide a safe setting model for data access.

• New question how safe is that output?

Page 53: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Data Intrusion Detection

• Virtual access allows the possibility of monitoring use.

• Use patterns by user and across users can be analysed for patterns resembling intrusion (similar to fraud detection).

Page 54: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Confidential data access via a grid

PRE-ACCESS Data Quality Monitor

Raw Datasets

Treated Datasets

Data Intrusion sentry

Grid Firewall

PRE-OUTPUT Disclosure Control

PRE-ACCESS Disclosure Control

PRE-Output Data Quality Monitor

User Analytical request

Page 55: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Conclusions

• Statistical Disclosure Control is a maturing field.– Basic issues well defined– Theory and Practice still in

development

• Grid presents new opportunities and new confidentiality risks.

Page 56: Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Finally a plug…..

International Symposium on Confidentiality, Privacy and

Disclosure in the 21st Century– Date: 3rd May– Venue: Manchester MANDEC Centre– See www.ccsr.ac.uk/capri/symposium


Recommended