+ All Categories
Home > Documents > Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of...

Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of...

Date post: 18-Dec-2015
Category:
Upload: deirdre-price
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester
Transcript

Privacy Statistics and Data Linkage

Mark Elliot

Confidentiality and Privacy Group

University of Manchester

Overview

• The disclosure risk problem

• Some e-science possibilities– Monitored data access– Grid based Data environment Analysis

• The meaning of privacy

Data Data Everywhere…• Massive and exponential increase in data; Mackey

and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service.

• Singer(1999) noted three behavioural tendencies:– Collect more information on each population unit

– Replace aggregate data with person specific databases

– Given the opportunity collect personal information

• Purdam and Elliot add:– Link data whenever you can

Disclosure Risk I: Microdata

The Disclosure Risk Problem:Type I: Identification

Name Address Sex Age ..

Income .. ..Sex Age ..

IDvariables

Keyvariables

Targetvariables

Identification file

Target file

Disclosure Risk II: Aggregate Tables of

Counts

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 1 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Multiple datasets

• Disclosure Risk assessment for single datasets is a reasonably understood problem.

• But what happens with multiple datasets?

Data Mining and the Grid

• Traditional Data Mining examines and identifies patterns on single (if massive) datasets.

• But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.

• Smith and Elliot (2005,06,07)

• Increases in data availability lead inexorably to an increase in disclosure risk

• My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z.

• It’s all about information!

CLEF: Clinical e-Science Framework

A solution involving monitored access

CLEF Consortium

Approximately 40 Staff from

• University of Manchester

• University of Sheffield

• University College London

• University of Brighton

• Royal Marsden Hospital, London

Purpose

• To provide a system for allowing research access to patient data, whilst maintaining privacy.

• Patient records– Database

• Texts such as referral letters and other clinical texts– Text mining system convert to microdata

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF one possible architecture

Workbench

Data Sentry: an AI system

• Monitors patterns of analytical requests– 3 levels: users, institution, world.– Looking for intrusive patterns.– Numbers of requests

• Stores Analytical requests for future use.

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF Proposed Architecture

Workbench

Data Quality

• User analyses are run on both treated and untreated data. – Outputs are compared and assessed for

difference.– Major research area – Knowledge Engineering

• Analyses are stored and collectively run over pre and post SDC files for assessment of impact.

The Grid: the context for massive combining.

• “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002)

– Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002)

• Data grid

• Knowledge grid

Grid based Data Environment Analysis

What’s it about?

• Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house

being vulnerable to flooding without looking at where it is located!

• Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..

What would it involve?

• Web Crawling

• Data Monitoring

• Synthetic Data Generation

• Grid based disclosure risk analysis

Web crawling

• Untrained Screen scraping of all web sites that collect personal data.

• Generic info gathering of web published personal info (personal web pages, My space etc)

Data Monitoring

• The development of sophisticated metadatabases representing available info fields

• Combined Database of web available data. – Involves intelligent interpretation of web data,

record linkage and other AI crossover techniques.

Architecture

Repository: Data & Metadata

Data monitorSynthesiserSDRA system

Web Crawler

Web Crawler

Web Crawler

Web Crawler

Web Crawler

What next?

• Decide on roles.

• Identify funder.

• Develop grant application.

Synthetic Data Generation

• Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.

Closing thoughts

A Blurring of Concepts

• The boundaries between data and processes become less distinct.

• Cyberidenties– I am my data?

• The distinction between informational and physical privacy becomes less distinct.

Data Growth

• There is no reason to suppose that data growth will not continue at the same break neck pace– The data environment will become increasingly

richer

• In this context the meaning of “privacy” will undoubtedly change.– But how?

The meaning of Privacy

• Do people care about privacy in an orthodox, absolute sense?– What does a blog mean?

• Private-public: Public Privacy

– Control and ownership are more important than the absolute right to secrecy.

From Data Subjects to Data Citizens

• A data actualised individual in control and self aware of their own data.

• What would data citizens be concerned about?– Ownership– The use/abuse of their data– Harm– Permission/Consent

• This suggests that the law should focus on data abuse rather than privacy per se.

Summary

• Statistical Disclosure prevents a problem for the use of data

• Multiple linkable datasets exacerbate that problem.

• E-science provides some tools for new modes of data access

But…..

• Assuming that the global culture continues to feed and be fed by the information explosion:– Our view of ourselves/our data will/must change.

– The meaning of privacy must change with it.

• The key question is what sort of society we are constructing; the meaning of privacy will reflect this.


Recommended