Post on 31-Jan-2016
description
transcript
Privacy Statistics and Data Linkage
Mark Elliot
Confidentiality and Privacy Group
University of Manchester
Overview
• The disclosure risk problem
• Some e-science possibilities– Monitored data access– Grid based Data environment Analysis
• The meaning of privacy
Data Data Everywhere…• Massive and exponential increase in data; Mackey
and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service.
• Singer(1999) noted three behavioural tendencies:– Collect more information on each population unit
– Replace aggregate data with person specific databases
– Given the opportunity collect personal information
• Purdam and Elliot add:– Link data whenever you can
Disclosure Risk I: Microdata
The Disclosure Risk Problem:Type I: Identification
Name Address Sex Age ..
Income .. ..Sex Age ..
IDvariables
Keyvariables
Targetvariables
Identification file
Target file
Disclosure Risk II: Aggregate Tables of
Counts
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 1 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
Multiple datasets
• Disclosure Risk assessment for single datasets is a reasonably understood problem.
• But what happens with multiple datasets?
Data Mining and the Grid
• Traditional Data Mining examines and identifies patterns on single (if massive) datasets.
• But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.
• Smith and Elliot (2005,06,07)
• Increases in data availability lead inexorably to an increase in disclosure risk
• My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z.
• It’s all about information!
CLEF: Clinical e-Science Framework
A solution involving monitored access
CLEF Consortium
Approximately 40 Staff from
• University of Manchester
• University of Sheffield
• University College London
• University of Brighton
• Royal Marsden Hospital, London
Purpose
• To provide a system for allowing research access to patient data, whilst maintaining privacy.
• Patient records– Database
• Texts such as referral letters and other clinical texts– Text mining system convert to microdata
PRE-ACCESS DQI Monitor
Raw Data
Treated Data
Data Intrusion
sentry
PRE-OUTPUT SDRA/SDC
PRE-ACCESS SDRA/SDC
PRE-Output DQI Monitor
Firewall
CLEF one possible architecture
Workbench
Data Sentry: an AI system
• Monitors patterns of analytical requests– 3 levels: users, institution, world.– Looking for intrusive patterns.– Numbers of requests
• Stores Analytical requests for future use.
PRE-ACCESS DQI Monitor
Raw Data
Treated Data
Data Intrusion
sentry
PRE-OUTPUT SDRA/SDC
PRE-ACCESS SDRA/SDC
PRE-Output DQI Monitor
Firewall
CLEF Proposed Architecture
Workbench
Data Quality
• User analyses are run on both treated and untreated data. – Outputs are compared and assessed for
difference.– Major research area – Knowledge Engineering
• Analyses are stored and collectively run over pre and post SDC files for assessment of impact.
The Grid: the context for massive combining.
• “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002)
– Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002)
• Data grid
• Knowledge grid
Grid based Data Environment Analysis
What’s it about?
• Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house
being vulnerable to flooding without looking at where it is located!
• Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..
What would it involve?
• Web Crawling
• Data Monitoring
• Synthetic Data Generation
• Grid based disclosure risk analysis
Web crawling
• Untrained Screen scraping of all web sites that collect personal data.
• Generic info gathering of web published personal info (personal web pages, My space etc)
Data Monitoring
• The development of sophisticated metadatabases representing available info fields
• Combined Database of web available data. – Involves intelligent interpretation of web data,
record linkage and other AI crossover techniques.
Architecture
Repository: Data & Metadata
Data monitorSynthesiserSDRA system
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
What next?
• Decide on roles.
• Identify funder.
• Develop grant application.
Synthetic Data Generation
• Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.
Closing thoughts
A Blurring of Concepts
• The boundaries between data and processes become less distinct.
• Cyberidenties– I am my data?
• The distinction between informational and physical privacy becomes less distinct.
Data Growth
• There is no reason to suppose that data growth will not continue at the same break neck pace– The data environment will become increasingly
richer
• In this context the meaning of “privacy” will undoubtedly change.– But how?
The meaning of Privacy
• Do people care about privacy in an orthodox, absolute sense?– What does a blog mean?
• Private-public: Public Privacy
– Control and ownership are more important than the absolute right to secrecy.
From Data Subjects to Data Citizens
• A data actualised individual in control and self aware of their own data.
• What would data citizens be concerned about?– Ownership– The use/abuse of their data– Harm– Permission/Consent
• This suggests that the law should focus on data abuse rather than privacy per se.
Summary
• Statistical Disclosure prevents a problem for the use of data
• Multiple linkable datasets exacerbate that problem.
• E-science provides some tools for new modes of data access
But…..
• Assuming that the global culture continues to feed and be fed by the information explosion:– Our view of ourselves/our data will/must change.
– The meaning of privacy must change with it.
• The key question is what sort of society we are constructing; the meaning of privacy will reflect this.