Date post: | 04-Aug-2015 |
Category: |
Technology |
Upload: | micah-altman |
View: | 215 times |
Download: | 1 times |
Key Concepts in Confidential Data Management
Micah Altman
Director of Research
MIT Libraries
Prepared for
Privacy Tools for Research Data Orientation
Harvard U.
June 2015
Key Concepts in Confidential Data Management
It’s easy to leak private information…
Birth date + zipcode + gender uniquely identify ~87% of people in the U.S.
Can predict social security number using birthdate/place
Tables, graphs and maps can reveal identifiable information
People have been identified through movie rankings, search strings, movement patterns, shopping habits, writing style…
Brownstein, et al., 2006 , NEJM 355(16),
Key Concepts in Confidential Data Management
Privacy Core Concepts
PrivacyControl over extent and
circumstances of sharing
ConfidentialityControl over disclosure
of information
IdentifiabilityPotential for learning
about individuals based on their inclusion in a
data
SensitivityPotential for harm
if information disclosed and identified
Key Concepts in Confidential Data Management
Different types of identifiability
Record-linkage“where’s waldo”
•Match a real person to precise record in a database
•Examples: direct identifiers.
•Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains
Indistinguishability “hiding in the crowd”
•Individuals can be linked only to a cluster of records (of known size)
•Examples: K-anonymity, attribute disclosure
•Caveats: Potential for substantial harms may remain, must specify what external information is observable, & need diversity for sensitive attributes
Limited Adversarial Learning“confidentiality guaranteed”
•Formally bounds the total learning about any individual that occurs from a data release
•Examples: differential privacy, zero-knowledge proofs
•Caveats: Challenging to implement, requires interactive system
Less Protection More Protection
Key Concepts in Confidential Data Management
How sensitive is information, if reidentified?
… creates minimal risk of harm – even if linked to an individual
… creates a non-minimal risk of minor harm
Examples: information that would reasonably be expected to cause embarrassment to some individuals.
… creates significant risk of moderate harm
Examples: civil liability, moderate psychological harm, or material social harm to individuals or groups, economic discrimination, moderate economic direct costs, substantial loss to reputation
… creates substantial risk of serious harm
Examples: serious psychological harm; loss of insurability, loss of employability; substantial social harm to a vulnerable group
… creates high risk of grave harm
Examples: death; significant injury; persecution
Data Subjects Vulnerable Groups
Institutions Society
Who is harmed?How much harm & how likely?
Key Concepts in Confidential Data Management
Laws define “anonymized” differently
FERPA HIPAA Common Rule
MA 201 CMR 17
Identification Criteria
- Direct- Indirect- Linked- Bad intent
- direct/indirect: 18 identifier
- OR statistician verifies minimal risk
AND no actual knowledge of identified individual
- Direct- Indirect / Linked -- if “readily identifiable”
-First Initial + Last Name
Sensitivity Criteria
Any non-directory information
Any medical information
Private information – based on harm
Financial, State, Federal Identifiers
Key Concepts in Confidential Data Management
Legal Constraints are ComplicatedContract Intellectual
Property
Access
RightsConfidentiali
ty
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual
Attribution
Trade SecretPatent
Trademark
Common Rule
45 CFR 26HIPAAFERPA
EU Privacy Directive
Privacy Torts
(Invasion, Defamation)
Rights of
Publicity
Sensitive but
Unclassified
Potentially Harmful
(Archeological Sites,
Endangered Species, Animal
Testing, …)
Classified
FOIA
CIPSEA
State Privacy Laws
EAR
State FOI
Laws
Journal Replication Requiremen
ts
Funder Open
Access
Contract
License
Click-WrapTOU
ITAR
Export Restrictio
ns
Key Concepts in Confidential Data Management
Privacy & Information Security
Privacy & Security overlap Overlapping concepts and concerns Terminology can conflict
Providing information-security “secrecy” is necessary but not sufficient to maintain confidentiality broadly
Difference in much of the general analytic approach Privacy analysis focuses on high-level principles
– and specifically on harm vs. utility Infosec analysis focuses on maintaining security
properties, -- and specifically on vulnerabilities & controls
Information Security Core Properties
Confidentiality(Secrecy)• control over disclosure
Integrity•control over modification
Availability•authorized users can access as needed
Authenticity•authorized users can validate information source
Non-Repudiation• all actions and changes provably sourced to a unique person
Key Concepts in Confidential Data Management
Key Concepts in Confidential Data Management
Simple Information Security Control ModelAccess Control
ClientResource
Auth
enti
cati
on
Credentials
Auth
ori
zati
on
Request/Response
Audit
ing
Log
External Auditor
Resource Control Model
Key Concepts in Confidential Data Management
Simplified Security Systems Model
Analysis: threats
(natural, unintentional, intentional)
vulnerabilities (logical, physical, social)
Systems (computers, storage, networks)
System Analysis
Threat Modeling
Vulnerability Identification
Analysis- likelihood- impact- mitigating controls
InstituteSelected
Controls
Testing and Auditing
NIST: Information Security Control Selection
Controls: process
(policies, procedures, training,…) technical
(identification, access, transmission, auditing …)
external(law, norms, economic, …)
Research Design Decisions Relevant to Confidential Information
Key Concepts in Confidential Data Management
Collection: Human Subjects Plan and Data Management Plan Consent/licensing terms Methods Measures
Storage Systems information security Data structures and partitioning
Dissemination Vetting Disclosure limitation Data use agreements
Key Concepts in Confidential Data Management
IRB’s and Confidential Information
IRB’s review consent procedures and documentation
IRB’s may review data management plans May require procedures to minimize risk of
disclosure May require procedures to minimize harm
resulting from disclosure IRB’s make determination of sensitivity of
information-- potential harm resulting from disclosure
IRB’s make determination regarding whether data is de-identified for “public use”[see NHRPAC, “Recommendations on Public Use Data Files”]
Key Concepts in Confidential Data Management
Reducing Risk in Data Collection
Avoid collecting sensitive information, unless it is required by research design, method, or hypothesis Unnecessary sensitive information not minimal risk Reducing sensitivity higher participation, greater honesty
Collect sensitive information in private settings Reduces risk of disclosure Increases participation
Reduce sensitivity through indirect measures Less sensitive proxies
E.g. Implicit association test [Greenwald, et al. 1998] Unfolding brackets Group response collection Random response technique [Warner 1965] Item count/unmatched count/list experiment technique
Key Concepts in Confidential Data Management
Some Variations and Challenges in Collection
Anonymizing data collectionthrough specialized trusted data intermediaries parties
Cloud-based data collectionnot-so-trusted intermediaries
Field researchtechnical constraints, legal constraints, and change in risk and vulnerability surfaces
Key Concepts in Confidential Data Management
Common Not-Bad Practices for Storage
Systematically identify confidential information being stored
Use whole-disk/filesystem/media encryption to protect data at rest
Use end-to-end encryption to protect data in motion
Use core information hygiene to protect systems Scan for confidential information regularly Be thorough in disposal of information
Very sensitive data requires more protection.
Key Concepts in Confidential Data Management
Common information Hygiene Computer setup
Use a virus checker And keep it updated
Use a host-based firewall Strong credentials” Use a locking screen-saver Lock default/open accounts Regularly scan for sensitive information Update your software regularly
Server Setup Password guessing restrictions Idle session locking (or used on all client) No password retrieval Keep access logs
Controlled Destruction Rule based Physical Media level
Behavior Don’t share accounts or passwords Don’t use administrative accounts all the time Don’t run programs from untrusted sources Don’t give out your password to anyone Have a process for revoking user access when no longer
needed/authorized (e.g. if user leaves university) Documented breach reporting procedure Users should have appropriate confidentiality training
Key Concepts in Confidential Data Management
Partitioning Information for management
Reduces risk in information management Partition data information based on sensitivity
Identifying information Descriptive information Sensitive information Other information
Segregate Storage of information Access regimes Data collections channels Data transmission channels
Plan to segregate as early as feasible in data collection and processing
Link segregated information with artificial keys …
Key Concepts in Confidential Data Management
Partitioned table
Name SSN Birthdate
Zipcode
Gender
LINK
A. Jones
12341
01011961
02145 M 1401
B. Jones
12342
02021961
02138 M 283
C. Jones
12343
11111972
94043 M 8979
D. Jones
12344
12121972
94043 M 7023
E. Jones
12345
03251972
94041 F 1498
F. Jones
12346
03251972
02127 F 1036
G. Jones
12347
08081989
02138 F 3864
H. Smith
12348
01011973
63200 F 2124
I. Smith
12349
02021973
63300 M 4339
J. Smith
12350
02021973
63400 M 6629
K. Smith
12351
03031974
64500 M 9091
L. Smith
12352
04041974
64600 M 9918
M. Smith
12353
04041974
64700 F 4749
N. Smith
12354
04041974
64800 F 8197
LINK FavoriteIce Cream
Treat
# acts
1401 Raspberry
0 0
283 Pistachio 1 20
8979 Chocolate
0 0
7023 Hazelnut 1 12
1498 Lemon 0 0
1036 Lemon 1 7
3864 Peach 0 1
2124 Lime 1 17
4339 Mango 0 4
6629 Coconut 1 18
9091 Frog 0 32
9918 Vanilla 1 65
4749 Pumpkin 0 128
8197 Allergic 1 256
Not Identified
Key Concepts in Confidential Data Management
Suppress Information for Data Release
Published Outputs
* Jones * * 1961 021*
* Jones * * 1961 021*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
Modal Practice“The correlation between X and Y was large and
statistically significant”
Summary statistics
Contingency table
Public use sample microdata
Information Visualization
Key Concepts in Confidential Data Management
How many things are wrong with this picture?
Name
SSN Birthdate Zipcode
Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341
01011961 02145 M Raspberry
0
B. Jones 12342
02021961 02138 M Pistachio
0
C. Jones 12343
11111972 94043 M Chocolate
0
D. Jones
12344
12121972 94043 M Hazelnut
0
E. Jones 12345
03251972 94041 F Lemon 0
F. Jones
12346
03251972 02127 F Lemon 1
G. Jones
12347
08081989 02138 F Peach 1
H. Smith
12348
01011973 63200 F Lime 2
I. Smith 12349
02021973 63300 M Mango 4
J. Smith 12350
02021973 63400 M Coconut 16
K. Smith
12351
03031974 64500 M Frog 32
L. Smith
12352
04041974 64600 M Vanilla 64
M. Smith
12353
04041974 64700 F Pumpkin 128
N. Smith-
Jones
12354
04041974 64800 F Allergic 256
Key Concepts in Confidential Data Management
Name SSN Birthdate Zipcode
Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341
01011961 02145 M Raspberry
0
B. Jones 12342
02021961 02138 M Pistachio
0
C. Jones 12343
11111972 94043 M Chocolate
0
D. Jones
12344
12121972 94043 M Hazelnut
0
E. Jones 12345
03251972 94041 F Lemon 0
F. Jones
12346
03251972 02127 F Lemon 1
G. Jones
12347
08081989 02138 F Peach 1
H. Smith
12348
01011973 63200 F Lime 2
I. Smith 12349
02021973 63300 M Mango 4
J. Smith 12350
02021973 63400 M Coconut 16
K. Smith
12351
03031974 64500 M Frog 32
L. Smith
12352
04041974 64600 M Vanilla 64
M. Smith
12353
04041974 64700 F Pumpkin 128
N. Smith
12354
04041974 64800 F Allergic 256
What’s wrong with this picture?Identifier
Sensitive
PrivateIdentifie
r
PrivateIdentifie
r
Identifier
Sensitive
Unexpected Response?
Mass resident
FERPA too?
Californian
Twins, separated at birth?
Key Concepts in Confidential Data Management
Help, help, I’m being suppressed…
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
[Name 1]
12341
*1961 021* M Raspberry
.1
[Name 2]
12342
*1961 021* M Pistachio -.1
[Name 3]
12343
*1972 940* M Chocolate
0
[Name 4]
12344
*1972 940* M Hazelnut 0
[Name 5]
12345
*1972 940* F Lemon .6
[Name 6]
12346
*1972 021* F Lemon .6
[Name 7]
12347
*1989 021* * Peach 64.6
[Name 8]
12348
*1973 632* F Lime 3
[Name 9]
12349
*1973 633* M Mango 3
[Name 10]
12350
*1973 634* M Coconut 37.2
[Name 11]
12351
*1974 645* M * 37.2
[Name 12]
12352
*1974 646* M Vanilla 37.2
[Name 13]
12353
*1974 647* F * 64.4
[Name 14]
12354
*1974 648* F Allergic 256
Row
VarSynthetic Global Recode Local Suppression Aggregation+
Perturbation
Traditional Static Suppression
Data reduction Observation
Measure
Cell
Perturbation Microaggregation Rule-based data
swapping Adding noise
Key Concepts in Confidential Data Management
R can do it (R can do everything, but not really :-)
# setup> library(sdcMicro)
# load data> classexample.df<-read.csv("examplesdc.csv”, as.is=T, stringsAsFactors=F,colClasses=c("character","character","character","character","factor","factor","numeric")
# create a weight variable if needed> classexample.df$weight<-1
# simple frequency table shows that data is uniquely identified> ftable(Birthdate~Zipcode,data=classexample.df)
Birthdate 01/01/1973 02/02/1973 03/25/1972 04/04/1974 08/08/1989 10/01/1961 11/11/1972 12/12/1972 20/02/1961 30/03/1974
Zipcode
02127 0 0 1 0 0 0 0 0 0 002138 0 0 0 0 1 0 0 0 1 002145 0 0 0 0 0 1 0 0 0 063200 1 0 0 0 0 0 0 0 0 063300 0 1 0 0 0 0 0 0 0 063400 0 1 0 0 0 0 0 0 0 064500 0 0 0 0 0 0 0 0 0 164600 0 0 0 1 0 0 0 0 0 064700 0 0 0 1 0 0 0 0 0 064800 0 0 0 1 0 0 0 0 0 094041 0 0 1 0 0 0 0 0 0 094043 0 0 0 0 0 0 1 1 0 0
Key Concepts in Confidential Data Management
Suppression reduces utility
Common approach of anonymizing/suppressing data reduces usefulness
Minimizing disclosure in the presence of large external data sources reduces usefulness a lot
Anonymized data is not simply less informative -- it typically yields biased analyses
Key Concepts in Confidential Data Management
Cradle-to-grave analysis:Consider Risks and Controls in Info Lifecycle
What is collected? Scope of information collected Intended uses Potential benefits from data
availability Re-identification (learning) risks Information sensitivity (harm out of
context) Controls on retention Possible information
transformations(aggregation, redaction)
Post-disclosure control and evaluation:use limits, review , reporting, and information accountability
Key Concepts in Confidential Data Management
Some Overarching Principles for Consideration
Fair Information Practice: Notice/awareness Choice/consent Access/
participation(verification, accuracy, correction)
Integrity/security Enforcement/
redress Self-regulation,
private remedies; government enforcements
Privacy by design: Proactive not reactive;
Preventative not remedial
Privacy as the default setting
Privacy embedded into design
Full Functionality – Positive-Sum, not Zero-Sum
End-to-End Security – Full Lifecycle Protection
Visibility and Transparency – Keep it Open
Respect for User Privacy – Keep it User-Centric
OECD Principles Collection
limitation Data quality Purpose
specification Use limitation Security
Safeguards Openness Individual
participation Accountability
Key Concepts in Confidential Data Management
Challenges of big data… Anonymization can completely destroy utility
The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]
Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible mask, when
correlated with external data [Zimmerman 2008; ]
Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked network data, if
only a few nodes controlled. [Backstrom, et. al 2007] The “Blog problem” : Pseudononymous communication can be linked
through textual analysis [Novak wet. al 2004]
Little Data in a Big World The “Favorite Ice Cream” problem
-- public information that is not risky can help us learn information that is risky
The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere
The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases
Source: [Calberese 2008; Real Time Rome Project 2007]
Key Concepts in Confidential Data Management
Observations
Confidentiality requires limiting what an adversary can learn about an individual as a result of their being measured
Common overarching principles do not provide sufficient guidance to select effective controls and approaches
Generic/naïve use of extant data sharing or redaction controls and technologies are unlikely to provide adequate protection in a big data world.
Evaluate the privacy and security risks, controls, and accountability mechanisms, over the entire information lifecycle – including collection, consent, use, dissemination, and post-disclosure
Key Concepts in Confidential Data Management
Questions?
Web:
informatics.mit.edu