+ All Categories
Home > Technology > Confidential data management_key_concepts

Confidential data management_key_concepts

Date post: 04-Aug-2015
Category:
Upload: micah-altman
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
35
Key Concepts in Confidential Data Management Micah Altman Director of Research MIT Libraries Prepared for Privacy Tools for Research Data Orientation Harvard U. June 2015
Transcript

Key Concepts in Confidential Data Management

Micah Altman

Director of Research

MIT Libraries

Prepared for

Privacy Tools for Research Data Orientation

Harvard U.

June 2015

Key Concepts in Confidential Data Management

It’s easy to leak private information…

Birth date + zipcode + gender uniquely identify ~87% of people in the U.S.

Can predict social security number using birthdate/place

Tables, graphs and maps can reveal identifiable information

People have been identified through movie rankings, search strings, movement patterns, shopping habits, writing style…

Brownstein, et al., 2006 , NEJM 355(16),

Key Concepts in Confidential Data Management

Some Concepts and Terminology

Key Concepts in Confidential Data Management

Privacy Core Concepts

PrivacyControl over extent and

circumstances of sharing

ConfidentialityControl over disclosure

of information

IdentifiabilityPotential for learning

about individuals based on their inclusion in a

data

SensitivityPotential for harm

if information disclosed and identified

Key Concepts in Confidential Data Management

Different types of identifiability

Record-linkage“where’s waldo”

•Match a real person to precise record in a database

•Examples: direct identifiers.

•Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains

Indistinguishability “hiding in the crowd”

•Individuals can be linked only to a cluster of records (of known size)

•Examples: K-anonymity, attribute disclosure

•Caveats: Potential for substantial harms may remain, must specify what external information is observable, & need diversity for sensitive attributes

Limited Adversarial Learning“confidentiality guaranteed”

•Formally bounds the total learning about any individual that occurs from a data release

•Examples: differential privacy, zero-knowledge proofs

•Caveats: Challenging to implement, requires interactive system

Less Protection More Protection

Key Concepts in Confidential Data Management

How sensitive is information, if reidentified?

… creates minimal risk of harm – even if linked to an individual

… creates a non-minimal risk of minor harm

Examples: information that would reasonably be expected to cause embarrassment to some individuals.

… creates significant risk of moderate harm

Examples: civil liability, moderate psychological harm, or material social harm to individuals or groups, economic discrimination, moderate economic direct costs, substantial loss to reputation

… creates substantial risk of serious harm

Examples: serious psychological harm; loss of insurability, loss of employability; substantial social harm to a vulnerable group

… creates high risk of grave harm

Examples: death; significant injury; persecution

Data Subjects Vulnerable Groups

Institutions Society

Who is harmed?How much harm & how likely?

Key Concepts in Confidential Data Management

Laws define “anonymized” differently

FERPA HIPAA Common Rule

MA 201 CMR 17

Identification Criteria

- Direct- Indirect- Linked- Bad intent

- direct/indirect: 18 identifier

- OR statistician verifies minimal risk

AND no actual knowledge of identified individual

- Direct- Indirect / Linked -- if “readily identifiable”

-First Initial + Last Name

Sensitivity Criteria

Any non-directory information

Any medical information

Private information – based on harm

Financial, State, Federal Identifiers

Key Concepts in Confidential Data Management

Legal Constraints are ComplicatedContract Intellectual

Property

Access

RightsConfidentiali

ty

Copyright

Fair Use

DMCA

Database Rights

Moral Rights

Intellectual

Attribution

Trade SecretPatent

Trademark

Common Rule

45 CFR 26HIPAAFERPA

EU Privacy Directive

Privacy Torts

(Invasion, Defamation)

Rights of

Publicity

Sensitive but

Unclassified

Potentially Harmful

(Archeological Sites,

Endangered Species, Animal

Testing, …)

Classified

FOIA

CIPSEA

State Privacy Laws

EAR

State FOI

Laws

Journal Replication Requiremen

ts

Funder Open

Access

Contract

License

Click-WrapTOU

ITAR

Export Restrictio

ns

Key Concepts in Confidential Data Management

Privacy & Information Security

Privacy & Security overlap Overlapping concepts and concerns Terminology can conflict

Providing information-security “secrecy” is necessary but not sufficient to maintain confidentiality broadly

Difference in much of the general analytic approach Privacy analysis focuses on high-level principles

– and specifically on harm vs. utility Infosec analysis focuses on maintaining security

properties, -- and specifically on vulnerabilities & controls

Information Security Core Properties

Confidentiality(Secrecy)• control over disclosure

Integrity•control over modification

Availability•authorized users can access as needed

Authenticity•authorized users can validate information source

Non-Repudiation• all actions and changes provably sourced to a unique person

Key Concepts in Confidential Data Management

Key Concepts in Confidential Data Management

Simple Information Security Control ModelAccess Control

ClientResource

Auth

enti

cati

on

Credentials

Auth

ori

zati

on

Request/Response

Audit

ing

Log

External Auditor

Resource Control Model

Key Concepts in Confidential Data Management

Simplified Security Systems Model

Analysis: threats

(natural, unintentional, intentional)

vulnerabilities (logical, physical, social)

Systems (computers, storage, networks)

System Analysis

Threat Modeling

Vulnerability Identification

Analysis- likelihood- impact- mitigating controls

InstituteSelected

Controls

Testing and Auditing

NIST: Information Security Control Selection

Controls: process

(policies, procedures, training,…) technical

(identification, access, transmission, auditing …)

external(law, norms, economic, …)

Key Concepts in Confidential Data Management

State of the Practice

Research Design Decisions Relevant to Confidential Information

Key Concepts in Confidential Data Management

Collection: Human Subjects Plan and Data Management Plan Consent/licensing terms Methods Measures

Storage Systems information security Data structures and partitioning

Dissemination Vetting Disclosure limitation Data use agreements

Key Concepts in Confidential Data Management

IRB’s and Confidential Information

IRB’s review consent procedures and documentation

IRB’s may review data management plans May require procedures to minimize risk of

disclosure May require procedures to minimize harm

resulting from disclosure IRB’s make determination of sensitivity of

information-- potential harm resulting from disclosure

IRB’s make determination regarding whether data is de-identified for “public use”[see NHRPAC, “Recommendations on Public Use Data Files”]

Key Concepts in Confidential Data Management

Reducing Risk in Data Collection

Avoid collecting sensitive information, unless it is required by research design, method, or hypothesis Unnecessary sensitive information not minimal risk Reducing sensitivity higher participation, greater honesty

Collect sensitive information in private settings Reduces risk of disclosure Increases participation

Reduce sensitivity through indirect measures Less sensitive proxies

E.g. Implicit association test [Greenwald, et al. 1998] Unfolding brackets Group response collection Random response technique [Warner 1965] Item count/unmatched count/list experiment technique

Key Concepts in Confidential Data Management

Some Variations and Challenges in Collection

Anonymizing data collectionthrough specialized trusted data intermediaries parties

Cloud-based data collectionnot-so-trusted intermediaries

Field researchtechnical constraints, legal constraints, and change in risk and vulnerability surfaces

Key Concepts in Confidential Data Management

Common Not-Bad Practices for Storage

Systematically identify confidential information being stored

Use whole-disk/filesystem/media encryption to protect data at rest

Use end-to-end encryption to protect data in motion

Use core information hygiene to protect systems Scan for confidential information regularly Be thorough in disposal of information

Very sensitive data requires more protection.

Key Concepts in Confidential Data Management

Common information Hygiene Computer setup

Use a virus checker And keep it updated

Use a host-based firewall Strong credentials” Use a locking screen-saver Lock default/open accounts Regularly scan for sensitive information Update your software regularly

Server Setup Password guessing restrictions Idle session locking (or used on all client) No password retrieval Keep access logs

Controlled Destruction Rule based Physical Media level

Behavior Don’t share accounts or passwords Don’t use administrative accounts all the time Don’t run programs from untrusted sources Don’t give out your password to anyone Have a process for revoking user access when no longer

needed/authorized (e.g. if user leaves university) Documented breach reporting procedure Users should have appropriate confidentiality training

Key Concepts in Confidential Data Management

Partitioning Information for management

Reduces risk in information management Partition data information based on sensitivity

Identifying information Descriptive information Sensitive information Other information

Segregate Storage of information Access regimes Data collections channels Data transmission channels

Plan to segregate as early as feasible in data collection and processing

Link segregated information with artificial keys …

Key Concepts in Confidential Data Management

Partitioned table

Name SSN Birthdate

Zipcode

Gender

LINK

A. Jones

12341

01011961

02145 M 1401

B. Jones

12342

02021961

02138 M 283

C. Jones

12343

11111972

94043 M 8979

D. Jones

12344

12121972

94043 M 7023

E. Jones

12345

03251972

94041 F 1498

F. Jones

12346

03251972

02127 F 1036

G. Jones

12347

08081989

02138 F 3864

H. Smith

12348

01011973

63200 F 2124

I. Smith

12349

02021973

63300 M 4339

J. Smith

12350

02021973

63400 M 6629

K. Smith

12351

03031974

64500 M 9091

L. Smith

12352

04041974

64600 M 9918

M. Smith

12353

04041974

64700 F 4749

N. Smith

12354

04041974

64800 F 8197

LINK FavoriteIce Cream

Treat

# acts

1401 Raspberry

0 0

283 Pistachio 1 20

8979 Chocolate

0 0

7023 Hazelnut 1 12

1498 Lemon 0 0

1036 Lemon 1 7

3864 Peach 0 1

2124 Lime 1 17

4339 Mango 0 4

6629 Coconut 1 18

9091 Frog 0 32

9918 Vanilla 1 65

4749 Pumpkin 0 128

8197 Allergic 1 256

Not Identified

Key Concepts in Confidential Data Management

Suppress Information for Data Release

Published Outputs

* Jones * * 1961 021*

* Jones * * 1961 021*

* Jones * * 1972 9404*

* Jones * * 1972 9404*

* Jones * * 1972 9404*

Modal Practice“The correlation between X and Y was large and

statistically significant”

Summary statistics

Contingency table

Public use sample microdata

Information Visualization

Key Concepts in Confidential Data Management

How many things are wrong with this picture?

Name

SSN Birthdate Zipcode

Gender FavoriteIce Cream

# of crimescommitted

A. Jones 12341

01011961 02145 M Raspberry

0

B. Jones 12342

02021961 02138 M Pistachio

0

C. Jones 12343

11111972 94043 M Chocolate

0

D. Jones

12344

12121972 94043 M Hazelnut

0

E. Jones 12345

03251972 94041 F Lemon 0

F. Jones

12346

03251972 02127 F Lemon 1

G. Jones

12347

08081989 02138 F Peach 1

H. Smith

12348

01011973 63200 F Lime 2

I. Smith 12349

02021973 63300 M Mango 4

J. Smith 12350

02021973 63400 M Coconut 16

K. Smith

12351

03031974 64500 M Frog 32

L. Smith

12352

04041974 64600 M Vanilla 64

M. Smith

12353

04041974 64700 F Pumpkin 128

N. Smith-

Jones

12354

04041974 64800 F Allergic 256

Key Concepts in Confidential Data Management

Name SSN Birthdate Zipcode

Gender FavoriteIce Cream

# of crimescommitted

A. Jones 12341

01011961 02145 M Raspberry

0

B. Jones 12342

02021961 02138 M Pistachio

0

C. Jones 12343

11111972 94043 M Chocolate

0

D. Jones

12344

12121972 94043 M Hazelnut

0

E. Jones 12345

03251972 94041 F Lemon 0

F. Jones

12346

03251972 02127 F Lemon 1

G. Jones

12347

08081989 02138 F Peach 1

H. Smith

12348

01011973 63200 F Lime 2

I. Smith 12349

02021973 63300 M Mango 4

J. Smith 12350

02021973 63400 M Coconut 16

K. Smith

12351

03031974 64500 M Frog 32

L. Smith

12352

04041974 64600 M Vanilla 64

M. Smith

12353

04041974 64700 F Pumpkin 128

N. Smith

12354

04041974 64800 F Allergic 256

What’s wrong with this picture?Identifier

Sensitive

PrivateIdentifie

r

PrivateIdentifie

r

Identifier

Sensitive

Unexpected Response?

Mass resident

FERPA too?

Californian

Twins, separated at birth?

Key Concepts in Confidential Data Management

Help, help, I’m being suppressed…

Name SSN Birthdate Zipcode Gender FavoriteIce Cream

# of crimescommitted

[Name 1]

12341

*1961 021* M Raspberry

.1

[Name 2]

12342

*1961 021* M Pistachio -.1

[Name 3]

12343

*1972 940* M Chocolate

0

[Name 4]

12344

*1972 940* M Hazelnut 0

[Name 5]

12345

*1972 940* F Lemon .6

[Name 6]

12346

*1972 021* F Lemon .6

[Name 7]

12347

*1989 021* * Peach 64.6

[Name 8]

12348

*1973 632* F Lime 3

[Name 9]

12349

*1973 633* M Mango 3

[Name 10]

12350

*1973 634* M Coconut 37.2

[Name 11]

12351

*1974 645* M * 37.2

[Name 12]

12352

*1974 646* M Vanilla 37.2

[Name 13]

12353

*1974 647* F * 64.4

[Name 14]

12354

*1974 648* F Allergic 256

Row

VarSynthetic Global Recode Local Suppression Aggregation+

Perturbation

Traditional Static Suppression

Data reduction Observation

Measure

Cell

Perturbation Microaggregation Rule-based data

swapping Adding noise

Key Concepts in Confidential Data Management

R can do it (R can do everything, but not really :-)

# setup> library(sdcMicro)

# load data> classexample.df<-read.csv("examplesdc.csv”, as.is=T, stringsAsFactors=F,colClasses=c("character","character","character","character","factor","factor","numeric")

# create a weight variable if needed> classexample.df$weight<-1

# simple frequency table shows that data is uniquely identified> ftable(Birthdate~Zipcode,data=classexample.df)

Birthdate 01/01/1973 02/02/1973 03/25/1972 04/04/1974 08/08/1989 10/01/1961 11/11/1972 12/12/1972 20/02/1961 30/03/1974

Zipcode

02127 0 0 1 0 0 0 0 0 0 002138 0 0 0 0 1 0 0 0 1 002145 0 0 0 0 0 1 0 0 0 063200 1 0 0 0 0 0 0 0 0 063300 0 1 0 0 0 0 0 0 0 063400 0 1 0 0 0 0 0 0 0 064500 0 0 0 0 0 0 0 0 0 164600 0 0 0 1 0 0 0 0 0 064700 0 0 0 1 0 0 0 0 0 064800 0 0 0 1 0 0 0 0 0 094041 0 0 1 0 0 0 0 0 0 094043 0 0 0 0 0 0 1 1 0 0

Key Concepts in Confidential Data Management

Suppression reduces utility

Common approach of anonymizing/suppressing data reduces usefulness

Minimizing disclosure in the presence of large external data sources reduces usefulness a lot

Anonymized data is not simply less informative -- it typically yields biased analyses

Key Concepts in Confidential Data Management

Advancing the state of the art and practice…

Key Concepts in Confidential Data Management

Cradle-to-grave analysis:Consider Risks and Controls in Info Lifecycle

What is collected? Scope of information collected Intended uses Potential benefits from data

availability Re-identification (learning) risks Information sensitivity (harm out of

context) Controls on retention Possible information

transformations(aggregation, redaction)

Post-disclosure control and evaluation:use limits, review , reporting, and information accountability

Key Concepts in Confidential Data Management

Some Overarching Principles for Consideration

Fair Information Practice: Notice/awareness Choice/consent Access/

participation(verification, accuracy, correction)

Integrity/security Enforcement/

redress Self-regulation,

private remedies; government enforcements

Privacy by design: Proactive not reactive;

Preventative not remedial

Privacy as the default setting

Privacy embedded into design

Full Functionality – Positive-Sum, not Zero-Sum

End-to-End Security – Full Lifecycle Protection

Visibility and Transparency – Keep it Open

Respect for User Privacy – Keep it User-Centric

OECD Principles Collection

limitation Data quality Purpose

specification Use limitation Security

Safeguards Openness Individual

participation Accountability

Key Concepts in Confidential Data Management

Some Interventions

Key Concepts in Confidential Data Management

(but wait, there’s more…)

Key Concepts in Confidential Data Management

Challenges of big data… Anonymization can completely destroy utility

The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]

Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible mask, when

correlated with external data [Zimmerman 2008; ]

Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked network data, if

only a few nodes controlled. [Backstrom, et. al 2007] The “Blog problem” : Pseudononymous communication can be linked

through textual analysis [Novak wet. al 2004]

Little Data in a Big World The “Favorite Ice Cream” problem

-- public information that is not risky can help us learn information that is risky

The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere

The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases

Source: [Calberese 2008; Real Time Rome Project 2007]

Key Concepts in Confidential Data Management

Observations

Confidentiality requires limiting what an adversary can learn about an individual as a result of their being measured

Common overarching principles do not provide sufficient guidance to select effective controls and approaches

Generic/naïve use of extant data sharing or redaction controls and technologies are unlikely to provide adequate protection in a big data world.

Evaluate the privacy and security risks, controls, and accountability mechanisms, over the entire information lifecycle – including collection, consent, use, dissemination, and post-disclosure

Key Concepts in Confidential Data Management

Questions?

Web:

informatics.mit.edu


Recommended