+ All Categories
Home > Documents > Making Digital Privacy Operational typically relies on data...

Making Digital Privacy Operational typically relies on data...

Date post: 28-May-2019
Category:
Upload: hoangtuyen
View: 215 times
Download: 0 times
Share this document with a friend
56
1 Making Digital Privacy Operational typically relies on data anonymity Latanya Sweeney, Ph.D. Assistant Professor of Computer Science & of Public Policy Director, Laboratory for International Data Privacy Carnegie Mellon University [email protected] http://sos.heinz.cmu.edu/dataprivacy/ Two Questions: 1. What kinds of problems do data anonymity tools solve? 2. How is data anonymity different from security and privacy?
Transcript

1

Making Digital Privacy Operationaltypically relies on data anonymity

Latanya Sweeney, Ph.D.Assistant Professor of Computer Science & of Public Policy

Director, Laboratory for International Data Privacy

Carnegie Mellon [email protected]

http://sos.heinz.cmu.edu/dataprivacy/

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

2

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

Bottom line:data anonymity addressesthe identifiability of shared information.

“Can’t release data”

Accuracy, quality Distortion, anonymity

Holder

RecipientConfidentiality, Privacy, Liability concerns

3

“Privacy is dead, get over it”

Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver

Accuracy, quality Distortion, anonymity

Recipient

HolderCommon Public Health reaction

“Share data while guaranteeinganonymity”

Accuracy, quality Distortion, anonymity

Holder

A* 1961 0213* cardiacA* 1961 0213* cancerA* 1961 0213* liver

Recipient Computational solutions

4

This talk

� New areas in CS

� Fact: lots of data out there

� Fact: few fields uniquely identify a person

� Examples of compromises

� Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

Data Anonymity (new area)

The study of computational solutionsfor releasing data such that the dataremain practically useful while theidentities of the subjects of the dataare not revealed.

“Useful AND Secure”

5

Learning information about entities...

Data Linkage (“data detectives”):

combining disparate pieces of entity-specificinformation to learn more about an entity

Privacy Protection (“data protectors”):

release information such that certain entity-specific properties (such as identity) cannotbe inferred; restrict what can be learned

Data Anonymity Lab at CMUWork with real-world stakeholders:

- public health- government agencies- private industry

Kinds of projects currently underway:- health data- web data- video surveillance data- genetic data- census surveys- crime data- grocery data, and so on…

http://sos.heinz.cmu.edu/dataprivacy/

6

This talk

�New areas in CS

� Fact: lots of data out there

� Fact: few fields uniquely identify a person

� Examples of compromises

� Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

GD

SP

(MB

/per

son)

0

5

10

15

20

25

30

35

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Sew

rver

s(in

Mill

ions

)

Technically-empowered Society

1993 FirstWWWconference

2001

Growth inavailablediskstorage

Growth inactive webservers

19961991

7

Behavior 1.Collect more

Global DSP over Time

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar

Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272

Based on State of Illinois [Sweeney 99]. DSP in bytes/person

Expand an existingperson-specific datacollection.

Typical Birth Certificate Fields, post 1925Field nameChild's first nameChild's middle name (sometimes or initial)Child's last nameDay, month and year of birthCity and/or County of birth (sometimes hospital)Father's nameMother's name (including maiden name)Place of birth (address and town/city)Mother's age and addressMother's birthplace (town/city, state, county)Mother's occupationMother, number of previous childrenFather's age and addressFather's birthplace (town/city, state, county)Father's occupation

8

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 1-15

Field# Size Field name1 1 File Status2 50 Baby’s First Name3 50 Baby’s Middle Name4 50 Baby’s Last Name5 1 Baby’s Suffix Code6 3 Baby’s Suffix Text7 8 Baby’s Date of Birth8 5 Baby’s Time of Birth9 1 AM/PM Indicator

10 1 Baby’s Sex11 3 Blood Type12 1 Born Here?13 40 Place of Birth14 1 Facility Type

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 16-30Field# Size Field name

16 20 County of Birth17 6 Certifier’s Code18 30 Certifier’s Name19 1 Certifier’s Title20 30 Attendant’s Name21 1 Attendant’s Title22 23 Attendant’s Address23 19 Attendant’s City24 2 Attendant’s State25 10 Attendant’s Zip Code26 50 Mother’s First Name27 50 Mother’s Middle Name28 50 Mother’s Last Name29 9 Mother’s Social Security Number30 8 Mother’s Date of Birth

9

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 31-45

field# Size Field name31 3 Mother’s State of Birth32 7 Mother’s Residence Address33 2 Mother’s Residence Direction34 20 Residence Street Address35 10 Residence Type36 2 Residence Extension37 10 Residence Apartment #38 20 Mother’s Town of Residence39 1 Mother’s Residence in City Limits40 14 Mother’s County of Residence41 3 Mother’s State of Residence42 10 Mother’s Residence Zip Code43 38 Mother’s Mailing Address44 19 Mother’s Mailing City45 2 Mother’s Mailing State

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 46-60

Field# Size Field name46 10 Mother’s Mailing Zip Code47 1 Mother Married?48 50 Father’s First Name49 50 Father’s Middle Name50 50 Father’s Last Name51 1 Father’s Suffix Code52 9 Father’s Suffix Text53 9 Father’s Social Security Number54 8 Father’s Date of Birth55 3 Father’s State of Birth56 14 Mother’s Origin57 14 Mother’s Race58 2 Mother’s Elementary Education59 2 Mother’s College Education60 11 Mother’s Occupation

10

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 61-75

Field# Size Field name61 11 Mother’s Industry62 14 Father’s Origin63 14 Father’s Race64 2 Father’s Elementary Education65 2 Father’s College Education66 11 Father’s Occupation67 11 Father’s Industry68 1 Plurality69 1 Birth Order70 2 Live Births Still Living71 2 Live Births Now Dead72 4 Month/Year Last Live Birth73 2 Number of Terminations74 4 Month/Year Last Termination75 1 Baby’s Weight Unit

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 76-90

Field# Size Field name76 5 Baby’s Weight77 6 Date of Last Normal Menses78 1 Month Prenatal Care Began79 2 Total Number of Visits80 2 Apgar Score – 1 Minute81 2 Apgar Score – 5 Minute82 2 Estimate of Gestation83 6 Date of Blood Test84 22 Laboratory85 1 Mother Transferred In86 30 Facility Mother Transferred From87 1 Baby Transferred Out88 30 Facility Baby Transferred To89 1 Tobacco Use During Pregnancy90 3 Number of Cigarettes/Day

11

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 91-105

Field# Size Field name91 1 Alcohol Use During Pregnancy92 3 Number of Drinks/Week93 3 Mother’s Weight Gain94 1 Release Info For SSN95 6 Operator Code96 12 Hospital ID97 1 Sent to Romans98 1 Sent to APORS99 16 Other Certifier Specify

100 12 Temporary Audit Number101 16 Other Facility Specify102 16 Other Attendant Specify103 1 Mother’s Race104 1 Father’s Race105 2 Mother’s Origin

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 106-120

Field# Size Field name106 2 Father’s Origin107 1 Attendant Same YN108 1 Mailing Address Same YN109 1 Capture Father’s Info YN110 2 Mother’s Age111 2 Father’s Age112 12 Baby’s Hospital Med. Rec.113 1 High Risk Pregnancy YN114 1 Care Giver (For Chicago)115 1 Record Selected For Download116 1 Downloaded117 1 Printed118 12 Form Number

MEDICAL RISK FACTORS119 1 Anemia120 1 Cardiac Disease

12

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 121-135

Field# Size Field name121 1 Acute/Chronic Lung Disease122 1 Diabetes123 1 Genital Herpes124 1 Hydramnios/Oligohydramnios125 1 Hemoglobinopathy126 1 Hypertension, Chronic127 1 Hypertension, Preg. Assoc.128 1 Eclampsia129 1 Incompetent Cervix130 1 Previous Infant 4000+ Grams131 1 Previous Preterm or SGA Infant132 1 Renal Disease133 1 Rh Sensitization134 1 Uterine Bleeding135 1 No Medical Risk Factors

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 136-150

Field# Size Field name136 40 Other Medical Risk Factors

OBSTETRIC PROCEDURES137 1 Amniocentesis138 1 Electronic Fetal Monitoring139 1 Induction of Labor140 1 Stimulation of Labor141 1 Tocolysis142 1 Ultrasound143 1 No Obstetric Procedures144 40 Other Obstetric Procedures

COMPLICATIONS OF LABOR & D145 1 Febrile (>100 or 38C)146 1 Meconium Moderate, Heavy147 1 Premature Rupture (>12 Hrs)148 1 Abruptio Placenta149 1 Placenta Previa150 1 Other Excessive Bleeding

13

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 151-165

Field# Size Field name151 1 Seizures During Labor152 1 Precipitous Labor (<3 Hrs)153 1 Prolonged Labor (>20 Hrs)154 1 Dysfunctional Labor155 1 Breech/Malpresentation156 1 Cephalopelvic Disproportion157 1 Cord Prolapse158 1 Anesthetic Complications159 1 Fetal Distress160 1 No Complications of L&D161 40 Other Complications of L&D

METHOD OF DELIVERY162 1 Vaginal163 1 Vaginal After Previous C-Section164 1 Primary C-Section165 1 Repeat C-Section

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 166-180

Field# Size Field name166 1 Forceps167 1 Vacuum

ABNORMAL CONDITIONS OF NEWBO168 1 Anemia169 1 Birth Injury170 1 Fetal Alcohol Syndrome171 1 Hyaline Membrane Disease/RDS172 1 Meconium Aspiration Syndrome173 1 Assisted Ventilation <30174 1 Assisted Ventilation >30175 1 Seizures176 1 No Abnormal Conditions of Newborn177 40 Other Abnormal Condition of Newborn

CONGENITAL ANOMALIES OF CHILD178 1 Anencephalus179 1 Spina Bifida/Meningocele180 1 Hydrocephalus

14

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 181-195

Field# Size Field name181 1 Microcephalus182 40 Other CNS Anomalies183 1 Heart Malformations184 40 Other Circ./Resp. Anomalies185 1 Rectal Atresia/Stenosis186 1 Tracheo-Esophageal Fistula/Esophag187 1 Omphalocele/Gastroschisis188 40 Other Gastrointestinal Ano.189 1 Malformed Genitalia190 1 Renal Agenesis191 40 Other Urogenital Anomalies192 1 Cleft Lip/Palate193 1 Polydactyly/Syndactyly/Adactyly194 1 Club Foot195 1 Diaphragmatic Hernia

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 196-210

Field# Size Field name196 40 Other Musculoskeletal/Integumental A197 1 Down’s Syndrome198 40 Other Chromosomal Anomalies199 1 No Congenital Anomalies200 40 Other Congenital Anomalies

CODE STRIP201 1 Record Complete YN202 1 Record Type203 4 Facility ID204 4 City of Birth205 3 County of Birth206 2 Mother’s State of Birth207 2 Mother’s State of Residence208 4 Mother’s Town of Residence209 3 Mother’s County of Residence210 2 Father’s State of Birth

15

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 211-226.

Field# Size Field name211 14 Certifier’s License Number212 6 Laboratory ID Number213 4 Mother Xfer Code214 3 Mother Xfer County Code215 4 Baby Xfer Code216 3 Baby Xfer County Code217 4 Year of Birth218 7 Certificate #219 1 Unique Code220 8 File Date221 2 Community Area222 4 Census Tract223 2 Century of Last Live Birth224 2 Century of Last Termination225 2 Century of Last Menses226 2 Century of Blood Test

Behavior 2.Collect specifically

Global DSP over Time

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar

Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272

Based on State of Illinois [Sweeney 99]. DSP in bytes/person

Replace an existingaggregate data collectionwith a person-specific one.

16

Hospital Discharge Data,fields 1-12

# Field description Size1 HOSPITAL ID NUMBER 122 PATIENT DATE OF BIRTH(MMDDYYYY) 83 SEX 14 ADMIT DATE (MMDYYYY) 85 DISCHARGE DATE (MMDDYYYY) 86 ADMIT SOURCE 17 ADMIT TYPE 18 LENGTH OF STAY (DAYS) 49 PATIENT STATUS 210 PRINCIPAL DIAGNOSIS CODE 611 SECONDARY DIAGNOSIS CODE - 1 612 SECONDARY DIAGNOSIS CODE - 2 6

Hospital Discharge Data,fields 12-25# Field description Size13 SECONDARY DIAGNOSIS CODE - 3 614 SECONDARY DIAGNOSIS CODE - 4 615 SECONDARY DIAGNOSIS CODE - 5 616 SECONDARY DIAGNOSIS CODE - 6 617 SECONDARY DIAGNOSIS CODE - 7 618 SECONDARY DIAGNOSIS CODE - 8 619 PRINCIPAL PROCEDURE CODE 720 SECONDARY PROCEDURE CODE - 1 721 SECONDARY PROCEDURE CODE - 2 722 SECONDARY PROCEDURE CODE - 3 723 SECONDARY PROCEDURE CODE - 4 724 SECONDARY PROCEDURE CODE - 5 725 DRG CODE 3

17

Hospital Discharge Data,fields 26-37

# Field description Size26 MDC CODE 227 TOTAL CHARGES 928 ROOM AND BOARD CHARGES 929 ANCILLARY CHARGES 930 ANESTHESIOLOGY CHARGES 931 PHARMACY CHARGES 932 RADIOLOGY CHARGES 933 CLINICAL LAB CHARGES 934 LABOR-DELIVERY CHARGES 935 OPERATING ROOM CHARGES 936 ONCOLOGY CHARGES 937 OTHER CHARGES 9

Hospital Discharge Data,fields 38-50# Field description Size38 NEWBORN INDICATOR 139 PAYER ID 1 940 TYPE CODE 1 141 PAYER ID 2 942 TYPE CODE 2 143 PAYER ID 3 944 TYPE CODE 3 145 PATIENT ZIP CODE 546 Patient Origin COUNTY 347 Patient Origin PLANNING AREA 348 Patient Origin HSA 249 PATIENT CONTROL NUMBER50 HOSPITAL HSA 2

18

Hospital Discharge by State, Part 1Private Semi-Private Semi-Public Public AHRQ

Mandate (Insiders) (Limited) (Deniable) (No Restrictions) SIDAlabama N N

Alaska N NArizona Y Y N Y Y Y

Arkansas Y Y N N NCalifornia Y Y N Y Y YColorado N Y N Y N Y

Connecticut Y Y N Y Y YDelaware Y Y N N* N*

District of Columbia N NFlorida Y N Y Y

Georgia Y N N N YHawaii N Y N Y Y Y

Idaho N NIllinois Y Y Y Y Y Y

Indiana Y Y N N NIowa Y Y N Y Y Y

Kansas Y Y N Y N YKentucky Y Y N Y NLouisiana N Y N

Maine Y Y N Y YMaryland Y Y N Y Y Y

Massachusetts Y Y N Y Y YMichigan N Y N Y N

Minnestoa N Y N Y NMissouri Y N Y Y Y

Mississippi N N

Hospital Discharge by State, Part 2Private Semi-Private Semi-Public Public AHRQ

Mandate (Insiders) (Limited) (Deniable) (No Restrictions) SIDMontana N N

Nebraska N Y N Y YNevada Y Y N N Y

New Hampshire Y Y N Y YNew Jersey N Y Y N Y Y

New Mexico Y Y N N YNew York Y Y N Y Y Y

North Carolina Y Y N NNorth Dakota Y N N Y

Ohio Y Y N N NOklahoma Y Y N Y N

Oregon Y N Y Y YPennsylvania Y Y Y Y Y YRhode Island Y Y N Y Y

South Carolina Y Y N Y Y YSouth Dakota N N

Tennessee Y Y N Y Y YTexas Y Y N N N

Utah Y Y N Y Y YVermont Y Y N Y YVirginia Y Y N Y Y

Washington Y Y N Y Y YWest Virginia Y Y N Y Y

Wisconsin Y Y N Y Y YWyoming Y Y N Y N

19

Behavior 3.Collect it if you can

Global DSP over Time

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar

Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272

Based on State of Illinois [Sweeney 99]. DSP in bytes/person

Given a question or problem tosolve or merely provided theopportunity, gather information bystarting a new person-specific datacollection.

Grocery dataField name Food Lion Fresh Fields Safeway Star MarketName yes yes yes yesHome street address yes yes yes yesHomy city yes yes yes yesHome state yes yes yes yesHome ZIP yes yes yes yesHome phone number yes yes yes yesSocial Security Number yes

Additional data sometimes requestedBirth date yes yesZIP code of work place yesOther stores where you shop yes yesNumber of people in household yes yesAge each person in household yes yesHow much do you spend each week yes yes

Additional data for accepting checksBank yes yesBank account number yes yes

20

Kinds of data releases

InsidersonlyPrivate(Pr)

NorestrictionsPublic (Pu)

Larger possibledistribution

Less people eligible

Limited accessSemi-private(SPr)

Deniable accessSemi-public(SPu)

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

Data anonymity tools address theidentifiability of shared information in asetting with lots of available data.

21

This talk

�New areas in CS

�Fact: lots of data out there

� Fact: few fields uniquely identify a person

� Examples of compromises

� Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

Anonymous data

… implies that the datacannot be manipulated orlinked to identify anindividual.

22

De-identified Data

… all explicit identifiers, such as name,address and phone number are removed,generalized or replaced with a made-upalternative.

De-identifying information provides noguarantee that the result is anonymous.

JLME 97, NRC 98

Health data (GIC example)

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

Medical Data

23

Population data (GIC example)

ZIP

Birthdate

Sex

Name

Address

Dateregistered

Partyaffiliation

Date lastvoted

Voter List

Linking to re-identify data

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

Name

Address

Dateregistered

Partyaffiliation

Date lastvoted

Medical Data Voter List

24

Uniqueness in Cambridge Voters

Birth date alone 12%Birth date & gender 29%Birth date & 5-digit ZIP 69%Birth date & full postal code 97%

Birth date includes month, day and year.Total 54,805 voters.

JLME 97

Few characteristics make a person unique

Birth includes month, day and year:

365 days x 100 years = 36,500 possibilities

Two genders and Five ZIP (5-digit) codes:

2 * 5 * 36,500 =365,000 possibilities

But the Cambridge Voter list had:

54,805 voters

So in general, using(birth[mon,day,yr], gender, ZIP[5-digit])provides aunique quasi-identifier.

JLME 97

25

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

ZIP 60623,112,167 people,11%, not 0%insufficient #above the age of55 living there.

26

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

ZIP 11794, 5418people, primarilybetween 19 and24 (4666 of 5418or 86%), only13%.

Uniqueness of Demographics in U.S.

Date of Birth Mon/Yr Birth Year of Birth

ZIP5-digit

Town/Place

County

Gender

87.1%

58.4%

18.1%

3.7%

3.6%

0.04%

0.04%

0.04%

0.00004%

27

{ Year of birth, gender, County},uniquely identifies 0.00004% of U.S. pop.

0%

10%

20%

30%

40%

50%

60%

0 2000000 4000000 6000000 8000000 10000000

County Population

%po

pula

tion

Iden

tifie

d

Loving County,Texas,population 107,53% unique

YellowstoneCounty, Montana,population 52,25% unique

King County,Texas,population 354,6% unique

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

Having lots of person-specific dataavailable makes it difficult to protectagainst inferences unique to the subjects.

28

This talk

�New areas in CS

�Fact: lots of data out there

�Fact: few fields uniquely identify a person

� Examples of compromises

� Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

Cancer registry looks anonymous

Diagnosis DiagDate ZIPKaposi’s Sarcoma 1/18/91 32555Kaposi’s Sarcoma 5/12/94 37581Kaposi’s Sarcoma 3/5/92 32172Kaposi’s Sarcoma 8/8/93 30158Neuroblastoma 4/3/91 39164

29

Cancer registry looks anonymous

Diagnosis DiagDate ZIPNeuroblastoma 7/93 32125Neuroblastoma 1/92 31752Neuroblastoma 8/91 38265Neuroblastoma 5/94 37233… … …

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

Having lots of person-specific dataavailable makes it difficult to protectagainst inferences unique to the subjects.

30

This talk

�New areas in CS

�Fact: lots of data out there

�Fact: few fields uniquely identify a person

�Examples of compromises

� Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

Disclosure overview

External Information Released Information

Ann 10/2/61 02139 diagnosis

AnnAbeAl

Dan

Don

Dave

Jcd

Jwq

Jxy

Private Information

c

f

g1

Subjects

Population

Universe

g2

Ann 10/2/61 02139 marriage10/2/61 02139 diagnosis

31

Disclosure overview

External Information Released Information

Ann 10/2/61 02139 diagnosis

AnnAbeAl

Dan

Don

Dave

Jcd

Jwq

Jxy

Private Information

c

f

g

Subjects

Population

Universe

Jcd diagnosisAnn 10/2/61 02139 marriage

Disclosure overview

External Information Released Information

Ann 10/2/61 02139 diagnosis

AnnAbeAl

Dan

Don

Dave

Jcd

Jwq

Jxy

Private Information

c

f

Subjects

Population

Universe

Al 3/8/61 02138 marriage2

Ann 10/2/61 02139 marriage1 A* 1961 0213* diagnosis

32

Techniques are specific to use

Technique A-Data Mining B-StatisticalDe-identification depends dependsEncryption depends dependsSuppression depends noGeneralize values depends noSwap values no yesSubstitution depends dependsOutlier to medians no dependsPerturbation no yesRounding no yesAdditive noise no yesSampling depends dependsAdd tuples no yesScramble tuples yes yes

k-anonymity,enforce on release

�Quasi-identifier, profile {Birth 0.5, ZIP0.7, Sex0.3}

�Generalization 10/27/59� 1959

�Suppression 02139 �� ����

�Encryption 3245123� 2168582

AMIA 97, IEEE IFIP 97

33

Sample Data

SSN Ethnicity Birth Sex ZIP Problem819181496 Black 09/20/65 m 02141 short of breath195925972 Black 02/14/65 m 02141 chest pain902750852 Black 10/23/65 f 02138 hypertension985820581 Black 08/24/65 f 02138 hypertension209559459 Black 11/07/64 f 02138 obesity679392975 Black 12/01/64 f 02138 chest pain819491049 White 10/23/64 m 02138 chest pain749201844 White 03/15/65 f 02139 hypertension985302952 White 08/13/64 m 02139 obesity874593560 White 05/05/64 m 02139 short of breath703872052 White 02/13/67 m 02138 chest pain963963603 White 03/21/67 m 02138 chest pain

Datafly results

SSN Ethnicity Birth Sex ZIP Problem902387250 Black 1965 m 0214* short of breath197150725 Black 1965 m 0214* chest pain486062381 Black 1965 f 0213* hypertension235978021 Black 1965 f 0213* hypertension214684616 Black 1964 f 0213* obesity135243442 Black 1964 f 0213* chest pain487620561 White 1964 m 0213* chest pain259003630 White 1964 m 0213* obesity410968224 White 1964 m 0213* short of breath664545413 White 1967 m 0213* chest pain860424429 White 1967 m 0213* chest pain

IEEE IFIP 97, NRC 98

34

µ-Argus ResultsSSN Ethnicity Birth Sex ZIP Problem

Black 1965 m 02141 short of breathBlack 1965 m 02141 chest painBlack 1965 f 02138 hypertensionBlack 1965 f 02138 hypertensionBlack 1964 f 02138 obesityBlack 1964 f 02138 chest painWhite 1964 m 02138 chest pain

f 02139 hypertensionWhite 1964 m 02139 obesityWhite 1964 m 02139 short of breathWhite 1967 m 02138 chest painWhite 1967 m 02138 chest pain

JLME 97, NRC 98

k-similar results

SSN Ethnicity Birth Sex ZIP Problem486753948 Black 1965 m 02141 short of breath758743753 Black 1965 m 02141 chest pain976483662 1965 f 0213* hypertension845796834 1965 f 0213* hypertension497306730 Black 1964 f 02138 obesity730768597 Black 1964 f 02138 chest pain348993639 Caucasian 1964 m0213* chest pain459734637 1965 f 0213* hypertension385692728 Caucasian 1964 m0213* obesity537387873 Caucasian 1964 m0213* short of breath385346532 Caucasian 1967 m 02138 chest pain349863628 Caucasian 1967 m 02138 chest pain

35

This talk

�New areas in CS

�Fact: lots of data out there

�Fact: few fields uniquely identify a person

�Examples of compromises

�Nature of computational solutions

� Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

Two Questions:

1. What kinds of problems do dataanonymity tools solve?

2. How is data anonymity differentfrom security and privacy?

Anonymity tools allow data to be sharedwith guarantees of anonymity while thedata remain practically useful.

36

Traditional areas of Computer Security

authorization (can you access what you request)authentication (are you who you say you are)

Examples of Authorization in Security

authorization (can you access what you request)authentication (are you who you say you are)

secure communication or eavesdropping(did anyone else get the info)

encryptionfile access privileges (read, write, execute)

37

Examples of Authentication in Security

authorization (can you access what you request)authentication (are you who you say you are)

secure communication or eavesdroppingpasswordsencryptiondigital signaturesauthenticity of data (“information assurance”)

Getting Data Into a System

Authentication: login with passwordAuthorization: allowed to write dataEncryption: to avoid eavesdropping

38

Getting Data Into a System

Authentication: login with passwordAuthorization: allowed to write dataEncryption: to avoid eavesdropping

Getting Data From a System

Authentication: login with passwordAuthorization: allowed to read dataEncryption: to avoid eavesdropping

39

Computer Security & Data Sharing

Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping

BUT data can re-identify individual!

Data Anonymity Concerns Content

Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping

Data can NOT reliably re-identify individual!

40

Incorrect Computer Security View

Computer Security

authenticationauthorization

Privacy

NuisanceViruses

Incorrect Computer Security View

Computer Security

authenticationauthorization

Privacy

NuisanceViruses

Privacy = privacy andconfidentiality

41

Incorrect Computer Security View

Computer Security

authenticationauthorization

Privacy

NuisanceViruses

Privacy = privacy andconfidentiality

Computer security <public safety

Incomplete Computer Security Viewincluding non-technology

Laws

Computer Security Privacy

authenticationauthorization

Regulations

Policies

NuisanceViruses

42

Computer Security, Privacy and Anonymity

Laws

Computer Security Privacy

authenticationauthorization

Regulations

anonymitytools

Policies

NuisanceViruses

Computer Security and Anonymityare computational tools

Laws

Computer Security Privacy

Regulation

anonymitytools

authenticationauthorization

Policies

NuisanceViruses

43

Z3={*****} *****�

Z2={021**} 021**�

Z1={0213*,0214*} 0213* 0214*�

Z0={02138, 02139, 02141, 02142} 02138 02139 02141 02142

DGHZ0 VGHZ0

Merging Computer Security and AnonymityComputational Tools

What versionof theinformationwill you get?

anonymitytools

authenticationauthorization

This talk

�New areas in CS

�Fact: lots of data out there

�Fact: few fields uniquely identify a person

�Examples of compromises

�Nature of computational solutions

�Anonymity versus Security and Privacy

� Real-world examples:HIPAA and bioterrorism surveillance

44

Medical Privacy before HIPAANo* medical privacy legislation (proposed or

drafted) addresses these problems.

� incorrect belief that de-identified implies anonymous

� incorrect belief that linkage and mining are controlled byencryption

� incorrect belief that security is the same as privacy

� inability to enumerate all sources, users and uses ofmedical data

new technology offers better choices than all or nothingand allows for a spectrum of solutions

Flow from the Hospital

Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver

Hospital

Recipient

Holder

Publichealth

Insurance Statedischarge

ResearchersPharmaceuticalcompany

(economic)(law) (law)

(IRB)(IRB)

CareProvider

45

Flow from the Provider

Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver

Care provider

Recipient

Holder

Publichealth

Insurance Statedischarge

Pharmacy

(economic)(law) (law)

Researchers

Transcriptionservice

Secondary Flows from the Hospital

Hospital

Publichealth

Insurance Statedischarge

ResearchersPharmaceuticalcompany

(economic)(law) (law)

(IRB)(IRB)

Virtually no restrictions before HIPAA, some restrictions withHIPAA. Examples: WebMD and Envoy, Cancer registries,

hospital discharge public data sets.

46

Depiction of no data sharing by the data holder

Depiction of data holder sharing data with somerecipients

1

11

1

1

47

Depiction of secondary sharingby recipients of the data

1

11

1

1 2 3

2

2

2

3

3

4

4

5

Medical Privacy and HIPAASecurity

�Audit trails, Authorization, Authentication

�Protected channels of communication

Privacy

�Limited applicability

�horrible distortion

� Increased role of IRB

�Safe harbor: {ZIP3, Year of Birth, Gender}

ELSEuse dataanonymitytools!

48

Computer Security and Anonymityare computational tools

Laws

Computer Security Privacy

Regulation

anonymitytools

authenticationauthorization

Policies

NuisanceViruses

Detect Early using Onset,Coordinate Deaths & Hospital Admits

Based on results reported in Guillemin, 1999.

1979 Sverdlovsk Anthrax Outbreak

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Time (in Days)

Cum

ulat

ive

(cas

es)

OnsetHospital AdmitsDeaths

49

How can we detect onset?How early on each can we predict?How does coordination help?

1979 Sverdlovsk Anthrax Outbreak

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Time (in Days)

Pre

vale

nce

(cas

es)

OnsetHospital AdmitsDeaths

Cum

ulat

ive

Cas

es

Continuously Observe Behaviorsto Detect Onset of Symptoms

Prodromic surveillance:

How many are acting ill?

Unusualbehaviors→syndromes?

Not confirmeddiagnoses!

50

Centralized Surveillance of Secondary Data

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Emerging Central Authorities

1. Public health agency

2. Trusted broker of publichealth agency

3. Law enforcement agency

4. Corporation (for profit)

5. University (non-profit, non-competitor)

51

Access Instruments

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

HIPAA

educationlaws

contract contract contract

contract

contract

HIPAA HIPAA contract

*Not includingpublic health law

Mechanical distortion decisionstypically renders data useless

Gross overview

Sufficiently de-identified

Identifiable

Explicitly identified

Readily identifiable

Sufficiently anonymous

Unusual activity

Suspicious activity

Outbreak detected

Outbreak suspected

Normal operation

Datafly Idenifiability 0..1 Detection Status 0..1

52

Explicitly identified data generatesprivacy concerns which mayultimately prohibit data sharing

Gross overview

Sufficiently de-identified

Identifiable

Explicitly identified

Readily identifiable

Sufficiently anonymous

Unusual activity

Suspicious activity

Outbreak detected

Outbreak suspected

Normal operation

Datafly Idenifiability 0..1 Detection Status 0..1

Levels of identifiabilitymatching detection status

Gross overview

Sufficiently de-identified

Identifiable

Explicitly identified

Readily identifiable

Sufficiently anonymous

Unusual activity

Suspicious activity

Outbreak detected

Outbreak suspected

Normal operation

Datafly Idenifiability 0..1 Detection Status 0..1

53

Automated Privacy Module

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Automated Privacy Module

data holder

detect

raw data"anonymized"

datarequestwith status

policy agreement

54

Levels of Identifiabilityand Detection Status

Gross overview

Sufficiently de-identified

Identifiable

Explicitly identified

Readily identifiable

Sufficiently anonymous

Unusual activity

Suspicious activity

Outbreak detected

Outbreak suspected

Normal operation

Datafly Idenifiability 0..1 Detection Status 0..1

Dynamically Augment the Model WhenSurveillance Detects Possible Attack� Lower the privacy threshold when potential attack detected

– But how often, how quickly, to what level?– Can we take advantage of disease-specific processing?– Need to flush out ideas by looking at data

55

Probable Cause Predicate

Judge

Officer

Informant

facts1. What is the basis of

the knowledge?

2. Is the source believable?

ReasonableCause Predicate

(Technology,Policy)

Detector

{ DataSourcei}

factsWhat is the minimalinformation needed basedon reliable knowledgeavailable?

Data Holderj

56

Automated Privacy Module

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Transmission uses traditional computer security tools.Content is based on data anonymity tools.Overall goal is public safety.

This talk

�New areas in CS

�Fact: lots of data out there

�Fact: few fields uniquely identify a person

�Examples of compromises

�Nature of computational solutions

�Anonymity versus Security and Privacy

�Real-world examplesFor more information:

[email protected]://sos.heinz.cmu.edu/dataprivacy/


Recommended