Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 1
Further Understanding the Intersection of Technology and Privacy to Ensure and Protect
Client Data
Latanya Sweeney, PhD
privacy.cs.cmu.edu [email protected]
“We can provably know where domestic violence shelter clients have been without knowing who they are.”
Special Thanks To
Michelle HayesMary Joel Holin
Michael RoanhouseJulie Hovden
Disclaimer
The views and opinions in this presentation represent my own and are not necessarily those of
HUD, Abt, or any affiliates (or my cat’s or dog’s).
Known side effects include shock and applause.
1.Example: tracking people2.Example: anonymizing data3.Example: distributed surveillance4.Example: trails of dots5.Example: learning who you know6.Example: identity theft7.Example: fingerprint capture8.Example: bio-terrorism surveillance9.Example: privacy-preserving surveillance 10.Example: DNA privacy11.Example: Identity theft protections12.Example: k-Anonymity13.Example: webcam surveillance14.Example: text de-identification15.Example: face de-identification16.Example: fraudulent Spam
Privacy Technology
privacy.cs.cmu.edu
Technology Or Privacy
Priv
acy
Usefulness
Traditional Belief System
This Work
And Question in this Work
How can Shelters construct UIDswithout risk of re-identification while still achieving an accurate unduplicated accounting?
This talk will examine old approaches and introduce a new solution with “provable” properties.
How can Shelters construct UIDswithout risk of re-identificationwhile still achieving an accurate unduplicated accounting?
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 2
1. The Setting2. Technology Survey3. A Provable Privacy Solution
This Talk
privacy.cs.cmu.edu
The Big Goal
Perform local unduplicated accountings of homeless visit patterns without identifying clients.
CoC1
Homeless Management Information Systems (HMIS)
Shelter1
Client1
HUDClient2
Client3
Client4
Shelter2
Shelter3Client5
CoC2
… … …
…
…
CoC
Goal: a local unduplicated accounting of Client Visit Patterns
Shelter
Client
HUD
Personal Information
Universal Data ElementsAggregateInformation
Universal Data ElementsName
Social Security Number
Date of Birth
Ethnicity and Race
Gender
Veteran Status
Disabling Condition
Residence Prior to Program Entry
Code of Last Permanent Address
Program Entry Date
Program Exit Date
Unique Person Identification Number
Program Identification Number
Household Identification Number
Unique Identifier (“UID”)
HUD Reporting (Sample)Question # AHAR Questions: Emergency Shelter -Individuals
1 How many people used emergency shelters at __ time?2 What is the distribution of family sizes using emergency shelters?3 What are the demographics of individuals using emergency shelters?3 distribution by gender?3 distribution by race and ethnicity?3 distribution by age group?3 distribution by household size?3 distribution by veteran status? By disabling condition?4 What was the living arrangement the night before entering the emergency shelter?4 within/outside geographical jurisdiction?5 What is distribution of the number of nights in an emergency shelter?5 distribution by gender?5 distribution by age group?
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 3
Intimate Stalker “Threat”
• Knows detailed informationabout a targeted client
• Is highly motivated
• Can compromise a shelter or CoCto find the location of thetargeted client (“re-identification”)
CoC “Threat”
• Has lots of other informationthat may contain the client.
• Motivated to learn information about clients generally
• Link data on clients specifically (“re-identification”)
Re-Identification
… occurs when explicit client identifiers (e.g., name or address) can be reasonably associated with the client’s de-identified information.
CoC
Linking to Re-identify Clients
Shelter
Alice
Personal Information
“9/19/60 F 37213” “9/19/60 F 37213”
“Alice”
Alice9/19/60123 Main St
External
Information
Alice
Ethnicity
Visit date
PIN
Shelter ID
ZIP
Birth date
Sex
Name
Address
Date registered
Party affiliation
Date last voted
Dataset Voter List
Linking to re-identify HMIS Data
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.
Re-identification Results
Date of Birth Mon/Yr Birth Year of Birth
ZIP5-digit
Town/Place
County
Gender
87.1%
58.4%
18.1%
3.7%
3.6%
0.04%
0.04%
0.04%
0.00004%
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 4
Thwarting Linking
Using re-identification analysis, we can quantify linking risks associated with data elements and make changes accordingly.
We can thwart linking. Remainder of this talk assumes linking precautions done.
Question in this Work
How can Shelters construct UIDswithout risk of re-identificationwhile still achieving an accurate unduplicated accounting?
This talk will examine old approaches and introduce a new solution with “provable” properties.
1. The Setting2. Technology Survey3. A Provable Privacy Solution
This Talk
privacy.cs.cmu.edu
Minimal Risk v. Provable Privacy
Minimal risk technologies uses a combination of technology, practices and policy to show that there is a minimal re-identification risk.
Provable privacy technology provides guarantees against re-identification.
Minimal Risk Technologies
EncryptionHashingEncodingScan Cards/RFIDBiometricsConsentTechnology
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g
Def
late
Acc
ount
ing
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 5
EncodingConcatenate parts of source information into a UID.Example: Using {date of birth, gender, ZIP}“09121960F37213”
Date of birth
ZIPSex
Providing explicitly sensitive source information. Need to use non-sensitive source information.
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g
Def
late
Acc
ount
ing
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
HashingUID based on part of the source information.Example: Using {date of birth, gender, ZIP}“8126r1329ws”“986s594652”
Must be “strong”
Can be examined publicly.
Fast to compute but infeasible to reverse.
Problems with Consistent Hashing (1)
If the same hash value is broadly used with Clients, then it may lead to re-identifications through linking.
If the intimate stalker compromises a Shelter or the CoC, the hashed UID could be learned and used to locate the targeted Client.
Problems with Consistent Hashing (2)
If the source information is SSN or demographics, then CoC could re-identify all UIDs by exhaustively computing all UIDs.
Hashing
Social Security Number
UID
Try “000-00-0000”Try “000-00-0001”Try “000-00-0002”…Try “104-51-2572”Try “104-51-2573”…Try “999-99-9999”
UID 869563 for try “000-00-0000”UID 962656 for try “000-00-0001”UID 072532 for try “000-00-0002”…UID 976526 for try “104-51-2572”UID 149875 for try “104-51-2573”…
976526
072532
149875
UID
976526
072532
149875
UID
Dataset
Problems with Consistent Hashing (3)bits seconds
28 129 330 731 1532 3133 6234 12435 24936 49937 99838 199639 399340 798641 1596342 3192643 6388844 12772545 25546346 51077447 1021463
0
200000
400000
600000
800000
1000000
1200000
24 29 34 39 44 49
Number of Bits
Tim
e to
Exh
aust
Cou
nt (s
econ
ds)
Size of source information matters. Exhaust all SSNs in 4 seconds!
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 6
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g D
efla
te A
ccou
ntin
g
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
EncryptionLike hashing, but has a key to reverse result.Example: Using {date of birth, gender, ZIP}“8126r1329ws”
“8126r1329ws” + key = “9/12/1960, F, 37213”
The person with the key can reveal the sensitive source information.
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g
Def
late
Acc
ount
ing
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
Scan Cards / RFID TagsIssue a card containing a UID to each client, who presents for service.
Can be lost of given away!Example
#57817#57817
Should not contain personal information or Shelter information.
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g
Def
late
Acc
ount
ing
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
Biometrics
Use something always present with client and that typically does not change.
Example: “23968c235z9”
fingerprintUID
Fingerprints can often be linked to law-enforcement databases and re-identify clients.
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 7
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g D
efla
te A
ccou
ntin
g
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
ConsentAsk each client their permission to share data in exchange for services.Disclose uses of data and circumstances of sharing. They may say “no.”Identifiable information can be shared.
Forwarding identifiable information is not good.
Minimal Risk Technologies for HMIS
UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g
Def
late
Acc
ount
ing
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
L. Sweeney. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-133 Pittsburgh: November 2005.
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
1. The HMIS Setting2. Technology Survey3. A Provable Privacy Solution
This Talk
privacy.cs.cmu.edu
Question in this Work
How can Shelters construct UIDswithout risk of re-identificationwhile still achieving an accurate unduplicated accounting?
This talk will examine old approaches and introduce a new solution with “provable” properties.
The Big Idea in 3 Steps1. Shelters assign UIDs.
Client has same UID at same shelter, and different UID at other shelters.
2. Shelters securely ship data to CoC“Fedex” UIDs and Universal Data Elements
3. CoC and Shelters de-duplicate UIDs(described over next slides)
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 8
UID Assignment
Each Shelter has a private value.
Each Client has a private value.
Strong hashing is used to combine the Shelter and Client value to produce a UID for the client.
De-Duplication
Each Shelter re-hashes the UIDs from all other Shelters.
All re-hashed values that are the same represent the same client.
The “Commutative Property”of Strong Hashing
J. Benaloh and M. de Mare. One-way accumulators: a decentralized alternative to digital signatures. In Proceedings of Advances in Cryptology - EUROCRYPT '93, Lecture Notes in Computer Science, v 765, pages 274-285, Lofthus, Norway, 1994.
There exists strong hash functions that when all Shelters re-hash all UIDs, the re-hashed values will only be the same for Clients whose source information was the same.
CoCShelter1
Shelter2
Simplified Multiplication Example, 1
Shelter1 Shelter2
13
23
Each Shelter has its own private value.
CoCShelter1
Shelter2
Simplified Multiplication Example, 2
Mult( , 13) = 39
Mult( , 13) = 91
3991
Shelter1 Shelter2
7
3
13
23Multiply Client and Shelter private value to get UIDs.
CoCShelter1
Shelter2
Simplified Multiplication Example, 3
Mult( , 13) = 39
Mult( , 13) = 91
3991
Shelter1 Shelter2
7
3
13
23The CoC stores UIDs of Clients from Shelter 1.
39
91
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 9
CoCShelter1
Shelter2
Simplified Multiplication Example, 4
Mult( , 23) = 69
Mult( , 23) = 253
69253
Shelter1 Shelter239
9169253
3
11
13
23Multiply Client and Shelter private value to get UIDs.
CoCShelter1
Shelter2
Simplified Multiplication Example, 5
Shelter1 Shelter239
9169253
13
23CoC now knows there are 4 visits, but how many Clients?
CoCShelter1
Shelter2
69253
Simplified Multiplication Example, 6
Shelter1 Shelter239
916925369253
13
23CoC sends UIDsfrom Shelter 2 to Shelter 1 for re-hashing.
CoCShelter1
Shelter2
Simplified Multiplication Example, 7
Mult(69, 13)= 897Mult(253, 13) = 3289
Shelter1 Shelter239
916925369253
13
23
3289
897
CoC stores the re-hashed values.
CoCShelter1
Shelter2
3991
Simplified Multiplication Example, 8
Shelter1 Shelter239
918973289
69253
3991
13
23
CoC sends UIDsfrom Shelter 1 to Shelter 2 for re-hashing.
CoCShelter1
Shelter2
Simplified Multiplication Example, 9
Mult(39, 23) = 897
Mult(91, 23) = 2093Shelter1 Shelter239
918973289
8972093
69253
3991
13
23
CoC stores the re-hashed values.
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 10
CoCShelter1
Shelter2
Simplified Multiplication Example, 10
Shelter1 Shelter239
918973289
8972093
69253
13
23
Re-hashed values that are the same represent the same Client. Which are the same?
CoCShelter1
Shelter2
Simplified Multiplication Example, 11
Shelter1 Shelter239
918973289
8972093
69253
13
23
The re-hashed value 897 appears twice. CoC learns that Client 39 at Shelter 1 is the same Client as 69 at Shelter 2.
2533289
912093
6939897
Shelter2Shelter1CompletelyRe-hashed UIDs
CoC Learns
Client1
Client2
Client3
CoC
Shelter1
Shelter2
CoCShelter1
Shelter2
Simplified Multiplication Example
Mult( , 13) = 39
Mult( , 23) = 69
Shelter1 Shelter239
69
39 897
8973
3
(3 * 23) * 13 = 897
(3 * 13) * 23 = 897
13
23
3 * 13
(3 * 13) * 233 * 23
(3 * 23)*13
The Big Idea in 3 Steps1. Shelters assign UIDs.
Client has same UID at same shelter, and different UID at other shelters.
2. Shelters securely ship data to CoC“Fedex” UIDs and Universal Data Elements
3. CoC and Shelters de-duplicate UIDsRe-hash UIDs to reveal which UIDs
belong to the same client.
NoteThe UIDs are not to be used for any other
purpose than this reporting and de-duplication.
Shelters use different private values at each reporting period. This results in different hashes for the same Clients over different reporting periods.
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 11
A Provable Claim
Theorem. If the re-hashed values are the same, the Clients representing the original UIDsprovided the same source information.
A Provable Claim
A dictionary attack by the CoC will not yield reliable re-identifications.
Hashing
Social Security Number
UID
Try “000-00-0000”Try “000-00-0001”Try “000-00-0002”…Try “104-51-2572”Try “104-51-2573”…Try “999-99-9999”
UID 869563 for try “000-00-0000”UID 962656 for try “000-00-0001”UID 072532 for try “000-00-0002”…UID 976526 for try “104-51-2572”UID 149875 for try “104-51-2573”…
976526
072532
149875
UID
976526
072532
149875
UID
Dataset
A Provable Claim
Compromising a Shelter will not help the intimate stalker learn where a targeted Client is (or has been) at another Shelter.
Client1
Client2
Client3
Shelter1
Shelter2
A Provable ClaimCompromising the CoC will not help the intimate stalker learn where a targeted Client is (or has been).
2533289
912093
6939897
Shelter2Shelter1Completely Re-hashed
UIDs
CoC
A Provable ClaimEven if the CoC pads the UIDs with known values, the CoC does not learn the source information of Clients.
Planning OfficeShelter1
Shelter2
ax41804
b3s7ghre Planning OfficeShelter1
Shelter2
H27320yfh02
H2732
nw450
Over the Limit
If the intimate stalker compromises both the CoC and a Shelter the targeted Client visited, the intimate stalker can learn the locations of all Shelters the Client visited.
Privacy Technology
Copyright (c) 1998-2006 Dr. Sweeney. 12
Technologies for HMIS UTILITY PRIVACY UID TECHNOLOGY
Non
-ver
ifiab
le s
ourc
e
Ver
ifiab
le s
ourc
e
Clie
nt T
rust
Infla
te A
ccou
ntin
g D
efla
te A
ccou
ntin
g
Bad
or m
issi
ng in
fo
Intim
ate
stal
ker
Link
ing
Dic
tiona
ry a
ttack
Rev
erse
eng
inee
r
Exp
ose
new
issu
es
Encoding Hashing Encryption Scan Cards/RFID Biometrics Consent Inconsistent Hash Distributed Query
Most severe/difficult problem Moderate problem A problem May be a problem No problem likely, or not applicable
This provable solution *** ** **
* If compromise enough parties, can learn information.** Shown is worst case, can be improved by source information selection.
Question in this WorkHow can Shelters construct UIDswithout risk of re-identification while still achieving an accurate unduplicated accounting?
First, use strong hashing, inconsistently across Shelters to assign UIDs. Second, provide accounting information to the CoCthrough a secure means. Then, have each Shelter re-hash the UIDs of all other Shelters, in turn, to de-duplicate UIDs.
1. The Setting2. Technology Survey3. A Provable Privacy Solution
This Talk
privacy.cs.cmu.edu [email protected]