Lecture 23Data PrivacyGuest Lecturer: Xi
CompSci 516Data Intensive Computing Systems
(The slides were adapted from CompSci 216 Spring 15)
Data and ____ ☜ your favorite subject
2
Where is all this data coming from?
3
Where is all this data coming from?
• Census surveys• IRS Records!
• Medical records• Insurance records!
• Search logs• Browse logs• Shopping histories
• Photos• Videos!• Smart phone Sensors• Mobility trajectories!
• …
4
Very sensitive information …
Sometimes users can know and control who sees their information
5
… but not always !!
6
Example: Targeted advertising
7
Source: http://graphicsweb.wsj.com/documents/divSlider/media/ecosystem100730.png
What websites track your behavior?
8
Source: http://blogs.wsj.com/wtk/
http://www.dictionary.com/
Servers track your information … so what?
10
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Server
DB
Either release the dataset OR
answers to queries
Does it matter … I am anonymous, right?
11
Source (http://xkcd.org/834/)
What if we ensure our names and other identifiers are never released?
Outline• Why does naïve anonymization fail?– The Massachusetts governor privacy breach– AOL data publishing fiasco – Facebook privacy violation
• How to ensure data analysis without privacy leakage?
• Applications & research direction
12
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
Medical Data
• Zip !• Birth date !• Sex
13
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation •Date last voted
• Zip !• Birth date !• Sex
Medical Data Voter List
14
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation •Date last voted
• Zip !• Birth date !• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis
15
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation •Date last voted
• Zip !• Birth date !• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex.
Quasi Identifier
87 % of US population
16
AOL data publishing fiasco17
AOL data publishing fiasco …
18
Ashwin222Ashwin222Ashwin222Ashwin222Jun156Jun156Brett12345Brett12345Brett12345Brett12345Austin222Austin222
Uefa cupUefa champions leagueChampions league finalChampions league final 2013exchangeabilityProof of deFinitti’s theoremZombie gamesWarcraftBeatles anthologyUbuntu breezePython in thoughtEnthought Canopy
User IDs replaced with random numbers
19
Uefa cupUefa champions leagueChampions league finalChampions league final 2013exchangeabilityProof of deFinitti’s theoremZombie gamesWarcraftBeatles anthologyUbuntu breezePython in thoughtEnthought Canopy
865712345865712345865712345865712345236712909236712909112765410112765410112765410112765410865712345865712345
Privacy breach20
[NYTimes 2006]
Privacy violations from Facebook
21
Source: http://article.wn.com/view/2012/08/28/Facebooks_new_app_bazaar_violates_punters_privacy_lobbyists/
Inference from Impressions: Sexual Orientation
22
[Korolova JPC 2011]
Facebook Profile
+Online Data
Number of Impressions
+ Who are interested in
Men
+ Who are interested in
Women
25
0
Facebook uses private information to predict match to ad
Reason for privacy breach• Anyone can run a campaign with strict targeting
criteria– Zip, birthdate and sex uniquely identify 87% of US
population!
• “Private” and “Friends only” profile info used to determine match!
• Default privacy settings lead to users having many publicly visible features– Default privacy setting for Likes, location, work place,
etc. is public
23
Can Facebook release its graph ?
• Suppose we release just release the nodes and edges in the Facebook graph …
24
25
Mobile communication networks
[J. Onnela et al. PNAS 07]
Sexual & Injection Drug Partners
[Potterat et al. STI 02]
Naïve anonymization
!!!
!• Consider the above email communication graph
– Each node represents an individual– Each edge between two individuals indicates that they have
exchanged emails!
• Replace node identifiers with random numbers.
26
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Alice has sent emails to three individuals only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
27
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Alice has sent emails to three individuals only • Only one node in the anonymized network has a
degree three• Hence, Alice can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
28
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Cathy has sent emails to five individuals
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
29
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Cathy has sent emails to five individuals• Only one node has a degree five• Hence, Cathy can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
30
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Now consider that Alice and Cathy share their knowledge about the anonymized network
• What can they learn about the other individuals?
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
31
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• First, Alice and Cathy know that only Bob have sent emails to both of them
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
32
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• First, Alice and Cathy know that only Bob have sent emails to both of them
• Bob can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
33
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Alice has sent emails to Bob, Cathy, and Ed only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
34
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Alice has sent emails to Bob, Cathy, and Ed only• Ed can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
35
Lecture 2 : 590.03 Fall 13
Attacks on naïve anonymization
• Alice and Cathy can learn that Bob and Ed are connected
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
36
Attacks
37
• Matching attacks: the adversary matches external information to a naively anonymized network - unique or partial node re-identification
Local structure is highly identifying
38
Node Degree Neighbor’s Degree
Well Protected
Uniquely Identified
[Hay et al PVLDB 08]
Friendster Network ~ 4.5 million nodes
Sensitive values in social networks
39
http://mattmckeon.com/facebook-privacy/
Sensitive values in social networks
• Some people are privacy conscious (like you)
• Most people are lazy and keep the default privacy settings (i.e., no privacy)!
• Can infer your sensitive attributes based on the sensitive attribute of public individuals …
40
Servers track your information … and you are not anonymous
41
• Redlining: the practice of denying, or charging more for, services such as banking, insurance, access to health care, or even supermarkets, or denying jobs to residents in particular, often racially determined, areas.
42
Why care about privacy?
Can data analysis be done without breaching the privacy of individuals?
43
Outline• Why does naïve anonymization fail?!
• How to ensure data analysis without privacy leakage?
!
• Applications & research direction
44
Private data analysis problem
45
Individual 1r1
Individual 2r2
Individual3r3
Individual NrN
Server
DB
Utility:Privacy: No breach about any individual
Private data analysis examples
46
Application Data Collector
Third Party (adversary)
Private Information
Function (utility)
Medical Hospital Epidemiologist Disease Correlation between disease and geography
Genome analysis
Hospital Statistician/Researcher
Genome Correlation between genome and disease
Advertising Google/FB/Y!
Advertiser Clicks/Browsing
Number of clicks on an ad by age/region/gender …
Social Recommen-
dations
Facebook Another user Friend links / profile
Recommend other users or ads to users based on
social network
Location Services
Verizon/AT&T
Verizon/AT&T Location Local Search
Private data analysis methods
• Bare Minimum protection: • K-anonymity [Sweeney IJUFKS 2002] • L-diversity [Machanavajjhala et al ICDE 2006]• T-closeness [Li et al ICDE 2007]
!
• Ideal (state-of-the-art): Differential Privacy
47
K-anonymity [Sweeney IJUFKS 2002]
• If every row corresponds to one individual, then … … every row should look like k-1 other rows based on the quasi-identifier attributes
48
K-anonymity
49
Zip Age Nationality Disease
13053 28 Russian Heart
13068 29 American Heart
13068 21 Japanese Flu
13053 23 American Flu
14853 50 Indian Cancer
14853 55 Russian Heart
14850 47 American Flu
14850 59 American Flu
13053 31 American Cancer
13053 37 Indian Cancer
13068 36 Japanese Cancer
13068 32 American Cancer
Zip Age Nationality Disease
130** <30 * Heart
130** <30 * Heart
130** <30 * Flu
130** <30 * Flu
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
k=4
K-anonymity in graphs
50
Attack 1: homogeneity
51
Zip Age Nationality Disease
130** <30 * Heart
130** <30 * Heart
130** <30 * Flu
130** <30 * Flu
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
!
Bob has Cancer
Name Zip Age Nat.
Bob 13053 35 ??
Attack 2: background & knowledge
52
Zip Age Nationality Disease
130** <30 * Heart
130** <30 * Heart
130** <30 * Flu
130** <30 * Flu
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
Name Zip Age Nat.
Umeko 13068 24 Japan
Attack 2: background & knowledge
53
Zip Age Nationality Disease
130** <30 * Heart
130** <30 * Heart
130** <30 * Flu
130** <30 * Flu
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
!
Umeko has Flu
Name Zip Age Nat.
Umeko 13068 24 Japan
Japanese have a very low incidence of Heart disease.
Recall attacks on k-anonymity
54
Zip Age Nationality Disease
130** <30 * Heart
130** <30 * Heart
130** <30 * Flu
130** <30 * Flu
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
Name Zip Age Nat.
Umeko 13068 24 Japan
Japanese have a very low incidence of Heart disease. !
Umeko has Flu
!
Bob has Cancer
Name Zip Age Nat.
Bob 13053 35 ??
3-diverse table
55
Zip Age Nationality Disease
1306* <=40 * Heart
1306* <=40 * Flu
1306* <=40 * Cancer
1306* <=40 * Cancer
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
1305* <=40 * Heart
1305* <=40 * Flu
1305* <=40 * Cancer
1305* <=40 * Cancer
Name Zip Age Nat.
Umeko 13068 24 Japan
Japanese have a very low incidence of Heart disease. !
Umeko has ?
!
Bob has ?
Name Zip Age Nat.
Bob 13053 35 ??
3-diverse table
56
Zip Age Nationality Disease
1306* <=40 * Heart
1306* <=40 * Flu
1306* <=40 * Cancer
1306* <=40 * Cancer
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
1305* <=40 * Heart
1305* <=40 * Flu
1305* <=40 * Cancer
1305* <=40 * Cancer
Name Zip Age Nat.
Umeko 13068 24 Japan
Japanese have a very low incidence of Heart disease. !
Umeko has ?
!
Bob has ?
Name Zip Age Nat.
Bob 13053 35 ??
L-Diversity Principle: Every group of tuples with the same Q-ID values has ≥ L distinct sensitive values of roughly equal proportions.
L-diversity
• L-diversity Principle:Every group of tuples with the same Q-ID values has ≥ L distinct “well represented” sensitive values.
• The link between identity and attribute value is the sensitive information. “Does Bob have cancer? Heart disease? Flu?” “Does Umeko have cancer? Heart disease? Flu?”
• Privacy is breached when the attribute value can be inferred with high probability. Pr[“Bob has cancer” | published table, adv. knowledge] > t
57
[Machanavajjhala et al ICDE 2006]
L-diversity
• Limit adversarial knowledge- Adversary knows ≤ L-2 negation statements.
“Umeko does not have Heart Disease.”
• Consider the worst case- Consider all possible conjunctions of ≤ (L-2) statements
58
At least L sensitive values should appear in every group
L-diversity
• Limit adversarial knowledge- Adversary knows ≤ L-2 negation statements.
“Umeko does not have Heart Disease.”
• Consider the worst case- Consider all possible conjunctions of ≤ (L-2) statements
59
The L distinct sensitive values in each group should be roughly of equal proportions
T-closeness60
[Li et al ICDE 2007]
The L distinct sensitive values in each group should be roughly of equal proportions
Let t=0.75. Privacy of individuals in the above group is ensured if,
T-closeness61
Theorem: For all groups g, for all s in S (sensitive values), and for all B (background knowledge), |B| ≤ (L-2)
is equivalent to
Attack 3: Composition
62
If Bob is in both datasets, then Bob has Stroke!
Zip Age Income Disease
130** [25-30] >50k None
130** [25-30] >50k Stroke
130** [25-30] >50k Flu
130** [25-30] >50k Cancer
902** [60-70] <50k Flu
902** [60-70] <50k Stroke
902** [60-70] <50k Flu
902** [60-70] <50k Cancer
Zip Age Nationality Disease
130** <40 * Cold
130** <40 * Stroke
130** <40 * Rash
1485* >40 * Cancer
1485* >40 * Flu
1485* >40 * Cancer
Differential Privacy
• Consider two datasets – D1: with Bob as one of the participants
– D2 : without Bob
• Answers are roughly the same whether or not Bob is in the data
63
[Dwork et al TCC 2006]
Differential Privacy
Algorithm A satisfies ε-differential privacy if: - for every pair of neighboring tables D1, D2 - for every output O!
Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]
64
Meaning …
65
D2
D1
Set of all outputs
.
.
.
A(D1) = O1
P [ A(D1) = O1 ]
P [ A(D2) = Ok ]
Bob in the data
Bob not in the data
Meaning …
66
.
.
.
Worst discrepancyin probabilities
D2
D1
O1
Privacy loss parameter ε
Algorithm A satisfies ε-differential privacy if: - for every pair of neighboring tables D1, D2 - for every output O
Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]!
• Smaller the ε more the privacy (and better the utility)
67
Privacy loss parameter ε
Algorithm A satisfies ε-differential privacy if: - for every pair of neighboring tables D1, D2 - for every output O
Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]!
what the adversary learns about an individual is the same even if the individual is not in the data
(or lied about his/her value)
68
Algorithm 1: randomized response
69
Disease (Y/N)
Y
Y
N
Y
N
N
With probability p, report true value!With probability 1-p, report flipped value
Disease (Y/N)
Y
N
N
N
Y
N
Can estimate the true proportion of Y in the data based on the perturbed values (since we know p)
Algorithm 1: randomized response
• Consider two databases D, D’ that differ in jth value:- D[j] ≠ D’[j], D[i] = D’[i] for all i≠j
• Consider output O:!
70
Algorithm 1: randomized response
• Consider two databases D, D’ that differ in jth value:- D[j] ≠ D’[j], D[i] = D’[i] for all i≠j
• Consider output O:!
71
Algorithm 1: randomized response
• Consider two databases D, D’ that differ in jth value:- D[j] ≠ D’[j], D[i] = D’[i] for all i≠j
• Consider output O:!
72
Algorithm 1: randomized response
• Suppose n1 out of n people replied ‘yes’, rest said no, what is the best estimator for
!
73
Algorithm 2: Laplace mechanism
74
Laplace Distribution – Lap(λ)
00.1250.25
0.3750.5
-10 -8 -6 -4 -2 0 2 4 6 8 10
Database
Researcher
Query q
True answer q(D) q(D) + η
η
h(η) α exp(-η / λ)
Privacy depends on the λ parameter
Mean: 0, Variance: 2 λ2
57
Laplace mechanism example
Qn: Release the histogram of admissions by diagnosisAns: • Compute the true histogram• Add noise to each count in the histogram using
noise from Lap(1/ε)!
Noisy count is within ± 1.38 of true count for ε = 1
75
DP Composition
Qn: Release 2 histograms of admissions (a) by diagnosis, and (b) ageAns: • Compute the true histograms• Add noise to each count in the histograms using noise
from Lap(1/ε)Noisy counts are within ± 1.38 of true counts in both
histograms … but total privacy loss (1+1) = 2
—> satisfies 2-differential privacy
76
Interactive v.s. publishing model77
Interactive setting: - depending on the remaining privacy budget
Non-interactive setting: - queries or the query types are known in advance- publishing synthetic data- no limit on the number of queries
Outline• Why does naïve anonymization fail?!
• How to ensure data analysis without privacy leakage?
!
• Applications & research direction
78
Applications of DP: OnTheMap79
• A Census application that plots commuting patterns of workers http://onthemap.ces.census.gov/
Applications of DP: OnTheMap80
• DP synthetic data, with noise-adding mechanism• To compute Quarterly Workforce Indicators – Total employment – Average earnings – New hires & separations – Unemployment statistics
Applications of DP: RAPPOR81
• Randomized aggregatable privacy-preserving ordinal response- crowdsource statistics from end-user client software (chromium)
Erlingsson et al
http://www.chromium.org/developers/design-documents/rappor
Randomized Response
Other DP Applications & Research
• Network data- Private analysis of graph structure, Karwa et al, at ACM Trans. Database Syst., 2014
• Multiple entries - Formal privacy protection for data products combining individual and employer frames, Haney et al. at UNECE/Eurostat 2015
• Trajectory & location trajectories - Geo-indistinguishability Andrés et al. at CCS 2013 - DPT, He et al. at VLDB 2015
• DP + Security - Root ORAM, Wagh et al. arxiv 2016 - Private record linkage, Cao et al, ICDE 2015
• Beyond DP - Pufferfish privacy, Machanavajjhala, at ACM Trans. Database Syst., 2014- Blowfish privacy, He et al, SIGMOD 2015.
82
Summary
• “Data-driven” revolution has transformed many fields, but need to address the privacy problem - The Massachusetts governor privacy breach - AOL data publishing fiasco - Facebook privacy violation
• Tools like differential privacy can foster ‘safe’ data collection, analysis and data sharing. - K-anonymity - L-diversity - T-closeness - Differential privacy
• More details on data privacy (see Ashwin’s other course)
83