Post on 07-Jan-2016
description
transcript
Lecture 6 : 590.03 Fall 12 1
Simulatability“The enemy knows the system”, Claude Shannon
CompSci 590.03Instructor: Ashwin Machanavajjhala
Lecture 6 : 590.03 Fall 12 2
Announcements• Please meet with me at least 2 times before you finalize your
project (deadline Sep 28).
Lecture 6 : 590.03 Fall 12 3
Recap – L-Diversity• The link between identity and attribute value is the sensitive
information. “Does Bob have Cancer? Heart disease? Flu?” “Does Umeko have Cancer? Heart disease? Flu?”
• Adversary knows ≤ L-2 negation statements. “Umeko does not have Heart Disease.”
– Data Publisher may not know exact adversarial knowledge
• Privacy is breached when identity can be linked to attribute value with high probability Pr[ “Bob has Cancer” | published table, adv. knowledge] > t
Lecture 6 : 590.03 Fall 12 4
Zip Age Nat. Disease
1306* <=40 * Heart
1306* <=40 * Flu
1306* <=40 * Cancer
1306* <=40 * Cancer
1485* >40 * Cancer
1485* >40 * Heart
1485* >40 * Flu
1485* >40 * Flu
1305* <=40 * Heart
1305* <=40 * Flu
1305* <=40 * Cancer
1305* <=40 * Cancer
Recap – 3-Diverse Table
L-Diversity Principle: Every group of tuples with the same Q-ID values has ≥ L distinct sensitive values of roughly equal proportions.
Lecture 6 : 590.03 Fall 12 5
Outline• Simulatable Auditing
• Minimality Attack in anonymization
• Simulatable algorithms for anoymization
Lecture 6 : 590.03 Fall 12 6
Query Auditing
Database has numeric values (say salaries of employees).Database either truthfully answers a question or denies answering.
MIN, MAX, SUM queries over subsets of the database.
Question: When to allow/deny queries?
Database
Researcher
Query
Safe to publish?
Yes
No
Lecture 6 : 590.03 Fall 12 7
Why should we deny queries?• Q1: Ben’s sensitive value?
– DENY
• Q2: Max sensitive value of males?– ANSWER: 2
• Q3: Max sensitive value of 1st year PhD students? – ANSWER: 3
• But Q3 + Q2 => Xi = 3
Name 1st year PhD
Gender Sensitive value
Ben Y M 1Bha N M 1Ios Y M 1Jan N M 2Jian Y M 2Jie N M 1Joe N M 2
Moh N M 1Son N F 1Xi Y F 3
Yao N M 2
Lecture 6 : 590.03 Fall 12 8
Value-Based Auditing• Let a1, a2, …, ak be the answers to previous queries Q1, Q2, …, Qk.
• Let ak+1 be the answer to Qk+1.
ai = f(ci1x1, ci2x2, …, cinxn), i = 1 … k+1
cim = 1 if Qi depends on xm
Check if any xj has a unique solution.
Lecture 6 : 590.03 Fall 12 9
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
Lecture 6 : 590.03 Fall 12 10
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
-∞ ≤ x1 … x5≤ 10
Lecture 6 : 590.03 Fall 12 11
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
-∞ ≤ x1 … x4 ≤ 8 => x5 = 10
Lecture 6 : 590.03 Fall 12 12
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
Denial means some value can be
compromised!
Lecture 6 : 590.03 Fall 12 13
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
What could max(x1, x2, x3, x4)
be?
Lecture 6 : 590.03 Fall 12 14
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
From first answer, max(x1,x2,x3,x4) ≤ 10
Lecture 6 : 590.03 Fall 12 15
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
If, max(x1,x2,x3,x4) = 10
Then, no privacy breach
Lecture 6 : 590.03 Fall 12 16
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
Hence, max(x1,x2,x3,x4) < 10
=> x5 = 10!
Lecture 6 : 590.03 Fall 12 17
Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)
Ans: 8DENY
Hence, max(x1,x2,x3,x4) < 10
=> x5 = 10!Denials leak information.
Attack occurred since privacy analysis didnot assume that attacker knows the algorithm.
Lecture 6 : 590.03 Fall 12 18
Simulatable Auditing [Kenthapadi et al PODS ‘05]
• An auditor is simulatable if the decision to deny a query Qk is made based on information already available to the attacker. – Can use queries Q1, Q2, …, Qk and answers a1, a2, …, ak-1
– Cannot use ak or the actual data to make the decision.
• Denials provably do not leak informaiton– Because the attacker could equivalently determine whether
the query would be denied. – Attacker can mimic or simulate the auditor.
Lecture 6 : 590.03 Fall 12 19
Simulatable Auditing Algorithm• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.
x1
x2
x3
x4 x5
max(x1, x2 , x3 , x4 , x5)
Ans: 10 10
max(x1, x2 , x3 , x4)Before
computing answer
DENY
Ans > 10 => not possibleAns = 10 => -∞ ≤ x1 … x4 ≤ 10Ans < 10 => x5 = 10
SAFEUNSAFE
Lecture 6 : 590.03 Fall 12 20
Summary of Simulatable Auditing
• Decision to deny answers must be based on past queries answered in some (many!) cases.
• Denials can leak information if the adversary does not know all the information that is used to decide whether to deny the query.
Lecture 6 : 590.03 Fall 12 21
Outline• Simulatable Auditing
• Minimality Attack in anonymization
• Simulatable algorithms for anoymization
Lecture 6 : 590.03 Fall 12 22
Minimality attack on Generalization algorithms
• Algorithms for K-anonymity, L-diversity, T-closeness, etc. try to maximize utility. – Find a minimally generalized table in the lattice that satisfies privacy, and
maximizes utility.
• But … attacker also knows this algorithm!
Lecture 6 : 590.03 Fall 12 23
Example Minimality attack [Wong et al VLDB07]
• Dataset with one quasi-identifier and 2 values q1, q2.• q1, q2 generalize to Q.
• Sensitive attribute: Cancer – yes/no• We want to ensure P[Cancer = yes] < ½.
– OK to know if an individual does not have Cancer.
• Published Table:
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Lecture 6 : 590.03 Fall 12 24
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)Possible Input dataset
3 occurrences of q1QID Cance
r
q1 Yes
q1 Yes
q1 No
q2 No
q2 No
q2 No
QID Cancer
q1 Yes
q1 No
q1 No
q2 Yes
q2 No
q2 No
Lecture 6 : 590.03 Fall 12 25
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)Possible Input dataset
3 occurrences of q1QID Cance
r
q1 Yes
Q No
Q No
q2 Yes
q2 No
q2 NoThis is a better generalization!
Lecture 6 : 590.03 Fall 12 26
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)Possible Input dataset
1 occurrence of q1QID Cance
r
q2 Yes
q1 Yes
q2 No
q2 No
q2 No
q2 No
QID Cancer
q2 Yes
q2 Yes
q1 No
q2 No
q2 No
q2 No
Lecture 6 : 590.03 Fall 12 27
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)Possible Input dataset
3 occurrences of q1QID Cance
r
q2 Yes
Q No
Q No
q2 Yes
q2 No
q2 NoThis is a better generalization!
Lecture 6 : 590.03 Fall 12 28
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)Possible Input dataset
3 occurrences of q1QID Cance
r
q2 Yes
Q No
Q No
q2 Yes
q2 No
q2 No
There must be exactly two tuples with q1
Lecture 6 : 590.03 Fall 12 29
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)
Possible Input dataset2 occurrences of q1
QID Cancer
q1 Yes
q1 Yes
q2 No
q2 No
q2 No
q2 No
QID Cancer
q2 Yes
q2 Yes
q1 No
q1 No
q2 No
q2 No
QID Cancer
q1 Yes
q2 Yes
q1 No
q2 No
q2 No
q2 No
Already satisfies privacy
Lecture 6 : 590.03 Fall 12 30
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)
Possible Input dataset2 occurrences of q1
QID Cancer
q1 Yes
q1 Yes
q2 No
q2 No
q2 No
q2 No
QID Cancer
q2 Yes
q2 Yes
q1 No
q1 No
q2 No
q2 No
Learning Cancer=NO is OK,
Hence, this is private
Lecture 6 : 590.03 Fall 12 31
Which input datasets could have led to the published table?
QID Cancer
Q Yes
Q Yes
Q No
Q No
q2 No
q2 No
Output dataset{q1,q2} Q
(“2-diverse”)
Possible Input dataset2 occurrences of q1
QID Cancer
q1 Yes
q1 Yes
q2 No
q2 No
q2 No
q2 No
This is the ONLY input that results in
the output!
P[Cancer = yes | q1] = 1
Lecture 6 : 590.03 Fall 12 32
Outline• Simulatable Auditing
• Minimality Attack in anonymization
• Transparent Anonymization: Simulatable algorithms for anoymization
Lecture 6 : 590.03 Fall 12 33
Transparent Anonymization• Assume that the adversary knows the algorithm that is being
used.
O: Output table
I(O, A): Input tables that result in O due to algorithm A
I: All possible input tables
Lecture 6 : 590.03 Fall 12 34
Transparent Anonymization• According to I(O, A) privacy must be guaranteed.
– Probability must be computed assuming I(O,A) is the actual set of all possible input tables.
• What is an efficient algorithm for Transparent Anonymization?– For L-diversity?
Lecture 6 : 590.03 Fall 12 35
Ace Algorithm [Xiao et al TODS’10]
Step 1: AssignJust based on the sensitive values, construct (in a randomized fashion) an intermediate L-diverse generation.
Step 2: SplitOnly based on the quasi-identifier values (and without looking at sensitive values) , deterministically refine the intermediate solution to maximize utility.
Lecture 6 : 590.03 Fall 12 36
Step 1: Assign• Input Table
Lecture 6 : 590.03 Fall 12 37
Step 1: Assign• St is the set of all tuples (grouped by sensitive value)
• Iteratively,
– Remove α tuples each from the β (≥L) most frequent sensitive values
Lecture 6 : 590.03 Fall 12 38
Step 1: Assign• St is the set of all tuples (grouped by sensitive value)
• Iteratively,
– Remove α tuples each from the β (≥L) most frequent sensitive values
– 1st iteration β=2, α=2
Lecture 6 : 590.03 Fall 12 39
Step 1: Assign• St is the set of all tuples (grouped by sensitive value)
• Iteratively,
– Remove α tuples each from the β (≥L) most frequent sensitive values
– 2nd iteration β=2, α=1
Lecture 6 : 590.03 Fall 12 40
Step 1: Assign• St is the set of all tuples (grouped by sensitive value)
• Iteratively,
– Remove α tuples each from the β (≥L) most frequent sensitive values
– 3rd iteration β=2, α=1
Lecture 6 : 590.03 Fall 12 41
Intermediate GeneralizationName Age Zip
Ann 21 10000
Bob 27 18000
Gill 60 63000
Ed 54 60000
Don 32 35000
Fred 60 63000
Hera 60 63000
Cate 32 35000
Disease
Dyspepsia
Dyspepsia
Flu
Flu
Bronchitis
Gastritis
Diabetes
Gastritis
Lecture 6 : 590.03 Fall 12 42
Step 2: Split• If a bucket contains α>1 tuples of each sensitive value, split it into
two buckets, Ba and Bb s.t.,
– Pick 1 ≤ αa < α tuples from each sensitive value in bucket B, and put them in bucket Ba. The remaining tuples go to Bb.
– The division (Ba, Bb) is optimal in terms of utility. Name Age Zip
Ann 21 10000
Bob 27 18000
Gill 60 63000
Ed 54 60000
Don 32 35000
Fred 60 63000
Hera 60 63000
Cate 32 35000
Lecture 6 : 590.03 Fall 12 43
Why does the Ace algorithm satisfy Transparent L-Diversity?
• According to I(O, A) privacy must be guaranteed. – Probability must be computed assuming I(O,A) is the actual set of all possible
input tables.
O: Output table
I(O, A): Input tables that result in O due to algorithm A
I: All possible input tables
Lecture 6 : 590.03 Fall 12 44
Ace algorithm analysisLemma 1:
The assign step satisfies transparent L-diversity.
Proof (sketch): • Consider an intermediate output Int• Suppose there is some input table T such that Assign(T) = Int• Any other table T’ where the sensitive values of 2 individuals in
the same group are swapped, also leads to the same intermediate output Int.
Lecture 6 : 590.03 Fall 12 45
Ace algorithm analysis
Both tables result in the same intermediate output.
Lecture 6 : 590.03 Fall 12 46
Ace algorithm analysisLemma 1:
The assign step satisfies transparent L-diversity.Proof (sketch): • Consider an intermediate output Int• Suppose there is some input table T such that Assign(T) = Int• Any other table T’, where the sensitive values of 2 individuals in the same
group are swapped, also leads to the same intermediate output.
• The set of input tables I(Int,A) contains all possible assignments of diseases to individuals within each group of Int.
Lecture 6 : 590.03 Fall 12 47
Ace algorithm analysisLemma 1:
The assign step satisfies transparent L-diversity.Proof (sketch): • The set of table I(Int,A) contains all possible assignments of diseases to
individuals in each group of Int.
• P[Ann has dyspepsia | I(Int,A) and Int] = 1/2
Name Age Zip
Ann 21 10000
Bob 27 18000
Gill 60 63000
Ed 54 60000
Disease
Dyspepsia
Dyspepsia
Flu
Flu
Lecture 6 : 590.03 Fall 12 48
Ace algorithm analysisLemma 2:
The split phase also satisfies transparent L-diversity.
Proof (sketch):• I(Int, Assign) contains all tables where an individual is assigned to
an arbitrary sensitive value within the same group in Int. • Suppose some input table T ε I(Int, Assign) results in the final
output O after Split.
Lecture 6 : 590.03 Fall 12 49
Ace algorithm analysis• Split does not depend on the sensitive values.
Ann Gill
BobEd
dyspepsia flu
Ann Bob
dyspepsia flu Gill Ed
dyspepsia flu
results in
BobEd
AnnGill
dyspepsia flu
Bob Ann
dyspepsia flu Ed Gill
dyspepsia flu
results in
Lecture 6 : 590.03 Fall 12 50
Ace algorithm analysis
If T ε I(Int, Assign), and it results in O after split, Then, T’ ε I(Int, Assign), and it results in O after split
Table T Table T’
Lecture 6 : 590.03 Fall 12 51
Ace algorithm analysis• Lemma 2:
The split phase also satisfies transparent L-diversity.
Proof (sketch)• Let T’ be generated by “swapping diseases” in some bucket. • If T ε I(Int, Assign), and it results in O after split,
Then, T’ ε I(Int, Assign), and it results in O after split.
• For any individual it is equally likely that sensitive value is one of ≥L choices.
• Therefore, P[individual has disease | I(O, Ace)] < 1/L
Lecture 6 : 590.03 Fall 12 52
Summary• Many systems assume privacy/security is guaranteed by assuming
the adversary does not know the algorithm. – This is bad …
• Simulatable algorithms avoid this problem– Ideally choices made by the algorithm should be simulatable by the
adversary.
• Anonymization algorithms are also susceptible to adversaries who know the algorithm or the objective function.
• Transparent anonymization limits the inference an attacker (who knows the algorithm) can make about sensitive values.
Lecture 6 : 590.03 Fall 12 53
Next Class• Composition of privacy • Differential Privacy
Lecture 6 : 590.03 Fall 12 54
ReferencesA. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, “L-Diversity: Privacy
beyond k-anonymity”, ICDE 2006K. Kenthapadi, N. Mishra, K. Nissim, “Simulatable Auditing”, PODS 2005R. Wong, A. Fu, K. Wang, J. Pei, “Minimality attack in privacy preserving data publishing”,
PVLDB 2007X. Xiao, Y. Tao & N. Koudas, “Transparent Anonymization: Thwarting adversaries who know
the algorithm”, TODS 2010