Simulatability “The enemy knows the system”, Claude Shannon CompSci 590.03 Instructor: Ashwin...

transcript

Lecture 6 : 590.03 Fall 12 1

Simulatability“The enemy knows the system”, Claude Shannon

CompSci 590.03Instructor: Ashwin Machanavajjhala

Lecture 6 : 590.03 Fall 12 2

Announcements• Please meet with me at least 2 times before you finalize your

project (deadline Sep 28).

Lecture 6 : 590.03 Fall 12 3

Recap – L-Diversity• The link between identity and attribute value is the sensitive

information. “Does Bob have Cancer? Heart disease? Flu?” “Does Umeko have Cancer? Heart disease? Flu?”

• Adversary knows ≤ L-2 negation statements. “Umeko does not have Heart Disease.”

– Data Publisher may not know exact adversarial knowledge

• Privacy is breached when identity can be linked to attribute value with high probability Pr[ “Bob has Cancer” | published table, adv. knowledge] > t

Lecture 6 : 590.03 Fall 12 4

Zip Age Nat. Disease

1306* <=40 * Heart

1306* <=40 * Flu

1306* <=40 * Cancer

1485* >40 * Cancer

1485* >40 * Heart

1485* >40 * Flu

1305* <=40 * Heart

1305* <=40 * Flu

1305* <=40 * Cancer

Recap – 3-Diverse Table

L-Diversity Principle: Every group of tuples with the same Q-ID values has ≥ L distinct sensitive values of roughly equal proportions.

Lecture 6 : 590.03 Fall 12 5

Outline• Simulatable Auditing

• Minimality Attack in anonymization

• Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 6

Query Auditing

Database has numeric values (say salaries of employees).Database either truthfully answers a question or denies answering.

MIN, MAX, SUM queries over subsets of the database.

Question: When to allow/deny queries?

Database

Researcher

Safe to publish?

Lecture 6 : 590.03 Fall 12 7

Why should we deny queries?• Q1: Ben’s sensitive value?

– DENY

• Q2: Max sensitive value of males?– ANSWER: 2

• Q3: Max sensitive value of 1st year PhD students? – ANSWER: 3

• But Q3 + Q2 => Xi = 3

Name 1st year PhD

Gender Sensitive value

Ben Y M 1Bha N M 1Ios Y M 1Jan N M 2Jian Y M 2Jie N M 1Joe N M 2

Moh N M 1Son N F 1Xi Y F 3

Yao N M 2

Lecture 6 : 590.03 Fall 12 8

Value-Based Auditing• Let a1, a2, …, ak be the answers to previous queries Q1, Q2, …, Qk.

• Let ak+1 be the answer to Qk+1.

ai = f(ci1x1, ci2x2, …, cinxn), i = 1 … k+1

cim = 1 if Qi depends on xm

Check if any xj has a unique solution.

Lecture 6 : 590.03 Fall 12 9

Value-based Auditing• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.

Lecture 6 : 590.03 Fall 12 10

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

-∞ ≤ x1 … x5≤ 10

Lecture 6 : 590.03 Fall 12 11

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

-∞ ≤ x1 … x4 ≤ 8 => x5 = 10

Lecture 6 : 590.03 Fall 12 12

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

Denial means some value can be

compromised!

Lecture 6 : 590.03 Fall 12 13

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

What could max(x1, x2, x3, x4)

Lecture 6 : 590.03 Fall 12 14

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

From first answer, max(x1,x2,x3,x4) ≤ 10

Lecture 6 : 590.03 Fall 12 15

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

If, max(x1,x2,x3,x4) = 10

Then, no privacy breach

Lecture 6 : 590.03 Fall 12 16

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

Hence, max(x1,x2,x3,x4) < 10

=> x5 = 10!

Lecture 6 : 590.03 Fall 12 17

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)

Ans: 8DENY

Hence, max(x1,x2,x3,x4) < 10

=> x5 = 10!Denials leak information.

Attack occurred since privacy analysis didnot assume that attacker knows the algorithm.

Lecture 6 : 590.03 Fall 12 18

Simulatable Auditing [Kenthapadi et al PODS ‘05]

• An auditor is simulatable if the decision to deny a query Qk is made based on information already available to the attacker. – Can use queries Q1, Q2, …, Qk and answers a1, a2, …, ak-1

– Cannot use ak or the actual data to make the decision.

• Denials provably do not leak informaiton– Because the attacker could equivalently determine whether

the query would be denied. – Attacker can mimic or simulate the auditor.

Lecture 6 : 590.03 Fall 12 19

Simulatable Auditing Algorithm• Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.• Allow query if value of xi can’t be inferred.

max(x1, x2 , x3 , x4 , x5)

Ans: 10 10

max(x1, x2 , x3 , x4)Before

computing answer

Ans > 10 => not possibleAns = 10 => -∞ ≤ x1 … x4 ≤ 10Ans < 10 => x5 = 10

SAFEUNSAFE

Lecture 6 : 590.03 Fall 12 20

Summary of Simulatable Auditing

• Decision to deny answers must be based on past queries answered in some (many!) cases.

• Denials can leak information if the adversary does not know all the information that is used to decide whether to deny the query.

Lecture 6 : 590.03 Fall 12 21

• Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 22

Minimality attack on Generalization algorithms

• Algorithms for K-anonymity, L-diversity, T-closeness, etc. try to maximize utility. – Find a minimally generalized table in the lattice that satisfies privacy, and

maximizes utility.

• But … attacker also knows this algorithm!

Lecture 6 : 590.03 Fall 12 23

Example Minimality attack [Wong et al VLDB07]

• Dataset with one quasi-identifier and 2 values q1, q2.• q1, q2 generalize to Q.

• Sensitive attribute: Cancer – yes/no• We want to ensure P[Cancer = yes] < ½.

– OK to know if an individual does not have Cancer.

• Published Table:

QID Cancer

Lecture 6 : 590.03 Fall 12 24

Which input datasets could have led to the published table?

QID Cancer

Output dataset{q1,q2} Q

(“2-diverse”)Possible Input dataset

3 occurrences of q1QID Cance

q1 Yes

QID Cancer

q1 Yes

q2 Yes

Lecture 6 : 590.03 Fall 12 25

QID Cancer

q1 Yes

q2 Yes

q2 NoThis is a better generalization!

Lecture 6 : 590.03 Fall 12 26

QID Cancer

1 occurrence of q1QID Cance

q2 Yes

q1 Yes

QID Cancer

q2 Yes

Lecture 6 : 590.03 Fall 12 27

QID Cancer

q2 Yes

q2 NoThis is a better generalization!

Lecture 6 : 590.03 Fall 12 28

QID Cancer

q2 Yes

There must be exactly two tuples with q1

Lecture 6 : 590.03 Fall 12 29

QID Cancer

(“2-diverse”)

Possible Input dataset2 occurrences of q1

QID Cancer

q1 Yes

QID Cancer

q2 Yes

QID Cancer

q1 Yes

q2 Yes

Already satisfies privacy

Lecture 6 : 590.03 Fall 12 30

QID Cancer

(“2-diverse”)

QID Cancer

q1 Yes

QID Cancer

q2 Yes

Learning Cancer=NO is OK,

Hence, this is private

Lecture 6 : 590.03 Fall 12 31

QID Cancer

(“2-diverse”)

QID Cancer

q1 Yes

This is the ONLY input that results in

the output!

P[Cancer = yes | q1] = 1

Lecture 6 : 590.03 Fall 12 32

• Transparent Anonymization: Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 33

Transparent Anonymization• Assume that the adversary knows the algorithm that is being

O: Output table

I(O, A): Input tables that result in O due to algorithm A

I: All possible input tables

Lecture 6 : 590.03 Fall 12 34

Transparent Anonymization• According to I(O, A) privacy must be guaranteed.

– Probability must be computed assuming I(O,A) is the actual set of all possible input tables.

• What is an efficient algorithm for Transparent Anonymization?– For L-diversity?

Lecture 6 : 590.03 Fall 12 35

Ace Algorithm [Xiao et al TODS’10]

Step 1: AssignJust based on the sensitive values, construct (in a randomized fashion) an intermediate L-diverse generation.

Step 2: SplitOnly based on the quasi-identifier values (and without looking at sensitive values) , deterministically refine the intermediate solution to maximize utility.

Lecture 6 : 590.03 Fall 12 36

Step 1: Assign• Input Table

Lecture 6 : 590.03 Fall 12 37

Step 1: Assign• St is the set of all tuples (grouped by sensitive value)

• Iteratively,

– Remove α tuples each from the β (≥L) most frequent sensitive values

Lecture 6 : 590.03 Fall 12 38

• Iteratively,

– 1st iteration β=2, α=2

Lecture 6 : 590.03 Fall 12 39

• Iteratively,

– 2nd iteration β=2, α=1

Lecture 6 : 590.03 Fall 12 40

• Iteratively,

– 3rd iteration β=2, α=1

Lecture 6 : 590.03 Fall 12 41

Intermediate GeneralizationName Age Zip

Ann 21 10000

Bob 27 18000

Gill 60 63000

Ed 54 60000

Don 32 35000

Fred 60 63000

Hera 60 63000

Cate 32 35000

Disease

Dyspepsia

Bronchitis

Gastritis

Diabetes

Gastritis

Lecture 6 : 590.03 Fall 12 42

Step 2: Split• If a bucket contains α>1 tuples of each sensitive value, split it into

two buckets, Ba and Bb s.t.,

– Pick 1 ≤ αa < α tuples from each sensitive value in bucket B, and put them in bucket Ba. The remaining tuples go to Bb.

– The division (Ba, Bb) is optimal in terms of utility. Name Age Zip

Ann 21 10000

Bob 27 18000

Gill 60 63000

Ed 54 60000

Don 32 35000

Fred 60 63000

Hera 60 63000

Cate 32 35000

Lecture 6 : 590.03 Fall 12 43

Why does the Ace algorithm satisfy Transparent L-Diversity?

• According to I(O, A) privacy must be guaranteed. – Probability must be computed assuming I(O,A) is the actual set of all possible

input tables.

O: Output table

I(O, A): Input tables that result in O due to algorithm A

I: All possible input tables

Lecture 6 : 590.03 Fall 12 44

Ace algorithm analysisLemma 1:

The assign step satisfies transparent L-diversity.

Proof (sketch): • Consider an intermediate output Int• Suppose there is some input table T such that Assign(T) = Int• Any other table T’ where the sensitive values of 2 individuals in

the same group are swapped, also leads to the same intermediate output Int.

Lecture 6 : 590.03 Fall 12 45

Ace algorithm analysis

Both tables result in the same intermediate output.

Lecture 6 : 590.03 Fall 12 46

The assign step satisfies transparent L-diversity.Proof (sketch): • Consider an intermediate output Int• Suppose there is some input table T such that Assign(T) = Int• Any other table T’, where the sensitive values of 2 individuals in the same

group are swapped, also leads to the same intermediate output.

• The set of input tables I(Int,A) contains all possible assignments of diseases to individuals within each group of Int.

Lecture 6 : 590.03 Fall 12 47

The assign step satisfies transparent L-diversity.Proof (sketch): • The set of table I(Int,A) contains all possible assignments of diseases to

individuals in each group of Int.

• P[Ann has dyspepsia | I(Int,A) and Int] = 1/2

Name Age Zip

Ann 21 10000

Bob 27 18000

Gill 60 63000

Ed 54 60000

Disease

Dyspepsia

Lecture 6 : 590.03 Fall 12 48

The split phase also satisfies transparent L-diversity.

Proof (sketch):• I(Int, Assign) contains all tables where an individual is assigned to

an arbitrary sensitive value within the same group in Int. • Suppose some input table T ε I(Int, Assign) results in the final

output O after Split.

Lecture 6 : 590.03 Fall 12 49

Ace algorithm analysis• Split does not depend on the sensitive values.

Ann Gill

dyspepsia flu

Ann Bob

dyspepsia flu Gill Ed

dyspepsia flu

results in

AnnGill

dyspepsia flu

Bob Ann

dyspepsia flu Ed Gill

dyspepsia flu

results in

Lecture 6 : 590.03 Fall 12 50

Ace algorithm analysis

If T ε I(Int, Assign), and it results in O after split, Then, T’ ε I(Int, Assign), and it results in O after split

Table T Table T’

Lecture 6 : 590.03 Fall 12 51

Ace algorithm analysis• Lemma 2:

The split phase also satisfies transparent L-diversity.

Proof (sketch)• Let T’ be generated by “swapping diseases” in some bucket. • If T ε I(Int, Assign), and it results in O after split,

Then, T’ ε I(Int, Assign), and it results in O after split.

• For any individual it is equally likely that sensitive value is one of ≥L choices.

• Therefore, P[individual has disease | I(O, Ace)] < 1/L

Lecture 6 : 590.03 Fall 12 52

Summary• Many systems assume privacy/security is guaranteed by assuming

the adversary does not know the algorithm. – This is bad …

• Simulatable algorithms avoid this problem– Ideally choices made by the algorithm should be simulatable by the

adversary.

• Anonymization algorithms are also susceptible to adversaries who know the algorithm or the objective function.

• Transparent anonymization limits the inference an attacker (who knows the algorithm) can make about sensitive values.

Lecture 6 : 590.03 Fall 12 53

Next Class• Composition of privacy • Differential Privacy

Lecture 6 : 590.03 Fall 12 54

ReferencesA. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, “L-Diversity: Privacy

beyond k-anonymity”, ICDE 2006K. Kenthapadi, N. Mishra, K. Nissim, “Simulatable Auditing”, PODS 2005R. Wong, A. Fu, K. Wang, J. Pei, “Minimality attack in privacy preserving data publishing”,

PVLDB 2007X. Xiao, Y. Tao & N. Koudas, “Transparent Anonymization: Thwarting adversaries who know

the algorithm”, TODS 2010

Simulatability “The enemy knows the system”, Claude Shannon CompSci 590.03 Instructor: Ashwin...

Documents