+ All Categories
Home > Documents > Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY...

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY...

Date post: 16-Jan-2016
Category:
Upload: peregrine-simpson
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala Yahoo! Reasearch Santa Clara, CA Amedeo D’Ascanio, University Of Bologna, Italy
Transcript
Page 1: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Data Publishing against Realistic Adversaries

Johannes GerhrkeCornell University

Ithaca, NY

Michaela GötzCornell UniversityIthaca, NY

Ashwin MachanavajjhalaYahoo! ReasearchSanta Clara, CA

Amedeo D’Ascanio, University Of Bologna, Italy

Page 2: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Outline

Introduction Є-privacy Adversary knowledge Adversary Classes Apply Є-privacy to Generalization Experimental evaluation Conclusion

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 3: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Introduction

Many reasons to Publish Data: requirements Preserve aggregate information about population Preserve privacy of sensitive information

Privacy How much information can an adversary deduce from

released data?

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 4: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Example

Alice knows that Rachel is 35 and she lives in 13058

Alice knows that Rachel is 20 and she has very low probability of Hart Disease

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 5: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Previous Definitions

L-diversity The adversary knows l-2 information about sensitive attribute The informations are equally like

T-closeness Alice knows the distribution of sensitive values Rachel’s chances of having a disease follow the same odds

Differential privacy Alice knows exact disease about every patient but Rachel’s one

“It’s flu season, a lot of elderly people will be in the hospital with flu symptoms” How do we model such background knowledge with l-diversty or t-

closeness? Does Alice knows everything about 1Billion patients?

Unrealistic assumptions!

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 6: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Є-privacy

Flexible language to define information about each individual

Privacy as difference of adversary’s belief between published table with and without the “victim”

Different class of adversary (either realistic or unrealistic) modeled based on their knowledge

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 7: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Modeling sensitive information

Positive disclosure Alice knows that Rachel has flu

Negative disclosure Alice knows that Rachel has not flu

Sensitive information using positive discloser on a set of sensitive predicates Φ

[ ] { , }Rachel Disease Ulcer Dyspepsia

[ ]Rachel Disease Flu

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 8: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Modeling sensitive informationExample

Negative discloser

1

2

3

( ) : [ ] { }

( ) : [ ] { }

( ) : [ ] { , }

Rachel

Rachel

Rachel

Rachel t S Flu

Rachel t S Cancer

Rachel t S Ulcer Dyspepsia

( )u each takes the form . , ( )ut S S S dom S

where dom(S) is the domain of sensitive attribute

1 2 3{ , , } Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if for each subset S

Positive discloser

False

True

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 9: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversaries Knowledge

Knowledge from other sourcesUsually modeled as the joint distribution P over

N and S attributes.

(..., ,...), such that 1i ii N SP p p i N S p

If the adversary has no preference for any value of i

1, ii N S pN S

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 10: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary KnowledgeTwo problems

Where does the adversary learn their knowledge? If population with cancer is 10% (si = s/10) For each i, pi=si/s=0.1

What if Tpub has only 10 enties?

Can the adversary change his prior? The probability that a woman has cancer is p i=0.5 based on

a sample of 100 women An adversary read another table with 20k tuples where s i is

2k (so that pi=0.1)

If her prior is not strong pi will change accordingly

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 11: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary KnowledgeTo model adversaries we assume that

The adversary knows more priors The tuples are not independent each other

Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution If H is healty an S is Sick, the probability of seeing the sequence

SSHSH is the same as the probability of HHSSS

Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to Choosing a data-generating distribution θ at random Creating the data by independently sampling from this chosen

distribution θ

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 12: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary Knowledge ExampleAssume two populations of equal size, Ω1 with only healty people and Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2

If the adversary doesn’t know which population has been chosen:

If the adversary knows that just one t is healthy then:

If tuples are independent from each other? Still Pr[t=H] =0.5

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 13: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Dirichlet Distribution

More generally: T (of size n) is generate in two steps:

One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector

D encode the adversary knowledge If the adversary has no prior is drawn from D equally like If an adversary know that 999 people over 1k have cancer, he

should model D in order to draw pno(cancer) = 0.001 and pyes(cancer) =0.999

Dirichlet Distribution to model prior over

p��������������

p��������������

p��������������

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 14: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Dirichlet Distribution

belief that the probabilities of k rival events are xi given that each event has been observed σi − 1 times.

Adversary without knowledge: D(σ1,…, σk) = D(1,…,1); After reading dataset whit counts (σ1-1,…, σk-1) the adversary may

update his prior to D(σ1,…, σk).

In this case not all are equally like

1

0

where = and ( )

is the stubbornness and / is the shape

t xii

t x e dx

��������������

p��������������

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 15: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Dirichlet Distribution

The vector with the maximum likelihood is

As we increase σ the becomes more likely

If is the only possible probability distribution

* /ip ��������������

*p��������������

*, p ��������������

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 16: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Other Adversary Knowledge

Knowledge from individuals inside the published table

Full knowledge about a subset B of tuples in T

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 17: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Definition

After Tpub is published the adversary belief in a sensitive predicate about an individual u in T is

If the individual u is remove from T, the belief becomes

( )u

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 18: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Definition The pin should not be much greater than pout

The greater it is, the more information about an individual’s sensitive predicate the adversary learns

A Table does not respect epsilon-privacy if

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 19: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary Classes Defined based on their prior built over the

distribution of sensitive valuesClass I:Class II:Class III:Class IV:

fixed and shape /i

fixed arbitrary ( ), such that ( )s S

D s

��������������

fixed shape / and arbitrarily large i

arbitrary and /i

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 20: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary classes - Examples Suppose to have

another dataset with 30000 tuples: 12000 with flu and 18000 cancer

Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior

Rachel is in the table. pin(flu) = .9 for all adversaries (depends only from published table)

pout(flu) changes for each adversary

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 21: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Adversary classes - Examples

Class I : pout(flu) = (18k+12k)/(20k+30k) =.6 Class II : pout(flu) = (18k+1)/(20k+30k)=.36002 Class III: pout(flu) = .4 Class IV = every value

So that Rachel is granted .4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 22: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Generalization and epsilon privacy

( ) {{ } }u s s S Set of sensitive predicates for each individual u is

We can define a set of constraint that have to be checked during the generalization process

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 23: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Check for Class I R1 and R2 has to be respected

Combination of Anonymity closeness

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 24: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Check for Class II R1 and R2 has to be respected

Combination of anonymity diversity

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 25: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Check for Class III R1 and R2 has to be respected

Only closeness

epsilon-privacy doesn’t guarantee privacy against class IV adversary

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 26: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Montonicity T1 and T2 generalization of T such that

if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy Useful for algorithms such as Incognito, Mondrian,

PET algorithm All checks shown before can has a time complexity

O(N)

T2 T1

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 27: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Choosing Parameter

The choice is application dependent: US Census

Stubbornness: number of individualsShape: distribution of sensitive valuesEpsilon: between 10 and 100 WHY?

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 28: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Experimental resultsThe more stubbornness we have, the grater epsilon we need to achieve privacy

With small values of σ the cost function is better

The average group size increases according to σ

Data from Minnesota Population Center with nearly 3M tuples

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 29: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Embedding prior work

Epsilon-privacy can cover some instantiation ofRecursive diversity (c,2)-diversityDifferential privacyT-closeness

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio

Page 30: Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Conclusions

Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous

works Epsilon-privacy in generalization process

Future work: Considering correlation between sensitive and non sensitive

values apply epsilon privacy to other algorithm

Data Publishing against Realistic Adversaries

Amedeo D’Ascanio


Recommended