Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY...

Data Publishing against Realistic Adversaries

Johannes GerhrkeCornell University

Ithaca, NY

Michaela GötzCornell UniversityIthaca, NY

Ashwin MachanavajjhalaYahoo! ReasearchSanta Clara, CA

Amedeo D’Ascanio, University Of Bologna, Italy

Outline

Introduction Є-privacy Adversary knowledge Adversary Classes Apply Є-privacy to Generalization Experimental evaluation Conclusion


Amedeo D’Ascanio

Introduction

Many reasons to Publish Data: requirements Preserve aggregate information about population Preserve privacy of sensitive information

Privacy How much information can an adversary deduce from

released data?


Amedeo D’Ascanio

Example

Alice knows that Rachel is 35 and she lives in 13058

Alice knows that Rachel is 20 and she has very low probability of Hart Disease


Amedeo D’Ascanio

Previous Definitions

L-diversity The adversary knows l-2 information about sensitive attribute The informations are equally like

T-closeness Alice knows the distribution of sensitive values Rachel’s chances of having a disease follow the same odds

Differential privacy Alice knows exact disease about every patient but Rachel’s one

“It’s flu season, a lot of elderly people will be in the hospital with flu symptoms” How do we model such background knowledge with l-diversty or t-

closeness? Does Alice knows everything about 1Billion patients?

Unrealistic assumptions!


Amedeo D’Ascanio

Є-privacy

Flexible language to define information about each individual

Privacy as difference of adversary’s belief between published table with and without the “victim”

Different class of adversary (either realistic or unrealistic) modeled based on their knowledge


Amedeo D’Ascanio

Modeling sensitive information

Positive disclosure Alice knows that Rachel has flu

Negative disclosure Alice knows that Rachel has not flu

Sensitive information using positive discloser on a set of sensitive predicates Φ

[ ] { , }Rachel Disease Ulcer Dyspepsia

[ ]Rachel Disease Flu


Amedeo D’Ascanio

Modeling sensitive informationExample

Negative discloser

1

2

3

( ) : [ ] { }

( ) : [ ] { }

( ) : [ ] { , }

Rachel

Rachel

Rachel

Rachel t S Flu

Rachel t S Cancer

Rachel t S Ulcer Dyspepsia

( )u each takes the form . , ( )ut S S S dom S

where dom(S) is the domain of sensitive attribute

1 2 3{ , , } Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if for each subset S

Positive discloser

False

True


Amedeo D’Ascanio

Adversaries Knowledge

Knowledge from other sourcesUsually modeled as the joint distribution P over

N and S attributes.

(..., ,...), such that 1i ii N SP p p i N S p

If the adversary has no preference for any value of i

1, ii N S pN S


Amedeo D’Ascanio

Adversary KnowledgeTwo problems

Where does the adversary learn their knowledge? If population with cancer is 10% (si = s/10) For each i, pi=si/s=0.1

What if Tpub has only 10 enties?

Can the adversary change his prior? The probability that a woman has cancer is p i=0.5 based on

a sample of 100 women An adversary read another table with 20k tuples where s i is

2k (so that pi=0.1)

If her prior is not strong pi will change accordingly


Amedeo D’Ascanio

Adversary KnowledgeTo model adversaries we assume that

The adversary knows more priors The tuples are not independent each other

Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution If H is healty an S is Sick, the probability of seeing the sequence

SSHSH is the same as the probability of HHSSS

Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to Choosing a data-generating distribution θ at random Creating the data by independently sampling from this chosen

distribution θ


Amedeo D’Ascanio

Adversary Knowledge ExampleAssume two populations of equal size, Ω1 with only healty people and Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2

If the adversary doesn’t know which population has been chosen:

If the adversary knows that just one t is healthy then:

If tuples are independent from each other? Still Pr[t=H] =0.5


Amedeo D’Ascanio

Dirichlet Distribution

More generally: T (of size n) is generate in two steps:

One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector

D encode the adversary knowledge If the adversary has no prior is drawn from D equally like If an adversary know that 999 people over 1k have cancer, he

should model D in order to draw pno(cancer) = 0.001 and pyes(cancer) =0.999

Dirichlet Distribution to model prior over

p��

p��

p��


Amedeo D’Ascanio


belief that the probabilities of k rival events are xi given that each event has been observed σi − 1 times.

Adversary without knowledge: D(σ1,…, σk) = D(1,…,1); After reading dataset whit counts (σ1-1,…, σk-1) the adversary may

update his prior to D(σ1,…, σk).

In this case not all are equally like

1

0

where = and ( )

is the stubbornness and / is the shape

t xii

t x e dx

��

p��


Amedeo D’Ascanio


The vector with the maximum likelihood is

As we increase σ the becomes more likely

If is the only possible probability distribution

* /ip ��

*p��

*, p ��


Amedeo D’Ascanio

Other Adversary Knowledge

Knowledge from individuals inside the published table

Full knowledge about a subset B of tuples in T


Amedeo D’Ascanio

Definition

After Tpub is published the adversary belief in a sensitive predicate about an individual u in T is

If the individual u is remove from T, the belief becomes

( )u


Amedeo D’Ascanio

Definition The pin should not be much greater than pout

The greater it is, the more information about an individual’s sensitive predicate the adversary learns

A Table does not respect epsilon-privacy if


Amedeo D’Ascanio

Adversary Classes Defined based on their prior built over the

distribution of sensitive valuesClass I:Class II:Class III:Class IV:

fixed and shape /i

fixed arbitrary ( ), such that ( )s S

D s

��

fixed shape / and arbitrarily large i

arbitrary and /i


Amedeo D’Ascanio

Adversary classes - Examples Suppose to have

another dataset with 30000 tuples: 12000 with flu and 18000 cancer

Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior

Rachel is in the table. pin(flu) = .9 for all adversaries (depends only from published table)

pout(flu) changes for each adversary


Amedeo D’Ascanio

Adversary classes - Examples

Class I : pout(flu) = (18k+12k)/(20k+30k) =.6 Class II : pout(flu) = (18k+1)/(20k+30k)=.36002 Class III: pout(flu) = .4 Class IV = every value

So that Rachel is granted .4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary


Amedeo D’Ascanio

Generalization and epsilon privacy

( ) {{ } }u s s S Set of sensitive predicates for each individual u is

We can define a set of constraint that have to be checked during the generalization process


Amedeo D’Ascanio

Check for Class I R1 and R2 has to be respected

Combination of Anonymity closeness


Amedeo D’Ascanio

Check for Class II R1 and R2 has to be respected

Combination of anonymity diversity


Amedeo D’Ascanio

Check for Class III R1 and R2 has to be respected

Only closeness

epsilon-privacy doesn’t guarantee privacy against class IV adversary


Amedeo D’Ascanio

Montonicity T1 and T2 generalization of T such that

if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy Useful for algorithms such as Incognito, Mondrian,

PET algorithm All checks shown before can has a time complexity

O(N)

T2 T1


Amedeo D’Ascanio

Choosing Parameter

The choice is application dependent: US Census

Stubbornness: number of individualsShape: distribution of sensitive valuesEpsilon: between 10 and 100 WHY?


Amedeo D’Ascanio

Experimental resultsThe more stubbornness we have, the grater epsilon we need to achieve privacy

With small values of σ the cost function is better

The average group size increases according to σ

Data from Minnesota Population Center with nearly 3M tuples


Amedeo D’Ascanio

Embedding prior work

Epsilon-privacy can cover some instantiation ofRecursive diversity (c,2)-diversityDifferential privacyT-closeness


Amedeo D’Ascanio

Conclusions

Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous

works Epsilon-privacy in generalization process

Future work: Considering correlation between sensitive and non sensitive

values apply epsilon privacy to other algorithm


Amedeo D’Ascanio

Date post:	16-Jan-2016
Category:	Documents
Upload:	peregrine-simpson
View:	215 times
Download:	0 times

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY...

Documents