Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | peregrine-simpson |
View: | 215 times |
Download: | 0 times |
Data Publishing against Realistic Adversaries
Johannes GerhrkeCornell University
Ithaca, NY
Michaela GötzCornell UniversityIthaca, NY
Ashwin MachanavajjhalaYahoo! ReasearchSanta Clara, CA
Amedeo D’Ascanio, University Of Bologna, Italy
Outline
Introduction Є-privacy Adversary knowledge Adversary Classes Apply Є-privacy to Generalization Experimental evaluation Conclusion
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Introduction
Many reasons to Publish Data: requirements Preserve aggregate information about population Preserve privacy of sensitive information
Privacy How much information can an adversary deduce from
released data?
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Example
Alice knows that Rachel is 35 and she lives in 13058
Alice knows that Rachel is 20 and she has very low probability of Hart Disease
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Previous Definitions
L-diversity The adversary knows l-2 information about sensitive attribute The informations are equally like
T-closeness Alice knows the distribution of sensitive values Rachel’s chances of having a disease follow the same odds
Differential privacy Alice knows exact disease about every patient but Rachel’s one
“It’s flu season, a lot of elderly people will be in the hospital with flu symptoms” How do we model such background knowledge with l-diversty or t-
closeness? Does Alice knows everything about 1Billion patients?
Unrealistic assumptions!
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Є-privacy
Flexible language to define information about each individual
Privacy as difference of adversary’s belief between published table with and without the “victim”
Different class of adversary (either realistic or unrealistic) modeled based on their knowledge
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Modeling sensitive information
Positive disclosure Alice knows that Rachel has flu
Negative disclosure Alice knows that Rachel has not flu
Sensitive information using positive discloser on a set of sensitive predicates Φ
[ ] { , }Rachel Disease Ulcer Dyspepsia
[ ]Rachel Disease Flu
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Modeling sensitive informationExample
Negative discloser
1
2
3
( ) : [ ] { }
( ) : [ ] { }
( ) : [ ] { , }
Rachel
Rachel
Rachel
Rachel t S Flu
Rachel t S Cancer
Rachel t S Ulcer Dyspepsia
( )u each takes the form . , ( )ut S S S dom S
where dom(S) is the domain of sensitive attribute
1 2 3{ , , } Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if for each subset S
Positive discloser
False
True
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversaries Knowledge
Knowledge from other sourcesUsually modeled as the joint distribution P over
N and S attributes.
(..., ,...), such that 1i ii N SP p p i N S p
If the adversary has no preference for any value of i
1, ii N S pN S
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary KnowledgeTwo problems
Where does the adversary learn their knowledge? If population with cancer is 10% (si = s/10) For each i, pi=si/s=0.1
What if Tpub has only 10 enties?
Can the adversary change his prior? The probability that a woman has cancer is p i=0.5 based on
a sample of 100 women An adversary read another table with 20k tuples where s i is
2k (so that pi=0.1)
If her prior is not strong pi will change accordingly
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary KnowledgeTo model adversaries we assume that
The adversary knows more priors The tuples are not independent each other
Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution If H is healty an S is Sick, the probability of seeing the sequence
SSHSH is the same as the probability of HHSSS
Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to Choosing a data-generating distribution θ at random Creating the data by independently sampling from this chosen
distribution θ
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary Knowledge ExampleAssume two populations of equal size, Ω1 with only healty people and Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2
If the adversary doesn’t know which population has been chosen:
If the adversary knows that just one t is healthy then:
If tuples are independent from each other? Still Pr[t=H] =0.5
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Dirichlet Distribution
More generally: T (of size n) is generate in two steps:
One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector
D encode the adversary knowledge If the adversary has no prior is drawn from D equally like If an adversary know that 999 people over 1k have cancer, he
should model D in order to draw pno(cancer) = 0.001 and pyes(cancer) =0.999
Dirichlet Distribution to model prior over
p��������������
p��������������
p��������������
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Dirichlet Distribution
belief that the probabilities of k rival events are xi given that each event has been observed σi − 1 times.
Adversary without knowledge: D(σ1,…, σk) = D(1,…,1); After reading dataset whit counts (σ1-1,…, σk-1) the adversary may
update his prior to D(σ1,…, σk).
In this case not all are equally like
1
0
where = and ( )
is the stubbornness and / is the shape
t xii
t x e dx
��������������
p��������������
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Dirichlet Distribution
The vector with the maximum likelihood is
As we increase σ the becomes more likely
If is the only possible probability distribution
* /ip ��������������
*p��������������
*, p ��������������
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Other Adversary Knowledge
Knowledge from individuals inside the published table
Full knowledge about a subset B of tuples in T
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Definition
After Tpub is published the adversary belief in a sensitive predicate about an individual u in T is
If the individual u is remove from T, the belief becomes
( )u
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Definition The pin should not be much greater than pout
The greater it is, the more information about an individual’s sensitive predicate the adversary learns
A Table does not respect epsilon-privacy if
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary Classes Defined based on their prior built over the
distribution of sensitive valuesClass I:Class II:Class III:Class IV:
fixed and shape /i
fixed arbitrary ( ), such that ( )s S
D s
��������������
fixed shape / and arbitrarily large i
arbitrary and /i
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary classes - Examples Suppose to have
another dataset with 30000 tuples: 12000 with flu and 18000 cancer
Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior
Rachel is in the table. pin(flu) = .9 for all adversaries (depends only from published table)
pout(flu) changes for each adversary
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Adversary classes - Examples
Class I : pout(flu) = (18k+12k)/(20k+30k) =.6 Class II : pout(flu) = (18k+1)/(20k+30k)=.36002 Class III: pout(flu) = .4 Class IV = every value
So that Rachel is granted .4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Generalization and epsilon privacy
( ) {{ } }u s s S Set of sensitive predicates for each individual u is
We can define a set of constraint that have to be checked during the generalization process
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Check for Class I R1 and R2 has to be respected
Combination of Anonymity closeness
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Check for Class II R1 and R2 has to be respected
Combination of anonymity diversity
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Check for Class III R1 and R2 has to be respected
Only closeness
epsilon-privacy doesn’t guarantee privacy against class IV adversary
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Montonicity T1 and T2 generalization of T such that
if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy Useful for algorithms such as Incognito, Mondrian,
PET algorithm All checks shown before can has a time complexity
O(N)
T2 T1
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Choosing Parameter
The choice is application dependent: US Census
Stubbornness: number of individualsShape: distribution of sensitive valuesEpsilon: between 10 and 100 WHY?
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Experimental resultsThe more stubbornness we have, the grater epsilon we need to achieve privacy
With small values of σ the cost function is better
The average group size increases according to σ
Data from Minnesota Population Center with nearly 3M tuples
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Embedding prior work
Epsilon-privacy can cover some instantiation ofRecursive diversity (c,2)-diversityDifferential privacyT-closeness
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio
Conclusions
Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous
works Epsilon-privacy in generalization process
Future work: Considering correlation between sensitive and non sensitive
values apply epsilon privacy to other algorithm
Data Publishing against Realistic Adversaries
Amedeo D’Ascanio