CS208: Applied Privacy for Data Science Membership & Other...

CS208: Applied Privacy for Data Science Membership & Other Attacks

James Honaker & Salil Vadhan School of Engineering & Applied Sciences

Harvard University

February 8, 2019

http://crcs.seas.harvard.edu/

Motivation • Last time: on a dataset with 𝑛𝑛 individuals, releasing 𝑚𝑚 = 𝑛𝑛 counts with error 𝐸𝐸 = 𝑜𝑜 𝑛𝑛 allows for reconstructing 1 − 𝑜𝑜(1) fraction of sensitive attributes.

• Q: what happens if we allow error Ω 𝑛𝑛 ≤ 𝐸𝐸 ≤ 𝑜𝑜 𝑛𝑛 ?

• A (today): if we release 𝑚𝑚 = 𝑛𝑛2 counts, can be vulnerable to “membership attacks”.

What is this 𝒏𝒏 threshold? • if 𝑋𝑋 = 𝑋𝑋1 + ⋯+ 𝑋𝑋𝑛𝑛 for independent random variables 𝑋𝑋𝑖𝑖

each with standard deviation 𝜎𝜎, then the standard deviation of 𝑋𝑋 is 𝜎𝜎 ⋅ 𝑛𝑛.

• So the “sampling error” for a sum is typically Θ 𝑛𝑛 . • If the 𝑋𝑋𝑖𝑖 ’s are bounded (or “subgaussian”), then 𝑋𝑋 will

have Gaussian-like concentration around its mean 𝜇𝜇: Pr 𝑋𝑋 − 𝜇𝜇 > 𝑡𝑡 ⋅ 𝑛𝑛 ≤ 𝑒𝑒−Ω(𝑡𝑡2) [Chernoff-Hoeffding Bound]

Normalized Counts (i.e. Averages)

• if 𝑋𝑋 = (𝑋𝑋1+⋯+ 𝑋𝑋𝑛𝑛)/𝑛𝑛 for independent random variables 𝑋𝑋𝑖𝑖 each with standard deviation 𝜎𝜎, then the standard deviation of 𝑋𝑋 is 𝜎𝜎/ 𝑛𝑛.

• So the “sampling error” for a sum is typically Θ 1/ 𝑛𝑛 . • If the 𝑋𝑋𝑖𝑖 ’s are bounded (or “subgaussian”), then 𝑋𝑋 will

have Gaussian-like concentration around its mean 𝜇𝜇: Pr 𝑋𝑋 − 𝜇𝜇 > 𝑡𝑡/ 𝑛𝑛 ≤ 𝑒𝑒−Ω(𝑡𝑡2) [Chernoff-Hoeffding Bound]

This is why subsampling 𝑘𝑘 out of 𝑛𝑛 rows allows us to approximate 𝑚𝑚 averages each to within ±𝑂𝑂 1

𝑘𝑘⋅ log𝑚𝑚

concentration std dev

Motivation • Last time: on a dataset with 𝑛𝑛 individuals, releasing 𝑚𝑚 = 𝑛𝑛 averages with error 𝐸𝐸 = 𝑜𝑜 1/ 𝑛𝑛 allows for reconstructing 1 − 𝑜𝑜(1) fraction of sensitive attributes.

• Q: what happens if we allow error Ω 1/ 𝑛𝑛 ≤ 𝐸𝐸 ≤ 𝑜𝑜 1 ?

• A (today): if we release 𝑚𝑚 = 𝑛𝑛2 counts, can be vulnerable to “membership attacks”.

PAUSE FOR INTRODUCTIONS

Membership Attacks: Setup

𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0 Data set X

Alice’s data

Attacker

Population

“In”

“Out”

“In”/ “Out”

Mechanism (stats, ML model, …)

aux

OR

Attacker gets: • Access to mechanism outputs • Alice’s data • (Possibly) auxiliary info about population Then decides: if Alice is in the dataset X

[slide based on one from Adam Smith]

Membership Attacks: Examples

𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0 Data set X

Alice’s data

Attacker

Population

“In”

“Out”

“In”/ “Out”


aux

OR

• Genome-wide Association Studies [Homer et al. `08] – release frequencies of SNP’s (individual positions) – determine whether Alice is in “case group” [w/a particular diagnosis]

• ML as a service [Shokri et al. `17] – apply models trained on X to Alice’s data



𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data

.5 .75 .5 .5 .75 .5 .25 .25 .5 𝑋𝑋� Attacker

“In”

“Out”

“In”/ “Out”

Membership Attacks from Means Population

𝒅𝒅 attributes/predicates OR

• Population = [vector 𝑝𝑝 = (𝑝𝑝1, … ,𝑝𝑝𝑑𝑑) of probabilities] – 𝑗𝑗’th attribute = iid Bernoulli(𝑝𝑝𝑗𝑗), independent across 𝑗𝑗 – Adversary gets 𝑝𝑝 (or a few random draws) given to adversary


𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data

.5 .75 .5 .5 .75 .5 .25 .25 .5 𝑋𝑋� Attacker

Population

“In”

“Out”

“In”/ “Out”

Membership Attacks from Means

.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝

𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑

OR

• Population = vector 𝑝𝑝 = (𝑝𝑝1, … ,𝑝𝑝𝑑𝑑) of probabilities – 𝑗𝑗’th attribute = iid Bernoulli(𝑝𝑝𝑗𝑗), independent across 𝑗𝑗 – Adversary gets 𝑎𝑎 ≈ �̅�𝑥 and 𝑝𝑝 (or a few random draws) – Only assume that 𝑎𝑎 = 𝑀𝑀(𝑥𝑥) has 𝑎𝑎𝑗𝑗 − �̅�𝑥𝑗𝑗 ≤ 𝛼𝛼 whp.

(“Noise” need not be independent or unbiased.) [slide based on one from Adam Smith]

𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data

.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥

Attacker

Population

“In”

“Out”

“In”/ “Out”

Membership Attacks from Noisy Means

.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝


OR

privacy mechanism

• We are interested in 𝛼𝛼 > 1/ 𝑛𝑛. • In this regime, if 𝑝𝑝 known to mechanism, can prevent attack. (Q: Why?) • So we will assume random 𝑝𝑝𝑗𝑗 ’s (e.g. iid uniform in [0,1]).


𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data

.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥

Attacker

Population

“In”

“Out”

“In”/ “Out”


.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝


OR

privacy mechanism

Theorem [Dwork et al. `15]: There is a constant 𝑐𝑐 and an attacker 𝐴𝐴 such that when 𝑑𝑑 ≥ 𝑐𝑐𝑛𝑛 and 𝛼𝛼 < min 𝑑𝑑 𝑂𝑂(𝑛𝑛2 log(1 𝛿𝛿⁄ ))⁄ , 1/2 :

• If Alice is IN, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≥ Ω 1𝛼𝛼2𝑛𝑛

.

• If Alice is OUT, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≤ 𝛿𝛿.


𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data 𝑦𝑦

.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥

Attacker

Population

“In”

“Out”

“In”/ “Out”


.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝


OR

privacy mechanism

Theorem [Dwork et al. `15]: There is an attacker 𝐴𝐴 such that when 𝑑𝑑 ≥ 𝑂𝑂(𝑛𝑛) and 𝛼𝛼 < min 𝑑𝑑 𝑂𝑂(𝑛𝑛2 log(1 𝛿𝛿⁄ ))⁄ , 1/2 :

• If Alice is IN, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≥ Ω 1𝛼𝛼2𝑛𝑛

. (true positive)

• If Alice is OUT, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≤ 𝛿𝛿. (false positive)

Remarks:

• Only interesting when 𝛿𝛿 < Ω 1𝛼𝛼2𝑛𝑛

.

• On average, succesfully trace Ω 1𝛼𝛼2

members of dataset. This is the best possible. (Why?)

• Can safely release at most 𝑂𝑂�(𝑛𝑛2) means!


𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = � IN if ⟨𝑦𝑦 − 𝑝𝑝,𝑎𝑎 − 𝑝𝑝⟩ > 𝑇𝑇OUT if ⟨𝑦𝑦 − 𝑝𝑝,𝑎𝑎 − 𝑝𝑝⟩ ≤ 𝑇𝑇

Note: given 𝑝𝑝,𝑎𝑎, can choose 𝑇𝑇 = 𝑇𝑇𝑝𝑝,𝑎𝑎 = O 𝑑𝑑 log(1 𝛿𝛿⁄ ) to make false positive probability exactly 𝛿𝛿.


𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 0

Alice’s data 𝑦𝑦

.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥

Attacker

Population

“In”

“Out”

“In”/ “Out”

The Attacker

.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝


OR

privacy mechanism

Attacks on Aggregate Stats • What error 𝛼𝛼 makes sense?

– Estimation error due to sampling ≈ 1/ 𝑛𝑛 – Reconstruction attacks require 𝛼𝛼 ≲ 1/ 𝑛𝑛, 𝑑𝑑 ≥ 𝑛𝑛

– Robust membership attacks: 𝛼𝛼 ≲ 𝒅𝒅/𝒏𝒏 • Lessons

– “Too many, too accurate” statistics reveal individual data – “Aggregate” is hard to pin down

16

𝟏𝟏𝒏𝒏

Reconstruction attacks

Sampling error

Membership attacks Distortion 𝜶𝜶

𝒅𝒅𝒏𝒏


Membership Attacks on ML as a Service

[Shokri et al. 2017] Switch to slides from Reza Shokri’s talk

Another Attack on ML? [Frederickson et al. `14, cf. McSherry `16]

𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 Data set X

Alice’s (known) data

Attacker

Population


Attacker gets: • Access to mechanism outputs • Some of Alice’s data • (Possibly) auxiliary info about population Then computes: a sensitive attribute of Alice

1

Another Attack on ML? [Frederickson et al. `14, cf. McSherry `16]

𝒏𝒏 people

0 1 1 0 1 0 0 0 1

0 1 0 1 0 1 0 0 1

1 0 1 1 1 1 0 1 0

1 1 0 0 1 0 1 0 0

1 0 1 1 1 1 0 1 Data set X

Alice’s (known) data

Attacker

Population


Difference from reconstruction attacks: • Above attack works even if Alice not in dataset. Based

on correlation between known & sensitive attributes. • Reconstruction attacks work even when sensitive bit

uncorrelated.

1

Goals of Differential Privacy • Utility: enable “statistical analysis” of datasets

– e.g. inference about population, ML training, useful descriptive statistics

• Privacy: protect individual-level data – against “all” attack strategies, auxiliary info.

Q: Can it help with privacy in microtargetted advertising? [Korolova attacks]

– inference from impressions? – inference from clicks? – displaying intrusive ads?

“Five Views” Responses to Membership Attacks on GWAS

Date post:	06-Mar-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

CS208: Applied Privacy for Data Science Membership & Other...

Documents