+ All Categories
Home > Documents > CS573 Data Privacy and Security Midterm Revielxiong/cs573_f16/share/slides/midterm_revie… ·...

CS573 Data Privacy and Security Midterm Revielxiong/cs573_f16/share/slides/midterm_revie… ·...

Date post: 04-Jun-2018
Category:
Upload: doque
View: 230 times
Download: 0 times
Share this document with a friend
81
CS573 Data Privacy and Security Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University
Transcript

CS573 Data Privacy and Security

Midterm Review

Li Xiong

Department of Mathematics and Computer Science

Emory University

Principles of Data Security – CIA Triad

• Confidentiality

– Prevent the disclosure of information to unauthorized users

• Integrity

– Prevent improper modification

• Availability

– Make data available to legitimate users

Privacy vs. Confidentiality

• Confidentiality

– Prevent disclosure of information to unauthorized users

• Privacy

– Prevent disclosure of personal information to unauthorized users

– Control of how personal information is collected and used

– Prevent identification of individuals

11/8/2016 3

Data Privacy and Security Measures

• Access control

– Restrict access to the (subset or view of) data to authorized users

• Cryptography

– Use encryption to encode information so it can be only read by authorized users (protected in transmit and storage)

• Inference control

– Restrict inference from accessible data to sensitive (non-accessible) data

• Inference control: Prevent inference from accessible information to individual information (not accessible)

• Technologies

– De-identification and Anonymization (input perturbation)

– Differential Privacy (output perturbation)

Inference Control

Original

Data

Sanitized

Records De-identification anonymization

Traditional De-identification and Anonymization

• Attribute suppression, encoding, perturbation, generalization

• Subject to re-identification and disclosure attacks

Original

Data

Statistics/

Models/

Synthetic

Records

Differentially Private Data Sharing

Statistical Data Sharing with Differential Privacy

• Macro data (as versus micro data)

• Output perturbation (as versus input perturbation)

• More rigorous guarantee

Cryptography

• Encoding data in a way that only authorized users can read it

11/9/2016 8

Original

Data

Encrypted

Data Encryption

9

Applications of Cryptography

• Secure data outsourcing

– Support computation and queries on encrypted data

11/9/2016 9

Encrypted

Data

Computation /Queries

10

Applications of Cryptography

• Multi-party secure computations (secure function evaluation) – Securely compute a function without revealing private

inputs

xn

x1

x3

x2

f(x1,x2,…, xn)

11

Applications of Cryptography

• Private information retrieval (access privacy) – Retrieve data without revealing query (access

pattern)

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 12

k-Anonymity

Caucas 78712 Flu

Asian 78705 Shingle

s

Caucas 78754 Flu

Asian 78705 Acne

AfrAm 78705 Acne

Caucas 78705 Flu

Caucas 787XX Flu

Asian/AfrA

m 78705 Shingle

s

Caucas 787XX Flu

Asian/AfrA

m 78705 Acne

Asian/AfrA

m 78705 Acne

Caucas 787XX Flu

Quasi-identifiers (QID) = race, zipcode

Sensitive attribute = diagnosis

K-anonymity: the size of each QID group is at least k

slide 14

Problem of k-anonymity

… … …

Rusty

Shackleford Caucas 78705

… … …

Caucas 787XX Flu

Asian/AfrA

m 78705 Shingle

s

Caucas 787XX Flu

Asian/AfrA

m 78705 Acne

Asian/AfrA

m 78705 Acne

Caucas 787XX Flu

Problem: sensitive attributes are not “diverse” within each quasi-identifier group

l-Diversity

Caucas 787XX Flu

Caucas 787XX Shingle

s

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Shingle

s

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Flu

Entropy of sensitive attributes

within each quasi-identifier

group must be at least l

[Machanavajjhala et al. ICDE ‘06]

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

… HIV+

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

Original dataset

Q1 HIV-

Q1 HIV-

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q1 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 Flu

Anonymization B

Q1 HIV+

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Anonymization A

99% have HIV-

50% HIV- quasi-identifier group is “diverse”

This leaks a ton of information

99% HIV- quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

Problem with l-diversity

Caucas 787XX Flu

Caucas 787XX Shingle

s

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Shingle

s

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Flu

[Li et al. ICDE ‘07]

Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database

t-Closeness

slide 17

Problems with Syntactic Privacy notions

• Syntactic

– Focuses on data transformation, not on what can be learned from the anonymized dataset

• “Quasi-identifier” fallacy

– Assumes a priori that attacker will not know certain information about his target

– Attacker may know the records in the database or external information

slide 18

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 19

Differential Privacy • Statistical outcome is indistinguishable regardless whether a

particular user (record) is included in the data

Differential Privacy • Statistical outcome is indistinguishable regardless whether a

particular user (record) is included in the data

Original records Original histogram

Statistical Data Release: disclosure risk

Original records Original histogram Perturbed histogram

with differential privacy

Statistical Data Release: differential privacy

Differential Privacy

A privacy mechanism A gives ε-differential privacy if for all neighbouring databases D, D’, and for any possible output S ∈ Range(A),

Pr[A(D) = S] ≤ exp(ε) × Pr[A(D’) = S]

D D’

• D and D’ are neighboring databases if they differ in one record

Add Laplace noise to the true output f(D)

Δf = max D,D’ |f(D) - f(D’)|

Global Sensitivity

Laplace Mechanism

Example: Laplace Mechanism

• For a single counting query Q over a dataset D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

11/8/2016 26

Exponential Mechanism

Inputs Outputs

Sample output r with a utility score function u(D,r)

Exponential Mechanism

For a database D, output space R and a

utility score function u : D×R → R, the

algorithm A

Pr[A(D) = r] ∝ exp (ε × u(D, r)/ 2Δu)

satisfies ε-differential privacy, where Δu is

the sensitivity of the utility score function

Δu = max r & D,D’ |u(D, r) - u(D’, r)|

Example: Exponential Mechanism

• Scoring/utility function w: Inputs x Outputs R

• D: nationalities of a set of people

• f(D) : most frequent nationality in D

• u (D, O) = #(D, O) the number of people with nationality O

Tutorial: Differential Privacy in the Wild 29 Module 2

Composition theorems

Sequential composition ∑iεi –differential privacy

Parallel composition max(εi)–differential

privacy

Differential Privacy

• Differential privacy ensure an attacker can’t infer the presence or absence of a single record in the input based on any output.

• Building blocks

– Laplace, exponential mechanism

• Composition rules help build complex algorithms using building blocks

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 32

Baseline: Laplace Mechanism

• For the counting query Q on each histogram bin, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

11/8/2016 33

Name Age Income HIV+

Frank 42 30K Y

Bob 31 60K Y

Mary 28 20K Y

… … … …

Original Records

DP V-optimal Histogram

Multi-dimensional

partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface

ε/2-DP

ε/2-DP

• Compute unit histogram counts with differential privacy

• Use DP unit histogram for partitioning

• Compute V-optimal histogram counts with differential privacy

Private Spatial decompositions [CPSSY 12]

quadtree kd-tree

Need to ensure both partitioning boundary and the counts of each partition are differentially private

35

Non-parametric methods (only work well for low-dimensional data)

Parametric methods (joint distribution difficult to model)

Histogram methods vs parametric methods

Fit the data to a distribution, make inferences about parameters

e.g. PrivacyOnTheMap

Original data

Histogram

Synthetic data

Perturbation

Learn empirical distribution through histograms

e.g. PSD , Privelet, FP, P-HP

DPCopula

A semi-parametric method

Non-parametric estimation

for each dimension

Age Hours /week

Income

42 64 30K

31 82 60K

28 40 20K

43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms

Dependence structure

Age Hours /week

Income

42 64 30K

31 82 60K

28 40 20K

43 36 80K

… … …

DP synthetic data set

Parametric estimation for

dependence

• Marginal distributions + Bayesian network

PrivBayes: Bayesian Network

age workclass

education title

income

Pr 𝑎𝑔𝑒 Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒

Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘

Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘

• STEP 1: Choose a suitable Bayesian network 𝒩 – must in a differentially private way – Add edges with highest mutual information -

Exponential mechanism

• STEP 2: Compute conditional distributions implied by 𝒩 – straightforward to do under differential privacy – inject noise – Laplace mechanism

• STEP 3: Generate synthetic data by sampling from 𝒩

– post-processing: no privacy issues

Outline of the Algorithm

Metrics: Random range-count queries with random query predicates covering all attributes

Relative error:

Absolute error:

Evaluation for DP Histograms

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 41

Frequent sequence mining (FSM) ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

F1: freq 1-seqs

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

Sup.

0

1

3

3

{b→a}

{b→b}

{b→c}

{b→d}

0

2

2

1

{c→a}

{c→b}

{c→c}

{c→d}

0

0

0

4

{d→a}

{d→b}

{d→c}

{d→d}

0

1

1

0

C2: cand 2-seqs

Sequence

{a→c}

{a→d}

{c→d}

Sup.

3

3

4

F3: freq 2-seqs

Scan D

Scan D

Scan D

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

{b→a}

{b→b}

{b→c}

{b→d}

{c→a}

{c→b}

{c→c}

{c→d}

{d→a}

{d→b}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→b→c}

C3: cand 3-seqs

Sequence

{a→b→c}

Sup.

3

F3: freq 3-seqs

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

noise

0.2

-0.4

0.4

-0.5

0.8

Sequence

{a→a}

{a→c}

{a→d}

{c→a}

{c→c}

{c→d}

{d→a}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→a}

{a→c}

{a→d}

Sup.

0

3

3

{c→a}

{c→c}

{c→d}

0

0

4

{d→a}

{d→c}

{d→d}

0

1

0

C2: cand 2-seqs

noise

0.2

0.3

0.2

-0.5

0.8

0.2

0.3

2.1

-0.5

Scan D

Scan D

Sequence

{a→c→d}

C3: cand 3-seqs

{a→d→c}

noise

0

0.3

Sequence

{a→c→d}

Sup.

3

{a→d→c} 1

C3: cand 3-seqs

Scan D

Sequence

{a}

{c}

{d}

Noisy Sup.

3.2

4.4

3.5

F1: freq 1-seqs

Sequence

{a→c}

{a→d}

{c→d}

Noisy Sup.

3.3

3.2

4.2

F2: freq 2-seqs

{d→c} 3.1

Sequence

{a→c→d}

Noisy Sup.

3

F3: freq 3-seqs

Lap(|C2| / ε2)

Lap(|C1| / ε1)

Lap(|C3| / ε3)

Baseline: Laplace Mechanism

PFS2 Algorithm

• Sensitivity impacted by two factors:

– Candidate size

– Sequence length

• Basic ideas: reduce sensitivity

• Use kth sample database for pruning candidate k-

sequences

– reduce candidate size

• Shrink sequences by transformation while maintaining

the frequent patterns – Reduce sequence length

Original Database

mth

sample database2nd

sample database1st sample database

……

Partition

DP Frequent Sequence Mining Evaluation

• Metrics

• F-score:

• Relative Error:

2precision recall

F scoreprecision recall

'{(| |) / }X x x xRE median sup sup sup

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/9/2016 46

Local Differential Privacy

Finance.com

Fashion.co

m

WeirdStuff.com

. . .

• No trusted server

• Each user applies

local perturbation

before submitting the

value to server

• Server only

aggregates the values

• Google Chrome

deployment

Randomized Response

Disease

(Y/N)

Y

Y

N

Y

N

N

With probability p,

Report true value

With probability 1-p,

Report flipped value

Disease

(Y/N)

Y

N

N

N

Y

N

D O

[W 65]

Differential Privacy Analysis

• Consider 2 databases D, D’ (of size M) that differ in the jth value

• D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j

• Consider some output O

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/9/2016 50

Individual Location Sharing: Existing Solutions

and Challenges

• Private information retrieval

• Computationally expensive

• Spatial cloaking

• Syntactic privacy notion

• Temporal correlations due to road constraints and moving patterns

• Event-level differential privacy

• protect an event (the exact location of a single user at a given time)

• Challenges:

• Large input domain (locations on the map) makes output useless

• Temporal correlations

Event-level differential privacy

Definition (Differential Privacy)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy if, for any output zt and any two locations x1

and x2, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any input locations

• Challenges:

• Distance does not capture location semantics

• Temporal correlations

Geo-indistinguishability [CCS 13]

Definition (Geo-indistinguishability)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy if, for any output zt and any two locations x1

and x2 within a circle of radius r, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two input locations that are close to each other

Differential privacy on δ-location set under temporal

correlations [CCS 15]

Definition (Differential Privacy on δ-location set)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy on δ-location set if, for any output zt and any

two locations x1 and x2 in δ-location set, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two locations in δ-

location set, the possible set of locations a user might appear

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure outsourcing

• Secure multiparty computations

• Private information retrieval

• Access control

• Applications

11/8/2016 55

56

E D m

plaintext

k

encryption key

k’

decryption key

Ek(m)

ciphertext Dk’ (Ek(m)) = m

attacker

Operational model of encryption

• Kerckhoff’s assumption: – attacker knows E and D

– attacker doesn’t know the (decryption) key

• attacker’s goal: – to systematically recover plaintext from ciphertext

– to deduce the (decryption) key

• attack models: – ciphertext-only

– known-plaintext

– (adaptive) chosen-plaintext

– (adaptive) chosen-ciphertext

slide 57

Cryptography Primitives

• Symmetric encryption

• Pubic encryption

• Encryption schemes with different properties

– Hommomorphic encryption

– Probabilistic encryption vs Deterministic encryption

– Order preserving encryption

– Commutative encryption

Symmetric Key Cryptography

symmetric key crypto: Bob and Alice share know same (symmetric) key: KA-B

Examples: AES

plaintext ciphertext

K A-B

encryption algorithm

decryption algorithm

K A-B

plaintext message, m

c=KA-B (m) K (m) A-B

m = K ( ) A-B

Public-Key Cryptography

Public key for encryption and secret key for decryption Examples: RSA

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure multiparty computations

• Secure outsourcing

• Access control

• Applications

11/9/2016 60

61

Multi-party secure computations (secure function evaluation)

– Securely compute a function without revealing private inputs

xn

x1

x3

x2

f(x1,x2,…, xn)

slide 62

Security Model

• A protocol is secure if it emulates an ideal setting where the parties hand their inputs to a “trusted party,” who locally computes the desired outputs and hands them back to the parties [Goldreich-Micali-Wigderson 1987]

A B x1

f2(x1,x2) f1(x1,x2)

x2

slide 63

Properties of the Definition

• Correctness

– All honest participants should receive the correct result of evaluating function f

• Privacy

– All corrupt participants should learn no more from the protocol than what they would learn in ideal model

– Which means only the private input (obviously) and the result of f

slide 64

Adversary Models

• Semi-honest (aka passive; honest-but-curious)

– Follows protocol, but tries to learn more from received messages than he would learn in the ideal model

• Malicious

– Deviates from the protocol in arbitrary ways, lies about his inputs, may quit at any point

Security proof tools

– Real/ideal model: the real model can be simulated in the ideal model • Key idea – Show that whatever can be computed by a party

participating in the protocol can be computed based on its input and output only

• polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)}

• Composition theorem – if a protocol is secure in the hybrid model where the

protocol uses a trusted party that computes the (sub) functionalities, and we replace the calls to the trusted party by calls to secure protocols, then the resulting protocol is secure

– Prove that component protocols are secure, then prove that the combined protocol is secure

General protocols

• Primitives

– Oblivious transfer (OT)

– Random shares

slide 67

Oblivious Transfer (OT)

• Fundamental SMC primitive

• 1-out-of-2 Oblivious Transfer (OT)

S R m0, m1

m

= 0 or 1

S inputs two bits, R inputs the index of one of S’s bits R learns his chosen bit, S learns nothing

– S does not learn which bit R has chosen; R does not learn the value of the bit that he did not choose

[Rabin 1981]

Secret Sharing Scheme

• Splitting

– Encode the secret as an integer S.

– Give to each player i (except one) a random integer ri.

– Give to the last player the number 𝑆 − 𝑟𝑖𝑛−1𝑖=1

(t, n) threshold scheme

• Shamir’s scheme 1979

– It takes t points to define a polynomial of degree t-1

– Create a t-1 degree polynomial with secret as the first coefficient and the remaining coefficients picked at random. Find n points on the curve and give one to each of the players. At least t points are required to fit the polynomial.

General protocols

• Passively-secure computation for semi-honest model

– Yao’s garbled circuit for two-party (OT and symmetric encryption)

– GWM protocol for multiple parties (random shares and OT)

• From passively-secure protocols to actively-secure protocols for malicious model

– Use zero-knowledge proofs to force parties to behave in a way consistent with the passively-secure protocol

Specialized protocols

• Using secret sharing, special encryption schemes, or randomized responses – May reveal some information

– Tradeoff of security and efficiency

• Examples – Secure sum by random shares

– Secure union (using commutative encryption)

• Build complex protocols from primitive protocols

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure multiparty computations

• Secure outsourcing

• Access control

• Applications

11/9/2016 72

73

Secure data outsourcing

• Support computation and queries on encrypted data

11/9/2016 73

Encrypted

Data

Computation /Queries

Secure data outsourcing

• Crypto Primitives

– Homomorphic encryption

• General protocol based on fully homomorphic encryption: computationally prohibitive

• Specialized protocols based on partially homomorphic encryption

– Property preserving encryption

• Deterministic encryption vs probabilistic encryption

• Order preserving encryption

Homomorphic Encryption

Homomorphic Encryption

Enc[f(x)]

Enc[x]

f

Eval

Homomorphic encryption schemes

• Partially homomorphic encryption: – homomorphic scheme where only one type of operation is

possible (* or +) • Multiplicative homomorphic – e.g. RSA • Additive homomorphic, e.g. Paillier

• Somewhat homomorphic encryption: • homomorphic scheme that can perform a limited number

of additions and multiplications

• Fully homomorphic encryption (FHE) (Gentry, 2010)

– Can perform an infinite number of additions and multiplications

76

11/9/2016 77

Using Partially Hommomorphic Encryption with two servers

• Two server setting – C1 holds encrypted data E(a), E(b)

– C2 holds decryption key sk

• Security goal: – C1 and C2 do not obtain anything about the data and result

• Basic idea – Utilize additive homomorphic property

– Use random shares to ensure C2 only has access to decrypted data with random shares

• Primitive protocols: – secure multiplication, secure comparison, …

• Build complex protocols from primitive protocols: eg secure kNN queries …

11/9/2016 79

Secure data outsourcing

• Crypto Primitives

– Homomorphic encryption

• General protocol based on fully homomorphic encryption: computationally prohibitive

• Specialized protocols based on partially homomorphic encryption

– Property preserving encryption

• Deterministic encryption vs probabilistic encryption

• Order preserving encryption

11/9/2016 81

CryptDB

• Use layers of encryption

• Decrypt as needed


Recommended