Part 4: Data Dependent Query Processing Methods

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’121

Part 4: Data Dependent Query Processing Methods

Yin “David” Yang Zhenjie Zhang Gerome Miklau

Prev. Session: Marianne Winslett Xiaokui Xiao


What we talked in the last sessionPrivacy is a major concern in data publishing

Simple anonymization methods fail to provide sufficient privacy protection

Definition of differential privacyHard to tell if a record is in the DB from query results Plausible deniability

Basic solutionsLaplace mechanism: inject Laplace noise into query

resultsExponential mechanism: choose a result randomly; a

“good” result has higher probabilityData independent methods


Data independent vs. data dependent

Data independent methods

Data dependent methods

Sensitive info Query results Query results + data dependent parameters

Error source Injected noise Injected noise + information loss

Noise type Unbiased Often BiasedAsymptotic error bound

Higher Lower, with data dependent constants

Practical accuracy Higher Lower for some data


Types of data dependent methodsType 1: optimizing noisy results

1. Inject noise2. Optimize the noisy query results based on

their valuesType 2: transforming original data

1. Transform the data to reduce the amount of necessary noise

2. Inject noise


Optimizing noisy results: Hierarchical Strategy presented in the last session.Hierarchical strategy: tree with count in each

nodeData dependent optimization:

If a node N has noisy count close to 0 Set the noisy count at N to 0

N4

N2 N3

N1

v1 v2 v3 v4

N5 N6 N7

Noisy count: 0.05Optimized count: 0

Hay et al. Boosting the Accuracy of Differentially-Private Queries Through Consistency, VLDB’10.


Optimizing noisy results: iReductSetting: answer a set of m queriesGoal: minimize their total relative error

RelErr = (noisy result – actual result) / actual result

Example:Two queries, q1 and q2Actual results: q1 :10, q2 :20Observation: we should add less noise to q1

than to q2

Xiao et al. iReduct: Differential Privacy with Reduced Relative Errors, SIGMOD’11.


Answering queries differently leads to different total relative errorContinuing the example

Two queries, q1 and q2, with actual answers 10 and 20

Suppose each of q1 and q2 has sensitivity 1Two strategies:

Answer q1 with ε/2, q2 with ε/2 Noise on q1: 2/ε Noise on q1: 2/ε

Answer q1 with 2ε/3, q2 with ε/3 Noise on q1: 1.5ε Noise variance on q1: 3/ε

Lower relative error overall

But we don’t know which strategy is better before comparing their actual answers!


Idea of iReduct1. Answer all queries with privacy budget ε/t2. Refine the noisy results with budget ε/t

more budget on queries with smaller results How to refine a noisy count?

Method 1: obtain a new noisy version, compute weighted average with the old version

Method 2: obtain a refined version directly from a complicated distribution

3. Repeat the last step t1 times


Example of iReductq1 q2

Iteration 1:16 14

ε/2t

ε/t 14/30

ε/2t

ε/t 16/30

12 24Iteration 2: ε/t 2/3 ε/t 1/3

9 22

… …Iteration 3: ε/t

22/31ε/t 9/31


Optimizing noisy results: MWProblem: publish a histogram under DP that is

optimized for a given query set.Idea:

Start from a uniform histogram.Repeat the following t times

Evaluate all queries. Find the query q with the worst accuracy. Modify the histogram to improve the accuracy of q

using a technique called multiplicative weights (MW)

Hardt et al. A simple and practical algorithm for differentially private data release, arXiv.


Example of MW

Exact histogram

q1 q2

Initial histogram

Range count queries

q1 q2less accurate

No privacy budget cost!Iteration 1: optimize q1

privacy cost: ε/t

q1 q2still less accurate

Iteration 2: optimize q1

privacy cost: ε/t

q1 q2less accurate

Iteration 3: optimize q2

privacy cost: ε/t

q1 q2


Optimizing noisy results: NoiseFirstProblem: publish a histogram

Xu et al. Differentially Private Histogram Publication, ICDE’12.

Original datain a medical statistical DB

Histogram

Name

Age

HIV+

Frank 42 YBob 31 YMary 28 YDave 43 N

… … …


Reduce error by merging bins

Noisy histogram

Exact histogram

Optimized histogram

2 2 2

Bin-merging scheme computed through dynamic programming

Positive/negative noise cancels out!


Next we focus on the second type.Type 1: optimizing noisy results

1. Inject noise2. Optimize the noisy query results based on

their valuesType 2: transforming original data

1. Transform the data to reduce the amount of necessary noise

2. Inject noise


Transforming data: StructureFirstAn alternative solution for histogram

publication

Original histogram Histogram after merging bins

∆=1 ∆=1/3 ∆=1/2

Lower sensitivity means less noise!

Xu et al. Differentially Private Histogram Publication, ICDE’12.Related: Xiao et al. Differentially Private Data Release through Multi-

Dimensional Partitioning. SDM’10.


But the optimal structure is sensitive!

OriginalHistogram

Diff. optimal structures

With/without Alice

Alice is an HIV+ patient !

Alice


StructureFirst uses the Exponential Mechanism to render its structure differentially private.

Randomly perturb the optimal histogram structureSet each boundary using the exponential

mechanism

1.3

1.3

1.3

4.5

4.5 1 1

1 2 1 4 5 1 1

1.3

1.3

1.3

4.5

4.5 1 1

1.3

1.3

1.3 4 2.

32.3

2.3

¢ ¢ ¢

1.2

1.2

1.2

5.1

2.4

2.4

2.4

Original histogram

merge bins (k*=3)

Randomly adjust boundaries

Lap(∆/ε) noise

1ProbSSE

Consume ε1Consume ε2 = (ε-ε1)Satisfies ε-DP


Observations on StructureFirstMerging bins essentially compresses the data

Reduced sensitivity vs. information lossQuestion: can we apply other compression algorithms?

Yes!Method 1: Perform Fourier transformation, take the first

few coefficients, discard all others Rastogi and Nath. Differentially Private Aggregation Of Distributed Time-series

With Transformation And Encryption, SIGMOD’10Method 2: apply the theory of sparse representation

Li et al. Compressive Mechanism: Utilizing Sparse Representation in Differential Privacy, WPES’11

Hardt and Roth. Beating Randomized Response on Incoherent Matrices. STOC’12

Your new paper?


Transforming original data: k-d-treeProblem: answer 2D range count queriesSolution: index the data with a k-d-tree

Cormode et al. Differentially Private Space Decompositions. ICDE’12.Xiao et al. Differentially Private Data Release through Multi-Dimensional

Partitioning. SDM, 2010

The k-d-tree structure is sensitive!


How to protect the k-d-tree structure?Core problem: differentially private median.Method 1: exponential mechanism. (best) [1]Method 2: simply replace mean with median.

[3]Method 3: cell-based method. [2]

Partition the data with a grid.Compute differentially private counts using the

grid.[1] Cormode et al. Differentially Private Space Decompositions. ICDE’12.[2] Xiao et al. Differentially Private Data Release through Multi-Dimensional

Partitioning. SDM’10.[3] Inan et al. Private Record Matching Using Differential Privacy. EDBT’10.


Transforming original data: S&AS&A: Sample and AggregateGoal: answer a query q whose result does not dependent on

the dataset cardinality, e.g., avg Idea 1:

Randomly partition the dataset into m blocks Evaluate q on each block Return average over m blocks + Laplace noise

Sensitivity: (max-min)/m Idea 2: median instead of average + exponential

mechanism Sensitivity is 1! Zhenjie has moreMohan et al. GUPT: Privacy Preserving Data Analysis Made Easy. SIGMOD’12.

Smith. Privacy-Preserving Statistical Estimation with Optimal Convergence Rates. STOC’11.


Systems using Differential PrivacyPrivacy on the MapPINQAiravatGUPT


Summary on data dependent methodsData dependent vs. data independentOptimizing noisy results

Simple optimizationsIterative methods

Transforming original dataReduced sensitivityCaution: parameters may reveal information

Next: Zhenjie on differentially private data mining

Date post:	23-Feb-2016
Category:	Documents
Upload:	cloris
View:	57 times
Download:	0 times

Part 4: Data Dependent Query Processing Methods

Documents