Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’121
Part 4: Data Dependent Query Processing Methods
Yin “David” Yang Zhenjie Zhang Gerome Miklau
Prev. Session: Marianne Winslett Xiaokui Xiao
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’122
What we talked in the last sessionPrivacy is a major concern in data publishing
Simple anonymization methods fail to provide sufficient privacy protection
Definition of differential privacyHard to tell if a record is in the DB from query results Plausible deniability
Basic solutionsLaplace mechanism: inject Laplace noise into query
resultsExponential mechanism: choose a result randomly; a
“good” result has higher probabilityData independent methods
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’123
Data independent vs. data dependent
Data independent methods
Data dependent methods
Sensitive info Query results Query results + data dependent parameters
Error source Injected noise Injected noise + information loss
Noise type Unbiased Often BiasedAsymptotic error bound
Higher Lower, with data dependent constants
Practical accuracy Higher Lower for some data
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’124
Types of data dependent methodsType 1: optimizing noisy results
1. Inject noise2. Optimize the noisy query results based on
their valuesType 2: transforming original data
1. Transform the data to reduce the amount of necessary noise
2. Inject noise
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’125
Optimizing noisy results: Hierarchical Strategy presented in the last session.Hierarchical strategy: tree with count in each
nodeData dependent optimization:
If a node N has noisy count close to 0 Set the noisy count at N to 0
N4
N2 N3
N1
v1 v2 v3 v4
N5 N6 N7
Noisy count: 0.05Optimized count: 0
Hay et al. Boosting the Accuracy of Differentially-Private Queries Through Consistency, VLDB’10.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’126
Optimizing noisy results: iReductSetting: answer a set of m queriesGoal: minimize their total relative error
RelErr = (noisy result – actual result) / actual result
Example:Two queries, q1 and q2Actual results: q1 :10, q2 :20Observation: we should add less noise to q1
than to q2
Xiao et al. iReduct: Differential Privacy with Reduced Relative Errors, SIGMOD’11.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’127
Answering queries differently leads to different total relative errorContinuing the example
Two queries, q1 and q2, with actual answers 10 and 20
Suppose each of q1 and q2 has sensitivity 1Two strategies:
Answer q1 with ε/2, q2 with ε/2 Noise on q1: 2/ε Noise on q1: 2/ε
Answer q1 with 2ε/3, q2 with ε/3 Noise on q1: 1.5ε Noise variance on q1: 3/ε
Lower relative error overall
But we don’t know which strategy is better before comparing their actual answers!
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’128
Idea of iReduct1. Answer all queries with privacy budget ε/t2. Refine the noisy results with budget ε/t
more budget on queries with smaller results How to refine a noisy count?
Method 1: obtain a new noisy version, compute weighted average with the old version
Method 2: obtain a refined version directly from a complicated distribution
3. Repeat the last step t1 times
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’129
Example of iReductq1 q2
Iteration 1:16 14
ε/2t
ε/t 14/30
ε/2t
ε/t 16/30
12 24Iteration 2: ε/t 2/3 ε/t 1/3
9 22
… …Iteration 3: ε/t
22/31ε/t 9/31
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1210
Optimizing noisy results: MWProblem: publish a histogram under DP that is
optimized for a given query set.Idea:
Start from a uniform histogram.Repeat the following t times
Evaluate all queries. Find the query q with the worst accuracy. Modify the histogram to improve the accuracy of q
using a technique called multiplicative weights (MW)
Hardt et al. A simple and practical algorithm for differentially private data release, arXiv.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1211
Example of MW
Exact histogram
q1 q2
Initial histogram
Range count queries
q1 q2less accurate
No privacy budget cost!Iteration 1: optimize q1
privacy cost: ε/t
q1 q2still less accurate
Iteration 2: optimize q1
privacy cost: ε/t
q1 q2less accurate
Iteration 3: optimize q2
privacy cost: ε/t
q1 q2
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1212
Optimizing noisy results: NoiseFirstProblem: publish a histogram
Xu et al. Differentially Private Histogram Publication, ICDE’12.
Original datain a medical statistical DB
Histogram
Name
Age
HIV+
Frank 42 YBob 31 YMary 28 YDave 43 N
… … …
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1213
Reduce error by merging bins
Noisy histogram
Exact histogram
Optimized histogram
2 2 2
Bin-merging scheme computed through dynamic programming
Positive/negative noise cancels out!
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1214
Next we focus on the second type.Type 1: optimizing noisy results
1. Inject noise2. Optimize the noisy query results based on
their valuesType 2: transforming original data
1. Transform the data to reduce the amount of necessary noise
2. Inject noise
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1215
Transforming data: StructureFirstAn alternative solution for histogram
publication
Original histogram Histogram after merging bins
∆=1 ∆=1/3 ∆=1/2
Lower sensitivity means less noise!
Xu et al. Differentially Private Histogram Publication, ICDE’12.Related: Xiao et al. Differentially Private Data Release through Multi-
Dimensional Partitioning. SDM’10.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1216
But the optimal structure is sensitive!
OriginalHistogram
Diff. optimal structures
With/without Alice
Alice is an HIV+ patient !
Alice
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1217
StructureFirst uses the Exponential Mechanism to render its structure differentially private.
Randomly perturb the optimal histogram structureSet each boundary using the exponential
mechanism
1.3
1.3
1.3
4.5
4.5 1 1
1 2 1 4 5 1 1
1.3
1.3
1.3
4.5
4.5 1 1
1.3
1.3
1.3 4 2.
32.3
2.3
¢ ¢ ¢
1.2
1.2
1.2
5.1
2.4
2.4
2.4
Original histogram
merge bins (k*=3)
Randomly adjust boundaries
Lap(∆/ε) noise
1ProbSSE
Consume ε1Consume ε2 = (ε-ε1)Satisfies ε-DP
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1218
Observations on StructureFirstMerging bins essentially compresses the data
Reduced sensitivity vs. information lossQuestion: can we apply other compression algorithms?
Yes!Method 1: Perform Fourier transformation, take the first
few coefficients, discard all others Rastogi and Nath. Differentially Private Aggregation Of Distributed Time-series
With Transformation And Encryption, SIGMOD’10Method 2: apply the theory of sparse representation
Li et al. Compressive Mechanism: Utilizing Sparse Representation in Differential Privacy, WPES’11
Hardt and Roth. Beating Randomized Response on Incoherent Matrices. STOC’12
Your new paper?
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1219
Transforming original data: k-d-treeProblem: answer 2D range count queriesSolution: index the data with a k-d-tree
Cormode et al. Differentially Private Space Decompositions. ICDE’12.Xiao et al. Differentially Private Data Release through Multi-Dimensional
Partitioning. SDM, 2010
The k-d-tree structure is sensitive!
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1220
How to protect the k-d-tree structure?Core problem: differentially private median.Method 1: exponential mechanism. (best) [1]Method 2: simply replace mean with median.
[3]Method 3: cell-based method. [2]
Partition the data with a grid.Compute differentially private counts using the
grid.[1] Cormode et al. Differentially Private Space Decompositions. ICDE’12.[2] Xiao et al. Differentially Private Data Release through Multi-Dimensional
Partitioning. SDM’10.[3] Inan et al. Private Record Matching Using Differential Privacy. EDBT’10.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1221
Transforming original data: S&AS&A: Sample and AggregateGoal: answer a query q whose result does not dependent on
the dataset cardinality, e.g., avg Idea 1:
Randomly partition the dataset into m blocks Evaluate q on each block Return average over m blocks + Laplace noise
Sensitivity: (max-min)/m Idea 2: median instead of average + exponential
mechanism Sensitivity is 1! Zhenjie has moreMohan et al. GUPT: Privacy Preserving Data Analysis Made Easy. SIGMOD’12.
Smith. Privacy-Preserving Statistical Estimation with Optimal Convergence Rates. STOC’11.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1222
Systems using Differential PrivacyPrivacy on the MapPINQAiravatGUPT
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’1223
Summary on data dependent methodsData dependent vs. data independentOptimizing noisy results
Simple optimizationsIterative methods
Transforming original dataReduced sensitivityCaution: parameters may reveal information
Next: Zhenjie on differentially private data mining