Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | cleopatra-barrett |
View: | 236 times |
Download: | 0 times |
CS573 Data Privacy and Security
Statistical Databases
Li Xiong
Today
• Statistical databases– Definitions– Early query restriction methods– Output perturbation and differential privacy
Statistical Data Release
Age City Diagnosis
25 Lilburn mantle cell lymphoma
35 Decatur adult T-cell lymphoma
35 Lilburn adult T-cell lymphoma
Diagnosis
Age
city
20
30
40
50
50
Population count
• Release statistical summary of the data (vs. individual records)• Useful for analysis and learning
• Medical statistics• Query log statistics – frequent search terms
• Still need rigorous inference control
• A statistical database is a database which provides statistics on subsets of records
• Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records
• Inference control to prevent inference from statistics to individual records
Statistical Database
Methods Data perturbation/anonymization Query restriction Output perturbation
Data Perturbation
Noise Added
User 2
Query
Results
OriginalDatabase
PerturbedDatabase
User 1
Que
ry
Res
ults
Query Resitrction
Query 1
Query 1Results
Query 2Results
Query 2
K KQuery
Results
QueryResults
OriginalDatabase
Noise Addedto Results
User 2
Query
Results
OriginalDatabase
User 1
Query
Results
Output Perturbation
Query
Query Results
Results
Methods Data perturbation/anonymization Query restriction
Query set size control Query set overlap control Query auditing
Output perturbation
Query Set Size Control A query-set size control limit the number of
records that must be in the result set Allows the query results to be displayed only if
the size of the query set |C| satisfies the condition
K <= |C| <= L – Kwhere L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2
Query Set Size Control
Query 1
Query 1Results
Query 2Results
Query 2
K KQuery
Results
QueryResults
OriginalDatabase
Tracker• Q1: Count ( Sex = Female ) = A• Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B
What if B = A+1?
Tracker• Q1: Count ( Sex = Female ) = A• Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B
If B = A+1
• Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia)
Positively or negatively compromised!
Query set size control
• If the threshold value k is large, then it will restrict too many queries– And still does not guarantee protection from
compromise • The database can be easily compromised
within a frame of 4-5 queries
• Basic idea: successive queries must be checked against the number of common records.
• If the number of common records in any query exceeds a given threshold, the requested statistic is not released.
• A query q(C) is only allowed if:| q (C ) ^ q (D) | ≤ r, r > 0
Where r is set by the administrator
Query Set Overlap Control
Query-set-overlap control
• Statistics for a set and its subset cannot be released – limiting usefulness
• High processing overhead – every new query compared with all previous ones
• Multiple users - need to keep user profile, need to consider collusion between users
• Still no formal privacy guarantee
Auditing
• Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued
• Excessive computation and storage requirements
• Only “efficient” methods for special types of queries
Audit Expert (Chin 1982)• Query auditing method for SUM queries• A SUM query can be considered as a linear equation
where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result
• A set of SUM queries can be thought of as a system of linear equations
• Maintains the binary matrix representing linearly independent queries and update it when a new query is issued
• A row with all 0s except for ith column indicates disclosure
Audit Expert
• Only stores linearly independent queries
• Not all queries are linearly independentQ1: Sum(Sex=M)Q2: Sum(Sex=M AND Age>20)Q3: Sum(Sex=M AND Age<=20)
Audit Expert
• O(L2) time complexity• Further work reduced to O(L) time and space
when number of queries < L• Only for SUM queries
Auditing – recent developments
• Online auditing– “Detect and deny” queries that violate privacy
requirement– Denial themselves may implicitly disclose sensitive
information• Offline auditing
– Check if a privacy requirement has been violated after the queries have been executed
– Not to prevent
Methods Data perturbation/anonymization Query restriction Output perturbation
Differential privacy
• Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set – E.g.: Q = select count() where Age = [20,30] and Diagnosis
= B
Differential Privacy
Output Perturbation
D2Bob out
UserQ
D1Bob in
A(D2)
A(D1)
• Differential privacy
• Laplace mechanism Q(D) + Y where Y is drawn from
• Query sensitivity
Differential Privacy
Differentially Private
Interface
D2Bob out
UserQ
D1Bob in
A(D1) = Q(D1) + Y1
A(D2) = Q(D2) + Y2
Composition of Differential Privacy• Sequential composition [McSherry SIGMOD 09]
– Let Mi each provides differential privacy. The sequence of Mi provides differential privacy
• Parallel composition– If Di are disjoint subsets of the original database and Mi
provides differential privacy for each Di, then the sequence of Mi provides differential privacy.
Differentially Private
Interface
D2Bob out
UserQ1,Q2, …
D1Bob in
A1(D2), A2(D2), …
A1(D1), A2(D1), …
Differential Privacy• Is unfettered access to raw data truly essential?• Is released data sufficient (provide sufficient utility
guarantee)?
Raw Data
ReleasedData
UserPrivacymechanism
Age City Diagnosis
25 Lilburn mantle cell lymphoma
35 Decatur adult T-cell lymphoma
35 Lilburn adult T-cell lymphoma
Diagnosis
Age
city
count
Challenges
• Differential privacy cost accumulates quickly with number of queries– Typical tasks require multiple queries or multiple
steps– Need to support multiple users
• Impossible to guarantee utility for all (any) data or all (any) applications
Possible Middle Ground
• Guaranteed utility for certain applications– Counting queries, classification, logistic regression
• Guaranteed utility for certain kinds of data– Use prior or domain knowledge about data– Use intermediate results (differentially private)
Raw Data
ReleasedData
UserPrivacymechanism
Prior or domain knowledge
Target Applications
Intermediate Result
Our Research: Adaptive Differentially Private Data Release• Data knowledge
• Dense and “smooth” data• High dimensional and sparse data• Dynamic data
• Application knowledge• Query workload• Specific tasks
Histogram Example
?
Strategy I: Baseline Cell Partitioning
diagnosis
Age50 10
50 90
A B
20
30
• Goal: to release a differentially private histogram to support random predicate queries
• Q: select count() where Age = [20,30] and Income = 40K• If a query predicate consists of multiple cells or partitions, it will have
aggregated perturbation error
Diagnosis
Age50’ 10’
50’ 90’
20
30
A B
Q1: count() where Age = 20, Diagnosis = AQ2: count() where Age = 20, Diagnosis = B…
Q
alphaDP
Strategy II: Hierarchical Partitioning
• Large perturbation error due to small divided privacy budget at each level
200’20
30
A B
60’
140’
20
30
A B
50’ 10’
50’ 90’
20
30
A B
alpha/3
alpha/3
alpha/3
diagnosis
Age50 10
50 90
A B
20
30
DPCube Strategy: Two phase partitioning
• If a query predicate is contained in a published partition, the answer has to be estimated typically based on a uniform distribution assumption. This introduces an approximation error.
Age 100’10’
90’
20
30
A Bdiagnosis
Age50 10
50 90
A B
20
30
DPCube Strategy: Two phase partitioning
50’ 10’
50’ 90’
20
30
50’
50’
10’
90’
20
30
100’10’
90’
20
30
Cell histogram
partitionhistogram
diagnosis
Age50 10
50 90
A B
20
30
1. Cell Partitioning
2. Multi-dimensionalPartitioning
A B
A B
A B
Partitioning Algorithm• Define a uniformity (randomness) measure for a partition
H(Dt)– information gain, variance
• Recursive algorithm Partition(Dt) for a given partition Dt• Find the best splitting point (e.g. largest information gain) and Partition
the data into Dt1 and Dt2• Partition(Dt1) and Partition(Dt2)
Privacy and Utility of the Released Histogram
• The released data satisfies -differential privacy• Support for count queries and other OLAP queries and
learning tasks• Formal utility results
– (epsilon,delta) - usefulness• Experimental results for partition histogram
– CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100)
– Report absolute error and relative error for random count queries
DPCube Result Example
Original histogram Diff. Private Cell histogram
Diff. private partition histogram Diff. Private Estimated Cell histogram
Experimental Results: Comparison with other partitioning strategies
• Higher alpha (lower privacy) results in lower error (higher utility) • Kd tree based approach outperforms others• Cell partitioning is comparable in absolute error but suffers in relative error
due to the sparsity of the data
39
High dimensional sparse data
• Many real-world data are high dimensional and sparse
• Web search log data, web transactions, etc.• A direct application of the 2-phase approach
– Cell histogram highly inaccurate– Computationally not scalable
Top-down recursive partitioning
• Recursively partition the spaces that have sufficient density• Use a context free taxonomy tree• Dynamically allocate and keep track of the budget
Adaptive Hierarchical Strategy1a. Overall count
n. Partition count
2a. Partition count
1b. Partitioning of non-sparse regions
2b. Partitioning of non-sparse regions
Data is sparse and Highly dimensional
Today
• Statistical databases– Definitions– Early query restriction methods– Output perturbation and differential privacy