Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | ashlie-edwards |
View: | 219 times |
Download: | 0 times |
Data Anonymization (1)
Outline Problem concepts algorithms on domain generalization
hierarchy Algorithms on numerical data
The Massachusetts Governor Privacy Breach
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation•Date last voted
• Zip
• Birth date
• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis
• Zip
• Birth date
• Sex
Sweeney, IJUFKS 2002
Quasi IdentifierQuasi Identifier
87 % of US population
3
Definition Table
Column: attributes, row: records
Quasi-identifier A list of attributes that can potentially be
used to identify individuals
K-anonymity Any QI in the table appears at least k
times
Basic techniques Generalization
Zip {02138, 02139} 0213* Domain generalization hierarchy
A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure
suppression
Balance
Better privacy guaranteeLower data utility
There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility
• Suppression is required if we cannot find a k-anonymity group for a record.
Criteria Minimal generalization
Minimal generalization that satisfy the k-anonymization specification
Minimal table distortion Minimal generalization with minimal
utility loss Use precision to evaluate the loss
[sweeny papers] Application-specific utility
Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are
approximate algorithms
Shared features in different solutions Always satisfy the k-anonymity
specification If some records not, suppress them
Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric
Algorithms Assume the domain generalization hierarchy is
given Efficiency Utility maximization
Metrics to be optimized Two cost metrics – we want to minimize
(bayardo ICDE05) Discernibility
Classification The dataset has a class label column – preserving
the classification model
# of items in the k-anony group
# Records in minor classes in the group
metrics A combination of information loss and
anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric
metrics Information loss
Dataset has class labels Entropy
a set S, labeled by different classes Entropy is used to calculate the impurity of labels
Information loss of a generalization G{c1,c2,…cn} p
I(G) = info(Sp) - info (Rci)
i
ii pp log Pi is the percentage of label iInfo(S)=
i p
ci
N
N
Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization
improves or does not change A(VID) Anonymity gain
P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K
x = K, otherwise
As long as k-anonymity is satisfied, further generalization of the VID does not gain
Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)
We want to minimize IPIf P(G) ==0, use I(G) only
Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up
algorithms They are all dimension-by-dimension
methods
Multidimensional techniques Categorical data?
Categories are mapped to numerize the categories
Bayardo 95 paper Order matters? (no research on that)
Numerical data K-anonymization n-dim space
partitioning Many existing techniques can be applied
Single-dimensional vs. multidimensional
The evolving procedure
Categorical(domain hierarchy)[sweeney, top-down/bottom-up]
numerized categories, single dimensional [bayardo05]
numerized/numerical multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain Numerize categorical data Apply a top-down partioning process
step1
Step2.1 Step2.2
Allowable cut
Method 2: spatial indexing Multidimensional spatial techniques
Kd-tree (similar to Mondrain algorithm) R-tree and its variations
R-tree R+-tree
Leaf layer
Upperlayer
Compacting bounds
Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]
Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.
Information is betterpreserved
Benefits of using R+-Tree Scalable: originally designed for
indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality
Performance
Mondrain
Utility Metrics
Discenibility penalty KL divergence: describe the difference
between a pair of distributions
Certainty penalty
Anonymized data distribution
T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range
Other issues Sparse high-dimensionality
Transactional data boolean matrix“On the anonymization of sparse high-dimensional
data” ICDE08 Relate to the clustering problem of
transactional data! The above one uses matrix-based clustering item based clustering (?)
Other issues Effect of numerizing categorical data
Ordering of categories may have certain impact on quality
General-purpose utility metrics vs. special task oriented utility metrics
Attacks on k-anonymity definition