Privacy vs. Utility

transcript

Privacy vs. Utility

Xintao Wu

University of North Carolina at CharlotteNov 10, 2008

Privacy

• Legal interpretation View of privacy in terms of access that others have to us and our

information. A general definition of privacy must be one that is measureable, of

value, and actionable.

• Measuring privacy Secrecy: it concerns information that others may gather about us.

The probability of a data item being accessed The change in knowledge of an adversary upon seeing the data

Anonymity: it addresses how much in the public gaze we are. The privacy leakage is measured in terms of the size of the blurring

accompanying the release of data. Solitude: it measures the degree to which others have physical

access to us.

Privacy vs. Utility

• Encryption does not work in publishing scenario.• Utility

The goal of privacy preservation measures is to secure access to confidential information while at the same time releasing aggregate information to the public.

Data anonymization methods

• Random perturbation Input perturbation Output perturbation

• Generalization The data domain has a natural hierarchical structure. The degree of perturbation can be measured in terms of the height

of the resulting generalization above the leaf values.

• Suppression• Permutation

Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.

Statistical measures of anonymity

• Query restriction For a database of size N, and a fixed parameter k, all queries that

returned either fewer than k or more than N-k records were rejected.

Could be subverted by requesting a specific sequence of queries

• Anonymity via variance Lower bound the variance for estimators of sensitive attributes

Utility is measured (by combining the perturbation scheme with a query restriction method) as the fraction of queries that are permitted after perturbation.

Confidence interval How hard it is to reconstruct the original data distribution

• Anonymity via multiplicity K-anonymity

Probabilistic measures of anonymity

• Knowing aggregate information about the data as well as the method of perturbation

Perturb X with a random value from [-1,1], the privacy achieved is 2.

The distribution of X is revealed, [0,1] with prob. 0.5 and [4,5] with prob. 0.5

The privacy achieved is reduced to 1

• Mutual information P(A|B) = 1 – 2^H(A|B)/2^H(A) = 1-2^(-I(A;B)) H(A) encodes the amount of uncertainty (the degree of privacy) in

a random variable. H(A|B) the amount of privacy left in A after B is released. I(A;B) = H(A)- H(A|B) mutual information between A and B

• Utility Statistical distance between the source distribution of the data and

perturbed distribution.

On the design and quantification of ppdm algorithms, PODS01

Market basket data

• A privacy breach is defined as one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. (Evfimievski et al.)

• Privacy is measured in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. (Rizvi and Haristsa)

• Utility is the problem of reconstructing itemset frequencies accurately.

Measuring of transfer information

Limiting privacy breaches in privacy preserving data mining, PODS03

If we look back from y, there is no easy way of telling whether

the source is x1 or x2

Measured based on generalization

• K-anonymity• L-diversity• P-sensitive-k-anonymity• T-closeness

L-diversity may be difficult and unnecessary to achieve The sensitive attribute is the rest result for a virus. 99% of them being

negative. The positive/negative have different degrees of sensitivity. L-diversity is insufficient to prevent atytribute disclosure

Skewness attack, e.g, one equivalence class has an equal number of positive/negative records.

Similarity attack when the sensitive attribute values in an equivalence class are distinct but semantically similar.

T-closeness if the distance between the distribution of a sensitive attribute in this class and that of the attribute in the whole table is no more than t.

Measuring distribution difference

Earth mover’s distance

EMD for numerical attribute

EMD for categorical attribute

Permutation

• The goal of the k-anonymous blocks is that the diameter of the range of sensitive attributes is larger than a parameter e.

• Permutation based anonymization can answer aggregate queries more accurately than generalization based anonymization.

Anonymizing inference

• To protect the possible inferences that can be made from the data

• A privacy template is an inference on the data, coupled with a confidence bound. The requirement is that in the anonymized data, this inference not be valid with a confidence larger than the provided bound.

• Wang et al. Handicapping attacker’s confidence: an alternate to k-anonymization

Measuring utility in generalization based anonymity• The precision of a generalization scheme is 1 – the

average height of a generalization (measured over all cells).

Bayardo and Agrawal

ICDE 05

Utility vs. privacy

• Most of the schemes for ensuring data anonymity focus on defining measures of anonymity, while using ad hoc measures of utility.

• After performing a standard anonymization, they publish carefully chosen marginals of the source data. From these marginals, they then construct a consistent maximum entropy distribution, and measure utility as the KL-distance between this distribution and the source.

Kifer & Gehrke. Injecting utility into anonymized datasets. SIGMOD06

• Rastogi et al. The boundary between privacy and utility in data publishing

Computational measures of anonymity• Privacy statements are phrased in terms of the power of

an adversary., rather than the amount of background knowledge they possess.

Dinur & Nissim. Revealing information while preserving privacy. PODS03

• Measuring anonymity via information transfer• Indistinguishability

A database is private if anything learnable from it can be learned in the absence of the database

Anonymity via isolation

• A record is private if it cannot be singled out from its neighbors.

An adversary is defined as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q.

An anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data.

Metrics for quantifying data quality

• Quality of the data resulting from the ppdm process Accuracy Completeness consistency

• Quality of the data mining results• Chapter 8.4

measures

Oliveira & Zaiane, privacy preserving frequent itemset mining, 2002

Generalization based

• The data quality metric is based on the height of generalization hierarchies.

Data should be generalized as fewer steps as possible to preserve maximum utility.

Not every generalization steps are equal in the sense of information loss.

• General loss metric• Classification metric

Iyengar KDD02

• Discernibility metric Bayado & Agarwal ICDE05

Statistical based perturbation

Privacy vs. Utility

Documents