+ All Categories
Home > Documents > A Common Measure of Identity and Value Disclosure Risk

A Common Measure of Identity and Value Disclosure Risk

Date post: 06-Feb-2016
Category:
Upload: xarles
View: 25 times
Download: 0 times
Share this document with a friend
Description:
A Common Measure of Identity and Value Disclosure Risk. Krish Muralidhar University of Kentucky [email protected] Rathin Sarathy Oklahoma State University [email protected]. Context. This study presents developments in the context of numerical data that have been masked and released - PowerPoint PPT Presentation
32
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky [email protected] Rathin Sarathy Oklahoma State University [email protected]
Transcript
Page 1: A Common Measure of Identity and Value Disclosure Risk

1

A Common Measure of Identity and Value Disclosure Risk

Krish MuralidharUniversity of Kentucky

[email protected]

Rathin SarathyOklahoma State University

[email protected]

Page 2: A Common Measure of Identity and Value Disclosure Risk

2

Context This study presents

developments in the context of numerical data that have been masked and released

We assume that the categorical data (if any) have not been masked This assumption can be relaxed

Page 3: A Common Measure of Identity and Value Disclosure Risk

3

Empirical Assessment of Disclosure Risk

Is there a link between both identity and value disclosure that will allow us to use a “common” measure?

Page 4: A Common Measure of Identity and Value Disclosure Risk

4

Basis for Disclosure

The “strength of the relationship”, in a multivariate sense, between the two datasets (original and masked) accounts for disclosure risk

Page 5: A Common Measure of Identity and Value Disclosure Risk

5

Value Disclosure Value disclosure based on “strength of

relationship” Palley & Simonoff(1987) (R2 measure for

individual variables) Tendick (1992) (R2 for linear combinations) Muralidhar & Sarathy(2002) (Canonical

Correlation) Implicit assumption – snooper can use linear

models to improve their prediction of confidential values (Palley & Simonoff(1987), Fuller(1993), Tendick(1992), Muralidhar & Sarathy(1999,2001))

Page 6: A Common Measure of Identity and Value Disclosure Risk

6

Identity Disclosure Assessment of identity disclosure is

often empirical in nature e.g., Winkler’s software – (Census Bureau) based on a modified Fellegi-Sunter algorithm. The number (or proportion) of observations

correctly re-identified represents an assessment of identity disclosure risk

Theoretical attempts for numerical data: Fuller (1993) (Linear model) Tendick (1992) (Linear model) Fienberg, Makov, Sanil (1997) (Bayesian)

Page 7: A Common Measure of Identity and Value Disclosure Risk

7

Fuller’s Measure Given the masked dataset Y, and the original dataset X, and

assuming normality, the probability that the jth released record corresponds to any particular record that the intruder may

possess is given by Pj = ( kt)-1 kj.

The intruder chooses the record j which maximizes kj given by: exp{-0.5 (X – YH)A-1(X – YH)`},

where A = XX – XY(YY)-1YX and H = (YY)

-1YX

Pj may be treated as the identification probability (identity risk) of any particular record and averaging over every record gives a mean identification probability or mean identity disclosure risk for whole masked dataset

Page 8: A Common Measure of Identity and Value Disclosure Risk

8

Fuller’s distance measure Based on best conditional densities

While restricted to normal datasets, it relates identity risk to the association between the two datasets (though somewhat indirectly) as indicated by kj which contains XY.

Shows the connection between distance-based measures and probability-based measures

Page 9: A Common Measure of Identity and Value Disclosure Risk

9

Our Goal To show that both value disclosure and identity

disclosure are determined by the degree of association between the masked and original datasets. This must be true, since both are based on best predictors

When the best predictors are linear (e.g., multivariate normal datasets) canonical correlation can capture the association, and both value disclosure and identity disclosure risk must be expressible in terms of canonical correlations

Already shown for value disclosure (Muralidhar et al. 1999, and Sarathy et al. 2002). We will show here the relationship between identity disclosure and canonical correlation

Page 10: A Common Measure of Identity and Value Disclosure Risk

10

Canonical Correlation Version of Fuller’s Distance Measure

(X – YH)A-1(X – YH)` = (U – V0.5) C-1 (U – V0.5)` ,

where U = X(xx)

-0.5e (the canonical variates for the X variables)

V = Y(yy)-0.5f (the canonical variates for the Y variables)

C = (I – λ)

e is eigenvector of (XX)-0.5XY(YY)

-1YX(XX)-0.5.

f is eigenvector of (YY)-0.5YX(XX)

-1XY(YY)-0.5.

is diagonal matrix of eigenvalues and is also the vector of squared canonical correlations

Page 11: A Common Measure of Identity and Value Disclosure Risk

11

Therefore… Identity disclosure risk is a function of the (linear)

association between the two datasets (the lambdas, which are the square of the canonical correlations)

(U – V0.5) (I- )-1 (U – V0.5)` relates this association to identity disclosure as well as provide an “operational” way to assess this risk.

Compute this distance measure and match each original record to masked record that minimizes the expression. Then the number of re-identified records gives an overall empirical assessment of identity disclosure risk for a masked data release (Empirical results shown later.)

Page 12: A Common Measure of Identity and Value Disclosure Risk

12

Mean Identification Probability (MIDP)

Tendick computed bounds on identification probabilities for correlated additive noise methods

His expressions are specific to the method and not for the general case

We show a lower bound on MIDP for the general case (regardless of masking technique) that is based on canonical correlations

Page 13: A Common Measure of Identity and Value Disclosure Risk

13

Bound on MIDP For a data set (size n) with k

confidential variables X, masked using any procedure to result in Y, the mean identification probability is given by:

2

1

1 j

j

λ1

λ5011

k

j

nMIDP.

Page 14: A Common Measure of Identity and Value Disclosure Risk

14

Identification Probability (IDP) For any given observation i in the

original data set, the probability that it will be re-identified is given by:

where Uij is the canonical variate for Xij

ijTij

k

j

k

j

UUnIDP1 j

j

1 j

2j

λ501

λ250

λ1

λ5011

.

.exp

.

Page 15: A Common Measure of Identity and Value Disclosure Risk

15

An Example Consider a data set with 10

variables and a specified covariance matrix

Assume that the data is to be perturbed using simple noise addition with different levels of variance

Compute MIDP for different sample sizes and different noise variances

Page 16: A Common Measure of Identity and Value Disclosure Risk

16

Covariance Matrix of X

Page 17: A Common Measure of Identity and Value Disclosure Risk

17

MIDP

Page 18: A Common Measure of Identity and Value Disclosure Risk

18

Additive (Correlated) Noise Kim (1986) suggested that

covariance structure of the noise term should be the same as that of the original confidential variables (dΣXX) where d is a constant representing the “level” of noise

In this case, canonical correlation for each (masked, original) variable pair is [1/(1+d)]0.5

Page 19: A Common Measure of Identity and Value Disclosure Risk

19

MIDP

Page 20: A Common Measure of Identity and Value Disclosure Risk

20

Comparison of Simple additive and Correlated noise

For the same noise level Correlated noise results in higher identity disclosure

risk … Tendick (1993) also observed this Correlated noise results in lower value disclosure

risk (Tendick and Matloff 1994; Muralidhar et al. 1999)

Page 21: A Common Measure of Identity and Value Disclosure Risk

21

Other Procedures For some other procedures

(micro-aggregation, data swapping, etc.), it may be necessary to perform the masking and use the data to compute the canonical correlations

Page 22: A Common Measure of Identity and Value Disclosure Risk

22

Data sets with Categorical non-confidential Variables

MIDP can be computed for subsets as well Example

Data set with 2000 observations Six numerical variables Three categorical (non-confidential) variables

Gender Marital status Age group (1 – 6)

Masking procedure is Rank Based Proximity Swap

Page 23: A Common Measure of Identity and Value Disclosure Risk

23

MIDP

Page 24: A Common Measure of Identity and Value Disclosure Risk

24

Using IDP We can use the IDP bound to

implement a record re-identification procedure by choosing masked record with highest IDP value

Page 25: A Common Measure of Identity and Value Disclosure Risk

25

An IDP Example Data set consisting of 25

observations from a MVN(0,1) Perturbed using independent noise

with variance = 0.45 MIDP = 0.2375

Approximately 6 observations should be re-identified using this criteria

Re-identification by chance = 1/n = 0.04

Page 26: A Common Measure of Identity and Value Disclosure Risk

26

An IDP Example

Page 27: A Common Measure of Identity and Value Disclosure Risk

27

Advantages Possible to compute MIDP with

just aggregate information Possible to use IDP as “record-

linkage” tool for assessing disclosure risk characteristics of a masking technique

Computationally easier than alternative existing methods

Page 28: A Common Measure of Identity and Value Disclosure Risk

28

Disadvantages Assumes that the data has a

multivariate normal distribution For large n, the lower bound is

weak. MIDP appears to be overly pessimistic, we are working on finding out why this is so, and possibly modifying the bound.

Page 29: A Common Measure of Identity and Value Disclosure Risk

29

Weak Bound? Sample result

n=50 simple noise

addition

Noise MIDP Actual

0.10 0.990408 1.00

0.20 0.787552 0.94

0.30 0.034811 0.88

0.40 0.000000 0.72

0.50 0.000000 0.62

0.75 0.000000 0.46

1.00 0.000000 0.36

Page 30: A Common Measure of Identity and Value Disclosure Risk

30

Conclusion Canonical correlation analysis can

be used to assess both identity and value disclosure

For normal data, this provides the best measure of both identity and value disclosure

Page 31: A Common Measure of Identity and Value Disclosure Risk

31

Further Research Sensitivity to normality

assumption Comparison with Fellegi-Sunter

based record linkage procedures Refining the bounds

Page 32: A Common Measure of Identity and Value Disclosure Risk

32

Our Research You can find the details of our

current and prior research at:

http://gatton.uky.edu/faculty/muralidhar


Recommended