+ All Categories
Home > Documents > Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu...

Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu...

Date post: 28-Dec-2015
Category:
Upload: rose-mason
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Privacy-Preserving Data Quality sessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC
Transcript
Page 1: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

Privacy-Preserving Data Quality Assessment for

High-Fidelity Data SharingJulien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun

PARC

Page 2: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

2

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

$

Page 3: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

3

What about data quality?

Alice does not know data quality prior to acquisition

Dirty data costs US businesses ~$600 billion annually[1]

Data cleaning accounts for up to 80% of development time

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

[1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, 2002

80

20Data Cleaning

Data Exploration

Page 4: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

4

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Privacy concerns for Bob

Page 5: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

5

All of them

How many rows are

complete?

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Trust and privacy concerns for Alice

Page 6: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

6

ProblemPrivacy-Preserving Data Quality AssessmentPrivacy-Preserving Data Quality Assessment

Page 7: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

7

Data Quality MetricsIntegrity constraints on attributes

=, >, [ ], age > 0

Dependency constraints across 2+ attributes if, while, forif state == CA, then ZIP in [94000, 96199]

Many data quality metrics[1,2] CompletenessValidityUniquenessConsistency Timeliness

[1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002

[2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171, 1996

Page 8: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

8

Data Quality Metrics

CompletenessPercentage of elements that are properly populated

Check for values such as NULL, “”,…

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 9: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

9

Data Quality Metrics

ValidityPercentage of elements whose attributes possess

meaningful values

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 10: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

10

Data Quality Metrics

ConsistencyDegree to which the data attributes satisfy a

dependency constraints

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 11: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

11

Desired Privacy Properties

Query PrivacyBob should not learn the data quality constraint

parameters and the resulting values

Data PrivacyAlice should not learn anything from Bob’s data besides

quality metric

Page 12: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

12

Application:High-Fidelity Cyber Threat Mitigation

[1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005

[2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008

[3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006

IP Port Time

UID APT

IP Port Time

UID APT

IP Port Time

UID APT

IP Port Time

UID APT

Page 13: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

13

SolutionsRely on existing cryptographic primitives

Develop custom solution

Page 14: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

14

Private Set Intersection

Set intersection or cardinality of set intersection

[1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004

[2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012

Page 15: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

15

Private Set Intersection Completeness

{NULL}

1, NULL2, NULL…n, NULL

1, d1

2, d2

n, dn

{d1, …, dn}

PSI-CA approach is inefficient

Page 16: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

16

Encrypted-domain Computation

E(d1), E(d2)

E(d1) * E(d2)

d1 + d2

[1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, 1999

Page 17: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

17

Select & Aggregate Setup

Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v.Query Privacy: Bob should not find the selector vector.Data Privacy: Alice should not discover any information other than the selected aggregate.

SecureSelect & Aggregat

eProtocol

Page 18: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

18

Select & Aggregate Protocol

1. Alice sends element-wise encryptions of u to Bob.2. Bob computes the dot product of u and v using

additive homomorphic property, and sends it to Alice.

3. Alice decrypts the dot product.

SecureSelect & Aggregat

eProtocol

Page 19: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

19

Select & Aggregate Complexity

Cannot afford O(#tuples) complexity for large databases.

# Encryptions K 0

# Decryptions 1 0

# Multiplications

0 K

# Exponentiations

0 K

# Transmissions K 1

Page 20: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

20

Key Idea1. Find a suitable low-dimensional representation.

2. Use Select & Aggregate to evaluate quality metric.

Page 21: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

21

Completeness Evaluation Setup

Example: Alice wants to find the number of NULL values in Bob’s data.Query Privacy: Bob does not discover that Alice is searching for the number of NULLs.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap.

0...

H(NULL): 1...0

HashMap Counting HashMap

H(b1): 23...

H(NULL): 5...

H(bt): 2

Page 22: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

22

Completeness Evaluation Protocol

Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem.The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap.By construction, protocol reveals number of NULLs to Alice.

0...

H(NULL): 1...0

HashMap Counting HashMap

5

H(b1): 23...

H(NULL): 5...

H(bt): 2

SecureSelect & Aggregat

eProtocol

Page 23: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

23

Validity Evaluation Setup

01467201

Histogram of attribute

00011100

Binary vector

Example: Alice wants to know how many of Bob’s entries are in the range [C,E].Query Privacy: Bob does not discover the range of Alice’s searches.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram.

AB

CD

E

G

F

Z

Page 24: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

24

Validity Evaluation Protocol

As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram.By construction, protocol reveals number of “valid” values to Alice.Protocol works for arbitrary range queries, uniqueness, timeliness.

00011100

01467201

Binary vector Histogram of attribute

15

SecureSelect & Aggregat

eProtocol

AB

CD

E

G

F

Z

Page 25: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

25

Consistency Evaluation Setup

Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode.Query Privacy: Bob doesn’t discover which dependencies Alice is checking.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations.

10111001

Observeddependencies

11011011

Expecteddependencies

Page 26: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

26

Alice and Bob agree upon an ordering of attribute values.They also agree on a vectorization (flattening) pattern.Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules.

CA MA MN

94304

1 0 0 0

55414

0 0 1 0

02139

0 1 0 0

94305

1 0 0 0

CA MA MN

94304

0 0 1 0

55414

0 0 1 0

02139

0 1 0 0

94305

1 0 0 0

…Desired Dependencies Observed Dependencies

Page 27: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

27

Consistency Evaluation Protocol

11011011

10111001

Expecteddependencies

Observeddependencies

4

SecureSelect & Aggregat

eProtocol

Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector.Protocol reveals number of “valid” dependencies to Alice.Works for dependencies among arbitrary attribute combinations.

Page 28: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

28

Computational Complexity

D R L G

# uniques = # bins = 4

# tuples = 2,306,559

AZ

20

12 v

ote

sMetrics Proposed Protocols Using PSI-CA

Completeness O(# uniques) O(# tuples)

Validity

Timeliness

Uniqueness

O(# histogram bins)

O(# tuples)

Consistency O((# histogram bins)m)

O((# tuples)m)

Page 29: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

29

Conclusions & Discussion• An important subclass of privacy-preserving data mining.

Precursor to collaboration among untrusting entities.

• Existing protocols, e.g., PSI-CA have high computational overhead.

• Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions.

• Future work:– DQ for non-numeric attributes. – Efficient protocols for testing sparse dependencies.– Extremely difficult: Private evaluation of reliability of

data.

{jfreudig,srane}@parc.com


Recommended