+ All Categories
Home > Documents > Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M...

Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M...

Date post: 31-Dec-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
42
Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1 , Charu C. Aggarwal 2 , Ken Barker 1 SSDBM’15 1 University of Calgary, Canada 2 IBM TJ Watson, USA
Transcript
Page 1: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy-Preserving Big Data Publishing Hessam Zakerzadeh1, Charu C. Aggarwal2,

Ken Barker1

SSDBM’15

1 University of Calgary, Canada 2 IBM TJ Watson, USA

Page 2: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Data Publishing

Privacy-Preserving Big Data Publishing 1

• OECD* declaration on access to research data

• Policy in Canadian Institutes of Health Research (CIHR)

*- Organization for Economic Co-operation and Development

Ref: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques (2010).

Page 3: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy-Preserving Big Data Publishing 2

Data Publishing

Benefits: • Facilitating the research community to confirm published

results. • Ensuring the availability of original data for meta-analysis. • Making data available for instruction and education.

Requirement: • Privacy of individuals whose data is included must be

preserved.

Page 4: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy-Preserving Big Data Publishing 3

Tug of War

Preserving the privacy of individuals whose data is included (needs anonymization).

Usefulness (utility) of the published data.

Ref: http://www.picgifs.com

Page 5: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy-Preserving Big Data Publishing 4

Key Question in Data Publishing

How to preserve the privacy of individuals while publishing data of high utility?

Page 6: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy Models

Privacy-Preserving Big Data Publishing 5

Privacy-preserving models: – Interactive setting (e.g. differential privacy) – Non-interactive setting (e.g. k-anonymity, l-diversity)

• Randomization • Generalization

Page 7: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

K-Anonymity

Attributes • Identifiers • Quasi-identifier • Sensitive

Privacy-Preserving Big Data Publishing 6

Page 8: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

K-Anonymity

Privacy-Preserving Big Data Publishing 7

EQ

Page 9: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

L-Diversity

Privacy-Preserving Big Data Publishing 8

Page 10: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Assumptions in state-of-the-art of anonymization

Implicit assumptions in current anonymization: • Small- to moderate-size data • Batch and one-time process Focus on • Quality of the published data

Privacy-Preserving Big Data Publishing 9

Page 11: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Still valid?

New assumptions: • Small- to moderate-size data Big data (in Tera or

Peta bytes) • e.g. health data, or web search logs

• Batch and one-time process Repeated application Focus on • Quality of the published data + Scalability

Privacy-Preserving Big Data Publishing 10

Page 12: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Naïve solution?

Divide & conquer • Inspired by streaming data anonymization techniques • Divide the big data into small parts (fragments) • Anonymize each part (fragment) individually and in isolation

Privacy-Preserving Big Data Publishing 11

Page 13: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Naïve solution?

Divide & conquer • Inspired by streaming data anonymization techniques • Divide the big data into small parts (fragments) • Anonymize each part (fragment) individually and in isolation

• What we lose? • Rule of thumb: more data, less generalization/perturbation (large

crowd effect) • Quality

Privacy-Preserving Big Data Publishing 12

Page 14: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Main question to answer in Big Data Privacy

Is it somehow possible to take advantage of the entire data set in the anonymization process

without losing scalability?

Privacy-Preserving Big Data Publishing 13

Page 15: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Map-Reduce-Based Anonymization

Idea: distribute the high-computational process of anonymization among different processing nodes such that distribution does not affect the quality (utility) of the anonymized data.

Privacy-Preserving Big Data Publishing 14

Page 16: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Map-Reduce Paradigm

Data flow of a mapreduce job

Privacy-Preserving Big Data Publishing 15

Page 17: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mondrian-like Map-Reduce Alg.

Traditional Mondrian • Pick a dimension (e.g. dim with widest range), called cut

dimension • Pick a point along the cut dimension, called cut point • Split data into two equivalence classes along the cut

dimension and at the cut point, provided that privacy condition is not violated.

• Repeat until no further split is possible

Privacy-Preserving Big Data Publishing 16

Page 18: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mondrian-like Map-Reduce Alg.

Traditional Mondrian

Privacy-Preserving Big Data Publishing 17

Page 19: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mondrian-like Map-Reduce Alg.

Traditional Mondrian

Privacy-Preserving Big Data Publishing 18

Page 20: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mondrian-like Map-Reduce Alg.

Traditional Mondrian

Privacy-Preserving Big Data Publishing 19

Page 21: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mondrian-like Map-Reduce Alg.

Preliminaries: • Each equivalence class is divided into at most q equivalence

classes. • A global file is shared among all nodes. This file contains

equivalence classes formed so far organized in a tree structure (called equivalence classes tree).

• Initially contains the most general equivalence class.

Privacy-Preserving Big Data Publishing 20

Page 22: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mapper

Privacy-Preserving Big Data Publishing 21

Page 23: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Mapper: Example

iteration 1: only one equivalence class exists (called eq1) eq1 = [1:100],[1:100],[1:100] Data records: 12, 33, 5 56, 33, 11 12, 99, 5 Mapper’s output: < eq1, <(12,1),(33,1),(5,1)>> < eq1, <(56,1),(33,1),(11,1)>> < eq1, <(12,1),(99,1),(5,1)>>

key value

Privacy-Preserving Big Data Publishing 22

Page 24: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Combiner

Privacy-Preserving Big Data Publishing 23

Page 25: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Combiner example

Mapper’s output (combiner’s input): < eq1, <(12,1),(33,1),(5,1)>> < eq1, <(56,1),(33,1),(11,1)>> < eq1, <(12,1),(99,1),(5,1)>> Combiner’s output: < eq1, , , >

12 2

56 1

33 2

99 1

5 2

11 1

Privacy-Preserving Big Data Publishing 24

Page 26: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Reducer

Privacy-Preserving Big Data Publishing 25

Page 27: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Reducer’s output

Combiner’s output (reducer’s input): < eq1, , , > Reducer’s output:

– If eq1 is splittable: – <[12:56],[33:45],[5:11], “1”> – <[12:56],[45:99],[5:11], “1”> – If eq1 is un-splittable: – <[12:56],[33:99],[5:11], “0”>

12 2

56 1

33 2

99 1

5 2

11 1

Added to the global file

Added to the global file

Privacy-Preserving Big Data Publishing 26

Page 28: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Improvement 1 (Data transfer improvement)

• Mapper outputs only records belonging to splittable equivalence classes.

• Requirement: • Global file includes a flag for each equivalence class indicating

whether it is splittable or not. • If a data record belongs to an un-splittable EQ, do not output it. • e.g. [1:100],[1:100],[1:100], “1” [12:56],[33:45],[5:11], “1” [12:56],[45:99],[5:11], “1” [12:30],[33:45],[5:11], “1” [30:56],[33:45],[5:11], “0”

Privacy-Preserving Big Data Publishing 27

Page 29: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Improvement 2 (Memory improvement)

• What if the global file doesn’t fit into memory of mapper nodes?

• Break down the global file into multiple small files • How?

• Split the big data into small files. • Create a global file per small input file. • Each global file contains only a subset of total equivalence classes. • Each small global file is referred to as global subset equivalence

class file (gsec file).

Privacy-Preserving Big Data Publishing 28

Page 30: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Improved Algorithm

Mapper’s output: < eq1, <f1,(12,1),(33,1),(5,1)>> < eq1, <f1,(56,1),(33,1),(11,1)>> < eq1, <f1,(12,1),(99,1),(5,1)>>

Reducer’s output: If eq1 is splittable: <[12:56],[33:45],[5:11], “1”, f1 > <[12:56],[45:99],[5:11], “1”,f1> If eq1 is un-splittable: <[12:56],[33:99],[5:11], “0”,f1>

Combiner’s output: < eq1, <f1, , , , , , > 12 2

56 1

33 2

99 1

5 2

11 1

Privacy-Preserving Big Data Publishing 29

Page 31: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

K,q = 2

Privacy-Preserving Big Data Publishing 30

Page 32: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Further Analysis

• Time Complexity (per round) • Mapper • Combiber • Reducer

• Data Transfer (per round) • Mapper and Combiner (Local) • Combiner and Reducer (Across the network)

Privacy-Preserving Big Data Publishing 31

Page 33: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Experiments

Answer the following questions: • How well the algorithm scales up? • How much information is lost in the anonymization

process? • How much data is transferred between

mappers/combiners and combiners/reducers in each iteration?

Privacy-Preserving Big Data Publishing 32

Page 34: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Experiments

Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions, 1.4

GB Synthetic data set: 100M records, each 15 dimensions, 14.5

GB Information loss baseline Each data set is split into 8 fragments. Each fragment is

anonymized individualy State-Of-The-Art MapReduce Top-Down Specialization (MRTDS) [Zhang et. al’14]

Privacy-Preserving Big Data Publishing 33

Page 35: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Experiments Settings

• Hadoop cluster on AceNet • 32 nodes, each having 16 cores and 64 GB RAM • Running RedHat Enterprise Linux 4.8

Privacy-Preserving Big Data Publishing 34

Page 36: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Information Loss vs. K

Privacy-Preserving Big Data Publishing 35

Page 37: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Information Loss vs . L

Privacy-Preserving Big Data Publishing 36

Page 38: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Running time vs. K (L)

Privacy-Preserving Big Data Publishing 37

Page 39: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Data Transfer vs. Iteration # (between mappers and combiners)

Privacy-Preserving Big Data Publishing 38

Page 40: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Data Transfer vs. Iteration # (between combiners and reducers)

Privacy-Preserving Big Data Publishing 39

Page 41: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Future work

• Extension to other data types (graph data anonymization, set-value data anonymizaiton, etc)

• Extension to other privacy models

Privacy-Preserving Big Data Publishing 40

Page 42: Privacy-Preserving Big Data Publishing - Semantic Scholar...Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions,

Privacy-Preserving Big Data Publishing 41


Recommended