Research on Differential Privacy and Case Study...will try to study the basic state-of-the-art...

UMEÅ UNIVERSITY

MASTER’S THESIS

Research on Differential Privacy and CaseStudy

Author:Bihil SELESHI,Samrawit ASSEFFA

Supervisor:Lili JIANG

A thesis submitted in fulfillment of the requirementsfor the degree of Master’s Thesis in Computing Science

in the

Department of Computing Science

http://www.umu.se

http://www8.cs.umu.se/~ljiang/

http://www.cs.umu.se

iii

AbstractThroughout the ages, human beings prefer to keep most things secret and brandthis overall state with the title of privacy. Like most significant terms, privacy tendsto create controversy regarding the extent of its flexible boundaries, since varioustechnological advancements are slowly leaching away the power people have overtheir own information. Even as cell phone brands release new upgrades, the waysin which information is communicated has drastically increased, in turn facilitat-ing the techniques in which people’s privacy can be tampered with. Therefore,questioning the methodology by which people can maintain their privacy in thetwenty-first century is a validated action undoubtedly conducted by the multitudeof the world’s population.Admittedly, data is everywhere. The world has become an explosion of informa-tion, and it should not come as a surprise, especially in a time when data storage ischeap and accessible. Various institutions use this data to conduct research, trackthe behavior of users, recommend products or maintain national security. As a re-sult, corporations’ need for information is growing by the minute. Companies needto know as much as possible about their customers. Nonetheless, how can this beachieved without compromising the privacy of individuals? How can companiesprovide great features and maintain great privacy?These questions can be answered by a current, anticipated research topic in the fieldof data privacy: differential privacy. Differential privacy is a branch of statisticsthat aims to attain the widest range of data while achieving a robust, significantand mathematically accurate definition of privacy.Thus, the aim of this thesis will be describing and analyzing the concept of differen-tial privacy and its properties that lead to the betterment of data privacy. Hence, wewill try to study the basic state-of-the-art methods, the model and the challenges ofdifferential privacy.After analyzing the state-of-the-art differential privacy methods, this thesis will fo-cus on an actual case study that is concerned with a dataset which is experimentedwith one of the methods of differential privacy methods. We design a basic frame-work that tries to achieves differential privacy guarantee and evaluate the resultsin terms of privacy effectiveness.

v

AcknowledgementsWe would like to express our deep, sincere gratitude to our advisor Lili for herrelentless, patience, motivation, unending support and most importantly for manyinsightful conversations we had during the development of ideas in this thesis.

vii

Contents

Abstract iii

Acknowledgements v

1 Background 1

2 Preliminary Knowledge on Privacy Preservation 52.1 Privacy Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Personally identifiable information . . . . . . . . . . . . . . . . 52.1.2 Privacy Preservation Models . . . . . . . . . . . . . . . . . . . 6

2.1.2.1 Anonymization . . . . . . . . . . . . . . . . . . . . . 62.1.2.2 K - Anonymity . . . . . . . . . . . . . . . . . . . . . . 62.1.2.3 Homogeneity Attack . . . . . . . . . . . . . . . . . . 72.1.2.4 Background Knowledge Attack . . . . . . . . . . . . 8

2.2 Privacy Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Insurance Commission (GIC) . . . . . . . . . . . . . . . . . . . 82.2.2 Search Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Netflix Prize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Investigation on Differential Privacy 133.1 The Need for Differential Privacy . . . . . . . . . . . . . . . . . . . . . 133.2 Definition of Differential Privacy . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Queries and Database . . . . . . . . . . . . . . . . . . . . . . . 143.2.1.1 Counting Queries . . . . . . . . . . . . . . . . . . . . 143.2.1.2 Linear Queries . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Properties of Differential Privacy . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 The Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.1.1 The Global Sensitivity . . . . . . . . . . . . . . . . . . 183.3.1.2 The Local Sensitivity . . . . . . . . . . . . . . . . . . 19

viii

3.3.2 The Privacy Budget . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2.1 Sequential composability . . . . . . . . . . . . . . . . 203.3.2.2 Parallel composability . . . . . . . . . . . . . . . . . . 20

3.4 Mechanisms of Differential Privacy . . . . . . . . . . . . . . . . . . . . 213.4.1 Laplace Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 The Exponential Mechanism . . . . . . . . . . . . . . . . . . . 233.4.3 The Median Mechanism . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Data Release on Differential Privacy . . . . . . . . . . . . . . . . . . . 263.5.1 Interactive Data Release . . . . . . . . . . . . . . . . . . . . . . 263.5.2 Non-Interactive Data Release . . . . . . . . . . . . . . . . . . . 26

3.6 Challenges in Differential Privacy . . . . . . . . . . . . . . . . . . . . . 273.6.1 Calculating Global Sensitivity . . . . . . . . . . . . . . . . . . . 273.6.2 Setting of privacy parameter ε . . . . . . . . . . . . . . . . . . 283.6.3 Uncertainty of outcome . . . . . . . . . . . . . . . . . . . . . . 28

4 Experimental Framework Setup 294.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Experimental Framework Overview . . . . . . . . . . . . . . . . . . . 29

4.2.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.2 Privacy Preservation Mechanism . . . . . . . . . . . . . . . . . 304.2.3 Datastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.4 Programming Language . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.1 Algorithm for Laplace Mechansim . . . . . . . . . . . . . . . . 32

5 Experimental Study on Differential Privacy 335.1 Adult Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 335.1.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Student Alcohol Consumption Dataset . . . . . . . . . . . . . . . . . . 385.2.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 385.2.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 46

ix

6 Related Work 536.1 Early works on Statistical Disclosure Control Methods . . . . . . . . . 53

6.1.1 Conceptual Technique . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 Query Restriction Technique . . . . . . . . . . . . . . . . . . . 546.1.3 Perturbation (Input Perturbation) . . . . . . . . . . . . . . . . 556.1.4 Output Perturbation . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Frameworks on Differential Privacy based Data Analysis . . . . . . . 556.2.1 Integrated Queries (PINQ) . . . . . . . . . . . . . . . . . . . . 566.2.2 Airavat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2.3 Fuzz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2.4 GUPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.3 Work on Differential Privacy based Query Analysis . . . . . . . . . . 576.3.1 (Online) Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3.2 Non-Interactive (Offline) Setting . . . . . . . . . . . . . . . . . 586.3.3 Histogram and Contingency Table . . . . . . . . . . . . . . . . 59

7 Conclusion and Future Work 61

Bibliography 63

xi

List of Figures

3.1 Laplace Distribution with various distribution . . . . . . . . . . . . . 223.2 Interactive and Non-Interactive data release . . . . . . . . . . . . . . . 27

4.1 Archetecture of the system . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 Graph for the Number of years in Education by Income and Race . . 355.2 Students Alcohol consumption and Health status . . . . . . . . . . . 435.3 Alcohol consumption and Health status with noise . . . . . . . . . . 435.4 Alcohol consumption and absence days . . . . . . . . . . . . . . . . . 445.5 Alcohol consumption and Absence days with noise . . . . . . . . . . 445.6 Alcohol consumption and Final grade . . . . . . . . . . . . . . . . . . 455.7 Alcohol consumption and Final grade with noise . . . . . . . . . . . . 455.8 Alcohol consumption and Final grade with high noise . . . . . . . . 465.9 Result of simple linear regression of health status on alcohol con-

sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.10 Result of simple Linear Regression of health status with added noise

on alcohol consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 475.11 Result of simple Linear Regression of school attendance on alcohol

consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.12 Result of simple Linear Regression of school attendance with added

noise on alcohol consumption . . . . . . . . . . . . . . . . . . . . . . . 495.13 Result of simple Linear Regression of school performance on alcohol

consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.14 Result of simple Linear Regression of school performance with added

noise on alcohol consumption (ε=0.1) . . . . . . . . . . . . . . . . . . . 505.15 Result of simple Linear Regression of school performance with very

high noise on alcohol consumption (ε=0.0001) . . . . . . . . . . . . . . 51

xiii

List of Tables

2.1 Anonymity table of a medical data . . . . . . . . . . . . . . . . . . . . 7

5.1 Adult dataset attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Mean query result for adult dataset . . . . . . . . . . . . . . . . . . . . 365.3 Result for Laplace mechanism for count query . . . . . . . . . . . . . 375.4 Result for Laplace Mechanism for Mean query . . . . . . . . . . . . . 385.5 The pre-processed student related variables . . . . . . . . . . . . . . . 39

1

Chapter 1

Background

In the past decades, personal data records are collected increasingly by govern-ments, health centers, social networks, organizations, companies, and individualsfor the purpose of data analysis and different purposes. Hence, this received datahas created opportunities for researchers, companies, organizations and decisionmakers, For example, medical records can be used to track the spread of disease,prevent epidemics, discover hidden links between illnesses, disease prevention,and early detecting and controlling of disease, etc.[53]. On the other hand, the col-lected data might be sold, exchanged or shared. For example, organizations candeliver their customers information to third parties like advertising agents.Sharing and exchanging this large set of data has been a key critical tool for achiev-ing specific needs of the researchers and the organizations. For example, e-commerce,an information exchange created from plenteous activities including searching,browsing, and Internet shopping, which could improve productivity. In the med-ical field, government’s manufacture medical record frameworks for the exchangeof medical information. While releasing and sharing the data create new opportu-nities and help researchers and individuals with better data analysis, it is crucialprotecting the privacy of each person’s information in the dataset. If anybody canbe explicitly distinguishable from the released data, their private data will poten-tially be compromised. Therefore, before releasing the dataset, the data curatormust secure data privacy preservation in a way that the individual’s identity con-tained in the information cannot be perceived.In a setting where a trusted data creator or custodian owns a database consistingof rows of data with practical information about specific information, a privacybreach occurs when an adversary infer this particular information. The adversarypreys on the background information and even if the data creator has publishedthe anonymized version of the data this problem is commonly known as privacy-preserving data publishing.

2 Chapter 1. Background

Many anonymization techniques have been proposed to release the dataset se-curely[47, 19, 1]. However, the dependence of the currently available privacy mod-els on the background awareness of the adversary makes it hard to protect it fromunpredicted auxiliary information. Because of their vulnerability to backgroundknowledge attacks these techniques were not successful in preserving privacy[3,58, 38, 34, 1].We will try to give a solution to the above-described problem with the approachof differential privacy [12] is growing rapidly in a wide range of aspects that isbeing used by major technological companies starting from Apple’s IOs 10[31] toGoogle’s Chrome for Google Report Project. Differential privacy tries to guaranteethe protection of this sensitive information about an individual irrespective of thebackground knowledge of the attacker. Thus, we will present the analysis of dif-ferential privacy and study its applicability on a chosen sample dataset. The maincontributions of this thesis include:

• An extensive investigation on privacy preservation and differential privacy,including state-of-the-art privacy protection methodologies, differential pri-vacy frameworks as well as their pros and cons.

• An experimental framework, which consists of back end database deploy-ment and algorithm implementation, to put differential privacy into practiceand evaluation.

• Experimental study on differential privacy to evaluate counting queries withdifferent differential privacy mechanisms on various datasets.

This thesis is organized as follows:

• In Chapter 2, we describe the preliminary knowledge on privacy preserva-tion; specifically, we present the basic understanding of the syntactic privacymodels that use privacy preserving data publishing techniques such as k-anonymity, l-diversity, t-closeness and in the last section we will introducedifferential privacy.

• In Chapter 3, we present the establishment of differential privacy that we de-pend on all the way throughout the chapters and continue defining differen-tial privacy while showing known mechanisms and techniques for achievingdifferential privacy. We then finalize the chapter by describing the significantchallenges and the currently available frameworks of differential privacy.

Chapter 1. Background 3

• In Chapter 4, we present the framework that we have applied to test out dif-ferential privacy and describe one of the algorithms that we have used in theexperiment

• In Chapter 5, we continue the investigation and present the process of achiev-ing differential privacy by giving out the results and discuss the implicationof differential privacy.

• In Chapter 6, we present instances of previous works which aimed to providesecurity to databases against information leakage and describe the currentexisting framework for differential privacy

The roles and the responsibilities through out the course of this thesis is as follows,

• Samrawit has the responsibility of doing the research on the preliminarystudy, related work and experimental study on differential privacy, and shehas covered Chapter 1, Chapter2, Chapter 5.2 and Chapter 6

• Bihil has the responsibility of doing the research on the properties, analy-sis, implementation and experimental part of the differential privacy and hascovered the Abstract, Chapter 3 Chapter 4, Chapter 5.1 and Chapter 7 part.

5

Chapter 2

Preliminary Knowledge onPrivacy Preservation and Failure

2.1 Privacy Information

A previous privacy model by Dalenius in 1977 [10] articulated a desideratum thatdepicts the privacy goal of databases: anything that can be learned from the databaseabout a particular individual should be determined without the access to the database.The idea behind this notion is making sure that the measurement of the adversary’sbefore and after beliefs of a particular data is small. However, this type of privacycannot be achieved. Dwork demonstrated that such privacy assurance is incon-ceivable because of the existence of background knowledge. Thus, Dwork cameup with a new perspective of privacy preservation: the risk to one’s privacy, or ingeneral, any risk, such as the risk of being denied automobile insurance, shouldnot substantially increase as a result of participating in a database [10, 33].Formerly, many works have been done to protect privacy; we will discuss thesyntactic privacy models that use privacy preserving data publishing such as k-anonymity, l-diversity, t-closeness, and about differential privacy on the next sec-tions.

2.1.1 Personally identifiable information

The collected data from different sources stored in a database.Mostly the privacypreserving data publisher categorize the database in four primary fields includ-ing Explicit Identifier, Quasi Identifier, Sensitive Attributes, and Non-Sensitive At-tributes.Explicit Identifiers is a set of attributes that contains information that can be usedto identify individuals such as name, security number uniquely. Quasi-identifier

6 Chapter 2. Preliminary Knowledge on Privacy Preservation

(QID) is a set of attributes such as zip code, gender, a birth date in which the com-bination of this attributes could potentially distinguish individuals while sensitiveattributes contain sensitive personal information such as medical history, salary.Non-sensitive attributes contain attributes that are not listed in the other fields [51].Among these categories, personally identifiable information(PII) is information thatcan be used all alone or with other data to identify individuals. For example, socialsecurity number, name and phone number[7]. Before releasing the data, the datacurator has to ensure that individual’s personally identifiable information will notbe disclosed while the data is still valuable. Many privacy preserving methodshave been proposed previously, and we will discuss some of them below.

2.1.2 Privacy Preservation Models

2.1.2.1 Anonymization

Data anonymization is the process of removing personally identifiable informa-tion from the data set to protect individuals’ privacy and to make it possible fordata users and owners to share data securely for data analysis, decision-making,research and different purposes so that individuals whose information is in thedata set stay anonymous. The curator (the person who collected the data) mod-ify the data by removing the specific identifiers such as name, security number,address and phone number. Even if the specific identifiers are removed, the avail-ability of individual’s background information (e.g. in the public voter list) makesit easier for the adversary to re-identify individuals by linking the released datamaking it very hard to publish data without disclosing privacy [3]. Once the datais released to the third party, it is hard for the owners to control the way thedata is manipulated. Latanya Sweeney, an MIT graduate student in computer sci-ence, had shown that individuals information in the anonymously published datacould be re-identified by linking the released data to publicly available data byre-identifying governor William weld’s medical information [3, 54].

2.1.2.2 K - Anonymity

To deal with the shortcomings of simple data anonymization, researchers have pro-posed various methods to preserve privacy. One of the most popular methods ofprivacy preservation is the K-anonymity. To counter record linkage using quasi-identifiers Samarati and Sweeney [47] proposed the idea of k-anonymity, its en-deavor is to release data with a scientific guarantee that a particular individual’s

2.1. Privacy Information 7

data can’t be uniquely distinguished while the data could be utilized in a sensiblemanner. A version provides k-anonymity protection if the data for each personcontained in the release that cannot be detected from at least k-1 individuals whoseinformation also appears in the release. K-anonymity is defined as the level of dataprotection on inference by linking. It prevents linking the released data to otherinformation sources (background information). Meanwhile, k-anonymity does notguarantee privacy. Machanavajjhala et al..[34] used two attacks to show how k-anonymity does not guarantee privacy. Let us discuss the two attacks based on thetable 2.1:

Non-Sensitive SensitiveName Age Zip code Religion Nationallity Medical History

* 20 6 Age 6 30 110** * USA Heart Disease* 20 6 Age 6 30 110** * USA Heart Disease* 20 6 Age 6 30 110** * USA Heart Disease* 40 6 Age 6 45 120** * Norway Pneumonia* 40 6 Age 6 45 120** * Norway Pneumonia* 40 6 Age 6 45 120** * Norway Pneumonia* > 46 130** * Mexico Vitamin D deficiency* > 46 130** * Mexico Vitamin D deficiency* > 46 130** * Mexico Alzheimer’s

TABLE 2.1: Anonymity table of a medical data

2.1.2.3 Homogeneity Attack

Homogeneity attack showed that when there is little diversity in the sensitive at-tributes, the adversary can identify the value of the sensitive attribute for thatgroup of k- records. For example, a politician who intends to be elected to a postin the governance of a state utilizes the medical history of his opponent in demon-strating to the populace that his opponent cannot or is not ready to deal with theobligations as an agent of the state due to his medical problems. He will haveto search for his opponent’s medical information by utilizing the released data ofthe 3-anonymous table from the hospital. Despite the likelihood that the data isa 3-anonymized table. Since he has some information about his opponent, he canrecognize what ailment his opponent has because when there is no much contrasts(there is little diversity) in the sensitive data. Case in point, he knows that the pa-tient is 25 years old American who lives in the postal division 11003, so due to thiscurrent data, he realizes that his rival has Heart Disease


2.1.2.4 Background Knowledge Attack

In this attack, the adversary uses background knowledge to make the attack suc-cessful, and we will show that k- anonymity does not guarantee privacy againstbackground knowledge attacks. For instance, a woman whose colleague’s father issick needs to comprehend the nature of the sickness. She knows that her co-workerdad is old and he is from Mexico so she can conclude that he is suffering fromeither vitamin D inadequacy or Alzheimer’s. Nonetheless, it is realized that Mexi-cans, for the most part, could not be influenced by vitamin D insufficiency malady.Alzheimer’s is a common neurological ailment in old people. Therefore, it is easyfor her to conclude that her colleague’s father has Alzheimer’s. Using backgroundknowledge, she distinguishes what malady her colleague’s dad has.Therefore, from the above examples, it can be seen that k-anonymity does not guar-antee privacy preservation.

2.2 Privacy Failures

Having huge amount of collected data and multiple ways of sharing has led to arapidly increasing accumulation of personal data, thereby leading to privacy is-sues such as exposure of sensitive data and mass harvest of personal informationby third parties [51]. Studies demonstrate that majority of the US population canbe uniquely distinguished by joining zip code, gender and date of birth showingclearly information released containing these attributes cannot be seen as anony-mous data [50, 18, 52]. If individuals can be particularly distinguished in the re-leased data, their private data will potentially be disclosed. The collected data setscontain private or sensitive information of individuals and releasing this data setscan lead to the disclosure of personally sensitive information.Protecting the database from background knowledge attacks is very challenging.The adversary’s background information which the data curator cannot foreseemakes privacy preservation models vulnerable. As examined in the preceding sec-tion the privacy standards do not protect privacy. Below we try to discuss instancesof privacy protection failures which led to the identification of individuals or users.

2.2.1 Insurance Commission (GIC)

Latanya Sweeney, an MIT graduate student in computer science, had shown thatindividuals information in the anonymously released data could be re-identified

2.2. Privacy Failures 9

by linking the released data to publicly available data (e.g., voter registration list)and by using some background knowledge about the individual or event. Shehad demonstrated this by re-identifying the medical data of the governor Williamweld’s data, the then Governor of Massachusetts by using hospital’s data whichwas released to the researchers by the Massachusetts Group Insurance Commis-sion (GIC) since it is accepted to be anonymous. It was publicly known that thegovernor had collapsed on May 18, 1996, during the event of the graduation cere-mony when he was about to receive an honorary doctorate. Even though GIC hadmodified the released data by deleting specific identifiers, Sweeney used her back-ground knowledge from media regarding the hospital’s name where the governorwas admitted to and regarding his residential address to identify the specific hospi-tal from the released GIC data. She also bought the voter list which contains manyattributes including individuals name, address, zip code, birthdate, and genderfrom the governor’s residential county so that she can match the quasi-identifierssuch as zip code, gender, date of birth, with the GIC data to re-identify the gover-nor’s medical data. Her work proved that though data publishers release data setafter anonymizing it by having all the personal identifiers removed, the remainingdata can be used to identify individuals by linking it to other data, such as publiclyavailable data sets[3, 54].

2.2.2 Search Log

In August 2006, American online (AOL) research publicly released 20 million de-tailed search logs of many numbers of AOL users collected over a three months’period intended for research purposes. The AOL did not remove any data whichbrought privacy concerns; they tried to anonymize it by replacing identifiers suchas AOL username and IP address with unique identification numbers to allow re-searchers to relate the searches to that of the individual’s. However, based on allthe searches made by a single user it was possible to identify individual’s namequickly, social security numbers, and even more sensitive information is disclosed.Following that, within a couple of days there was an article in the new York timesrevealing the identity of one of the searchers, and subsequently, many other peo-ple were identified. AOL acknowledged the mistake and removed the releaseddata and apologized for releasing, however, they couldn’t control the informationleakage because the data was redistributed by others[58, 5]


2.2.3 Netflix Prize

In October 2006 Netflix the world’s largest online DVD rental service publicly re-leased a data set containing 100 million anonymized movie ratings, created by500,000 subscribers of Netflix. The purpose of the release was to improve theirmovie recommendation system. At the time, they secured the customer’s privacyby removing all the personal information they released the data that only con-tain an anonymized user Id, ratings, and the dates the subscriber rated the movie.Narayanan and Shmatikov from the university of Texas at Austin[38] demonstratedthat an adversary who knows a little bit about individuals subscription could iden-tify the subscriber’s information in the data set. By using IMDB (The InternetMovie Database)[37][13] as the source of the background knowledge they iden-tified the subscriber’s record, and they have revealed the customer’s personal sen-sitive information[38, 37].Protecting the database from these attacks is very challenging especially while theadversary has sufficient background information at his/her disposal that is evennot anticipated by the data curator. Hence, there is a need for a better privacy pre-serving technique to prevent the leakage of individual’s private information andgive a much better guarantee to preserve privacy against the worst-case scenar-ios where the adversary has almost all the background information. Thus, In thisthesis, we choose differential privacy as a privacy model.

2.3 Differential Privacy

Differential privacy is a powerful standard for data privacy proposed by Dwork.It is based on the idea that the outcome of the statistical analysis is essentiallyequally likely independent of whether any individual joins or refrains from join-ing the database, i.e., one learn approximately the same thing either way [35]. Itgives a guarantee that the possibility that the adversary that brings harm or goodto any set of participants is basically the same regardless of whether or not any in-dividual is in or out of the dataset. To accomplish this, Differential privacy adds arandom noise to the output of the query so that the difference to the results of theoutput make by the presence or the absence of a single person will be cover up.Differential privacy has been studied theoretically and proved that, It gives a rigor-ous guarantee of privacy even when the adversary has the worst-case backgroundknowledge [11]. It neutralizes all linkage attacks and statistical attacks because it

2.3. Differential Privacy 11

uses the property of the data access mechanism which is not reliant on the presenceor absence of background knowledge.Thus, because of its strong privacy guarantee against the worst case backgroundknowledge attacks of the adversaries, differential privacy has been considered as apromising privacy preserving technique. Therefore, throughout this thesis, we willtry to describe its properties and analyze a selected case study on it.

13

Chapter 3

Investigation on DifferentialPrivacy

In this chapter, we present the foundation of differential privacy that we dependon all the way throughout this thesis. First, we will give a summary about the needfor data privacy and define the concept of data domain and query that we are go-ing to included. Then, we formally define differential privacy and present knownunderlying mechanisms and techniques for achieving differential privacy. We thenfinalize the chapter by describing the significant challenges and the currently avail-able frameworks of differential privacy.

3.1 The Need for Differential Privacy

In today’s information realm, an extensive sensitive personal information is beingpossessed by the daily services we tend to use, such as search engines, mobile ser-vices, on-line social activity and so on. This vast amount of statistical sensitivepersonal information can be of enormous social value such as enhancing economicutility, comprehending the spread of disease, allocating resources and so on. Asdescribed in the last chapter information is obtained in several ways, starting withan opportunistic data collection that is simply promised "privacy" to a legally com-pelled one which both should be treated equally since there is no logical reasoningto engaging in the actions that generate them in the first place.The loss of data privacy is imminent since the guarantee of data privacy incor-porates controlling access to information, controlling the flow of information, orcontrolling the purpose of information. Hence as seen in the last chapter severalmethods try to preserve privacy which are insufficient enough to give the desireddata privacy. Thus, the need for differential privacy arises with the hope of a betterand robust data privacy.

14 Chapter 3. Investigation on Differential Privacy

3.2 Definition of Differential Privacy

3.2.1 Queries and Database

We take a finite data universe U , where d is a collection of records which is com-monly known as a dataset or database.

Definition 3.2.1. The distance between two adjacent or neighboring datasets D,D′ withthe l1 norm is:

||D −D′||1 =

|U |∑i=1

|Di −D′i| ≤ 1

3.2.1.1 Counting Queries

Counting queries are the essential class of queries that give some of the basic statis-tics on a database. They have a form "What portion of the database satisfy the prop-erty q?". Formally, it could be determined by a boolean predicate q : U → {0, 1}.The evaluation of it on a database is defined as:

q(D) =1

n

n∑i=1

q(xi)

3.2.1.2 Linear Queries

Definition 3.2.2. A linear query is a vector Q ∈ [0, 1]|x|, evaluated as Q(D) = 1n〈Q,D〉.

Similarly, we can view Q as a function Q : X → [0, 1], and evaluate:

Q(D) =

∑xi∈DQ(xi)

n=

∑|X|i=1Q(xi).D[i]

n

3.2.2 Definitions

The definition of privacy has a broad different aspect of understanding, and itmight be a little profound to nail down precisely. Some have tried to define itfrom philosophical point of view, like Warren and Brandeis[57] seen it as "right tobe let alone" or Dalenius[9] in 1977 tried to describe it as " anything that can belearned about a respondent from the statistical database should be learnable with-out access to the database." Some are more specialized and have extraordinarilyless succinct quotes portraying their intent. Every definition endeavors to repre-sent and inspire a particular kind of privacy, keeping in mind this is an astounding

3.2. Definition of Differential Privacy 15

academic interest, it doesn’t regularly translate into immediately helpful guidancefor the concerned data subject.It is evident that the objective of privacy-preserving data analysis is to release in-formation without giving up the privacy of any individuals who’s data contributeto the database. Having data that is sensible enough to be used as utility and pro-viding a secured privacy are the two compromises that pose conflicting objectivesof differential privacy.Thus, differential privacy is yet another privacy definition which is more unmis-takably actionable and directly concerned with a specific concern the data subjecthave, what will be the result of in participating in a particular database, but it is alsoclearly both philosophical and technical. Differential privacy ensures that nothingthat couldn’t occur without access to your data will happen with access to yourdata. Moreover, it makes a stronger guarantee[13] of privacy even when the ad-versary has arbitrary external knowledge. The possibility that ’any particular thing’happens with the access to your information is at most a multiple ’X’ of chance itwould happen without your data, where ’multiple’ X determines how much pri-vacy is guaranteed.Thus formally, let A : Dn → Y be a randomized algorithm and two databasesD1, D2 ∈ Dn are adjacent or neighbouring which differ in at most one value.

Definition 3.2.3. ((ε)-differential privacy)[12] Let ε > 0. DefineA to be (ε)-differentiallyprivate if for all neighbouring input datasets D1,D2, and for all (measurable) subsets Y ⊂Y , we have

Pr[A(D1) ∈ Y ]

Pr[A(D2) ∈ Y ]≤ exp(ε)

where the probability is taken over the coin tosses of ASince we can change D1 and D2 mutually the above definition refers that

exp(−ε) ≤ Pr[A(D1) ∈ Y ]

Pr[A(D2) ∈ Y ]≤ exp(ε)

Since exp(ε) ≈ 1 + ε, for small ε, then we have roughly

1− ε . Pr[A(D1) ∈ Y ]

Pr[A(D2) ∈ Y ]. 1 + ε

A common weakening of ε-differential privacy is the following notion of ap-proximate privacy[23]


Definition 3.2.4. ((ε, δ)-differential privacy)[23]Define A to be (ε, δ)-differentially private if for all neighbouring input datasets D1,D2,

and for all (measurable) subsets Y ⊂ Y , we have

Pr[A(D1) ∈ Y ] ≤ exp(ε)× Pr[A(D2) ∈ Y ] + δ

In this case the two parameters ε and δ control the level of privacy.The strongest version of differential privacy, in which δ = 0, is known as pure dfferntial privacy

while the more general case where δ > 0 is known as approximate differential privacy,ans is well less understood.[16]

It is important to note that is a property of the randomized algorithm A but not ofthe dataset. The input dataset D1 generate a probability distribution on its range(R(A)). Thus, the ratio of two probabilistic density functions comparable to thedistribution generated on R(A) by D1 and D2 concerning the parameter ε as aboundedness condition is differential privacy. The privacy parameter ε, when itcloser to zero, the closer the distribution are, and the higher the level of the privacy.From the above definitions of differential privacy, it is quite evident what privacymeans. Thus, the data gained regarding a participant by the output of some al-gorithm is no more than the data acquired by that particular member without theaccess to it which we informally call as a pure semantic security. In fact, it is clari-fied in the semantically-flavoured explanation[11] which states regardless of externalknowledge, an adversary with access to the sanitized database draws the same conclusionswhether or not my data is included in the original database. Unluckily, the presence ofarbitrary external information makes such privacy definition impossible. We couldsee this in the illustrated example[27]:

Consider a clinical study that explores the relationship betweensmoking and lung disease. A health insurance company who hadno a priori understanding of that relationship might dramaticallyalter its “beliefs” (as encoded by insurance premiums) to accountfor the results of the study. The study would cause the companyto raise premiums for smokers and lower them for non-smokers,regardless of whether they participated in the study. In this case,the conclusions drawn by the company about the riskiness of anyone individual (say Alice) are strongly affected by the results of thestudy. This occurs regardless of whether Alice’s data are includedin the study

3.2. Definition of Differential Privacy 17

From this perception, instead of having a pure semantic privacy, we should targetto a more appropriate definition of privacy so that to show that differential privacyachieves a relaxed version of semantic privacy. It is known that differential privacystates whether or not being in the output of an algorithm that describes a singleuser’s data, the amount of data that is learned by the adversary is practically thesame. Thus, an assurance must be created so that participating in the statisticaldatabase will not utterly change the outcome functions run on the database for thesake of the user.Supposing that the user participates in the database, we now formalize the argu-ment that we have given. Thus, to bound statistical difference mathematically be-tween the before presumptions/beliefs b1, b2, where b1 is a before on the values inthe database given an output y = A(D) where D includes the user data, and b2

is the after on the database given the same output y = A(D′), where D′ does notinclude the user’s data.

Differential privacy implies (relaxed) semantic privacy.[27]To understand the concept of semantic privacy we need to define the statistical

difference . If P and Q are two distributions on the same discrete probability space,the statistical difference between P and Q is defined as:

SD(P,Q) = maxS⊂D|P[S]−Q[S]|

Our expectation is there is a randomized algorithm A that satisfies ε-differentialprivacy. Let b(D) denote the the prior belief of the adversary on databases D ∈ Dn

and b(D|y) denote the posterior belief on databases, given an output y ∈ Y . Letb′(D|y) imply the posterior belief of the adversary where we use different random-ized algorithm A′(D) = A′(D¬n) that D¬n is a database is where we keep the n− 1

values of D while the n-th value is changed by with some inconsistent dn ∈ D.We now could argue that for all D ∈ Dn and for all yinY

SD(b(D|y), b′(D|y)) ≤ exp(2ε)− 1

Theorem 3.2.5. ε-differential privacy implies semantic security. Let A be an ε-differentiallyprivate algorithm. For all D ∈ Dn and y ∈ Y, we have

SD(b(D|y), b′(D|y)) ≤ exp(2ε)− 1


Proof. By Bayes rule[43], we know that

b(D|y) =µ(y|D)b(D)∑

E∈Dnµ(y|E)b(E)

This yields

b(D|y)− b′(D|y) =µ(y|D)b(D)∑E

µ(y|E)b(E)− µ′(y|D)b(D)∑

E

µ′(y|E)b(E)

Then form the definition differential privacy inequalities, we get |b(D|y)−b′(D|y)| ≤exp(2ε)− 1

3.3 Properties of Differential Privacy

3.3.1 The Sensitivity

Sensitivity parametrizes the amount how much perturbation required in the dif-ferential privacy mechanism. Currently, the global and local sensitivity are beingmainly used in differential privacy.

3.3.1.1 The Global Sensitivity

It shows how much the maximal differences is between the query results of theneighboring databases that are going to be used in one of differentially privatemechanism. The formal definition:

Definition 3.3.1. (Global Sensitivity) [14]For f : Dn → Rk, and use the l1 norm on Rk(denoted ||.||1, or simply ||.||) as a distancemetric on outcomes of f . Then, the global sensitivity of f is

GS(f) = maxD1,D2

||f(D1)− f(D2)||1

For low sensitivity queries, such as count or sum works good with global sensitiv-ity when it comes to releasing them. We could take the count query as an examplewhich have GS(f) = 1 that is smaller than the true answer. However, when itcomes to queries like median, average the global sensitivity is much higher.

3.3. Properties of Differential Privacy 19

3.3.1.2 The Local Sensitivity

The extent of noise included by the Laplace mechanism rely upon GS(F ) and theprivacy parameter ε, but not on the database D. The adding of noise to the mostof the functions applied to yield a much higher noise not resonating the function’sgeneral insensitivity to individual’s input. Thus, Nissim[43] proposed a local sen-sitivity that satisfies differential privacy by adjusting the difference between queryresults on the neighboring databases[14]. The formal definition:

Definition 3.3.2. (Local Sensitivity)[39]For f : Dn → Rk and x ∈ Dn, the local sensitivity of f

LS(f) = maxD2

||f(D1)− f(D2)||1

In here, we have to observe that the global sensitivity from the definition 3.3.1is

GS(f) = maxD1

LS(f)(D1)

which creates less noise for queries with high sensitivity .For queries such as countor range, the local sensitivity is identical to the global sensitivity.An honest argument shows that every differentially private algorithm must adddistortion at least as large as the local sensitivity on many inputs. However, find-ing algorithms whose error matches the local sensitivity is not straightforward: analgorithm that releases f with noise magnitude proportional to LS(f) on input D1

is not, in general, differentially private [35], since the noise magnitude itself canleak information.

3.3.2 The Privacy Budget

Absolute data privacy guarantee that holds despite the number of computationcarried out in it is the excellent condition that a data creator may want to achieve.However, accomplishing a significant meaning of privacy cannot be done with ab-solute privacy guarantee. The concept of privacy budget comes into action due tothe need for restricting the number of queries and composability differential pri-vacy[36].In definition 3.2.3 the privacy level of the mechanism A is controlled by ε which isdefined as privacy budget[44]. Currently, the sequential composition and the parallelcomposition are mostly used in the design of mechanisms.


3.3.2.1 Sequential composability

The possibility of computing the results of an independent differentially private kalgorithms in sequence on a dataset, without giving up privacy could be achievedin sequential composability while privacy budget is added up for each step.Suppose there are k algorithms Ai(D;xi), where the xi represents some auxil-iary input while each of the Ai’s are ε-differentially private for any auxiliary in-put xi. Consider a sequence of computations {x1 = A1(D), x2 = A2(D;x1), x3 =

A3(D; z1, z2), . . .} and A(D) = xk

Theorem 3.3.3. (Sequential composibility[55]).Let ki(D), for some i ∈ I , be computations over D providing εi-differential privacy. Thesequence of computations (ki (D))i∈I provides

(∑i∈I εi

)-differential privacy.

Proof. Let D1, D2 be two neighboring databases. Then

Pr[A(D)1 = xk] = Pr[A1(D1) = x1]Pr[A2(D1;x1) = x2] . . . P r[Ak(D1;x1, . . . , xk−1)]

≤ exp(kε)k∏i=1

Pr[Ai(D2;x1, . . . , xi−1) = xi]

= exp(kε)Pr[A(D2) = xk]

3.3.2.2 Parallel composability

In situations where we have a sequence of queries made on non-intersecting sets,then we could apply parallel composability. The largest privacy budget wouldguarantee the ultimate privacy in this composability. We consider a situation wherewe have k disjoint subsets, Di from a partitioned database D and suppose we havek algorithms A(Di;xi) which each are differentially private.

Theorem 3.3.4. (Parallel composibility[29]).Let ki(D), for some i ∈ I , be computations over Di providing ε-differential privacy. If eachDi contains data on a set of subjects disjoint from the sets of the sets of subjects of Dj forall j 6=, then(ki (D))i∈I provides εi-differential privacy.

3.4. Mechanisms of Differential Privacy 21

Proof. Let D1, D2 be two neighboring databases. Assume that the j-th partitioncontains the differing element. Then

Pr[A(D)1 = xk] =k∏i=1

Pr[Ai(D1i ;x1, . . . , xi−1) = xi]

≤ exp(ε)Pr[Aj(D2j ;x1, . . . , xj) = xj ]k∏i 6=j

Pr[Aj(D1i ;x1, . . . , xi−1)]

= exp(ε)Pr[A(D2) = xk]

3.4 Mechanisms of Differential Privacy

3.4.1 Laplace Mechanism

In this section we introduce one of the most basic and handy tool in differentialprivacy.The Laplace mechanism involves adding random noise that adjusts to the Laplacedistribution with mean 0 and scales fraceGS(f)ε and added independently to eachquery response, thus making sure that every query is perturbed appropriately.To analyze Laplace Mechanism we first need to define Laplace distribution.

Definition 3.4.1. Laplace distribution[59]Laplace distribution is characterized by location θ(any real number) and scale λ(has to begreater than 0) parameters with the following probability density function:

f(x|θ, λ) =1

2λexp

(−|x− θ|

λ

)


FIGURE 3.1: Laplace Distribution with various distribution

[59]

Laplace mechanism is dependent upon Global Sensitivity.

Theorem 3.4.2. (Laplace Mechanism)[14]For any f : Dn → Rk, and ε > 0, the following randomized mechanism A, called theLaplace Mechansim is ε-differentially private:

Af (D) = f(D) +

(GS(f)

ε

)kProof. Let D1,D2 be neighboring databases at any point x ∈ Rk. It is enough tocompare the probability density f of A(D1) with the g of A(D2).Let f(D)i denote


the i-th coordinate of f(D).

Pr[Af (D1) = x]

Pr[Af (D1) = x]=

k∏i=1

Pr(xi − f(D1)i)

k∏i=1

Pr(xi − f(D1)i)

=

k∏i=1

exp(−ε|xi − f(D1)i|/GS(f))

k∏i=1

exp(−ε|xi − f(D2)i|/GS(f))

= exp

ε

(k∑i=1

|xi − f(D2)i| − |xi − f(D1)i|

)GS(f)

≤ exp

ε

(k∑i=1

|f(D2)i − f(D1)i|

)GS(f)

≤ exp(ε)

where the first inequality comes from the triangle inequality and the second in-equality comes from the definition global sensitivity 3.3.1

3.4.2 The Exponential Mechanism

A real-valued noise is being added to the actual answer when it comes to LaplaceMechanism. Nonetheless, it is known that all queries functions cannot return nu-merical values to their output all the time. Hence, McSherry and Talwar[36] pro-posed a more general method that can be applied to answer non-numeric queriesin a differential matter.Given a utility function q : D × R → R, the exponential mechanism selects anoutput from d with n elements from domain D and an arbitrary range R, basedon the score which represents the quality of r in d. The final output would beclose to the ideal choice on q since the mechanism appoints exponentially higherprobabilities of being selected to the higher outputs.


Definition 3.4.3. (The Exponential Mechansim)[36]The exponential mechanism ME(x, u,R) selects and outputs an element r ∈ R withprobability proportional to exp

(εu(x,r)2∆u

)It might be difficult to implement the exponential mechanism effectively if the range of uis super-polynomially large in the parameters of the problem since it can define a complexdistribution over a large arbitrary domain.

Theorem 3.4.4. (The exponential mechanism)[36]The exponential mechanism maintains ε-differential privacy.

Proof. We compare the ratio of the probability that the mechanism outputs for someelements r ∈ R on two neighbouring databases D1 and D2 (i.e., ||D1 −D2|| ≤ 1).

Pr[ME(D1, u,R) = r]

Pr[ME(D2, u,R) = r]=

(exp(εu(D1,r)

2∆u

)∑r′∈Rexp

(εu(D1,r

′)2∆u

))

(exp(εu(D2,r)

2∆u

)∑r′∈Rexp

(εu(D2,r

′)2∆u

))

=

exp(εu(D1,r)

2∆u

)exp

(εu(D2,r)

2∆u

) .

∑r′∈Rexp(εu(D2,r′)

2∆u

)∑

r′∈Rexp(εu(D1,r′)

2∆u

)

= exp

(ε(u(D1, r

′)− u(D2, r′))

2∆u

)

.


2∆u

)∑


2∆u

)

≤ exp( ε

2

).exp

( ε2

).


2∆u

)∑


2∆u

)

= exp(ε)

Equivalently, Pr[ME(D2,u)=r]Pr[ME(D1,u)=r] ≥ exp(−ε) by symmetry.

3.4.3 The Median Mechanism

The median mechanism is an interactive differentially private mechanism that an-swers arbitrary predicate queries f1, . . . , fk that arrive on the fly without the futureknowledge queries, where k could be large or even super-polynomial. It performsmuch better than the other mechanisms like Laplace when it comes to answeringmore queries exponentially and give fixed constraints.Theoretically, the mechanism


is suitable for defining and identifying the equivalence of queries in the interactivesetting. [44]Assorting queries as "hard" and "easy" with low privacy cost the core concept of themedian mechanism. The number of "hard" queries is bounded to O(log k. log |X|)due to a Vapnik-Chervonekis(VC) dimension argument[55] and the constant-factoredreduction of databases every time we answer a "hard" query. A query is considered"easy" from consonant answers maintained from the collection of databases.Themedian mechanism can be explained in the following steps[44]:

(1) Initialise C0 =[database of size m over X]

(2) For each query f1, f2, . . . , fk in turn:

(a) Define ri and let r̂i = ri + Lap(

2εnα′

).

(b) Let ti = 34 + jγ, where j ∈ {0, 1, . . . , 1

γ320} is chosen with probability

proportional to 2−j

(c) If r̂, set ai to be the median value of fi on Ci−1.

(d) If r̂i < ti, set ai to be fi(D) + Lap(

1α′

).

(e) If r̂ < ti, setCi to the databases S ofCi−1 with |fi(S)−ai| ≤ ε50 ; otherwise

Ci = Ci−1.

(f) If r̂j < tj for more than 20mlog|X| values of j ≤ i, then halt and reportfailure.

The mechanism makes use of several additional parameters. Roth[55] sets them to

m =160000ln k ln 1

ε

ε2

α′ =α

720mln |X|= Θ(

αε2

log |X| log k log 1ε

)

γ =4

α′εnln

2k

α= Θ(

log |X| log2 k log 1ε

αε3n)

The α can be seen as privacy cost as a function of the number of queries. The valueof ai of the median mechanism is defined as

ri =

∑S∈Ci−1

exp(−ε−1|fi(D)− fi(S)|)|Ci−1|


3.5 Data Release on Differential Privacy

Delivering aggregate statistics on a dataset without unveiling any individual recordis the primary goal of data release. In the process of this private data release, wetake into account of the data curator which is the trusted party that has a databaseDwith sensitive information and privacy parameters ε > 0. The other party includedin this process is the user/data analyst which is the seen as the untrusted one thatlayout a sequence of queries q1, . . . , qk. The interactive and non-interactive as seen3.2 are the two settings that are being applied in the mechanisms depending uponthe arrangement of answers for the query sets.

3.5.1 Interactive Data Release

In this setting, to answer the queries from the users, an interactive differential pri-vacy interface is being inserted between the users and the dataset for the sake ofprivacy. Moreover, each of the queries that are being answered in this setting de-pends upon the privacy budget. Thus, the curator and the user interact in x rounds.In round z ∈ {1, . . . , x}, the curator gives an answer az for a chosen query qz fromthe collection of queries Q and this query may be selected on the dependency thatit have from the previous interaction q1, a1, . . . , qz−1, az−1.For an interactive mechanism, even if we know all the queries in advance, we canstill obtain a privacy preserving answers by running the interactive mechanismon each of them, which makes this setting preferable comparing it to the non-interactive one.

3.5.2 Non-Interactive Data Release

In this settings, a set queries Q can be answered in a batch which contributes ahigher flexibility for data analysis when it is compared to the interactive environ-ment. A much higher noise is being added to the answers given for the queries toreassure differential privacy, despite this, there is a significant decline in the usageof the data. Thus, the major issue in this setting is keeping the trade-off betweenthe usage and privacy by answering more with a limited privacy budget.

3.6. Challenges in Differential Privacy 27

FIGURE 3.2: Interactive and Non-Interactive data release

3.6 Challenges in Differential Privacy

Differential privacy is a strong notion of privacy, however, the notion it still haspractical challenges and limitations. From the definition of differential privacy,we have seen that an individual has a limited impact on the published datasetwhenever we apply one of the mechanisms of differential privacy.The dominant theory in here is that the person’s privacy is not breached as long asthe information gained from the dataset does not include that individual, althoughthis particular information could be used to know about private information aboutthe individual. This assumption implies that without disregard to the notion ofdifferential privacy specific types of background information can be utilized withthe differentially private outcome to learn accurate information about the person,thus implying that the assurance of differential privacy is not absolute confiden-tiality[17].

3.6.1 Calculating Global Sensitivity

To get a sensible constrained global sensitivity for a query, one must include all as-pects of the domain with all conceivable tuples. For example, if we consider a hypo-thetical scenario given by Lee and Clifton[28], Purdue University has put togethera "short list" of alumni as possible commencement speakers. A local newspaperis writing a story on the value Indiana taxpayers get from Purdue and would liketo know if these distinguished alumni are locals or world travelers. Purdue doesnot want to reveal the list (to avoid embarrassing those that are not selected, for


example), but is willing to show the average distance people on the list have trav-eled from Purdue in their lifetimes. Outliers in the data, such as Purdue’s Apolloastronaut alumni (who have been nearly 400,000km from campus). In this casecalculating the global sensitivity will result in impractical noise, which makes theresult of having less utility.Simple queries with low sensitivity, for example, count queries have a weak effecton the utility of the data. However, high sensitivity queries are involved in numer-ous applications. In general, having privacy and applicable noise while computinga global sensitivity is a challenging task.

3.6.2 Setting of privacy parameter ε

The question on how to set the privacy parameter ε has not been adequately ad-dressed since the research on differential privacy has started. The way in which εinfluences to identify an individual is not clear, although differential privacy hasapparently stated that if it’s hard to determine a person is incorporated into adatabase, it then is defiantly hard to know that individual’s record. The param-eter ε in ε differential privacy does not show what has been revealed about theperson; it rather limits the outcome an individual has on the result.The impact of ε that has in determining an individual, for queries that try to retrievelarge properties of the data is less clear except that of queries that ask accurateinformation. Lee and Clifton showed[28] that values in the database, as well as that,are not could change the confidence level of that of an adversary for a particularindividual for a given setting of ε.The improper value of epsilon causes a privacy breach, even for the same values ofepsilon the level of protection that epsilon-differential privacy mechanism providesis different based on the values of the domain attributes and the type of queries.

3.6.3 Uncertainty of outcome

The results obtained from a differentially private mechanism can differ enormouslywhich makes the quality unreliable to use. Whenever applying Laplace mecha-nism, there might be a significant difference in the answer. An example providedin [48] a differentially private query for the mean income of a single U.S. county,with ε = 0.25 (resp. ε = 1.0), deviates from the true value by 10,000$ or less only 3%(resp. 12%) of the time! This can be extremely misleading, given that the true valueis 16,708$.

29

Chapter 4

Experimental Framework Setup

In this chapter, we address a detailed description of the architecture that we haveused to experiment the application of differential privacy as it will be a test bedand guidance for the next chapter. Moreover, we present the hypothesis that wethought would fit best to see the actual effect of differential privacy. Later in thechapter, we present choices of the programming language that we have used.

4.1 Hypothesis

Currently, there are several ways to implement differential privacy with variouskinds of settings. Thus, some assumption has to be revealed, and we have consid-ered in this thesis to follow a basic model or architecture where a secured serveris connected to a data store that is efficient enough to present differential privacymechanisms which are going to be shown in the next section.The actual scenario we have assumed is when a data owner puts datasets in asecured system for the purpose of some data analyst to use the information fora particular purpose while providing data privacy by utilizing differential privacymethods.In the case of mechanisms of differential privacy, we have applied the primarymethods of differential privacy that are essential in protecting sensitive data. Thesemethods are Laplace and Exponential mechanism where each of the algorithmsused is described in the next section to come.

4.2 Experimental Framework Overview

The proposed model aims to approve the thesis’s hypothesis and verify that the ex-isting mechanisms of differential privacy increase the privacy level of the database.

30 Chapter 4. Experimental Framework Setup

Thus, as we have presented above, we have used an underlying architecture in im-plementing differential privacy as seen in the Figure 4.1 and most importantly wehave followed the interactive model. Each of the major parts and their respectivepurpose in this environment is presented in the next sections.

FIGURE 4.1: Archetecture of the system

4.2.1 User Interface

It is an abstract notion of the system where a data analyst or user will connect tothe dataset/database through the user interface and request data using a queryand receive data back with the noise added to the true data by the methods ofdifferential privacy.

4.2.2 Privacy Preservation Mechanism

In this part of the framework, all the essential methods of differential privacy usethe methods that are needed for making the data private. It is known that at thispoint the one using this framework need not know how to enforce the privacyrequirement nor be an expert in privacy since one of the properties of the interactivemodel as in 3.5.1 described manage this.

4.3. Algorithm Description 31

Hence, in this part of the framework it receives a query requesting for data from thedata analyst/user where it brings the raw data from the data store. It then enforcesdifferential privacy by adding carefully calibrated noise to each query dependingupon the type of the mechanism it applies. Moreover, to mask the influence of anyparticular record on the outcome depending upon the mechanism, taking Laplacemechanism into consideration as described in 3.4.1 and 3.3.1 the magnitude of thenoise is chosen.

4.2.3 Datastore

This part of the framework is where all the sensitive datasets are stored in Post-greSQL database.

4.2.4 Programming Language

The selection of programming language is on the basis upon for the reason of pro-cessing a large dataset with the smallest amount of time and having the capabilityof supporting mathematical operations. Thus, for those grounds we have usedPython that is implemented utilizing the interactive environment called IPythonNotebook[42]. Furthermore, for managing a large dataset we have used Pandas[41],and Numpy[56] and SciPy[26] libraries are used.

4.3 Algorithm Description

For the purpose of showing how to apply differential privacy, we used one of theprimary method, Laplace mechanism. In the next section, we will describe thealgorithm that we used in implementing the mechanism.

32 Chapter 4. Experimental Framework Setup

4.3.1 Algorithm for Laplace Mechansim

Algorithm 1 Laplace Mechanism’s algorithm

1: function LAPLACE(D,Q : N|x| → Rk, ε) . the laplace based on the dataset,query and the epsilon value

2: 4 = GS(Q) . Calculate the global sensitivity3: for i← 1, k do4: yi ∼ Lap(4ε ) . Get the noise based on the episilon and senetivity from

Laplace distribution5: end for6: return Q(D) + (y1, . . . , yk) . the noise added plus true value7: end function

A controlled noise is added to the functions with low sensitivity for many dif-ferentally private methods[14]. In the algorithm 1 adds the Laplace noise depend-ing upon the value of the sensitvity form the Laplace distrbtion [59] whose densityfuncition 1

2λexp(−|X−µ|λ ) where µ is a mean and λ(> 0) is a scale factor. It is known

that for the query function Q and the dataset D, a randomized mechanism that re-turns Q(D) + (y1, . . . , yk) as a response where y1, . . . , yk is drawn independent andidentically distrbuted from Lap(4ε ), gives ε-differential privacy[14].

33

Chapter 5

Experimental Study onDifferential Privacy

In this chapter, we present the dataset that has been used and the analysis of theimplementation that we have applied. We have used two types of datasets that wehave implemented with differential privacy. Each of the dataset description andanalysis is described in the next section to come.

5.1 Adult Dataset

5.1.1 Dataset Description

We have used one of the famous UCI’s adult dataset[32] which was acquired fromUS Census data (1994) and was donated in 1996. It contains more than 30,000 in-stances (customer records) with the following 15 attributes (columns):

5.1.2 Experimental Design

The whole experiment was conducted on a Linux OS based system that has anIntel i5-4200 CPU processing capacity with a 4GB RAM while the other systemwas a Windows based system that has an AMD CPU processing capacity with4GB RAM.arTo experiment with differential privacy, we have used two differentdatasets with a different aspect of scenarios. The scenarios are assumed to be de-pendent upon the utilization of the dataset in a way it makes sense when applyingthe experiment. Thus, we have given into consideration to implement differentkinds of queries and see the result and behavior of them when using the methodsin ε-differential privacy.The first scenario we considered is that as being a data analyst and want to find outthe number of years a person have spent in an education by income and race using

34 Chapter 5. Experimental Study on Differential Privacy

Attribute Type ValuesAge Numerical

Workclass NominalPrivate, Self-emp-not-inc, Self-emp-inc,Federal-gov, Local-gov, State-gov,Without-pay, Never-worked

Fnlgwt Numerical

Education Nominal

Bachelors,Some-college, 11th, HS-grad, Prof-school,Assoc-acdm, Assoc-voc, 9th,7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education-num Numerical

marital-status NominalMarried-civ-spouse, Divorced, Never-married, Separated,Widowed, Married-spouse-absent, Married-AF-spouse

Occupation Nominal

Tech-support,Craft-repair, Other-service, Sales,Exec-managerial, Prof-specialty,Handlers-cleaners, Machine-op-inspct,Adm-clerical, Farming-fishing,Transport-moving, Priv-house-serv,Protective-serv, Armed-Forces

relationship Nominal Wife, Own-child, Husband, Not-in-family, Other-relative, UnmarriedRace Nominal White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, BlackSex Nominal Female, Malecapital-gain Numericalcapital-loss Numericalhours-per-week Numerical

native-country Nominal

United-States,Cambodia, England, Puerto-Rico, Canada, Germany,Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba,Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan,Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand,Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,Holand-Netherlands

income Nominal <=50K, >50K

TABLE 5.1: Adult dataset attributes

the adult dataset. The first step in our experimental design, in this case, would beapplying the required counting query to get the data:

SQL 5.1: Counting query for adult dataset

SELECT

education_num,

COUNT(CASE WHEN race = ’White’ THEN ’White’ END) as "NoWhite",

COUNT(CASE WHEN race = ’Black’ THEN ’Black’ END) as "NoBlack",

COUNT(CASE WHEN race = ’Asian-Pac-Islander’ THEN ’Asian-Pac-Islander’ END) as

"NoAsianPacIslander",

COUNT(CASE WHEN race = ’Amer-Indian-Eskimo’ THEN ’Amer-Indian-Eskimo’ END) as

"NoAmerIndianEskimo",

COUNT(CASE WHEN race = ’Other’ THEN ’Other’ END) as "NoOther"

FROM

adulti

5.1. Adult Dataset 35

WHERE

salary_group = ’<=50K’

GROUP BY

education_num, salary_group

ORDER BY

education_num;

which would have give the data analyst the following data as shown in thegraph 5.1 but before that happenes one of the methods differenital privacy is ap-plied to it in order to preserve the privacy. The results and the behaiour of thedifferential privacy is disscused in the next part.

FIGURE 5.1: Graph for the Number of years in Education by Incomeand Race

The second scenario we considered under this dataset is understanding the char-acteristics of the mean query under aggregate function when using it differentialprivacy. Thus, in here also as being a data analyst, we want to find out the averageworking hour per work for each job class that existed in the dataset. To get the data,we apply the appropriate mean query:

SQL 5.2: Mean query for adult dataset

SELECT


workclass, AVG(hours_per_week) as AVGHoursPerWeek

FROM adulti

GROUP BY workclass

which would deliver the data as shown below before applying it to one of themethods in differential privacy.

work_class hours/week0 State-gov 39.031587

1 Federal-gov 41.379167

2 Private 40.267096

3 Local-gov 40.982800

4 Self-emp-inc 48.818100

5 Self-emp-not-inc 44.421881

6 NA 31.919390

7 Never-worked 28.428571

8 Without-pay 32.714286

TABLE 5.2: Mean query result for adult dataset

5.1.3 Results and Discussion

Since the individuals’ privacy is the primary concern of differential privacy, threedifferent types of queries considered to examine the level of privacy and as wellas the utility of the data provided by the proposed model. Thus, mainly as shownin 5.1 is a counting query where it manages to ask about the behavior of racesconcerning education and income which have a very sensitive information. In theother types of query, a mean query 5.2 ask a confidential information about the av-erage working time that also needs to be protected. The main reason why we haveconsidered to include the mean query is that the way of calculating the sensitivityof the query affects the characteristics of differential privacy that we are going tosee in the next section.In the Laplace mechanism, the major step in achieving differential privacy is gener-ating the Laplace distribution. Therefore, since the value of ε govern the amount ofthe noise created in the experiment, it was repeated 50,100, 150 and 250 times overeach dataset with different values of the parameter ε. The values taken into consid-eration for the parameter ε are 1, 0.5, 0.01, and 0.001. The choices of ε contribute totest the performance and the characteristics it has the utility of the data.

5.1. Adult Dataset 37

For the case of counting query, the first step in getting the noise from the Laplacedistribution is calculating the amount of sensitivity. From the definition of globalsensitivity 3.3.1 we know that it is the maximum difference between the two neigh-boring databases and in this case, it is the effect a record have on being included ina dataset or not which is at most oneThe result of the first query 5.1 is seen in the table for all the ε parameter values

ε=0.001 ε=0.01 ε=0.1 ε=0.5 ε=1 ε=2 True Value27.812493 35.390516 37.326686 38.004313 37.949046 38.142612 38

114.25916 129.886678 128.211722 128.904062 128.987817 128.99923 129

376.422304 264.810698 265.716605 266.990905 267.004823 266.913432 267

508.410383 521.427331 514.27641 515.432698 515.11026 515.017971 515

318.298624 367.727981 380.454155 381.233755 381.025898 380.948868 381

633.457919 710.750242 708.279636 707.957029 708.001879 707.936147 708

905.903116 912.219137 927.73722 926.672365 927.043154 927.001282 927

229.255766 310.303729 307.250924 308.205609 308.102301 308.059411 308

7361.250424 7363.948766 7365.089796 7361.968016 7362.020765 7361.984653 7362

4766.829785 4986.021341 4951.782225 4951.46874 4951.640962 4951.942688 4952

999.507002 889.21688 874.203274 873.741552 874.00838 874.015033 874

668.459507 680.438207 678.677963 679.927983 679.971105 680.02492 680

2745.336556 2675.991294 2667.153078 2666.96552 2666.995558 2666.956198 2667

643.886931 672.841862 665.517948 666.025606 666.140248 666.002405 666

93.347772 141.364006 131.312386 131.979095 132.031384 132.035919 132

114.494777 89.435127 92.744101 93.117938 93.166598 92.96753 93

TABLE 5.3: Result for Laplace mechanism for count query

From the result above, we could see that the effect of the ε has in the quality of theresult from this mechanism. This is due to differential privacy is achieved sinceLaplace mechanism requires generating amounts of noise from Laplace distribu-tions. Therefore, as the values of ε becomes smaller the amounts of generated noisebecome larger and smaller while the values of it become larger.When it comes to the scenario of a mean query, the process of achieving differentialprivacy is the same as of the count query, but the major difference is calculatingsensitivity. Unlike counting query, the senstivity of the mean query is the maxi-mum average difference of the two neighboring databases which have a variableamount other than one. This value will lead to having not a complete sensiablenoise addition that could be used in the dataset. We could clearly see the result inthe table 5.4


True Value ε=2 ε=1 ε=0.5 ε=0.1 ε=0.01 ε=0.00139.031587 39.061643 39.069102 38.879876 39.45814 35.839112 131.613162

41.379167 41.290438 41.125853 41.26991 42.599621 25.580234 293.379021

40.267096 40.324735 40.353773 40.33514 41.172319 30.570535 60.370953

40.9828 41.125241 40.803465 41.515622 37.188855 55.853787 -121.15732

48.8181 48.783353 48.97514 48.997429 45.874634 51.523733 -124.031842

44.421881 44.452711 44.177682 43.37441 43.694063 63.597372 -41.977862

31.91939 32.051182 31.890406 31.856859 32.275897 85.031996 362.218115

28.428571 28.444793 28.36093 30.039959 26.413951 6.068707 104.168042

32.714286 32.846824 32.713454 32.852096 29.470895 47.369689 119.513105

TABLE 5.4: Result for Laplace Mechanism for Mean query

From the above results, we could observe that the Laplace mechanism has differ-ent characteristics towards the different types queries that run against the dataset.From this, the first thing to understand is that the noise that is going to be addeddepends upon the kind of the dataset of which the ε plays a major part in it. Theother thing that we could see from this mechanism is that the tradeoffs betweenutility and privacy are also influenced by the value of ε

5.2 Student Alcohol Consumption Dataset

5.2.1 Dataset Description

We use student’s alcohol consumption dataset from UCI Machine Learning Repos-itory: datasets for our study. The data contains Alcohol consumption of secondarylevel students including student’s grades, demographic, social and school-relatedfeatures. The data was first gathered and analyzed by Paulo Cortez and Alice Silva,University of Minho, Portugal (Portuguese student on two courses (Mathematicsand Portuguese)). It was collected by using school reports and questionnaires fromtwo public schools, from the Alentejo region of Portugal amid the 2005 - 2006 schoolyear. Students are evaluated three times G1 - first-period grade, G2 - second-periodgrade and G3 - final grade. The grading scale is 20 points grading scale from 0 tothe lowest to 20 the highest and these grades are related to the course subject, Mathor Portuguese [40, 6].

5.2. Student Alcohol Consumption Dataset 39

Attribute Description(Domain)sex student sex (binary: female or male)

age student age (numeric: from 15 to 22)

school student school (binary: Gabriel Pereira or Mousinho da Silveira)

address student home address type (binary: urban or rural)

Pstatus parent cohabitation status (binary: living together or apart)

Medu mother education (numeric: from 0 to 4a )

Mjob mother job (nominalb )

Fedu father education (numeric: from 0 to 4a )

Fjob father job (nominalb)

guardian student guardian (nominal: mother, father or other)

famsize family size (binary: ≤ 3 or >3)

famrel quality of family relationships (numeric: from 1 – very bad to 5 – excellent)

reason reason to choose this school (nominal: close to home, school reputation, course

traveltime home to school travel time (numeric: 1 – <15 min., 2 – 15 to 30 min., 3 – 30

studytime weekly study time (numeric: 1 – <2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4

failures number of past class failures (numeric: n if 1 ≤ n <3, else 4)

schoolsup extra educational school support (binary: yes or no)

famsup family educational support (binary: yes or no)

activities extra-curricular activities (binary: yes or no)

paidclass extra paid classes (binary: yes or no)

internet Internet access at home (binary: yes or no)

nursery attended nursery school (binary: yes or no)

higher wants to take higher education (binary: yes or no

romantic with a romantic relationship (binary: yes or no)

freetime free time after school (numeric: from 1 – very low to 5 – very high)

goout going out with friends (numeric: from 1 – very low to 5 – very high)

Walc weekend alcohol consumption (numeric: from 1 – very low to 5 – very high)

Dalc workday alcohol consumption (numeric: from 1 – very low to 5 – very high) health

Absences number of school absences (numeric: from 0 to 93)

G1 first period grade (numeric: from 0 to 20)

G2 second period grade (numeric: from 0 to 20)

G3 final grade (numeric: from 0 to 20)

TABLE 5.5: The pre-processed student related variables

1. 0 – none, 1 – primary education (4th grade), 2 – 5th to 9th grade, 3 – secondaryeducation or 4 – higher education.

2. teacher, health care related, civil services (e.g. administrative or police), athome or other.


5.2.2 Experimental Design

Consuming alcohol by high school (young) students has a lot of bad impacts ontheir education, brain development, mental health, behavior change, health andfuture life. Our analysis intends to investigate the effects of student’s alcohol ad-diction on their health status, education performance and school attendance.The student alcohol consumption dataset contains two attributes of alcohol con-sumption of students, Dalc (workday alcohol consumption) and Walc (weekendalcohol consumption). We combine the two attributes to calculate the total amountof alcohol taking by a student in an entire week using the following formula

Alc =(Walc× 2) + (Dalc× 5)

7

We imported the CSV file of student’s alcohol consumption dataset to Post-greSQL database and connected the database to IPython using the following code:

try:

conn = pg.connect("dbname=’Student’ user=’postgres’ host=’localhost’ password=

’samri’")

except:

print ("I am unable to connect to the database")

After connecting IPython (Jupyter Notebook) to the database(PostgreSQL), we usedthe SQL counting queries on IPython to find the total alcohol consumption takenby individual students in a week and their health status, school attendance andtheir education performance (Final grade score). After getting the results from theSQL query, to protect the privacy of the data of the actual value, we added noise tothe data using the Laplace mechanism.We tried to do the analysis based on both limited observations and all 649 observa-tions within the dataset. However, the regression analysis on limited observationsresulted in outputs which are entirely different from the regression result from theentire dataset. For example, the regression analysis of school attendance on al-cohol consumption showed alcohol consumption was not a predictor for schoolattendance when the observations used limited while it showed otherwise whenthe complete data was used for the regression analysis. So, we opted to use theentire dataset to avoid bias which led to our qualification of observations for theregression analysis.

sql3 = ’’’SELECT (((walc * 2) + (dalc * 5))/7) AS AlcoholCons, health AS

HealthStatus

FROM adulti


ORDER BY AlcoholCons;’’’

dataframe3 = psql.read_sql_query(sql3, conn)

dataframe3

The results for this query is shown on 5.2. After getting the true value of thisdataset, we used Laplace mechanism to add noise to the true value to make ε dif-ferentially private data. The code used to manipulate the data is given below, andthe results are shown on 5.3.

dataframe3Copy = dataframe3

data03=add_laplace_noise(dataframe3.healthstatus, len(dataframe3.healthstatus),

0.001, 1)

dataframe3Copy.healthstatus = data03

dataframe3Copy

To investigate the effect of Alcohol consumption on school attendance we use thefollowing queries. The result for this query is shown on 5.4.

Number of students absence days from school and their Alcohol consumptionfrom 1 - very low to 5 - very high

sql2 = ’’’SELECT (((walc * 2) + (dalc * 5))/7) AS AlcoholConsumption,absences AS

AbsenceDays

FROM adulti

ORDER BY (((walc * 2) + (dalc * 5))/7), absences;

’’’

dataframe2 = psql.read_sql(sql2, conn)

dataframe2

The code used to manipulate the data is given below and the result is shown on5.5.


data01=add_laplace_noise(dataframe2.absencedays, len(dataframe2.absencedays),

0.001, 1)

dataframe2Copy.absencedays = data01

dataframe2Copy

To investigate the effect of Alcohol consumption on education performance we usethe following queries, the result is shown on 5.6.

sql4= ’’’SELECT (((walc * 2) + (dalc * 5))/7) AS AlcoholConsumption, G3 AS

FinalGrade

FROM adulti

ORDER BY AlcoholConsumption, FinalGrade ;

’’’

dataframe4=psql.read_sql_query(sql4,conn)

dataframe4


We manipulate the true value we get from this query using the following code andthe result is shown on 5.7.


data04=add_laplace_noise(dataframe4.finalgrade, len(dataframe4.finalgrade), 0.1,

1)

dataframe4Copy.finalgrade = data04

dataframe4Copy

We have performed the experiment for different values of epsilon (ε = 0.1 andε = 0.0001) on the final grade attribute to study the level of protection ε- differen-tially private mechanism gives for the same attribute when the value of the privacyparameter is different.To compare the results, we find using different values of ε we use ε = 0.0001, forthe final grade attribute the code we used is given below, and the result is shownon 5.8.


data04=add_laplace_noise(dataframe4.finalgrade, len(dataframe4.finalgrade),

0.0001, 1)

dataframe4Copy.finalgrade = data04

dataframe4Copy

In our effort to draw the scatter plot of true value data and the noisy data on thesame graph, we observed that the resulted graph is too dense because the y-axisscale is large the noisy values range from decimal values near 0 to large values.This results in overlap of markers in the given range making the graph congested.Thus, we opted to represent the figures for the actual value data and the noisy dataseparately.


FIGURE 5.2: Students Alcohol consumption andHealth status

FIGURE 5.3: Alcohol consumption and Healthstatus with noise


FIGURE 5.4: Alcohol consumption and absence days

FIGURE 5.5: Alcohol consumption and Absence days with noise


FIGURE 5.6: Alcohol consumption and Finalgrade

FIGURE 5.7: Alcohol consumption and Finalgrade with noise


FIGURE 5.8: Alcohol consumption and Finalgrade with high noise

5.2.3 Results and Discussion

In the Implementation introduced above, we presented the distinctions betweenthe actual value results and the differentially private results. In this section, wediscuss the results we get from the implementation, that is if there is a correlationbetween the alcohol consumption and students performance, Alcohol consumptionand students health status, Alcohol consumption and students absence days fromschool we have performed simple linear regression using Stata software version13[49]. The results will be discussed below.


FIGURE 5.9: Result of simple linear regression of health status onalcohol consumption

FIGURE 5.10: Result of simple Linear Regression of health statuswith added noise on alcohol consumption

As it can be seen from 5.9, alcohol consumption by itself does not seem to predictthe health status of a student (p-value=0.338) at a significant level of 0.05. 5.10 also


shows that adding a noise with ε=0.001 does not change the result of the regression(whether the correlation is significant or not) albeit some shift in the value of thecoefficient from 0.098 to 0.12. This model can be further investigated by addingother predictors (age, sex) but as we are performing the analysis just to evaluatethe effect of adding noise, we will not perform such analysis.

FIGURE 5.11: Result of simple Linear Regression of school atten-dance on alcohol consumption


FIGURE 5.12: Result of simple Linear Regression of school atten-dance with added noise on alcohol consumption

We also investigated if there is a change in the result of the regression analysiswhen a noise is added with ε=0.001. As we can see from 5.11 and 5.12, the levelof alcohol consumption significantly predicts the school attendance of a student(p-value <0.05) though the model itself describes only 2.9% of the variance (R-squared=0.0288) and 3.0% of the variance when noise is added to school atten-dance. This analysis shows that for a unit increase on a scale of alcohol consump-tion, we would expect a 0.89 unit increase in the absence record of a student whenno noise is introduced to the outcome variable and a 0.99 unit increase in the ab-sence record of a student when noise is added.Finally, we run the regression analysis of school performance on alcohol consump-tion level by adding a varying degree of noise on the outcome variable (final grade).Table5 shows that increased alcohol consumption significantly predicts a decreasein the school performance of a student (p-value <0.05 and coefficient= -0.72). Addinga noise with a value of ε=0.1 did not affect either this result or the statement thatapproximately 4% of the variance of school performance is accounted for by thegiven model 5.14. However, adding a noise with ε=0.0001 results in the differentoutcome as shown in 5.15. Here, we can see that the effect of alcohol consumptionon school performance is not significant (p=0.563) and only 0.05% of the varianceof final grade is accounted for by the given model, neither of results similar to theresult obtained in 5.13.


FIGURE 5.13: Result of simple Linear Regression of school perfor-mance on alcohol consumption

FIGURE 5.14: Result of simple Linear Regression of school perfor-mance with added noise on alcohol consumption (ε=0.1)


FIGURE 5.15: Result of simple Linear Regression of school perfor-mance with very high noise on alcohol consumption (ε=0.0001)

ε is a privacy parameter on the level of privacy given. Choosing the proper valueof ε is very challenging. Even for the same value of ε, the privacy guarantee ofdifferential privacy provides different based on the type of query and the attributeson the domain.To compare the results of the data analysis when the ε value changes, we haveperformed an experiment on student’s final grade dataset (ε = 0.1 and ε = 0.0001)to show the difference between the results we get from the differentially privatemechanism.Based on our experiment, we observed that when ε decreases the utility of the datadecreases and the privacy level increases, and when ε increases the utility of thedata increases and the privacy level decreases.When the ε value equals 0.0001, the results we get from the differentially privatedata release shows that the privacy level guarantee that ε - differential privacy givesis very high, but the data has no utility, that is it leads to the wrong data analysisresults (change in p-value).When we make a statistical data analysis as shown in5.15, it shows that there is no correlation between student’s alcohol consumptionand students performance (final grade results) which is the wrong conclusion.For ε = 0.1 the utility of the data increased, but the privacy level of the data de-creased there is less difference between the outputs of the data compared to the


real values of the dataset.From our experiment using student’s alcohol consumption dataset, we observedthat Giving sufficiently small value for ε protects the privacy of the data from theadversary. As the ε value decreases the privacy level increases, it will be difficultfor the adversary to guess the exact value of the query results, but the utility of thedata decreases.

53

Chapter 6

Related Work

As discussed above, differential privacy requires that the outcome of computationremains insensitive to any given individual record in the dataset. The need to avoidunauthorized disclosures of information on records has led to the development ofdifferent analysis techniques which improve the privacy preservation of records.We will discuss, afterward, instances of previous works which aimed to providesecurity to databases against information leakage.

6.1 Early works on Statistical Disclosure Control Methods

A statistical database (SDB) system allows only the retrieval of aggregate statistics,such as frequency and sample mean. The methods which are developed to providesecurity to SDBs fall under four general techniques: conceptual, query restriction,data perturbation, and output perturbation [46].

6.1.1 Conceptual Technique

Two models which fall under this general framework had been proposed in early1980’s. The conceptual model, which was described by Chin and Ozsoyoglu, servesas a platform for investigation of security issues at the conceptual data model level.In this model, only the population (entities with common attributes), and its statis-tics are accessible for the user. Merging and intersecting populations cannot bedone by any data manipulation language, such as relational algebra. Within theconceptual model, the concept of atomic (A-) populations (i.e., the smallest sub-populations which cannot be decomposed any further and which contain eithernull or at least two entities) is used to deny information request on a singular en-tity. The lattice model describes SDB information represented in high-dimensional

54 Chapter 6. Related Work

tabular format different levels of aggregation. The cell suppression technique em-ployed by the conceptual model for single entries may lead to the disclosure ofredundant information. In the lattice design, several aggregate tables are producedby aggregating the high dimensional table along each dimension, correspondingto each attribute.This process leads to a creation of coarser aggregate tables whichform a lattice. Allowing access only to aggregate tables gives a more robust securityto the database.

6.1.2 Query Restriction Technique

General query restriction techniques which have been developed to restrict queriesinclude query-set-size control, query-set-overlap control, cell suppression, and par-titioning. The query-set-size control method permits the release of statistics only ifthe size of the query set |C| satisfies the condition:K 6 |C| 6 L − K, where Lis the size of the database (the number of entities represented in the database) andK is a parameter set by the curator (with the condition 0 6 K 6 L

2 ). The draw-back of this technique is that the privacy which is provided by this technique canbe compromised by the usage of tracker even when K is close to L

2 . The Query-Set-Overlap Control mechanism responds only to query set when the size of theintersection between this query set and preceding queries of a given user is lessthan some parameter. This technique has its drawbacks because it does not fend offattacks from multiple users, it does not allow statistics release for both a group andsubgroups (example, all students, and students taking a particular course)- whichlimits the utility of the database and finally because user profile should be updatedcontinuously. The partitioning mechanism is based on the concept of clusteringindividual entities in several mutually exclusive subsets (similar to the idea of anatomic population) by partitioning the values of attributes. As in the case of atomicpopulation, one of the problems which arise from partitioning is the emergence ofsubsets with the only single entity. Merging distinct attributes may solve this is-sue, but it may also lead to information loss. Cell suppression technique is usedto remove from the released table all cells which lead to a release of confidentialinformation (all cells with sensitive information, and all cells with non-confidentialinformation but which can potentially result in the release of confidential informa-tion). Finally, auditing involves keeping the trail of all queries entered by uniqueusers and always checking if privacy is compromised with the entry of the newquery. Auditing, for this reason, is not economical regarding CPU time and storageas log keeps on adding several queries.

6.2. Frameworks on Differential Privacy based Data Analysis 55

6.1.3 Perturbation (Input Perturbation)

It is a technique which is used to introduce noise in the primary statistical databasebefore making it available for users is considered relatively active in privacy pro-tecting of statistical databases when compared to the previously discussed mech-anisms. There are two types of data perturbation techniques which are discussedin this thesis. The probability distribution approach considers the database as asample from a given population with a particular probability distribution. Thusthe technique takes another sample from the same population (thus with similarprobability distribution) and replaces the original statistical database with it. Theprobability distribution, instead of another sample, can also be used to replace theoriginal statistical database. The value falsification approach perturbs the values ofthe attributes within the statistical database using multiplicative or additive noise,or other randomized processes.

6.1.4 Output Perturbation

This technique is different from data perturbation because with output perturba-tion noise is injected to each query result instead of the database itself.

6.2 Frameworks on Differential Privacy based Data Analy-sis

While early works on differential privacy focused on manually proving whether aparticular algorithm is differentially private, i.e., the responses to specific queriesdo not lead to information leakage. Lately, several systems, which can mechan-ically perform differentially private data analysis (without expert intervention),have been developed. These systems allow untrusted users, with no expertise inprivacy, to write algorithms and run statistical analyses without being occupiedwith privacy requirement beyond the defined privacy policy. In the interactive set-ting of data analysis, in which the user can only access the data through the inter-face and obtains only aggregate information, the systems provide the functionalityof differential privacy to their users. We will discuss some of these works in thefollowing few paragraphs.


6.2.1 Integrated Queries (PINQ)

The first mechanism, which was proposed by McSherry [21] with the aim of pro-tecting ε- differential privacy, had its basis on the concept of designing agents track-ing the privacy budget consumption by the query (at run-time) and canceling thecomputation when the budget is exhausted. This mechanism was used by Mc-Sherry to implement PINQ as a capability-based system. The implementation isbased on LINQ declarative query language, which is SQL-like language, embed-ded in C]. Data providers can use PINQ to wrap LINQ data sources in protectedobjects with encoded differential privacy bound epsilon. The wrapper uses Laplacenoise and the exponential mechanism to enforce differential privacy. The analystcan access a database through a query interface exposed by the thin privacy pre-serving layer. The access layer implements differential privacy by adding carefullycalibrated noise to each query. PINQ’s restrictive language which does not allowdatabase transformation operations (example, Select, Where, Join, Group by, un-less they are followed by aggregation operation) and run-time checks ensure thetotal amount of noise respects the encoded privacy bound epsilon. Despite PINQ’srestrictive language, the analyst can use this interface to use aggregate operationssuch as count (NoisyCount), sum (NoisySum) and average (NoisyAvg).

6.2.2 Airavat

Airavat combines an approach similar to PINQ and uses Mandatory Access Control(MAC) in a distributed, Java-based MapReduce framework. This usage can acceler-ate the process of large datasets. Airavat implements a simple model consisting ofa query which contains a sequence of not necessarily trusted codes, chained micro-queries, called "mappers" and a subset from among fixed list of macro-querieswithin the system, called "reducers" which are part of the trusted base. The map-pers are responsible for updating privacy budget and determining whether to con-tinue or abort the analysis based on the adequacy of the privacy budget. When theanalyst submits a query, they must also declare the anticipated numerical range ofits outputs, action similar to stating the sensitivity level. The reported sensitivitylevel is important for the calculation by Airavat of the amount of noise which mustbe added to the reducers outputs (after applying aggregation function) to achieve- ε- differential privacy.

6.3. Work on Differential Privacy based Query Analysis 57

6.2.3 Fuzz

PINQ and Airavat assume that the adversary can see only the results of his/herquery, which ignores the fact that the adversary can guess with a high level ofcertainty some attributes by strategically observing CPU activity, execution timeand global variables. Fuzz addresses this shortcoming by isolating the physicalmachine and allowing users to communicate with the database over the networkonly [8]. Fuzz uses a new type system to statically infer the privacy cost of arbi-trary queries written in a special programming language, and it uses a primitivecalled predictable transactions to prevent information leakage through executiontime side channel. Fuzz splits each query to a set of micro-queries. Each micro-query is expected to be returned within the specified time, deadline. Otherwise,it is aborted, and a default value is returned as a result of micro-query execution.If the micro-query execution takes less time, and afterward the system waits andreturns the result after that particular time. Using this approach each query takesthe same predictable amount of time for all databases of the same size.

6.2.4 GUPT

GUPT [45] is a platform which uses a new model of data sensitivity which de-creases the privacy requirement of data over time. It uses the aging model of datasensitivity which enables the description of privacy budget regarding the accuracyof the final output. This model enables GUPT to select an optimal size that reducesthe perturbation added for differential privacy.GUPT also automatically distributes privacy budget to each query according to theaccuracy requirements.It uses a sample and aggregate differential privacy frame-work. The data resampling method used by GUPT minimizes the error which iscaused by the data partitioning scheme. Finally, the GUPT platform is safe underside-channel attacks such as ti attacks, privacy budget attacks, and state attacks.

6.3 Work on Differential Privacy based Query Analysis

6.3.1 (Online) Setting

In this setting, the data analyst submits queries to the administrator in an inter-active way, based on the observed answers to previous queries, and the queriesare answered immediately with no knowledge of future queries. Under interactivesetting, maintaining privacy and accuracy at the same time is difficult if a large


number of queries are submitted. Since the early work of Dinur and Nissim [22],by which they applied polynomial reconstruction algorithm to SDBs to show largeperturbation is necessary to maintain privacy, several types of research have beenconducted to address this issue.Roth and Roughgarden [15] introduced the median privacy mechanism which im-proves upon independent Laplace mechanism and answers exponentially moreinteractive counting queries. A basic implementation of the median mechanismmeant it is inefficient and sampling from a set of super-polynomial size is needed.More efficient implementation, on the other hand, means weaker utility. Classifi-cation of queries as “hard” and “easy” (with hard queries defined as queries theanswers to which completely determine the answers to all of the other queries)without exhausting the privacy budget is the motivation for the development ofthis mechanism.Private multiplicative weights mechanism, whose goal is using a privacy-preservingmultiplicative weights mechanism, was later developed by Hardt and Rothblum[20]. The main result is achieving a running time only linear in N (for each ofthe k queries), while the error scales roughly as 1√

n log k. Moreover, the proposed

mechanism makes partial progress for side-stepping previous negative results inthe work of Dwork et al. [4] by relaxing the utility notion. Hardt and Rothblum[20] considered accuracy guarantees for the class of pseudo-smooth databases (i.e.,underlying distributions that do not put too much weight on any particular dataitem) with sublinear running time. In later work, Gupta et al.[61], through a sim-ple modular analysis, had given improved accuracy bounds for linear queries inprivate multiplicative weights mechanism.

6.3.2 Non-Interactive (Offline) Setting

In this setting, the curator sanitizes the data before publishing "safer" or "anonymized"version of the DB (example, histograms, summary tables) or synthetic DB (with thesame distribution as the original DB) once and for all, and has no role after releas-ing.In the last few years, DP has transitioned from conceptual level to application. Ithas been applied to several real-world data which few of them we discuss below.

6.3. Work on Differential Privacy based Query Analysis 59

6.3.3 Histogram and Contingency Table

Using histogram is an effective way of statistical summarization of attributes. It ag-gregates data points into intervals (groups) and represents each group with nonover-lapping bins which correspond to the exact count of the data points. Chawla et al[24] were first to introduce histogram sanitization after a formal privacy definitionwas agreed. They proposed two sanitization mechanisms: recursive histogram san-itization, in which bins are partitioned recursively into smaller cells until no regioncontains 2t or more real data points; and density-based input perturbation, noisefrom spherically symmetric distribution (e.g., a Gaussian distribution) is added todata points. Recently, Xu et al. [60], being motivated by their observation that thequality of a histogram structure depends on the balancing of information loss andnoise scales, proposed two algorithms for differential privacy compliant histogramcomputation: NoiseFirst (which determines the histogram structure after inject-ing random noise) and StructureFirst (which injects random Laplace noise to eachcount after determining the optimal histogram structure on the original count se-quence). Afterward, they adapted DP-histograms to answer arbitrary range-countqueries. The researchers used real-world datasets from IPUM’s census record ofBrazil, search logs from Google Trends and AOL search logs, NetTrace (an IP levelnetwork trace data) and Social Network files to experiment their proposals.Theexperimental results show that compared to other proposed methods [14, 25, 30],NoiseFirst usually returns more accurate results for range count queries with shortranges, especially for unit-length queries providing histogram with better visual-ization of the data distribution. On the other hand, large-length queries are betterhandled by StructureFirst.A contingency table is a table which summarizes the frequency (count) of attributesin the dataset. These counts are called marginals and can be used to compute corre-lations between attributes within the dataset. The calculated correlations are usedto reduce the amount of noise required for privacy protection. Barak et al. [2] haveproposed methods to release a set of consistent marginals of a contingency tablepreserving all the privacy and accuracy of the original data. Their approach can beviewed as a general approach for synthetic data production. They utilize a Fouriertransformation of the data to estimate low-order marginals with counts which arenon-negative integers, and their sum is consistent with a set of marginals.Xiao has devised a differentially private method, privilege (based on Haar wavelettransformation), which optimizes range count queries (count queries where thepredicate on each attribute is a range) by reducing the magnitude of noise needed


to guarantee ε- differential privacy to publish multidimensional frequency matrix.Privelet preserves privacy by modifying the frequency matrix M of the input data.The wavelet transform is first applied to this model to produce another matrix Cto which a polylogarithmic noise is then injected. They had considered a scenariowhere there is overlap between count queries (i.e., where there is at least one tuplesatisfying multiple queries). Under their approach, queries with smaller answersare injected with less noise while queries with bigger answers are injected withmore noise. For their experiment, Xiao et al. used the census data from both Braziland the US.Hay et al. [40] propose an approach based on hierarchical sums and least squaresfor achieving ε- differential privacy while ensuring a polylogarithmic noise vari-ance in range-count query responses. Given a uni-dimensional frequency ma-trix M, Hay et al.’s algorithm add Laplace noise directly to the replies of rangecount queries on M which it computed beforehand. Then it produces a noisy fre-quency matrix M* based on these noisy responses. The researchers have tested theirmethod on three datasets: SearchLog, NetTrace and Social Network data. Thoughboth Private and Hay et al.’s method have comparable utility guarantee, unlikethe Privelet algorithm which is applicable also to simple and multi-dimensionalqueries, Hay et al.’s algorithm are devised for uni-dimensional datasets. Recently,Li et al. [6] generalized both the Privelet and Hay et.al’s algorithms by introduc-ing two stage process, the matrix mechanism, for answering a workload of lin-ear counting queries Separate set of queries (strategy queries) are used as a queryproxy to the DB. Noisy responses are produced for these strategy queries using theLaplace mechanism, and then noisy responses to the workload queries are derivedfrom these responses. The researchers used this approach to exploit stronger noisedistribution correlation which preserves differential privacy but increases accuracy.

61

Chapter 7

Conclusion and Future Work

Differential privacy has become the de facto standard to guarantee privacy, andcurrently, it’s one of the most anticipated research topics in data privacy used ina wide range of applications. Hence in our thesis, we carried out a study on dif-ferential privacy, and we have seen from the earliest history of differential privacyfrom Dalenius’ initial mathematical methodology that is used before the birth ofdifferential privacy through Dwork’s landmark paper defining differential privacyto an extensive analysis of the state-of-the-art differential privacy.According to the experimental results, we could clearly see that differential privacyis a useful and powerful tool. While that as it may, similar to all other tools, it mustbe utilized appropriately, and one must comprehend what they are doing keep-ing in mind the end goal is to exploit the privacy guarantees that the differentialprivacy bears ultimately.Thus, our general understanding about differential privacy from our thesis is asfollows:

• Differential privacy is a strict privacy methodology with sufficient theory insupport.

• Differential privacy protects sensitive data by adding randomized noise to anactual value, and one must understand that this data are not the real valuewhen using them in the real world

• While Laplace mechanism is one of a sound approach to achieving differen-tial privacy, depending upon the type of the dataset it adds fluctuating resultthat results in a very limited applicability towards privacy.

Up to the current date while there are dozens of inquiries that has been madein choosing the right value of ε which plays a significant role in differential pri-vacy, however, there is still no general concession towards selecting the right value.

62 Chapter 7. Conclusion and Future Work

Therefore, future work should consider in tackling this issue towards coming upwith a better general approach in choosing the right amount of ε.While the underlying mechanisms of differential privacy are efficient enough togive the data privacy needed they might not be flexible enough to apply them in allaspect of real life scenarios which might hinder in achieving the maximum securityand usability needed. Hence, it’s ideal to analyze and customize other types ofmechanisms besides the basic ones as a future work.Under data dependency differential privacy has a vulnerable assumption that canlead to a reduction in the expected privacy level when applied to real-world datasetsthat show natural dependence owing to various social, behavioral, and genetic re-lationships between users. This weak privacy assumption is observed under caseswhere attacks such as inference exist under differential privacy mechanism. Thus,future work should also consider in working towards a better mechanism that sig-nificantly improves the existing tools under this assumption.

63

Bibliography

[1] Nabil R. Adam and John C. Wortmann. “Security-Control Methods for Sta-tistical Databases: A Comparative Study”. In: ACM Comput. Surv. 21.4 (1989),pp. 515–556.

[2] Airavat. URL: http://z.cs.utexas.edu/users/osa/airavat/.

[3] Daniel C. Barth-Jones. The ’Re-Identification’ of Governor William Weld’s Medi-cal Information: A Critical Re-Examination of Health Data Identification Risks andPrivacy Protections, Then and Now. Tech. rep. Columbia University - MailmanSchool of Public Health, Department of Epidemiology, July 2012.

[4] Shuchi Chawla et al. “Toward Privacy in Public Databases”. In: Proceedingsof the Second International Conference on Theory of Cryptography. TCC’05. Cam-bridge, MA: Springer-Verlag, 2005, pp. 363–385. ISBN: 3-540-24573-1, 978-3-540-24573-5.

[5] Chronicle of AOL search query log release incident. URL: http://sifaka.cs.uiuc.edu/xshen/aol/aol_querylog.html.

[6] Paulo Cortez and Alice Maria Gonçalves Silva. “Using data mining to predictsecondary school student performance”. In: (2008).

[7] Tore Dalenius. “Towards a methodology for statistical disclosure control”.In: Statistik Tidskrift 15 (1977), pp. 429–444.

[8] Irit Dinur and Kobbi Nissim. “Revealing Information While Preserving Pri-vacy”. In: Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems. PODS ’03. San Diego, California:ACM, 2003, pp. 202–210. ISBN: 1-58113-670-6.

[9] Cynthia Dwork. “An Ad Omnia Approach to Defining and Achieving Pri-vate Data Analysis”. In: Proceedings of the 1st ACM SIGKDD International Con-ference on Privacy, Security, and Trust in KDD. PinKDD’07. San Jose, CA, USA:Springer-Verlag, 2008, pp. 1–13. ISBN: 3-540-78477-2, 978-3-540-78477-7.

http://z.cs.utexas.edu/users/osa/airavat/

http://sifaka.cs.uiuc.edu/xshen/aol/aol_querylog.html

http://sifaka.cs.uiuc.edu/xshen/aol/aol_querylog.html

64 BIBLIOGRAPHY

[10] Cynthia Dwork. “Differential Privacy”. In: Automata, Languages and Program-ming, 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006,Proceedings, Part II. 2006, pp. 1–12.

[11] Cynthia Dwork. “Differential Privacy”. In: Automata, Languages and Program-ming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006,Proceedings, Part II. Ed. by Michele Bugliesi et al. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2006, pp. 1–12. ISBN: 978-3-540-35908-1.

[12] Cynthia Dwork. “Differential Privacy: A Survey of Results”. In: Proceedings ofthe 5th International Conference on Theory and Applications of Models of Compu-tation. TAMC’08. Xi’an, China: Springer-Verlag, 2008, pp. 1–19. ISBN: 3-540-79227-9, 978-3-540-79227-7.

[13] Cynthia Dwork and Aaron Roth. “The Algorithmic Foundations of Differen-tial Privacy”. In: Foundations and Trends in Theoretical Computer Science 9.3-4(2014), pp. 211–407.

[14] Cynthia Dwork et al. “Calibrating Noise to Sensitivity in Private Data Anal-ysis”. In: Theory of Cryptography: Third Theory of Cryptography Conference, TCC2006, New York, NY, USA, March 4-7, 2006. Proceedings. Ed. by Shai Halevi andTal Rabin. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 265–284.ISBN: 978-3-540-32732-5.

[15] Cynthia Dwork et al. “On the Complexity of Differentially Private Data Re-lease: Efficient Algorithms and Hardness Results”. In: Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing. STOC ’09. Bethesda,MD, USA: ACM, 2009, pp. 381–390. ISBN: 978-1-60558-506-2.

[16] Cynthia Dwork et al. “Our Data, Ourselves: Privacy Via Distributed NoiseGeneration”. In: Advances in Cryptology - EUROCRYPT 2006, 25th Annual In-ternational Conference on the Theory and Applications of Cryptographic Techniques,St. Petersburg, Russia, May 28 - June 1, 2006, Proceedings. 2006, pp. 486–503.

[17] Cynthia Dwork et al. “Our Data, Ourselves: Privacy Via Distributed NoiseGeneration”. In: Advances in Cryptology - EUROCRYPT 2006: 24th Annual In-ternational Conference on the Theory and Applications of Cryptographic Techniques,St. Petersburg, Russia, May 28 - June 1, 2006. Proceedings. Ed. by Serge Vaude-nay. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 486–503. ISBN:978-3-540-34547-3.

BIBLIOGRAPHY 65

[18] Philippe Golle. “Revisiting the Uniqueness of Simple Demographics in theUS Population”. In: Proceedings of the 5th ACM Workshop on Privacy in Elec-tronic Society. WPES ’06. Alexandria, Virginia, USA: ACM, 2006, pp. 77–80.ISBN: 1-59593-556-8.

[19] Andy Greenberg. Apple Differential Privacy. URL: https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/.

[20] Anupam Gupta, Aaron Roth, and Jonathan Ullman. “Iterative Constructionsand Private Data Release”. In: CoRR abs/1107.3731 (2011).

[21] Andreas Haeberlen, Benjamin C. Pierce, and Arjun Narayan. “DifferentialPrivacy Under Fire”. In: Proceedings of the 20th USENIX Conference on Security.SEC’11. San Francisco, CA: USENIX Association, 2011, pp. 33–33.

[22] Moritz Hardt and Guy N. Rothblum. “A Multiplicative Weights Mechanismfor Privacy-Preserving Data Analysis”. In: 51th Annual IEEE Symposium onFoundations of Computer Science FOCS 2010, October 23-26, 2010, Las Vegas,Nevada, USA. 2010, pp. 61–70.

[23] Moritz Hardt and Kunal Talwar. “On the Geometry of Differential Privacy”.In: CoRR abs/0907.3754 (2009).

[24] Michael Hay et al. “Boosting the Accuracy of Differentially Private HistogramsThrough Consistency”. In: Proc. VLDB Endow. 3.1-2 (Sept. 2010), pp. 1021–1032. ISSN: 2150-8097.

[25] Michael Hay et al. “Boosting the Accuracy of Differentially Private HistogramsThrough Consistency”. In: Proc. VLDB Endow. 3.1-2 (Sept. 2010), pp. 1021–1032. ISSN: 2150-8097.

[26] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientifictools for Python. [Online; accessed 2017-02-01]. 2001–. URL: http://www.scipy.org/.

[27] Shiva Prasad Kasivisiwanathan and Adam Smith. “On the ‘Semantics’ of dif-ferential privacy: A Bayesian formulation”. In: The Journal of Privacy and Con-fidentiality 6.1 (2014), pp. 1–16.

[28] Jaewoo Lee and Chris Clifton. “How Much Is Enough? Choosing ε for Dif-ferential Privacy”. In: Information Security, 14th International Conference, ISC2011, Xi’an, China, October 26-29, 2011. Proceedings. 2011, pp. 325–340.

[29] Chao Li and Gerome Miklau. “An Adaptive Mechanism for Accurate QueryAnswering under Differential Privacy”. In: CoRR abs/1202.3807 (2012).

https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/

https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/

http://www.scipy.org/

http://www.scipy.org/

66 BIBLIOGRAPHY

[30] Chao Li et al. “Optimizing Linear Counting Queries Under Differential Pri-vacy”. In: Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Sym-posium on Principles of Database Systems. PODS ’10. Indianapolis, Indiana, USA:ACM, 2010, pp. 123–134. ISBN: 978-1-4503-0033-9.

[31] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. “t-closeness: Pri-vacy beyond k-anonymity and l-diversity”. In: In ICDE. 2007.

[32] M. Lichman. UCI Machine Learning Repository. 2013. URL: http://archive.ics.uci.edu/ml/datasets/Adult.

[33] Ashwin Machanavajjhala et al. “l-Diversity: Privacy Beyond k-Anonymity”.In: Proceedings of the 22nd International Conference on Data Engineering, ICDE2006, 3-8 April 2006, Atlanta, GA, USA. 2006, p. 24.

[34] Ashwin Machanavajjhala et al. “l-diversity: Privacy beyond k-anonymity”.In: IN ICDE. 2006.

[35] Frank McSherry. “Privacy Integrated Queries: An Extensible Platform forPrivacy-preserving Data Analysis”. In: Commun. ACM 53.9 (Sept. 2010), pp. 89–97. ISSN: 0001-0782.

[36] Frank McSherry and Kunal Talwar. “Mechanism Design via Differential Pri-vacy”. In: Proceedings of the 48th Annual IEEE Symposium on Foundations ofComputer Science. FOCS ’07. Washington, DC, USA: IEEE Computer Society,2007, pp. 94–103. ISBN: 0-7695-3010-9.

[37] Arvind Narayanan and Vitaly Shmatikov. “How To Break Anonymity of theNetflix Prize Dataset”. In: CoRR abs/cs/0610105 (2006).

[38] Arvind Narayanan and Vitaly Shmatikov. “Robust De-anonymization of LargeSparse Datasets”. In: Proceedings of the 2008 IEEE Symposium on Security andPrivacy. SP ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 111–125. ISBN: 978-0-7695-3168-7.

[39] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “Smooth Sensitiv-ity and Sampling in Private Data Analysis”. In: Proceedings of the Thirty-ninthAnnual ACM Symposium on Theory of Computing. STOC ’07. San Diego, Cali-fornia, USA: ACM, 2007, pp. 75–84. ISBN: 978-1-59593-631-8.

[40] Fabio Pagnotta and HM Amran. “Using data mining to predict secondaryschool student alcohol consumption”. In: Department of Computer Science, Uni-versity of Camerino ().

http://archive.ics.uci.edu/ml/datasets/Adult

http://archive.ics.uci.edu/ml/datasets/Adult

BIBLIOGRAPHY 67

[41] pandas: Python Data Analysis Library. Online. 2012. URL: http://pandas.pydata.org/.

[42] Fernando Pérez and Brian E. Granger. “IPython: a System for Interactive Sci-entific Computing”. In: Computing in Science and Engineering 9.3 (May 2007),pp. 21–29. ISSN: 1521-9615. URL: http://ipython.org.

[43] Shiva Prasad and Kasiviswanathan Adam Smith. A Note on Differential Pri-vacy: Defining Resistance to Arbitrary Side Information. 803.

[44] Aaron Roth and Tim Roughgarden. “Interactive Privacy via the Median Mech-anism”. In: Proceedings of the Forty-second ACM Symposium on Theory of Com-puting. STOC ’10. Cambridge, Massachusetts, USA: ACM, 2010, pp. 765–774.ISBN: 978-1-4503-0050-6.

[45] Aaron Roth and Tim Roughgarden. “Interactive Privacy via the Median Mech-anism”. In: Proceedings of the Forty-second ACM Symposium on Theory of Com-puting. STOC ’10. Cambridge, Massachusetts, USA: ACM, 2010, pp. 765–774.ISBN: 978-1-4503-0050-6.

[46] Indrajit Roy et al. “Airavat: Security and Privacy for MapReduce”. In: Pro-ceedings of the 7th USENIX Conference on Networked Systems Design and Imple-mentation. NSDI’10. San Jose, California: USENIX Association, 2010, pp. 20–20.

[47] Pierangela Samarati and Latanya Sweeney. Protecting Privacy when DisclosingInformation: k-Anonymity and Its Enforcement through Generalization and Sup-pression. Tech. rep. 1998.

[48] Rathindra Sarathy and Krishnamurty Muralidhar. “Evaluating Laplace NoiseAddition to Satisfy Differential Privacy for Numeric Data”. In: Trans. DataPrivacy 4.1 (Apr. 2011), pp. 1–17. ISSN: 1888-5063.

[49] S Stata. “Release 13. Statistical software”. In: StataCorp LP, College Station, TX(2013).

[50] L Sweeney. Uniqueness of Simple Demographics in the U.S. Population. Tech. rep.Technical Report LIDAP-WP4. Pittsburgh: School of Computer Science, DataPrivacy Laboratory, July 2000.

[51] Latanya Sweeney. “K-anonymity: A Model for Protecting Privacy”. In: Int.J. Uncertain. Fuzziness Knowl.-Based Syst. 10.5 (Oct. 2002), pp. 557–570. ISSN:0218-4885.

http://pandas.pydata.org/

http://pandas.pydata.org/

http://ipython.org

68 BIBLIOGRAPHY

[52] Latanya Sweeney. k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY1. 2002.

[53] Christine Task. “An Illustrated Primer in Differential Privacy”. In: XRDS 20.1(Sept. 2013), pp. 53–57. ISSN: 1528-4972.

[54] Hamid Ebadi Tavallaei. “PINQuin, a framework for differentially privateanalysis”. Chalmars University of Technology, 2012.

[55] V. N. Vapnik and A. Ya. Chervonenkis. “On the Uniform Convergence ofRelative Frequencies of Events to Their Probabilities”. In: Theory of Probabilityand its Applications 16.2 (1971), pp. 264–280.

[56] Stéfan van der Walt, S. Chris Colbert, and Gaël Varoquaux. “The NumPy ar-ray: a structure for efficient numerical computation”. In: CoRR abs/1102.1523(2011).

[57] Samuel D. Warren and Louis D. Brandeis. “The Right to Privacy”. In: HarwardLaw Review 4.5 (Dec. 1890), pp. 193–220.

[58] Wikipedia. Aol search data leak. Nov. 2016. URL: https://en.wikipedia.org/wiki/AOL_search_data_leak.

[59] Wikipedia. Laplace distribution. [Online; accessed 15-Dec-2016]. 2016. URL: https://en.wikipedia.org/wiki/Laplace_distribution.

[60] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. “Differential Privacyvia Wavelet Transforms”. In: IEEE Trans. on Knowl. and Data Eng. 23.8 (Aug.2011), pp. 1200–1214. ISSN: 1041-4347.

[61] Jia Xu et al. “Differentially Private Histogram Publication”. In: The VLDBJournal 22.6 (Dec. 2013), pp. 797–822. ISSN: 1066-8888.

https://en.wikipedia.org/wiki/AOL_search_data_leak

https://en.wikipedia.org/wiki/AOL_search_data_leak

https://en.wikipedia.org/wiki/Laplace_distribution

https://en.wikipedia.org/wiki/Laplace_distribution

Date post:	29-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Research on Differential Privacy and Case Study...will try to study the basic state-of-the-art...

Documents