Privacy and k-AnonymityPrivacy and k-Anonymity
Guy SagyGuy SagyNovember 2008November 2008
Seminar in Databases (236826) Seminar in Databases (236826)
22
OutlineOutline
Introduction Introduction k-Anonymityk-Anonymity Generalization & SuppressionGeneralization & Suppression MinGen – Theoretical AlgorithmMinGen – Theoretical Algorithm Mondrian – A greedy partition algorithmMondrian – A greedy partition algorithm
33
What is Privacy ?What is Privacy ? Society is experiencing exponential growth in the Society is experiencing exponential growth in the
number and variety of number and variety of data collectionsdata collections containing containing person-specific information.person-specific information.
SharingSharing these collected information is valuable both in these collected information is valuable both in research and business. Publishing the data may put research and business. Publishing the data may put person person privacyprivacy in risk. in risk.
Objective: Maximize data utility while limiting disclosure Objective: Maximize data utility while limiting disclosure risk to an acceptable levelrisk to an acceptable level
Note :Note : There is no clear definition for disclosure and acceptable levelThere is no clear definition for disclosure and acceptable level Not the traditional security of data e.g. access control, theft, Not the traditional security of data e.g. access control, theft,
hacking etc.hacking etc.
44
ExampleExample
For medical research (e.g., Gene, infection For medical research (e.g., Gene, infection diseases) a hospital has some person-specific diseases) a hospital has some person-specific patient data which it wants to publishpatient data which it wants to publish
It wants to publish such that:It wants to publish such that: Information remains practically usefulInformation remains practically useful Identity of an individual cannot be determinedIdentity of an individual cannot be determined
Adversary might Adversary might inferinfer the secret/sensitive data the secret/sensitive data from the published databasefrom the published database
55
Example – cont.Example – cont.
The data contains:The data contains: Identifiers - {name, ssn}Identifiers - {name, ssn} Non-Sensitive data - {zip-code, nationality, age}Non-Sensitive data - {zip-code, nationality, age} Sensitive data - { medical condition, salary, location }Sensitive data - { medical condition, salary, location }
IdentifiersNon-Sensitive dataNon-Sensitive dataSensitive data
##NameZipZipAgeAgeNationalityNationalityCondition
11Kumar13053130532828IndianIndianHeart Disease
22Bob13067130672929AmericanAmericanHeart Disease
33Ivan13053130533535CanadianCanadianViral Infection
44Umeko13067130673636JapaneseJapaneseCancer
66
Example – cont Example – cont [SW02-A][SW02-A]
Non-Sensitive DataNon-Sensitive DataSensitive DataSensitive Data
##ZipZipAgeAgeNationalityNationalityConditionCondition
1113053130532828IndianIndianHeart Disease
2213067130672929AmericanAmericanHeart Disease
3313053130533535CanadianCanadianViral Infection
4413067130673636JapaneseJapaneseCancer
PublishedData
ChrisChrisBobBobJohnJohnNameName
AmericanAmerican2323130531305333AmericanAmerican2929130671306722AmericanAmerican2828130531305311NationalityNationalityAgeAgeZipZip##
Voter List
Data leak! Do we have a privacy violation ?Do we have a privacy violation ?
77
The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health.
Voter registration list for Cambridge Massachusetts – sold for 20$
William Weld was governor of Massachusetts- Lived in Cambridge Massachusetts Six people had his particular birth date Three of them were men He was the only with 5-digit ZIP code.
Example – contExample – cont[SW02-A][SW02-A]
ZipBirthdateGender
EthnicityVisit dateDiagnosisProcedureMedicationTotal charge
NameAddressDate registeredParty affiliationDate last voted
Medical data Voter List
Quasi Identifier)QI)
88
Example-2 – AOL (2006)Example-2 – AOL (2006)Anon
IDQueryQueryTimeItemRankClickURL
1326konig wheels18/04/2006 13:291http://www.konigwheels.com
1326jet blue airlines27/04/2006 15:29
1326coats tire equipment28/04/2006 15:53
1326coats tire equipment03/05/2006 19:15
1326verizon wireless09/05/2006 00:09
1326www.crazyradiodeals.com23/05/2006 18:00
1337uslandrecords.com01/03/2006 11:501http://www.seda-cog.org
1337titlesourcein.com14/03/2006 15:45
1337titlesourceinc14/03/2006 15:451http://www.titlesourceinc.com
1337select business services14/03/2006 15:51
1337select business services title14/03/2006 15:52
1337cbc companies14/03/2006 15:522http://www.cbc-companies.com
1337cbc companies14/03/2006 15:523http://www.cbc-companies.com
1337national real estate settlement services14/03/2006 15:591http://www.realtms.com
Example2 – cont.
Example-3
1111
k-Anonymity k-Anonymity [SW02-A][SW02-A]
Change data in such a way that for each Change data in such a way that for each tuple in the resulting table there are at least tuple in the resulting table there are at least ((k-1) k-1) other tuples with the same value for the other tuples with the same value for the quasi-identifier – quasi-identifier – k-Anonymized tablek-Anonymized table
#ZipAgeNationalityCondition
1130**< 40*Heart Disease
2130**< 40*Heart Disease
3130**< 40*Viral Infection
4130**< 40*Cancer
This is a 4-anonymizedTable. Why ?
1212
K-Anonymity K-Anonymity –– Formal Definition Formal Definition
RT - Released TableRT - Released Table (A1,A2,(A1,A2,……,An) - Attributes,An) - Attributes QIQIRTRT - Quasi Identifier - Quasi Identifier
RT[QIRT[QIRTRT] – Projection of RT on QI] – Projection of RT on QIRTRT
1313
K-Anonymity Example K-Anonymity Example [SW02-B][SW02-B]
CountryBirthGenderZIPProblem
t1USA1965m02141short breath
t2USA1965m02141chest pain
t3USA1964f02138obesity
t4USA1964f02138chest pain
t5Non-USA1964m02138chest pain
t6Non-USA1964m02138obesity
t7Non-USA1964m02138short breath
Example of k-anonymity, where k=2 and QI={Country, Birth, Gender, ZIP}
1414
K-Anonymity K-Anonymity –– The challenge The challenge
Theorem 1 in [SW02-B] claims :Let RT(A1,...,An) be a table, QIRT =(Ai,…, Aj) be the quasi-identifier associated with RT, Ai,…,AjA1,…,An, and RT satisfy k-anonymity. Then, each sequence of values in RT[Ax] appears with at least k occurrences in RT[QIRT] for x=i,…,j.
Can we use this property for easily building of a k-Anonymity table ? (Can we claim the opposite ?)(each sequence of values in RT[Ax] appears with at least k occurrences then the table is k-anonymity?)
1515
K-Anonymity K-Anonymity –– The challenge – cont. The challenge – cont.
#ZipAgeNationalityCondition
1120*Heart Disease
2130*Heart Disease
3220*Viral Infection
4230*Cancer
No!!!
1616
GeneralizationGeneralization Replace the original value by a semantically Replace the original value by a semantically
consistent but consistent but lessless specific value specific value SuppressionSuppression
Data not released at allData not released at all Can be viewed as first level of generalizationCan be viewed as first level of generalization
How to create k-Anonymity ?How to create k-Anonymity ?
##ZipZipAgeAgeNationalityNationalityConditionCondition
1130**< 40*Heart Disease
2130**< 40*Heart DiseaseGeneralization Suppression
1717
Generalization & HierarchiesGeneralization & Hierarchies
ZIP
1305813053
1305
130
1306713063
1306
Age
2928
< 30
< 40
*
3536
3*
Nationality
USCanadian
American
JapaneseIndian
Asian
*
Z0={13053,13058,13063,13067}
Z1={1305*,1306*}
Z2={130**}
Z3={*****}
Z0
Z1
Z2
Z3
Z0
Z1
Z2
1818
Generalization & HierarchiesGeneralization & Hierarchies The number of generalized tables is :The number of generalized tables is :
(DGH (DGHi i = Maximum generalization level of A= Maximum generalization level of A ii))
(note, not all generalization creates a k-anonymity table)(note, not all generalization creates a k-anonymity table)
n
iiDGH
1
)1|(|
1919
#ZipAgeNationalityCondition
113053< 40*Heart Disease
213053< 40*Viral Infection
313067< 40*Heart Disease
413067< 40*Cancer
#ZipAgeNationalityCondition
1130**< 30AmericanHeart Disease
2130**< 30AmericanViral Infection
3130**3*AsianHeart Disease
4130**3*AsianCancer
#ZipAgeNationalityCondition
1130**< 40*Heart Disease
2130**< 40*Viral Infection
3130**< 40*Heart Disease
4130**< 40*Cancer
2020
K-minimal GeneralizationsK-minimal Generalizations
Intuition: The one that does not generalize the data Intuition: The one that does not generalize the data more than needed (decrease in utility of the more than needed (decrease in utility of the published dataset!)published dataset!)
K-minimal generalizationK-minimal generalization: : TTmm is said to be a minimal generalization of RT if is said to be a minimal generalization of RT if TTm m satisfies the k-anonymity requirement with respect to satisfies the k-anonymity requirement with respect to
QIQIRTRT
TTzz: RT: RTTTz z ,T,Tzz T Tmm, T, Tzz satisfies the k-anonymity satisfies the k-anonymity
requirement with respect to QIrequirement with respect to QIRT RT T Tzz=T=Tmm
2121
#ZipAgeNationalityCondition
113053< 40*Heart Disease
213053< 40*Viral Infection
313067< 40*Heart Disease
413067< 40*Cancer
#ZipAgeNationalityCondition
1130**< 30AmericanHeart Disease
2130**< 30AmericanViral Infection
3130**3*AsianHeart Disease
4130**3*AsianCancer
2-minimal Generalizations
#ZipAgeNationalityCondition
1130**< 40*Heart Disease
2130**< 40*Viral Infection
3130**< 40*Heart Disease
4130**< 40*Cancer
NOT a2-minimal Generalization
There are many k-minimal anonymized tables –
There are many k-minimal anonymized tables –
which which one
one to pick?to pick?
2222
K-minimal GeneralizationsK-minimal Generalizations There are many k-minimal generalizations – which one is There are many k-minimal generalizations – which one is
preferredpreferred then then?? No clear and “correct” answer :No clear and “correct” answer :
The one that creates The one that creates min. min. distortion to datadistortion to data, where distortion, where distortion
Normalized averageNormalized average equivalence class size metric equivalence class size metric
The one with min. The one with min. suppressionsuppression Best support the research (less damaging the “interesting” Best support the research (less damaging the “interesting”
attributes)attributes)
attributesofnumber
DGHAtiongeneralizaoflevel
D iA i
i
)/()__
_( k
classesequivtotal
recordstotalCAVG
2323
Algorithm for finding minimal Algorithm for finding minimal generalization generalization [SW02-B][SW02-B]
Theoretical Model (MinGen)Theoretical Model (MinGen) Store the set of all possible generalizations of Store the set of all possible generalizations of
RT over QI into RT over QI into allgensallgens Store from Store from allgensallgens all the tables which all the tables which
satisfied k-anonymity into satisfied k-anonymity into protectedprotected Define comparing measure Define comparing measure scorescore From From protectedprotected choose the table with best choose the table with best
scorescore
2424
Algorithm for finding minimal Algorithm for finding minimal generalizationgeneralization
The search space is exponentialThe search space is exponential The problem is NP-Hard!The problem is NP-Hard! We present one proposed algorithm[LDR06]-We present one proposed algorithm[LDR06]-
LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 -LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 - Multi-dimensional algorithm (Mondrian)Multi-dimensional algorithm (Mondrian)
2525
Single Dimensional PartitioningSingle Dimensional Partitioning
A single dimensional A single dimensional partitioning defines for partitioning defines for each attribute Aeach attribute Ai i , a , a
set of non overlapping set of non overlapping single-dimensional single-dimensional intervals that cover intervals that cover DDXi.Xi.
Age
20
22
22
24
26
30
30
31
38
40
42
44
Age
20-24
20-24
20-24
20-24
26-31
26-31
26-31
26-31
38-44
38-44
38-44
38-44
Data Partitioning
2626
Single Dimensional PartitioningSingle Dimensional Partitioning
20
24
26
31
38
44
2120 2130 2140
Age
Zip Code2129 2139 2149
12 Areas of Partitioning
2727
Multidimensional PartitioningMultidimensional Partitioning
Assume all attributes are from discrete Assume all attributes are from discrete numeric domain (every set can be mapped numeric domain (every set can be mapped to a one)to a one)
The domain of AThe domain of Ai i is denoted by Dis denoted by DXiXi
Each tuple can be presented as Each tuple can be presented as (v(v11,v,v22,…,v,…,vdd))DDX1X1 D DX2X2… D… DXnXn
A multidimensional partitioning defines a A multidimensional partitioning defines a set of multidimensional regions.set of multidimensional regions.
2828
Multidimensional Partitioning – Multidimensional Partitioning – cont.cont.
Attributes = {ZipCode,Age)
2929
Multidimensional Partitioning – Why Multidimensional Partitioning – Why is it good ? is it good ?
NameAgeSexZipcode
Ahmed25Male53710
Bob28Male53711
Claire31Female90210
Dave19Male2174
Evelyn40Female2237
Voter Registration Data Patient Data
AgeSexZipcodeDisease
25Male53710Flu
25Female53712Hepatitis
26Male53711Brochitis
27Male53710Broken Arm
27Female53712AIDS
28Male53711Brochitis
3030
Multidimensional Partitioning –cont.Multidimensional Partitioning –cont.
Single Dimensional Multi Dimensional
Bronchitis53710-11Male25-28
Broken Arm53710-11Male25-28
Bronchitis53710-11Male25-28
Flu53710-11Male25-28
DiseaseZipcodeSexAge
Bronchitis53710-11Male27-28
Broken Arm53710-11Male27-28
Bronchitis53710-11Male25-26
Flu53710-11Male25-26
DiseaseZipcodeSexAge
AgeSexZipcodeDisease
25Male53710Flu
26Male53711Bronchitis
27Male53710Broken Arm
28Male53711Bronchitis
25Female53712Hepatitis
27Female53712AIDS
AIDS53712Female25-28
Hepatitis53712Female25-28
AIDS53712Female25-27
Hepatitis53712Female25-27
3131
Finding k-Anonymous Finding k-Anonymous Multidimensional PartitioningMultidimensional Partitioning
Given a set P of unique (point,count), with Given a set P of unique (point,count), with points in d-dimensional space, is there a points in d-dimensional space, is there a multidimensional partitioningmultidimensional partitioning for P such for P such that:that: For every region RFor every region R ii, , ppRiRicount(p)count(p)k or k or
ppRiRicount(p) =0 count(p) =0 (k-anonymity)(k-anonymity)
CCAVG AVG c (positive constant)?c (positive constant)? (average number of records in each partition)(average number of records in each partition)
This problem is NP-CompleteThis problem is NP-Complete Proof : reduction from partitionProof : reduction from partition
We
igh
t
35 4540 5550 6560 7050
55
60
65
70
75
80
85
Age
Mondrian - A Greedy Partitioning A Greedy Partitioning Algorithm Algorithm [LDR06][LDR06]
k-anonymity, k = 3 Mondrian(partition) if (no allowable multidimensional cut for
partition)return : partition summary
else dim choose dimension() fs frequency set(partition, dim) splitVal find median(fs) lhs {t partition : t.dim splitVal} rhs {t partition : t.dim > splitVal} return Mondrian(rhs) Mondrian(lhs)
3333
Mondrian – ExampleMondrian – Example[LDR06] [LDR06]
Anonymizations for two attributes with a discrete normal distribution ( = 25, = 2)
3434
Mondrian QualityMondrian Quality
By definition of k-Anonymity:By definition of k-Anonymity:
From Theorem 2 in [LeFevre et al. 06’]:From Theorem 2 in [LeFevre et al. 06’]:The maximum number of points in any region (RThe maximum number of points in any region (Rii) is ) is
2d*(k-1)+m2d*(k-1)+m, where , where mm is the maximum number of copy of is the maximum number of copy of any distinct point in Pany distinct point in P
For constant For constant d,m,kd,m,k - C - CAVGAVG2*C2*CAVG*AVG*
1)/()__
_(* k
classesequivtotal
recordstotalCAVG
k
mkd
C
C
AVG
AVG
)1(*2
*
Piet Mondrian (1872-1944)
(*) wikipedia
Privacy – Last Example
3838
3939
BibliographyBibliography
[SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. [SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. Sweeney,2002Sweeney,2002
[SW02-B] “Achieving k-Anonymity Privacy Protection Using [SW02-B] “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression”, L. Sweeney, 2002Generalization and Suppression”, L. Sweeney, 2002
[LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, [LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006D.J. DeWitt, R. Ramakrishnan,2006
http://en.wikipedia.org/wiki/Piet_Mondrian http://en.wikipedia.org/wiki/Piet_Mondrian Presentations:Presentations:
““Privacy In Databases”, B. Aditya PrakashPrivacy In Databases”, B. Aditya Prakash ““K-Anonymity and Other Cluster-Based Methods”, Ge. RuanK-Anonymity and Other Cluster-Based Methods”, Ge. Ruan