Personalized Privacy Preservation
Xiaokui Xiao, Yufei Tao
City University of Hong Kong
Privacy preserving data publishing
Microdata
• Purposes:– Allow researchers to effectively study the correlation b
etween various attributes – Protect the privacy of every patient
Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis
Sarah 28 F 37000 fluMary 56 F 58000 flu
A naïve solution
• It does not work. See next.
Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis
Sarah 28 F 37000 fluMary 56 F 58000 flu
Age Sex Zipcode Disease4 M 12000 gastric ulcer5 M 14000 dyspepsia6 M 18000 pneumonia9 M 19000 bronchitis12 F 22000 flu19 F 24000 pneumonia21 F 33000 gastritis25 F 34000 gastritis28 F 37000 flu56 F 58000 flu
publish
Inference attack
Age Sex Zipcode Disease4 M 12000 gastric ulcer5 M 14000 dyspepsia6 M 18000 pneumonia9 M 19000 bronchitis12 F 22000 flu19 F 24000 pneumonia21 F 33000 gastritis25 F 34000 gastritis28 F 37000 flu56 F 58000 flu
Published table
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
An external database(a voter registration list)
An adversary
Quasi-identifier (QI) attributes
Generalization
• Transform each QI value into a less specific form
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
A generalized table An external databaseAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu
Information loss
k-anonymity
• The following table is 2-anonymous
Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu
5 QI groups
Quasi-identifier (QI) attributes Sensitive attribute
Drawback of k-anonymity
• What is the disease of Linda?
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
A 2-anonymous table An external databaseAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu
A better criterion: l-diversity• Each QI-group
– has at least l different sensitive values– even the most frequent sensitive value does not have a lot of tupl
es
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Alice 12 F 22000Mike 7 M 17000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
A 2-diverse table An external databaseAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
Motivation 1: Personalization
• Andy does not want anyone to know that he had a stomach problem• Sarah does not mind at all if others find out that she had flu
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
A 2-diverse table An external databaseAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
Motivation 2: Non-primary case
MicrodataName Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerAndy 4 M 12000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis
Sarah 28 F 37000 fluMary 56 F 58000 flu
Motivation 2: Non-primary case (cont.)
Name Age Sex ZipcodeAndy 4 M 12000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
2-diverse table An external databaseAge Sex Zipcode Disease
4 M 12000 gastric ulcer4 M 12000 dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
Motivation 3: SA generalization
• How many female patients are there with age above 30?• 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3• Real answer: 1
A generalized tableAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
An external database
Motivation 3: SA generalization (cont.)
• Generalization of the sensitive attribute is beneficial in this case
A better generalized tableAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu
56 F 58000respiratory infection
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
An external database
Personalized anonymity
• We propose– a mechanism to capture personalized privacy
requirements– criteria for measuring the degree of security
provided by a generalized table– an algorithm for generating publishable tables
Guarding nodeany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
• Andy does not want anyone to know that he had a stomach problem• He can specify “stomach disease” as the guarding node for his tuple
• The data publisher should prevent an adversary from associating Andy with “stomach disease”
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
Guarding nodeany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
• Sarah is willing to disclose her exact symptom• She can specify Ø as the guarding node for her tuple
Name Age Sex Zipcode Disease guarding node
Sarah 28 F 37000 flu Ø
Guarding nodeany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
• Bill does not have any special preference• He can specify the guarding node for his tuple as the same with his
sensitive value
Name Age Sex Zipcode Disease guarding node
Bill 5 M 14000 dyspepsia dyspepsia
A personalized approachany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu
Personalized anonymity
• A table satisfies personalized anonymity with a parameter pbreach
– Iff no adversary can breach the privacy requirement of any tuple with a probability above pbreach
• If pbreach = 0.3, then any adversary should have no more than 30% probability to find out that:
– Andy had a stomach disease– Bill had dyspepsia– etc
Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu
Personalized anonymity• Personalized anonymity with respect to a predefined para
meter pbreach– an adversary can breach the privacy requirement of any tuple with
a probability at most pbreach
Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia
21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection
• We need a method for calculating the breach probabilities
What is the probability that Andy had some stomach problem?
Combinatorial reconstruction
• Assumptions– the adversary has no prior knowledge about each indivi
dual– every individual involved in the microdata also appears i
n the external database
Combinatorial reconstruction
• Andy does not want anyone to know that he had some stomach problem
• What is the probability that the adversary can find out that “Andy had a stomach disease”?
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia
21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection
Combinatorial reconstruction (cont.)
• Can each individual appear more than once?– No = the primary case– Yes = the non-primary case
• Some possible reconstructions:
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
the primary case
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
the non-primary case
Combinatorial reconstruction (cont.)
• Can each individual appear more than once?– No = the primary case– Yes = the non-primary case
• Some possible reconstructions:
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
the primary case
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
the non-primary case
Breach probability (primary)
• Totally 120 possible reconstructions• If Andy is associated with a stomach disease in nb reconstructions • The probability that the adversary should associate Andy with some
stomach problem is nb / 120
• Andy is associated with– gastric ulcer in 24 reconstructions– dyspepsia in 24 reconstructions– gastritis in 0 reconstructions
• nb = 48• The breach probability for Andy’s tuple is 48 / 120 = 2 / 5
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Breach probability (non-primary)
• Totally 625 possible reconstructions• Andy is associated with gastric ulcer or dyspepsi
a or gastritis in 225 reconstructions
• nb = 225• The breach probability for Andy’s tuple is
225 / 625 = 9 / 25
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
AndyBillKenNashMike
gastric ulcerdyspepsiapneumoniabronchitis
Breach probability: Formal results
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia
21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection
Breach probability: Formal results
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia
21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection
More in our paper
• An algorithm for computing generalized tables that– satisfies personalized anonymity with predefin
ed pbreach
– reduces information loss by employing generalization on both the QI attributes and the sensitive attribute
Experiment settings 1
• Goal: To show that k-anonymity and l-diversity do not always provide sufficient privacy protection
• Real dataset
• Pri-leaf• Nonpri-leaf• Pri-mixed• Nonpri-mixed
• Cardinality = 100k
Age Education Gender Marital-status Occupation Income
Degree of privacy protection (Pri-leaf)
pbreach = 0.25 (k = 4, l = 4)
Degree of privacy protection (Nonpri-leaf)
pbreach = 0.25 (k = 4, l = 4)
Degree of privacy protection (Pri-mixed)
pbreach = 0.25 (k = 4, l = 4)
Degree of privacy protection (Nonpri-mixed)
pbreach = 0.25 (k = 4, l = 4)
Experiment settings 2
• Goal: To show that applying generalization on both the QI attributes and the sensitive attribute will lead to more effective data analysis
Accuracy of analysis (no personalization)
Accuracy of analysis (with personalization)
Conclusions
• k-anonymity and l-diversity are not sufficient for the Non-primary case
• Guarding nodes allow individuals to describe their privacy requirements better
• Generalization on the sensitive attribute is beneficial
Thank you!
Datasets and implementation are available for download at
http://www.cs.cityu.edu.hk/~taoyf