1
14th ACM CCS
Fengjun Li
Automaton Segmentation: A New Approach to Preserve Privacy in
XML Information Brokering
Fengjun Li, Bo Luo, Peng Liu, Dongwon Lee and Chao-Hsien Chu
College of Information Sciences and TechnologyThe Pennsylvania State University
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Outline
• Motivation• Solution
– The broker-coordinator overlay approach
– Automaton segmentation scheme– Query encryption scheme
• Evaluation• Conclusion
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Information Brokering Scenario
Data Owner
UserMotivation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Naïve approachAlice
California Hospital at Los Angeles
Mt. Sinai Hospital at NYC
QueryBobBob know little
•Name: Alice•Observed symptoms
Motivation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Threats
• Privacy threats outside the proxy – Curious insider at the hospital – Link eavesdropper
• Privacy threats from the proxy– Compromised proxy
Motivation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Threat: Curious Insider
California Hospital at Los Angeles
Mt. Sinai Hospital at NYC
Bob
Motivation
Probing query
Data location!
Q: /provider/…/patient [name()=‘Alice’]//*
2
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Mt. Sinai Hospital at NYC
Privacy Threat: Curious insider & eavesdropping
California Hospital at Los Angeles
Bob
Query?Encrypted!
Motivation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Mt. Sinai Hospital at NYC
Privacy Threat: Curious insider & eavesdropping
California Hospital at Los Angeles
Bob
Location?From California Hospital, LA
Motivation
To Mt. Sinai Hospital, NYC
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Threat: Curious insider & eavesdropping
California Hospital at Los Angeles
Mt. Sinai Hospital at NYC
Bob
Motivation
Probing queryIt was Alice!Blood problem?
Q: /provider/…/patient [name()=‘Tom’]//*Q: /provider/…/patient [name()=‘Steve’]//*Q: /provider/…/patient [name()=‘Alice’]//*Q: /provider/…/patient [name()=‘Tom’]//*
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Q
Q
Bob
Major Privacy Concerns: Summary
Q: /provider/…/patient [name()=‘Alice’]/symptom [cancer()=‘blood’]//*
QueryContent
Q: /provider/…/patient [name()=‘Alice’]//*
DataLocation
Motivation
PatientLocation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Our Solution: Overview
• To block probing queries– in-network access control
• To protect data location privacy, patient location privacy, and metadata privacy– automaton segmentation
• Defeat all the aforementioned privacy threats.
• Achieve superior privacy protection.
Solution
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Preliminary: How the proxy works?
• Routing rules
– Object is an XPath expression– Destination is an IP address.– Routing rules represent physical distribution
of data objects
{ , ( )}indexR object destination s=Solution
3
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Example Routing Rules
Solution
R1: {/site/people, 192.168.0.2}
R2: {//africa/item, 192.168.0.15}
R3: {//asia/item, 192.168.0.16}
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
An Example Routing Automaton• Query routing: one-to-many XPath matching.
• Routing automaton: A Non-deterministic Finite Automaton that captures routing rules.
Solution
R1: {/site/people, 192.168.0.2}R2: {//africa/item, 192.168.0.15}R3: {//asia/item, 192.168.0.16}
0
1
34
6
site
ε
*
people2
192.168.0.22
5192.168.0.15
5
7192.168.0.16
7
item
item
Africa
Asia
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
How to block probing queries?
Solution
Global RoutingAutomaton
RoutingRules
Integrated GlobalAutomaton
IntegratedRule
Access Control Automaton
Access control
rules (ACR)
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion 192.168.0.2
The Integrated Automaton
Q’: “/site/regions/asia/item[name=’Abacavir’]/location”
QUERY: “/*/regions/asia/item[name=’Abacavir’]/location”
*
1410
1112 13 15
1617 18
regions
2 3 4
categories
people
**ε
* item9
person
emailaddress
addressname
*ε
loca
tion
site
quan
tity
description
name
0
1 5 6 7
8
192.168.0.2192.168.0.2
192.168.0.15192.168.0.16
192.168.0.15
192.168.0.15
192.168.0.15Integrated Automaton
192.168.0.15192.168.0.16
Solution
192.168.0.16
192.168.0.16
192.168.0.16
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
System Architecture
Solution
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Brokers and Coordinators
• Brokers– Connect user– Forward query to the root-coordinator– Before forwarding, do pre-protection
• Coordinator– Root-coordinator, coordinator, and leaf-
coordinator– They form a coordinator tree– The leaf-coordinator does not hold any
automaton piece, but the other two do.• The Super Node
– Initiation and offline maintenance
Solution
4
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
1
3 5
7
6
2
4User
Super Node
Broker-Coordinator Network
1
2
810
76
9
5
34
Data Server
Data Server
Automaton Segmentation
site0
categories1
regions *5
2 ε 3
**
item6
loca
tion
quan
tity
description
name7
8
9
10
11
4
Solution
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
1
3 5
7
6
2
4User
Super Node
Broker-Coordinator Network
1
2
10
6
9
5
4
Data Server
Data Server
8
7
3
Automaton Segmentation
site0
categories1
regions*5
2 ε 3
**
item6
loca
tion
quan
tity
description
name7
8
9
10
11
4
Solution
Automata segmentation granularitySegment deployment and replication
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
1
3 5
7
6
2
4User
Super Node
Broker-Coordinator Network
1
2
10
6
9
5
4
Data Server
Data Server
8
7
3
Automaton SegmentationBroker 3 root-coordinator
Q= “/*/regions/asia/item[name=’Abacavir’]/location”
Solution
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Query Segment Encryption
1
2
n
Assume Q = s1s2…sn , where si is a segment.
Solution
s1 s2 … sn
s1 s2 … sn
s1 s2 … sn
s1 s2 … sn
3…
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
1
3 5
7
6
2
4User
Super Node
Broker-Coordinator Network
1
2
10
6
9
5
4
Data Server
Data Server
8
7
3
Query Segment EncryptionBob “/*/regions/asia/item[name=’Abacavir’]/location”Bob, “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
Broker 3, “/*/regions/asia/item[name=’Abacavir’]/location”
Broker 3, “xxxx/regions/asia/item[name=’Abacavir’]/location”Broker 3, “xxxxxxxxxx/asia/item[name=’Abacavir’]/location”
Solution
Broker 3 “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
Broker 3, “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Analysis
• Answers are returned through leaf-coordinator and broker.
• Unauthorized queries are rejected by the coordinators: probing queries are blocked.
• Leaf-coordinators know data server addresses, but nothing about the data– Leaf coordinators cannot see queries.– Leaf coordinators only hold accept states.
• Other coordinators have partial routing rule, but NO location informationData Location Privacy
Solution
5
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Analysis• Query content is encrypted in the communication.• For all proxy components, ONLY the root-
coordinators can see the content of the query.Query Content Privacy (Partially)
• NO coordinator knows user location, and ONLY the broker does
• But the broker does not know query contentUser Location Privacy
• The automaton is split into pieces, NO proxy knows an entire (access control/routing) ruleMetadata Privacy
Solution
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Analysis
• If eavesdropping, communication between leaf-coordinator and the broker
• If a broker is compromised user location• If a root-coordinator is compromised,
query content• If a coordinator is compromised, one
segment of the automaton• If a leaf-coordinator is compromised, IP
address of data servers• If collusive, …
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Performance Evaluation
• Settings– Coordinators
• Java (JDK5.0)• Windows desktop
– Data• XMark DTD and XML documents• Synthetic rules• Synthetic queries
• Network Assumption– It’s unfair to use our Gigabyte intranet to
measure network latency– Internet latency
Evaluation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
End-to-End Query Processing Time
• Average query brokering time (TC)– AC enforcement, routing to next coordinator and
encryption– Average value from experiment: 1.9 ms
• Average network transmission time (TN)– Estimated using Internet latency, average latency
between two coordinators: 100ms• Average number of hops (NHOP)
– Estimated from experiment: 5.7
• Without any privacy protection, Tforward = 211 (ms)• Average query evaluation time at data server (TE)• Average backward data transmission time (Tback)
Evaluation
Tforward = TC × NHOP + TN ×(NHOP + 1) = 681 (ms)
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
System Scalability
• Total computation from all the coordinators– Measured by NSeg: total number of
query segments in PPIB system.• Parameters:
– Query frequency– Size of rules
Evaluation
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
System Scalability
• NSeg vs. NQuery
• NSeg vs. NRules0
10000
20000
30000
40000
50000
60000
0 100 200 300 400 500 600 700 800 900 1000
20 Rules 40 Rules 60 Rules80 Rules 100 Rules
0
50
100
150
200
250
300
0 50 100 150 200
Evaluation
6
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Conclusion
• Design the first privacy-preserving information brokering system (PPIB).– Integrate query routing with in-broker access
control• Design automaton segmentation scheme
to preserve query content privacy.– Integrate automaton segmentation with query
routing and access control• Provide most comprehensive privacy
protection for IBSs with insignificant performance degradation and scalability.
Conclusion
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Questions?
• Thank you!
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Privacy Analysis
• If the root coordinator is compromised:– PPIB vs. centralized proxy
• We still protect query location and data location privacy
– Full query segment encryption• Also encrypt un-processed XPath steps
– Relaxed: encrypt all predicates
• Coordinators need to be authenticated to decrypt an XPath step
• Extra overhead: very complicated authentication process and key management scheme
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
• If one coordinator is compromised,– What he obtains:
• A segment of the integrated automatonan XPath step
• Public resource: XML schema– What he can infer:
• Pre-path from root to itself– Multiple paths available k-anonymity
• Post-path from itself to the accepted states– Multiple accept states available l-diversity
• Together t-concealment
Solution
Privacy Analysis
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
A Real Example• Assume 5 hospitals are sharing data; Each
hospital has 10,000 patients – Assume each hospital has 10 roles;
• How large is the total amount of index data? – Assume we only index on patient name 10000*5=50K– could be greatly reduced at the router.
• How large is the data server for each hospital? – TB Level: 100M*10K=1T
• How many rules are there for each hospital? – 10-30 rules per role, 100-300 per hospital, 500-1500 in
total– Rules may be identical or similar
• How large is the global access control automata? – A fair guess (MRQ [SUTC 06]): 50 distinct paths
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
A Real Example• How many coordinators are needed in each
category?– A fair guess: 50 leaf coordinators, 10 intermediate
coordinators– Replicates needed
• What is the size for each automaton segment? – One Xpath step per segment– Memory consumption of Java implementation: 10KB
level
• What is the average size of a query? – Consider the size of health care schema (e.g. HL7)– A fair guess: 4-8 paths
7
14th ACM CCS
Fengjun Li
Motivation
Solution
Evaluation
Conclusion
Different with other anonymizing services
• Target destinations are known beforehand
• User don’t know where to send the queryQuery routing is necessary and unavoidableProxy with the routing automaton knows too much