Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | thomasina-nicholson |
View: | 216 times |
Download: | 0 times |
Preserving Privacy in GPS Traces via Uncertainty-Aware
Path Cloaking
Baik HohMarco Gruteser
Hui Xiong Ansaf Alrabady
All images are credited to “ACM” Hoh et al (2007), pp. 161-170
ProblemGPS traces are taken from “probe” vehicles
to provide servicesTraffic Monitoring Application
GPS location, heading, and speed data
Other research has shown that even if this data is anonymized, individual routes can be identified.
Problem: Traffic MonitoringGPS points are mapped to a road segmentAverage speed of those vectors are calculatedCongestion is inferred
RequirementsSpatial AccuracyRoad Coverage
Achieved by “penetration rate” Initial deployments fall short – and privacy suffers
Problem: How Privacy is compromisedIndividuals can be identified by starting
and ending points in the GPS traceData points can be linked together using
target tracking and “Maximum Likelihood Detection”For a set of possible points, select the point
with the highest probability of belonging to this route.
Other research has shown that even if this data is anonymized, individual routes can be identified.
Problem: Existing Algorithms
Existing anonymity algorithms cause severe degradation to the utility of the data
Problem: Existing AlgorithmsK-anonymity using CliqueCloak modifies
trace data beyond usabilityThought to be the most accurate system with
any anonymity guaranteeEven making the anonymity set as small as 3,
location accuracy drops down to between 1000-2000m, even if they use 2000 probes.
Increasing penetration rate would help, but: Higher penetration rates not possible in early
deployment Lower density areas of the map would never be
accurate.
Factors to Consider
The longer the attacker can follow an individual trace, the better they are able to guess who you are, and where you are going
Relative Weighted Coverage MetricWhen samples are withheld, road coverage
decreasesCongestion monitoring is more important on
popular routesCoverage is limited by the original data set,
so coverage can’t get better; it can only go down.
High Level: The metric measures the coverage delta between the original data set and the confused data set. It is a measure of data quality.
Time-to-confusion MetricThe mean time-to-confusion (MTTC) is meant
to be a measurement of privacyThe lower the average trackable trip time, the
more privacy you have as an individual in the overall system.
How long an individual can be tracked is a time-to-confusion threshold.
High Level: Time-to-confusion is the time you are able to be “tracked” after de-anonymization.
“Uncertainty-aware” algorithm
Calculates the probability of a particular point belonging to a “trip” and verifies that the trip cannot be followed, due to the existence of other points which could just as probably fit that trip
High Level: Ensures that a specific level of uncertainty is maintained for every “trip” in the trace data.
Put it all togetherGiven all the points in a particular slice of time, if a
single point could have been tracked longer than the time-to-confusion threshold, AND the point in this time slice can be correlated to that trace with high probability, that point is omitted from the set of published data.
Allows tracking for a limited time, but prevents tracking the entire trip.
The starting location and ending location are not connected, so it’s not possible to identify who the individual is or where they are going, thus privacy is preserved.
Mean time-to-confusion is the average time between omitted points on a “trip”
DataUsed data collected
from 233 volunteer vehicles collected over 7 days
Data covers a 70km by 70km metropolitan area (70km = 43.5 miles)
Samples are taken every 1 minute while ignition is “on”
Results: Off-Peak, High Density
Off-Peak, High Density10am – 11:30am
Gray dots are released
Black dots are excluded
Results: On-Peak, High Density
On-Peak, High Density5pm – 6:30pm
Gray dots are released
Black dots are excluded
Results: Maximum TTCIf UT = 40%, TTC=5m
92.5% of points may be published
If UT is 99%, TTC=5mstill over 65% of points
may be published.
If only 92.5% of points are published and randomly selected, at least one route is traceable for 35 minutes.
Results: Median TTCIf UT is 40%, TTC = 5m
MTTC is 1 minute for the data set.
If UT is 99%, TTC = 5m MTTC is 1 minute for the
data set.
Publishing 80% of points randomly still identified 15% of routes for over 10 minutes. (median not specified)
Results:Relative Weighted Road CoverageWhen Uncertainty Threshold = 95% and TTC = 5min
81% of data samples are releasedRoad coverage is still 95%
If 20% of data samples are removed randomly80% of samples are publishedRoad coverage is only 79.3%
As you can see, there is significantly more degradation in the case of randomly throwing out data.
Other Considerations The authors also consider algorithm
modifications to address reacquisition.Maximum TTC is still preserved, but quality is
only marginally better than when data points are randomly removed
The authors also do not make their algorithm aware of real topography, which could be taken advantage of by an attackerIf topography were also considered, this problem
could be averted.There are many open research areas (in 2007).