Baik Hoh Marco Gruteser Hui Xiong Ansaf Alrabady All images are credited to “ACM” Hoh et al...

Preserving Privacy in GPS Traces via Uncertainty-Aware

Path Cloaking

Baik HohMarco Gruteser

Hui Xiong Ansaf Alrabady

All images are credited to “ACM” Hoh et al (2007), pp. 161-170

ProblemGPS traces are taken from “probe” vehicles

to provide servicesTraffic Monitoring Application

GPS location, heading, and speed data

Other research has shown that even if this data is anonymized, individual routes can be identified.

Problem: Traffic MonitoringGPS points are mapped to a road segmentAverage speed of those vectors are calculatedCongestion is inferred

RequirementsSpatial AccuracyRoad Coverage

Achieved by “penetration rate” Initial deployments fall short – and privacy suffers

Problem: How Privacy is compromisedIndividuals can be identified by starting

and ending points in the GPS traceData points can be linked together using

target tracking and “Maximum Likelihood Detection”For a set of possible points, select the point

with the highest probability of belonging to this route.

Other research has shown that even if this data is anonymized, individual routes can be identified.

Problem: Existing Algorithms

Existing anonymity algorithms cause severe degradation to the utility of the data

Problem: Existing AlgorithmsK-anonymity using CliqueCloak modifies

trace data beyond usabilityThought to be the most accurate system with

any anonymity guaranteeEven making the anonymity set as small as 3,

location accuracy drops down to between 1000-2000m, even if they use 2000 probes.

Increasing penetration rate would help, but: Higher penetration rates not possible in early

deployment Lower density areas of the map would never be

accurate.

Factors to Consider

The longer the attacker can follow an individual trace, the better they are able to guess who you are, and where you are going

Relative Weighted Coverage MetricWhen samples are withheld, road coverage

decreasesCongestion monitoring is more important on

popular routesCoverage is limited by the original data set,

so coverage can’t get better; it can only go down.

High Level: The metric measures the coverage delta between the original data set and the confused data set. It is a measure of data quality.

Time-to-confusion MetricThe mean time-to-confusion (MTTC) is meant

to be a measurement of privacyThe lower the average trackable trip time, the

more privacy you have as an individual in the overall system.

How long an individual can be tracked is a time-to-confusion threshold.

High Level: Time-to-confusion is the time you are able to be “tracked” after de-anonymization.

“Uncertainty-aware” algorithm

Calculates the probability of a particular point belonging to a “trip” and verifies that the trip cannot be followed, due to the existence of other points which could just as probably fit that trip

High Level: Ensures that a specific level of uncertainty is maintained for every “trip” in the trace data.

Put it all togetherGiven all the points in a particular slice of time, if a

single point could have been tracked longer than the time-to-confusion threshold, AND the point in this time slice can be correlated to that trace with high probability, that point is omitted from the set of published data.

Allows tracking for a limited time, but prevents tracking the entire trip.

The starting location and ending location are not connected, so it’s not possible to identify who the individual is or where they are going, thus privacy is preserved.

Mean time-to-confusion is the average time between omitted points on a “trip”

DataUsed data collected

from 233 volunteer vehicles collected over 7 days

Data covers a 70km by 70km metropolitan area (70km = 43.5 miles)

Samples are taken every 1 minute while ignition is “on”

Data

Results: Off-Peak, High Density

Off-Peak, High Density10am – 11:30am

Gray dots are released

Black dots are excluded

Results: On-Peak, High Density

On-Peak, High Density5pm – 6:30pm

Gray dots are released

Black dots are excluded

Results: ComparisonOff-Peak On-Peak

Results: Maximum TTCIf UT = 40%, TTC=5m

92.5% of points may be published

If UT is 99%, TTC=5mstill over 65% of points

may be published.

If only 92.5% of points are published and randomly selected, at least one route is traceable for 35 minutes.

Results: Median TTCIf UT is 40%, TTC = 5m

MTTC is 1 minute for the data set.

If UT is 99%, TTC = 5m MTTC is 1 minute for the

data set.

Publishing 80% of points randomly still identified 15% of routes for over 10 minutes. (median not specified)

Results:Relative Weighted Road CoverageWhen Uncertainty Threshold = 95% and TTC = 5min

81% of data samples are releasedRoad coverage is still 95%

If 20% of data samples are removed randomly80% of samples are publishedRoad coverage is only 79.3%

As you can see, there is significantly more degradation in the case of randomly throwing out data.

Other Considerations The authors also consider algorithm

modifications to address reacquisition.Maximum TTC is still preserved, but quality is

only marginally better than when data points are randomly removed

The authors also do not make their algorithm aware of real topography, which could be taken advantage of by an attackerIf topography were also considered, this problem

could be averted.There are many open research areas (in 2007).

Conclusion

Intelligently removing data points to confuse a de-anonymization algorithm is successful for even low-penetration deployments.

All images are credited to “ACM”, Hoh et al (2007), pp. 161-170

Date post:	24-Dec-2015
Category:	Documents
Upload:	thomasina-nicholson
View:	216 times
Download:	0 times

Baik Hoh Marco Gruteser Hui Xiong Ansaf Alrabady All images are credited to “ACM” Hoh et al...

Documents