On the Approximability of Geometric and Geographic Generalization and the Min-Max Bin Covering Problem
Michael T. Goodrich
Dept. of Computer Science
joint with Wenliang Du, David Eppstein, and George Lueker
Motivation
• Privacy is a concern with respect to information in relational data bases– rows are associated with
people– columns are attributes
• K-anonymity– No query should reveal less
than K individuals
image source: http://neodv8.blogspot.com/2007/09/neutral-mask-masterclass.html
Generalization
• Replace specific attributes with more general ones, so no category has fewer than K members.
source: ℓ-Diversity: Privacy Beyond k-Anonymity Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer Muthuramakrishnan Venkitasubramaniam Department of Computer Science, Cornell University
Data Types
• Linear: Easy greedy algorithm is optimal• Unordered: arbitrary groupings possible• GPS coordinates: group using rectangles• Zip codes: should use proximity, not text
image source: http://eagereyes.org/Applications/ZIPScribbleMap.html
Previous Work• [Samarati, Sweeney, 98] introduce concept of k-
anonymization and generalization to achieve it.• [Meyerson, Williams, 04] show optimal generalization
or unordered data is NP-hard, but their proof requires as many attributes as people. And similar proofs are due to [Aggarwal et al., 05] and [Byun et al., 07].
• [Khanna, Muthukrishnan, Paterson, 98] study a rectangle tiling problem similar to GPS coordinate generation, showing 5/4-approximations are not possible unless P=NP.
• Lots more work on k-anonymization and its variants…
Our Results
• Zip codes: has a 4-approximation, but no 4/3-approximation unless P=NP
• GPS coordinates: has a 5-approximation, but no 4/3-approximation unless P=NP
• Unordered: is NP-hard but has a PTAS. Also, this version of the problem gives rise to a new type of bin-packing problem.
Min-Max Bin Covering
max
min (k)
image source: http://www.developerfusion.com/article/5540/bin-packing/4/
Min-Max Bin Cover is NP-hard
• Reduction from:
• Reduction method:
A Next-Fit Method: “Fold”• Theorem: There is a linear-time algorithm, A,
guaranteeing• Proof idea: Put items of size at least k into their
own bins, and use Next Fit for remaining items.– all but the last bin have level at
most 2k − 2, as they each have at most k − 1 before the last item.
– There may be one leftover bin with level less than k, which must be merged with some other bin.
Our PTAS: “Spread”
• Theorem: For each fixed ϵ > 0, there is a polynomial time algorithm Aϵ that, given some instance X of Min-Max Bin Covering, finds a solution satisfying Aϵ(X) ≤ (1 + ϵ)(Opt(X) + 1).
• Note: Normalize so k=1 and note that if there is an item of size > 3, then Next-Fit Theorem gives an optimal solution.
• We can assume, wlog, that the optimal solution has cost at most 3
The Spread Algorithm Warm-up• Call items < ϵ “small” and others “large”
– Note that any solution will have at most 3n bins.
• For any packing P, let the type of P be a packing where we throw out all small items and round all large items down to largest smaller value that is a product of ϵ and a power of (1+ϵ).
ϵ(1+ ϵ)5
More Warm-up• There are a constant number of rounded
values, for fixed ϵ; hence, a constant number of configurations – ways of filling a bin to at most 3 with rounded values.
• Represent a type by counts of each configuration, so that there are a polynomial number of types (with at most 3n bins).
configurations: 1 432 5
bin counts: 4 0 6 8 1
(constant number)
The Spread Algorithm
• For each type T:1. Let T’ be packing with rounded values replaced
with corresponding original (large) values.
2. Pack small values into T’ using greedy method of choosing bin with lowest level.
3. Merge pairs of smallest bins until every bin has a level of at least 1.
• Pick the one that minimizes the size of the largest bin.
Why it Works
• The type for the optimal solution is considered by the Spread algorithm.
• The T’ in this instance has cost at most (1+ϵ) times the optimal cost.
• During the greedy completion, the maximum bin must be at most (1+ϵ)Opt + ϵ, for otherwise we would have used more than the original set of items
• When we merge bins, we may merge one with level less than 1 with one of level (1+ϵ)Opt + ϵ; hence max of (1+ϵ)Opt + 1 + ϵ
Experimental Results
• Apply to names in the U.S. Census data:– FEMALE-1990: Female first names and their
frequencies, for names with frequency at least 0.001%.
– MALE-1990: Male first names and their frequencies, for names with frequency at least 0.001%.
– LAST-1990: Surnames and their frequencies, for surnames with frequency at least 0.001%.
Fold versus Spread
• Apply to random and sorted orders, since both algorithms consider items according to their input order.
• Test each algorithm for increasing k.
• At certain threshold levels of k, the number of bins is reduced, which causes some “jaggedness” in the results.
Female-1990
Male-1990
Last-1990
Zip Code Generalization is NP-Hard
• Formally, 3-Regular Planar Partition into Paths of Length 2 (3PPPL2): Given a 3-regular planar graph G, can G be partitioned into paths of length 2?
Proof Sketch
• Reduction from 3-Dimensional Matching:– Given triples (x,y,z) from sets X,Y,Z, find a set
of triples such that each member of X, Y, and Z belong to exactly one triple.
Proof Sketch
• Crossover gadget:
Proof Sketch
• Crossover gadget:
Additional Results
• An 4/3-approximation algorithm for planar graphs
• NP-hardness and 4/3-approximation algorithm for two-dimensional points.
Conclusion• We have shown that generalization is NP-hard and in
some cases cannot be arbitrarily approximated unless P=NP.
• We have given approximation algorithms for the versions we study:– unordered data– planar graphs (generalized into connected components)– two-dimensional points (generalized with rectangles)