Download - On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.

On the Approximability of Geometric and Geographic Generalization and the Min-Max Bin Covering Problem

Michael T. Goodrich

Dept. of Computer Science

joint with Wenliang Du, David Eppstein, and George Lueker

Motivation

• Privacy is a concern with respect to information in relational data bases– rows are associated with

people– columns are attributes

• K-anonymity– No query should reveal less

than K individuals

image source: http://neodv8.blogspot.com/2007/09/neutral-mask-masterclass.html

Generalization

• Replace specific attributes with more general ones, so no category has fewer than K members.

source: ℓ-Diversity: Privacy Beyond k-Anonymity Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer Muthuramakrishnan Venkitasubramaniam Department of Computer Science, Cornell University

Data Types

• Linear: Easy greedy algorithm is optimal• Unordered: arbitrary groupings possible• GPS coordinates: group using rectangles• Zip codes: should use proximity, not text

image source: http://eagereyes.org/Applications/ZIPScribbleMap.html

Previous Work• [Samarati, Sweeney, 98] introduce concept of k-

anonymization and generalization to achieve it.• [Meyerson, Williams, 04] show optimal generalization

or unordered data is NP-hard, but their proof requires as many attributes as people. And similar proofs are due to [Aggarwal et al., 05] and [Byun et al., 07].

• [Khanna, Muthukrishnan, Paterson, 98] study a rectangle tiling problem similar to GPS coordinate generation, showing 5/4-approximations are not possible unless P=NP.

• Lots more work on k-anonymization and its variants…

Our Results

• Zip codes: has a 4-approximation, but no 4/3-approximation unless P=NP

• GPS coordinates: has a 5-approximation, but no 4/3-approximation unless P=NP

• Unordered: is NP-hard but has a PTAS. Also, this version of the problem gives rise to a new type of bin-packing problem.

Min-Max Bin Covering

max

min (k)

image source: http://www.developerfusion.com/article/5540/bin-packing/4/

Min-Max Bin Cover is NP-hard

• Reduction from:

• Reduction method:

A Next-Fit Method: “Fold”• Theorem: There is a linear-time algorithm, A,

guaranteeing• Proof idea: Put items of size at least k into their

own bins, and use Next Fit for remaining items.– all but the last bin have level at

most 2k − 2, as they each have at most k − 1 before the last item.

– There may be one leftover bin with level less than k, which must be merged with some other bin.

Our PTAS: “Spread”

• Theorem: For each fixed ϵ > 0, there is a polynomial time algorithm Aϵ that, given some instance X of Min-Max Bin Covering, finds a solution satisfying Aϵ(X) ≤ (1 + ϵ)(Opt(X) + 1).

• Note: Normalize so k=1 and note that if there is an item of size > 3, then Next-Fit Theorem gives an optimal solution.

• We can assume, wlog, that the optimal solution has cost at most 3

The Spread Algorithm Warm-up• Call items < ϵ “small” and others “large”

– Note that any solution will have at most 3n bins.

• For any packing P, let the type of P be a packing where we throw out all small items and round all large items down to largest smaller value that is a product of ϵ and a power of (1+ϵ).

ϵ(1+ ϵ)5

More Warm-up• There are a constant number of rounded

values, for fixed ϵ; hence, a constant number of configurations – ways of filling a bin to at most 3 with rounded values.

• Represent a type by counts of each configuration, so that there are a polynomial number of types (with at most 3n bins).

configurations: 1 432 5

bin counts: 4 0 6 8 1

(constant number)

The Spread Algorithm

• For each type T:1. Let T’ be packing with rounded values replaced

with corresponding original (large) values.

2. Pack small values into T’ using greedy method of choosing bin with lowest level.

3. Merge pairs of smallest bins until every bin has a level of at least 1.

• Pick the one that minimizes the size of the largest bin.

Why it Works

• The type for the optimal solution is considered by the Spread algorithm.

• The T’ in this instance has cost at most (1+ϵ) times the optimal cost.

• During the greedy completion, the maximum bin must be at most (1+ϵ)Opt + ϵ, for otherwise we would have used more than the original set of items

• When we merge bins, we may merge one with level less than 1 with one of level (1+ϵ)Opt + ϵ; hence max of (1+ϵ)Opt + 1 + ϵ

Experimental Results

• Apply to names in the U.S. Census data:– FEMALE-1990: Female first names and their

frequencies, for names with frequency at least 0.001%.

– MALE-1990: Male first names and their frequencies, for names with frequency at least 0.001%.

– LAST-1990: Surnames and their frequencies, for surnames with frequency at least 0.001%.

Fold versus Spread

• Apply to random and sorted orders, since both algorithms consider items according to their input order.

• Test each algorithm for increasing k.

• At certain threshold levels of k, the number of bins is reduced, which causes some “jaggedness” in the results.

Female-1990

Male-1990

Last-1990

Zip Code Generalization is NP-Hard

• Formally, 3-Regular Planar Partition into Paths of Length 2 (3PPPL2): Given a 3-regular planar graph G, can G be partitioned into paths of length 2?

Proof Sketch

• Reduction from 3-Dimensional Matching:– Given triples (x,y,z) from sets X,Y,Z, find a set

of triples such that each member of X, Y, and Z belong to exactly one triple.

Proof Sketch

• Crossover gadget:

Proof Sketch

• Crossover gadget:

Additional Results

• An 4/3-approximation algorithm for planar graphs

• NP-hardness and 4/3-approximation algorithm for two-dimensional points.

Conclusion• We have shown that generalization is NP-hard and in

some cases cannot be arbitrarily approximated unless P=NP.

• We have given approximation algorithms for the versions we study:– unordered data– planar graphs (generalized into connected components)– two-dimensional points (generalized with rectangles)