Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | curtis-gardner |
View: | 213 times |
Download: | 1 times |
Lower Bounds for Massive Dataset Algorithms
T.S. Jayram(IBM Almaden)
IIT Kanpur Workshop on Data Streams
Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before.
-Star Trek
Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over
massive data sets
Similarly, we want a single notion of efficient computation over massive datasets
Efficient Algorithms for Massive Data Sets
A Single Theory?
Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc.
Paradigms of computing Sampling, sketching, data stream, read-write
streams, stream-sort, map-reduce And many more yet to come
Lower Bounds
This is a fertile ground for proving results Many successes Certain problems seem to be fundamental Reductions play a big role
Sampling: Query a small number of data elements
Data streams: Stream through the data in a one-way fashion; limited main memory storage
Models with Limited Main Memory
Algorithm
Data Set
Algorithm
Data Set
Distributed Computing
Sketching: Compress data chunks into small “sketches”; compute over the sketches
Algorithm
Data Set
Sampling: Lower Bounds for Symmetric Functions
In general, sampling algorithms are adaptive
Proof Idea•Let T be a sampling algorithm for the function•Randomly permute the data elements•Run T •Resulting algorithm estimates the function and uses uniform samples
Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.
Lower Bounds for Uniform Sampling
Tools: block sensitivity
Hellinger distance
Kullback-Leibler divergence
Jensen-Shannon divergence
Combinatorics[Nisan]
Statistics [Bar Yossef et al.]
Information theory[Bar-Yossef]
Example
Find the mean of n numbers in [0,1]
Requires (1/2) samples to approximate additively within
Lower Bound proof using Hellinger distance
Step 1
Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s
Given x 2 {a,b}, any sampling algorithm for mean (with additive error /4) can distinguish whether x=a or x=b
Step 2
Let Pa = distribution on {0,1} by sampling uniformly from a; Similarly Pb …
Compute Hellinger distance h2(Pa,Pb) For discrete distributions P, Q
h2(P,Q) = k√P - √Qk2
= 1 - Σx (P(x) Q(x))½
h2(Pa,Pb) = O(2)
Lower bound via Hellinger Distance
Key Idea: multiplicative property of Hellinger distance:
1 - h2(Pk,Qk) = (1 - h2(P,Q))k
Theorem.Any uniform sampling algorithm needs
Ω(1/ h2(Pa,Pb)) samples to distinguish input a from input b
Lower Bounds for Data Streams
Idea is to somehow bound the flow of information (yields space lower bounds)
Model is too “fine-grained” to prove lower bounds directly
Instead, we consider more powerful models (hopefully simpler to tackle)
Communication complexity
x yf(x,y)
Extensions to multiple parties
Resources:
# bits
# rounds
Connection to Data Streams
Data stream algorithm for f(x±y) Space s Passes k
) O(ks), 2k round protocol for f(x,y)
Data stream algorithm for f(x1±x2± ±xt) Space s Passes k
) O(tks) protocol for f(x1,x2,…,xt)
Caveat
Communication complexity usually deals with decision problems
Data stream problems involve approximation computations
Usual reduction techniques yield promise problems in c.c.
Set Disjointness
Sets A,B µ [n] Alice has A and Bob has B
Is A \ B ;? Classical problem in c.c.
t-party version [Alon,Matias,Szegedy]
C.C. Lower Bounds for Set Disjointness
Remarks: Choose a “hard distribution” on inputs and
show a lower bound on communication Unfortunately, the hard distributions here
involves correlated inputs The arguments are somewhat tricky
Theorem:Randomized c.c. of Disjointness is (n)
[Kalyanasundaram,Schnitger; Razborov]
Direct Sum Methodology
x and y are characteristic vectorsAND(a,b) = a ^ bINT(x,y) = _i (xi ^ yi) = _i AND(xi,yi)
Establish that any protocol for INT must solve n independent copies of AND
This is not true for communication itself !
Information Cost
P a protocol for a function f
[Chakrabarti, Sun, Wirth, Yao]
Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information
[Bar-Yossef, J., Kumar, Sivakumar]
Conditional Information Cost of P = I(X,Y : P(X,Y) | D)
Information Complexity
Let be a distribution on inputs
IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to
Proposition.CC(f) IC(f)
Proof: Let P compute fI(X,Y : P(X,Y) | D)
H(P(X,Y) | D) H(P(X,Y)) (conditioning reduces entropy) |P| (entropy bounded by bits)
Information vs Communication Complexity
Distribution for Disjointness
For each i = 1, …, n, independently:Di R {a,b}
If Di = a then xi = 0, yi R {0,1}
If Di = b then xi R {0,1}, yi = 0
Remarks: This always produces disjoint sets! Conditioned on D, X and Y are
independent
Direct Sum Theorem
Theorem.IC(INT) ¸ n ¢ IC(AND)
… …
X1 X2Y1 Y2 a
0 0 0
b Xn Yn
Information Complexity of AND
Nice connections to statistical distances
In case of AND, this reduces to getting a lower bound on the Hellinger distance:
h2(AND(0,1), AND(1,0))
0
1
0 1
a
b
More Thoughts
Extension to t-party set disjointness: lower bound of (n/t2) Can be improved to (n/(t log t))
[Chakrabarti,Khot,Sun] Yields optimal space lower bounds for
frequency moments Fk, k > 2
Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For Lp, p>2, the space bound is polynomial
with a minor gap between u.b. and l.b. in terms of p
Reductions – Example for F0
Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A?
One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS]
F0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/2) but requires substantial new ideas
[Indyk,Woodruff; Woodruff]
Lower bounds for Sketching
Simultaneous messages
f(x1,x2,…,xt)
A1(x1)
At(xt)
A2(x2)
Beyond Data Streams: a Peek at External Memory
Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to
disks are as good as random access to main memory
New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl]
Map-reduce [Dean,Ghemawat]
Read/Write Streams
Also called Reversal Turing Machines by [GS]
I n p u t
t streams
Machine Memory
Critical Resources
#tapes t space s No constraint on the length of streams But #reversals is at most r
An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1))
read/write stream algorithm What happens when #reversals is o(log N)?
Lower Bounds
There is no reduction using c.c. E.g. Equality of strings is easy here So what gives?
The intuition is that it is hard to compare data elements at random locations
Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]
ymy2 y(m)y(2)
Difficult Problems for Read/Write Streams
A direct-sum type of problem with inputs moved around
…
h
g
x1 y1y(1)
g
x2
g
xm
Pick a permutation with small monotonicity
Previous Results
Sorting with o(log N) reversals requires (N1/5) space [GS]
Set Equality with o(log N) reversals requires (N1/4) space [GHS] Also applies to Sorting
Bounds hold for deterministic and randomized 1-sided error models
Our Results
[Beame, J., Rudra] Lower bounds for 2-sided error
randomized computation
Set Disjointness with o(log N/log log N) reversals requires near-linear space
We derive our results in a direct-sum framework
Lower Bound Technique
1st step: List machine records the potential ways in which
subsets of input elements can be “compared” at different stages of the computation
2nd step: Skeleton Describes the information flow in terms of
the locations of elements that are compared
Key Theorem of [GH,GSH]
Skeletons resemble transcripts in c.c.
Theorem.The skeletons partition the input domain such that
(1) #skeletons is “small”
(2) output depends only on the skeleton
(3) Each skeleton satisfies a weak rectangle-like property
Semi-Rectangle Property of Skeletons
Inputs mapped to the transcript
Skeleton: For “most” coordinate pairs (i,(i)) For any assignment to xj and y(j), 8 j i
The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle
Transcript in c.c.:
Rectangle
Working with Skeletons
In [GS,GHS], the proofs use only one coordinate pair
For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a
similar lower bound Therefore, we keep track of multiple coordinate
pairs Tricky part: keeping track of the inputs as
we vary the coordinate pairs
Remarks
Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information
complexity based approach We consider two kinds of composition
operators: © and _ Yields lower bounds for Intersection Size
Mod 2 (Inner Product)
Summary
We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations
Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions
in communication complexity Promise problems still pose a challenge
Gap-Hamming for multiple passes