Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data...

Lower Bounds for Massive Dataset Algorithms

T.S. Jayram(IBM Almaden)

IIT Kanpur Workshop on Data Streams

Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before.

-Star Trek

Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over

massive data sets

Similarly, we want a single notion of efficient computation over massive datasets

Efficient Algorithms for Massive Data Sets

A Single Theory?

Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc.

Paradigms of computing Sampling, sketching, data stream, read-write

streams, stream-sort, map-reduce And many more yet to come

Lower Bounds

This is a fertile ground for proving results Many successes Certain problems seem to be fundamental Reductions play a big role

Sampling: Query a small number of data elements

Data streams: Stream through the data in a one-way fashion; limited main memory storage

Models with Limited Main Memory

Algorithm

Data Set

Algorithm

Data Set

Distributed Computing

Sketching: Compress data chunks into small “sketches”; compute over the sketches

Algorithm

Data Set

Sampling: Lower Bounds for Symmetric Functions

In general, sampling algorithms are adaptive

Proof Idea•Let T be a sampling algorithm for the function•Randomly permute the data elements•Run T •Resulting algorithm estimates the function and uses uniform samples

Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.

Lower Bounds for Uniform Sampling

Tools: block sensitivity

Hellinger distance

Kullback-Leibler divergence

Jensen-Shannon divergence

Combinatorics[Nisan]

Statistics [Bar Yossef et al.]

Information theory[Bar-Yossef]

Example

Find the mean of n numbers in [0,1]

Requires (1/2) samples to approximate additively within

Lower Bound proof using Hellinger distance

Step 1

Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s

Given x 2 {a,b}, any sampling algorithm for mean (with additive error /4) can distinguish whether x=a or x=b

Step 2

Let Pa = distribution on {0,1} by sampling uniformly from a; Similarly Pb …

Compute Hellinger distance h2(Pa,Pb) For discrete distributions P, Q

h2(P,Q) = k√P - √Qk2

= 1 - Σx (P(x) Q(x))½

h2(Pa,Pb) = O(2)

Lower bound via Hellinger Distance

Key Idea: multiplicative property of Hellinger distance:

1 - h2(Pk,Qk) = (1 - h2(P,Q))k

Theorem.Any uniform sampling algorithm needs

Ω(1/ h2(Pa,Pb)) samples to distinguish input a from input b

Lower Bounds for Data Streams

Idea is to somehow bound the flow of information (yields space lower bounds)

Model is too “fine-grained” to prove lower bounds directly

Instead, we consider more powerful models (hopefully simpler to tackle)

Communication complexity

x yf(x,y)

Extensions to multiple parties

Resources:

# bits

# rounds

Connection to Data Streams

Data stream algorithm for f(x±y) Space s Passes k

) O(ks), 2k round protocol for f(x,y)

Data stream algorithm for f(x1±x2± ±xt) Space s Passes k

) O(tks) protocol for f(x1,x2,…,xt)

Caveat

Communication complexity usually deals with decision problems

Data stream problems involve approximation computations

Usual reduction techniques yield promise problems in c.c.

Set Disjointness

Sets A,B µ [n] Alice has A and Bob has B

Is A \ B ;? Classical problem in c.c.

t-party version [Alon,Matias,Szegedy]

C.C. Lower Bounds for Set Disjointness

Remarks: Choose a “hard distribution” on inputs and

show a lower bound on communication Unfortunately, the hard distributions here

involves correlated inputs The arguments are somewhat tricky

Theorem:Randomized c.c. of Disjointness is (n)

[Kalyanasundaram,Schnitger; Razborov]

Direct Sum Methodology

x and y are characteristic vectorsAND(a,b) = a ^ bINT(x,y) = _i (xi ^ yi) = _i AND(xi,yi)

Establish that any protocol for INT must solve n independent copies of AND

This is not true for communication itself !

Information Cost

P a protocol for a function f

[Chakrabarti, Sun, Wirth, Yao]

Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information

[Bar-Yossef, J., Kumar, Sivakumar]

Conditional Information Cost of P = I(X,Y : P(X,Y) | D)

Information Complexity

Let be a distribution on inputs

IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to

Proposition.CC(f) IC(f)

Proof: Let P compute fI(X,Y : P(X,Y) | D)

H(P(X,Y) | D) H(P(X,Y)) (conditioning reduces entropy) |P| (entropy bounded by bits)

Information vs Communication Complexity

Distribution for Disjointness

For each i = 1, …, n, independently:Di R {a,b}

If Di = a then xi = 0, yi R {0,1}

If Di = b then xi R {0,1}, yi = 0

Remarks: This always produces disjoint sets! Conditioned on D, X and Y are

independent

Direct Sum Theorem

Theorem.IC(INT) ¸ n ¢ IC(AND)

… …

X1 X2Y1 Y2 a

0 0 0

b Xn Yn

Information Complexity of AND

Nice connections to statistical distances

In case of AND, this reduces to getting a lower bound on the Hellinger distance:

h2(AND(0,1), AND(1,0))

0

1

0 1

a

b

More Thoughts

Extension to t-party set disjointness: lower bound of (n/t2) Can be improved to (n/(t log t))

[Chakrabarti,Khot,Sun] Yields optimal space lower bounds for

frequency moments Fk, k > 2

Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For Lp, p>2, the space bound is polynomial

with a minor gap between u.b. and l.b. in terms of p

Reductions – Example for F0

Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A?

One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS]

F0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/2) but requires substantial new ideas

[Indyk,Woodruff; Woodruff]

Lower bounds for Sketching

Simultaneous messages

f(x1,x2,…,xt)

A1(x1)

At(xt)

A2(x2)

Beyond Data Streams: a Peek at External Memory

Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to

disks are as good as random access to main memory

New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl]

Map-reduce [Dean,Ghemawat]

Read/Write Streams

Also called Reversal Turing Machines by [GS]

I n p u t

t streams

Machine Memory

Critical Resources

#tapes t space s No constraint on the length of streams But #reversals is at most r

An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1))

read/write stream algorithm What happens when #reversals is o(log N)?

Lower Bounds

There is no reduction using c.c. E.g. Equality of strings is easy here So what gives?

The intuition is that it is hard to compare data elements at random locations

Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]

ymy2 y(m)y(2)

Difficult Problems for Read/Write Streams

A direct-sum type of problem with inputs moved around

…

h

g

x1 y1y(1)

g

x2

g

xm

Pick a permutation with small monotonicity

Previous Results

Sorting with o(log N) reversals requires (N1/5) space [GS]

Set Equality with o(log N) reversals requires (N1/4) space [GHS] Also applies to Sorting

Bounds hold for deterministic and randomized 1-sided error models

Our Results

[Beame, J., Rudra] Lower bounds for 2-sided error

randomized computation

Set Disjointness with o(log N/log log N) reversals requires near-linear space

We derive our results in a direct-sum framework

Lower Bound Technique

1st step: List machine records the potential ways in which

subsets of input elements can be “compared” at different stages of the computation

2nd step: Skeleton Describes the information flow in terms of

the locations of elements that are compared

Key Theorem of [GH,GSH]

Skeletons resemble transcripts in c.c.

Theorem.The skeletons partition the input domain such that

(1) #skeletons is “small”

(2) output depends only on the skeleton

(3) Each skeleton satisfies a weak rectangle-like property

Semi-Rectangle Property of Skeletons

Inputs mapped to the transcript

Skeleton: For “most” coordinate pairs (i,(i)) For any assignment to xj and y(j), 8 j i

The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle

Transcript in c.c.:

Rectangle

Working with Skeletons

In [GS,GHS], the proofs use only one coordinate pair

For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a

similar lower bound Therefore, we keep track of multiple coordinate

pairs Tricky part: keeping track of the inputs as

we vary the coordinate pairs

Remarks

Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information

complexity based approach We consider two kinds of composition

operators: © and _ Yields lower bounds for Intersection Size

Mod 2 (Inner Product)

Summary

We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations

Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions

in communication complexity Promise problems still pose a challenge

Gap-Hamming for multiple passes

Date post:	15-Jan-2016
Category:	Documents
Upload:	curtis-gardner
View:	213 times
Download:	1 times

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data...

Documents