+ All Categories
Home > Documents > Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data...

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data...

Date post: 15-Jan-2016
Category:
Upload: curtis-gardner
View: 213 times
Download: 1 times
Share this document with a friend
42
Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams
Transcript
Page 1: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds for Massive Dataset Algorithms

T.S. Jayram(IBM Almaden)

IIT Kanpur Workshop on Data Streams

Page 2: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before.

-Star Trek

Page 3: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over

massive data sets

Similarly, we want a single notion of efficient computation over massive datasets

Efficient Algorithms for Massive Data Sets

Page 4: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

A Single Theory?

Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc.

Paradigms of computing Sampling, sketching, data stream, read-write

streams, stream-sort, map-reduce And many more yet to come

Page 5: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds

This is a fertile ground for proving results Many successes Certain problems seem to be fundamental Reductions play a big role

Page 6: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Sampling: Query a small number of data elements

Data streams: Stream through the data in a one-way fashion; limited main memory storage

Models with Limited Main Memory

Algorithm

Data Set

Algorithm

Data Set

Page 7: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Distributed Computing

Sketching: Compress data chunks into small “sketches”; compute over the sketches

Algorithm

Data Set

Page 8: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Sampling: Lower Bounds for Symmetric Functions

In general, sampling algorithms are adaptive

Proof Idea•Let T be a sampling algorithm for the function•Randomly permute the data elements•Run T •Resulting algorithm estimates the function and uses uniform samples

Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.

Page 9: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds for Uniform Sampling

Tools: block sensitivity

Hellinger distance

Kullback-Leibler divergence

Jensen-Shannon divergence

Combinatorics[Nisan]

Statistics [Bar Yossef et al.]

Information theory[Bar-Yossef]

Page 10: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Example

Find the mean of n numbers in [0,1]

Requires (1/2) samples to approximate additively within

Lower Bound proof using Hellinger distance

Page 11: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Step 1

Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s

Given x 2 {a,b}, any sampling algorithm for mean (with additive error /4) can distinguish whether x=a or x=b

Page 12: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Step 2

Let Pa = distribution on {0,1} by sampling uniformly from a; Similarly Pb …

Compute Hellinger distance h2(Pa,Pb) For discrete distributions P, Q

h2(P,Q) = k√P - √Qk2

= 1 - Σx (P(x) Q(x))½

h2(Pa,Pb) = O(2)

Page 13: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower bound via Hellinger Distance

Key Idea: multiplicative property of Hellinger distance:

1 - h2(Pk,Qk) = (1 - h2(P,Q))k

Theorem.Any uniform sampling algorithm needs

Ω(1/ h2(Pa,Pb)) samples to distinguish input a from input b

Page 14: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds for Data Streams

Idea is to somehow bound the flow of information (yields space lower bounds)

Model is too “fine-grained” to prove lower bounds directly

Instead, we consider more powerful models (hopefully simpler to tackle)

Page 15: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Communication complexity

x yf(x,y)

Extensions to multiple parties

Resources:

# bits

# rounds

Page 16: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Connection to Data Streams

Data stream algorithm for f(x±y) Space s Passes k

) O(ks), 2k round protocol for f(x,y)

Data stream algorithm for f(x1±x2± ±xt) Space s Passes k

) O(tks) protocol for f(x1,x2,…,xt)

Page 17: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Caveat

Communication complexity usually deals with decision problems

Data stream problems involve approximation computations

Usual reduction techniques yield promise problems in c.c.

Page 18: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Set Disjointness

Sets A,B µ [n] Alice has A and Bob has B

Is A \ B ;? Classical problem in c.c.

t-party version [Alon,Matias,Szegedy]

Page 19: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

C.C. Lower Bounds for Set Disjointness

Remarks: Choose a “hard distribution” on inputs and

show a lower bound on communication Unfortunately, the hard distributions here

involves correlated inputs The arguments are somewhat tricky

Theorem:Randomized c.c. of Disjointness is (n)

[Kalyanasundaram,Schnitger; Razborov]

Page 20: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Direct Sum Methodology

x and y are characteristic vectorsAND(a,b) = a ^ bINT(x,y) = _i (xi ^ yi) = _i AND(xi,yi)

Establish that any protocol for INT must solve n independent copies of AND

This is not true for communication itself !

Page 21: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Information Cost

P a protocol for a function f

[Chakrabarti, Sun, Wirth, Yao]

Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information

[Bar-Yossef, J., Kumar, Sivakumar]

Conditional Information Cost of P = I(X,Y : P(X,Y) | D)

Page 22: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Information Complexity

Let be a distribution on inputs

IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to

Page 23: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Proposition.CC(f) IC(f)

Proof: Let P compute fI(X,Y : P(X,Y) | D)

H(P(X,Y) | D) H(P(X,Y)) (conditioning reduces entropy) |P| (entropy bounded by bits)

Information vs Communication Complexity

Page 24: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Distribution for Disjointness

For each i = 1, …, n, independently:Di R {a,b}

If Di = a then xi = 0, yi R {0,1}

If Di = b then xi R {0,1}, yi = 0

Remarks: This always produces disjoint sets! Conditioned on D, X and Y are

independent

Page 25: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Direct Sum Theorem

Theorem.IC(INT) ¸ n ¢ IC(AND)

… …

X1 X2Y1 Y2 a

0 0 0

b Xn Yn

Page 26: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Information Complexity of AND

Nice connections to statistical distances

In case of AND, this reduces to getting a lower bound on the Hellinger distance:

h2(AND(0,1), AND(1,0))

0

1

0 1

a

b

Page 27: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

More Thoughts

Extension to t-party set disjointness: lower bound of (n/t2) Can be improved to (n/(t log t))

[Chakrabarti,Khot,Sun] Yields optimal space lower bounds for

frequency moments Fk, k > 2

Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For Lp, p>2, the space bound is polynomial

with a minor gap between u.b. and l.b. in terms of p

Page 28: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Reductions – Example for F0

Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A?

One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS]

F0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/2) but requires substantial new ideas

[Indyk,Woodruff; Woodruff]

Page 29: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower bounds for Sketching

Simultaneous messages

f(x1,x2,…,xt)

A1(x1)

At(xt)

A2(x2)

Page 30: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Beyond Data Streams: a Peek at External Memory

Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to

disks are as good as random access to main memory

New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl]

Map-reduce [Dean,Ghemawat]

Page 31: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Read/Write Streams

Also called Reversal Turing Machines by [GS]

I n p u t

t streams

Machine Memory

Page 32: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Critical Resources

#tapes t space s No constraint on the length of streams But #reversals is at most r

An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1))

read/write stream algorithm What happens when #reversals is o(log N)?

Page 33: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds

There is no reduction using c.c. E.g. Equality of strings is easy here So what gives?

The intuition is that it is hard to compare data elements at random locations

Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]

Page 34: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

ymy2 y(m)y(2)

Difficult Problems for Read/Write Streams

A direct-sum type of problem with inputs moved around

h

g

x1 y1y(1)

g

x2

g

xm

Pick a permutation with small monotonicity

Page 35: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Previous Results

Sorting with o(log N) reversals requires (N1/5) space [GS]

Set Equality with o(log N) reversals requires (N1/4) space [GHS] Also applies to Sorting

Bounds hold for deterministic and randomized 1-sided error models

Page 36: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Our Results

[Beame, J., Rudra] Lower bounds for 2-sided error

randomized computation

Set Disjointness with o(log N/log log N) reversals requires near-linear space

We derive our results in a direct-sum framework

Page 37: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bound Technique

1st step: List machine records the potential ways in which

subsets of input elements can be “compared” at different stages of the computation

2nd step: Skeleton Describes the information flow in terms of

the locations of elements that are compared

Page 38: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Key Theorem of [GH,GSH]

Skeletons resemble transcripts in c.c.

Theorem.The skeletons partition the input domain such that

(1) #skeletons is “small”

(2) output depends only on the skeleton

(3) Each skeleton satisfies a weak rectangle-like property

Page 39: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Semi-Rectangle Property of Skeletons

Inputs mapped to the transcript

Skeleton: For “most” coordinate pairs (i,(i)) For any assignment to xj and y(j), 8 j i

The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle

Transcript in c.c.:

Rectangle

Page 40: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Working with Skeletons

In [GS,GHS], the proofs use only one coordinate pair

For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a

similar lower bound Therefore, we keep track of multiple coordinate

pairs Tricky part: keeping track of the inputs as

we vary the coordinate pairs

Page 41: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Remarks

Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information

complexity based approach We consider two kinds of composition

operators: © and _ Yields lower bounds for Intersection Size

Mod 2 (Inner Product)

Page 42: Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Summary

We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations

Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions

in communication complexity Promise problems still pose a challenge

Gap-Hamming for multiple passes


Recommended