+ All Categories
Home > Documents > Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus...

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus...

Date post: 26-Mar-2015
Category:
Upload: sarah-palmer
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO
Transcript
Page 1: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Tight Bounds for Distributed Functional Monitoring

David Woodruff

IBM Almaden

Qin Zhang

Aarhus University

MADALGO

Page 2: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Distributed Functional MonitoringC

P1 P2 P3 Pk…

coordinator

time

sites

Static case vs. Dynamic caseProblems on x1 + x2 + … + xk: sampling, p-norms, heavy hitters, compressed sensing, quantiles, entropyAuthors: Can, Cormode, Huang, Muthukrishnan, Patt-Shamir, Shafrir, Tirthapura, Wang, Yi, Zhao, many others

CommunicationCommunication

x1 x2 x3 xkinputs:

Updates:xi à xi + ej

Updates:xi à xi + ej

Page 3: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Motivation

• Data distributed and stored in the cloud– Impractical to put data on a single device

• Sensor networks– Communication very power-intensive

• Network routers– Bandwidth limitations

Page 4: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Problems• Which functions f(x1, …, xk) do we care about?

• x1, …, xk are non-negative length-n vectors

• x = i=1k xi

• f(x1, …, xk) = |x|p = (i=1n xi

p)1/p

• |x|0 is the number of non-zero coordinates

What is the randomized communication cost of these

problems?I.e., the minimal cost of a protocol, which for every input, fails with probability < 1/3

Static case, Dynamic Case

What is the randomized communication cost of these

problems?I.e., the minimal cost of a protocol, which for every input, fails with probability < 1/3

Static case, Dynamic Case

Page 5: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Exact Answers

• An (n) communication bound for computing |x|p , p 1

• Reduction from 2-Player Set-Disjointness (DISJ)

• Alice has a set S µ [n] of size n/4

• Bob has a set T µ [n] of size n/4 with either |S Å T| = 0 or |S Å T| = 1

• Is S Å T = ;?• |X Å Y| = 1 ! DISJ(X,Y) = 1, |X Å Y| = 0 !DISJ(X,Y) = 0

• [KS, R] (n) communication

• Prohibitive for applications

Page 6: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Approximate Answers

f(x1, …, xk) = (1 ± ε) |x |p

What is the randomized communication cost as a function of k, ε, and n?

Ignore log(nk/ε) factors

Page 7: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Previous ResultsLower bounds in static model, upper bounds in dynamic

model (underlying vectors are non-negative)

• |x|0: (k + ε-2) and O(k¢ε-2 )

• |x|p: (k + ε-2)

• |x|2: O(k2/ε + k1.5/ε3)

• |x|p, p > 2: O(k2p+1n1-2/p ¢ poly(1/ε))

Page 8: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Our ResultsLower bounds in static model, upper bounds in dynamic

model (underlying vectors are non-negative)

• |x|0: (k + ε-2) and O(k¢ε-2 ) (k¢ε-2)

• |x|p: (k + ε-2) (kp-1¢ε-2). Talk will focus on p = 2

• |x|2: O(k2/ε + k1.5/ε3) O(k¢poly(1/ε))

• |x|p, p > 2: O(k2p+1n1-2/p ¢ poly(1/ε)) O(kp-1¢poly(1/ε))

First lower bounds to depend on

product of k and ε-

2

First lower bounds to depend on

product of k and ε-

2

Upper bound doesn’t depend

polynomially on n

Upper bound doesn’t depend

polynomially on n

Page 9: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 10: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Previous Lower Bounds• Lower bounds for any p-norm, p != 1

• [CMY](k)

• [ABC] (ε-2) • Reduction from Gap-Orthogonality (GAP-ORT)

• Alice, Bob have u, v 2 {0,1}ε-2 , respectively

• |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε

• [CR, S] (ε-2) communication

Page 11: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 12: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Lower Bound for Distinct Elements

• Improve bound to optimal (k¢ε-2)

• Simpler problem: k-GAP-THRESH

– Each site Pi holds a bit Zi

– Zi are i.i.d. Bernoulli(¯)

– Decide if

i=1k Zi > ¯ k + (¯ k)1/2 or i=1

k Zi < ¯ k - (¯ k)1/2

Otherwise don’t care

• Rectangle property: for any correct protocol transcript ¿,

Z1, Z2, …, Zk are independent conditioned on ¿

Page 13: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

A Key Lemma• Lemma: For any protocol ¦ which succeeds w.pr. >.9999, the

transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i, H(Zi | ¿) < H(.01 ¯)

• Proof: Suppose ¿ does not satisfy this– With large probability,

¯ k - O(¯ k)1/2 i=1k Zi | ¿] < ¯ k + O(¯ k)1/2

– Since the Zi are independent given ¿, i=1

k Zi | ¿ is a sum of independent Bernoullis

– Since most H(Zi | ¿) are large, by anti-concentration, both events occur with constant probability:

i=1k Zi | ¿ > ¯ k + (¯ k)1/2 , i=1

k Zi | ¿ < ¯ k - (¯ k)1/2

So ¦ can’t succeed with large probability

Page 14: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Composition IdeaC

P1 P2 P3 Pk…

Z3Z2Z1Zk

The input to Pi in k-GAP-THRESH, denoted Zi, is the output of a 2-party Disjointness (DISJ) instance between C and Si

- Let X be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2}- For each i, if Zi = 1, then choose Yi so that DISJ(X, Yi) = 1, else choose Yi so that DISJ(X, Yi) = 0- Distributional complexity (1/ε2) [Razborov]

DISJ

DISJ

DISJDISJ

Can think of C as a

player

Can think of C as a

player

Page 15: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Putting it All Together• Key Lemma ! For most i, H(Zi | ¿) < H(.01¯)

• Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves DISJ(X, Yi) with constant probability

• Since the Zi | ¿ are independent, solving DISJ requires communication (ε-2) on each of k/2 copies

• Total communication is (k¢ε-2)

• Can show a reduction:– |x|0 > 1/(2ε2) + 1/ε if i=1

k Zi > ¯ k + (¯ k)1/2

– |x|0 < 1/(2ε2) - 1/ε if i=1k Zi < ¯ k - (¯ k)1/2

Page 16: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 17: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Lower Bound for Euclidean Norm

• Improve (k + ε-) bound to optimal (k¢ε-2)

• Base problem: Gap-Orthogonality (GAP-ORT(X, Y))– Consider uniform distribution on (X,Y)

• We observe information lower bound for GAP-ORT

• Sherstov’s lower bound for GAP-ORT holds for uniform distribution on (X,Y)

• [BBCR] + [Sherstov] ! for any protocol ¦ and t > 0, I(X, Y; ¦) = (1/(ε2 log t)) or ¦ uses t communication

Page 18: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Information Implications

• By chain rule,

I(X, Y ; ¦) = i=11/ε2 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2)

• For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)

• Maximum Likelihood Principle: non-trivial advantage in guessing (Xi, Yi)

Page 19: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

2-BIT k-Party DISJ

• Choose a random j 2 [k2]– j doesn’t occur in any Ti

– j occurs only in T1, …, Tk/2

– j occurs only in Tk/, …, Tk

– j occurs in T1, …, Tk

• All j’ j occur in at most one set Ti (assume k ¸ 4)

• We show (k) information cost

P1 P2 … PkP3

T1 T2 T3 Tk 2 [k2]

We compose GAP-ORT with a variant of k-Party DISJ

Page 20: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Rough Composition Idea

2-BIT k-party DISJ instance

2-BIT k-party DISJ instance

2-BIT k-party DISJ instance

{1/ε2

Show (k/ε2) overall information is revealed

Bits Xi and Yi in GAP-ORT determine output of i-th 2-BIT k-party DISJ instance

Bits Xi and Yi in GAP-ORT determine output of i-th 2-BIT k-party DISJ instance

An algorithm for approximating Euclidean norm solves GAP-ORT, therefore solves most 2-BIT k-party DISJ instances

An algorithm for approximating Euclidean norm solves GAP-ORT, therefore solves most 2-BIT k-party DISJ instances

GAP-ORT

- Information adds (if we condition on enough “helper” variables)- Pi participates in all instances

- Information adds (if we condition on enough “helper” variables)- Pi participates in all instances

Page 21: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 22: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Algorithm for p-norm

• We get kp-1 poly(1/ε), improving k2p+1n1-2/p poly(1/ε) for general p and O(k2/ε + k1.5/ε3) for p = 2

• Our protocol is the first 1-way protocol, that is, all communication is from sites to coordinator

• Focus on Euclidean norm (p = 2) in talk

• Non-negative vectors

• Just determine if Euclidean norm exceeds a threshold θ

Page 23: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

The Most Naïve Thing to Do

• xi is Site i’s current vector

• x = i=1k xi

• Suppose Site i sees an update xi à xi + ej

• Send j to Coordinator with a certain probability that only depends on k and θ?

Page 24: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Sample and Send

P1 P2 … PkP3

C

1…10…00…0………0…0

0…01…10…0………0…0

0…00…01…1………0…0

………………………………………

0…00…00…0………1…1

|x|2 = k2|x|2 = k2

{k|x|2 = 2k2|x|2 = 2k2

1 1 1 1 1

Send each update with probability at least 1/k

Communication = O(k), so okay

Send each update with probability at least 1/k

Communication = O(k), so okay

Suppose x has k4 coordinates that are 1, and may have a

unique coordinate which is k2, occurring k times on each site

Suppose x has k4 coordinates that are 1, and may have a

unique coordinate which is k2, occurring k times on each site

- Send update with probability 1/k2

- Will find the large coordinate

- But communication is (k2)

- Send update with probability 1/k2

- Will find the large coordinate

- But communication is (k2)

Page 25: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

What Is Happening?

• Sampling with probability ¼ 1/k2 is good to get a few samples from heavy item

• But all the light coordinates are in the way, making the communication (k2)

• Suppose we put a barrier of k, that is, sample with probability ¼ 1/k2 but only send an item if it has occurred at least k times on a site

• Now communication is O(1) and found heavy coordinate

• But light coordinates also contribute to overall |x|2 value

Page 26: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

• Sample at different scales with different barriers

• Use public coin to create O(log n) groups T1, …, Tlog n of the n input coordinates

• Tz contains n/2z random coordinates

• Suppose Site i sees the update xi à xi + ej

• For each Tz containing j • If xi

j > (θ/2z)1/2/k then with probability (2z/θ)1/2¢poly(ε-1 log n), send (j, z) to the coordinator

Algorithm for Euclidean Norm

• Expected communication O~(k)

• If a group of coordinates contributes to|x|2, there is a z for which a few coordinates in the group are sampled multiple times

Page 27: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Conclusions• Improved communication lower and upper bounds

for estimating |x|p

• Implies tight lower bounds for estimating entropy, heavy hitters, quantiles

• Implications for data stream model– First lower bound for |x|0 without Gap-Hamming– Useful information cost lower bound for Gap-Hamming, or protocol has very large communication– Improve (n1-2/p/ε2/p) bound for estimating |x|p in a

stream to (n1-2/p/ε4/p)


Recommended