Distributed (Local) Monotonicity Reconstruction

transcript

Distributed (Local) Monotonicity

ReconstructionMichael Saks

Rutgers University

C. SeshadhriPrinceton University(Now IBM Almaden)

Overview Introduce a new class of algorithmic

problems:Distributed Property Reconstruction

(extending framework of program self-correction, robust property testing locally decodable codes)

A solution for the property Monotonicity

Data Sets

Data set = function f : Γ V

Γ = finite index setV = value set

In this talk,Γ = [n]d = {1,…,n}d

V = nonnegative integersf = d-dimensional array of nonnegative

integers

For f,g with common domain Γ:

dist(f,g) = fraction of domain where f(x) ≠ g(x)

Distance between two data sets

Properties of data setsFocus of this talk:

Monotone: nondecreasing along every line (Order preserving)

When d=1, monotone = sorted

Some Algorithmic problems for PGiven data set f (as input): Recognition: Does f satisfy P? RobustTesting:

(Define ε(f) = min{ dist(f,g) : g satisfies P})For some 0 ≤ ε1 < ε2 < 1, output either ε(f) > ε1 :f is far from P ε(f) < ε2: f is close to P

(If ε1 < ε(f) ≤ ε2 then can decide either)

Property ReconstructionSetting:

Given f We expect f to satisfy P

(e.g. we run algorithms on f that assume P) but f may not satisfy P

Reconstruction problem for P: Given data set f, produce data set g that satisfies P is close to f: d(f,g) is not much bigger than ε(f)

What does it mean to produce g? Offline computation

Input: function table for f

Output: function table for g

Distributed monotonicity reconstructionWant algorithm A that on input x, computes

may query f(y) for any y has access to a short random string s

and is otherwise deterministic.

Distributed Property ReconstructionGoal:

WHP (with probability close to 1) (over choices of random string s):

g has property P d(g,f) = O( ε(f) ) Each A(x) runs quickly

in particular only reads f(y) for a small number of y.

Distributed Property ReconstructionPrecursors: Online Data Reconstruction Model

(Ailon-Chazelle-Liu-Seshadhri)[ACCL]

Locally Decodable Codes and Program self-correction (Blum-Luby-Rubinfeld; Rubinfeld-Sudan; etc )

Graph Coloring (Goldreich-Goldwasser-Ron)

Monotonicity Testing (Dodis-Goldreich- Lehman-Raskhodnikova-Ron-Samorodnitsky; Goldreich-Goldwasser- Lehman-Ron-Samorodnitsky;Fischer;Fischer-Lehman-Newman-Raskhodnikova-Rubinfeld-Samorodnitsky;Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan; etc)

Tolerant Property Testing (Parnas, Ron, Rubinfeld)

Example: Local Decoding of Codes Data set f = boolean string of length n

Property = is a Code word of a given error correcting code C

Reconstruction = Decoding to a close code wordDistributed reconstruction = Local decoding

Key issue: making answers consistent For error correcting code, can assume

input f decodes to a unique g. The set of positions that need to be corrected is

determined by f. For general property,

many different g (even exponentially many) that are close to f may have the property

We want to ensure that A produces one of them.

An example

Monotonicity with input array:

1,….,100, 111,…,120,101,…,110,121,…,200.

Monotonicity Reconstruction: d=1 f is a linear array of length nFirst attempt at distributed reconstruction:

A(x) looks at f(x) and f(x-1)

If f(x) ≥ f(x-1),then g(x) = f(x)

Otherwise, we have a non-monotonicity

g(x) = max { f(x) , f(x-1) }

Monotonicity Reconstruction: d=1 Second attempt

Set g(x) = max{ f(1), f(2),…, f(x) }

g is monotone but

A(x) requires time Ω(x) dist(g,f) may be much larger than ε(f)

Our results (for general d )A distributed monotonicity reconstruction

algorithm for general dimension d such that:

Time to compute g(x) is (log n)O(d)

dist(f,g) = C1(D) (f) Shared random string s has size (d log n)O(1)

(Builds on prior results on monotonicity testing and online monotonicity reconstruction.)

Which array values should be changed?A subset S of Γ is f-monotone

if f restricted to S is monotone.

For each x in Γ, A(x) must: Decide whether g(x) = f(x) If not , then determine g(x)

Preserved = { x : g(x) = f(x) }Corrected = { x : g(x) ≠ f(x) }

In particular, Preserved must be f-monotone

Identifying Preserved

The partition (Preserved, Corrected)must satisfy:

Preserved is f-monotone |Corrected|/|Γ| = O(ε(f))

Preliminary algorithmic problem:

Classification problem

Classify each y in Γ as Green or Red Green is f - monotone Red has size O(ε(f)|Γ|)

Need subroutine Classify(y).

A sufficient condition for f-monotonicityA pair (x,y) in Γ × Γ is a violation if

x < y and f(x) > f(y)

To guarantee that Green is f - monotone:

Red should hit all violations:

For every violation (x,y) at least one of x,y is Red

Classify: 1-dimensional case

d=1: Γ={1,…,n} f is a linear array.

For x in Γ, and subinterval J of Γ:violations(x,J)=|{y in J : (x,y) is a violation}|

Constructing a large f-monotone setThe set Bad:

x in Bad if for some interval J containing x|violations(x,J)|≥|J|/2

Lemma.1)Good=Γ - Bad is f-monotone2)|Bad| ≤ 4 ε(f)|Γ| . Proof: 1) If x,y are a violation then one of them is Bad for the

interval [x,y].

Lemma. Good=Γ \ Bad is f-monotone |Bad| ≤ 4 ε(f)|Γ| .

So we’d like to take:Green=Good Red = Bad

How do we compute Good?

To test whether y in Good:For each interval J containing y,

check violations(y,J)< |J|/2Difficulties

There are (n) intervals J containing y For each J, computing violations(y,J)

takes time (|J|) .

Speeding up the computation

Estimate violations(y,J) by random sampling sample size polylog(n) is sufficient

violations* (y,J) denotes the estimate

Compute violations* (y,J) only for a carefully chosen set of test intervals

The Test Set T

Assume n=|Γ|=2k

k layers of intervalsLayer j consists of 2k-j+1-1 intervals of size 2j

Subroutine classify

To classify y If for each J in T containing y

violations*(y,J) < .1 |J|then y is Greenelse y is Red

Where are we?We have a subroutine Classify On input x,

Classify outputs Green or Red Runs in time polylog(n)

WHP Green is f-monotone |Red| ≤ 20ε(f)|Γ|

Defining g(x) for Red x

The natural way to define g(x) is: Green(x) = { y : y ≤x and y Green}

g(x) = max{f(y) : y in Green(x))} = f(max{Green(x)})

In particular, this givesg(x) = f(x) for Green x

Computing m(x)

Can search back from x to find first Green

Inefficient if x is preceded by a long Red stretch

Approximating m(x)?

Pick random Sample(x) of points less than x Density inversely proportional to distance from x Size is polylog(n)

Green* (x) = { y: y in Sample(x) , y Green}m*(x) = max {y in Green* (x)}

Is m*(x) good enough?

xm*(x) y

Suppose y is Green and m*(x) ≤ y ≤ x Since y is Green:

g(y) = f(y) and

g(x) = f(m*(x)) < f(y) = g(y)

g is not monotone

Is m*(x) good enough?

To ensure monotonicity we need:x < y implies m*(x) < m*(y)

Requires relaxing the requirement: for all Green z, m*(z) = z

xm*(x) y

Thinning out Green* (x)

Plan: Eliminate certain unsafe points from Green*(x)

Roughly, y is unsafe for x if for some z > x

Some interval beginning with y and containing x has a high density of Reds.

(There is a non-trivial chance that Sample(z)has no Green points ≥ y.)

Thinning out Green* (x)

Green* (x) = { y: y in Sample(x) , y Green}m*(x) = max {y in Green* (x)}

Green^(x) = { y: y in Green* (x) , y safe for x}m^(x) = max {y in Green^ (x)}

(Hiding: Efficient implementation of Green^(x))

Redefining Green^(x)

if x ≤ y, then m^(x) ≤ m^(y)

{x: m^(x) ≠ x} is O(ε(f) |Γ|).

Summary of 1-dimensional case Classify points as Green and Red

Few Red points f restricted to Green is f-monotone

For each x, choose Sample(x) size polylog(n) All points less than x Density inversely proportional to distance from x

Green^ (x) from Sample(x) that are safe for x m^(x) is the maximum of Green^(x)

Output g(x)=f(m^(x))

Dimension greater than 1

For x < y, want g(x) < g(y)

Red/Green Classification

Extend the Red/Green classification to higher dimensions: f restricted to Green is Monotone Red is small

Straightforward (mostly) extension of 1-dimensional case

Given Red/Green classificationIn the one-dimensional case,

Green^ (x) = sampled Green points safe for x

g(x) = f(max {y : y in Green^ (x) }

The Green points below x

Set of Green maxima could be very large Sparse Random Sampling will only roughly capture the frontier Finding an appropriate definition of unsafe points is much harder

than in the one dimensional case

Further work The g produced by our algorithm has

d(g,f) ≤ C(d)ε(f)|Γ| Our C(d) is exp(d2) . What should C(d) be? (Guess: C(d) = exp(d) )

Distributed reconstruction for other interesting properties? (Reconstructing expanders, Kale,Peres,

Seshadhri, FOCS 08)

Distributed (Local) Monotonicity Reconstruction

Documents