Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam...

Embedding the Embedding the Ulam metric into Ulam metric into

ℓℓ11((Ενκρεβάτωση του μετρικού χώρου Ενκρεβάτωση του μετρικού χώρου Ulam Ulam στον στον ℓℓ11))

Για το μάθημαΓια το μάθημα “ “Advanced Data

Advanced Data

StructuresStructures””

Αντώνης ΑχιλλέωςΑντώνης Αχιλλέως

MetricsMetrics

A A metric spacemetric space is a couple <X,d> is a couple <X,d> s.t. X is a set, d s.t. X is a set, d :: X X2 2 → R and for all → R and for all x, y, z in X, x, y, z in X,

1.1. d(x,y) ≥ 0 and d(x,y) = 0 iff x = yd(x,y) ≥ 0 and d(x,y) = 0 iff x = y

2.2. d(x,y) = d(y,x)d(x,y) = d(y,x)

3.3. d(x,y) ≥ d(x,z) +d(z,y)d(x,y) ≥ d(x,z) +d(z,y)

Two metric spacesTwo metric spacesEdit DistanceEdit Distance

Let Let Σ Σ be a set of symbols, be a set of symbols, ΣΣnn the set of all finite the set of all finite sequences (strings, or n-tuples) of characters sequences (strings, or n-tuples) of characters from from ΣΣ

Edit operationsEdit operations on an element of on an element of ΣΣnn are the are the following:following:

adding a characteradding a character deleting a characterdeleting a character replacing a characterreplacing a character If for x, y in If for x, y in ΣΣnn, if ed(x,y) is the minimum number , if ed(x,y) is the minimum number

of edit operations needed to transform x to yof edit operations needed to transform x to y Then, <Then, <ΣΣnn,, ed> is a metric space ed> is a metric space

Two metric spacesTwo metric spacesThe The Ulam metricUlam metric of dimension n of dimension n

Let Let ΣΣ,, be as before, but let Pbe as before, but let Pnn be the set of strings be the set of strings of n distinct characters from of n distinct characters from ΣΣ,, where n = |where n = |Σ|Σ|..

And if x, y are in PAnd if x, y are in Pnn, then define UL(x,y) to be the , then define UL(x,y) to be the number of character moves needed to transform x number of character moves needed to transform x to y.to y.

< P< Pnn,UL> is a metric space.,UL> is a metric space. The above definitions are limited: we need pairs of The above definitions are limited: we need pairs of

strings with different characters, sostrings with different characters, so We let n be < |We let n be < |Σ|Σ| and instead of UL, we use ed. We and instead of UL, we use ed. We

can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y)can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y)

EmbeddingsEmbeddings

An An embeddingembedding of a metric space <X, of a metric space <X, d> into a target metric space <Y, d> into a target metric space <Y, m> is a mapping m> is a mapping f f : X → Y s.t. : X → Y s.t. there are C, s real numbers such there are C, s real numbers such that for all x, y in X,that for all x, y in X,

d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y)d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) The minimum C that satisfies the The minimum C that satisfies the

above inequality for some s is called above inequality for some s is called the the distortiondistortion of the embedding of the embedding ff..

Edit distance algorithm in O(nEdit distance algorithm in O(n22)) If LCS(x,y) is the longest common If LCS(x,y) is the longest common

subsequence between x and y, subsequence between x and y, where x, y strings of length n, then where x, y strings of length n, then

n – LCS(x,y) ≤ ed(x,y) ≤n – LCS(x,y) ≤ ed(x,y) ≤ 2(n – 2(n – LCS(x,y))LCS(x,y))

TheoremTheorem

For every n, the Ulam metric of For every n, the Ulam metric of dimension n can be embedded into dimension n can be embedded into ℓℓ11

O(O(|Σ||Σ|22)) with distortion O(logn). with distortion O(logn).

Let n be an integer, and lets suppose Let n be an integer, and lets suppose it is a power of 2, let m = |it is a power of 2, let m = |Σ|Σ|, so we , so we can suppose that can suppose that Σ = Σ = {1, 2, …, m}. {1, 2, …, m}. The embedding is the following:The embedding is the following:

The embeddingThe embedding

The embedding is The embedding is ff : P : Pnn → ℓ → ℓ11((mm22))

Associate every coordinate of the target Associate every coordinate of the target space with a distinct pair {a, b}, where a, b space with a distinct pair {a, b}, where a, b in in ΣΣ, and a ≠ b, and every permutation p in , and a ≠ b, and every permutation p in PPnn receives in the new space the following receives in the new space the following coordinates:coordinates:

ff(p)(p){a, b}{a, b} = 1/(p = 1/(p-1-1(b) – p(b) – p-1-1(a)), if a, b appear in p,(a)), if a, b appear in p,

ff(p)(p){a, b}{a, b} = 0 , if they don’t. = 0 , if they don’t. The proof is given by the following two The proof is given by the following two

lemmas. lemmas.

Lemma 1 - ExpansionLemma 1 - Expansion

Let p and q be permutations of length n. Let p and q be permutations of length n. Then, Then,

║║ff(p) – (p) – ff(q)║(q)║11 ≤ O(logn)∙ed(p, q) ≤ O(logn)∙ed(p, q)

Proof:Proof:

First notice that f can be extended to First notice that f can be extended to strings of length less than n. So, we only strings of length less than n. So, we only need to show the inequality to hold for the need to show the inequality to hold for the case ed(x, y) = 1, the size of x is n and of case ed(x, y) = 1, the size of x is n and of y is n – 1. Also, we will treat substitution y is n – 1. Also, we will treat substitution as a character deletion and insertion.as a character deletion and insertion.

Proof of lemma 1 (cont.)Proof of lemma 1 (cont.)

q is obtained from p by deleting p[s] for q is obtained from p by deleting p[s] for some s.some s.

So, p[i] = q[i] for i < s, andSo, p[i] = q[i] for i < s, and p[i+1] = q[i] for i ≥ s.p[i+1] = q[i] for i ≥ s. ║║ff(p) – (p) – ff(q)║(q)║11

= ∑= ∑a,b in a,b in ΣΣ | |ff(p)(p){a,b}{a,b} – – ff(q)(q){a,b}{a,b}|| Ignore {a, b} not entirely in p. So, a = p[i], Ignore {a, b} not entirely in p. So, a = p[i],

b = p[j], i < j. Cases: i = s, i, j < s, i, j b = p[j], i < j. Cases: i = s, i, j < s, i, j > s and i < s < j (on the whiteboard)> s and i < s < j (on the whiteboard)

QEDQED

Definitions neededDefinitions needed

LIS(p)LIS(p) breakpointbreakpoint: a position i in [k-1] s.t. p[i] > : a position i in [k-1] s.t. p[i] >

p[i+1].p[i+1]. b(p) : # of breakpoints in p.b(p) : # of breakpoints in p. pp00, p, p11 are a are a partitionpartition of p if distinct and for all of p if distinct and for all

x of p, x appears in px of p, x appears in p00 or p or p11.. blockblock: a pair of positions {2i – 1, 2i}.: a pair of positions {2i – 1, 2i}. a partition pa partition p00, p, p11 is is block-balancedblock-balanced if they if they

also partition every block with one element also partition every block with one element each.each.

Proposition 1Proposition 1

Let p be a permutation of length k, k Let p be a permutation of length k, k even.even.

Then, for every block-balanced partition Then, for every block-balanced partition of p into pof p into p00, p, p11, ,

LIS(p) ≥ LIS(pLIS(p) ≥ LIS(p00) + LIS(p) + LIS(p11) – 2b(p)) – 2b(p)

Will prove that LIS(p) ≥ 2LIS(pWill prove that LIS(p) ≥ 2LIS(p00) – 2b(p).) – 2b(p). Argument followsArgument follows

Argument pointsArgument points

will try to augment LIS(pwill try to augment LIS(p00) with points from ) with points from pp11..

if j position in pif j position in p00, then, {j’, j} is a , then, {j’, j} is a blockblock

if j in LIS(pif j in LIS(p00), then j’ is a ), then j’ is a candidate.candidate.

##candidatescandidates = = LIS(pLIS(p00))

LIS(pLIS(p00) can always be augmented by ) can always be augmented by LIS(pLIS(p00) – 2b(p).) – 2b(p).

Every breakpoint can only be blamed for at Every breakpoint can only be blamed for at most 2 candidatesmost 2 candidates

Lemma 2 - ContractionLemma 2 - Contraction

Let p and q be permutations of length Let p and q be permutations of length n, and assume that n is a power of 2.n, and assume that n is a power of 2.

Then ║Then ║ff(p) – (p) – ff(q)║ ≥ (1/16)ed(p, q) (q)║ ≥ (1/16)ed(p, q)

For the proof assume: For the proof assume: p and q have the same charactersp and q have the same characters q = (1, 2, 3, …, n)q = (1, 2, 3, …, n) So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n – So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n –

LIS(p))LIS(p))

Proof of Lemma2Proof of Lemma2

Partition p to pPartition p to p00, p, p11 at random, uniformly at random, uniformly splitting every block.splitting every block.

Partition pPartition p00 to p to p0000, p, p0101 at random, at random, uniformly splitting every block, e.t.c. uniformly splitting every block, e.t.c. recursively, until we have singleton recursively, until we have singleton subsequences psubsequences pσσ, , for for σ σ in {0,1}in {0,1}lognlogn. Let . Let εε be the empty string, pbe the empty string, pεε = p. = p.

LIS(p) ≥ E[LIS(pLIS(p) ≥ E[LIS(p00) + LIS(p) + LIS(p11)] – 2b(p) ≥ )] – 2b(p) ≥

∑∑σ σ in {0,1}in {0,1}lognlognE[LIS(pE[LIS(pσσ)] – 2∑)] – 2∑k≤lognk≤logn ∑ ∑σ σ in {0,1}in {0,1}k-1 k-1

E[b(pE[b(pσσ)])]

Proof of Lemma 2 (cont.)Proof of Lemma 2 (cont.)

So, n – LIS(p) ≤ 2E[∑∑(b(pSo, n – LIS(p) ≤ 2E[∑∑(b(pσσ))]))] ≤≤ 8 8 ∑1/(∑1/(i – ji – j)),,

i > j and p[i] < p[j]i > j and p[i] < p[j] For such i, j, For such i, j, ff(p)(p){p[i],p[j]}{p[i],p[j]} = 1/(j – i) < 0, = 1/(j – i) < 0,

and f(q)and f(q){p[i],p[j]}{p[i],p[j]} = 1/(p[j] – p[i]) > 0. = 1/(p[j] – p[i]) > 0.

So, |So, |ff(p)(p){p[i],p[j]}{p[i],p[j]} – – ff(q)(q){p[i],p[j]}{p[i],p[j]}| > 1/(i – j)| > 1/(i – j) And 8║And 8║ff(p) – (p) – ff(q)║ ≥ (1/2)ed(p, q), (q)║ ≥ (1/2)ed(p, q),

which ends the proof.which ends the proof.

Applications (some Applications (some definitions)definitions)

X X n, tn, t includes all t-non-repetitive includes all t-non-repetitive strings of length n over strings of length n over ΣΣ..

B B n, tn, t includes all t-bounded- includes all t-bounded-occurence strings of length n over occurence strings of length n over ΣΣ..

X X n, r, tn, r, t includes all (t, r)-non-repetitive includes all (t, r)-non-repetitive strings of length n over strings of length n over ΣΣ..

Non-repetitive stringsNon-repetitive strings

((X X n, t n, t , ed) embeds with distortion 2t , ed) embeds with distortion 2t into the Ulam metric of dimension n – t into the Ulam metric of dimension n – t + 1 and alphabet size 2+ 1 and alphabet size 2tt. Consequently, . Consequently, it embeds into ℓit embeds into ℓ11 with distortion with distortion O(logn)O(logn)

Σ = {0, 1}Σ = {0, 1}tt, and for x in {0,1}, and for x in {0,1}nn, , ff(x) is (x) is defined: defined: ff(x)(x)jj = x[j] … x[j + t – 1] = x[j] … x[j + t – 1]

½ ed(x, y) ≤ ed(½ ed(x, y) ≤ ed(ff(x), (x), ff(y)) ≤ t ed(x, y) (y)) ≤ t ed(x, y)

(proof …)(proof …)

Bounded-occurrence Bounded-occurrence stringsstrings

(B(B n, t n, t , ed) embeds with distortion t into , ed) embeds with distortion t into the Ulam metric of dimension n over an the Ulam metric of dimension n over an extended alphabet of size t|extended alphabet of size t|Σ|Σ|. . Consequently, it embeds into ℓConsequently, it embeds into ℓ11 with with distortion O(logn).distortion O(logn).

Just substitute a in Just substitute a in Σ Σ with awith a11, a, a22, …, a, …, att and and extend it to extend it to Σ’Σ’, of size t|, of size t|ΣΣ|.|.

Substitute the j-th occurrence of a in x, Substitute the j-th occurrence of a in x, with awith ajj to have f(x). to have f(x).

ed(x, y) ≤ ed(ed(x, y) ≤ ed(ff(x), (x), ff(y)) ≤ t ed(x, y) follows.(y)) ≤ t ed(x, y) follows.

Sketching t-non-Sketching t-non-repetitive stringsrepetitive strings

For every k, there exists a polynomial-For every k, there exists a polynomial-time sketching algorithm that solves the time sketching algorithm that solves the

k k vsvs ΩΩ(k t logn) gap edit (k t logn) gap edit distance problem on t-non-repetitive distance problem on t-non-repetitive strings of length n, using sketches of size strings of length n, using sketches of size O(1).O(1). We use the following:We use the following:

For all k and For all k and ε > 0,ε > 0, there exists a there exists a polynomial-time sketching algorithm that polynomial-time sketching algorithm that solves the solves the k k vsvs (1+ε) (1+ε)k gap edit k gap edit distance problem on binary of length n, distance problem on binary of length n, using a sketch of size O(1/using a sketch of size O(1/εε22).).

Sketching t-non-Sketching t-non-repetitive stringsrepetitive strings

Convert ℓConvert ℓ11 into Hamming metric: into Hamming metric:

Round each coordinate to multiples Round each coordinate to multiples of 1/Cnof 1/Cn22 for sufficiently large C > 0 for sufficiently large C > 0 (distortion increases by 2).(distortion increases by 2).

Convert this to an element of the Convert this to an element of the Hamming space…Hamming space…

Use sketching algorithm for Use sketching algorithm for Hamming distanceHamming distance

Locally non-repetitive Locally non-repetitive stringsstrings

For every t, and every k, there exists For every t, and every k, there exists an embedding an embedding ff of the (t, 180tk)-non- of the (t, 180tk)-non-repetitive strings into ℓrepetitive strings into ℓ11, such that , such that for every two strings x, y, for every two strings x, y,

Ω(Ω(min{k, ed(x, y)/(t log(tk))}min{k, ed(x, y)/(t log(tk))})) ≤ ║ ≤ ║ff(x) (x) – – ff(y)║(y)║11 ≤ ed(x, y) ≤ ed(x, y)

…… Proof Proof ……

The embeddingThe embedding

let x be a (t, 180tk)-non-repetitive string, let x be a (t, 180tk)-non-repetitive string, W = 56tk, append to x the string a W = 56tk, append to x the string a11aa22……aa2W+t2W+t (new symbols). (new symbols).

Use Use anchorsanchors αα11, , αα22, …, , …, ααrrxx. r. rx x = O(n/tk). Define = O(n/tk). Define

φφii.. Embed the Embed the φφii’s into ℓ’s into ℓ11

O(tk)O(tk). . φφii is a string of length is a string of length at most 2W + t ≤ 180tk, so it is t-non-repetitive.at most 2W + t ≤ 180tk, so it is t-non-repetitive.

Concatenate to Concatenate to φφ(x) in ℓ(x) in ℓ11O(n)O(n). .

Choose r in {0, 1}Choose r in {0, 1}O(n)O(n) of same length s.t. r of same length s.t. rii = 1 = 1 independedly with probability 1/(kt log(kt))independedly with probability 1/(kt log(kt))

Then, Then, ff’(x) = r∙’(x) = r∙φ(φ(xx)) mod 2 mod 2

Two lemmas that complete Two lemmas that complete the proofthe proof

1.1. If x and y are (t, 180tk)-non-repetitive If x and y are (t, 180tk)-non-repetitive strings, then Pr[strings, then Pr[f’f’(x) ≠ (x) ≠ f’f’(y)] ≤ O(ed(x, (y)] ≤ O(ed(x, y)/k).y)/k).

2.2. If x and y are (t, 180tk)-non-repetitive If x and y are (t, 180tk)-non-repetitive strings, Pr[strings, Pr[f’f’(x) ≠ (x) ≠ f’f’(y)] ≥ (y)] ≥ Ω(Ω(min{ed(x, min{ed(x, y)/kt log(kt),1}y)/kt log(kt),1}))

Also, if Also, if ff(x) is the concatenation of k (x) is the concatenation of k f’f’ results, it follows:results, it follows:

║║ff(x) – (x) – ff(y)║(y)║11 = k E[| = k E[|f’f’(x) - (x) - f’f’(y)|] = (y)|] = = k Pr[ = k Pr[f’f’(x) ≠ (x) ≠ f’f’(y)] ≤ O(ed(x, y)).(y)] ≤ O(ed(x, y)).

Resulting…Resulting…

For every t, k, there exists a For every t, k, there exists a polynomial-time efficient sketching polynomial-time efficient sketching algorithm that solves the algorithm that solves the k vs k vs ΩΩ(t k logk) gap edit distance problem (t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings for (t,180tk)-non-repetitive strings using sketches of size O(1).using sketches of size O(1).

This improves a previous result and This improves a previous result and gives a sketching algorithm for the gives a sketching algorithm for the Ulam metric for this gap (with t = 1).Ulam metric for this gap (with t = 1).

Embed(x) (of the Ulam Embed(x) (of the Ulam metric)metric)

(x is the inverse of the permutation – if (x is the inverse of the permutation – if a not in permutation, then x[a] = 0)a not in permutation, then x[a] = 0)A[1..m][1..m]: array of real; i, j : intA[1..m][1..m]: array of real; i, j : intBeginBegin

for i:=1 to m do for j:= 1 to i – 1 dofor i:=1 to m do for j:= 1 to i – 1 do if x[i]*x[j] <> 0 then A [j, i] := if x[i]*x[j] <> 0 then A [j, i] :=

1/(x[i] – x[j])1/(x[i] – x[j]) else A [j, else A [j, i] := 0;i] := 0;

output (A);output (A);End.End.

Embednr(x) (of a(t,180tk)-Embednr(x) (of a(t,180tk)-non-repetitive string x of non-repetitive string x of

size n)size n)const W = 56tkconst W = 56tkvar A,B[1..n+2W+t]: array of int; c, i, j, k, l, m, h: int;var A,B[1..n+2W+t]: array of int; c, i, j, k, l, m, h: int;BeginBeginh:=0;h:=0;for all possible coin tosses do{for all possible coin tosses do{h:=h+1h:=h+1k:=0;k:=0;for j:= 1 to n do {A[i]:=x[i]; if k < x[i] then k:=x[i]};for j:= 1 to n do {A[i]:=x[i]; if k < x[i] then k:=x[i]};for j:= n+1 to n + 2W+t do {k=k+1; A[i]:=k};for j:= n+1 to n + 2W+t do {k=k+1; A[i]:=k};c:=1; i:=1;c:=1; i:=1;repeatrepeat

1.1. ssijij:=the j’th string of size t starting in A[c+W+j-1];:=the j’th string of size t starting in A[c+W+j-1];2.2. pick random permutation pick random permutation Π Π on on ΣΣtt 3.3. set aset aii := min{s := min{sijij} and l:= such that s} and l:= such that silil = min{s = min{sijij} (by perm. } (by perm. ΠΠ))4.4. c:=c+W+l; i:=i+1c:=c+W+l; i:=i+1 until c > nuntil c > n

rx := i;rx := i;for i := 1 to rx do for i := 1 to rx do φφ[i] := substring of x starting after a[i] := substring of x starting after ai-1i-1, ending at the end of a, ending at the end of aii;;for i := 1 to rx do B[i]:= Embed[for i := 1 to rx do B[i]:= Embed[φ[φ[i]];i]];φφ’:’:== concatenate B; concatenate B;pick random r with r[i] = 1 with possibility 1/kt log(kt);pick random r with r[i] = 1 with possibility 1/kt log(kt);f’:=0; for all i do f’:= f’ + r[i]*f’:=0; for all i do f’:= f’ + r[i]*φφ[i] mod 2;[i] mod 2;

f[h]:=f’}f[h]:=f’}End. End. ((Τελειώσαμε, μπορείτε να ξυπνήσετεΤελειώσαμε, μπορείτε να ξυπνήσετε))

Ενδ Ενδ (Καληνύχτα...)(Καληνύχτα...)

Date post:	14-Dec-2015
Category:	Documents
Upload:	rebeca-harless
View:	241 times
Download:	0 times

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam...

Documents