+ All Categories
Home > Documents > Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF,...

Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF,...

Date post: 21-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
34
Lempel-Ziv and Related Algorithms W. Szpankowski Department of Computer Science Purdue University W. Lafayette, IN 47907 April 29, 2008 AofA and IT logos Wroclaw 2008 Research supported by NSF, AFSOR, and NIH. Joint work with M. Drmota, P . Jacquet, C. Knessl, and M. Ward.
Transcript
Page 1: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Lempel-Ziv and Related Algorithms∗

W. Szpankowski†

Department of Computer Science

Purdue University

W. Lafayette, IN 47907

April 29, 2008

AofA and IT logos

Wroclaw 2008

∗Research supported by NSF, AFSOR, and NIH.†Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Page 2: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Outline

1. Universal Source Coding

2. Error-Resilient Lempel-Ziv’77 (Analytic Pattern Matching)

3. Method of Types (Nonlinear Functional Equations)

Algorithms: are at the heart of virtually all computing technologies;

Combinatorics: provides indispensable tools for finding patterns and structures;

Information: permeates every corner of our lives and shapes our universe.

Page 3: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Goals of Source Coding

The basic problem of source coding (i.e., data compression) is to

find codes with shortest descriptions (lengths) either on average or for

individual sequences when the source (i.e., statistics of the underlying

probability distribution) is unknown (i.e., universal source coding).

Definition: A block-to-variable (BV) length code: Cn : An → {0, 1}∗is a bijective mapping from a set of all sequences of length n over the

alphabetA to the set {0, 1}∗ of binary sequences.

For a probabilistic source model S and a code Cn we let:

• P (xn1) be the probability of xn

1 = x1 . . . xn;

• L(Cn, xn1) be the code length for xn

1 ;

• Entropy Hn(P ) = −P

xn1

P (xn1) lg P (xn

1 ); entropy rate h ∼ H(Xn1 )/n.

Page 4: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Outline Update

1. Universal Source Coding

2. Algorithms: Error-Resilient Lempel-Ziv’77

(a) Redundant Bits in LZ’77

(b) Design of Encoder and Decoder

(c) Analysis through the Suffix Tree

3. Combinatorics: Method of Types

4. Information: Non-Prefix Codes

Page 5: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

LZ’77 Scheme

The popular Lempel-Ziv’77 scheme works on-line: It compresses phrases by

consecutively replacing the longest prefix of the non-compressed portion

of a file with a pointer and the length of its copy.

The devastating effect of errors in LZ’77 is a long-standing open problem.

Castelli and Lastras in 2004 proved that a single error in LZ’77 corrupts

O(n2/3) phrases, thus about O(n2/3 log n) symbols, where n is the size the

file to be compressed.

historyhistory current positioncurrent position

0001

10

11

Figure 1: LZ’77 pointers (also for LZRS’77 we have Mn = 4).

Page 6: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Our Main Idea of Error Resilient LZ’77

1. We observe that there are usually multiple copies of the longest prefix.

By Mn we denote the number of copies of the longest prefix of the

uncompressed string that appear in the database.

2. By a judicious choice of pointers in the LZ’77 scheme, we can recover

⌊log2 Mn⌋ bits without losing a bit in compression.

3. Use parity bits recovered from the multiple copies (redundancy) for the

Reed-Solomon channel coding.

Note: If the greediness of LZ’77 is relaxed (say, by looking for the 10th largest

prefix, for instance), then the number of copies found in the database will

increase significantly. This would allow even more errors to be corrected.

Page 7: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Experimental Results: I

Table 1: The compression of “gzip -3” (we also call it LZS’77) versus

“gzipS -3” for the files of the Calgary corpus; the last column shows the

total number of available bytes for error correction.file size gzip gzipS file redundant

111,261 39,473 39,511 bib 1,721

768,771 333,776 336,256 book1 14,524

610,856 228,321 228,242 book2 10,361

102,400 69,478 71,168 geo 4,101

377,109 155,290 156,150 news 5,956

21,504 10,584 10,783 obj1 353

246,814 89,467 89,757 obj2 3,628

53,161 20,110 20,204 paper1 937

82,199 32,529 32,507 paper2 1,551

46,526 19,450 19,567 paper3 893

13,286 5,853 5,898 paper4 249

11,954 5,252 5,294 paper5 210

38,105 14,433 14,506 paper6 738

513,216 62,357 61,259 pic 3,025

39,611 14,510 14,660 progc 736

71,646 18,310 18,407 progl 1,106

49,379 12,532 12,572 progp 741

93,695 22,178 22,098 trans 1,201

Page 8: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Encoder and Decoder of LZRS’77

We use the family of Reed-Solomon codes RS(255, 255− 2e) that contains

blocks of 255 bytes, of which 255− 2e are data and 2e are parity.

Encoder: The data is broken into blocks of size 255 − 2e. Blocks are

processed in reverse order, beginning with the very last. When processing

block i, the encoder computes first the Reed-Solomon parity bits for the

block i + 1 and then it embeds the extra bits in the pointers of block i.

Decoder: The decoder receives a sequence of pointers, preceded by the

parity bits of the first block which are used to correct block B1. Once block

B1 is correct, it decompresses it using LZS’77. Redundant bits of block B1

are used as parity bits to correct block B2, etc.

RS

Adjust

pointers

RS

Adjust

pointers

B1 ...

RS

B2 B3 Bb

RS

Adjust

pointersStore

Figure 2: The right-to-left sequence of operations on the blocks.

Page 9: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Error-Resilient Algorithm LZRS’77

LZRS’77 ENCODER (X, e)

let b, j, n← 1, 1, |X|while j < n do

append LZ’77 COMPRESS(Xj) to Bb

if |Bb| = 255− 2e then let b← b + 1

for i← b, . . . , 2 do

let RSi ← REED SOLOMON ENCODER(Bi, e)

embed in the block Bi−1 the bits RSi using LZS’77

let RS1 ← REED SOLOMON ENCODER(B1, e)

return RS1, B1, B2, . . . , Bb

LZRS’77 DECODER (RS1, B1, B2, . . . , Bb, e)

D ← empty string

if REED SOLOMON DECODER(B1 + RS1, e) = errors

then correct B1

append LZ’77 DECOMPRESS(Bi) to D

recover RS2 from the pointers used in B1 using LZS’77

for i← 2, . . . , b do

if REED SOLOMON DECODER(Bi + RSi, e) = errors

then correct Bi

append LZ DECOMPRESS(Bi) to D

recover RSi+1 from the pointers in Bi using LZS’77

return D

Page 10: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Experimental Results II

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Pro

babi

lity

that

the

file

coul

d no

t be

reco

vere

d

Number of error injected (t=1, 10 buffers)

’error10_1.dat’

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

Pro

babi

lity

that

the

file

coul

d no

t be

reco

vere

d

Number of error injected (t=1, 100 buffers)

’error100_1.dat’

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50

Pro

babi

lity

that

the

file

coul

d no

t be

reco

vere

d

Number of error injected (t=2, 10 buffers)

’error10_2.dat’

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80

Pro

babi

lity

that

the

file

coul

d no

t be

reco

vere

d

Number of error injected (t=2, 100 buffers)

’error100_2.dat’

Figure 3: The probability that a file of b blocks could not be recovered correctly vs

the number of errors distributed over the blocks. Top-left: e = 1 and b = 10, top-right:

e = 1 and b = 100, lower-left: e = 2 and b = 10, lower-right: e = 2 and b = 100

(e.g., for e = 2 and b = 100 LZRS’77 can decompress correctly with with

20 uniformly distributed errors 90% of the time).

Page 11: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Analysis of Mn Via Suffix Trees

Performance of LZRS’77 depends on Mn. How does Mn typically behave?

Build a suffix tree from the first n suffixes of the database X (i.e., S1 =

X∞1 , S2 = X∞2 , . . . , Sn = X∞n ). Then insert the (n+1)st suffix, Sn+1 = X∞n+1.

Observe: Depth of insertion of Sn+1 is the (n + 1)-st phrase length. Also,

Mn is the size of the subtree that starts at the insertion point of the (n+1)st

suffix.

S1 S2

S3 S4

S5

Mn

Figure 4: M4(=2) is the size of the subtree at the insertion point of S5.

Page 12: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Analyzing Mn

The ith suffix of X is X(i) = XiXi+1Xi+2 . . .. Consider the longest prefix w

of X(n+1) such that X(i) also has w as a prefix, for some 1 ≤ i ≤ n. Then

Mn = #{1 ≤ i ≤ n |X(i)= XiXi+1Xi+2 . . . has w as a prefix}

1. Independent Strings: Tries built over n independent strings.

(i) Average E[MIn] satisfies the recurrence (p = 1− q probability of a “1”):

E[MIn] = pn(qn+pE[MI

n])+qn(pn+qE[MIn)]+

n−1X

k=1

“n

k

pkqn−k(pE[MIk ]+qE[MI

n−k]);

(ii) The probability generating functions E[uMIn] satisfy

E[uMI

n] = pn(qu

n+pE[u

MIn])+q

n(pu

n+qE[u

MIn])+

n−1X

k=1

“n

k

pkq

n−k(pE[u

MIk ]+qE[u

MIn−k])

Pattern matching approach, also gives (β = α⊕ 1)

MI(z, u) =

∞X

n=1

∞X

k=1

P(MIn = k)u

kz

n

=X

w∈A∗α∈A

uP(β)P(w)

1− z(1− P(w))

zP(w)P(α)

1− z(1 + uP(w)P(α)− P(w))

Page 13: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Analyzing Mn for Dependent Strings

2. Suffix Trees: Using analytic combinatorics on words we prove that

M(z, u) =∞

X

n=1

∞X

k=1

P(Mn = k)ukzn

=X

w∈A∗α∈A

uP(β)P(w)

Dw(z)

Dwα(z)− (1− z)

Dw(z)− u(Dwα(z)− (1− z))

where Dw(z) = (1 − z)Sw(z) + zmP (w) and Sw(z) is the autocorrelation

polynomial, namely

Sw(z) =X

k∈P(w)

P(wmk+1)z

m−k

whereP(w) denotes positions k of w satisfying w1 . . . wk = wm−k+1 . . . wm.

For any ε > 0 there exists β > 1 such that (all hard analytic work is here!)

Pr(Mn = k)− Pr(MIn = k) = O(n−εβ−k)

Random suffix trees resemble random independent tries (cf. P. Jacquet,

W.S., 1994, Ward, 2005).

Page 14: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Main Results

Theorem 1 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p

ln q = rs for

some relatively prime r, s ∈ Z (i.e., ln pln q is rational).

The jth factorial moment E[(Mn)j] = E[M(M − 1) · · ·M(−j + 1)] is

E[(Mn)j] = Γ(j)

q(p/q)j + p(q/p)j

h+ δj(log1/p n) + O(n−η)

where h = −p log p− q log q is the entropy rate, η > 0, and where Γ is the

Euler gamma function and

δj(t) =X

k 6=0

−e2krπitΓ(zk + j)

`

pjq−zk−j+1 + qjp−zk−j+1´

p−zk+1 ln p + q−zk+1 ln q.

δj is a periodic function that has a small magnitude and exhibits fluctuation

when ln pln q is rational

Note: On average there are

E[Mn] ∼ 1/h additional pointers.

j 1ln 2

P

k 6=0

˛

˛Γ`

j − 2kiπln 2

´˛

˛

1 1.4260 ×10−5

3 1.2072 ×10−3

5 1.1421 ×10−1

6 1.1823 ×100

8 1.4721 ×102

9 1.7798 ×103

10 2.2737 ×104

Page 15: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Distribution of Mn

Theorem 2 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p

ln q = rs for

some relatively prime r, s ∈ Z. Then

P (Mn = j) =pjq + qjp

jh+

X

k 6=0

−e2krπi log1/p n

Γ(zk)(pjq + qjp)(zk)

j

j!(p−zk+1 ln p + q−zk+1 ln q)+ O(n−η)

where η > 0, and Γ is the Euler gamma function.

Therefore, Mn follows the logarithmic series distribution with mean 1/h (plus

some fluctuations).

The logarithmic series distribution ((pjq + qjp)/(jh))

is well concentrated around its mean EMn ≈ 1/h.

0

0.2

0.4

0.6

0.8

2 3 4 5 6 7 8

x

Page 16: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Outline Update

1. Universal Source Coding

2. Algorithms: Error-Resilient Lempel-Ziv’77

3. Combinatorics: Method of Types

(a) Markov Types and Eulerian Paths

(b) Universal Types and Enumeration of Binary Trees

(c) Nonlinear Functional Equations Arising in AofA

4. Information: Non-Prefix Codes

Page 17: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Method of Types

The method of types is a powerful technique in information theory; it

reduces calculations of the probability of rare events to combinatorics.

Sequences are of the same type if they have the same empirical

distribution.

Warm-up Problem: How many binary strings xn1 of length n generated by

a memoryless source have k “1”s (i.e., have the same Bernoulli type)? All

such strings have the same probability

P (xn1) = p

k(1− p)

n−k

where p is the probability of generating a 1.

Answer: Certainly, the answer is:`n

k

´

.

Page 18: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Markov Types

Consider a Markov source over an m-ary alphabet with the transition

matrix P = {pij}mi,j=1 , that is, P (Xt+1 = j|Xt = i) = pij.

The probability of xn1 is

P (xn1 ) = p

k1111 · · · p

kmmmm

where kij is the number of pair symbols ij in xn1 , that is, i followed by j.

Example: Let xn1 = 01101, then

P (01101) = p201p11p10.

For circular strings (i.e., after the n symbol we re-visit the first symbol of xn1 ),

the matrix [kij] satisfies the following constraints that we denote as Fn

X

1≤i,j≤m

kij = n,m

X

j=1

kij =m

X

j=1

kji, ∀ i (balance property)

Page 19: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Markov Types and Eulerian Cycles

Problem. Let k = [kij]mi,j=1 be a given frequency matrix satisfying the

balance property.

A: How many strings of a given frequency matrix k (given type) are there?

Example: Let A = {0, 1} and

k =

»

1 2

2 2

B: How to enumerate Eulerian paths (types) in a multigraph with |A|vertices and kij edges between ith and jth vertices?

We are interested in:

Nk – number of (cyclic) strings xn1 belonging to the same type k.

Nak – number of strings xn

1 of type k and starting with a symbol a.

Nabk – # strings xn

1 of type k, starting with a symbol a and ending with b.

Page 20: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Main Technical Tool

Let gk be a sequence of scalars indexed by matrices k and

g(z) =X

k

gkzk

be its regular generating function, and

Fg(z) =X

k∈Fgkz

k =X

n≥0

X

k∈Fn

gkzk

the F -generating function of gk for which k ∈ F .

Lemma 1. Let g(z) =P

k gkzk. Then

Fg(z) :=X

n≥0

X

k∈Fn

gkzk

=

1

«m I

dx1

x1

· · ·I

dxm

xm

g([zij

xj

xi

])

with the ij-th coefficient of [zijxjxi

] is zijxjxi

.

Proof. It suffices to observe

g([zij

xj

xi

]) =X

k

gkzk

mY

i=1

xP

i kij−P

j kiji

Thus Fg(z) is the coefficient of g([zijxjxi

]) at x01x

02 · · · x0

m.

Page 21: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Enumeration of Eulerian Paths

1. Define for an m-ary alphabet

Bk =“ k1

k11 · · · k1m

· · ·“ km

km1 · · · kmm

.

2. Let Nak,k′ be the number of ways matrix k is transformed into another

matrix k′ when the Eulerian path starts with symbol a.

Nak,k′ = N

ak−k′ ×Bk′, k

′a = 0.

SinceP

k′Nak,k′ = Bk, hence Bk =

P

k′∈F,k′a=0 Nak−k′ ×Bk′.

3. We find

Nb,ak = [z

k]B(z)zba · det

bb(I− z).

and using Cauchy’s formula we can prove that

N b,ak =

kba

kb

Bk · detbb

(I− k∗)

1 + O

1

n

««

,

where k∗ is the normalized matrix such that k∗ = [kij/ki].

4. For example for a binary Markov we have

N0,0k ∼

k10

k10 + k11

“k00 + k01

k00

”“k10 + k11

k10

=k10

k10 + k11

Bk

Page 22: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Universal Types

Seroussi introduced in 2003 universal types for stationary ergodic sources:

Sequences of the same length p are said to be of the same universal

type if they generate the same set of phrases in the Lempel-Ziv’78.

Figure 5: Two universal types and the corresponding binary trees

.

Page 23: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Number of Types and Binary Trees

Lempel-Ziv’78 parsing scheme of a sequence of length p can be

represented by a binary tree of path length p. Let

– Tn be the set of binary trees built on n nodes.

– Tp be the set of binary trees with the path length equal to p.

# universal types over Ap ≡ |Tp|: # of trees of a given path p.

How to enumerate binary trees of a given path length p?

Page 24: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Enumeration of Binary Trees: Tn vs Tp

Let b(n, p) be the number of binary trees with

n nodes and path length p. It satisfies:

b(n, p) =P

k+ℓ=n−1

P

r+s+n−1=p b(k, r)b(ℓ, s)

b(n+1,p)

b(k,r) b(n-k,p-n-r)

Define Bn(w) =P∞

p=0 b(n, p)wp, and B(z, w) =P∞

n=0 znBn(w). Then

B(z, w) = 1 + zB2(zw, w)

This functional equation is asymmetric with respect to z and w.

We want to study the number of trees in Tp (of a given path length p).

Observe

|Tp| =X

n≥0

b(n, p) = [wp]B(1, w).

We set z = 1 in the functional equation leading to

B(1, w) = 1 + B2(w, w)

which is not algebraically solvable.

Page 25: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Generalization: Knuth’s Problem

During the 10th Seminar on Analysis of Algorithms, MSRI, 2004, Knuth posed

the problem of analyzing the left and the right path lengths in a random

binary trees.

Let N(p, q; n) be the number of binary trees with n nodes that have a total

right path length p and a total left path length q. Define

Bn(w, v) =X

p

X

q

N(p, q; n)wpv

q

which satisfies the recurrence (B0(w, v) = 1)

Bn+1(w, v) =

nX

i=0

wiv

n−iBi(w, v)Bn−i(w, v), n ≥ 0.

Thus the triple transform B(w, v, z) =P∞

n=0 Bn(w, v)zn satisfies

Knuth’s functional equation

B(w, v, z) = 1 + zB(w, v, wz)B(w, v, vz).

Page 26: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Catalan Numbers and Uniform Model

1. Setting w = 1 we get

B(1, z) = 1 + zB2(1, z)

which can be solved explicitly

B(1, z) =`

1−√

1− 4z´

/(2z)

leading to the Catalan number Cn = [zn]B(1, z).

2. Set w = v (and define B(w, z) := B(w, w, z)) to find

B(w, z) = 1 + zB2(w, zw).

This describes the total path length Ln in the Tn-uniform model, that is,

P (Ln = p) =b(n, p)

Cn

where b(n, p) is the number of trees with n nodes and path length p.

Page 27: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Path Length Distribution

Louchard (1984) and Takacs (1991) show that

E[Lnr]

(2n3)r/2∼ 2

√π

Γ((3r − 1)/2)wr

where wr satisfies the following nonlinear recurrence (w0 = −1) for r ≥ 1

2wr = (3r − 4)rwr−1 +r−1X

j=1

“r

j

wjwr−j

or setting cr = wr/r!

2cr = (3r − 4)cr−1 +r−1X

j=1

cjcr−j.

In other words, the limiting distribution of the total path length satisfies

P

Ln√2n3≤ x

«

→ W (x)

where W (x) is the Airy distribution defined by its moments through wr.

Page 28: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Right Path Length, Area under Bernoulli Walk

3. Let us now set v = 1 in the triple transform equation. Then

B(w, z) = 1 + zB(w, wz)B(w, z)

while Bn(w) = [zn]B(w, z) satisfies

Bn+1(w) =

nX

i=1

wiBi(w)Bn−i(w).

Observe that Bn(w) is the generating function of the right path length Rn

in the Tn model.

It was also studied by Takacs who analyzed

the area under a Bernoulli excursion 2nAn

in 2n steps. 0

1

2

3

4

0 n 2n

Rn

i

P

An√2n≤ x

«

= P

2nAn

2n√

2n≤ x

«

= P

Rn√2n3≤ x

«

= P

Ln

2√

2n3≤ x

«

= W (x)

where W (x) is the Airy’s distribution.

Finally, it appears in the Kleitman-Winston conjecture.

Page 29: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Back to Types: Number of Trees with a Given Path Length

4. Setting z = 1 in the previous equation we arrive at

B(w, 1) = 1 + B2(w, w).

Observe that [wp]B(w, 1) = the number of binary trees with path length=p

Seroussi (2004) and Knessl & W.S (2004) prove that (c1, c2 are constants)

|Tp| =1

(log2 p)√

πp2

2plog2 p

1+c1 log−2/3 p+c2 log−1 p+O(log−4/3 p)”

.

When selecting randomly a tree from Tp we may define: Np, the number

of nodes in the Tp-model.

Surprisingly, we can prove that Np is asymptotically normal, that is,

Pr{Np = n} =b(n, p)∞

X

n=0

b(n, p)

∼ 1p

2πVar[Np]exp

"

−(n− E[Np])2

2Var[Np]

#

where

E[Np] ∼p

log2 p, Var[Np] ∼

p

log2 p5/3

(log 2)A0

6(21/3

where A0 is a constant.

Page 30: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

WKB Method – Open Problems

Knessl and W.S. use the so called WKB method (heuristic).

The WKB method assumes that the solution, B(ξ; n), to a functional

equation has the following asymptotic form

B(ξ; n) ∼ enϕ(ξ)

»

A(ξ) +1

nA

(1)(ξ) +

1

n2A

(2)(ξ) + · · ·

,

where ϕ(ξ) and A(ξ), A(1)(ξ), . . . are unknown functions.

These functions must be determined from the equation itself, often in

conjunction with the asymptotic matching principle.

For example, for w = 1 + a/n3/2 with a = −Y 3/2 we found that

Bn(w) ∼ 4n+1

n3/2(−a)

∞X

j=0

exp(−|rj|41/3Y )

where rj are the roots of the Airy’s function Ai(z) = 0.

Open Problems: The above two results concerning the size of Tp and the

distribution of Np do not have analytic solutions.

Page 31: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Back to Knuth’s Problem

5. LetDn be a random variable representing the path difference between

the left and the right paths in a binary tree.

We observe that the distribution of Dn can be only characterized by

moments. Janson 2006, Knessl, W.S., 2005, and Panholzer, 2006 proved

thatE[Dn

2m+2]

n5(m+1)/2→ (2m + 2)!

√π

∆m

Γ(52m + 2)

where ∆m satisfy the following nonlinear recurrence

∆m+1 =(5m + 6)(5m + 4)

8∆m +

1

4

mX

ℓ=0

∆ℓ∆m−ℓ, m ≥ 0.

Again, a nonlinear recurrence for the coefficients at the normalized

moments!

Page 32: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Page 3 of Flajolet’s Talk in Princeton’98

Distributions: Difference equations

For nonlinear parameters, esp. quadratic ones,

decomposability implies a functional equation.

Xn = n + Xsmaller + X ′smaller

Trie sort (Jacquet, Regnier)

F (z, w) = F (wz

2, w)2 + az + b

Digital search and compression (J, Szpank, Louchard)

∂zF (z, w) = F (w

z

2, w)2

Quicksort (Hennequin + Regnier, Rosler)

∂zF (z, w) = F (wz, w)2

In situ permutation (Knuth, Prodinger)

∂zF (z, w) = F (z, w) · F (wz, q)

Linear probing hashing (Knuth, Fl-Poblete-Viola)

∂zF (z, w) = F (z, w) ·

F (z, w) − wF (wz, q)

1 − w

Path length in trees (Louchard, Takacs)

F (z, w) =z

1 − F (wz, w)

Page 33: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Open Problem – A Conjecture

Consider problems characterized by a nonlinear differential-functional

equation

A1

∂wB(w, z)+A2B(w, z) = a(w, z)+b(w, z)B(wα1zβ1, wα2zβ2)B(wα3zβ3, wα4zβ4)

where a(w, z), b(w, z) are slowly growing functions, and αi, βi ∈ {0, 1}.

Let Zn be a random variable such that for some am →∞

E[Zmn ]

an

→ cm

where in general cm satisfies

cm+1 = αm + βmcm +m−1X

i=0

γicicm−i

with some initial conditions, and given αm, βm and γm.

Similar recurrences appear in the quicksort, linear hashing, path length in

binary trees, area under Bernoulli walk, enumeration of trees with given

path length, and many others.

A new class of distributions? Can we characterize it?

Page 34: Lempel-Ziv and Related Algorithms · AofA and IT logos Wroclaw 2008 ∗Research supported by NSF, AFSOR, and NIH. †Joint work with M. Drmota, P. Jacquet, C. Knessl, and M. Ward.

Analytic Information Theory

• In the 1997 Shannon Lecture Jacob Ziv presented compelling

arguments for “backing off” from first-order asymptotics in order to

predict the behavior of real systems with finite length description.

• To overcome these difficulties we propose replacing first-order analyses

by full asymptotic expansions and more accurate analyses (e.g., large

deviations, central limit laws).

• Following Knuth and Hadamard’s precept1, we study information theory

problems using techniques of complex analysis such as generating

functions, combinatorial calculus, Rice’s formula, Mellin transform,

Fourier series, sequences distributed modulo 1, saddle point methods,

analytic poissonization and depoissonization, and singularity analysis.

• This program, which applies complex-analytic tools to information

theory, constitutes analytic information theory.

1 The shortest path between two truths on the real line passes through the complex plane.


Recommended