A rate of convergence result for a universal D-semifaithful code

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO.3, MAY 1993 813

A Rate of Convergence Result for a Universal D-Semifaithful Code

Bin Yu and T. P. Speed

Abstract� The problem of optimal rate universal coding in the context of rate-distortion theory is considered. A D-semifaithful universal coding scheme for discrete memoryless sonrces is given. The main result is a refined covering lemma based on the random coding argument and the method of types. Tbe average codelength of the code is shown to appraoch its lower bound, the rate-distortion function, at a rate 0 ( n -I log n) , and this is conjectured to be optimal based on a result of Pile_ Issues of constructiveness and universality are also addressed.

Index Terms- Discrete memoryless source, rate-distortion, D-semifaithful, universal coding, optimal rate, random coding, method of types.

I. INTRODUCTION

ENTROPY has a central position in information theory, in part because in the limit it gives the shortest possible

per-symbol average length of a noiseless code. If we consider a discrete memoryless source with distribution Po, the entropy H ( Po) serves as a nonasymptotic lower bound to the average expected codelength for data strings from this source. Moreover, the entropy lower bound can be achieved asymptotically at the rate 0 (n-I) when the source distribution Po is known, and at the rate O(n-I log n) when Po is not known.

Rissanen [19] improved the' entropy lower bound by showing that entropy + kn -I log n is an asymptotic lower bound to the average expected codelength. His bound holds for data strings from parametric statistical models satisfying mild regularity conditions, and the k in the lower bound is the dimension of the model. Discrete memoryless sources are covered by his result, with k there being the cardinality of the source alphabet minus one, and the rate O(n-Ilogn) is optimal in this case, when Po is not known. The rate O(n-Ilogn) has been' shown to be achievable for various other statistical models, see for example Davisson [8], Rissanen [19], [20], Hannan and Kavalieris [10], Hemerly and Davis [11], Gerenscer and Rissanen [9], Clarke and Barron [5], and Weinberger, Lempel; and Ziv [27]. Extensions to nonparametric models can be found in Barron and Cover [1], Rissanen, Speed, and Yu [21], and Yu and Speed [28].

Rate-distortion theory was started by Shannon [23], and in that context we consider block-codes with a fidelity criterion,

Manuscript received June 10, 1991; revised August 17, 1992. This work was supported in part by the Wisconsin Alumni Research Foum!ation and the Army Research Office Grant DAAL03-91-G-007, and the National Science Foundation Grant DMS 8802378.

B. Yu is with the Department of Statistics, University of Wisconsin, Madison, WI 53706.

T. P. Speed is with the Department of Statistics, University of California, Berkeley, CA 94720.

IEEE Log Number 9206227.

or semifaithful codes to use the term from a recent paper of Ornstein and Shields [15J. Instead of the expected codelength used in noiseless coding it is natural in rate-distortion theory to consider the log of the number of D-balls required to cover the n-tuple space of the source alphabet under some singleletter distance measure. The role of entropy in noiseless coding is then taken by the rate-distortion function, in the following sense: the rate-distortion function gives a lower bound to the log of the 'covering number, which we may also refer to as the expected code length of a D-semifaithful code, and this lower bound can be achieved in the limit by certain Dsemifaithful Godes. In particular, Ornstein and Shields [15] obtain D-semifaithful codes which achieve the rate-distortion function lower bound almost surely, for ergodic sequences, and Shields [22] uses Markov types for similar results. Earlier work for other classes of sources include Neuhoff, Gray, and Davisson [14], Mackenthun and Pursley [13] and Kieffer [12]. In the case of memoryless sources, the achievability proof can be found in standard texts, see for example, Cover and Thomas [7] for a recent exposition using the random coding argument. However, no results have yet been provided on the rate at which this lower bound is approached.

In this paper, we describe a D-semifaithful universal coding scheme of memoryless sources and obtain an associated rate result. We show, for a discrete memoryless source with a source alphabet of J elements and an unknown distribution Po, that under some mild smoothness conditions on the ratedistortion function, a universal D-semifaithful code can be constructed such that the average expected length of this code tends to the rate-distortion function at the rate n -llogn. The techniques used are the method of types and random coding. The. main result will be based on a refined coding lemma (Theorem 1) for type classes. It is "refined" because it improves the 0(1) term in the covering lemma in Csiszar and Komer [6] to an 0 ( n -llogn) term. In other words, we are able to give a better upper bound on (the log of) the number of D-balls needed to cover a type class, equivalently, on the number of D-semifaithful code words required to encode a type class. Then a two-stage code is constructed as the D-semifaithful code for all strings: first we encode the type class; and next we encode the elements of each type class using the refined covering lemma. The above results are contained in Section II.

In Section III, we conjecture that the rate n-1 log n is asymptotically optimal. Ow conjecture is based on a result of Pile [16], [17], which is expressed in terms of the inverse of the rate-distortion function: the distortion-rate function. Pilc has upper bounds and lower bounds for noiseless channels and

0018-9448/93$03.00 © 1993 IEEE

814

for noisy channels, but we use his results only for noiseless channels. Unfortunately, although the rate n-l log n in the upper bound of our two-stage code matches that in Pile's lower bound, his lower bound is on the log cardinality of an expected

D-semifaithful code (cf. the forthcoming definition) while our code is pointwise D-semifaithful with an upper bound on the expected codelength. Hence, we do not know at this stage if the rate n-l log n is indeed optimal in terms of expected codelength. Moreover, his bound does not include Rissanen's since it holds only for nonzero distortion levels.

In Section IV, we compare our code with the code corresponding to Pile's upper bound. The main point made there is that our code is universal, while the other one is not. In addition, the issue of construction versus pure existence is addressed in relation to our code and the one corresponding to Pile's upper bound.

We start with some preliminaries on rate-distortion theory and the method of types. Our main reference on rate-distortion theory is Berger [2], and that on the method of types is Csiszar and Korner [6].

II. PRELIMINARIES

Let AD = {I, 2, . . . ,J - 1, J} be the source alphabet, and let 130 = {l, 2,· .. ,K} be the reproducing alphabet. 130 could be the same as or a subset of Ao. We assume our source is memoryless, i.e., that the letters :1;1, • . . , .7:n, which make up our strings are mutually independent and identically distributed (i.i.d.) with distribution Po on Ao. Without loss of generality we assume Po(j) > 0 for all j E A o. We use a single-letter fidelity criterion to measure the distortion between any nth order source string x'" = (X l, " ' , Xn) E AIT, and its code word yn E 130, More precisely, let

dn(x",yn) = n-l Ld(xt,Yt), t=l

where d is a bounded real nonnegative function on Ao x 130, with maximum dM and minimum dm. Then the rate distortion function Rn(Po, D) for the distribution of Xl,'" , Xn equals nR(Po; D) where the rate-distortion function R(Po, D) of Po can be formally defined as follows:

R(Po,D) = min I(W,Po) w

. = mJp � � Po(j)W(klj) log ��ki�) , where the minimum is taken over the set of matrices W from Ao to 130 such that for any j, k, W(klj) :::;> 0, for all j, �f=l W(klj) = 1,

J K

L L Po(j)W(klj) d(j, k) :S D, j=lk=l

and Q is the marginal distribution on 130 induced by Po and W, i.e., for k E 130,

J Q(k) = L Po(j)W(klj)·

j=l

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 39, NO.3, MAY 1993

The following properties of R(P, D) can be found in Berger [2).

1) R(P, .) is convex, monotonically decreasing on [0, Dmaxl where Dmax = mink �:=l P(j) d(j , k). Moreover, for D :::;> Dmax, R(P, D) = 0, and R(P,O) = H(P) = - �:=l P(j) log P(j). Hence R'n(P, D) :S 0 for any D, where I denotes differentiation with respect to D.

2) If I(W,P) = R(P, D), then for any .i: oI(W, P) . W(klj) 8W(klj) IW = P(J) log Q(k) ,

where for any j, k:

. Q(k)esd(j,k) W(kIJ) = �Q(£)esd(j,£) ,

€

with s = R'n(P,D) :S O. Definition (D-semifaithful code): A map Mn : AO' --7 130' is

called a pointwise D-semifaithful code if for any xn E AO'

Similarly a map Mn is called expected D-semifaithful with respect to a source distribution Po if whenever (Xl," . , Xn) are i.i.d. with common distribution Po,

Since our main argument will be based on the method of types, we next introduce the definition of type and some of its properties. We will follow the notation of Csiszar and Korner [6], with 1 {A} denoting the indicator of the event A and == meaning equal by definition.

Definition (Type): The type of a sequence xn E AIT is the distribution P",n on Ao defined for j E Ao by

that is, the empirical distribution of xn on Ao. We write T;' = {:rn : .rn has type P} for any P on Ao such that {nP(j)} are integers.

For any given xn E AIT, and a stochastic matrix W : Ao -> 130, we next define conditional types.

Definition (Conditional type): The conditional type W of a sequence yn E 130 given xn E AIT is defined for JEAn, k E 130 by

N(jIXn)W(klj) == N(j,klxn,yn) n

== L 1{X t =j andYt = k}. t=l

We denote the set of sequences yn E 130 having the conditional type W given xn by Tw(xn).

The cardinality of a type class, or a conditional type class can be bounded above and below as in the following results from Csiszar and Korner [6].

Lemma 1: For any type P of sequences in AIT,

(n + 1)-J exp(nH(P)) :S IT;' I :S exp(nH(P)). (1.1)

YU AND SPEED: A RATE OF CONVERGENCE RESULT FOR A UNIVERSAL V-SEMIFAITHFUL CODE 815

Lemma 2: For every x" E AO, and stochastic matrix V: Ao -. 80 such that T{!(x") is nonempty,

(n+l)-JK exp{nH(VIPxn)} S IT{!(xn)1 S exp{71,H(V)pxn)}, (1.2)

where H(VIP) = EJ=1 P(j)H(V('lj» = - 2:J=1 P(j) 'E�=1 V(k)j) log V(klj)·

Lemma 3: The total number of type classes is at most (n + l)J.

III. A UNIVERSAL POINTWISE D-SEMIFAITHFUL CODE

In this section, we first use the random coding argument in Csiszar and Korner [6] to prove a refined covering lemma (Theorem 1). Then, we go on to give a two-stage universal D-semifaithful coding scheme (Theorem 2) with the rate n-1 log n. We begin with a proposition extracted from Csiszar and Korner [6].

For a given type P on AQ, positive constant D in (0, Dmax), and a subset B of l3Q, write

where d,,(x", B) := minyTI<Bdn(x",y"). Proposition 1: Suppose z(m) = {ZI:"" Zm} arc i.i.d.

and uniform over a subset G c 80, If for some mn we have EIUv(Zemn)1 < 1, then there is a set Bp,v such that IBp,vl s m" and IUv(Bp,v)1 < 1. This implies that Uv(Bp,v) = ¢. In other words, that Bp,v "covers" Tp within distance D.

See Csiszar and Komer [6] for the proof. Moreover, note that,

luv(zem))1 = L l{Uv(zem))}(xn) xnETp

The last inequality holds because (1 - t)'" S exp (-tm) for any t > O. Next we choose a conditional type class as G1(x"), and a type class as G. For the chosen GI(xn) and' G, we select m" using (2.4) such that EIUv(z(m)1 < 1. For any type P and constant D in (0, Dmax), take W such that

L W(klj)P(j)d(j,k) S D*, j,k

and J(W, P) = R(P, D*), where D* = D - n-1JKdM. Note that this W depends on both P and D*, but for

simplicity, we do not indicate this dependence in any way. Because the 71,P(j) are integers, we can find a stochastic matrix [W], a truncation of W, such that for all j and k, 71,[W](klj)P(j) are integers, and

IW(klj) - [W](klj)I s n;(j) , for j = 1, . . . , J.

Let [Q] = [W]· P, i.e., [Q]Ck) = EJ-l [WJ(klj)P(j). Then, the 71,[Q](k) are also integers. Therefore, the type cla&'l T[Q] and the conditional type class T[7v]Cxn) are well defined for all x" E Tp.

Let us take G = T[QJ and G1(xn) = T[W](x"). Then, for any yn E G1(xn),

�nd

N(kly") = LN(j,klx",yn) j

= L [W](klj)P(j)71, = [Q](k)71"

that is, yn E G = T[Q]' Hence, G1(x") <;;; G. implying

E I(Uv(z(m))1 = L Po (x" E Uv(z(m))) . In addition, for any yn E G1 (xn), since D* = D

(2.1) 71,-1JKdM, xnET;

For any fixed x" E Tp, because the Z are i.i.d.,

where Z is uniformly distributed over G. Furthermore, if we can find a subset G1 (x") c G such that,

for any y" E G1(xn) we have d,,(xn,yn) S D, then

Po(d,,(x", Z) > D) = 1 - Po(dn(x", Z) S D) S 1 - Po(Z E G1(x"»

= 1-IG1(xn)I/IGI. (2.3)

Combining (2.1), (2.2), and (2.3), we get

EIUv(z(m))1 S L (1- IGl�") I ) m xTIETn I I p

d,,(xn,yn) == L[W](klj)P(j)d(j,k) j,k

s L (W(klj)P(j) + n-1)d(j,k) j,k

j,k

Recalling the bounds in (1.1) and (1.2), we have

IGI s exp(71,H(Q», IG1(xn)1 � (n + l)-JK exp{71,H([W]IP)}.

Thus, since J([WJ, P) = H([Q]) - H([W])P),

IG1(xn)1 -JK - IGI S -(71,+ 1) exp(-71,J([WJ,P», (2.5)

816

Putting (2.5) into (2.4), we find

E I UD (z(m») I -:; L

. JK . exp{-(n+I)- exp(-nl([W],P»m} ITpl exp{ -en + l)-JK

. exp(-nJ([W],P»m} Finally, we get by Lemma 1:

E I uD (z(m») i-:;exp(nH(P»exp{-(n+I)-JK ·exp(-nl([W] ,P»m}. (2.6)

Now we can choose m = m" as an integer such that

exp{nl([W], P) + (JK + 2) log(n + I)} -:; mn -:; exp{nl([WJ, P) + (JK + 4) log(n + I)}.

Then for such an mn, (2.6) gives

EjUn(z(m»1 -:; exp(n log .1) exp( -en + 1)2) < 1, for n large.

Applying Proposition 1, we obtain the following theorem.

. Theorem 1 (Refined Covering Lemma for Type Classes): Given a type P on AD and D in (0, Dmax), there is a subset Bp,D C EO such that for any xn E Tp, dn(x", Bp,D) -:; D and

IBp,DI-:; exp{nl([W], P) + (JK + 4) log(n + I)}, where for any j, k

I[W](klJ) - W(klJ)l-:; n)(j) ,

and 1(W,P) = R(P,D* ) for D* = D - n-1JKdM. Next, we show that we can replace [W] by W in Theorem

1. Since [W] is close to W, and D* is close to D, we expect 1([W],P) to be close to J(W,P) = R(P,D'), hence close to R(P, D). Formally, we expand 1([WJ, P) around ICW, P) as follows:

J K-l . J([W] P) - J(W P) '" '" al

, - , + � 6 aW(klJ) . (.,P)IWCk1i)([W](klj) - W(klj» + . . . , (2.7)

where ( . . . ) denotes smaller order terms. Since J (W, P) R( P, D*), by property 2) in Section I, for any k, j

ale P) . W(kjj) aW(klj) IW = prj) log Q(k) .

Note that for any j, k, I[W](klj) - W(klj) -:; (nPU»-I, so we have from (2.7):

1([W], P) = J(W, P)

+ L L log __ J_ n-I { J K -I W(kl .) }

j=1 k=1 Q(k) + 0{n-1).

IEEE TRANSAcnoNS ON INFORMATION TH�ORY, VOL. 39, NO.3, MAY 1993

However, again by property 2) in Section I, for all j, k we have

. Q(k)esd(j,k) W(ki.J) = 'LQ(£)esdCj,£) ,

f where s = R'v(P, D*) < D. Hence,

W(klj) esd(j,k) e-Isldm -Q-C- k)- = 'L Q (£)esd(j ,f) -:; ";='L:=-Q=-C (:-:7£ )-e ---;-1 s"""'ld"-M

t l

= elsICdM-dm).

Similarly W(klj)/Q(k) 2 e-lsl(dM-dml, and hence,

W(klj) Ilog Q(k) I -:; IslCdM - dm)·

Without loss of generality, assume d", = D. We then get

J([W], P) -:; leW, P) + n-I JKlsidM + D(n-1) = R(P, D') + n-I JKlsidM + O(n-I).

On the other hand,

I(W,P) = R(P,D*)

Thus,

= R(P,D) + R'v(P,D)(D* - D) + . . .

= R(P,D) + IsIJKdMn-1 + ....

1([W],P) -:; R(P,D) + 2JKlsldMn-1 + 0(n-1), (2.8)

where s = R'v(P, D). We have proved the following Corollary 1.' Under the assumptions of Theorem 1

10gIBp,Dl-:;nR(P,D) + 2.IKlsldM + 0(1) + (KJ + 4) log(n + 1).

Theorem 2 (Universal Pointwise D-Semifaithful Coding): Let F be a class of distributions on Ao such that for some D E (O,Dmax), the derivatives {a2R(p,D)/oPjoPj, : j,j' = 1, . . . , .I} are uniformly bounded over F by a constant C, and IE Po R'v (Px n , D) I < 00 for all Po E :F. Then there exists a two-stage code M" : A� ...... l3� such that

and for all Po E F, as n --? 00,

n-1 EPoL(Mn(xn» -:;R(Po, D) + (KJ + .1 + 4)71-1 . loge n + I) + 0 ( n -I) .

Proof oj. Theorem 2: For any xn E .A�, our coding scheme has two stages: First, we encode Pxn, which by Lemma 3 takes at most .1 loge n + I) bits. Next, we use Corollary 1, which asserts that for the type class Tp n' there is a BPxn,D which covers Tp;; with radius D. We then take Mn : Tp,r. --; BPxn,D where Mn(xn) = y" E BPxn,D is such that dn(x",yn) -:; D. This takes at most 10gIBP.n,DI bits, which, since R'v < 0, is bounded by

nR(Pxn,D)+(KJ+4) 10g(n+I) - 2KJR'v(p;:n,D) + 0(1).

YU AND SPEED: A RATE OF CONVERGENCE RESULT FOR A UNIVERSAL D-SEMIFAITHFUL CODE 817

For simplicity, denote px" by F. Taking the expectation of this last expression gives a bound on the expected codeJength of

EPoR(F,D) + (KJ + 4)n-1 log(n + 1)

- 2KJEPoR'n (F, D)n-1 + o(n-1).

To prove the theorem, it suffices to show

(2.9)

because IE Po R'n (F, D) I < 00 by assumption. The idea to show (2.9) is simply a Taylor expansion of

R(P, D) around P = Po, but because the Taylor expansion holds only in an 0(1) neighborhood of Po, some effort has to be made to give a rigorous proof.

We split the set of types into two disjoint subsets:

!1" = {P: 1P(j) - PoCf)1 ::; n-1/210gn, for all j E Ao},

and

!1� = {P: IP(j) -Po(j)1 > n-1/210gn, for some j E Ao},

and we break the expectation EPoR(P,D) up similarly, defining

E2 = L lP'o(xn E Tp)R(P, D), PEfl�

where Po denotes the probability measure on sequences defined by Po. Before we go further, we need a good bound on 2:PEn< Po(xn E Tp). Hoeffding's inequality, cf. Pollard [18, p. 191]: implies that for all j,

lP'o(xn : IP(j) -Po(j)1 > n-1/2 log n) ::; exp( -2 [lognj2 . n/4n)

= exp( -f (10gn)2).

As a result,

J ::; LlP'o(Xn: IP(j) -Po(j)1 > n-1/2Iogn)

j=1 ::; J exp( -+ (logn?) = In-t logn. (2.10)

Then (2.10) and the inequality R(P, D) ::; log J together yield

E2 = L lP'o(xn E T�)R(P, D) PEn:;

::; (log J)Jn-t logn = O(n-1), for n large.

On !1n, we can expand R(P, D) as J-10R

R(P, D) = R(Po, D) + L op. (Po, D)· ((P(j) - Po(j)) j=1 )

+ + L (P(]') - Po(j')) O:.20� ., (P�,D)

. "' J J ) ,) . «P(j) -PoCf)),

where P� is in between Po and P. Because the partial derivatives around Po are bounded by a constant C, the third term on the right is bounded by

J 2JCL (P(j) -Po(j))2,

j=1 and so its expectation is 0 ( n -1) by a known result concerning the multinomial variance.

Moreover, the fact that EPo(P(j) -Po(j)) = 0 for all j, implies that

I L lP'o(xn E Tp)(P(j) - Po(j)) I PEnn

= I L lP'o(xn E Tp)(P(j) - Po(j)) I PEn�

::; 2 L lP'o(xn E T�) = O(n-1). (2.11) PEn;

Similarly, we can show

L po(xnET�)R(Po,D) = R(Po, D) + O (n-1). (2.12) PEnn

Hence,

EPoR(F,D)::; R(Po,D)+O(n-1).

This completes the proof that

n-1EpoL(Mn(xn))::; R(Po,D) + (KJ + J +4)n-1 . log (n + l) + 0 ( n -1) . o

Remark: Note that it is very easy to check that the boundedness conditions on the derivatives of the rate-distortion funciton are satisfied by a Bernoulli(p) source with distortion measured by Hamming distance. In that case, the rate-distortion function is known, cf. Cover and Thomas [7], to be

R(p, D) = { Ho, (P) -H(D), if 0::; D ::; min(p, 1 - p) ,

if D > min(p, 1 -pl. Then,

and

l-D R'n(p,D) = log �

82 1 8p2

R(p,D) = -pCl _ p) .

818

It is clear that the boundedness conditions are satisfied in this case if we choose :F as the set of binary distributions with uniform bounds on p away from both 0 and 1. In general, if those derivatives exist, they are likely to be bounded. Thus, our conditions do not appear to be too stringent.

IV. LOWER BOUND

For a string of LLd. discrete source letters from Ao, we have shown in the last section that under some smoothness conditions there is a pointwise D-semifaithful code with its average expected code length tending to R(Po,D) at the rate n-l logn. Recalling that R(Po, D) is a lower bound on the average codelength of such D-semifaithful codes, we may ask: is the rate n -1 log n the best possible?

Unfortunately, we have not been able to show that n-1 log n is the optimal rate, though we conjecture it is the case. The main reason for our conjecture is a lower bound due to Pile [16], [17] in terms on the distortion-rate function D(Po, R), which is the inverse function of R(Po, D) in the variable D. Note that D(Po,') is defined on [0, H(Po)].

Theorem 3 (Pile (17l): Assume that Xl,"', Xn are Li.d. with distribution Po on Ao, and that Mn is a map from Ao -+ Bo· Given R E (0, H(Po)), if IMn(AQ)1 ::; 2nR, then

EPodn(xn, Mn(x")) � D(Po, R) 1 logn

+ "2 Iso(R)ln (1 + 0(1)),

where So satisfies

with

and

Il(SO,Po) - SOIl'(SO,Po) = -R

J K

Il(s, Po) = � Po(j) log({; Q(k) exp (sd(j, k))) ,

Q = WPo,

(3.1)

Moreover, for any E > 0, there exists N(E) > 0, such that if n > N(E), min EPodn(xn, Mn(xn))

1 logn ::; D(Po, R) + - (1 + E) .-1 -I

(1 + 0(1)), (3.2) 2 So n

where the minimum is taken over all codes (maps) Mn such that IMn(Ao)1 ::; 2nR.

It is worth noting that the constant + in front of the rate n -1 log n does not depend on the dimension J of source distribution Po, whereas in the noiseless case the corresponding constant is + (J - 1).

Applying the function R(Po,') to both sides of (3.1), we get the following corollary.

Corollary 2 (Lower Bound): Under the assumptions of Theorem 3, for any expected D-semifaithful code 11.1", if

IEEE TRANSACTIONS ON INFORMATIOK TIIEORY, VOL. 39, NO.3, MAY 1993

IM"I = 2nR, then as n -+ 00,

1 R � R(Po, D) + "2 n-llogn(l + 0(1)). (3.3)

Proof" By (3.1) and the fact that R(Po,') is decreasing,

R(Po,EPodn(xn,Mn(xn))::; R(Po,D(Po, R) 1 logn

+ "2

21soln (1 + 0(1)))

= R(Po, D (Po,R)) + R'v(Po, D(Po, R))

logn , . 21soln

(1 + 0 (1)),

where the 0'(1) term represents the sum of 0(1) term in the previous expression and the smaller order terms from the Taylor expansion. From the parametric representation of R(Po,') (Berger [2]), it is easy to see that so(R) = R'v(Po,D(Po,R)) < O.

Also note that R'v < 0, EPodn(xn,1Wn(xn» ::; D, and R(I'o, D (I'o,R») = R, so we have

R � R(Po, EPodn(xn, Mn(xn)) + +n-l logn(l + 0'(1)).

Since R(Po,') is decreasing,

This completes the proof of (3.3). D

Remark 1: Pile's lower bound in Theorem 3 relies on some large deviation bounds from Shannon and Gallager [25], [26]. Those bounds are for tails of sums of Li.d. variables and are accurate to the order n-l/2 exp(-cn) with the best constant c. Moreover, Pile's original lower bound does not hold for noiseless coding because at R = H(Po), so(R) = R'v(Po,D) = -00. Hence, his lower bound does not include Rissanen's lower bound in the noiseless coding case as a special case.

Remark 2: From the previous section, we have a universal code which is pointwise D-semifaithful and R(Po, D) + o (n -1 log n) in expected codelength. It would have been perfect if Pile's result was in terms of expected code length and for pointwise D-semifaithful codes. However, Pile's lower bound in the form of Corollary 2 is something like a dual to the result we seek; it says that for any expected D-semifaithful code, the log of the cardinality of the set of its code words is bounded below by R(Po, D) + O(n-1 log n). Note that this log cardinality is not a random quantity, unlike the codelength of our universal code. For a pointwise D-semifaithful code, the log cardinality is likely to be bigger than the expected code length.

V. DISCUSSION

In this section, we compare the proofs of our Theorem 2 and Pile's Theorem 3 from the points of view of constructiveness and universality. Both Pile's upper bound (3.2) and our Theorem 2 involve a random coding argument. We might think

YU AND SPEED: A RATE OF CONVERGENCE RESULT FOR A UNIVERSAL D·SEMIFAITIIFUL CODE 819

that neither of them can give a code constructively, and that we can not really say which code is universal, since neither result looks constructive. On the other hand, we observe that the random coding argument in Theorem 2 does not nced the true distribution Po, since we proved the existence of a D-semifaithful code for each type class, while Pile's random coding argument used a knowledge of Po. This difference in random coding seems to suggest that our code might be universal, whereas Pile's coding might not be so. We now show this to be the case.

A. Construction of a Universal D-Semifaithful Code when Po is Unknown

For each type class Tp, let [Q] = [W] . P as in Section II. [W] can be obtained to any precision numerically (if not analytically) by Blahut's algorithm, Blahut [4]. We require a precision of order (nP(j» -I. Then, the n[Q](k) are integers, Le., [Q] is a type. Take m" to be the integer part of exp{nI([W],P) + 3JKlog(n+I)}. For this mn, we could in principle search through all subsets of size Tn" in T[QI in order to find a subset B P,D that covefS Tp within D distance. That is, for each subset B, we check whether d(xn, B) ::; D for all xn E Tp. For the m" previously chosen, Theorem I guarantees the existence of such a pointwise Dsemifaithful code. In other words, through exhaustive search, we can find at least one set Bp,D C T[QI satisfying our D· covering requirement. We take the first such B P,D found as our codebook for Tp, and we have "constructed" a universal pointwise D-semifaithful code. Note that the code we just described has its code length approach the rate-distortion function lower bound at the rate n -llogn, and this rate is optimal in the noiseless coding case.

B. Construction of a D-Semifaithful Code when Po is Known

Pile's upper bound (3.2) says that for any R E (O,H(Po» and E > 0, we can use a random coding argument to find a map Mn : Ao -t 80 such that as n -t 00,

Epodn(xn, Mn(xn» ::; D(Po, R) + a + E) logn . -

I -

I (1 + 0(1».

80 n

When Po is known, for any fixed D E (0, Dmax), we can take Rn to be R(Po, D) + t(1 + E)n-110gn. For this Rn, we can search through all subsets in 80 of size less than or equal to 2nRn. We choose the codebook Bpo as the set such that

Epod(xn,Bpo) = min EPod,,(x", B), (4.1) B

where the min is taken over all B with IB I ::; 2nRn. Pilc's result guarantees this codebook B Po satisfies

Epodn(xn,Bpo) ::; D(Po,R,,)

+ ..!:.. (1 + E) llog

ln

(1 + 0(1» 2 So n

. 1 logn = D(Po, R(Po, D» - '2 (1 + E) Iso In 1 logn

+ - (1 + E) -I -

I + 0(n-110gn).

2 So n

The last equality holds because of the Taylor expansion of D(Pci, .), the fact that D'(Po, R(Po, D» = sol, and D' < O. It follows that

EPodn(xn, Bpo) = D + o(n-110gn).

Without knowledge of the 0(1) term in Pilc's upper bound, D + 0 ( n -1 log n) is the best level of distortion we can establish; we cannot deduce that the code is expected semifaithful at the exact level D.

The code 13 Po clearly depends on Po as we need to know Po to check (4.1). The D + 0(n-1 10gn) -semifaithful code obtained from Pilc's upper bound is, therefore, not universal.

When Po is not known, a natural remedy would be to use the empirical distribution P instead of Po in the construction we have just outlined. But this does not work if we want to keep the rate n -1 log n. The problem here is that when we replace Po by P in (4.1), we create an error of magnitude (n-110glogn)1/2 since liP - Pol l = O[(n-110glogn)I/2]. This rate overwrites the desired rate n-Ilogn.

There is another difference between our code and Pilc's. The code B Po has the stated distortion on average, Le., it is an expected D·semifaithful code, but the codelength is pointwise R(Po, D) + t(l + E}n-110gn, not in expectation. On the other hand, our code {B P,D : P any type} is pointwise D-semifaithful with the expected codelength R( Po, D) + (K J + J + 4)n-llog n. Ignoring the issue of non universality, and the different constants in front of n -I log n, we might say that Pilc's result is dual to ours. We doubt that there exists a universal code that is pointwise D·semifaithful and whose log cardinality approaches the lower bound R( P, D) at the rate n-1logn.

A technical difference of the two results concerns the mathematical tools employed. We both use the random coding argument, but the rate n -1 log n came out of the method of types for us, while for Pile it came out of the large deviation results of Shannon and Gallager [25], [26]. This is not surprising, however, since Jarge deviation results can be obtained using the method of types in the discrete memoryless source case. Due to the elegance of the method of types, our proofs are simpler and more direct than those of Pile. Both results rely on the assumption that the source is LLd., although Pilc has results on noisy channels, too. Large deviation results do exist for independent not identical distributions, but we are not aware of any result as refined as that required by Pilc's bounds.

ACKNOWLEDGMENT

The authors would like to thank two referees for very helpful comments. The authors would like to thank Dr. T. Linder for pointing out that the boundedness condition on the second derivative of R( D, P) in Theorem 2 is superfluous. This is because -R' (P, D) is positive and bounded by log J / D, as R(P, D) is a convex function of D.

REFERENCES

[1) A. R. Barron and T. M. Cover, "Minimum complexity density estima· tion," IEEE Trans. Inform. Theory, vol. 37, pp. 1034-1054, July 1991.

820

[2] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs, NH: Prentice-Hall, 1971.

[3] T. Berger and L. D. Davisson, Advances in Source Coding. New York: Springer-Verlag, 1975.

[4] R. E. Blahut, Principles and Practice of Information Theory. Reading, MA: Addison-Wesley, 1978.

[5] B. S. Clarke and A. R. Barron, "Information theoretic asymptotics of Bayes methods," IEEE Trans. Inform. Theory, vol. 36, 453-471, May 1990.

[6] I. Csiszar and J. Korner, Information Theory: Coding Theorem, for' Discrete Memoryless Systems. New York: Academic, 1981.

[7] T.M. Cover and J.A. Thomas, "Elements of information theory," Lecture Notes, Stanford Univ., CA, 1990.

[8] L. D. Davisson, "Minimax noiseless universal coding for Markov sources," IEEE Trans. Inform. Theory, vol. IT-29, pp. 211-215, 1983.

[9] L. Gerencser and J. Rissanen, "A prediction bound for Gaussian ARMA processes," Proc. 25th CDC, Athens, Greece, vol. 3, 1986, pp. 1487-1490.

[10] E. J. Hannan and L. Kavalieris, "Regression, autoregressive models," 1. Time Series Anal., vol. 7, pp. 27-49, 1986.

[11] E. M. IIemerly and M. H. A. Davis, "Strong consistency of the predictive least squares criterion for order determination of autoregressive processes," Ann. Statisi., vol. 17, pp. 941-946, 1989.

[12] J. Kieffer, "A unified approach to weak universal source coding," 11:.£"£ Trans. Inform. Theory, vol. 1T-24, pp. 674-682, 1978.

[13] K. M. Mackenthun and M. B. Pursley, "Variable-rate universal block source coding subject to a fidelity constraint," IEEE Trans. Inform. Theory, vol. 1T-24, pp. 349-360, 1978.

[14J D. L. Neuhoff, R. M. Gray, and L. D. Davisson, "Fixed rate universal block source coding with a fidelity criterion," TEEE Trans. Inform. Theory, vol. IT-21, pp. 511-523, 1975.

IEEE TRANSACfIONS ON INFORMATION THEORY, VOL. 3Y, NO.3, MAY 1993

[IS] O. S. Ornstein and P. C. Shields, "Universal almost sure data compression," Ann. Probab., vol. 18, pp. 441-452, 1990.

[16] R.1. Pile, "Coding theorems for discrete source-channel pairs," Ph.D. thesis, Dept. Elect. Eng., M.LT., Cambridge, MA, 1967.

[17] __ , "The transmission distortion of a source as a function of the encoding block length," Bell Syst. Tech. 1., vol. 47, pp. 827-885, 1968.

[18] D. Pollard, Convergence of Stochastic Processes. New York: SpringerVerlag, 1984.

[19} J. Rissanen, "Stochastic complexity and modeling ," Ann. Statist., vol. 14, pp. 1080-1100, 1986.

[20] __ , "Complexity of strings in the class of Markov sources," IEEE Trans. Inform. Theory, vol. IT-34, pp. 526-532, July 1986.

[21] 1. Rissanen, T. P. Speed, and B. Yu, " Density estimation by stochastic complexity," IEEE Trans. Inform. Theory, vol. 38, PI. J, pp. 315-323, Mar. 1992.

[221

[23]

[241

[25]

[26]

[27]

[28]

P. Shields, "Universal almost sure data compression using Markov types," Probl. Contr. Inform. Theory, vol. 19, pp. 269-277, 1990. C. Shannon and W. Weaver, A Mathematical Theory of Communication. Urbana, II.: Univ. Illinois Press, 1949. C. Shannon, "Coding theorems for a discrete source with a fidelity criterion," in Information and Decision Processes, R. E. Machol, Ed. New York: McGraw-Hill, 1959. C. Shannon and R. G. Gallager, "Lower bounds tu error probability for coding on discrete memoryless channels I," Inform. Contr., vol. 10, pp. 65-103, 1969. __ , "Lower bounds to error probability for coding on discrete memoryless channels II," Inform. Contr., vol. 10, pp. 523-552, 1969. M. J. Weinberger, A. Lempel, and J. Ziv, "A sequential algorithm for the universal coding of finite memory sources," lI:.101o' Trans. Inform Theory, vol. 38, pp. 1002-1014, May 1992. B. Yu and T. Speed, "Data compression and histograms," Probability Theory Related Fields 92, pp. 195-229, 1992.

Date post:	22-Sep-2016
Category:	Documents
Upload:	tp
View:	216 times
Download:	4 times

A rate of convergence result for a universal D-semifaithful code

Documents