+ All Categories
Home > Documents > Cutting and stacking: a method for constructing stationary processes

Cutting and stacking: a method for constructing stationary processes

Date post: 21-Sep-2016
Category:
Upload: pc
View: 216 times
Download: 3 times
Share this document with a friend
13
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991 1605 Cutting and Stacking: A Method for Constructing Stationary Processes Paul C. Shields, Member, IEEE Abstract -The method of cutting and stacking has long been used to construct counterexamples in ergodic theory. Recently it has been used to construct examples of interest in information theory and probability theory. The method builds a stationary ergodic process by describing sample paths as concatenations of non-overlapping blocks of varying lengths. Induction is used to show how these blocks are concatenated to form longer and longer blocks and a geometric model is used to guarantee stationarity. The basic ideas of the method and some recent applications to information theory problems are described. Index Terms -Ergodic processes, typical sequences. I. INTRODUCTION GENERAL method for constructing examples of A stationary, ergodic, finite-alphabet processes with desired properties will be described in this paper. The method, known as "cutting and stacking," is well known to ergodic theorists; recently this author has used the method to construct several examples of interest to infor- mation theorists. The main purpose of this paper is to serve as an appendix to these recent constructions; thus this paper is both a tutorial introduction to the cutting and stacking ideas and an unification and extension of the basic concepts. We hope this paper will provide both the basic understanding and the basic results needed to make it possible for readers who are unfamiliar with the ideas to learn them and make use of them in their own work. To provide some motivation for information theorists, three problems recently solved using the cutting and stacking method will be described. The first is the diver- gence-rate problem for ergodic sources, stated as follows. Let P(a) and Q(a) be two probability distributions on the finite set A. The (informational) divergence is defined by the formula See the CsiszAr-Korner book, [2], for a discussion of the Manuscript received September 11, 1990; revised May 20, 1991. This work was supported in part by NSF Grant DMS-874630 and a Fulbright Lectureship at Eotvos Lorind University, Budapest, Hungary. This work was presented in part at the IEEE Information Theory Workshop, Eindhoven, The Netherlands, 1990, and in part at the IEEE Interna- tional Symposium on Information Theory, Budapest, Hungary, June The author is with the Department of Mathematics, University of IEEE Log Number 9102342. 24-28, 1991. Toledo, Toledo, OH 43606. divergence concept and its role in information theory and statistics. Now suppose that P and Q are stationary, ergodic processes with finite-alphabet A, that is, ergodic shift-invariant Bore1 probability measures on the set A" of all infinite sequences of symbols from A. The diver- gence-rate is defined by D( PllQ) = limswD,( P"llQ"), (1) n where P" is the measure induced by P on the set A" of sequences a; = a,, a*,. . ., a,, ai E A, by the formula def P"( a;) = P({ x E A": x; = a;}), a; E A", and is the nth-order divergence per step. Problem I: If P and Q arc "reasonable" processes, is the limit superior in (1) really a limit, and does a Shan- non-McMillan divergence-rate theorem hold, that is, does 1 P"( x;) n n Q"($) lim - log ~ = D( PllQ), as.? CsiszAr and this author have recently shown that there is a limiting divergence-rate and the Shannon-McMillan divergence theorem holds in the case when Q is a finite- state process, [3]. This author has shown, using the cutting and stacking method, that if P is i.i.d. then there exists a B-process Q, in the sense of Gray, [5], [16], such that the limit superior in (1) is not a limit, and no Shannon- McMillan divergence-rate theorem holds for the pair P, Q. The class of B-processes coincides with the class of sta- tionary codings of i.i.d. processes; the class includes the aperiodic Markov processes and the mixing finite-state processes and is the smallest class contaiing the mixing finite-state processes that is closed in the d-metric, hence is a "reasonable" candidate for the class of "reasonable" processes, (see [91, [lo], [161.) The second problem is related to the first and is con- cerned with the positivity property, namely, Problem 2: If P and Q are "reasonable" processes and D(PllQ) = 0 can we conclude that P = Q? Marton has recently shown that if Q satisfies the "blowing-up" property, [7], and D(PllQ) = 0 then P = Q, [3]. It is also shown that many processes, including the 00 18-9448/9 1$0 1 .OO 0 1991 IEEE
Transcript
Page 1: Cutting and stacking: a method for constructing stationary processes

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991 1605

Cutting and Stacking: A Method for Constructing Stationary Processes

Paul C. Shields, Member, IEEE

Abstract -The method of cutting and stacking has long been used to construct counterexamples in ergodic theory. Recently it has been used to construct examples of interest in information theory and probability theory. The method builds a stationary ergodic process by describing sample paths as concatenations of non-overlapping blocks of varying lengths. Induction is used to show how these blocks are concatenated to form longer and longer blocks and a geometric model is used to guarantee stationarity. The basic ideas of the method and some recent applications to information theory problems are described.

Index Terms -Ergodic processes, typical sequences.

I. INTRODUCTION GENERAL method for constructing examples of A stationary, ergodic, finite-alphabet processes with

desired properties will be described in this paper. The method, known as "cutting and stacking," is well known to ergodic theorists; recently this author has used the method to construct several examples of interest to infor- mation theorists. The main purpose of this paper is to serve as an appendix to these recent constructions; thus this paper is both a tutorial introduction to the cutting and stacking ideas and an unification and extension of the basic concepts. We hope this paper will provide both the basic understanding and the basic results needed to make it possible for readers who are unfamiliar with the ideas to learn them and make use of them in their own work.

To provide some motivation for information theorists, three problems recently solved using the cutting and stacking method will be described. The first is the diver- gence-rate problem for ergodic sources, stated as follows. Let P(a) and Q(a) be two probability distributions on the finite set A . The (informational) divergence is defined by the formula

See the CsiszAr-Korner book, [2], for a discussion of the

Manuscript received September 11, 1990; revised May 20, 1991. This work was supported in part by NSF Grant DMS-874630 and a Fulbright Lectureship at Eotvos Lorind University, Budapest, Hungary. This work was presented in part at the IEEE Information Theory Workshop, Eindhoven, The Netherlands, 1990, and in part at the IEEE Interna- tional Symposium on Information Theory, Budapest, Hungary, June

The author is with the Department of Mathematics, University of

IEEE Log Number 9102342.

24-28, 1991.

Toledo, Toledo, OH 43606.

divergence concept and its role in information theory and statistics. Now suppose that P and Q are stationary, ergodic processes with finite-alphabet A , that is, ergodic shift-invariant Bore1 probability measures on the set A" of all infinite sequences of symbols from A . The diver- gence-rate is defined by

D( PllQ) = limswD,( P"llQ"), (1) n

where P" is the measure induced by P on the set A" of sequences a; = a,, a*,. . ., a,, ai E A , by the formula

def

P"( a;) = P({ x E A": x ; = a ; } ) , a; E A",

and

is the nth-order divergence per step. Problem I: If P and Q arc "reasonable" processes, is

the limit superior in (1) really a limit, and does a Shan- non-McMillan divergence-rate theorem hold, that is, does

1 P"( x ; )

n n Q"($) lim - log ~ = D( PllQ), a s . ?

CsiszAr and this author have recently shown that there is a limiting divergence-rate and the Shannon-McMillan divergence theorem holds in the case when Q is a finite- state process, [3]. This author has shown, using the cutting and stacking method, that if P is i.i.d. then there exists a B-process Q, in the sense of Gray, [5], [16], such that the limit superior in (1) is not a limit, and no Shannon- McMillan divergence-rate theorem holds for the pair P, Q . The class of B-processes coincides with the class of sta- tionary codings of i.i.d. processes; the class includes the aperiodic Markov processes and the mixing finite-state processes and is the smallest class contaiing the mixing finite-state processes that is closed in the d-metric, hence is a "reasonable" candidate for the class of "reasonable" processes, (see [91, [lo], [161.)

The second problem is related to the first and is con- cerned with the positivity property, namely,

Problem 2: If P and Q are "reasonable" processes and D(PllQ) = 0 can we conclude that P = Q?

Marton has recently shown that if Q satisfies the "blowing-up" property, [7], and D(PllQ) = 0 then P = Q, [3]. It is also shown that many processes, including the

00 18-9448/9 1$0 1 .OO 0 199 1 IEEE

Page 2: Cutting and stacking: a method for constructing stationary processes

1606 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

mixing finite-state processes, satisfy the "blowing-up'' property; it can be shown that processes that satisfy this property must be B-processes. This author has shown, using the cutting and stacking method, that if P is the i.i.d. process generated by unbiased coin-tossing there is a B-process Q f P such that D(PllQ> = 0.

The third problem is the waiting-time problem of Wyner and Ziv, [22]. Let x, y E A" and define

W,( x , y ) = min { m : y , + i ~ = xi, 1 I i I n} ,

, that is, W,(x, y ) is the waiting time until the initial seg- ment x; of x is seen in the sequence y . If x and y are independently selected then it is natural to think of x as a "training" sequence and y as a sequence to be com- pressed by telling where its blocks occur in x.

Problem 3: If P is ergodic with entropy-rate H and x and y are selected independently does (l/n)log W J x , y > converge in probability to H ?

Wyner and Ziv have shown that the corresponding recurrence-time problem has a positive solution. The re- currence time is defined by R , ( x ) = W,(x, Tx), where Tx is the shift of x, namely, Tx = x2, xj, * . . . Wyner and Ziv show that ( l /n)log R , ( x ) converges in probability to H , for any ergodic process P , a result later strengthened to almost sure convergence by Ornstein and Weiss, [14], [22]. Wyner and Ziv also obtained a positive answer to Prob- lem 3 for the case when P is Markov. This author has shown, using cutting and stacking, that there is an ergodic process for which the answer to Problem 3 is no, [21]. Thus in the general ergodic case one cannot be sure that the use of training sequences will lead to optimal com- pression.

11. THE CUTTING AND STACKING METHOD There are several standard methods for specifying a

stationary process, P. One common procedure is to spec- ify the joint distributions, P(a;>= P"(a;), for all n and all a;, or equivalently, the conditional probabilities P(a,la;-'). These then yield, via the Kolmogorov con- struction, a shift-invariant measure P on the space A" of infinite sequences and the random variables of the pro- cess are thought of as the coordinate functions on the product space. Another common method is to define a process as a function of some given process, e.g., moving- average processes and finite-state processes are usually described as functions of i.i.d. and Markov processes, respectively.

The Kolmogorov model is conceptually so useful that it is now common practice to think of the functions of an A-valued process as coordinate functions on a fixed se- quence space; the measure P then carries all the informa- tion about the process. Stationarity, for example, is just the statement that the measure P is shift-invariant. Also, two processes are said to be equiLlalent if they have the same joint distributions for all n; this is the same as saying they have the same Kolmogorov representation.

The cutting and stacking method specifies the structure of the "typical" sequences for the process, that is, the sequences that have the correct limiting frequencies of all orders. The method builds the typical sequences by con- catenating blocks of varying lengths, using a geometric model to control how the blocks fit together and to guarantee stationarity. Of course, the focus on typical sequences implicitly involves ergodic processes; while the cutting and stacking method can be used to build noner- godic processes it is more natural to use it to build ergodic processes.

As noted in the preceding paragraph, the cutting and stacking method uses a geometric model to guarantee stationarity. The geometric model for a process comes from a different and more abstract way of thinking about shift-invariant measures, described as follows. If the pro- cess is ergodic then the measure P will be nonatomic. A nonatomic measure is really just Lebesgue measure in disguised form, that is, there is an invertible measure preserving mapping from A" onto the unit interval that carries P onto Lebesgue measure A . This isomorphism carries the shift onto a transformation T on the unit interval which preserves A . The natural partition of the sequence space into the time 0 sets defined by {x: x,, = a), a E A is carried by this isomorphism onto a partition II = (raj of the unit interval. The process can then be recovered by selecting a point w at random in the unit interval and defining X , to be the index of the member of n to which T"w belongs.

In summary, the geometric model for a stationary er- godic process represents it as a transformation T on the unit interval that preserves Lebesgue measure A , together with a partition II = (TJ of the unit interval, with X,(w) = a, if T"w E ra. We shall continue to use the notation P for a stationary measure on A" except when we want to stress the geometric model, in which case we use A . The notation {X,} will either mean the the associated se- quence of coordinate functions defined on A" or the process {X, (w)} defined on the unit interval by the A - invariant transformation T and the partition II.

The cutting and stacking method focuses on defining a transformation T on the unit interval X = [O, 1) that pre- serves Lebesgue measure A , along with a finite partition II. The principal features of the method are the following.

The transformation T and partition II are defined in stages. At stage n, they are defined on a subset S, of the unit interval. The subset S, is a union of a collection 4, of disjoint subintervals. Each subinterval in 4, is mapped linearly by T onto a subinterval of [O, 1) of the same length, which either belongs to 4n or is disjoint from 4n. Each subinterval in 4" is contained in only one set of the partition II. S,, c Sn+l and A(S,) -+ 1. Once T or II is defined at a point w its definition does not change at later stages.

Page 3: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1607

The fact that each subinterval in S , is mapped linearly by T onto an interval of the same length guarantees that the final transformation preserves Lebesgue measure. Since the set [0, 1) - S, on which T is not defined at stage n decreases to 0 in measure, for almost every w E[O, 1) there will be a stage after which T w is defined. Likewise, there will be a stage after which w has been assigned to a member of the partition II. We stress here that as the stage number increases, more is revealed about the pro- cess, that is, the transformation and the partition. The transformation is not defined as a limit of transforma- tions, nor is the partition defined as a limit of partitions. The limit involved in these constructions is that the do- main of definition of T and of II at stage n increases as n grows, exhausting the entire unit interval as n +CO.

The automatic measure-preserving property of the cut- ting and stacking method allows the builder to focus attention on the dynamic properties desired, rather than on the often messy details of defining joint distributions. Visual control over the details is obtained by stacking the subintervals at a given stage into columns in such a way that the action of T maps points upwards; the domain of T at this stage is the union of all but the top interval of each column. The column picture has another useful feature, for it specifies how frequencies of nonoverlap- ping blocks occur in typical names and also allows the estimation of k th-order joint probabilities, provided k is small relative to the column height. We go from stage n to stage n + 1 by cutting the stage n columns into sub- columns and stacking these in some way, sometimes ad- joining further subintervals. (Hence, the name "cutting and stacking.")

The basic terminology for organizing cutting and stack- ing constructions will be given in the next section. Ergod- icity and the estimation of joint distributions from the column pictures will be discussed in Section IV, along with a brief outline of the constructions of the counterex- amples to Problems 1-3. We close this section by dis- cussing two examples that illustrate the ideas involved. The first shows how to construct the familiar binary symmetric i.i.d. process using cutting and stacking, while the second shows how to construct a related process that is less easy to describe by the usual joint distribution methods. In these particular examples, the partition is already fully defined at the first stage; the transformation, however, is only fully defined in the limit of stages.

A. Coin- Tossing

In our first example we show how the familiar binary symmetric i.i.d. process can be constructed by cutting and stacking. At Stage 1, the unit interval [0,1> is partitioned into the two subintervals, [0,1/2), [1/2,1), and lI is defined by ro = [0, 1/2), r1 = [1/2,1). The domain of T at Stage 1 is empty; note, however, that the partition is fully defined. All we can say about our process from knowledge of the first stage is that 0's and 1's occur with equal probability. (See Fig. 1.)

0 1

Fig. 1. Coin-tossing: The first stage.

0 , 0 ~ 0 1 0 1 I 1 I 1 I 1

a ' b ' c ' d e ' f ' g ' h

(b)

Fig. 2. Cutting and stacking to construct Stage 2. (a) Cut each interval into four subintervals of equal width. (b) Restack to get all combinations of 0's and 1's.

To go to the second' stage we cut each first stage interval into four subintervals of equal length and stack them in pairs into four columns of height 2 and width 1/8, so that each possible combination of 0's and 1's occurs. There are several ways to do this, one of which is illustrated in Fig. 2. The transformation T at Stage 2 is defined only on the bottom subintervals; each point in the union of the bottom subintervals is mapped directly up- wards by T. All that can be said at the end of this stage about a typical sequence of the process is that it can be partitioned into blocks of length 2 in such a way that 00, 01, 10, and 11 occur with equal frequency; this stage gives no information about how these blocks are connected.

To go to the third stage each second stage column is cut into eight subcolumns of equal width; these sub- columns are stacked in pairs into 16 columns of height 4 and width 1/64, so that each possible combination of 0's and 1's occurs. For example to obtain the third stage column labeled 0010 take one of the subcolumns of the second stage column labeled 01 and stack it above one of the subcolumns of the second stage column labeled 00. See Fig. 3. The transformation T is now defined on all but the top layer; each point in a lower layer is mapped directly upwards to the next layer. More can now be said about the process, namely, that a typical sequence can be partitioned into blocks of length 4 in such a way that the 16 possible blocks of length 4 occur with equal frequency. No information is given at this stage about how these blocks are connected.

After n stages we have 22n-1 columns of height 2"-' and width 1/(2"-'22"-L 1. There is a one-to-one corre- spondence between columns and binary sequences of length 2"-'; the ith term of a sequence specifies the member of the partition lI to which the ith level belongs. The transformation T is defined on all but the top layer; each point in a lower layer is mapped directly upwards to the next layer. A typical sequence of the process can be partitioned into blocks of length 2"-' and each possible block occurs with the same frequency. The nth stage gives no information about how these blocks are concatenated

Page 4: Cutting and stacking: a method for constructing stationary processes

1608 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

-,U1111111111111111 1101 I I I I I I I I I I I I I 01 I I I I I I I I I I I I I I I I l l I I I l l I I I I I I I I I I I I I I I I I I I I I I I I - I l l 1 I l l - : 1 0 ; : I ; I I I I I I I I I I I I I I I I I I I I I I I I I I I

(a)

1111111111111111

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1111111111111111

0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1

1111111111111111

0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 0 0 0 1 1 1 1 1 I 1 I

(b)

Fig. 3. Construction of third stage columns. (a) Cut each column into eight columns. (b) Reassemble to get 16 columns. Arrow shows column 0010.

to form the process. We go from stage n to stage II + 1 by cutting each nth stage column into 2.22n-1 subcolumns of equal width and stacking these subcolumns into 22" columns of height 2" so that each possible binary se- quence of length 2" occurs. The definition of T is ex- tended to all but the top layer of the new collection by mapping points directly upwards.

This completes our construction; our final process is the end result of all the stages. At each stage every w E [0, 1) belongs to some column. The total width of all the columns shrinks to zero, thus eventually almost surely w will belong to a level below the top level; as soon as this happens we know the value of T u , namely, it will be the point directly above w in that column. The value of X l ( w ) will then be 0 or 1, depending on whether T w belongs to T,, or r1. The reader can verify that the process we have defined has the same joint distributions as that generated by unbiased coin-tossing.

In Section 111, the terminology "independent cutting and stacking" will be used to describe the cutting and stacking operation used to go from stage rz to stage n + 1; the general form of this construction will be described. In Section IV, we shall provide a general method for calcu- lating frequencies of overlapping blocks by directly using the column picture; this can then be used to define joint distributions.

The construction can easily be generalized to produce an arbitrary finite alphabet i.i.d. process; the details are left to the reader.

B. Repeated Symbols

We now construct a variation of the preceding example; here we force 1's to occur in pairs. To form Stage 1, we cut the unit interval into three subintervals of length 1/3 and form two columns from these, a left column of height 1 and a right column of height 2. These are shown in Fig. 4. The column of height 1 is assigned to T,, and the

1

0 1

Fig. 4. First stage: 1's occur in pairs.

0 1 0 I O I O 1 1 1 1 1 I 1 a ' b I C ' d e ' f ' g ' h

1 h I

1 g' + +

0 1 1 1 ---- C 9 e' f' 0 0 1 1

a b e f ----

(b)

Fig. 5. Stage 2 by cutting and stacking. (a) Cut each column into four subcolumns of equal width. (b) Restack to get all combinations of pairs of subcolumns.

column of height 2 is assigned to 7,. The transformation T is defined only on the lower interval of the right column; it maps a point in that column directly upwards to the column above. Thus, all we know after Stage 1 is that 0's occur 1/3 of the time and 1's occur in pairs.

To go to the second stage we cut each first stage column into four subcolumns of equal width and stack these subcolumns in pairs so that each possible combina- tion appears. Thus we obtain four columns of width 1/12; one of height 2, labeled by 00; two of height 3, labeled by 011 and 110; and one of height 4, labeled 1111. (See Fig. 5.) The transformation T is now defined on all but the top level of each column; each point in a lower layer is mapped directly upwards to the next layer. After this stage is completed we can say more about the process, namely, that a typical sequence can be partitioned into blocks of lengths 2, 3, and 4. Of these blocks, 2/12 have length 2 and consist of 0 followed by 0, 3/12 have length 3 and consist of 0 followed by 11, 3/12 have length 3 and consist of 11 followed by 0, and 4/12 have length 4 and consist of four consecutive 1's. No information is given at this stage about how these blocks are connected.

To go to the third stage we cut each second stage column into eight subcolumns of equal width and stack these subcolumns in pairs in all possible ways to obtain 16 columns of heights ranging from 4 to 8. For example, there will be four columns of height 5 corresponding to the ways of arranging one column of 00 with one column of either 011 or 110. Thus the four columns of height 5

Page 5: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1609 - I I I I I I I I

I -- ; ; ; I ; ; ; ; I I I I I I I I

i o ; ; ; ; ; ; ; i o ; ; ; ; ; ; ; I I I I I I I 1 I I I I I I I I

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I O I I I I I I I I I I I I I I I 1 I I I I O I I I I I I I I I I I ~ I I I I I I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I I I I I I I I I I --

(a) 1

11111

1111111-111

1111-1-11111111

1

1111111111111111

1

U

1111111111111111

(b) 0

Fig. 6 . Construction of the third stage. (a) Cut each column into eight copies. (b) Reassemble to get 16 columns.

are labeled by

To obtain the third stage column labeled 01100, for example, we select one of the subcolumns of the second stage column labeled 00 and stack it above one of the subcolumns of the second stage column labeled 011, as shown in Fig. 6. After the completion of the third stage the transformation T is now defined on all but the top layer; each point in a lower layer is mapped directly upwards to the next layer. Now we can say that a typical sequence can be partitioned into blocks of lengths 4, 5 , 6, 7, and 8. For example, 5/96 of the blocks will have length 5 and be labeled 00011, 5/96 of the blocks will be have length 5 and be labeled 00110, and 6/96 of the blocks will have length 6 and be labeled 110110, etc. No information is given at this stage about how these blocks are con- nected.

After n stages we have 22n-1 columns of equal width, with heights ranging from 2”-‘ to 2”. All possible columns occur subject only to the condition that a column of height h is labeled by a binary sequence b: that can be partitioned into 2”-’ - j blocks of 0’s of length 1 and j blocks of 1’s of length 2, for some j ~ [ 0 , 2 ” - ~ ] . The transformation T is defined on all but the top level of each column; each point in a lower level is mapped directly upwards to the next level. We go from stage n to stage n + l by cutting each column into 2.2’”-’ sub- columns of equal width and stacking these subcolumns in pairs into 2’” columns such that all possible combinations of pairs occur. The definition of T is extended to all but

0001 1,001 10,01100,11000.

the top layer of the new collection by mapping points directly upwards.

This completes our second construction; our final pro- cess is again the end result of all the stages. At each stage every w E[O, 1) belongs to some column. As in the first example, the total width of all the columns shrinks to zero and thus eventually almost surely w will belong to a level below the top level; as soon as this happens we know the value of Tu, namely, it will be the point directly above w in that column. The value of X l ( w ) will then be 0 or 1, depending on whether Tw belongs to ro or rl. The reader might want to show directly that the resulting process is ergodic; we shall describe a general method for establishing ergodicity in Section IV. Our final process is actually a B-process, as shown in [151.

Remark I : The preceding example can be slightly mod- ified to produce a Markov chain; all that is needed is that the three intervals of the first stage be given distinct names. Thus is we assign the left interval of Fig. 4 to r,,, the lower right interval to r1, and the upper right interval to r2 , then apply the same cutting and stacking proce- dure the resulting process will be an aperiodic Markov process with the three states {0,1,2}, as shown in [151. The two-state process we actually constructed is, therefore, a function of a Markov chain. In general, our second con- struction can be modified to produce any Markov chain or function of a Markov chain. In fact, it can be shown that any finite-alphabet ergodic process can be con- structed by cutting and stacking; of course, full knowledge of the process may be needed in order to see how to go from one stage to the next. We do not want to mislead the reader, however, for the power of the method is not due to its ability to construct familiar examples, but is instead due to the freedom it allows the user to focus on desired properties, rather than the details of stationarity (or ergodicity, as we shall see in Section IV.)

111. THE BASIC OPERATIONS In this section, we present some terminology that is

useful in making cutting and stacking constructions; a slightly modified form of the nomenclature of [ l l ] will be used. Most of this terminology has already been used informally in our discussion of the two examples of the preceding section and is generally suggestive and natural. The single exception is the use of the word “gadget” for a collection of columns together with a partition. (The precise definition of gadget will be given later.) The word “gadget” was introduced by Ornstein to help organize his isomorphism theory, [lo], and is still commonly used, even though it has no suggestive meaning. It would be nice to have a better word, a word that captures the idea of a set of columns and a partition.

A. Columns, Transfomations, Partitions, Gadgets

We begin by defining the basic objects. Column: A column t= (Ll, L,; * * , Lh) of height h =

h ( C ) is an ordered, disjoint, collection of subintervals of

Page 6: Cutting and stacking: a method for constructing stationary processes

1610 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

L h

I I

f I

B=L 1

Fig. 7. Transformation defined by a column.

I - - I&-- -- ---- -9

,-I_-- -9

I

I I

I I

A B C D A C B D

T w o Copies o f G G Fig. 9. Cutting a gadget into copies.

Gadget: A gadget 9 is a finite collection of disjoint columns (<}, together with a partition II = I I ( 9 ) that is compatible with each column. The support of 9 is the union of the supports of its columns. The width w = w ( 9 ) of a gadget is the sum of the widths of its columns and the measure A($) of a gadget is the sum of the measures of its columns, that is, the measure of its support. A subgad- get of J is obtained by selecting some of the columns of 3. If (g.} is a finite collection of gadgets with disjoint support then their union, 9 = U i4., is the gadget whose columns are the columns of all of the 4..

Gadget Transformation: Associated with a gadget 9 is the transformation T = T ( 9 ) that acts as T(<) on each column <. Note that T ( J ) is not defined on the top interval of each column of 9, although it may be the restriction of a transformation that is defined on the top. We say that a gadget 9 extends a gadget 2 if the support of 9 includes the support of S, if T ( J ) ex- tends T ( 2 ) , and if n(#) extends II (2) .

Fig. 8. Full-height subcolumn. B. Cutting, Stacking, Adjoining

The cutting and stacking operations that are commonly used will now be defined, along with an adjoining opera- tion.

Cutting into Copies: The basic cutting operation is the cutting of gadgets into copies. The distribution of a gadget (with J columns) is the vector

the unit interval of equal width w = w ( 8 ) . The interval L , is called the base, L , is called the top, and L, is called the j th level of the column. The support of 8 is the union of the L,; the measure of 4, denoted by A(&), is the measure of its support. Of course, A(&) = hw.

Column Transformation: A column defines a transfor- mation T = T ( d ) ; namely, the transformation T that maps Lj linearly onto L,+', for 1 I j < h. We usually picture a column by drawing Lj+' directly above Lj and think of T

not defined on the top, L,, and T-' is not defined on

(Ll, L,,. . ., L,) can now be represented in the form,

full-height subcolumn of the column 8 is a column of the form ( I , TI, T21; . ., T h T I I ) , where I is a subinterval of the base B. (See Fig. 8.)

w ( 9 ) ' w ( 3 ) ' ' w ( S ) w(81> 4 8 2 ) ___ ___ . . . ___

as mapping upwards. (See Fig* 7.) Note that is which gives the (normalized) widths of the columns. A the gadget is a copy of the gadget if they have the =

Same distribution and corresponding columns have the same partition names (and hence the same heights). A gadget 9 can be cut into M copies of itself, {dm, 1 5 m 5

= ( ym) ,

L 1 ' Note that the

(B,TB, T2B, ' ' * y T h - ' B ) , where is the base, L1. A according to a given probability M-vector,

by cutting each column

Compatible Partition: A partition

of the support of d is compatible with k if each column level is contained in exactly one set of II. Associated with a column is its II-name; this is the sequence c: = c f ( k > defined by the condition L, c 7rcj. Unless stated other- wise, all partitions will be partitions into k sets, labeled by the integers (0,1, . , k - l}.

< = ( L ~ , ~ : 1s j I h ( 6 ) )

into disjoint full-height subcolumns

<,,= (L i , j ,m: 1 S j S h ( < ) ) ,

where ~ ( 6 , ~ ) = w ( L ~ , ~ , , , ) = ~ , W ( L , , ~ ) . The gadget 3, formed from the columns <,m is called the copy of J of width y,. Note that the action of the gadget transforma- tion T is not affected by the copying operation. Fig. 9 illustrates the case m = 2.

Page 7: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1611

Cl c 2 Cl * c 2

Fig. 10. Stacking columns. G H

- 11111111111111111111l(l . __ __I ......................... a i b c i d a c b d C c*/ I *c

H H l H 2 (b) Fig. 11. Adjoining an interval to a column.

Stacking Columns: The basic stacking operation is de- fined for two columns as follows. Let

8l = ( Ll, j : 1 I j I h( 81)), 8, = ( L2, : 1 5 j I h( 8,))

be disjoint columns of the same width. The stack &l * 82 is the column with levels Lj where

L j = L 1 s j I h ( 8 1 ) ; Lh(e,)+j= L2,,; 11 j s h ( 8 , ) .

(See Fig. 10.) Note that the map T both and T(d2) .

Adjoining Intervals: Intervals can

= T ( 8 1 * B,) extends

also be adjoined at the top or bottom of a column or the top or bottom of each column of a gadget. This concept can be expressed in terms of the stacking of columns, by calling the inter-

- -- -- ......... -- ........................ - - ........................

to be Joined a but that is sometimes a nuisance. Suppose 8 = ( L j ) is a column and I is an

Fig. 13. according to Dis t (d) . (c) Cut each column of d according to D i s t ( 2 ) .

Independept cutting and stacking. (a) 9 and 2. (b) Cut 2

interval of the same width as 8 and disjoint from the support of 8. The column L * I is formed by stacking I

(d) Stack 2 onto 6'.

on top of 8; we say that we have adjoined I to the top of 8. The transformation T = T ( L ) is extended by mapping the top of B linearly onto I . Likewise, I can be adjoined to the bottom of 8, obtaining Z * 8. (See Fig. 11.)

Stacking Gadgets onto Columns: A gadget 9 can be stacked on top of a column B of the same width, pro- vided that they have disjoint supports. This gives a new gadget 8 * 9 defined as follows. Cut 8 into full-height subcolumns < accofding to the_ distribution of 9, that is, so that -w(4) = w(<), where 4 is the ith column of f. Stack < on top of < to get the new column, < * 4. This produces a ne_w gadget 8 * 9, namely, the gadget with columns < * <. It is called the gadget obtained by stacking 9 onto 8. (See Fig. 12.)

width this gives the gadget, 9 * 2, defined as follows. Suppose the columns of 9 are {<}. Cut 2 into copies {&> so that w(&) = ~(4). The gadget 9 * A? is then the union of the gadgets < * &. Note that w ( 9 * 2) =

w ( 9 ) and the number of columns of 9 * 2 is the product of the number of columns of 9 by the number of columns of 2. (See Fig. 13.)

M-Fold Stacking: A gadget can be cut into copies and these copies stacked to form new gadgets. The M-fold independent cutting and stacking of a single gadget 9 is defined by cutting 9 into M copies {Hm} of equal width and successively independently cutting and stacking them to obtain

Gadgets onto Gadgets: Gadgets can also be stacked on g * ( M ) = & l * $ 2 * . . . *.a,. top of other gadgets in several ways; the most important of these is called independent cutting and stacking. If 9 Note that ~ ( 9 * ' ~ ' ) = w ( 9 ) / M , and the number of and A? are gadgets with disjoint supports and equal is the Mth power of the number of columns of

Page 8: Cutting and stacking: a method for constructing stationary processes

1612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

columns of 9. Thus the columns of have names that are M-fold concatenations of column names of the initial gadget J. The independent cutting and stacking construction contains more than just this concatenation information, for it carries with it the information about the distribution of the concatenations, namely, that they are independently concatenated according to the distribu- tion of 8. Repeated two-fold independent cutting and stacking was used in the two examples of the preceding section. (See Fig. 14.)

Cyclical Stacking: A seS;ond form of cutting and stack- ing has recently been used in [12], [13] and is useful in connection with our negative answer to Problem 3, the waiting-time problem, [211. Let {Sm, 1 ~ m s M ) be an ordered collection of disjoint gadgets, all of the same width. The cyclical independent cutting and stacking of this collection is the gadget

Jl * 9* * . . . * &M. (2)

Thus a column name from * * . * . * is a con- catenation of a column name from d1, followed by a name from J,, then a name from 9,, 9 . . . Note that, as in the preceding paragraph, the definition of cyclical independent cutting and stacking implies that these names are put in independently, according to the distribution of names in the separate gadgets.

C. Defining Processes from Gadgets

Let T be a transformation of the unit interval that preserves Lebesgue measure and let II be a partition of the unit interval.

Definition 1: The pair T,II admits a gadget 9 if II is compatible with J and if the restriction of T to the union of all but the top level of each column of 9 is equal to the transformation T ( J ) defined by the gadget.

We also say that a process admits a gadget J if it is equivalent to the process defined by a pair T,II that admits J. To say that an ergodic process admits a gadget 9 gives partial information of the following form about the typical sequences of the process.

A typical sequence x = xy E A" can be partitioned

Each block {x$ is either assigned to a column C of

The limiting relative frequency of j ' s such that x2, is

This is the basic frequency interpretation of the gadget concept.

As an example, suppose an ergodic process admits the gadget shown in Fig. 15, where the 6 left hand intervals each have length 0.1 and the 3 right-hand intervals each have length 0.05, so the support of the gadget has mea- sure 0.75. This tells us that a typical sequence for the process can be partitioned into blocks of length 4, 2, 3, and blocks of unspecified lengths. Blocks consisting of 1000 cover 0.4 of the sequence; blocks consisting of 11

into blocks {x2 1.

J whose name is x 2 or is left unassigned.

assigned to k is ~ ( 8 ) .

G

Fig. 14.

-- --

1,1111

-- --

Two-fold independent cutting and stacking. (a) Cut into two copies. (b) Cut d2 into two more copies. (c) Stack.

0

0 - -- - 0

o s I 0 - - 1 1

I

Fig. 15. Gadget admitted by T

cover 0.2 of the sequence; and blocks consisting of 100 cover 0.15 of the sequence. The gadget gives no informa- tion about the remainder of the sequence nor does it specify hbw the blocks are put together.

As previously noted, to say that a process admits a gadget gives partial information about the structure of typical sequences. To gain more full information we need an increasing sequence of gadgets that exhausts the unit interval. Let us recall that a gadget 9 extends a gadget 3 if the support of J includes the support of 3, if T ( 9 ) extends T ( X ) , and if II(9) extends II(3). The cutting and stacking method constructs a transformation T and partition 9 by defining a sequence of gadgets {9n) with the following conditions.

1) limn w ( { J n ) ) = 0 and limn 2) Jn+, extends 9,.

The transformation T = T({&)) is defined to be the common extension of the T(9,) and the partition II =

= 1.

Page 9: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1613

II((Sn}) is defined to be the common extension of the II(9J. The transformation T(&) is defined on all but the top of 9, and the complement of yn, hence Conditions 1-2) guarantee that T is defined on a set of full measure. Likewise, rI is defined almost everywhere. Note that the transformation T automatically preserves Lebesgue mea- sure A ; thus the process defined by T and II is stationary. The following terminology summarizes these ideas.

Definition 2: A sequence of gadgets {gn} is complete if it satisfies Conditions 1) and 2). T = T ( ( 9 J ) is called its associated transformation and II = II((9J) is called its associated partition. The associated process { X n } is de- fined by selecting w at random in the unit interval and setting X,(w) = a , if T“w E ra.

Thus a complete sequence defines a process, essentially by revealing more and more information about the struc- ture of typical sequences, until, in the limit, the typical sequences are fully specified. Note also that if T is the associated transformation of a complete sequence then T is necessarily invertible. This is because on each 9, the mapping T-’ is defined on all but the bottom level of each column and Condition 1) guarantees that the mea- sure of the union of the bottom levels shrinks to 0.

A final note about our terminology. The kind of exten- sion we actually use can be formulated in a more explicit form, which is often useful and which, in some sense, justifies the name “cutting and stacking.” We say that a gadget 9 is built by cutting and stacking from a gadget 2 if 9 is obtained by first cutting 2 into copies, then, possibly, adjoining some additional intervals at the top and/or bottom of each column and/or adding some further columns, and then stacking the resulting collec- tion of columns in some way. An alternative way to say this is to say that there is a gadget 2’ disjoint from 2 such that each column of 9 is a union of full-height subcolumns of 2, U 2. It is not hard to show that if 9 is built by cutting and stacking from 2 and 2 is built by cutting and stacking from AY then 9 is built by cutting and stacking from X.

It is not quite true that the concept of “extension” is equivalent to the concept of “built by cutting and stack- ing,” for 9 can be an extension of 2 without being built by cutting and stacking from 2. It is true, however, that if 9 extends 2 then the columns of 9 can be cut into full-height subcolumns so that the resulting gadget 9 is built from 2 by cutting and stacking. In particular, without loss of generality, Condition 2) could be replaced by the condition:

2’) 9,+, is built by cutting and stacking from .a,.

D. Examples Revisited

Now let us apply our terminology to the two examples of the preceding section. For the first example, the coin- tossing process, define 9, to consist of two intervals of length 1/2, one assigned to T,,, the other to r1. Then proceed by induction, defining 9” + to be the two-fold independent cutting and stacking of 9”. The sequence

{9”} is complete; its associated process is the coin-tossing process. The same process can, of course, be defined by many different gadget constructions. For example, we could start with the same 9, and define 9n+l to be the ten-fold independent cutting and stacking of 9,.

For the second example, the repeated 1’s process, de- fine 9, to consist of two columns of width 1/3, one of height 1 which is assigned to T,,, the other of height 2 that is assigned to r , . Then proceed as in the preceding paragraph, defining 9, + to be the two-fold independent cutting and stacking of 9,. The sequence {9,} is again complete; this time its associated process is the repeated 1’s process.

IV. THEORETICAL RESULTS The cutting and stacking method always produces a

stationary process, because the mapping T is defined so that it preserves Lebesgue measure. The process is not guaranteed to be ergodic, however, and even if it is ergodic it may not satisfy stronger properties such as mixing, completely positive entropy, or the very weak Bernoulli property of Ornstein. In this section we discuss a condition that guarantees ergodicity and show how joint distributions can be estimated from a gadget. We also briefly sketch the construction of our examples that give negative answers to the three problems stated in the introduction.

A. Ergodicity

Throughout this subsection { 97} will denote a complete sequence of gadgets with associated transformation T and partition n. Tilde nofation will be used to indicate sup- port, for example, t will denote the support of the column G and 9 will denote the support of the gad- get 9.

A condition guaranteeing ergodicity is that the columns of a given Yn are well-distributed among the columns of SN when N is sufficiently large relative to n. To make this concept precise suppose the gadget 9 is built by cutting and stacking from the gadget 2, suppose t is a colump o f - 9 , suppose is a column of 2, and note that G n 9 is a union of full-height subcolumns of 9 of width w ( G ) , consisting of those subcolumns of 53 that are used to build 8. (Se: Cig. 16.)-In p_articular, the condi- tional probability A ( t l 9 ) = A(# n 9 ) / A ( 9 ) can be in- terpreted as the fraction of the top of 53 that appears in the column 8.

Now we state the definition of well-distributed and our principal theorem.

Definition 3: Suppose 9 is built from 2 by cutting and stacking and 0 < E < 1. 2 is (1 - 6)-well-distributed in 9 if

for each column 9 in 3.

Page 10: Cutting and stacking: a method for constructing stationary processes

1614 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

..,,,. ..-... L .............~........ ...... ...-.... I ..................... - D4

..-.... ...................... - - D, D 2 ' D , 0 4 .,.,.

,.,,,>

.,,,,I

D -

Dl - - ,,,..

Fig. 16. Subcolumns of 9 that appear in B.

Theorem 1: The complete sequence (g,} defines an ergodic process if it has the property that given E > 0 and n there is an N > n such that g, is (1 - E)-well-distributed in 4,.

Proof: We give the proof only for the case when g,+, is built by cutting and stacking from gn. Let B be a measurable set of positive measure such that T-'B = B. The goal is to show that A(B) = 1. Towards this end note that since the intervals of 3, shrink to 0 as n + w they must generate the cT-algebra. Therefore there must be an n such that > 1 - E and an interval Z of 8, such that

A ( B n Z) 2 (1 - € ' )A( I )

Let -9 be the column of gn to which Z belongs and note that T k ( B n Z) = B n TkZ for each integer k , and hence,

A ( 6 n B ) ~ ( I - - E * ) A ( B ) . (3) Now choose N so large that & is (1 - E')-well-distributed in gN. A column B in gN will be called B-almost full in -9 if

A(B n 2 n B ) 2 (1 - € ) A ( & n -2).

A(B) 2 (1 - 36)(1- E ) , which completes the proof of the theorem. 0

The preceding theorem can be reformulated in a num- ber of useful ways. All we needed to know was that any given invariant set eventually meets a large fraction of some interval of gn and that each column of 8, is eventually well-distributed in a later gadget. Once we have found such an n then we are free to make n larger. Thus the following is an immediate corollary to our theo- rem.

Corollary I: If {&,} is compIete and 4, is (1 - E,)-well- distributed in gn+,, where E , -+ 0, then it defines an ergodic process.

Remark 2: If we start with a gadget gl of measure 1 and do only repeated two-fold independent cutting and stacking then the conditions of Theorem 1 will certainly hold. This shows, in particular, that the two examples constructed in Section 11-A and Section 11-B are ergodic. They are, of course, mixing; this can be shown by extend- ing Theorem 1 to obtain conditions for mixing. A much stronger result can be proved in this case, namely, if two columns of the starting gadget 3, have heights differing by 1 and we do only independent cutting and stacking then the final process will be a B-process. The reason is that the partition into column levels at stage n will define an aperiodic Markov chain. The details of this argument can be found in [15].

B. Estimating Joint Distributions

A gadget gives us information about the frequencies of nonoverlapping blocks that appear in typical sequences. Frequencies of overlapping blocks, that is, joint distribu- tions, can be estimated, provided the block length is short relative to column height. Let us examine this approxima- tion idea more closely. Throughout this subsection {4] will denote a complete sequence of gadgets with associ- ated transformation T , partition n, and process {Xnl . We, will continue to use tilde notation for support, e.g., C denotes the support of the column 8.

A sequence a;" of length m defines the cylinder set

Suppwe C is B-almost full in -9 and suppose {9J is [a ; " ] = ( x € A r : x L = a , , I s i s m } , the COlkCtiOn Of full-height subcolumns of 9 of width w(B> that are used to build B. The almost-full condition

which may be expressed directly in terms of the transfor- mation T and partition n as

implies that B must be B-almost full in at least one of these subcolumns, that is, there is an i such that [a ; " ] = [ w : T I @ era , , 1s i I m ) .

In particular, some level of P must meet B in a set of measure at least ( ~ - E ) w ( & ) . The argument used to prove (3) then gives

A ( 2 n B ) 2 (1 - E ) A ( ~ ) . (4) The Markov inequality, together wi<h the de_finition of

well-distributed and the fact that 2 A ( g n ) 2 1 - E ,

shows that the set of B-almost full columns has total measure at least 1 - 36. This combined with (4) shows that

The (absolute) frequency f(a;l lc:) of occurrence of the block a;" in the sequence c: is defined by sliding a window of length m along cf and counting occurrences of a y , that is

f (ayIc: ) = ( { i : c:+'''-' = a y , 1 I i s /I - m + I ) ( .

Note that we ignore the final m -1 terms of cf. The relative (conditional) frequency is defined by p(a;" IB) =

f(a;" I c:>/ h . Now let us translate the frequency idea into column

and gadget language. Let C = ( L , , L,; . ., L,) be a col-

Page 11: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1615

umn of height h and denote its II-name by c:, so that, L, c rc,. Note that if i 5 h - m + 1 then to say that a point w E L, belongs to the cylinder set [U ; " ] is just to say that L,+,- c r 1 I j I m. This establishes the following ba- sic connection between absolute frequencies and column width.

a/

h - m + l

A ( [.;"In U L, ) =f(a;"lc:)w(&>. 1 = 1

Dividing both sides by A(& = h w ( d ) then produces the following bound

Thus, except for the end effect of the top m-levels of a column, the conditional probability of a cylinder set rela- tive to a column can be estimated by counting frequencies along the column name. The conditional probability of a cylinder set relative to a gadget is obtained by summing over the columns of the gadget,

These observations imply that if most of the support of a gadget is contained in columns that are long relative to m then mth-order conditional probabilities can be estimated by counting frequencies along the columns of the gadget. To state this precisely define &(h) to be the set of columns of & of height at least h.

Theorem 2: If h > m / E and A ( 3 ( h ) ) > (1 - E ) A ( ~ ) then

In particular, if 9 is a gadget of width less than 6 and measure greater than 1 - 6 and if 6 is small enough, then most of the mass of 9 is contained in columns whose height is long relative to m. Theorem 2 tells us that we can then estimate the mth-order joint distribution,

A ( [ u ; " ] ) = Prob{X, = a i , 1s i I m } ,

by taking the average of the relative frequencies of u;l in each column name, weighted by the column measure.

C. Counterexamples

To demonstrate the power of the cutting and stacking method we now sketch our construction of a B-process for which the answer to Problem 2 is negative. At the end of this sketch, we show how the construction can be modified to yield a negative answer to the divergence-rate question of Problem 1 and make a few remarks about our negative answer to the waiting-time problem, Problem 3.

Let P be the i.i.d. coin-tossing process. We want to construct a B-process Q # P such that D(PllQ) = 0. The problem is easy to solve if we allow Q to be nonergodic, as suggested by the results in [61. We could, for example, take R to be any stationary process, fix an a in the range 0 < a < 1 and, for each n, define Q' = aP" +(1- a)R".

Then,

and hence, D,(P"llQ")5 - ( l /n ) loga , so that

D( PllQ) = 0.

The key idea here is clearly the lower bound on Qn<u;>. Suppose

1 > a , L O , ( l / n ) loga , + 0. (6)

It is enough to construct a B-process Q, which is not equal to the coin-tossing process P , such that

Q"( U ; ) 2 a,Pn( U ; ) , U ; E A". (7)

We first show how to obtain an ergodic Q. The basic idea is to make Q look like the coin-tossing process P on part of the space, so that (7) will hold for a given n, and make Q look like something different on the rest of the space. Then a fraction of the first part on which P is being mimicked is cut off and mixed into the second part. If this fraction is small enough then (7) will hold for n + 1; passing to the limit will then yield our desired Q. This idea is easy to implement using cutting and stacking; we use two disjoint gadgets, mimicking P on the left gadget and building something different on the right gadget.

We already know how to build P by a sequence of gadgets; just let &,+, be the two-fold independent cut- ting and stacking of &,, where consists of the two intervals, rTT0 = [0, 1/21, r l = [1/2,1), as discussed in Sec- tion 11-A and Section 111-D. The process Q will be built in stages. The nth-stage gadget will be the disjoint union of two gadgets of total measure 1, a left gadget, 2,, and a right gadget, 9,. The left gadget 2, will be a copy of &kn of measure p, for suitable choice of k , and p,,. This is what we mean by "mimicking the construction of P on the left gadget." The structure of 9, will become appar- ent as we describe how to go to stage n + 1.

To go to stage n + 1 the following steps are performed, in the order given. (See Fig. 17.)

1) Cut the left gadget, 4, into two copies, 4' and An2, of sizes p, + < p, and p, - p, + 1, respectively.

2) Do L,-fold independent cutting and stacking to the first copy Jnl to obtain the new left gadget Yn+l.

3 ) Take the union of the second copy, _t7n2, with 9,, obtaining Jn2 U 9,.

4) Do R,-fold independent cutting and stacking to the union J n 2 U 9 , to produce the new right gadget

The first step guarantees that new left gadget -t',+l will have measure p, + As long as L, is a power of 2, the second step guarantees that J,+ will be a copy of a gkn,,, for some k , + l > k, . The initial right gadget and the sequences p, 50, L, tm, and R, can be chosen so that the following will hold.

a) Qn(a;) 2 a,P"(a;), U ; E A". b) Q is a B-process.

9, + 1'

Page 12: Cutting and stacking: a method for constructing stationary processes

I IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 6, NOVEMBER 1991

Fig. 17. Construction of stage n + 1 . (a) nth stage. (b) Cut ln into two parts 1; and lnz. (c) Do dn-fold independent 'cutting and stocking to Jl to produce dn+l. (d) Do .%?,,-fold independent cutting and stacking to Jn2 U Se, to produce 9,,+

L l

Fig. 18. First stage left and right gadgets.

The choice of p, is quite easy; all we need is that p,, > a , and 1 > p, LO. The initial left gadget dl will consist of two intervals, each of length p l /2 , with one assigned to ro, the other to rl. Then dl is a copy of 8l and the inequality a) certainly holds for n = 1. The initial right gadget 9, consists of one interval of length 1 - pl; it is assigned to ro. (See Fig. 18.) No matter what we do in later stages we are guaranteed that Q f P , since Q(0) > P(0).

The parameter L , controls the amount of cutting and stacking needed to go to stage n + 1, hence, it determines f ~ , + ~ , the height of (Note that the columns of Sk,+l, hence, of Jn+l, all have the same height, I Z , + ~ . ) If this height is long enough, then, by the law of large numbers, for most columns 8 E (akn+l, each relative fre- quency p(u;+'lC> will be about 2-("+l). This same fre- quency property will have to hold in the corresponding

columns of the copy for most columns of 2,+ 1, which implies that

Thus, Q<[U:+~I I~ ) = 2-(,+'),

= pn+lP"+y [ U ; ' ' ] ) .

Since p,+ > a,,+1 we can choose L, so large that the inequality a) holds for n + 1. Thus we have D(PIIQ) = 0.

The parameter R,, which controls the amount of inde- pendent cutting and stacking of d,," U 9, used produce the new right gadget snfl, is free to be chosen as we please. To force Q to be ergodic we just choose R, large enough to force Jn2 U 9, to be (1 - e,)-well-distributed in snfl, which is possible by Remark 2. Appropriate choice of E , will then force Q to be ergodic, by Theo- rem 1.

The reader is referred to [20] for the details of the above construction. To obtain a B-process two modifica- tions need to be made in the above construction. First the initial right gadget S1 is redefined to consist of one column of height 2 and width (1 - 2p,)/2; both levels are assigned to ro. This guarantees that 4," U 9, will al- ways contain two columns with heights differing by l. The arguments of [15] can then be applied to choose the R, so large that Problem 5 can be replaced by a suitable approx- imate form of Ornstein's very weak Bernoulli property, [lo], and thereby guarantee that the final process is a B-process. The reader is again referred to [20] for the details.

A negative answer to the divergence-rate problem, Problem 1, can be obtained by modifying the previous statement. The first modification is to alternate between transferring small amounts to the right side to obtain lower bounds like (7) for a subsequence of n's, and transferring large amounts to get upper bounds of the form Q'Tu;) 5 y,P"(u;), a; E A" for other values of n. The second modification is that some intervals, which are marked with a different symbol, say 2, are adjoined to the right side to guarantee this upper bound. Suitable choice of a, will yield liminf, D,(P"llQ") = 0, while suitable choice of 7, will yield limsup, D,(P"llQ") = 1. The reader is referred to [20] for the details.

A negative answer to the waiting-time problem can be obtained by modifying a recent gadget construction of Ornstein and Weiss, [12], [13]. At stage n, there will be a large number g, of disjoint subgadgets {4,g: g I g,} with a very strong separation property, namely long enough blocks selected from names of columns drawn from differ- ent subgadgets must disagree in about 1/2 their places, provided they are not too long relative to gadget height. The subgadgets are then merged by cyclical cutting and stacking in various cyclical patterns to obtain a much larger number g, + of n + 1 stage subgadgets with only a small change in the separation property. The g, must grow rapidly to guarantee that points will stay far apart, while the merging guarantees ergodicity in the limit. If x and y are selected at random then there will be infinitely many stages when they belong to different gadgets; the

Page 13: Cutting and stacking: a method for constructing stationary processes

SHIELDS: CUTTING AND STACKING 1617

gadgets can be made so long and thin that the waiting time to see in y an initial segment of x will be as large as we please. The reader is referred to [21] for details.

Remark 3: The method of cutting and stacking was invented by von Neumann and Kakutani; see Friedman’s book, [4], for a discussion of early work. A highlight of this early work was Chacon’s simple example of a weak mixing but not mixing transformation, [l]. Ornstein used the method to construct examples of processes with trivial tail that are not B-processes, [ll]; since then cutting and stacking has been frequently used to construct examples with various kinds of mixing properties. The method also provides a useful way to think about processes; for exam- ple, we recently used the method to construct counterex- amples showing that entropy is only loosely connected to the string matching problem of DNA analysis and to prefix generation, 1171, [NI. Cutting and stacking sug- gested our original solutions; the published examples use instead a simpler, but related method for converting from block-codes to stationary codes, [8], [16]. The cutting and stacking method was also used to show that universal redundancy rates do not exist, [19], although it was not central to the construction.

ACKNOWLEDGMENTS This paper is an outgrowth of lectures on the cutting

and stacking method given at the Mathematics Institute of the Hungarian Academy of Sciences in the fall of 1989. The author is grateful to I. Csiszhr, K. Marton, and G. Tusniidy of the Institute, and to A. Wyner and J. Ziv for many useful conversations and suggestions about the ideas discussed here.

REFERENCES

[l] R. Chacon, “Transformations with continuous spectrum,” J. Math.

[2] I. CsiszLr and J. Korner, Information Theory. Coding Theorems for

[3] I. CsiszLr, K. Marton, and P. Shields, “The divergence-rate for

[4] N. Friedman, Introduction to Ergodic Theory. New York Van

[5] R. Gray, “Sliding-block source coding,” ZEEE Trans. Inform. The-

[6] J. Kieffer, “A counterexample to Perez’s generalization of the Shannon-McMillan theorem,” Ann. Probab., vol. 1, pp. 362-364, 1973.

[7] K. Marton, “A simple proof of the blowing-up lemma,” ZEEE Trans. Inform. Theory, vol. IT-32, pp. 445-447, May 1986.

181 D. Neuhoff and P. Shields, “Block and sliding-block source coding,” IEEE Trans. Znform. Theory, vol. IT-23, pp. 211-215, Mar. 1977.

191 D. S. Ornstein, “An application of ergodic theory to probability theory,” Ann. Probab., vol. 1, pp. 43-58, 1973.

[lo] -, Ergodic Theory, Randomness, and Dynamical Systems. New Haven, CT: Yale Univ. Press, 1973.

[ l l ] D. Ornstein and P. Shields, “An uncountable family of K-Auto- morphisms,” Advances in Math., vol. 10, pp. 63-88, 1973.

[12] -, “The 2-recognition of processes,” Prob. Theory and Related Fields, to appear.

[13] D. Ornstein and B. Weiss, “How sampling reveals a process,” Ann. Probab., vol. 18, pp. 805-930, 1990.

1141 -, “Entropy and data compression schemes,” preprint.

Mech., vol. 16, 1966.

Discrete Memoryless Systems.

processes,” in preparation.

Nostrand Reinhold, 1970.

Budapest: Akaddmiai Kiad6, 1981.

ory, vol. IT-21, pp. 357-368, July 1975.

1 P. Shields, “Cutting and independent stacking of intervals,’’ Math. syst. Theory, vol. 7,-pp. 1-4, i973. -, “Stationary coding of processes,” ZEEE Trans. Inform. The- ory, vol. IT-25, pp. 283-291, May 1979. -, ‘‘String matching: The ergodic case,” Ann. Probab., to ap- pear. -, “Entropy and prefixes,” Ann. Probab., to appear. -, “Universal redundancy rates don’t exist,” ZEEE Trans. Zn- form. Theory, submitted. -, “Some divergence-rate counterexamples,” in preparation. -, “Waiting times for processes,” in preparation. A. Wyner and J. Ziv, “Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compres- sion,” ZEEE Trans. Inform. Theory, vol. 35, pp. 1250-1258, NOV. 1989.


Recommended