FINITE-STATE MARKOV CHAINS - RLE at MIT · Chapter 4 FINITE-STATE MARKOV CHAINS 4.1 Introduction...

Chapter 4

FINITE-STATE MARKOVCHAINS

4.1 Introduction

The counting processes {N(t), t ≥ 0} of Chapterss 2 and 3 have the property that N(t)changes at discrete instants of time, but is defined for all real t ≥ 0. Such stochasticprocesses are generally called continuous time processes. The Markov chains to be discussedin this and the next chapter are stochastic processes defined only at integer values of time,n = 0, 1, . . . . At each integer time n ≥ 0, there is an integer valued random variable(rv) Xn, called the state at time n, and the process is the family of rv’s {Xn, n ≥ 0}.These processes are often called discrete time processes, but we prefer the more specificterm integer time processes. An integer time process {Xn;n ≥ 0} can also be viewed as acontinuous time process {X(t); t ≥ 0} by taking X(t) = Xn for n ≤ t < n + 1, but sincechanges only occur at integer times, it is usually simpler to view the process only at integertimes.

In general, for Markov chains, the set of possible values for each rv Xn is a countable setusually taken to be {0, 1, 2, . . . }. In this chapter (except for Theorems 4.2 and 4.3), werestrict attention to a finite set of possible values, say {1, . . . ,M}. Thus we are looking atprocesses whose sample functions are sequences of integers, each between 1 and M. There isno special significance to using integer labels for states, and no compelling reason to include0 as a state for the countably infinite case and not to include 0 for the finite case. Forthe countably infinite case, the most common applications come from queueing theory, andthe state often represents the number of waiting customers, which can be zero. For thefinite case, we often use vectors and matrices, and it is more conventional to use positiveinteger labels. In some examples, it will be more convenient to use more illustrative labelsfor states.

Definition 4.1. A Markov chain is an integer time process, {Xn, n ≥ 0} for which each rvXn, n ≥ 1, is integer valued and depends on the past only through the most recent rv Xn−1,

139

140 CHAPTER 4. FINITE-STATE MARKOV CHAINS

i.e., for all integer n ≥ 1 and all integer i, j, k, . . . ,m,

Pr{Xn=j | Xn−1=i,Xn−2=k, . . . ,X0=m} = Pr{Xn=j | Xn−1=i} .. (4.1)

Pr{Xn=j | Xn−1=i} depends only on i and j (not n) and is denoted by

Pr{Xn=j | Xn−1=i} = Pij . (4.2)

The initial state X0 has an arbitrary probability distribution, which is required for a fullprobabilistic description of the process, but is not needed for most of the results. A Markovchain in which each Xn has a finite set of possible sample values is a finite-state Markovchain.

The rv Xn is called the state of the chain at time n. The possible values for the state attime n, namely {1, . . . ,M} or {0, 1, . . . } are also generally called states, usually without toomuch confusion. Thus Pij is the probability of going to state j given that the previous stateis i; the new state, given the previous state, is independent of all earlier states. The use ofthe word state here conforms to the usual idea of the state of a system — the state at agiven time summarizes everything about the past that is relevant to the future. Note thatthe transition probabilities, Pij , do not depend on n. Occasionally, a more general modelis required where the transition probabilities do depend on n. In such situations, (4.1) and(4.2) are replaced by

Pr{Xn=j | Xn−1=i,Xn−2=k, . . . ,X0=m} = Pr{Xn=j | Xn−1=i} = Pij(n). (4.3)

A process that obeys (4.3), with a dependence on n, is called a non-homogeneous Markovchain. Some people refer to a Markov chain (as defined in (4.1) and (4.2)) as a homogeneousMarkov chain. We will discuss only the homogeneous case (since not much of generalinterest can be said about the non-homogeneous case) and thus omit the word homogeneousas a qualifier. An initial probability distribution for X0, combined with the transitionprobabilities {Pij} (or {Pij(n)} for the non-homogeneous case), define the probabilities forall events.

Markov chains can be used to model an enormous variety of physical phenomena and canbe used to approximate most other kinds of stochastic processes. To see this, consider sam-pling a given process at a high rate in time, and then quantizing it, thus converting it into adiscrete time process, {Zn; −1 < n <1}, where each Zn takes on a finite set of possiblevalues. In this new process, each variable Zn will typically have a statistical dependence onpast values that gradually dies out in time, so we can approximate the process by allowing Zn

to depend on only a finite number of past variables, say Zn−1, . . . , Zn−k. Finally, we can de-fine a Markov process where the state at time n is Xn = (Zn, Zn−1, . . . , Zn−k+1). The stateXn = (Zn, Zn−1, . . . , Zn−k+1) then depends only on Xn−1 = (Zn−1, . . . , Zn−k+1, Zn−k),since the new part of Xn, i.e., Zn, is independent of Zn−k−1, Zn−k−2, . . . , and the othervariables comprising Xn are specified by Xn−1. Thus {Xn} forms a Markov chain approxi-mating the original process. This is not always an insightful or desirable model, but at leastprovides one possibility for modeling relatively general stochastic processes.

Markov chains are often described by a directed graph (see Figure 4.1). In the graphicalrepresentation, there is one node for each state and a directed arc for each non-zero transition

4.2. CLASSIFICATION OF STATES 141

probability. If Pij = 0, then the arc from node i to node j is omitted; thus the differencebetween zero and non-zero transition probabilities stands out clearly in the graph. Severalof the most important characteristics of a Markov chain depend only on which transitionprobabilities are zero, so the graphical representation is well suited for understanding thesecharacteristics. A finite-state Markov chain is also often described by a matrix [P ] (seeFigure 4.1). If the chain has M states, then [P ] is a M by M matrix with elements Pij . Thematrix representation is ideally suited for studying algebraic and computational issues.

✐1✞✝ ✘✿

P11 ✐5 ☎✆②✘

P55

✐2 ✐3

✐4

✐6✟✟✟✟✟✟✯

P12

❍❍❍❍❍❍

P41

③P23

②P32

❄

P35

✲P45

✛P63

❄

P35 [P ] =

P11 P12 · · · P16

P21 P22 · · · P26

. . . . . . . . . . . . . . . . . . .P61 P62 · · · P66

(a) (b)

Figure 4.1: Graphical and Matrix Representation of a 6 state Markov Chain; a directedarc from i to j is included in the graph if and only if (iff) Pij > 0.

4.2 Classification of states

This section, except where indicated otherwise, applies to Markov chains with both finiteand countable state spaces. We start with several definitions.

Definition 4.2. An (n-step) walk1 is an ordered string of nodes {i0, i1, . . . in}, n ≥ 1, inwhich there is a directed arc from im−1 to im for each m, 1 ≤ m ≤ n. A path is a walkin which the nodes are distinct. A cycle is a walk in which the first and last nodes are thesame and the other nodes are distinct.

Note that a walk can start and end on the same node, whereas a path cannot. Also thenumber of steps in a walk can be arbitrarily large, whereas a path can have at most M− 1steps and a cycle at most M steps.

Definition 4.3. A state j is accessible from i (abbreviated as i → j) if there is a walk inthe graph from i to j.

For example, in figure 4.1(a), there is a walk from node 1 to node 3 (passing through node2), so state 3 is accessible from 1. There is no walk from node 5 to 3, so state 3 is notaccessible from 5. State 2, for example, is accessible from itself, but state 6 is not accessiblefrom itself. To see the probabilistic meaning of accessibility, suppose that a walk i0, i1, . . . inexists from node i0 to in. Then, conditional on X0 = i0, there is a positive probability,Pi0i1 , that X1 = i1, and consequently (since Pi1i2 > 0), there is a positive probability that

1We are interested here only in directed graphs, and thus undirected walks and paths do not arise.


X2 = i2. Continuing this argument there is a positive probability that Xn = in, so thatPr{Xn=in | X0=i0} > 0. Similarly, if Pr{Xn=in | X0=i0} > 0, then there is an n-stepwalk from i0 to in. Summarizing, i→ j if and only if (iff) Pr{Xn=j | X0=i} > 0 for somen ≥ 1. We denote Pr{Xn=j | X0=i} by Pn

ij . Thus, for n ≥ 1, Pnij > 0 iff the graph has an n

step walk from i to j (perhaps visiting the same node more than once). For the example inFigure 4.1(a), P 2

13 = P12P23 > 0. On the other hand, Pn53 = 0 for all n ≥ 1. An important

relation that we use often in what follows is that if there is an n-step walk from state i to jand an m-step walk from state j to k, then there is a walk of m+n steps from i to k. Thus

Pnij > 0 and Pm

jk > 0 imply Pn+mik > 0. (4.4)

This also shows that

i→ j and j→ k imply i→ k. (4.5)

Definition 4.4. Two distinct states i and j communicate (abbreviated i↔ j) if i is acces-sible from j and j is accessible from i.

An important fact about communicating states is that if i↔ j and m↔ j then i↔ m. Tosee this, note that i↔ j and m↔ j imply that i→ j and j → m, so that i→ m. Similarly,m→ i, so i↔ m.

Definition 4.5. A class T of states is a non-empty set of states such that for each statei ∈ T , i communicates with each j ∈ T (except perhaps itself) amd does not communicatewith any j /∈ T .

For the example of Fig. 4.1(a), {1, 2, 3, 4} is one class of states, {5} is another, and {6} isanother. Note that state 6 does not communicate with itself, but {6} is still considered tobe a class. The entire set of states in a given Markov chain is partitioned into one or moredisjoint classes in this way.

Definition 4.6. For finite-state Markov chains, a recurrent state is a state i that is acces-sible from all states that are accessible from i (i is recurrent if i → j implies that j → i).A transient state is a state that is not recurrent.

Recurrent and transient states for Markov chains with a countably infinite set of states willbe defined in the next chapter.

According to the definition, a state i in a finite-state Markov chain is recurrent if thereis no possibility of going to a state j from which there can be no return. As we shall seelater, if a Markov chain ever enters a recurrent state, it returns to that state eventuallywith probability 1, and thus keeps returning infinitely often (in fact, this property serves asthe definition of recurrence for Markov chains without the finite-state restriction). A statei is transient if there is some j that is accessible from i but from which there is no possiblereturn. Each time the system returns to i, there is a possibility of going to j; eventuallythis possibility will occur, and then no more returns to i can occur (this can be thought ofas a mathematical form of Murphy’s law).


Theorem 4.1. For finite-state Markov chains, either all states in a class are transient orall are recurrent.2

Proof: Assume that state i is transient (i.e., for some j, i → j but j 6→ i) and supposethat i and m are in the same class (i.e., i ↔ m). Then m → i and i → j, so m → j. Nowif j → m, then the walk from j to m could be extended to i; this is a contradiction, andtherefore there is no walk from j to m, and m is transient. Since we have just shown thatall nodes in a class are transient if any are, it follows that the states in a class are either allrecurrent or all transient.

For the example of fig. 4.1(a), {1, 2, 3, 4} is a transient class and {5} is a recurrent class.In terms of the graph of a Markov chain, a class is transient if there are any directed arcsgoing from a node in the class to a node outside the class. Every finite-state Markov chainmust have at least one recurrent class of states (see Exercise 4.1), and can have arbitrarilymany additional classes of recurrent states and transient states.

States can also be classified according to their periods (see Figure 4.2). In fig. 4.2(a), giventhat X0 = 2, we see that X1 must be either 1 or 3, X2 must then be either 2 or 4, and ingeneral, Xn must be 2 or 4 for n even and 1 or 3 for n odd. On the other hand, if X0 is 1or 3, then Xn is 2 or 4 for n odd and 1 or 3 for n even. Thus the effect of the starting statenever dies out. Fig. 4.2(b) illustrates another example in which the state alternates fromodd to even and the memory of the starting state never dies out. The states in both theseMarkov chains are said to be periodic with period 2.

❧1 ❧2

❧3❧4❧7

❧8 ❧9❧4

❧1

❧3❧2

❧5❧6

③②

③②

✄✗

✄✎ ✄✎

✄✗ °°✒ °°✒❅❅❘ ❅❅❘

°°✠°°✠ ❅❅■❅❅■

✲

✛

(a) (b)

Figure 4.2: Periodic Markov Chains

Definition 4.7. The period of a state i, denoted d(i), is the greatest common divisor (gcd)of those values of n for which Pn

ii > 0. If the period is 1, the state is aperiodic, and if theperiod is 2 or more, the state is periodic.3

For example, in Figure 4.2(a), Pn11 > 0 for n = 2, 4, 6, . . . . Thus d(1), the period of state

1, is two. Similarly, d(i) = 2 for the other states in Figure 4.2(a). For fig. 4.2(b), we have2This theorem is also true for Markov chains with a countable state space, but the proof here is inadequate.

Also recurrent classes with a countable state space are further classified into either positive-recurrent or null-recurrent, a distinction that does not appear in the finite-state case.

3For completeness, we say that the period is infinite if P nii = 0 for all n ≥ 1. Such states do not have the

intuitive characteristics of either periodic or aperiodic states. Such a state cannot communicate with anyother state, and cannot return to itself, so it corresponds to a singleton class of transient states. The notionof periodicity is of primary interest for recurrent states.


Pn11 > 0 for n = 4, 8, 10, 12, . . . ; thus d(1) = 2, and it can be seen that d(i) = 2 for all the

states. These examples suggest the following theorem.

Theorem 4.2. For any Markov chain (with either a finite or countably infinite number ofstates), all states in the same class have the same period.

Proof: Let i and j be any distinct pair of states in a class. Then i↔ j and there is some rsuch that P r

ij > 0 and some s such that P sji > 0. Since there is a walk of length r + s going

from i to j and back to i, r + s must be divisible by d(i). Let t be any integer such thatP t

jj > 0. Since there is a walk of length r + t + s that goes first from i to j, then to j again,and then back to i, r + t + s is divisible by d(i), and thus t is divisible by d(i). Since thisis true for any t such that P t

jj > 0, d(j) is divisible by d(i). Reversing the roles of i and j,d(i) is divisible by d(j), so d(i) = d(j).

Since the states in a class all have the same period and are either all recurrent or alltransient, we refer to the class itself as having the period of its states and as being recurrentor transient. Similarly if a Markov chain has a single class of states, we refer to the chainas having the corresponding period and being recurrent or transient.

Theorem 4.3. If a recurrent class in a finite-state Markov chain has period d, then thestates in the class can be partitioned into d subsets, S1,S2, . . . ,Sd, such that all transitionsout of subset Sm go to subset Sm+1 for m < d and to subset S1 for m = d. That is, ifj ∈ Sm and Pjk > 0, then k ∈ Sm+1 for m < d and k ∈ S1 for m = d.

Proof: See Figure 4.3 for an illustration of the theorem. For a given state in the class, saystate 1, define the sets S1, . . . ,Sd by

Sm = {j : Pnd+m1j > 0 for some n ≥ 0}; 1 ≤ m ≤ d. (4.6)

For each given j in the class, we first show that there is one and only one value of m suchthat j ∈ Sm. Since 1↔ j, there is some r for which P r

1j > 0 and some s for which P sj1 > 0.

Since there is a walk from 1 to 1 (through j) of length r + s, r + s is divisible by d. Definem, 1 ≤ m ≤ d, by r = m+nd, where n is an integer. From (4.6), j ∈ Sm. Now let r0 be anyother integer such that P r0

1j > 0. Then r0 + s is also divisible by d, so that r0 − r is divisibleby d. Thus r0 = m + n0d for some integer n0 and that same m. Since r0 is any integer suchthat P r0

1j > 0, j is in Sm for only that one value of m. Since j is arbitrary, this shows thatthe sets Sm are disjoint and partition the class.

Finally, suppose j ∈ Sm and Pjk > 0. Given a walk of length r = nd + m from state 1 toj, there is a walk of length nd + m + 1 from state 1 to k. It follows that if m < d, thenk ∈ Sm+1 and if m = d, then k ∈ S1, completing the proof.

We have seen that each class of states (for a finite-state chain) can be classified both interms of its period and in terms of whether or not it is recurrent. The most important caseis that in which a class is both recurrent and aperiodic.

Definition 4.8. For a finite-state Markov chain, an ergodic class of states is a class that


❣

❣❣ ❣

❣❣❆

❆❆

❆❆

❆❆❆❑

❇❇❇❇❇❇▼

✻PPPPPPPPPPPPPPPPq

③

PPPPPPPPPPq

❍❍❍❍❍❍❍❍❍❍❍❍❍❍❥✏✏✮✛

S1

S2

S3

Figure 4.3: Structure of a Periodic Markov Chain with d = 3. Note that transitionsonly go from one subset Sm to the next subset Sm+1 (or from Sd to S1).

is both recurrent and aperiodic4. A Markov chain consisting entirely of one ergodic class iscalled an ergodic chain.

We shall see later that these chains have the desirable property that Pnij becomes indepen-

dent of the starting state i as n → 1. The next theorem establishes the first part of thisby showing that Pn

ij > 0 for all i and j when n is sufficiently large. The Markov chain inFigure 4.4 illustrates the theorem by illustrating how large n must be in the worst case.

°°°✒ ❅

❅❅❘

°°°✠❅

❅❅■

✲

✛❄

✒✑✓✏

4

✒✑✓✏

5 ✒✑✓✏

6

✒✑✓✏

1

✒✑✓✏

2✒✑✓✏

3

Figure 4.4: An ergodic chain with M = 6 states in which Pmij > 0 for all m > (M− 1)2

and all i, j but P (M−1)2

11 = 0 The figure also illustrates that an M state Markov chainmust have a cycle with M−1 or fewer nodes. To see this, note that an ergodic chain musthave cycles, since each node must have a walk to itself, and any subcycle of repeatednodes can be omitted from that walk, converting it into a cycle. Such a cycle mighthave M nodes, but a chain with only a M node cycle would be periodic. Thus somenodes must be on smaller cycles, such as the cycle of length 5 in the figure.

Theorem 4.4. For an ergodic M state Markov chain, Pmij > 0 for all i, j, and all m ≥

(M− 1)2 + 1.

4For Markov chains with a countably infinite state space, ergodic means that the states are positive-recurrent and aperiodic (see Chapter 5, Section 5.1).


Proof*:5 As shown in Figure 4.4, the chain must contain a cycle with fewer than M nodes.Let τ ≤ M − 1 be the number of nodes on a smallest cycle in the chain and let i be anygiven state on such a cycle. Define T (m), m ≥ 1, as the set of states accessible from thefixed state i in m steps. Thus T (1) = {j : Pij > 0}, and for arbitrary m ≥ 1,

T (m) = {j : Pmij > 0}. (4.7)

Since i is on a cycle of length τ , P τii > 0. For any m ≥ 1 and any j ∈ T (m), we can then

construct an m + τ step walk from i to j by going from i to i in τ steps and then to j inanother m steps. This is true for all j ⊆ T (m), so

T (m) ⊆ T (m + τ). (4.8)

By defining T (0) to be the singleton set {i}, (4.8) also holds for m = 0, since i ∈ T (τ). Bystarting with m = 0 and iterating on (4.8),

T (0) ⊆ T (τ) ⊆ T (2τ) ⊆ · · · ⊆ T (nτ) ⊆ · · · . (4.9)

We now show that if one of the inclusion relations in (4.9) is satisfied with equality, then allthe subsequent relations are satisfied with equality. More generally, assume that T (m) =T (m + s) for some m ≥ 0 and s ≥ 1. Note that T (m + 1) is the set of states that can bereached in one step from states in T (m), and similarly T (m + s + 1) is the set reachable inone step from T (m + s) = T (m). Thus T (m + 1) = T (m + 1 + s). Iterating this result,

T (m) = T (m + s) implies T (n) = T (n + s) for all n ≥ m. (4.10)

Thus, (4.9) starts with strict inclusions and then continues with strict equalities. Since theentire set has M members, there can be at most M− 1 strict inclusions in (4.9). Thus

T ((M− 1)τ) = T (nτ) for all integers n ≥ M− 1. (4.11)

Define k as (M − 1)τ . We can then rewrite (4.11) as

T (k) = T (k + jτ) for all j ≥ 1. (4.12)

We next show that T (k) consists of all M nodes in the chain. The central part of this is toshow that T (k) = T (k + 1). Let t be any positive integer other than τ such that P t

ii > 0.Letting m = k in (4.8) and using t in place of τ ,

T (k) ⊆ T (k + t) ⊆ T (k + +2t) ⊆ · · · ⊆ T (k + τ t). (4.13)

Since T (k + τ t) = T (k), this shows that

T (k) = T (k + t). (4.14)

Now let s be the smallest positive integer such that

T (k) = T (k + s). (4.15)

5Proofs marked with an asterisk can be omitted without loss of continuity.

4.3. THE MATRIX REPRESENTATION 147

From (4.11), we see that (4.15) holds when s takes the value τ . Thus, the minimizing s mustlie in the range 1 ≤ s ≤ τ . We will show that s = 1 by assuming s > 1 and establishinga contradiction. Since the chain is aperiodic, there is some t not divisible by s for whichP t

ii > 0. This t can be represented by t = js + ` where 1 ≤ ` < s and j ≥ 0. Iterating(4.15), we get T (k) = T (k + js), and applying (4.10) to this,

T (k + `) = T (k + js + `)= T (k + t)= T (k).

where we have used t = js + ` followed by (4.14). This is the desired contradiction, since` < s. Thus s = 1 and T (k) = T (k + 1). Iterating this,

T (k) = T (k + n) for all n ≥ 0. (4.16)

Since the chain is ergodic, each state j continues to be accessible after k steps. Therefore jmust be in T (k + n) for some n ≥ 0, which, from (4.16), implies that j ∈ T (k). Since j isarbitrary, T (k) must be the entire set of states. Thus Pn

ij > 0 for all n ≥ k and all j.

This same argument can be applied to any state i on the given cycle with τ nodes. Anystate m not on this cycle has a path to the cycle using at most M − τ steps. Using thispath to reach a node i on the cycle, and following this with all the walks from i of lengthk = (M− 1)τ , we see that

PM−τ+(M−1)τmj > 0 for all j,m.

The proof is complete, since M− τ +(M− 1)τ ≤ (M− 1)2 +1 for all τ, 1 ≤ τ ≤ M− 1, withequality when τ = M− 1.

Figure 4.4 illustrates a situation where the bound (M− 1)2 + 1 is met with equality. Notethat there is one cycle of length M− 1 and the single node not on this cycle, node 1, is theunique starting node at which the bound is met with equality.

4.3 The Matrix representation

The matrix [P ] of transition probabilities of a Markov chain is called a stochastic matrix;that is, a stochastic matrix is a square matrix of non-negative terms in which the elementsin each row sum to 1. We first consider the n step transition probabilities Pn

ij in terms of[P]. The probability of going from state i to state j in two steps is the sum over h of allpossible two step walks, from i to h and from h to j. Using the Markov condition in (4.1),

P 2ij =

MX

h=1

PihPhj .

It can be seen that this is just the ij term of the product of matrix [P ] with itself; denoting[P ][P ] as [P ]2, this means that P 2

ij is the (i, j) element of the matrix [P ]2. Similarly, Pnij is


the ij element of the nth power of the matrix [P ]. Since [P ]m+n = [P ]m[P ]n, this meansthat

Pm+nij =

MX

h=1

Pmih Pn

hj . (4.17)

This is known as the Chapman-Kolmogorov equation. An efficient approach to compute[P ]n (and thus Pn

ij) for large n, is to multiply [P ]2 by [P ]2, then [P ]4 by [P ]4 and so forthand then multiply these binary powers together as needed.

The matrix [P ]n (i.e., the matrix of transition probabilities raised to the nth power) is veryimportant for a number of reasons. The i, j element of this matrix is Pn

ij , which is theprobability of being in state j at time n given state i at time 0. If memory of the past diesout with increasing n, then we would expect the dependence on both n and i to disappearin Pn

ij . This means, first, that [P ]n should converge to a limit as n→1, and, second, thateach row of [P ]n should tend to the same set of probabilities. If this convergence occurs(and we later determine the circumstances under which it occurs), [P ]n and [P ]n+1 will bethe same in the limit n→1 which means lim[P ]n = (lim[P ]n)P . If all the rows of lim[Pn]are the same, equal to some row vector πππ = (π1,π2, . . . ,πM), this simplifies to πππ = πππ[P ].Since πππ is a probability vector (i.e., its components are the probabilities of being in thevarious states in the limit n→1), its components must be non-negative and sum to 1.

Definition 4.9. A steady-state probability vector (or a steady-state distribution) for a Markovchain with transition matrix [P ] is a vector πππ that satisfies

πππ = πππ[P ] ; whereX

i

πi = 1 ;πi ≥ 0 , 1 ≤ i ≤ M. (4.18)

The steady-state probability vector is also often called a stationary distribution. If a prob-ability vector πππ satisfying (4.18) is taken as the initial probability assignment of the chainat time 0, then that assigment is maintained forever. That is, if Pr{X0=i} = πi for all i,then Pr{X1=j} =

Pi πiPij = πj for all j, and, by induction, Pr{Xn = j} = πj for all j

and all n > 0.

If [P ]n converges as above, then, for each starting state, the steady-state distribution isreached asymptotically. There are a number of questions that must be answered for asteady-state distribution as defined above:

1. Does πππ = πππ[P ] always have a probability vector solution?

2. Does πππ = πππ[P ] have a unique probability vector solution?

3. Do the rows of [P ]n converge to a probability vector solution of πππ = πππ[P ]?

We first give the answers to these questions for finite-state Markov chains and then derivethem. First, (4.18) always has a solution (although this is not necessarily true for infinite-state chains). The answer to the second and third questions is simpler with the followingdefinition:

4.3. THE MATRIX REPRESENTATION 149

Definition 4.10. A unichain is a finite-state Markov chain that contains a single recurrentclass plus, perhaps, some transient states. An ergodic unichain is a unichain for which therecurrent class is ergodic.

A Unichain, as we shall see, is the natural generalization of a recurrent chain to allow forsome initial transient behavior without disturbing the long term aymptotic behavior of theunderlying recurrent chain.

The answer to the second question above is that the solution to (4.18) is unique iff [P] isthe transition matrix of a unichain. If there are r recurrent classes, then πππ = πππ[P ] has rlinearly independent solutions. For the third question, each row of [P ]n converges to theunique solution of (4.18) if [P] is the transition matrix of an ergodic unichain. If there aremultiple recurrent classes, but all of them are aperiodic, then [P ]n still converges, but toa matrix with non-identical rows. If the Markov chain has one or more periodic recurrentclasses, then [P ]n does not converge.

We first look at these answers from the standpoint of matrix theory and then proceed inChapter 5 to look at the more general problem of Markov chains with a countably infinitenumber of states. There we use renewal theory to answer these same questions (and todiscover the differences that occur for infinite-state Markov chains). The matrix theoryapproach is useful computationally and also has the advantage of telling us somethingabout rates of convergence. The approach using renewal theory is very simple (given anunderstanding of renewal processes), but is more abstract.

4.3.1 The eigenvalues and eigenvectors of P

A convenient way of dealing with the nth power of a matrix is to find the eigenvalues andeigenvectors of the matrix.

Definition 4.11. The row vector πππ is a left eigenvector of [P ] of eigenvalue ∏ if πππ 6= 0and πππ[P ] = ∏πππ. The column vector ∫∫∫ is a right eigenvector of eigenvalue ∏ if ∫∫∫ 6= 0 and[P ]∫∫∫ = ∏∫∫∫.

We first treat the special case of a Markov chain with two states. Here the eigenvalues andeigenvectors can be found by elementary (but slightly tedious) algebra. The eigenvectorequations can be written out as

π1P11 + π2P21 = ∏π1 P11∫1 + P12∫2 = ∏∫1

π1P12 + π2P22 = ∏π2 P21∫1 + P22∫2 = ∏∫2. (4.19)

These equations have a non-zero solution iff the matrix [P − ∏I], where [I] is the identitymatrix, is singular (i.e., there must be a non-zero ∫∫∫ for which [P −∏I]∫∫∫ = 0 ). Thus ∏ mustbe such that the determinant of [P − ∏I], namely (P11 − ∏)(P22 − ∏) − P12P21, is equalto 0. Solving this quadratic equation in ∏, we find that ∏ has two solutions, ∏1 = 1 and∏2 = 1 − P12 − P21. Assume initially that P12 and P21 are not both 0. Then the solutionfor the left and right eigenvectors, πππ(1) and ∫∫∫(1), of ∏1 and πππ(2) and ∫∫∫(2) of ∏2, are given by

π(1)1 = P21

P12+P21π(1)

2 = P12P12+P21

∫(1)1 = 1 ∫(1)

2 = 1π(2)

1 = 1 π(2)2 = −1 ∫(2)

1 = P12P12+P21

∫(2)2 = −P21

P12+P21

.


These solutions contain an arbitrary normalization factor. Now let [Λ] =∑

∏1 00 ∏2

∏and

let [U ] be a matrix with columns ∫∫∫(1) and ∫∫∫(2). Then the two right eigenvector equations in(4.19) can be combined compactly as [P ][U ] = [U ][Λ]. It turns out (given the way we havenormalized the eigenvectors) that the inverse of [U ] is just the matrix whose rows are theleft eigenvectors of [P ] (this can be verified by direct calculation, and we show later that anyright eigenvector of one eigenvalue must be orthogonal to any left eigenvector of anothereigenvalue). We then see that [P ] = [U ][Λ][U ]−1 and consequently [P ]n = [U ][Λ]n[U ]−1.Multiplying this out, we get

[P ]n =∑

π1 + π2∏n2 π2 − π2∏n

2

π1 − π1∏n2 π2 + π1∏n

2

∏where π1 =

P21

P12 + P21, π2 = 1− π1.

Recalling that ∏2 = 1 − P12 − P21, we see that |∏2| ≤ 1. If P12 = P21 = 0, then ∏2 = 1 sothat [P ] and [P ]n are simply identity matrices. If P12 = P21 = 1, then ∏2 = −1 so that[P ]n alternates between the identity matrix for n even and [P ] for n odd. In all other cases,|∏2| < 1 and [P ]n approaches the matrix whose rows are both equal to πππ.

Parts of this special case generalize to an arbitrary finite number of states. In particular,∏ = 1 is always an eigenvalue and the vector e whose components are all equal to 1 isalways a right eigenvector of ∏ = 1 (this follows immediately from the fact that each row ofa stochastic matrix sums to 1). Unfortunately, not all stochastic matrices can be representedin the form [P ] = [U ][Λ][U−1 (since M independent right eigenvectors need not exist—seeExercise 4.9) In general, the diagonal matrix of eigenvalues in [P ] = [U ][Λ][U−1] must bereplaced by something called a Jordan form, which does not easily lead us to the desiredresults. In what follows, we develop the powerful Perron and Frobenius theorems, whichare useful in their own right and also provide the necessary results about [P ]n in general.

4.4 Perron-Frobenius theory

A real vector x (i.e., a vector with real components) is defined to be positive, denoted x > 0,if xi > 0 for each component i. A real matrix [A] is positive, denoted [A] > 0, if Aij > 0 foreach i, j. Similarly, x is non-negative, denoted x ≥ 0, if xi ≥ 0 for all i. [A] is non-negative,denoted [A] ≥ 0, if Aij ≥ 0 for all i, j. Note that it is possible to have x ≥ 0 and x 6= 0without having x > 0, since x > 0 means that all components of x are positive and x ≥ 0,x 6= 0 means that at least one component of x is positive and all are non-negative. Next,x > y and y < x both mean x −y > 0. Similarly, x ≥ y and y ≤ x mean x −y ≥ 0. Thecorresponding matrix inequalities have corresponding meanings.

We start by looking at the eigenvalues and eigenvectors of positive square matrices. Inwhat follows, when we assert that a matrix, vector, or number is positive or non-negative,we implicitly mean that it is real also. We will prove Perron’s theorem, which is thecritical result for dealing with positive matrices. We then generalize Perron’s theorem tothe Frobenius theorem, which treats a class of non-negative matrices called irreduciblematrices. We finally specialize the results to stochastic matrices.

4.4. PERRON-FROBENIUS THEORY 151

Perron’s theorem shows that a square positive matrix [A] always has a positive eigenvalue∏ that exceeds the magnitude of all other eigenvalues. It also shows that this ∏ has a righteigenvector ∫∫∫ that is positive and unique within a scale factor. It establishes these resultsby relating ∏ to the following frequently useful optimization problem. For a given squarematrix [A] > 0, and for any non-zero vector6 x ≥ 0, let g(x ) be the largest real number afor which ax ≤ [A]x . Let ∏ be defined by

∏ = supx 6=0 ,x≥0

g(x ). (4.20)

We can express g(x ) explicitly by rewriting ax ≤ Ax as axi ≤P

j Aijxj for all i. Thus,the largest a for which this is satisfied is

g(x ) = mini

gi(x ) where gi(x ) =P

j Aijxj

xi. (4.21)

Since [A] > 0, x ≥ 0 and x 6= 0 , it follows that the numeratorP

i Aijxj is positive for all i.Thus gi(x ) is positive for xi > 0 and infinite for xi = 0, so g(x ) > 0. It is shown in Exercise4.10 that g(x ) is a continuous function of x over x 6= 0 ,x ≥ 0 and that the supremum in(4.20) is actually achieved as a maximum.

Theorem 4.5 (Perron). Let [A] > 0 be a M by M matrix, let ∏ > 0 be given by (4.20)and (4.21), and let ∫∫∫ be a vector x that maximizes (4.20). Then

1. ∏∫∫∫ = [A]∫∫∫ and ∫∫∫ > 0.

2. For any other eigenvalue µ of [A], |µ| < ∏.

3. If x satisfies ∏x = [A]x, then x = β∫∫∫ for some (possibly complex) number β.

Discussion: Property (1) asserts not only that the solution ∏ of the optimization problemis an eigenvalue of [A], but also that the optimizing vector ∫∫∫ is an eigenvector and isstrictly positive. Property (2) says that ∏ is strictly greater than the magnitude of anyother eigenvalue, and thus we refer to it in what follows as the largest eigenvalue of [A].Property (3) asserts that the eigenvector ∫∫∫ is unique (within a scale factor), not only amongpositive vectors but among all (possibly complex) vectors.

Proof* Property 1: We are given that

∏ = g(∫∫∫) ≥ g(x ) for eachx ≥ 0 ,x 6= 0 . (4.22)

We must show that ∏∫∫∫ = [A]∫∫∫, i.e., that ∏∫i =P

j Aij∫j for each i, or equivalently that

∏ = g(∫∫∫) = gi(∫∫∫) =P

j Aij∫j

∫ifor each i. (4.23)

Thus we want to show that the minimum in (4.21) is achieved by each i, 1≤i≤M. Toshow this, we assume the contrary and demonstrate a contradiction. Thus, suppose that

6Note that the set of nonzero vectors x for which x ≥ 0 is different from the set {x > 0} in that theformer allows some xi to be zero, whereas the latter requires all xi to be zero.


g(∫∫∫) < gk(∫∫∫) for some k. Let ek be the kth unit vector and let ≤ be a small positive number.The contradiction will be to show that g(∫∫∫ + ≤ek) > g(∫∫∫) for small enough ≤, thus violating(4.22). For i 6= k,

gi(∫∫∫ + ≤ek) =P

j Aij∫j + ≤Aik

∫i>

Pj Aij∫j

∫i. (4.24)

gk(∫∫∫+≤ek), on the other hand, is continuous in ≤ > 0 as ≤ increases from 0 and thus remainsgreater than g(∫∫∫) for small enough ≤. This shows that g(∫∫∫ + ≤ek) > g(∫∫∫), completing thecontradiction. This also shows that ∫k must be greater than 0 for each k.

Property 2: Let µ be any eigenvalue of [A]. Let x 6= 0 be a right eigenvector (perhapscomplex) for µ. Taking the magnitude of each side of µx = [A]x , we get the following foreach component i

|µ||xi| = |X

j

Aijxj | ≤X

j

Aij |xj |. (4.25)

Let u = (|x1|, |x2|, . . . , |xM|), so (4.25) becomes |µ|u ≤ [A]u . Since u ≥ 0, u 6= 0, it followsfrom the definition of g(x ) that |µ| ≤ g(u). From (4.20), g(u) ≤ ∏, so |µ| ≤ ∏.

Next assume that |µ| = ∏. From (4.25), then, ∏u ≤ [A]u , so u achieves the maximizationin (4.20) and part 1 of the theorem asserts that ∏u = [A]u . This means that (4.25) issatisfied with equality, and it follows from this (see Exercise 4.11) that x = βu for some(perhaps complex) scalar β. Thus x is an eigenvector of ∏, and µ = ∏. Thus |µ| = ∏ isimpossible for µ 6= ∏, so ∏ > |µ| for all eigenvalues µ 6= ∏.

Property 3: Let x be any eigenvector of ∏. Property 2 showed that x = βu where ui = |xi|for each i and u is a non-negative eigenvector of eigenvalue ∏. Since ∫∫∫ > 0 , we can chooseα > 0 so that ∫∫∫ − αu ≥ 0 and ∫i − αui = 0 for some i. Now ∫∫∫ − αu is either identically0 or else an eigenvector of eigenvalue ∏, and thus strictly positive. Since ∫i − αui = 0 forsome i, ∫∫∫ − αu = 0 . Thus u and x are scalar multiples of ∫∫∫, completing the proof.

Next we apply the results above to a more general type of non-negative matrix called anirreducible matrix. Recall that we analyzed the classes of a finite-state Markov chain interms of a directed graph where the nodes represent the states of the chain and a directedarc goes from i to j if Pij > 0. We can draw the same type of directed graph for an arbitrarynon-negative matrix [A]; i. e., a directed arc goes from i to j if Aij > 0.

Definition 4.12. An irreducible matrix is a non-negative matrix such that for every pairof nodes i, j in its graph, there is a walk from i to j.

For stochastic matrices, an irreducible matrix is thus the matrix of a recurrent Markovchain. If we denote the i, j element of [A]n by An

ij , then we see that Anij > 0 iff there is a

walk of length n from i to j in the graph. If [A] is irreducible, a walk exists from any i toany j (including j = i) with length at most M, since the walk need visit each other node atmost once. Thus An

ij > 0 for some n, 1 ≤ n ≤ M, andPM

n=1 Anij > 0 . The key to analyzing

irreducible matrices is then the fact that the matrix B =PM

n=1[A]n is strictly positive.


Theorem 4.6 (Frobenius). Let [A] ≥ 0 be a M by M irreducible matrix and let ∏ be thesupremum in (4.20) and (4.21). Then the supremum is achieved as a maximum at somevector ∫∫∫ and the pair ∏,∫∫∫ have the following properties:

1. ∏∫∫∫ = [A]∫∫∫ and ∫∫∫ > 0.

2. For any other eigenvalue µ of [A], |µ| ≤ ∏.

3. If x satisfies ∏x = [A]x, then x = β∫∫∫ for some (possibly complex) number β.

Discussion: Note that this is almost the same as the Perron theorem, except that [A] isirreducible (but not necessarily positive), and the magnitudes of the other eigenvalues neednot be strictly less than ∏. When we look at recurrent matrices of period d, we shall findthat there are d−1 other eigenvalues of magnitude equal to ∏. Because of the possibility ofother eigenvalues with the same magnitude as ∏, we refer to ∏ as the largest real eigenvalueof [A].

Proof* Property 1: We first establish property 1 for a particular choice of ∏ and ∫∫∫and then show that this choice satisfies the optimization problem in (4.20) and (4.21).Let [B] =

PMn=1[A]n > 0. Using theorem 4.5, we let ∏B be the largest eigenvalue of [B]

and let ∫∫∫ > 0 be the corresponding right eigenvector. Then [B]∫∫∫ = ∏B∫∫∫. Also, since[B][A] = [A][B], we have [B]{[A]∫∫∫} = [A][B]∫∫∫ = ∏B[A]∫∫∫. Thus [A]∫∫∫ is a right eigenvectorfor eigenvalue ∏B of [B] and thus equal to ∫∫∫ multiplied by some positive scale factor.Define this scale factor to be ∏, so that [A]∫∫∫ = ∏∫∫∫ and ∏ > 0. We can relate ∏ to ∏B by[B]∫∫∫ =

PMn=1 [A]n∫∫∫ = (∏ + · · · + ∏M)∫∫∫. Thus ∏B = ∏ + · · · + ∏M.

Next, for any non-zero x ≥ 0 , let g > 0 be the largest number such that [A]x ≥ gx .Multiplying both sides of this by [A], we see that [A]2x ≥ g[A]x ≥ g2x . Similarly, [A]ix ≥gix for each i ≥ 1, so it follows that Bx ≥ (g + g2 + · · · + gM)x . From the optimizationproperty of ∏B in theorem 4.5, this shows that ∏B ≥ g + g2 + · · · + gM. Since ∏B =∏+∏2 + · · ·+∏M, we conclude that ∏ ≥ g, showing that ∏,∫∫∫ solve the optimization problemfor A in (4.20) and (4.21).

Properties 2 and 3: The first half of the proof of property 2 in Theorem 4.5 applies herealso to show that |µ| ≤ ∏ for all eigenvalues µ of [A]. Finally, let x be an arbitrary vectorsatisfying [A]x = ∏x . Then, from the argument above, x is also a right eigenvector of [B]with eigenvalue ∏B, so from Theorem 4.5, x must be a scalar multiple of ∫∫∫, completing theproof.

Corollary 4.1. The largest real eigenvalue ∏ of an irreducible matrix [A] ≥ 0 has a positiveleft eigenvector πππ. πππ is the unique left eigenvector of ∏ (within a scale factor) and is theonly non-negative non-zero vector (within a scale factor) that satisfies ∏πππ ≤ πππ[A].

Proof: A left eigenvector of [A] is a right eigenvector (transposed) of [A]T. The graphcorresponding to [A]T is the same as that for [A] with all the arc directions reversed, sothat all pairs of nodes still communicate and [A]T is irreducible. Since [A] and [A]T havethe same eigenvalues, the corollary is just a restatement of the theorem.


Corollary 4.2. Let ∏ be the largest real eigenvalue of an irreducible matrix and let theright and left eigenvectors of ∏ be ∫∫∫ > 0 and πππ > 0. Then, within a scale factor, ∫∫∫ isthe only non-negative right eigenvector of [A] (i.e., no other eigenvalues have non-negativeeigenvectors). Similarly, within a scale factor, πππ is the only non-negative left eigenvector of[A].

Proof: Theorem 4.6 asserts that ∫∫∫ is the unique right eigenvector (within a scale factor) ofthe largest real eigenvalue ∏, so suppose that u is a right eigenvector of some other eigenvalueµ. Letting πππ be the left eigenvector of ∏, we have πππ[A]u = ∏πππu and also πππ[A]u = µπππu .Thus πππu = 0. Since πππ > 0 , u cannot be non-negative and non-zero. The same argumentshows the uniqueness of πππ.

Corollary 4.3. Let [P ] be a stochastic irreducible matrix (i.e., the matrix of a recurrentMarkov chain). Then ∏ = 1 is the largest real eigenvalue of [P ], e = (1, 1, . . . , 1)T is theright eigenvector of ∏ = 1, unique within a scale factor, and there is a unique probabilityvector πππ > 0 that is a left eigenvector of ∏ = 1.

Proof: Since each row of [P ] adds up to 1, [P ]e = e. Corollary 4.2 asserts the uniquenessof e and the fact that ∏ = 1 is the largest real eigenvalue, and Corollary 4.1 asserts theuniqueness of πππ.

The proof above shows that every stochastic matrix, whether irreducible or not, has aneigenvalue ∏ = 1 with e = (1, . . . , 1)T as a right eigenvector. In general, a stochasticmatrix with r recurrent classes has r independent non-negative right eigenvectors and rindependent non-negative left eigenvectors; the left eigenvectors can be taken as the steady-state probability vectors within the r recurrent classes (see Exercise 4.14).

The following corollary, proved in Exercise 4.13, extends corollary 4.3 to unichains.

Corollary 4.4. Let [P ] be the transition matrix of a unichain. Then ∏ = 1 is the largestreal eigenvalue of [P ], e = (1, 1, . . . , 1)T is the right eigenvector of ∏ = 1, unique withina scale factor, and there is a unique probability vector πππ ≥ 0 that is a left eigenvector of∏ = 1; πi > 0 for each recurrent state i and πi = 0 for each transient state.

Corollary 4.5. The largest real eigenvalue ∏ of an irreducible matrix [A] ≥ 0 is a strictlyincreasing function of each component of [A].

Proof: For a given irreducible [A], let [B] satisfy [B] ≥ [A], [B] 6= [A]. Let ∏ be thelargest real eigenvalue of [A] and ∫∫∫ > 0 be the corresponding right eigenvector. Then∏∫∫∫ = [A]∫∫∫ ≤ [B]∫∫∫, but ∏∫∫∫ 6= [B]∫∫∫. Let µ be the largest real eigenvalue of [B], which is alsoirreducible. If µ ≤ ∏, then µ∫∫∫ ≤ ∏∫∫∫ ≤ [B]∫∫∫, and µ∫∫∫ 6= [B]∫∫∫, which is a contradiction ofproperty 1 in Theorem 4.6. Thus, µ > ∏.

We are now ready to study the asymptotic behavior of [A]n. The simplest and cleanestresult holds for [A] > 0. We establish this in the following corollary and then look at thecase of greatest importance, that of a stochastic matrix for an ergodic Markov chain. Moregeneral cases are treated in Exercises 4.13 and 4.14.


Corollary 4.6. Let ∏ be the largest eigenvalue of [A] > 0 and let πππ and ∫∫∫ be the positiveleft and right eigenvectors of ∏, normalized so that πππ∫∫∫ = 1. Then

limn→1

[A]n

∏n= ∫∫∫πππ. (4.26)

Proof*: Since ∫∫∫ > 0 is a column vector and πππ > 0 is a row vector, ∫∫∫πππ is a positive matrixof the same dimension as [A]. Since [A] > 0, we can define a matrix [B] = [A]−α∫∫∫πππ whichis positive for small enough α > 0. Note that πππ and ∫∫∫ are left and right eigenvectors of [B]with eigenvalue µ = ∏ − α. We then have µn∫∫∫ = [B]n∫∫∫, which when pre-multiplied by πyields

(∏− α)n = πππ[B]n∫∫∫ =X

i

X

j

πiBnij∫j .

where Bnij is the i, j element of [B]n. Since each term in the above summation is positive,

we have (∏ − α)n ≥ πiBnij∫j , and therefore Bn

ij ≤ (∏ − α)n/(πi∫j). Thus, for each i,j, limn→1Bn

ij∏−n = 0, and therefore limn→1[B]n∏−n = 0. Next we use a convenient

matrix identity: for any eigenvalue ∏ of a matrix [A], and any corresponding right and lefteigenvectors ∫∫∫ and πππ, normalized so that πππ∫∫∫ = 1, we have {[A]− ∏∫∫∫πππ}n = [A]n− ∏n∫∫∫πππ (seeExercise 4.12). Applying the same identity to [B], we have {[B]− µ∫∫∫πππ}n = [B]n − µn∫∫∫πππ.Finally, since [B] = [A]− α∫∫∫πππ, we have [B]− µ∫∫∫πππ = [A]− ∏∫∫∫πππ, so that

[A]n − ∏n∫∫∫πππ = [B]n − µn∫∫∫πππ. (4.27)

Dividing both sides of (4.27) by ∏n and taking the limit of both sides of (4.27) as n→1,the right hand side goes to 0, completing the proof.

Note that for a stochastic matrix [P ] > 0, this corollary simplifies to limn→1[P ]n = eπππ.This means that limn→1 Pn

ij = πj , which means that the probability of being in state jafter a long time is πj , independent of the starting state.

Theorem 4.7. Let [P ] be the transition matrix of an ergodic finite-state Markov chain.Then ∏ = 1 is the largest real eigenvalue of [P ], and ∏ > |µ| for every other eigenvalueµ. Furthermore, limn→1[P ]n = eπππ, where πππ > 0 is the unique probability vector satisfyingπππ[P ] = πππ and e = (1, 1, . . . , 1)T is the unique vector ∫∫∫ (within a scale factor) satisfying[P ]∫∫∫ = ∫∫∫.

Proof: From corollary 4.3, ∏ = 1 is the largest real eigenvalue of [P ], e is the unique(within a scale factor) right eigenvector of ∏ = 1, and there is a unique probability vectorπππ such that πππ[P ] = πππ. From Theorem 4.4, [P ]m is positive for sufficiently large m. Since[P ]m is also stochastic, ∏ = 1 is strictly larger than the magnitude of any other eigenvalueof [P ]m. Let µ be any other eigenvalue of [P ] and let x be a right eigenvector of µ. Notethat x is also a right eigenvector of [P ]m with eigenvalue (µ)m. Since ∏ = 1 is the onlyeigenvalue of [P ]m of magnitude 1 or more, we either have |µ| < ∏ or (µ)m = ∏. If (µ)m = ∏,then x must be a scalar times e. This is impossible, since x cannot be an eigenvector of[P ] with both eigenvalue ∏ and µ. Thus |µ| < ∏. Similarly, πππ > 0 is the unique lefteigenvector of [P ]m with eigenvalue ∏ = 1, and πππe = 1. Corollary 4.6 then asserts that


limn→1[P ]mn = eπππ. Multiplying by [P ]i for any i, 1 ≤ i < m, we get limn→1[P ]mn+i = eπππ,so limn→1[P ]n = eπππ.

Theorem 4.7 generalizes easily to an ergodic unichain (see Exercise 4.15). In this case, asone might suspect, πi = 0 for each transient state i and πi > 0 within the ergodic class.Theorem 4.7 becomes:

Theorem 4.8. Let [P ] be the transition matrix of an ergodic unichain. Then ∏ = 1 is thelargest real eigenvalue of [P ], and ∏ > |µ| for every other eigenvalue µ. Furthermore,

limm→1

[P ]m = eπππ , (4.28)

where πππ ≥ 0 is the unique probability vector satisfying πππ[P ] = πππ and e = (1, 1, . . . , 1)T isthe unique ∫∫∫ (within a scale factor) satisfying [P ]∫∫∫ = ∫∫∫.

If a chain has a periodic recurrent class, [P ]m never converges. The existence of a uniqueprobability vector solution to πππ[P ] = πππ for a periodic recurrent chain is somewhat mystifyingat first. If the period is d, then the steady-state vector πππ assigns probability 1/d to eachof the d subsets of Theorem 4.3. If the initial probabilities for the chain are chosen asPr{X0 = i} = πi for each i, then for each subsequent time n, Pr{Xn = i} = πi. Whatis happening is that this initial probability assignment starts the chain in each of the dsubsets with probability 1/d, and subsequent transitions maintain this randomness oversubsets. On the other hand, [P ]n cannot converge because Pn

ii , for each i, is zero exceptwhen n is a multiple of d. Thus the memory of starting state never dies out. An ergodicMarkov chain does not have this peculiar property, and the memory of the starting statedies out (from Theorem 4.7).

The intuition to be associated with the word ergodic is that of a process in which time-averages are equal to ensemble-averages. Using the general definition of ergodicity (whichis beyond our scope here), a periodic recurrent Markov chain in steady-state (i.e., withPr{Xn = i} = πi for all n and i) is ergodic.

Thus the notion of ergodicity for Markov chains is slightly different than that in the gen-eral theory. The difference is that we think of a Markov chain as being specified withoutspecifying the initial state distribution, and thus different initial state distributions reallycorrespond to different stochastic processes. If a periodic Markov chain starts in steadystate, then the corresponding stochastic process is stationary, and otherwise not.

4.5 Markov chains with rewards

Suppose that each state i in a Markov chain is associated with some reward, ri. As theMarkov chain proceeds from state to state, there is an associated sequence of rewards thatare not independent, but are related by the statistics of the Markov chain. The situationis similar to, but simpler than, that of renewal-reward processes. As with renewal-rewardprocesses, the reward ri could equally well be a cost or an arbitrary real valued function ofthe state. In this section, the expected value of the aggregate reward over time is analyzed.

4.5. MARKOV CHAINS WITH REWARDS 157

The model of Markov chains with rewards is surprisingly broad. We have already seen thatalmost any stochastic process can be approximated by a Markov chain. Also, as we sawin studying renewal theory, the concept of rewards is quite graphic not only in modelingsuch things as corporate profits or portfolio performance, but also for studying residual life,queueing delay, and many other phenomena.

In Section 4.6, we shall study Markov decision theory, or dynamic programming. This canbe viewed as a generalization of Markov chains with rewards in the sense that there is a“decision maker” or “policy maker” who in each state can choose between several differentpolicies; for each policy, there is a given set of transition probabilities to the next stateand a given expected reward for the current state. Thus the decision maker must make acompromise between the expected reward of a given policy in the current state (i.e., theimmediate reward) and the long term benefit from the next state to be entered. This is amuch more challenging problem than the current study of Markov chains with rewards, buta thorough understanding of the current problem provides the machinery to understandMarkov decision theory also.

Frequently it is more natural to associate rewards with transitions rather than states. If rij

denotes the reward associated with a transition from i to j and Pij denotes the correspond-ing transition probability, then ri =

Pj Pijrij is the expected reward associated with a

transition from state i. Since we analyze only expected rewards here, and since the effect oftransition rewards rij are summarized into the state rewards ri =

Pj Pijrij , we henceforth

ignore transition rewards and consider only state rewards.

The steady-state expected reward per unit time, assuming a single recurrent class of states,is easily seen to be g =

Pi πiri where πi is the steady-state probability of being in state i.

The following examples demonstrate that it is also important to understand the transientbehavior of rewards. This transient behavior will turn out to be even more important whenwe study Markov decision theory and dynamic programming.

Example 4.5.1 (Expected first-passage time). A common problem when dealing withMarkov chains is that of finding the expected number of steps, starting in some initial state,before some given final state is entered. Since the answer to this problem does not dependon what happens after the given final state is entered, we can modify the chain to convertthe given final state, say state 1, into a trapping state (a trapping state i is a state fromwhich there is no exit, i.e., for which Pii = 1). That is, we set P11 = 1, P1j = 0 for allj 6= 1, and leave Pij unchanged for all i 6= 1 and all j (see Figure 4.5).

✟✙

❍

✟✯❍❥

✄✗

✄✎

✟✯

❍❥

✟✙❍✘✿ ②♥ ♥

♥

♥1 3

4

2

✟✙

❍

✟✯❍❥

✄✗

✄✎

✚✚✚❂

⑥

✘✿ ②♥ ♥♥

♥1 3

4

2

Figure 4.5: The conversion of a four state Markov chain into a chain for which state 1is a trapping state. Note that the outgoing arcs from node 1 have been removed.

Let vi be the expected number of steps to reach state 1 starting in state i 6= 1. This number


of steps includes the first step plus the expected number of steps from whatever state isentered next (which is 0 if state 1 is entered next). Thus, for the chain in Figure 4.5, wehave the equations

v2 = 1 + P23v3 + P24v4

v3 = 1 + P32v2 + P33v3 + P34v4

v4 = 1 + P42v2 + P43v3.

For an arbitrary chain of M states where 1 is a trapping state and all other states aretransient, this set of equations becomes

vi = 1 +X

j 6=1

Pijvj ; i 6= 1. (4.29)

If we define ri = 1 for i 6= 1 and ri = 0 for i = 1, then ri is a unit reward for not yet enteringthe trapping state, and vi as the expected aggregate reward before entering the trappingstate. Thus by taking r1 = 0, the reward ceases upon entering the trapping state, and vi

is the expected transient reward, i.e., the expected first passage time from state i to state1. Note that in this example, rewards occur only in transient states. Since transient stateshave zero steady-state probabilities, the steady-state gain per unit time, g =

Pi πiri, is 0.

If we define v1 = 0, then (4.29), along with v1 = 0, has the vector form

v = r + [P ]v ; v1 = 0. (4.30)

For a Markov chain with M states, (4.29) is a set of M− 1 equations in the M− 1 variablesv2 to vM. The equation v = r + [P ]v is a set of M linear equations, of which the first is thevacuous equation v1 = 0 + v1, and, with v1 = 0, the last M− 1 correspond to (4.29). It isnot hard to show that (4.30) has a unique solution for v under the condition that states 2to M are all transient states and 1 is a trapping state, but we prove this later, in Lemma4.1, under more general circumstances.

Example 4.5.2. Assume that a Markov chain has M states, {0, 1, . . . ,M − 1}, and thatthe state represents the number of customers in an integer time queueing system. Supposewe wish to find the expected sum of the times all customers spend in the system, startingat an integer time where i customers are in the system, and ending at the first instantwhen the system becomes idle. From our discussion of Little’s theorem in Section 3.6, weknow that this sum of times is equal to the sum of the number of customers in the system,summed over each integer time from the initial time with i customers to the final time whenthe system becomes empty. As in the previous example, we modify the Markov chain tomake state 0 a trapping state. We take ri = i as the “reward” in state i, and vi as theexpected aggregate reward until the trapping state is entered. Using the same reasoning asin the previous example, vi is equal to the immediate “reward” ri = i plus the expectedreward from whatever state is entered next. Thus vi = ri +

Pj≥1 Pijvj . With v0 = 0, this

is v = r + [P ]v . This has a unique solution for v as will be shown later in Lemma 4.1.This same analysis is valid for any choice of reward ri for each transient state i; the rewardin the trapping state must be 0 so as to keep the expected aggregate reward finite.


In the above examples, the Markov chain has a trapping state with zero gain, so the expectedgain is essentially a transient phenomena until entering the trapping state. We now look atthe more general case of a unichain, i.e., a chain with a single recurrent class, possibly alongwith some transient states. In this more general case, there can be some average gain perunit time, along with some transient gain depending on the initial state. We first look atthe aggregate gain over a finite number of time units, thus providing a clean way of goingto the limit.

Example 4.5.3. The example in Figure 4.6 provides some intuitive appreciation for thegeneral problem. Note that the chain tends to persist in whatever state it is in for arelatively long time. Thus if the chain starts in state 2, not only is an immediate rewardof 1 achieved, but there is a high probability of an additional gain of 1 on many successivetransitions. Thus the aggregate value of starting in state 2 is considerably more than theimmediate reward of 1. On the other hand, we see from symmetry that the expected gainper unit time, over a long time period, must be one half.

1 20.01 0.99

0.01r1=0 r2=1

0.99②✘✿ ♥ ③

② ♥

Figure 4.6: Markov chain with rewards.

Returning to the general case, it is convenient to work backward from a final time ratherthan forward from the initial time. This will be quite helpful later when we consider dynamicprogramming and Markov decision theory. For any final time m, define stage n as n timeunits before the final time, i.e., as time m − n in Figure 4.7. Equivalently, we often viewthe final time as time 0, and then stage n corresponds to time −n.

−n −n+1 −n+2 −n+3 · · · −2 −1 0 Timen n−1 n−2 n−3 · · · 2 1 0 Stage

m−n · · · · · · m−3 m−2 m−1 m Timen · · · · · · 3 2 1 0 Stage

Figure 4.7: Alternate views of Stages.

As a final generalization of the problem (which will be helpful in the solution), we allow thereward at the final time (i.e., in stage 0) to be different from that at other times. The finalreward in state i is denoted ui, and u = (u1, . . . , uM)T. We denote the expected aggregatereward from stage n up to and including the final stage (stage zero), given state i at stage n,as vi(n,u). Note that the notation here is taking advantage of the Markov property. Thatis, given that the chain is in state i at time −n (i.e., stage n), the expected aggregate rewardup to and including time 0 is independent of the states before time −n and is independentof when the Markov chain started prior to time −n.


The expected aggregate reward can be found by starting at stage 1. Given that the chain isin state i at time −1, the immediate reward is ri. The chain then makes a transition (withprobability Pij) to some state j at time 0 with a final reward of uj . Thus

vi(1,u) = ri +X

j

Pijuj . (4.31)

For the example of Figure 4.6 (assuming the final reward is the same as that at the otherstages, i.e., ui = ri for i = 1, 2), we have v1(1,u) = 0.01 and v2(1,u) = 1.99.

The expected aggregate reward for stage 2 can be calculated in the same way. Given state iat time −2 (i.e., stage 2), there is an immediate reward of ri and, with probability Pij , thechain goes to state j at time −1 (i.e., stage 1) with an expected additional gain of vj(1,u).Thus

vi(2,u) = ri +X

j

Pijvj(1,u). (4.32)

Note that vj(1,u), as calculated in (4.31), includes the gain in stages 1 and 0, and does notdepend on how state j was entered. Iterating the above argument to stage 3, 4, . . . , n,

vi(n,u) = ri +X

j

Pijvj(n−1,u). (4.33)

This can be written in vector form as

v(n,u) = r + [P ]v(n−1,u); n ≥ 1, (4.34)

where r is a column vector with components r1, r2, . . . , rM and v(n,u) is a column vectorwith components v1(n,u), . . . , vM(n,u). By substituting (4.34), with n replaced by n− 1,into the last term of (4.34),

v(n,u) = r + [P ]r + [P ]2v(n−2,u); n ≥ 2. (4.35)

Applying the same substitution recursively, we eventually get an explicit expression forv(n,u),

v(n,u) = r + [P ]r + [P ]2r + · · · + [P ]n−1r + [P ]nu . (4.36)

Eq. (4.34), applied iteratively, is more convenient for calculating v(n,u) than (4.36), butneither give us much insight into the behavior of the expected aggregate reward, especiallyfor large n. We can get a little insight by averaging the components of (4.36) over thesteady-state probability vector πππ. Since πππ[P ]m = πππ for all m and πππr is, by definition, thesteady state gain per stage g, this gives us

πππv(n,u) = ng + πππu . (4.37)

This result is not surprising, since when the chain starts in steady-state at stage n, itremains in steady-state, yielding a gain per stage of g until the final reward at stage 0.For the example of Figure 4.6 (again assuming u = r), Figure 4.8 tabulates this steady


state expected aggregate gain and compares it with the expected aggregate gain vi(n,u) forinitial states 1 and 2. Note that v1(n,u) is always less than the steady-state average by anamount approaching 25 with increasing n. Similarly, v2(n,u) is greater than the average bythe corresponding amount. In other words, for this example, vi(n,u) − πππv(n,u), for eachstate i, approaches a limit as n → 1. This limit is called the asymptotic relative gain forstarting in state i, relative to starting in steady state. In what follows, we shall see thatthis type of asymptotic behavior is quite general.

n πππv(n, r) v1(n, r) v2(n, r)1 1 0.01 1.992 1.5 0.0298 2.97024 2.5 0.098 4.902

10 5.5 0.518 10.48240 20.5 6.420 34.580

100 50.5 28.749 72.250400 200.5 175.507 225.492

1000 500.5 475.500 525.500

Figure 4.8: The expected aggregate reward, as a function of starting state and stage,for the example of figure 4.6.

Initially we consider only ergodic Markov chains and first try to understand the asymptoticbehavior above at an intuitive level. For large n, the probability of being in state j at time0, conditional on starting in i at time −n, is Pn

ij ≈ πj . Thus, the expected final reward attime 0 is approximately πππu for each possible starting state at time −n. For (4.36), thissays that the final term [P ]nu is approximately (πππu)e for large n. Similarly, in (4.36),[P ]n−mr ≈ ge if n −m is large. This means that for very large n, each unit increase ordecrease in n simply adds or subtracts ge to the vector gain. Thus, we might conjecturethat, for large n, v(n,u) is the sum of an initial transient term w , an intermediate termnge, and a final term, (πππu)e, i.e,

v(n,u) ≈ w + nge + (πππu)e. (4.38)

where we also conjecture that the approximation becomes exact as n → 1. Substituting(4.37) into (4.38), the conjecture (which we shall soon validate) is

v(n,u) ≈ w + (πππv(n,u))e. (4.39)

That is, the component wi of w tells us how profitable it is, in the long term, to startin a particular state i rather than start in steady-state. Thus w is called the asymptoticrelative gain vector or, for brevity, the relative gain vector. In the example of the tableabove, w = (−25,+25).

There are two reasonable approaches to validate the conjecture above and to evaluate therelative gain vector w . The first is explored in Exercise 4.22 and expands on the intuitiveargument leading to (4.38) to show that w is given by

w =1X

n=0

([P ]i − eπππ)r . (4.40)


This expression is not a very useful way to calculate w , and thus we follow the secondapproach here, which provides both a convenient expression for w and a proof that theapproximation in (4.38) becomes exact in the limit.

Rearranging (4.38) and going to the limit,

w = limn→1

{v(n,u)− nge − (πππu)e}. (4.41)

The conjecture, which is still to be proven, is that the limit in (4.41) actually exists. Wenow show that if this limit exists, w must have a particular form. In particular, substituting(4.34) into (4.41),

w = limn→1

{r + [P ]v(n− 1,u)− nge − (πππu)e}

= r − ge + [P ] limn→1

{v(n− 1,u)− (n− 1)ge − (πππu)e}

= r − ge + [P ]w .

Thus, if the limit in (4.41) exists, that limiting vector w must satisfy w + ge = r + [P ]w .The following lemma shows that this equation has a solution. The lemma does not dependon the conjecture in (4.41); we are simply using this conjecture to motivate why the equation(4.42) is important.

Lemma 4.1. Let [P ] be the transition matrix of a M state unichain. Let r = (r1, . . . , rM)T

be a reward vector, let πππ = (π1, . . . ,πM) be the steady state probabilities of the chain, andlet g =

Pi πiri. Then the equation

w + ge = r + [P ]w (4.42)

has a solution for w. With the additional condition πππw = 0, that solution is unique.

Discussion: Note that v = r + [P ]v in Example 4.5.1 is a special case of (4.42) inwhich πππ = (1, 0, . . . , 0) and r = (0, 1, . . . , 1)T and thus g = 0. With the added conditionv1 = πππv = 0, the solution is unique. Example 4.5.2 is the same, except for that r isdifferent, and thus also has a unique solution.

Proof: Rewrite (4.42) as

{[P ]− [I]}w = ge − r . (4.43)

Let ew be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can beexpressed as ew+x for some x that satisfies the homogeneous equation {[P ]−[I]}x = 0. Forx to satisfy {[P ]− [I]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue1. From Theorem 4.8, x must have the form αe for some number α. This means that if aparticular solution ew to (4.43) exists, then all solutions have the form w = ew + αe. Fora particular solution to (4.43) to exist, ge − r must lie in the column space of the matrix[P ]− [I]. This column space is the space orthogonal to the left null space of [P ]− [I]. Thisleft null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., thescalar multiples of πππ. Thus, a particular solution exists iff πππ(ge − r) = 0. Since πππge = gand πππr = g, this equality is satisfied and a particular solution exists. Since all solutions


have the form w = ew + αe, setting πππw = 0 determines the value of α to be −πππ ew , thusyielding a unique solution with πππw = 0 and completing the proof.

It is not necessary to assume that g = πππr in the lemma. If g is treated as a variable in(4.42), then, by pre-multiplying any solution w , g of (4.42) by πππ, we find that g = πππr mustbe satisfied. This means that (4.42) can be viewed as M linear equations in the M + 1variables w , g and the set of solutions can be found without first calculating πππ. Naturally,πππ must be found to find the particular solution with πππw = 0.

If the final reward vector is chosen to be any solution w of (4.42) (not necessarily the onewith πππw = 0), then

v(1,w) = r + [P ]w = w + ge

v(2,w) = r + [P ]{w + ge} = w + 2ge· · · · · ·

v(n,w) = r + [P ]{w + (n− 1)ge} = w + nge. (4.44)

This is a simple explicit expression for expected aggregate gain for this special final rewardvector. We now show how to use this to get a simple expression for v(n,u) for arbitraryu . From (4.36),

v(n,u)− v(n,w) = [P ]n{u −w}. (4.45)

Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44)into (4.45),

v(n,u) = nge + w + [P ]n{u −w}. (4.46)

It should now be clear why we wanted to allow the final reward vector to differ from thereward vector at other stages. The result is summarized in the following theorem:

Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector andw a solution to (4.42). Then the expected aggregate reward vector over n stages is given by(4.46). If the unichain is ergodic and w satisfies πππw = 0 then

limn→1

{v(n,u)− nge} = w + (πππu)e. (4.47)

Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]napproaches a matrix whose rows each equal πππ, and (4.47) follows.

The set of solutions to (4.42) has the form w + αe where w satisfies πππw = 0 and α is anyreal number. The factor α cancels out in (4.46), so any solution can be used. In (4.47),however, the restriction to πππw = 0 is necessary. We have defined the (asymptotic) relativegain vector w to satisfy πππw = 0 so that, in the ergodic case, the expected aggregate gain,v(n,u) can be cleanly split into an initial transient w , the intermediate gain per stage, ne,and the final gain πππu , as in (4.47). We shall call other solutions to (4.42) shifted relativegain vectors.


Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i toenter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. Thisaggregate reward, in the general setup of Theorem 4.9, is limn→1 v(n,u). Since g = 0 andu = 0 in these examples, (4.47) simplifies to limn→1 v(n,u) = w where w = r +[P ]w andπππw = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples.

For the example in Figure 4.6, we have seen that w = (−25, 25) (see Exercise 4.21 also).The large relative gain for state 2 accounts for both the immediate reward and the highprobability of multiple additional rewards through remaining in state 2. Note that w2 cannot be interpreted as the expected reward up to the first transition from state 2 to 1. Thereason for this is that the gain starting from state 1 cannot be ignored; this can be seenfrom Figure 4.9, which modifies Figure 4.6 by changing P12 to 1. In this case, (see Exercise4.21), w2 − w1 = 1/1.01 ∼ 0.99, reflecting the fact that state 1 is always left immediately,thus reducing the advantage of starting in state 2.

1 21 0.99

0.01r1=0 r12=1

②♥ ③② ♥

Figure 4.9: A variation of Figure 4.6.

We can now interpret the general solution in (4.46) by viewing ge as the steady state gainper stage, viewing w as the dependence on the initial state, and viewing [P ]n{u − w} asthe dependence on the final reward vector u). If the recurrent class is ergodic, then, asseen in (4.47), this final term is asymptotically independent of the starting state and w ,but depends on πππu .

Example 4.5.4. In order to understand better why (4.47) can be false without the as-sumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1,r1 = r2 = 0, and arbitrary final reward with u1 6= u2. Then it is easy to see that for n even,v1(n) = u1; v2(n) = u2 and for n odd, v1(n) = u2; v2(n) = u1. Thus, the effect of thefinal reward on the initial state never dies out.

For a unichain with a periodic recurrent class of period d, as in the example above, itis a little hard to interpret w as an asymptotic relative gain vector, since the last termof (4.46) involves w also (i.e., the relative gain of starting in different states depends onboth n and u). The trouble is that the final reward happens at a particular phase of theperiodic variation, and the starting state determines the set of states at which the finalreward is assigned. If we view the final reward as being randomized over a period, withequal probability of occuring at each phase, then, from (4.46),

d−1X

m=0

v(n + m,u)− (n + m)ge = w + [P ]n≥I + [P ] + · · · + [P ]d−1

¥{u −w}.

Going to the limit n → 1, and using the result of Exercise 4.18, this becomes almost the

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 165

same as the result for an ergodic unichain, i.e.,

limn→1

d−1X

m=0

(v(n + m,u)− (n + m)ge) = w + (eπππ)u . (4.48)

There is an interesting analogy between the steady-state vector πππ and the relative gainvector w . If the recurrent class of states is ergodic, then any initial distribution on thestates approaches the steady state with increasing time, and similarly the effect of any finalgain vector becomes negligible (except for the choice of (πππu)) with an increasing numberof stages. On the other hand, if the recurrent class is periodic, then starting the Markovchain in steady-state maintains the steady state, and similarly, choosing the final gain tobe the relative gain vector maintains the same relative gain at each stage.

Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic ex-pressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to arelative gain vector can be expressed as a solution to

w +mX

i=1

g(i)∫∫∫(i) = r + [P ]w , (4.49)

where g(i) is the gain of the ith recurrent class and ∫∫∫(i) is the corresponding right eigenvectorof [P ] (see Exercise 4.14). Using a solution to (4.49) as a final gain vector, we can repeatthe argument in (4.44) to get

v(n,w) = w + nmX

i=1

g(i)∫∫∫(i) for all n ≥ 1. (4.50)

As expected, the average reward per stage depends on the recurrent class of the initialstate. If the initial state, j, is transient, the average reward per stage is averaged over therecurrent classes, using the probability ∫(i)

j that state j eventually reaches class i. For anarbitrary final reward vector u , (4.50) can be combined with (4.45) to get

v(n,u) = w + nmX

i=1

g(i)∫∫∫(i) + [P ]n{u −w} for all n ≥ 1. (4.51)

Eqn. (4.49) always has a solution (see Exercise 4.27), and in fact has an m dimensional setof solutions given by w = w̃ +

Pi αi∫∫∫(i), where α1, . . . ,αm can be chosen arbitrarily and

w̃ is any given solution.

4.6 Markov decision theory and dynamic programming

4.6.1 Introduction

In the previous section, we analyzed the behavior of a Markov chain with rewards. In thissection, we consider a much more elaborate structure in which a decision maker can select


between various possible decisions for rewards and transition probabilities. In place of thereward ri and the transition probabilities {Pij ; 1 ≤ j ≤ M} associated with a given state i,there is a choice between some number Ki of different rewards, say r(1)

i , r(2)i , . . . , r(Ki)

i anda corresponding choice between Ki different sets of transition probabilities, say {P (1)

ij ; 1 ≤j ≤ M}, {P (2)

ij , 1 ≤ j ≤ M}, . . . {P (Ki)ij ; 1 ≤ j ≤ M}. A decision maker then decides between

these Ki possible decisions each time the chain is in state i. Note that if the decision makerchooses decision k for state i, then the reward is r(k)

i and the transition probabilities fromstate i are {P (k)

ij ; 1 ≤ j ≤ M}; it is not possible to choose r(k)i for one k and {P (k)

ij ; 1 ≤ j ≤ M}for another k. We assume that, given Xn = i, and given decision k at time n, the probabilityof entering state j at time n + 1 is P (k)

ij , independent of earlier states and decisions.

Figure 4.10 shows an example of this situation in which the decision maker can choosebetween two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1(K1 = 1). This figure illustrates the familiar tradeoff between instant gratification (alter-native 2) and long term gratification (alternative 1).

1 2 1 20.01 0.99

0.01r1=0 r(1)2 =1

Decision 1

0.99 0.01

1r1=0 r(2)2 =50

Decision 2

0.99②✘✿ ♥ ③

② ♥ ③②♥ ♥✘✿

Figure 4.10: A Markov decision problem with two alternatives in state 2.

It is also possible to consider the situation in which the rewards for each decision areassociated with transitions; that is, for decision k in state i, the reward r(k)

ij is associatedwith a transition from i to j. This means that the expected reward for a transition fromi with decision k is given by r(k)

i =P

j P (k)ij r(k)

ij . Thus, as in the previous section, thereis no essential loss in generality in restricting attention to the case in which rewards areassociated with the states.

The set of rules used by the decision maker in selecting different alternatives at each stageof the chain is called a policy. We want to consider the expected aggregate reward over ntrials of the “Markov chain,” as a function of the policy used by the decision maker. If thepolicy uses the same decision, say ki, at each occurrence of state i, for each i, then thatpolicy corresponds to a homogeneous Markov chain with transition probabilities P (ki)

ij . Wedenote the matrix of these transition probabilities as [P k ], where k = (k1, . . . , kM). Such apolicy, i.e., making the decision for each state i independent of time, is called a stationarypolicy. The aggregate reward for any such stationary policy was found in the previoussection. Since both rewards and transition probabilities depend only on the state and thecorresponding decision, and not on time, one feels intuitively that stationary policies makea certain amount of sense over a long period of time. On the other hand, assuming somefinal reward ui for being in state i at the end of the nth trial, one might expect the bestpolicy to depend on time, at least close to the end of the n trials.

In what follows, we first derive the optimal policy for maximizing expected aggregate rewardover an arbitrary number n of trials. We shall see that the decision at time m, 0 ≤ m < n, for


the optimal policy does in fact depend both on m and on the final rewards {ui; 1 ≤ i ≤ M}.We call this optimal policy the optimal dynamic policy. This policy is found from thedynamic programming algorithm, which, as we shall see, is conceptually very simple. Wethen go on to find the relationship between the optimal dynamic policy and the optimalstationary policy and show that each has the same long term gain per trial.

4.6.2 Dynamic programming algorithm

As in our development of Markov chains with rewards, we consider expected aggregatereward over n time periods and we use stages, counting backwards from the final trial.First consider the optimum decision with just one trial (i.e., with just one stage). We startin a given state i at stage 1, make a decision k, obtain the reward r(k)

i , then go to some statej with probability P (k)

ij and obtain the final reward uj . This expected aggregate reward ismaximized over the choice of k, i.e.,

v∗i (1,u) = maxk

{r(k)i +

X

j

P (k)ij uj}. (4.52)

We use the notation v∗i (n,u) to represent the maximum expected aggregate reward forn stages starting in state i. Note that v∗i (1,u) depends on the final reward vector u =(u1, u2, . . . , uM)T . Next consider the maximum expected aggregate reward starting in statei at stage 2. For each state j, 1 ≤ j ≤ M, let vj(1,u) be the expected aggregate reward,over stages 1 and 0, for some arbitrary policy, conditional on the chain being in state j atstage 1. Then if decision k is made in state i at stage 2, the expected aggregate reward forstage 2 is r(k)

i +P

j P (k)ij vj(1,u). Note that no matter what policy is chosen at stage 2, this

expression is maximized at stage 1 by choosing the stage 1 policy that maximizes vj(1,u).Thus, independent of what we choose at stage 2 (or at earlier times), we must use v∗j (1,u)for the aggregate gain from stage 1 onward in order to maximize the overall aggregate gainfrom stage 2. Thus, at stage 2, we achieve maximum expected aggregate gain, v∗i (2,u), bychoosing the k that achieves the following maximum:

v∗i (2,u) = maxk

{r(k)i +

X

j

P (k)ij v∗j (1,u)}. (4.53)

Repeating this argument for successively larger n, we obtain the general expression

v∗i (n,u) = maxk

{r(k)i +

X

j

P (k)ij v∗j (n− 1,u)}. (4.54)

Note that this is almost the same as (4.33), differing only by the maximization over k. Wecan also write this in vector form, for n ≥ 1, as

v∗(n,u) = maxk

{rk + [P k ]v∗(n− 1,u)}, (4.55)

where for n = 1, we take v∗(0,u) = u . Here k is a set (or vector) of decisions, k =(k1, k2, . . . , kM) where ki is the decision to be used in state i. [P k ] denotes a matrix whose


(i, j) element is P (ki)ij , and rk denotes a vector whose ith element is r(ki)

i . The maximizationover k in (4.55) is really M separate and independent maximizations, one for each state,i.e., (4.55) is simply a vector form of (4.54). Another frequently useful way to rewrite (4.54)or (4.55) is as follows:

v∗(n,u) = rk 0 + [P k 0 ]v∗(n−1) for k 0 such that

rk 0 + [P k 0 ]v∗(n−1) = maxk

rk + [P k ]v∗(n−1). (4.56)

If k 0 satisfies (4.56), it is called an optimal decision for stage n. Note that (4.54), (4.55), and(4.56) are valid with no restrictions (such as recurrent or aperiodic states) on the possibletransition probabilities [P k ].

The dynamic programming algorithm is just the calculation of (4.54), (4.55), or (4.56), per-formed successively for n = 1, 2, 3, . . . . The development of this algorithm, as a systematictool for solving this class of problems, is due to Bellman [Bel57]. This algorithm yields theoptimal dynamic policy for any given final reward vector, u . Along with the calculationof v∗(n,u) for each n, the algorithm also yields the optimal decision at each stage. Thesurprising simplicity of the algorithm is due to the Markov property. That is, v∗i (n,u) is theaggregate present and future reward conditional on the present state. Since it is conditionedon the present state, it is independent of the past (i.e., how the process arrived at state ifrom previous transitions and choices).

Although dynamic programming is computationally straightforward and convenient7, theasymptotic behavior of v∗(n,u) as n→1 is not evident from the algorithm. After workingout some simple examples, we look at the general question of asymptotic behavior.

Example 4.6.1. Consider Fig. 4.10, repeated below, with the final rewards u2 = u1 = 0.

1 2 1 20.01 0.99

0.01r1=0 r(1)2 =1

0.99 0.01

1r1=0 r(2)2 =50

0.99②✘✿ ♥ ③

② ♥ ③②♥ ♥✘✿

Since there is no reward in stage 0, uj = 0. Also r1 = 0, so, from (4.52), the aggregate gainin state 1 at stage 1 is

v∗1(1,u) = r1 +X

j

Pijuj = 0.

Similarly, since policy 1 has an immediate reward r(1)2 = 1 in state 2, and policy 2 has an

immediate reward r(2)2 = 50,

v∗2(1,u) = maxΩh

r(1)2 +

X

j

P (1)ij uj

i,

hr(2)2 +

X

j

P (2)ij uj

iæ= max{1, 50} = 50.

7Unfortunately, many dynamic programming problems of interest have enormous numbers of states andpossible choices of decision (the so called curse of dimensionality), and thus, even though the equations aresimple, the computational requirements might be beyond the range of practical feasibility.


We can now go on to stage 2, using the results above for v∗j (1,u). From (4.53),

v∗1(2) = r1 + P11v∗1(1,u) + P12v

∗2(1,u) = P12v

∗2(1,u) = 0.5

v∗2(2) = maxΩh

r(1)2 +

X

j

P (1)2j v∗j (1,u)

i,

hr(2)2 + P (2)

21 v∗1(1,u)iæ

= maxn[1 + P (1)

22 v∗2(1,u)], 50o

= max{50.5, 50} = 50.5.

Thus, we have seen that, in state 2, decision 1 is preferable at stage 2, while decision 2 ispreferable at stage 1. What is happening is that the choice of decision 2 at stage 1 has madeit very profitable to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it ispreferable to choose decision 1 (i.e., the small unit gain) at stage 2 with the correspondinghigh probability of remaining in state 2 at stage 1. Continuing this computation for largern, one finds that v∗1(n,u) = n/2 and v∗2(n,u) = 50 + n/2. The optimum dynamic policy isdecision 2 for stage 1 and decision 1 for all stages n > 1.

This example also illustrates that the maximization of expected gain is not necessarily whatis most desirable in all applications. For example, people who want to avoid risk might wellprefer decision 2 at stage 2. This guarantees a reward of 50, rather than taking a smallchance of losing that reward.

Example 4.6.2 (Shortest Path Problems). The problem of finding the shortest pathsbetween nodes in a directed graph arises in many situations, from routing in communicationnetworks to calculating the time to complete complex tasks. The problem is quite similarto the expected first passage time of example 4.5.1. In that problem, arcs in a directedgraph were selected according to a probability distribution, whereas here, we must makea decision about which arc to take. Although there are no probabilities here, the problemcan be posed as dynamic programming. We suppose that we want to find the shortest pathfrom each node in a directed graph to some particular node, say node 1 (see Figure 4.11).The link lengths are arbitrary numbers that might reflect physical distance, or might reflectan arbitrary type of cost. The length of a path is the sum of the lengths of the arcs on thatpath. In terms of dynamic programming, a policy is a choice of arc out of each node. Herewe want to minimize cost (i.e., path length) rather than maximizing reward, so we simplyreplace the maximum in the dynamic programming algorithm with a minimum (or, if onewishes, all costs can be replaced with negative rewards.

✟✙

❍

✟✯❍❥

✄✗

✄✎

✚✚✚❂

⑥

✘✿ ②♥ ♥♥

♥1 3

4

22

4

40

Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Anyunmarked link has length 1

We start the dynamic programming algorithm with a final cost vector that is 0 for node 1and infinite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that


from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then

v1(1,u) = 0, v2(1,u) = 4, v3(1,u) =1, v4(1,u) = 1.

In stage 2, the cost v3(2,u), for example, is

v3(2,u) = minh2 + v2(1,u), 4 + v4(1,u)

i= 5.

The set of costs at stage 2 are

v1(2,u) = 0, v2(2,u) = 2, v3(2,u) = 5, v4(2,u) = 1.

and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches tonode 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterationsyield no change, and the resulting policy is also the optimal stationary policy.

It can be seen without too much difficulty, for the example of Figure 4.11, that these finalaggregate costs and shortest paths also result no matter what final cost vector u (withu1 = 0) is used. We shall see later that this always happens so long as all the cycles in thedirected graph (other than the self loop from node 1 to node 1) have positive cost.

4.6.3 Optimal stationary policies

In Example 4.6.1, we saw that there was a final transient (for stage 1) in which decision 1was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policyused a stationary policy (using decision 2) except for a final transient. It seems reasonable toexpect this same type of behavior for typical but more complex Markov decision problems.We can get a clue about how to demonstrate this by first looking at a situation in whichthe aggregate expected gain of a stationary policy is equal to that of the optimal dynamicpolicy. Denote some given stationary policy by the vector k 0 = (k01, . . . , k0M) of decisions ineach state. Assume that the Markov chain with transition matrix [P k 0 ] is a unichain, i.e.,recurrent with perhaps additional transient states. The expected aggregate reward for thisstationary policy is then given by (4.46), using the Markov chain with transition matrix[P k 0 ] and reward vector rk 0 . Let w 0 be the relative gain vector for the stationary policy k 0.Recall from (4.44) that if w 0 is used as the final reward vector, then the expected aggregategain simplifies to

vk 0(n,w 0)− ng0e = w 0, (4.57)

where g0 =P

i πk 0i rk 0

i is the steady-state gain, πππk 0 is the steady-state probability vector, andthe relative gain vector w 0 satisfies

w 0 + g0e = rk 0 + [P k 0 ]w 0 ; πππk 0w 0 = 0. (4.58)

The fact that the right hand side of (4.57) is independent of the stage, n, leads us tohypothesize that if the stationary policy k 0 is the same as the dynamic policy except fora final transient, then that final transient might disappear if we use w 0 as a final reward


vector. To pursue this hypothesis, assume a final reward equal to w 0. Then, if k 0 maximizesrk + [P k ]w 0 over k , we have

v∗(1,w 0) = rk 0 + [P k 0 ]w 0 = maxk

{rk + [P k ]w 0}. (4.59)

Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if

w 0 + g0e = rk 0 + [P k 0 ]w 0 = maxk

{rk + [P k ]w 0}. (4.60)

If (4.60) is also satisfied, then the optimal gain is given by

v∗(1,w 0) = w 0 + g0e. (4.61)

The following theorem now shows that if (4.60) is satisfied, then, not only is the decisionk 0 that maximizes rk + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimalat all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy).

Theorem 4.10. Assume that (4.60) is satisfied for some w0, g0, and k0. Then, if the finalreward vector is equal to w0, the stationary policy k0 is an optimal dynamic policy and theoptimal expected aggregate gain satisfies

v∗(n,w0) = w0 + ng0e. (4.62)

Proof: Since k 0 maximizes rk + [P k ]w 0, it is an optimal decision at stage 1 for the finalvector w 0. From (4.60), w 0 + g0e = rk 0 + [P k 0 ]w 0, so v∗(1,w 0) = w 0 + g0e. Thus (4.62) issatisfied for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) ingeneral. Thus, assume that (4.62) is satisfied for n. Then, from (4.55),

v∗(n + 1,w 0) = maxk

{rk + [P k ]v∗(n,w 0)} (4.63)

= maxk

nrk + [P k ]{w 0k + ng0e}

o(4.64)

= ng0e + maxk

{rk + [P k ]w 0} (4.65)

= (n + 1)g0e + w 0.. (4.66)

Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = efor all k , and (4.66) follows from (4.60). This verifies (4.62) for n + 1. Also, since k 0

maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optimal decision at stagen + 1. This completes the inductive step and thus the proof.

Since our major interest in stationary policies is to help understand the relationship betweenthe optimal dynamic policy and stationary policies, we define an optimal stationary policyas follows:

Definition 4.13. A stationary policy k0 is optimal if there is some final reward vector w0

for which k0 is the optimal dynamic policy.


From Theorem 4.10, we see that if there is a solution to (4.60), then the stationary policyk 0 that maximizes rk + [P k ]w 0 is an optimal stationary policy. Eqn. (4.60) is known asBellman’s equation, and we now explore the situations in which it has a solution (since thesesolutions give rise to optimal stationary policies).

Theorem 4.10 made no assumptions beyond Bellman’s equation about w 0, g0, or the station-ary policy k 0 that maximizes rk + [P k ]w 0. However, if k 0 corresponds to a unichain, then,from Lemma 4.1 and its following discussion, w 0 and g0 are uniquely determined (aside froman additive factor of αe in w 0) as the relative gain vector and gain per stage for k 0.

If Bellman’s equation has a solution, w 0, g0, then, for every decision k , we have

w 0 + g0e ≥ rk + [P k ]w 0 with equality for some k 0. (4.67)

The Markov chains with transition matrices [P k ] might have multiple recurrent classes, sowe let πππk ,R denote the steady-state probability vector for a given recurrent class R of k .Premultiplying both sides of (4.67) by πππk ,R,

πππk ,Rw 0 + g0πππk ,Re ≥ πππk ,Rrk + πππk ,R[P k ]w 0 with equality for somek 0. (4.68)

Recognizing that πππk ,Re = 1 and πππk ,R[P k ] = πππk ,R, this simplifies to

g0 ≥ πππk ,Rrk with equality for some k 0. (4.69)

This says that if Bellman’s equation has a solution w 0, g0, then the gain per stage g0 inthat solution is greater than or equal to the gain per stage in each recurrent class of eachstationary policy, and is equal to the gain per stage in each recurrent class of the maximizingstationary policy, k 0. Thus, the maximizing stationary policy is either a unichain or consistsof several recurrent classes all with the same gain per stage.

We have been discussing the properties that any solution of Bellman’s equation must have,but still have no guarantee that any such solution must exist. The following subsectiondescribes a fairly general algorithm (policy iteration) to find a solution of Bellman’s algo-rithm, and also shows why, in some cases, no solution exists. Before doing this, however,we look briefly at the overall relations between the states in a Markov decision problem.

For any Markov decision problem, consider a directed graph for which the nodes of thegraph are the states in the Markov decision problem, and, for each pair of states (i, j),there is a directed arc from i to j if P (ki)

ij > 0 for some decision ki.

Definition 4.14. A state i in a Markov decision problem is reachable from state j if thereis a path from j to i in the above directed graph.

Note that if i is reachable from j, then there is a stationary policy in which i is accessiblefrom j (i.e., for each arc (m, l) on the path, a decision km in state m is used for whichP (km)

ml > 0).

Definition 4.15. A state i in a Markov decision problem is inherently transient if it isnot reachable from some state j that is reachable from i. A state i is inherently recurrent


if it is not inherently transient. A class I of states is inherently recurrent if each i ∈ Iis inherently recurrent, each is reachable from each other, and no state j /∈ I is reachablefrom any i ∈ I. A Markov decision problem is inherently recurrent if all states form aninherently recurrent class.

An inherently recurrent class of states is a class that, once entered, can never be left, butwhich has no subclass with that property. An inherently transient state is transient in atleast one stationary policy, but might be recurrent in other policies (but all the states in anysuch recurrent class must be inherently transient). In the following subsection, we analyzeinherently recurrent Markov decision problems. Multiple inherently recurrent classes canbe analyzed one by one using the same approach, and we later give a short discussion ofinherently transient states.

4.6.4 Policy iteration and the solution of Bellman’s equation

The general idea of policy iteration is to start with an arbitrary unichain stationary policyk 0 and to find its gain per stage g0 and its relative gain vector w 0. We then check whetherBellman’s equation, (4.60), is satisfied, and if not, we find another stationary policy kthat is ‘better’ than k 0 in a sense to be described later. Unfortunately, the ‘better’ policythat we find might not be a unichain, so the following lemma shows that any such policycan be converted into an equally ‘good’ unichain policy. The algorithm then iterativelyfinds better and better unichain stationary policies, until eventually one of them satisfiesBellman’s equation and is thus optimal.

Lemma 4.2. Let k = (k1, . . . , kM) be an arbitrary stationary policy in an inherently recur-rent Markov decision problem. Let R be a recurrent class of states in k. Then a unichainstationary policy k̃ = (k̃1, . . . , k̃M) exists with the recurrent class R and with k̃j = kj forj ∈ R.

Proof: Let j be any state in R. By the inherently recurrent assumption, there is a decisionvector, say k 0 under which j is accessible from all other states (see Exercise 4.38). Choosingk̃i = ki for i ∈ R and k̃i = k0i for i /∈ R completes the proof.

Now that we are assured that unichain stationary policies exist and can be found, we canstate the policy improvement algorithm for inherently recurrent Markov decision problems.This algorithm is a generalization of Howard’s policy iteration algorithm, [How60].

Policy Improvement Algorithm1. Choose an arbitrary unichain policy k 0

2. For policy k 0, calculate w 0 and g0 from w 0 + g0e = rk 0 + [P k 0 ]w 0.

3. If w 0 + g0e = maxk{rk + [P k ]w 0}, then stop; k 0 is optimal.

4. Otherwise, choose i and ki so that w0i +g0 < r(ki)i +

Pj P (ki)

ij w0j . For j 6= i, let kj = k0j .

5. If the policy k = (k1, . . . kM) is not a unichain, then let R be the recurrent class inpolicy k that contains state i, and let k̃ be the unichain policy of Lemma 4.2. Updatek to the value of k̃ .


6. Update k 0 to the value of k and return to step 2.

If the stopping test in step 3 fails, then there is some i for which w0i + g0 < maxki{r(ki)i +

Pj P (ki)

ij w0j}, so step 4 can always be executed if the algorithm does not stop in step 3. Theresulting policy k then satisfies

w 0 + g0e ≤6= rk + [P k ]w 0, (4.70)

where ≤6= means that the inequality is strict for at least one component (namely i) of the

vectors.

Note that at the end of step 4, [P k ] differs from [P k 0 ] only in the transitions out of state i.Thus the set of states from which i is accessible is the same in k 0 as k . If i is recurrent inthe unichain k 0, then it is accessible from all states in k 0 and thus also accessible from allstates in k . It follows that i is also recurrent in k and that k is a unichain (see Exercise 4.2.On the other hand, if i is transient in k 0, and if R0 is the recurrent class of k 0, then R0 mustalso be a recurrent class of k , since the transitions from states in R0 are unchanged. Thereare then two possibilities when i is transient in k 0. First, if the changes in P (ki)

ij eliminateall the paths from i to R0, then a new recurrent class R will be formed with i a member.This is the case in which step 5 is used to change k back to a unichain. Alternatively, if apath still exists to R0, then i is transient in k and k is a unichain with the same recurrentclass R0 as k 0. These results are summarized in the following lemma:

Lemma 4.3. There are only three possibilities for k at the end of step 4 of the policyimprovement algorithm for inherently recurrent Markov decision problems. First, k is aunichain and i is recurrent in both k0 and k. Second, k is not a unichain and i is transientin k0 and recurrent in k. Third, k is a unichain with the same recurrent class as k0 and i istransient in both k0 and k.

The following lemma now asserts that the new policy on returning to step 2 of the algorithmis an improvement over the previous policy k 0.

Lemma 4.4. Let k0 be the unichain policy of step 2 in an iteration of the policy improve-ment algorithm for an inherently recurrent Markov decision problem. Let g0, w0, R0 be thegain per stage, relative gain vector, and recurrent class respectively of k0. Assume the algo-rithm doesn’t stop at step 3 and let k be the unichain policy of step 6. Then either the gainper stage g of k satisfies g > g0 or else the recurrent class of k is R0, the gain per stagesatisfies g = g0, and there is a shifted relative gain vector, w, of k satisfying

w0 ≤6= w and w0j = wj for each j ∈ R0. (4.71)

Proof*: The policy k of step 4 satisfies (4.70) with strict inequality for the component iin which k 0 and k differ. Let R be any recurrent class of k and let πππ be the steady-stateprobability vector for R. Premultiplying both sides of (4.70) by πππ, we get

πππw 0 + g0 ≤ πππrk + πππ[P k ]w 0. (4.72)


Recognizing that πππ[P k ] = πππ and cancelling terms, this shows that g0 ≤ πππrk . Now (4.70) issatisfied with strict inequality for component i, and thus, if πi > 0, (4.72) is satisfied withstrict inequality. Thus,

g0 ≤ πππrk with equality iff πi = 0. (4.73)

For the first possibility of Lemma 4.3, k is a unichain and i ∈ R. Thus g0 < πππrk = g.Similarly, for the second possibility in Lemma 4.3, i ∈ R for the new recurrent class that isformed in k , so again g0 < πππrk . Since k̃ is a unichain with the recurrent class R, we haveg0 < g again. For the third possibility in Lemma 4.3, i is transient in R0 = R. Thus πi = 0,so πππ0 = πππ, and g0 = g. Thus, to complete the proof, we must demonstrate the validity of(4.71) for this case.

We first show that, for each n ≥ 1,

vk (n,w 0)− ng0e ≤ vk (n+1,w 0)− (n+1)g0e. (4.74)

For n = 1,

vk (1,w 0) = rk + [P k ]w 0. (4.75)

Using this, (4.70) can be rewritten as

w 0 ≤6= vk (1,w 0)− g0e. (4.76)

Using (4.75) and then (4.76),

vk (1,w 0)− g0e = rk + [P k ]w 0 − g0e

≤ rk + [P k ]{vk (1,w 0)− g0e}− g0e (4.77)= rk + [P k ]vk (1,w 0)− 2g0e= vk (2,w 0)− 2g0e.

We now use induction on n, using n = 1 as the basis, to demonstrate (4.74) in general. Forany n > 1, assume (4.74) for n− 1 as the inductive hypothesis.

vk (n,w 0)− ng0e = rk + [P k ]vk (n− 1,w 0)− ng0e

= rk + [P k ]{vk (n− 1,w 0)− (n− 1)g0e}− g0e

≤ rk + [P k ]{vk (n,w 0)− ng0e}− g0e

= vk (n+1,w 0)− (n+1)g0e.

This completes the induction, verifying (4.74) and showing that vk (n,w 0) − ng0e is non-decreasing in n. Since k is a unichain, Lemma 4.1 asserts that k has a shifted relative gainvector w , i.e., a solution to (4.42). From (4.46),

vk (n,w 0) = w + ng0e + [P k ]n{w 0 −w}. (4.78)

Since [P k ]n is a stochastic matrix, its elements are each between 0 and 1, so the sequenceof vectors vk − ng0e must be bounded independent of n. Since this sequence is also non-decreasing, it must have a limit, say w̃ ,

limn→1

vk (n,w 0)− ng0e = w̃ . (4.79)


We next show that w̃ satisfies (4.42) for k .

w̃ = limn→1

{vk (n+1,w 0)− (n + 1)g0e}

= limn→1

{rk + [P k ]vk (n,w 0)− (n + 1)g0e} (4.80)

= rk − g0e + [P k ] limn→1

{vk (n,w 0)− ng0e} = rk − g0e + [P k ]w̃ . (4.81)

Thus w̃ is a shifted relative gain vector for k . Finally we must show that w̃ satisfies theconditions on w in (4.71). Using (4.76) and iterating with (4.74),

w 0 ≤6= vk (n,w 0)− ng0e ≤ w̃ for all n ≥ 1. (4.82)

Premultiplying each term in (4.82) by the steady-state probability vector πππ for k ,

πππw 0 ≤ πππvk (n,w 0)− ng0 ≤ πππw̃ . (4.83)

Now, k is the same as k 0 over the recurrent class, and πππ = πππ0 since πππ is non-zero only overthe recurrent class. This means that the first inequality above is actually an equality. Also,going to the limit, we see that πππw 0 = πππw̃ . Since πi ≥ 0 and w0i ≤ w̃i, this implies thatw0i = w̃i for all recurrent i, completing the proof.

We now see that each iteration of the algorithm either increases the gain per stage or holdsthe gain constant and increases the shifted relative gain vector w . Thus the sequenceof policies found by the algorithm can never repeat. Since there are a finite number ofstationary policies, the algorithm must eventually terminate at step 3. Thus we have provedthe following important theorem.

Theorem 4.11. For any inherently recurrent Markov decision problem, there is a solutionto Bellman’s equation and a maximizing stationary policy that is a unichain.

There are also many interesting Markov decision problems, such as shortest path problems,that contain not only an inherently recurrent class but also some inherently transient states.The following theorem then applies.

Theorem 4.12. Consider a Markov decision problem with a single inherently recurrentclass of states and one or more inherently transient states. Let g∗ be the maximum gain perstage over all recurrent classes of all stationary policies and assume that each recurrent classwith gain per stage equal to g∗ is contained in the inherently recurrent class. Then there isa solution to Bellman’s equation and a maximizing stationary policy that is a unichain.

Proof*: Let k be a stationary policy which has a recurrent class, R, with gain per stageg∗. Let j be any state in R. Since j is inherently recurrent, there is a decision vector k̃under which j is accessible from all other states. Choose k 0 such that k0i = ki for all i ∈ Rand k0i = k̃i for all i /∈ R. Then k 0 is a unichain policy with gain per stage g∗. Suppose thepolicy improvement algorithm is started with this unichain policy. If the algorithm stopsat step 3, then k 0 satisfies Bellman’s equation and we are done. Otherwise, from Lemma4.4, the unichain policy in step 6 of the algorithm either has a larger gain per stage (whichis impossible) or has the same recurrent class R and has a relative gain vector w satisfying


(4.74). Iterating the algorithm, we find successively larger relative gain vectors. Since thepolicies cannot repeat, the algorithm must eventually stop with a solution to Bellman’sequation.

The above theorems give us a good idea of the situations under which optimal stationarypolicies and solutions to Bellman’s equation exist. However, we call a stationary policyoptimal if it is the optimal dynamic policy for one special final reward vector. In the nextsubsection, we will show that if an optimal stationary policy is unique and is an ergodicunichain, then that policy is optimal except for a final transient no matter what the finalreward vector is.

4.6.5 Stationary policies with arbitrary final rewards

We start out this subsection with the main theorem, then build up some notation andpreliminary ideas for the proof, then prove a couple of lemmas, and finally prove the theorem.

Theorem 4.13. Assume that k0 is a unique optimal stationary policy and is an ergodicunichain with the ergodic class R = {1, 2, . . . ,m}. Let w0 and g0 be the relative gain vectorand gain per stage for k0. Then, for any final gain vector u, the following limit exists andis independent of i

limn→1

v∗i (n,u)− ng0 − w0i = (πππ0u)(u), (4.84)

where (πππ0u)(u) satisfies

(πππ0u)(u) = limn→1

πππ0[v∗(n,u)− ng0e−w0] (4.85)

and πππ0 is the steady-state vector for k0

Discussion: The theorem says that, asymptotically, the relative advantage of starting inone state rather than another is independent of the final gain vector, i.e., that for any statesi, j, limn→1[u∗i (n,u) − u∗j (n,u)] is independent of u . For the shortest path problem, forexample, this says that v∗(n,u) converges to the shortest path vector for any choice of ufor which ui = 0. This means that if the arc lengths change, we can start the algorithm atthe shortest paths for the previous arc lengths, and the algorithm is guaranteed to convergeto the correct new shortest paths.

To see why the theorem can be false without the ergodic assumption, consider Example4.5.4 where, even without any choice of decisions, (4.84) is false. Exercise 4.34 shows whythe theorem can be false without the uniqueness assumption.

It can also be shown (see Exercise 4.35) that for any Markov decision problem satisfying thehypotheses of Theorem 4.13, there is some n0 such that the optimal dynamic policy usesthe optimal stationary policy for all stages n ≥ n0. Thus, the dynamic part of the optimaldynamic policy is strictly a transient.

The proof of the theorem is quite lengthy. Under the restricted conditions that k 0 is anergodic Markov chain, the proof is simpler and involves only Lemma 4.5.


We now develop some notation required for the proof of the theorem. Given a final rewardvector u , define ki(n) for each i and n as the k that maximizes r(k)

i +P

j P (k)ij v∗j (n,u). Then

v∗i (n + 1,u) = r(ki(n))i +

X

j

P (ki(n))ij v∗j (n,u) ≥ r

(k0i)i +

X

j

P(k0i)ij v∗j (n,u). (4.86)

Similarly, since k0i maximizes r(k)i +

Pj P (k)

ij v∗j (n,w 0),

v∗i (n + 1,w 0) = r(k0i)i +

X

j

P(k0i)ij v∗j (n,w 0) ≥ r(ki(n))

i +X

j

P (ki(n))ij v∗j (n,w 0). (4.87)

Subtracting (4.87) from (4.86), we get the following two inequalities,

v∗i (n + 1,u)− v∗i (n + 1,w 0) ≥X

j

P(k0i)ij [v∗j (n,u)− v∗j (n,w 0)]. (4.88)

v∗i (n + 1,u)− v∗i (n + 1,w 0) ≤X

j

P (ki(n))ij [v∗j (n,u)− v∗j (n,w 0)]. (4.89)

Define

δi(n) = v∗i (n,u)− v∗i (n,w 0).

Then (4.88) and (4.89) become

δi(n + 1) ≥X

j

P(k0i)ij δj(n). (4.90)

δi(n + 1) ≤X

j

P (ki(n))ij δj(n). (4.91)

Since v∗i (n,w 0) = ng0 + w0i for all i, n,

δi(n) = v∗i (n,u)− ng0 − w0i.

Thus the theorem can be restated as asserting that limn→1 δi(n) = β(u) for each state i.

Define

δmax(n) = maxi

δi(n); δmin(n) = mini

δi(n).

Then, from (4.90), δi(n + 1) ≥P

j P(k0i)ij δmin(n) = δmin(n). Since this is true for all i,

δmin(n + 1) ≥ δmin(n). (4.92)

In the same way, from (4.91),

δmax(n + 1) ≤ δmax(n). (4.93)

The following lemma shows that (4.84) is valid for each of the recurrent states.


Lemma 4.5. Under the hypotheses of Theorem 4.12, the limiting expression for β(u) in(4.85) exists and

limn→1

δi(n) = β(u) for 1 ≤ i ≤ m. (4.94)

Proof* of lemma 4.5: Multiplying each side of (4.90) by π0i and summing over i,

πππ0δδδ(n + 1) ≥ πππ0[P k 0 ]δδδ(n) = πππ0δδδ(n).

Thus πππ0δδδ(n) is non-decreasing in n. Also, from (4.93), πππ0δδδ(n) ≤ δmax(n) ≤ δmax(1). Sinceπππ0δδδ(n) is non-decreasing and bounded, it has a limit β(u) as defined by (4.85) and

πππ0δδδ(n) ≤ β(u) limn→1

πππ0δδδ(n) = β(u). (4.95)

Next, iterating (4.90) m times, we get

δδδ(n + m) ≥ [P k 0 ]mδδδ(n).

Since the recurrent class of k 0 is ergodic, (4.28) shows that limm→1[P k 0 ]m = eπππ0. Thus,

[P k 0 ]m = eπππ0 + [χ(m)].

where [χ(m)] is a sequence of matrices for which limm→1[χ(m)] = 0.

δδδ(n + m) ≥ eπππ0δδδ(n) + [χ(m)]δδδ(n).

For any ≤ > 0, (4.95) shows that for all sufficiently large n, πππ0δδδ(n) ≥ β(u)− ≤/2. Also, sinceδmin(1) ≤ δi(n) ≤ δmax(1) for all i and n, and since [χ(m)] → 0, we see that [χ(m)]δδδ(n) ≥−(≤/2)e for all large enough m. Thus, for all large enough n and m, δi(n + m) ≥ β(u)− ≤.Thus, for any ≤ > 0, there is an n0 such that for all n ≥ n0,

δi(n) ≥ β(u)− ≤. (4.96)

Also, from (4.95), we have πππ0[δδδ(n)− β(u)e] ≤ 0, so

πππ0[δδδ(n)− β(u)e + ≤e] ≤ ≤. (4.97)

From (4.96), each term, π0i[δi(n) − β(u) + ≤], on the left side of (4.97) is non-negative, soeach must also be smaller than ≤. For π0i > 0, it follows that

δi(n)− β(u) + ≤ ≤ ≤/π0i for all i and all n ≥ n0. (4.98)

Since ≤ > 0 is arbitrary, (4.96) and (4.98) together with π0i > 0 show that, limn→1 δi(n) =β(u), completing the proof of Lemma 4.5.

Since k 0 is a unique optimal stationary policy, we have

r(k0i)i +

X

j

P(k0i)ij w0j > r(ki)

i +X

j

P (ki)ij w0j


for all i and all ki 6= k0i. Snce this is a finite set of strict inequalities, there is an α > 0 suchthat for all i > m, ki 6= k0i,

r(k0i)i +

X

j

P(k0i)ij w0j ≥ r(ki)

i +X

j

P (ki)ij w0j + α. (4.99)

Since v∗i (n,w 0) = ng0 + w0i,

v∗i (n + 1,w 0) = r(k0i)i +

X

j

P(k0i)ij v∗j (n,w 0) (4.100)

≥ r(ki(n))i +

X

j

P (ki(n))ij v∗j (n,w 0) + α. (4.101)

for each i and ki(n) 6= k0i. Subtracting (4.101) from (4.86),

δi(n + 1) ≤X

j

P(k0i)ij δj(n)− α for ki(n) 6= k0i. (4.102)

Since δi(n) ≤ δmax(n), (4.102) can be further bounded by δi(n + 1) ≤ δmax(n) − α forki(n) 6= k0i. Combining this with δi(n + 1) =

Pj P

(k0i)ij δj(n) for ki(n) = k0i,

δi(n + 1) ≤ maxhδmax − α,

X

j

P(k0i)ij δj(n)

i. (4.103)

Next, since k 0 is a unichain, we can renumber the transient states, m < i ≤ M so thatP

j<i P(k0i)ij > 0 for each i, m < i ≤ M. Since this is a finite set of strict inequalities, there

is some ∞ > 0 such thatX

j<i

P(k0i)ij ≥ ∞ for m < i ≤ M. (4.104)

The quantity δi(n) for each transient state i is somewhat difficult to work with directly, sowe define the new quantity, δ̃i(n), which will be shown in the following lemma to upperbound δi(n). The definition for δ̃i(n) is given iteratively for n ≥ 1, m < i ≤ M as

δ̃i(n + 1) = maxhδ̃M(n)− α, ∞δ̃i−1(n) + (1− ∞)δ̃M(n)

i. (4.105)

The boundary conditions for this are defined to be

δ̃i(1) = δmax(1); m < i ≤ M (4.106)δ̃m(n) = sup

n0≥nmaxi≤m

δi(n0). (4.107)

Lemma 4.6. Under the hypotheses of Theorem 4.13, with α defined by (4.99) and ∞ definedby (4.104), the following three inequalities hold,

δ̃i(n) ≤ δ̃i(n− 1); forn ≥ 2, m ≤ i ≤ M (4.108)δ̃i(n) ≤ δ̃i+1(n); forn ≥ 1, m ≤ i < M (4.109)δj(n) ≤ δ̃i(n); forn ≥ 1, j ≤ i, m ≤ i ≤ M. (4.110)


Proof* of (4.108): Since the supremum in (4.107) is over a set decreasing in n,

δ̃m(n) ≤ δ̃m(n− 1); for n ≥ 1. (4.111)

This establishes (4.108) for i = m. To establish (4.108) for n = 2, note that δ̃i(1) = δmax(1)for i > m and

δ̃m(1) = supn0≥1

maxi≤m

δi(n0) ≤ supn0≥1

δmax(n0) ≤ δmax(1). (4.112)

Thus

δ̃i(2) = maxhδ̃M(1)− α, ∞δ̃i−1(1) + (1− ∞)δ̃M(1)

i

≤ δmax(1) = δ̃i(1) for i > m.

Finally, we use induction for n ≥ 2, i > m, using n = 2 as the basis. Assuming (4.108) fora given n ≥ 2,

δ̃i(n+1) = max[δ̃M(n)−α, ∞δ̃i−1(n) + (1−∞)δ̃M(n)]≤ max[δ̃M(n−1)−α, ∞δ̃i−1(n−1) + (1−∞)δ̃M(n−1)] = δ̃i(n).

Proof* of (4.109): Using (4.112) and the fact that δ̃i(1) = δmax(1) for i > m, (4.109) isvalid for n = 1. Using induction on n with n = 1 as the basis, we assume (4.109) for a givenn ≥ 1. Then for m ≤ i ≤ M,

δ̃i(n + 1) ≤ δ̃i(n) ≤ ∞δ̃i(n) + (1− ∞)δ̃M(n)≤ max[δ̃M(n)− α, ∞δ̃i(n) + (1− ∞)δ̃M(n)] = δ̃i+1(n + 1).

Proof* of (4.110): Note that δj(n) ≤ δ̃m(n) for all j ≤ m and n ≥ 1 by the definitionin (4.107). From (4.109), δj(n) ≤ δ̃i(n) for j ≤ m ≤ i. Also, for all i > m and j ≤ i,δj(1) ≤ δmax(1) = δ̃i(1). Thus (4.110) holds for n = 1. We complete the proof by usinginduction on n for m < j ≤ i, using n = 1 as the basis. Assume (4.110) for a givenn ≥ 1. Then, δj(n) ≤ δ̃M(n) for all j, and it then follows that δmax(n) ≤ δ̃M(n). Similarly,δj(n) ≤ δ̃i−1(n) for j ≤ i− 1. For i > m, we then have

δi(n+1) ≤ maxhδmax(n)−α,

X

j

Pk0iij δj(n)

i

≤ maxhδ̃M(n)−α,

X

j<i

Pk0iij δ̃i−1(n) +

X

j≥i

Pk0iij δ̃M(n)

i

≤ maxhδ̃M(n)−α, ∞δ̃i−1(n) + (1−∞)δ̃M(n)

§= δ̃i(n+1),

where the final inequality follows from the definition of ∞. Finally, using (4.109) again, wehave δj(n + 1) ≤ δ̃j(n + 1) ≤ δ̃i(n + 1) for m < j ≤ i, completing the proof of Lemma 4.6.

Proof* of Theorem 4.13: From (4.110), δ̃i(n) is non-increasing in n for i ≥ m. Also,from (4.109) and (4.97), δ̃i(n) ≥ δ̃m(n) ≥ β(u). Thus, limn→1 δ̃i(n) exists for each i ≥ m.We then have

limn→1

δ̃M(n) = maxh

limn→1

δ̃M(n)−α, ∞ limn→1

δ̃M−1(n) + (1−∞) limn→1

δ̃M(n)i.


Since α > 0, the second term in the maximum above must achieve the maximum in thelimit. Thus,

limn→1

δ̃M(n) = limn→1

δ̃M−1(n). (4.113)

In the same way,

limn→1

δ̃M−1(n) = maxh

limn→1

δ̃M(n)−α, ∞ limn→1

δ̃M−2(n) + (1−∞) limn→1

δ̃M−1(n)i.

Again, the second term must achieve the maximum, and using (4.113),

limn→1

δ̃M−1(n) = limn→1

δ̃M−2(n).

Repeating this argument,

limn→1

δ̃i(n) = limn→1

δ̃i−1(n) for each i, m < i ≤ M. (4.114)

Now, from (4.94), limn→1 δi = β(u) for i ≤ m. From (4.107), then, we see that limn→1 δ̃m(n) =β(u). Combining this with (4.114),

limn→1

δ̃i(n) = β(u) for each i such that m ≤ i ≤ M. (4.115)

Combining this with (4.110), we see that for any ≤ > 0, and any i, δi(n) ≤ β(u) + ≤ forlarge enough n. Combining this with (4.96) completes the proof.

4.7 Summary

This chapter has developed the basic results about finite-state Markov chains from a pri-marily algebraic standpoint. It was shown that the states of any finite-state chain can bepartitioned into classes, where each class is either transient or recurrent, and each class isperiodic or aperiodic. If the entire chain is one recurrent class, then the Frobenius theo-rem, with all its corollaries, shows that ∏ = 1 is an eigenvalue of largest magnitude andhas positive right and left eigenvectors, unique within a scale factor. The left eigenvector(scaled to be a probability vector) is the steady-state probability vector. If the chain is alsoaperiodic, then the eigenvalue ∏ = 1 is the only eigenvalue of magnitude 1, and all rowsof [P ]n converge geometrically in n to the steady-state vector. This same analysis can beapplied to each aperiodic recurrent class of a general Markov chain, given that the chainever enters that class.

For a periodic recurrent chain of period d, there are d − 1 other eigenvalues of magnitude1, with all d eigenvalues uniformly placed around the unit circle in the complex plane.Exercise 4.17 shows how to interpret these eigenvectors, and shows that [P ]nd convergesgeometrically as n→1.

For an arbitrary finite-state Markov chain, if the initial state is transient, then the Markovchain will eventually enter a recurrent state, and the probability that this takes more than

4.8. EXERCISES 183

n steps approaches zero geometrically in n; Exercise 4.14 shows how to find the probabilitythat each recurrent class is entered. Given an entry into a particular recurrent class, thenthe results above can be used to analyze the behavior within that class.

The results about Markov chains were extended to Markov chains with rewards. As withrenewal processes, the use of reward functions provides a systematic way to approach alarge class of problems ranging from first passage times to dynamic programming. The keyresult here is Theorem 4.9, which provides both an exact expression and an asymptoticexpression for the expected aggregate reward over n stages.

Finally, the results on Markov chains with rewards were used to understand Markov decisiontheory. We developed the Bellman dynamic programming algorithm, and also investigatedthe optimal stationary policy. Theorem 4.13 demonstrated the relationship between theoptimal dynamic policy and the optimal stationary policy. This section provided only anintroduction to dynamic programming and omitted all discussion of discounting (in whichfuture gain is considered worth less than present gain because of interest rates). We alsoomitted infinite state spaces.

For an introduction to vectors, matrices, and linear algebra, see any introductory text onlinear algebra such as Strang [20]. Gantmacher [11] has a particularly complete treatment ofnon-negative matrices and Perron-Frobenius theory. For further reading on Markov decisiontheory and dynamic programming, see Bertsekas, [3]. Bellman [1] is of historic interest andquite readable.

4.8 Exercises

Exercise 4.1. a) Prove that, for a finite-state Markov chain, if Pii > 0 for some i in arecurrent class A, then class A is aperiodic.

b) Show that every finite-state Markov chain contains at least one recurrent set of states.Hint: Construct a directed graph in which the states are nodes and an edge goes from ito j if i → j but i is not accessible from j. Show that this graph contains no cycles, andthus contains one or more nodes with no outgoing edges. Show that each such node is in arecurrent class. Note: this result is not true for Markov chains with countably infinite statespaces.

Exercise 4.2. Consider a finite-state Markov chain in which some given state, say state 1,is accessible from every other state. Show that the chain has at most one recurrent class ofstates. (Note that, combined with Exercise 4.1, there is exactly one recurrent class and thechain is then a unichain.)

Exercise 4.3. Show how to generalize the graph in Figure 4.4 to an arbitrary number ofstates M ≥ 3 with one cycle of M nodes and one of M − 1 nodes. For M = 4, let node 1be the node not in the cycle of M − 1 nodes. List the set of states accessible from node 1in n steps for each n ≤ 12 and show that the bound in Theorem 4.5 is met with equality.Explain why the same result holds for all larger M.


Exercise 4.4. Consider a Markov chain with one ergodic class of m states, say {1, 2, . . . ,m}and M − m other states that are all transient. Show that Pn

ij > 0 for all j ≤ m andn ≥ (m− 1)2 + 1 + M−m.

Exercise 4.5. a) Let τ be the number of states in the smallest cycle of an arbitrary ergodicMarkov chain of M ≥ 3 states. Show that Pn

ij > 0 for all n ≥ (M− 2)τ + M. Hint: Look atthe last part of the proof of Theorem 4.4.

b) For τ = 1, draw the graph of an ergodic Markov chain (generalized for arbitrary M ≥ 3)for which there is an i, j for which Pn

ij = 0 for n = 2M− 3. Hint: Look at Figure 4.4.

d) For arbitrary τ < M − 1, draw the graph of an ergodic Markov chain (generalized forarbitrary M) for which there is an i, j for which Pn

ij = 0 for n = (M− 2)τ + M− 1.

Exercise 4.6. A transition probability matrix P is said to be doubly stochastic ifX

j

Pij = 1 for all i;X

i

Pij = 1 for all j.

That is, the row sum and the column sum each equal 1. If a doubly stochastic chain hasM states and is ergodic (i.e., has a single class of states and is aperiodic), calculate itssteady-state probabilities.

Exercise 4.7. a) Find the steady-state probabilities π0, . . . ,πk−1 for the Markov chainbelow. Express your answer in terms of the ratio ρ = p/q. Pay particular attention to thespecial case ρ = 1.

b) Sketch π0, . . . ,πk−1. Give one sketch for ρ = 1/2, one for ρ = 1, and one for ρ = 2.

c) Find the limit of π0 as k approaches 1; give separate answers for ρ < 1, ρ = 1, andρ > 1. Find limiting values of πk−1 for the same cases.

0 1 2 k−2 k−1♥p p p p p

1−p 1−p 1−p 1−p

1−p. . . ②✘✿ ♥ ③

② ♥ ③② ♥ ③

② ♥ ③②

Exercise 4.8. a) Find the steady-state probabilities for each of the Markov chains inFigure 4.2 of section 4.1. Assume that all clockwise probabilities in the first graph are thesame, say p, and assume that P4,5 = P4,1 in the second graph.

b) Find the matrices [P ]2 for the same chains. Draw the graphs for the Markov chainsrepresented by [P ]2, i.e., the graph of two step transitions for the original chains. Findthe steady-state probabilities for these two step chains. Explain why your steady-stateprobabilities are not unique.

c) Find limn→1[P ]2n for each of the chains.

4.8. EXERCISES 185

Exercise 4.9. Answer each of the following questions for each of the following non-negativematrices [A]

i)∑

1 01 1

∏ii)

1 0 0

1/2 1/2 00 1/2 1/2

.

a) Find [A]n in closed form for arbitrary n > 1.

b) Find all eigenvalues and all right eigenvectors of [A].

c) Use (b) to show that there is no diagonal matrix [Λ] and no invertible matrix [Q] forwhich [A][Q] = [Q][Λ].

d) Rederive the result of part (c) using the result of (a) rather than (b).

Exercise 4.10. a) Show that g(x ), as given in (4.21), is a continuous function of x forx ≥ 0 ,x 6= 0 .

b) Show that g(x ) = g(βx ) for all β > 0. Show that this implies that the supremum ofg(x ) over x ≥ 0 ,x 6= 0 is the same as the supremum over x ≥ 0 ,

Pi xi = 1. Note that

this shows that the supremum must be achieved, since it is a supremum of a continuousfunction over a closed and bounded space.

Exercise 4.11. a) Show that if x1 and x2 are real or complex numbers, then |x1 + x2| =|x1| + |x2| implies that for some β, βx1 and βx2 are both real and non-negative.

b) Show from this that if the inequality in (4.25) is satisfied with equality, then there issome β for which βxi = |xi| for all i.

Exercise 4.12. a) Let ∏ be an eigenvalue of a matrix [A], and let ∫∫∫ and πππ be right andleft eigenvectors respectively of ∏, normalized so that πππ∫∫∫ = 1. Show that

[[A]− ∏∫∫∫πππ]2 = [A]2 − ∏2∫∫∫πππ.

b) Show that [[A]n − ∏n∫∫∫πππ][[A]− ∏∫∫∫πππ] = [A]n+1 − ∏n+1∫∫∫πππ.

c) Use induction to show that [[A]− ∏∫∫∫πππ]n = [A]n − ∏n∫∫∫πππ.

Exercise 4.13. Let [P ] be the transition matrix for a Markov unichain with M recurrentstates, numbered 1 to M, and K transient states, J+1 to J+K. Thus [P ] can be partitioned

as [P ] =∑

Pr 0Ptr Ptt

∏.

a) Show that [P ]n can be partitioned as [P ]n =∑

[Pr]n [0][Pn

ij ] [Ptt]n

∏. That is, the blocks on

the diagonal are simply products of the corresponding blocks of [P ], and the lower left blockis whatever it turns out to be.


b) Let Qi be the probability that the chain will be in a recurrent state after K transitions,starting from state i, i.e., Qi =

Pj≤M PK

ij . Show that Qi > 0 for all transient i.

c) Let Q be the minimum Qi over all transient i and show that PnKij ≤ (1 − Q)n for all

transient i, j (i.e., show that [Ptt]n approaches the all zero matrix [0] with increasing n).

d) Let πππ = (πππr,πππt) be a left eigenvector of [P ] of eigenvalue 1 (if one exists). Show thatπππt = 0 and show that πππr must be positive and be a left eigenvector of [Pr]. Thus show thatπππ exists and is unique (within a scale factor).

e) Show that e is the unique right eigenvector of [P ] of eigenvalue 1 (within a scale factor).

Exercise 4.14. Generalize Exercise 4.13 to the case of a Markov chain [P ] with r recurrentclasses and one or more transient classes. In particular,

a) Show that [P ] has exactly r linearly independent left eigenvectors, πππ(1),πππ(2), . . . ,πππ(r) ofeigenvalue 1, and that the ith can be taken as a probability vector that is positive on theith recurrent class and zero elsewhere.

b) Show that [P ] has exactly r linearly independent right eigenvectors, ∫∫∫(1),∫∫∫(2), . . . ,∫∫∫(r)

of eigenvalue 1, and that the ith can be taken as a vector with ∫(i)j equal to the probability

that recurrent class i will ever be entered starting from state j.

Exercise 4.15. Prove Theorem 4.8. Hint: Use Theorem 4.7 and the results of Exercise4.13.

Exercise 4.16. Generalize Exercise 4.15 to the case of a Markov chain [P ] with r aperiodicrecurrent classes and one or more transient classes. In particular, using the left and righteigenvectors πππ(1),πππ(2), . . . ,πππ(r) and ∫∫∫(1), . . . ,∫∫∫(r) of Exercise 4.14, show that

limn→1

[P ]n =X

i

∫∫∫(i)πππ(i).

Exercise 4.17. Suppose a Markov chain with an irreducible matrix [P ] is periodic withperiod d and let Ti, 1 ≤ i ≤ d, be the ith subset in the sense of Theorem 4.3. Assume thestates are numbered so that the first M1 states are in T1, the next J2 are in T2, and so forth.Thus [P ] has the block form given by

[P ] =

0 [P1]. . . . . . 0

0 0 [P2]. . . . . .

. . . . . . . . . . . . . . .

0 0 . . . . . . [Pd−1]

[Pd] 0 . . . . . . 0

where [Pi] has dimension Mi by Mi+1 for i < d and Md by M1 for i = d

4.8. EXERCISES 187

a) Show that [P ]d has the form

[P ]d =

[Q1] 0 . . . 0

0 [Q2]. . . . . .

0 0 . . . [Qd]

where [Qi] = [Pi][Pi+1] . . . [Pd][P1] . . . [Pi−1].

b) Show that [Qi] is the matrix of an ergodic Markov chain, so that with the eigenvectorsdefined in Exercises 4.14 and 4.16, limn→1[P ]nd =

Pi ∫∫∫

(i)πππ(i).

c) Show that π̂ππ(i), the left eigenvector of [Qi] of eigenvalue 1 satisfies π̂ππ(i)[Pi] = π̂ππ(i+1) fori < d and π̂ππ(d)[Pd] = π̂ππ(1).

d) Let α = 2πππ√−1

d and let πππ(k) = (π̂ππ(1), π̂ππ(2)eαk, π̂ππ(3)e2αk, . . . , π̂ππ(d)e(d−1)αk). Show that πππ(k)

is a left eigenvector of [P ] of eigenvalue e−αk.

Exercise 4.18. (continuation of Exercise 4.17). a) Show that, with the eigenvectorsdefined in Exercises 4.14 and 4.16,

limn→1

[P ]nd[P ] =dX

i=1

∫∫∫(i)πππ(i+1)

where πππ(d+1) is taken to be πππ(1).

b) Show that, for 1 ≤ j < d,

limn→1

[P ]nd[P ]j =dX

i=1

∫∫∫(i)πππ(i+j)

where πππ(d+m) is taken to be πππ(m) for 1 ≤ m < d.

c) Show that

limn→1

[P ]ndnI + [P ] + . . . , [P ]d−1

o=

√dX

i=1

∫∫∫(i)

!√dX

i=1

πππ(i+j)

!

.

d) Show that

limn→1

1d

≥[P ]n + [P ]n+1 + · · · + [P ]n+d−1

¥= eπππ

where πππ is the steady-state probability vector for [P ]. Hint: Show that e =P

i ∫∫∫(i) and

πππ = (1/n)P

i πππ(i).

e) Show that the above result is also valid for periodic unichains.


Exercise 4.19. Assume a friend has developed an excellent program for finding the steady-state probabilities for finite-state Markov chains. More precisely, given the transition matrix[P], the program returns limn→1 Pn

ii for each i. Assume all chains are aperiodic.

a) You want to find the expected time to first reach a given state k starting from a differentstate m for a Markov chain with transition matrix [P ]. You modify the matrix to [P 0] whereP 0km = 1, P 0kj = 0 for j 6= m, and P 0ij = Pij otherwise. How do you find the desired firstpassage time from the program output given [P 0] as an input? (Hint: The times at which aMarkov chain enters any given state can be considered as renewals in a (perhaps delayed)renewal process).

b) Using the same [P 0] as the program input, how can you find the expected number ofreturns to state m before the first passage to state k?

c) Suppose, for the same Markov chain [P ] and the same starting state m, you want tofind the probability of reaching some given state n before the first passage to k. Modify [P ]to some [P 00] so that the above program with P 00 as an input allows you to easily find thedesired probability.

d) Let Pr{X(0) = i} = Qi, 1 ≤ i ≤ M be an arbitrary set of initial probabilities for thesame Markov chain [P ] as above. Show how to modify [P ] to some [P 000] for which thesteady-state probabilities allow you to easily find the expected time of the first passage tostate k.

Exercise 4.20. Suppose A and B are each ergodic Markov chains with transition probabil-ities {PAi,Aj} and {PBi,Bj} respectively. Denote the steady-state probabilities of A and Bby {πAi} and {πBi} respectively. The chains are now connected and modified as shown be-low. In particular, states A1 and B1 are now connected and the new transition probabilitiesP 0 for the combined chain are given by

P 0A1,B1= ε, P 0A1,Aj

= (1− ε)PA1,Aj for all Aj

P 0B1,A1= δ, P 0B1,Bj

= (1− δ)PB1,Bj for all Bj .

All other transition probabilities remain the same. Think intuitively of ε and δ as beingsmall, but do not make any approximations in what follows. Give your answers to thefollowing questions as functions of ε, δ, {πAi} and {πBi}.

♥ ③②A1 B1

♥♥

♥♥

✚✚✚

✚✚✚

♥♥

♥♥

✚✚✚°

°°

°° ε

δ

✬

✫

✩

✪

✬

✫

✩

✪Chain A Chain B

a) Assume that ≤ > 0, δ = 0 (i.e., that A is a set of transient states in the combined chain).Starting in state A1, find the conditional expected time to return to A1 given that the firsttransition is to some state in chain A.

4.8. EXERCISES 189

b) Assume that ≤ > 0, δ = 0. Find TA,B, the expected time to first reach state B1

starting from state A1. Your answer should be a function of ≤ and the original steady stateprobabilities {πAi} in chain A.

c) Assume ε > 0, δ > 0, find TB,A, the expected time to first reach state A1, starting instate B1. Your answer should depend only on δ and {πBi}.

d) Assume ε > 0 and δ > 0. Find P 0(A), the steady-state probability that the combinedchain is in one of the states {Aj} of the original chain A.

e) Assume ε > 0, δ = 0. For each state Aj 6= A1 in A, find vAj , the expected numberof visits to state Aj , starting in state A1, before reaching state B1. Your answer shoulddepend only on ε and {πAi}.

f) Assume ε > 0, δ > 0. For each state Aj in A, find π0Aj, the steady-state probability of

being in state Aj in the combined chain. Hint: Be careful in your treatment of state A1.

Exercise 4.21. For the Markov chain with rewards in figure 4.6,

a) Find the general solution to (4.42) and then find the particular solution (the relativegain vector) with πππw = 0.

b) Modify Figure 4.6 by letting P12 be an arbitrary probability. Find g and w again andgive an intuitive explanation of why P12 effects w2.

Exercise 4.22. a) Show that, for any i,

[P ]ir = ([P ]i − eπππ)r + ge.

b) Show that 4.36) can be rewritten as

v(n,u) =n−1X

i=0

([P ]i − eπππ)r + nge + (πππu)e.

c) Show that if [P ] is a positive stochastic matrix, thenPn−1

i=0 ([P ]i − eπππ) converges in thelimit n→1. Hint: You can use the same argument as in the proof of Corollary 4.6. Note:this sum also converges for an arbitrary ergodic Markov chain.

Exercise 4.23. Consider the Markov chain below:

a) Suppose the chain is started in state i and goes through n transitions; let vi(n,u) be theexpected number of transitions (out of the total of n) until the chain enters the trappingstate, state 1. Find an expression for v(n,u) = (v1(n,u), v2(n,u), v3(n,u)) in terms ofv(n − 1,u) (take v1(n,u) = 0 for all n). (Hint: view the system as a Markov rewardsystem; what is the value of r?)

b) Solve numerically for limn→1 v(n,u). Interpret the meaning of the elements vi in thesolution of (4.30).

c) Give a direct argument why (4.30) provides the solution directly to the expected timefrom each state to enter the trapping state.


3

1

2

♥♥

♥

✏✏✏✏✏✏✏✏✏✏✏✶

PPPPPPPPPPPq

✻ 1/2

1/4

1/2 1

1/2

1/4

②

✘✿

✘✿

Exercise 4.24. Consider a sequence of IID binary rv’s X1,X2, . . . . Assume that Pr{Xi = 1} =p1, Pr{Xi = 0} = p0 = 1 − p1. A binary string (a1, a2, . . . , ak) occurs at time n ifXn = ak,Xn−1 = ak−1, . . .Xn−k+1 = a1. For a given string (a1, a2, . . . , ak), consider aMarkov chain with k + 1 states {0, 1, . . . , k}. State 0 is the initial state, state k is a fi-nal trapping state where (a1, a2, . . . , ak) has already occurred, and each intervening statei, 0 < i < k, has the property that if the subsequent k − i variables take on the valuesai+1, ai+2, . . . , ak, the Markov chain will move successively from state i to i+1 to i+2 andso forth to k. For example, if k = 2 and (a1, a2) = (0, 1), the corresponding chain is givenby

❧0 ③ ❧1 ③ ❧2②☎✆②✘

0

0

1

a) For the chain above, find the mean first passage time from state 0 to state 2.

b) For parts b to d, let (a1, a2, a3, . . . , ak) = (0, 1, 1, . . . , 1), i.e., zero followed by k − 1ones. Draw the corresponding Markov chain for k = 4.

c) Let vi, 1 ≤ i ≤ k be the expected first passage time from state i to state k. Note thatvk = 0. Show that v0 = 1/p0 + v1.

d) For each i, 1 ≤ i < k, show that vi = αi + vi+1 and v0 = βi + vi+1 where αi and βi

are each a product of powers of p0 and p1. Hint: use induction, or iteration, starting withi = 1, and establish both equalities together.

e) Let k = 3 and let (a1, a2, a3) = (1, 0, 1). Draw the corresponding Markov chain for thisstring. Evaluate v0, the expected first passage time for the string 1,0,1 to occur.

f) Use renewal theory to explain why the answer in part e is different from that in part dwith k = 3.

Exercise 4.25. a) Find limn→1[P ]n for the Markov chain below. Hint: Think in termsof the long term transition probabilities. Recall that the edges in the graph for a Markovchain correspond to the positive transition probabilities.

b) Let πππ(1) and πππ(2) denote the first two rows of limn→1[P ]n and let ∫∫∫(1) and ∫∫∫(2) denote thefirst two columns of limn→1[P ]n. Show that πππ(1) and πππ(2) are independent left eigenvectorsof [P ], and that ∫∫∫(1) and ∫∫∫(2) are independent right eigenvectors of [P ]. Find the eigenvaluefor each eigenvector.

4.8. EXERCISES 191

♥1 ♥3 ♥2✛ ✲

❈❖

P31 P32

P33

11②✘✿

c) Let r be an arbitrary reward vector and consider the equation

w + g(1)∫∫∫(1) + g(2)∫∫∫(2) = r + [P ]w . (4.116)

Determine what values g(1) and g(2) must have in order for (4.84) to have a solution. Arguethat with the additional constraints w1 = w2 = 0, (4.84) has a unique solution for w andfind that w .

d) Show that, with the w above, w 0 = w + α∫∫∫(1) + β∫∫∫(2) satisfies (4.84) for all choices ofscalars α and β.

e) Assume that the reward at stage 0 is u = w . Show that v(n,w) = n(g(1)∫∫∫(1)+g(2)∫∫∫(2))+w .

f) For an arbitrary reward u at stage 0, show that v(n,u) = n(g(1)∫∫∫(1) + g(2)∫∫∫(2)) + w +[P ]n(u −w). Note that this verifies (4.49-4.51) for this special case.

Exercise 4.26. Generalize Exercise 4.25 to the general case of two recurrent classes andan arbitrary set of transient states. In part (f), you will have to assume that the recurrentclasses are ergodic. Hint: generalize the proof of Lemma 4.1 and Theorem 4.9

Exercise 4.27. Generalize Exercise 4.26 to an arbitrary number of recurrent classes andan arbitrary number of transient states. This verifies (4.49-4.51) in general.

Exercise 4.28. Let u and u 0 be arbitrary final reward vectors with u ≤ u 0.

a) Let k be an arbitrary stationary policy and prove that vk (n,u) ≤ vk (n,u 0) for eachn ≥ 1.

b) Prove that v∗(n,u) ≤ v∗(n,u 0) for each n ≥ 1. This is known as the monotonicitytheorem.

Exercise 4.29. George drives his car to the theater, which is at the end of a one-way street.There are parking places along the side of the street and a parking garage that costs $5 atthe theater. Each parking place is independently occupied or unoccupied with probability1/2. If George parks n parking places away from the theater, it costs him n cents (in timeand shoe leather) to walk the rest of the way. George is myopic and can only see the parkingplace he is currently passing. If George has not already parked by the time he reaches thenth place, he first decides whether or not he will park if the place is unoccupied, and thenobserves the place and acts according to his decision. George can never go back and mustpark in the parking garage if he has not parked before.

a) Model the above problem as a 2 state Markov decision problem. In the “driving” state,state 2, there are two possible decisions: park if the current place is unoccupied or drive onwhether or not the current place is unoccupied.


b) Find v∗i (n,u), the minimum expected aggregate cost for n stages (i.e., immediatelybefore observation of the nth parking place) starting in state i = 1 or 2; it is sufficientto express v∗i (n,u) in times of v∗i (n − 1). The final costs, in cents, at stage 0 should bev2(0) = 500, v1(0) = 0.

c) For what values of n is the optimal decision the decision to drive on?

d) What is the probability that George will park in the garage, assuming that he followsthe optimal policy?

Exercise 4.30. Consider the dynamic programming problem below with two states andtwo possible policies, denoted k and k 0. The policies differ only in state 2.

1 2 1 21/2 7/8

1/8r1=0 rk=52

1/2 1/2 3/4

1/4r1=0 rk 02 =6

1/2②✘✿ ♥ ③

② ♥ ③②♥ ♥②✘✿

a) Find the steady-state gain per stage, g and g0, for stationary policies k and k 0. Showthat g = g0.

b) Find the relative gain vectors, w and w 0, for stationary policies k and k 0.

c) Suppose the final reward, at stage 0, is u1 = 0, u2 = u. For what range of u does thedynamic programming algorithm use decision k in state 2 at stage 1?

d) For what range of u does the dynamic programming algorithm use decision k in state 2at stage 2? at stage n? You should find that (for this example) the dynamic programmingalgorithm uses the same decision at each stage n as it uses in stage 1.

e) Find the optimal gain v∗2(n,u) and v∗1(n,u) as a function of stage n assuming u = 10.

f) Find limn→1 v∗(n,u) and show how it depends on u.

Exercise 4.31. Consider a Markov decision problem in which the stationary policies k andk 0 each satisfy Bellman’s equation, (4.60) and each correspond to ergodic Markov chains.

a) Show that if rk 0 + [P k 0 ]w 0 ≥ rk + [P k ]w 0 is not satisfied with equality, then g0 > g.

b) Show that rk 0 + [P k 0 ]w 0 = rk + [P k ]w 0 (Hint: use part a).

c) Find the relationship between the relative gain vector wk for policy k and the relativegain vector w 0 for policy k 0. (Hint: Show that rk + [P k ]w 0 = ge + w 0; what does this sayabout w and w 0?)

e) Suppose that policy k uses decision 1 in state 1 and policy k 0 uses decision 2 in state1 (i.e., k1 = 1 for policy k and k1 = 2 for policy k 0). What is the relationship betweenr(k)1 , P (k)

11 , P (k)12 , . . . P (k)

1J for k equal to 1 and 2?

f) Now suppose that policy k uses decision 1 in each state and policy k 0 uses decision 2 ineach state. Is it possible that r(1)

i > r(2)i for all i? Explain carefully.

g) Now assume that r(1)i is the same for all i. Does this change your answer to part f)?

Explain.

4.8. EXERCISES 193

Exercise 4.32. Consider a Markov decision problem with three states. Assume that eachstationary policy corresponds to an ergodic Markov chain. It is known that a particularpolicy k 0 = (k1, k2, k3) = (2, 4, 1) is the unique optimal stationary policy (i.e., the gain perstage in steady-state is maximized by always using decision 2 in state 1, decision 4 in state2, and decision 1 in state 3). As usual, r(k)

i denotes the reward in state i under decision k,and P (k)

ij denotes the probability of a transition to state j given state i and given the use ofdecision k in state i. Consider the effect of changing the Markov decision problem in eachof the following ways (the changes in each part are to be considered in the absence of thechanges in the other parts):

a) r(1)1 is replaced by r(1)

1 − 1.

b) r(2)1 is replaced by r(2)

1 + 1.

c) r(k)1 is replaced by r(k)

1 + 1 for all state 1 decisions k.

d) for all i, r(ki)i is replaced by r(ki) + 1 for the decision ki of policy k 0.

For each of the above changes, answer the following questions; give explanations:

1) Is the gain per stage, g0, increased, decreased, or unchanged by the given change?

2) Is it possible that another policy, k 6= k 0, is optimal after the given change?

Exercise 4.33. (The Odoni Bound) Let k 0 be the optimal stationary policy for a Markovdecision problem and let g0 and πππ0 be the corresponding gain and steady-state probabilityrespectively. Let v∗i (n,u) be the optimal dynamic expected reward for starting in state i atstage n.

a) Show that mini[v∗i (n,u) − v∗i (n− 1)] ≤ g0 ≤ maxi[v∗i (n,u) − v∗i (n− 1)] ; n ≥ 1. Hint:Consider premultiplying v∗(n,u) − v∗(n − 1) by πππ0 or πππ0 where k is the optimal dynamicpolicy at stage n.

b) Show that the lower bound is non-decreasing in n and the upper bound is non-increasingin n and both converge to g0 with increasing n.

Exercise 4.34. Consider a Markov decision problem with three states, {1, 2, 3}. For state3, there are two decisions, r(1)

3 = r(2)3 = 0 and P (1)

3,1 = P (2)3,2 = 1. For state 1, there are two

decisions, r(1)1 = 0, r(2)

1 = −100 and P (1)1,1 = P (2)

1,3 = 1. For state 2, there are two decisions,r(1)2 = 0, r(2)

2 = −100 and P (1)2,1 = P (2

2,3 = 1.

a) Show that there are two ergodic unichain optimal stationary policies, one using decision1 in states 1 and 3 and decision 2 in state 2. The other uses the opposite decision in eachstate.

b) Find the relative gain vector for each of the above stationary policies.

c) Let u be the final reward vector. Show that the first stationary policy above is theoptimal dynamic policy in all stages if u1 ≥ u2 + 100 and u3 ≥ u2 + 100. Show that anon-unichain stationary policy is the optimal dynamic policy if u1 = u2 = u3


c) Theorem 4.13 implies that, under the conditions of the theorem, limn→1[v∗i (n,u) −v∗j (n,u)] is independent of u . Show that this is not true for the conditions of this exercise.

Exercise 4.35. Assume that k 0 is a unique optimal stationary policy and corresponds toan ergodic unichain (as in Theorem 4.13). Let w 0 and g0 be the relative gain and gain perstage for k 0 and let u be an arbitrary final reward vector.

a) Let k 0 = (k01, k02, ..., k0M). Show that for each i and each k 6= k0i, there is some α > 0 suchthat for each i and k 6= k0i,

r(k0i)i +

X

j

P(k0i)ij w0j ≥ +rk

i +X

j

P kijw

0j + α.

Hint: Look at the proof of Lemma 4.5

b) Show that there is some n0 such that for all n ≥ n0,ØØØv∗j (n− 1)− (n− 1)g0 − w0j − β(u)

ØØØ < α/2

where β(u) is given in Theorem 4.13.

c) Use part b) to show that for all i and all n ≥ n0,

r(k0i)i +

X

j

P(k0i)ij v∗j (n− 1) > +r

(k0i)i +

X

j

P(k0i)ij w0j + (n− 1)g0 + β(u)− α/2.

d) Use parts a) and b) to show that for all i, all n ≥ n0, and all k 6= k0i,

rki +

X

j

P kijv

∗j (n− 1) < +r

(k0i)i +

X

j

P(k0i)ij w0j + (n− 1)g0 + β(u)− α/2.

e Combine parts c) and d) to conclude that the optimal dynamic policy uses policy k 0 forall n ≥ n0.

Exercise 4.36. Consider an integer time queueing system with a finite buffer of size 2. Atthe beginning of the nth time interval, the queue contains at most two customers. Thereis a cost of one unit for each customer in queue (i.e., the cost of delaying that customer).If there is one customer in queue, that customer is served. If there are two customers, anextra server is hired at a cost of 3 units and both customers are served. Thus the totalimmediate cost for two customers in queue is 5, the cost for one customer is 1, and the costfor 0 customers is 0. At the end of the nth time interval, either 0, 1, or 2 new customersarrive (each with probability 1/3).

a) Assume that the system starts with 0 ≤ i ≤ 2 customers in queue at time −1 (i.e., instage 1) and terminates at time 0 (stage 0) with a final cost u of 5 units for each customerin queue (at the beginning of interval 0). Find the expected aggregate cost vi(1,u) for0 ≤ i ≤ 2.

4.8. EXERCISES 195

b) Assume now that the system starts with i customers in queue at time −2 with the samefinal cost at time 0. Find the expected aggregate cost vi(2,u) for 0 ≤ i ≤ 2.

c) For an arbitrary starting time−n, find the expected aggregate cost vi(n,u) for 0 ≤ i ≤ 2.

d) Find the cost per stage and find the relative cost (gain) vector.

e) Now assume that there is a decision maker who can choose whether or not to hire theextra server when there are two customers in queue. If the extra server is not hired, the3 unit fee is saved, but only one of the customers is served. If there are two arrivals inthis case, assume that one is turned away at a cost of 5 units. Find the minimum dynamicaggregate expected cost v∗i (1), 0 ≤ i ≤, for stage 1 with the same final cost as before.

f) Find the minimum dynamic aggregate expected cost v∗i (n,u) for stage n, 0 ≤ i ≤ 2.

g) Now assume a final cost u of one unit per customer rather than 5, and find the newminimum dynamic aggregate expected cost v∗i (n,u), 0 ≤ i ≤ 2.

Exercise 4.37. Consider a finite-state ergodic Markov chain {Xn;n ≥ 0} with an integervalued set of states {−K,−K+1, . . . ,−1, 0, 1, . . . ,+K}, a set of transition probabilitiesPij ;−K ≤ i, j ≤ K, and initial state X0 = 0. One example of such a chain is given by:

♥-1 ♥0 ♥1✲ ✲✄✎ ✄✎✄✎

②0.1 0.1

0.1

0.90.90.9

Let {Sn;n ≥ 0} be a stochastic process with Sn =Pn

i=0 Xi. Parts (a), (b), and (c) areindependent of parts (d) and (e). Parts (a), (b), and (c) should be solved both for thespecial case in the above graph and for the general case.

a) Find limn→1 E [Xn] for the example and express limn→1 E [Xn] in terms of the steady-state probabilities of {Xn, n ≥ 0} for the general case.

b) Show that limn→1 Sn/n exists with probability one and find the value of the limit.Hint: apply renewal-reward theory to {Xn;n ≥ 0}.

c) Assume that limn→1 E [Xn] = 0. Find limn→1 E [Sn].

d) Show that

Pr{Sn=sn | Sn−1=sn−1, Sn−2=sn−2, Sn−3=sn−3, . . . , S0=0} =

Pr{Sn=sn | Sn−1=sn−1, Sn−2=sn−2} .

e) Let Y n = (Sn, Sn−1) (i. e., Y n is a random two dimensional integer valued vector).Show that {Y n;n ≥ 0} (where Y 0 = (0, 0)) is a Markov chain. Describe the transitionprobabilities of {Y n;n ≥ 0} in terms of {Pij}.


Exercise 4.38. Consider a Markov decision problem with M states in which some state,say state 1, is inherently reachable from each other state.

a) Show that there must be some other state, say state 2, and some decision, k2, such thatP (k2)

21 > 0.

b) Show that there must be some other state, say state 3, and some decision, k3, such thateither P (k3)

31 > 0 or P (k3)32 > 0.

c)Assume, for some i, and some set of decisions k2, . . . , ki that, for each j, 2 ≤ j ≤ i,P

(kj)jl > 0 for some l < j (i.e., that each state from 2 to j has a non-zero transition to a

lower numbered state). Show that there is some state (other than 1 to i), say i + 1 andsome decision ki+1 such that P (ki+1)

i+1,l > 0 for some l ≤ i.

d) Use parts a), b), and c) to observe that there is a stationary policy k = k1, . . . , kM forwhich state 1 is accessible from each other state.

Date post:	09-Apr-2018
Category:	Documents
Upload:	trinhtruc
View:	215 times
Download:	1 times

FINITE-STATE MARKOV CHAINS - RLE at MIT · Chapter 4 FINITE-STATE MARKOV CHAINS 4.1 Introduction...

Documents