+ All Categories
Home > Documents > Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

Date post: 10-Feb-2018
Category:
Upload: larasmoyo
View: 256 times
Download: 1 times
Share this document with a friend

of 31

Transcript
  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    1/31

    Solutions Vol. II, Chapter 1

    1.5

    (a) We have

    nj=1

    pij(u) =

    nj=1

    pij(u) mj1

    nk=1 mk

    =

    nj=1pij(u)

    nj=1 mj

    1 n

    k=1 mk

    = 1.

    Therefore, pij(u) are transition probabilities.

    (b) We have for the modified problem

    J(i) = minuU(i)

    g(i, u) +

    1 n

    j=1

    mj

    n

    j=1

    pij(u) mj1

    nk=1 mk

    J(j)

    = minuU(i)

    g(i, u) + n

    j=1

    pij(u)J(j) n

    k=1

    mkJ(k)

    .

    So

    J(i) + nk=1 mkJ(k)

    1 = min

    uU(i)

    g(i, u) + n

    j=1

    pij(u)J(j) n

    k=1

    mk(1 11

    )

    1

    J(k)

    J(i) +n

    k=1 mkJ(k)

    1 = min

    uU(i)

    g(i, u) + n

    j=1

    pij(u)

    J(j) +

    n

    k=1 mkJ(k)

    1

    .Thus

    J(i) +n

    k=1 mkJ(k)

    1 =J(i), i.

    Q.E.D.

    1.7

    We show that for any bounded function J :S R, we have

    JT(J) T(J) F(J), (1)

    37

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    2/31

    JT(J) T(J) F(J). (2)

    For any , define

    F(J)(i) =g(i, (i)) +

    j=ipij((i))J(j)

    1 pii((i))

    and note that

    F(J)(i) =T(J)(i) pii((i))J(i)

    1 pii((i)) . (3)

    Fix >0. IfJT(J), let be such that F(J) F(J) + e. Then, using Eq. (3),

    F(J)(i) + F(J)(i) =T(J)(i) pii((i))J(i)

    1 pii((i))

    T(J)(i) pii((i))T(J)(i)

    1 pii((i)) =T(J)(i).

    Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i). Similarly, if J T(J), let be such that

    T(J) T(J) + e. Then, using Eq. (3),

    F(J)(i) F(J)(i) = T(J)(i) pii((i))J(i)1 pii((i))

    T(J)(i) + pii((i))T(J)(i)1 pii((i))

    T(J)(i) + 1

    .

    Since >0 is arbitrary, we obtainF(J)(i) T(J)(i).

    From (1) and (2) we see that F and Thave the same fixed points, so J is the unique fixed point

    ofF. Using the definition ofF, it can be seen that for any scalar r >0 we have

    F(J+ re) F(J) + re, F (J) re F(J re). (4)

    Furthermore, Fis monotone, that is

    JJ F(J) F(J). (5)

    For any bounded function J, let r >0 be such that

    J re J J+ re.

    Applying Frepeatedly to this equation and using Eqs. (4) and (5), we obtain

    Fk(J) kre J Fk(J) + kre.

    ThereforeFk(J) converges to J. From Eqs. (1), (2), and (5) we see that

    JT(J) Tk(J) Fk(J) J,

    JT(J) Tk(J) Fk(J) J.

    These equations demonstrate the faster convergence property ofF overT.

    38

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    3/31

    As a final result (not explicitly required in the problem statement), we show that for any two

    bounded functions J :S R, J :S R, we have

    maxj

    |F(J)(j) F(J)(j)| maxj

    |J(j) J(j)|, (6)

    soFis a contraction mapping with modulus . Indeed, we have

    F(J)(i) = minuU(i)

    g(i, u) +

    j=ipij(u)J(j)

    1 pii(u)

    = minuU(i)

    g(i, u) +

    j=ipij(u)J

    (j)

    1 pii(u) +

    j=ipij(u)[J(j) J(j)]

    1 pii(u)

    F(J)(i) + max

    j|J(j) J(j)|, i,

    where we have used the fact

    1 pii(u) 1 pii(u) = j=ipij(u).Thus, we have

    F(J)(i) F(J)(i) maxj

    |J(j) J(j)|, i.

    The roles ofJ andJ may be reversed, so we can also obtain

    F(J)(i) F(J)(i) maxj

    |J(j) J(j)|, i.

    Combining the last two inequalities, we see that

    |F(J)(i) F(J)(i)| maxj

    |J(j) J(j)|, i.

    By taking the maximum overi, Eq. (6) follows.

    1.9

    (a) SinceJ, J B(S), i.e., are real-valued, bounded functions onS, we know that the infimum and the

    supremum of their difference is finite. We shall denote

    m= minxSJ(x) J(x)

    and

    M= maxxs

    J(x) J(x)

    .

    Thus

    m J(x) J(x) M, x S,

    39

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    4/31

    or

    J(x) + m J(x) J(x) + M, x S.

    Now we apply the mapping Ton the above inequalities. By property (1) we know thatT will preserve

    the inequalities. Thus

    T(J + me)(x) T(J)(x) T(J + Me)(x), x S.

    By property (2) we know that

    T(J)(x) + min[a1r, a2r] T(J+ re)(x) T(J)(x) + max[a1r, a2r].

    If we replace r bym or M, we get the inequalities

    T(J)(x) + min[a1m, a2m] T(J + me)(x) T(J)(x) + max[a1m, a2m]

    and

    T(J)(x) + min[a1M, a2M] T(J + Me)(x) T(J)(x) + max[a1M, a2M].

    Thus

    T(J)(x) + min[a1m, a2m] T(J)(x) T(J)(x) + max[a1M, a2M],

    so that

    |T(J)(x) T(J)(x)| max[a1|M|, a2|M|, a1|m), a2|m|].

    We also have

    max[a1|M|, a2|M|, a1|m|, a1|m|, a2|m|] a2max[|M|, |m|] a2supxS

    |J(x) J(x).

    Thus

    |T(J)(x) T(J)(x)| a2maxxS

    |J(x) J(x)|

    from which

    maxxS

    |T(J)(x) T(J)(x)| a2maxxS

    |J(x) J(x)|.

    Thus Tis a contraction mapping since we know by the statement of the problem that 0 a1 a2 < 1.

    Since the set B(S) of bounded real valued functions is a complete linear space, we conclude that

    the contraction mappingThas a unique fixed point, J, and limk Tk(J)(x) = J(x).

    (b) We shall first prove the lower bounds ofJ(x). The upper bounds follow by a similar argument. Since

    J, T(J) B(S), there exists a c , (c < ), such that

    J(x) + c T(J)(x). (1)

    40

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    5/31

    We apply T on both sides of (1) and since Tpreserves the inequalities (by assumption (1)) we have by

    applying the relation of assumption (2).

    J(x) + min[c + a1c, c + a2c] T(J)(x) + min[a1c, a2c] T(J+ ce)(x) T2(J)(x). (2)

    Similarly, if we applyTagain we get,

    J(x) + mini(1,2)

    [c + aic, c + a2i c] T(J) + min[a1c + a21c, a2c + a

    22c]

    T2(J) + min[a21c, a22c] T(T(J) + min[a1c, a2c]e)(x) T

    3(J)(x).

    Thus by induction we conclude

    J(x) + min[

    km=0

    am1 c,k

    m=0

    am2 c] T(J)(x) + min[k

    m=1

    am1 c,k

    m=1

    am2 c] . . .

    Tk(J)(x) + min[ak1c, ak2c] T

    k+1(J)(x).

    (3)

    By taking the limit as k and noting that the quantities in the minimization are monotone, and

    either nonnegative or nonpositive, we conclude that

    J(x) + min

    1

    1 a1c,

    1

    1 a2c

    T(J)(x) + min

    a1

    1 a1c,

    a21 a2

    c

    Tk(J)(x) + min

    ak1

    1 a1c,

    ak21 a2

    c

    Tk+1(J)(x) + min

    ak+111 a1

    c, ak+121 a2

    c

    J(x).

    (4)

    Finally we note that

    min[ak1c, ak2c] T

    k+1(J)(x) Tk(J)(x).

    Thus

    min[ak1c, ak2c] inf

    xS(Tk+1(J)(x) Tk(J)(x)) .

    Letbk+1 = infxS(Tk+1(J)(x) Tk(J)(x)) .Thus min[ak1c, ak2c] bk+1.From the above relation we infer

    that

    minak+11 c

    1 a1,ak+12 c

    1 a2 min a1

    1 a1bk1 ,

    a21 a2

    bk+1= ck+1Therefore

    Tk(J)(x) + min

    ak1c

    1 a1,

    ak2c

    1 a2

    Tk+1(J)(x) + ck+1.

    This relationship gives for k = 1

    T(J)(x) + min

    a1c

    1 a1,

    a2c

    1 a2

    T2(J)(x) + c2

    41

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    6/31

    Let

    c= infxS

    (T(J)(x) J(x))

    Then the above inequality still holds. From the definition ofc1 we have

    c1 = min a1c

    1 a1,

    a2c

    1 a2

    .

    Therefore

    T(J)(x) + c1 T2(J)(x) + c2

    andT(J)(x) + c1 J(x) from Eq. (4). Similarly, letJ1(x) = T(J)(x), and let

    b2 = minxS

    (T2(J)(x) T(J)(x)) = minxS

    (T(J1)(x) T(J1)(x)).

    If we proceed as before, we get

    J1(x) + min

    1

    1 a3b2,

    1

    1 a2b2

    T(J1)(x) + min

    a1b21 a2

    , a1b21 a2

    T2(J1)(x) + min

    a21b21 a2

    , a22b21 a2

    J(x).

    Then

    min[a1b2, a2b2] minxS

    [T2(J1)(x) T(J1)(x)] = minxS

    [T3(J)(x) T2(J)(x)] = b3

    Thus

    min a21b2

    1 a1,

    a22b2

    1 a2 min a1b3

    1 a2,

    a2b3

    1 a2 .Thus

    T(J1)(x) + min

    a1b21 a2

    , a2b21 a2

    T2(J1)(x) + min

    a1b31 a2

    , a2b21 a2

    or

    T2(J)(x) + c2 T3(J)(x) + c3

    and

    T2(J)(x) + c2 J(x).

    Proceeding similarly the result is proved.

    The reverse inequalities can be proved by a similar argument.

    (c) Let us first consider the state x = 1

    F(J)(1) = minuU(1)

    g(j, j) + a

    nj=1

    p1jJ(j)

    42

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    7/31

    Thus

    F(J+ re)(1) = minuU(1)

    g(1, u) +

    nj=1

    pij(J+ re)(j)

    = min

    uU(1)

    g(1, u) +

    nj=1

    p1jJ(j) + ar

    =F(J)(1) + rThus

    F(J+ re)(1) F(J((1)

    r = (1)

    Since 0 1 we conclude that n . Thus

    n F(J+ re)(1) F(J)(1)

    r =

    For the statex = 2 we proceed similarly and we get

    F(J)(2) = minuU(2)

    g(2, u) + p21F(J)(1) + nJ=2

    p2jJ(j)and

    F(J+ re)(2) = minuU(2)

    g(2, u) + p21F(J+ re)(1) +

    nJ=2

    p2j(J+ re)(j)

    = minuU(2)

    g(2, u) + p21F(J)(1) + 2rp21+

    nJ=2

    p2J(j) + n

    J=2

    pijre(j)

    where, for the last equality, we used relation (1).

    Thus we conclude

    F(J+ re)(2) =F(J)(2) + 2rp21+ n

    j=2

    p2jr= F(J)(2) + 2rp21+ r(1 p21)

    which yieldsF(J+ re)(2) F(J)(2)

    r =2P21+ (1 p21) (2)

    Now let us study the behavior of the right-hand side of Eq. (2). We have 0 <

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    8/31

    Claim:

    i F(J+ re)(x) F(J)(x)

    r

    Proof: We shall employ an inductive argument. Obviously the result holds forx = 1, 2. Let us assume

    that it holds for all x i. We shall prove it for x = i +j

    F(J)(i + 1) = minuU(i+1)

    g(i + 1, u) +

    ij=1

    p1+ijF(J)(j) + n

    j=i+1

    pi+1jpi+1jJ(j)

    F(J+ re)(i + 1) = minuU(i+1)

    g(i + 1, u) +

    ij=1

    pi+1jF(J+ re)(j) +

    j=i+1n

    pi+1,j(J+ re)(j)

    We knowj F(J+ re)(j) , j i, thus

    F(J)(i + 1) + rj=1

    F(J)(i + 1) + 2rp + r(1 p)

    where

    p=i

    j=1

    p1+ij

    Obviouslyi

    j=1

    jpi+1j ii

    j=1

    pi+1j =ip

    Thus

    i+1p + (1 p)F(J+ re)(j) F(J)(j)

    r 2p + (1 p)

    Since 0< i+1 2

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    9/31

    For property (2) we note that

    T(J+ re)(x) = g(x) + M(J+ re)(x) = g(x) + M J(x) + rM e(x) = T(J)(x) + rM e(x)

    We have

    1 M e(x) 2

    so that

    T(J+ re)(x) T(J)(x)

    r =M e(x)

    and

    1 T(J+ re)(x) T(J)(x)

    r 2

    Thus property (2) also holds if2 < 1.

    1.10

    (a) If there is a unique such thatT(J) =T(J), then there exists an >0 such that for all Rn

    with maxi |(i)| we have

    F(J+ ) =T(J+ ) J =g+ P(J+ ) J =g+ (P I)(J+ ).

    It follows thatFis linear around Jand its Jacobian is P I.

    (b) We first note that the equation defining Newtons method is the first order Taylor series expansion of

    FaroundJk.Ifk is the unique such thatT(Jk) = T(Jk),thenFis linear near Jk and coincides with

    its first order Taylor series expansion around Jk. Therefore the vector Jk+1 is obtained by the Newton

    iteration satisfies

    F(Jk+1) = 0

    or

    Tk(Jk+1) = Jk+1.

    This equation yields Jk+1 = Jk , so the next policy k+1 is obtained as

    k+1 = arg min

    T(Jk).

    This is precisely the policy iteration of the algorithm.

    45

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    10/31

    1.12

    For simplicity, we consider the case where U(i) consists of a single control. The calculations are very

    similar for the more general case. We first show thatnj=1 Mij = . We apply the definition of the

    quantities Mij

    nj=1

    Mij =nj=1

    ij +

    (1 )(Mij ij)

    1 mi

    =

    nj=1

    ij+nj=1

    (1 )(Mij ij)

    1 mi

    = 1 + (1 )n

    j=1

    Mij1 mi

    (1 )

    1 mi

    nj=1

    ij = 1 + (1 ) mi

    1 mi

    (1 )

    1 mi

    = 1 (1 ) = .

    LetJ1 , . . . , J n satisfy

    Ji =gi+n

    j=1 MijJj . (1)We substituteJ into the new equation

    Ji = gi+

    nj=1

    MijJj

    and manipulate the equation until we reach a relation that holds trivially

    J1 =gi(1 )

    1 mi+

    nj=1

    ijJj + 1

    1 mi

    nj=1

    (Mij ij)Jj

    = gi(1 )1 mi+ Ji + 1 1 mi

    nj=1

    MijJj 1 1 miJi

    =Ji + 1

    1 mi

    gi+ n

    j=1

    MijJj Ji

    .

    This relation follows trivially from Eq. (1) above. Thus J is a solution of

    Ji= gi+n

    J=1

    MijJj .

    1.17

    The form of Bellmans Equation for the tax problem is

    J(x) = mini

    j=i

    cj(xi) + Ewi{J[xi, xi1, fi(xi, wi)

    46

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    11/31

    Let J(x) = J(x)

    J(x) = maxi

    n

    j=1

    cj(xj) + ci(xi) + Ewi{J[ ]}

    Let J(x) = (1 )J(x) + nj=1 Cj(xj) By substitution we obtainJ(x) = max

    i

    (1 ) n

    j=1

    cj(xj) + (1 )ci(xi) + Ewi{(1 )J[ ]}

    = maxi

    [ci(xi) Ewi{ci(f(xi, wi)}] + Ewi{J( )}].

    Thus Jsatisfies Bellmans Equation of a multi-armed Bandit problem with

    Ri(xi) = ci(xi) Ewi{ci(f(xi, wi))}.

    1.18

    Bellmans Equation for the restart problem is

    J(x) = max[R(x0) + E{J[f(x0, w)]}, R(x) + E{J[f(x, w)]}]. (A)

    Now, consider the one-armed bandit problem with rewardR(x)

    J(x, M) = max{M, R(x) + E[J(f(x, w), M)]}. (B)

    We have

    J(x0, M) = R(x0) + E[J(f(x0, w), M)]> M

    ifM < m(x0) andJ(x0, M) = M. This implies that

    R(x0) + E[J(f(x0, w))] =m(x0).

    Therefore the forms of both Bellmans Equations (A) and (B) are the same whenM=m(x0).

    47

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    12/31

    Solutions Vol. II, Chapter 2

    2.1

    (a) (i) First, we need to define a state space for the problem. The obvious choice for a state variable

    is our location. However, this does not encapsulate all of the necessary information. We also need to

    include the value of c if it is known. Thus, let the state space consist of the following 2m+ 2 states:

    {S, S1, . . . , S m, I1, . . . I m, D}, whereSis associated with being at the starting point with no information,

    Si and Ii are associated with being at S and I, respectively, and knowing that c = ci, and D is the

    termination state.

    At state S, there are two possible controls: go directly to D (direct) or go to an intermediate

    point (indirect). If control direct is selected, we go to state D with probability 1, and the cost is

    g(S, direct, D) = a. If control indirect is selected, we go to state Ii with probability pi, and the cost is

    g(S, indirect, Ii) = b.

    At state Si, for i {1, . . . , m}, we have the same controls as at state S. Again, if control direct is

    selected, we go to state D with probability 1, and the cost is g(Si,direct,D) =a. If, on the other hand,

    control indirectis selected, we go to state Ii with probability 1, and the cost is g(S, indirect, Ii) = b.

    At state Ii, for i {1, . . . , m}, there are also two possible controls: go back to the start (start) or

    go to the destination (dest). If control start is selected, we go to state Si with probability 1, and the

    cost isg (Ii,start,Si) = b. If control destis selected, we go to state D with probability 1, and the cost isg(Ii,dest,D) = ci.

    We have thus formulated the problem as a stochastic shortest path problem. Bellmans equation

    for this problem is

    J(S) = min[a, b +mi=1

    piJ(Ii)]

    J(Si) = min[a, b + J(Ii)]

    J(Ii) = min[ci, b + J(Si)].

    We assume thatb >0. Then, Assumptions 5.1 and 5.2 hold since all improper policies have infinite cost.As a result, if(Ii) =start, then (Si) = direct. If(Ii)=start, then we never reach state Si and

    so it doesnt matter what the control is in this case. Thus, J(Si) =a, and (Si) =direct. From this,

    it is easy to derive the optimal costs and controls for the other states

    J(Ii) = min[ci, b + a] (Ii) =

    dest, ifci < b + a

    start, otherwise,

    48

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    13/31

    J(S) = min[a, b +

    mi=1

    pimin(ci, b + a)]

    (S) =

    direct, ifa < b +

    mi=1pimin(ci, b + a)

    indirect, otherwise.

    For the numerical case given, we see that a < b+m

    i=1pimin(ci, b+ a) since a = 2 and b+mi=1pimin(ci, b + a) = 2.5. Hence (S) = direct. We need not consider the other states since they will

    never be reached.

    (ii) In this case, every time we are at the starting location, our available information is the same. We

    thus no longer need the states Si from part (i). Our state space for this part is then S, I1, . . . , I m, D.

    At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state

    D with probability 1, and the cost is g(S, direct, D) = a. If control indirectis selected, we go to state Ii

    with probabilitypi, and the cost is g(S, indirect, Ii) = b [same as in part (ii)].

    At state Ii, for i {1, . . . , m}, the possible controls are {start, dest}. If controlstart is selected,

    we go to state Swith probability 1, and the cost is g(Ii,start,S) = b. If control dest is selected, we go

    to state D with probability 1, and the cost is g(Ii,dest,D) = ci.

    Bellmans equation for this stochastic shortest path problem is

    J(S) = min[a, b +mi=1

    piJ(Ii)]

    J(Ii) = min[ci, b + J(S)].

    The optimal policy can be described by

    (S) =

    direct, ifa < b +

    mi=1piJ

    (Ii)

    indirect, otherwise,

    (Ii) =

    dest, ifci < b + J(S)

    start, otherwise.

    We will solve the problem for the numerical case by guessing an optimal policy and then showing

    that the resulting cost J satisfies J= T J. SinceJ is the unique solution to this equation, our policy

    is optimal. So lets guess the initial policy to be

    (S) = direct (I1) = dest (I2) = start.

    Then

    J(S) = a = 2 J(I1) = c1 = 0 J(I2) = b + J(S) = 1 + 2 = 3.

    49

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    14/31

    From Bellmans equation, we have

    J(S) = min(2, 1 + 0.5(3 + 0)) = 2

    J(I1) = min(0, 1 + 2)) = 0

    J(I2) = min(5, 1 + 2)) = 3.

    Thus, our policy is optimal.

    (b) The state space for this problem is the same as for part a(ii): {S, I1, . . . , I m, D}.

    At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state

    D with probability 1, and the cost is g(S, direct, D) = a. If control indirectis selected, we go to state Ii

    with probabilitypi, and the cost is g(S, indirect, Ii) = b [same as in part a,(i) and (ii)].

    At state Ii, for i {1, . . . , m}, we have an additional option of waiting. So the possible controls

    are {start, dest, wait}. If control start is selected, we go to state Swith probability 1, and the cost

    is g(Ii,start,S) = b. If control dest is selected, we go to state D with probability 1, and the cost is

    g(Ii,dest,D) = ci. If control wait is selected, we go to state Ij with probability pj , and the cost is

    g(Ii,wait,Ij) = d.

    Bellmans equation is

    J(S) = min[a, b +m

    i=1piJ(Ii)]

    J(Ii) = min[ci, b + J(S), d +

    mj=1

    pjJ(Ij)].

    We can describe the optimal policy as follows:

    (S) =

    direct, ifa < b +

    mi=1piJ

    (Ii)

    indirect, otherwise.

    If direct was selected, we do not need to consider the other states (other than D) since they will never

    be reached. If indirect was selected, then defining k = min(2b, d), we see that

    (Ii) =

    dest, ifci< k+m

    i=1 J(Ii)

    start, ifci> k+m

    i=1 J(Ii) and 2b < d

    wait, ifci> k+m

    i=1 J(Ii) and 2b > d.

    50

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    15/31

    2.2

    Lets define the following states:

    H: Last flip outcome was heads

    T: Last flip outcome was tails

    C: Caught (this is the termination state)

    (a) We can formulate this problem as a stochastic shortest path problem with stateCbeing the termina-

    tion state. There are four possible policies: 1 = {always flip fair coin},2 = {always flip two-headed coin},

    3 = {flip fair coin if last outcome was heads / flip two-headed coin if last outcome was tails}, and4 =

    {flip fair coin if last outcome was tails / flip two-headed coin if last outcome was heads}. The only way

    to reach the termination state is to be caught cheating. Under all policies except1, this is inevitable.

    Thus 1 is an improper policy, and2, 3, and 4 are proper policies.

    (b) Let J1(H) and J2(T) be the costs corresponding policy 1 where the starting state is H and T,

    respectively. The expected benefit starting from state Tup to the first return to T(and always using the

    fair coin), is1

    2

    1 +

    1

    2+

    1

    22+

    m

    2 =

    1

    2(2 m).

    Therefore

    J1(T) =

    + ifm 2.

    Also we have

    J1(H) =1

    2(1 + Jn(H)) +

    1

    2Jn(T),

    so

    J1(H) = 1 + J(T).

    It follows that ifm >2, then1 results in infinite cost for any initial state.

    (c,d) The expected one-stage rewards at each stage are

    Play Fair in State H: 12

    Cheat in StateH: 1 p

    Play Fair in State T: 1m2

    Cheat in StateT: 0

    We show that any policy that cheats at Hat some stage cannot be optimal. As a result we can eliminate

    cheating from the control constraint set of state H.

    51

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    16/31

    Indeed suppose we are at state H at some stage and consider a policy which cheats at the first

    stage and then follows the optimal policy from the second stage on. Consider a policy which plays

    fair at the first stage, and then follows from the second stage on if the outcome of the first stage is H

    or cheats at the second stage and follows from the third stage on if the outcome of the first stage is

    T. We have

    J(H) = (1 p)[1 + J(H)]

    J(H) =1

    2(1 + J(H)) +

    1

    2

    (1 p)[1 + J(H)]

    =

    1

    2+

    1

    2[J(H) + J(H)]

    1

    2+ J(H),

    where the inequality follows from the fact thatJ(H) J(H) since is optimal. Therefore the reward

    of policy can be improved by at least 12 by switching to policy , and therefore cannot be optimal.

    We now need only consider policies in which the gambler can only play fair at state H: 1 and3.

    Under1, we saw from part b) that the expected benefits are

    J1(T) =

    + ifm 2,

    and

    J1(H) =

    + ifm 2.

    Under3, we have

    J3(T) = (1 p)J3(H),

    J3(H) =1

    2[1 + J3(H)] +

    1

    2J3(T).

    Solving these two equations yields

    J3(T) =1 p

    p ,

    J3(H) =1

    p.

    Thus ifm >2, it is optimal to cheat if the last flip was tails and play fair otherwise, and if m

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    17/31

    2.7

    (a) Leti be any state in Sm. Then,

    J(i) = min

    uU(i)

    [E{g(i,u,j) + J(j)}]

    = minuU(i)

    jSm

    pij(u)[g(i,u,j) + J(j)] +

    jSm1S1t

    pij(u)[g(i,u,j) + J(j)]

    = minuU(i)

    jSm

    pij(u)[g(i,u,j) + J(j)] + (1 jSm

    pij(u))

    jSm1S1t

    pij(u)[g(i,u,j) + J(j)]

    (1

    jSmpij(u))

    .

    In the above equation, we can think of the union ofSm1, . . . , S 1,and t as an aggregate termination state

    tm associated with Sm. The probability of a transition from i Sm totm (under u) is given by,

    pitm(u) = 1

    jSmpij(u).

    The corresponding cost of a transition from i Sm totm (underu) is given by,

    g(i,u,tm) =

    j=Sm1S1t

    pij(u)[g(i,u,j) + J(j)]

    pitm(u) .

    Thus, for i Sm, Bellmans equation can be written as,

    J(i) = minuU(i)

    jSm

    pij(u)[g(i,u,j) + J(j)] +pitm(u)[g(i,u,tm) + 0]

    .

    Note that with respect to Sm, the termination state tm is both absorbing and of zero cost. Let tm and

    g(i,u,tm) be similarly constructed form = 1, . . . , M .

    The original stochastic shortest path problem can be solved as M stochastic shortest path sub-

    problems. To see how, start with evaluatingJ(i) for i S1 (where t1 = {t}). With the values ofJ(i),

    for i S1, in hand, the g cost-terms for the S2 problem can be computed. The solution of the original

    problem continues in this manner as the solution ofM stochastic shortest path problems in succession.

    (b) Suppose that in the finite horizon problem there are n states. Define a new state space Snew

    and sets Sm as follows,

    Snew= {(k, i)|k {0, 1, . . . , M 1} and i {1, 2, . . . , n}}

    Sm = {(k, i)|k= M mand i {1, 2, . . . , n}}

    for m = 1, 2, . . . , M . (Note that the Sms do not overlap.) By associating Sm with the state space of

    the original finite-horizon problem at stage k = M m, we see that ifik Sm1 under all policies. By

    augmenting a termination state t which is absorbing and of zero cost, we see that the original finite-

    horizon problem can be cast as a stochastic shortest path problem with the special structure indicated in

    the problem statement.

    53

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    18/31

    2.8

    Let J be the optimal cost of the original problem and Jbe the optimal cost of the modified problem.

    Then we have

    J(i) = minu

    nj=1

    pij(u) (g(i,u,j) + J(j)) ,

    and

    J(i) = minu

    nj=1,j=i

    pij(u)

    1 pii(u)

    g(i,u,j) +

    g (i,u,i)pii(u)

    1 pii(u) + J(j)

    .

    For eachi, let(i) be a control such that

    J(i) =

    nj=1

    pij((i)) (g(i, (i), j) + J(j)) .

    Then

    J(i) = n

    j=1,j=ipij((i)) (g(i, (i), j) + J(j)) +pii((i)) (g(i, (i), i) + J(i)) .By collecting the terms involving J(i) and then dividing by 1 pii((i)),

    J(i) = 1

    1 pii((i))

    nj=1,j=i

    pij((i))(g(i, (i), j) + J(j))

    +pii((i))g(i, (i), i)

    .

    Sincen

    j=1,j=i

    pij((i))

    1pii((i))= 1, we have

    J(i) = 1

    1 pii((i))

    nj=1,j=i

    pij((i))(g(i, (i), j) + J(j))

    + n

    j=1,j=i

    pij((i))

    1 pii((i))pii((i))g(i, (i), i)

    =n

    j=1,j=i

    pij((i))

    1 pii((i))

    (g(i, (i), j) + J(j) +pii((i))g(i, (i), i)

    1 pii((i))

    ) .ThereforeJ(i) is the cost of stationary policy {, , . . .} in the modified problem. Thus

    J(i) J(i) i.

    Similarly, for eachi, let (i) be a control such that

    J(i) =n

    j=1,j=i

    pij((i))

    1 pii(mu(i))

    g(i,(i), j) +

    g (i,(i), i)pii((i))

    1 pii((i)) + J(j)

    .

    Then, using a reverse argument from before, we see that J(i) is the cost of stationary policy {, , . . .}

    in the original problem. Thus

    J(i) J

    (i) i.

    Combining the two results, we have J(i) = J(i), and thus the two problems have the same optimal costs.

    Ifpii(u) = 1 for some i = t, we can eliminate u from U(i) without increasing J(i) or any other

    optimal costJ(j), j =i. If that were not so, every optimal stationary policy must useu at state i and

    therefore must be improper, which is a contradiction.

    54

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    19/31

    2.17

    Consider a modified stochastic shortest path problem where the state space is denoted byS, the control

    space by U, the transition costs by g, and the transition probabilities by p. Let the state space

    S =SS SSU, where

    SS={1, . . . , n , t} where each i SS corresponds to i S

    SSU={(i, u)|i S, u U(i)} where each (i, u) SSU corresponds to i Sandu U(i).

    For i, j SS, u U(i), we define U(i) = U(i), g(i,u,j) = g(i,u,j), and pij(u) = pij(u). For (i, u)

    SS Uandj SS, the only possible control isu

    =u (i.e., U(i, u) = {u}), and we have g ((i, u), u, j) =

    g(i,u,j) andp(i,u)j(u) = pij(u).

    Since trajectories originating from a state i SS are equivalent to trajectories in the original

    problem, the optimal cost-to-go value for statei in the modified problem is J(i), the optimal cost-to-go

    value from the original problem. Let us denote the optimal cost-to-go value for (i, u) SSU byJ(i, u).

    ThenJ(i) and J(i, u) solve uniquely Bellmans equation of the modified problem, which is

    J(i) = minuU(i)

    nj=1

    pij(u)(g(i,u,j) + J(j)) (1)

    J(i, u) =n

    j=1

    pij(u)(g(i,u,j) + J(j)). (2)

    The Q-factors for the original problem are defined as

    Q(i, u) =n

    j=1

    pij(u)(g(i,u,j) + J(j)),

    so from Eq. (2), we have

    Q(i, u) = J(i, u), (i, u). (3)

    Also from Eqs. (1) and (2), we have

    J(i) = minuU(i)

    J(i, u), i. (4)

    Thus from Eqs. (1)-(4), we obtain

    Q(i, u) =n

    j=1

    pij(u)

    g(i,u,j) + min

    uU(j)Q(j, u)

    . (5)

    There remains to show that there is no other solution to Eq. (5). Indeed, if Q(i, u) were such that

    Q(i, u) =n

    j=1

    pij(u)

    g(i,u,j) + min

    uU(j)Q(j, u)

    , (i, u), (6)

    55

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    20/31

    then by defining

    J(i) = minuU(i)

    Q(i, u) (7)

    we obtain from Eq. (6)

    Q(i, u) =

    nj=1

    pij(u)(g(i,u,j) + J(j)), (i, u). (8)

    By combining Eqs. (7) and (8), we have

    J(i) = minuU(i)

    nj=1

    pij(u)(g(i,u,j) + J(j)), i. (9)

    Thus J(i) and Q(i, u) satisfy Bellmans Eq. (1)-(2) for the modified problem. Since this Bellman equation

    is solved uniquely by J(i) and J(i, u), we see that

    Q(i, u) = J(i, u) = Q(i, u), (i, u).

    Thus the Q-factorsQ(i, u) solve uniquely Eq. (5).

    56

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    21/31

    Solutions Vol. II, Chapter 3

    3.4

    By using the relationT(J) T(J) + e= J + e and the monotonicity ofT, we obtain

    T2(J) T(J) + e J + e + e.

    Proceeding similarly, we obtain

    Tk (J) T(J) +

    k2i=0

    i

    e J +

    k1i=0

    ie

    and by taking limit as k , the desired resultJ J + (/(1 ))e follows.

    3.5

    Under assumption P, we have by Prop. 1.2(a), J J. Let r >0 be such that

    J J re.

    Then, applyingTk to this inequality, we have

    J

    =Tk

    (J

    ) Tk

    (J

    ) k

    re.

    Taking the limit as k , we obtainJ J, which combined with the earlier shown relation J J,

    yields J =J. Under assumption N, the proof is analogous, using Prop. 1.2(b).

    3.8

    From the proof of Proposition 1.1, we know that there exists a policy such that, for all i> 0.

    J(x) J(x) +

    i=0 iiLet

    i=

    2i+1i >0.

    Thus,

    J(x) J(x) + i=0

    1

    2i+1 =J(x) + xS.

    57

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    22/31

    If

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    23/31

    wherei =(Ri+ B

    iKi+1Bi)

    1biKk+1Aix

    p1 = (Rp1+ Bp1K0Bp1)

    1Bp1K0Ap1x

    andK0 . . . , K p1 satisfy the coupled set ofp algebraic Ricatti equations

    Ki = Ai[Ki+1 Ki+1Bi(Ri+ BiKi+1Bi)

    1BiKi+1]Ai+ Qi, i= 0, . . . , p 2,

    Kp1 = Ap1[K0 K0Bp1(Rp1+ Bp1K0Bp1)

    1Bp1K0]Ap1+ Qp1.

    3.14

    The formulation of the problem falls under assumption P for periodic policies. All the more, the problem

    is discounted. Since wk are independent with zero mean, the optimality equation for the equivalent

    stationary problem reduces to the following system of equations

    J(x0, 0) = minu0U(x0)

    Ew0{x0Q0x0+ u0(x0)

    R0u0(x0) + J(A0x0+ B0u0+ w0, 1)}

    J(x1, 1) = minu1U(x1)

    Ew1{x1Q1x1+ u1(x1)

    R1u1(x1) + J(A1x1+ B1u1+ w1, 2)}

    . . .

    J(xp1, p 1) = minup1U(xp1)

    Ewp1{xp1Qp1xp1+ up1(xp1)

    Rp1up1(xp1)

    + J(Ap1xp1+ Bp1up1+ wp1, 0)}

    (1)

    From the analysis in7.8 in Ch.7 on periodic problems we see that there exists a periodic policy

    {0, 1, . . . ,

    p1,

    1,

    2, . . . ,

    p1, . . .}

    which is optimal. In order to obtain the solution we argue as follows: Let us assume that the solution is

    of the same form as the one for the general quadratic problem. In particular, assume that

    J(x, i) = xKix + ci,

    whereciis a constant andKiis positive definite. This is justified by applying the successive approximation

    method and observing that the sets

    Uk(xi, , i) = {ui Rm|xQx + uiRui+ (Ax + Bui)Kki+1(Ax + Bui) }

    are compact. The latter claim can be seen from the fact that R 0 and Kki+1 0. Then by Proposition

    7.7, limk Jk(xi, i) = J(xi, i) and the form of the solution obtained from successive approximation is

    as described above.

    59

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    24/31

    In particular, we have for 0 i p 1

    J(x, i) = minuiU(xi)

    Ewi{xQix + ui(x)R1ui(x) + J(A1x + B1ui+ wi, i + 1)}

    = minuiU(xi)

    Ewi{xQix + ui(x)R1ui(x) + [(Aix + Biui+ wi)ki+1(Aix + Biui+ wi) + ci+1]}

    = minuiU(xi)

    Ewi{x(Qi+ AiKi+1Ai)xi+ ui(ri+ BiKi+1Bi)ui+ 2xAiKi+1Biui+

    + 2wiKi+1Biui+ 2xAiKi+1wi+ w

    iKi+1wi+ ci+1}

    = minuiU(xi)

    {x(Qi+ AiKi+1Ai)xi+ ui(Ri+ B

    iKi+1Bi)ui+ 2x

    AiKi+1Biui+

    + wiKi+1wi+ c1}

    where we have taken into consideration the fact that E(wi) = 0. Minimizing the above quantity will give

    us

    ui= (Ri+ BiKi+1Bi)1BiKi+1Aix (2)

    Thus

    J(x, i) = x [Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)

    1BiKi+1)Ai] x + ci= xKix + ci

    whereci= Ewi{wiKi+1wi} + ci+1 and

    Ki= Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)

    1BiKi+1)Ai.

    Now for this solution to be consistent we must have Kp = K0. This leads to the following system of

    equations

    K0 = Q0+ A0(K1 2K1(R0+ B0K1B0)1B0K1)A0

    . . .

    Ki= Qi+ Ai(Ki+1 2Ki+1(Ri+ BiKi+1Bi)

    1BiKi+1)Ai

    . . .

    Kp1 = Qp1+ Ap1(K0 2K0(Rp1+ Bp1K0Bp1)

    1Bp1K0)Ap1

    (3)

    This system of equations has a positive definite solution since (from the description of the problem) the

    system is controllable, i.e. there exists a sequence of controls such that {u0, . . . , ur} such that xr+1 = 0.

    Thus the result follows.

    3.16

    (a) Consider the stationary policy, {0, 0, . . . , }, where0 = L0x. We have

    J0(x) = 0

    60

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    25/31

    T0(J0)(x) = xQx + xL0RL0x

    T20(J0)(x) = xQx + xL0RL0x + (Ax + BL0x + w)

    Q(Ax + BL0x + w)

    =xM1x + constant

    whereM1 = Q + L0

    RL0+ (A + BL0)Q(A + BL0),

    T30(J0)(x) = xQx + xL0RL0x + (Ax + BL0x + w)

    M1(Ax + BL0+ w) + (constant)

    =xM2x + constant

    Continuing similarly, we get

    Mk+1 = Q + L0RL0+ (A + BL0)Mk(A + BL0).

    Using a very similar analysis as in Section 8.2, we get

    Mk K0

    where

    K0 = Q + L0RL0+ (A + BL0)K0(A + BL0)

    (b)

    J1(x) = limN

    E wkk=0,,N1

    N1k=0

    k

    xkQxk+ 1(xk)R1(xk)

    = limN

    TN1(J0)(x)

    Proceeding as in the proof of the validity of policy iteration (Section 7.3, Chapter 7). We have

    T1(J0) = T(J0)

    J0(x) = xK0x + constant =T0(J0)(x) T1(J0(x)

    Hence, we obtain

    J0(x) T1(J0)(x) . . . Tk1(J0)(x) . . .

    implying,

    J0(x) limk

    Tk1(J0)(x) = J1(x).

    (c) As in part (b), we show that

    Jk (x) = xKkx + constant Jk1(x).

    Now since

    0 xKkx xKk1x, x

    61

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    26/31

    we have

    Kk K.

    The form ofK is,K= (A + BL)K(A + BL) + Q + LRL

    L= (BKB + R)1BKA

    To show that Kis indeed the optimal cost matrix, we have to show that it satisfies

    K=A[K 2KB (BKB + R)1BK]A + Q

    =A[KA + KBL] + Q

    Let us expand the formula for K, using the formula for L,

    K= (AKA + AKB L + LBKA + LBKB L) + Q + LRL.

    Substituting, we get K= (AKA + AKB L + LBKA) + Q LBKA

    =AKA + AKB L + Q.

    Thus Kis the optimal cost matrix.

    A second approach: (a) We know that

    J0(x) = limn

    Tn0(J0)(x).

    Following the analysis at 8.1 we have

    J0(x) = 0

    T0(J)(x) = E{xQx + 0(x)R0(x)}= xQx + 0(x)R0(x) = x(Q + L0RL0)x

    T20(J)(x) = E{xQx + 0(x)R0(x) + (Ax + B0(x) + w)

    Q(Ax + B0(x) + w)}

    =x (Q + L0RL0+ (A + BL0)Q(A + BL0)) x + E{wQw}.

    Define

    K00 =Q

    Kk+10 =Q + L0RL0+ (A + BL0)

    Kk0 (A + BL0).

    Then

    Tk+10 (J)(x) = xKk+10 x +

    k1m=0

    kmE{wKm0 w}.

    The convergence ofKk+10 follows from the analysis of4.1. Thus

    J0(x) = xK0x +

    1 E{wK0w}

    62

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    27/31

    (as in8.1) which proves the required relation.

    (b) Let1(x) be the solution of the following

    minu

    {uRu + (Ax + Bu)K0(Ax + Bu)}

    which yields

    u1 = (R+ BK0B)1BK0Ax= L1x.

    Thus

    L1 = (R+ BK0B)1BK0A= M1

    whereM=R+ BK0B and =B K0A. Let us consider the cost associated with u1 if we ignore w

    J1(x) =

    k=0 k (xkQxk+ 1(xk)

    Rm1(xk)) =

    k=0 kxk(Q + L

    1RL1)xk.

    However, we know the following

    xk+1 = (A + BL1)k+1x0+k+1m=1

    (A + BL1)k+1mwm.

    Thus, if we ignore the disturbance w we get

    J1(x) = x0

    k=0

    k(A + BL1)k(Q + L1RL1)(A + BL1)kx0.

    Let us call

    K1 =

    k=0

    k(A + BL1)k(Q + L1RL1)(A + BL1)kx0. (1)

    We know that

    K 0 (A + BL0)K0(A + BL0) L0RL0 = Q.

    Substituting in (1) we have

    K1 =k=0

    k(A + BL1)k(K0+ (A + BL1)K0(A + BL1))(A + BL1)+

    +

    k=0

    {k(A + BL1)k[(A + BL1)K0(A + BL1) (A + BL0)K0(A + BL0)+

    + L1RL1 L0RL0](A + BL1)

    k}.

    However, we know that

    K0 =k=0

    k(A + BL1)k (K0 (A + BL1)K0(A + BL1)) (A + BL1)k.

    63

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    28/31

    Thus we conclude that

    K1 K0 =k=0

    k(A + BL1)k(A + BL1)k

    where

    = (A + BL1)K0(A + BL1) (A + BL0)K0(A + BL0) + L1K0L1+ L

    0K0L0.

    We manipulate the above equation further and we obtain

    =L1(R+ BKoB)L1 L0(R+ B

    K0B)L0+ L1BK0A + AK0BL1

    L0BK0A AK0BL0

    =L1ML1 L0ML0+ L

    1 +

    L1 L0 L0

    =(L0 L1)M(L0 L1) ( + ML1)(L0 L1) (L0 L1)( + M L1).

    However, it is seen that

    + ML1 = 0.

    Thus

    = (L0 L1)M(L0 L1).

    SinceM0 we conclude that

    K0 K1 =k=0

    k(A + BL1)k(L0 L1)M(L0 L1)(A + BL1)k 0.

    Similarly, the optimal solution for the case where there are no disturbances satisfies the equation

    K= Q + LRL + (A + BL)K(A + BL)

    withL = (R+ BKB )1BKA. If we follow the same steps as above we will obtain

    K1 K=k=0

    k(A + BL1)k(L1 L)M(L1 L)(A + BL1)k 0.

    Thus K K1 K0. Since K1 is bounded, we conclude that A+ BL1 is stable (otherwise K1 ).

    Thus, the sum converges and K1 is the solution ofK1 = (A+BL1)K1(A+L1) + Q+L1RL1. Now

    returning to the case with the disturbances w we conclude as in case (a) that

    J1(x) = xK1x +

    1 E{wK1w}.

    SinceK1 K0 we conclude that J1(x) J0(x) which proves the result.

    c) The policy iteration is defined as follows: Let

    Lk = (R+ BKk1B)1BKk1A.

    64

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    29/31

    Thenk(x) = Lkx and

    Jk(x) = xKkx +

    1 E{wKkw}

    whereKk is obtained as the solution of

    Kk = (A + BLk)Kk(A + BLk) + Q + LkRLk.

    If we follow the steps of (b) we can prove that

    KKk . . . K1 K0. (2)

    Thus by the theorem of monotonic convergence of positive operators (Kantorovich and Akilov p.

    189: Functional Analysis in Normed Spaces) we conclude that

    K = limp

    Kk

    exists. Then if we take the limit of both sides of eq. (2) we have

    K= (A + BL)K(A + L) + Q + LRL

    with

    L= (R+ BKB)1BKA.

    However, according to 4.1, K is the unique solution of the above equation. Thus, K = K and

    the result follows.

    65

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    30/31

    Solutions Vol. II, Chapter 4

    4.4

    (a) We have

    Tk+1h0 =T(Tkh0) = T

    hki +

    Tkh0

    (i)e

    = T hki +

    Tkh0

    (i).

    Theith component of this equation yieldsTk+1h0

    (i) =

    T hki

    (i) +

    Tkh0

    (i).

    Subtracting these two relations, we obtain

    Tk+1h0

    Tk+1h0

    (i) = T hki

    T hki

    (i),

    from whichhk+1i =T h

    ki

    T hki

    (i).

    Similarly, we have

    Tk+1h0 =T

    Tkh0

    = T

    hk +

    1

    n

    Tkh0

    (i)e

    = Thk +

    1

    n

    Tkh0

    (i)e.

    From this equation, we obtain

    1

    n

    Tk+1h0

    (i) =

    1

    n

    Thk

    (i) +

    1

    n

    Tkh0

    (i)e.

    By subtracting these two relations, we obtain

    hk+1 =Thk

    1

    n Thk(i).The proof forhk is similar.

    (b) We have

    hk =Tkh0

    1

    n

    i

    Tkh0

    (i)

    e=

    1

    n

    ni=1

    hki .

    So sincehki converges, the same is true forhk. Also,

    hk =Tkh0 mini

    Tkh0

    (i)e

    and

    hk(j) =

    Tkh0

    (j) mini

    Tkh0

    (i)

    = maxi

    Tkh0

    (j)

    Tkh0

    (i)

    = maxi

    hki(j).

    Sincehki converges, the same is true forhk.

    66

  • 7/22/2019 Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

    31/31

    4.8

    Bellmans equation for the auxiliary (1 )discounted problem is as follows:

    J(i) = minuU(i)[g(i, u) + (1 )j

    pij(u)J(j)]. (1)

    Using the definition of pij(u), we obtain

    j

    pij(u)J(j) =j=t

    (1 )1pij(u) J(j) + (1 )1(pit(u) )J(t),

    or j

    pij(u)J(j) =j

    (1 )1pij(u)J(j) (1 )1J(t).

    This together with (1) leads to

    J(i) = minuU(i)

    [g(i, u) +j

    pij(u)J(j) J(t)],

    or, equivalently,

    J(t) + J(i) = minuU(i)

    [g(i, u) +j

    pij(u)J(j)]. (2)

    Returning to the problem of minimizing the average cost per stage, we notice that we have to solve the

    equation

    + h(i) = minuU(i)

    [g(i, u) +

    jpij(u)h(j)]. (3)

    Using (2), it follows that (3) is satisfied for = J(t) andh(i) = J(i) for all i. Thus, by Proposition 2.1,

    we conclude thatJ(t) is the optimal average cost and J(i) is a corresponding differential cost at state i.


Recommended