2997 spring 2004

7/25/2019 2997 spring 2004

1/201

2.997Decision-Making inLarge-ScaleSystems February4

MIT,

Spring

2004

Handout

#1

Lecture

Note

1

1 Markovdecisionprocesses

In this class we will study discrete-time stochastic systems. We can describe the evolution (dynamics) of

thesesystemsbythefollowingequation,whichwecallthesystemequation:

xt+1 =f(xt, at, wt), (1)

where xt S, at Axt and wt W denote the system state, decision and random disturbance at time

t, respectively. In words, the state of the system at time t+1 is a function f of the state, the decision

andarandomdisturbanceattimet. An importantassumptionofthisclassofmodels isthat, conditioned

on the current state xt, the distribution of future states xt+1, xt+2, . . . is independent of the past states

xt1, xt2, . . . . ThisistheMarkovproperty,whichrisetothenameMarkovdecisionprocesses.

An

alternative

representation

of

the

system

dynamics

is

given

through

transition

probability

matrices:

for

each state-action pair (x, a), we let Pa(x, y) denotethe probability thatthe next state is y, given that the

currentstateisxandthecurrentactionisa.

Weareconcernedwiththeproblemofhowtomakedecisionsovertime. Inotherwords,wewouldliketo

pickanactionat A ateachtimet. Inreal-worldproblems,thisistypicallydonewithsomeobjectivein

mind,suchasminimizingcosts,maximizingprofitsorrewards,orreachingagoal. Letu(x, t)takevaluesin

Ax,foreachx. Thenwecanthinkofuasadecisionrulethatprescribesanactionfromthesetofavailable

actions

Ax based

on

the

current

time

stage

t

and

current

state

x.

We

call

u

a

policy.

In this course, we will assess the quality of each policy based on costs that are accumulated additively

xt

over time. More specifically, we assume that at each time stage t a cost g (xt) is incurred. In the next

section,wedescribesomeoftheoptimalitycriteriathatwillbeusedinthisclasswhenchoosingapolicy.

Basedonthepreviousdiscussion,wecharacterizeaMarkovdecisionprocessbyatuple(S, A, P(, ), g()),

consistingofastatespace,asetofactionsassociatedwitheachspace,transitionprobabilitiesandcostsas

sociated with each state-action pair. For simplicity, we will assume throughout the course that S and Ax

are

finite.

Most

results

extend

to

the

case

of

countably

or

uncountably

infinite

state

and

action

spaces

under

certaintechnicalassumptions.

at

2 OptimalityCriteria

In

the

previous

section

we

described

Markov

decision

processes,

and

introduced

the

notion

that

decisions

7/25/2019 2997 spring 2004

2/201

3

2. Averagecost:

T1 1 lim

supE

g (xt)

x0 =

x

(3)

T T

at

t=0

3. Infinite-horizondiscountedcost:

E tg (xt)x0 =x , (4)at

t=0

where (0, 1) is a discount factor expressing temporal preferences. The presence of a discount

factorismostintuitiveinproblemsinvolvingcashflows,wherethevalueofthesamenominalamount

of

money

at

a

later

time

stage

is

not

the

same

as

its

value

at

a

earlier

time

stage,

since

money

at

the

earlier

stage

can

be

invested

at

a

risk-free

interest

rate

and

is

therefore

equivalent

to

a

larger

nominal amount at a later stage. However, discounted costs also offer good approximations to the

other optimality criteria. In particular, it can be shown that, when the state and action spaces are

finite,there isa largeenough

7/25/2019 2997 spring 2004

3/201

A common choice for the state of this system is an 8-dimensional vector containing the queue lengths.

Since each server serves multiple queues, in each time step it is necessary to decide which queue each of

the

different

servers

is

going

to

serve.

A

decision

of

this

type

may

be

coded

as

an

8-dimensional

vector

a

indicatingwhichqueuesarebeingserved,satisfyingtheconstraintthatnomorethanonequeueassociated

witheachserverisbeingserved,i.e.,ai {0, 1},anda1+a3+a8 1,a2+a6 1,a4+a5+a7 1. Wecan

imposeadditionalconstraintsonthechoicesofaasdesired,forinstanceconsideringonlynon-idlingpolicies.

Policies

are

described

by

a

mappingu returning an allocation of server efforta asa functionofsystem

x. We represent the evolution of the queue lengths in terms of transition probabilities - the conditional

probabilities for the next state x(t+1) given that the current state is x(t) and the current action is a(t).

For

instance

Prob(x1(t+1)=x1(t)+1| x(t), a(t))=1,

Prob(x3(t+1)=x3(t)+1, x2(t+1)=x2(t) 1| (x(t), a(t))=22I(x2(t)>0, a2(t))=1),

Prob(x3(t+1)=x3(t) 1| (x(t), a(t))=13I(x3(t)>0, a3(t))=1),

correspondingtoanarrivaltoqueue1,adeparturefromqueue2andanarrivaltoqueue3,andadeparture

from queue 3. I() is the indication function. Transition probabilities related to other events are defined

similarly.

Wemayconsidercostsoftheformg(x)= xi,thetotalnumberofunfinishedunitsinthesystem. Foriinstance, this is a reasonably common choice of cost for manufacturing systems, which are often modelled

asqueueingnetworks.

Tetris

Tetris

is

a

computer

game

whose

essence

rule

is

to

fit

a

sequence

of

geometrically

different

pieces,

which

fall

fromthetopofthescreenstochastically, togethertocompletethecontiguousrowsofblocks. Piecesarrive

sequentially and the geometric shape of the pieces are independently distributed. A falling piece can be

rotated and moved horizontally into a desired position. Note that the rotation and move of falling pieces

must be scheduled and executed before it reaches the remaining pile of pieces at the button of the screen.

Onceapiecereachestheremainingpile,thepiecemustresitethereandcannotberotatedormoved.

To put the Tetris game into the framework of Markov decision processes, one could define the state to

correspondtothecurrentconfigurationandcurrent fallingpiece. Thedecision ineachtimestage iswhere

to

place

the

current

falling

piece.

Transitions

to

the

next

board

configuration

follow

deterministically

from

thecurrentstateandaction; transitionstothenext fallingpiecearegivenby itsdistribution,whichcould

be, for instance, uniform over all piece types. Finally, we associate a reward with each state-action pair,

correspondingtothepointsachievedbythenumberofrowseliminated.

7/25/2019 2997 spring 2004

4/201

by

n

ixt+1 =

ateixt.

i=1

Therefore, transition probabilities can be derived from the distribution of the rate of return of each riskyn

assets. Weassociatewitheachstate-actionpair(x, a)arewardga(x)=x(1 i=1ai,correspondingtothe

amountofwealthconsumed.

4 SolvingFinite-HorizonProblems

Finding a policy that minimizes the finite-horizon cost corresponds to solving the following optimization

problem: T1

minE gu(xt,t)(xt)|x0 =x (5)u(,)

t=0

A naive approach to solving (5) is to enumerate all possible policies u(x, t), evaluate the corresponding

expected cost, and choose the policy that maximizes it. However, note that the number of policies grows

exponentially

on

the

number

of

states

and

time

stages.

A

central

idea

in

dynamic

programming

is

that

the

computationrequiredtofindanoptimalpolicycanbegreatlyreducedbynotingthat(5)canberewritten

asfollows: T1

min ga(x)+ Pa(x, y)minE gu(xt,t)(xt)|x1 =y . (6)aAx u(,)

yS t=1

DefineJ(x, t0)asfollows:

T1

J

(x, t0)

=

min E

gu(xt,t)(xt)|x1 =

y .

u(,)t=t0

Itisclearfrom(6)that, ifweknowJ(, t0 +1),wecaneasilyfindJ(x, t0)bysolving

J(x, t0)= min ga(x)+ Pa(x, y)J(y, t0+1) . (7)

aAx yS

Moreover, (6) suggests that an optimal action at state x and time t0 is simply one that minimizes the

right-hand

side

in

(7).

It

is

easy

to

verify

that

this

is

the

case

by

using

backwards

induction.

We

callJ(x, t)thecost-to-gofunction. Itcanbefoundrecursivelybynotingthat

J(x, T 1)=minga(x)a

andJ(x, t), t=0, . . . , T 2,canbecomputedvia(7).

N t th t fi di J ( t) f ll S d t 0 T 1 i b f t ti th t

7/25/2019 2997 spring 2004

5/201

1. Find(somehow) foreveryxandt0,

J(x,

t0)

=

min E

tt0gu(xt,t)(xt)|xt0 =

x

(8)

u(,)t=t0

2. Theoptimalactionforstatexattimet0 isgivenby

u (x,t0)=argminaAx ga(x)+ Pa(x,y)J(y,t0 +1) . (9)

yS

We

may

also

conjecture

that,

as

in

the

finite-horizon

case,

J

(x,

t)

satisfies

a

recursive

relation

of

the

form

J(x,t)= min ga(x)+ Pa(x,y)J(y,t+1) .

aAx yS

The first thing to note in the infinite-horizon case is that, based on expression (8), we have J(x,t) = J(x,t)=J(x)foralltandt. Indeed,notethat,foreveryu,

E tt0gu(xt,t)(xt)|xt0 =x =

tt0Probu(xt =y|xt0 =x)gu(y)(y)t=t0 t=t0

= tt0Probu(xtt0

=y|x0 =x)gu(y)(y)t=t0

= tProbu(xt =y|x0 =x)gu(y)(y).t=0

Intuitively,

since

transition

probabilities

Pu(x,

y)

do

not

depend

on

time,

infinite-horizon

problems

look

the

sameregardlessofthevalueoftheinitialtimestatet,aslongastheinitialstateisthesame.

Note also that, since J(x,t) =J(x), we can also infer from (9) that the optimal policy u(x,t) does

notdependonthecurrentstaget,sothatu(x,t)=u(x)forsomefunctionu(). Wecallpoliciesthatdo

notdependonthetimestagestationary. Finally,J mustsatisfythefollowingequation:

J(x)= min ga(x)+ Pa(x,y)J(y) .

aA

x

yS

ThisiscalledBellmansequation.

Wewillshowinthenextlecturethatthecost-to-gofunctionistheuniquesolutionofBellmansequation

andthestationarypolicyu isoptimal.

7/25/2019 2997 spring 2004

6/201

7/25/2019 2997 spring 2004

7/201

Proof First,wehave

J

=

J

+

J

J

J+JJe.

. Wenowhave

TJT J T(J+JJe)T J

= T J+JJeT J

= JJe.

Thefirstinequalityfollowsfrommonotonicityandthesecond fromtheoffsetpropertyofT. SinceJ and J arearbitrary,weconcludebythesamereasoningthatT JTJ JJe. Thelemmafollows.

7/25/2019 2997 spring 2004

8/201


MIT,Spring2004 Handout#2

LectureNote2

1 Summary: MarkovDecisionProcesses

Markovdecisionprocessescanbecharacterizedby(S, A, g(),P(, )),where

S

denotes

a

finite

set

of

states

x denotesafinitesetofactionsforstatex SA

ga(x)denotesthefinitetime-stagecostforactiona Ax andstatex S

Pa(x, y)denotesthetransmissionprobabilitywhenthetakenaction isa Ax,currentstate isx,and

thenextstateisy

Let u(x, t) denote thepolicy for state x at time t and, similarly, let u(x) denote the stationary policy for

statex. Takingthestationarypolicyu(x)intoconsideration,we introducethefollowingnotation

gu(x) gu(x)(x)

Pu(x, y) Pu(x)(x, y)

torepresentthecostfunctionandtransitionprobabilitiesunderpolicyu(x).

2

Cost-to-go

Function

and

Bellmans

Equation

Intheprevious lecture,wedefinedthediscounted-cost,infinitehorizoncost-to-gofunctionas

J(x)=minE tgu(xt)|x0 =x .u

t=0

WealsoconjecturedthatJ shouldsatisfiestheBellmansequation

J(x)=minga(x)+ Pa(x, y)J(y)a ,

yS

or,usingtheoperatornotation introducedinthepreviouslecture,

J = T J

7/25/2019 2997 spring 2004

9/201

3 ValueIteration

Thevalue iterationalgorithmgoesasfollows:

1. J0,k=0

2. Jk+1 =T Jk,k=k+1

3. Gobackto2

Theorem1

lim Jk =J

k

Proof SinceJ0()andg )arefinite,thereexistsarealnumberM satisfying(

M and M for all a Ax

and x S. Then we have, for every integer K 1 and real|J0(x)| |ga(x)|

number(0, 1),

TKJ0(x)K1

= minE tgu(xt)+KJ0(xK)

JK(x) =

x0 =xu

t=0

K 1

minE tgu(xt)u

t=0

+KMx0 =x

From

J(x)=minK1

tgu(xt)+

tgu(xt) ,u

t=0 t=K

we

have

(TKJ0)(x) J(x)

K1

minE tgu(xt)+K

J0(xK)K1

minE tgu(xt)+ tgu(xt)

u= x0 =x x0 =x

ut=0

t=0 t=K

K1

t=0

K1

t=0 t=K

tgu(xt)+KJ0(xK

) tgu(xt)+ tgu(xt)E =x E x0

=x x0

E

K J0(xk

) + tgu(xt) =x | | x0

t=K

maxE K J0(xK

) + t g0(xt) =x | | | | x0

u

t=K

1

7/25/2019 2997 spring 2004

10/201

Theorem2 J is theuniquesolutionof theBellmansequation.

Proof WefirstshowthatJ =T J. Bycontractionprinciple,

= ||Tk+1J0 TkJ0||||T(T

kJ0)TkJ0||

||TkJ0 Tk1J0||

0 k||T J0

J0|| asK

SinceforallkwehaveJT J Tk+1J0+JTkJ0+T

k+1J0TkT0,weconclude T J

thatJ =T J. WenextshowthatJ istheuniquesolutiontoJ =T J. SupposethatJ = J2. Then1

1

J2

|| =||T J1

T J 1

J0

y.

WehencedefinetheoperatorF asfollows

(F J)(x) = min ga(x) + Pa(x y)(F J)(y) + Pa(x y)J(y) (1)

7/25/2019 2997 spring 2004

11/201

Lemma1

||F JF J|| ||JJ||

Proof BythedefinitionofF,weconsiderthecasex=1,

|(F J)(1)(F J)(1) = (T J)(1)(T J)(1)| | | ||JJ||

Forthecasex=2,bythedefinitionofF,wehave

|(F J)(2)(F J)(2)| max |(F J)(1)(F J)(1), J(2)J(2), . . . , )| | | |J(|S|)J(|S| |

||J

J||

Repeating the same reasoning for x = 3, . . . , we can show by induction that |(F J)(x)(F J)(x)| Hence,weconclude||F JF ||JJ||,xS. J|| ||JJ||.

Theorem3 F has theuniquefixedpoint J.

Proof By the definition of operator F and the Bellmans equation J = T J, we have J = F J.

The

convergence

result

follows

from

the

previous

lemma.

Therefore,

F J

=

J

.

By

maximum

contraction

property,theuniquenessofJ holds.

7/25/2019 2997 spring 2004

12/201

2.997

Decision-Making

in

Large-Scale

Systems

February

11


LectureNote3

1 Value Iteration

Usingvalue iteration,startingatanarbitraryJ0,wegenerateasequenceof{Jk}by

Jk+1 =

T Jk,

integer

k

0.

WehaveshownthatthesequenceJk J ask,andderivedtheerrorbounds

||Jk J||

k||J0J||

Recall

that

the

greedy

policy

uJ withrespecttovalueJ isdefinedasT J =TuJJ. Wealsodenoteuk =uJkasthegreedypolicywithrespecttovalueJk. Then,wehavethefollowinglemma.

Lemma

1

Given

(0, 1), 1

||Juk Jk|| 1||T Jk Jk||

Proof:

Juk Jk = (IPuk)1guk Jk

= (IPuk)1(guk +PukJk Jk)

= (IPuk)1(T Jk Jk)

=

tPt (T Jk Jk)ukt=0

tPt

uke||T Jk Jk||

t=0

=

te||T Jk Jk||t=0

e

= 1

||T Jk

Jk||

where I is an identity matrix, and e is a vector of unit elements with appropriate dimension. The third

equalitycomesfromT Jk =guk +PukJk,i.e.,uk isthegreedypolicyw.r.t. Jk,andtheforthequalityholdsebecause(IPuk)

1 =

tPt . ByswitchingJuk andJk,wecanobtainJkJukt=0 uk 1||T Jk Jk||,

and hence conclude

7/25/2019 2997 spring 2004

13/201

Theorem

1

2||Juk J

|| 1

||Jk J||

Proof:

|| =uk J

uk Jk +Jk J||J ||J ||

uk ||J

1

Jk||+||Jk J||

1

||T Jk J

+JJk||+||Jk J||

1Jk||)+||Jk J

1

(||T Jk J||

+||J ||

2

1

||Jk J||

ThesecondinequalitycomesfromLemma1andthethirdinequalityholdsbythecontractionprinciple. 2

2

Optimality

of

Stationary

Policy

Beforeprovingthemaintheoremofthissection,weintroducethefollowingusefullemma.

Lemma2 IfJ T J, thenJ J. IfJ T J, thenJ J.

Proof:

Suppose

that

J T J. Applying operator T on both sides repeatedly k1 times and by the

monotonicitypropertyofT,wehave

.J

T J

T2J

TkJ

Forsufficientlylargek,TkJ approachestoJ. WehenceconcludeJ J. Theotherstatementfollowsthe

same

argument.

2

We

show

the

optimality

of

the

stationary

policy

by

the

following

theorem.

Theorem

2

Let

u

=

(u1, u2, . . .)

be

any

policy

and

let

u

uJ1.

Then,

Ju Ju =J.

Moreover, letubeanystationarypolicysuchthatTuJ

=T J.2 Then,Ju(x)>J(x)forat leastonestate

xS.

7/25/2019 2997 spring 2004

14/201

Then

1u Ju|| M(1+1

)k 0ask.||Jk

Ifu=(u, u, . . . ),then

u Jk 0ask.||J u||

Thus, wehave Jku =TukJ =Tk1(T J)=Tk1J =J. ThereforeJu =J

. For any other policy, foru u

allk,

1

u M

1

kJu 1+ J

k

1= Tu1 . . . T ukJ

M

1+

1

k

T J 1

kTu1 . . . T uk1 T J

1+ M 1

=J

1. . .J

1+

1

kM

Therefore

Ju

J.

Take

a

stationary

policy

u

such

that

TuJ

=

T J,

i.e.

TuJ

T J,

and

at

least

one

statexS suchthat(TuJ)(x)>(T J)(x). Observe

J =T J TuJ

Applying

Tu onbothsidesandbythemonotonicitypropertyofT,orapplyingLemma2,

J TuJ T2J TkJ Juu u

andJ(x)

7/25/2019 2997 spring 2004

15/201

Proof:

If

uk isoptimal,thenwearedone. Nowsupposethatuk isnotoptimal. Then

T Juk TukJ =

Jukuk

withstrictinequalityforatleastonestatex. SinceTuk+1J =T Juk andJuk =TukJ ,wehave

.

uk uk

Juk =TukJuk T Juk =Tuk+1Juk Tn Juk Juk+1 asn

Therefore,

policyuk+1 isanimprovementoverpolicyuk. 2

uk+1

In

step

2,

we

solve

J =

g +

PukJuk,

which

would

require

a

significant

amount

of

computations.

We

thusintroduceanotheralgorithmwhichhasfeweriterationsinstep2.

uk uk

3.1 AsynchronousPolicyIteration

Thealgorithmgoesas follows.

1.

Start

with

policy

u0,cost-to-gofunctionJ0,k = 0

2. ForsomesubsetSk S,dooneofthefollowing

(i)valueupdate (Jk+1)(x)=(T Jk)(x), k,x Suk

(ii)policyupdate uk+1(x)=uJk(x), kx S

3. k=k+1;gobacktostep2

Theorem

4

If

Tu0J0 J0 and

infinitely

many

value

and

policy

updates

are

performed

on

each

state,

then

lim Jk =J.

k

Proof: Weprovethistheorembytwosteps. First,wewillshowthat

J Jk+1 Jk, k.

ThisimpliesthatJk isanonincreasingsequence. SinceJk islowerboundedbyJ,Jk willconvergetosome

value,i.e.,Jk . Next,wewillshowthatJk willconvergetoJ,i.e.,J =J.J ask

Lemma3 If Tu0J0 J0, thesequenceJk generatedbyasynchronouspolicyiterationconverges.

Proof:

We

start

by

showing

that,

if

TukJk Jk , then Tuk+1Jk+1 Jk+1 Jk. Suppose we have a value

update Then

7/25/2019 2997 spring 2004

16/201

Now

suppose

that

we

have

a

policy

update.

Then

Jk+1 =Jk. Moreover,forx Sk,wehave

(T Jk+1)(x) = (T J

k

)(x)uk+1

uk+1

= (T Jk)(x)

(TukJk)(x)

Jk(x)

= Jk+1(x).

ThefirstequalityfollowsfromJk =Jk+1,thesecondequalityandfirstinequalityfollowsfromthefactthat

uk+1(x)

is

greedy

with

respect

to

Jk for

x

Sk,

the

second

inequality

follows

from

the

induction

hypothesis,

andthethirdequality follows fromJk =Jk+1. Forx Sk ,wehave

(T Jk+1)(x) = (T Jk)(x)uk+1 uk

Jk(x)

= Jk+1(x).

Theequalities follow fromJk =Jk+1 anduk+1(x)=uk(x) forx Sk, andthe inequality follows fromthe

induction

hypothesis.

Since by hypothesis Tu0J0 J0, we conclude that Jk is a decreasing sequence. Moreover, we have

TukJk Jk,henceJk J J,sothatJk isboundedbellow. It followsthatJk convergestosomelimit

uk

J. 2

Lemma4 Suppose that Jk J,where Jk isgeneratedbyasynchronouspolicy iteration,and suppose that

thereare infinitelymanyvalueandpolicyupdatesateachstate.ThenJ =J.

Proof: First note that, since T Jk Jk, by continuity of the operator T, we must have T J J. Now suppose

that

(T J)(x)

< J(x)

for

some

state

x.

Then,

by

continuity,

there

is

an

iteration

index

k

such

that

(T Jk)(x)< J(x)forallkk. Letk

>k > kcorrespondtoiterationsoftheasynchronouspolicyiteration

algorithmsuchthatthereisapolicyupdateatstatexatiterationk,avalueupdateatstatexatiteration

k, and no updates at state x in iterations k

7/25/2019 2997 spring 2004

17/201

2

We

have

concluded

that

Jk+1 < J. HoweverbyhypothesisJk J,wehaveacontradiction,anditmust followthatT J =J,sothatJ =J.

7/25/2019 2997 spring 2004

18/201

1



Lecture

Note

4

Average-costProblems

Intheaveragecostproblems,weaimatfindingapolicyuwhichminimizes

Ju(x)

=

lim

sup

1

E

T1

gu(xt)

x0 =

0

.

(1)t=0T

T

Since the state space is finite, it can be shown that the limsup can actually be replaced with lim for any

stationary policy. In the previous lectures, we first find the cost-to-go functions J(x) (for discounted

problems)orJ(x, t) (for finitehorizon problems)and thenfindthe optimal policythrough thecost-to-go

functions. However, in the average-cost problem, Ju(x) does not offer enough information for an optimal

policy to be found; in particular, in mostcases of interest we will have Ju(x)=u for some scalar u, for

allx,sothatitdoesnotallowustodistinguishthevalueofbeing ineachstate.

We

will

start

by

deriving

some

intuition

based

on

finite-horizon

problems.

Consider

a

set

of

states

x1, x2, . . . , x, . . . , xn}. Thestatesarevisitedinasequencewithsomeinitialstatex,sayS={

x , . . . . . . , x, . . . . . . , x, . . . . . . , x, . . . . . . ,

h(x) 1 2u u

LetTi(x), i=1, 2, . . . bethestagescorrespondingtotheithvisittostatex,startingatstatex. Let

Ti+1(x)1gu(xt)

u(x)

=

E

t=Ti

(x)i Ti+1(x) Ti(x)

Intuitively,wemusthavei u(x)=j

u(x)is independentofinitialstatexandi

u(x),sincewehavethesame

transitionprobabilitieswheneverwestartanewtrajectoryinstatex. Goingbacktoobservethedefinition

ofthefunction

T

J(x, T)=minE gu(xt)

xo

=x ,u

t=0

we

conjecture

that

the

function

can

be

approximated

as

follows.

J(x, T) (x)T+h(x)+o(T), asT, (2)

Notethat,since(x)is independentofthe initialstate,wecanrewritetheapproximationas

J(x, T) T +h(x)+o(T), asT. (3)

7/25/2019 2997 spring 2004

19/201

2

We can now speculate about a version of Bellmans equation for computing and h. Approximating

J(x, T)as in(3,wehave

J(x, T +1)=min ga(x)+ Pa(x, y)J(y, T)

ay

(T+1)+h(x)+o(T)=min ga(x)+ Pa(x, y) [T +h(y)+o(T)]

a

y

Therefore,wehave

+h(x)=mina

ga(x)+ y

Pa(x, y)h(y) (4)

As

we

did

in

the

cost-to-go

context,

we

set

Tuh=gu

+Puh

and

T h=minTuh.u

Then,wehave

Lemma

1

(Monotonicity)

Let

h

h

be

arbitrary.

Then

T h

T h. (Tuh

Tuh)

Lemma

2

(Offset)

For

allhandk ,wehave T(h+ke)=T h+ke.

NoticethatthecontractionprincipledoesnotholdforT h=minuTuh.

BellmansEquation

From

the

discussion

above,

we

can

write

the

Bellmans

equation

e+h=T h. (5)

BeforeexaminingtheexistenceofsolutionstoBellmansequation,weshowthefactthatthesolutionofthe

Bellmansequationrenderstheoptimalpolicybythefollowingtheorem.

Theorem

1

Suppose

that andh satisfytheBellmansequation. Letu begreedywithrespecttoh,i.e.,

T h Tuh. Then,

Ju

(x)

=

,x,

and

u(x) Ju(x),u.

Proof: Letu=(u1, u2, . . . ). LetN bearbitrary. Then

J

7/25/2019 2997 spring 2004

20/201

Then

TN1h

N e+hT1T2

Thus,we

have

N

1

E gu(xt)+h(xN) (N 1)

e+h

t=0

BydividingbothsidesbyN andtakethelimitasN approachesto infinity,wehave1

Jue

Takeu=(u, u, u, . . . ),thenalltheinequalitiesabovebecometheequality. Thus

e=Ju.

Thistheoremsaysthat, iftheBellmansequationhasasolution,thenwecangetaoptimalpolicyfrom it.

Notethat,if(, h)isasolutiontotheBellmansequation,then(, h+ke)isalsoasolution,forall

scalark. Hence,ifBellmansequation in(5)hasasolution,then ithasinfinitelymanysolutions. However,

unlike

the

case

of

discounted-cost

and

finite-horizon

problems,

the

average-cost

Bellmans

equation

does

not

necessarily have a solution. In particular, the previous theorem implies that, if a solution exists, then the

averagecostJu(x) isthesame forall initialstates. It iseasytocomeupwithexampleswherethis isnot

thecase. Forinstance,considerthecasewhenthetransitionprobabilityisanidentitymatrix,i.e.,thestate

visitsitselfeverytime,andeachstateincursdifferenttransitioncostsg(). Thentheaveragecost depends

onthe initialstate, which is notthe property ofthe average cost. Hence,the Bellmans equationdoes not

alwayshold.

2

7/25/2019 2997 spring 2004

21/201

1



Lecture

Note

5

RelationshipbetweenDiscountedandAverage-CostProblems

Inthis lecture, wewillshowthatoptimalpolicies fordiscounted-costproblems with largeenoughdiscount

factor are also optimal for average-cost problems. The analysis will also show that, if the optimal average

cost

is

the

same

for

all

initial

states,

then

the

average-cost

Bellmans

equation

has

a

solution.

Notethattheoptimalaveragecost is independentofthe initialstate. Recallthat

1N1

Ju(x)=limsup E gu(xt)|x0 =xN

Nt=0

or,equivalently,

1N1

Ju = lim Putgu.

NNt=0

WealsoletJu, denotethediscountedcost-to-gofunctionassociatedwithpolicyuwhenthediscountfactor

is,i.e.,

tPtJu, = ugu =(IPu)1

gu.t=0

The following theorem formalizes the relationship between the discounted cost-to-go function and average

cost.

Theorem

1

For

every

stationary

policy

u,

there

is

hu such

that

1Ju,

=1

Ju

+hu

+O( 1). (1)| |

Theorem1followseasilyfromthefollowingproposition.

Proposition

1

For

all

stationary

policies u,wehave

1

(I

Pu)1

=

1

P

u +

Hu +

O(|1

|)1

,

(2)

where

Pu = limN

1

N

N1

t=0

Ptu, (3)

1

7/25/2019 2997 spring 2004

22/201

Proof: LetM =(1)(IPu)

1. Then,since

tPt

u(x,y) (1) t 1 =1,M(x,y) =(1)| |

t=0 t=0

M(x,y) is intheformof

M(x,y)=p()

q(),

wherep()andq()arepolynomialssuchthatq(1)=1. Weconcludethatthe limit lim 1M exists. Let

P =lim 1M. WecandoTaylorsexpansionofM around=1,sothatu

M =P

+(1)Hu +O((1)2)u

dMwhereHu

=d

. Therefore

1(IPu)1

=1

P +Hu +O( 1)u | |

forsomeP andHu.uNext,observethat

(1)(IPu)(IPu)1 =(1)I

for

all

.

Taking

the

limit

as

1

yields

(IPu)P

=0,u

sothatP =PuP. WecanusethesamereasoningtoconcludethatP =PPu. Wealsohaveu u u u

(IPu)P =(1)Pu

,u

henceforeverywehave

P =(1)(IPu)1Pu u,

andtakingthe limitas1yieldsPP =P.u u uWe now show that, for every t 1, Ptu

P = (Pu

Pu

)t. For t = 1, it is trivial. Suppose that theu

resultholdsupton1,i.e.,Pn1Pu

=(PuP

u)n1

=(PuP

u)n1

. Then(PuPu)(PuP

u)(Pn1

u u

u) = Pn PuP

uPn1 +PP = Pn Pn2 +P = Pu

n Pu. By induction, we haveu uP

u u P

u u u u P P u u

u

P =(Pu

PPt u

)t.u

Nownotethat

uHu = limM P

1

1

Pu= lim (IPu)1

1 1

= lim t(Pt u)u P

1t=0

7/25/2019 2997 spring 2004

23/201

|

Hence Hu =(I Pu +P

.u

)1 Pu

WenowshowPHu =0. Observeu

u)1 PPHu =

P (I

Pu +

Pu u u

= Pu(Pu P

uu)t P

t=0

= Pu P

=0.u

Therefore, PHu

=0.u

Observe(IPu+P

u

)P =IP. SincePHu

=0,wehaveu

)Hu

=I(IPu+P

u u

u

By

multiplyingPk

toP

+H

u

=I

+P

uH

u,

we

have

u u

+Pk+1Pk =PkuPu

+PkHu u u Hu, ku

Summingfromk=0tok=N 1,wehave

NN1 N1

N P + PkHu

= Puk + PkHu,u u

u

k=0 k=0 k=1

or,equivalently,N

1

N P =

Pk +

(PN I)Hu.u

u u

k=0

DividingbothsidesbyN and lettingN ,thenwehave

P +Hu

=I+PuHu.u

limN

1

N

1

Pk =P.N k=0 u u

Since

P =

PPu and

Pu itself

is

a

stochastic

matrix,

the

rows

of

P are

of

special

meanings.

Let

u u

u

u denote a row of Pu. Then u = uPu and u(x) = yu(y)Pu(y, x). Then Pu(x1 = x x0 u) =

u(y)Pu(y, x). Wecan concludethatanyrow inmatrixP isa stationarydistribution fortheMarkovy u

chainunderthepolicyu. However,doesthisobservationmeanthatallrows inP areidentical?u

Theorem

2

JuJu, =

1 +hu +O( 1 )| |

Proof:

Ju, = (I Pu)1

gu

Pu=1

+Hu

+O( 1 ) gu| |

Pu

gu= + Hugu + O( 1 )| |

2

7/25/2019 2997 spring 2004

24/201

2

2 BlackwellOptimality

Inthissection,wewillshowthatpoliciesthatareoptimalforthediscounted-costcriterionwithlargeenough

discountfactorsarealsooptimalfortheaverage-costcriterion. Indeed,wecanactuallystrengthenthenotion

ofaverage-costoptimalityandestablishtheexistenceofpoliciesthatareoptimalforalllargeenoughdiscount

factors.

Definition

1

(Blackwell

Optimality)

A

stationary

policy u is calledBlackwell optimal if (0,1)

such

thatu isoptimal [,1).

Theorem

3

There

exists

a

stationary

Blackwell

optimal

policy

and

it

is

also

optimal

for

the

average-cost

problemamongallstationarypolicies.

Proof: Since there are only finitely many policies, we must have for each state x a policy x such that

Jux

,(x)

Ju,(x)

for

all

large

enough

.

If

we

take

the

policy

to

be

given

by

(x)

=

x(x),

then

mustsatisfyBellmansequation

Ju, =min{gu +PuJu,}u

foralllargeenough,andweconcludethat isBlackwelloptimal.

Nowletu beBlackwelloptimal. Alsosupposethat uisoptimalfortheaverage-costproblem. Then

Ju J+hu +O( 1| |)

u

+hu +O( 1),.| |1 1

Taking

the

limit

as

1,

we

conclude

that

Ju J ,u

andu mustbeoptimalfortheaverage-costproblem.

Remark

1

It

is

actually

possible

to

establish

average-cost

optimality

of

Blackwell

optimal

policies

among

the

set

of

all

policies,

not

only

stationary

ones.

Remark

2

An

algorithm

for

computing

Blackwell

optimal

policies

involves

lexicographic

optimization

of

Ju,

hu andhigher-order terms in theTaylorexpansionof Ju,.

Th 3 i li th t if th ti l t i th dl f th i iti l t t th th

2

7/25/2019 2997 spring 2004

25/201

Proof: Wehave,forall largeenough,

Ju, = min gu +PuJu,}u

{

Ju

1 +hu +O((1 )

2) = min gu +PuJu

+hu +O((1 )2)

u 1

e e

1 +hu +O((1 )

2) = min gu +Pu1

+hu +O((1 )2)

u

+hu +O((1 )2) = min gu

+Pu

hu +O((1 )2) .

u

Takingthelimitas1,weget

e

+

hu =min gu +Puhu}=Thu.u

{

2

Intheaverage-costsetting,existenceofasolutiontoBellmansequationactuallydependsonthestructure

of transition probabilities in the system. Some sufficient conditions for the optimal average cost to be the

sameregardlessofthe initialstatearegivenbelow.

Definition2 We say that two states x,y communicate under policy u if there are k, 1,2,...} suchk {kthat

Puk(x,y)>0,Pu(y,x)>0.

Definition

3

We

say

that

a

state x is recurrentunderpolicy u if,conditionedon thefact that it isvisited

at

least

once,

it

is

visited

infinitely

many

times.

Definition4 We say that a state x is transient under policy u if it is only visitedfinitely many times,

regardless

of

the

initial

condition

of

the

system.

Definition

5

We

say

that

a

policyu isunichain ifallof itsrecurrentstatescommunicate.

Westatewithoutproofthefollowingtheorem.

Theorem

4 Either of thefollowing conditions is sufficientfor the optimal average cost to be the same

regardless

of

the

initial

state:

1. Thereexistsaunichainoptimalpolicy.

2.

For

every

pair

of

states xandy, there isapolicy usuch thatxandy communicate.

3 Value Iteration

7/25/2019 2997 spring 2004

26/201

OnewaytoobtainthisvalueistocalculateafinitebutverylargeN toapproximatethelimitandspeculate

thatsuchan limitisaccurate. Henceweconsider

k1

TkJ =minE gu(xt)+J0(xk)u

t=0

RecallJ(x, T) x,wehave=T +h(x).Choosesomestatexand

J(x, T)J( x)x, T)=h(x)h(

Then

hk(x)

=

J

(x, k)

k

,

for

some

1

,

2

, . . .

Notethat,since(, h+ke)isasolutiontoBellmansequationforallkwhenever(, h)isasolution,we

canchoosethevalue ofasinglestatearbitrarily. Letting h(x)=0,wehavethe followingcommonly used

versionofvalue iteration;

hk+1(x)=(T hk)(x)(T hk)(x) (8)

Theorem

5

Lethk begivenby(8). Then if hk h,wehave

=(T h)( x)andh =h,e +h =T h.

Note

that

there

must

exist

a

solution

to

the

average-cost

Bellmans

equation

for

value

iteration

to

con-verge. However, itcanbeshownthatexistenceofasolution isnotasufficientcondition.

7/25/2019 2997 spring 2004

27/201


MIT,

Spring

2004

Handout

#9

LectureNote6

1 ApplicationtoQueueingNetworks

In the first part of this lecture, we will discuss the application of dynamic programming to the queueing

network introduced in [1], which illustrates several issues encountered in the application of dynamic pro

gramming

in

practical

problems.

In

particular,

we

consider

the

issues

that

arise

when

value

iteration

is

applied

to

problems

with

a

large

or

infinite

state

space.

Themainpointsin[1],whichweoverviewtoday,arethefollowing:

Naive implementation of value iteration may lead to slow convergence and, in the case of infinite

statespaces,policieswith infiniteaveragecost inevery iterationstep,eventhoughthe iteratesJk(x)

convergepointwisetoJ(x) foreverystatex;

Under

certain

conditions,

with

proper

initialization

J0,

we

can

have

a

faster

convergence

and

stability

guarantees;

Inqueueingnetworks,properJ0 canbefoundfromwell-knownheuristicssuchasfluidmodelsolutions.

We will illustrate these issues with examples involving queueing networks. For the generic results, in

cluding a proof of convergence of average-cost value iteration for MDPs with infinite state spaces, refer to

[1].

1.1 Multiclassqueueingnetworks

Consideraqueueingnetworkasillustrated inFig.1.

1

Machine 1

13

11

18

4

Machine 2

22

26

37

35

34

Machine 3

7/25/2019 2997 spring 2004

28/201

We introducesomenotation:

N thenumberofqueuesinthesystem

i probabilityofexogenousarrivalatqueuei

i probabilitythatajobatqueueiiscompletedifthejobisbeingserved

xi state,lengthofqueueiN

g(x)= xi costfunction,inwhichstatex=(x1, . . . , xN)i=1

a {0, 1}N ai =1ifajobfromqueueiisbeingserved,andai =0otherwise.

The

interpretation

is

as

follows.

At

each

time

stage,

at

most

one

of

the

following

events

can

happen:

a

new

jobarrives atqueue i with probability i, ajob from queue ithat iscurrently beingserved has itsservice

completed, withprobabilityi, and either movesto another queueor leaves thesystem, dependingonthe

structureofthenetwork. Notethat,ateachtimestage,aservermaychoosetoprocessajobfromanyofthe

queues

associated

with

it.

Therefore

the

decision

a

encodes

which

queue

is

being

processed

at

each

server.

Werefertosuchaqueueingnetworkasmulticlass becausejobsatdifferentqueueshasdifferentservicerates

and

trajectories

through

the

system.

As seen before, an optimal policy could be derived from the differential cost function h, which is the

solution

of

Bellmans

equation:

+h =T h.

Considerusingvalue iteration forestimatingh. Thisrequiressome initialguessh0. Acommonchoice

ish0 =0; however, wewillshow thatthis can lead to slow convergence of h. Indeed, weknow thath is

equivalenttoaquadratic,inthesensethatthere isaconstant andasolutiontoBellmansequationsuch2 21

.

Now

let

h0 =

0.

Then

that

1i

xi h

(x)

ixi k1 N

xi(t) x0 =x .Tkh0(x)=minE

ut=0 i=1

Since

E[xi(t)]=E[xi(t 1)]+ E[Ai(t)] E[Di(t)] ,

=i (arrival)0(departure)

we

have

E[xi(t) xi(t 1)]i (1)

By(1),wehave

E

[xi(1)] E[xi(0)]+i

7/25/2019 2997 spring 2004

29/201

2

Thus,

T

k

h0

N

k1

(xi(0)+ti)t=0 i=1N k(k

1)kxi(0)+ i=

2i=1

This impliesthathk(x) isupperboundedbya linear functionofthestatex. Inorder for ittoapproacha

quadratic functionofx,the iterationnumberk musthavethesamemagnitudeasx. It followsthat, ifthe

state space is very large, convergence is slow. Moreover, if the state space is infinite, which is the case if

queues

do

not

have

finite

buffers,

only

pointwise

convergence

of

hk(x)

to

h

(x)

can

be

ensured,

but

for

every

k,thereissomestatexsuchthathk(x)isapoorapproximationtoh(x).

Example1(Singlequeue lengthwithcontrolled servicerate) Considerasinglequeuewith

State

x

defined

as

the

queue

length

Pa(x,x+1)=, (arrivalrate)

Pa(x,x 1)=1+a2, whereactiona {0,1}

Pa(x,

x)

=

1

1 a2.

Let

the

cost

function

be

defined

as

ga(x)=(1+a)x.

The interpretation isasfollows. Ateach timestage, there isachoicebetweenprocessingjobsata lower

service

rate 1 or at a higher service rate 2. Processing at a higher service rate helps to decreasefuture

queue

lengths

but

an

extra

cost

must

be

paid

for

the

extra

effort.

Suppose

that

>

1. Then ifpolicy u(x) = 0forall x x0,whenever thequeue length isat least x0,

thereareonaveragemorejobarrivalsthandepartures,and itcanbeshownthateventuallythequeue length

converges

to

infinity,

leading

to

infinite

average

cost.

Suppose

thath0(x)=0, x. Thenineveryiterationk,thereexistsanxk suchthathk =T

kh0(x)=cx+d

forallxxk. Moreover,whenhk =cx+d inaneighborhoodof x)=0,which isx,thegreedyaction isuk(

thecase that theaveragecostgoes to infinity.2

As

shown

in

[1],

using

the

initial

value h0(x) = 1+

x2

leads

to

stable

policies

for

every

iterate hk,

and

ensures

convergence

to

the

optimal

policy.

The

choice

ofh0 asaquadraticarisesfromproblem-specific

knowledge. Moreover,appropriatechoices in thecaseofqueueingnetworkscanbederivedfromwell-known

heuristics

and

analysis

specific

to

the

field.

Simulation-basedMethods

Thedynamicprogrammingalgorithmsstudiedso farhavethefollowingcharacteristics:

7/25/2019 2997 spring 2004

30/201

In realistic scenarios, each of these requirements may pose difficulties. When the state space is large,

performingupdatesinfinitelyoften ineverystatemaybeprohibitive,orevenif it is feasible,acleverorder

of

visitation

may

considerably

speed

up

convergence.

In

many

cases,

the

system

parameters

are

not

known,

andinsteadonehasonlyaccesstoobservationsaboutthesystem. Finally,evenifthetransitionprobabilities

are known, computing expectations of the form (2) may be costly. In the next few lectures, we will study

simulation-basedmethods,whichaimatalleviatingtheseissues.

2.1 Asynchronousvalue iteration

Wedescribetheasynchronousvalueiteration(AVI)as

Jk+1(xk)=(T Jk)(xk), xk Sk

We have seen that, if every state has its value updated infinitely many times, then the AVI converges

(see arguments in Problem set 1). The question remains as to whether convergence may be improved by

selecting

states

in

a

particular

order,

and

whether

we

can

dispense

with

the

requirement

of

visiting

every

stateinfinitelymanytimes.

WewillconsideraversionofAVIwherestateupdatesarebasedonactualorsimulatedtrajectories for

the

system.

It

seems

reasonable

to

expect

that,

if

the

system

is

often

encountered

at

certain

states,

more

emphasis

should

be

placed

in

obtaining

accurate

estimates

and

good

actions

for

those

states,

motivating

performingvalueupdatesmoreoftenatthosestates. Inthe limit, it isclearthatifastate isnevervisited,

under any policy, then the value of the cost-to-go function at such a state never comes into play in the

decision-makingprocess,andnoupdatesneedtobeperformed forsuchastateatall. Basedonthenotion

that state trajectories contain information about which states are most relevant, we propose the following

versionofAVI.Wecallitreal-timevalue iteration(RTVI).

1.

Take

an

arbitrary

state

x0. Letk=0.

2. Chooseactionuk insomefashion.

3. Let xk+1 = f(xk, uk, wk) (recall from lecture 1 that f gives an alternative representation for state

transitions).

4. LetJk+1(xk+1)=(T Jk)(xk+1).

5.

Let

k

=

k

+

1

and

return

to

step

2.

2.2 ExplorationxExploitation

Notethatthere isstillanelementmissing inthedescriptionofRTVI,namely,howtochooseactionuk. It

7/25/2019 2997 spring 2004

31/201

Ingeneral,choosinguk greedilydoesnotensureconvergencetotheoptimalpolicy. Onepossiblefailure

scenario is illustrated in Figure 2. Suppose that there is a subset of states B which is recurrent under an

optimal

policy,

and

a

disjoint

subset

of

states

A

which

is

recurrent

under

another

policy.

If

we

start

with

a

guessJ0 whichishighenoughatstatesoutsideregionA,andalwayschooseactionsgreedily,thenanaction

thatneverleadstostatesoutsideregionAwillbeselected. HenceRTVIneverhasachanceofupdatingand

correctingthe initialguessJ0 atstatesinsubsetB,andinparticular,theoptimalpolicyisneverachieved.

It

turns

out

that,

if

we

choose

initial

value

J0 J,

then

the

greedy

policy

selection

performs

well,

as

showninFig2(b). Westatethisconcept formallybythefollowingtheorem.

The previous discussion highlights a tradeoff that is fundamental to learning algorithms: the conflict

of exploitation versus exploration. In particular, there is usually tension between exploiting information

accumulated

by

previous

learning

steps

and

exploring

different

options,

possibly

at

a

certain

cost,

in

order

togathermoreinformation.

J J* *J

J

0

0

AB

J(x)

x x

J(x)

*(a) Improper init ial value J with greedy (b) Initial value with J0less or equal to J0

policy selection

Figure2: InitialValueSelection

Theorem

1

If

J0 J and

all

states

are

reachable

from

one

another,

then

the

real

time

value

iteration

algorithm(RTVI)withgreedypolicyut satisfies thefollowing

(a)

Jk JforsomeJ,

(b)

J =

Jfor

all

states

visited

infinitely

many

times,

(c)

after

some

iterations,

all

decisions

will

be

optimal.

Proof: SinceT ismonotone,wehave

(T J )(x) (T J)(x) x J (x ) J(x ) and J (x) = J (x) J(x) x = x

7/25/2019 2997 spring 2004

32/201

HenceonecouldregardJ asa function fromthesetAto|A|. SoTA issimilartoDPoperator forthe

subsetAofstatesand

||T

A

J

T

A

J

||

||J

J

||.

Therefore,

RTVI

is

AVI

over

A,

with

every

state

visited

infinitely

many

times.

Thus,

J(x), ifxA,Jk(x)J(x)=

J0(x), ifx /A.

Since

the

states

x

/Aarenevervisited,wemusthave

Pa(x,

y)

=

0,

x

A,

y

/

A,

whereaisgreedywithrespecttoJ. Letu bethegreedypolicyofJ. Then

J(x)=gu(x)+ Pu(x,y)J(y)=gu(x)+ Pu(x,y)J(y),xA.yA yS

Therefore,weconclude

J(x)=Ju(x)J(x),

xA.

ByhypothesisJ0 J,weknowthat

J(x)=J(x), xA.

References

[1] R-RChenandS.P.Meyn,ValueIterationandOptimizationofMulticlassQueueingNetworks,Queueing

Systems,

32,

pp.

6597,

1999.

7/25/2019 2997 spring 2004

33/201



Lecture

Note

7

1 Real-TimeValue Iteration

Recallthereal-timevalue iteration(RTVI)algorithm

choose xk+1 =f(xk, uk, wk)

choose ut insomefashion

update Jk+1(xk)=(T Jk)(xk), Jk+1(x)=(T Jk)(x), x=xk

Wethushave

T Jk(xk

)=min ga(xk)+ Pa(xk

, y)Jk

(y)a

y

Weencounterthefollowingtwoquestionsinthisalgorithm.

1. whatifwedonotknowPa(x, y)?

2. even ifweknow/cansimulatePa(x, y),computing yPa(x, y)J(y)maybeexpensive.

Toovercomethesetwoproblems,weconsidertheQ-learningapproach.

2 Q-Learning

2.1

Q-factors

Foreverystate-actionpair,weconsider

Q(x, a) = ga(x)+Pa(xk, y)J(y) (1)

J(x) = minQ(x, a) (2)a

WecaninterprettheseequationsasBellmansequationsforanMDPwithexpandedstatespace. Wehave

the

original

states

x

S,

with

associated

sets

of

feasible

actions

Ax,

and

extra

states

(x, a), x

S, a

Ax,

corresponding to state-action pairs, for which there is only one action available, and no decision must be

made. Note that, whenever we are in a state x where a decision must be made, the system transitions

deterministicallytostate (x, a) basedonthestateand actionachosen. Thereforewecircumvent theneed

toperformexpectations y

Pa(x, y)J(y)associatedwithgreedypolicies.

7/25/2019 2997 spring 2004

34/201

Monotonicity Q,andQsuchthatQQ,HQHQ.

Offset H(Q+Ke)=HQ+Ke.

Contraction

HQ

H

QQ ||QQ||,Q,

ItfollowsthatH hasauniquefixedpoint,correspondingtotheQfactorQ.

2.2

Q-Learning

Wenowdevelopareal-timevalue iterationalgorithm forcomputingQ. AnalgorithmanalogoustoRTVI

forcomputingthecost-to-gofunctionisasfollows:

Qt+1(xt, ut)=gut(xt)+ Put(x, y)minQt(y, a).

ay

However, this algorithm undermines the idea that Q-learning is motivated by situations where we do not

know Pa(x, y) or find it expensive to compute expectations Pa(x, y)J(y). Alternatively, we consideravariantsthatimplicitlyestimatethisexpectation,basedonstatetransitionsobservedinsystemtrajectories.

Basedonthis idea,onepossibilityistoutilizeaschemeoftheform

Qt+1(xt, at)=gat(xt)+ minQt(xt+1, a)

a

However, note that such an algorithm should not be expected to converge; in particular, Qt(xt+1, a) is a

noisy estimate of yPut(x, y)mina Qt(y, a). We consider a small-step version of this scheme, where the

noiseisattenuated:

Qt+1(xt, at)=(1t)Qt(xt, at)+t

gat(xt)+ minQt(xt+1, a) . (4)

a

We

will

study

the

properties

of

(4)

under

the

more

general

framework

of

stochastic

approximations,

which

areatthecoreofmanysimulation-based orreal-timedynamicprogrammingalgorithms.

3 StochasticApproximation

Inthestochasticapproximationsetting,thegoalistosolveasystemofequations

r

=

Hr,

wherer isavector inn forsomen andH isanoperator defined inn. If weknowhow tocompute Hr

foranygivenr, itiscommontotrytosolvethissytemofequationsbyvalueiteration:

rk+1

=Hrk

. (5)

7/25/2019 2997 spring 2004

35/201

Wecanalsodothesummationrecursivelybysetting

(i) 1i

r = (Hrt +wi),t

i

j=1

(i+1)

i 1r =

i+1r

(i)

+i+1

(Hrt

+wi+1).t

t

Therefore, rt+1

= rt

(k). Finally, we may consider replacing samples Hrt

+wi

with samples Hr(i1)

+wi,t

obtainingthefinalform

rt+1 =(1t)rt +t(Hrt +wt).

A

simple

application

of

these

ideas

involves

estimating

the

expected

value

of

a

random

variable

by

drawing

i.i.d. samples.

Example

1 Letv1, v2, . . . be i.i.d. randomvariables. Given

t 1

rt+1 =t+1

rt +t+1

vt+1

we

know

that

rt

v by

strong

law

of

large

numbers.

We

can

actually

prove

(General

Version)

rt+1

=

(1

t)rt

+

tvt+1

v w.p.

1,

if

t =and

t=1 t=1t2

7/25/2019 2997 spring 2004

36/201

3.1

Lyapunov

function

analysis

Thequestionwetrytoanswer isDoes(8)converge? Ifso,wheredoesitconvergeto?

We

will

first

illustrate

the

basic

ideas

of

Lyapunov

function

analysis

by

considering

a

deterministic

case.

3.1.1

Deterministic

Case

Indeterministiccase,wehaveS(r, w)=S(r). Supposethereexistssomeuniquer suchthat

S(r)=Hr r =0.

Thebasic idea istoshowthatacertainmeasureofdistancebetweenrt andr

isdecreasing.

Example

2

Suppose

that

F

is

a

contraction

with

respect

to

2.

Then

rt+1 =rt +t(F rt rt)

converges.

Proof: SinceF isacontraction,thereexistsauniquer s.t. F r =r. Let

V(r)=r r 2.

We

will

show

V(rt)

V(rt+1).

Observe

V(rt+1) = rt+1 r

2

= rt

+t(F rt

rt) r

2

= (1 t)(rt r)+t(F rt r

)2

(1 t)rt r

2 +tF rt r

2

(1 t)rt r

2

+trt r

2

=

rt r 2 (1

)trt r 2.

t 0

Therefore,rt r

2 isnonincreasingandboundedbelowbyzero. Thus,rt r

2 0. Then

rt+1

r 2

rt

r 2

(1 )trt

r 20

rt r

2 (1 )t

rt1 r

2 (1 )(t +t1)

...t

r0 r

2 (1 ) l l=1

Hencer0

r 2 t

7/25/2019 2997 spring 2004

37/201

1. WedefineadistanceV(rt)0indicatinghowfarrt isfromasolutionr satisfyingS(r)=01

2. Weshowthatthedistanceisnonincreasingint

3.

We

show

that

the

distance

indeed

converges

to

0.

Theargument also involves thebasicresult thatevery nonincreasingsequence bounded below converges

toshowthatthedistanceconverges

Motivatedbythesepoints,weintroducethenotionofaLyapunovfunction:

Definition

1

We

call

function

V

a

Lyapunov

function

if

V

satisfies

(a)

V()

0

(b) (rV)T

S(r)0

(c) V(r)=0S(r)=0

3.1.2

Stochastic

Case

Theargument used for convergence in the stochastic caseparallels the argument used inthe deterministic

case. LetFt

denoteallinformationthatisavailableatstaget,and let

St(r)=E [S(r, wt)|Ft].

ThenwerequireaLyapunovfunctionV satisfying

V()0 (9)

2(V(rt))T (10)St(rt)cV(rt)

(11)V(r)V(r)Lr r

S(rt, wt)2 2E |Ft K1 +K2V(rt) , (12)

forsomeconstantsc,L,K1 andK2.

Notethat(9)and(10)aredirectanaloguesofrequiringexistenceofadistancethat isnonincreasing in

t;moreover,(10)ensuresthatthedistancedecreasesatacertainrate ifrt isfar fromadesiredsolutionr

satisfying V(r = 0). Condition (11) imposes some regularity on V which is required to show that V(rt)

does indeedconvergeto0,andcondition(12)imposessomecontroloverthenoise.

A

last

point

worth

mentioning

is

that

(10)

implies

that

the

expected

value

of

V(rt)

is

nonincreasing;

however, we may have V(rt+1) > V(rt) occasionally. Therefore we need an stochastic counterpart to the

resultthateverynonincreasingsequenceboundedbelowconverges. Thestochasticcounterpartofinterest

toouranalysis isgivenbelow.

Theorem 1 (Supermartingale Convergence Theorem) Suppose that Xt Yt and Zt are nonnegative

7/25/2019 2997 spring 2004

38/201

1. Xt converges toa limit(whichcanbearandomvariable)withprobability1,

2.

t=1

Zt

0with

t = and ,t=0 t=0t

2 uu u r

gmax 01

whichimplies(1)

PNgmax

(p) PN

(p)1 u gmax

> .ur r

In order the complete the proof of Theorem 1 from the four lemmas above, we have to consider the

probabilities fromtwoformsoffailure:

failuretostopthealgorithmwithanear-optimalpolicy

7/25/2019 2997 spring 2004

49/201

References

[1]

M.

Kearns

and

S.

Singh,

Near-Optimal

Reinforcement

Learning

in

Polynomial

Time,

Machine

Learning,

Volume

49,

Issue

2,

pp.

209-232,

Nov

2002.

7/25/2019 2997 spring 2004

50/201

2.997 Decision-Making in Large-Scale Systems March 8

MIT, Spring 2004 Handout #13

Lecture Note 10

1 Value Function Approximation

DP problems are centered around the cost-to-go functionJ or the Q-factorQ. In certain problems, such as

linear-quadratic-Gaussian systems, J exhibits some structure which allows for its compact representation:

Example 1 In LQG system, we have

xk+1 = Axk+ Buk+ Cwk, x n

g(x, u) = xDx + uEu,

wherexk represents the system state, uk represents the control action, andwk is a Gaussian noise. It can

be shown that the optimal policy is of the form

uk = Lkxk

and the optimal cost-to-go function is of the form

J(x) = xRx + S, R nm, S

where R is a symmetric matrix. It follows that, if there are n state variables (i.e., xk n), storing

J requires storing n(n+ 1)/2 + 1 real numbers, corresponding to the matrix R and the scalar S. The

computational time and storage space required is quadratic in the number of state variables.

In general, we are not as lucky as in the LQG system case, and exact representation ofJ requires that

it be stored as a lookup table, with one value per state. Therefore, the space is proportional to the size of

the state space, which grows exponentially with the number of state variables. This problem, known as the

curse of dimensionality, makes dynamic programming intractable in face of most problems of practical scale.

Example 2 Consider the game of Tetris, represented in Fig. 1. As seen in previous lectures, this game maybe represented as an MDP, and a possible choice of state is the pair(B, P), in whichB nm represents

the board configuration andP represents the current falling piece. More specifically, we haveb(i, j) = 1, if

position(i, j) of the board is filled, andb(i, j) = 0 otherwise.

If there arep different types of pieces, and the board has dimensionn m, the number of states is on the

7/25/2019 2997 spring 2004

51/201

Figure 1: A tetris game

as a deterministic optimization problem, in the following way. Denote by (u) the average cost of policy

u. Then our problem corresponds to

minuU

(u), (1)

whereU is the set of all possible policies. In principle, we could solve (1) by enumerating all policies and

choosing the one with the smallest value of(u); however, note that the number of policies is exponential

in the number of states we have |Y|= |A||S| ; if there is no special structure to U, this problem requires

even more computational time than solving Bellmans equation for the cost-to-go function. A possible

approach to approximating the solution is to transform problem (1) by considering only a tractable

subset of all policies:

minuF

(u)

where F is a subset of the policy space. If F has some appropriate format, e.g., we consider policies

that are parameterized by a continuous variable, we may be able to solve this problem without having

to enumerate all policies in the set, but by using some standard optimization method such as gradient

descent. Methods based on this idea are called approximations in the policy space, and will be studied

later on in this class.

(2) Cost-to-go Function Approximation

Another approach to approximating the dynamic programming solution is to approximate the cost-to-gofunction. The underlying idea for cost-to-go function approximation is thatJ has some structure that

allows for approximate compact representation

J(x) J(x, r), for some parameterr P.

E l

7/25/2019 2997 spring 2004

52/201

Example 3

J(x, r) = cos(xTr) nonlinear inr

J(x, r) =r0+ rT1xJ(x, r) =r0+ r

t1(x)

linear inr

In the next few lectures, we will focus on cost-to-go function approximation. Note that there are two

important preconditions to the development of an effective approximation. First, we need to choose a

parameterization Jthat can closely approximate the desired cost-to-go function. In this respect, a suitable

choice requires some practical experience or theoretical analysis that provides rough information on the shape

of the function to be approximated. Regularities associated with the function, for example, can guide the

choice of representation. Designing an approximation architecture is a problem-specific task and it is not

the main focus of this paper; however, we provide some general guidelines and illustration via case studies

involving queueing problems. Second, given a parameterization for the cost-to-go function approximation,

we need an efficient algorithm that computes appropriate parameter values.

We will start by describing usual choices for approximation architectures.

2 Approximation Architectures

2.1 Neural Networks

A common choice for an approximation architecture are neural networks. Fig. ??represents a neural network.

The underlying idea is as follows: we first convert the original state x into a vector x n, for somen. This

vector is used as the input to a linear layerof the neural network, which maps the input to a vector y m,

for some m, such that yj = n

i=1 rijxi. The vector y is then used as the input to a sigmoidal layer, which

outputs a vector z m

with the property that zi = f(yi), and f(.) is a sigmoidal function. A sigmoidalfunction is any function with the following properties:

1. monotonically increasing

2. differentiable

3. bounded

Fig. 3 represents a typical sigmoidal function.

The combination of a linear and a sigmoidal layer is called a perceptron, and a neural network consists

of a chain of one or more perceptrons (i.e., the output of a sigmoidal layer can be redirected to another

sigmoidal layer, and so on). Finally, the output of the neural network consists of a weighted sum of the

outputz of the final layer:

7/25/2019 2997 spring 2004

53/201

Input

Sigmoidal LayerLinear Layer

ir

ijr

+

+

+

+

+

+

+

+

+

+

+

Figure 2: A neural network

Figure 3: A sigmoidal function

of functions on some bounded and closed set, if functions are uniformly smooth, we can get error O( 1n

)

with n sigmoidal functions. (Barron 1990). Note, however, that in order to obtain a good approximation,

an adequate set of weights r must be found. Backpropagation, which is simply a gradient descent algorithm,

is able to find a local optimum among all set of weights, but finding the global optimum may be a difficult

problem.

2.2 State Space Partitioning

Another common choice for approximation architecture is based on partitioning of the state space. The

underlying idea is that similar states may be grouped together For instance in an MDP involving

2 3 Features

7/25/2019 2997 spring 2004

54/201

2.3 Features

A special case of state space partitioning consists of mapping states to features, and considering approxima-

tions of the cost-to-go function that are functions of the features. The hope is that the featurewould captureaspects of the state that are relevant for the decision-making process and discard irrelevant details, thus

providing a more compact representation. At the same time, one would also hope that, with an appropriate

choice of features, the mapping from features to the (approximate) cost-to-go function would be smoother

than that from the original state to the cost-to-go function, thereby allowing for successful approximation

with architectures that are suitable for smooth mappings (e.g., polynomials). This process is represented

below.

Statex features f(x) J(f(x), r).

J(x) J(f(x)) such thatf(x) Jis smooth.

Example 4 Consider the tetris game. What features we should choose?

1. |h(i) h(i + 1)| (height)

2. how many holes

3. maxh(i)

2 997 Decision-Making in Large-Scale Systems March 10

7/25/2019 2997 spring 2004

55/201

x

2.997Decision-Making inLarge-ScaleSystems March10

MIT,

Spring

2004

Handout

#14

Lecture

Note

11

1 ComplexityandModelSelection

In this lecture, we will consider the problem of supervised learning. The setup is as follows. We have

pairs(x, y),distributedaccordingtoajointdistributionP(x, y). Wewouldliketodescribetherelationshipbetween

x

and

y

through

some

function

f chosen

from

a

set

of

available

functions

C,

so

that

y

f(x).

Ideally,wewouldchoosef bysolving

f =

argmingCE (yf)2 x, yP (testerror)|

However, wewillassumethat thedistributionP isnotknown,butrather, weonly have access tosamples

(xi, yi). Intuitively,wemaytrytosolve

n

21

min yi

f(xi) (trainingerror)f n

i=1

instead. Italsoseemsthat,therichertheclassCis,thebetterthechancetocorrectlydescribetherelationshipbetweenxandy. Inthislecture,wewillshowthatthisisnotthecase,andtheappropriatecomplexityofCandtheselectionofamodelfordescribinghowxandyrelatedmustbeguidedbyhowmuchdataisactually

available.

This

issue

is

illustrated

in

the

following

example.

2

Example

Consider

fitting

the

following

data

by

a

polynomial

of

finite

degree:

1 2 3 4

y 2.5 3.5 4.5 5.5Amongseveralothers,thefollowingpolynomialsfitthedataperfectly:

y =

x+

1.5

2

y =

2x

4

20x

3

+

70x

99x+

49.5

8

7

6

Which polynomial should we choose?

7/25/2019 2997 spring 2004

56/201

3

Whichpolynomialshouldwechoose?

x 1 2 3 4Nowconsiderthefollowing(possiblynoisy)data:

y 2.3 3.5 4.7 5.5

Fitting

the

data

with

a

first-degree

polynomial

yields

=

1.03x

+

1.3;

fitting

it

with

a

fourth-degree

y

y=2x420.0667x3 +70.4x 99.5333x+49.5.polynomialyields(amongothers) 2Whichpolynomialshouldwechoose?

1

1.5

2 2.5 3 3.5 40

1

2

3

4

Training

error

vs.

test

error.

It seems intuitive in the previous example that a line may be the best description for the relationship

between x and y, even though a polynomial of degree 3 describes the data perfectly in both cases and no

linear function is ableto describe the data perfectly in the second case. Is the intuition correct, and if so,

how can we decide on an appropriate representation, if relying solely on the training error does not seem

completelyreasonable?

The

essence

of

the

problem

is

as

follows.

Ultimately,

what

we

are

interested

in

is

the

ability

of

our

fitted

curvetopredictfuturedata,ratherthansimplyexplainingtheobserveddata. Inotherwords,wewouldlike

to choose a predictor that minimizes the expected error y(x)y(x) over all possible x. We call this the| |testerror. Theaverageerroroverthedatasetiscalledthetrainingerror.

We will show that training error and test error can be related through a measure of the complexity of

the

class

of

predictors

being

considered.

Appropriate

choice

of

a

predictor

will

then

be

shown

to

require

balancing the training error and the complexity of the predictors being considered. Their relationship is

described inFig.1, whereweplot test and training errorsversuscomplexity ofthepredictorclass

C when

the

number

of

samples

is

fixed.

The

main

difficulty

is

that,

as

indicated

in

Fig.

1,

there

exists

a

tradeoff

between

the

complexity

and

the

errors,

i.e.,

training

error

and

the

test

error;

while

the

approximation

error

over the sampled points goes to zero as we consider richer approximation classes, the same is not true for

the test error, which we are ultimately interested in minimizing. This is due to the fact that, with only

C

Error

7/25/2019 2997 spring 2004

57/201

test error

training error

Optimal degree maximum degree (d)

Figure1: Errorvs. degreeofapproximationfunction

3.1

Classification

with

a

finite

number

of

classifiers

Suppose that, given n samples (xi, yi), i = 1, . . . , n, we need to choose a classifier hi from a finite set of

classifiersf1, . . . , f d.

Define

(k) = E[y fk(x) ]|n

|1

n(k) = yifk(xi).n

| |i=1

In

words,

(k)

is

the

test

error

associated

with

classifier

fk,

and

n(k)

is

a

random

variable

representing

the

training

error

associated

with

classifier

fk over the samples (xi, yi), i = 1, . . . , n. As described before, we

wouldliketofindk =argmink(k),butcannotcomputedirectly. Letusconsiderusing instead

k=argminn(k)k

We

are

interested

in

the

following

question:

How

does

the

test

error

(k)

compare

to

the

optimal

error

(k)?

Suppose

that

|n(k)(k), k, (1)|forsome>0. Thenwehave

(k) n(k) +

In words, if the training error is close to the test error for all classifiers fk, then using k instead of k is

7/25/2019 2997 spring 2004

58/201

4

near-optimal.

But

can

we

expect

(1)

to

hold?

Observethat yi

fk(xi) are i.i.d. Bernoullirandomvariables. Fromthestrong law of largenumbers,

| |we

must

have

n(k)(k) w.p. 1.Thismeansthat, iftherearesufficientsamples,(1)shouldbetrue. Havingonlyfinitelymanysamples,we

facetwoquestions:

(1) Howmanysamplesareneededbeforewehavehighconfidencethat n(k) iscloseto(k)?

(2) Canweshowthat n(k)approaches(k)equallyfastforallfk

C?

The

first

question

is

resolved

by

the

Chernoff

bound:

For

i.i.d.

Bernoulli

random

variables

xi,i=1, . . . , n,

wehave P 1 n xiEx1> 2exp(2n2)

n

i=1

Moreover, since there are only finitely many functions in C, uniform convergence of n(k) to (k) followsimmediately:

P(k

:

(k)

(k) >

) =

(k)

(k) >

})| |

P

(k{| |d

(k)(k) >}) P({| |k=1

2d exp(2n2).

Therefore

we

have

the

following

theorem.

Theorem

1

With

probability

at

least

1

,

the

training

set

(xi, yi),

i

=

1, . . . , n,

will

be

such

that

testerror trainingerror+(d,n,)

where

1 1(d,n,)= log2d+log .

2n

Measures

of

Complexity

In Theorem 1, the error (d,n,) is on the order of

logd. In other words, the more classifiers are under

consideration,

the

larger

the

bound

on

the

difference

between

the

testing

and

training

errors,

and

the

difference grows as a function of logd. It follows that, for our purposes, logd captures the complexity of

4.1

VC

dimension

7/25/2019 2997 spring 2004

59/201

TheVCdimensionisapropertyofaclassCoffunctionsi.e.,foreachsetC,wehaveanassociatedmeasure

of

complexity, d

V C(C). d

V C

captures

how

much

variability

there

is

between

different

functions

in

C.

The

underlying idea is as follows. Take n points x1, . . . , xn, and consider binary vectors in{1, +1}n formedby

applying

a

function

f

C

to

(xi). How many different vectors can we come up with? In other words,

considerthefollowingmatrix: f1(x1) f1(x2) . . . f1(xn) f2(x1) f2(x2) . . . f2(xn) . . . ..

.

.

.. . . .

where

fi

C.

How

many

distinct

rows

can

this

matrix

have?

This

discussion

leads

to

the

notion

of

shattering

andtothedefinitionoftheVCdimension.

Definition

1

(Shattering) A set of points x1, . . . , xn is shattered by a class C of classifiers iffor any

assignment

of

labels

in

{1, 1},

there

is

f

C

such

that

f(xi)=yi, i.

Definition2 VCdimensionofC is thecardinalityof the largestset itcanshatter.

Example

1

Consider

|C =

d.

Suppose

x1, x2,dots,xn is

shattered

by

C.

We

need

d

2n and

thus

n

log

d.|Thismeans thatdV C(C)logd.

Example2

2ConsiderC ={hyperplanes in2}, Any two points in can be shattered. Hence, dV C(C)2. Considerany three points in2, C can shatter these three points. Hence dV C(C) 3. SinceC cannot shatter anyfour points in

2, hence dV C(

C)

3. It follows that dV C(

C) = 3. Moreover, it can be shown that, if

=

{hyperplanes

in

},

then

dV C(C)

=

n+

1.

C

n

Example3 IfC is thesetofallconvexsets in2,wecanshow thatdV C(C)=.

ItturnsoutthattheVCdimensionprovidesageneralizationoftheresultsfromtheprevioussection,for

finitesetsofclassifiers,togeneralclassesofclassifiers:

Theorem2 Withprobabilityat least1over thechoiceofsamplepoints(xi, yi), i=1, . . . , n ,wehave

(f)

n(f)+(n, dV C(C), ), fC,

where

dV C log(2n )

+

1

+

log(41

)dVC

5 StructuralRiskMinimization

7/25/2019 2997 spring 2004

60/201

Based

on

the

previous

results,

we

may

consider

the

following

approach

to

selecting

a

class

of

functions

C

whose

complexity

is

appropriate

for

the

number

of

samples

available.

Suppose

that

we

have

several

classes

. . . Cp. Note that complexity increases from C1 to Cp. We have classifiers f1, f2, . . . , f p whichC1 C2minimizesthetrainingerror n(fi)withineachclass. Then,givenaconfidencelevel,wemayfoundupper

bounds

on

the

test

error

(fi)associatedwitheachclassifier:

(fi)n(fi)+(dV C, n , ),

with probability at least1, andwe can choosethe classifier fi that minimizes the above upper bound.This

approach

is

called

structural

risk

minimization.

There

are

two

difficulties

associated

with

structural

risk

minimization:

first,

the

upper

bound

provided

by Theorems 1 and 2 may be loose; second, it may be difficult to determine the VC dimension of a given

classofclassifiers,androughestimatesorupperboundsmayhavetobeused instead. Still, thismaybea

reasonableapproach,ifwehavealimitedamountofdata. Ifwehavealotofdata,analternativeapproachis

asfollows. Wecansplitthedatainthreesets: atrainingset,avalidationsetandatestset. Wecanusethe

trainingsettofindtheclassifierfi withineachclassCi thatminimizesthetrainingerror;usethevalidation

set

to

estimate

the

test

error

of

each

selected

classifier

fi,

and

choose

the

classifier

f

from

f1, . . . , f p

with

the

smallestestimate;andfinally,usethetestsettogenerateanestimateofthetesterrorassociatedwith f.

2.997Decision-Making inLarge-ScaleSystems March12

7/25/2019 2997 spring 2004

61/201

1

MIT,

Spring

2004

Handout

#16

Lecture

Note12

ValueFunctionApproximationandPolicyPerformance

Recall that two tasks must be accomplished in order to for a value function approximation scheme to be

successful:

1.

We

must

pick

a

good

representation

J,

such

that

J()

J(, r)

for

at

least

some

parameter

r;

2. Wemustpickagoodparameter r,suchthatJ(x)J(x, r).

Consider

approximatingJ withalineararchitecture,i.e.,let

p

J(x, r)= i(x)ri,t=1

for

some

functions

i, i=1, . . . , p. Wecandefineamatrix |S|p givenby | |

.= 1 . . . p | |

Withthisnotation,wecanrepresenteachfunctionJ(, r)as

J =

r.

Fig.

1

gives

a

geometric

interpretation

of

value

function

approximation.

We

may

think

ofJ asavector

in|S| ;byconsideringapproximationsofthe formJ =r,werestrictattentiontothehyperplaneJ =r

in the same space. Given a norm (e.g., the Euclidean norm), an ideal value function approximation

algorithmwouldchooserminimizingJ r;inotherwords,itwouldfindtheprojectionr ofJ onto

the

hyperplane.

Note

that

J risanaturalmeasureforthequalityoftheapproximationarchitecture,

sinceit isthebestapproximationerrorthatcanbeattainedbyanyalgorithmgiventhechoiceof.

Algorithmsforvaluefunctionapproximationfoundintheliteraturedonotcomputetheprojectionr,

sincethis isan intractableproblem. BuildingontheknowledgethatJ satisfiesBellmansequation, value

function approximation typically involves adaptation of exact dynamic programming algorithms. For in

stance,drawinginspirationfromvalueiteration,onemightconsiderthefollowingapproximatevalueiteration

is capable of producing a good approximation to J, then the approximation algorithm should be able to

d l ti l d i ti

7/25/2019 2997 spring 2004

62/201

2

produce

a

relatively

good

approximation.

Another important question concerns the choice of a norm used to measure approximation errors.

Recall

that,

ultimately,

we

are

not

interested

in

finding

an

approximation

to

J

,

but

rather

in

finding

a

good

policy fortheoriginaldecision problem. Thereforewe would liketo choose to reflecttheperformance

associated

with

approximations

toJ.

State 2

State 1

State 3

J

J r

J( )

=

x,r

Figure1: ValueFunctionApproximation

PerformanceBounds

We

are

interested

in

the

following

question.

Let uJ be the greedy policy associated with an arbitrary

function J, and JuJ be the cost-to-go function associated with that policy. Can we relate the increase in

costJuJ JtotheapproximationerrorJ J?

Recallthefollowingtheorem,fromLectureNote3:

Theorem1 LetJ bearbitrary,uJ beagreedypolicywithrespect toJ.1 LetJuJ be thecost-to-gofunction

forpolicyuJ. Then

JuJ J

2J J.

1

is unrealistic to expect that we could approximate J uniformly well over all states (which is required by

Theorem 1) or that e could find a polic that ields a cost to go uniforml close to J o er all states

7/25/2019 2997 spring 2004

63/201

Theorem

1)

or

that

we

could

find

a

policyuJ thatyieldsacost-to-gouniformlyclosetoJ

over

all

states.

The following example illustrates the notion that having a large error J J does not necessarily

lead

to

a

bad

policy.

Moreover,

minimizing

J

J may

also

lead

to

undesirable

results.

Example1 Considera singlequeuewithcontrolled service rate. Let xdenote thequeue length, B denote

the

buffer

size,

and

Pa(x, x+1) = , a, x=0, 1, . . . , B1

Pa(x, x 1) = (a), a, x=1, 2, . . . , B ,

Pa(B, B+1) = 0, a

ga(x) = x +q(a)

Suppose thatweare interested inminimizing theaverage cost in thisproblem. Thenwewould like tofind

anapproximation to thedifferentialcostfunctionh. Suppose thatweconsideronly linearapproximations:

h(x, r)=r1 +r2x.At the topofFigure1,werepresent h and twopossibleapproximations, h1 and h2. h1

is

chosen

so

as

to

minimize

h h. Which one is a better approximation? Note that h1 yields smaller

approximation errors than h2 over large states, but yields large approximation errors over the whole state

space.

In

particular,

as

we

increase

the

buffer

size

B,

it

should

lead

to

worse

and

worse

approximation

errors

in almost all states. h2, on the other hand, has an interesting property, which we now describe. At the

bottom

of

Figure

1,

we

represent

the

stationary

state

distribution(x)encounteredundertheoptimalpolicy.

Note that itdecaysexponentiallywithx,and largestatesarerarelyvisited. Thissuggests that,forpractical

purposes, h2 may lead toabetterpolicy, since itapproximates h better than h1 over the setof states that

arevisitedalmostallof the time.

What the previous example hints at is that, in the case of a large state space, it may be important to

consider

errors

J

J

that

differentiate

between

more

or

less

important

states.

In

the

next

section,

we

will

introduce

the

notion

of

weighted

norms

and

present

performance

bounds

that

take

state

relevance

into

account.

2.1 PerformanceBoundswithState-RelevanceWeights

We

first

introduce

the

following

weighted

norms:

||J

J(, r)|| =

max

xS |J

(x)

J(x, r)|

JJ(, r), = maxxS

(x)|J(x)J(x, r)|J J(, r)1, = (x)|J

(x)J(x, r)| (>0)xS

7/25/2019 2997 spring 2004

64/201

h(x)

B

x

h2

h1

h

x

Dist. of

x

whereT T tP t(1 ) T (I P )1 (1 )

7/25/2019 2997 spring 2004

65/201

T T tPtcJ =(1 )cT(I PuJ)

1 =(1 )c uJt=0

or

equivalently

c,J(x)=(1 ) c(y) tPu

tJ

(y,x), x S.y t=0

Proof: WehaveJ TJ J JuJ. Then

TJuJ J1,c = c (JuJ J

)

cT(JuJ J)

T

=

c

(I

PuJ)1guJ J)

T= c (I PuJ)

1(guJ +PuJJ J)

T= c (I PuJ)1(TJ J)

T c (I PuJ)1(J J)

1= J J1,c,J1

ComparingTheorems1and2,wehave

JuJ J

2J J

1

JuJ J1,c

1J J1,c,J.1

There

are

two

important

differences

between

these

bounds:

1. Thefirstboundrelatesperformancetotheworstpossibleapproximationerroroverallstates,whereas

thesecondinvolvesonlyaweightedaverageoferrors. Thereforeweexpectthesecondboundtoexhibit

better

scaling

properties.

2. The first bound presents a worst-case guarantee on performance: the cost-to-go starting from any

initialstatexcannotbegreaterthanthestatedbound. Thesecondboundpresentsaguaranteeonthe

expectedcost-to-go,giventhattheinitialstateisdistributedaccordingtodistributionc. Althoughthis

is

a

weaker

guarantee,

it

represents

a

more

realistic

requirement

for

large-scale

systems,

and

raises

the

possibilityofexploitinginformationabouthowimportanteachdifferentstateisintheoveralldecision

problem.

This step is typically done inreal-time, asthe system is being controlled. If the setof available actions Ais relatively small and the summation y Pa(x, y)J (y, r) can be computed relatively fast, then evaluating

7/25/2019 2997 spring 2004

66/201

s

e at ve y

s a

a d

t e

Date post:	24-Feb-2018
Category:	Documents
Upload:	combatps1
View:	219 times
Download:	0 times

2997 spring 2004

Documents