of 201
7/25/2019 2997 spring 2004
1/201
2.997Decision-Making inLarge-ScaleSystems February4
MIT,
Spring
2004
Handout
#1
Lecture
Note
1
1 Markovdecisionprocesses
In this class we will study discrete-time stochastic systems. We can describe the evolution (dynamics) of
thesesystemsbythefollowingequation,whichwecallthesystemequation:
xt+1 =f(xt, at, wt), (1)
where xt S, at Axt and wt W denote the system state, decision and random disturbance at time
t, respectively. In words, the state of the system at time t+1 is a function f of the state, the decision
andarandomdisturbanceattimet. An importantassumptionofthisclassofmodels isthat, conditioned
on the current state xt, the distribution of future states xt+1, xt+2, . . . is independent of the past states
xt1, xt2, . . . . ThisistheMarkovproperty,whichrisetothenameMarkovdecisionprocesses.
An
alternative
representation
of
the
system
dynamics
is
given
through
transition
probability
matrices:
for
each state-action pair (x, a), we let Pa(x, y) denotethe probability thatthe next state is y, given that the
currentstateisxandthecurrentactionisa.
Weareconcernedwiththeproblemofhowtomakedecisionsovertime. Inotherwords,wewouldliketo
pickanactionat A ateachtimet. Inreal-worldproblems,thisistypicallydonewithsomeobjectivein
mind,suchasminimizingcosts,maximizingprofitsorrewards,orreachingagoal. Letu(x, t)takevaluesin
Ax,foreachx. Thenwecanthinkofuasadecisionrulethatprescribesanactionfromthesetofavailable
actions
Ax based
on
the
current
time
stage
t
and
current
state
x.
We
call
u
a
policy.
In this course, we will assess the quality of each policy based on costs that are accumulated additively
xt
over time. More specifically, we assume that at each time stage t a cost g (xt) is incurred. In the next
section,wedescribesomeoftheoptimalitycriteriathatwillbeusedinthisclasswhenchoosingapolicy.
Basedonthepreviousdiscussion,wecharacterizeaMarkovdecisionprocessbyatuple(S, A, P(, ), g()),
consistingofastatespace,asetofactionsassociatedwitheachspace,transitionprobabilitiesandcostsas
sociated with each state-action pair. For simplicity, we will assume throughout the course that S and Ax
are
finite.
Most
results
extend
to
the
case
of
countably
or
uncountably
infinite
state
and
action
spaces
under
certaintechnicalassumptions.
at
2 OptimalityCriteria
In
the
previous
section
we
described
Markov
decision
processes,
and
introduced
the
notion
that
decisions
7/25/2019 2997 spring 2004
2/201
3
2. Averagecost:
T1 1 lim
supE
g (xt)
x0 =
x
(3)
T T
at
t=0
3. Infinite-horizondiscountedcost:
E tg (xt)x0 =x , (4)at
t=0
where (0, 1) is a discount factor expressing temporal preferences. The presence of a discount
factorismostintuitiveinproblemsinvolvingcashflows,wherethevalueofthesamenominalamount
of
money
at
a
later
time
stage
is
not
the
same
as
its
value
at
a
earlier
time
stage,
since
money
at
the
earlier
stage
can
be
invested
at
a
risk-free
interest
rate
and
is
therefore
equivalent
to
a
larger
nominal amount at a later stage. However, discounted costs also offer good approximations to the
other optimality criteria. In particular, it can be shown that, when the state and action spaces are
finite,there isa largeenough
7/25/2019 2997 spring 2004
3/201
A common choice for the state of this system is an 8-dimensional vector containing the queue lengths.
Since each server serves multiple queues, in each time step it is necessary to decide which queue each of
the
different
servers
is
going
to
serve.
A
decision
of
this
type
may
be
coded
as
an
8-dimensional
vector
a
indicatingwhichqueuesarebeingserved,satisfyingtheconstraintthatnomorethanonequeueassociated
witheachserverisbeingserved,i.e.,ai {0, 1},anda1+a3+a8 1,a2+a6 1,a4+a5+a7 1. Wecan
imposeadditionalconstraintsonthechoicesofaasdesired,forinstanceconsideringonlynon-idlingpolicies.
Policies
are
described
by
a
mappingu returning an allocation of server efforta asa functionofsystem
x. We represent the evolution of the queue lengths in terms of transition probabilities - the conditional
probabilities for the next state x(t+1) given that the current state is x(t) and the current action is a(t).
For
instance
Prob(x1(t+1)=x1(t)+1| x(t), a(t))=1,
Prob(x3(t+1)=x3(t)+1, x2(t+1)=x2(t) 1| (x(t), a(t))=22I(x2(t)>0, a2(t))=1),
Prob(x3(t+1)=x3(t) 1| (x(t), a(t))=13I(x3(t)>0, a3(t))=1),
correspondingtoanarrivaltoqueue1,adeparturefromqueue2andanarrivaltoqueue3,andadeparture
from queue 3. I() is the indication function. Transition probabilities related to other events are defined
similarly.
Wemayconsidercostsoftheformg(x)= xi,thetotalnumberofunfinishedunitsinthesystem. Foriinstance, this is a reasonably common choice of cost for manufacturing systems, which are often modelled
asqueueingnetworks.
Tetris
Tetris
is
a
computer
game
whose
essence
rule
is
to
fit
a
sequence
of
geometrically
different
pieces,
which
fall
fromthetopofthescreenstochastically, togethertocompletethecontiguousrowsofblocks. Piecesarrive
sequentially and the geometric shape of the pieces are independently distributed. A falling piece can be
rotated and moved horizontally into a desired position. Note that the rotation and move of falling pieces
must be scheduled and executed before it reaches the remaining pile of pieces at the button of the screen.
Onceapiecereachestheremainingpile,thepiecemustresitethereandcannotberotatedormoved.
To put the Tetris game into the framework of Markov decision processes, one could define the state to
correspondtothecurrentconfigurationandcurrent fallingpiece. Thedecision ineachtimestage iswhere
to
place
the
current
falling
piece.
Transitions
to
the
next
board
configuration
follow
deterministically
from
thecurrentstateandaction; transitionstothenext fallingpiecearegivenby itsdistribution,whichcould
be, for instance, uniform over all piece types. Finally, we associate a reward with each state-action pair,
correspondingtothepointsachievedbythenumberofrowseliminated.
7/25/2019 2997 spring 2004
4/201
by
n
ixt+1 =
ateixt.
i=1
Therefore, transition probabilities can be derived from the distribution of the rate of return of each riskyn
assets. Weassociatewitheachstate-actionpair(x, a)arewardga(x)=x(1 i=1ai,correspondingtothe
amountofwealthconsumed.
4 SolvingFinite-HorizonProblems
Finding a policy that minimizes the finite-horizon cost corresponds to solving the following optimization
problem: T1
minE gu(xt,t)(xt)|x0 =x (5)u(,)
t=0
A naive approach to solving (5) is to enumerate all possible policies u(x, t), evaluate the corresponding
expected cost, and choose the policy that maximizes it. However, note that the number of policies grows
exponentially
on
the
number
of
states
and
time
stages.
A
central
idea
in
dynamic
programming
is
that
the
computationrequiredtofindanoptimalpolicycanbegreatlyreducedbynotingthat(5)canberewritten
asfollows: T1
min ga(x)+ Pa(x, y)minE gu(xt,t)(xt)|x1 =y . (6)aAx u(,)
yS t=1
DefineJ(x, t0)asfollows:
T1
J
(x, t0)
=
min E
gu(xt,t)(xt)|x1 =
y .
u(,)t=t0
Itisclearfrom(6)that, ifweknowJ(, t0 +1),wecaneasilyfindJ(x, t0)bysolving
J(x, t0)= min ga(x)+ Pa(x, y)J(y, t0+1) . (7)
aAx yS
Moreover, (6) suggests that an optimal action at state x and time t0 is simply one that minimizes the
right-hand
side
in
(7).
It
is
easy
to
verify
that
this
is
the
case
by
using
backwards
induction.
We
callJ(x, t)thecost-to-gofunction. Itcanbefoundrecursivelybynotingthat
J(x, T 1)=minga(x)a
andJ(x, t), t=0, . . . , T 2,canbecomputedvia(7).
N t th t fi di J ( t) f ll S d t 0 T 1 i b f t ti th t
7/25/2019 2997 spring 2004
5/201
1. Find(somehow) foreveryxandt0,
J(x,
t0)
=
min E
tt0gu(xt,t)(xt)|xt0 =
x
(8)
u(,)t=t0
2. Theoptimalactionforstatexattimet0 isgivenby
u (x,t0)=argminaAx ga(x)+ Pa(x,y)J(y,t0 +1) . (9)
yS
We
may
also
conjecture
that,
as
in
the
finite-horizon
case,
J
(x,
t)
satisfies
a
recursive
relation
of
the
form
J(x,t)= min ga(x)+ Pa(x,y)J(y,t+1) .
aAx yS
The first thing to note in the infinite-horizon case is that, based on expression (8), we have J(x,t) = J(x,t)=J(x)foralltandt. Indeed,notethat,foreveryu,
E tt0gu(xt,t)(xt)|xt0 =x =
tt0Probu(xt =y|xt0 =x)gu(y)(y)t=t0 t=t0
= tt0Probu(xtt0
=y|x0 =x)gu(y)(y)t=t0
= tProbu(xt =y|x0 =x)gu(y)(y).t=0
Intuitively,
since
transition
probabilities
Pu(x,
y)
do
not
depend
on
time,
infinite-horizon
problems
look
the
sameregardlessofthevalueoftheinitialtimestatet,aslongastheinitialstateisthesame.
Note also that, since J(x,t) =J(x), we can also infer from (9) that the optimal policy u(x,t) does
notdependonthecurrentstaget,sothatu(x,t)=u(x)forsomefunctionu(). Wecallpoliciesthatdo
notdependonthetimestagestationary. Finally,J mustsatisfythefollowingequation:
J(x)= min ga(x)+ Pa(x,y)J(y) .
aA
x
yS
ThisiscalledBellmansequation.
Wewillshowinthenextlecturethatthecost-to-gofunctionistheuniquesolutionofBellmansequation
andthestationarypolicyu isoptimal.
7/25/2019 2997 spring 2004
6/201
7/25/2019 2997 spring 2004
7/201
Proof First,wehave
J
=
J
+
J
J
J+JJe.
. Wenowhave
TJT J T(J+JJe)T J
= T J+JJeT J
= JJe.
Thefirstinequalityfollowsfrommonotonicityandthesecond fromtheoffsetpropertyofT. SinceJ and J arearbitrary,weconcludebythesamereasoningthatT JTJ JJe. Thelemmafollows.
7/25/2019 2997 spring 2004
8/201
2.997Decision-Making inLarge-ScaleSystems February9
MIT,Spring2004 Handout#2
LectureNote2
1 Summary: MarkovDecisionProcesses
Markovdecisionprocessescanbecharacterizedby(S, A, g(),P(, )),where
S
denotes
a
finite
set
of
states
x denotesafinitesetofactionsforstatex SA
ga(x)denotesthefinitetime-stagecostforactiona Ax andstatex S
Pa(x, y)denotesthetransmissionprobabilitywhenthetakenaction isa Ax,currentstate isx,and
thenextstateisy
Let u(x, t) denote thepolicy for state x at time t and, similarly, let u(x) denote the stationary policy for
statex. Takingthestationarypolicyu(x)intoconsideration,we introducethefollowingnotation
gu(x) gu(x)(x)
Pu(x, y) Pu(x)(x, y)
torepresentthecostfunctionandtransitionprobabilitiesunderpolicyu(x).
2
Cost-to-go
Function
and
Bellmans
Equation
Intheprevious lecture,wedefinedthediscounted-cost,infinitehorizoncost-to-gofunctionas
J(x)=minE tgu(xt)|x0 =x .u
t=0
WealsoconjecturedthatJ shouldsatisfiestheBellmansequation
J(x)=minga(x)+ Pa(x, y)J(y)a ,
yS
or,usingtheoperatornotation introducedinthepreviouslecture,
J = T J
7/25/2019 2997 spring 2004
9/201
3 ValueIteration
Thevalue iterationalgorithmgoesasfollows:
1. J0,k=0
2. Jk+1 =T Jk,k=k+1
3. Gobackto2
Theorem1
lim Jk =J
k
Proof SinceJ0()andg )arefinite,thereexistsarealnumberM satisfying(
M and M for all a Ax
and x S. Then we have, for every integer K 1 and real|J0(x)| |ga(x)|
number(0, 1),
TKJ0(x)K1
= minE tgu(xt)+KJ0(xK)
JK(x) =
x0 =xu
t=0
K 1
minE tgu(xt)u
t=0
+KMx0 =x
From
J(x)=minK1
tgu(xt)+
tgu(xt) ,u
t=0 t=K
we
have
(TKJ0)(x) J(x)
K1
minE tgu(xt)+K
J0(xK)K1
minE tgu(xt)+ tgu(xt)
u= x0 =x x0 =x
ut=0
t=0 t=K
K1
t=0
K1
t=0 t=K
tgu(xt)+KJ0(xK
) tgu(xt)+ tgu(xt)E =x E x0
=x x0
E
K J0(xk
) + tgu(xt) =x | | x0
t=K
maxE K J0(xK
) + t g0(xt) =x | | | | x0
u
t=K
1
7/25/2019 2997 spring 2004
10/201
Theorem2 J is theuniquesolutionof theBellmansequation.
Proof WefirstshowthatJ =T J. Bycontractionprinciple,
= ||Tk+1J0 TkJ0||||T(T
kJ0)TkJ0||
||TkJ0 Tk1J0||
0 k||T J0
J0|| asK
SinceforallkwehaveJT J Tk+1J0+JTkJ0+T
k+1J0TkT0,weconclude T J
thatJ =T J. WenextshowthatJ istheuniquesolutiontoJ =T J. SupposethatJ = J2. Then1
1
J2
|| =||T J1
T J 1
J0
y.
WehencedefinetheoperatorF asfollows
(F J)(x) = min ga(x) + Pa(x y)(F J)(y) + Pa(x y)J(y) (1)
7/25/2019 2997 spring 2004
11/201
Lemma1
||F JF J|| ||JJ||
Proof BythedefinitionofF,weconsiderthecasex=1,
|(F J)(1)(F J)(1) = (T J)(1)(T J)(1)| | | ||JJ||
Forthecasex=2,bythedefinitionofF,wehave
|(F J)(2)(F J)(2)| max |(F J)(1)(F J)(1), J(2)J(2), . . . , )| | | |J(|S|)J(|S| |
||J
J||
Repeating the same reasoning for x = 3, . . . , we can show by induction that |(F J)(x)(F J)(x)| Hence,weconclude||F JF ||JJ||,xS. J|| ||JJ||.
Theorem3 F has theuniquefixedpoint J.
Proof By the definition of operator F and the Bellmans equation J = T J, we have J = F J.
The
convergence
result
follows
from
the
previous
lemma.
Therefore,
F J
=
J
.
By
maximum
contraction
property,theuniquenessofJ holds.
7/25/2019 2997 spring 2004
12/201
2.997
Decision-Making
in
Large-Scale
Systems
February
11
MIT,Spring2004 Handout#4
LectureNote3
1 Value Iteration
Usingvalue iteration,startingatanarbitraryJ0,wegenerateasequenceof{Jk}by
Jk+1 =
T Jk,
integer
k
0.
WehaveshownthatthesequenceJk J ask,andderivedtheerrorbounds
||Jk J||
k||J0J||
Recall
that
the
greedy
policy
uJ withrespecttovalueJ isdefinedasT J =TuJJ. Wealsodenoteuk =uJkasthegreedypolicywithrespecttovalueJk. Then,wehavethefollowinglemma.
Lemma
1
Given
(0, 1), 1
||Juk Jk|| 1||T Jk Jk||
Proof:
Juk Jk = (IPuk)1guk Jk
= (IPuk)1(guk +PukJk Jk)
= (IPuk)1(T Jk Jk)
=
tPt (T Jk Jk)ukt=0
tPt
uke||T Jk Jk||
t=0
=
te||T Jk Jk||t=0
e
= 1
||T Jk
Jk||
where I is an identity matrix, and e is a vector of unit elements with appropriate dimension. The third
equalitycomesfromT Jk =guk +PukJk,i.e.,uk isthegreedypolicyw.r.t. Jk,andtheforthequalityholdsebecause(IPuk)
1 =
tPt . ByswitchingJuk andJk,wecanobtainJkJukt=0 uk 1||T Jk Jk||,
and hence conclude
7/25/2019 2997 spring 2004
13/201
Theorem
1
2||Juk J
|| 1
||Jk J||
Proof:
|| =uk J
uk Jk +Jk J||J ||J ||
uk ||J
1
Jk||+||Jk J||
1
||T Jk J
+JJk||+||Jk J||
1Jk||)+||Jk J
1
(||T Jk J||
+||J ||
2
1
||Jk J||
ThesecondinequalitycomesfromLemma1andthethirdinequalityholdsbythecontractionprinciple. 2
2
Optimality
of
Stationary
Policy
Beforeprovingthemaintheoremofthissection,weintroducethefollowingusefullemma.
Lemma2 IfJ T J, thenJ J. IfJ T J, thenJ J.
Proof:
Suppose
that
J T J. Applying operator T on both sides repeatedly k1 times and by the
monotonicitypropertyofT,wehave
.J
T J
T2J
TkJ
Forsufficientlylargek,TkJ approachestoJ. WehenceconcludeJ J. Theotherstatementfollowsthe
same
argument.
2
We
show
the
optimality
of
the
stationary
policy
by
the
following
theorem.
Theorem
2
Let
u
=
(u1, u2, . . .)
be
any
policy
and
let
u
uJ1.
Then,
Ju Ju =J.
Moreover, letubeanystationarypolicysuchthatTuJ
=T J.2 Then,Ju(x)>J(x)forat leastonestate
xS.
7/25/2019 2997 spring 2004
14/201
Then
1u Ju|| M(1+1
)k 0ask.||Jk
Ifu=(u, u, . . . ),then
u Jk 0ask.||J u||
Thus, wehave Jku =TukJ =Tk1(T J)=Tk1J =J. ThereforeJu =J
. For any other policy, foru u
allk,
1
u M
1
kJu 1+ J
k
1= Tu1 . . . T ukJ
M
1+
1
k
T J 1
kTu1 . . . T uk1 T J
1+ M 1
=J
1. . .J
1+
1
kM
Therefore
Ju
J.
Take
a
stationary
policy
u
such
that
TuJ
=
T J,
i.e.
TuJ
T J,
and
at
least
one
statexS suchthat(TuJ)(x)>(T J)(x). Observe
J =T J TuJ
Applying
Tu onbothsidesandbythemonotonicitypropertyofT,orapplyingLemma2,
J TuJ T2J TkJ Juu u
andJ(x)
7/25/2019 2997 spring 2004
15/201
Proof:
If
uk isoptimal,thenwearedone. Nowsupposethatuk isnotoptimal. Then
T Juk TukJ =
Jukuk
withstrictinequalityforatleastonestatex. SinceTuk+1J =T Juk andJuk =TukJ ,wehave
.
uk uk
Juk =TukJuk T Juk =Tuk+1Juk Tn Juk Juk+1 asn
Therefore,
policyuk+1 isanimprovementoverpolicyuk. 2
uk+1
In
step
2,
we
solve
J =
g +
PukJuk,
which
would
require
a
significant
amount
of
computations.
We
thusintroduceanotheralgorithmwhichhasfeweriterationsinstep2.
uk uk
3.1 AsynchronousPolicyIteration
Thealgorithmgoesas follows.
1.
Start
with
policy
u0,cost-to-gofunctionJ0,k = 0
2. ForsomesubsetSk S,dooneofthefollowing
(i)valueupdate (Jk+1)(x)=(T Jk)(x), k,x Suk
(ii)policyupdate uk+1(x)=uJk(x), kx S
3. k=k+1;gobacktostep2
Theorem
4
If
Tu0J0 J0 and
infinitely
many
value
and
policy
updates
are
performed
on
each
state,
then
lim Jk =J.
k
Proof: Weprovethistheorembytwosteps. First,wewillshowthat
J Jk+1 Jk, k.
ThisimpliesthatJk isanonincreasingsequence. SinceJk islowerboundedbyJ,Jk willconvergetosome
value,i.e.,Jk . Next,wewillshowthatJk willconvergetoJ,i.e.,J =J.J ask
Lemma3 If Tu0J0 J0, thesequenceJk generatedbyasynchronouspolicyiterationconverges.
Proof:
We
start
by
showing
that,
if
TukJk Jk , then Tuk+1Jk+1 Jk+1 Jk. Suppose we have a value
update Then
7/25/2019 2997 spring 2004
16/201
Now
suppose
that
we
have
a
policy
update.
Then
Jk+1 =Jk. Moreover,forx Sk,wehave
(T Jk+1)(x) = (T J
k
)(x)uk+1
uk+1
= (T Jk)(x)
(TukJk)(x)
Jk(x)
= Jk+1(x).
ThefirstequalityfollowsfromJk =Jk+1,thesecondequalityandfirstinequalityfollowsfromthefactthat
uk+1(x)
is
greedy
with
respect
to
Jk for
x
Sk,
the
second
inequality
follows
from
the
induction
hypothesis,
andthethirdequality follows fromJk =Jk+1. Forx Sk ,wehave
(T Jk+1)(x) = (T Jk)(x)uk+1 uk
Jk(x)
= Jk+1(x).
Theequalities follow fromJk =Jk+1 anduk+1(x)=uk(x) forx Sk, andthe inequality follows fromthe
induction
hypothesis.
Since by hypothesis Tu0J0 J0, we conclude that Jk is a decreasing sequence. Moreover, we have
TukJk Jk,henceJk J J,sothatJk isboundedbellow. It followsthatJk convergestosomelimit
uk
J. 2
Lemma4 Suppose that Jk J,where Jk isgeneratedbyasynchronouspolicy iteration,and suppose that
thereare infinitelymanyvalueandpolicyupdatesateachstate.ThenJ =J.
Proof: First note that, since T Jk Jk, by continuity of the operator T, we must have T J J. Now suppose
that
(T J)(x)
< J(x)
for
some
state
x.
Then,
by
continuity,
there
is
an
iteration
index
k
such
that
(T Jk)(x)< J(x)forallkk. Letk
>k > kcorrespondtoiterationsoftheasynchronouspolicyiteration
algorithmsuchthatthereisapolicyupdateatstatexatiterationk,avalueupdateatstatexatiteration
k, and no updates at state x in iterations k
7/25/2019 2997 spring 2004
17/201
2
We
have
concluded
that
Jk+1 < J. HoweverbyhypothesisJk J,wehaveacontradiction,anditmust followthatT J =J,sothatJ =J.
7/25/2019 2997 spring 2004
18/201
1
2.997Decision-Making inLarge-ScaleSystems February17
MIT,Spring2004 Handout#6
Lecture
Note
4
Average-costProblems
Intheaveragecostproblems,weaimatfindingapolicyuwhichminimizes
Ju(x)
=
lim
sup
1
E
T1
gu(xt)
x0 =
0
.
(1)t=0T
T
Since the state space is finite, it can be shown that the limsup can actually be replaced with lim for any
stationary policy. In the previous lectures, we first find the cost-to-go functions J(x) (for discounted
problems)orJ(x, t) (for finitehorizon problems)and thenfindthe optimal policythrough thecost-to-go
functions. However, in the average-cost problem, Ju(x) does not offer enough information for an optimal
policy to be found; in particular, in mostcases of interest we will have Ju(x)=u for some scalar u, for
allx,sothatitdoesnotallowustodistinguishthevalueofbeing ineachstate.
We
will
start
by
deriving
some
intuition
based
on
finite-horizon
problems.
Consider
a
set
of
states
x1, x2, . . . , x, . . . , xn}. Thestatesarevisitedinasequencewithsomeinitialstatex,sayS={
x , . . . . . . , x, . . . . . . , x, . . . . . . , x, . . . . . . ,
h(x) 1 2u u
LetTi(x), i=1, 2, . . . bethestagescorrespondingtotheithvisittostatex,startingatstatex. Let
Ti+1(x)1gu(xt)
u(x)
=
E
t=Ti
(x)i Ti+1(x) Ti(x)
Intuitively,wemusthavei u(x)=j
u(x)is independentofinitialstatexandi
u(x),sincewehavethesame
transitionprobabilitieswheneverwestartanewtrajectoryinstatex. Goingbacktoobservethedefinition
ofthefunction
T
J(x, T)=minE gu(xt)
xo
=x ,u
t=0
we
conjecture
that
the
function
can
be
approximated
as
follows.
J(x, T) (x)T+h(x)+o(T), asT, (2)
Notethat,since(x)is independentofthe initialstate,wecanrewritetheapproximationas
J(x, T) T +h(x)+o(T), asT. (3)
7/25/2019 2997 spring 2004
19/201
2
We can now speculate about a version of Bellmans equation for computing and h. Approximating
J(x, T)as in(3,wehave
J(x, T +1)=min ga(x)+ Pa(x, y)J(y, T)
ay
(T+1)+h(x)+o(T)=min ga(x)+ Pa(x, y) [T +h(y)+o(T)]
a
y
Therefore,wehave
+h(x)=mina
ga(x)+ y
Pa(x, y)h(y) (4)
As
we
did
in
the
cost-to-go
context,
we
set
Tuh=gu
+Puh
and
T h=minTuh.u
Then,wehave
Lemma
1
(Monotonicity)
Let
h
h
be
arbitrary.
Then
T h
T h. (Tuh
Tuh)
Lemma
2
(Offset)
For
allhandk ,wehave T(h+ke)=T h+ke.
NoticethatthecontractionprincipledoesnotholdforT h=minuTuh.
BellmansEquation
From
the
discussion
above,
we
can
write
the
Bellmans
equation
e+h=T h. (5)
BeforeexaminingtheexistenceofsolutionstoBellmansequation,weshowthefactthatthesolutionofthe
Bellmansequationrenderstheoptimalpolicybythefollowingtheorem.
Theorem
1
Suppose
that andh satisfytheBellmansequation. Letu begreedywithrespecttoh,i.e.,
T h Tuh. Then,
Ju
(x)
=
,x,
and
u(x) Ju(x),u.
Proof: Letu=(u1, u2, . . . ). LetN bearbitrary. Then
J
7/25/2019 2997 spring 2004
20/201
Then
TN1h
N e+hT1T2
Thus,we
have
N
1
E gu(xt)+h(xN) (N 1)
e+h
t=0
BydividingbothsidesbyN andtakethelimitasN approachesto infinity,wehave1
Jue
Takeu=(u, u, u, . . . ),thenalltheinequalitiesabovebecometheequality. Thus
e=Ju.
Thistheoremsaysthat, iftheBellmansequationhasasolution,thenwecangetaoptimalpolicyfrom it.
Notethat,if(, h)isasolutiontotheBellmansequation,then(, h+ke)isalsoasolution,forall
scalark. Hence,ifBellmansequation in(5)hasasolution,then ithasinfinitelymanysolutions. However,
unlike
the
case
of
discounted-cost
and
finite-horizon
problems,
the
average-cost
Bellmans
equation
does
not
necessarily have a solution. In particular, the previous theorem implies that, if a solution exists, then the
averagecostJu(x) isthesame forall initialstates. It iseasytocomeupwithexampleswherethis isnot
thecase. Forinstance,considerthecasewhenthetransitionprobabilityisanidentitymatrix,i.e.,thestate
visitsitselfeverytime,andeachstateincursdifferenttransitioncostsg(). Thentheaveragecost depends
onthe initialstate, which is notthe property ofthe average cost. Hence,the Bellmans equationdoes not
alwayshold.
2
7/25/2019 2997 spring 2004
21/201
1
2.997Decision-Making inLarge-ScaleSystems February18
MIT,Spring2004 Handout#7
Lecture
Note
5
RelationshipbetweenDiscountedandAverage-CostProblems
Inthis lecture, wewillshowthatoptimalpolicies fordiscounted-costproblems with largeenoughdiscount
factor are also optimal for average-cost problems. The analysis will also show that, if the optimal average
cost
is
the
same
for
all
initial
states,
then
the
average-cost
Bellmans
equation
has
a
solution.
Notethattheoptimalaveragecost is independentofthe initialstate. Recallthat
1N1
Ju(x)=limsup E gu(xt)|x0 =xN
Nt=0
or,equivalently,
1N1
Ju = lim Putgu.
NNt=0
WealsoletJu, denotethediscountedcost-to-gofunctionassociatedwithpolicyuwhenthediscountfactor
is,i.e.,
tPtJu, = ugu =(IPu)1
gu.t=0
The following theorem formalizes the relationship between the discounted cost-to-go function and average
cost.
Theorem
1
For
every
stationary
policy
u,
there
is
hu such
that
1Ju,
=1
Ju
+hu
+O( 1). (1)| |
Theorem1followseasilyfromthefollowingproposition.
Proposition
1
For
all
stationary
policies u,wehave
1
(I
Pu)1
=
1
P
u +
Hu +
O(|1
|)1
,
(2)
where
Pu = limN
1
N
N1
t=0
Ptu, (3)
1
7/25/2019 2997 spring 2004
22/201
Proof: LetM =(1)(IPu)
1. Then,since
tPt
u(x,y) (1) t 1 =1,M(x,y) =(1)| |
t=0 t=0
M(x,y) is intheformof
M(x,y)=p()
q(),
wherep()andq()arepolynomialssuchthatq(1)=1. Weconcludethatthe limit lim 1M exists. Let
P =lim 1M. WecandoTaylorsexpansionofM around=1,sothatu
M =P
+(1)Hu +O((1)2)u
dMwhereHu
=d
. Therefore
1(IPu)1
=1
P +Hu +O( 1)u | |
forsomeP andHu.uNext,observethat
(1)(IPu)(IPu)1 =(1)I
for
all
.
Taking
the
limit
as
1
yields
(IPu)P
=0,u
sothatP =PuP. WecanusethesamereasoningtoconcludethatP =PPu. Wealsohaveu u u u
(IPu)P =(1)Pu
,u
henceforeverywehave
P =(1)(IPu)1Pu u,
andtakingthe limitas1yieldsPP =P.u u uWe now show that, for every t 1, Ptu
P = (Pu
Pu
)t. For t = 1, it is trivial. Suppose that theu
resultholdsupton1,i.e.,Pn1Pu
=(PuP
u)n1
=(PuP
u)n1
. Then(PuPu)(PuP
u)(Pn1
u u
u) = Pn PuP
uPn1 +PP = Pn Pn2 +P = Pu
n Pu. By induction, we haveu uP
u u P
u u u u P P u u
u
P =(Pu
PPt u
)t.u
Nownotethat
uHu = limM P
1
1
Pu= lim (IPu)1
1 1
= lim t(Pt u)u P
1t=0
7/25/2019 2997 spring 2004
23/201
|
Hence Hu =(I Pu +P
.u
)1 Pu
WenowshowPHu =0. Observeu
u)1 PPHu =
P (I
Pu +
Pu u u
= Pu(Pu P
uu)t P
t=0
= Pu P
=0.u
Therefore, PHu
=0.u
Observe(IPu+P
u
)P =IP. SincePHu
=0,wehaveu
)Hu
=I(IPu+P
u u
u
By
multiplyingPk
toP
+H
u
=I
+P
uH
u,
we
have
u u
+Pk+1Pk =PkuPu
+PkHu u u Hu, ku
Summingfromk=0tok=N 1,wehave
NN1 N1
N P + PkHu
= Puk + PkHu,u u
u
k=0 k=0 k=1
or,equivalently,N
1
N P =
Pk +
(PN I)Hu.u
u u
k=0
DividingbothsidesbyN and lettingN ,thenwehave
P +Hu
=I+PuHu.u
limN
1
N
1
Pk =P.N k=0 u u
Since
P =
PPu and
Pu itself
is
a
stochastic
matrix,
the
rows
of
P are
of
special
meanings.
Let
u u
u
u denote a row of Pu. Then u = uPu and u(x) = yu(y)Pu(y, x). Then Pu(x1 = x x0 u) =
u(y)Pu(y, x). Wecan concludethatanyrow inmatrixP isa stationarydistribution fortheMarkovy u
chainunderthepolicyu. However,doesthisobservationmeanthatallrows inP areidentical?u
Theorem
2
JuJu, =
1 +hu +O( 1 )| |
Proof:
Ju, = (I Pu)1
gu
Pu=1
+Hu
+O( 1 ) gu| |
Pu
gu= + Hugu + O( 1 )| |
2
7/25/2019 2997 spring 2004
24/201
2
2 BlackwellOptimality
Inthissection,wewillshowthatpoliciesthatareoptimalforthediscounted-costcriterionwithlargeenough
discountfactorsarealsooptimalfortheaverage-costcriterion. Indeed,wecanactuallystrengthenthenotion
ofaverage-costoptimalityandestablishtheexistenceofpoliciesthatareoptimalforalllargeenoughdiscount
factors.
Definition
1
(Blackwell
Optimality)
A
stationary
policy u is calledBlackwell optimal if (0,1)
such
thatu isoptimal [,1).
Theorem
3
There
exists
a
stationary
Blackwell
optimal
policy
and
it
is
also
optimal
for
the
average-cost
problemamongallstationarypolicies.
Proof: Since there are only finitely many policies, we must have for each state x a policy x such that
Jux
,(x)
Ju,(x)
for
all
large
enough
.
If
we
take
the
policy
to
be
given
by
(x)
=
x(x),
then
mustsatisfyBellmansequation
Ju, =min{gu +PuJu,}u
foralllargeenough,andweconcludethat isBlackwelloptimal.
Nowletu beBlackwelloptimal. Alsosupposethat uisoptimalfortheaverage-costproblem. Then
Ju J+hu +O( 1| |)
u
+hu +O( 1),.| |1 1
Taking
the
limit
as
1,
we
conclude
that
Ju J ,u
andu mustbeoptimalfortheaverage-costproblem.
Remark
1
It
is
actually
possible
to
establish
average-cost
optimality
of
Blackwell
optimal
policies
among
the
set
of
all
policies,
not
only
stationary
ones.
Remark
2
An
algorithm
for
computing
Blackwell
optimal
policies
involves
lexicographic
optimization
of
Ju,
hu andhigher-order terms in theTaylorexpansionof Ju,.
Th 3 i li th t if th ti l t i th dl f th i iti l t t th th
2
7/25/2019 2997 spring 2004
25/201
Proof: Wehave,forall largeenough,
Ju, = min gu +PuJu,}u
{
Ju
1 +hu +O((1 )
2) = min gu +PuJu
+hu +O((1 )2)
u 1
e e
1 +hu +O((1 )
2) = min gu +Pu1
+hu +O((1 )2)
u
+hu +O((1 )2) = min gu
+Pu
hu +O((1 )2) .
u
Takingthelimitas1,weget
e
+
hu =min gu +Puhu}=Thu.u
{
2
Intheaverage-costsetting,existenceofasolutiontoBellmansequationactuallydependsonthestructure
of transition probabilities in the system. Some sufficient conditions for the optimal average cost to be the
sameregardlessofthe initialstatearegivenbelow.
Definition2 We say that two states x,y communicate under policy u if there are k, 1,2,...} suchk {kthat
Puk(x,y)>0,Pu(y,x)>0.
Definition
3
We
say
that
a
state x is recurrentunderpolicy u if,conditionedon thefact that it isvisited
at
least
once,
it
is
visited
infinitely
many
times.
Definition4 We say that a state x is transient under policy u if it is only visitedfinitely many times,
regardless
of
the
initial
condition
of
the
system.
Definition
5
We
say
that
a
policyu isunichain ifallof itsrecurrentstatescommunicate.
Westatewithoutproofthefollowingtheorem.
Theorem
4 Either of thefollowing conditions is sufficientfor the optimal average cost to be the same
regardless
of
the
initial
state:
1. Thereexistsaunichainoptimalpolicy.
2.
For
every
pair
of
states xandy, there isapolicy usuch thatxandy communicate.
3 Value Iteration
7/25/2019 2997 spring 2004
26/201
OnewaytoobtainthisvalueistocalculateafinitebutverylargeN toapproximatethelimitandspeculate
thatsuchan limitisaccurate. Henceweconsider
k1
TkJ =minE gu(xt)+J0(xk)u
t=0
RecallJ(x, T) x,wehave=T +h(x).Choosesomestatexand
J(x, T)J( x)x, T)=h(x)h(
Then
hk(x)
=
J
(x, k)
k
,
for
some
1
,
2
, . . .
Notethat,since(, h+ke)isasolutiontoBellmansequationforallkwhenever(, h)isasolution,we
canchoosethevalue ofasinglestatearbitrarily. Letting h(x)=0,wehavethe followingcommonly used
versionofvalue iteration;
hk+1(x)=(T hk)(x)(T hk)(x) (8)
Theorem
5
Lethk begivenby(8). Then if hk h,wehave
=(T h)( x)andh =h,e +h =T h.
Note
that
there
must
exist
a
solution
to
the
average-cost
Bellmans
equation
for
value
iteration
to
con-verge. However, itcanbeshownthatexistenceofasolution isnotasufficientcondition.
7/25/2019 2997 spring 2004
27/201
2.997Decision-Making inLarge-ScaleSystems February23
MIT,
Spring
2004
Handout
#9
LectureNote6
1 ApplicationtoQueueingNetworks
In the first part of this lecture, we will discuss the application of dynamic programming to the queueing
network introduced in [1], which illustrates several issues encountered in the application of dynamic pro
gramming
in
practical
problems.
In
particular,
we
consider
the
issues
that
arise
when
value
iteration
is
applied
to
problems
with
a
large
or
infinite
state
space.
Themainpointsin[1],whichweoverviewtoday,arethefollowing:
Naive implementation of value iteration may lead to slow convergence and, in the case of infinite
statespaces,policieswith infiniteaveragecost inevery iterationstep,eventhoughthe iteratesJk(x)
convergepointwisetoJ(x) foreverystatex;
Under
certain
conditions,
with
proper
initialization
J0,
we
can
have
a
faster
convergence
and
stability
guarantees;
Inqueueingnetworks,properJ0 canbefoundfromwell-knownheuristicssuchasfluidmodelsolutions.
We will illustrate these issues with examples involving queueing networks. For the generic results, in
cluding a proof of convergence of average-cost value iteration for MDPs with infinite state spaces, refer to
[1].
1.1 Multiclassqueueingnetworks
Consideraqueueingnetworkasillustrated inFig.1.
1
Machine 1
13
11
18
4
Machine 2
22
26
37
35
34
Machine 3
7/25/2019 2997 spring 2004
28/201
We introducesomenotation:
N thenumberofqueuesinthesystem
i probabilityofexogenousarrivalatqueuei
i probabilitythatajobatqueueiiscompletedifthejobisbeingserved
xi state,lengthofqueueiN
g(x)= xi costfunction,inwhichstatex=(x1, . . . , xN)i=1
a {0, 1}N ai =1ifajobfromqueueiisbeingserved,andai =0otherwise.
The
interpretation
is
as
follows.
At
each
time
stage,
at
most
one
of
the
following
events
can
happen:
a
new
jobarrives atqueue i with probability i, ajob from queue ithat iscurrently beingserved has itsservice
completed, withprobabilityi, and either movesto another queueor leaves thesystem, dependingonthe
structureofthenetwork. Notethat,ateachtimestage,aservermaychoosetoprocessajobfromanyofthe
queues
associated
with
it.
Therefore
the
decision
a
encodes
which
queue
is
being
processed
at
each
server.
Werefertosuchaqueueingnetworkasmulticlass becausejobsatdifferentqueueshasdifferentservicerates
and
trajectories
through
the
system.
As seen before, an optimal policy could be derived from the differential cost function h, which is the
solution
of
Bellmans
equation:
+h =T h.
Considerusingvalue iteration forestimatingh. Thisrequiressome initialguessh0. Acommonchoice
ish0 =0; however, wewillshow thatthis can lead to slow convergence of h. Indeed, weknow thath is
equivalenttoaquadratic,inthesensethatthere isaconstant andasolutiontoBellmansequationsuch2 21
.
Now
let
h0 =
0.
Then
that
1i
xi h
(x)
ixi k1 N
xi(t) x0 =x .Tkh0(x)=minE
ut=0 i=1
Since
E[xi(t)]=E[xi(t 1)]+ E[Ai(t)] E[Di(t)] ,
=i (arrival)0(departure)
we
have
E[xi(t) xi(t 1)]i (1)
By(1),wehave
E
[xi(1)] E[xi(0)]+i
7/25/2019 2997 spring 2004
29/201
2
Thus,
T
k
h0
N
k1
(xi(0)+ti)t=0 i=1N k(k
1)kxi(0)+ i=
2i=1
This impliesthathk(x) isupperboundedbya linear functionofthestatex. Inorder for ittoapproacha
quadratic functionofx,the iterationnumberk musthavethesamemagnitudeasx. It followsthat, ifthe
state space is very large, convergence is slow. Moreover, if the state space is infinite, which is the case if
queues
do
not
have
finite
buffers,
only
pointwise
convergence
of
hk(x)
to
h
(x)
can
be
ensured,
but
for
every
k,thereissomestatexsuchthathk(x)isapoorapproximationtoh(x).
Example1(Singlequeue lengthwithcontrolled servicerate) Considerasinglequeuewith
State
x
defined
as
the
queue
length
Pa(x,x+1)=, (arrivalrate)
Pa(x,x 1)=1+a2, whereactiona {0,1}
Pa(x,
x)
=
1
1 a2.
Let
the
cost
function
be
defined
as
ga(x)=(1+a)x.
The interpretation isasfollows. Ateach timestage, there isachoicebetweenprocessingjobsata lower
service
rate 1 or at a higher service rate 2. Processing at a higher service rate helps to decreasefuture
queue
lengths
but
an
extra
cost
must
be
paid
for
the
extra
effort.
Suppose
that
>
1. Then ifpolicy u(x) = 0forall x x0,whenever thequeue length isat least x0,
thereareonaveragemorejobarrivalsthandepartures,and itcanbeshownthateventuallythequeue length
converges
to
infinity,
leading
to
infinite
average
cost.
Suppose
thath0(x)=0, x. Thenineveryiterationk,thereexistsanxk suchthathk =T
kh0(x)=cx+d
forallxxk. Moreover,whenhk =cx+d inaneighborhoodof x)=0,which isx,thegreedyaction isuk(
thecase that theaveragecostgoes to infinity.2
As
shown
in
[1],
using
the
initial
value h0(x) = 1+
x2
leads
to
stable
policies
for
every
iterate hk,
and
ensures
convergence
to
the
optimal
policy.
The
choice
ofh0 asaquadraticarisesfromproblem-specific
knowledge. Moreover,appropriatechoices in thecaseofqueueingnetworkscanbederivedfromwell-known
heuristics
and
analysis
specific
to
the
field.
Simulation-basedMethods
Thedynamicprogrammingalgorithmsstudiedso farhavethefollowingcharacteristics:
7/25/2019 2997 spring 2004
30/201
In realistic scenarios, each of these requirements may pose difficulties. When the state space is large,
performingupdatesinfinitelyoften ineverystatemaybeprohibitive,orevenif it is feasible,acleverorder
of
visitation
may
considerably
speed
up
convergence.
In
many
cases,
the
system
parameters
are
not
known,
andinsteadonehasonlyaccesstoobservationsaboutthesystem. Finally,evenifthetransitionprobabilities
are known, computing expectations of the form (2) may be costly. In the next few lectures, we will study
simulation-basedmethods,whichaimatalleviatingtheseissues.
2.1 Asynchronousvalue iteration
Wedescribetheasynchronousvalueiteration(AVI)as
Jk+1(xk)=(T Jk)(xk), xk Sk
We have seen that, if every state has its value updated infinitely many times, then the AVI converges
(see arguments in Problem set 1). The question remains as to whether convergence may be improved by
selecting
states
in
a
particular
order,
and
whether
we
can
dispense
with
the
requirement
of
visiting
every
stateinfinitelymanytimes.
WewillconsideraversionofAVIwherestateupdatesarebasedonactualorsimulatedtrajectories for
the
system.
It
seems
reasonable
to
expect
that,
if
the
system
is
often
encountered
at
certain
states,
more
emphasis
should
be
placed
in
obtaining
accurate
estimates
and
good
actions
for
those
states,
motivating
performingvalueupdatesmoreoftenatthosestates. Inthe limit, it isclearthatifastate isnevervisited,
under any policy, then the value of the cost-to-go function at such a state never comes into play in the
decision-makingprocess,andnoupdatesneedtobeperformed forsuchastateatall. Basedonthenotion
that state trajectories contain information about which states are most relevant, we propose the following
versionofAVI.Wecallitreal-timevalue iteration(RTVI).
1.
Take
an
arbitrary
state
x0. Letk=0.
2. Chooseactionuk insomefashion.
3. Let xk+1 = f(xk, uk, wk) (recall from lecture 1 that f gives an alternative representation for state
transitions).
4. LetJk+1(xk+1)=(T Jk)(xk+1).
5.
Let
k
=
k
+
1
and
return
to
step
2.
2.2 ExplorationxExploitation
Notethatthere isstillanelementmissing inthedescriptionofRTVI,namely,howtochooseactionuk. It
7/25/2019 2997 spring 2004
31/201
Ingeneral,choosinguk greedilydoesnotensureconvergencetotheoptimalpolicy. Onepossiblefailure
scenario is illustrated in Figure 2. Suppose that there is a subset of states B which is recurrent under an
optimal
policy,
and
a
disjoint
subset
of
states
A
which
is
recurrent
under
another
policy.
If
we
start
with
a
guessJ0 whichishighenoughatstatesoutsideregionA,andalwayschooseactionsgreedily,thenanaction
thatneverleadstostatesoutsideregionAwillbeselected. HenceRTVIneverhasachanceofupdatingand
correctingthe initialguessJ0 atstatesinsubsetB,andinparticular,theoptimalpolicyisneverachieved.
It
turns
out
that,
if
we
choose
initial
value
J0 J,
then
the
greedy
policy
selection
performs
well,
as
showninFig2(b). Westatethisconcept formallybythefollowingtheorem.
The previous discussion highlights a tradeoff that is fundamental to learning algorithms: the conflict
of exploitation versus exploration. In particular, there is usually tension between exploiting information
accumulated
by
previous
learning
steps
and
exploring
different
options,
possibly
at
a
certain
cost,
in
order
togathermoreinformation.
J J* *J
J
0
0
AB
J(x)
x x
J(x)
*(a) Improper init ial value J with greedy (b) Initial value with J0less or equal to J0
policy selection
Figure2: InitialValueSelection
Theorem
1
If
J0 J and
all
states
are
reachable
from
one
another,
then
the
real
time
value
iteration
algorithm(RTVI)withgreedypolicyut satisfies thefollowing
(a)
Jk JforsomeJ,
(b)
J =
Jfor
all
states
visited
infinitely
many
times,
(c)
after
some
iterations,
all
decisions
will
be
optimal.
Proof: SinceT ismonotone,wehave
(T J )(x) (T J)(x) x J (x ) J(x ) and J (x) = J (x) J(x) x = x
7/25/2019 2997 spring 2004
32/201
HenceonecouldregardJ asa function fromthesetAto|A|. SoTA issimilartoDPoperator forthe
subsetAofstatesand
||T
A
J
T
A
J
||
||J
J
||.
Therefore,
RTVI
is
AVI
over
A,
with
every
state
visited
infinitely
many
times.
Thus,
J(x), ifxA,Jk(x)J(x)=
J0(x), ifx /A.
Since
the
states
x
/Aarenevervisited,wemusthave
Pa(x,
y)
=
0,
x
A,
y
/
A,
whereaisgreedywithrespecttoJ. Letu bethegreedypolicyofJ. Then
J(x)=gu(x)+ Pu(x,y)J(y)=gu(x)+ Pu(x,y)J(y),xA.yA yS
Therefore,weconclude
J(x)=Ju(x)J(x),
xA.
ByhypothesisJ0 J,weknowthat
J(x)=J(x), xA.
References
[1] R-RChenandS.P.Meyn,ValueIterationandOptimizationofMulticlassQueueingNetworks,Queueing
Systems,
32,
pp.
6597,
1999.
7/25/2019 2997 spring 2004
33/201
2.997Decision-Making inLarge-ScaleSystems February25
MIT,Spring2004 Handout#10
Lecture
Note
7
1 Real-TimeValue Iteration
Recallthereal-timevalue iteration(RTVI)algorithm
choose xk+1 =f(xk, uk, wk)
choose ut insomefashion
update Jk+1(xk)=(T Jk)(xk), Jk+1(x)=(T Jk)(x), x=xk
Wethushave
T Jk(xk
)=min ga(xk)+ Pa(xk
, y)Jk
(y)a
y
Weencounterthefollowingtwoquestionsinthisalgorithm.
1. whatifwedonotknowPa(x, y)?
2. even ifweknow/cansimulatePa(x, y),computing yPa(x, y)J(y)maybeexpensive.
Toovercomethesetwoproblems,weconsidertheQ-learningapproach.
2 Q-Learning
2.1
Q-factors
Foreverystate-actionpair,weconsider
Q(x, a) = ga(x)+Pa(xk, y)J(y) (1)
J(x) = minQ(x, a) (2)a
WecaninterprettheseequationsasBellmansequationsforanMDPwithexpandedstatespace. Wehave
the
original
states
x
S,
with
associated
sets
of
feasible
actions
Ax,
and
extra
states
(x, a), x
S, a
Ax,
corresponding to state-action pairs, for which there is only one action available, and no decision must be
made. Note that, whenever we are in a state x where a decision must be made, the system transitions
deterministicallytostate (x, a) basedonthestateand actionachosen. Thereforewecircumvent theneed
toperformexpectations y
Pa(x, y)J(y)associatedwithgreedypolicies.
7/25/2019 2997 spring 2004
34/201
Monotonicity Q,andQsuchthatQQ,HQHQ.
Offset H(Q+Ke)=HQ+Ke.
Contraction
HQ
H
QQ ||QQ||,Q,
ItfollowsthatH hasauniquefixedpoint,correspondingtotheQfactorQ.
2.2
Q-Learning
Wenowdevelopareal-timevalue iterationalgorithm forcomputingQ. AnalgorithmanalogoustoRTVI
forcomputingthecost-to-gofunctionisasfollows:
Qt+1(xt, ut)=gut(xt)+ Put(x, y)minQt(y, a).
ay
However, this algorithm undermines the idea that Q-learning is motivated by situations where we do not
know Pa(x, y) or find it expensive to compute expectations Pa(x, y)J(y). Alternatively, we consideravariantsthatimplicitlyestimatethisexpectation,basedonstatetransitionsobservedinsystemtrajectories.
Basedonthis idea,onepossibilityistoutilizeaschemeoftheform
Qt+1(xt, at)=gat(xt)+ minQt(xt+1, a)
a
However, note that such an algorithm should not be expected to converge; in particular, Qt(xt+1, a) is a
noisy estimate of yPut(x, y)mina Qt(y, a). We consider a small-step version of this scheme, where the
noiseisattenuated:
Qt+1(xt, at)=(1t)Qt(xt, at)+t
gat(xt)+ minQt(xt+1, a) . (4)
a
We
will
study
the
properties
of
(4)
under
the
more
general
framework
of
stochastic
approximations,
which
areatthecoreofmanysimulation-based orreal-timedynamicprogrammingalgorithms.
3 StochasticApproximation
Inthestochasticapproximationsetting,thegoalistosolveasystemofequations
r
=
Hr,
wherer isavector inn forsomen andH isanoperator defined inn. If weknowhow tocompute Hr
foranygivenr, itiscommontotrytosolvethissytemofequationsbyvalueiteration:
rk+1
=Hrk
. (5)
7/25/2019 2997 spring 2004
35/201
Wecanalsodothesummationrecursivelybysetting
(i) 1i
r = (Hrt +wi),t
i
j=1
(i+1)
i 1r =
i+1r
(i)
+i+1
(Hrt
+wi+1).t
t
Therefore, rt+1
= rt
(k). Finally, we may consider replacing samples Hrt
+wi
with samples Hr(i1)
+wi,t
obtainingthefinalform
rt+1 =(1t)rt +t(Hrt +wt).
A
simple
application
of
these
ideas
involves
estimating
the
expected
value
of
a
random
variable
by
drawing
i.i.d. samples.
Example
1 Letv1, v2, . . . be i.i.d. randomvariables. Given
t 1
rt+1 =t+1
rt +t+1
vt+1
we
know
that
rt
v by
strong
law
of
large
numbers.
We
can
actually
prove
(General
Version)
rt+1
=
(1
t)rt
+
tvt+1
v w.p.
1,
if
t =and
t=1 t=1t2
7/25/2019 2997 spring 2004
36/201
3.1
Lyapunov
function
analysis
Thequestionwetrytoanswer isDoes(8)converge? Ifso,wheredoesitconvergeto?
We
will
first
illustrate
the
basic
ideas
of
Lyapunov
function
analysis
by
considering
a
deterministic
case.
3.1.1
Deterministic
Case
Indeterministiccase,wehaveS(r, w)=S(r). Supposethereexistssomeuniquer suchthat
S(r)=Hr r =0.
Thebasic idea istoshowthatacertainmeasureofdistancebetweenrt andr
isdecreasing.
Example
2
Suppose
that
F
is
a
contraction
with
respect
to
2.
Then
rt+1 =rt +t(F rt rt)
converges.
Proof: SinceF isacontraction,thereexistsauniquer s.t. F r =r. Let
V(r)=r r 2.
We
will
show
V(rt)
V(rt+1).
Observe
V(rt+1) = rt+1 r
2
= rt
+t(F rt
rt) r
2
= (1 t)(rt r)+t(F rt r
)2
(1 t)rt r
2 +tF rt r
2
(1 t)rt r
2
+trt r
2
=
rt r 2 (1
)trt r 2.
t 0
Therefore,rt r
2 isnonincreasingandboundedbelowbyzero. Thus,rt r
2 0. Then
rt+1
r 2
rt
r 2
(1 )trt
r 20
rt r
2 (1 )t
rt1 r
2 (1 )(t +t1)
...t
r0 r
2 (1 ) l l=1
Hencer0
r 2 t
7/25/2019 2997 spring 2004
37/201
1. WedefineadistanceV(rt)0indicatinghowfarrt isfromasolutionr satisfyingS(r)=01
2. Weshowthatthedistanceisnonincreasingint
3.
We
show
that
the
distance
indeed
converges
to
0.
Theargument also involves thebasicresult thatevery nonincreasingsequence bounded below converges
toshowthatthedistanceconverges
Motivatedbythesepoints,weintroducethenotionofaLyapunovfunction:
Definition
1
We
call
function
V
a
Lyapunov
function
if
V
satisfies
(a)
V()
0
(b) (rV)T
S(r)0
(c) V(r)=0S(r)=0
3.1.2
Stochastic
Case
Theargument used for convergence in the stochastic caseparallels the argument used inthe deterministic
case. LetFt
denoteallinformationthatisavailableatstaget,and let
St(r)=E [S(r, wt)|Ft].
ThenwerequireaLyapunovfunctionV satisfying
V()0 (9)
2(V(rt))T (10)St(rt)cV(rt)
(11)V(r)V(r)Lr r
S(rt, wt)2 2E |Ft K1 +K2V(rt) , (12)
forsomeconstantsc,L,K1 andK2.
Notethat(9)and(10)aredirectanaloguesofrequiringexistenceofadistancethat isnonincreasing in
t;moreover,(10)ensuresthatthedistancedecreasesatacertainrate ifrt isfar fromadesiredsolutionr
satisfying V(r = 0). Condition (11) imposes some regularity on V which is required to show that V(rt)
does indeedconvergeto0,andcondition(12)imposessomecontroloverthenoise.
A
last
point
worth
mentioning
is
that
(10)
implies
that
the
expected
value
of
V(rt)
is
nonincreasing;
however, we may have V(rt+1) > V(rt) occasionally. Therefore we need an stochastic counterpart to the
resultthateverynonincreasingsequenceboundedbelowconverges. Thestochasticcounterpartofinterest
toouranalysis isgivenbelow.
Theorem 1 (Supermartingale Convergence Theorem) Suppose that Xt Yt and Zt are nonnegative
7/25/2019 2997 spring 2004
38/201
1. Xt converges toa limit(whichcanbearandomvariable)withprobability1,
2.
t=1
Zt
0with
t = and ,t=0 t=0t
2 uu u r
gmax 01
whichimplies(1)
PNgmax
(p) PN
(p)1 u gmax
> .ur r
In order the complete the proof of Theorem 1 from the four lemmas above, we have to consider the
probabilities fromtwoformsoffailure:
failuretostopthealgorithmwithanear-optimalpolicy
7/25/2019 2997 spring 2004
49/201
References
[1]
M.
Kearns
and
S.
Singh,
Near-Optimal
Reinforcement
Learning
in
Polynomial
Time,
Machine
Learning,
Volume
49,
Issue
2,
pp.
209-232,
Nov
2002.
7/25/2019 2997 spring 2004
50/201
2.997 Decision-Making in Large-Scale Systems March 8
MIT, Spring 2004 Handout #13
Lecture Note 10
1 Value Function Approximation
DP problems are centered around the cost-to-go functionJ or the Q-factorQ. In certain problems, such as
linear-quadratic-Gaussian systems, J exhibits some structure which allows for its compact representation:
Example 1 In LQG system, we have
xk+1 = Axk+ Buk+ Cwk, x n
g(x, u) = xDx + uEu,
wherexk represents the system state, uk represents the control action, andwk is a Gaussian noise. It can
be shown that the optimal policy is of the form
uk = Lkxk
and the optimal cost-to-go function is of the form
J(x) = xRx + S, R nm, S
where R is a symmetric matrix. It follows that, if there are n state variables (i.e., xk n), storing
J requires storing n(n+ 1)/2 + 1 real numbers, corresponding to the matrix R and the scalar S. The
computational time and storage space required is quadratic in the number of state variables.
In general, we are not as lucky as in the LQG system case, and exact representation ofJ requires that
it be stored as a lookup table, with one value per state. Therefore, the space is proportional to the size of
the state space, which grows exponentially with the number of state variables. This problem, known as the
curse of dimensionality, makes dynamic programming intractable in face of most problems of practical scale.
Example 2 Consider the game of Tetris, represented in Fig. 1. As seen in previous lectures, this game maybe represented as an MDP, and a possible choice of state is the pair(B, P), in whichB nm represents
the board configuration andP represents the current falling piece. More specifically, we haveb(i, j) = 1, if
position(i, j) of the board is filled, andb(i, j) = 0 otherwise.
If there arep different types of pieces, and the board has dimensionn m, the number of states is on the
7/25/2019 2997 spring 2004
51/201
Figure 1: A tetris game
as a deterministic optimization problem, in the following way. Denote by (u) the average cost of policy
u. Then our problem corresponds to
minuU
(u), (1)
whereU is the set of all possible policies. In principle, we could solve (1) by enumerating all policies and
choosing the one with the smallest value of(u); however, note that the number of policies is exponential
in the number of states we have |Y|= |A||S| ; if there is no special structure to U, this problem requires
even more computational time than solving Bellmans equation for the cost-to-go function. A possible
approach to approximating the solution is to transform problem (1) by considering only a tractable
subset of all policies:
minuF
(u)
where F is a subset of the policy space. If F has some appropriate format, e.g., we consider policies
that are parameterized by a continuous variable, we may be able to solve this problem without having
to enumerate all policies in the set, but by using some standard optimization method such as gradient
descent. Methods based on this idea are called approximations in the policy space, and will be studied
later on in this class.
(2) Cost-to-go Function Approximation
Another approach to approximating the dynamic programming solution is to approximate the cost-to-gofunction. The underlying idea for cost-to-go function approximation is thatJ has some structure that
allows for approximate compact representation
J(x) J(x, r), for some parameterr P.
E l
7/25/2019 2997 spring 2004
52/201
Example 3
J(x, r) = cos(xTr) nonlinear inr
J(x, r) =r0+ rT1xJ(x, r) =r0+ r
t1(x)
linear inr
In the next few lectures, we will focus on cost-to-go function approximation. Note that there are two
important preconditions to the development of an effective approximation. First, we need to choose a
parameterization Jthat can closely approximate the desired cost-to-go function. In this respect, a suitable
choice requires some practical experience or theoretical analysis that provides rough information on the shape
of the function to be approximated. Regularities associated with the function, for example, can guide the
choice of representation. Designing an approximation architecture is a problem-specific task and it is not
the main focus of this paper; however, we provide some general guidelines and illustration via case studies
involving queueing problems. Second, given a parameterization for the cost-to-go function approximation,
we need an efficient algorithm that computes appropriate parameter values.
We will start by describing usual choices for approximation architectures.
2 Approximation Architectures
2.1 Neural Networks
A common choice for an approximation architecture are neural networks. Fig. ??represents a neural network.
The underlying idea is as follows: we first convert the original state x into a vector x n, for somen. This
vector is used as the input to a linear layerof the neural network, which maps the input to a vector y m,
for some m, such that yj = n
i=1 rijxi. The vector y is then used as the input to a sigmoidal layer, which
outputs a vector z m
with the property that zi = f(yi), and f(.) is a sigmoidal function. A sigmoidalfunction is any function with the following properties:
1. monotonically increasing
2. differentiable
3. bounded
Fig. 3 represents a typical sigmoidal function.
The combination of a linear and a sigmoidal layer is called a perceptron, and a neural network consists
of a chain of one or more perceptrons (i.e., the output of a sigmoidal layer can be redirected to another
sigmoidal layer, and so on). Finally, the output of the neural network consists of a weighted sum of the
outputz of the final layer:
7/25/2019 2997 spring 2004
53/201
Input
Sigmoidal LayerLinear Layer
ir
ijr
+
+
+
+
+
+
+
+
+
+
+
Figure 2: A neural network
Figure 3: A sigmoidal function
of functions on some bounded and closed set, if functions are uniformly smooth, we can get error O( 1n
)
with n sigmoidal functions. (Barron 1990). Note, however, that in order to obtain a good approximation,
an adequate set of weights r must be found. Backpropagation, which is simply a gradient descent algorithm,
is able to find a local optimum among all set of weights, but finding the global optimum may be a difficult
problem.
2.2 State Space Partitioning
Another common choice for approximation architecture is based on partitioning of the state space. The
underlying idea is that similar states may be grouped together For instance in an MDP involving
2 3 Features
7/25/2019 2997 spring 2004
54/201
2.3 Features
A special case of state space partitioning consists of mapping states to features, and considering approxima-
tions of the cost-to-go function that are functions of the features. The hope is that the featurewould captureaspects of the state that are relevant for the decision-making process and discard irrelevant details, thus
providing a more compact representation. At the same time, one would also hope that, with an appropriate
choice of features, the mapping from features to the (approximate) cost-to-go function would be smoother
than that from the original state to the cost-to-go function, thereby allowing for successful approximation
with architectures that are suitable for smooth mappings (e.g., polynomials). This process is represented
below.
Statex features f(x) J(f(x), r).
J(x) J(f(x)) such thatf(x) Jis smooth.
Example 4 Consider the tetris game. What features we should choose?
1. |h(i) h(i + 1)| (height)
2. how many holes
3. maxh(i)
2 997 Decision-Making in Large-Scale Systems March 10
7/25/2019 2997 spring 2004
55/201
x
2.997Decision-Making inLarge-ScaleSystems March10
MIT,
Spring
2004
Handout
#14
Lecture
Note
11
1 ComplexityandModelSelection
In this lecture, we will consider the problem of supervised learning. The setup is as follows. We have
pairs(x, y),distributedaccordingtoajointdistributionP(x, y). Wewouldliketodescribetherelationshipbetween
x
and
y
through
some
function
f chosen
from
a
set
of
available
functions
C,
so
that
y
f(x).
Ideally,wewouldchoosef bysolving
f =
argmingCE (yf)2 x, yP (testerror)|
However, wewillassumethat thedistributionP isnotknown,butrather, weonly have access tosamples
(xi, yi). Intuitively,wemaytrytosolve
n
21
min yi
f(xi) (trainingerror)f n
i=1
instead. Italsoseemsthat,therichertheclassCis,thebetterthechancetocorrectlydescribetherelationshipbetweenxandy. Inthislecture,wewillshowthatthisisnotthecase,andtheappropriatecomplexityofCandtheselectionofamodelfordescribinghowxandyrelatedmustbeguidedbyhowmuchdataisactually
available.
This
issue
is
illustrated
in
the
following
example.
2
Example
Consider
fitting
the
following
data
by
a
polynomial
of
finite
degree:
1 2 3 4
y 2.5 3.5 4.5 5.5Amongseveralothers,thefollowingpolynomialsfitthedataperfectly:
y =
x+
1.5
2
y =
2x
4
20x
3
+
70x
99x+
49.5
8
7
6
Which polynomial should we choose?
7/25/2019 2997 spring 2004
56/201
3
Whichpolynomialshouldwechoose?
x 1 2 3 4Nowconsiderthefollowing(possiblynoisy)data:
y 2.3 3.5 4.7 5.5
Fitting
the
data
with
a
first-degree
polynomial
yields
=
1.03x
+
1.3;
fitting
it
with
a
fourth-degree
y
y=2x420.0667x3 +70.4x 99.5333x+49.5.polynomialyields(amongothers) 2Whichpolynomialshouldwechoose?
1
1.5
2 2.5 3 3.5 40
1
2
3
4
Training
error
vs.
test
error.
It seems intuitive in the previous example that a line may be the best description for the relationship
between x and y, even though a polynomial of degree 3 describes the data perfectly in both cases and no
linear function is ableto describe the data perfectly in the second case. Is the intuition correct, and if so,
how can we decide on an appropriate representation, if relying solely on the training error does not seem
completelyreasonable?
The
essence
of
the
problem
is
as
follows.
Ultimately,
what
we
are
interested
in
is
the
ability
of
our
fitted
curvetopredictfuturedata,ratherthansimplyexplainingtheobserveddata. Inotherwords,wewouldlike
to choose a predictor that minimizes the expected error y(x)y(x) over all possible x. We call this the| |testerror. Theaverageerroroverthedatasetiscalledthetrainingerror.
We will show that training error and test error can be related through a measure of the complexity of
the
class
of
predictors
being
considered.
Appropriate
choice
of
a
predictor
will
then
be
shown
to
require
balancing the training error and the complexity of the predictors being considered. Their relationship is
described inFig.1, whereweplot test and training errorsversuscomplexity ofthepredictorclass
C when
the
number
of
samples
is
fixed.
The
main
difficulty
is
that,
as
indicated
in
Fig.
1,
there
exists
a
tradeoff
between
the
complexity
and
the
errors,
i.e.,
training
error
and
the
test
error;
while
the
approximation
error
over the sampled points goes to zero as we consider richer approximation classes, the same is not true for
the test error, which we are ultimately interested in minimizing. This is due to the fact that, with only
C
Error
7/25/2019 2997 spring 2004
57/201
test error
training error
Optimal degree maximum degree (d)
Figure1: Errorvs. degreeofapproximationfunction
3.1
Classification
with
a
finite
number
of
classifiers
Suppose that, given n samples (xi, yi), i = 1, . . . , n, we need to choose a classifier hi from a finite set of
classifiersf1, . . . , f d.
Define
(k) = E[y fk(x) ]|n
|1
n(k) = yifk(xi).n
| |i=1
In
words,
(k)
is
the
test
error
associated
with
classifier
fk,
and
n(k)
is
a
random
variable
representing
the
training
error
associated
with
classifier
fk over the samples (xi, yi), i = 1, . . . , n. As described before, we
wouldliketofindk =argmink(k),butcannotcomputedirectly. Letusconsiderusing instead
k=argminn(k)k
We
are
interested
in
the
following
question:
How
does
the
test
error
(k)
compare
to
the
optimal
error
(k)?
Suppose
that
|n(k)(k), k, (1)|forsome>0. Thenwehave
(k) n(k) +
In words, if the training error is close to the test error for all classifiers fk, then using k instead of k is
7/25/2019 2997 spring 2004
58/201
4
near-optimal.
But
can
we
expect
(1)
to
hold?
Observethat yi
fk(xi) are i.i.d. Bernoullirandomvariables. Fromthestrong law of largenumbers,
| |we
must
have
n(k)(k) w.p. 1.Thismeansthat, iftherearesufficientsamples,(1)shouldbetrue. Havingonlyfinitelymanysamples,we
facetwoquestions:
(1) Howmanysamplesareneededbeforewehavehighconfidencethat n(k) iscloseto(k)?
(2) Canweshowthat n(k)approaches(k)equallyfastforallfk
C?
The
first
question
is
resolved
by
the
Chernoff
bound:
For
i.i.d.
Bernoulli
random
variables
xi,i=1, . . . , n,
wehave P 1 n xiEx1> 2exp(2n2)
n
i=1
Moreover, since there are only finitely many functions in C, uniform convergence of n(k) to (k) followsimmediately:
P(k
:
(k)
(k) >
) =
(k)
(k) >
})| |
P
(k{| |d
(k)(k) >}) P({| |k=1
2d exp(2n2).
Therefore
we
have
the
following
theorem.
Theorem
1
With
probability
at
least
1
,
the
training
set
(xi, yi),
i
=
1, . . . , n,
will
be
such
that
testerror trainingerror+(d,n,)
where
1 1(d,n,)= log2d+log .
2n
Measures
of
Complexity
In Theorem 1, the error (d,n,) is on the order of
logd. In other words, the more classifiers are under
consideration,
the
larger
the
bound
on
the
difference
between
the
testing
and
training
errors,
and
the
difference grows as a function of logd. It follows that, for our purposes, logd captures the complexity of
4.1
VC
dimension
7/25/2019 2997 spring 2004
59/201
TheVCdimensionisapropertyofaclassCoffunctionsi.e.,foreachsetC,wehaveanassociatedmeasure
of
complexity, d
V C(C). d
V C
captures
how
much
variability
there
is
between
different
functions
in
C.
The
underlying idea is as follows. Take n points x1, . . . , xn, and consider binary vectors in{1, +1}n formedby
applying
a
function
f
C
to
(xi). How many different vectors can we come up with? In other words,
considerthefollowingmatrix: f1(x1) f1(x2) . . . f1(xn) f2(x1) f2(x2) . . . f2(xn) . . . ..
.
.
.. . . .
where
fi
C.
How
many
distinct
rows
can
this
matrix
have?
This
discussion
leads
to
the
notion
of
shattering
andtothedefinitionoftheVCdimension.
Definition
1
(Shattering) A set of points x1, . . . , xn is shattered by a class C of classifiers iffor any
assignment
of
labels
in
{1, 1},
there
is
f
C
such
that
f(xi)=yi, i.
Definition2 VCdimensionofC is thecardinalityof the largestset itcanshatter.
Example
1
Consider
|C =
d.
Suppose
x1, x2,dots,xn is
shattered
by
C.
We
need
d
2n and
thus
n
log
d.|Thismeans thatdV C(C)logd.
Example2
2ConsiderC ={hyperplanes in2}, Any two points in can be shattered. Hence, dV C(C)2. Considerany three points in2, C can shatter these three points. Hence dV C(C) 3. SinceC cannot shatter anyfour points in
2, hence dV C(
C)
3. It follows that dV C(
C) = 3. Moreover, it can be shown that, if
=
{hyperplanes
in
},
then
dV C(C)
=
n+
1.
C
n
Example3 IfC is thesetofallconvexsets in2,wecanshow thatdV C(C)=.
ItturnsoutthattheVCdimensionprovidesageneralizationoftheresultsfromtheprevioussection,for
finitesetsofclassifiers,togeneralclassesofclassifiers:
Theorem2 Withprobabilityat least1over thechoiceofsamplepoints(xi, yi), i=1, . . . , n ,wehave
(f)
n(f)+(n, dV C(C), ), fC,
where
dV C log(2n )
+
1
+
log(41
)dVC
5 StructuralRiskMinimization
7/25/2019 2997 spring 2004
60/201
Based
on
the
previous
results,
we
may
consider
the
following
approach
to
selecting
a
class
of
functions
C
whose
complexity
is
appropriate
for
the
number
of
samples
available.
Suppose
that
we
have
several
classes
. . . Cp. Note that complexity increases from C1 to Cp. We have classifiers f1, f2, . . . , f p whichC1 C2minimizesthetrainingerror n(fi)withineachclass. Then,givenaconfidencelevel,wemayfoundupper
bounds
on
the
test
error
(fi)associatedwitheachclassifier:
(fi)n(fi)+(dV C, n , ),
with probability at least1, andwe can choosethe classifier fi that minimizes the above upper bound.This
approach
is
called
structural
risk
minimization.
There
are
two
difficulties
associated
with
structural
risk
minimization:
first,
the
upper
bound
provided
by Theorems 1 and 2 may be loose; second, it may be difficult to determine the VC dimension of a given
classofclassifiers,androughestimatesorupperboundsmayhavetobeused instead. Still, thismaybea
reasonableapproach,ifwehavealimitedamountofdata. Ifwehavealotofdata,analternativeapproachis
asfollows. Wecansplitthedatainthreesets: atrainingset,avalidationsetandatestset. Wecanusethe
trainingsettofindtheclassifierfi withineachclassCi thatminimizesthetrainingerror;usethevalidation
set
to
estimate
the
test
error
of
each
selected
classifier
fi,
and
choose
the
classifier
f
from
f1, . . . , f p
with
the
smallestestimate;andfinally,usethetestsettogenerateanestimateofthetesterrorassociatedwith f.
2.997Decision-Making inLarge-ScaleSystems March12
7/25/2019 2997 spring 2004
61/201
1
MIT,
Spring
2004
Handout
#16
Lecture
Note12
ValueFunctionApproximationandPolicyPerformance
Recall that two tasks must be accomplished in order to for a value function approximation scheme to be
successful:
1.
We
must
pick
a
good
representation
J,
such
that
J()
J(, r)
for
at
least
some
parameter
r;
2. Wemustpickagoodparameter r,suchthatJ(x)J(x, r).
Consider
approximatingJ withalineararchitecture,i.e.,let
p
J(x, r)= i(x)ri,t=1
for
some
functions
i, i=1, . . . , p. Wecandefineamatrix |S|p givenby | |
.= 1 . . . p | |
Withthisnotation,wecanrepresenteachfunctionJ(, r)as
J =
r.
Fig.
1
gives
a
geometric
interpretation
of
value
function
approximation.
We
may
think
ofJ asavector
in|S| ;byconsideringapproximationsofthe formJ =r,werestrictattentiontothehyperplaneJ =r
in the same space. Given a norm (e.g., the Euclidean norm), an ideal value function approximation
algorithmwouldchooserminimizingJ r;inotherwords,itwouldfindtheprojectionr ofJ onto
the
hyperplane.
Note
that
J risanaturalmeasureforthequalityoftheapproximationarchitecture,
sinceit isthebestapproximationerrorthatcanbeattainedbyanyalgorithmgiventhechoiceof.
Algorithmsforvaluefunctionapproximationfoundintheliteraturedonotcomputetheprojectionr,
sincethis isan intractableproblem. BuildingontheknowledgethatJ satisfiesBellmansequation, value
function approximation typically involves adaptation of exact dynamic programming algorithms. For in
stance,drawinginspirationfromvalueiteration,onemightconsiderthefollowingapproximatevalueiteration
is capable of producing a good approximation to J, then the approximation algorithm should be able to
d l ti l d i ti
7/25/2019 2997 spring 2004
62/201
2
produce
a
relatively
good
approximation.
Another important question concerns the choice of a norm used to measure approximation errors.
Recall
that,
ultimately,
we
are
not
interested
in
finding
an
approximation
to
J
,
but
rather
in
finding
a
good
policy fortheoriginaldecision problem. Thereforewe would liketo choose to reflecttheperformance
associated
with
approximations
toJ.
State 2
State 1
State 3
J
J r
J( )
=
x,r
Figure1: ValueFunctionApproximation
PerformanceBounds
We
are
interested
in
the
following
question.
Let uJ be the greedy policy associated with an arbitrary
function J, and JuJ be the cost-to-go function associated with that policy. Can we relate the increase in
costJuJ JtotheapproximationerrorJ J?
Recallthefollowingtheorem,fromLectureNote3:
Theorem1 LetJ bearbitrary,uJ beagreedypolicywithrespect toJ.1 LetJuJ be thecost-to-gofunction
forpolicyuJ. Then
JuJ J
2J J.
1
is unrealistic to expect that we could approximate J uniformly well over all states (which is required by
Theorem 1) or that e could find a polic that ields a cost to go uniforml close to J o er all states
7/25/2019 2997 spring 2004
63/201
Theorem
1)
or
that
we
could
find
a
policyuJ thatyieldsacost-to-gouniformlyclosetoJ
over
all
states.
The following example illustrates the notion that having a large error J J does not necessarily
lead
to
a
bad
policy.
Moreover,
minimizing
J
J may
also
lead
to
undesirable
results.
Example1 Considera singlequeuewithcontrolled service rate. Let xdenote thequeue length, B denote
the
buffer
size,
and
Pa(x, x+1) = , a, x=0, 1, . . . , B1
Pa(x, x 1) = (a), a, x=1, 2, . . . , B ,
Pa(B, B+1) = 0, a
ga(x) = x +q(a)
Suppose thatweare interested inminimizing theaverage cost in thisproblem. Thenwewould like tofind
anapproximation to thedifferentialcostfunctionh. Suppose thatweconsideronly linearapproximations:
h(x, r)=r1 +r2x.At the topofFigure1,werepresent h and twopossibleapproximations, h1 and h2. h1
is
chosen
so
as
to
minimize
h h. Which one is a better approximation? Note that h1 yields smaller
approximation errors than h2 over large states, but yields large approximation errors over the whole state
space.
In
particular,
as
we
increase
the
buffer
size
B,
it
should
lead
to
worse
and
worse
approximation
errors
in almost all states. h2, on the other hand, has an interesting property, which we now describe. At the
bottom
of
Figure
1,
we
represent
the
stationary
state
distribution(x)encounteredundertheoptimalpolicy.
Note that itdecaysexponentiallywithx,and largestatesarerarelyvisited. Thissuggests that,forpractical
purposes, h2 may lead toabetterpolicy, since itapproximates h better than h1 over the setof states that
arevisitedalmostallof the time.
What the previous example hints at is that, in the case of a large state space, it may be important to
consider
errors
J
J
that
differentiate
between
more
or
less
important
states.
In
the
next
section,
we
will
introduce
the
notion
of
weighted
norms
and
present
performance
bounds
that
take
state
relevance
into
account.
2.1 PerformanceBoundswithState-RelevanceWeights
We
first
introduce
the
following
weighted
norms:
||J
J(, r)|| =
max
xS |J
(x)
J(x, r)|
JJ(, r), = maxxS
(x)|J(x)J(x, r)|J J(, r)1, = (x)|J
(x)J(x, r)| (>0)xS
7/25/2019 2997 spring 2004
64/201
h(x)
B
x
h2
h1
h
x
Dist. of
x
whereT T tP t(1 ) T (I P )1 (1 )
7/25/2019 2997 spring 2004
65/201
T T tPtcJ =(1 )cT(I PuJ)
1 =(1 )c uJt=0
or
equivalently
c,J(x)=(1 ) c(y) tPu
tJ
(y,x), x S.y t=0
Proof: WehaveJ TJ J JuJ. Then
TJuJ J1,c = c (JuJ J
)
cT(JuJ J)
T
=
c
(I
PuJ)1guJ J)
T= c (I PuJ)
1(guJ +PuJJ J)
T= c (I PuJ)1(TJ J)
T c (I PuJ)1(J J)
1= J J1,c,J1
ComparingTheorems1and2,wehave
JuJ J
2J J
1
JuJ J1,c
1J J1,c,J.1
There
are
two
important
differences
between
these
bounds:
1. Thefirstboundrelatesperformancetotheworstpossibleapproximationerroroverallstates,whereas
thesecondinvolvesonlyaweightedaverageoferrors. Thereforeweexpectthesecondboundtoexhibit
better
scaling
properties.
2. The first bound presents a worst-case guarantee on performance: the cost-to-go starting from any
initialstatexcannotbegreaterthanthestatedbound. Thesecondboundpresentsaguaranteeonthe
expectedcost-to-go,giventhattheinitialstateisdistributedaccordingtodistributionc. Althoughthis
is
a
weaker
guarantee,
it
represents
a
more
realistic
requirement
for
large-scale
systems,
and
raises
the
possibilityofexploitinginformationabouthowimportanteachdifferentstateisintheoveralldecision
problem.
This step is typically done inreal-time, asthe system is being controlled. If the setof available actions Ais relatively small and the summation y Pa(x, y)J (y, r) can be computed relatively fast, then evaluating
7/25/2019 2997 spring 2004
66/201
s
e at ve y
s a
a d
t e