+ All Categories
Home > Documents > DP_Slides_2012(1)

DP_Slides_2012(1)

Date post: 19-Aug-2015
Category:
Upload: ptkien
View: 214 times
Download: 1 times
Share this document with a friend
Description:
DP
Popular Tags:
282
LECTURE SLIDES - DYNAMIC PROGRAMMING BASED ON LECTURES GIVEN AT THE MASSACHUSETTS INST. OF TECHNOLOGY CAMBRIDGE, MASS FALL 2012 DIMITRI P. BERTSEKAS These lecture slides are based on the two- volume book: “Dynamic Programming and Optimal Control” Athena Scientific, by D. P. Bertsekas (Vol. I, 3rd Edition, 2005; Vol. II, 4th Edition, 2012); see http://www.athenasc.com/dpbook.html Two related reference books: (1) “Neuro-Dynamic Programming,” Athena Scientific, by D. P. Bertsekas and J. N. Tsitsiklis, 1996 (2) “Introduction to Probability” (2nd edi- tion), Athena Scientific, by D. P. Bert- sekas and J. N. Tsitsiklis, 2008
Transcript

LECTURESLIDES-DYNAMICPROGRAMMINGBASEDONLECTURESGIVENATTHEMASSACHUSETTSINST.OFTECHNOLOGYCAMBRIDGE,MASSFALL2012DIMITRIP.BERTSEKASTheselectureslidesarebasedonthetwo-volumebook: DynamicProgramming andOptimal ControlAthenaScientic, byD.P.Bertsekas(Vol.I,3rdEdition,2005;Vol.II,4thEdition,2012);seehttp://www.athenasc.com/dpbook.htmlTworelatedreferencebooks:(1) Neuro-Dynamic Programming, AthenaScientic,by D. P. Bertsekas and J. N.Tsitsiklis, 1996(2) Introduction to Probability (2nd edi-tion),Athena Scientic,by D. P. Bert-sekasandJ.N. Tsitsiklis, 20086.231: DYNAMICPROGRAMMINGLECTURE1LECTUREOUTLINE ProblemFormulation Examples TheBasicProblem SignicanceofFeedbackDPASANOPTIMIZATIONMETHODOLOGY Genericoptimizationproblem:minuUg(u)where u is the optimization/decision variable,g(u)isthecostfunction, andUistheconstraintset Categoriesofproblems:Discrete(Uisnite)orcontinuousLinear (gislinear andUispolyhedral) ornonlinearStochastic or deterministic: In stochastic prob-lems the cost involves a stochastic parameterw,whichisaveraged,i.e.,ithastheformg(u) = Ew_G(u, w)_wherewisarandomparameter. DPcandealwithcomplexstochasticproblemswhereinformationaboutwbecomesavailableinstages, andthedecisionsarealsomadeinstagesandmakeuseofthisinformation.BASICSTRUCTUREOFSTOCHASTICDP Discrete-timesystemxk+1= fk(xk, uk, wk), k = 0, 1, . . . , N 1k: Discretetimexk: State; summarizes past information thatisrelevantforfutureoptimizationuk: Control; decisiontobeselectedattimekfromagivensetwk: Randomparameter(alsocalleddistur-banceornoisedependingonthecontext)N: Horizonor number of times control isapplied CostfunctionthatisadditiveovertimeE_gN(xN) +N1

k=0gk(xk, uk, wk)_Alternative system description: P(xk+1 [ xk, uk)xk+1= wkwithP(wk [ xk, uk) = P(xk+1 [ xk, uk)INVENTORYCONTROLEXAMPLEInventorySyst emStock Ordered atPeriod kStock at Period kStock at Period k + 1Demand at Period kxkwkxk + 1 = xk+ uk -wkukCost of Pe riod kc uk + r (xk + uk - wk) Discrete-timesystemxk+1= fk(xk, uk, wk) = xk +ukwk CostfunctionthatisadditiveovertimeE_gN(xN) +N1

k=0gk(xk, uk, wk)_= E_N1

k=0_cuk +r(xk +ukwk)__Optimization over policies: Rules/functions uk=k(xk)thatmapstatestocontrolsADDITIONALASSUMPTIONS Thesetofvaluesthatthecontrol ukcantakedependatmostonxkandnotonpriorxoru Probabilitydistributionofwkdoesnotdependonpastvalueswk1, . . . , w0, butmaydependonxkandukOtherwise past values of wor xwouldbeusefulforfutureoptimization Sequenceofeventsenvisionedinperiodk:xkoccursaccordingtoxk= fk1_xk1, uk1, wk1_ukisselectedwithknowledgeofxk,i.e.,uk Uk(xk)wkisrandomandgeneratedaccordingtoadistributionPwk(xk, uk)DETERMINISTICFINITE-STATEPROBLEMS Schedulingexample: FindoptimalsequenceofoperationsA,B,C,D AmustprecedeB,andCmustprecedeD Given startup cost SAandSC, andsetup tran-sitioncostCmnfromoperationmtooperationnASACSCABCABACCACCDACADABCCACCDCDACDACBCABCADCBCCCBCCDCABCCACDACCDCBDCDBCBDCDBCABInitialSt at eSTOCHASTICFINITE-STATEPROBLEMS Example: Findtwo-gamechessmatchstrategy Timidplaydrawswithprob. pd>0andloseswith prob. 1pd. Bold play wins with prob. pw j better than the current path s --> j ?)Is di + aij < UPPER

?(Does the path s --> i --> j have a chance to be part of a shorter s --> t path ?)YESYESINSERTOP E NSetdj = di + aijEXAMPLEABC ABD ACB ACD ADB ADCABCDAB AC ADABDC ACBD ACDB ADBC ADCBArtificial Terminal Node tOrigin Node s A11 120 2020 204 44 41515 553 353 315Iter. No. NodeExitingOPEN OPENafterIteration UPPER0 - 1 1 1 2, 7,10 2 2 3, 5, 7, 10 3 3 4, 5, 7, 10 4 4 5, 7, 10 435 5 6, 7, 10 436 6 7, 10 137 7 8, 10 138 8 9, 10 139 9 10 1310 10 Empty 13 NotethatsomenodesneverenteredOPENVALIDITYOFLABELCORRECTINGMETHODSProposition: If there exists at least one pathfromtheorigintothedestination, thelabel cor-rectingalgorithmterminateswithUPPERequalto the shortest distance from the origin to the des-tinationProof: (1) Each time a node jenters OPEN, itslabel is decreased and becomes equal to the lengthofsomepathfromstoj(2)Thenumberof possibledistinctpathlengthsisnite,sothenumberoftimesanodecanenterOPENisnite,andthealgorithmterminates(3)Let(s, j1, j2, . . . , jk, t)beashortestpathandletdbetheshortestdistance. If UPPER>dat termination, UPPERwill alsobelarger thanthe lengthof all the paths (s, j1, . . . , jm), m=1, . . . , k, throughoutthealgorithm. Hence, nodejkwill neverentertheOPENlistwithdjkequaltotheshortest distancefromstojk. Similarlynode jk1will never enter the OPENlist withdjk1equal to the shortest distance from s to jk1.Continuetoj1togetacontradiction6.231DYNAMICPROGRAMMINGLECTURE4LECTUREOUTLINE Deterministiccontinuous-timeoptimalcontrol Examples Connectionwiththecalculusofvariations The Hamilton-Jacobi-Bellmanequationas acontinuous-timelimitoftheDPalgorithm The Hamilton-Jacobi-Bellmanequationas asucientcondition ExamplesPROBLEMFORMULATION Continuous-timedynamicsystem: x(t) = f_x(t), u(t)_, 0 t T, x(0) : given,wherex(t) n: statevectorattimetu(t) U m: controlvectorattimetU: controlconstraintsetT: terminaltimeAdmissible control trajectories_u(t) [ t [0, T]_:piecewisecontinuousfunctions_u(t) [ t [0, T]_with u(t) Ufor all t [0, T]; uniquely determine_x(t) [ t [0, T]_ Problem:Find anadmissible controltrajectory_u(t) [ t [0, T]_andcorrespondingstatetrajec-tory_x(t) [ t [0, T]_,thatminimizesthecosth_x(T)_+_T0g_x(t), u(t)_dt f, h, gareassumedcontinuouslydierentiableEXAMPLEI Motioncontrol: Aunitmassmoves onalineundertheinuenceofaforceu x(t)=_x1(t), x2(t)_: positionandvelocityofthemassattimet Problem: From a given_x1(0), x2(0)_, bring themass near a givennal position-velocity pair(x1, x2)attimeTinthesense:minimizex1(T) x12+x2(T) x22subjecttothecontrolconstraint[u(t)[ 1, forallt [0, T] Theproblemtstheframeworkwith x1(t) = x2(t), x2(t) = u(t),h_x(T)_=x1(T) x12+x2(T) x22,g_x(t), u(t)_= 0, forallt [0, T]EXAMPLEII A producerwithproductionrate x(t)attime tmayallocateaportionu(t)ofhis/herproductionrate to reinvestment and 1u(t) to production ofastorablegood. Thusx(t)evolvesaccordingto x(t) = u(t)x(t),where> 0isagivenconstantThe producer wants to maximize the total amountofproductstored_T0_1 u(t)_x(t)dtsubjectto0 u(t) 1, forallt [0, T] Theinitialproductionratex(0)isagivenpos-itivenumberEXAMPLEIII(CALCULUSOFVARIATIONS)Le ngth = 0T1 + (u(t))2 dtax(t) T t 0x(t) = u(t).GivenPointGivenLine

T0

1 +

u(t)

2dt Findacurvefromagivenpointtoagivenlinethathasminimumlength Theproblemisminimize_T0_1 +_ x(t)_2dtsubjectto x(0) = Reformulationasanoptimalcontrolproblem:minimize_T0_1 +_u(t)_2dtsubjectto x(t) = u(t), x(0) = HAMILTON-JACOBI-BELLMANEQUATIONI We discretize [0, T] at times 0, , 2, . . . , N,where= T/N,andweletxk= x(k), uk= u(k), k = 0, 1, . . . , N Wealsodiscretizethesystemandcost:xk+1= xk+f(xk, uk), h(xN)+N1

k=0g(xk, uk) WewritetheDPalgorithmforthediscretizedproblemJ(N, x) = h(x),J(k, x) = minuU_g(x, u)+ J_(k+1), x+f(x, u)_. AssumeJis dierentiable and Taylor-expand:J(k, x) = minuU_g(x, u) +J(k, x) + t J(k, x) + x J(k, x)f(x, u) +o() CancelJ(k, x),divideby,andtakelimitHAMILTON-JACOBI-BELLMANEQUATIONII Let J(t, x) be the optimal cost-to-goof thecontinuousproblem. Assumingthelimitisvalidlimk, 0, k=tJ(k, x) = J(t, x), forallt, x,weobtainforallt, x,0 = minuU_g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)withtheboundaryconditionJ(T, x) = h(x) This is the Hamilton-Jacobi-Bellman (HJB)equationa partialdierentialequation,whichissatisedforalltime-statepairs(t, x)bythecost-to-gofunctionJ(t, x) (assumingJis dieren-tiableandtheprecedinginformal limitingproce-dureisvalid) HardtotellaprioriifJ(t, x)isdierentiable SoweusetheHJBEq.asavericationtool;ifwecansolveitforadierentiableJ(t, x),then:Jistheoptimal-cost-to-gofunctionThe control (t, x) that minimizes intheRHSforeach(t, x)denesanoptimal con-trolVERIFICATION/SUFFICIENCYTHEOREM SupposeV (t, x)isasolutiontotheHJBequa-tion; thatis, V iscontinuouslydierentiableintandx,andissuchthatforallt, x,0 = minuU_g(x, u) +tV (t, x) +xV (t, x)f(x, u),V (T, x) = h(x), forallx Suppose also that (t, x) attains the minimumaboveforalltandx Let_x(t) [ t [0, T]_and u(t) = _t, x(t)_,t [0, T], bethecorrespondingstateandcontroltrajectories ThenV (t, x) = J(t, x), forallt, x,and_u(t) [ t [0, T]_isoptimal LimitationsoftheTheoremPROOFLet ( u(t), x(t)) [ t [0, T]be anyadmissiblecontrol-statetrajectory. Wehaveforallt [0, T]0 g_ x(t), u(t)_+tV_t, x(t)_+xV_t, x(t)_f_ x(t), u(t)_.Usingthesystemequation x(t)=f_ x(t), u(t)_,theRHSoftheaboveisequaltog_ x(t), u(t)_+ddt_V (t, x(t))_Integratingthisexpressionovert [0, T],0 _T0g_ x(t), u(t)_dt +V_T, x(T)_V_0, x(0)_.UsingV (T, x) = h(x)and x(0) = x(0),wehaveV_0, x(0)_ h_ x(T)_+_T0g_ x(t), u(t)_dt.If we use u(t) and x(t) in place of u(t) and x(t),theinequalitiesbecomesequalities,andV_0, x(0)_= h_x(T)_+_T0g_x(t), u(t)_dtEXAMPLEOFTHEHJBEQUATIONConsider the scalar system x(t) = u(t), with [u(t)[ 1andcost(1/2)_x(T)_2.TheHJBequationis0 =min|u|1_tV (t, x)+xV (t, x)u, forallt, x,withtheterminalconditionV (T, x) = (1/2)x2 Evident candidate for optimality: (t, x) =sgn(x). Correspondingcost-to-goJ(t, x) =12_max_0, [x[ (T t)__2. We verify that Jsolves the HJB Eq., and thatu = sgn(x)attainstheminintheRHS.Indeed,tJ(t, x) = max_0, [x[ (T t)_,xJ(t, x) = sgn(x)max_0, [x[ (T t)_.Substituting,theHJBEq.becomes0 =min|u|1_1 + sgn(x)umax_0, [x[ (T t)_andholdsasanidentityforallxandt.LINEARQUADRATICPROBLEMConsiderthen-dimensionallinearsystem x(t) = Ax(t) +Bu(t),andthequadraticcostx(T)QTx(T) +_T0_x(t)Qx(t) +u(t)Ru(t)_dtTheHJBequationis0 =minum_xQx+uRu+tV (t, x)+xV (t, x)(Ax+Bu),with the terminal condition V (T, x) = xQTx. WetryasolutionoftheformV (t, x) = xK(t)x, K(t) : n nsymmetric,andshowthatV (t, x)solvestheHJBequationifK(t) = K(t)AAK(t)+K(t)BR1BK(t)QwiththeterminalconditionK(T) = QT6.231DYNAMICPROGRAMMINGLECTURE5LECTUREOUTLINE ExamplesofstochasticDPproblems Linear-quadraticproblems InventorycontrolLINEAR-QUADRATICPROBLEMS System: xk+1= Akxk +Bkuk +wk QuadraticcostEwkk=0,1,...,N1_xNQNxN+N1

k=0(xkQkxk +ukRkuk)_where Qk 0 and Rk> 0 (in the positive (semi)denitesense). wkareindependentandzeromean DPalgorithm:JN(xN) = xNQNxN,Jk(xk) = minukE_xkQkxk +ukRkuk+Jk+1(Akxk +Bkuk +wk)_ Keyfacts:Jk(xk)isquadraticOptimalpolicy 0, . . . , N1islinear:k(xk) = LkxkSimilartreatmentofanumberofvariantsDERIVATION Byinductionverifythatk(xk) = Lkxk, Jk(xk) = xkKkxk+constant,whereLkarematricesgivenbyLk= (BkKk+1Bk +Rk)1BkKk+1Ak,and where Kkare symmetric positive semidenitematricesgivenbyKN= QN,Kk= Ak_Kk+1Kk+1Bk(BkKk+1Bk+Rk)1BkKk+1_Ak +Qk. This is called the discrete-time Riccati equation. JustlikeDP, itstartsattheterminal timeNandproceedsbackwards. Certaintyequivalenceholds(optimal policyisthesameaswhenwkisreplacedbyitsexpectedvalueEwk = 0).ASYMPTOTICBEHAVIOROFRICCATIEQ. Assumetime-independentsystemandcostperstage, andsometechnical assumptions: controla-bilityof(A, B)andobservabilityof(A, C)whereQ = CC The Riccati equation converges limkKk=K, where Kis pos. denite, andis the unique(within the class of pos. semidenite matrices) so-lutionofthealgebraicRiccatiequationK= A_K KB(BKB +R)1BK_A+QThe corresponding steady-state controller (x) =Lx,whereL = (BKB +R)1BKA,is stablein thesensethatthematrix (A+BL)oftheclosed-loopsystemxk+1= (A+BL)xk +wksatiseslimk(A+BL)k= 0.GRAPHICALPROOFFORSCALARSYSTEMSA2RB2+ QP 0QF(P)450P PkPk + 1P*-RB2 Riccatiequation(withPk= KNk):Pk+1= A2_Pk B2P2kB2Pk +R_+Q,orPk+1= F(Pk),whereF(P) =A2RPB2P+R+Q. Notethetwosteady-statesolutions, satisfyingP= F(P),ofwhichonlyoneispositive.RANDOMSYSTEMMATRICES Supposethat A0, B0, . . . , AN1, BN1arenot known but rather are independent randommatricesthatarealsoindependentofthewk DPalgorithmisJN(xN) = xNQNxN,Jk(xk) = minukEwk,Ak,Bk_xkQkxk+ukRkuk +Jk+1(Akxk +Bkuk +wk)_ Optimalpolicyk(xk) = Lkxk,whereLk= _Rk +EBkKk+1Bk_1EBkKk+1Ak,andwherethematricesKkaregivenbyKN= QN,Kk=EAkKk+1Ak EAkKk+1Bk_Rk +EBkKk+1Bk_1EBkKk+1Ak +QkPROPERTIES Certaintyequivalencemaynothold Riccati equation may not converge to a steady-stateQ4500 PF(P)-RE{B2} WehavePk+1=F(Pk),whereF(P) =EA2RPEB2P+R+Q+TP2EB2P+R,T= EA2EB2 _EA_2_EB_2INVENTORYCONTROL xk: stock,uk: stockpurchased,wk: demandxk+1= xk +ukwk, k = 0, 1, . . . , N 1 MinimizeE_N1

k=0_cuk +H(xk +uk)__whereH(x +u) = Er(x +u w)istheexpectedshortage/holdingcost, withrde-nede.g.,forsomep > 0andh > 0,asr(x) = p max(0, x) +hmax(0, x) DPalgorithm:JN(xN) = 0,Jk(xk) =minuk0_cuk+H(xk+uk)+E_Jk+1(xk+ukwk)_OPTIMALPOLICY DPalgorithmcanbewrittenasJN(xN) = 0,Jk(xk) =minuk0Gk(xk +uk) cxk,whereGk(y) = cy +H(y) +E_Jk+1(y w)_. If Gkisconvexandlim|x|Gk(x) , wehavek(xk) =_Sk xkifxk< Sk,0 ifxk Sk,whereSkminimizesGk(y). This is shown, assuming that c < p, by showingthatJkisconvexforallk,andlim|x|Jk(x) JUSTIFICATION GraphicalinductiveproofthatJkisconvex.- cy- cyyH(y)cy + H(y)SN - 1cSN - 1JN - 1(xN - 1)xN - 1SN - 16.231DYNAMICPROGRAMMINGLECTURE6LECTUREOUTLINE Stoppingproblems Schedulingproblems OtherapplicationsPURESTOPPINGPROBLEMS Twopossiblecontrols:Stop(incur aone-time stopping cost, andmovetocost-freeandabsorbingstopstate)Continue[usingxk+1=fk(xk, wk) andin-curringthecost-per-stage] Eachpolicyconsistsofapartitionofthesetofstatesxkintotworegions:Stopregion,wherewestopContinueregion,wherewecontinueSTOPREGIONCONTINUE REGIONStop StateEXAMPLE:ASSETSELLINGA person has an asset, and at k = 0, 1, . . . , N1receivesarandomoerwk Mayacceptwkandinvestthemoneyatxedrateofinterestr,orrejectwkandwaitforwk+1.MustacceptthelastoerwN1 DP algorithm (xk: current oer, T: stop state):JN(xN) =_xNifxN ,= T,0 ifxN= T,Jk(xk) =_max_(1 +r)Nkxk,E_Jk+1(wk)_ifxk=T,0 ifxk=T. Optimalpolicy;accepttheoerxkifxk> k,rejecttheoerxkifxk< k,wherek=E_Jk+1(wk)_(1 +r)Nk.FURTHERANALYSIS0 1 2 N - 1 N kACCEPTREJECTa1aN - 1a2 Canshowthatk k+1forallk Proof: Let Vk(xk) =Jk(xk)/(1 +r)Nkforxk ,= T.ThentheDPalgorithmisVN(xN) = xNandVk(xk) = max_xk,(1 +r)1Ew_Vk+1(w)__.We have k= Ew_Vk+1(w)_/(1 +r), so it is enoughtoshowthat Vk(x) Vk+1(x) for all xandk.StartwithVN1(x) VN(x)andusethemono-tonicitypropertyofDP. Wecanalsoshowthatk aask .Suggeststhatforaninnitehorizontheoptimalpolicyisstationary.GENERALSTOPPINGPROBLEMS Attimek,wemaystopatcostt(xk)orchooseacontroluk U(xk)andcontinueJN(xN) = t(xN),Jk(xk) = min_t(xk), minukU(xk)E_g(xk, uk, wk)+Jk+1_f(xk, uk, wk)__ OptimaltostopattimekforxinthesetTk=_xt(x) minuU(x)E_g(x, u, w) + Jk+1_f(x, u, w)___ Since JN1(x) JN(x), we have Jk(x) Jk+1(x)forallk,soT0 Tk Tk+1 TN1. Interesting case is when all the Tkare equal (toTN1, the set where it is better to stop than to goonestepandstop). Canbeshowntobetrueiff(x, u, w) TN1, forallx TN1, u U(x), w.SCHEDULINGPROBLEMS We have a set of tasks to perform,the orderingissubjecttooptimalchoice. Costsdependontheorder There may be stochastic uncertainty, and prece-denceandresourceavailabilityconstraints Some of the hardest combinatorial problemsareof thistype(e.g., travelingsalesman, vehiclerouting,etc.) Somespecial problems admit asimplequasi-analyticalsolutionmethodOptimal policy has an index form, i.e.,eachtaskhasaneasilycalculablecostin-dex, andit is optimal to select the taskthat has the minimum value of index (multi-armed bandit problems - to be discussed later)Someproblemscanbesolvedbyaninter-change argument(start with some schedule,interchange two adjacent tasks, and see whathappens). They requireexistenceofanop-timalpolicywhichisopen-loop.EXAMPLE:THEQUIZPROBLEM GivenalistofNquestions. Ifquestioniisan-swered correctly (given probability pi),we receiverewardRi;ifnotthequizterminates. Chooseor-derofquestionstomaximizeexpectedreward. Letiandjbethekthand(k + 1)stquestionsinanoptimallyorderedlistL = (i0, . . . , ik1, i, j, ik+2, . . . , iN1)E rewardofL = E_rewardof i0, . . . , ik1_+pi0 pik1(piRi +pipjRj)+pi0 pik1pipjE_rewardof ik+2, . . . , iN1_ConsiderthelistwithiandjinterchangedL= (i0, . . . , ik1, j, i, ik+2, . . . , iN1)Since L is optimal, ErewardofL ErewardofL,soitfollowsthatpiRi + pipjRj pjRj+ pjpiRiorpiRi/(1 pi) pjRj/(1 pj).MINIMAXCONTROL Consider basic problem with the dierence thatthedisturbancewkinsteadofbeingrandom,itisjustknowntobelongtoagivensetWk(xk, uk). FindpolicythatminimizesthecostJ(x0) = maxwkWk(xk,k(xk))k=0,1,...,N1_gN(xN)+N1

k=0gk_xk, k(xk), wk__ TheDPalgorithmtakestheformJN(xN) = gN(xN),Jk(xk) = minukU(xk)maxwkWk(xk,uk)_gk(xk, uk, wk)+Jk+1_fk(xk, uk, wk)_(Exercise1.5inthetext, solutionpostedonthewww).UNKNOWN-BUT-BOUNDEDCONTROL For each k, keep the xkof the controlled systemxk+1= fk_xk, k(xk), wk_insideagivensetXk,thetargetsetattimek. Thisisaminimaxcontrol problem, wherethecostatstagekisgk(xk) =_0 ifxk Xk,1 ifxk/ Xk. WemustreachattimekthesetXk=_xk [ Jk(xk) = 0_inordertobeabletomaintainthestatewithinthesubsequenttargetsets. Start with XN= XN, and for k = 0, 1, . . . , N1,Xk=_xk Xk [ thereexistsuk Uk(xk)suchthatfk(xk, uk, wk) Xk+1, forallwk Wk(xk, uk)_6.231DYNAMICPROGRAMMINGLECTURE7LECTUREOUTLINE Problemswithimperfectstateinfo Reductiontotheperfectstateinfocase Linearquadraticproblems SeparationofestimationandcontrolBASICPROBL.W/IMPERFECTSTATEINFO Sameasbasicproblemof Chapter1withonedierence: thecontroller, insteadof knowingxk,receives at each time kan observation of the formz0= h0(x0, v0), zk= hk(xk, uk1, vk), k 1 TheobservationzkbelongstosomespaceZk. The random observation disturbance vkis char-acterizedbyaprobabilitydistributionPvk( | xk, . . . , x0, uk1, . . . , u0, wk1, . . . , w0, vk1, . . . , v0) Theinitialstatex0isalsorandomandcharac-terizedbyaprobabilitydistributionPx0. TheprobabilitydistributionPwk( [ xk, uk)ofwkisgiven, anditmaydependexplicitlyonxkandukbutnotonw0, . . . , wk1, v0, . . . , vk1. ThecontrolukisconstrainedtoagivensubsetUk(thissubsetdoesnotdependonxk, whichisnotassumedknown).INFORMATIONVECTORANDPOLICIES DenotebyIkthe informationvector, i.e., theinformationavailableattimek:Ik= (z0, z1, . . . , zk, u0, u1, . . . , uk1), k 1,I0= z0 Weconsider policies = 0, 1, . . . , N1,where each function kmaps the information vec-torIkintoacontrolukandk(Ik) Uk, forallIk, k 0 WewanttondapolicythatminimizesJ=Ex0,wk,vkk=0,...,N1_gN(xN) +N1

k=0gk_xk, k(Ik), wk__subjecttotheequationsxk+1= fk_xk, k(Ik), wk_, k 0,z0= h0(x0, v0), zk= hk_xk, k1(Ik1), vk_, k 1REFORMULATIONASPERFECTINFOPROBL. WehaveIk+1= (Ik, zk+1, uk), k = 0, 1, . . . , N2, I0= z0Viewthis as a dynamic systemwithstate Ik,controluk,andrandomdisturbancezk+1 WehaveP(zk+1 [Ik, uk) = P(zk+1 [Ik, uk, z0, z1, . . . , zk),since z0, z1, . . . , zkare part of the information vec-torIk. Thustheprobabilitydistributionof zk+1depends explicitly only on the state Ikand controlukandnotonthepriordisturbanceszk, . . . , z0 WriteE_gk(xk, uk, wk)_= E_Exk,wk_gk(xk, uk, wk)|Ik, uk__sothecostperstageofthenewsystemis gk(Ik, uk) =Exk,wk_gk(xk, uk, wk) [Ik, uk_DPALGORITHMWriting the DP algorithm for the (reformulated)perfectstateinfoproblemanddoingthealgebra:Jk(Ik) = minukUk_Exk, wk, zk+1_gk(xk, uk, wk)+Jk+1(Ik, zk+1, uk)|Ik, uk__fork = 0, 1, . . . , N 2,andfork = N 1,JN1(IN1) = minuN1UN1_ExN1, wN1_gN_fN1(xN1, uN1, wN1)_+gN1(xN1, uN1, wN1)|IN1, uN1__ TheoptimalcostJisgivenbyJ= Ez0_J0(z0)_LINEAR-QUADRATICPROBLEMS System: xk+1= Akxk +Bkuk +wk QuadraticcostEwkk=0,1,...,N1_xNQNxN+N1

k=0(xkQkxk +ukRkuk)_whereQk 0andRk> 0 Observationszk= Ckxk +vk, k = 0, 1, . . . , N 1 w0, . . . , wN1,v0, . . . , vN1indep.zeromean Keyfacttoshow:Optimal policy 0, . . . , N1 is of the form:k(Ik) = LkExk [ IkLk: sameasfortheperfectstateinfocaseEstimation problem and control problem canbesolvedseparatelyDPALGORITHMI LaststageN 1(supressingindexN 1):JN1(IN1) =minuN1_ExN1,wN1_xN1QxN1+ uN1RuN1 + (AxN1 +BuN1 + wN1) Q(AxN1 +BuN1 +wN1)|IN1, uN1__ SinceEwN1 [ IN1=EwN1=0, theminimizationinvolvesminuN1_uN1(BQB +R)uN1+ 2E{xN1|IN1}AQBuN1TheminimizationyieldstheoptimalN1:uN1= N1(IN1) = LN1ExN1 [IN1whereLN1= (BQB +R)1BQADPALGORITHMII SubstitutingintheDPalgorithmJN1(IN1) =ExN1_xN1KN1xN1 [IN1_+ExN1__xN1ExN1 [IN1_ PN1_xN1ExN1 [IN1_ [IN1_+EwN1wN1QNwN1,wherethematricesKN1andPN1aregivenbyPN1= AN1QNBN1(RN1 +BN1QNBN1)1 BN1QNAN1,KN1= AN1QNAN1PN1 +QN1 Note the structure of JN1: inadditiontothe quadratic andconstant terms, it involves aquadraticintheestimationerrorxN1ExN1 [IN1DPALGORITHMIII DPequationforperiodN 2:JN2(IN2) =minuN2_ExN2,wN2,zN1{xN2QxN2+uN2RuN2 +JN1(IN1)|IN2, uN2}_=E_xN2QxN2|IN2_+minuN2_uN2RuN2+E_xN1KN1xN1|IN2, uN2__+E__xN1 E{xN1|IN1}_ PN1_xN1 E{xN1|IN1}_|IN2, uN2_+EwN1{wN1QNwN1} Keypoint: Wehaveexcludedthenexttolastterm from the minimization with respect to uN2 This term turns out to be independentof uN2QUALITYOFESTIMATIONLEMMA Currentestimationerrorisunaectedbypastcontrols: Foreveryk,thereisafunctionMks.t.xkExk [ Ik = Mk(x0, w0, . . . , wk1, v0, . . . , vk),independentlyofthepolicybeingused Consequence: Usingthelemma,xN1ExN1 [IN1 = N1,whereN1: functionofx0, w0, . . . , wN2, v0, . . . , vN1 SinceN1isindependentofuN2,thecondi-tionalexpectationofN1PN1N1satisesEN1PN1N1 [IN2, uN2= EN1PN1N1 [IN2andisindependentofuN2. SominimizationintheDPalgorithmyieldsuN2= N2(IN2) = LN2ExN2 [IN2FINALRESULT Continuingsimilarly(usingalsothequalityofestimationlemma)k(Ik) = LkExk [Ik,whereLkisthesameasforperfectstateinfo:Lk= (Rk +BkKk+1Bk)1BkKk+1Ak,withKkgeneratedusingtheRiccatiequation:KN= QN, Kk= AkKk+1AkPk +Qk,Pk= AkKk+1Bk(Rk +BkKk+1Bk)1BkKk+1Akxk + 1 = Akxk + Bkuk + wkLkukwkxkzk = Ckxk + vkDelayEstimatorE{xk | Ik}uk- 1zkvkzkukSEPARATIONINTERPRETATION The optimal controller can be decomposedinto(a) Anestimator,whichusesthedatatogener-atetheconditionalexpectationExk [Ik.(b) An actuator, which multiplies Exk [ Ik bythegainmatrixLkandappliesthecontrolinputuk= LkExk [Ik. Generically the estimate x of a random vector xgivensomeinformation(randomvector)I,whichminimizesthemeansquarederrorEx|x x|2[I = |x|22Ex [ I x +| x|2is Ex [I (set to zero the derivative with respectto xoftheabovequadraticform). The estimator portion of the optimal controllerisoptimalfortheproblemofestimatingthestatexkassumingthecontrolisnotsubjecttochoice. Theactuatorportionisoptimalforthecontrolproblemassumingperfectstateinformation.STEADYSTATE/IMPLEMENTATIONASPECTS As N , the solution of the Riccati equationconvergestoasteadystateandLk L. If x0, wk, andvkareGaussian, Exk [ IkisalinearfunctionofIkandisgeneratedbyanicerecursivealgorithm,theKalmanlter. TheKalmanlterinvolvesalsoaRiccatiequa-tion, soforN , andastationarysystem, italsohasasteady-statestructure. Thus,forGaussianuncertainty,thesolutionisniceandpossessesasteadystate.For nonGaussian uncertainty, computing Exk [ Ikmaybeverydicult, soasuboptimal solutionistypicallyused. Mostcommonsuboptimal controller: ReplaceExk [ Ik by the estimate produced by the Kalmanlter(actasifx0,wk,andvkareGaussian). Itcanbeshownthatthiscontrollerisoptimalwithin the class of controllers that are linear func-tionsofIk.6.231DYNAMICPROGRAMMINGLECTURE8LECTUREOUTLINE DPforimperfectstateinfo Sucientstatistics Conditional state distributionas a sucientstatistic Finite-statesystems ExamplesREVIEW:IMPERFECTSTATEINFOPROBLEM Insteadofknowingxk,wereceiveobservationsz0= h0(x0, v0), zk= hk(xk, uk1, vk), k 0 Ik: informationvectoravailableattimek:I0= z0, Ik= (z0, z1, . . . , zk, u0, u1, . . . , uk1), k 1Optimization over policies = 0, 1, . . . , N1,wherek(Ik) Uk,forallIkandk. FindapolicythatminimizesJ=Ex0,wk,vkk=0,...,N1_gN(xN) +N1

k=0gk_xk, k(Ik), wk__subjecttotheequationsxk+1= fk_xk, k(Ik), wk_, k 0,z0= h0(x0, v0), zk= hk_xk, k1(Ik1), vk_, k 1DPALGORITHM DPalgorithm:Jk(Ik) = minukUk_Exk, wk, zk+1_gk(xk, uk, wk)+Jk+1(Ik, zk+1, uk)|Ik, uk__fork = 0, 1, . . . , N 2,andfork = N 1,JN1(IN1) = minuN1UN1_ExN1, wN1_gN_fN1(xN1, uN1, wN1)_+gN1(xN1, uN1, wN1)|IN1, uN1__ TheoptimalcostJisgivenbyJ= Ez0_J0(z0)_.SUFFICIENTSTATISTICS Suppose that we can nd a function Sk(Ik) suchthattheright-handsideoftheDPalgorithmcanbewrittenintermsofsomefunctionHkasminukUkHk_Sk(Ik), uk_.Such a function Sk is called a sucient statistic. Anoptimal policyobtainedbytheprecedingminimizationcanbewrittenask(Ik) = k_Sk(Ik)_,wherekisanappropriatefunction. Exampleofasucientstatistic: Sk(Ik) = Ik AnotherimportantsucientstatisticSk(Ik) = Pxk|IkDPALGORITHMINTERMSOFPXK|IK FilteringEquation: Pxk|Ikisgeneratedrecur-sively by a dynamic system (estimator) of the formPxk+1|Ik+1= k_Pxk|Ik, uk, zk+1_forasuitablefunctionk DPalgorithmcanbewrittenasJk(Pxk|Ik) = minukUk_Exk,wk,zk+1_gk(xk, uk, wk)+Jk+1_k(Pxk|Ik, uk, zk+1)_ [Ik, uk__ It is the DP algorithm for a new problem whosestateisPxk|Ik(alsocalledbeliefstate)ukxkDelayEstimatoruk- 1uk- 1vkzkzkwkk- 1Actuatorxk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)System MeasurementPxk| IkkEXAMPLE:ASEARCHPROBLEM Ateachperiod, decidetosearchornotsearchasitethatmaycontainatreasure. Ifwesearchandatreasureispresent, wenditwithprob.andremoveitfromthesite. Treasuresworth: V . Costofsearch: C States: treasure present&treasure notpresent Each search can be viewed as an observation ofthestate Denotepk: prob.oftreasurepresentatthestartoftimekwithp0given. pkevolvesattimekaccordingtotheequationpk+1=___pkifnotsearch,0 ifsearchandndtreasure,pk(1)pk(1)+1pkifsearchandnotreasure.Thisisthelteringequation.SEARCHPROBLEM(CONTINUED) DPalgorithmJk(pk) = max_0, C +pkV+ (1 pk)Jk+1_pk(1 )pk(1 ) + 1 pk__,withJN(pN) = 0. CanbeshownbyinductionthatthefunctionsJksatisfyJk(pk)___= 0 if pk CV ,> 0 if pk>CV . Furthermore, itisoptimal tosearchatperiodkifandonlyifpkV C(expectedreward from the nextsearch the costofthesearch-amyopicrule)FINITE-STATESYSTEMS-POMDP Suppose the systemis a nite-state Markovchain,withstates1, . . . , n. Thentheconditional probabilitydistributionPxk|Ikisann-vector_P(xk= 1 [ Ik), . . . , P(xk= n [ Ik)_ TheDPalgorithmcanbeexecutedoverthen-dimensional simplex (state space is not expandingwithincreasingk) Whenthecontrol andobservationspaces arealsonitesets theproblemis calledaPOMDP(PartiallyObservedMarkovDecisionProblem). For POMDPit turns out that thecost-to-gofunctions Jkinthe DPalgorithmare piecewiselinearandconcave(Exercise5.7). This is conceptually important. It is also usefulinpracticebecauseitformsthebasisforapproxi-mations.INSTRUCTIONEXAMPLEI Teachingastudentsomeitem. PossiblestatesareL: Itemlearned,orL: Itemnotlearned. Possibledecisions: T: Terminatetheinstruc-tion, or T: Continue the instruction for one periodand then conduct a test that indicates whether thestudenthaslearnedtheitem. Possible test outcomes: R: Student gives a cor-rectanswer,orR: Studentgivesanincorrectan-swer. ProbabilisticstructureL L Rr t1 11 - r 1 - tL R L Costofinstruction: Iperperiod Cost of terminatinginstruction: 0if studenthaslearnedtheitem,andC> 0ifnot.INSTRUCTIONEXAMPLEIILet pk: prob. student has learned the item giventhetestresultssofarpk= P(xk= L [ z0, z1, . . . , zk). UsingBayesruleweobtainthelteringequa-tionpk+1= (pk, zk+1)=_1(1t)(1pk)1(1t)(1r)(1pk)ifzk+1= R,0 ifzk+1= R. DPalgorithm:Jk(pk) = min_(1 pk)C,I+Ezk+1_Jk+1_(pk, zk+1)___.startingwithJN1(pN1) = min_(1pN1)C,I+(1t)(1pN1)C.INSTRUCTIONEXAMPLEIII WritetheDPalgorithmasJk(pk) = min_(1 pk)C,I +Ak(pk),whereAk(pk) =P(zk+1= R [Ik)Jk+1_(pk, R)_+P(zk+1= R [Ik)Jk+1_(pk, R)_ Can show by induction that Ak(p) are piecewiselinear,concave,monotonicallydecreasing,withAk1(p) Ak(p) Ak+1(p), forall p [0, 1].(Thecost-to-goatknowledgeprob.pincreasesaswecomeclosertotheendofhorizon.)0pCII + AN - 1(p)I + AN - 2(p)I + AN - 3(p)1aN - 1aN - 3aN - 21 -IC6.231DYNAMICPROGRAMMINGLECTURE9LECTUREOUTLINE Suboptimalcontrol Costapproximationmethods: Classication Certaintyequivalentcontrol: Anexample Limitedlookaheadpolicies Performancebounds Problemapproximationapproach Parametriccost-to-goapproximationPRACTICALDIFFICULTIESOFDP ThecurseofdimensionalityExponential growth of the computational andstoragerequirementsasthenumberofstatevariablesandcontrolvariablesincreasesQuickexplosionof thenumber of statesincombinatorialproblemsIntractabilityofimperfectstateinformationproblems ThecurseofmodelingMathematicalmodelsComputer/simulationmodels Theremaybereal-timesolutionconstraintsA family of problems may be addressed. Thedata of the problem to be solved is given withlittleadvancenoticeThe problem data may change as the systemiscontrolledneedforon-linereplanningCOST-TO-GOFUNCTIONAPPROXIMATION UseapolicycomputedfromtheDPequationwhere the optimal cost-to-go function Jk+1isreplacedbyanapproximationJk+1. (SometimesE_gk_isalsoreplacedbyanapproximation.) Applyk(xk),whichattainstheminimuminminukUk(xk)E_gk(xk, uk, wk)+ Jk+1_fk(xk, uk, wk)__ ThereareseveralwaystocomputeJk+1: O-line approximation:The entire functionJk+1 is computed for every k, before the con-trolprocessbegins. On-line approximation: Only the valuesJk+1(xk+1)at the relevant next states xk+1are com-puted and used to compute ukjust after thecurrentstatexkbecomesknown. Simulation-basedmethods: These are o-line and on-line methods that share the com-moncharacteristic that theyare basedonMonte-Carlo simulation. Some of these meth-ods aresuitablefor problems of verylargesize.CERTAINTYEQUIVALENTCONTROL(CEC) Idea: Replace the stochastic problemwithadeterministicproblem Ateachtimek,thefutureuncertainquantitiesarexedatsometypicalvalues On-line implementation for a perfect state infoproblem. Ateachtimek:(1) Fixthewi, i k, at somewi. Solvethedeterministicproblem:minimize gN(xN) +N1

i=kgi_xi, ui, wi_wherexkisknown,andui Ui, xi+1= fi_xi, ui, wi_.(2) Useascontrol therstelementintheopti-malcontrolsequencefound. Soweapply k(xk)thatminimizesgk_xk, uk, wk_+Jk+1_fk(xk, uk, wk)_whereJk+1istheoptimalcostofthecorrespond-ingdeterministicproblem.ALTERNATIVEOFF-LINEIMPLEMENTATION Let_d0(x0), . . . , dN1(xN1)_beanoptimalcontroller obtained from the DP algorithm for thedeterministicproblemminimize gN(xN) +N1

k=0gk_xk, k(xk), wk_subjectto xk+1= fk_xk, k(xk), wk_, k(xk) Uk TheCECappliesattimekthecontrol inputdk(xk). Inanimperfectinfoversion, xkisreplacedbyanestimatexk(Ik).xkDelayEstimatoruk- 1uk- 1vkzkzkwkActuatorxk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)System Measurementmkduk=mkd(xk)xk(Ik)PARTIALLYSTOCHASTICCEC Instead of xing all future disturbances to theirtypical values, x only some, and treat the rest asstochastic. Important special case: Treat animperfectstateinformationproblemasoneofperfectstateinformation, usinganestimatexk(Ik)of xkasifitwereexact. Multiaccess communication example:black Con-sider controlling the slottedAloha system(Ex-ample 5.1.1 inthe text) by optimally choosingthe probabilityof transmissionof waitingpack-ets. This is ahardproblemof imperfect stateinfo,whoseperfectstateinfoversioniseasy. NaturalpartiallystochasticCEC: k(Ik) = min_1,1xk(Ik)_,wherexk(Ik)isanestimateofthecurrentpacketbacklogbasedontheentirepastchannel historyofsuccesses,idles,andcollisions(whichisIk).GENERALCOST-TO-GOAPPROXIMATION One-steplookahead(1SL) policy: At eachkandstatexk,usethecontrolk(xk)thatminukUk(xk)E_gk(xk, uk, wk)+ Jk+1_fk(xk, uk, wk)__,whereJN= gN.Jk+1: approximation to true cost-to-go Jk+1 Two-steplookahead policy: At eachk andxk, use the control k(xk) attaining the minimumabove, wherethefunctionJk+1isobtainedusinga 1SL approximation (solve a 2-step DP problem). IfJk+1isreadilyavailableandtheminimiza-tionaboveisnottoohard, the1SLpolicyisim-plementableon-line. Sometimes one also replaces Uk(xk) above withasubsetofmostpromisingcontrolsUk(xk). As thelengthof lookaheadincreases, there-quiredcomputationquicklyexplodes.PERFORMANCEBOUNDSFOR1SL Let Jk(xk) be the cost-to-go from (xk, k) of the1SLpolicy,basedonfunctionsJk. Assumethatforall(xk, k),wehaveJk(xk) Jk(xk), (*)whereJN= gNandforallk,Jk(xk) = minukUk(xk)E_gk(xk, uk, wk)+Jk+1_fk(xk, uk, wk)__,[soJk(xk)iscomputedalongwithk(xk)]. ThenJk(xk) Jk(xk), forall(xk, k). Important application: WhenJkis the cost-to-go of some heuristic policy (then the 1SL policy iscalledthe rolloutpolicy). TheboundcanbeextendedtothecasewherethereisakintheRHSof(*). ThenJk(xk) Jk(xk) +k + +N1COMPUTATIONALASPECTS Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularlywhenUk(xk)isnotadiscreteset]. Con-nection with stochastic programming methods(seetext). Thechoiceof theapproximatingfunctionsJkiscritical,andiscalculatedinavarietyofways. Someapproaches:(a) ProblemApproximation: Approximatetheoptimal cost-to-go withsome cost derivedfromarelatedbutsimplerproblem(b) Parametric Cost-to-Go Approximation: Ap-proximate the optimal cost-to-go with a func-tion of a suitable parametric form, whose pa-rameters are tuned by some heuristic or sys-tematic scheme (Neuro-Dynamic Program-ming)(c) Rollout Approach: Approximate the opti-malcost-to-gowiththecostofsomesubop-timalpolicy, whichiscalculatedeitherana-lyticallyorbysimulationPROBLEMAPPROXIMATION Many(problem-dependent)possibilitiesReplace uncertain quantities by nominal val-ues, orsimplifythecalculationof expectedvaluesbylimitedsimulationSimplifydicultconstraintsordynamics Exampleof enforceddecomposition: Routemvehicles thatmoveover a graph. Eachnodehas avalue.Therstvehiclethatpassesthroughthenode collects its value. Maxthe total collectedvalue,subjecttoinitialandnaltimeconstraints(plustimewindowsandotherconstraints). Usuallythe1-vehicleversionoftheproblemismuchsimpler. Thismotivatesanapproximationobtainedbysolvingsinglevehicleproblems. 1SLscheme: Attimekandstatexk(positionofvehiclesandcollectedvaluenodes), considerallpossiblekthmovesbythevehicles,andattheresulting states we approximate the optimal value-to-gowiththevaluecollected byoptimizingthevehicleroutesone-at-a-timePARAMETRICCOST-TO-GOAPPROXIMATIONUse a cost-to-go approximation from a paramet-ricclassJ(x, r)wherexisthecurrentstateandr=(r1, . . . , rm)isavector of tunablescalars(weights). Byadjustingtheweights, onecanchangetheshapeoftheapproximationJsothatitisrea-sonablyclosetothetrueoptimalcost-to-gofunc-tion. Twokeyissues:Thechoiceof parametricclassJ(x, r) (theapproximationarchitecture).Methodfor tuningthe weights (trainingthearchitecture). Successful application strongly depends on howthese issues are handled,and on insight about theproblem.Sometimes a simulation-based algorithm is used,particularly when there is no mathematical modelofthesystem. We will look in detail at these issues after a fewlectures.APPROXIMATIONARCHITECTURES Dividedinlinearandnonlinear[i.e., linearornonlineardependenceofJ(x, r)onr].Linear architectures are easier to train, but non-linearones(e.g.,neuralnetworks)arericher. ArchitecturesbasedonfeatureextractionFeature ExtractionMappingCost Approximator w/Parameter Vector rFeatureVector y State xCost ApproximationJ(y,r ) Ideally, thefeatures will encodemuchof thenonlinearitythatisinherentinthecost-to-goap-proximated,andtheapproximationmaybequiteaccuratewithoutacomplicatedarchitecture. Sometimesthestatespaceispartitioned, andlocalfeaturesareintroducedforeachsubsetofthepartition(theyare0outsidethesubset). Withawell-chosenfeaturevectory(x),wecanusealineararchitectureJ(x, r) =J_y(x), r_=

iriyi(x)ANEXAMPLE-COMPUTERCHESSPrograms use a feature-based position evaluatorthatassignsascoretoeachmove/positionFeatureExtractionWeightingof FeaturesScoreFeatures:Material balance,Mobility,Safety, etcPosition Evaluator Manycontext-dependentspecialfeatures. Mostoftentheweightingof featuresislinearbutmultisteplookaheadisinvolved. Most oftenthe trainingis done bytrial anderror.6.231DYNAMICPROGRAMMINGLECTURE10LECTUREOUTLINE Rolloutalgorithms Costimprovementproperty Discretedeterministicproblems Approximationsofrolloutalgorithms ModelPredictiveControl(MPC) Discretizationofcontinuoustime Discretizationofcontinuousspace OthersuboptimalapproachesROLLOUTALGORITHMS One-steplookaheadpolicy:Ateachkandstatexk,usethecontrolk(xk)thatminukUk(xk)E_gk(xk, uk, wk)+ Jk+1_fk(xk, uk, wk)__,whereJN= gN.Jk+1: approximation to true cost-to-go Jk+1 Rolloutalgorithm: WhenJkis the cost-to-goofsomeheuristicpolicy(calledthebasepolicy) Cost improvement property (to be shown): Therolloutalgorithmachievesnoworse(andusuallymuch better) cost than the base heuristic startingfromthesamestate.Main diculty: CalculatingJk(xk) may be com-putationallyintensive if the cost-to-go of the basepolicycannotbeanalyticallycalculated.MayinvolveMonteCarlosimulationif theproblemisstochastic.Thingsimproveinthedeterministiccase.EXAMPLE:THEQUIZPROBLEM ApersonisgivenNquestions; answeringcor-rectly question i has probability pi, rewardvi.Quizterminatesattherstincorrectanswer. Problem: Choosetheorderingof questionssoastomaximizethetotalexpectedreward. Assumingnootherconstraints,itisoptimaltouse the index policy: Answer questions in decreas-ingorderofpivi/(1 pi). Withminorchangesintheproblem,theindexpolicyneednotbeoptimal. Examples:Alimit(1stinequalityMin selection of k(xk) ==> 2nd inequalityDenitionofHk, k==>lastequalityDISCRETEDETERMINISTICPROBLEMS Anydiscreteoptimizationproblem(withnitenumber of choices/feasible solutions) can be repre-sented sequentially by breaking down the decisionprocessintostages. A tree/shortest path representation. The leavesofthetreecorrespondtothefeasiblesolutions. Decisionscanbemadeinstages.May complete partial solutions, one stage atatime.Mayapplyrollout withanyheuristic thatcancompleteapartialsolution.Nocostlystochasticsimulationneeded. Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheachofNcities.ABC ABD ACB ACD ADB ADCABCDABACADABDC ACBD ACDB ADBC ADCBOrigin Node s AEXAMPLE:THEBREAKTHROUGHPROBLEMroot GivenabinarytreewithNstages. Each arc is either free or is blocked (crossed outinthegure). Problem: Find a free pathfrom the root to theleaves(suchastheoneshownwiththicklines). Base heuristic (greedy): Follow the right branchiffree;elsefollowtheleftbranchiffree. This is ararerollout instancethat admits adetailedanalysis. For large Nandgivenprob. of free branch:therollout algorithmrequires O(N) times morecomputation, buthasO(N)timeslargerprob.ofndingafreepaththanthegreedyalgorithm.DET. EXAMPLE:ONE-DIMENSIONALWALK Apersontakeseitheraunitsteptotheleftoraunitsteptotheright. Minimizethecostg(i)ofthepointiwherehewillendupafterNsteps.g(i)i N N - 2 -N 0(N,0)(0,0)(N,-N) (N,N)i_i_ Base heuristic: Always go to the right. Rolloutndstherightmostlocal minimum. Base heuristic: Compare always go to the rightand always go the left. Choose the best of the two.Rolloutndsaglobal minimum.DET. EXAMPLE:ONE-DIMENSIONALWALK Apersontakeseitheraunitsteptotheleftoraunitsteptotheright. Minimizethecostg(i)ofthepointiwherehewillendupafterNsteps.g(i)i N N - 2 -N 0(N,0)(0,0)(N,-N) (N,N)i_i_ Base heuristic: Always go to the right. Rolloutndstherightmostlocal minimum. Base heuristic: Compare always go to the rightand always go the left. Choose the best of the two.Rolloutndsaglobal minimum.AROLLOUTISSUEFORDISCRETEPROBLEMS ThebaseheuristicneednotconstituteapolicyintheDPsense. Reason: Dependingonitsstartingpoint, thebaseheuristicmaynotapplythesamecontrolatthesamestate. As a result the cost improvement property maybe lost (except if the base heuristic has a propertycalledsequential consistency; see the text for aformaldenition). Thecostimprovementpropertyisrestoredintwoways:Thebaseheuristichasapropertycalledse-quential improvement (see the text for a for-maldenition).A variant of the rollout algorithm, called for-tiedrollout, is used, whichenforces costimprovement. Roughlyspeakingthebestsolutionfoundsofar is maintained, anditisfollowedwhenever atanytimethestan-dardversionofthealgorithmtriestofollowaworsesolution(seethetext).ROLLINGHORIZONWITHROLLOUT Wecanusearollinghorizonapproximationincalculatingthecost-to-goofthebaseheuristic. Becausetheheuristicissuboptimal, theratio-naleforalongrollinghorizonbecomesweaker.Example: N-stage stopping problem where thestoppingcostis0, thecontinuationcostiseitheror 1, where0 0)limNE_N1

k=0_tk+1tketg_xk, k(xk)_dtx0= i_ AveragecostperunittimelimN1E{tN}E_N1

k=0_tk+1tkg_xk, k(xk)_dtx0= i_ We will see that both problems have equivalentdiscrete-timeversions.ANOTEONNOTATION The scaled CDF Qij(, u) can be used to modeldiscrete, continuous, andmixeddistributionsforthetransitiontime. Generally, expected values of functions of canbewrittenasintegralsinvolvingd Qij(, u). Forexample, the conditional expected value of giveni,j,anduiswrittenasE [ i, j, u =_0 d Qij(, u)pij(u) IfQij(, u)iscontinuouswithrespectto,itsderivativeqij(, u) =dQijd(, u)can beviewed as a scaleddensityfunction. Ex-pected values of functions of can then be writtenintermsofqij(, u). ForexampleE [ i, j, u =_0 qij(, u)pij(u)dIf Qij(, u) is discontinuous and staircase-like,expectedvaluescanbewrittenassummations.DISCOUNTEDCASE-COSTCALCULATION Forapolicy = 0, 1, . . .,writeJ(i) = E{1sttransitioncost}+E{eJ1(j) | i, 0(i)}where E1sttransitioncost = E__0etg(i, u)dt_andJ1(j)isthecost-to-goof1= 1, 2, . . . Wecalculatethetwocosts intheRHS. TheE1sttransitioncost, if u is applied at state i, isG(i, u) =Ej_E{1sttransitioncost |j}_=n

j=1pij(u)_0__0etg(i, u)dt_dQij(, u)pij(u)=g(i, u)n

j=1_01 edQij(, u) ThustheE1sttransitioncostisG_i, 0(i)_= g_i, 0(i)_n

j=1_01 edQij_, 0(i)_(Thesummationtermcanbeviewedas adis-countedlengthof the transition intervalt1t0.)COSTCALCULATION(CONTINUED) Alsotheexpected(discounted) costfromthenextstatejisE_eJ1(j) [ i, 0(i)_= Ej_Ee[ i, 0(i), jJ1(j) [ i, 0(i)_=n

j=1pij(0(i))__0e dQij(, 0(i))pij(0(i))_J1(j)=n

j=1mij_0(i)_J1(j)wheremij(u)isgivenbymij(u) =_0edQij(, u)_ 0,or process none. The cost per unit time of anunlledorderisc. Maxnumberofunlledordersisn. ThenonzerotransitiondistributionsareQi1(, Fill) = Qi(i+1)(, NotFill) = min_1,max_ Theone-stageexpectedcostGisG(i, Fill) = 0, G(i, NotFill) = c i,where=n

j=1_01 edQij(, u) =_max01 emaxd Thereisaninstantaneouscost g(i, Fill) = K, g(i, NotFill) = 0MANUFACTURERSEXAMPLECONTINUED The eective discountfactors mij(u) in Bell-mansEquationaremi1(Fill) = mi(i+1)(NotFill) = ,where =_0edQij(, u) =_max0emaxd=1 emaxmax BellmansequationhastheformJ(i) = min_K+J(1),ci+J(i+1), i = 1, 2, . . . Asinthediscrete-timecase, wecanconcludethatthereexistsanoptimalthresholdi:lltheorders theirnumberiexceedsiAVERAGECOST MinimizelimN1EtNE__tN0g_x(t), u(t)_dt_assuming there is a special state that is recurrentunderallpolicies TotalexpectedcostofatransitionG(i, u) = g(i, u)i(u),wherei(u): Expectedtransitiontime. WenowapplytheSSPargumentusedforthediscrete-timecase. Dividetrajectoryintocyclesmarked by successive visits to n. The cost at (i, u)isG(i, u) i(u), whereistheoptimal ex-pected cost per unit time. Each cycle is viewed asa state trajectory of a corresponding SSP problemwiththeterminationstatebeingessentiallyn. SoBellmansEq.fortheaveragecostproblem:h(i) = minuU(i)__G(i, u) i(u) +n

j=1pij(u)h(j)__MANUFACTUREREXAMPLE/AVERAGECOST Theexpectedtransitiontimesarei(Fill) = i(NotFill) =max2theexpectedtransitioncostisG(i, Fill) = 0, G(i, NotFill) =c i max2andthereisalsotheinstantaneouscost g(i, Fill) = K, g(i, NotFill) = 0 Bellmansequation:h(i) = min_K max2+h(1),cimax2max2+h(i + 1)_ Againitcanbeshownthatathresholdpolicyisoptimal.6.231DYNAMICPROGRAMMINGLECTURE15LECTUREOUTLINE Westartanine-lecturesequenceonadvancedinnitehorizonDPandapproximationmethods Weallowinnitestatespace,sothestochasticshortest path framework cannot be used any more Resultsarerigorousassumingacountabledis-turbancespaceThis includes deterministic problems witharbitrary state space, andcountable stateMarkovchainsOtherwisethemathematicsofmeasurethe-orymakeanalysisdicult, althoughthe-nal results are essentially the same as forcountabledisturbancespace We use Volume II starting with the discountedproblem(Chapter1) ThecentralmathematicalstructureisthattheDP mappingis a contractionmapping(insteadofexistenceofaterminationstate)DISCOUNTEDPROBLEMS/BOUNDEDCOST Stationarysystemwitharbitrarystatespacexk+1= f(xk, uk, wk), k = 0, 1, . . . Costofapolicy= 0, 1, . . .J(x0) = limNEwkk=0,1,..._N1

k=0kg_xk, k(xk), wk__with < 1, and for some M, we have [g(x, u, w)[ Mforall(x, u, w) Shorthandnotation for DP mappings (operateonfunctionsofstatetoproduceotherfunctions)(TJ)(x) = minuU(x)Ew_g(x, u, w) +J_f(x, u, w)__, xTJistheoptimalcostfunctionfortheone-stageproblemwithstagecostgandterminalcostJ. Foranystationarypolicy(TJ)(x) = Ew_g_x, (x), w_+J_f(x, (x), w)__, xSHORTHANDTHEORYASUMMARY Costfunctionexpressions[withJ0(x) 0]J(x) = limk(T0T1 TkJ0)(x), J(x) = limk(TkJ0)(x) Bellmansequation:J= TJ, J= TJ Optimalitycondition:: optimal TJ= TJ Valueiteration: Forany(bounded)Jandallx,J(x) =limk(TkJ)(x) Policyiteration:Givenk,Policyevaluation: FindJkbysolvingJk= TkJkPolicyimprovement: Findk+1suchthatTk+1Jk= TJkTWOKEYPROPERTIES Monotonicityproperty: For any functions JandJsuchthatJ(x) J(x)forall x, andany(TJ)(x) (TJ)(x), x,(TJ)(x) (TJ)(x), x. Constant Shift property: For anyJ, anyscalarr,andany_T(J+re)_(x) = (TJ)(x) +r, x,_T(J+re)_(x) = (TJ)(x) +r, x,whereeistheunitfunction[e(x) 1]. These properties hold for almost all DP models. A third important property that holds for some(but not all) DP models is that Tand Tare con-tractionmappings(moreonthislater).CONVERGENCEOFVALUEITERATION IfJ0 0,J(x) = limN(TNJ0)(x), forallxProof: For anyinitial statex0, andpolicy=0, 1, . . .,J(x0) = E_

k=0kg_xk, k(xk), wk__= E_N1

k=0kg_xk, k(xk), wk__+E_

k=Nkg_xk, k(xk), wk__ThetailportionsatisesE_

k=Nkg_xk, k(xk), wk__NM1 ,whereM [g(x, u, w)[. Taketheminoverofbothsides. Q.E.D.BELLMANSEQUATION The optimal cost function Jsatises BellmansEq.,i.e. J= TJ.Proof:ForallxandN,J(x) NM1 (TNJ0)(x) J(x) +NM1 ,whereJ0(x) 0andM [g(x, u, w)[. ApplyingT tothis relation, andusingMonotonicityandConstantShift,(TJ)(x) N+1M1 (TN+1J0)(x) (TJ)(x) +N+1M1 TakingthelimitasN andusingthefactlimN(TN+1J0)(x) = J(x)weobtainJ= TJ. Q.E.D.THECONTRACTIONPROPERTYContraction property: For any bounded func-tionsJandJ,andany,maxx(TJ)(x) (TJ)(x) maxxJ(x) J(x),maxx(TJ)(x)(TJ)(x) maxxJ(x)J(x).Proof:Denotec = maxxSJ(x) J(x).ThenJ(x) c J(x) J(x) +c, xApplyTtobothsides,andusetheMonotonicityandConstantShiftproperties:(TJ)(x) c (TJ)(x) (TJ)(x) +c, xHence(TJ)(x) (TJ)(x) c, x.Q.E.D.IMPLICATIONSOFCONTRACTIONPROPERTY Wecanstrengthenourearlierresult: BellmansequationJ= TJhasauniquesolu-tion,namelyJ,andforanyboundedJ,wehavelimk(TkJ)(x) = J(x), xProof:Usemaxx(TkJ)(x) J(x)= maxx(TkJ)(x) (TkJ)(x) kmaxxJ(x) J(x) SpecialCase: For each stationary , Jis theuniquesolutionofJ= TJandlimk(TkJ)(x) = J(x), x,foranyboundedJ. Convergencerate:Forallk,maxx(TkJ)(x) J(x) kmaxxJ(x) J(x)NEC.ANDSUFFICIENTOPT.CONDITION Astationarypolicyisoptimal ifandonlyif(x)attainstheminimuminBellmansequationforeachx;i.e.,TJ= TJ.Proof: If TJ= TJ, then using Bellmans equa-tion(J= TJ),wehaveJ= TJ,so by uniqueness of the xed point of T, we obtainJ= J;i.e.,isoptimal. Conversely, if the stationary policy is optimal,wehaveJ= J,soJ= TJ.Combining this withBellmans equation(J=TJ),weobtainTJ= TJ. Q.E.D.COMPUTATIONALMETHODS-ANOVERVIEW Typically must work with a nite-state system.Possiblyanapproximationoftheoriginalsystem. ValueiterationandvariantsGauss-Seidelandasynchronousversions PolicyiterationandvariantsCombinationwith(possiblyasynchronous)valueiterationOptimisticpolicyiteration Linearprogrammingmaximizen

i=1J(i)subjectto J(i) g(i, u) +n

j=1pij(u)J(j), (i, u) Versionswithsubspaceapproximation: useinplace of J(i) a low-dim. basis function representa-tion,withstatefeaturesm(i),m = 1, . . . , sJ(i, r) =s

m=1rmm(i)andmodifythebasicmehodsappropriately.USINGQ-FACTORSI Letthestatesbei =1, . . . , n. WecanwriteBellmansequationasJ(i) = minuU(i)Q(i, u) i = 1, . . . , n,whereQ(i, u) =n

j=1pij(u)_g(i, u, j) + minvU(j)Q(j, v)_forall(i, u) Q(i, u)iscalledthe optimalQ-factorof(i, u) Q-factors haveoptimal cost interpretationinanaugmentedproblemwhosestatesarei and(i, u), u U(i) - the optimal cost vector is (J, Q)The Bellman Eq. is J= TJ, Q= FQwhere(FQ)(i, u) =n

j=1pij(u)_g(i, u, j) + minvU(j)Q(j, v)_ Ithasauniquesolution.USINGQ-FACTORSII WecanequivalentlywritetheVImethodasJk+1(i) = minuU(i)Qk+1(i, u), i = 1, . . . , n,whereQk+1isgeneratedforalliandu U(i)byQk+1(i, u) =n

j=1pij(u)_g(i, u, j) + minvU(j)Qk(j, v)_orJk+1= TJk,Qk+1= FQk. Itrequiresequal amountofcomputation... itjustneedsmorestorage Havingoptimal Q-factors is convenient whenimplementinganoptimalpolicyon-lineby(i) = minuU(i)Q(i, u) Once Q(i, u) are known, the model [g andpij(u)]isnotneeded. Model-freeoperationLater we will see how stochastic/sampling meth-odscanbeusedtocalculate(approximationsof)Q(i, u) using a simulator of the system (no modelneeded)6.231DYNAMICPROGRAMMINGLECTURE16LECTUREOUTLINE Reviewofbasictheoryofdiscountedproblems Monotonicityandcontractionproperties ContractionmappingsinDP Discountedproblems: Countable state spacewithunboundedcosts GeneralizeddiscountedDPDISCOUNTEDPROBLEMS/BOUNDEDCOST Stationarysystemwitharbitrarystatespacexk+1= f(xk, uk, wk), k = 0, 1, . . . Costofapolicy= 0, 1, . . .J(x0) = limNEwkk=0,1,..._N1

k=0kg_xk, k(xk), wk__with < 1, and for some M, we have [g(x, u, w)[ Mforall(x, u, w) Shorthandnotation for DP mappings (operateonfunctionsofstatetoproduceotherfunctions)(TJ)(x) = minuU(x)Ew_g(x, u, w) +J_f(x, u, w)__, xTJistheoptimalcostfunctionfortheone-stageproblemwithstagecostgandterminalcostJ. Foranystationarypolicy(TJ)(x) = Ew_g_x, (x), w_+J_f(x, (x), w)__, xSHORTHANDTHEORYASUMMARY Costfunctionexpressions[withJ0(x) 0]J(x) = limk(T0T1 TkJ0)(x), J(x) = limk(TkJ0)(x) Bellmansequation:J= TJ, J= TJ Optimalitycondition:: optimal TJ= TJ Valueiteration: Forany(bounded)Jandallx,J(x) =limk(TkJ)(x) Policyiteration:Givenk,Policyevaluation: FindJkbysolvingJk= TkJkPolicyimprovement: Findk+1suchthatTk+1Jk= TJkMAJORPROPERTIES Monotonicityproperty: For anyfunctionsJandJonthestatespaceXsuchthatJ(x) J(x)forallx X,andany(TJ)(x) (TJ)(x), x X,(TJ)(x) (TJ)(x), x X.Contraction property: For any bounded func-tionsJandJ,andany,maxx(TJ)(x) (TJ)(x) maxxJ(x) J(x),maxx(TJ)(x)(TJ)(x) maxxJ(x)J(x). The contraction property canbe writteninshorthandas|TJTJ| |JJ|, |TJTJ| |JJ|,whereforanyboundedfunctionJ, wedenoteby|J|thesup-norm|J| = maxxXJ(x).CONTRACTIONMAPPINGS Givenareal vectorspaceY withanorm ||(seetextfordenitions).A function F: Y Yis said to be a contractionmappingifforsome (0, 1),wehave|Fy Fz| |y z|, forally, z Y.iscalledthemodulusofcontractionofF. Linear case, Y= n: Fy= Ay +b is a contrac-tionifandonlyifalleigenvaluesofAarestrictlywithintheunitcircle. Form>1, wesaythatFisanm-stagecon-tractionifFmisacontraction. Importantexample: Let Xbe a set (e.g.,statespace inDP), v : X be apositive-valuedfunction. Let B(X) be the set of all functionsJ: X such that J(s)/v(s) is bounded over s. Theweightedsup-normonB(X):|J| = maxsX[J(s)[v(s). Importantspecial case: Thediscountedprob-lemmappingsTandT[forv(s) 1, = ].ADP-LIKECONTRACTIONMAPPING LetX= 1, 2, . . .,andletF: B(X) B(X)bealinearmappingoftheform(FJ)(i) = b(i) +

jXa(i, j) J(j), iwhere b(i)and a(i, j)are some scalars. ThenFisacontractionwithmodulusif

jX[a(i, j)[ v(j)v(i) , i[Thinkof the special case where a(i, j) are thetransitionprobs.ofapolicy]. LetF: B(X) B(X)beamappingof theform(FJ)(i) =minM(FJ)(i), iwhereMisparameterset, andforeach M,FisacontractionmappingfromB(X)toB(X)with modulus . Then Fis a contraction mappingwithmodulus.CONTRACTIONMAPPINGFIXED-POINTTH. ContractionMappingFixed-PointTheorem:IfF: B(X) B(X)isacontractionwithmodulus (0, 1), thenthereexistsauniqueJB(X)suchthatJ= FJ.Furthermore, if JisanyfunctioninB(X), thenFkJconvergestoJandwehave|FkJ J| k|J J|, k = 1, 2, . . . . Similar result if F is anm-stagecontractionmapping. This is aspecial case of ageneral result forcontractionmappings F : Y Y over normedvector spaces Ythat are complete: every sequenceykthatisCauchy(satises |ym yn| 0asm, n )converges. ThespaceB(X)is complete[see thetext(Sec-tion1.5)foraproof].GENERALFORMSOFDISCOUNTEDDP Monotonicityassumption:IfJ, J R(X)andJ J,thenH(x, u, J) H(x, u, J), x X, u U(x) Contractionassumption:ForeveryJ B(X),thefunctionsTJandTJbelongtoB(X).For some (0, 1) and all J, J B(X), Tsatises|TJ TJ| |J J| Wecanshowall thestandardanalytical andcomputational resultsofdiscountedDPbasedonthesetwoassumptions. Withjustthemonotonicityassumption(asinshortestpathproblem)wecanstill showvariousforms of thebasicresults under appropriateas-sumptions(likeintheSSPproblem).EXAMPLES DiscountedproblemsH(x, u, J) = E_g(x, u, w) +J_f(x, u, w)__ DiscountedSemi-MarkovProblemsH(x, u, J) = G(x, u) +n

y=1mxy(u)J(y)wheremxyarediscountedtransitionprobabili-ties,denedbythetransitiondistributions ShortestPathProblemsH(x, u, J) =_axu +J(u) ifu ,= d,axdifu = dwheredisthedestination MinimaxProblemsH(x, u, J) = maxwW(x,u)_g(x, u, w)+J_f(x, u, w)_GENERALFORMSOFDISCOUNTEDDP Monotonicityassumption:IfJ, J R(X)andJ J,thenH(x, u, J) H(x, u, J), x X, u U(x) Contractionassumption:ForeveryJ B(X),thefunctionsTJandTJbelongtoB(X).Forsome (0, 1)andallJ, J B(X),HsatisesH(x, u, J)H(x, u, J) maxyXJ(y)J(y)forallx Xandu U(x). Wecanshowall thestandardanalytical andcomputational resultsofdiscountedDPbasedonthesetwoassumptions Withjustthemonotonicityassumption(asinshortestpathproblem)wecanstill showvariousforms of thebasicresults under appropriateas-sumptions(likeintheSSPproblem)RESULTSUSINGCONTRACTION The mappingsTandTare sup-norm contrac-tionmappings withmodulus over B(X), andhave unique xed points in B(X), denoted JandJ,respectively(cf. Bellmansequation). Proof :Fromcontractionassumptionandxedpointth. ForanyJ B(X)and /,limkTkJ= J, limkTkJ= J(cf. convergence of value iteration). Proof : FromcontractionpropertyofTandT. WehaveTJ=TJif andonlyif J=J(cf. optimalitycondition). Proof : TJ=TJ,thenTJ=J, implyingJ=J. Conversely,ifJ= J,thenTJ= TJ= J= J= TJ. Useful boundfor J: For all J B(X), /|JJ| |TJ J|1 Proof :|TkJJ| k

=1|TJT1J| |TJJ|k

=11RESULTSUSINGMON. ANDCONTRACTION Optimalityofxedpoint:J(x) =minMJ(x), x X Furthermore,forevery > 0,thereexists /suchthatJ(x) J(x) J(x) +, x X Nonstationarypolicies: Considerthesetofall sequences= 0, 1, . . .withk /forallk,anddeneJ(x) = liminfk(T0T1 TkJ)(x), x X,withJbeinganyfunction(thechoiceof Jdoesnotmatter) WehaveJ(x) = minJ(x), x X6.231DYNAMICPROGRAMMINGLECTURE17LECTUREOUTLINE Reviewofcomputational theoryofdiscountedproblems Valueiteration(VI),policyiteration(PI) OptimisticPI Computational methods for generalized dis-countedDP AsynchronousalgorithmsDISCOUNTEDPROBLEMS Stationarysystemwitharbitrarystatespacexk+1= f(xk, uk, wk), k = 0, 1, . . . Costofapolicy= 0, 1, . . .J(x0) = limNEwkk=0,1,..._N1

k=0kg_xk, k(xk), wk__ ShorthandnotationforDPmappings(n-stateMarkovchaincase)(TJ)(i) = minuU(i)n

j=1pij(u)_g(i, u, j) +J(j)_, iTJistheoptimal costfunctionfortheone-stageproblemwithstagecostgandterminalcostJ. Foranystationarypolicy(TJ)(i) =n

j=1pij_(i)__g(i, (i), j) +J(j)_, iNote: Tislinear[inshortTJ= P(g +J)].SHORTHANDTHEORYASUMMARY Costfunctionexpressions(withJ0 0)J=limkT0T1 TkJ0, J=limkTkJ0 Bellmansequation:J= TJ, J= TJ Optimalitycondition:: optimal TJ= TJ Contraction: |TJ1TJ2| |J1J2| Valueiteration:Forany(bounded)JJ=limkTkJ Policyiteration:Givenk,Policyevaluation: FindJkbysolvingJk= TkJkPolicyimprovement: Findk+1suchthatTk+1Jk= TJkINTERPRETATIONOFVIANDPIJJ=TJ0Prob. =1JJ=TJ0Prob. =1TJProb. =1Prob. =TJProb. =1Prob. =1JJTJ45DegreeLineProb.= 1Prob.=JJ=TJ0Prob. =11JJJJ1 = T1J1Policy Improvement Exact Policy Evaluation Approximate PolicyEvaluationPolicy Improvement Exact Policy Evaluation Approximate PolicyEvaluationTJT1JJPolicyImprovementExactPolicyEvaluation(ExactifJ0J0J0J0=TJ0=TJ0=TJ0Donot Replace SetS=T2J0Donot Replace SetS=T2J0nValueIterationsVIANDPIMETHODSFORQ-LEARNING WecanwriteBellmansequationasJ(i) = minuU(i)Q(i, u) i = 1, . . . , n,whereQis thevector of optimal Q-factors theuniquesolutionofQ = FQwhere(FQ)(i, u) =n

j=1pij(u)_g(i, u, j) + minvU(j)Q(j, v)_ VI andPI for Q-factors are mathematicallyequivalenttoVIandPIforcosts. Theyrequireequal amountof computation...theyjustneedmorestorage. Forexample,wecanwritetheVImethodasJk+1(i) = minuU(i)Qk+1(i, u), i = 1, . . . , n,whereQk+1isgeneratedforalliandu U(i)byQk+1(i, u) =n

j=1pij(u)_g(i, u, j) + minvU(j)Qk(j, v)_APPROXIMATEPI Supposethatthepolicyevaluationisapproxi-mate,accordingto,maxx[Jk(x) Jk(x)[ , k = 0, 1, . . .and policy improvement is approximate, accordingto,maxx[(Tk+1Jk)(x)(TJk)(x)[ , k = 0, 1, . . .whereandaresomepositivescalars. ErrorBound: Thesequence kgeneratedbyapproximatepolicyiterationsatiseslimsupkmaxxS_Jk(x) J(x)_ + 2(1 )2 Typical practical behavior: The method makessteady progress up to a point and then the iteratesJkoscillatewithinaneighborhoodof J.OPTIMISTICPI ThisisPI, wherepolicyevaluationiscarriedoutbyanitenumberofVI Shorthanddenition:ForsomeintegersmkTkJk= TJk, Jk+1= TmkkJk, k = 0, 1, . . .Ifmk 1itbecomesVIIfmk= itbecomesPIFor intermediate values of mk, it is generallymoreecientthaneitherVIorPITJ=minTJJJ=TJ0Prob. =11JJPolicy Improvement Exact Policy Evaluation Approximate PolicyEvaluationPolicy Improvement Exact Policy Evaluation Approximate PolicyEvaluationJ0J0= TJ0JJ0 = T0J0= TJ0= T0J0J1= T20J0T0JApprox.PolicyEvaluationEXTENSIONSTOGENERALIZEDDISC.DP All the preceding VI and PI methods extend togeneralizeddiscountedDP. Summary: For a mapping H: XU R(X) ,consider(TJ)(x) = minuU(x)H(x, u, J), x X.(TJ)(x) = H_x, (x), J_, x X. WewanttondJsuchthatJ(x) = minuU(x)H(x, u, J), x XandasuchthatTJ= TJ.Discounted, Discounted Semi-Markov, MinimaxH(x, u, J) = E_g(x, u, w) +J_f(x, u, w)__H(x, u, J) = G(x, u) +n

y=1mxy(u)J(y)H(x, u, J) = maxwW(x,u)_g(x, u, w)+J_f(x, u, w)_ASSUMPTIONSANDRESULTS Monotonicityassumption:IfJ, J R(X)andJ J,thenH(x, u, J) H(x, u, J), x X, u U(x) Contractionassumption:ForeveryJ B(X),thefunctionsTJandTJbelongtoB(X).Forsome (0, 1)andallJ, J B(X),HsatisesH(x, u, J)H(x, u, J) maxyXJ(y)J(y)forallx Xandu U(x). Standardalgorithmicresultsextend:GeneralizedVIconvergestoJ, theuniquexedpointofTGeneralizedPI andoptimistic PI generateksuchthatlimk|Jk J| = 0ASYNCHRONOUSALGORITHMS MotivationforasynchronousalgorithmsFasterconvergenceParallelanddistributedcomputationSimulation-basedimplementations General framework: PartitionXintodisjointnonemptysubsets X1, . . . , Xm, anduseseparateprocessorupdatingJ(x)forx X JbepartitionedasJ= (J1, . . . , Jm),whereJistherestrictionofJonthesetX. Synchronousalgorithm:Jt+1(x) = T(Jt1, . . . , Jtm)(x), x X, = 1, . . . , m Asynchronousalgorithm: Forsomesubsetsoftimes 1,Jt+1(x) =_T(J1(t)1, . . . , Jm(t)m)(x) ift 1,Jt(x) ift/ 1wheret j(t)arecommunicationdelaysONE-STATE-AT-A-TIMEITERATIONS Importantspecial case: Assumenstates, aseparateprocessorforeachstate,andnodelays Generate a sequence of states x0, x1, . . ., gen-eratedinsomeway, possiblybysimulation(eachstateisgeneratedinnitelyoften) AsynchronousVI:Jt+1=_T(Jt1, . . . , Jtn)() if = xt,Jtif ,= xt,where T(Jt1, . . . , Jtn)() denotes the -thcompo-nentofthevectorT(Jt1, . . . , Jtn) = TJt,andforsimplicitywewriteJtinsteadofJt() Thespecialcasewherex0, x1, . . . = 1, . . . , n, 1, . . . , n, 1, . . .isthe Gauss-Seidelmethod WecanshowthatJt Junderthecontrac-tionassumption6.231DYNAMICPROGRAMMINGLECTURE18LECTUREOUTLINE AnalysisofasynchronousVIandPIforgener-alizeddiscountedDP Undiscountedproblems Stochasticshortestpathproblems(SSP) Properandimproperpolicies AnalysisandcomputationalmethodsforSSP PathologiesofSSPREVIEWOFASYNCHRONOUSALGORITHMS General framework: PartitionXintodisjointnonemptysubsets X1, . . . , Xm, anduseseparateprocessorupdatingJ(x)forx X JbepartitionedasJ= (J1, . . . , Jm),whereJistherestrictionofJonthesetX. Synchronousalgorithm:Jt+1(x) = T(Jt1, . . . , Jtm)(x), x X, = 1, . . . , m Asynchronousalgorithm: Forsomesubsetsoftimes 1,Jt+1(x) =_T(J1(t)1, . . . , Jm(t)m)(x) ift 1,Jt(x) ift/ 1wheret j(t)arecommunicationdelaysASYNCHRONOUSCONV.THEOREMIAssume that for all , j= 1, . . . , m, 1 is inniteandlimtj(t) = Proposition: Let Thave a unique xed point J,andassumethatthereisasequenceofnonemptysubsets_S(k)_ R(X)withS(k + 1) S(k)forallk,andwiththefollowingproperties:(1) Synchronous Convergence Condition: Ev-erysequence JkwithJkS(k)foreachk, convergespointwisetoJ. Moreover, wehaveTJ S(k+1), J S(k), k = 0, 1, . . . .(2) Box Condition: For all k, S(k) is a CartesianproductoftheformS(k) = S1(k)Sm(k),whereS(k)isasetofreal-valuedfunctionsonX, = 1, . . . , m.ThenforeveryJ S(0), thesequence Jtgen-eratedbytheasynchronous algorithmconvergespointwisetoJ.ASYNCHRONOUSCONV.THEOREMII Interpretationofassumptions:S(0)(0)S(k))S(k + 1) + 1)JJ= (J1, J2)S1(0)(0)S2(0)TJA synchronous iteration from any Jin S(k) movesintoS(k + 1)(component-by-component) Convergencemechanism:S(0)(0)S(k))S(k + 1)+ 1)JJ= (J1, J2)J1IterationsIterationsJ2IterationKey: Independent component-wise improve-ment. An asynchronous component iteration fromanyJinS(k)movesintothecorrespondingcom-ponentportionofS(k + 1)UNDISCOUNTEDPROBLEMS System: xk+1= f(xk, uk, wk) Costofapolicy= 0, 1, . . .J(x0) = limNEwkk=0,1,..._N1

k=0g_xk, k(xk), wk__ ShorthandnotationforDPmappings(TJ)(x) = minuU(x)Ew_g(x, u, w) +J_f(x, u, w)__, x Foranystationarypolicy(TJ)(x) = Ew_g_x, (x), w_+J_f(x, (x), w)__, x TandTneednotbecontractionsingeneral,buttheirmonotonicityishelpful (seeCh. 4, Vol.IIoftextforananalysis). SSPproblems provideasoft boundarybe-tweenthe easy nite-state discountedproblemsandthehardundiscountedproblems.Theysharefeaturesofboth.Some of the nice theory is recovered becauseoftheterminationstate.SSPTHEORYSUMMARYI Asearlier, wehaveacost-freeterm. statet, anite number of states 1, . . . , n, and nite numberof controls, but we will make weaker assumptions. Mappings T andT(modiedtoaccount forterminationstatet):(TJ)(i) = minuU(i)_g(i, u) +n

j=1pij(u)J(j)_, i = 1, . . . , n,(TJ)(i) = g_i, (i)_+n

j=1pij_(i)_J(j), i = 1, . . . , n. Denition: Astationary policy is calledproper, if under , fromeverystatei, thereisapositiveprobabilitypaththatleadstot. Importantfact: If is proper, Tis contrac-tionwithrespecttosomeweightedmaxnormmaxi1vi[(TJ)(i)(TJ)(i)[ maxi1vi[J(i)J(i)[ Tissimilarlyacontractionifall areproper(thecasediscussedinthetext,Ch. 7,Vol.I).SSPTHEORYSUMMARYII The theorycanbe pushedone stepfurther.Assumethat:(a) Thereexistsatleastoneproperpolicy(b) Foreachimproper,J(i) = forsomei ThenTisnotnecessarilyacontraction,but:JistheuniquesolutionofBellmansEqu.isoptimalifandonlyifTJ= TJlimk(TkJ)(i) = J(i)foralliPolicyiterationterminateswithanoptimalpolicy,ifstartedwithaproperpolicy Example: Deterministic shortest path problemwithasingledestinationt.States nodes; Controls arcsTerminationstate thedestinationAssumption(a) everynodeis con-nectedtothedestinationAssumption(b) allcyclecosts> 0SSPANALYSISI Foraproperpolicy, Jistheuniquexedpoint of T, and TkJ Jfor all J(holds by thetheoryofVol.I,Section7.2) KeyFact: AsatisfyingJTJ for someJ nmustbeproper-truebecauseJ TkJ= PkJ+k1

m=0Pmgand some component of the termon the rightblowsupifisimproper(byourassumptions). Consequence: T canhave at most one xedpointwithin n.Proof: IfJandJaretwoxedpoints,selectandsuchthatJ= TJ= TJandJ= TJ=T J. Byprecedingassertion, andmustbeproper,andJ= JandJ= J . AlsoJ= TkJ TkJ J = JSimilarly,J J,soJ= J.SSPANALYSISII We now show that Thas a xed point, and alsothatpolicyiterationconverges. Generatea sequenceofproperpolicies k bypolicyiterationstartingfromaproperpolicy0. 1isproperandJ0 J1sinceJ0= T0J0 TJ0= T1J0 Tk1J0 J1 Thus Jkisnonincreasing, somepolicyisrepeated, withJ=TJ. SoJisaxedpointofT. NextshowTkJ Jforall J, i.e., valueit-erationconvergestothesamelimitaspolicyiter-ation. (Sketch: Trueif J=J, argueusingtheproperness of toshowthat the terminal costdierenceJ Jdoesnotmatter.) ToshowJ= J,forany= 0, 1, . . .T0 Tk1J0 TkJ0,whereJ0 0. Takelimsupask , toobtainJ J,soisoptimalandJ= J.SSPANALYSISIII ContractionProperty:Ifallpoliciesareproper(cf. Section 7.1, Vol. I), Tand Tare contractionswithrespecttoaweightedsupnorm.Proof: Consider anewSSPproblemwherethetransition probabilities are the same as in the orig-inal, butthetransitioncostsareall equal to 1.LetJbethecorrespondingoptimal costvector.Forall,J(i) = 1+minuU(i)n

j=1pij(u)J(j) 1+n

j=1pij_(i)_J(j)Forvi= J(i),wehavevi 1,andforall,n

j=1pij_(i)_vj vi1 vi, i = 1, . . . , n,where = maxi=1,...,nvi1vi< 1.ThisimpliesTandTarecontractionsofmodu-lus for norm |J| = maxi=1,...,n[J(i)[/vi(bytheresultsofearlierlectures).PATHOLOGIESI:DETERM.SHORTESTPATHS If there is a cycle with cost = 0, Bellmans equa-tion has aninnite number of solutions. Example:0011 2 t WehaveJ(1) = J(2) = 0. BellmansequationisJ(1) = J(2), J(2) = min_J(1), 1]. IthasJassolution. SetofsolutionsofBellmansequation:_J [ J(1) = J(2) 1_.PATHOLOGIESII:DETERM.SHORTESTPATHS If thereis acyclewithcost 0(i.e.,maxi[Jk+1(i) TJk(i)[ ),limsupkmaxi=1,...,n_Jk(i, rk) J(i)_2(1 )2ANEXAMPLEOFFAILURE Consider two states 1 and 2, and a single policy.Deterministictransitions: 1 2and2 2Transitioncosts 0,soJ(1) = J(2) = 0. ConsiderapproximateVI schemethatapproxi-mates cost functions in S=_(r, 2r) [ r _ withaweightedleastsquarest;here =_12_Given Jk= (rk, 2rk), we nd Jk+1= (rk+1, 2rk+1),whereforweightsv1, v2> 0,rk+1isobtainedasrk+1= arg minr_v1_r(TJk)(1)_2+v2_2r(TJk)(2)_2_or rk+1= vT(rk)(=leastsquarestofT(rk)). Withstraightforwardcalculationrk+1= rk, where = 2(v1+2v2)/(v1+4v2) > 1 Soif>1/, thesequence rkdivergesandsodoes Jk. Diculty: Tisacontraction,butvTisnot Norm mismatch problem (to be reencountered)INDIRECTPOLICYEVALUATION For the current policy, consider the corre-spondingmappingT:(TJ)(i) =n

i=1pij_g(i, j)+J(j)_, i = 1, . . . , n, ThesolutionJofBellmansequationJ= TJisapproximatedbythesolutionofr = T(r)S: Subspace spanned by basis functionsT(!r)0!r = "T(!r)Projectionon SIndirect method: Solving a projected form of Bellman`s equationWEIGHTEDEUCLIDEANPROJECTIONS ConsideraweightedEuclideannorm|J|v=_n

i=1vi_J(i)_2,wherevisavectorofpositiveweightsv1, . . . , vn. LetdenotetheprojectionoperationontoS= r [ r swithrespecttothisnorm,i.e.,foranyJ n,J= rwherer= arg minrs|J r|2vKEYQUESTIONSANDRESULTS Doestheprojectedequationhaveasolution? Under what conditions is themappingT acontraction,soThasuniquexedpoint? AssumingThasuniquexedpointr,howcloseisrtoJ? Assumption: Phas a single recurrent class andnotransientstates, i.e., ithassteady-stateprob-abilitiesthatarepositivej= limN1NN

k=1P(ik= j [ i0= i) > 0 Proposition: T iscontractionof modulus with respect to the weighted Euclidean norm ||,where=(1, . . . , n)isthesteady-stateproba-bilityvector. Theuniquexedpointrof Tsatises|Jr| 11 2|JJ|PRELIMINARIES:PROJECTIONPROPERTIES Importantpropertyof theprojectiononSwithweightedEuclideannorm ||v. ForallJ n, J S,thePythagoreanTheoremholds:|J J|2v= |J J|2v +|J J|2vProof: Geometrically, (J J) and(J J)are orthogonal in the scaled geometry of the norm||v,wheretwovectorsx, y nareorthogonalif

ni=1vixiyi=0. ExpandthequadraticintheRHSbelow:|J J|2v= |(J J) + (J J)|2vThe Pythagorean Theorem implies that the pro-jectionisnonexpansive,i.e.,|J J|v |J J|v, forallJ,J n.Toseethis,notethat__(J J)__2v __(J J)__2v +__(I )(J J)__2v= |J J|2vPROOFOFCONTRACTIONPROPERTY Lemma: Wehave|Pz| |z|, z nProof: Letpijbethecomponentsof P. Forallz n,wehave|Pz|2=n

i=1i__n

j=1pijzj__2n

i=1in

j=1pijz2j=n

j=1n

i=1ipijz2j=n

j=1jz2j= |z|2,where the inequality follows from the convexity ofthe quadratic function, and the next to last equal-ity follows from the dening property

ni=1ipij=jofthesteady-stateprobabilities. Usingthelemma, thenonexpansivenessof ,andthedenitionTJ= g +PJ,wehaveTJTJ TJTJ= P(JJ) JJforall J,J n. HenceT isacontractionofmodulus.PROOFOFERRORBOUND LetrbethexedpointofT. Wehave|Jr| 11 2|JJ|.Proof: Wehave|Jr|2= |JJ|2 +__Jr__2= |JJ|2 +__TJT(r)__2 |JJ|2 +2|Jr|2,whereThe rst equality uses the Pythagorean The-oremThesecondequalityholdsbecauseJisthexedpointof TandristhexedpointofTThe inequality uses the contraction propertyofT.Q.E.D.MATRIXFORMOFPROJECTEDEQUATION Its solutionis thevector J =r, wherersolvestheproblemminrs__r (g +Pr)__2. Settingto0thegradientwithrespecttorofthisquadratic,weobtain_r(g +Pr)_= 0,whereisthediagonal matrixwiththesteady-stateprobabilities1, . . . , nalongthediagonal. Thisisjusttheorthogonalitycondition: Theerrorr (g+ Pr)isorthogonaltothesubspacespannedbythecolumnsof. Equivalently,Cr= d,whereC= (I P), d = g.PROJECTEDEQUATION:SOLUTIONMETHODS Matrixinversion:r= C1d ProjectedValueIteration(PVI)method:rk+1= T(rk) = (g +Prk)ConvergestorbecauseTisacontraction.S: Subspace spanned by basis functionsrkT(rk) = g + Prk0rk+1Value IterateProjectionon S PVIcanbewrittenas:rk+1= arg minrs__r (g +Prk)__2Bysettingto0thegradientwithrespecttor,_rk+1(g +Prk)_= 0,whichyieldsrk+1= rk ()1(Crk d)SIMULATION-BASEDIMPLEMENTATIONS Keyidea:Calculatesimulation-basedapproxi-mationsbasedonksamplesCk C, dk d Matrixinversionr=C1disapproximatedby rk= C1kdkThisisthe LSTD(LeastSquaresTemporal Dif-ferences)Method. PVI methodrk+1= rk()1(Crkd)isapproximatedbyrk+1= rk Gk(Ckrkdk)whereGk ()1Thisisthe LSPE(LeastSquaresPolicyEvalua-tion)Method. Keyfact: Ck, dk, andGkcanbecomputedwithlow-dimensional linear algebra(of order s;thenumberofbasisfunctions).SIMULATIONMECHANICSWe generate an innitely long trajectory (i0, i1, . . .)of theMarkovchain, sostates i andtransitions(i, j) appear with long-term frequencies i and pij. After generating the transition(it, it+1), wecompute the row(it)of andthe cost com-ponentg(it, it+1). WeformCk=1k + 1k

t=0(it)_(it)(it+1)_ (IP)dk=1k + 1k

t=0(it)g(it, it+1) gAlsointhecaseofLSPEGk=1k + 1k

t=0(it)(it) Convergence proof: View C, d, and G as expec-tations;uselawoflargenumbersarguments. NotethatCk,dk,andGkcanbeformedincre-mentally.6.231DYNAMICPROGRAMMINGLECTURE21LECTUREOUTLINE Reviewofapproximatepolicyiteration Projectedequationmethodsforpolicyevalua-tion Optimisticversions Multistepprojectedequationmethods Bias-variancetradeo Exploration-enhancedimplementations OscillationsREVIEW:PROJECTEDBELLMANEQUATION Foraxedpolicytobeevaluated, considerthecorrespondingmappingT:(TJ)(i) =n

i=1pij_g(i, j)+J(j)_, i = 1, . . . , n,ormorecompactly,TJ= g +PJ ApproximateBellmans equationJ =TJ byr =T(r) or thematrixform/orthogonalityconditionCr= d,whereC= (I P), d = g.S: Subspace spanned by basis functionsT(!r)0!r = "T(!r)Projectionon SIndirect method: Solving a projected form of Bellman`s equationPROJECTEDEQUATIONMETHODS Matrixinversion:r= C1dIterative Projected Value Iteration (PVI) method:rk+1= T(rk) = (g +Prk)Converges to rif Tis a contraction. True if isprojectionw.r.t.steady-statedistributionnorm. Simulation-BasedImplementations: Generatek+1 simulated transitions sequence i0, i1, . . . , ikandapproximationsCk Canddk d:Ck=1k + 1k

t=0(it)_(it)(it+1)_ (IP)dk=1k + 1k

t=0(it)g(it, it+1) g LSTD: rk= C1kdk LSPE: rk+1= rkGk(Ckrkdk) where Gk ()1. ConvergestorifTiscontraction. Keyfact: Ck, dk, andGkcanbecomputedwithlow-dimensional linear algebra(of order s;thenumberofbasisfunctions).OPTIMISTICVERSIONS Use coarse approximations Ck Cand dk d,basedonfewsimulationsamples PIcontext: Evaluate(coarsely)currentpolicy,thendoapolicyimprovement Very complex behavior (see the subsequent dis-cussiononoscillations) Oftenapproaches the limit more quickly(asoptimisticmethodsoftendo)The matrix inversion/LSTD method has seriousproblems due to large simulation noise (because oflimitedsampling) Astepsize (0, 1]inLSPEmaybeusefultodamptheeectofsimulationnoise:rk+1= rk Gk(Ckrk dk) In the context of PI, LSPE tends to cope betterbecauseof its iterativenature(whenapolicyischanged, thecurrentrk,Ck,dk,Gkmaybeusedas a hot start for the iterations of the new policyevaluation)MULTISTEPMETHODSFORPOL.EVALUATIONIntroduce a multistep version of Bellmans equa-tionJ= T()J,wherefor [0, 1),T()= (1 )

t=0tTt+1 T()hasthesamexedpointasT, soitmaybeusedasbasisforapproximation. Ttisacontractionwithmodulust, withre-spect to the weighted Euclidean norm | |, where is the steady-state distribution vector of the chain. HenceT()isacontractionwithmodulus= (1 )

t=0t+1t=(1 )1 Notethat 0as 1. LetrbethexedpointofT(). Then|Jr| 1_1 2|JJ| rdependson;isclosertoJas 1.BIAS-VARIANCETRADEOFFError bound |Jr| 1_12|JJ| Bias-variancetradeo:As 1, wehave 0, soerror bound(and the quality of approximation) improvesas 1. Infactlim1r= J(= thedirectapprox.solution)ButsimulationnoiseinapproximatingT()= (1 )

t=0tTt+1increases.SubspaceS= {r |rs}SetSlopeJSimulationerrorSimulationerrorJSimulationerrorBias) = 0=0=10. SolutionofprojectedequationSimulationerror Solutionofr = T()(r)MOREONMULTISTEPMETHODS Thesimulationprocessto obtainC()kandd()kis similartothecase = 0 (singlesimulationtra-jectoryi0, i1, . . .morecomplexformulas)C()k=1k + 1k

t=0(it)k

m=tmtmt_(im)(im+1)_d()k=1k + 1k

t=0(it)k

m=tmtmtgim Inthecontextofapproximatepolicyiteration,wecanuseoptimisticversions (fewsamples be-tweenpolicyupdates).Many dierent/incremental versions ... see text. Notethe-tradeos:As 1, C()kandd()kcontainmoresim-ulationnoise, somoresamplesareneededfor a close approximation of r(the solutionoftheprojectedequation)The error bound |Jr| becomes smallerAs 1, T()becomesacontractionforarbitraryprojectionnormPOLICYITERATIONISSUES-EXPLORATION 1stmajorissue: exploration. Commonrem-edyistheo-policyapproach: ReplacePof cur-rentpolicywithP= (I B)P+BQ,whereBisadiagonal matrixwithi [0, 1] onthediagonal,andQisanothertransitionmatrix. Then LSTD and LSPE formulas must be modi-ed ... otherwise the policy associated with P(notP)isevaluated(seethetextbook,Section6.4).Alternatives: Geometric and free-form sampling Bothoftheseusemultipleshortsimulatedtra-jectories, with random restart state, chosen to en-hanceexploration Geometricsamplingusestrajectorieswithgeo-metricallydistributednumberoftransitionswithparameter [0, 1). It implements LSTD() andLSPE()withexploration(seethetext). Free-form samplinguses trajectories with moregenerally distributed number of transitions. It im-plements methodfor approximationof thesolu-tionof ageneralizedmultistepBellmanequation(seethetext).POLICYITERATIONISSUES-OSCILLATIONS DeneforeachpolicyR=_r [ T(r) = T(r)_ Thesesetsformthegreedypartitionofthepa-rameterr-spaceR=

r |T(r) =T(r)

R

R

R R Forapolicy,Risthesetofallrsuchthatpolicyimprovementbasedonrproduces Oscillations of nonoptimistic approx.: ris gen-erated by an evaluation method so that r Jrkk rk+1+1 rk+2+2 rk+3RkRk+1Rk+2+2 Rk+3MOREONOSCILLATIONS/CHATTERING ForoptimisticPIadierentpictureholdsr11 r22 r3R1R22 R3 Oscillations are less violent, but the limitpointismeaningless! Fundamentally,oscillationsareduetothelackof monotonicityof the projectionoperator, i.e.,J JdoesnotimplyJ J. IfapproximatePIusespolicyevaluationr = (WT)(r)withWsomemonotoneoperator, thegeneratedpolicies converge (to a possibly nonoptimal limit). Theoperator Wusedintheaggregationap-proachhasthismonotonicityproperty.6.231DYNAMICPROGRAMMINGLECTURE22LECTUREOUTLINE Aggregationasanapproximationmethodology Aggregateproblem Examplesofaggregation Simulation-basedaggregation Q-LearningPROBLEMAPPROXIMATION-AGGREGATION AnothermajorideainADPistoapproximatethe cost-to-gofunctionof the problemwiththecost-to-go function of a simpler problem. The sim-plicationisoftenad-hoc/problemdependent. Aggregationisa systematicapproachforprob-lemapproximation. Mainelements: Introduceafewaggregatestates,viewedasthestatesofanaggregatesystem Denetransitionprobabilitiesandcostsofthe aggregate system, by relating originalsystemstateswithaggregatestates Solve(exactlyor approximately) theag-gregate problem by any kind of value or pol-icyiterationmethod(includingsimulation-basedmethods) Use the optimalcostof the aggregateprob-lemtoapproximatetheoptimal costof theoriginalproblem Hardaggregationexample: Aggregate statesaresubsetsoforiginalsystemstates,treatedasiftheyallhavethesamecost.AGGREGATION/DISAGGREGATIONPROBS Theaggregatesystemtransitionprobabilitiesaredenedviatwo(somewhatarbitrary)choicesaccording to pij(u), with costdxiSjyQ, j=1 i), x ), yOriginalSystemStatesAggregateStates

OriginalSystemStatesAggregateStates

|OriginalSystemStatesAggregateStates pxy(u) =n

i=1dxin

j=1pij(u)jy,DisaggregationProbabilities

AggregationProbabilitiesDisaggregationProbabilitiesAggregationProbabilitiesDisaggregationProbabilities

AggregationProbabilitiesDisaggregationProbabilities g(x, u) =n

i=1dxin

j=1pij(u)g(i, u, j),g(i, u, j)Matrix Matrix Foreachoriginalsystemstatejandaggregatestatey,theaggregationprobabilityjyThedegreeof membershipof j intheag-gregatestatey.Inhardaggregation, jy=1if statej be-longstoaggregatestate/subsety. Foreachaggregatestatexandoriginalsystemstatei,thedisaggregationprobabilitydxiThedegreeofibeingrepresentativeofx.In hard aggregation, one possibility is allstates i that belongs to aggregate state/subsetxhaveequaldxi.AGGREGATEPROBLEM The transition probability from aggregate statextoaggregatestateyundercontrolu pxy(u) =n

i=1dxin

j=1pij(u)jy, orP(u) = DP(u)wheretherowsof Dandarethedisaggr. andaggr.probs. Theaggregateexpectedtransitioncostis g(x, u) =n

i=1dxin

j=1pij(u)g(i, u, j), or g = DPgThe optimal cost function of the aggregate prob-lem,denotedR,isR(x) = minuU_ g(x, u) +

y pxy(u) R(y)_, xorR=minu[ g + PR] - Bellmansequationfortheaggregateproblem. The optimal cost functionJof the originalproblemisapproximatedusinginterpolation:J(j) =

yjy R(y), jEXAMPLEI:HARDAGGREGATION Grouptheoriginal systemstatesintosubsets,andvieweachsubsetasanaggregatestate Aggregationprobs: jy=1ifjbelongstoaggregatestatey.123456789 123456789123456789123456789 123456789 123456789123456789 123456789 123456789123456789x1x2x3x4 =1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1 Disaggregationprobs: Therearemanypos-sibilities, e.g., all states i within aggregate state xhaveequalprob.dxi. Ifoptimal costvectorJispiecewiseconstantover theaggregatestates/subsets, hardaggrega-tion is exact. Suggests grouping states with roughlyequalcostintoaggregates. Soft aggregation(provides soft boundariesbetweenaggregatestates).EXAMPLEII:FEATURE-BASEDAGGREGATION If we knowgoodfeatures, it makes sense togroup together states that have similar features EssentiallydiscretizethefeaturesandassignaweighttoeachdiscretizationpointSpecial States Aggregate States Features)Special States Aggregate States FeaturesSpecial States Aggregate States FeaturesFeatureExtractionMappingFeatureVectorFeatureExtractionMappingFeatureVector Ageneralapproachforpassingfromafeature-based state representation to an aggregation-basedarchitectureHard aggregation architecture based on featuresis more powerful(nonlinear/piecewiseconstantinthefeatures,ratherthanlinear) ... but may require many more aggregate statesto reach the same level of performance as the cor-respondinglinearfeature-basedarchitectureEXAMPLEIII:REP.STATES/COARSEGRID Choosea collection ofrepresentativeoriginalsystem states, and associate each one of them withanaggregatestate. Theninterpolatexjxj1j2j2j3xj1j3y1 1y22y3y3OriginalStateSpaceRepresentative/AggregateStates Disaggregationprobabilitiesaredxi=1ifiisequaltorepresentativestatex. Aggregation probabilities associate original sys-temstateswithconvexcombinationsofrepresen-tativestatesj

yAjyy Well-suitedforEuclideanspacediscretization Extends nicelytocontinuous statespace, in-cludingbeliefspaceofPOMDPEXAMPLEIV:REPRESENTATIVEFEATURES Chooseacollectionofrepresentativesubsetsof original systemstates, andassociateeachoneofthemwithanaggregatestatey3OriginalStateSpaceAggregateStates/Subsets01249Sx1Small costSx2Small costSx3ijjijjAggregateStates/Subsets01249ipijpijjx1jx2jx3 Commoncase: Sxis agroupof states withsimilarfeaturesHard aggregation is special case: xSx= 1, . . . , nAggregation with representative states is specialcase: Sxconsistsofjustonestate Withrep. features, aggregationapproachisaspecial caseof projectedequationapproachwithseminorm or oblique projection (see text). Sothe TD methods and multistage Bellman Eq. method-ologyapplyAPPROXIMATEPIBYAGGREGATION ConsiderapproximatePIfortheoriginalprob-lem, with evaluation done using the aggregate prob-lem(otherpossibilitiesexist-seethetext) Evaluationof policy:J =R, whereR=DT(R) (Ris thevector of costs of aggregatestatescorrespondingto). Mayusesimulation. SimilarformtotheprojectedequationR=T(R)(Dinplaceof). Advantages: It has no problem with explorationorwithoscillations. Disadvantage: Therowsof Dandmustbeprobabilitydistributions.according to pij(u), with costdxiSjyQ, j=1 i), x ), yOriginalSystemStatesAggregateStates

OriginalSystemStatesAggregateStates

|OriginalSystemStatesAggregateStates pxy(u) =n

i=1dxin

j=1pij(u)jy,DisaggregationProbabilities

AggregationProbabilitiesDisaggregationProbabilitiesAggregationProbabilitiesDisaggregationProbabilities

AggregationProbabilitiesDisaggregationProbabilities g(x, u) =n

i=1dxin

j=1pij(u)g(i, u, j),g(i, u, j)Matrix MatrixQ-LEARNINGI Q-learninghastwomotivations:Dealing with multiple policies simultaneouslyUsing a model-free approach [no need to knowpij(u),onlybeabletosimulatethem] TheQ-factorsaredenedbyQ(i, u) =n

j=1pij(u)_g(i, u, j) +J(j)_, (i, u)Since J= TJ, we have J(i) = minuU(i)Q(i, u)sotheQfactorssolvetheequationQ(i, u) =n

j=1pij(u)_g(i, u, j) + minuU(j)Q(j, u)_ Q(i, u)canbeshowntobetheuniquesolu-tionofthisequation. Reason: ThisisBellmansequation for a system whose states are the originalstates1, . . . , n,togetherwithallthepairs(i, u). Valueiteration: Forall(i, u)Q(i, u) :=n

j=1pij(u)_g(i, u, j) + minuU(j)Q(j, u)_Q-LEARNINGII Useanyprobabilisticmechanismtoselectse-quenceofpairs(ik, uk)[allpairs(i, u)arechoseninnitely often], and for each k, select jk accordingtopikj(uk). Q-learningalgorithm: updatesQ(ik, uk)byQ(ik, uk) :=_1 k(ik, uk)_Q(ik, uk)+k(ik, uk)_g(ik, uk, jk) + minuU(jk)Q(jk, u)_ Stepsize k(ik, uk) must converge to 0 at properrate(e.g.,like1/k). Important mathematical point: In the Q-factorversionofBellmansequationtheorderofexpec-tation and minimization is reversed relative to theordinarycostversionofBellmansequation:J(i) = minuU(i)n

j=1pij(u)_g(i, u, j) +J(j)_Q-learning can be shown to converge to true/exactQ-factors (sophisticated stoch. approximation proof). Major drawback: The large number of pairs(i, u)-nofunctionapproximationisused.Q-FACTORAPROXIMATIONS BasisfunctionapproximationforQ-factors:Q(i, u, r) = (i, u)r WecanuseapproximatepolicyiterationandLSPE/LSTD/TD for policy evaluation (explorationissueisacute). Optimistic policy iterationmethods are fre-quentlyusedonaheuristicbasis.Example (very optimistic). At iteration k, givenrkandstate/control(ik, uk):(1) Simulatenexttransition(ik, ik+1)usingthetransitionprobabilitiespikj(uk).(2) Generatecontroluk+1fromuk+1= arg minuU(ik+1)Q(ik+1, u, rk)(3) Updatetheparametervectorviark+1= rk (LSPEorTD-likecorrection) Unclearvalidity. Solidbasisforanimportantbutveryspecial case: optimal stopping(seethetext)6.231DYNAMICPROGRAMMINGLECTURE23LECTUREOUTLINE AdditionaltopicsinADP Stochasticshortestpathproblems Averagecostproblems Generalizations Basisfunctionadaptation Gradient-basedapproximationinpolicyspaceREVIEW:PROJECTEDBELLMANEQUATIONPolicy Evaluation: Approximate Bellmans equa-tionJ=TJisapproximatedbytheprojectedequationr = T(r)whichcanbesolvedbyasimulation-basedmeth-ods suchas LSPE(), LSTD(), or TD(). (Arelated approach is aggregation - simpler in var-iousways.)S: Subspace spanned by basis functionsT(!r)0!r = "T(!r)Projectionon SIndirect method: Solving a projected form of Bellman`s equation These ideas applyto other (linear) Bellmanequations,e.g.,forSSPandaveragecost. ImportantIssue: Constructsimulationframe-workwhereT[orT()]isacontraction.STOCHASTICSHORTESTPATHS IntroduceapproximationsubspaceS= r [ r sandforagivenproperpolicy,BellmansequationanditsprojectedversionJ= TJ= g +PJ, r = T(r)Alsoits-versionr = T()(r), T()= (1 )

t=0tTt+1 Question: Whatshouldbethenormofprojec-tion?Howtoimplementitbysimulation? Speculationbasedondiscountedcase: Itshould be a weighted Euclidean norm with weightvector=(1, . . . , n), whereishouldbesometype of long-term occupancy probability of state i(whichcanbegeneratedbysimulation). But what does long-term occupancyprobabil-ityofastatemeanintheSSPcontext? Howdowegenerateinnitelengthtrajectoriesgiventhatterminationoccurswithprob. 1?SIMULATIONFORSSP We envisionsimulationof trajectories uptotermination, followedbyrestart at state i withsomexedprobabilitiesq0(i) > 0. Thenthelong-termoccupancyprobabilityofastateofiisproportionaltoq(i) =

t=0qt(i), i = 1, . . . , n,whereqt(i) = P(it= i), i = 1, . . . , n, t = 0, 1, . . . Weusetheprojectionnorm|J|q=_n

i=1q(i)_J(i)_2[Notethat 0 0 if aij ,= 0,i.e., for each i, the relative frequency of (i, j) is pij(connectiontoimportancesampling) Rowsamplingmaybe done usingaMarkovchainwithtransitionmatrixQ(unrelatedtoP) Rowsamplingmayalsobe done without aMarkov chain - just sample rows according to someknowndistribution(e.g.,auniform)ROWANDCOLUMNSAMPLINGRowSamplingAccordingto(MaColumnSamplingAccordingto(MayUseMarkovChainQ)toMarkovChainP |A|ColumnSamplingAc= ( ) gAccordingtoMar( )Subspace(MayUseMarkotoMarkovChainSubspace ProjectionkovChainainP |A|jectiononi0Roi0i1Rowi1j0wSam0j1ampj1ikplinikik+1lingAccordi+1jkAccordijkjk+1According+1...ngto+1...ngto Rowsampling StateSequenceGenerationinDP.Aects:Theprojectionnorm.WhetherAisacontraction. Columnsampling TransitionSequenceGen-erationinDP.Canbe totallyunrelatedtorowsampling.Aectsthesampling/simulationerror.MatchingPwith [A[isbenecial(hasaneectlikeinimportancesampling). Independentrowandcolumnsamplingallowsexploration at will! Resolves the exploration prob-lem that is critical in approximate policy iteration.LSTD-LIKEMETHOD Optimalitycondition/equivalent formof pro-jectedequationn

i=1i(i)__(i) n

j=1aij(j)__r=n

i=1i(i)bi Thetwoexpectedvaluesareapproximatedbyrowandcolumnsampling(batch0 t). Wesolvethelinearequationt

k=0(ik)_(ik) aikjkpikjk(jk)_rt=t

k=0(ik)bik We have rt r, regardless of A being a con-traction(bylawoflargenumbers;seenextslide).Issues of singularity or near-singularity of IAmaybeimportant;seethetext. AnLSPE-likemethodisalsopossible, butre-quiresthatAisacontraction. Undertheassumption

nj=1[aij[ 1foralli,there are conditions that guarantee contraction ofA;seethetext.JUSTIFICATIONW/LAWOFLARGENUMBERS Wewill matchterms intheexact optimalityconditionandthesimulation-basedversion. Lettibe the relative frequencyof i inrowsamplinguptotimet. Wehave1t + 1t

k=0(ik)(ik)=n

i=1ti(i)(i)n

i=1i(i)(i)1t + 1t

k=0(ik)bik=n

i=1ti(i)bi n

i=1i(i)bi Let ptijbe the relative frequencyof (i, j) incolumnsamplinguptotimet.1t + 1t

k=0aikjkpikjk(ik)(jk)=n

i=1tin

j=1 ptijaijpij(i)(j)n

i=1in

j=1aij(i)(j)BASISFUNCTIONADAPTATIONI Animportant issue inADPis howtoselectbasisfunctions. Apossibleapproachistointroducebasisfunc-tions that are parametrizedbyavector , andoptimizeover,i.e.,solvetheproblemminF_J()_whereJ()isthesolutionoftheprojectedequa-tion. OneexampleisF_J()_=__ J() T_J()___2 AnotherexampleisF_J()_=

iI[J(i) J()(i)[2,whereIisasubsetofstates, andJ(i), i I,arethecostsof thepolicyatthesestatescalculateddirectlybysimulation.BASISFUNCTIONADAPTATIONIISome algorithm may be used to minimize F_J()_over. Achallengehereisthatthealgorithmshoulduselow-dimensionalcalculations.One possibility is to use a form of random search(the cross-entropy method); see the paper by Men-ache, Mannor, and Shimkin (Annals of Oper. Res.,Vol. 134,2005) Another possibility is to use a gradient method.For this it is necessary to estimate the partialderivatives ofJ() with respect to the componentsof. It turns out that bydierentiating the pro-jectedequation, thesepartial derivatives canbecalculatedusinglow-dimensional operations. Seethereferencesinthetext.APPROXIMATIONINPOLICYSPACEI Consider anaveragecostproblem, wheretheproblemdataare parametrizedbya vectorr,i.e.,acost vector g(r), transitionprobabilitymatrixP(r). Let (r) bethe(scalar) averagecost perstage,satisfyingBellmansequation(r)e +h(r) = g(r) +P(r)h(r)whereh(r)isthedierentialcostvector. Considerminimizing (r)overr(here the datadependence on control is encoded in the parametriza-tion). Otherthanrandomsearch, wecantrytosolve the problem by nonlinear programming/gra-dientdescentmethods. Important fact: If , g, P are thechangesin, g, Pduetoasmallchangerfromagivenr,wehave= (g + Ph),where is thesteady-state probabilitydistribu-tion/vector corresponding to P(r), and all the quan-titiesaboveareevaluatedatr.APPROXIMATIONINPOLICYSPACEII Proof of thegradientformula: Wehave,bydierentiatingBellmansequation,(r)e+h(r) = g(r)+P(r)h(r)+P(r)h(r)Byleft-multiplyingwith,(r)e+h(r) =_g(r)+P(r)h(r)_+P(r)h(r)Since(r)e=(r)and=P(r), thisequationsimpliesto= (g + Ph) Sincewedontknow,wecannotimplementagradient-likemethodforminimizing(r). Anal-ternative is to use sampled gradients, i.e., gener-ate a simulation trajectory (i0, i1, . . .), and changeronce in a while,in the direction of a simulation-basedestimateof(g + Ph). Important Fact: canbe viewedas anexpectedvalue! Thereismuchresearchonthissubject,seethetext.6.231DYNAMICPROGRAMMINGOVERVIEW-EPILOGUELECTUREOUTLINE FinitehorizonproblemsDeterministicvsStochasticPerfectvsImperfectStateInfo InnitehorizonproblemsStochasticshortestpathproblemsDiscountedproblemsAveragecostproblemsFINITEHORIZONPROBLEMS-ANALYSIS PerfectstateinfoAgeneral formulation-Basicproblem, DPalgorithmAfewniceproblemsadmitanalytical solu-tion ImperfectstateinfoReductiontoperfect stateinfo- SucientstatisticsVery few nice problems that admit analyticalsolutionFinite-state problems admit reformulation asperfectstateinfoproblemswhosestatesareprob.distributions(thebeliefvectors)FINITEHORIZONPROBS-EXACTCOMP.SOL. Deterministicnite-stateproblemsEquivalenttoshortestpathAwealthoffastalgorithmsHardcombinatorial problems are aspecialcase(but#ofstatesgrowsexponentially) StochasticperfectstateinfoproblemsTheDPalgorithmistheonlychoiceCurseofdimensionalityisbigbottleneck ImperfectstateinfoproblemsForgetit!Onlytrivial examplesadmitanexactcom-putationalsolutionFINITEHORIZONPROBS-APPROX.SOL. Many techniques (and combinations thereof) tochoosefrom SimplicationapproachesCertaintyequivalenceProblemsimplicationRollinghorizonAggregation-Coarsegriddiscretization Limitedlookaheadcombinedwith:RolloutMPC(animportantspecialcase)Feature-basedcostfunctionapproximation ApproximationinpolicyspaceGradientme


Recommended