Hierarchical Optimal Control of MDPsbaveja/Papers/Yale-98.pdf · 2004-02-08 · Hierarchical...

Hierarchical Optimal Control of MDPs

Amy McGovernUniv. of MassachusettsAmherst, MA [email protected]

Doina PrecupUniv. of MassachusettsAmherst, MA 01003

[email protected]

Balaraman RavindranUniv. of MassachusettsAmherst, MA [email protected]

Satinder SinghUniv. of ColoradoBoulder, CO 80309

[email protected]

Richard S. SuttonUniv. of MassachusettsAmherst, MA [email protected]

Abstract

Fundamental to reinforcement learning, as well as to thetheory of systems and control, is the problem of represent-ing knowledge about the environment and about possiblecourses of action hierarchically, at a multiplicity of interre-lated temporal scales. For example, a human traveler mustdecide which cities to go to, whether to fly, drive, or walk,and the individual muscle contractions involved in eachstep. In this paper we survey a new approach to reinforce-ment learning in which each of these decisions is treateduniformly. Each low-level action and high-level course ofaction is represented as an option, a (sub)controller anda termination condition. The theory of options is basedon the theories of Markov and semi-Markov decision pro-cesses, but extends these in significant ways. Options canbe used in place of actions in all the planning and learn-ing methods conventionally used in reinforcement learning.Options and models of options can be learned for a widevariety of different subtasks, and then rapidly combined tosolve new tasks. Options enable planning and learning si-multaneously at a wide variety of times scales, and towarda wide variety of subtasks, substantially increasing the ef-ficiency and abilities of reinforcement learning systems.

IntroductionThe field of reinforcement learning is entering a newphase, in which it considers learning at multiple levels,and at multiple temporal and spatial scales. Such hierar-chical approaches are advantageous in very large problemsbecause they provide a principled way of forming approx-imate solutions. They also allow much greater flexibilityand richness in what is learned. In particular, we can con-sider not just one task, but a whole range of tasks, solvethem independently, and yet be able to combine their in-dividual solutions quickly to solve new overall tasks. Wealso allow the learner to work not just with primitive ac-tions, but with higher-level, temporally-extended actions,called options. In effect, the learner can choose amongsubcontrollers rather than just low-level actions. This newdirection is also consonant with reinforcement learning’sroots in artificial intelligence, which has long focused onplanning and knowledge representation at higher levels. Inthis paper we survey our work in recent years forming partof this trend.

From the point of view of classical control, ournew work constitutes a hierarchical approach to solvingMarkov decision problems (MDPs), and in this paper wepresent it in that way. In particular, our methods areclosely related to semi-Markov methods commonly used

with discrete-event systems. The work we describe dif-fers from most other hierarchical approaches in that we donot lump states together into larger states. We keep theoriginal state representation and instead alter the temporalaspects of the actions.

In this paper we survey our recent and ongoing work intemporal abstraction and hierarchical control of Markovdecision processes (Precup, Sutton & Singh 1998a,b, inprep.). This work is part of a larger trend toward focus-ing on these issues by many researchers in reinforcementlearning (e.g. Singh, 1992a,b; Kaelbling, 1993; Lin, 1993;Dayan & Hinton, 1993; Thrun & Schwartz, 1995; Huber& Grupen, 1997; Dietterich, 1998; Parr & Russell, 1998).

Markov Decision ProcessesIn this section we briefly describe the conventional re-inforcement learning framework of discrete-time, finiteMarkov decision processes, or MDPs, which forms thebasis for our extensions to temporally extended coursesof action. In this framework, a learning agent interactswith an environment at some discrete, lowest-level timescale t � �� . On each time step the agent per-ceives the state of the environment, st � S, and on thatbasis chooses a primitive action, at � A. In response toeach action, at, the environment produces one step latera numerical reward, rt��, and a next state, st��. The en-vironment’s transition dynamics are modeled by one-stepstate-transition probabilities,

pass� � Prfst�� s� j st � s� at � ag�

and one-step expected rewards,

ras � Efrt�� j st � s� at � ag�

for all s� s� � S and a � A. These two sets of quantitiestogether constitute the one-step model of the environment.

The agent’s objective is to learn an optimal Markov pol-icy, a mapping from states to probabilities of taking eachavailable primitive action, � � S � A �� , that max-imizes the expected discounted future reward from eachstate s:

V ��s� � Enrt�� rt�� rt��

�� st � s� �o

where � � �� is a discount-rate parameter. V ��s� iscalled the value of state s under policy �, and V � is called

the state-value function for �. The unique optimal state-value function, V ��s� � max� V

��s��s � S, gives thevalue of a state under an optimal policy. Any policy thatachieves V � is by definition an optimal policy. There arealso action value functions, Q� � S � A �� and Q� �S � A �� , that give the value of a state given that aparticular action is initially taken in it, and a given policyis followed afterwards.

OptionsWe use the term options for our generalization of primitiveactions to include temporally extended courses of action.Formally, an option consists of three components: an inputset I � S, a policy � � S �A �� , and a terminationcondition � � S �� . An option hI� �� i is availablein state s if and only if s � I. If the option is taken, thenactions are selected according to � until the option termi-nates stochastically according to �. In particular, the nextaction at is selected according to the probability distribu-tion ��st� ��. The environment then makes a transition tostate st��, where the option either terminates, with prob-ability ��st��, or else continues, determining at�� ac-cording to ��st�� , possibly terminating in st�� accord-ing to ��st��, and so on. When the option terminates, theagent has the opportunity to select another option.

The input set and termination condition of an option to-gether restrict its range of application in a potentially use-ful way. In particular, they limit the range over which theoption’s policy needs to be defined. For example, a hand-crafted policy � for a mobile robot to dock with its batterycharger might be defined only for states I in which the bat-tery charger is within sight. The termination condition �

would be defined to be � outside of I and when the robot issuccessfully docked. It is natural to assume that all stateswhere an option might continue are also states where theoption might be taken (i.e., that fs � ��s� � �g � I).In this case, � needs to be defined only over I rather thanover all of S.

The definition of options is crafted to make them asmuch like actions as possible, except temporally extended.Because options terminate in a well defined way, we canconsider policies that select options instead of primitiveactions. Let the set of options available in state s be de-noted Os; the set of all options is denoted O �

Ss�S Os.

When initiated in a state st, the Markov policy over op-tions � � S � O �� selects an option o � Ost ac-cording to probability distribution ��st� ��. The option o isthen taken in st, determining actions until it terminates inst�k, at which point a new option is selected, according to��st�k� ��, and so on. In this way a policy over options, �,determines a policy over actions, or flat policy, � � f��.Henceforth we use the unqualified term policy for Markovpolicies over options, which include Markov flat policiesas a special case. Note, however, that f�� is typicallynot Markov because the action taken in a state depends on

which option is being taken at the time, not just on thestate. We define the value of a state s under a general flatpolicy � as the expected return if the policy is started in s:

V ��s�def� E

nrt�� rt��

�� E�� s� t�o�where E�� s� t� denotes the event of � being initiated in sat time t. The value of a state under a general policy (i.e.,a policy over options) � can then be defined as the value

of the state under the corresponding flat policy: V ��s�def�

V f��s�.

MDP + Options = SMDPOptions are closely related to the actions in a special kindof decision problem known as a semi-Markov decisionprocess, or SMDP (e.g., see Puterman, 1994). In fact,a fixed set of options defines a new discrete-time SMDPembedded within the original MDP, as suggested by Fig-ure 1. The top panel shows the state trajectory over dis-crete time of an MDP, the middle panel shows the largerstate changes over continuous time of an SMDP, and thelast panel shows how these two levels of analysis can besuperimposed through the use of options. In this case theunderlying base system is an MDP, with regular, single-step transitions, while the options define larger transitions,like those of an SMDP, that last for a number of discretesteps. All the usual SMDP theory applies to the super-imposed SMDP defined by the options but, in addition,we have an explicit interpretation of them in terms of theunderlying MDP. We will now outline the way in whichsome of the SMDP results can be interpreted and used inthe context of MDPs and options.

SMDP

Time

MDPState

Options�over MDP

Figure 1: The state trajectory of an MDP is made up of small,discrete-time transitions, whereas that of an SMDP compriseslarger, continuous-time transitions. Options enable an MDP tra-jectory to be analyzed at either level.

Planning with options requires a model of their conse-quences. Fortunately, the appropriate form of model foroptions, analogous to the ras and pass� defined earlier foractions, is known from existing SMDP theory. For eachstate in which an option may be started, this kind of modelpredicts the state in which the option will terminate andthe total reward received along the way. These quantitiesare discounted in a particular way. For any option o, letE�o� s� t� denote the event of o being initiated in state s at

time t. Then the reward part of the model of o for states � S is

ros�Enrt�� rt�� k��rt�k

�� E�o� s� t�o� (1)

where tk is the random time at which o terminates. Thestate-prediction part of the model of o for state s is

poss� �

�Xj��

�j Prfst�k � s�� k � j j E�o� s� t�g

� E��k�s�st�k j E�o� s� t�

�� (2)

for all s� � S, under the same conditions, where �ss� is anidentity indicator, equal to 1 if s � s�, and equal to 0 oth-erwise. Thus, poss� is a combination of the likelihood thats� is the state in which o terminates together with a mea-sure of how delayed that outcome is relative to �. We callthis kind of model a multi-time model (Precup and Sutton,1998) because it describes the outcome of an option notat a single time, but at potentially many different times,appropriately combined.

Using multi-time models we can write Bellman equa-tions for general policies and options. For instance, letus denote a restricted set of options byO and the set of allpolicies selecting only from options in O by �O�. Thenthe optimal value function given that we can select onlyfrom O is

V �O�s� � max

o�Os

�ros

Xs�

poss�V�O�s

��

��

A corresponding optimal policy, denoted ��O, is any pol-icy that achieves V �O, i.e., for which V ��

O �s� � V �O�s�in all states s � S. If V �

O and models of the options areknown, then ��O can be formed by choosing in any propor-tion among the maximizing options in the equation above.

It is straightforward to extend MDP planning methodsto SMDPs. The policies found using temporally abstractoptions are approximate in the sense that they achieve onlyV �O, which is typically less than V �.

The Rooms ExampleAs a simple illustration of planning with options, con-sider the rooms example, a gridworld environment of fourrooms shown in Figure 2. The cells of the grid corre-spond to the states of the environment. From any state theagent can perform one of four actions, up, down, leftor right. These actions usually move the agent in thecorresponding direction, but with 1/3 probability they in-stead move the agent in another, random direction. In eachof the four rooms, the system is also provided with twobuilt-in hallway options that take the agent from anywherewithin the room to one of the hallway cells leading out ofthe room.

HALLWAYS

o

8 multi-step options

up

down

rightleft

(to each room's 2 hallways)

4 stochastic primitive actions

Fail 33% of the time

G

*�o2

1

Figure 2: The rooms example is a gridworld environment withstochastic cell-to-cell actions and room-to-room hallway op-tions. Two of the hallway options are suggested by the arrowslabeled o� and o�. The label G indicates a location used as goal.

To complete the specification of the planning problemwe designate one state as a goal, say the state labeledG, byproviding a reward of +1 on arrival there. Figure 3 showsthe results of applying synchronous value iteration (SVI)to this problem with and without options. The upper partof the figure shows the value function after the first twoiterations of SVI using just primitive actions. The regionof accurately valued states moved out by one cell on eachiteration, but after two iterations most states still had theirinitial arbitrary value of zero. In the lower part of the fig-ure are shown the corresponding value functions for SVIwith the hallway options. In the first iteration all statesin the rooms adjacent to the goal state became accuratelyvalued, and in the second iteration all the states becameaccurately valued. Rather than planning step-by-step, thehallway options enable the planning to proceed at a higher

Iteration #0 Iteration #1 Iteration #2

with cell-to-cell primitive actions

Iteration #0 Iteration #1 Iteration #2

with room-to-room options

V (goal)=1

V (goal)=1

Figure 3: Value functions formed over iterations of planningby synchronous value iteration with primitive actions and withhallway options. The hallway options enable planning to pro-ceed room-by-room rather than cell-by-cell. The area of the diskin each cell is proportional to the estimated value of the state,where a disk that just fills a cell represents a value of 1.0. We usediscounting with � � � for this task.

SMDP Solution(600 Steps)

S

GTermination-ImprovedSolution (474 Steps)

Trajectories throughSpace of Landmarks

0 1 2 30

1

2

3-600-500-400-300-200-100 0

SMDP Solution Termination-Improved Solution0 1 2 30

1

2

3-600-500-400-300-200-100

0

Figure 4: Termination improvement in navigating with landmark-directed controllers. The task (left) is to navigate from S to Gin minimum time using options based on controllers that run each to one of seven landmarks (the black dots). The circles showthe region around each landmark within which the controllers operate. The thin line shows the optimal behavior that uses only thesecontrollers run to termination, and the thick line shows the corresponding termination improved behavior, which cuts the corners. Theright panels show the state-value functions for the SMDP-optimal and termination-improved policies. Note that the latter is greater

level, room-by-room, and thus be much faster.

Termination Improvement

So far we have assumed that an option, once started, mustbe followed until it terminates. This assumption is neces-sary to apply the theoretical machinery of SMDPs. On theother hand, the whole point of the options framework isthat one also has an interpretation in terms of the under-lying MDP. This enables us to consider interrupting op-tions before they would terminate normally. For exam-ple, suppose we have determined the option-value func-tion Q��s� o� for some policy � and for all state–optionspairs s� o that could be encountered while following �.This function tells us how well we do while following� and committing irrevocably to each option, but it canalso be used to re-evaluate our commitment on each step.Suppose at time t we are in the midst of executing optiono. If o is Markov in s, then we can compare the valueof continuing with o, which is Q��st� o�, to the value ofterminating o and selecting a new option according to �,which is V ��s� �

Po� ��s� o

��Q��s� o��. If the latter ismore highly valued, then why not terminate o and allowthe switch? Indeed, we have shown that this new wayof behaving is guaranteed to be better. We characterizethis as an improvement in the termination condition of theoriginal option, i.e., as a termination improvement.

Figure 4 shows a simple example of termination im-provement. Here the task is to navigate from a startlocation to a goal location within a continuous two-dimensional state space. The actions are movements of0.01 in any direction from the current state. Rather thanworking with these low-level actions, infinite in number,we introduce seven landmark locations. For each land-mark we define a controller that takes us to the land-mark in a direct path. Each controller is only applicablewithin a limited range of states, in this case within a cer-tain distance of the corresponding landmark. Each con-

troller then defines an option: the circular region aroundthe controller’s landmark is the option’s input set, the con-troller itself is the policy, and arrival at the target land-mark is the termination condition. We denote the set ofseven landmark options by O. Any action within 0.01 ofthe goal location transitions to the terminal state, � � �,and the reward is �� on all transitions, which makes thisa minimum-time task.

One of the landmarks coincides with the goal, so it ispossible to reach the goal while picking only fromO. Theoptimal policy within O runs from landmark to landmark,as shown by the thin line in Figure 4. This is the opti-mal solution to the SMDP defined by O and is indeed thebest that one can do while picking only from these op-tions. But of course one can do better if the options arenot followed all the way to each landmark. The trajectoryshown by the thick line in Figure 4 cuts the corners andis shorter. This is the termination-improved policy withrespect to the SMDP-optimal policy. The termination im-provement policy takes 474 steps from start to goal which,while not as good as the optimal policy in primitive actions(425 steps), is much better, for no additional cost, than theSMDP-optimal policy, which takes 600 steps. The state-value functions, V ��

O and V �� for the two policies are alsoshown on the right in Figure 4.

Another illustration of termination improvement in amore complex task is shown in Figure 5. The task hereis to fly a reconnaissance plane from base, to observe asmany sites as possible, from a given set of sites, and returnto base without running out of fuel. The local weather ateach site flips between cloudy and clear according to in-dependent Poisson processes. If the sky at a given site iscloudy when the plane gets there, no observation is madeand the reward is �. If the sky is clear, the plane gets areward, according to the importance of the site. The planehas a limited amount of fuel, and it consumes one unitof fuel during each time tick. If the fuel runs out before

reaching the base, the plane crashes and receives a rewardof ��.

10

50

50

50

100

25

15 (reward)

5

25

8

Base

100 �decision �steps��

options

(mean time between� weather changes)

40

50

60

Low FuelHigh Fuel

ExpectedReward

perMission

SMDPPlanner

StaticRe-planner

TerminationImprovement

Figure 5: The mission planning task and the performance ofpolicies constructed by SMDP methods, termination improve-ment of the SMDP policy, and an optimal static re-planner thatdoes not take into account possible changes in weather condi-tions.

The primitive actions are tiny movements in any direc-tion (there is no inertia). The state of the system is de-scribed by several variables: the current position of theplane, the fuel level, the sites that have been observedso far, and the current weather at each of the remainingsites. This state-action space has approximately �� bil-lion elements (assuming 100 discretization levels of thecontinuous variables), making the problem intractable bynormal dynamic programming methods. We introducedoptions that can take the plane to each of the sites (includ-ing the base), from any position in the input space. Theresulting SMDP has only 874,800 elements and it is feasi-ble to determine V �

O�s�� exactly for all sites s�. From this

solution and the model of the options, we can determineQ�O�s� o� � ros

Ps� p

oss�V

�O�s

�� for any option o and anystate s in the whole space.

The data in figure 5 compares the SMDP and termi-nation improvement policies found for the problem withthe performance of a static planner, which exhaustivelysearches for the best tour assuming the weather does notchange, and then re-plans whenever the weather doeschange. The policy obtained by the termination im-provement approach performs significantly better than theSMDP policy, which in turn is significantly better than thestatic planner.

Intra-Option LearningOptimal value functions can be determined by learningas well as by planning. One natural approach is to useSMDP learning methods (Bradtke & Duff (1995), Parr &

Russell (1998), Mahadevan et al. (1997), and McGov-ern, Sutton & Fagg (1997)), which treat complete optionexecutions just as primitive actions are treated in conven-tional reinforcement learning methods. One drawback tothese methods is that they need to execute an option to ter-mination before they can learn about it. Because of this,they can only be applied to one option at a time—the op-tion that is executing at that time. More interesting andpotentially more powerful methods are possible by takingadvantage of the structure inside each option. In particu-lar, if the options are Markov and we are willing to lookinside them, then we can use special temporal-differencemethods to learn usefully about an option before the op-tion terminates. This is the main idea behind intra-optionmethods .

0

0.5

1

1.5

2

2.5

3

3.5

0 2000 4000 6000 8000 10000

Absolute error in option �values averaged over�options

Episodes

SMDP Q-learning

Intra

�Macro Q-learning

-4

-3.8

-3.6

-3.4

-3.2

-3

-2.8

-2.6

-2.4

-2.2

-2

0 2000 4000 6000 8000 10000

Intra-option value learning

SMDP Q-learning

Episodes

Macro Q-learning

Average on-line reward

Figure 6: Comparison of SMDP, intra-option and Macro Q-learning. Intra-option methods converge faster to correct values.

Figure 6 shows an illustration of the advantages of intra-option learning in the rooms example. In this case smallnegative rewards were introduced at all the states, and thegoal was located at G. We experimented with two SMDPmethods: one-step SMDP Q-learning (Bradtke & Duff,1995) and a hierarchical form of Q-learning called MacroQ-learning (McGovern, Sutton & Fagg, 1997). Althoughthe SMDP methods can be used here, they were muchslower than the intra-option method.

Analyzing the Effects of OptionsAdding options can either accelerate or retard learning de-pending on their appropriateness to the particular task. Inanother aspect of our work, we are trying to break downthe effect of options into components that can be measuredand studied independently. The two components that wehave studied so far are: the effect of options on initial ex-ploratory behavior, independent of learning, and the ef-fect of learning with options on the speed at which cor-rect value information is propagated, independent of thebehavior. We have found that both of these effects are sig-nificant.

We have measured these effects in gridworld tasks andin the larger, simulated robotics task shown on the left inFigure 7. This is a foraging task in a two-dimensionalspace. The circular robot inhabits a world with two rooms,one door connecting them, and one food object. The robot

Simulated Environment Without Options With Options

foodfood sensing�radius

sonars

Figure 7: The simulated robotic foraging task. The picture on of the environment shows the five sonars, the doorway sensors, andthe food sensor. The graphs on the right hand side represent the position of the robot during a random walk.

has simulated sonars to sense the distance to the nearestwall in each of five fixed directions and simple inertialdynamics, with friction and inelastic collisions with thewalls. We provide two options, one which orients therobot towards the door, and the other which drives therobot forward until it encounters a wall. Because the statespace is continuous and large, we used a tile-coding func-tion approximators is necessary.

To assess the effect of options, we examined the be-havior for 100,000 steps when the actions were selectedrandomly from the primitive actions only and from boththe primitive actions and the options. The two right pan-els in Figure 7 show a projection of one such trajectoryonto the two spatial dimensions. The options have a largeinfluence on this exploratory behavior. With options, therobot crosses more often between the two rooms and trav-els more often with high velocity. In preliminary resultswe have also been able to show faster learning of efficientforaging strategies through the use of options.

ClosingIn this paper we have briefly surveyed a number of waysin which temporal abstraction can contribute to the hierar-chical control of MDPs. We have presented some of thebasic theory and several suggestive examples, but many ofthe most interesting questions remain open.

Acknowledgments

The authors gratefully acknowledge the contributions to theseideas of many colleagues, especially Andrew Barto, Ron Parr,Tom Dietterich, Andrew Fagg, Leo Zelevinsky and ManfredHuber. We also thank Paul Cohen, Robbie Moll, Mance Har-mon, Sascha Engelbrecht, and Ted Perkins for helpful reactionsand constructive criticism. This work was supported by NSFgrant ECS-9511805 and grant AFOSR-F49620-96-1-0254, bothto Andrew Barto and Richard Sutton. Satinder Singh was sup-ported by NSF grant IIS-9711753.

References

Bradtke, S. J., and Duff, M. O. 1995. Reinforcement learningmethods for continuous-time Markov decision problems. InAdvances in Neural Information Processing Systems 7, 393–400. MIT Press.

Dayan, P., and Hinton, G. E. 1993. Feudal reinforcement learn-ing. In Advances in Neural Information Processing Systems 5,271–278. Morgan Kaufmann.

Dietterich, T. G. 1998. The MAXQ method for hierarchicalreinforcement learning. In Proc. of the 15th Intl. Conf. on Ma-chine Learning. Morgan Kaufmann.

Huber, M., and Grupen, R. A. 1997. A feedback control struc-ture for on-line learning tasks. Robotics and Autonomous Sys-tems 22(3-4):303–315.

Kaelbling, L. P. 1993. Hierarchical learning in stochastic do-mains: Preliminary results. In Proc. of the 10th Intl. Conf. onMachine Learning, 167–173. Morgan Kaufmann.

Lin, L.-J. 1993. Reinforcement Learning for Robots Using Neu-ral Networks. Ph.D. Dissertation, Carnegie Mellon University.

Mahadevan, S.; Marchallek, N.; Das, T. K.; and Gosavi, A.1997. Self-improving factory simulation using continuous-timeaverage-reward reinforcement learning. In Proc. of the 14thIntl. Conf. on Machine Learning, 202–210. Morgan Kaufmann.

McGovern, A.; Sutton, R. S.; and Fagg, A. H. 1997. Roles ofmacro-actions in accelerating reinforcement learning. In GraceHopper Celebration of Women in Computing, 13–17.

Parr, R., and Russell, S. 1998. Reinforcement learning withhierarchies of machines. In Advances in Neural InformationProcessing Systems 10. MIT Press.

Precup, D.; Sutton, R. S.; and Singh, S. 1998a. Multi-timemodels for temporally abstract planning. In Advances in NeuralInformation Processing Systems 10. MIT Press.

Precup, D.; Sutton, R. S.; and Singh, S. 1998b. Theoreticalresults on reinforcement learning with temporally abstract op-tions. In Machine Learning: ECML98. 10th European Confer-ence on Machine Learning. Proceedings, 382–393. Springer.

Singh, S. P. 1992a. Reinforcement learning with a hierarchy ofabstract models. In Proc. of the 10th National Conf. on Artifi-cial Intelligence, 202–207. MIT/AAAI Press.

Singh, S. P. 1992b. Scaling reinforcement learning by learningvariable temporal resolution models. In Proc. of the 9th Intl.Conf. on Machine Learning, 406–415. Morgan Kaufmann.

Sutton, R. S.; Precup, D.; and Singh, S. 1998. Intra-optionlearning about temporally abstract actions. In Proc. of the 15thIntl. Conf. on Machine Learning. Morgan Kaufman.

Sutton, R. S.; Precup, D.; and Singh, S. in preparation. BetweenMDPs and Semi-MDPs: learning, planning, and representingknowledge at multiple temporal scales.

Puterman, M. L. (1994). Markov Decision Processes: DiscreteStochastic Dynamic Programming. Wiley.

Thrun, S., and Schwartz, A. 1995. Finding structure in rein-forcement learning. In Advances in Neural Information Pro-cessing Systems 7, 385–392. MIT Press.

Date post:	23-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Hierarchical Optimal Control of MDPsbaveja/Papers/Yale-98.pdf · 2004-02-08 · Hierarchical...

Documents