Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

11

ECE-517: Reinforcement LearningECE-517: Reinforcement Learningin Artificial Intelligencein Artificial Intelligence

Lecture 6: Optimality Criterion in MDPsLecture 6: Optimality Criterion in MDPs

Dr. Itamar ArelDr. Itamar ArelCollege of EngineeringCollege of Engineering

Department of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer ScienceThe University of TennesseeThe University of Tennessee

Fall 2011Fall 2011

September 8, 2011September 8, 2011

ECE 517 - Reinforcement Learning in AI 22

OutlineOutline

Optimal value functions (cont.)Optimal value functions (cont.)Implementation considerationsImplementation considerationsOptimality and approximationOptimality and approximation


We define the We define the state-value functionstate-value function for policy for policy asas

Similarly, we define the Similarly, we define the action-value functionaction-value function for for

The The Bellman equationBellman equation

The value function The value function VV ss is the unique solution to its is the unique solution to its

Bellman equationBellman equation

0

Recap on Value FunctionsRecap on Value Functions

0

0

∆∆


Optimal Value FunctionsOptimal Value Functions

A policy A policy is defined to be is defined to be better than or equal tobetter than or equal to a a policy policy , if its expected return is greater than or equal , if its expected return is greater than or equal to that of to that of for for all statesall states, i.e. , i.e.

There is always at least one policy (a.k.a. There is always at least one policy (a.k.a. optimal optimal policypolicy) that is better than or equal to all other policies) that is better than or equal to all other policies

Optimal policies also share the same Optimal policies also share the same optimal action-optimal action-value functionvalue function, defined as , defined as

Ssssiff )(V)(V **

SsssV * )(Vmax)(

A(s) aSsasQs,aQ , ),(max)(*


Optimal Value Functions (cont.)Optimal Value Functions (cont.)

The latter gives the expected return for taking action The latter gives the expected return for taking action aa in state in state ss and thereafter following an optimal policy and thereafter following an optimal policyThus, we can write Thus, we can write

Since Since VV ss is the value function for a policy, it must is the value function for a policy, it must

satisfy the Bellman equation satisfy the Bellman equation This is called the This is called the Bellman optimality equationBellman optimality equation

Intuitively, the Bellman optimality equation expresses Intuitively, the Bellman optimality equation expresses the fact that the the fact that the value of a statevalue of a state under an optimal under an optimal policy must equal the expected return for the policy must equal the expected return for the best best actionaction from that state from that state

,|)( )( 1*

1* aasssVrEs,aQ tttt ∆∆


0


∆∆


0

0


The Bellman optimality equation for The Bellman optimality equation for QQ is is

Backup diagramsBackup diagrams arcs have been added at the agent's arcs have been added at the agent's choice points to represent that the choice points to represent that the maximummaximum over that over that choice is taken rather than the expected value (given some choice is taken rather than the expected value (given some policy)policy)



For finite MDPs, the Bellman optimality equation has a For finite MDPs, the Bellman optimality equation has a unique solution unique solution independent of the policyindependent of the policy

The Bellman optimality equation is actually a system of The Bellman optimality equation is actually a system of equations, one for each stateequations, one for each state

NN equations (one for each state) equations (one for each state) NN variables – variables – VV

ss This assumes you know the dynamics of the environmentThis assumes you know the dynamics of the environmentOnce one has Once one has VV

ss, it is relatively easy to determine an , it is relatively easy to determine an optimal policy …optimal policy …

For each state there will be one or more actions for which For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equationthe maximum is obtained in the Bellman optimality equation

Any policy that assigns nonzero probability only to these Any policy that assigns nonzero probability only to these actions is an optimal policyactions is an optimal policy

This translates to a one-step search, i.e. greedy decisions This translates to a one-step search, i.e. greedy decisions will be optimalwill be optimal



With With QQ, the agent does not even have to do a one-, the agent does not even have to do a one-step-ahead searchstep-ahead search

For any state For any state ss – the agent can simply find any action – the agent can simply find any action that maximizes that maximizes QQ(s,a)(s,a)

The action-value function effectively embeds the The action-value function effectively embeds the results of all one-step-ahead searchesresults of all one-step-ahead searchesIt provides the optimal expected long-term return as a It provides the optimal expected long-term return as a value that is locally and immediately available for each value that is locally and immediately available for each state-action pairstate-action pair

Agent does not need to know anything about the Agent does not need to know anything about the dynamics of the environmentdynamics of the environment

Q: What are the implementation tradeoffs here?Q: What are the implementation tradeoffs here?∆∆


Implementation ConsiderationsImplementation Considerations

Computational ComplexityComputational Complexity How complex is it to evaluate the value and How complex is it to evaluate the value and

state-value functions?state-value functions? In softwareIn software In hardwareIn hardware

Data flow constraintsData flow constraints Which part of the data needs to be globally vs. Which part of the data needs to be globally vs.

locally available?locally available? Impact of memory bandwidth limitationsImpact of memory bandwidth limitations

∆∆


Recycling Robot revisitedRecycling Robot revisited

0

A A transition graphtransition graph is a useful way to summarize the is a useful way to summarize the dynamics of a finite MDPdynamics of a finite MDP

State nodeState node for each possible state for each possible state Action nodeAction node for each possible state-action pair for each possible state-action pair


0

Bellman Optimality Equations for the Recycling RobotBellman Optimality Equations for the Recycling Robot

To make things more compact, we abbreviate the To make things more compact, we abbreviate the states states highhigh and and lowlow, and the actions , and the actions searchsearch, , waitwait, and , and rechargerecharge respectively by respectively by hh, , ll, , ss, , ww, and , and rere


Optimality and ApproximationOptimality and Approximation

Clearly, an agent that learns an optimal policy has done Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happensvery well, but in practice this rarely happens

Usually involves heavy computational loadUsually involves heavy computational loadTypically agents perform Typically agents perform approximationsapproximations to the optimal to the optimal policypolicyA critical aspect of the problem facing the agent is A critical aspect of the problem facing the agent is always the computational resources available to italways the computational resources available to it

In particular, the amount of computation it can perform in In particular, the amount of computation it can perform in a a single time stepsingle time step

Practical considerations are thus:Practical considerations are thus: Computational complexityComputational complexity Memory availableMemory available

TabularTabular methods apply for small state sets methods apply for small state sets Communication overhead (for distributed Communication overhead (for distributed

implementations)implementations) Hardware vs. softwareHardware vs. software


Are approximations good or bad ?Are approximations good or bad ?

RL typically relies on RL typically relies on approximationapproximation mechanisms mechanisms (see later)(see later)This could be an opportunity This could be an opportunity

Efficient “Feature-extraction” type of approximation may Efficient “Feature-extraction” type of approximation may actually reduce “noise”actually reduce “noise”

Make it practical for us to address large-scale problemsMake it practical for us to address large-scale problemsIn general, making “bad” decisions in RL result in In general, making “bad” decisions in RL result in learning opportunities (online)learning opportunities (online)The online nature of RL encourages learning more The online nature of RL encourages learning more effectively from events that occur frequentlyeffectively from events that occur frequently

Supported in natureSupported in natureCapturing Capturing regularitiesregularities is a key property of RL is a key property of RL

Date post:	11-Feb-2016
Category:	Documents
Upload:	jory
View:	29 times
Download:	0 times

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Documents