+ All Categories
Home > Documents > Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Date post: 11-Feb-2016
Category:
Upload: jory
View: 29 times
Download: 0 times
Share this document with a friend
Description:
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline. - PowerPoint PPT Presentation
14
1 ECE-517: Reinforcement Learning ECE-517: Reinforcement Learning in Artificial Intelligence in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel Dr. Itamar Arel College of Engineering College of Engineering Department of Electrical Engineering and Computer Science Department of Electrical Engineering and Computer Science The University of Tennessee The University of Tennessee Fall 2011 Fall 2011 September 8, 2011 September 8, 2011
Transcript
Page 1: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

11

ECE-517: Reinforcement LearningECE-517: Reinforcement Learningin Artificial Intelligencein Artificial Intelligence

Lecture 6: Optimality Criterion in MDPsLecture 6: Optimality Criterion in MDPs

Dr. Itamar ArelDr. Itamar ArelCollege of EngineeringCollege of Engineering

Department of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer ScienceThe University of TennesseeThe University of Tennessee

Fall 2011Fall 2011

September 8, 2011September 8, 2011

Page 2: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 22

OutlineOutline

Optimal value functions (cont.)Optimal value functions (cont.)Implementation considerationsImplementation considerationsOptimality and approximationOptimality and approximation

Page 3: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 33

We define the We define the state-value functionstate-value function for policy for policy asas

Similarly, we define the Similarly, we define the action-value functionaction-value function for for

The The Bellman equationBellman equation

The value function The value function VV ss is the unique solution to its is the unique solution to its

Bellman equationBellman equation

0

Recap on Value FunctionsRecap on Value Functions

0

0

∆∆

Page 4: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 44

Optimal Value FunctionsOptimal Value Functions

A policy A policy is defined to be is defined to be better than or equal tobetter than or equal to a a policy policy , if its expected return is greater than or equal , if its expected return is greater than or equal to that of to that of for for all statesall states, i.e. , i.e.

There is always at least one policy (a.k.a. There is always at least one policy (a.k.a. optimal optimal policypolicy) that is better than or equal to all other policies) that is better than or equal to all other policies

Optimal policies also share the same Optimal policies also share the same optimal action-optimal action-value functionvalue function, defined as , defined as

Ssssiff )(V)(V **

SsssV * )(Vmax)(

A(s) aSsasQs,aQ , ),(max)(*

Page 5: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 55

Optimal Value Functions (cont.)Optimal Value Functions (cont.)

The latter gives the expected return for taking action The latter gives the expected return for taking action aa in state in state ss and thereafter following an optimal policy and thereafter following an optimal policyThus, we can write Thus, we can write

Since Since VV ss is the value function for a policy, it must is the value function for a policy, it must

satisfy the Bellman equation satisfy the Bellman equation This is called the This is called the Bellman optimality equationBellman optimality equation

Intuitively, the Bellman optimality equation expresses Intuitively, the Bellman optimality equation expresses the fact that the the fact that the value of a statevalue of a state under an optimal under an optimal policy must equal the expected return for the policy must equal the expected return for the best best actionaction from that state from that state

,|)( )( 1*

1* aasssVrEs,aQ tttt ∆∆

Page 6: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 66

0

Optimal Value Functions (cont.)Optimal Value Functions (cont.) 

                              

 

                                         

                                                   

 

                                                               

 

                                                      

 

                                           

∆∆

Page 7: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 77

0

0

Optimal Value Functions (cont.)Optimal Value Functions (cont.)

The Bellman optimality equation for The Bellman optimality equation for QQ is is

Backup diagramsBackup diagrams arcs have been added at the agent's arcs have been added at the agent's choice points to represent that the choice points to represent that the maximummaximum over that over that choice is taken rather than the expected value (given some choice is taken rather than the expected value (given some policy)policy)

Page 8: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 88

Optimal Value Functions (cont.)Optimal Value Functions (cont.)

For finite MDPs, the Bellman optimality equation has a For finite MDPs, the Bellman optimality equation has a unique solution unique solution independent of the policyindependent of the policy

The Bellman optimality equation is actually a system of The Bellman optimality equation is actually a system of equations, one for each stateequations, one for each state

NN equations (one for each state) equations (one for each state) NN variables – variables – VV

ss This assumes you know the dynamics of the environmentThis assumes you know the dynamics of the environmentOnce one has Once one has VV

ss, it is relatively easy to determine an , it is relatively easy to determine an optimal policy …optimal policy …

For each state there will be one or more actions for which For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equationthe maximum is obtained in the Bellman optimality equation

Any policy that assigns nonzero probability only to these Any policy that assigns nonzero probability only to these actions is an optimal policyactions is an optimal policy

This translates to a one-step search, i.e. greedy decisions This translates to a one-step search, i.e. greedy decisions will be optimalwill be optimal

Page 9: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 99

Optimal Value Functions (cont.)Optimal Value Functions (cont.)

With With QQ, the agent does not even have to do a one-, the agent does not even have to do a one-step-ahead searchstep-ahead search

For any state For any state ss – the agent can simply find any action – the agent can simply find any action that maximizes that maximizes QQ(s,a)(s,a)

The action-value function effectively embeds the The action-value function effectively embeds the results of all one-step-ahead searchesresults of all one-step-ahead searchesIt provides the optimal expected long-term return as a It provides the optimal expected long-term return as a value that is locally and immediately available for each value that is locally and immediately available for each state-action pairstate-action pair

Agent does not need to know anything about the Agent does not need to know anything about the dynamics of the environmentdynamics of the environment

Q: What are the implementation tradeoffs here?Q: What are the implementation tradeoffs here?∆∆

Page 10: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 1010

Implementation ConsiderationsImplementation Considerations

Computational ComplexityComputational Complexity How complex is it to evaluate the value and How complex is it to evaluate the value and

state-value functions?state-value functions? In softwareIn software In hardwareIn hardware

Data flow constraintsData flow constraints Which part of the data needs to be globally vs. Which part of the data needs to be globally vs.

locally available?locally available? Impact of memory bandwidth limitationsImpact of memory bandwidth limitations

∆∆

Page 11: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 1111

Recycling Robot revisitedRecycling Robot revisited

0

A A transition graphtransition graph is a useful way to summarize the is a useful way to summarize the dynamics of a finite MDPdynamics of a finite MDP

State nodeState node for each possible state for each possible state Action nodeAction node for each possible state-action pair for each possible state-action pair

Page 12: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 1212

0

Bellman Optimality Equations for the Recycling RobotBellman Optimality Equations for the Recycling Robot

To make things more compact, we abbreviate the To make things more compact, we abbreviate the states states highhigh and and lowlow, and the actions , and the actions searchsearch, , waitwait, and , and rechargerecharge respectively by respectively by hh, , ll, , ss, , ww, and , and rere

Page 13: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 1313

Optimality and ApproximationOptimality and Approximation

Clearly, an agent that learns an optimal policy has done Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happensvery well, but in practice this rarely happens

Usually involves heavy computational loadUsually involves heavy computational loadTypically agents perform Typically agents perform approximationsapproximations to the optimal to the optimal policypolicyA critical aspect of the problem facing the agent is A critical aspect of the problem facing the agent is always the computational resources available to italways the computational resources available to it

In particular, the amount of computation it can perform in In particular, the amount of computation it can perform in a a single time stepsingle time step

Practical considerations are thus:Practical considerations are thus: Computational complexityComputational complexity Memory availableMemory available

TabularTabular methods apply for small state sets methods apply for small state sets Communication overhead (for distributed Communication overhead (for distributed

implementations)implementations) Hardware vs. softwareHardware vs. software

Page 14: Dr. Itamar  Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517 - Reinforcement Learning in AI 1414

Are approximations good or bad ?Are approximations good or bad ?

RL typically relies on RL typically relies on approximationapproximation mechanisms mechanisms (see later)(see later)This could be an opportunity This could be an opportunity

Efficient “Feature-extraction” type of approximation may Efficient “Feature-extraction” type of approximation may actually reduce “noise”actually reduce “noise”

Make it practical for us to address large-scale problemsMake it practical for us to address large-scale problemsIn general, making “bad” decisions in RL result in In general, making “bad” decisions in RL result in learning opportunities (online)learning opportunities (online)The online nature of RL encourages learning more The online nature of RL encourages learning more effectively from events that occur frequentlyeffectively from events that occur frequently

Supported in natureSupported in natureCapturing Capturing regularitiesregularities is a key property of RL is a key property of RL


Recommended