+ All Categories
Home > Documents > reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement...

reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement...

Date post: 09-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
50
Reinforcement Learning Dipendra Misra Cornell University [email protected] https://dipendramisra.wordpress.com/
Transcript
Page 1: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Reinforcement LearningDipendra Misra

Cornell [email protected]

https://dipendramisra.wordpress.com/

Page 2: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Task

Setup from Lenz et. al. 2014

Grasp the green cup.

Output: Sequence of controller actions

Page 3: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Supervised Learning

Setup from Lenz et. al. 2014

Grasp the green cup.

Expert Demonstrations

Page 4: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Supervised Learning

Setup from Lenz et. al. 2014

Grasp the green cup.

Expert Demonstrations

Problem?

Page 5: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Supervised LearningGrasp the cup. Problem?

Training data

Test data No exploration

Page 6: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Exploring the environment

Page 7: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

What is reinforcement learning?

“Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment”

- R. Sutton and A. Barto

Page 8: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Interaction with the environment

reward +

new environment

action

Setup from Lenz et. al. 2014

Scalar reward

Page 9: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Interaction with the environment

Setup from Lenz et. al. 2014

Episodic vs

Non-Episodic

.

.

.

a1

a2

an

r1

r2

rn

Page 10: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Rollout

Setup from Lenz et. al. 2014

.

.

.

a1

a2

an

r1

r2

rn

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

Page 11: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Setup

st ! st+1

atrt

e.g.,1$

Page 12: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Policy

⇡(s, a) = 0.9

Page 13: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Interaction with the environment

reward +

new environment

action

Setup from Lenz et. al. 2014

Objective?

Page 14: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Objective

Setup from Lenz et. al. 2014

.

.

.

a1

a2

an

r1

r2

rn

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

maximize expected reward

E

"nX

t=1

rt

#Problem?

Page 15: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Discounted Reward

.

.

.

a1

a2

an

r1

r2

rn

unbounded

maximize expected reward

discount future reward

Problem?

E

" 1X

t=0

�trt+1

#

E

" 1X

t=0

rt+1

#

� 2 [0, 1)

Page 16: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Discounted Reward

.

.

.

a1

a2

an

r1

r2

rn

maximize discounted expected reward

E

"n�1X

t=0

�trt+1

#

if r M

E

" 1X

t=0

�trt+1

#

1X

t=0

�tM =M

1� �

and � 2 [0, 1)

Page 17: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Need for discounting

• To keep the problem well formed

• Evidence that humans discount future reward

Page 18: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Markov Decision ProcessMDP is a tuple where

• is a set of finite or infinite states

• is a set of finite or infinite actions

• For the transition

• is the transition probability

• is the reward for the transition

• is the discounted factor

(S,A, P,R, �)

S

A

P as,s0 2 P

Ras,s0 2 R

sa�! s0

� 2 [0, 1]

}Markov Asmp.

Page 19: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

MDP Example

Example from Sutton and Barto 1998

Page 20: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Summary

.

.

.

a1

a2

an

r1

r2

rn

Maximize discounted expected reward

E

"n�1X

t=0

�trt+1

#

MDP is a tuple (S,A, P,R, �)

Agent controls the policy⇡(s, a)

Page 21: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

What we learned

Reinforcement Learning

Agent-Reward-EnvironmentNo supervisionExploration

MDPPolicy

Page 22: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Value functions • Expected reward from following a policy

V ⇡(s) = E

" 1X

t=0

�trt+1 | s1 = s,⇡

#

Q⇡(s, a) = E

" 1X

t=0

�trt+1 | s1 = s, a1 = a,⇡

#

State value function

State action value function

Page 23: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

State Value function

V ⇡(s) = E

" 1X

t=0

�trt+1 | s1 = s,⇡

#

s1

s2

s3

a1

a2

· · ·

· · ·

...

...

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3

Page 24: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

State Value function

V ⇡(s1) = E

" 1X

t=0

�trt+1

#

s1

s2

s3

a1

a2

· · ·

· · ·

...

...

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where=X

t

(r1 + �r2 · · · )p(t)

Page 25: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

State Value function

V ⇡(s1) = E

" 1X

t=0

�trt+1

#

=X

a1,s2

X

t0

P (s1, a1, s2)P (t0 | s1, a1, s2)�Ra1

s1,s2 + �(r2 · · · )

=X

a1

⇡(s1, a2)X

s2

P a1s1,s2

�Ra1

s1,s2 + �V (s2)

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where

V ⇡(s2)

V ⇡(s2)}

=X

t

(r1 + �r2 · · · )p(t)

=X

a1,s2

P (s1, a1, s2){Ra1s1,s2 + �

X

t0

P (t0 | s1, a1, s2)(r2 · · · )}

Page 26: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman Self-Consistency Eqn

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Q⇡(s, a) =X

s0

P as,s0

(Ra

s,s0 + �X

a0

⇡(s0, a0)Q⇡(s0, a0)

)

Q⇡(s, a) =X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

similarly

Page 27: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman Self-Consistency Eqn

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Given N states, we have N equations in N variables

Solve the above equation

Does it have a unique solution?

Yes, it does. Exercise: Prove it.

Page 28: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Optimal Policy

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Given a state s

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

Page 29: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Optimal Policy

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

is an optimal policy if:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

Does it always exists?

Yes it always does.

Page 30: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Existence of Optimal Policy

Leader policy for every state is: ⇡s = argmax

⇡V ⇡

(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a

To show is optimal or equivalently:⇡⇤

�(s) = V ⇡⇤(s)� V ⇡s(s) � 0

V ⇡⇤(s) =

X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Page 31: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Existence of Optimal Policy

V ⇡⇤(s) =

X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Leader policy for every state is: ⇡s = argmax

⇡V ⇡

(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a

�(s) = V ⇡⇤(s)� V ⇡s(s) = �

X

a

⇡s(s, a)X

s0

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

= �X

a

⇡s(s, a)X

s0

P as,s0{�(s0)} = �conv(�(s0))�

� �(s0)

Page 32: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Existence of Optimal Policy Leader policy for every state is: ⇡s = argmax

⇡V ⇡

(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a

�(s) = V ⇡⇤(s)� V ⇡s(s) = �

X

a

⇡s(s, a)X

s0

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

�(s) � �min �(s0)

min �(s) � �min �(s0)

� 2 [0, 1) ) min �(s) � 0 Hence proved

= �X

a

⇡s(s, a)X

s0

P as,s0{�(s0)} = �conv(�(s0))�

Page 33: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman’s Optimality Condition Define and V ⇤(s) = V ⇡⇤

(s) Q⇤(s, a) = Q⇡⇤(s, a)

V ⇡⇤(s) =

X

a

⇡⇤(s, a)Q⇡⇤

(s, a) max

aQ⇡⇤

(s, a)

Claim: V ⇡⇤(s) = max

aQ⇡⇤

(s, a)

Let V ⇡⇤(s) < max

aQ⇡⇤

(s, a)

Define ⇡0(s) = argmax

aQ⇡⇤

(s, a)

Page 34: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman’s Optimality Condition ⇡0(s) = argmax

aQ⇡⇤

(s, a)

V ⇡0(s) = Q⇡0

(s,⇡0(s))

V ⇡⇤(s) =

X

a

⇡⇤(s, a)Q⇡⇤(s, a)

�(s) = V ⇡0(s)� V ⇡⇤

(s) = Q⇡0(s,⇡0(s))�

X

a

⇡⇤(s, a)Q⇡⇤(s, a)

� Q⇡0(s,⇡0(s))�Q⇡⇤

(s,⇡0(s)) = �X

s0

P⇡0(s)s,s0 �(s0)

�(s) � 0

such that 9s0 �(s0) > 0is not optimal⇡⇤

Page 35: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman’s Optimality Condition

V ⇤(s) = max

aQ⇤

(s, a)

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

similarly

Q⇤(s, a) =

X

s0

P as,s0{Ra

s,s0 + �max

a0Q⇤

(s0, a0)}

Page 36: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Optimal policy from Q value

Given an optimal policy is given by:Q⇤(s, a)

Corollary: Every MDP has a deterministic optimal policy

⇡⇤(s) = argmax

aQ⇤

(s, a)

Page 37: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Summary

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Bellman’s optimality condition

Bellman’s self-consistency equation

An optimal policy exists such that:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

Page 38: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

What we learned

Reinforcement Learning

Agent-Reward-EnvironmentNo supervisionExploration

MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Page 39: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Solving MDP

To solve an MDP is to find an optimal policy

Page 40: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman’s Optimality Condition

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

Iteratively solve the above equation

Page 41: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Bellman Backup Operator

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

T : V ! V

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

Page 42: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Dynamic Programming Solution

Initialize randomlyV 0

do

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

V t+1(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V t(s0)}

Page 43: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Convergence

Theorem:

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof:|(TV1)(s)� (TV2)(s)| = |max

a

X

s0

P as,s0{Ra

s,s0 + �V1(s0)}�

�max

a

X

s0

P as,s0{Ra

s,s0 + �V2(s0)}|

) kTV1 � TV2k1 �kV1 � V2k1

|max

x

f(x)�max

x

g(x)| max

x

|f(x)� g(x)|using

Page 44: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Convergence

Theorem: kTV1 � TV2k1 = �kV1 � V2k1kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof: |(TV1)(s)� (TV2)(s)| max

a�|

X

s0

P as,s0(V1(s

0)� V2(s

0))|

max

a�X

s0

P as,s0 |(V1(s

0)� V2(s

0))|

�kV1 � V2k1

) kTV1 � TV2k1 �kV1 � V2k1

max

amax

s0|V1(s

0)� V2(s

0)|

Page 45: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Optimal is a fixed point

V ⇤= max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Page 46: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Optimal is the fixed point

V ⇤= max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Theorem: is the only fixed point of V ⇤ T

TV1 = V1Proof: TV2 = V2

kV1 � V2k1 = kTV1 � TV2k1 �kV1 � V2k1

As therefore kV1 � V2k1 = 0 ) V1 = V2� 2 [0, 1)

Page 47: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Dynamic Programming Solution

Initialize randomlyV 0

do

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

Theorem: algorithm converges for all V 0

Proof: kV t+1 � V ⇤k1 = kTV t � TV ⇤k1 �kV t � V ⇤k1kV t � V ⇤k1 �tkV 0 � V ⇤k1

limt!1

kV t � V ⇤k1 limt!1

�tkV 0 � V ⇤k1 = 0

Problem?

Page 48: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Summary

Iteratively solving optimality condition

Bellman Backup Operator

Convergence of the iterative solution

V t+1(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V t(s0)}

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

Page 49: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

What we learned

Reinforcement Learning

Agent-Reward-EnvironmentNo supervisionExploration

MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Bellman Backup Operator Iterative Solution

Page 50: reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

In next tutorial

• Value and Policy Iteration

• Monte Carlo Solution

• SARSA and Q-Learning

• Policy Gradient Methods

• Learning to search OR Atari game paper?

https://dipendramisra.wordpress.com/


Recommended