Introduction to Reinforcement Learning
Ather GattamiSenior Scientist, RISE SICS
Stockholm, Sweden
November 3, 2017
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Outline
1 Introduction
2 Dynamical Systems
3 Bellman’s Principle of Optimality
4 Reinforcement Learning
2
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Success Stories
3
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Reinforcement Learning in A Nutshell
Used in problems where actions(decisions) have to be made
Each action (decision) affects futurestates of the system
Success is measured by a scalarreward signal
Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available
4
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Reinforcement Learning in A Nutshell
Used in problems where actions(decisions) have to be made
Each action (decision) affects futurestates of the system
Success is measured by a scalarreward signal
Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available
4
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Reinforcement Learning in A Nutshell
Used in problems where actions(decisions) have to be made
Each action (decision) affects futurestates of the system
Success is measured by a scalarreward signal
Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available
4
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Reinforcement Learning in A Nutshell
Used in problems where actions(decisions) have to be made
Each action (decision) affects futurestates of the system
Success is measured by a scalarreward signal
Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available
4
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Dynamical Systems
Let sk, yk, ak be the state, observation, and action at time step k, respectively.
Deterministic model:
sk+1 = fk(sk, ak)
yk = gk(sk, ak)
Stochastic model (Markov Decision Process):
P(sk+1 | sk, ak, sk−1, ak−1, ...) = P(sk+1 | sk, ak)P(yk | sk, ak, sk−1, ak−1, ...) = P(yk | sk, ak)
We assume perfect state observation, that is yk = sk.
5
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Dynamical Systems
Given a dynamical system with states, observations, and actions given by sk, ykand ak, respectively, and scalar valued rewards rk(sk, ak), find the actions akthat maximize the average reward
RT = E
(T∑k=1
δkrk(sk, ak)
)
where 0 < δ ≤ 1 is the discount factor.
6
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Example
Let rk(θk, Fk) = −θ2k where θk and Fkare time discretized values of the angle θ(the state) and force F (the action).Maximize the reward function R withrespect to F .
7
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Bellman’s Principle of Optimality
8
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Bellman’s Equation
DefinitionA policy π(sk) defines a probability distribution over actions given a state sk,
P(Ak = ak | Sk = sk)
For deterministic policies, the action is given by ak = π(sk) with probability 1.
9
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Bellman’s Equation
Let δi ·Qπi (s, a) be the expected reward to go from time step k = i to k = Tgiven the policy π(sk), the state si = s, and the action ak = a. That is
Qπi (s, a) = E
(T∑k=i
δk−irk(sk, ak)
∣∣∣∣∣si = s, ai = a)
Then,
Qπi (s, a) = E
(ri(si, ai) + δ ·
T∑k=i+1
δk−(i+1)rk(sk, ak)
∣∣∣∣∣si = s, ai = a)
= E(ri(s, a) + δ ·Qπi+1(sk+1, ak+1)
∣∣si = s, ai = a)10
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Bellman’s Equation
Let π∗ be an optimal policy that maximizes Qπi (s, a) for all i and define
Q∗i (s, a) = Qπ∗
i (s, a)
Since the policy is optimal for all i we have
π∗(s) = arg supaQ∗i (s, a)
Bellman’s Equation
Q∗i (s, a) = E(ri(s, a) + δ ·Q∗i+1(si+1, ai+1)
∣∣si = s, ai = a)11
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Dynamic Programming
If the system model is known, we can use dynamic programming.Let
V ∗(s) = supaQ∗i (s, a)
The Bellman Equation is given by
V (sk) = supak
E (rk(sk, ak) + V (sk+1))
= supak
E
(∑s′∈S
P(s′ | sk, ak) (rk(sk, ak) + V (s′))
)
12
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Model Free Optimization and Reinforcement Learning
What if we don’t have the system model?
If the system is deterministic, the model is given by
sk+1 = fk(sk, ak)
If the system is stochastic, the model is given by
P(sk+1 | sk, ak)
13
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Q-Learning
Let s = sk and s′ = sk+1. Learning the Q function is done by minimizing withrespect to Q the following cost function
l = (r(s, a) + δ supa′Q(s′, a′)−Q(s, a))2
Gradient descent update rule with α as Lagrange multiplier:
Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))
The optimal policy is estimated from Q(s, a):
π(s) = arg supaQ(s, a)
14
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Q-Learning
TheoremConsider the Q-learning algorithm given by
Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))
The Q-learning algorithm converges to the optimal action-value function,Q(s, a)→ Q∗(s, a).
15
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Deep Reinforcement Learning
Deep reinforcement learning (used in AlphaGo that defeated the World Championin Go) is reinforcement learning where the Q function is approximated with adeep neural network.
Minimizing the loss function with respect to the neural network weights w
l = (r(s, a) + δ supa′Q(s′, a′,w−)−Q(s, a,w))2
16
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
Cart Pole Example
The horizontal ("x") axis represents the time axis and the vertical ("y") axisrepresents the upward position of the pole (the "z" axis in space).
Figure:
17
Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning
End of Presentation
THANKS FOR LISTENING!
18
IntroductionDynamical SystemsBellman's Principle of OptimalityReinforcement Learning