Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | mlconf |
View: | 259 times |
Download: | 1 times |
Deep Reinforcement Learning with Shallow Trees
Matineh ShakerAI Scientist (Bonsai)
MLConf San Francisco
10 November 2017
Outline● Introduction to RL (Reinforcement Learning)
● Markov decision processes
● Value-based methods
● Concept-Network Reinforcement Learning (CNRL)
● Use cases
2
A Reinforcement Learning Example
3
Rocket Trajectory Optimization:OpenAI Gym’s LunarLander Simulator
A Reinforcement Learning Example
4
State:
x_positiony_positionx_velocityy_velocityangle angular velocityleft_legright_leg
Action (Discrete):
do nothing (0) fire left engine (1)fire main engine (2)fire right engine (3)
Action (Continuous):
main engine power left/right engine power
Reward: Moving from the top of the screen to landing pad and zero speed has about 100-140 points. Episode finishes if the lander crashes or comes to rest, additional -100 or +100. Each leg ground contact is +10. Firing main engine has -0.3 points each frame.
Basic RL Concepts
5
Reward HypothesisGoals can be described by maximizing the expected cumulative reward .
Sequential Decision MakingActions may have long-term consequences.Rewards may be delayed, like a financial investment.Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!)
State DataSequential and non i.i.dAgent’s actions affect the next data samples.
Definitions
Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s)
Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]QЛ(s,a)
ModelPredicts what the environment will do next (simulator’s job for instance)
6
Agent and Environment
At each time step, the agent: Receives observationReceives rewardTakes action
The environment: Receives actionSends next observationSends next reward
7
Markov Decision Processes (MDP)
8
Mathematical framework for sequential decision making.An environment in which all states are Markovian:
Markov Decision Process is a tuple:
Pictures from David Silver’s Slides
Exploration vs. Exploitation
Exploration vs. Exploitation Dilemma
● Reinforcement learning (specially model-free) is like trial-and-error learning.
● The agent should find a good policy that maximizes future rewards from its experiences
of the environment, in a potentially very large state space.
● Exploration finds more information about the environment, while Exploitation exploits
known information to maximise reward.
9
Value Based Methods: Q-Learning
What are the Problems:
● The iterative update is not scalable enough: ● Computing Q(s,a) for every state-action pair is not feasible most of the times.
Solution:
● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable)
10
Using Bellman equation as an iterative update, to find optimal policy:
Value Based Methods: Q-Learning
Use a function approximator to estimate the action-value function:
Q(s, a; ) ≅ Q*(s, a)
is the function parameter (weights of NN)
Function approximator can be a deep neural network: DQN
11
Loss Function:
Value Based Methods: DQN
Learning from batches of consecutive samples is problematic and costly:
- Sample correlation: Samples are correlated, which in return, makes inefficient learning
- Bad feedback loops: Current Q-network parameters dictates next training samples and can lead to bad feedback loops (e.g if maximizing action is to move left, training samples will be dominated by samples from left-hand size)
To solve them, use Experience Replay
- Continually update a replay memory table of transitions (st , at , rt , st+1).
- Train Q-network on random mini-batches of transitions from the replay memory.
12
Concept Network Reinforcement Learning
● Solving complex tasks by decomposing them to high level actions or "concepts".
● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options: ○ enables efficient exploration by the abstractions over low level actions, ○ improving sample efficiency significantly, ○ especially in “sparse reward”.
● Allows existing solutions to sub-problems to be composed into an overall solution without requiring re-training.
13
Temporal Abstractions
● At each time t for each state st , a higher level “selector” chooses concept ct among all possible concepts available to the selector.
● Each concept remains active for some time, until a predefined terminal state is reached.
● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct , and sends reward rc(t) to the selector.
● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of “primitive” actions, forming a concept, so that execution of each concept corresponds to a certain action.
14
Robotics Pick and Place with Concepts
17
Lift Orient Stack
Robotics Pick and Place with Concepts
18
Robotics Pick and Place with Concepts
19
Deep Reinforcement Learning for Dexterous Manipulation with Concept Networkshttps://arxiv.org/abs/1709.06977
Thank you!
20
Backup Slides for Q/A:
21
DefinitionsState The agent’s internal representation in the environment.Information the agent uses to pick the next action.
Policy Dictates agent’s behavior, and maps from state to action:Deterministic policy: a = Л(s)Stochastic policy: Л(a|s) = P(At = a|St = s) Value functionDetermines how good each state (and action) is:VЛ(s)=EЛ [ Rt+1+ Rt+2+ 2Rt+3+... | St=s ]
QЛ(s,a)ModelPredicts what the environment will do next (simulator’s job for instance)
22
RL’s Main Loop
23
Value Based Methods: DQN with Experience Replay(2)
24
Learning vs Planning
25
Learning (Model-Free Reinforcement Learning):The environment is initially unknownThe agent interacts with the environment, not knowing about the environment The agent improves its policy based on previous interactions
Planning (Model-based Reinforcement Learning):A model of the environment is known or acquiredThe agent performs computations with the model, without any external interactionThe agent improves its policy based on those computations with the model
LunarLander with Concept Network
26
Introduction to RL: Challenges
27
Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind
Policy-Based Methods● The Q-function can be complex and unnecessary. All we want is best action!!
● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value of every (state, action) pair.
28
● Defining parameterized policies:
● For each policy, define its value:
● Gradient ascent on policy parameters to find the optimal policy!