Deep Reinforcement Learning
Outline● What is Reinforcement Learning (RL) and why is it useful?● What is Deep Reinforcement Learning (DRL)?● DRL case studies
What is Reinforcement Learning?● The Workflow:
○ An agent interacts with an environment to learn a policy which maximizes the reward from the environment
○ By randomly selecting actions the agent explores the environment and slowly learns which actions give a positive or negative reward
Where is Reinforcement Learning Applied?● General algorithm that can be applied to many complex problems● Solve problems where the rules are simple but the solution is not
○ Games like Go and Chess○ Robotic control
● Stochastic and partially observable environments where manually designing even a good solution is difficult or impossible
Why is Reinforcement Learning Useful?● RL represents a paradigm shift from a hand-engineered solution to
specifying an objective; no expert knowledge is needed● Creates agents that exhibit complex behavior and often discover novel
solutions○ AlphaGo Zero taught human experts new strategies for a 3,000 year
old game● Adaptive to changes in objectives or environments● The same RL algorithm can solve different problems
Deep Reinforcement Learning (DRL)● The combination of Reinforcement Learning and Deep Neural Networks
(NN) creates a truly general algorithm which can be applied to almost any problem○ Neural networks can process many different kinds of data: numeric,
images, video, audio, and any combination thereof● NNs have the capability to generalize past experience to new states● NNs and RL algorithms are independent, advancements in both areas
improve agent performance
Case Studies (Why learn on video games?)● Simplified views into real world scenarios● Extensive list of environments already exist:
○ Driving simulators, tactical & strategy, complex 3D worlds● Games require many characteristics desirable in real world scenarios
○ Quick decision making○ Balancing short-term vs long-term goals○ Adaptability to evolving scenarios
DRL Timeline (Breakout)● Dec 2013 - First successful application of Deep
Reinforcement Learning● Nov 2016 - Deepmind announces plans to
research SC2● Jan 2019 - DRL agent beats professional players● Only 5 years from Breakout to SC2
○ A 34 year difference in release date○ May 1976 - July 2010
AlphaStar - StarCraft 2● SC2 is a complex game, played in real time, which requires micro and
macro strategic decision-making along with resource management● Partially observable, requiring enemy positions to be scouted/tracked● Initially trained to mimic human actions/strategy● Multiple agents with differing objectives are trained by competing
against each other with AlphaStar incorporating the best strategies discovered
● Unlike AlphaGo, AlphaStar does not use a search algorithm
AlphaStar - 10, Humans - 0● Beat TLO 5-0, a professional player ranked in the top 600
○ “AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn’t thought of before...”
● After another week of training defeats MaNa a top 10 player, 5-0○ “I’ve realised how much my gameplay relies on forcing mistakes and
being able to exploit human reactions…”
OpenAI Five - DotA 2● Multi-agent 5 vs 5 game, played in real time and partially observable,
each agent must fulfill its role and trade-off personal vs team rewards● Trained with no human supervision, agents learn from random action
policies and self-play○ 180 years per day for 9 days -> 1,620 years of play
● Reward shaping is used for the final agent but good policies can be learned from only a binary win/loss signal
● No explicit communication channel between agents, they collaborate based on a shared view of the environment (emergent swarming)
OpenAI Five - DotA 2● Won 2-0 against semi-pro (99th percentile) players
○ Attack target coordination○ Flanking & Ambushes○ Perfect timing / low level control○ Punishes opponents mistakes without hesitation
● Lost 0-2 against professional players○ Humans were able to adapt during the game to exploit the AI
Robots Learning (Walker)● Learns a robust policy in 2 hours, training only on a flat surface
Robots Learning (Quadruped)● https://youtu.be/aTDkYFZFWug?t=157● “The learned policies are also robust to changes in hardware… different
robot configurations, which roughly contribute 2.0 kg to the total weight, and a new drive which has a spring three times stiffer than the original one.”
● “In terms of computational cost ... the inference on the robot requires less than 25 µs using a single CPU thread.”
● “…this process [designing the rewards & NN architecture] takes about two days for the locomotion policies presented in this work.”
https://youtu.be/aTDkYFZFWug?t=157
Where Does RL Fail?● Generalization, applying learned concepts to new & unseen environments● In the maze environment the agent overfits even with 20,000 training mazes● AlphaStar competed on only one map● OpenAI Five plays with hero (18/117), item, and skill restrictions
RL Summary● Used where the rules are defined but the optimal solution is unknown● Typically requires an accurate simulation● Adaptable to rule/requirement changes, retrain instead of re-engineer● In real-world use cases, special care is needed to test and evaluate
performance in unseen scenarios
● DotA 2 Rematch - April 13th
The Power of Reinforcement Learning
AlphaStar was able to beat a professional player in a restricted setting after 7 days of training, then a top 10 player the following week. “...the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training. Wonder where it’s at now…”
- Alex Irpan, Software Engineer at Google Brain Robotics
Q&A
Deep Reinforcement Learning�OutlineWhat is Reinforcement Learning?Where is Reinforcement Learning Applied?Why is Reinforcement Learning Useful?Deep Reinforcement Learning (DRL)Case Studies (Why learn on video games?)DRL Timeline (Breakout)Slide Number 9AlphaStar - StarCraft 2AlphaStar - 10, Humans - 0OpenAI Five - DotA 2OpenAI Five - DotA 2Robots Learning (Walker)Robots Learning (Quadruped)Where Does RL Fail?RL SummaryThe Power of Reinforcement LearningQ&A