11
ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence
Lecture 19: Case StudiesLecture 19: Case Studies
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2010Fall 2010
November 10, 2010November 10, 2010
ECE 517: Reinforcement Learning in AI 22
Final Project RecapFinal Project Recap
Requirements:Requirements: PresentationPresentation
In-class 15 minute presentation + 5 minutes for In-class 15 minute presentation + 5 minutes for questionsquestions
Presentation assignment slots have been posted on Presentation assignment slots have been posted on websitewebsite
Project report – Due Project report – Due Friday, Dec 3Friday, Dec 3thth
Comprehensive documentation of your workComprehensive documentation of your work Recall that the Final Project is 30% of course Recall that the Final Project is 30% of course
grade!grade!
ECE 517: Reinforcement Learning in AI 33
IntroductionIntroduction
We’ll discuss several case studies of reinforcement We’ll discuss several case studies of reinforcement learninglearning
The intention is to illustrate some of the The intention is to illustrate some of the trade-offstrade-offs and and issues that arise in real applicationsissues that arise in real applications
For example, we emphasize how domain For example, we emphasize how domain knowledgeknowledge is is incorporated into the formulation and solution of the incorporated into the formulation and solution of the problemproblem
We also highlight the We also highlight the representationrepresentation issues that are so issues that are so often critical to successful applicationsoften critical to successful applications
Applications of reinforcement learning are still far from Applications of reinforcement learning are still far from routine and typically require as much art as scienceroutine and typically require as much art as science
Making applications easier and more straightforward is Making applications easier and more straightforward is one of the goals of current research in reinforcement one of the goals of current research in reinforcement learninglearning
ECE 517: Reinforcement Learning in AI 44
TD-Gammon (Tesauro’s 1992, 1994, 1995, …)TD-Gammon (Tesauro’s 1992, 1994, 1995, …)
One of the most impressive applications of RL to date One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon is Gerry Tesauro’s (IBM) game of backgammon
TD-GammonTD-Gammon, required little backgammon knowledge, , required little backgammon knowledge, yet learned to play extremely well, near the level of yet learned to play extremely well, near the level of the world's strongest grandmastersthe world's strongest grandmasters
The learning algorithm was a straightforward The learning algorithm was a straightforward combination of the TD(combination of the TD() algorithm and nonlinear ) algorithm and nonlinear function approximationfunction approximation
FA using a FFNN trained by backpropagating TD errorsFA using a FFNN trained by backpropagating TD errors
There are probably more professional backgammon There are probably more professional backgammon players than there are professional chess playersplayers than there are professional chess players
BG is in part a game of chance, which can be viewed BG is in part a game of chance, which can be viewed as a large MDPas a large MDP
ECE 517: Reinforcement Learning in AI 55
TD-Gammon (cont.)TD-Gammon (cont.)
The game is played with 15 white and 15 black pieces The game is played with 15 white and 15 black pieces on a board of 24 locations, called on a board of 24 locations, called pointspoints
Here’s a typical position early in the game, seen from Here’s a typical position early in the game, seen from the perspective of the white playerthe perspective of the white player
ECE 517: Reinforcement Learning in AI 66
TD-Gammon (cont.)TD-Gammon (cont.)
White has just rolled a 5 White has just rolled a 5 and a 2, so it can move and a 2, so it can move one of his pieces 5 and one of his pieces 5 and one (possibly the same) 2 one (possibly the same) 2 stepssteps
The objective is to The objective is to advance all pieces to advance all pieces to points 19-24, and then off points 19-24, and then off the boardthe board
HittingHitting – removal of single – removal of single piecepiece
30 pieces, 24 locations implies enormous number of 30 pieces, 24 locations implies enormous number of configurations (state set is ~10 configurations (state set is ~102020))
Effective branching factor of 400, considering that each Effective branching factor of 400, considering that each dice dice role has ~20 possibilities role has ~20 possibilities
ECE 517: Reinforcement Learning in AI 77
TD-Gammon - detailsTD-Gammon - details
Although the game Although the game is highly stochasticis highly stochastic, a complete , a complete description of the game's state is available at all timesdescription of the game's state is available at all times
The estimated value of any state was meant to predict The estimated value of any state was meant to predict the probability of winning starting from that statethe probability of winning starting from that state
RewardReward: 0 at all times except those in which the game : 0 at all times except those in which the game is won, when it is 1is won, when it is 1
Episodic (game = episode), Episodic (game = episode), undiscountedundiscounted
Non-linear form of TD(Non-linear form of TD() using a FF neural network) using a FF neural network Weights initialized to small random numbersWeights initialized to small random numbers Backpropagation of TD errorBackpropagation of TD error Four input units for each point; unary encoding of Four input units for each point; unary encoding of
number of white pieces, plus other featuresnumber of white pieces, plus other features Use of AfterstateUse of Afterstate
Learning during self-play – fully incrementallyLearning during self-play – fully incrementally
ECE 517: Reinforcement Learning in AI 88
TD-Gammon – Neural Network EmployedTD-Gammon – Neural Network Employed
ECE 517: Reinforcement Learning in AI 99
Summary of TD-Gammon ResultsSummary of TD-Gammon Results
Two players played against each otherTwo players played against each other Each had no prior knowledge of the gameEach had no prior knowledge of the game Only the rules of the game were prescribedOnly the rules of the game were prescribed
Human’s learn from machinesHuman’s learn from machines: TD-Gammon learned : TD-Gammon learned to play certain opening positions differently than was to play certain opening positions differently than was the convention among the best human playersthe convention among the best human players
ECE 517: Reinforcement Learning in AI 1010
Rebuttal on TD-GammonRebuttal on TD-Gammon
For an alternative view, see For an alternative view, see “Why did TD-“Why did TD-Gammon Work?Gammon Work?”, Jordan Pollack and Alan ”, Jordan Pollack and Alan Blair, NIPS 9 (1997)Blair, NIPS 9 (1997)
Claim: Claim: it was the “co-evolutionary training it was the “co-evolutionary training strategy, playing games against itself, which strategy, playing games against itself, which led to the success”led to the success”
Any such approach would work with Any such approach would work with backgammonbackgammon
Success does not extend to other problemsSuccess does not extend to other problems e.g. Tetris, maze-type problems – exploration e.g. Tetris, maze-type problems – exploration
issue comes upissue comes up
ECE 517: Reinforcement Learning in AI 1111
The AcrobotThe Acrobot
Robotic application of RLRobotic application of RL
Roughly analogous to a Roughly analogous to a gymnast swinging on a high bar gymnast swinging on a high bar
The first joint (corresponding toThe first joint (corresponding tothe hands on the bar) cannotthe hands on the bar) cannotexert torqueexert torque
The second joint (correspondingThe second joint (correspondingto the gymnast bending at theto the gymnast bending at thewaist) canwaist) can
This system has been widelyThis system has been widelystudied by control engineersstudied by control engineersand machine learning researchers and machine learning researchers
ECE 517: Reinforcement Learning in AI 1212
The Acrobot (cont.)The Acrobot (cont.)
One objective for controlling the Acrobot is to swing One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount the tip (the "feet") above the first joint by an amount equal to one of the links in minimum timeequal to one of the links in minimum time
In this task, the torque applied at the second joint is In this task, the torque applied at the second joint is limited to three choices: limited to three choices: positive torquepositive torque of a fixed of a fixed magnitude, magnitude, negative torquenegative torque of the same magnitude, of the same magnitude, or or no torqueno torque
A reward of A reward of –1–1 is given on all time steps until the goal is given on all time steps until the goal is reached, which ends the episode. No discounting is is reached, which ends the episode. No discounting is usedused
Thus, the optimal value of any state is the minimum Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)time to reach the goal (an integer number of steps)
Sutton (1996) addressed the Acrobot swing-up task in Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free contextan on-line, model-free context
ECE 517: Reinforcement Learning in AI 1313
Acrobot Learning Curves for Sarsa(Acrobot Learning Curves for Sarsa())
ECE 517: Reinforcement Learning in AI 1414
Typical Acrobot Learned BehaviorTypical Acrobot Learned Behavior
ECE 517: Reinforcement Learning in AI 1515
RL in RoboticsRL in Robotics
Robot motor capabilities were investigated using RLRobot motor capabilities were investigated using RL
Walking, grabbing and delivering Walking, grabbing and delivering MIT Media LabMIT Media Lab Robocup competitionsRobocup competitions – soccer games – soccer games
Sony AIBOs are commonSony AIBOs are commonemployedemployed
Maze-type problemsMaze-type problems Balancing themselvesBalancing themselves
on unstable platformon unstable platform Multi-dimensional inputMulti-dimensional input
streamsstreams
Hopefully some new Hopefully some new applications soon applications soon
ECE 517: Reinforcement Learning in AI 1616
Introduction to Wireless Sensor Networks (WSN)Introduction to Wireless Sensor Networks (WSN)
A A sensor networksensor network is composed of a large number of is composed of a large number of sensor nodes, which are densely deployed either sensor nodes, which are densely deployed either inside the phenomenon or very close to itinside the phenomenon or very close to it
Random deploymentRandom deployment Cooperative capabilitiesCooperative capabilities
May be wireless or wired, however most modern May be wireless or wired, however most modern applications require wireless applications require wireless communicationscommunications
May be mobile or staticMay be mobile or static
Main challenge: maximize Main challenge: maximize the life of the networkthe life of the networkunder battery constraints!under battery constraints!
ECE 517: Reinforcement Learning in AI 1717
Communication Topology of Sensor NetworksCommunication Topology of Sensor Networks
ECE 517: Reinforcement Learning in AI 1818
Fire detection and monitoringFire detection and monitoring
ECE 517: Reinforcement Learning in AI 1919
Nodes we have here at the labNodes we have here at the lab
UCB TelosB
Intel Mote
ECE 517: Reinforcement Learning in AI 2020
Energy Consumption in WSNEnergy Consumption in WSN
Sources of Energy Sources of Energy Consumption Consumption SensingSensing ComputationComputation Communication Communication
(dominant)(dominant)
Energy Wastes on CommunicationsEnergy Wastes on Communications Collisions. Collisions. (Packet retransmission increases energy consumption)(Packet retransmission increases energy consumption) Idle Listening. (listen to the channel when the node are not Idle Listening. (listen to the channel when the node are not
intending to transmit)intending to transmit) Communication Overhead. (the communications cost of the MAC Communication Overhead. (the communications cost of the MAC
protocol)protocol) Overhearing (receive packets which are destined to other nodes)Overhearing (receive packets which are destined to other nodes)
ECE 517: Reinforcement Learning in AI 2121
MAC-related problems in WSNMAC-related problems in WSN
Goal:Goal: to schedule or coordinate the to schedule or coordinate the communications among multiple nodes sharing communications among multiple nodes sharing the same wireless radio frequency.the same wireless radio frequency.
5
7
1
2
4
36
Hidden Terminal Problem.Hidden Terminal Problem.Node 5 and node 3 want to transmitNode 5 and node 3 want to transmit
data to node 1. Since node 3 is out ofdata to node 1. Since node 3 is out of
the communication range of node 5, the communication range of node 5, ifif
communication occurscommunication occurs
simultaneously, node 1 will simultaneously, node 1 will experience collision.experience collision.
Exposed Terminal Problem.Exposed Terminal Problem.node 1 sends data to node 3, since node 1 sends data to node 3, since
node 5 also overhears it, the node 5 also overhears it, the transmission from node 6 to node transmission from node 6 to node 5 is constrained.5 is constrained.
ECE 517: Reinforcement Learning in AI 2222
S-MAC S-MAC — — by Ye, Heidemann and Estrin (2003)by Ye, Heidemann and Estrin (2003)
TradeoffsTradeoffs
Major components in S-MACMajor components in S-MAC• Periodic listen and sleepPeriodic listen and sleep• Collision avoidanceCollision avoidance• Overhearing avoidanceOverhearing avoidance• Massage passingMassage passing
S-MAC – Example of WSN MAC ProtocolS-MAC – Example of WSN MAC Protocol
Latency
FairnessEnergy
ECE 517: Reinforcement Learning in AI 2323
RL-MAC (Z. Liu, I. Arel, 2005) RL-MAC (Z. Liu, I. Arel, 2005)
Formulate the MAC problem as a RL problemFormulate the MAC problem as a RL problem
Similar frame-based structure as in SMAC/TMACSimilar frame-based structure as in SMAC/TMAC
Each node Each node infersinfers the state of other nodes as part of its the state of other nodes as part of its decision making processdecision making process
Active time and duty cycle both a function of the traffic load and Active time and duty cycle both a function of the traffic load and Q-Learning was usedQ-Learning was used
The main effort involved crafting the reward signalThe main effort involved crafting the reward signal
nnbb - # of packets- # of packets
queuedqueued
ttrr– action (active– action (active
time)time)
Ratio of successfulRatio of successfulrx vs. txrx vs. tx
# Failed attempts# Failed attempts Reflect on delayReflect on delay
ECE 517: Reinforcement Learning in AI 2424
RL-MAC ResultsRL-MAC Results
ECE 517: Reinforcement Learning in AI 2525
RL-MAC Results (cont.)RL-MAC Results (cont.)
ECE 517: Reinforcement Learning in AI 2626
SummarySummary
RL is a powerful tool which can support a wide RL is a powerful tool which can support a wide range of applicationsrange of applications
There is an art to defining the observations, There is an art to defining the observations, states, rewards and actionsstates, rewards and actions Main goal: formulate “as simple as possible” Main goal: formulate “as simple as possible”
representationrepresentation Depends on the applicationDepends on the application Can impact results significantlyCan impact results significantly
Fits in high-resource and low-resource systemsFits in high-resource and low-resource systems
Next class, we’ll talk about a particular class of RL Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programmingtechniques called Neuro-Dynamic Programming