Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | jemimah-gibson |
View: | 224 times |
Download: | 4 times |
1
Reinforcement Learning
Chapter 13
• What is Reinforcement Learning?• Q-Learning• Examples
2
Machine Learning Categories
3
What’s reinforcement Learning?
• An autonomous agent should learn to choose optimal actions in each state to achieve its goals.
• The agent learns how to achieve that goal by trial-and-error interactions with its environment.
4
Example: Learning to ride a bike
• Suppose: In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right.
• At this point, there are two possible actions: – turn the handle bars right:
• crashing to the ground (a negative reinforcement)
– turn the handle bars left:• crashing to the ground (a negative reinforcement)
5
Example: Learning to ride a bike
• At this point, the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad.
• Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……
6
Reinforcement Learning: Suitable for state-action problems
• Board games: E.g. backgammon, chess, 8-puzzle, …(Reinforcement learning in board games., Imran Ghory, 2004)
s0
s2 s1
s5 s6 s7
s3
s8
a5a4
a1a2
a3
a6 a7
7
What’s reinforcement Learning?
s0 s1
Agent
environment
StateReward
a0
r0 r1
s2
r2
a1
Action
a2
s : state
a : action
r : a reward function
control policy : S -> A
8
Example: TD-Gammon
• Tesauro (1995)
• RL to play Backgammon to become the world championship
• Immediate reward
– +100 if win
– -100 if lose
– 0 for all other states
• Trained by playing 1.5 million games against itself
• Now approximately equal to best human player
9
An Example of Reward Function
10
The Goal in Reinforcement Learning
• Goal: learn to choose actions that maximize:
r0 + r1 + 2 r2 + … ,
• where 0 < 1
• The discount factor is used to exponentially decrease the weight of reinforcements received in the future
• It’s called: Discounted Cumulative Reward
11
Discounted Cumulative Reward
=0.9
12
Other Options
• Finite-horizon model:
• Average-reward model:
• Average discounted reward model:
13
Different Types of Learning Tasks
• Agent’s actions: – Deterministic, or – Nondeterministic
• Agent may have or haven’t the ability of predicting the next state that will result from each action
• Trainer of the agent: – Expert (who shows it examples of optimal action
sequences), or – agent itself(train itself by performing actions of its own
choice.)
14
Q-Learning for Simple Deterministic Worlds
15
example
Q(s1, aright) r + Q (s2 , )
0 + 0.9 max{63,81,100}
90
16
RL as a function approximation method
• Learning the control policy () is very similar to the function approximation problem, except:
1. Delayed reward– In RL, The trainer provides only a sequence of immediate
reward values => Facing the problem of temporal credit assignment.
2. Exploration or Exploitation (next slide)– Exploration to collect new information, or Exploitation of
what it already learned to maximize the cumulative rewards.
– In RL, the agents influence the distribution of training examples by the action sequence it chooses.
17
Explore or Exploit?• In Q-learning, there is no mention about how to choose an
action among possible actions, some obtions:
– Random uniform selction
– High Q-value selection
– Selection based on the following probability:
– Small k => exploration, large k => exploitation,
– Common choice: small k at the beginning of the learning process, then gradually increasing k
18
RL Vs. other function approximation(continued)
3. Partially Observable States– In many practical situations, the sensors provide only partial information
(like the camera in front of a robot). – Solution: considering previous observations together with the current
sensor data
4. Life-long Learning– Unlike the function approximation task, in RL, robots need to learn many
task simultaneously plus online learning process forever.
19
RL Convergence• Proved in p 377-378, Mitchell.
• Three conditions of convergency:
– Deterministic Markov Decision Process (MDP)
– Immediate positive bounded rewards
– Agent selects every agent-action pairs infinitely often.
20
Markov Decision Process• Finite set of States : S; Set of Actions: A
– t: discrete time step; – st: the state at time t; – at: the action at time t;
• At each discrete time, agent observe states st S, and chooses action at A. • Then receive immediate reward: rt , And state change to: st+1
• Markov assumption: st+1= (st , at ), rt=r (st , at )– i.e., rt, and st+1 depend only on current state and action
• Functions and r may be nondeterministic • Functions and r not necessarily be known to agent
stat
rt
st+1 rt+1
st+2 rt+2
at+1 at+2
…
21
Other issues in RL (p. 381 - 386)• Reinforcement Learning for non-deterministic rewards
and actions
• Temporal Difference Learning
• Generalizing from examples
• Relationship to dynamic programming
• Continuous reinforcement learning (state-of-the-art)
22
Homework
• 13.3– Tik-Tak-Toe