Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
Research in Intelligent Mobile Robotics (and related topics)Part 2: Learning
Anna Helena Reali [email protected] www.pcs.usp.br/~anna
Laboratório de Técnicas Inteligentes Escola Politécnica da Universidade de São Paulo
Carlos Henrique Costa [email protected] www.comp.ita.br/~carlos
Divisão de Ciência da Computação Instituto Tecnológico de Aeronáutica
MultiBot - Meeting #1, Lisboa 2003 - part II 2
Reinforcement Learning
Learning via model-free direct experimentation. Based on Markov Decision Process (MDP)
theory. Related to Dynamic Programming (well-
established theoretical basis). Grounded on the concept of prediction learning. Strong theoretical results in single agent cases
(proofs of convergence, error bounds).
MultiBot - Meeting #1, Lisboa 2003 - part II 3
Global Maps: Magellan, Neural net model
Map 1
Map 2
Map 3
MultiBot - Meeting #1, Lisboa 2003 - part II 4
Experiments: LearningSildomar Takahashi, Carlos Ribeiro
Saphira simulator, “Magellan”
MultiBot - Meeting #1, Lisboa 2003 - part II 5
Results “Magellan” – Map 1, Q-learning
Q-learning - Mapa 2 A
-100
0
100
200
300
400
500
600
0 5 10 15
Episódio
Q-learning - Mapa 2 B
-100
0
100
200
300
400
500
600
0 5 10 15
Episódio
Pas
so
Q-learning - Mapa 2 C
-100
0
100
200
300
400
500
600
0 5 10 15
Episódio
Pas
so
Q-learning Mapa 2
-6
-3
0
3
6
9
12
1 21 41 61 81 101
Passo
Re
forç
o
A,original B,perfeito C,pior
Q-learning Mapa 2
0
2970
0 3450A,original B,perfeito C,pior
MultiBot - Meeting #1, Lisboa 2003 - part II 6
Results on Path Learning: “Magellan”
Learning rate = 0,9 Temporal discount = 0,95 Dyna 30 experiments / step
A training course = 20 learning episodes
Results: Average over 10 complete training courses (2,5 hours)
MultiBot - Meeting #1, Lisboa 2003 - part II 7
Conclusions Trajectory learning based on RL
suffers graceful degradation w.r.t. map quality.
Perceptual aliasing is not necessarily catastrophic for RL-based robot path learning.
MultiBot - Meeting #1, Lisboa 2003 - part II 8
Option PoliciesLeticia Friske, Carlos Ribeiro
An option policy is a mapping state option. An option is a sequence of actions (possibly a
partial solution or subplan)
More aggressive exploration of the state space.
a1,r1 a2 ,r2 a3 ,r3
s0 s1 s2 s3
MultiBot - Meeting #1, Lisboa 2003 - part II 9
O Options
Set of options o1,o2,o3... on. Each is an action policy i..
o = ,, where I S is the input set is the action policy is a termination condition (more
generally a probability distribution over states)
MultiBot - Meeting #1, Lisboa 2003 - part II 10
O Options: Example
Policy o1
S1,1,T=3
Option o:
O(s0) = O(s17) = o1
…
O(s3) = o2
Policy o2
S2,2,T=2
0
7
14
21
28
35
MultiBot - Meeting #1, Lisboa 2003 - part II 11
OS Options
Set of options os1,os2,os3,...,osn . Each of them is a fixed sequence of actions.
Sequential execution of actions, independent from the state the agent is at.
os = S,seq, where: S is the set of possible states seq is an action sequence {at, at+1,..., aT-1} is a termination condition (more generally a
probability distribution over states)
MultiBot - Meeting #1, Lisboa 2003 - part II 12
OS Options: Example
Policy o1
S1,,T=3
Option o:
O(s0) = O(s17) = o1
…
O(s3) = o2
Policy o2
S2,,T=2
0
7
14
21
28
35
MultiBot - Meeting #1, Lisboa 2003 - part II 13
Option Policy Autonomous Learning
Q-Learning: Visits state st and selects option ot
Executes ot until termination state st+k is reached
Receives reinforcement ro = rt+1+ rt+2+...+ k-1rt+k
Updates Qt(st , ot) according to:
Qt = t [ ro + maxoQ( st+k , o) - Qt (st , ot) ]
MultiBot - Meeting #1, Lisboa 2003 - part II 14
Termination Improvement Termination can be determined either by:
Termination condition or Value criteria: termination caused by the result
of a comparison among values of states visited during execution of an option.
t0 t1 t2
s0 s1 s2 s3 s4
t3 T
MultiBot - Meeting #1, Lisboa 2003 - part II 15
Experimental Results Path learning, Khepera simulator Algorithm: Q-Learning Parameters:
punishment: -1 reward: [0,+1] depending on target proximity Temporal discount : 0.9 Learning rate : 0.9 Exploration rate : 20%, 5% e 1%
MultiBot - Meeting #1, Lisboa 2003 - part II 16
Experimental Results Options OS and Termination Improvement
Three robots (averaged 10 learning courses for each): Robot A: 7 options of length 2 (32 = 9 -
redundancies) Robot B: 15 options of length 3 Robot C: 15 options (same as robot B) with
Termination Improvement
MultiBot - Meeting #1, Lisboa 2003 - part II 17
Results (learning)Gráfico Comparativo da Aprendizagem dos Robôs
0
500
1000
1500
2000
2500
0 10 20 30 40 50Episódios
Pa
ss
os
Robô A
Robô B
Robô C
MultiBot - Meeting #1, Lisboa 2003 - part II 18
Results Better results for robot C:
Higher convergence speed: option duration control via Termination Improvement makes it easier to define a set of convenient options
Fine tuning of rewards in regions close to target and due to Termination Improvement: shorter duration options in such regions.
MultiBot - Meeting #1, Lisboa 2003 - part II 19
Conclusion Termination Improvement:
Increases convergence speed
Produces better rewards (fine tuning)
Assumes knowledge of states visited during execution of option: requires a simulation model in stochastic domains
MultiBot - Meeting #1, Lisboa 2003 - part II 20
Future Work More results: different domains, comparison
among different kind of options (there are references for some results already observed).
Experiments in a real robotic platform
Extensions: combination of different kinds of options, automatic segmentation (autonomous definition of option set – PhD work, L. Friske).
Formalization: relationship with hybrid control (?)
MultiBot - Meeting #1, Lisboa 2003 - part II 21
RL in Games Concurrent policy learning in model-
free, multi-agent scenarios. Agents learning concurrently: non-
stationary scenario. Non-stationary scenario: reward
received by an agent depends on the behavior of other agents.
MultiBot - Meeting #1, Lisboa 2003 - part II 22
Markov Games – MG Extended Markov Decision Process (MDP) for
multiple agents. n agents interacting with the environment via
perception and action. On each interaction step:
Each agent i senses current state st and chooses an action ai .
The set of actions a1, ..., an alter the state. A reinforcement signal ri is provided to each agent i
to indicate desirability of the resulting state.
MultiBot - Meeting #1, Lisboa 2003 - part II 23
Minimax-Q [Littman,1994] An algorithm for solving a specialization of the MG
framework: two agents (A and B),actions in alternating turns, zero-sum game.
Combines Minimax and Q-learning Assumes discrete states and actions in a tabular representation.
Goal of A: to learn an optimal policy of actions that maximizes an expected cumulative sum of discounted reinforcements
Not easy - it depends on the actions the opponent performs! Idea: evaluate each policy with respect to the opponent’s
strategy that makes it look the worst. Converges to equilibrium, but rather slowly for practical
purposes.
MultiBot - Meeting #1, Lisboa 2003 - part II 24
A solution: experience generalizationRenê Pegoraro, Carlos Ribeiro, Anna Reali
Modify the basic RL technique to include prior information about the structure of the state-action space in such a way that: The well-studied characteristics of the RL
algorithm are maintained. The inclusion of information improves
performance in practical applications (faster “convergence”).
MultiBot - Meeting #1, Lisboa 2003 - part II 25
Minimax-QS Prior knowledge can be encoded by a
spreading mechanism: a single experience can update more than a single action value.
Better use of experience (experience generalization): the consequence of choosing action at when the
opponent chooses ot at state st is spread to all others similar (s,a,o).
Minimax- QS – Algorithm for experience generalization based on Minimax- Q.
MultiBot - Meeting #1, Lisboa 2003 - part II 26
Minimax-QS Iteration Loop until convergence:
1. State st: A selects at, B selects ot.
2. A receives reinforcement rt.
3. Next state is st+1.
4. Q values for every (s,a,o) are updated according to:
Qt+1(s,a,o) Qt(s,a,o) +
+ t t(st,at,ot,s,a,o) [rt + V(st+1)-Qt(s,a,o)]
0 t(st,at,ot,s,a,o) 1 is the spreading function.
MultiBot - Meeting #1, Lisboa 2003 - part II 27
Minimax-QS: Properties By using a spreading function it is possible to
reduce the learning time of Minimax-Q. If the spreading mechanism t vanishes at least
as quickly as the learning rate t , then Minimax-QS converges to the action values generated by Minimax-Q: this means that Minimax-QS converges to equilibrium,
faster (hopefully) than Minimax-Q. Proof of convergence: see [Ribeiro,Pegoraro,Costa, 2002].
MultiBot - Meeting #1, Lisboa 2003 - part II 28
Experiments in the Soccer Domain
Soccer simulator (Littman): Two-player zero-sum game played on a 4x5 grid. Players always on distinct grid squares. Players actions: N, S, E, W, stand. If a player attempts to move to grid square occupied
by the other player, move fails and the second player gets the ball.
If a player tries an action that takes it out of the board, the move does not take place.
MultiBot - Meeting #1, Lisboa 2003 - part II 29
Experiments in the Soccer Domain
Initial configuration: players A and B placed in the positions shown, ball possession given randomly to A or B.
If player with ball reaches the goal, it scores and the board is reset to its initial configuration.
AB
MultiBot - Meeting #1, Lisboa 2003 - part II 30
Spreading Functions t(st,at,ot,s,a,o) = gt(st,s) (at,a) (ot,o) with gt(st,s) = d
is either a constant (0.7) or a decreasing linear function of the iteration number (0.7 maximum)
d is the minimal number of actions to move B to the position in the similar configuration considered.
AB
0.49 0.490.7
0.7
0.7
0.7
0.490.49
A
B
AB
is similar toby a factor of 0.49
MultiBot - Meeting #1, Lisboa 2003 - part II 31
Results against random opponent
MultiBot - Meeting #1, Lisboa 2003 - part II 32
Results against Minimax-Q opponent
MultiBot - Meeting #1, Lisboa 2003 - part II 33
Minimax-QS: conclusionsMinimax-QS enhances Minimax-Q by embedding
prior knowledge, keeping convergence properties of the latter.
Results confirm usefulness of the spreading function, mainly in the beginning of the learning process (even when using a very simple domain-dependent spreading function).
Problem: A wrong choice of spreading function can degrade performance of Minimax-QS. But this is a problem with any approximation scheme…
MultiBot - Meeting #1, Lisboa 2003 - part II 34
Heuristically Accelerated Learning Reinaldo Bianchi, Anna Reali, Carlos Ribeiro
Heuristically Accelerated Learning – HAL: allows the use of heuristics to speed up
RL algorithms. uses automatic methods for the
extraction of the heuristic from: the problem domain and/or the learning process.
MultiBot - Meeting #1, Lisboa 2003 - part II 35
The Heuristic function HALs are modeled using a Heuristic
function H that: influences the choice of the actions; is strongly related to the policy of the
system. is updated during the learning process.
The Heuristic function is only used to influence the learner exploration.
MultiBot - Meeting #1, Lisboa 2003 - part II 36
Ex. HAL: QH-Learning An extension of Q-Learning. It uses the following rule to choose the action to perform:
It uses any “Heuristic from X” method to define the heuristic.
at(st) =arg max [Q(st, at) + H(st, at)] if q p,
arandom otherwise.
MultiBot - Meeting #1, Lisboa 2003 - part II 37
Experiments with HAL: domain
MultiBot - Meeting #1, Lisboa 2003 - part II 38
The V*-table
MultiBot - Meeting #1, Lisboa 2003 - part II 39
Optimal Policy
MultiBot - Meeting #1, Lisboa 2003 - part II 40
QH-Learning Experiments Case-based implementation:
Runs until the 100th iteration. Using the edges in the current V-table,
check if the current case is similar to one already learned.
If there is a similar case, use the policy *
c of the previous case as heuristic:1 if a = *
c(st),
0 otherwise.H(st, a) =
MultiBot - Meeting #1, Lisboa 2003 - part II 41
V-table + Edge Detection
V*-table (previous case)
Edges of V*-table (stored previous case)
Policy *c
(define H(s,a))
MultiBot - Meeting #1, Lisboa 2003 - part II 42
Q x QH – good heuristic1 if a = *
c(st),
0 otherwise.H(st, a) =
MultiBot - Meeting #1, Lisboa 2003 - part II 43
Q x QH – poor heuristic 0 if a = *
c(st),
1 otherwise.H(st, a) =
MultiBot - Meeting #1, Lisboa 2003 - part II 44
ConclusionsGood heuristic function the system
converges (accelerated learning). Even if using a poor heuristic, the system
is able to recover from bad performance.Problem: A wrong choice of heuristic function can
degrade performance of RL algorithms. But this is a problem with any approximation scheme…
We are now investigating how to define appropriate heuristic functions.