Research in Intelligent Mobile Robotics (and related topics) Part 2: Learning Anna Helena Reali...

Research in Intelligent Mobile Robotics (and related topics)Part 2: Learning

Anna Helena Reali [email protected] www.pcs.usp.br/~anna

Laboratório de Técnicas Inteligentes Escola Politécnica da Universidade de São Paulo

Carlos Henrique Costa [email protected] www.comp.ita.br/~carlos

Divisão de Ciência da Computação Instituto Tecnológico de Aeronáutica

MultiBot - Meeting #1, Lisboa 2003 - part II 2

Reinforcement Learning

Learning via model-free direct experimentation. Based on Markov Decision Process (MDP)

theory. Related to Dynamic Programming (well-

established theoretical basis). Grounded on the concept of prediction learning. Strong theoretical results in single agent cases

(proofs of convergence, error bounds).


Global Maps: Magellan, Neural net model

Map 1

Map 2

Map 3


Experiments: LearningSildomar Takahashi, Carlos Ribeiro

Saphira simulator, “Magellan”


Results “Magellan” – Map 1, Q-learning

Q-learning - Mapa 2 A

-100

0

100

200

300

400

500

600

0 5 10 15

Episódio

Q-learning - Mapa 2 B

-100

0

100

200

300

400

500

600

0 5 10 15

Episódio

Pas

so

Q-learning - Mapa 2 C

-100

0

100

200

300

400

500

600

0 5 10 15

Episódio

Pas

so

Q-learning Mapa 2

-6

-3

0

3

6

9

12

1 21 41 61 81 101

Passo

Re

forç

o

A,original B,perfeito C,pior

Q-learning Mapa 2

0

2970

0 3450A,original B,perfeito C,pior


Results on Path Learning: “Magellan”

Learning rate = 0,9 Temporal discount = 0,95 Dyna 30 experiments / step

A training course = 20 learning episodes

Results: Average over 10 complete training courses (2,5 hours)


Conclusions Trajectory learning based on RL

suffers graceful degradation w.r.t. map quality.

Perceptual aliasing is not necessarily catastrophic for RL-based robot path learning.


Option PoliciesLeticia Friske, Carlos Ribeiro

An option policy is a mapping state option. An option is a sequence of actions (possibly a

partial solution or subplan)

More aggressive exploration of the state space.

a1,r1 a2 ,r2 a3 ,r3

s0 s1 s2 s3


O Options

Set of options o1,o2,o3... on. Each is an action policy i..

o = ,, where I S is the input set is the action policy is a termination condition (more

generally a probability distribution over states)


O Options: Example

Policy o1

S1,1,T=3

Option o:

O(s0) = O(s17) = o1

…

O(s3) = o2

Policy o2

S2,2,T=2

0

7

14

21

28

35


OS Options

Set of options os1,os2,os3,...,osn . Each of them is a fixed sequence of actions.

Sequential execution of actions, independent from the state the agent is at.

os = S,seq, where: S is the set of possible states seq is an action sequence {at, at+1,..., aT-1} is a termination condition (more generally a

probability distribution over states)


OS Options: Example

Policy o1

S1,,T=3

Option o:

O(s0) = O(s17) = o1

…

O(s3) = o2

Policy o2

S2,,T=2

0

7

14

21

28

35


Option Policy Autonomous Learning

Q-Learning: Visits state st and selects option ot

Executes ot until termination state st+k is reached

Receives reinforcement ro = rt+1+ rt+2+...+ k-1rt+k

Updates Qt(st , ot) according to:

Qt = t [ ro + maxoQ( st+k , o) - Qt (st , ot) ]


Termination Improvement Termination can be determined either by:

Termination condition or Value criteria: termination caused by the result

of a comparison among values of states visited during execution of an option.

t0 t1 t2

s0 s1 s2 s3 s4

t3 T


Experimental Results Path learning, Khepera simulator Algorithm: Q-Learning Parameters:

punishment: -1 reward: [0,+1] depending on target proximity Temporal discount : 0.9 Learning rate : 0.9 Exploration rate : 20%, 5% e 1%


Experimental Results Options OS and Termination Improvement

Three robots (averaged 10 learning courses for each): Robot A: 7 options of length 2 (32 = 9 -

redundancies) Robot B: 15 options of length 3 Robot C: 15 options (same as robot B) with

Termination Improvement


Results (learning)Gráfico Comparativo da Aprendizagem dos Robôs

0

500

1000

1500

2000

2500

0 10 20 30 40 50Episódios

Pa

ss

os

Robô A

Robô B

Robô C


Results Better results for robot C:

Higher convergence speed: option duration control via Termination Improvement makes it easier to define a set of convenient options

Fine tuning of rewards in regions close to target and due to Termination Improvement: shorter duration options in such regions.


Conclusion Termination Improvement:

Increases convergence speed

Produces better rewards (fine tuning)

Assumes knowledge of states visited during execution of option: requires a simulation model in stochastic domains


Future Work More results: different domains, comparison

among different kind of options (there are references for some results already observed).

Experiments in a real robotic platform

Extensions: combination of different kinds of options, automatic segmentation (autonomous definition of option set – PhD work, L. Friske).

Formalization: relationship with hybrid control (?)


RL in Games Concurrent policy learning in model-

free, multi-agent scenarios. Agents learning concurrently: non-

stationary scenario. Non-stationary scenario: reward

received by an agent depends on the behavior of other agents.


Markov Games – MG Extended Markov Decision Process (MDP) for

multiple agents. n agents interacting with the environment via

perception and action. On each interaction step:

Each agent i senses current state st and chooses an action ai .

The set of actions a1, ..., an alter the state. A reinforcement signal ri is provided to each agent i

to indicate desirability of the resulting state.


Minimax-Q [Littman,1994] An algorithm for solving a specialization of the MG

framework: two agents (A and B),actions in alternating turns, zero-sum game.

Combines Minimax and Q-learning Assumes discrete states and actions in a tabular representation.

Goal of A: to learn an optimal policy of actions that maximizes an expected cumulative sum of discounted reinforcements

Not easy - it depends on the actions the opponent performs! Idea: evaluate each policy with respect to the opponent’s

strategy that makes it look the worst. Converges to equilibrium, but rather slowly for practical

purposes.


A solution: experience generalizationRenê Pegoraro, Carlos Ribeiro, Anna Reali

Modify the basic RL technique to include prior information about the structure of the state-action space in such a way that: The well-studied characteristics of the RL

algorithm are maintained. The inclusion of information improves

performance in practical applications (faster “convergence”).


Minimax-QS Prior knowledge can be encoded by a

spreading mechanism: a single experience can update more than a single action value.

Better use of experience (experience generalization): the consequence of choosing action at when the

opponent chooses ot at state st is spread to all others similar (s,a,o).

Minimax- QS – Algorithm for experience generalization based on Minimax- Q.


Minimax-QS Iteration Loop until convergence:

1. State st: A selects at, B selects ot.

2. A receives reinforcement rt.

3. Next state is st+1.

4. Q values for every (s,a,o) are updated according to:

Qt+1(s,a,o) Qt(s,a,o) +

+ t t(st,at,ot,s,a,o) [rt + V(st+1)-Qt(s,a,o)]

0 t(st,at,ot,s,a,o) 1 is the spreading function.


Minimax-QS: Properties By using a spreading function it is possible to

reduce the learning time of Minimax-Q. If the spreading mechanism t vanishes at least

as quickly as the learning rate t , then Minimax-QS converges to the action values generated by Minimax-Q: this means that Minimax-QS converges to equilibrium,

faster (hopefully) than Minimax-Q. Proof of convergence: see [Ribeiro,Pegoraro,Costa, 2002].


Experiments in the Soccer Domain

Soccer simulator (Littman): Two-player zero-sum game played on a 4x5 grid. Players always on distinct grid squares. Players actions: N, S, E, W, stand. If a player attempts to move to grid square occupied

by the other player, move fails and the second player gets the ball.

If a player tries an action that takes it out of the board, the move does not take place.


Experiments in the Soccer Domain

Initial configuration: players A and B placed in the positions shown, ball possession given randomly to A or B.

If player with ball reaches the goal, it scores and the board is reset to its initial configuration.

AB


Spreading Functions t(st,at,ot,s,a,o) = gt(st,s) (at,a) (ot,o) with gt(st,s) = d

is either a constant (0.7) or a decreasing linear function of the iteration number (0.7 maximum)

d is the minimal number of actions to move B to the position in the similar configuration considered.

AB

0.49 0.490.7

0.7

0.7

0.7

0.490.49

A

B

AB

is similar toby a factor of 0.49


Results against random opponent


Results against Minimax-Q opponent


Minimax-QS: conclusionsMinimax-QS enhances Minimax-Q by embedding

prior knowledge, keeping convergence properties of the latter.

Results confirm usefulness of the spreading function, mainly in the beginning of the learning process (even when using a very simple domain-dependent spreading function).

Problem: A wrong choice of spreading function can degrade performance of Minimax-QS. But this is a problem with any approximation scheme…


Heuristically Accelerated Learning Reinaldo Bianchi, Anna Reali, Carlos Ribeiro

Heuristically Accelerated Learning – HAL: allows the use of heuristics to speed up

RL algorithms. uses automatic methods for the

extraction of the heuristic from: the problem domain and/or the learning process.


The Heuristic function HALs are modeled using a Heuristic

function H that: influences the choice of the actions; is strongly related to the policy of the

system. is updated during the learning process.

The Heuristic function is only used to influence the learner exploration.


Ex. HAL: QH-Learning An extension of Q-Learning. It uses the following rule to choose the action to perform:

It uses any “Heuristic from X” method to define the heuristic.

at(st) =arg max [Q(st, at) + H(st, at)] if q p,

arandom otherwise.


Experiments with HAL: domain


The V*-table


Optimal Policy


QH-Learning Experiments Case-based implementation:

Runs until the 100th iteration. Using the edges in the current V-table,

check if the current case is similar to one already learned.

If there is a similar case, use the policy *

c of the previous case as heuristic:1 if a = *

c(st),

0 otherwise.H(st, a) =


V-table + Edge Detection

V*-table (previous case)

Edges of V*-table (stored previous case)

Policy *c

(define H(s,a))


Q x QH – good heuristic1 if a = *

c(st),



Q x QH – poor heuristic 0 if a = *

c(st),



ConclusionsGood heuristic function the system

converges (accelerated learning). Even if using a poor heuristic, the system

is able to recover from bad performance.Problem: A wrong choice of heuristic function can

degrade performance of RL algorithms. But this is a problem with any approximation scheme…

We are now investigating how to define appropriate heuristic functions.

Date post:	22-Dec-2015
Category:	Documents
View:	214 times
Download:	1 times

Research in Intelligent Mobile Robotics (and related topics) Part 2: Learning Anna Helena Reali...

Documents