Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning...

Post on 07-Jul-2020

2 views 0 download

transcript

Lifelong Learning for Disturbance Rejection on Mobile Robots

GRASP LABORATORY

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin,

Brandon Kallaher, Matthew E. Taylor

1Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 2

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

… … … …

Time

Current Task

We focus on the lifelong learning case:Agent learns multiple tasks consecutivelyWant stability guarantees as the number of tasks grows large

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 3

Background

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 4

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control [Kober & Peters 2011;

Peters & Schaal 2008; Sutton et al. 2000]

G

reward function

agent

probabilistic transition

Agent makes sequential decisions

Background: Policy Gradient Methods for Control

•Formalized as a Markov Decision Process (MDP)

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 5

Background: Policy Gradient Methods for Control

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 6

Background: Policy Gradient Methods for Control

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy

probability of trajectory reward function

Goal: find policy that minimizes

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 7

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 8

Approximate the change in reward with sampled disturbances

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 9

Approximate the change in reward with sampled disturbances

Use the pseudo-inverse to find the gradient

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 10

Approximate the change in reward with sampled disturbances

Use the pseudo-inverse to find the gradient

Update the current policy

Lifelong PG Learning

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 11

Lifelong Machine Learning

17Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 12

... ...

Lifelong Machine Learning

19Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 13

... ...

Lifelong Machine Learning

14Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 14

... ...

Lifelong Machine Learning

21Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

learned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 15

... ...

Lifelong Machine Learning

22Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

3.) New knowledge is stored for future uselearned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 16

... ...

Lifelong Machine Learning

Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

3.) New knowledge is stored for future use

4.) Existingknowledge is refined

learned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 17

Issue: the objective is dependent on all trajectories

PG-ELLA Objective

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 18

Issue: the objective is dependent on all trajectories

PG-ELLA Objective

Hessian

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 19

Verification on Robots

Experiments

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 20

Results for Robot Go-to-Goal Task

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 21

• Run RL on a new robot (goal and disturbance) for a small number of iterations• Use PG-ELLA to adjust policy according to known solutions• Continue training

PG-ELLA improves Learning

Better Results Incorporating Prior

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 22

• Initialization with average policy of other robots improves benefit

PG-ELLA improves Learning

GRASP LABORATORY

Thank you!

Questions?This research was supported by ONR N00014-11-1-0139, AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.

Lifelong Learning for Disturbance Rejection on Mobile Robots

23Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor