+ All Categories
Home > Documents > Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis...

Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis...

Date post: 09-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
130
Coaching: Learning and Using Environment and Agent Models for Advice Patrick Riley February 1, 2005 Thesis Committee: Manuela Veloso, Chair Tom Mitchell Jack Mostow Milind Tambe, University of Southern California
Transcript
Page 1: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching: Learning and UsingEnvironment and Agent Models for Advice

Patrick Riley

February 1, 2005

Thesis Committee:Manuela Veloso, Chair

Tom MitchellJack Mostow

Milind Tambe, University of Southern California

Page 2: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

1Patrick Riley

Thesis Defense

Page 3: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

Opponent

1Patrick Riley

Thesis Defense

Page 4: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

Effectors

Perceptors

1Patrick Riley

Thesis Defense

Page 5: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

Effectors

Perceptors

ObservationsGlobal, External(in specified

language)

Advice

Coach

1Patrick Riley

Thesis Defense

Page 6: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

Effectors

Perceptors

ObservationsGlobal, External(in specified

language)

Advice

Coach

ObservationsPast

1Patrick Riley

Thesis Defense

Page 7: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching?

Environment

Effectors

Perceptors

ObservationsGlobal, External(in specified

language)

Advice

Coach

ObservationsPast

Environment/AgentModels

ObservationHistory

1Patrick Riley

Thesis Defense

Page 8: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Thesis Question

What algorithms can be used byan automated coach agent toprovide advice to one or moreagents in order to improve their

performance?

2Patrick Riley

Thesis Defense

Page 9: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Outline

• Prologue

� Robot soccer environment� Coaching sub-questions

3Patrick Riley

Thesis Defense

Page 10: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Outline

• Prologue

� Robot soccer environment� Coaching sub-questions

• Technical sections

� Matching opponents to models� Learning/using environment models

3Patrick Riley

Thesis Defense

Page 11: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Outline

• Prologue

� Robot soccer environment� Coaching sub-questions

• Technical sections

� Matching opponents to models� Learning/using environment models

• Epilogue

� Relation to previous work� Review/overview of thesis contributions� Future work

3Patrick Riley

Thesis Defense

Page 12: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Motivating Environment:Simulated Robot Soccer

• Real time constraints

• Noisy actions

• Noisy and incompletesensation

• Near continuousstate/action spaces

• 22 distributed player agents

4Patrick Riley

Thesis Defense

Page 13: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Simulated Robot Soccer: Coaching

• Coach agent with global view and limited communication

� Coach does not see agent actions or intentions

5Patrick Riley

Thesis Defense

Page 14: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Simulated Robot Soccer: Coaching

• Coach agent with global view and limited communication

� Coach does not see agent actions or intentions

• Community created standard advice language named CLang

� Rule based� Conditions are logical combinations of world state atoms� Actions are recommended macro-actions like passing andpositioning

5Patrick Riley

Thesis Defense

Page 15: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Simulated Robot Soccer: Coaching

• Coach agent with global view and limited communication

� Coach does not see agent actions or intentions

• Community created standard advice language named CLang

� Rule based� Conditions are logical combinations of world state atoms� Actions are recommended macro-actions like passing andpositioning

• Basis for 4 years of coach competitions at RoboCup events

� Run different coaches with same teams

5Patrick Riley

Thesis Defense

Page 16: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

My Questions in Coaching• What can the coach learn from observations?

� Opponent models; learn and/or select from given set� Learn environment models

6Patrick Riley

Thesis Defense

Page 17: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

My Questions in Coaching• What can the coach learn from observations?

� Opponent models; learn and/or select from given set� Learn environment models

• How can models be used to get desired actions for agents?

� Plan a response to predicted behavior� Imitate a good team� Solve for universal plan

6Patrick Riley

Thesis Defense

Page 18: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

My Questions in Coaching• What can the coach learn from observations?

� Opponent models; learn and/or select from given set� Learn environment models

• How can models be used to get desired actions for agents?

� Plan a response to predicted behavior� Imitate a good team� Solve for universal plan

• Once the coach has desired actions, how does the coachadapt advice to the agent abilities?

6Patrick Riley

Thesis Defense

Page 19: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

My Questions in Coaching• What can the coach learn from observations?

� Opponent models; learn and/or select from given set� Learn environment models

• How can models be used to get desired actions for agents?

� Plan a response to predicted behavior� Imitate a good team� Solve for universal plan

• Once the coach has desired actions, how does the coachadapt advice to the agent abilities?

• What format does advice take?

6Patrick Riley

Thesis Defense

Page 20: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

How to Study Coaching?

• Isolate questions with various domains

Adapt

Advice Use

Models

Learn

Models

Format

Advice

7Patrick Riley

Thesis Defense

Page 21: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

How to Study Coaching?

• Isolate questions with various domains

Adapt

Advice Use

Models

Learn

Models

Format

AdvicePredator

Prey

7Patrick Riley

Thesis Defense

Page 22: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

How to Study Coaching?

• Isolate questions with various domains

Adapt

Advice Use

Models

Learn

Models

Format

AdvicePredator

PreyRCSSMaze

7Patrick Riley

Thesis Defense

Page 23: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

How to Study Coaching?

• Isolate questions with various domains

Adapt

Advice Use

Models

Learn

Models

Format

AdvicePredator

PreyRCSSMaze

Soccer

sub-game

7Patrick Riley

Thesis Defense

Page 24: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

How to Study Coaching?

• Isolate questions with various domains

Adapt

Advice Use

Models

Learn

Models

Format

AdvicePredator

PreyRCSSMaze

Soccer

sub-game

Soccer

7Patrick Riley

Thesis Defense

Page 25: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Opponent Models

8Patrick Riley

Thesis Defense

Page 26: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Why Opponent Models?

• Dealing with opponents is a fertile area for advice

• Adapting to current opponent canmeanbetter performance

9Patrick Riley

Thesis Defense

Page 27: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Predicting Opponent Movement

10Patrick Riley

Thesis Defense

Page 28: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Predicting Opponent Movement

10Patrick Riley

Thesis Defense

Page 29: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Predicting Opponent MovementM : SW × Sp

O ×A → RpO

M Opponent model

SW Set of world states

p Players per team

SO Set of opponent states

A Planned actions of our team

RO Probability distributionover opponent states

10Patrick Riley

Thesis Defense

Page 30: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Predicting Opponent MovementM : SW × Sp

O ×A → RpO

M Opponent model

SW Set of world states

p Players per team

SO Set of opponent states

A Planned actions of our team

RO Probability distributionover opponent states

• Use predicted opponent movement to plan team actions

10Patrick Riley

Thesis Defense

Page 31: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Selecting Between Opponent Models

• Online, must make quick decisions with small amounts ofdata

• Rather than learning a new model from scratch, coach willselect between models from a prede�ned set

11Patrick Riley

Thesis Defense

Page 32: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Selecting Between Opponent Models

• Online, must make quick decisions with small amounts ofdata

• Rather than learning a new model from scratch, coach willselect between models from a prede�ned set

• Model chosen affects the plan generated

Model 1 Model 2 Model 3 Model 4 Model 5

11Patrick Riley

Thesis Defense

Page 33: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Selecting Between Opponent Models• Maintain probability distribution over set of models P [Mi]

• Use observation o = (w, s, a, e) to update with naive Bayes

w World state (ball location)s Starting opponent states (locations)a Team actions (ball movement)e Ending opponent states (locations)

12Patrick Riley

Thesis Defense

Page 34: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Selecting Between Opponent Models• Maintain probability distribution over set of models P [Mi]

• Use observation o = (w, s, a, e) to update with naive Bayes

w World state (ball location)s Starting opponent states (locations)a Team actions (ball movement)e Ending opponent states (locations)

P [Mi|o] = P [e1|w, s, a, Mi]P [e2|w, s, a, Mi] . . . P [ep|w, s, a, Mi]︸ ︷︷ ︸what opponent model calculates

P [w, s, a]P [o]︸ ︷︷ ︸

norm. constant

P [Mi]︸ ︷︷ ︸prior

12Patrick Riley

Thesis Defense

Page 35: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Can Models Be Recognized?

We presented an algorithm to select a model froma set. Does it select the correct one?

13Patrick Riley

Thesis Defense

Page 36: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Can Models Be Recognized?

We presented an algorithm to select a model froma set. Does it select the correct one?

• De�ne a set of �ve models

• De�ne a set of teams that (mostly) act like the models

13Patrick Riley

Thesis Defense

Page 37: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Can Models Be Recognized?

We presented an algorithm to select a model froma set. Does it select the correct one?

• De�ne a set of �ve models

• De�ne a set of teams that (mostly) act like the models

• Observe each of the �ve teams playing while the coachmakes plans

• For each of the teams, how often is the correct modelselected?

13Patrick Riley

Thesis Defense

Page 38: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Recognition Results

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 2 4 6 8 10 12 14 16

Pro

ba

bili

ty C

orr

ect

Re

cog

niti

on

Number of Observations

No MovementAll to Ball

All DefensiveAll Offensive

One to Ball

14Patrick Riley

Thesis Defense

Page 39: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Environment Models

15Patrick Riley

Thesis Defense

Page 40: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Environment Model?

• Model the effects of possible agent actions on the state ofthe world

� Our algorithms learn an abstract Markov Decision Process

16Patrick Riley

Thesis Defense

Page 41: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Environment Model?

• Model the effects of possible agent actions on the state ofthe world

� Our algorithms learn an abstract Markov Decision Process

• A coach must have some knowledge to provide advice

• An environment model can be solved to get a desired actionpolicy for the agents

16Patrick Riley

Thesis Defense

Page 42: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations, ..., Advice

Observationsof PastExecution

Advice

17Patrick Riley

Thesis Defense

Page 43: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

What are Observations?

18Patrick Riley

Thesis Defense

Page 44: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

What are Observations?

t, score, play mode,〈xball, yball,∆xball,∆yball〉〈x1, y1,∆x1∆y1, θ

B1 , θN

1 , view1, . . .〉〈x2, y2,∆x2∆y2, θ

B2 , θN

2 , view2, . . .〉...〈x22, y22,∆x22∆y22, θ

B22, θ

N22, view22, . . .〉

18Patrick Riley

Thesis Defense

Page 45: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

What are Observations?

t, score, play mode,〈xball, yball,∆xball,∆yball〉〈x1, y1,∆x1∆y1, θ

B1 , θN

1 , view1, . . .〉〈x2, y2,∆x2∆y2, θ

B2 , θN

2 , view2, . . .〉...〈x22, y22,∆x22∆y22, θ

B22, θ

N22, view22, . . .〉

• Only state, no actions

� But produced by agents taking actions

• Externally visible global view

18Patrick Riley

Thesis Defense

Page 46: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

What are Observations?

t, score, play mode,〈xball, yball,∆xball,∆yball〉〈x1, y1,∆x1∆y1, θ

B1 , θN

1 , view1, . . .〉〈x2, y2,∆x2∆y2, θ

B2 , θN

2 , view2, . . .〉...〈x22, y22,∆x22∆y22, θ

B22, θ

N22, view22, . . .〉

• Only state, no actions

� But produced by agents taking actions

• Externally visible global view

• Observation logs exists for many processes, not just soccer

18Patrick Riley

Thesis Defense

Page 47: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations, Markov Chain, ..., Advice

Observationsof PastExecution

Advice

AbstractMarkov Chain

19Patrick Riley

Thesis Defense

Page 48: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain

ObservedExecutions

ObservedStateTransitions

20Patrick Riley

Thesis Defense

Page 49: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain

ObservedExecutions

ObservedStateTransitions

s72

s51

s1 s72

s12

s47

s1 s12

AbstractStateTransitions

20Patrick Riley

Thesis Defense

Page 50: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain

ObservedExecutions

ObservedStateTransitions

s72

s51

s1 s72

s12

s47

s1 s12

AbstractStateTransitions

s1

s72

s72

s51

s47

s47

Forevery

state

20Patrick Riley

Thesis Defense

Page 51: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain

ObservedExecutions

ObservedStateTransitions

s72

s51

s1 s72

s12

s47

s1 s12

AbstractStateTransitions

s1

s72

s72

s51

s47

s47

Forevery

state

s1

s72

s47

s51

0.4

0.2

0.4CombineTransitions

20Patrick Riley

Thesis Defense

Page 52: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

State Abstraction in Robot Soccer

Goal

21Patrick Riley

Thesis Defense

Page 53: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

State Abstraction in Robot Soccer

Goal Ball possession

21Patrick Riley

Thesis Defense

Page 54: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

State Abstraction in Robot Soccer

Goal Ball possession

Ball grid

21Patrick Riley

Thesis Defense

Page 55: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

State Abstraction in Robot Soccer

Goal Ball possession

Ball grid Player occupancy

21Patrick Riley

Thesis Defense

Page 56: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain: Formalism

ObservationData

StateAbstract

s′i ∈ S

s′9 → s′

3 → s′2 → s′

7...

s′3 → s′

9 → s′3 → s′

2...

〈S, B: S → S ∪ ε〉

S Set of observation states

S Set of abstract states

B Abstraction function

TMC Transition function

22Patrick Riley

Thesis Defense

Page 57: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain: Formalism

ObservationData

StateAbstract

s′i ∈ S

s′9 → s′

3 → s′2 → s′

7...

s′3 → s′

9 → s′3 → s′

2...

〈S, B: S → S ∪ ε〉

ExtractObserve

S Set of observation states

S Set of abstract states

B Abstraction function

TMC Transition function

22Patrick Riley

Thesis Defense

Page 58: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain: Formalism

ObservationData

StateAbstract

s′i ∈ S

s′9 → s′

3 → s′2 → s′

7...

s′3 → s′

9 → s′3 → s′

2...

〈S, B: S → S ∪ ε〉

ExtractObserve

AbstractStateTraces

si ∈ Ss2 → s1 → s2...s1 → s2 → s1...

S Set of observation states

S Set of abstract states

B Abstraction function

TMC Transition function

22Patrick Riley

Thesis Defense

Page 59: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations to Markov Chain: Formalism

ObservationData

StateAbstract

s′i ∈ S

s′9 → s′

3 → s′2 → s′

7...

s′3 → s′

9 → s′3 → s′

2...

〈S, B: S → S ∪ ε〉

ExtractObserve

AbstractStateTraces

si ∈ Ss2 → s1 → s2...s1 → s2 → s1...

Combine

MarkovChain

〈S, TMC〉

S Set of observation states

S Set of abstract states

B Abstraction function

TMC Transition function

22Patrick Riley

Thesis Defense

Page 60: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations, MC, MDP, ..., Advice

Observationsof PastExecution

Advice

AbstractMarkov Chain State Action

Abstract Abstract

InstantiateAbstractActions

Abstract MDP

23Patrick Riley

Thesis Defense

Page 61: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP

How to infer actions from Markov Chain?

• Solution: Introduce abstractaction templates

� Sets of primary andsecondary transitions

� Non-deterministic, but noprobabilities

24Patrick Riley

Thesis Defense

Page 62: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP

How to infer actions from Markov Chain?

• Solution: Introduce abstractaction templates

� Sets of primary andsecondary transitions

� Non-deterministic, but noprobabilities

s7

s0s1

s2s0

s3

s7s4

s1

a

Primary Secondary

a

24Patrick Riley

Thesis Defense

Page 63: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP

How to infer actions from Markov Chain?

• Solution: Introduce abstractaction templates

� Sets of primary andsecondary transitions

� Non-deterministic, but noprobabilities

s7

s0s1

s2s0

s3

s7s4

s1

a

Primary Secondary

a

• Same action templates for different agents

24Patrick Riley

Thesis Defense

Page 64: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

25Patrick Riley

Thesis Defense

Page 65: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

25Patrick Riley

Thesis Defense

Page 66: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

Resulting MDP

State

s0

25Patrick Riley

Thesis Defense

Page 67: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

Resulting MDP

State

s0

a0

a1

25Patrick Riley

Thesis Defense

Page 68: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

Resulting MDP

State

s0

a0

a1

s2

s1

s1

s3

25Patrick Riley

Thesis Defense

Page 69: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

Resulting MDP

State

s0

a0

a1

s2

s1

s1

s3

.6/2

.3

.6/2

.1

25Patrick Riley

Thesis Defense

Page 70: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Markov Chain to MDP: Example

s1

s2

s3

s0 .3

.1

.6

Markov Chain

State

Action

Templates

Primary

Secondary

s6

s2s0

s1a0

s1

s0 s3

s5

a1

s0s1

s4

a2

Resulting MDP

State

s0

a0

a1

s2

s1

s1

s3

.25

.75

.5.5

25Patrick Riley

Thesis Defense

Page 71: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Observations, MC, MDP, Policy, Advice

Observationsof PastExecution

Advice

AbstractMarkov Chain State Action

Abstract Abstract

InstantiateAbstractActions

Abstract MDP

Policy

26Patrick Riley

Thesis Defense

Page 72: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Adding Rewards

• We have learned an abstract transition model

� MDP is currently reward-less

27Patrick Riley

Thesis Defense

Page 73: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Adding Rewards

• We have learned an abstract transition model

� MDP is currently reward-less

• Model can not be solved for an action policy until rewardsare added

• The same transition model can be used for many differentreward signals

27Patrick Riley

Thesis Defense

Page 74: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP to Advice

MDP

28Patrick Riley

Thesis Defense

Page 75: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP to Advice

MDP RewardSignal

+

28Patrick Riley

Thesis Defense

Page 76: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP to Advice

MDP RewardSignal

+ = Policy

28Patrick Riley

Thesis Defense

Page 77: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP to Advice

MDP RewardSignal

+ = Policy

28Patrick Riley

Thesis Defense

Page 78: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Formalism

ChainMarkov

〈S, TMC〉

S Set of abstract states

A Set of abstract actions

Cp, Cs Primary, Secondarytransition descriptions

29Patrick Riley

Thesis Defense

Page 79: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Formalism

ChainMarkov

〈S, TMC〉

ActionsAssociate

Abstract Actions

〈A, Cp, Cs〉

S Set of abstract states

A Set of abstract actions

Cp, Cs Primary, Secondarytransition descriptions

29Patrick Riley

Thesis Defense

Page 80: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Formalism

ChainMarkov

〈S, TMC〉

ActionsAssociate

Abstract Actions

〈A, Cp, Cs〉

〈S, A, TMDP , R〉

(Abstract) MDP

S Set of abstract states

A Set of abstract actions

Cp, Cs Primary, Secondarytransition descriptions

29Patrick Riley

Thesis Defense

Page 81: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Formalism

ChainMarkov

〈S, TMC〉

ActionsAssociate

Abstract Actions

〈A, Cp, Cs〉

〈S, A, TMDP , R〉

(Abstract) MDP

Add rewards RewardR

S Set of abstract states

A Set of abstract actions

Cp, Cs Primary, Secondarytransition descriptions

29Patrick Riley

Thesis Defense

Page 82: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Formalism

ChainMarkov

〈S, TMC〉

ActionsAssociate

Abstract Actions

〈A, Cp, Cs〉

〈S, A, TMDP , R〉

(Abstract) MDP

Add rewards RewardR(Abstract) MDP

with reward

〈S, A, TMDP , R〉

S Set of abstract states

A Set of abstract actions

Cp, Cs Primary, Secondarytransition descriptions

29Patrick Riley

Thesis Defense

Page 83: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed OpponentCan our learning algorithm exploit an opponent's strategy?

30Patrick Riley

Thesis Defense

Page 84: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed OpponentCan our learning algorithm exploit an opponent's strategy?

• Test against a team with a �aw that we program

� Known set of states and actions will have high value

30Patrick Riley

Thesis Defense

Page 85: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed OpponentCan our learning algorithm exploit an opponent's strategy?

• Test against a team with a �aw that we program

� Known set of states and actions will have high value

• Opponent team (on right) will not go into corridor below

30Patrick Riley

Thesis Defense

Page 86: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed Opponent ResultsTraining

Team Score Difference Mean Ball X % AttackingSMEKACM4

31Patrick Riley

Thesis Defense

Page 87: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed Opponent ResultsTraining

Team Score Difference Mean Ball X % AttackingSM 12.2 [11.3, 13.2] 19.0 [18.93, 19.11] 43%EKA 7.3 [6.5, 8.1] 14.6 [14.47, 14.65] 35%CM4 0.7 [0.4, 1.0] 1.1 [1.04, 1.16] 24%

31Patrick Riley

Thesis Defense

Page 88: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed Opponent ResultsTraining

Team Score Difference Mean Ball X % AttackingSM 12.2 [11.3, 13.2] 19.0 [18.93, 19.11] 43%EKA 7.3 [6.5, 8.1] 14.6 [14.47, 14.65] 35%CM4 0.7 [0.4, 1.0] 1.1 [1.04, 1.16] 24%

TestingCM4 3.1 [2.5, 3.7] 9.5 [9.46, 9.64] 35%

31Patrick Riley

Thesis Defense

Page 89: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed Opponent Results

• Each dot represents a location of the ball when our teamowned the ball

Training

32Patrick Riley

Thesis Defense

Page 90: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Flawed Opponent Results

• Each dot represents a location of the ball when our teamowned the ball

Training Testing

32Patrick Riley

Thesis Defense

Page 91: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Soccer is Complicated!

33Patrick Riley

Thesis Defense

Page 92: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Soccer is Complicated!

• Team of advice receivers

• Team of opponents

• Infrequent, hard to achieve reward

� Unclear evaluation metrics

• Unknown optimal policy

33Patrick Riley

Thesis Defense

Page 93: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Introducing RCSSMaze• Continuous state/action spaces, partial observability

Start

34Patrick Riley

Thesis Defense

Page 94: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Introducing RCSSMaze• Continuous state/action spaces, partial observability

• Single executing agent receiving advice

� �Wall� agents execute �xed movement behaviors

Start

34Patrick Riley

Thesis Defense

Page 95: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Introducing RCSSMaze• Continuous state/action spaces, partial observability

• Single executing agent receiving advice

� �Wall� agents execute �xed movement behaviors

• We approximately know the optimal policy

Start

34Patrick Riley

Thesis Defense

Page 96: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze TrainingCan our algorithm learn a model for effective advice?

35Patrick Riley

Thesis Defense

Page 97: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze TrainingCan our algorithm learn a model for effective advice?

• Training data (240 minutes)

� Agent randomly picks one of given points� Heads directly to point until reached or reset to start� 5% of time, heads in a random direction

Start

35Patrick Riley

Thesis Defense

Page 98: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze Rewards

• We can put reward wherever we want

Reward 1

Reward 0

Reward 2

36Patrick Riley

Thesis Defense

Page 99: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze Results

• A trial begins when the agent is at the start state

• A trial ends when

� A positive reward is received� The agent is reset to the start state

• A successful trial is one that receives positive reward

37Patrick Riley

Thesis Defense

Page 100: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze Results

• A trial begins when the agent is at the start state

• A trial ends when

� A positive reward is received� The agent is reset to the start state

• A successful trial is one that receives positive reward

Reward % in Training % with MDP0 < 1% 64%1 1% 60%2 7% 93%

37Patrick Riley

Thesis Defense

Page 101: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP Learning and Other Domains

• We used the MDP for advice, but environment models areuseful in other contexts

38Patrick Riley

Thesis Defense

Page 102: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP Learning and Other Domains

• We used the MDP for advice, but environment models areuseful in other contexts

• Algorithm inputs

� External observations (do not need to see inside agents'heads)

� Abstract state space� Abstract action templates

38Patrick Riley

Thesis Defense

Page 103: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

MDP Learning and Other Domains

• We used the MDP for advice, but environment models areuseful in other contexts

• Algorithm inputs

� External observations (do not need to see inside agents'heads)

� Abstract state space� Abstract action templates

• Apply any reward function

38Patrick Riley

Thesis Defense

Page 104: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Summary andPrevious and Future Work

39Patrick Riley

Thesis Defense

Page 105: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching and Previous WorkIntelligent Tutoring Systems

• Systems to instruct human students

• Generally used with complete and correct expert model

• Focused on humans

40Patrick Riley

Thesis Defense

Page 106: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching and Previous WorkIntelligent Tutoring Systems

• Systems to instruct human students

• Generally used with complete and correct expert model

• Focused on humans

Agents Taking Advice

• Lots of Reinforcement Learning [e.g. Maclin and Shavlik,1996]

• How to operationalize advice? [e.g. Mostow, 1981]

• Use some similar techniques to incorporate advice, but realconcern is giving advice

40Patrick Riley

Thesis Defense

Page 107: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching and Previous WorkAbstract/Factored Markov Decision Processes

• Ef�cient reasoning by learning/using abstractions [e.g.Dearden and Boutilier, 1997, Uther and Veloso, 2002]

• Factored representations [Dean and Kanazawa, 1989] andtheir applications [e.g. Guestrin et al., 2001]

41Patrick Riley

Thesis Defense

Page 108: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching and Previous WorkAbstract/Factored Markov Decision Processes

• Ef�cient reasoning by learning/using abstractions [e.g.Dearden and Boutilier, 1997, Uther and Veloso, 2002]

• Factored representations [Dean and Kanazawa, 1989] andtheir applications [e.g. Guestrin et al., 2001]

Coaching in Robot Soccer

• This thesis grew with and helped de�ne this �eld

• Early coachingwork dealt with formations [Takahashi, 2000]

• ISAAC [Raines et al., 2000]

• Opponent modeling [Steffens, 2002, Kuhlmann et al., 2004]

41Patrick Riley

Thesis Defense

Page 109: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

42Patrick Riley

Thesis Defense

Page 110: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

Observations

Current

(Logs)

Observations

Past

42Patrick Riley

Thesis Defense

Page 111: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

Observations

Current

(Logs)

Observations

Past

Advice Formatting

42Patrick Riley

Thesis Defense

Page 112: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

Observations

Current

(Logs)

Observations

Past

Advice Formatting

Learning Expert’s

Coding

Models

Opponent

Model

Selection

Planning Response

42Patrick Riley

Thesis Defense

Page 113: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

Observations

Current

(Logs)

Observations

Past

Advice Formatting

Learning Expert’s

Coding

Models

Opponent

Model

Selection

Planning Response

Learning

Models

Environment

Policy Solver

Policies

42Patrick Riley

Thesis Defense

Page 114: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Big Picture Summary

Advice to Agents

Observations

from

Environment

Observations

Current

(Logs)

Observations

Past

Advice Formatting

Learning Expert’s

Coding

Models

Opponent

Model

Selection

Planning Response

Learning

Models

Environment

Policy Solver

Policies

Adaptation to Advice Receivers

Models

Receiver

Advice

42Patrick Riley

Thesis Defense

Page 115: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Contributions• Several opponent model representations, with learning andadvice generation algorithms (in robot soccer)

• Algorithms for learning an abstract MDP from observations,given state abstraction, and abstract action templates

• Study of adapting advice in a predator-prey environmentconsidering limitation and communication bandwidth

• Multi-Agent Simple Temporal Networks: novel multi-agentplan representation and accompanying execution algorithm

• Largest empirical study of coaching in simulated robot soccer(5000 games/2500 hours)

43Patrick Riley

Thesis Defense

Page 116: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Future Work: Abstract MDP Learning

• Recursive Learning and Veri�cation of Abstract MarkovDecision Processes

• Learning Hierarchical Semi-Markov Decision Processes fromExternal Observation

• Re�ning State Abstractions for Markov Decision ProcessLearning

44Patrick Riley

Thesis Defense

Page 117: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Future Work:Adapting to Advice Receivers

• Learning About Agents While Giving Advice

• Talking Back: How Advice Receivers Can Help Their Coaches

• What I See and What I Don't: What a Coach Needs to KnowAbout Partial Observability

45Patrick Riley

Thesis Defense

Page 118: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Questions?

?46

Patrick RileyThesis Defense

Page 119: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Why is the Coach a Separate Agent?

• Some of the reasoning described could be done by a singleexecuting agent

• Advice language provides abstraction to work across agents

• Agent systems will be more distributed

47Patrick Riley

Thesis Defense

Page 120: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Why Coaching?

Disclaimer: This isn't a philosophy talk

Coach/agent separation is a forced distribution

• Why would/should one make their agent system like this?

• Agent systems will be more distributed � how will agentsinteract?

• Knowledge transfer will not always be easy

48Patrick Riley

Thesis Defense

Page 121: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching Problem Properties

• Team goals

• External, observing coach

• Advice, not control

• Access to past behavior logs

• Advice at execution, not training

49Patrick Riley

Thesis Defense

Page 122: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching Problem Dimensions

• Online vs. of�ine learning

• One-time vs occasional vs. continual advice

• Advice as actions vs. macro-actions vs. plans

50Patrick Riley

Thesis Defense

Page 123: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Coaching General Lessons

• The coach and advice receivers are a tightly coupled system

• Coach learning will require iteration to achieve the bestperformance

• A tradeoff exists in how much of the state space to coverwith advice versus how good the advice is

• Different observability by the coach and agents can beignored somewhat, but will need to be considered at times

• Analyzing the past behavior of an agent is most useful onlyif the future will look similar to the past

51Patrick Riley

Thesis Defense

Page 124: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Empirical: Circle Passing

• By using a domain smaller than the whole soccer game, canbetter isolate effects

• Setup

� Give the players a �xed action strategy� Because of noise, coach will see other possible actionresults

• Coach learns a model, then gives advice

• Different rewards lead to different agent behaviors

52Patrick Riley

Thesis Defense

Page 125: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Circle Passing: Setup

• Six players trying to pass in acircle

• Not all passes are successful

• Some kicks result in passes toother players or a dribble

53Patrick Riley

Thesis Defense

Page 126: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Circle Passing: Reward

• Can apply any rewardfunction

• We'll describe one (morein the thesis)

• In the middle (miskicksfrom several players gohere)

54Patrick Riley

Thesis Defense

Page 127: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Circle Passing: Results

• We consider a trial a success if:

� From a random starting position� Reward is received within 200 cycles (20 seconds)

Success % During Training 40%Success % With Advice 88%

55Patrick Riley

Thesis Defense

Page 128: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

RCSSMaze: Recursive Learning

# Rew. (Training) % Success (Testing)TrainingData

R0 R1 R2 R0 R1 R2

Original 11 115 1055 64% 60% 93%From R0 676 0 0 82% n/a n/aFrom R1 1 2909 0 0% 67% n/aFrom R2 0 0 9088 n/a n/a 78%

56Patrick Riley

Thesis Defense

Page 129: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

ReferencesT. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computational Intelligence, 76(1�2):3�74,

1989.

Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. Arti�cial Intelligence, 89(1):219�283, 1997.

Diana Gordon and Devika Dubramanian. A multi-strategy learning scheme for knowledge assimilation in embedded agents.Informatica, 17, 1993.

Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In Advances in Neural InformationProcessing Systems 14, 2001.

Gregory Kuhlmann, Peter Stone, and Justin Lallinger. The champion UT Austin Villa 2003 simulator online coach team. In DanielPolani, Brett Browning, Andrea Bonarini, and Kazuo Yoshida, editors, RoboCup-2003: Robot Soccer World Cup VII. SpringerVerlag, Berlin, 2004. To appear.

Richard Maclin and Jude W. Shavlik. Creating advice-taking reinforcement learners. Machine Learning, 22:251�282, 1996.

Jack Mostow. Mechanical Transformation of Task Heuristics into Operational Procedures. PhD thesis, Carnegie Mellon University,1981.

Taylor Raines, Milind Tambe, and Stacy Marsella. Automated assistant to aid humans in understanding team behaviors. InProceedings of the Fourth International Conference on Autonomous Agents (Agents-2000), 2000.

57Patrick Riley

Thesis Defense

Page 130: Coaching: Learning and Using Environment and Agent Models ...pfr/thesis/defense-pppp.pdf · Thesis Defense. Motivating Environment: Simulated Robot Soccer • Real time constraints

Timo Steffens. Feature-based declarative opponent-modelling in multi-agent systems. Master's thesis, Institute of CognitiveScience Osnabrück, 2002. URL citeseer.nj.nec.com/steffens02featurebased.html.

Tomoichi Takahashi. Kasugabito III. In Veloso, Pagello, and Kitano, editors, RoboCup-99: Robot Soccer World Cup III, number1856 in Lecture Notes in Arti�cial Intelligence, pages 592�595. Springer-Verlag, Berlin, 2000.

William Uther and Manuela Veloso. TTree: Tree-based state generalization with temporally abstract actions. In Proceedings ofSARA-2002, Edmonton, Canada, August 2002.

58Patrick Riley

Thesis Defense


Recommended