+ All Categories
Home > Documents > Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The...

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The...

Date post: 04-Jan-2016
Category:
Upload: virgil-shelton
View: 213 times
Download: 1 times
Share this document with a friend
Popular Tags:
24
Design Principles for Creating Human- Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone QuickTime™ and a TIFF (Uncompressed) decompre are needed to see this pic The University of Texas at Austin Department of Computer Sciences
Transcript
Page 1: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Design Principles for Creating Human-Shapable Agents

W. Bradley Knox, Ian Fasel,

and Peter Stone

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture. The University of Texas at AustinDepartment of Computer Sciences

Page 2: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Transferring human knowledge through natural forms of

communication

Potential benefits over purely autonomous learners:

• Decrease sample complexity• Learn in the absence of a reward function• Allow lay users to teach agents the policies

that they prefer (no programming!)• Learn in more complex domains

Page 3: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Shaping

Def. - creating a desired behavior by reinforcing successive approximations of the behavior

LOOK magazine, 1952

Page 4: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

The Shaping Scenario(in this context)

A human trainer observes an agent and manually delivers reinforcement (a scalar value), signaling approval or disapproval.

E.g., training a dog with treats as in the previous picture

Page 5: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

The Shaping Problem (for computational agents)

Within a sequential decision making task, how can an agent harness state

descriptions and occasional scalar human reinforcement signals to learn a good task

policy?

Page 6: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Previous work on human-shapable agents

• Clicker training for entertainment agents (Blumberg et al., 2002; Kaplan et al., 2002)

• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +

human reinforcement

• Social software agent Cobot in LambdaMoo (Isbell et al., 2006)– RL with reward = human reinforcement

Page 7: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

MDP reward vs.

Human reinforcement

• MDP reward (within reinforcement learning): – Key problem: credit assignment from sparse

rewards

• Reinforcement from a human trainer:– Trainer has long-term impact in mind– Reinforcement is within a small temporal window of

the targeted behavior– Credit assignment problem is largely removed

Page 8: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Teaching an Agent Manually via Evaluative Reinforcement (TAMER)

• TAMER approach:– Learn a model of human reinforcement

– Directly exploit the model to determine policy

• If greedy:

Page 9: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Teaching an Agent Manually via Evaluative Reinforcement (TAMER)

Learning from targeted human reinforcement

is a supervised learning problem, not a reinforcement learning problem.

Page 10: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Teaching an Agent Manually via Evaluative Reinforcement (TAMER)

Page 11: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

The Shaped Agent’s Perspective

• Each time step, agent:– receives state description– might receive a scalar human

reinforcement signal– chooses an action

– does not receive an environmental reward signal (if learning purely from shaping)

Page 12: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Tetris

• Drop blocks to make solid horizontal lines, which then disappear

• |state space| > 2250

• Challenging but slow

• 21 features extracted from (s, a)• TAMER model:

– Linear model over features– Gradient descent updates

• Greedy action selection

Page 13: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER in action: Tetris

QuickTime™ and aH.264 decompressor

are needed to see this picture.

QuickTime™ and aH.264 decompressor

are needed to see this picture.

QuickTime™ and aH.264 decompressor

are needed to see this picture.

Training:Before training:

After training:

Page 14: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER Results: Tetris(9 subjects)

Page 15: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER Results: Tetris(9 subjects)

Page 16: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER Results: Mountain Car(19 subjects)

Page 17: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Conjectures on how to create an agent that can be interactively

shaped by a human trainer

1. For many tasks, greedily exploiting the human trainer’s reinforcement function yields a good policy.

2. Modeling a human trainer’s reinforcement is a supervised learning problem (not RL).

3. Exploration can be driven by negative reinforcement alone.

4. Credit assignment to a dense state-action history should …

5. A human trainer’s reinforcement function is not static.

6. Human reinforcement is a function of states and actions.

7. In an MDP, human reinforcement should be treated differently from environmental reward.

8. Human trainers reinforce predicted action as well as recent action.

Page 18: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

the end.

Page 19: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Mountain Car

• Drive back and forth, gaining enough momentum to get to the goal on top of the hill

• Continuous state space– Velocity and position

• Simple but rapid actions• Feature extraction:

–2D Gaussian RBFs over velocity and position of car–One “grid” of RBFs per action

• TAMER model:–Linear model over RBF features–Gradient descent updates

Page 20: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER in action: Mountain Car

QuickTime™ and aH.264 decompressor

are needed to see this picture.

QuickTime™ and aH.264 decompressor

are needed to see this picture.

QuickTime™ and aH.264 decompressor

are needed to see this picture.

Before training: After training:

Training:

Page 21: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER Results: Mountain Car(19 subjects)

Page 22: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

TAMER Results: Mountain Car(19 subjects)

Page 23: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

HOW TO: Convert a basic TD-Learning agent into a TAMER agent

(w/o temporal credit assignment)

1. the underlying fcn approximator must be a Q-function (for state-action values)

2. set discount factor (gamma) to 03. make action selection fully greedy4. human reinf. replaces environmental reward5. if no human input is received, no update6. remove any eligibility traces (can just

change parameter lambda to 0)7. maybe lower alpha to .01 or less

Page 24: Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

HOW TO: Convert a TD-Learning agent into a TAMER agent (cont.)

With credit assignment (more frequent time steps)1. Save (features, human reinf.) for each time step in a window from

0.2 seconds before to about 0.8 seconds2. define a probability distribution fcn over the window (a uniform

distribution is probably fine)3. credit for each state-action pair is the integral of the pdf from the

time of the next most recent timestep to the timestep for that pair• - for the update, both reward prediction (in place of state-action-

value prediction) used to calculate the error and the calculation of the gradient for any one weight use the the weighted sum, for each action, of the features in the window (the weights are the "credit" calculated in the last step)

• - time measurements used for credit assignment should be in real time, not simulation time


Recommended