Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The...

Design Principles for Creating Human-Shapable Agents

W. Bradley Knox, Ian Fasel,

and Peter Stone

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture. The University of Texas at AustinDepartment of Computer Sciences

Transferring human knowledge through natural forms of

communication

Potential benefits over purely autonomous learners:

• Decrease sample complexity• Learn in the absence of a reward function• Allow lay users to teach agents the policies

that they prefer (no programming!)• Learn in more complex domains

Shaping

Def. - creating a desired behavior by reinforcing successive approximations of the behavior

LOOK magazine, 1952

The Shaping Scenario(in this context)

A human trainer observes an agent and manually delivers reinforcement (a scalar value), signaling approval or disapproval.

E.g., training a dog with treats as in the previous picture

The Shaping Problem (for computational agents)

Within a sequential decision making task, how can an agent harness state

descriptions and occasional scalar human reinforcement signals to learn a good task

policy?

Previous work on human-shapable agents

• Clicker training for entertainment agents (Blumberg et al., 2002; Kaplan et al., 2002)

• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +

human reinforcement

• Social software agent Cobot in LambdaMoo (Isbell et al., 2006)– RL with reward = human reinforcement

MDP reward vs.

Human reinforcement

• MDP reward (within reinforcement learning): – Key problem: credit assignment from sparse

rewards

• Reinforcement from a human trainer:– Trainer has long-term impact in mind– Reinforcement is within a small temporal window of

the targeted behavior– Credit assignment problem is largely removed

Teaching an Agent Manually via Evaluative Reinforcement (TAMER)

• TAMER approach:– Learn a model of human reinforcement

– Directly exploit the model to determine policy

• If greedy:


Learning from targeted human reinforcement

is a supervised learning problem, not a reinforcement learning problem.


The Shaped Agent’s Perspective

• Each time step, agent:– receives state description– might receive a scalar human

reinforcement signal– chooses an action

– does not receive an environmental reward signal (if learning purely from shaping)

Tetris

• Drop blocks to make solid horizontal lines, which then disappear

• |state space| > 2250

• Challenging but slow

• 21 features extracted from (s, a)• TAMER model:

– Linear model over features– Gradient descent updates

• Greedy action selection

TAMER in action: Tetris

QuickTime™ and aH.264 decompressor

are needed to see this picture.





Training:Before training:

After training:

TAMER Results: Tetris(9 subjects)

TAMER Results: Tetris(9 subjects)

TAMER Results: Mountain Car(19 subjects)

Conjectures on how to create an agent that can be interactively

shaped by a human trainer

1. For many tasks, greedily exploiting the human trainer’s reinforcement function yields a good policy.

2. Modeling a human trainer’s reinforcement is a supervised learning problem (not RL).

3. Exploration can be driven by negative reinforcement alone.

4. Credit assignment to a dense state-action history should …

5. A human trainer’s reinforcement function is not static.

6. Human reinforcement is a function of states and actions.

7. In an MDP, human reinforcement should be treated differently from environmental reward.

8. Human trainers reinforce predicted action as well as recent action.

the end.

Mountain Car

• Drive back and forth, gaining enough momentum to get to the goal on top of the hill

• Continuous state space– Velocity and position

• Simple but rapid actions• Feature extraction:

–2D Gaussian RBFs over velocity and position of car–One “grid” of RBFs per action

• TAMER model:–Linear model over RBF features–Gradient descent updates

TAMER in action: Mountain Car







Before training: After training:

Training:



HOW TO: Convert a basic TD-Learning agent into a TAMER agent

(w/o temporal credit assignment)

1. the underlying fcn approximator must be a Q-function (for state-action values)

2. set discount factor (gamma) to 03. make action selection fully greedy4. human reinf. replaces environmental reward5. if no human input is received, no update6. remove any eligibility traces (can just

change parameter lambda to 0)7. maybe lower alpha to .01 or less

HOW TO: Convert a TD-Learning agent into a TAMER agent (cont.)

With credit assignment (more frequent time steps)1. Save (features, human reinf.) for each time step in a window from

0.2 seconds before to about 0.8 seconds2. define a probability distribution fcn over the window (a uniform

distribution is probably fine)3. credit for each state-action pair is the integral of the pdf from the

time of the next most recent timestep to the timestep for that pair• - for the update, both reward prediction (in place of state-action-

value prediction) used to calculate the error and the calculation of the gradient for any one weight use the the weighted sum, for each action, of the features in the window (the weights are the "credit" calculated in the last step)

• - time measurements used for credit assignment should be in real time, not simulation time

Date post:	04-Jan-2016
Category:	Documents
Upload:	virgil-shelton
View:	213 times
Download:	1 times

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The...

Documents