+ All Categories
Home > Documents > NW Computational Intelligence Laboratory Implementing DHP in Software: Taking Control of the...

NW Computational Intelligence Laboratory Implementing DHP in Software: Taking Control of the...

Date post: 30-Dec-2015
Category:
Upload: lionel-watkins
View: 215 times
Download: 1 times
Share this document with a friend
26
NW Computational Intelligence NW Computational Intelligence Laboratory Laboratory Implementing DHP in Implementing DHP in Software: Software: Taking Control of the Pole- Taking Control of the Pole- Cart System Cart System Lars Holmstrom
Transcript

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Implementing DHP in Software:Implementing DHP in Software:Taking Control of the Pole-Cart SystemTaking Control of the Pole-Cart System

Lars Holmstrom

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

OverviewOverview

• Provides a brief overview of Dual Heuristic Programming (DHP)

• Describes a software implementation of DHP for designing a non-linear controller for the pole-cart system

• Follows the methodology outlined in– Lendaris, G.G. & J.S. Neidhoefer, 2004, "Guidance in the Use of Adaptive

Critics for Control" Ch.4 in "Handbook of Learning and Approximate Dynamic Programming", Si, et al, Eds., IEEE Press & Wiley Interscience, pp. 97-124, 2004.

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

DHP FoundationsDHP Foundations

• Reinforcement Learning – A process in which an agent learns behaviors through

trial-and-error interactions with its environment, based on “reinforcement” signals acquired over time

– As opposed to Supervised Learning in which an error signal based on the desired outcome of an action is known, reinforcement signals provide information about a “better” or “worse” action to take rather than the “best” one

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

DHP Foundations (continued)DHP Foundations (continued)

• Dynamic Programming– Provides a mathematical formalism for finding optimal

solutions to control problems within a Markovian decision process

– “Cost to Go” Function

– Bellman’s Recursion

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

DHP Foundations (continued)DHP Foundations (continued)

• Adaptive Critics– An application of Reinforcement Learning for solving

Dynamic Programming problems– The Critic is charged with the task of estimating J for a

particular control policy π– The Critic’s knowledge about J, in turn, allows us to improve

the control policy π– This process is iterated until the optimal J surface, J*, is

found along with the associated optimal control policy π*

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

DHP ArchitectureDHP Architecture

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Weight Update Calculation for the Action NetworkWeight Update Calculation for the Action Network

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Calculating the Critic TargetsCalculating the Critic Targets

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

The Pole Cart ProblemThe Pole Cart Problem

• The dynamical system (plant) consists of a cart on a length of track with an inverted pendulum attached to it.

• The control problem is to balance the inverted pendulum while keeping the cart near the center of the track by applying a horizontal force to the cart.

• Pole Cart Animation

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Simulating the PlantSimulating the Plant

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Calculating the Instantaneous DerivativeCalculating the Instantaneous Derivative

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Iterating One Step In TimeIterating One Step In Time

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Iterating the Model Over a TrajectoryIterating the Model Over a Trajectory

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Running the SimulationRunning the Simulation

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Calculating the Model JacobiansCalculating the Model Jacobians

• Analytically• Numerical approximation• Backpropagation

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Defining a Utility FunctionDefining a Utility Function

• The utility function, along with the plant dynamics, define the optimal control policy

• For this example, I will choose

• Note: there is no penalty for effort, horizontal velocity (the cart), or angular velocity (the pole)

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Setting Up the DHP Training LoopSetting Up the DHP Training Loop

• For each training iteration (step in time)– Measure the current state– Calculate the control to apply– Calculate the control Jacobian– Iterate the model– Calculate the model Jacobian– Calculate the utility derivative– Calculate the present lambda– Calculate the future lambda– Calculate the reinforcement signal for the controller– Train the controller– Calculate the desired target for the critic– Train the critic

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Defining an ExperimentDefining an Experiment

• Define the neural network architecture for action and critic networks

• Define the constants to be used for the model• Set up the lesson plan

– Define incremental steps in the learning process• Set us a test plan

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Defining an Experiment in the DHP ToolkitDefining an Experiment in the DHP Toolkit

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Training Step 1 : 2 DegreesTraining Step 1 : 2 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Training Step 2 : -5 Degrees Training Step 2 : -5 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Training Step 2 : 15 DegreesTraining Step 2 : 15 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Training Step 2 : -30 DegreesTraining Step 2 : -30 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Testing Step 2 : 20 DegreesTesting Step 2 : 20 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Testing Step 2 : 30 DegreesTesting Step 2 : 30 Degrees

NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

Software AvailabilitySoftware Availability

• This software is available to anyone who would like to make use of it

• We also have software available for performing backpropagation through time (BPTT) experiments

• Set up an appointment with me or come in during my office hours to get more information about the software


Recommended