Statistical learning and optimal control: A framework for biological learning and motor control

Statistical learning and optimal control:

A framework for biological learning and motor control

Lecture 4: Stochastic optimal control

Reza Shadmehr

Johns Hopkins School of Medicine

( 1) ( )

( 1) ( 1)

(0)

(0)

1 (0)

k k

k k

h h

h h

T T Ta h h h h h h h h

T T T T T Th h

A C

B

G

E F

J T L E F

L F G TGF F G T GE

x x u

y x

y x

x x u

y r y r u u λ x x u

u r x

Summary: Optimal control of a linear system with quadratic cost

Issues with the control policy:

• What if the system gets perturbed during the control policy? With the current approach, there is no compensation for the perturbation.

• In reality, both the state update equation and the measurement equation are subject to noise. How do we take that into account?

• To resolve this, we need a way to figure out what command to produce, given that we find ourselves at some state x at some time k. Once we figure this out, we will consider the situation where we cannot measure x directly, but have noise to deal with. Our best estimate will be through the Kalman filter. This will link estimation with control.

1( ) ( ) ( ) ( 1) ( 1) ( 1)

0

pk T k k k T k k

k

J L T

u u y y

(0)

(0) (1) ( 1)

(1) (2) ( ) (1) (2) ( )

, , ,

, , , , , ,

p

p pB B B

x

u u u

y y y x x x

Starting at state

Sequence of actions

Observations

Cost to minimize

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( 1) ( 1) ( ) ( 1) ( 1)

( 1) ( ) ( 1) ( 1) ( ) ( 1)

( 1) ( ) ( 1)

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) (

2

p p T p p p T T p p

p T p

Tp p p p p p

p T T p p p T T p p

p T T p p

p p T p p p T p p

J T B T B

W B T B

J A C W A C

A W A C W C

C W A

J T L J

y y x x

x u x u

x x u u

u x

y y u u

)

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( )

( 1)( 1) ( 1) ( ) ( 1) ( ) ( 1)

( 1)

1( 1) ( 1) ( ) ( ) ( 1)

1( 1) ( 1) ( ) ( )

( 1) ( 1) ( 1)

2 2 2 0

p

p T T p p p T p p p

pp p T p p T p p

p

p p T p T p p

p p T p T p

p p p

B T B L J

dJL C W C C W A

d

L C W C C W A

G L C W C C W A

G

x x u u

u u xu

u x

u x

Cost at the last time point

Cost-to-go at the next to the last time point

Note that at the last time step, cost is a quadratic function of state

( ) ( 1) ( ) ( 1) ( 1) ( ) ( 1) ( 1) ( ) ( 1)

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( )

( 1) ( 1) ( ) ( 1)

( 1) ( 1) ( ) ( 1) ( 1) ( ) (

2

2

p p T T p p p T T p p p T p p

p p T T p p p T p p p

p T T p T p p

p T p T p p p T T p p

J A W A C W C W BA

J B T B L J

B T B A W A

L C W C C W A

x x u u u x

x x u u

x x

u u u x

1)

( 1) ( 1) ( ) ( 1)

( 1) ( 1) ( 1) ( ) ( 1) ( 1) ( 1) ( ) ( 1)2

p T T p T p p

p T p T p T p p p p T T p p

B T B A W A

G L C W C G C W A

x x

x x u x

We will now show that if we choose the optimal u at step p-1, then cost to go is once again a quadratic function of state x.

Can be simplified to:( ) ( 1)T p pA W CG

( 1) ( 1) ( 1) ( ) ( ) ( 1) ( 1)

( 1) ( 1) ( ) ( 1) ( 1)

( 1) ( 1) ( 1)

p p T T p T p T p p p

p T T p T p p p

p T p p

J B T B A W A A W CG

B T B A W A CG

W

x x

x x

x x

Can be simplified to: ( 1) ( ) ( 1)2 p T T p pA W CG x

We just showed that for the last time step, the cost to go is a quadratic function of x:

( ) ( ) ( ) ( )p p T p pJ Wx x

The optimal u to at time point p-1 minimizes cost to go J(p-1):

1( 1) ( 1) ( ) ( )

( 1) ( 1) ( 1)

p p T p T p

p p p

G L C W C C W A

G

u x

If at time point p-1 we indeed carry out this optimal policy u, then the cost to go at time p-1 also becomes a linear function of x:

( 1) ( 1) ( 1) ( 1)

( 1) ( 1) ( ) ( 1)

p p T p p

p T p T p p

J W

W B T B A W A CG

x x

If we now repeat the process and find the optimal u for time point p-2, it will be:

1( 2) ( 2) ( 1) ( 1)

( 2) ( 2) ( 2)

p p T p T p

p p p

G L C W C C W A

G

u x

And if we apply the optimal u at time points p-2 and p-1, then the cost to go at time point p-2 will be a quadratic function of x:

( 2) ( 2) ( 2) ( 2)

( 2) ( 2) ( 1) ( 2)

p p T p p

p T p T p p

J W

W B T B A W A CG

x x

So in general, if for time points t+1, …, p we calculated the optimal policy for u, then the above gives us a recipe to compute the optima policy for time point t.

( 1) ( ) ( )

( 1) ( 1)

1(0) ( 1) ( 1) ( 1) ( ) ( ) ( )

0

k k k

k k

pk T k k k T k k

k

A C

B

J T L

x x u

y x

y y u u

Summary of the linear quadratic tracking problem

( ) ( ) ( ) ( )

( ) ( )

1( 1) ( 1) ( ) ( )

( 1) ( 1) ( 1)

( 1) ( 1) ( 1) ( 1)

( 1) ( 1) ( ) ( 1)

p p T p p

p T p

p p T p T p

p p p

p p T p p

p T p T p p

J W

W B T B

G L C W C C W A

G

J W

W B T B A W A CG

x x

u x

x x

(0)x

(1)y

(0)u

(1)x ( )px

( 1)pu

( )py

Cost to go

1(0) (0) (1) (1)

(0) (0) (0)

(0) (0) (0) (0)

(0) (0) (1) (0)

T T

T

T T

G L C W C C W A

G

J W

W B T B A W A CG

u x

x x

The procedure is to compute the matrices W and G from the last time point to the first time point.

1 2

1 1

2 2

( 1) ( )

1

3 . / 0.45 . . / 0.3 . /

0 1 0 00

10

1 0

0.01sec

exp

exp

c c

k k

c

c c c

k N m rad b N m s rad m kg m rad

x x x x

x xk b

x x um m m

A C

y

A C

A A

C A A I C

x x u

x

x x u

Continuous time model of the elbow

Discrete time model of the elbow

Modeling of an elbow movement

Goal: Reach a target at 30 deg in 300 ms time and hold it there for 100 ms.

Unperturbed movement Arm held at start for 200ms Force pulse to the arm for 50ms

0 0.1 0.2 0.3 0.4sec

0

0.1

0.2

0.3

0.4

0.5

noitisoP

0.05 0.1 0.15 0.2 0.25 0.3 0.35sec

-75

-50

-25

0

25

50

75

rotoMdnammoc

0 0.1 0.2 0.3 0.4sec

0

0.1

0.2

0.3

0.4

0.5

noitisoP

0.05 0.1 0.15 0.2 0.25 0.3 0.35sec

-5

0

5

10

15

rotoMdnammoc

0 0.1 0.2 0.3 0.4sec

0

0.1

0.2

0.3

0.4

0.5

noitisoP0.05 0.1 0.15 0.2 0.25 0.3 0.35

sec

-30

-20

-10

0

10

rotoMdnammoc

0 0.1 0.2 0.3 0.4sec

1

1.25

1.5

1.75

2

2.25

2.5

L

L

0 0.1 0.2 0.3 0.4sec

0

500000

1 106

1.5 106

2 106

soPtsoc T

0 0.1 0.2 0.3 0.4sec

0

5000

10000

15000

20000

leVtsoc T

0 0.1 0.2 0.3 0.4 0.5 0.6sec

0

500000

1106

1.5106

2106

soPtsoc

Movement with a via point: we set the cost to be high at the time when we are supposed to be at the via points.

0 0.1 0.2 0.3 0.4 0.5 0.6sec

-10

0

10

20

30

rotoMdnammoc

0 0.1 0.2 0.3 0.4 0.5 0.6sec

0

0.2

0.4

0.6

0.8

noitisoP

0 0.1 0.2 0.3 0.4 0.5 0.6sec

0

200

400

600

800

soPniaG

T

G

( 1) ( ) ( )

( 1) ( 1)

1( ) ( ) ( ) ( 1) ( 1) ( 1)

0

0,

0,

k k kx x

k ky y

pk T k k k T k k

k

A C N Q

B N R

J L T

x x u ε ε

y x ε ε

u u y y

Stochastic optimal control

Biological processes have noise. For example, neurons fire stochastically in response to a constant input, and muscles produce a stochastic force in response to constant stimulation. Here we will see how to solve the optimal control problem with additive Gaussian noise.

Cost to minimize

Because there is noise, we are no longer able to observe x directly. Rather, the best we can do is to estimate it. As we saw before, for a linear system with additive noise the best estimate of state is through the Kalman filter. So our goal is to determine the best command u for the current estimate of x so that we can minimize the global cost function.

Approach: as before, at the last time point p the cost is a quadratic function of x. We will find the optimal motor command for time point p-1 so that it minimizes the expected cost to go. If we perform the optimal motor command at p-1, then we will see that the cost to go at p-1 is again a quadratic function of x.

Preliminaries: Expected value of a squared random variable. In the following example, we assume that x is the random variable.

2

22

2

2

var

var

var

var

var

T

TT

T

T

T

T

v x

x E x E x

E v E x

x E x

v

E E E

E v E tr

tr E

tr tr E E

tr E E

x x

x xx x x

xx

xx

x x x

x x x

1 2

2 2 21 2

21 1 2 1

22 1 2 2

21 2

2

1

Tn

Tn

n

T n

n n n

nT

ii

T T

r r r

r r r

r r r r r

r r r r r

r r r r r

tr r

tr

r

r r

rr

rr

r r rr

Scalar x

Vector x

var

T

T

v A

E v tr A E AE

x x

x x x

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

var var

p p T p p

p T T p p T py y

p T p

p p T p p T py y

p p p T p p py

p p T p p

J T

B T B T

W B T B

E J E W E T

tr W E W E tr T

tr W Q E W E

y y

x x ε ε

x x ε ε

x x x ε

x x

( )

( 1) ( 1) ( ) ( 1) ( 1) ( ) ( )

( 1) ( ) ( 1) ( 1) ( ) ( 1)

( 1) ( ) ( 1) ( ) ( )2

p

Tp p p p p p p

p T T p p p T T p p

p T T p p T p p

tr T Q

A C W A C tr W Q tr T Q

A W A C W C

A W C tr W Q tr T Q

x u x u

x x u u

x u

Cost at the last time point

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( )

( 1) ( 1) ( ) ( 1) ( 1) ( 1) ( ) ( 1)

( 1) ( ) ( 1) ( ) ( ) ( 1)

( 1)( 1)

( 1)

2

2

p p T p p p T p p p

p T T p T p p p T p T p p

p T T p p p p T py y

pp T

p

J T L E J

B T B A W A L C W C

A W C tr W Q tr T Q T

dJL C W

d

y y u u

x x u u

x u ε ε

u

( ) ( 1) ( ) ( 1)

1( 1) ( 1) ( ) ( ) ( 1)

1( 1) ( 1) ( ) ( )

( 1) ( 1) ( 1)

2 0p p T p p

p p T p T p p

p p T p T p

p p p

C C W A

L C W C C W A

G L C W C C W A

G

u x

u x

u x

Cost-to-go at the next to the last time point

So we see that if our system has additive state or measurement noises, the optimal motor command remains the same as if the system had no noises at all. When we use the optimal policy at time point p-1, we see that, as before, the cost-to-go at p-1 is a quadratic function of x. The matrix W at p-1 remains the same as when the system had no noise.

The problem is that we do not have x. The best that we can do is to estimate x via the Kalman filter. We do this in the next slide.

1 21 1

1 2 2 2 2

2 2 2 3 2 32 2

1 2 2 3 2 32 2 2

1 21

1 2 2 2 2 2

1 1 1

ˆ

ˆ ˆ

ˆ ˆ ˆ

ˆ ˆ ˆ

ˆ ˆ

ˆ ˆ ˆ

ˆ

p pp p

p p p p p

p p p p p pp p

p p p p p pp p p

p pp

p p p p p p

p p p

G

A C

K B

A C AK B

A C AK B

G

u x

x x u

x x y x

x x u y x

x x

x x u y x

u x

On trial p-1, our best estimate of x is the prior.

We compute the prior for the current trial from the posterior of the last trial.

The posterior estimate.

Our short-hand way to note the prior estimate of x on trial p-1.

Although the noises in the system do not affect the gain G, the estimate of x is of course affected by the noises because the Kalman gain is influenced by them.

Kalman gain

( 1) ( ) ( )

( 1) ( 1)

1(0) ( ) ( ) ( ) ( 1) ( 1) ( 1)

0

0,

0,

k k kx x

k ky y

pk T k k k T k k

k

A C N Q

B N R

J L T

x x u ε ε

y x ε ε

u u y y

Summary of stochastic optimal control for a linear system with additive Gaussian noise and quadratic cost

( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( )

1( 1) ( 1) ( ) ( )

( 1) ( 1) ( 1)

( 1) ( 1) ( 1) ( 1) ( 1)

( 1) ( 1) ( ) ( 1)

( 1) ( )

ˆ

p p T p p p

p T p

p T p py y

p p T p T p

p p p

p p T p p p

p T p T p p

p p Ty

J W w

W B T B

w T tr T Q

G L C W C C W A

G

J W w

W B T B A W A CG

w tr W Q T

x x

ε ε

u x

x x

ε ( 1) ( )p py w ε

Cost to go at the start

Cost to go at the end

( ) ( 1) ( )

( ) ( ) ( )

1 1( ) ( )

0,

0,

ˆ ˆ ˆ

ˆvar

ˆ ˆ

ˆ ˆ

ˆ ˆ

n n nx x

n n ny y

n n n n n nn n

n n n n

n n T

T

T

A N Q

H N R

y H

P

tr P tr E

E tr

E

x x ε ε

y x ε ε

x x k x

x

x x x x

x x x x

x x x x

11 1( )

1( )

n n n nn T T

n n n nn

P H HP H R

P I H P

k

k

The duality of the Kalman filter and optimal control

In the estimation problem, we have a model of how we think the hidden states x are related to observations y. Given an observation y, we have a rule with which we can change our estimates.

Our objective is to minimize the trace of the variance of our estimate xhat. This variance is P. This trace is our scalar cost function, which is quadratic in terms of xhat. We minimize it by finding the optimal gain k.

If we use this optimal k, then we can compute the variance in the next time step. Our cost (i.e., variance) of course still remains quadratic in terms of xhat.

1 1( )

11 1 1 1

n n n n n nT n T

n n n n n n n nT T T

P AP A Q A I H P A Q

P A I P H HP H R H P A Q

k

( 1) ( ) ( )

( 1) ( 1)

1(0) ( ) ( ) ( ) ( 1) ( 1) ( 1)

0

( ) ( ) ( ) ( ) ( )

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1)

0,

0,

k k kx x

k ky y

pk T k k k T k k

k

p p T p p p

p p T p p p T p p

A C N Q

B N R

J L T

J W w

J T L E J

x x u ε ε

y x ε ε

u u y y

x x

y y u u

( )

1 1 1

( 1) ( 1) ( 1) ( 1) ( 1)

ˆ

p

p p p

p p T p p p

G

J W w

u x

x x

The duality of the Kalman filter and optimal control, continued.

In the control problem, we have a model of how we think the hidden states x are related to commands u and observations y.

Our objective is to find the u that minimizes a scalar cost. To find this u, we run time backwards!

We start at the end time point and find the optimal u that minimizes the cost to go. When we find this u, we then move to the next time point and so on.

The cost to go is a quadratic function of hidden states. This is very similar to the Kalman filter, where the cost was a quadratic function of the hidden states as well.

1( 1) ( ) ( 1) ( ) ( ) ( 1)k T k k T k T k T kW A I W C L C W C C W A B T B

State noiseMeasurement noiseState uncertainty

11 1( ) n n n nn T TP H R HP H

k

1( ) ( ) ( 1) ( 1) ( )ˆk k T k T k kL C W C C W A u x

11 1 1 1n n n n n n n nT T TP A I P H R HP H H P A Q

So W is like an estimate of state uncertainty matrix, BTB is like state update noise Q, and L is like measurement noise R.

In optimal control, the motor commands are generated by applying a gain to the state. This gain is like the Kalman gain.

Duality of optimal control and Kalman filter, continued.

Motor cost

Kalman Filter

Optimal control

Weighting of state Tracking cost

A B

Noise characteristics of biological systems are not additive GaussianNoise in the motor output grows with the size of the motor command

The standard deviation of noise grows with mean force in an isometric task. Participants produced a given force with their thumb flexors. In one condition (labeled “voluntary”), the participants generated the force, whereas in another condition (labeled “NMES”) the experimenters stimulated their muscles artificially to produce force. To guide force production, the participants viewed a cursor that displayed thumb force, but the experimenters analyzed the data during a 4-s period in which this feedback had disappeared. A. Force produced by a typical participant. The period without visual feedback is marked by the horizontal bar in the 1st and 3rd columns (top right) and is expanded in the 2nd and 4th columns. B. When participants generated force, noise (measured as the standard deviation) increased linearly with force magnitude. Abbreviations: NMES, neuromuscular electrical stimulation; MVC, maximum voluntary contraction. From Jones et al. (2002) J Neurophysiol 88:1533.

Electrical stimulation of the muscle

Voluntary contraction of the muscle

Representing signal dependent noise

,N I 0 Vector of zero mean, variance 1 random variables

( 1) ( ) ( ) ( ) noise with standard deviation that linearly grows with

( 1) ( 1) ( ) noise with standard deviation that linearly grows with

k k k kx

k k ky

A B

H

x x u ε u

y x ε x

Zero mean Gaussian noise signal dependent motor noise

Zero mean Gaussian noise signal dependent sensory noise

( ) ( ) ( )1 11 1 1

( ) ( ) ( )( ) ( )2 22 2 2

1

1 2 2

( )( )

( )

0 0 0 0

motor noise 0 0 0 0

0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0

motor noise =

var motor noise va

k k k

k k kk k

kki i

i

ki

c u c u

c u c u

c

C C c

C

C

u

u

( )r k T T T Ti i ii

i i

C C C u uu

( )( 1) ( ) ( ) ( ) ( )

( )( ) ( ) ( ) ( )

( 1) ( ) ( ) ( ) ( ) ( ) ( )

ˆ

( ) ( ) ( ) ( ) ( )

ˆ ˆ ˆ

0,1 0,1

, , ,

kk k k k kx i ii

kk k k ky i ii

k k k k k k k

i i

x x y y x

k T k k T k k

A B C

H D

A AK H B

N N

N Q N Q N Q

L T

x x u ε u

y x ε x

x x y x u

ε 0 ε 0 0

u u x x

Cost per step:

Control problem with signal dependent noise (Todorov 2005)

To find the motor commands that minimize the total cost, we start at the last time step p and work backwards. At time step p, the cost is a quadratic function of x. At time step p-1, we can find the optimal u that minimizes the cost to go. When we find this optimal u, the cost to go at p-1 will be a quadratic function of x plus a quadratic function of x-xhat. In general, by induction we can prove that as long as we apply the optimal u, the cost to go will have this quadratic form. This proof is due to E. Todorov, Neural Computation, 2005.

( ) ( ) ( ) ( )

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( ) ( 1) ( 1) ( 1)

( ) ( ) ( ) ( ) ( ) ( ) ( 1) ( 1) ( 1)

( ) ( 1

ˆ, ,

ˆvar , ,

var

p p T p p

p p T p p T p p p p p p

Tp p p p p p p p p

p px i

J T

J L T E J

E J E T E tr T

Q C

x x

u u x x x x u

x x x x x u

x u

) ( 1)

( ) ( 1) ( 1) ( ) ( 1) ( 1)

( ) ( 1) ( ) ( 1)

( 1) (

(

1) ( 1) ( ) ( 1) ( )

( 1) ( )

) ( 1)

( ( 1 () ) 2

p T px

p T Tii

Tp p p p p p

p p T T p px i ii

p p T p T p p px

i ii

px

p T T p p p

C

E J A B T A B

tr T Q C T C

J T A

C C T

T A tr T Q

T B

C

CL B

u

x u x u

u u

x x

u u u

1) ( ) ( 1)

( 1)( ) ( 1) ( ) ( 1)

( 1)

1( 1) ( ) ( ) (

(

( ) 1

)

)

2 2 0

ˆ

T T p p

pT p p T p p

p

p T p Tp p p

px

x

B T A

dJL B T B B T A

d

L B T B B

C

T AC

x

u xu

u x

Cost at time step p (last time step)

Optimal u to minimize the cost-to-go at time step p-1

Cost-to-go at p-1

1( ) ( 1) ( 1) ( 1)

( ) ( ) ( )

( 1) ( 1) ( 1) ( ) ( 1) ( )

( 1) ( 1) ( ) ( 1)

( 1) ( 1) ( ) ( 1)

( 1) ( 1)

ˆ

ˆ ˆ

ˆ2

ˆ ˆ ˆ ˆ ˆ2

p T p p T p

p p p

p p T p T p p px

p T p T T p p

p T p T T p p

TT T T

p p T

G L B T B C B T A

G

J T A T A tr T Q

G B T A

G B T A

Z Z Z Z

J T

u x

x x

x x

x x

x x x x x x x x x x

x

( 1)

( 1)( 1)( 1)

( 1) ( ) ( ) ( 1) ( 1)

( 1) ( 1) ( ) ( 1) ( 1) ( 1) ( )

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) (

ˆ ˆ

px

pe pp

p T p p p p

W

Tp p T p p p p px

Ww

p p T p p p T p px e

A T A T BG

A T BG tr T Q

J W W w

e

x

x x x x

x x e e

1)p

J(p-1) is the cost-to-go at time step p-1, assuming that the optimal u is produced at p-1.Note that unlike the cost at time step p, this cost-to-go is quadratic in terms of x and the error in estimation of x. So now we need to show that if we continue to produce the optimal u at each time step, the cost-to-go remains in this form for all time steps.

( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) ( 1)k k T k k k T k k kx eJ W W w x x e e

( ) ( ) ( ) ( ) ( ) ( ) ( 1) ( ) ( ) ( )ˆ, ,k k T k k T k k k k k kJ L T E J u u x x x x u

Conjecture: If at some time point k+1 the cost-to-go under an optimal control policy is quadratic in x and e, and provided that we produce a u that minimizes the cost-to-go at time step k, then the cost-to-go at time step k will also be quadratic.To prove this, our first step is to find the u that minimizes the cost-to-go at time step k, and then show the at the resulting optimal cost-to-go remains in the quadratic form above.

To compute the expected value term, we need to do some work on the term e.

( )

( 1) ( 1) ( 1)

( )( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( )( ) ( ) ( ) ( ) ( ) ( ) (

( ) ( ) ( )

) ( ) ( )

( )

( )

ˆ

ˆ ˆkk k ky i ii

k k k

kk k kx i ii

k k k k

k kk k k k k k k k kx i y ii ii i

k

k

A C

A AK H

A AK H C AK A D

B

K

H D

B

e x x

x ε u

x x

e ε ε x

u

u

x uε x

( )ky

( 1) ( ) ( ) ( ) ( ) ( )

( 1) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )ˆ

ˆ, ,

ˆvar , ,

k k k k k k

k k k k k k T T k k T Tx i i yi

k k k T T k T Ti i xi

E A AK H

Q C C AK Q K A

AK D D K A Q

e x x u e

e x x u u u

x x

(

( 1) ( 1) ( 1) ( ) ( ) ( ) ( ) ( ) ( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( 1) ( 1) ( ) ( ) ( ) ( ) ( ) ( 1) ( ) ( )

( 1)

1)

ˆ, ,

ˆ, ,

Tk T k k k k k k k k

T ki x

k kx x

k k T kx x

Tk T k k k k k k T k ke

i

k

i

ke

ke

E W A B W A B

tr W Q

E W A AK H W A AK H

tr W

C W C

Q

x x x x u x u x u

u u

e e x x u e e

( ) ( )ˆ

( ) ( )

( )

( 1)

( ) ( 1) ( ( ))

T ki e i

T k T T k ki

k k T Tx x y

k T k

e i

i

Ti

k kD K A W A

Q AK Q K

D

C W C

K

A

u u

x x

To compute the Expected value of J(k+1), we compute the Exp value of the two quadratic terms (the Exp value of the third term is zero as it is composed only of zero mean random variables).

( )kxC

( )keC

( )kD

( ) ( ) ( ) ( ) ( ) ( ) ( 1) ( ) ( ) ( )

( ) ( ) ( ) ( 1) ( ) ( ) ( 1) ( )

( )( ) ( ) ( 1) ( ) ( 1) ( )

( )

( ) ( )) ((

ˆ, ,

2

2 2 0

k k T k k T k k k k k k

k T k k T k k k T T k kx e x x

kk k T k

k k T kx e x

k T k kx e x xk

k L C C

J L T E J

L C C B W B B W A

dJL C C B W B

B W

B W Ad

u u x x x x u

u u u x

u xu

u

) 1 ( )11 ( ) ˆT k kxB B W A

x

Terms that do not depend on u

( )kG

( ) ( ) ( ) ( 1) ( ) ( ) ( ) ( 1) ( )

( ) ( ) ( 1) ( ) ( )

( 1) ( 1) ( ) ( )ˆ

( ) ( ) ( 1) ( ) (

( ) ( 1 1

)

) (( )

ˆ ˆ ˆ2k k T k T T k k k T k T T k kx x

k T k T k k k Tx

k k k k T Tx x e x x y

Tk T k k k ke

k T k T kx

k Tx

J G B W A G B W A

T A W A D

tr W Q W Q Q AK Q K A

A AK H W A AK H

T A W A A W

x x x x

x x

e e

x

( 1) ( 1) ( ) (

( 1)

)ˆ

( )

( ) ( )

( ) ( ) ( ) ( (

)

( ) ( ) ( 1) ( )

(

( ) ( )

( ) ) ))

k k k

TT k

k T Tx x e x x y

k

k T k

k T k k T k

k

k k k kx e

ke

k

k

kx

A W BG A AK H W

tr W Q W Q Q AK Q K A

A AK H

W w

BG D

W

x

e e

x x e e

So we just showed that if at some time point k+1 the cost-to-go under an optimal control policy is quadratic in x and e, and provided that we produce a u that minimizes the cost-to-go at time step k, then the cost-to-go at time step k will also be quadratic. Since we had earlier shown that at time step p-1 the cost is quadratic in x and e, we now have the solution to our problem.

( )( 1) ( ) ( ) ( ) ( )

( )( ) ( ) ( ) ( )

( 1) ( ) ( ) ( ) ( ) ( ) ( )

ˆ

( ) ( ) ( ) ( ) ( )

ˆ ˆ ˆ

0,1 0,1

, , ,

kk k k k kx i ii

kk k k ky i ii

k k k k k k k

i i

x x y y x

k T k k T k k

A B C

H D

A AK H B

N N

N Q N Q N Q

L T

x x u ε u

y x ε x

x x y x u

ε 0 ε 0 0

u u x x

( ) ( ) ( )

1( ) ( 1) ( 1) ( 1) ( 1)

( ) ( ) ( 1) ( 1) ( ) ( ) ( 1) ( )

( ) ( 1) ( ) ( ) ( 1) ( )

( ) ( 1)

ˆk k k

k T k T k T k T ki x i i e i x xi i

k k T k T k k T k T T k kx x x i e ii

Tk T k k k k ke x e

k kx x

G

G L C W C C W C B W B B W A

W T A W A A W BG D K A W AK D

W A W BG A AK H W A AK H

w tr W Q W

u x

( 1) ( ) ( )ˆ

( ) ( ) ( ) ( )0 0

k k k T Te x x y

p p p px e

Q Q AK Q K A

W T W w

Cost per step

Summary: Control problem with signal dependent noise (Todorov 2005)

For the last time step

Date post:	03-Jan-2016
Category:	Documents
Upload:	ursula-moreno
View:	47 times
Download:	3 times

Statistical learning and optimal control: A framework for biological learning and motor control

Documents