Predictive Inverse Optimal Control for Linear-Quadratic ... · Department of Computer Science...

Predictive Inverse Optimal Control for Linear-Quadratic-GaussianSystems

Xiangli Chen Brian D. ZiebartDepartment of Computer ScienceUniversity of Illinois at Chicago

Chicago, IL [email protected]

Department of Computer ScienceUniversity of Illinois at Chicago

Chicago, IL [email protected]

Abstract

Predictive inverse optimal control is a pow-erful approach for estimating the control pol-icy of an agent from observed control demon-strations. Its usefulness has been establishedin a number of large-scale sequential deci-sion settings characterized by complete stateobservability. However, many real decisionsare made in situations where the state isnot fully known to the agent making deci-sions. Though extensions of predictive in-verse optimal control to partially observableMarkov decision processes have been devel-oped, their applicability has been limited bythe complexities of inference in those rep-resentations. In this work, we extend pre-dictive inverse optimal control to the linear-quadratic-Gaussian control setting. We es-tablish close connections between optimalcontrol laws for this setting and the proba-bilistic predictions under our approach. Wedemonstrate the effectiveness and benefit inestimating control policies that are influencedby partial observability on both synthetic andreal datasets.

1 Introduction

Predicting sequences of behavior is an important taskfor many artificial intelligence applications. It is of keyimportance for human-robot interaction and human-computer interaction systems. For example, robotsthat more efficiently and safely navigate around peo-

Appearing in Proceedings of the 18th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2015, San Diego, CA, USA. JMLR: W&CP volume 38.Copyright 2015 by the authors.

ple and user interfaces that can autonomously adaptto improve a user’s task efficiency [3] each requires ac-curate behavior predictions. Perfect accuracy is anunrealistic objective for this task given the large num-ber of behavior sequences that are possible. Instead,statistical methods are needed to characterize the in-herent uncertainties of this prediction task and guideappropriate decision making in artificial intelligenceapplications.

Two main approaches for constructing predictive mod-els for this behavior prediction task are: (1) directpolicy estimation [22] learns a mapping from contex-tual situations to actions; and (2) inverse optimalcontrol (IOC) methods (also known as inverse rein-forcement learning and inverse planning) view behav-ior under a sequential decision process and estimate areward/cost function that rationalizes demonstratedbehavior [23, 19, 30, 2]. Often, a learned rewardfunction from the latter approach generalizes well toother portions of the decision process’s state space andeven to other decision processes in which the same re-ward function is applicable. Policy estimation is notnearly as adaptive to contextual changes in the deci-sion/control setting. However, inverse optimal controlrequires planning, decision, and control problems to berepeatedly solved1. This can be computationally ex-pensive. As a result, inverse optimal control methodsthat have been beneficially employed to estimate de-cision policies for behavior prediction tasks have beenrestricted to settings with low dimensional state-actionspaces and/or full state observability [31, 11, 29, 16]to make the optimal control problem tractable.

Unfortunately, the assumptions of a low-dimensionalstate-control space and full state observability often

1Inverse optimal control methods that purportedly ad-dress the IOC problem by estimating value or cost-to-go functions without solving optimal control problems [9]more closely resemble (the dual optimization problem) ofdirect policy estimation methods.

Predictive Inverse Optimal Control for Linear-Quadratic-Gaussian Systems

do not match reality for many important predictiontasks. Often a human actor has only a partial knowl-edge of the “state of the world” and takes actions thatare delayed responses to noisy observations of the ac-tual world state. For example, a person may walkthrough an environment with occlusions and there-fore have uncertainty about the locations of obsta-cles. Similarly, user interfaces may change in ways thatusers do not anticipate leading to observed behavior se-quences affected by human response times. Extensionsof inverse optimal control techniques address eitherhigh-dimensionality or partial observability, but notboth. IOC methods for high-dimensional data assumea linear-quadratic control setting [29, 15], for whichoptimal control is tractable even for very high dimen-sional state-control spaces. IOC methods for partially-observable decision process representations [28, 7] havebeen limited to partially-observable Markov decisionprocesses with small state-action spaces.

In this paper, we extend IOC methods to settings withboth high-dimensional state-control spaces and partialobservability. We specifically investigate the discrete-time linear-quadratic-Gaussian (LQG) control setting.This is a special sub-class of partially-observable de-cision processes for which optimal control is efficienteven for large state and control dimensions. We for-mulate the inverse LQG problem from robust estima-tion first principles to obtain a predictive distributionover state-control sequences. Like the optimal controlsolution, which combines a Kalman filter [12] with alinear quadratic regulator [14], our approach finds asimilar separation between state estimation and con-trol policy estimation to provide probabilistic predic-tive distributions. We demonstrate the benefits of in-corporating partial observability for predictive inverseoptimal control in a synthetic control prediction taskand for modeling mouse cursor pointing motions.

2 Background & Related Work

2.1 Linear quadratic Gaussian control

Linear-quadratic-Gaussian (LQG) control prob-lems seek the optimal control policy of partially ob-served linear systems. Apart from its initial value,which is Gaussian distributed (1), the unobserved stateof the system, denoted as value ~xt (or random vari-

able ~Xt) at time t, evolves as a noisy linear functionof the previous state ~xt−1 and control ~ut−1 (2). Thestate itself is not directly observed by the controller;instead, observation variables ~zt (or as random vari-

ables, ~Zt) that are noisy linear functions of the stateare observed (3). The state dynamic and observationnoise are each conditional Gaussian distributions witha mean defined by linear relationships of the A, B,

and C matrices and covariance matrices Σd1 , Σd andΣo characterizing the noise:

~X1 ∼ N(~µ,Σd1); (1)

~Xt+1|~xt, ~ut ∼ N(A~xt + B~ut,Σd); (2)

~Zt|~xt ∼ N(C~xt,Σo). (3)

The independence properties of the LQG control set-ting are illustrated by the Bayesian network in Figure1.

Figure 1: A probabilistic graphical model for thepartially-observable control setting.

Given the state and observation dynamics, the LQGoptimal control problem is to obtain the control policyπ : ~U1:t−1 × ~Z1:t → ~Ut, that minimizes an expectedcost that is quadratically defined in terms of a costmatrix M:

Ef(~Z1:T+1,~X1:T+1,~U1:T )

[T+1∑t=1

~XTt M ~Xt

]. (4)

The optimal control law is obtained by separating theproblem into a state estimation task and a linear-quadratic-regulation optimal control problem for theestimated state. State estimation is accomplished us-ing a Kalman filter [12]. Due to the linear character-istics of the problem, only the mean of the state esti-mate is needed. For the estimate of the state’s meanconditioned on previous and current observations andprevious controls, xt(+), the optimal control policy isrecursively defined [25] as:

~ut = −Ltxt(+),

xt+1 = Axt + But + Kt(zt −C(Axt + But)), x1 = E[x1],

where Kt is the Kalman gain:

Kt = HtCT (CHtC

T + Σo)−1,

and Ht is determined by the following matrix Riccati dif-ference equation that runs forward in time,

Ht+1 = A(Ht −HtCT (CHtC

T + Σo)−1CHt)AT + Σx,

H1 = E[X1XT1 ] = Σx + µµT

Xiangli Chen, Brian D. Ziebart

The feedback gain matrix:

Lt = (BTFt+1B)−1BTFt+1A,

where Ft is determined by the following matrix Riccatidifference equation that runs backward in time,

Ft =

M + ATFt+1A t ≤ T

−ATFt+1B(BTFt+1B)−1BTFt+1A

M t = T + 1.

(5)

Linear-quadratic regulator (LQR) can be viewedas the full-observability special case of LQG with C =I and Σo = 0 where I is the identity matrix so that theobservation variable ~zt is equivalent to the unobservedstate ~xt.

2.2 Inverse optimal control

In contrast with the optimal control problem of ob-taining a policy that minimizes some cumulative ex-pected cost, inverse optimal control [20, 1] takes (sam-ples from) a policy and tries to obtain a cost func-tion for which observed behavior is optimal, ideally.Early approaches often assumed a linear functionalform [5, 24, 1] for the cost function in terms of fea-tures f , cost(xt) = θT f(xt), and optimality for somechoice of weights θ. In practice, observed behavior isnot consistently optimal for any linear cost function [1]for this family of functions, and the optimality assump-tion breaks down. Even though the linear weights ofthe function are unknown and behavior can be arbi-trarily sub-optimal, any policy that has the same ex-pected feature statistics, E[

∑t f(xt)], as the demon-

strated feature statistic expectation is guaranteed tohave the same expected cost [1]. Mixtures of optimalpolicies can instead be employed to guarantee the sameexpected costs as (sub-optimal) demonstrated behav-ior [1].

We distinguish IOC approaches that match the ex-pected cost of demonstrated behavior with those thatattempt to provide predictions for behavior. The prin-ciple of maximum entropy has previously been em-ployed for this purpose. Maximum entropy IOC [30]provides a stochastic control policy that robustly min-imizes the predictive logloss for policies that, in ex-pectation, generate certain expected features. In con-trast, mixtures of optimal policies [1] can produce infi-nite logloss when they provide no support for demon-strated policies. We build upon the maximum entropyIOC approach in this work.

2.3 Directed information theory

We view the LQG setting using concepts and measuresfrom directed information theory [17, 18, 13, 26, 21].

The joint distribution of states, observations, and con-trols is factored into two causally conditioned proba-bility distributions [13],

f(~x1:T ,~z1:T , ~u1:T ) =

f(~u1:T ||~x1:T ,~z1:T ) f(~x1:T ,~z1:T ||~u1:T−1), (6)

where: f(~u1:T ||~x1:T ,~z1:T ) , (7)

T∏t=1

f(~ut|~u1:t−1, ~x1:t,~z1:t)

and f(~x1:T ,~z1:T ||~u1:T−1) , (8)

T∏t=1

f(~xt, ~zt|~x1:t−1, ~u1:t−1,~z1:t−1).

The causal entropy [13] of the control policy,

H(~U1:T ||~X1:T , ~Z1:T ) , E[− log2 f(~U1:T ||~X1:T , ~Z1:T )

]= E

[T∑t=1

H(~Ut|~X1:t, ~Z1:t, ~U1:t−1)

], (9)

is a measure of the uncertainty of the causally con-ditioned probability distribution (8). It can be inter-preted as the amount of information or “surprise” (inbits when using base-2) present in expectation for acontrol sequence ~u1:T sampled from the joint state,observation, control distribution (6), given only previ-ous observation and control variables.

Due to the specific independence properties of partialobservability in the LQG setting (shown in Figure 1),the control policy reduces to:

f(~u1:T ||~z1:T ) ,T∏t=1

f(~ut|~u1:t−1,~z1:t), (10)

and the causal entropy of the control policy reduces toH(~U1:T ||~Z1:T ). However, as we shall see, representingthe control distribution in its more general form andconstraining it to possess the required independenceproperties is crucial for our approach.

3 Inverse Linear Quadratic GaussianControl

We employ a robust estimation approach for learningthe control policy in a way that generalizes to differ-ent settings (§3.1). We show that this approach canbe posed as a convex optimization problem (§3.2) lead-ing to a maximum causal entropy problem (§3.3). Thedual solution decomposes into a state estimation com-ponent and a (softened) optimal control component,enabling efficient inference (§3.4).


3.1 Robust policy estimation

We consider a set of policies denoted by Ξ that aresimilar to observed sequences of states, observations,and controls (defined precisely in §3.3). We follow therobust estimation formulation [27, 10] of maximum en-tropy inverse optimal control [28] to select the singlepolicy with the best worst-case predictive guaranteesfrom this set. This can be viewed as a two-player gamein which the policy estimate, f , that minimizes loss isfirst chosen, and then an evaluation policy, f ∈ Ξ, isadversarially chosen that maximizes the loss subjectto matching known/observed properties of the actualpolicy:

min{f(~u1:T ||~z1:T ,~x1:T )}

max{f(~u1:T ||~z1:T , ~x1:T )}

∈ Ξ

Loss(f , f) (11)

≥ max{f(~u1:T ||~z1:T , ~x1:T )}

∈ Ξ

min{f(~u1:T ||~z1:T ,~x1:T )}

Loss(f , f) (12)

= max{f(~u1:T ||~z1:T ,~x1:T )}∈Ξ

Loss(f, f). (13)

In general, weak Lagrangian duality holds and the dualoptimization problem (12) provides a lower-bound onthe primal optimization problem (11). The causallog-loss,

Loss(f , f) = Ef [− log f(~U1:T ||~Z1:T~X1:T )], (14)

measures the amount of “surprise” (in bits when log2

is used) when control sequences sampled from f are

observed while control sequences from f are expected.

When it is employed as the loss function, the dual op-timization problem reduces to maximizing the causalentropy (13): Loss(f, f) = H(~U1:T ||~Z1:T , ~X1:T ).

3.2 A convex definition of the LQG policy set

We seek to strengthen our analysis of the dual solutionso that primal-dual equality holds (12). This strongduality requires the set of policies (10) to be convex[6], which is not obviously the case. We introduce thepartial observability causal simplex (Definition1), which extends the causal simplex [28] to the partial-observability setting. It is defined by affine constraintsthat ensure that members of the set factor according to(10). This is accomplished by preventing unobservedvariables (~x1:T ) and not-yet-revealed variables (~zt+1:T )from influencing a control variable’s distribution (yt).

Definition 1. The partial observability causalsimplex for f(~u1:T ||~z1:T , ~x1:T ) denoted by ∆, is de-

fined by the following set of constraints:

∀~u1:T ∈ ~U1:T , ~x1:T ,x′1:T ∈ ~X1:T ,~z1:T , z

′1:T ∈ ~Z1:T ,

f(~u1:T ||~z1:T , ~x1:T ) ≥ 0, (15)∫~u′1:T∈~U1:T

f(~u′1:T ||~z1:T , ~x1:T ) d~u′1:T = 1, (16)

f(~u1:T ||~z1:T , ~x1:T ) = f(~u1:T ||~z1:T , ~x′1:T ). (17)

∀τ ∈ {0, . . . , T} such that ~z1:τ = ~z′1:τ , ~x1:τ = ~x′1:τ ,∫~uτ+1:T∈~Uτ+1:T

f(~u1:T ||~z1:T , ~x1:T ) d~uτ+1:T (18)

=

∫~uτ+1:T∈~Uτ+1:T

f(~u1:T ||~z′1:T , ~x′1:T ) d~uτ+1:T .

The non-negativity constraints (15) and normalizationconstraints (16) ensure a valid probability distribu-tion. The next set of constraints (17) enforces par-tial observability—the controls do not depend on thehidden state. The final set of constraints (Equation18) ensures that only previous ~x and ~z variables in-fluence controls ~u (causal conditioning). Because allof the equalities and inequalities are affine, the partialobservability causal simplex is a convex set.

3.3 Maximum causal entropy estimation

Redefining the domain of the estimated policyf(u1:T ||z1:T ) using the partial observability causalsimplex (Definition 1) enables strong duality. The dualof the robust policy estimation formulation (Section3.1), reduces to maximizing the causal entropy (9) asa selection measure from the set of policies (∆) match-ing quadratic state expectation constraints (Definition2).

Definition 2. The maximum causal entropy in-verse LQG policy is obtained from:

argmax{f(~u1:T ||~z1:T ,~x1:T )}∈∆

H(~U1:T ||~Z1:T , ~X1:T ) (19)

such that: E

[T+1∑t=1

~Xt~XTt

]= E

[T+1∑t=1

~Xt~XTt

], (20)

where ∆ is the partial observability causal simplex ofDefinition 1, E[·] is the expectation under the estimatedpolicy, and E[·] is the empirical expectation from ob-served behavior sequence data.

This choice of constraints is motivated by inverse op-timal control (Section 2.2). They ensure that thestochastic control policy matches the performance ofobserved behavior on unknown state-based quadraticcost functions2 (Corollary 1).

2We employ state-based functions for notational sim-plicity. Control-based functions could also be explicitlyadded with an additional constraint or implicitly by incor-porating an “action memory” into the state vector.


Corollary 1 ([1]). For any unknown quadratic costfunction, parameterized by matrix M, matching ex-pected feature counts guarantees equivalent perfor-mance on the unknown cost function:

∀M ∈ R|S|×|S|,

E

[T+1∑t=1

~Xt~XTt

]= E

[T+1∑t=1

~Xt~XTt

]

=⇒ E

[T+1∑t=1

~XTt M~Xt

]= E

[T+1∑t=1

~XTt M~Xt

],

Many different mixture distributions over determinis-tic policies can satisfy this constraint [1]. Thus, thecausal entropy (19) can be viewed as a tie-breakingcriterion that resolves the ill-posedness of inverse opti-mal and provides strong robust prediction guarantees.

3.4 Predictive inverse LQG distribution

The Lagrangian dual provides a value-equivalent so-lution to the primal constrained optimization problem(19)3, while leading to a more compact representationsof the policy.

Theorem 1. The solution to the partially-observablemaximum causal entropy problem (Definition 2) takesthe following recursive form where M is the La-grangian multiplier matrix:

f(~ut|~u1:t−1,~z1:t) = eQ(~u1:t,~z1:t)−V (~u1:t−1,~z1:t) (21)

where:

Q(~u1:t,~z1:t) =

E[~XT

T+1M~XT+1|~u1:T ,~z1:T ] t = T ;

E[~XTt+1M

~Xt+1

+V (~U1:t, ~Z1:t+1)|~u1:t,~z1:t] t < T

(22)

V (~u1:t−1,~z1:t) = softmax~ut

Q(~u1:t,~z1:t)

, log

∫~ut

eQ(~u1:t,~z1:t)d~ut. (23)

The probability distribution can be interpreted as asoftened relaxation of the Bellman optimal policy cri-terion [4] where the softmax function replaces the maxfunction: softmaxx f(x) , log

∫xef(x). It serves as a

smooth interpolator of the maximum function.

Unfortunately, in the LQG setting, the value functionsof Theorem 1 are still unwieldy since they depend onthe entire history of actions and observations. As inoptimal LQG control [14], a more practical algorithmis obtained by separating state estimation from the

3Strong duality is subject to mild feasibility require-ments on feature matching.

policy distribution. Assuming a Gaussian belief of thecurrent state ~Xt|bt(~u1:t−1,~z1:t) ∼ N(~µbt ,Σbt) that isbased on the entire history, the policy can be recur-sively obtained according to Theorem 2.

Theorem 2. Given a belief state which summa-rizes the history ~u1:t−1,~z1:t up to time step t (i.e.,~Xt|bt ∼ N(~µbt ,Σbt), the recurrence values (22),(23)are Markovian quadratic functions of the form:

Q(~ut, ~µbt) =

[~ut~µbt

]TWt

[~ut~µbt

](24)

V (~zt, ~ut−1, ~µbt−1) =

~zt~ut−1

~µbt−1

TDt

~zt~ut−1

~µbt−1

(25)

Wt =

[B A

]TM

[B A

]t = T[

B A]T

M[B A

]t < T

+Dt+1(Uµ,Z)

[CB CA

]+[CB CA

]TDt+1(Z,Uµ),

+[CB CA

]TDt+1(Z,Z)

[CB CA

]+Dt+1(Uµ,Uµ)

(26)

Dt = PTt (Wt(µ,µ) −W

Tt(U,µ)W

−1t(U,U)Wt(U,µ))Pt (27)

where

Pt =[Et+1 B−Et+1CB A−Et+1CA

]Et+1 = (Σd+AΣT

btAT )TCT (Σo+C(Σd+AΣT

btAT )TCT )−1

The probabilistic control policy for a belief state withmean ~µbt is then:

~Ut|~µbt ∼ N(−W−1

t(U,U)Wt(U,µ)~µbt ,−1

2W−1

t(U,U)

)(28)

Theorem 3 establishes the connection to optimal con-trol: the mean/mode of the control distribution is theoptimal control and, in fact, the (stochastic) maximumcausal entropy probabilistic control policy can be di-rectly obtained from the optimal control solution.

Theorem 3. The terms of the stochastic control policy(28) are related to the LQG optimal control laws as:

Wt(U,U) = BTFt+1B; Wt(U,µ) = BTFt+1, (29)

where Ft+1 is defined by the optimal control law(5),and the Lagrangian multiplier matrix M in(26) isgiven as the cost matrix in(5).

Thus, existing methods for solving LQG optimal con-trol problems can be used to recover the stochasticcontrol policy given the cost matrix M.


3.5 Model Fitting

According to the previously developed theory of max-imum causal entropy [28] the gradient of the La-grangian dual form with respect to the Lagrangianmultipliers matrix M is:

Ef

[T+1∑t=1

~Xt~XTt

]− Ef

[T+1∑t=1

~Xt~XTt

](30)

This comes from the constraint of the convex optimiza-tion problem (20).

We can compute the expectation of quadraticstate moments over the distribution of state-control-observation trajectories provided by our estimated pol-icy also via conditioning on the mean and variance ofthe belief state:

Ef[~Xt~XTt

]= Ef(~u1:t−1,~z1:t)

[Ef(~xt|~u1:t−1,~z1:t)

[~Xt~Xt|~U1:t−1, ~Z1:t

]]= E

[Σbt + ~µbt~µ

Tbt

]= Σbt + Var [~µbt ] + E [~µbt ]E [~µbt ]

T.

The mean and variance of the belief state ~µbt ,Σbt isrecursively computed according to a Kalman filter [12].

4 Experimental Validation

We evaluate the performance of our inverse LQG ap-proach on controlled data (Sec. 4.1) and real mousecursor movement data (Sec. 4.2) to investigate itsbenefits in comparison to full observability modelsof behavior by average empirical causal log-loss overtest data. We call empirical casual log-loss as tra-jectory log-loss. Assume we have N trajectories{~x1:Tn+1 ,~z1:Tn , ~u1:Tn}Nn=1 for test, and f is our proba-bilistic control policy learned from training data, thenthe average trajectory log-loss is:

− 1

N

N∑n=1

log f(~u1:Tn ||~z1:Tn) =

=− 1

N

N∑n=1

Tn∑t=1

log f(~ut|~u1:t−1,~z1:t)

=− 1

N

N∑n=1

Tn∑t=1

log f(~ut|~µbt).

4.1 Controlled demonstrations of thebenefits of partial observability

In our first set of experiments, we investigate the ben-efits of incorporating partial-observability into predic-tive inverse optimal control. This provides some in-sights into whether it is sufficient to simply ignore

partial observability and use inverse optimal control(IOC) models that assume full observability. We varythe state and observation noise of a LQG controlproblem and measure the average empirical trajectorylog-loss compared to treating the problem as a fully-observed linear-quadratic regulator (LQR)control pro-cess.

We collect data via an optimal LQG controller [14]applied to a spring-mass system:

~Xt+1 = A~Xt + B~Ut + εs ~Zt = C~Xt + εo

~X1 ∼ N(~0,Σd1) εx ∼ N(~0,Σd) εo ∼ N(~0,Σo).

A =

[0 1−1 0

]B =

[01

]C =

[1 0

]The controller minimizes the following expectedquadratic cost function:

J = E

[T+1∑t=1

~XTt Q~Xt

],

where we set (using I as the identity matrix):

Q = I2×2 Σd1 = Σd = σd ∗ I2×2 Σo = σo ∗ I1×1.

From the setting of the observation dynamic C, onlythe first row of the two row state ~Xt is observed whichprovides partial-observability scnario.

For each experiment where we vary the noise of the sys-tem, we generate 2000 state-observation-control tra-jectories with length T = 30 by applying the optimalLQG controller. We use the first 1000 trajectories asthe training data and remaining 1000 trajectories astesting data. To simulate the LQR model via LQGsetting, we set C = I2×2 and let ~Zt = ~Xt. We notethat the average trajectory log-loss can be negative, asin these experiments, because it is taken over a con-tinuous distribution.

Figure 2, above, shows that LQG has significantly bet-ter performance than LQR when the observation noiseσo increases. This is because the controller is basing itscontrols on noisy observations that are increasingly dif-ferent from the true state. Also, it shows that the logloss decreases as the observation noise increases. Thisis because as the observation noise increases, the con-trollers system state estimates are less certain. Payinglarge costs for controls becomes less worthwhile, andcontrols from a smaller range (closer to 0) are insteadproduced. These lower variance controls are easier topredict and have small log loss. As shown in Figure 2,below, when the state noise σd increases, it dominatesthe test performance of both LQG and LQR. This isbecause as the state noise increases, state estimationbecomes increasingly error-prone for both LQG andLQR.


Figure 2: Above: Withheld average trajectory log-lossas the observation noise, σo, increases (with fixed statetransition dynamic noise σd = 0.01). Below: With-held average trajectory log-loss as the state transitiondynamics noise, σd, increases (with fixed observationnoise σo = 1.0).

4.2 Estimating mouse cursor pointingtrajectories

Modeling mouse cursor pointing motions is an impor-tant machine learning problem for human-computerinteraction tasks. A number of interventional tech-niques have been developed to facilitate pointing tar-get acquisition (e.g., adjusting the control-display ratiodynamics, enlarging targets, etc.) [3]. However, bet-ter predictions of intended target are required for theseintervention techniques to be successfully employed inthe wild [29].

We use data captured from 20 non-motor-impairedcomputer users performing computer cursor pointingtasks to assess the benefits of the LQG approach ver-sus previous LQR models that have been employedfor this task [29]. Users are presented with a sequenceof circular clicking targets to select and their mousecursor data is collected at 100Hz. We specifically in-vestigate whether incorporating a response delay usingour LQG framework provides better predictions than

Figure 3: Example mouse cursor trajectories terminat-ing at small circle positions exhibiting characteristicsof delayed feedback.

the LQR approach, which assumes an instantaneousresponse to the changes in mouse cursor position. Ourassumption is instead that due to the imprecise humanabilities for fine-grained control, cursor navigation isessentially an open loop control problem and that in-corporating feedback delay will produce better policyestimates. Some evidence of this is demonstrated bythe cursor pointing trajectories in Figure 3.

We follow the previous work’s control formulation [29].The instantaneous state

~xt , [xt yt xt yt xt yt ]>

is represented by the relative position, velocity, andacceleration vectors towards the target and orthogo-nal to the target at discrete points in time. These dy-namics (e.g., velocities and accelerations) are definedaccording to difference equations,(

xtyt

)=

(xt − xt−1

yt − yt−1

)(31)(

xtyt

)=

(xt − xt−1

yt − yt−1

), (32)

and can easily be expressed as a linear dynamics modelwith the control vector ~ut representing the change inposition. Under this dynamics model, mouse pointingmotion data follows a linear relationship (with optionalzero-meaned Gaussian noise, ε):

~xt+1 = A~xt + B~ut + ε.

where

A =

1 0 0 0 0 00 1 0 0 0 00 0 0 0 0 00 0 0 0 0 00 0 −1 0 0 00 0 0 −1 0 0

B =

1 00 11 00 11 00 1

Though the cursor positions are located at discretepixels and the controls are the discrete differences of


these pixel locations, using a discrete model for esti-mating the control policy is not feasible. Specifically,since the dimensionality of the state space is six, anyreasonably fine-grained discretization of each dimen-sion (position, velocity, acceleration) will lead to an in-tractably large discrete decision process that is fartherexacerbated by partial observability (as a partially-observed Markov decision process).

We consider a control system with a delayed observa-tion of t0 time steps. This is formally represented inthe LQG model by augmenting the LQG state withthe previous t0 states and having the observation dy-namics only reveal the state from t0 time steps ago.For example, a delay one model has the following dy-namics matrices:

A′ =

[A 0I 0

]B′ =

[B0

]C′ =

[0 I

].

We additionally compare our prediction approachagainst a direct policy estimation method: kth-orderMarkov models of different orders k = {1, 2, 3, 4}.For this continuous state-action setting, estimating theMarkov model reduces to a linear regression problemof the form:

~st = [~st−1 ~st−2 . . . ~st−k]~α+ ε, (33)

with zero-mean Gaussian noise ε ∼ N(0, σ2). Thestate at each time is the x and y position of the mousecursor. Regression parameters ~α are estimated by min-imizing the sum of squared errors, as is standard inordinary linear regression. Control estimates ~ut aresimply the difference between the next state estimate,~st and the previous state, ~st−1, with the distributiondetermined by the Gaussian model.

Of the 4, 949 mouse cursor trajectories, we randomlyselect 3, 000 as training data and use the remaining1, 949 for evaluation. In Figure 4, we evaluate dif-ferent choices of delayed feedback, t0. We have noobservation noise. Note that the model is equivalentto the LQR setting when t0 = 0. As shown in thisfigure, the LQG setting with t0 = 3 delay has thebest performance. The Markov models of 3rd and 4th

order outperform the LQR model, but are not morepredictive than the LQG model with delay of 1, 2, 3,or 4. (Note that 1st order Markov model is signifi-cantly worse and does not appear in the figure.) Theadvantage of the LQG model over the noise-less LQRinverse optimal control model and the direct policyestimation of the Markov model shows that modelingmouse pointing motions as an LQG problem is advan-tageous compared to the previous LQR model, whichassumes instantaneous responses.

Other partial-observability mechanisms are likely toalso influence the cursor pointing motions and improve

Figure 4: Average trajectory log-loss of: the LQGmodel with various amounts of delay, t0; the LQRmodel; Markov models of order 2,3,4.

the prediction of common overshooting and correctingmotions. For example, a noisy observation model ismore appropriate than the delayed perfect observationmodel we employ. However, our experiments providea solid first step in predicting pointing motion controlsequences using the LQG framework.

5 Discussion and Future Work

In this paper, we extended maximum entropy inverseoptimal control to the LQG control setting. We es-tablished a separation property that allows inferencein the resulting model to be performed efficiently. De-spite the formulation of our approach being distinctfrom optimal LQG control, we found close connectionsbetween the two methods, including the ability to usean LQG solver as an integral part of the inverse LQGinference procedure. We demonstrated the advantagesof the LQG representation for predictive inverse opti-mal control both on a synthetic dataset and on realmouse cursor data.

Of significant future interest are general methodsfor approximating non-linear control problems withobserved state-observation-action trajectories placedwithin the LQG framework and using inverse optimalcontrol to construct predictive models. These lineariz-ing approximations have been well-studied in the con-trol literature [14], but it remains to be seen whetherreasonable learning can occur when an entire trajec-tory distribution must be approximated rather thanthe deterministic optimal controller.

Acknowledgement

This material is based upon work supported by theNational Science Foundation under Grant No.1227495.


References

[1] P. Abbeel and A. Y. Ng. Apprenticeship learning viainverse reinforcement learning. In Proc. InternationalConference on Machine Learning, pages 1–8, 2004.

[2] Monica Babes, Vukosi Marivate, Kaushik Subrama-nian, and Michael L Littman. Apprenticeship learn-ing about multiple intentions. In Proceedings of the28th International Conference on Machine Learning(ICML-11), pages 897–904, 2011.

[3] Ravin Balakrishnan. “Beating” Fitts law: virtualenhancements for pointing facilitation. InternationalJournal of Human-Computer Studies, 61(6):857–874,2004.

[4] R. Bellman. A Markovian decision process. Journalof Mathematics and Mechanics, 6:679–684, 1957.

[5] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrish-nan. Linear matrix inequalities in system and controltheory. SIAM, 15, 1994.

[6] Stephen Boyd and Lieven Vandenberghe. Convex Op-timization. Cambridge University Press, March 2004.

[7] Jaedeug Choi and Kee-Eung Kim. Inverse reinforce-ment learning in partially observable environments.The Journal of Machine Learning Research, 12:691–730, 2011.

[8] T.M. Cover and J.A. Thomas. Elements of informa-tion theory. John Wiley and sons, 2006.

[9] Krishnamurthy Dvijotham and Emanuel Todorov. In-verse Optimal Control with Linearly-solvable MDPs.In Proc. International Conference on Machine Learn-ing, pages 335–342, 2010.

[10] P. D. Grunwald and A. P. Dawid. Game theory,maximum entropy, minimum discrepancy, and ro-bust Bayesian decision theory. Annals of Statistics,32:1367–1433, 2004.

[11] P. Henry, C. Vollmer, B. Ferris, and D. Fox. Learn-ing to Navigate Through Crowded Environments. InProc. International Conference on Robotics and Au-tomation, pages 981–986, 2010.

[12] R.E. Kalman. A new approach to linear filtering andprediction problems. Journal of basic Engineering,82(1):35–45, 1960.

[13] G. Kramer. Capacity results for the discrete memo-ryless network. Proc. IEEE Transactions on Informa-tion Theory, 49(1):4–21, Jan 2003.

[14] Huibert Kwakernaak and Raphael Sivan. Linear op-timal control systems, volume 1. Wiley-InterscienceNew York, 1972.

[15] Sergey Levine and Vladlen Koltun. Continuous in-verse optimal control with locally optimal examples.In International Conference on Machine Learning(ICML 2012), 2012.

[16] Sergey Levine and Vladlen Koltun. Variational policysearch via trajectory optimization. In Advances inNeural Information Processing Systems, pages 207–215, 2013.

[17] Hans Marko. The bidirectional communication the-ory – a generalization of information theory. In IEEETransactions on Communications, pages 1345–1351,1973.

[18] James L. Massey. Causality, feedback and directedinformation. In Proc. IEEE International Symposiumon Information Theory and Its Applications, pages27–30, 1990.

[19] Gergely Neu and Csaba Szepesvari. Apprenticeshiplearning using inverse reinforcement learning and gra-dient methods. In Proc. UAI, pages 295–302, 2007.

[20] Andrew Y. Ng and Stuart Russell. Algorithms forinverse reinforcement learning. In Proc. Interna-tional Conference on Machine Learning, pages 663–670, 2000.

[21] Haim H. Permuter, Young-Han Kim, and TsachyWeissman. On directed information and gambling. InProc. IEEE International Symposium on InformationTheory, pages 1403–1407, 2008.

[22] D. Pomerleau. Alvinn: An autonomous land vehicle ina neural network. In Advances in Neural InformationProcessing Systems 1, 1989.

[23] Deepak Ramachandran and Eyal Amir. Bayesian in-verse reinforcement learning. In Proc. IJCAI, pages2586–2591, 2007.

[24] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Maxi-mum margin planning. In Proc. ICML, pages 729–736,2006.

[25] Robert F Stengel. Optimal control and estimation.Courier Dover Publications, 2012.

[26] S. Tatikonda and S. Mitter. Control under communi-cation constraints. Automatic Control, IEEE Trans-actions on, 49(7):1056–1068, 2004.

[27] F. Topsøe. Information theoretical optimization tech-niques. Kybernetika, 15(1):8–27, 1979.

[28] Brian D. Ziebart, J. Andrew Bagnell, and Anind K.Dey. Modeling interaction via the principle of maxi-mum causal entropy. In Proc. International Confer-ence on Machine Learning, pages 1255–1262, 2010.

[29] Brian D. Ziebart, Anind K. Dey, and J. Andrew Bag-nell. Probabilistic pointing target prediction via in-verse optimal control. In Proceedings of the ACM In-ternational Conference on Intelligent User Interfaces,pages 1–10, 2012.

[30] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell,and Anind K. Dey. Maximum entropy inverse rein-forcement learning. In Proc. AAAI Conference onArtificial Intelligence, pages 1433–1438, 2008.

[31] Brian D. Ziebart, Andrew Maas, Anind K. Dey, andJ. Andrew Bagnell. Navigate like a cabbie: Proba-bilistic reasoning from observed context-aware behav-ior. In Proc. International Conference on UbiquitousComputing, pages 322–331, 2008.


Proofs

Lemma 1. The constrained optimization of (19) is equivalent to:

argmax{f(~u1:T ||~z1:T ,~x1:T )}

H(~U1:T ||~Z1:T , ~X1:T ) (34)

where: f(~u1:t||~z1:T , ~x1:T ) =

T∏t=1

f(~ut|~u1:t−1,~z1:t, ~x1:t); (35)

∀t ∈ {1 · · ·T}, ~u1:t ∈ ~U1:t,~z1:t ∈ ~Z1:t, ~x1:t ∈ ~X1:t, ~x′1:t ∈ ~X1:t,

such that f(~ut|~u1:t−1,~z1:t, ~x1:t) ≥ 0,

∫~ut∈~Ut

f(~ut|~u1:t−1,~z1:t, ~x1:t) = 1, (36)

f(~ut|~u1:t−1,~z1:t, ~x1:t) = f(~ut|~u1:t−1,~z1:t, ~x′1:t). (37)

Proof of Lemma 1. The previously developed theory of maximum causal entropy [28] shows the causally condi-tioned probability distribution defined according to affine constraint (15),(16) and (18) are equivalent to it definedby the decomposition into a product of conditional probabilities (35),(36). Then, we show partial observabilityconstraint (17) implies (37).

∀~u1:T ∈ ~U1:T , ~x1:T ,x′1:T ∈ ~X1:T ,~z1:T ∈ ~Z1:T ,

T∏t=1

f(~ut|~u1:t−1,~z1:t, ~x1:t) =

T∏t=1

f(~ut|~u1:t−1,~z1:t, ~x′1:t)

It is possible, f(~u1|~z1, ~x1) · · · f(~ut|~u1:t−1,~z1:t, ~x1:t) · · · f(~uT |~u1:T−1,~z1:T , ~x1:T )

= f(~u1|~z1, ~x1) · · · f(~ut|~u1:t−1,~z1:t, ~x′1:t) · · · f(~uT |~u1:T−1,~z1:T , ~x1:T )

Thus, f(~ut|~u1:t−1,~z1:t, ~x1:t) = f(~ut|~u1:t−1,~z1:t, ~x′1:t).

It is easy to show (37) implies (17).

Lemma 2. The constrained optimization defined in Lemma 1 is equivalent to:

argmax{f(~u1:T ||~z1:T )}

H(~U1:T ||~Z1:T ) (38)

∀~u1:T ∈ ~U1:T ,~z1:T , z′1:T ∈ ~Z1:T ,

f(~u1:T ||~z1:T ≥ 0,

∫~u′1:T∈~U1:T

f(~u′1:T ||~z1:T ) d~u′1:T = 1, (39)

∀τ ∈ {1, · · · , T} such that ~z1:τ = ~z′1:τ ,∫~uτ+1:T∈~Uτ+1:T

f(~u1:T ||~z1:T ) d~uτ+1:T =

∫~uτ+1:T∈~Uτ+1:T

f(~u1:T ||~z′1:T ) d~uτ+1:T . (40)

Proof of Lemma 2.

∀t ∈ {1, · · · , T}, ~u1:t ∈ ~U1:t,~z1:t ∈ ~Z1:t, ~x1:t ∈ ~X1:t, ~x′1:t ∈ ~X1:t,

f(~ut|~u1:t−1,~z1:t, ~x1:t) = f(~ut|~u1:t−1,~z1:t, ~x′1:t) = f(~ut|~u1:t−1,~z1:t)

Then,

T∏t=1

f(~ut|~u1:t−1,~z1:t, ~x1:t) =

T∏t=1

f(~ut|~u1:t−1,~z1:t) = f(~u1:T ||~z1:T )

Similar to the proof of Lemma 1, the causally conditioned probability distribution defined by a product ofconditional probabilities are equivalent to the affine constraint (39), (40).


To show the object function (38) is equivalent to (19), we first show∫~x1:T∈ ~X1:T

f(~z1:T , ~x1:T ||~u1:T−1)

=

∫~x1:T∈ ~X1:T

∏Tt=1 f(~zt, ~xt|~z1:t−1, ~x1:t−1, ~u1:t−1)

∏Tt=1 f(~ut|~u1:t−1,~z1:t, ~x1:t)∏T

t=1 f(~ut|~u1:t−1,~z1:t)

=

∫~x1:T∈ ~X1:T

f(~u1:T ,~z1:T , ~x1:T )∏Tt=1 f(~ut|~u1:t−1,~z1:t)

=f(~u1:T ,~z1:T )∏T

t=1 f(~ut|~u1:t−1,~z1:t)

=

∏Tt=1 f(~ut|~u1:t−1,~z1:t)

∏Tt=1 f(~zt|~z1:t−1, ~u1:t−1)∏T

t=1 f(~ut|~u1:t−1,~z1:t)

=

T∏t=1

f(~zt|~z1:t−1, ~u1:t−1) = f(~z1:T ||~u1:T−1)

Then, H(~U1:T ||~Z1:T , ~X1:T )

=

∫~u1:T ,~z1:T ,~x1:T

f(~u1:T ||~z1:T , ~x1:T )f(~z1:T , ~x1:T ||~u1:T−1) log f(~u1:T ||~z1:T , ~x1:T )

=

∫~u1:T ,~z1:T

f(~u1:T ||~z1:T ) log f(~u1:T ||~z1:T )

∫~x1:T

f(~z1:T , ~x1:T ||~u1:T−1)

=

∫~u1:T ,~z1:T

f(~u1:T ||~z1:T ) log f(~u1:T ||~z1:T )f(~z1:T ||~u1:T−1)

= H(~U1:T ||~Z1:T )

Lemma 3. Suppose the constrained optimization problem in Lemma 2 has the following additional constraint:(F : ~U1:T × ~Z1:T → RN ,~c ∈ RN )

Ef(~u1:T ,~z1:T )

[F (~U1:T , ~Z1:T )

]= ~c (41)

Then the solution to this optimization problem has the form:

f(~ut|~u1:t−1,~z1:t) = eQ(~u1:t,~z1:t)−V (~u1:t−1,~z1:t)

where Q and V functions take the following recursive form:

Q(~u1:t,~z1:t) =

{λTF (~u1:T ,~z1:T ), t = T ;

E[V (~U1:t, ~Z1:t+1)|~u1:t,~z1:t], t < T


Q(~u1:t,~z1:t) , log

∫~ut

eQ(~u1:t,~z1:t)d~ut

Proof of Lemma 3. We first show for any joint distribution g(~u1:T ,~z1:T ), the following equation holds:

Eg[− log f(~U1:T ||~Z1:T )

]=

∫~z1

f(~z1)V (~z1)− Eg[λTF (~U1:T , ~Z1:T )

](42)


Eg

[T∑t=1

− log f(~Ut|~U1:t−1, ~Z1:t)

]

= Eg

[−λTF (~U1:T , ~Z1:T )−

T−1∑t=1

Q(~U1:t, ~Z1:t) +

T∑t=1

V (~U1:t−1, ~Z1:t)

]

= Eg[−λTF (~U1:T , ~Z1:T )

]−∫~u1:T ,~z1:T

g(~u1:T ,~z1:T )

T−1∑t=1

∫~zt+1

f(~zt+1|~u1:t,~z1:t)V (~u1:t,~z1:t+1)

+

∫~u1:T ,~z1:T

g(~u1:T ,~z1:T )

T∑t=1

V (~u1:t−1,~z1:t)

= Eg[−λTF (~U1:T , ~Z1:T )

]−T−1∑t=1

∫~u1:t,~z1:t+1

g(~u1:t,~z1:t+1)V (~u1:t,~z1:t+1)

+

∫~u1:T ,~z1:T

g(~u1:T ,~z1:T )

T∑t=1

V (~u1:t−1,~z1:t)

which implies equation (42).

For any arbitrary causally conditional probability distribution g(~u1:T ||~z1:T ) satisfies with expectation constraint(41), we show:

Hg(~U1:T ||~Z1:T ) ≤ Hf (~U1:T ||~Z1:T )

Eg[− log g(~U1:T ||~Z1:T )

]= −

∫~u1:T ,~z1:T

g(~u1:T ,~z1:T ) log

(g(~u1:T ||~z1:T )f(~z1:T ||~u1:T−1)

f(~u1:T ||~z1:T )f(~z1:T ||~u1:T−1)f(~u1:T ||~z1:T )

)

= −DKL

(g(~u1:T ,~z1:T )||f(~u1:T ,~z1:T )

)−∫~u1:T ,~z1:T

g(~u1:T ,~z1:T ) log f(~u1:T ||~z1:T )

≤ −∫~u1:T ,~z1:T

g(~u1:T ,~z1:T ) log f(~u1:T ||~z1:T )

=

∫~z1

f(~z1)V (~z1)− Eg[λTF (~U1:T , ~Z1:T )

]=

∫~z1

f(~z1)V (~z1)− Ef[λTF (~U1:T , ~Z1:T )

]= Hf (~U1:T ||~Z1:T )

DKL is the Kullback-Leibler divergence which is non-negative[8]. Thus, f(~ut|~u1:t−1,~z1:t) is the solution to theoptimization problem in Lemma 2 incorporates with expectation constraint (41).

Proof of Theorem 1. We first incorporate the expectation constraint (20) into the constrained optimization prob-lem defined in Lemma 2

Ef(~u1:T ,~z1:T+1,~x1:T+1)

[T+1∑t=1

~Xt~XTt

]

=

∫~u1:T ,~z1:T+1,~x1:T+1

f(~u1:T ||~z1:T , ~x1:T )f(~z1:T+1, ~x1:T+1||~u1:T )

T+1∑t=1

~xt~xTt

=

∫~u1:T ,~z1:T

f(~u1:T ||~z1:T )f(~z1:T ||~u1:T−1)

∫~x1:T+1,~zT+1

f(~z1:T+1, ~x1:T+1||~u1:T )∑T+1t=1 ~xt~x

Tt

f(~z1:T ||~u1:T−1)

= Ef(~u1:T ,~z1:T )

[∫~X1:T+1, ~ZT+1

f(~Z1:T+1, ~X1:T+1||~U1:T )∑T+1t=1

~Xt~XTt

f(~Z1:T ||~U1:T−1)

]


According to Lemma 3, the solution to the constrained problem defined in Lemma 2 incorporates with theexpected constraint (41) takes the following recursive form:

Q(~u1:t,~z1:t) =

∫~x1:T+1,~zT+1

f(~z1:T+1,~x1:T+1||~u1:T )∑T+1t=1 ~xTt M~xt

f(~z1:T ||~u1:T−1) , t = T ;

E[V (~U1:t, ~Z1:t+1)|~u1:t,~z1:t], t < T


Q(~u1:t,~z1:t) , log

∫~ut

eQ(~u1:t,~z1:t)d~ut

Q(~u1:T ,~z1:T ) =

∫~x1:T

f(~z1:T , ~x1:T ||~u1:T−1)E[~XTT+1M

~XT+1|~XT , ~uT

]f(~z1:T ||~u1:T−1)︸︷︷︸

We define it Q′(~u1:T ,~z1:T )

+

∫~x1:T

f(~z1:T , ~x1:T ||~u1:T−1)∑Tt=1 ~x

Tt M~xt

f(~z1:T ||~u1:T−1)︸︷︷︸This is a constant term with respect to ~uT ,we define it WT

Q′(~u1:T ,~z1:T ) =∫~x1:T

f(~z1:T , ~x1:T ||~u1:T−1)f(~u1:T ||~x1:T ,~z1:T )E[~XTT+1M

~XT+1|~XT , ~uT

]f(~z1:T ||~u1:T−1)f(~u1:T ||~z1:T )

=

∫~xT+1

f(~xT+1, ~u1:T ,~z1:T )~xTT+1M~xT+1

f(~u1:T ,~z1:T )

= E[~XTT+1M

~XT+1|~u1:T ,~z1:T

]Let V ′(~u1:T−1,~z1:T ) = log

∫~uT

Q′(~u1:T ,~z1:T )

Q(~u1:T−1,~z1:T−1) =

∫~zT

f(~zT |~u1:T−1,~z1:T−1)(WT + V ′(~u1:T−1,~z1:T ))

=

∫~zT ,~x1:T

f(~z1:T , ~x1:T ||~u1:T−1)f(~u1:T−1||~x1:T−1,~z1:T−1)~xTTM~xT

f(~z1:T−1||~u1:T−2)f(~u1:T−1||~z1:T−1)

+ E[V ′(~U1:T−1, ~Z1:T )|~u1:T−1,~z1:T−1

]+WT−1

= E[~XTTM~XT + V ′(~U1:T−1, ~Z1:T )|~u1:T−1,~z1:T−1

]︸︷︷︸

We define it Q′(~u1:T−1,~z1:T−1)

+WT−1

And let V ′(~u1:T−2,~z1:T−1) = log

∫~uT−1

Q′(~u1:T−1,~z1:T−1)

For t < T − 1, the argument to Q′(~u1:t,~z1:t), V′(~u1:t−1,~z1:t) is similar. We redefine Q(~u1:t,~z1:t) = Q′(~u1:t,~z1:t)

and V (~u1:t−1,~z1:t) = V ′(~u1:t−1,~z1:t) which gives the recursive form in Theorem 1.

Lemma 4. The distribution of belief state ~Xt|bt ∼ N(~µbt ,Σbt) is recursively defined as following and Σbt isindependent of bt.

~µb1 = ~µ+ ΣTd1CT (Σo + CΣTd1C

T )−1(~Z1 −C~µ) (43)

Σb1 = Σd1 − ΣTd1CT (Σo + CΣTd1C

T )−1CΣd1 (44)

~µbt+1= B~Ut + A~µbt + (Σd + AΣTbtA

T )TCT

(Σo + C(Σd + AΣTbtAT )TCT )−1(~Zt+1 −C(B~Ut + A~µbt)) (45)

Σbt+1= Σd + AΣTbtA

T − (Σd + AΣTbtAT )TCT

(Σo + C(Σd + AΣTbtAT )TCT )−1C(Σd + AΣTbtA

T ) (46)


Proof of Lemma 4. Since ~Z1|~x1 ∼ N(C~x1,Σo) and ~X1 ∼ N(~µ,Σd1), applying Gaussian transformation tech-

niques, it is easy to show that the distribution of initial belief state ~X1|b1 (that is ~X1|~z1) is a Gaussian distributionwith mean (43) and variance (44).

Note that f(~xt+1|~xt, ~ut, bt) = f(~xt+1|~xt, ~ut) ~Xt+1|~xt, ~ut ∼ N(A~xt + B~ut,Σd)

f(~xt|~ut, bt) = f(~xt|bt) ~Xt|bt ∼ N(~µbt ,Σbt)

Then ~Xt+1|~ut, bt ∼ N(B~ut + A~µbt ,Σd + AΣTbtAT )

Furthermore f(~zt+1|~xt+1, ~ut, bt) = f(~zt+1|~xt+1) ~Zt+1|~xt+1 ∼ N(C~xt+1,Σo)

Thus, it’s easy to show the distribution of ~Xt+1, ~Zt+1|~ut, bt) is:

N

(B~ut + A~µbt

C(B~ut + A~µbt),

Σd + AΣTbtAT (Σd + AΣTbtA

T )TCT

C(Σd + AΣTbtAT ) Σo + C(Σd + AΣTbtA

T )TCT

)Finally, f(~xt+1|bt+1) = f(~xt+1|~zt+1, ~ut, bt) = f(~xt+1, ~zt+1|~ut, bt)/f(~zt+1|~ut, bt) which gives the distribution of~Xt+1|bt+1 with mean (45) and variance (46).

Proof of Theorem 2.

E[~XTt+1M

~Xt+1|~u1:t,~z1:t] = E[~XTt+1M

~Xt+1|~ut, bt]= (B~ut + A~µbt)

TM(B~ut + A~µbt) + tr(M(Σd + AΣTbtAT ))

=

[~ut~µbt

]T [B A

]TM[B A

] [ ~ut~µbt

]+ constant

Thus Q(~u1:T ,~z1:T ) = Q(~uT , ~µbT ) = E[~XTt+1M

~Xt+1|~ut, bt] gives WT .

V (~u1:T−1,~z1:T ) = V (~µbT ) = V (~zT , ~uT−1, ~µT−1) = log

∫~uT

eQ(~uT ,~µbT )

= ~µTbT (WT (µ,µ) −WTT (U,µ)W

−1T (U,U)WT (U,µ))~µbT + constant

=

~zT~uT−1

~µbT−1

T PTT (WT (µ,µ) −WT

T (U,µ)W−1T (U,U)WT (U,µ))PT

~zT~uT−1

~µbT−1

+ constant

which gives DT .

Thus E[V (~U1:t, ~Z1:t+1)|~u1:t,~z1:t] = E[V (~Zt+1, ~Ut, ~µbt)|~ut, ~µbt ]

= E

[[~ut~µbt

]TDt+1(uµ,z)

~Zt+1 + ~ZTt+1Dt+1(z,uµ)

[~ut~µbt

] ∣∣∣∣ [ ~ut~µbt]]

+ E[~ZTt+1Dt+1(z,z)

~Zt+1

∣∣∣∣ [ ~ut~µbt]]

+

[~ut~µbt

]TDt+1(uµ,uµ)

[~ut~µbt

]+ constant

=

[~ut~µbt

]TDt+1(uµ,z)CBA + CT

BADt+1(z,uµ)

+ CTBADt+1(u,u)CBA + Dt+1(uµ,uµ)

[~ut~µbt

]Q(~ut, ~µbt) = E[~XT

t+1M~Xt+1 + V (~Zt+1, ~Ut, ~µbt)|~u1:t,~z1:t] which gives Wt(26)

The quadratic form of V (~zt, ~ut−1, ~µt−1) is similar to V (~zT , ~uT−1, ~µT−1)

which gives Dt(27)


Proof of Theorem 3. It’s easy to check the initial setting WT =[B A

]TM[B A

]matches (5). For general

case, we plug Dt+1(27) into Wt (26) and check with Wt(U,U) first. To simplify proof, let’s define

φt = Wt(µ,µ) −WTt(U,µ)W

−1t(U,U)Wt(U,µ).

Then from (26),(27)

Wt =[B A

]TM[B A

]+[B−Et+1CB A−Et+1CA

]Tφt+1Et+1

[CB CA

]+[

CB CA]T

ETt+1φt+1

[B−Et+1CB A−Et+1CA

]+[

CB CA]T

ETt+1φt+1Et+1

[CB CA

]+[

B−Et+1CB A−Et+1CA]Tφt+1

[B−Et+1CB A−Et+1CA

]Wt(U,U) = BTMB + (B−Et+1CB)Tφt+1Et+1CB + (Et+1CB)Tφt+1(B−Et+1CB)+

(Et+1CB)Tφt+1Et+1CB + (B−Et+1CB)Tφt+1(B−Et+1CB)

= BTMB + BTφt+1B

That is BTFt+1B = BTMB + BTφt+1B. (47)

By plugging out φt+1, the equation(47) matches equation(5). Wt(U,µ), Wt(µ,U), Wt(µ,µ) follow similar argument.

Date post:	18-Jan-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Predictive Inverse Optimal Control for Linear-Quadratic ... · Department of Computer Science...

Documents