+ All Categories
Home > Documents > Active Learning of Dynamics for Data-Driven Control …TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX,...

Active Learning of Dynamics for Data-Driven Control …TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX,...

Date post: 10-Apr-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
14
TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 1 Active Learning of Dynamics for Data-Driven Control Using Koopman Operators Ian Abraham and Todd D. Murphey Abstract—This paper presents an active learning strategy for robotic systems that takes into account task information, enables fast learning, and allows control to be readily synthesized by taking advantage of the Koopman operator representation. We first motivate the use of representing nonlinear systems as linear Koopman operator systems by illustrating the improved model-based control performance with an actuated Van der Pol system. Information-theoretic methods are then applied to the Koopman operator formulation of dynamical systems where we derive a controller for active learning of robot dynamics. The active learning controller is shown to increase the rate of information about the Koopman operator. In addition, our active learning controller can readily incorporate policies built on the Koopman dynamics, enabling the benefits of fast active learning and improved control. Results using a quadcopter illustrate single-execution active learning and stabilization capabilities during free-fall. The results for active learning are extended for automating Koopman observables and we implement our method on real robotic systems. Index Terms—Active Learning, Information Theoretic Control, Koopman Operators, Single Execution Learning. I. I NTRODUCTION I N order to enable active learning for robots, we need a control algorithm that readily incorporates task informa- tion, learns dynamic model representation, and is capable of incorporating policies for solving additional tasks during the learning process. In this work, we develop an active learning controller that enables a robot to learn an expressive repre- sentation of its dynamics using Koopman operators [1]–[4]. Koopman operators represent a nonlinear dynamical system as a linear, infinite dimensional, system by evolving functions of the state (also known as function observables) in time [1]–[4]. Often, these linear representations can capture the behavior of the dynamics globally while enabling the use of known linear quadratic control methods. As a result, the Koopman operator representation changes how we represent the dynamic constraints of the robotic systems, carrying more nonlinear dynamic information, and often improving control authority. Koopman operator dynamics are typically found through data-driven methods that generate an approximation to the theoretical infinite-dimensional Koopman operator [2], [4], [5]. These data-driven methods require robotic systems to be actuated in order to collect data. The process for data Authors are with the Neuroscience and Robotics lab (NxR) at the Depart- ment of Mechanical Engineering, Northwestern University, 2145 Sheridan Road Evanston, IL, 60208. Videos of the experiments and sample code can be found at https://sites. google.com/view/active-learning-koopman-op . email: [email protected], [email protected] Manuscript received September 15, 2018; revised June 11, 2019. collection in robotics is an active process that relies on control; therefore, learning the Koopman operator formulation, for robotics, is an active learning process. In this paper, we use the Koopman operator representation for improving control authority of nonlinear robotic systems. Moreover, we address the problem of calculating the linear representation of the Koopman operator by exploiting an information-theoretic active learning strategy based on the structure of Koopman operators. As a result, are able to demonstrate active learning through data-driven control in real- time settings where only a single execution of the robotic sys- tem is possible. Thus, the contribution of this paper is a method for active learning of Koopman operator representations of nonlinear dynamical systems which exploits both information- theoretic measures and improved control authority based on Koopman operators. A. History and Related Work Active learning in robotics has recently been a topic of interest [6]–[10]. Much work has been done in active learn- ing for parameter identification [11]–[14] as well as active learning for state-control mappings in reinforcement learning [9], [15]–[18] and adaptive control [19]–[21]. In particular, much of the mentioned work refers to exciting a robot’s dynamics —using information theoretic measures [10], [12], [13], reward functions [9], [10], [15], [17] in reinforcement learning, and other methods [22], [23]—in order to obtain the “best” set of measurements that resolve a parameter or the “best-case” mapping (either of the state-control map or of the dynamics). This paper uses active learning to enable robots to learn Koopman operator representations of a robot’s own dynamic process. Koopman operators were first proposed in 1931 in work by B.O. Koopman [1]. At the time, approximating the Koop- man operator was computationally infeasible; the onset of computers enabled data-driven methods that approximate the Koopman operator [2], [4], [24]. Other research involves com- putation of Koopman eigenfunctions and Koopman-invariant subspaces that determine the size of the Koopman opera- tor [25]–[27]. This allows for finite dimensional Koopman operators that captures nonlinear dynamics while compressing the overall state dimension used to represent the dynamical system. Recent works, on combining model-based control methods and Koopman operators have suggested that control based on Koopman operators is a promising avenue for many fields in- cluding robotics [3], [5], [26]–[33]. In particular, recent work
Transcript

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 1

Active Learning of Dynamics for Data-DrivenControl Using Koopman Operators

Ian Abraham and Todd D. Murphey

Abstract—This paper presents an active learning strategyfor robotic systems that takes into account task information,enables fast learning, and allows control to be readily synthesizedby taking advantage of the Koopman operator representation.We first motivate the use of representing nonlinear systems aslinear Koopman operator systems by illustrating the improvedmodel-based control performance with an actuated Van derPol system. Information-theoretic methods are then applied tothe Koopman operator formulation of dynamical systems wherewe derive a controller for active learning of robot dynamics.The active learning controller is shown to increase the rate ofinformation about the Koopman operator. In addition, our activelearning controller can readily incorporate policies built on theKoopman dynamics, enabling the benefits of fast active learningand improved control. Results using a quadcopter illustratesingle-execution active learning and stabilization capabilitiesduring free-fall. The results for active learning are extended forautomating Koopman observables and we implement our methodon real robotic systems.

Index Terms—Active Learning, Information Theoretic Control,Koopman Operators, Single Execution Learning.

I. INTRODUCTION

IN order to enable active learning for robots, we need acontrol algorithm that readily incorporates task informa-

tion, learns dynamic model representation, and is capable ofincorporating policies for solving additional tasks during thelearning process. In this work, we develop an active learningcontroller that enables a robot to learn an expressive repre-sentation of its dynamics using Koopman operators [1]–[4].Koopman operators represent a nonlinear dynamical system asa linear, infinite dimensional, system by evolving functions ofthe state (also known as function observables) in time [1]–[4].Often, these linear representations can capture the behaviorof the dynamics globally while enabling the use of knownlinear quadratic control methods. As a result, the Koopmanoperator representation changes how we represent the dynamicconstraints of the robotic systems, carrying more nonlineardynamic information, and often improving control authority.

Koopman operator dynamics are typically found throughdata-driven methods that generate an approximation to thetheoretical infinite-dimensional Koopman operator [2], [4],[5]. These data-driven methods require robotic systems tobe actuated in order to collect data. The process for data

Authors are with the Neuroscience and Robotics lab (NxR) at the Depart-ment of Mechanical Engineering, Northwestern University, 2145 SheridanRoad Evanston, IL, 60208.

Videos of the experiments and sample code can be found at https://sites.google.com/view/active-learning-koopman-op .

email: [email protected], [email protected] received September 15, 2018; revised June 11, 2019.

collection in robotics is an active process that relies on control;therefore, learning the Koopman operator formulation, forrobotics, is an active learning process.

In this paper, we use the Koopman operator representationfor improving control authority of nonlinear robotic systems.Moreover, we address the problem of calculating the linearrepresentation of the Koopman operator by exploiting aninformation-theoretic active learning strategy based on thestructure of Koopman operators. As a result, are able todemonstrate active learning through data-driven control in real-time settings where only a single execution of the robotic sys-tem is possible. Thus, the contribution of this paper is a methodfor active learning of Koopman operator representations ofnonlinear dynamical systems which exploits both information-theoretic measures and improved control authority based onKoopman operators.

A. History and Related Work

Active learning in robotics has recently been a topic ofinterest [6]–[10]. Much work has been done in active learn-ing for parameter identification [11]–[14] as well as activelearning for state-control mappings in reinforcement learning[9], [15]–[18] and adaptive control [19]–[21]. In particular,much of the mentioned work refers to exciting a robot’sdynamics —using information theoretic measures [10], [12],[13], reward functions [9], [10], [15], [17] in reinforcementlearning, and other methods [22], [23]—in order to obtain the“best” set of measurements that resolve a parameter or the“best-case” mapping (either of the state-control map or of thedynamics). This paper uses active learning to enable robotsto learn Koopman operator representations of a robot’s owndynamic process.

Koopman operators were first proposed in 1931 in workby B.O. Koopman [1]. At the time, approximating the Koop-man operator was computationally infeasible; the onset ofcomputers enabled data-driven methods that approximate theKoopman operator [2], [4], [24]. Other research involves com-putation of Koopman eigenfunctions and Koopman-invariantsubspaces that determine the size of the Koopman opera-tor [25]–[27]. This allows for finite dimensional Koopmanoperators that captures nonlinear dynamics while compressingthe overall state dimension used to represent the dynamicalsystem.

Recent works, on combining model-based control methodsand Koopman operators have suggested that control based onKoopman operators is a promising avenue for many fields in-cluding robotics [3], [5], [26]–[33]. In particular, recent work

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 2

from the authors implemented a controller using a Koopmanoperator representation of a robotic system in an experimentalsetting of a robot in sand [32]. Koopman operators are closelyrelated to latent variable (embedded) dynamic models [34].In embedded dynamic models, an autoencoder [34], [35]is used to compress the original state-space into a lower-dimensional representation. The embedded dynamics modelthen only evolves the states that are useful for predictingthe overall dynamical systems behavior. Koopman operatorsrepresent the state of some dynamical system in a higher-or lower- dimensional representation where the evolution ofthe embedding is a linear dynamical systems. Thus, Koopmanoperators are a special case of an embedded dynamic modelwhere the latent variable describes the nonlinearities of adynamical system and are represented as a linear differentialequation.

B. Relation to Previous Work

We extend previous work in [32] with new examples ofcontrol with Koopman operator representations of roboticsystems. In addition, we provide an example in Section IIIwhich gives further intuition for the use of Koopman operatordynamics. Moreover, we address design choices when gener-ating a Koopman operator dynamic representation of a roboticsystems and provide a methodology towards automating thesedesign choices. Last, we introduce a method for enabling therobot to actively learn Koopman operator dynamics whiletaking advantage of linear quadratic (LQ) approaches forcontrol. We note that there is no overlap with the results andthe theoretical content that is presented in this paper with [32].

C. Outline

The paper outline is as follows: Section II introduces theKoopman operator and data-driven methods to approximate theKoopman operator from data, including a recursively definedonline approach for approximating the Koopman operator.Section III motivates using Koopman operator representationsof dynamical systems for control. Section IV introduces acontroller that enables robots to learn the Koopman operatordynamics. Simulated results for active learning using ourmethod is provided with comparisons in Section V. Section VIdiscusses methods for automating the design specifications ofthe Koopman operator. Last, robot experiments are providedin Section VII and concluding remarks in Section VIIIrespectively.

II. KOOPMAN OPERATORS

This section introduces the Koopman operator and formu-lates the Koopman operator for control of robotic systems.

A. Infinite Dimensional Koopman Operator

Let us first define the continuous dynamical system whosestate evolution is defined by

x(ti + ts) = F (x(ti), u(ti), ts) (1)

= x(ti) +

∫ ti+ts

ti

f(x(s), u(s))ds,

where ti is the ith sampling time and ts is the samplinginterval, x(t) : R → Rn is the state of the robot at time t,u(t) : R → Rm is the applied actuation to the robot at timet, f(x, u) : Rn × Rm → Rn is the unknown dynamics of therobot, and F (x(ti), u(ti), ts) is the mapping which advancesthe state x(ti) to x(ti + ts). In addition, let us define anobservation function g(x(t)) : Rn → Rc ∈ C where C isthe space of all observation functions. The Koopman operatorK is an infinite dimensional operator that directly acts on theelements of C

[Kg] (x(ti)) = g(F (x(ti), u(ti), ts)), (2)

where u(ti), ts are implicitly defined in F such that

Kg(x(ti)) = g(F (x(ti), u(ti), ts)) = g(x(ti+1)). (3)

In words, the Koopman operator K takes any observationof state g(x(ti)) at time ti and time shifts the observations,subject to the control u(ti), to the next observable time ti+1.This formulation assumes equal time spacing ts = ti+1− ti =ti − ti−1.

B. Approximating the Data-Driven Koopman OperatorThe Koopman operator K is infeasible to compute in the

infinite dimensional space. A finite subspace approximationto the operator K ∈ Rc × Rc acting on C ⊂ C is used wherewe define a subset of function observables (or observationsof state) z(x) = [ψ1(x), ψ2(x), . . . , ψc(x)]

> ∈ Rc ⊂ C. Eachscalar valued ψi ∈ C and the span of all ψi is the finitesubspace C ⊂ C. The operator K acting on z(x(ti)) is thenrepresented in discrete time as

z(x(ti+1)) = Kz(x(ti)) + r(x(ti)) (4)

where r(x) ∈ C is the residual function error. In principle,as c → ∞, the residual error goes to zero [3], [4]; however,it is sometimes possible to find c < ∞ such that r(x) =0 [26]. Equation (4) gives us the discrete time transition ofobservations of state in time. We overload the notation for theKoopman operator and write the differential equation for theobservations of state as

z = Kz(x(ti)) + r(x(ti)) (5)

where the continuous time K is acquired by taking the matrixlogarithm as ti+1 − ti → 0.

Provided a data set D = {x(tm)}Mm=0, we can computethe approximate Koopman operator K using least-squaresminimization over the parameters of K:

minK

1

2

M−1∑m=0

‖z(x(tm+1)− Kz(x(tm))‖2. (6)

Since (6) is convex in K, the solution is given by

K = AG† (7)

where † denotes the Moore-Penrose pseudoinverse and

A =1

M

M−1∑m=0

z(x(tm+1)z(x(tm))>,

G =1

M

M−1∑m=0

z(x(tm))z(x(tm))>. (8)

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 3

The continuous time operator is then given by log(K)/ts. Notethat we can solve (6) using gradient descent methods [36] orother optimization methods. We write a recursive least-squaresupdate [20], [37] which adaptively updates K as more data isacquired.

C. Koopman Operator for Control

The Koopman operator can include a predefined input uthat contributes to the evolution of z(x(t)). Consider theobservable functions that includes the control input, v(x, u) :Rx ×Rm → Rcu where c = cx + cu. The resulting computedKoopman operator can be divided into sub-matrices

K =

[Kx Ku· ·

], (9)

where Kx ∈ Rcx×cx and Ku ∈ Rcx×cu . Note that the term (·)in (9) refers to terms that evolve the observations on control zuwhich are ignored as there is no ambiguity in their evolution(they are determined by the controller). The Koopman operatordynamical system with control is then

z = f(z, u) = Kxz(x(ti)) + Kuv(x(ti), u(ti)). (10)

Note that the data set D must now store u(ti), u(ti+1) inorder to compute the Koopman operator matrix Ku.

III. ENHANCING CONTROL AUTHORITY WITH KOOPMANOPERATORS

Koopman operators map dynamic constraints into a lineardynamical system in a modified state-space. The Koopmanoperator structure allows one to use linear quadratic (LQ)control methods to compute optimal controllers for nonlinearsystems that can often outperform locally optimal LQ con-trollers obtained through linearizing the nonlinear dynamicsmodel.

Let us consider control of the nonlinear forced Van derPol oscillator, the dynamics of which are defined in Ap-pendix A-A, as an example. We specify the control task asminimizing the following LQ objective

J =

∫ ti+T

ti

x(t)>Qx(t) + u(t)>Ru(t)dt+

x(ti + T )>Qfx(ti + T ) (11)

where Q ∈ Rn×n, R ∈ Rm×m, and Qf ∈ Rn×n. Choosingthe set of function observable (Appendix A-A), we can com-pute a Koopman operator K by repeated simulation of the Vander Pol oscillator subject to uniformly random control inputsfor 5000 randomly sampled initial conditions.

Since the Van der Pol oscillator dynamics are nonlinear, asolution to the LQ control problem is to linearize the dynamicsabout the equilibrium state xt = [0, 0]

> and form a linearquadratic control regulator (LQR). Using the Kooman operatorformulation of the Van der Pol dynamics, we can compute acontroller in a similar manner using the following objective

J =

∫ ti+T

ti

z(t)>Qz(t) + u(t)>Ru(t)dt+

z(ti + T )>Qfz(ti + T ) (12)

where

Q =

[Q 00 0

]∈ Rcx×cx and Qf =

[Qf 00 0

]∈ Rcx×cx .

(13)Setting Q and Qf to only include the state observables allowsus to compare the same control objective using the linearizeddynamics against the Koopman operator dynamics where thefirst terms in the function observable z(x(t)) is the state ofthe Van der Pol system itself.

0.0 2.5 5.0 7.5 10.0Time (s)

10

20

30

Inte

grat

ed T

rack

ing

Erro

r

Koopman DynamicsLearned/known Dynamics

(a) Int. Tracking Error

0.0 2.5 5.0 7.5 10.0Time (s)

8

6

4

2

0

2

Stat

es

Target

(b) Trajectory

Fig. 1: Control performance of a forced van der pol oscillatorwith an LQR control using the learned Koopman operator,the linearization of the known system dynamics, and thelinearizion of a learned state-space model using the same dataand basis functions as the Koopman operator. The controlperformance using the Koopman operator dynamics is shownto outperform the LQR control with known dynamics. Thelearned dynamics model performs equally to the known dy-namics model and is overlayed on top of the known dynamicsresults.

Figure 1 illustrates the improvement in control performancewhen using the the Koopman operator dynamics for LQcontrol instead of linearizing the dynamics around a localregion. We compare the control authority using a learneddynamics model in the original state-space using Bayesianoptimization with the same functions used for the Koopmanoperator. This illustrates that the data used to compute theKoopman operator can learn a nonlinear model of the Vander Pol dynamics in the original state-space. The Koopmanoperator formulation of the Van der Pol approximates thedynamic constraints as a linear dynamical systems in a higherdimensional space that captures nonlinear dynamical behavior.As a result, the Koopman operator formulation coupled withLQ methods can be used to enhance the control the Van derPol system as shown in Figure 1b. Computing the resultingtrajectory error (Figure 1a) shows that the trajectory takenfrom the Koopman operator controller results in less overallintegrated error. This is due to formulating the LQ controllerwith additional information in the form of a dynamical systemsthat evolves functions of state.

While this example illustrates the possible benefits of uti-lizing the Koopman operator formulation, we ignored how thedata was collected for the Van der Pol dynamical system. Infact, computing the Koopman operator used random inputs.For this example, such an approach works reasonably, butrequires a significant amount of data to fully cover the state-space of the Van der Pol system. The following sections

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 4

introduce a method that enables a robot to actively learn theKoopman operator.

IV. CONTROL SYNTHESIS FOR ACTIVE LEARNING OFKOOPMAN OPERATOR DYNAMICS

Active learning controllers need to consider existing policesthat solve a task while generalizing to learning objectives. Inthis section, we formulate a controller for active learning thattakes into account the Koopman operator dynamics as wellas polices generated for solving tasks using the Koopmanoperator linear dynamics. We generate an active learning con-troller that takes into account existing policies by first derivingthe mode insertion gradient [38], [39]. The mode insertiongradient calculates how an objective changes when switchingfrom one control strategy to another. We then formulate anactive learning controller by minimizing the mode insertiongradient while including policies that solve a specified task.1 The derived controller is then shown to increase the rate ofchange of the information measure, which guides the robottowards important regions of state-space, improving the datacollection and the quality of the learned Koopman operatordynamics model.

A. Control Formulation

Active learning allows a robotic agent to self-excite the dy-namical states in order to collect data that results in a Koopmanoperator K that can be used describe system evolution. Weformulate the active learning problem as a hybrid switchingproblem [41] where the goal is to switch between a policy fora task to an information maximizing controller that assists thedynamical system in collecting informative data.

Consider a general objective function of the form

J =

∫ ti+T

ti

`(z(s), µ(z(s)))ds+m(z(ti + T )) (14)

where z(t) : R→ Rcx is the value of the function observablesat time t subject to the Koopman dynamics in (10) startingfrom initial condition z(x(ti)), `(z, u) : Rcx × Rm → R isthe running cost, m(z) : Rcx → R is the terminal cost, andµ(z) : Rcx → Rcu is a C1 differentiable policy. In this work,the running cost is split into two parts:

`(z, u) = `learn(z, u) + `task(z, u)

where `learn is the information maximizing objective (learningtask) and `task(z, u) is the task objective for which the policyµ(z) is a solution to (14) when `learn = 0.

Given equation (14), we want to synthesize a controller thatis bounded to the policy µ(z), but also allows for improvementof an information measure for active learning. To do so, weexamine in Proposition (1) how sensitive (14) is to switchingbetween the policy µ(z) to an arbitrary control vector µ?(t)at time τ for a time duration λ.

1During training, the policies derived from the Koopman operator dynamicswill be inaccurate; however, over time and gathered experience, both the modeland policy will converge. This is a common approach in most model-basedreinforcement learning techniques [40].

Proposition 1. The sensitivity of switching from µ to µ? forall τ ∈ [ti, ti + T ] for an infinitesimally small λ, (also knownas the mode insertion gradient [38], [39]) is given by

∂J

∂λ

∣∣∣τ,λ=0

= ρ(τ)>(f2 − f1) (15)

where z(t) is a solution to 10 with u(t) = µ(z(t)) and z(ti) =z(x(ti)), f2 = f(z(τ), µ?(τ)), f1 = f(z(τ), µ(z(τ))), and

ρ = −

(∂`

∂z+∂µ

∂z

> ∂`

∂u

)−(∂f

∂z+∂f

∂u

∂µ

∂z

)>ρ (16)

subject to the terminal condition ρ(ti+T ) = ∂∂zm(z(ti+T )).

Proof. See Appendix B-A.

We can write an unconstrained optimization problem forcalculating µ?(τ) over the interval τ ∈ [ti, ti + T ] that willminimize the mode insertion gradient. We can write thisoptimization problem using a secondary objective function

J2 =

∫ ti+T

ti

∂J

∂λ

∣∣∣τ=t,λ=0

+1

2‖µ?(t)− µ(z(t))‖2

Rdt, (17)

where R ∈ Rm×m bounds the change of µ? to µ(z), and∂J∂λ

∣∣∣τ=t,λ=0

is evaluated at τ = t. Solving equation (17) with

respect to µ?(t) can be viewed as a functional optimizationover µ?(t)∀t ∈ [ti, ti + T ]. Since equation (17) is quadratic inµ?, we can compute a closed form solution for any applicationtime τ ∈ [ti, ti + T ].

Proposition 2. Assuming that v(x, u) is differentiable, thecontrol solution that minimizes (17) is

µ?(t) = −R−1

(Ku

∂v

∂u

)>ρ(t) + µ(z(t)). (18)

Proof. Since (17) is separable in time, we take the derivativeof (17) with respect to µ?(t) at each point in t which givesthe following expression:

∂µ?J2 =

∫ ti+T

ti

∂µ?

(ρ(t)> (f2 − f1)

)+ R (µ?(t)− µ(z(t)) dt

=

∫ ti+T

ti

(Ku

∂v

∂u

)>ρ(t) + R(µ?(t)− µ(z(t)))dt.

(19)

Solving for µ?(t) in (19) gives the control solution

µ?(t) = −R−1

(Ku

∂v

∂u

)>ρ(t) + µ(z(t)).

Proposition (2) gives a formula for switching from µ?(t)to improve the objective (14). We can use equation (18) with(B-A) to show that our approach improves the active learningobjective subject to bounds placed on arbitrary tasks includedin (14).

Corollary 1. Assume that the Koopman operator dynamics fora system are defined by the following control affine structure:

z = Kxz(x(t)) + Kuv(x(t))u(t) (20)

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 5

where v(x) : Rn → Rcu×m. 2 Moreover, assume that ∂∂µH 6=

0 where H is the control Hamiltonian for (14). Then

∂λJ = −‖ (Kuv(x))

>ρ‖2

R−1 < 0 (21)

for µ?(t) ∈ U ∀t ∈ [ti, ti + T ] where U is the control space.

Proof. Inserting (18) into (B-A) gives

∂λJ = ρ(t)> (Kuv(x(t)))

(−R−1 (Kuv(x(t)))

>ρ(t)

)which can be written as the norm

∂λJ = −‖ (Kuv(x))

>ρ‖2

R−1 < 0.

Because we define our objective to be reasonably general,we can add both stabilization terms as well as informationmeasures that allow a robot to actively identify its owndynamics. The following subsection provides an overview ofthe Fisher information measure and information bounds basedon our controller. We first describe the Fisher informationmatrix for the Koopman operator parameters and then generatean information measure. We then show that using (18) andCorollary 1, that we can approximately calculate to first orderthe gain in information.

B. Information Maximization

Using the controller defined in (18), we investigate in-formation measures that we can use in (14) to enable therobot to actively learn the Koopman operator dynamics. Inthis work, we use the Fisher information [42], [43] to gen-erate a information measure for active learning. The Fisherinformation is a way of measuring how much information arandom variable has about a set of parameters. If we treatcalculating the Koopman operator dynamics as a maximumlikelihood estimation problem where the likelihood is given byπ(z | K) : Rcx → R+, we can compute the Fisher informationmatrix over the parameters that compose of the Koopmanoperator K. The Fisher information matrix is computed as

I [z | K] = E[∂

∂κlog π(z | K)>

∂κlog π(z | K)

]∈ R|κ|

2

(22)where E is the expectation operator, κ = {Ki,j | Ki,j ∈ K},and |κ| is the cardinality of the vector κ. Assuming that π isa Gaussian distribution, (22) becomes

I [z | K] =∂f

∂κ

>Σ−1 ∂f

∂κ(23)

where Σ ∈ Rcx×cx is the noise covariance matrix. Because theFisher information defined here is positive semi-definite, weuse the trace of the Fisher information matrix [44] in `(z, u).This measure allows us to synthesize control actions thatmaximize the T-optimality measure of the Fisher informationmatrix [44].

2This formulation assumes that we can recover x(t) from z(t) for com-puting v(x).

Definition 1. The T-optimality measure is given by the traceof the Fisher information matrix (22) and defined as

I(K) = tr I [z | K] ≥ 0. (24)

In this work we incorporate (24) into (14) additively using1/(I + ε), that is

`learn(z, u) = 1/(I(K) + ε)

where ε � 1 is a small number to prevent singular so-lutions due to the positive semi-definite Fisher informationmatrix [45]–[47], and I is computed using the evaluation ofK at time ti. By minimizing (14) we also minimize the inverseof the T-optimality (which maximizes the T-optimality).

Assumption 1. Assume that I(K) > 0 implies I(K) > 0where K is an approximation to the Koopman operator Kcomputed from the data set D = {x(tm), u(tm)}im=0 thatcontains data up until the current sampling time ti.

Theorem 1. Given Assumption 1 and dynamics (20), thenthe change in information 3 ∆I subject to (18) is given to firstorder

∆I ≈ (‖(Kuv(x))>ρ‖2

R−1 + `task(z, µ?)

− `task(z, µ))Iµ?Iµ +O(∆t), (25)

where Iµ?, Iµ is the T-optimality measure (24) from applying

the control µ? and µ.

Proof. See Appendix B-B.

Theorem 1 shows that our controller increases the rate ofinformation that a robot would have normally acquired if it hadonly used the control policy µ(z). Weighing the informationmeasure against the task objective allows us to ensure thatthe relative information gain is positive when using the activelearning controller. That is, the difference between the infor-mation from using the policy µ(x) and the control µ?(t) willbe positive. Other heuristics can be used such as a decayingweight on the information gain or setting the weight to 0 at aspecific time so that the robot attempts the task. We providea basic overview of the control procedure in Algorithm 1.Videos of the experiments and example code can be found athttps://sites.google.com/view/active-learning-koopman-op.

Algorithm 1 Active Learning Control

1: initialize: objective `(z, u), policy µ(z), normally dis-tributed random K ∼ N (0,1).

2: sample state measurement x(ti)3: add x(ti) to dataset D, update K and µ(z)4: simulate z(t), ρ(t) for t ∈ [ti, ti + T ] with conditionsz(ti) = z(x(ti)) and ρ(ti + T ) = ∂

∂zm(z(ti + T )) withµ(z)

5: compute µ?(t) = −R−1(Ku

∂v∂u

)>ρ(t) + µ(z(t))

6: return µ?(ti)7: update timer ti → ti+1

The following sections use our derived controller to enableactive-learning of Koopman operator dynamics.

3With respect to the information acquired from applying only µ(z).

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 6

V. SINGLE EXECUTION ACTIVE LEARNING OFFREE-FALLING QUADCOPTERS

In this example, we illustrate the capabilities of combiningthe Koopman operator representation of a dynamical systemsand active learning for single execution model learning ofa free-falling quadcopter for stabilization. Additionally, wecompare our approach to other common learning strategiessuch as active learning with Gaussian processes [48]–[50],online model adaptation through direct attempts at the tasksof stabilization (common online reinforcement learning andadaptive control approach [19], [20], [37], [51], [52]), anda two-stage noisy motor input (often referred to as “motorbabble” [53]–[55]).

A. Problem Statement

The task is as follows: The quadcopter, with dynamicsdescribed in Appendix A-B and [56], must learn a modelwithin the first second of free-falling and then use the model togenerate a stabilizing controller, preventing itself from fallingany further. We define success of the quadcopter in the taskwhen ‖x − xd‖2 < 0.01 where xd is the desired target statedefined by zero linear and angular velocity. The controllersare designed as linear quadratic regulators using the model thatwas learned and the LQ objectives provided in Section III. Theparameters used for this example are defined in Appendix A-Band follows the same parameter choices as in Section III forfairness in terms of the learning methods against which weare comparing.

We compare the information gained (based on the T-optimality condition) and the stabilization error in time againstvarious learning strategies. Each learning strategy is testedwith the same 20 uniformly sampled initial velocities (andangular velocities) between −2 and 2 radians/meters persecond. After each trial, the learned dynamics model is resetso that no information from the previous trials are used.

B. Other Active Learning Strategies

We compare our method for active learning against commondynamic model learning strategies. Specifically, we comparethree model learning approaches against our method, a two-stage noisy control input approach [53], a direct stabilizationwith adaptive model using least squares [19], [37], and anactive learning strategy using a Gaussian process [57], [58].Each of these strategies are generating a Koopman operatorusing the functions of state defined in Appendix A-B togenerate a dynamic model of the quadcopter. The Gaussianprocess formulation is the only model where the functions mapto the original state-space resulting in a nonlinear dynamicsmodel.

a) Least Squares Adaptive Stabilization: The first strat-egy we compare to is to do the task of stabilization at the whileupdating the model of the dynamics recursively [19], [37].This is often a strategy used in model-based reinforcementlearning [54] and adaptive control [37].

b) Two-Stage Motor Babble: The second strategy is atwo stage approach using noisy motor input (motor babble)for the first second and then pure stabilization [53]. Ratherthan directly attempting to stabilize the dynamics, the priorityis to simply try all possible motor inputs regardless of themodel of the dynamics that is being constructed. The motorbabble strategy allows us to bound the motor excitation whichprevents the rotor from destabilizing once the learning stageis complete. As with the direct stabilization method, we usea recursive least squares to update the model of the Koopmanoperator.

c) Active Learning with Gaussian Process: The laststrategy is an active Gaussian process strategy [57], [58].In this active learning strategy, we build a model of thedynamics of the quadcopter by generating a Gaussian processdynamics model [50], [57]. Using the variance estimate [58],we uniformly sample points around the current state boundedby some ε constant and find the state which maximizes thevariance. The sampled state with the largest variance is thenused to generate a local LQ controller to guide the quadcopterdynamics to that state to collect the data. After the first second,the Gaussian process model is used to generate a stabilizingcontroller by linearizing the model about the final desiredstabilization state. The kernel function used is computed usingthe functions of state provided in Appendix A-B for a faircomparison.

Note that for the two-stage, least squared adaptive, andour approach, we learn a Koopman operator dynamics modelwhich we use to compute an LQ controller. The Gaussianprocess model is in in the original state-space as describedin [50].

C. Results

Figure 2 (a) illustrates the information (T-optimality of theFisher information matrix) for each method. Our approachto active learning is shown to improve upon the informationwhen compared to motor babble (the most basic method foractive learning). The other methods outperform our approachin terms of the overall information gain by overly excitingthe dynamics. The direct adaptive stabilization method utilizesthe incorrect dynamics model to self-adjust and eventuallystabilize the quadcopter (as shown in the variance). The activeGaussian process approach uses the covariance estimate toactuate the quadcopter towards uncertain regions. Collectingdata in uncertain regions allows the active Gaussian processapproach to actively select where the quadcopter should collectdata next.

It is worth noting that these approaches will often leadthe quadcopter towards unstable regions, making it difficultto stabilize the dynamics in time. Our approach activelysynthesizes when it is best to learn and stabilize which assistsin quickly stabilizing the quadcopter dynamics (see Figure 2(b)). The addition of the Koopman operator dynamics furtherenhances the control authority of the quadcopter as shownwith the direct adaptive stabilization, motor babble, and ourapproach to active learning. While the active Gaussian processmodel does at times succeed, the method relies on both the

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 7

0 1 2 3 4 5Time (s)

0.5

1.0

1.5

2.0In

form

atio

n Ga

in

(a) Inf. Gain

0 1 2 3 4 5Time (s)

0.00

0.05

0.10

0.15

0.20

Stab

ilizat

ion

Erro

r Noisey InputsDirect StabilizationGPOur Method

(b) Stabilization Errorx(m)

z(m)

(c) Sample Trajectories

Fig. 2: Monte-Carlo simulation comparing various learning strategies to stabilize a quadcopter falling for 20 trials with uniformlysampled initial linear and angular velocities. (a) Information gain (trace of the Fisher information matrix) is shown for thevarious learning strategies. (b) Stabilization error and standard deviation is shown over time for each learning strategy over 20trajectories. (c) Representative time series snapshots are shown depicting the various learning strategies. With our approach,maximization of the information measure, coupled with the Koopman operator formulation of the dynamics, enables quickstabilization of the quadcopter.

quality of data acquired and the local linear approximation tothe dynamics. This results in a deficit of nonlinear informationthat is needed to successfully achieve the learning task in asingle execution.

D. Sensitivity to Initialization and Parameters

We further test our algorithm against sensitivities to ini-tialization of the Koopman operator. Our algorithm requiresan initial guess at the Koopman operator in order to boot-strap the active learning process. We accomplish this usingthe same experiment described in the previous section whichused a zero mean, variance of 1 normally distributed initial-ization of the Koopman operator. We vary the variance thatinitializes the Koopman operator parameters using a normaldistribution with zero mean and a variance experiment set of{0.01, 0.1, 1.0, 10.0}.

In Fig. 3 we find that so long as the initialization of theKoopman operator is within a reasonable initialization (non-zero and within an order of magnitude), the performance iscomparable to active learning described in Fig. 2. However,this may not be true for all autonomous systems and resultsmay vary depending on the sampling frequency and thebehavior of the underlying system. A benchmark is providedfor stabilizing the quadcopter when the Koopman operatoris precomputed in Fig 3 illustrating the performance of thecontrol authority when using the Koopman operator-basedcontroller.

The choices in the parameters of our algorithm can alsoeffect its performance. Specifically, setting the value of theregularization term R too large will prevent the robot fromsignificantly exploring the states of the robot. In contrast, ifthe regularization term is set too low, the robot will widen itsbreath of exploration which can be harmful to the robot if thestates are not bounded. A similar effect is achieved by addinga weight on the active learning objective.

0 2 4Time (s)

0.000

0.025

0.050

0.075

0.100

0.125

0.150

Stab

ilizat

ion

Erro

r

1.0 var.0.10.0110.0Precalc. Koop. Op.

(a) Stab. Err.

0 1 2 3 4 5Time (s)

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Info

rmat

ion

Gain

(b) Inf. Gain

Fig. 3: Resulting sensitivities in stabilization error and in-formation gain with respect to variance levels in Koopmanoperator initialization. Benchmark stabilization performance isprovided for known/precalculated Koopman operator.

Changes in the time horizon T will also effect the per-formance of the algorithm. Generally, smaller T will resultin more reactive behaviors where larger T tends to havemore intent driven control responses. Choosing these valuesappropriately will be problem specific; however, the limitednumber of tunable parameters (not including choosing a taskobjective) provides the advantage of ease of implementation.

E. Discussion

While the single execution capabilities of the Koopman op-erator with active learning is appealing, not all robotic systemswill be capable of such drastic performance. In particular, thisexample relies on some prior knowledge of the underlyingrobotic system and the dynamics that govern the system. Thefunctions of state are chosen such that they include nonlinearelements (e.g, cross product terms that we expect will helpin stabilization). Thus, the approximate Koopman operator ispredicting the evolution of nonlinear elements found in the

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 8

original nonlinear dynamics. Often these underlying structuresthat we can exploit are not known or easily found in robotics.Choosing random polynomial or Fourier expansions as func-tion observables can sometimes work (see Section VII), butoften can lead to unstable eigenvalues in the Koopman operatordynamics which can make model-based control difficult tosynthesize [26].

Recent work has attempted to address these issues usingsparse optimization [59] or discovering invariances in the state-space [26]. A promising method is automating the discoveryof the function observables by learning the functions fromdata [60]. By using current advances in neural networks andfunction representation, it is possible to automate the dis-covery of function observables. The following section furtherdevelops the work in automating the discovery of functionobservables for Koopman operators through the use of ourapproach for active learning.

VI. AUTOMATING DISCOVERY OF KOOPMAN OPERATORFUNCTION OBSERVABLES

As a solution to automating the choice of function observ-ables, the use of deep neural networks [60] have been usedto automatically discover the function observables. In thissection, we illustrate that we can use these neural networkscoupled with our approach for active learning to automaticallydiscover the Koopman operator and the associated functionsof state.

(a) Cart Pendulum (b) 2-Link Robot

Fig. 4: (a) Resulting stabilization time of a cart pendulum us-ing Koopman operators with automatic function discovery. (b)Control response of a 2-link robot using Koopman operatorswith automatic function discovery. Active learning improvesthe rate of success of each task.

A. Including Automatic Function Discovery

Revisiting Equation 10, we can parameterize z(x) andv(x, u) using a multi-layer neural network with parametersθ ∈ Rd. We denote the parameterization of z, v as zθ(x)and vθ(x, u) where the subscript θ denotes the functionobservables are parameterized by the same set of parametersθ. Given the same data set that was defined previously,D = {x(tm), u(tm)}Mm=0, the new optimization problem thatis to be solved is

minK,θ

1

2

M−1∑m=0

‖zθ(x(tm+1), u(tm+1))− Kzθ(x(tm), u(tm))‖2,

(26)

where zθ(x, u) =[zθ(x)>, vθ(x, u)>

]>. Equation (26) can be

solved using any of the current techniques for gradient descent(Adams method [61] is used in this work). The continuoustime Koopman operator is obtained similarly using the matrixlog of K, resulting in the differential equation

zθ = Kxzθ(x(t)) + Kuvθ(x(t), u(t)). (27)

Because we are now optimizing over θ, we lose the sample ef-ficiency of single execution learning that was illustrated in theexample in Section V. Active learning can be used; however,adding the additional parameters θ to the information measuresignificantly increases the computational cost of calculatingthe Fisher information measure (22). As a result, we onlycompute the information measure with respect to K in order toavoid the computational overhead of maximizing informationwith respect to θ.

B. Examples

We illustrate the use of deep networks for automating thefunction observables for the Koopman operator for stabilizinga cart pendulum and controlling a 2-link robot arm to a target.A neural network is first initialized (see Appendix A-C fordetails) for the Koopman operator functions zθ, vθ as wellas an LQ controller for the task at hand. At each iteration,the robot attempts the task and learns the Koopman operatordynamics by minimizing (26). We compare against decayingadditive control noise as well as our method for active learningwhere a weight on information measure is used which decaysat each iteration according to γi+1 where 0 < γ < 1 and i isthe iteration number. The data collected is then used to updatethe parameters θ and K using (26) and the LQ controller isupdated with the new Kx,Ku parameters.

Figure 4 illustrates that we can automate the process oflearning the function observables as well as the Koopmanoperator. With the addition of active learning, the process oflearning the Koopman operator and the function observablesis improved. In particular, stabilization of the cart pendulumis achieved in only 50 iterations in comparison to additivenoise which takes over 100 iterations. Similarly, the 2-linkrobot can be controlled to the target configuration within 5iterations with our active learning approach.

C. Discussion

While this method is promising, there still exist significantissues that merit more investigation in future work. One ofwhich is the trivial solution where zθ, vθ = 0. This issue oftenoccurs with how the parameters θ were initialized. This trivialsolution has been addressed in [62]; however, their approachrequires significantly complicating how the regression (26) isformulated. We found that adding the state x as part of theneural network output of zθ was enough to overcome the trivialsolution.

VII. ROBOT EXPERIMENTS

Our last set of examples test our active learning strategywith robot experiments. We use the robots depicted in Figure 5

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 9

to illustrate control and active learning with Koopman opera-tors. The sphero SPRK robot (Figure 5a) is a differential driverobot inside of a clear outer ball. We test trajectory tracking ofthe SPRK robot in a sand terrain where the challenge is thatthe SPRK must be able to learn how to maneuver in sand.The Sawyer robot (Figure 5b) is a 7-link robot arm whosetask is to track a trajectory defined at the end effector wherethe challenge is the high dimensionality of the robot. We referthe reader to the attached multimedia which has clips of theexperiments.

(a) Sphero SPRK (b) Sawyer Robot

Fig. 5: Depiction of robots used for experimentation.

A. Experiments: Granular Media and Sphero SPRK

Active learning is applied in an experimental setting usingthe Sphero SPRK robot (Fig. 5a) in sand. The interaction be-tween sand and the SPRK robot makes physics-based modelschallenging.

SPRK Robot

Sand Barrier

(a) Experimental Setup

Koopman Operator ControlTarget Trajectory

State-Space Linear-Model Control

(b) SPRK Trajectories

Method RMSE Correlation Phase Lag (rad)Koopman-based Control 0.3010 0.4028 1.1262

Controller in [32] 0.3535 0.1034 1.4667

(c) Controller performance

Fig. 6: Experiment using the Sphero SPRK robot in sand.(a) The experimental setup is depicted with the SPRK robotinside the sand pit. Position information is calculated with anoverhanging Xbox Kinect using OpenCV [63] for tracking.(b) Performance of the SPRK robot using the Koopmanoperator-based controller after active learning. Performance iscompared with results from [32]. (c) Performance measuresshowing active learning significantly outperforms non-activelearning in robot experiment. The attached multimedia showsthe experiment executed.

The parameters for the experiment are defined in Ap-pendix A-D. The experiment starts with 20 seconds of active

learning. After actively identifying the Koopman operator, theweight on information maximizing is set to zero at t = 20and the objective is switched to track the trajectory shown inFig. 6b. In Fig. 6c, we show the average root mean squarederror (RMSE) of the x − y trajectory tracking, the averagex−y Pearson’s correlation using a two-sided hypothesis testing(values close to 1 indicate responsive controllers), and thephase lag of the experimental results. Note that in contrastto previous work by the authors [32], the method of activelylearning the Koopman operator improves the performance ofthe model-based controller. In particular, we find that theoverall responsiveness and phase lag of the Koopman-basedcontroller improved after active learning in sand.

(a) Experimental Visualization

Sawyer Joint ControllerKoopman-based Control

Target Trajectory

(b) Sawyer Trajectories

Method RMSE Correlation Phase Lag (rad)Koopman-based Control 0.0228 0.9777 0.2826Sawyer Joint Controller 0.0443 0.6026 0.7041

(c) Controller performance

Fig. 7: Experiment using Sawyer. Experimental data visualizedusing RViz [64]. (a) End-effector trajectory paths using theembedded Rethink Joint controller and Koopman operatorcontroller. Both controllers are running at 100 Hz. (b) Tra-jectory overlaid from both controller responses. (c) Controllerperformance shows that active learning for Koopman operator-based controllers performs comparably. We refer the reader tothe attached multimedia to view clips of this experiment.

B. Experiments: Trajectory Tracking of Rethink Sawyer Robot

In this experiment, we use active learning with the Koopmanoperator to model a 7 DoF Sawyer robot arm from RethinkRobotics. The 7-DoF system is of interest because it is bothhigh dimensional and inertial effects tend to dominate thedynamics of the system. We define the parameters used forthis experiment in Appendix A-E.

Figure 7 illustrates a comparison of the embedded controllerin the Sawyer robot and the data-driven Koopman operatorcontroller. Here, we show the average root mean squared errorof the tracking position, the Pearson’s correlation using a two-sided hypothesis testing (values close to 1 indicate responsivecontrollers), and the phase lag of the trajectory tracking. Theresulting controller using the Koopman operator is shown tobe comparable to the built-in controller with the inclusionof the evolution of the nonlinearities on the Sawyer robotwhich improve overall trajectory tracking performance. Thetrajectories of the two methods are overlaid which illustrates

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 10

the improvement in control from the Koopman operator afteractive learning has occurred. Since data is always beingacquired online, the Koopman operator is continuously beingupdated as the robot is tracking the trajectory. The Koopmanoperator-based controller is able to capture dynamic effects ofthe individual joints from data. This is further reinforced bythe improved results found Note that one can build a model tosolve for similar, if not better, inverse dynamics of the Sawyerrobot that can be computed for control. In particular, theSawyer robot provides an implementation of inverse dynamicsin the robot’s embedded controller. However, our approachprovides high accuracy without needing such a model aheadof time and without linearizing the nonlinear dynamics.

VIII. CONCLUSION

In this paper, we use Koopman operators as a method for en-hancing control of robotic systems. In addition, we contributea method for active learning of Koopman operator dynamicsfor robotic systems. The active learning controller enablesthe robots to learn their own dynamics quickly while takinginto account the linear structure of the Koopman operator toenhance LQ control. We illustrate various examples of robotcontrol with Koopman operators and provide examples forautomating design choices for Koopman operators. Last, weshow that our method is applicable to actual robotic systems.

APPENDIX APARAMETERS FOR VARIOUS EXAMPLES

A. Control of forced van der pol oscillator

The nonlinear dynamics that govern the Van der Pol oscil-lator are given by the differential equations

d

dt

[x1

x2

]=

[x2

−x1 + ε(1− x21)x2 + u

]where ε = 1 and u is the control input.

The Koopman operator functions used are defined as

z(x) =[x1, x2, x

21, x2x

21

]>and v(u) = u. The same functions are used to compute aregression problem where the final equation is given by

d

dt

[x1

x2

]= Az(x) + Bv(u)

where A ∈ Rn×cx and B ∈ Rn×cu are both generated usinglinear regression.

The weight parameters for LQ control are

Q = diag ([1, 1]) and R = 0.1

where

Q =

[Q 00 0

]∈ Rcx×cx (28)

B. Quadcopter Free-Falling

The quadcopter system dynamics are defined as

h = h

[ω v0 0

],

Jω = M + Jω × ω,

v =1

mFe3 − ω × v − gRT e3,

where h = (R, p) ∈ SE(3), the inputs to the system are u =[u1, u2, u3, u4], and

F = kt(u1 + u2 + u3 + u4),

M =

ktl(u2 − u4)ktl(u3 − u1)

km(u1 − u2 + u3 − u4)

(see [56] for more details on the dynamics and parametersused). Note that in this formulation of the quadcopter, thecontrol vector u has bidirectional thrust.

The measurements of the state of the quadcopter are givenby

[ag, ω, v]> ∈ R9 (29)

where ag ∈ R3 denotes the body-centered gravity vector andω, v are the body angular and linear velocities respectively.The sampling rate for this system is 200 Hz.

We define the basis functions for this system as

z(x) = [ag, ω, v, g(v, ω)]T ∈ R18

where g(v, ω) = [v3ω3, v2ω3, v3ω1, v1ω3, v2ω1,v1ω2, ω2ω3, ω1ω3, ω1ω2] are the chosen basis functionssuch that ωi, vi are elements of the body-centered angular andlinear velocity ω, v respectively. The functions for control are

v(u) = u ∈ R4.

The LQ control parameters for the stabilization problem aregiven as

Q = diag ([1, 1, 1, 1, 1, 1, 5, 5, 5]) and R = diag ([1, 1, 1, 1])

where the weight on the additional functions Q are set to zeroas in (28) . The time horizon used in 0.1s.

The active learning controller uses a weight on the in-formation measure of 0.1 and a regularization weight R =diag(1000, 1000, 1000, 1000]). Motor noise used in the two-stage method is given by uniform noise at 33% of the controlsaturation.

C. Neural Network Automatic Function Discovery Configura-tion

In this example, we use the Roboschool environments [65]for the robot simulations.

For the cart pendulum example, we use a three layer net-work with a single hidden layer for zθ and vθ with {4, 20, 40}and {2, 20, 10} nodes respectively for each layer makingcx = 40 and cu = 10. The exploration noise used on thecontrol is given by additive zero mean noise with a variance of40% motor saturation decreasing at a rate of 0.9i+1. The decayweight on the information measure is given by 0.2i+1. The LQ

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 11

weights are given by Q = diag([50.0, 1.0, 10.0, 0.1]+~0) wherethe first non-zero weights correspond to the states of the cartpendulum. A time horizon of 0.1s is used with a samplingrate of 50 Hz. The regularization weight R = 1× 106.

For the 2-link robot example, we use a similar threelayer network with a single hidden layer for zθ and vθwith {4, 20, 40} and {2, 20, 20} nodes respectively for eachlayer making cx = 40 and cu = 10. The explorationnoise used on the control is given by additive zero meannoise with a variance of 40% motor saturation decreasingat a rate of 0.9i+1. The decay weight on the informationmeasure is given by 0.2i+1. The LQ weights are given byQ = diag([10.0, 1.0, 20.0, 1.0] + ~0) where the first non-zeroweights correspond to the states of the cart pendulum. A timehorizon of 0.05s is used with a sampling rate of 100 Hz. Theregularization weight R = diag([1× 106, 1× 106]).

D. SPRK Tracking in Sand

The SPRK robot is running a 30 Hz sampling rate forcontrol and state estimation. Control vectors are filtered usinga low-pass filter to avoid noisy responses in the robot. Thecontroller weights are defined as

Q = diag([60, 60, 5, 5, ~1])and R = diag([0.1, 0.1]).

The control regularization is R = R. A weight of 80 is addedto the information measure. A time horizon of 0.5s is used tocompute the controller.

We run the active learning controller for 20 seconds andthen set the weight of the information measure to zero andtrack the end effector trajectory given by[

x(t)y(t)

]=

[0.5 cos (t) + 1.120.3 sin (2t) + 0.85

].

In this example, the set of functions are chosen as apolynomial expansion of the velocity states x = [x, y] to the3rd order. The function observables are defined as

z(x) =[x, y, x, y, 1, x2, y2, x2y, . . . , x3y3

]T ∈ R18

andv(x, u) = u ∈ R2.

E. Sawyer Control

The Sawyer robot was run on a sampling rate of 100Hz. Control vectors are filtered using a low-pass filter toavoid noisy responses in the robot. The controller weights aredefined as

Q = diag([200×~1 ∈ R14, ~1])and R = diag([0.001×~1 ∈ R7]).

The control regularization is R = R. A weight of 2000 isadded to the information measure. A time horizon of 0.5s isused to compute the controller.

We run the active learning controller for 20 seconds andthen set the weight of the information measure to zero andtrack the end effector trajectory given byx(t)

y(t)z(t)

=

0.80.1 cos (2t)

0.1 sin (4t) + 0.4

.

The functions of state using to compute the Koopmanoperator are defined as

z(x) =[xT , 1, θ1θ2, θ2θ3, . . . , θ

36θ

37, θ1θ2, . . . , θ

36 θ

37

]T∈ R51

with v(u) = u ∈ R7 as the torque input control of eachindividual joint and states x containing the joint angles andjoint velocities.

APPENDIX BPROOFS

A. Proof of Proposition 1

Proposition 1 : The sensitivity of switching from µ to µ? atany time τ ∈ [ti, ti + T ] for an infinitesimally small λ, (alsoknown as the mode insertion gradient [38], [39]) is given by

∂J

∂λ

∣∣∣τ,λ=0

= ρ(τ)>(f2 − f1)

where z(t) is a solution to 10 with u(t) = µ(z(t)) and z(ti) =z(x(ti)), f2 = f(z(τ), µ?(τ)), f1 = f(z(τ), µ(z(τ))), and

ρ = −

(∂`

∂z+∂µ

∂z

> ∂`

∂u

)−(∂f

∂z+∂f

∂u

∂µ

∂z

)>ρ

subject to the terminal condition ρ(ti+T ) = ∂∂zm(z(ti+T )).

Proof. Consider the objective (14) evaluated at a trajectoryz(t)∀t ∈ [ti, ti + T ] generated from a dynamical system.Furthermore, assume that z(ti + T ) is generated by a policyµ(z(t))∀t /∈ [τ, τ + λ] and a controller µ?(t)∀t ∈ [τ, τ + λ]where τ is the time of application of control µ? and λ is theduration of the control. Formally, z(ti + T ) can be written as

z(ti + T ) = z(ti) +

∫ τ

ti

f(z(t), µ(z(t)))dt (30)

+

∫ τ+λ

τ

f(z(t), µ?(t))dt

+

∫ ti+T

τ+λ

f(z(t), µ(z(t)))dt,

where f(z, u) : Rcx × Rcu → Rcx is a mapping whichdescribes the time evolution of the state z(t).

Using (30) and (14), we compute the derivative of (14) withrespect to the duration λ of control µ? applied at any timeτ ∈ [ti, ti + T ]:

∂λJ

∣∣∣∣∣τ

=

∫ ti+T

τ+λ

(∂`

∂z+∂µ

∂z

> ∂`

∂u

)>∂z

∂λdt. (31)

where

∂z(t)

∂λ= f2 − f1 +

∫ t

τ+λ

(∂f

∂z+∂f

∂u

∂µ

∂z

)>∂z(s)

∂λds (32)

such that f2 = f(z(τ), µ?(τ)), f1 = f(z(τ), µ(z(τ))) areboundary terms from applying Leibniz’s rule.

Because (32) is a linear convolution with initial condition,∂z(τ)∂λ = f2 − f1, we are able to rewrite the solution to

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 12

∂z(t)∂λ using a state-transition matrix Φ(t, τ) [66] with initial

condition f2 − f1 as

∂z(t)

∂λ= Φ(t, τ) (f2 − f1) . (33)

Since the term f2 − f1 is evaluated at time τ , we can write(31) as

∂λJ

∣∣∣∣∣τ

=

∫ ti+T

τ+λ

(∂`

∂z+∂µ

∂z

> ∂`

∂u

)>Φ(t, τ)dt (f2 − f1) .

(34)

Taking the limit of (34) as λ → 0 gives us the sensitivityof (14) with respect to switching at any time τ ∈ [ti, ti + T ].We can further define the adjoint (or co-state) variable

ρ(τ)> =

∫ ti+T

τ

(∂`

∂x+∂µ

∂x

> ∂`

∂u

)>Φ(t, τ)dt ∈ Rcx

which allows us to define the mode insertion gradient [39] as

∂λJ∣∣∣t=τ

= ρ(τ)> (f2 − f1)

where

ρ = −

(∂`

∂z+∂µ

∂z

> ∂`

∂u

)−(∂f

∂z+∂f

∂u

∂µ

∂z

)>ρ

subject to the terminal condition ρ(ti + T ) = ∂∂zm(z(ti +

T )).

B. Proof of Theorem 1

Theorem 1 : Given Assumption 1 and dynamics (20), thenthe change in information 4 ∆I subject to (18) is given to firstorder

∆I ≈ (‖(Kuv(x))>ρ‖2

R−1 + `task(z, µ?)

− `task(z, µ))Iµ?Iµ +O(∆t), (35)

where Iµ? , Iµ is the T-optimality measure (24) from applyingthe control µ? and µ.

Proof. First define (14) for a controller as

J(u(t)) =

∫ ti+∆t

ti

1

Iu+ `task(z(t), u(t))dt (36)

where ∆t < T is a time duration, z(t) is subject to thecontroller u(t), and Iu is the measure of information fromapplying the control u. If we consider the difference betweenJ(µ?) and J(µ) where µ is a controller that minimizes`task(z, u), then

J(µ?)− J(µ) =

∫ ti+∆t

ti

1

Iµ?

− 1

Iµ+ `task(z, µ?)− `task(z, µ)dt

≈ ∆t

(1

Iµ?

− 1

Iµ+ `task(z, µ?)− `task(z, µ)

)+O(∆t).

(37)

4With respect to the information acquired from applying only µ(z).

From Corollary 1 and that,

∂λJ∆t ≈ J(µ?)− J(µ),

we can show that∂

∂λJ∆t ≈ J(µ?)− J(µ)

≈ ∆t

(1

Iµ?

− 1

Iµ+ `task(z, µ?)− `task(z, µ)

)+O(∆t).

(38)

which we rearrange (38) and insert (21) to get

−‖ (Kuv(x))>ρ‖2

R−1 ≈(

1

Iµ?

− 1

Iµ+ `task(z, µ?)− `task(z, µ)

)+O(∆t).

≈ Iµ − Iµ? + (`task(z, µ?)− `task(z, µ))Iµ?IµIµ?

+O(∆t).(39)

Setting ∆I = Iµ? − Iµ in (39) and simplifying gives therelative information gain

∆I ≈ (‖ (Kuv(x))>ρ‖2

R−1 + `task(z, µ?)

− `task(z, µ))Iµ?Iµ +O(∆t).

ACKNOWLEDGMENT

The authors would like to thank Giorgos Mamakoukas forhis insight, discussion, and thorough review of this paper.

This material is based upon work supported by the Na-tional Science Foundation under awards NSF CPS 1837515.Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the author(s) anddo not necessarily reflect the views of the National ScienceFoundation

REFERENCES

[1] B. O. Koopman, “Hamiltonian systems and transformation in Hilbertspace,” Proceedings of the National Academy of Sciences, vol. 17, no. 5,pp. 315–318, 1931.

[2] I. Mezic, “Analysis of fluid flows via spectral properties of the Koopmanoperator,” Annual Review of Fluid Mechanics, vol. 45, pp. 357–378,2013.

[3] ——, “On applications of the spectral theory of the Koopman operatorin dynamical systems and control theory,” in IEEE Int. Conf. on Decisionand Control (CDC), 2015, pp. 7034–7041.

[4] M. Budisic, R. Mohr, and I. Mezic, “Applied Koopmanism,” Chaos: AnInterdisciplinary Journal of Nonlinear Science, vol. 22, no. 4, p. 047510,2012.

[5] M. Korda and I. Mezic, “Linear predictors for nonlinear dynamicalsystems: Koopman operator meets model predictive control,” arXivpreprint arXiv:1611.03537, 2016.

[6] N. Roy and A. McCallum, “Toward optimal active learning throughmonte carlo estimation of error reduction,” International Conference onMachine Learning, pp. 441–448, 2001.

[7] A. Baranes and P.-Y. Oudeyer, “Active learning of inverse modelswith intrinsically motivated goal exploration in robots,” Robotics andAutonomous Systems, vol. 61, no. 1, pp. 49–73, 2013.

[8] C. Dima, M. Hebert, and A. Stentz, “Enabling learning from largedatasets: Applying active learning to mobile robotics,” in IEEE Int. Conf.on Robotics and Automation (ICRA), vol. 1, 2004, pp. 108–114.

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 13

[9] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1238–1274, 2013.

[10] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots,and E. A. Theodorou, “Information theoretic MPC for model-basedreinforcement learning,” in IEEE Int. Conf. on Robotics and Automation(ICRA), 2017.

[11] B. Armstrong, “On finding exciting trajectories for identification exper-iments involving systems with nonlinear dynamics,” The InternationalJournal of Robotics Research, vol. 8, no. 6, pp. 28–48, 1989.

[12] A. D. Wilson, J. A. Schultz, A. R. Ansari, and T. D. Murphey,“Dynamic task execution using active parameter identification with theBaxter research robot,” IEEE Transactions on Automation Science andEngineering, vol. 14, no. 1, pp. 391–397, 2017.

[13] A. D. Wilson, J. A. Schultz, and T. D. Murphey, “Trajectory synthesisfor Fisher information maximization,” IEEE Transactions on Robotics,vol. 30, no. 6, pp. 1358–1370, 2014.

[14] K. Ayusawa and E. Yoshida, “Motion retargeting for humanoid robotsbased on simultaneous morphing parameter identification and motionoptimization,” IEEE Transactions on Robotics, vol. 33, no. 6, pp. 1343–1357, 2017.

[15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International Conference on Machine Learning,2016, pp. 1928–1937.

[16] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Inter-national Conference on Machine Learning, 2016, pp. 1329–1338.

[17] M. Cutler, T. J. Walsh, and J. P. How, “Real-world reinforcementlearning via multifidelity simulators,” IEEE Transactions on Robotics,vol. 31, no. 3, pp. 655–671, 2015.

[18] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for the unknown:Learning a universal policy with online system identification,” in Pro-ceedings of Robotics: Science and Systems, 2017.

[19] K. S. Sin and G. C. Goodwin, “Stochastic adaptive control using amodified least squares algorithm,” Automatica, vol. 18, no. 3, pp. 315–321, 1982.

[20] F. Ding, X. Wang, Q. Chen, and Y. Xiao, “Recursive least squaresparameter estimation for a class of output nonlinear systems based onthe model decomposition,” Circuits, Systems, and Signal Processing,vol. 35, no. 9, pp. 3323–3338, 2016.

[21] F. Ding, D. Meng, J. Dai, Q. Li, A. Alsaedi, and T. Hayat, “Least squaresbased iterative parameter estimation algorithm for stochastic dynamicalsystems with ARMA noise using the model equivalence,” InternationalJournal of Control, Automation and Systems, vol. 16, no. 2, pp. 630–639,2018.

[22] V. Bonnet, P. Fraisse, A. Crosnier, M. Gautier, A. Gonzlez, and G. Ven-ture, “Optimal exciting dance for identifying inertial parameters of ananthropomorphic structure,” IEEE Transactions on Robotics, vol. 32,no. 4, pp. 823–836, 2016.

[23] J. Jovic, A. Escande, K. Ayusawa, E. Yoshida, A. Kheddar, andG. Venture, “Humanoid and human inertia parameter identification usinghierarchical optimization,” IEEE Transactions on Robotics, vol. 32,no. 3, pp. 726–735, 2016.

[24] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, andJ. N. Kutz, “On dynamic mode decomposition: theory and applications,”Journal of Computational Dynamics, vol. 1, no. 2, pp. 391–421, 2014.

[25] A. Mauroy and I. Mezic, “Global stability analysis using the eigen-functions of the Koopman operator,” IEEE Transactions on AutomaticControl, vol. 61, no. 11, pp. 3356–3369, 2016.

[26] S. L. Brunton, B. W. Brunton, J. L. Proctor, and J. N. Kutz, “Koop-man invariant subspaces and finite linear representations of nonlineardynamical systems for control,” PloS one, vol. 11, no. 2, p. e0150171,2016.

[27] E. Kaiser, J. N. Kutz, and S. L. Brunton, “Data-driven discovery ofKoopman eigenfunctions for control,” arXiv preprint arXiv:1707.01146,2017.

[28] A. Sootla and D. Ernst, “Pulse-based control using Koopman operatorunder parametric uncertainty,” IEEE Transactions on Automatic Control,2017.

[29] C. W. Rowley, “Low-order models for control of fluids: Balanced modelsand the Koopman operator,” Advances in Computation, Modeling andControl of Transitional and Turbulent Flows, p. 60, 2015.

[30] J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Dynamic mode decom-position with control,” Journal on Applied Dynamical Systems, vol. 15,no. 1, pp. 142–161, 2016.

[31] A. Surana, “Koopman operator based observer synthesis for control-affine nonlinear systems,” in IEEE Int. Conf. on Decision and Control(CDC), 2016, pp. 6492–6499.

[32] I. Abraham, G. de la Torre, and T. Murphey, “Model-based control usingKoopman operators,” in Proceedings of Robotics: Science and Systems,2017.

[33] A. Broad, T. Murphey, and B. Argall, “Learning models for sharedcontrol of human-machine systems with unknown dynamics,” in Pro-ceedings of Robotics: Science and Systems, 2017.

[34] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embedto control: A locally linear latent dynamics model for control from rawimages,” in Advances in neural information processing systems, 2015,pp. 2746–2754.

[35] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,“Variational autoencoder for deep learning of images, labels and cap-tions,” in Advances in neural information processing systems, 2016, pp.2352–2360.

[36] M. Rattray, D. Saad, and S.-i. Amari, “Natural gradient descent for on-line learning,” Physical review letters, vol. 81, no. 24, p. 5461, 1998.

[37] T. Lai and C.-Z. Wei, “Extended least squares and their applications toadaptive control and prediction in linear systems,” IEEE Transactionson Automatic Control, vol. 31, no. 10, pp. 898–906, 1986.

[38] M. Egerstedt, Y. Wardi, and F. Delmotte, “Optimal control of switchingtimes in switched dynamical systems,” in IEEE Int. Conf. on Decisionand Control (CDC), vol. 3, 2003, pp. 2138–2143.

[39] H. Axelsson, Y. Wardi, M. Egerstedt, and E. Verriest, “Gradient descentapproach to optimal mode scheduling in hybrid dynamical systems,”Journal of Optimization Theory and Applications, vol. 136, no. 2, pp.167–186, 2008.

[40] C. G. Atkeson and J. C. Santamaria, “A comparison of direct and model-based reinforcement learning,” in IEEE International Conference onRobotics and Automation, vol. 4, 1997, pp. 3557–3564.

[41] A. R. Ansari and T. D. Murphey, “Sequential action control: Closed-form optimal control for nonlinear and nonsmooth systems,” IEEETransactions on Robotics, vol. 32, no. 5, pp. 1196–1214, 2016.

[42] F. Pukelsheim, Optimal Design of Experiments. SIAM, 2006.[43] T. M. Cover and J. A. Thomas, Elements of Information Theory. John

Wiley & Sons, 2012.[44] N. Nahi and G. Napjus, “Design of optimal probing signals for vector

parameter estimation,” in IEEE Conference on Decision and Control,vol. 10, 1971, pp. 162–168.

[45] T. Morimura, E. . i. e. j. Uchibe, and K. Doya, “Utilizing the naturalgradient in temporal difference reinforcement learning with eligibilitytraces,” in International Symposium on Information Geometry and ItsApplications, 2005, pp. 256–263.

[46] H. Wei, J. Zhang, F. Cousseau, T. Ozeki, and S.-i. Amari, “Dynamicsof learning near singularities in layered networks,” Neural computation,vol. 20, no. 3, pp. 813–843, 2008.

[47] M. Inoue, H. Park, and M. Okada, “On-line learning theory of softcommittee machines with correlated hidden units–steepest gradientdescent and natural gradient descent–,” Journal of the Physical Societyof Japan, vol. 72, no. 4, pp. 805–810, 2003.

[48] X. Yan, V. Indelman, and B. Boots, “Incremental sparse gp regressionfor continuous-time trajectory estimation and mapping,” Robotics andAutonomous Systems, vol. 87, pp. 120–132, 2017.

[49] D. Nguyen-Tuong and J. Peters, “Incremental online sparsification formodel learning in real-time robot control,” Neurocomputing, vol. 74,no. 11, pp. 1859–1867, 2011.

[50] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processesfor data-efficient learning in robotics and control,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 408–423,2015.

[51] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1238–1274, 2013.

[52] P. Kormushev, S. Calinon, and D. G. Caldwell, “Robot motor skillcoordination with em-based reinforcement learning,” in InternationalConference on Intelligent Robots and Systems, 2010, pp. 3232–3237.

[53] R. Saegusa, G. Metta, G. Sandini, and S. Sakka, “Active motor babblingfor sensorimotor learning,” in International Conference on Robotics andBiomimetics, 2009, pp. 794–799.

[54] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-freefine-tuning,” arXiv preprint arXiv:1708.02596, 2017.

[55] R. F. Reinhart, “Autonomous exploration of motor skills by skillbabbling,” Autonomous Robots, vol. 41, no. 7, pp. 1521–1537, 2017.

TRANSACTIONS ON ROBOTICS, VOL. XX, NO. XX, DATE XX 14

[56] T. Fan and T. Murphey, “Online feedback control for input-saturatedrobotic systems on lie groups,” in Proceedings of Robotics: Science andSystems, 2016.

[57] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller opti-mization for quadrotors with Gaussian processes,” in IEEE Int. Conf. onRobotics and Automation (ICRA), 2016, pp. 491–496.

[58] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert,and M. Toussaint, “Safe exploration for active learning with gaussianprocesses,” in Joint European Conference on Machine Learning andKnowledge Discovery in Databases. Springer, 2015, pp. 133–149.

[59] B. Kramer, P. Grover, P. Boufounos, S. Nabi, and M. Benosman, “Sparsesensing and dmd-based identification of flow regimes and bifurcations incomplex flows,” Journal on Applied Dynamical Systems, vol. 16, no. 2,pp. 1164–1196, 2017.

[60] E. Yeung, S. Kundu, and N. Hodas, “Learning deep neural networkrepresentations for Koopman operators of nonlinear dynamical systems,”arXiv preprint arXiv:1708.06850, 2017.

[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[62] B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universallinear embeddings of nonlinear dynamics,” Nature communications,vol. 9, no. 1, p. 4950, 2018.

[63] Itseez, “Open source computer vision library,” https://github.com/itseez/opencv, 2015.

[64] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “ROS: an open-source robot operatingsystem,” in ICRA Workshop on Open Source Software, vol. 3, no. 3.2.Kobe, 2009, p. 5.

[65] O. Klimov and J. Shulman, “Roboschool,” https://github.com/openai/roboschool, 2017.

[66] B. D. Anderson and J. B. Moore, Optimal control: linear quadraticmethods. Courier Corporation, 2007.

Todd D Murphey Todd D. Murphey received hisB.S. degree in mathematics from the Universityof Arizona and the Ph.D. degree in Control andDynamical Systems from the California Institute ofTechnology. He is a Professor of Mechanical Engi-neering at Northwestern University. His laboratory ispart of the Neuroscience and Robotics Laboratory,and his research interests include robotics, control,computational methods for biomechanical systems,and computational neuroscience. Honors include theNational Science Foundation CAREER award in

2006, membership in the 2014-2015 DARPA/IDA Defense Science StudyGroup, and Northwesterns Professorship of Teaching Excellence. He was aSenior Editor of the IEEE Transactions on Robotics.

Ian Abraham Ian Abraham received the B.S. degreein Mechanical and Aerospace Engineering from Rut-gers University and the M.S. degree in MechanicalEngineering from Northwestern University. He iscurrently a Ph.D. Candidate working in the Neuro-science and Robotics Lab. His Ph.D. work focuseson active sensing and efficient robot learning.


Recommended