IntentionGAN: Multi-Modal Imitation Learning from Unstructured … · 2020-07-22 · REFERENCES [1]...

IntentionGAN: Multi-Modal Imitation Learningfrom Unstructured Demonstrations

Karol Hausman∗, Yevgen Chebotar∗, Stefan Schaal, Gaurav Sukhatme, Joseph J. LimUniversity of Southern California, Los Angeles, USA

I. INTRODUCTION

Traditionally, imitation learning has focused on using iso-lated demonstrations of a particular skill [3]. The demonstrationis usually provided in the form of kinesthetic teaching, whichrequires the user to spend sufficient time to provide the righttraining data. This constrained setup for imitation learning isdifficult to scale to real world scenarios, where robots haveto be able to execute a combination of different skills. Tolearn these skills, the robots would require a large numberof robot-tailored demonstrations, since at least one isolateddemonstration has to be provided for every individual skill.

In order to improve the scalability of imitation learning, wepropose a framework that can learn to imitate skills from a setof unstructured and unlabeled demonstrations of various tasks.

As a motivating example, consider a highly unstructured datasource, e.g. a video of a person cooking a meal. A complexactivity, such as cooking, involves a set of simpler skills suchas grasping, reaching, cutting, pouring, etc. In order to learnfrom such data, three components are required: i) the ability tomap the image stream to state-action pairs that can be executedby a robot, ii) the ability to segment the data into simple skills,and iii) the ability to imitate each of the segmented skills. Inthis work, we tackle the latter two components, leaving thefirst one for future work.

In this paper, we present a novel imitation learning methodthat learns a multi-modal stochastic policy, which is ableto imitate a number of automatically segmented tasks usinga set of unstructured and unlabeled demonstrations. Ourresults indicate that the presented technique can separate thedemonstrations into sensible individual skills and imitate theseskills using a learned multi-modal policy.

II. MULTI-MODAL IMITATION LEARNING

The traditional imitation learning scenario considers aproblem of learning to imitate one skill from demonstrations.The demonstrations represent samples from a single expertpolicy πE1. In this work, we focus on an imitation learningsetup where we learn from unstructured and unlabelled demon-strations of various tasks. The demonstrations come from a setof expert policies πE1 , πE2 , . . . , πEk , where k can be unknown,that optimize different reward functions/tasks. We refer to thisset of unstructured expert policies as a mixture of policies πE .We aim to segment the demonstrations of these policies into

∗ Equal contribution

separate tasks and learn a multi-modal policy that imitates allof them.

To be able to learn multi-modal policy distributions, weaugment the policy input with a latent intention i distributedby a categorical or uniform distribution p(i), similar to [1].The goal of the intention variable is to select a specific modeof the policy, which corresponds to one of the skills presentedin the demonstrations. The resulting policy can be expressedas πi(a|s, i) = p(i|s, a)π

i(a|s)p(i) .

We augment the trajectory τ to include the latent in-tention as τi = (s0, a0, i0, ...sT , aT , iT ). The resulting re-ward of the trajectory with the latent intention is R(τi) =∑T

t=0 γtR(st, at, it). R(a, s, i) is a reward function that

depends on the latent intention i as we have multipledemonstrations that optimize different reward functionsfor different tasks. The expected discounted reward isequal to: Eπiθ [R(τi)] =

∫R(τi)π

iθ(τi)dτi where πθ(τi) =

p0(s0)∏T−1t=0 P (st+1|st, at)πiθ(at|st, it)p(it).

Here, we show an extension of the derivation presented in [2]for a policy πi(a|s, i) augmented with the latent intentionvariable i, which uses demonstrations from a set of expertpolicies πE . We are aiming at maximum entropy policiesthat can be determined from the latent intention variable i.Accordingly, we transform the original max-entropy inversereinforcement learning (IRL) problem [5] to reflect this goal:maxR

(maxπi H(πi(a|s))−H(πi(a|s, i)) + EπiR(s, a, i)

)− EπER(s, a, i). This objective reflects our goal: we aim toobtain a multi-modal policy that has a high entropy withoutany given intention, but it collapses to a particular task whenthe intention is specified. Analogously to the solution for asingle expert policy presented in [2], this optimization objectiveresults in the optimization of the generative adversarial imitationlearning network with the state-action pairs (s, a) beingsampled from a set of expert policies πE :

maxθ

minw

Ei∼p(i),(s,a)∼πiθ [log(Dw(s, a))] (1)

+ E(s,a)∼πE [1− log(Dw(s, a))]

+ λHH(πiθ(a|s))− λIH(πiθ(a|s, i)),

where λI , λH correspond to the weighting parameters on therespective objectives. The resulting entropy H(πiθ(a|s, i)) termcan be expressed as

H(πiθ(a|s, i)) = Ei∼p(i),(s,a)∼πiθ (− log(πiθ(a|s, i)) (2)

= −Ei∼p(i),(s,a)∼πiθ log(p(i|s, a)) +H(πiθ(a|s))−H(i),

where H(i) is a constant that does not influence the op-timization. This results in the same optimization objec-tive as for the single expert policy [2] with an additionalterm λIEi∼p(i),(s,a)∼πiθ log(p(i|s, a)) responsible for reward-ing state-action pairs that make the latent intention inferenceeasier. We refer to this cost as the latent intention cost andrepresent p(i|s, a) with a neural network.

III. EXPERIMENTS

Reacher The actuator is a 2-DoF arm attached at the centerof the scene. There are two targets placed at random positionsthroughout the environment. The goal of the task is, given adata set of reaching motions to random targets, to discover thedependency of the target selection on the intention and learna policy that is capable of reaching different targets based onthe specified intention input.

Walker-2D The Walker-2D is a 6-DoF bipedal robot consist-ing of two legs and feet attached to a common base. The goalof this task is to learn a policy that can switch between threedifferent behaviors dependent on the discovered intentions:running forward, running backward and jumping. We useTRPO [4] to train single expert policies and create a combineddata set of all three behaviors that is used to train a multi-modalpolicy using our imitation framework.

Humanoid Humanoid is a high-dimensional robot with 17degrees of freedom. Similar to Walker-2D the goal of the task isto be able to discover three different policies: running forward,running backward and balancing, from the combined expertdemonstrations of all of them.

The performance of our method in all of these setups canbe seen in our supplementary video: http://sites.google.com/view/nips17intentiongan.

We first evaluate the influence of the latent intention cost onthe Reacher task. For these experiments, we use a categoricalintention distribution with the number of categories equal tothe number of targets.

To demonstrate the development of different intentions, inFig. 1 (left) we present the Reacher rewards over trainingiterations for different intention variables. When the latentintention cost is included, (Fig. 1-1), the separation of differentskills for different intentions starts to emerge around the 1000-th iteration and leads to a multi-modal policy that, given theintention value, consistently reaches the target associated withthat intention. In the case of the standard imitation learningGAN setup (Fig. 1-2), the network learns how to imitatereaching only one of the targets for both intention values.

We also seek to further understand whether our model ex-tends to segmenting and imitating policies that perform differenttasks. In particular, we evaluate whether our framework is ableto learn a multi-modal policy on the Walker-2D task. The resultsare depicted in Fig. 2 (left). The additional latent intention costresults in a policy that is able to autonomously segment andmimic all three behaviors and achieve a similar performance tothe expert policies (Fig. 2-1). Different intention variable valuescorrespond to different expert policies: 0 - running forwards, 1- jumping, and 2 - running backwards. The imitation learning

Fig. 1. Rewards of different Reacher policies for 2 targets for differentintention values over the training iterations with (1) and without (2) the latentintention cost.

Fig. 2. Top: Rewards of Walker-2D policies for different intention valuesover the training iterations with (1) and without (2) the latent intention cost.Bottom: Rewards of Humanoid policies for different intention values over thetraining iterations with (3) and without (4) the latent intention cost.

GAN method is shown as a baseline in Fig. 2-2. The resultsshow that the policy collapses to a single mode, where alldifferent intention variable values correspond to the jumpingbehavior, ignoring the demonstrations of the other two skills.

To test if our multi-modal imitation learning frameworkscales to high-dimensional tasks, we evaluate it in the Hu-manoid environment. Fig. 2 (right) shows the rewards obtainedfor different values of the intention variable. Similarly toWalker-2D, the latent intention cost enables the neural networkto segment the tasks and learn a multi-modal imitation policy.In this case, however, due to the high dimensionality of thetask, the resulting policy is able to mimic running forwardsand balancing policies almost as well as the experts, but itachieves a suboptimal performance on the running backwardstask (Fig. 2-3). The imitation learning GAN baseline collapsesto a uni-modal policy that maps all the intention values to abalancing behavior (Fig. 2-4).

IV. CONCLUSIONS

We present a novel imitation learning method that learns amulti-modal stochastic policy, which is able to imitate a numberof automatically segmented tasks using a set of unstructuredand unlabeled demonstrations. The presented approach learnsthe notion of intention and is able to perform different tasksbased on the policy intention input.

http://sites.google.com/view/nips17intentiongan

http://sites.google.com/view/nips17intentiongan

REFERENCES

[1] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever,and Pieter Abbeel. Infogan: Interpretable representation learning byinformation maximizing generative adversarial nets, 2016.

[2] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.CoRR, abs/1606.03476, 2016.

[3] Stefan Schaal. Is imitation learning the route to humanoid robots? Trendsin cognitive sciences, 3(6):233–242, 1999.

[4] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, andPhilipp Moritz. Trust region policy optimization. In Francis R. Bachand David M. Blei, editors, ICML, volume 37 of JMLR Workshop andConference Proceedings, pages 1889–1897. JMLR.org, 2015.

[5] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K.Dey. Maximum entropy inverse reinforcement learning. In Dieter Foxand Carla P. Gomes, editors, AAAI, pages 1433–1438. AAAI Press, 2008.

Date post:	01-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

IntentionGAN: Multi-Modal Imitation Learning from Unstructured … · 2020-07-22 · REFERENCES [1]...

Documents