+ All Categories
Home > Documents > Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go,...

Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go,...

Date post: 16-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
13
Learning Generalized Reactive Policies using Deep Neural Networks Edward Groshev Department of Computer Science University of California, Berkeley Berkeley, CA 94720 [email protected] Aviv Tamar Department of Computer Science University of California, Berkeley Berkeley, CA 94720 [email protected] Siddharth Srivastava * School of Computing, Informatics, and Decision Systems Engineering Arizona State University Tempe, AZ 85281 [email protected] Pieter Abbeel Department of Computer Science University of California, Berkeley Berkeley, CA 94720 [email protected] Abstract We consider the problem of learning for planning, where knowledge acquired while planning is reused to plan faster in new problem instances. For robotic tasks, among others, plan execution can be captured as a sequence of visual images. For such domains, we propose to use deep neural networks in learning for planning, based on learning a reactive policy that imitates execution traces produced by a planner. We investigate architectural properties of deep networks that are suitable for learning long-horizon planning behavior, and explore how to learn, in addition to the policy, a heuristic function that can be used with classical planners or search algorithms such as A * . Our results on the challenging Sokoban domain show that, with a suitable network design, complex decision making policies and powerful heuristic functions can be learned through imitation. Videos available at https://sites.google.com/site/learn2plannips/. 1 Introduction In order to help with day to day chores such as organizing a cabinet or arranging a dinner table, robots need to be able plan: to reason about the best course of action that could lead to a given objective. Unfortunately, planning is known to be a challenging computation problem. The plan existence problem for deterministic, fully observable environments is PSPACE-complete when expressed using rudimentary propositional representations [2]. Such results have inspired the learning for planning paradigm: learning, or reusing the knowledge acquired while planning across multiple problem instances (in the form of triangle tables [6], learning control knowledge for planning [31], and constructing generalized plans [25], among other approaches) with the goal of faster plan computation in a new problem instance. * Some of the work was done while this author was at United Technologies Research Center arXiv:1708.07280v1 [cs.AI] 24 Aug 2017
Transcript
Page 1: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

Learning Generalized Reactive Policiesusing Deep Neural Networks

Edward GroshevDepartment of Computer ScienceUniversity of California, Berkeley

Berkeley, CA [email protected]

Aviv TamarDepartment of Computer ScienceUniversity of California, Berkeley

Berkeley, CA [email protected]

Siddharth Srivastava∗School of Computing, Informatics, and Decision Systems Engineering

Arizona State UniversityTempe, AZ 85281

[email protected]

Pieter AbbeelDepartment of Computer ScienceUniversity of California, Berkeley

Berkeley, CA [email protected]

Abstract

We consider the problem of learning for planning, where knowledge acquiredwhile planning is reused to plan faster in new problem instances. For robotic tasks,among others, plan execution can be captured as a sequence of visual images. Forsuch domains, we propose to use deep neural networks in learning for planning,based on learning a reactive policy that imitates execution traces produced by aplanner. We investigate architectural properties of deep networks that are suitablefor learning long-horizon planning behavior, and explore how to learn, in additionto the policy, a heuristic function that can be used with classical planners orsearch algorithms such as A∗. Our results on the challenging Sokoban domainshow that, with a suitable network design, complex decision making policies andpowerful heuristic functions can be learned through imitation. Videos available athttps://sites.google.com/site/learn2plannips/.

1 Introduction

In order to help with day to day chores such as organizing a cabinet or arranging a dinner table, robotsneed to be able plan: to reason about the best course of action that could lead to a given objective.Unfortunately, planning is known to be a challenging computation problem. The plan existenceproblem for deterministic, fully observable environments is PSPACE-complete when expressedusing rudimentary propositional representations [2]. Such results have inspired the learning forplanning paradigm: learning, or reusing the knowledge acquired while planning across multipleproblem instances (in the form of triangle tables [6], learning control knowledge for planning [31], andconstructing generalized plans [25], among other approaches) with the goal of faster plan computationin a new problem instance.

∗Some of the work was done while this author was at United Technologies Research Center

arX

iv:1

708.

0728

0v1

[cs

.AI]

24

Aug

201

7

Page 2: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

Figure 1: The Sokoban domain (best viewed in color). In Sokoban the agent (red dot) needs to pusharound movable objects (purple dots) between unmovable obstacles (blue squares) to a goal position(green square). In this figure we show a challenging Sokoban instance with one object. From left toright, we plot several steps in the shortest plan for this task: arrows represent the agent’s path, andlight purple dots show the resulting object movement. This 44 step trajectory was produced by ourlearned DNN policy. Note that it demonstrates reasoning about dead ends that may happen manysteps after the initial state.

One challenge in learning for planning, however, is how to select a good representation for thelearning problem, and prior approaches (e.g., [11, 15, 30, 31, 25]) have relied upon hand-writtendomain descriptions and feature sets based on languages such as PDDL [7].

Recently, deep neural networks (DNNs) have been used to automatically extract expressive featuresfrom data, leading to state-of-the-art learning results in image classification [13], natural languageprocessing [26], and control [16], among other domains. The phenomenal success of DNNs acrossvarious disciplines motivates us to investigate whether DNNs can learn useful representations in thelearning for planning setting as well.

In this work, we present an imitation learning (IL) approach for learning a generalized reactive policy(GRP) – a policy that mimics a planner, by drawing upon past plan executions for similar problems.Our learned GRP captures a reactive policy in the form of a DNN that predicts the action to be takenunder any situation, given an observation of the planning domain and the current state. Our approachcan also be used to automatically generate heuristic functions for a given domain, to be used inarbitrary directed search algorithms such as A∗ [22].

Imitation learning has been previously used with DNNs to learn policies for tasks that involve shorthorizon reasoning such as path following and obstacle avoidance [20, 21, 27], focused robot skills[17, 18], and recently block stacking [4]. In this work, we investigate whether IL can be used to learntasks that require longer-horizon reasoning, such as demonstrated by state-of-the-art planners.

For the purpose of this paper, we restrict our attention to planning problems for which plan executioncan be accurately captured as a sequence of images. This category captures a number of problemsof interest in household robotics including setting the dinner table. In particular, we focus on theSokoban domain (see Figure 1), which has been described as the most challenging problem in theliterature on learning for planning [5]: “ Sokoban has been demonstrated to be a very challengingdomain for AI search and planning algorithms, even when significant human domain knowledge isprovided [10]. This domain is more complex than the others.”2

Our experiments reveal that several architectural components are required to make IL work efficientlyin learning for planning:

1. A deep network. We attribute this finding to the network having to learn some form ofplanning computation, which cannot be captured with a shallow architecture.

2. Structuring the network to receive as input pairs of current state and goal observations. Thisallows us to ‘bootstrap’ the data, by training with all pairs of states in a demonstrationtrajectory.

3. Predicting plan length as an auxiliary training signal can improve IL performance. Inaddition, the plan length can be effectively exploited as a heuristic by standard planners.

We believe that these observations are general, and will hold for many domains. For the particularcase of Sokoban, using these insights, we were able to demonstrate a 97% success rate in one objectdomains, and an 87% success rate in two object domains. In Figure 1 we show an example testdomain, and a non-trivial solution produced by our learned DNN.

2This quote is in reference to the general Sokoban domain, containing multiple objects and goals.

2

Page 3: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

The performance of reactive policies learned by our method for a given domain is consistent across anumber of test problems generated at random from the same domain. We also show that the learnedheuristic function significantly improves upon standard hand-designed heuristics for Sokoban.

1.1 Related Work

The interface of planning and learning has been investigated extensively in the past.

Within the learning for planning literature [5], several studies considered learning a reactive policy,which is similar to our imitation learning approach. The works of Khadron [11], Martin andGeffner [15], and Yoon et al. [30] learn policies represented as decision lists on the logical problemrepresentation, which needs to be hand specified. Our approach requires as input only a set ofsuccessful plans and their executions—our neural network architecture is able to learn a reactivepolicy that predicts the best action to execute based on an image of the current state of the environmentwithout any additional representational expressions. Our approach thus offers two major advantagesover prior efforts: (1) in situations where successful plan executions can be observed, e.g. byobserving humans solving problems, our approach can eliminate the effort required in designingdomain representations; (2) in situations where guarantees of success are required, and domainrepresentations are available, our approach can automatically generate a representation-independentheuristic function which can be used with arbitrary directed search algorithms.

Works in imitation learning have mainly focused on learning skills from human demonstrations, suchas driving and obstacle avoidance [20, 21], and robot table tennis and rope manipulation [17, 18].Pfeiffer et al. [19] recently applied IL to learning obstacle avoidance from a motion planner. Theseskills do not require learning a planning computation, in the sense that the difference between thetrain and test environments is mostly in the observation (e.g., the visual driving conditions), but not inthe task goal (e.g., stay on the lane and avoid obstacles). Indeed, the recent work of Tamar et al. [27]demonstrated the difficulty of generalizing goal directed behavior from demonstration. The recentwork of Duan et al. [4] showed learning of various block stacking tasks where the goal was specifiedby an additional execution trace. From a planning perspective, the Sokoban domain considered hereis considerably more challenging than block stacking or navigation between obstacles. The ‘one-shot’techniques in [4], however, are complimentary to this work. The impressive Alpha-Go [23] programlearned a DNN strategy for Go, using a combination of IL, and reinforcement learning throughself-play. Extending our work to reinforcement learning is a direction for future research.

Recently, several authors considered DNN architectures that are suitable for learning a planningcomputation. In [27], a value iteration planning computation was embedded within the networkstructure, and demonstrated successful learning on 2D gridworld navigation. Due to the curse ofdimensionality, it is not clear how to extend that work to planning domains with much larger statespaces, such as the Sokoban domain considered here. The predictron architecture [24] uses ideasfrom temporal difference learning to design a DNN for reward prediction. Their architecture uses arecurrent network to internally simulate the future state transitions and predict future rewards. Atpresent, it is also not clear how to use the predictron for learning a policy, as we do here. Concurrentlywith our work, Weber et at. [29] proposed a DNN architecture that combines model based planningwith model free components for reinforcement learning, and demonstrated results on the Sokobandomain. In comparison, our IL approach requires significantly less training instances of the planningproblem (over 3 orders of magnitude) to achieve similar performance in Sokoban.

2 Background

In this section we present our formulation and preliminaries.

Planning: We focus on fully observable, deterministic task planning problems described in theformal language PDDL [7]. Such planning problems are defined as a tuple Π = 〈E ,F , I, G,A〉, withthe following definitions:

E : a set of entities in the domain (e.g. individual objects or locations).F : a set of binary fluents that describe relations between entities: ObjAt(obj3, loc1), InGrip-

per(obj2), etc.I ∈ 2F : a conjunction of fluents that are initially true.G ∈ 2F : a conjunction of fluents that characterizes the set of goal states.

3

Page 4: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

A: a set of actions that describe the ways in which an agent can alter the world. Each action ischaracterized by: preconditions, a set of fluents that describes the set of states where theaction is applicable and effects, a set fluents that change after the action is carried out.

For clarity, we will describe both preconditions and effects as a conjunctive lists of fluents. As anexample, the discrete move action could be represented as follows:

Move(loc1, loc2) :

{pre : RobotAt(loc1),

eff : ¬RobotAt(loc1), RobotAt(loc2).

We introduce several additional notations to the planning problem, to make the connection withimitation learning clearer. We denote by S = 2F the state space of the planning problem. A states ∈ S corresponds to the values of each fluent in F . The initial state s0 is defined by I , and a goalstate sg is defined by G. The task in planning is to find a sequence of actions – the so called plan –that, when consecutively applied to the initial state, results in a goal state. In addition, we denoteby o(Π, s) the observation for a problem Π when the state is s. For example, o can be an image ofthe current game state, as depicted in Figure 1 for Sokoban. We let τ = {s0, o0, a0, s1, . . . , sg, og}denote the state-observation-action trajectory implied by the plan. The plan length is the number ofstates in τ .

Learning for Planning: In the learning for planning setting, we are given a set Dtrain of Ntrainproblem instances {Π1, . . . ,ΠNtrain}, which will be used for learning a model that can improve and/orreplace a planner, and a set Dtest of Ntest problem instances that will be used for evaluating the learnedmodel. We assume that the training and test domains are similar in some sense, so that relevantknowledge can be extracted from the training set to improve performance on the test set. Concretely,both training and test domain instances come from the same distribution.

Imitation Learning: In imitation learning (IL), demonstrations of an expert performing a task aregiven in the form of observation-action trajectories Dimitation = {o0, a0, o1, . . . , oT , aT }. The goalis to find a policy – a mapping from observation to actions a = µ(o), which imitates the expert. Astraightforward IL approach is behavioral cloning [20], in which standard supervised learning is usedto learn µ from the data.

3 Imitation Learning in Learning for PlanningWe define a generalized reactive policy (GRP) as a function that maps <problem instance, state> toaction, thus generalizing the concept of a policy. In this work, similar to [27], we assume that theproblem instance and state are given as an image observation. Such a representation is suitable formany robotic problems3.

In this section we describe our approach for learning GRPs. Our approach is comprised of two stages:a data generation stage and a policy training stage.

Data generation: given the training domains Dtrain we generate a data set for imitation learningDimitation. For each Π ∈ Dtrain, we run an off-the-shelf planner to generate a plan and correspondingtrajectory τ , and then add the observations and actions in τ to Dimitation. In our experiments we usedthe Fast-Forward (FF) planner [9], though any other PDDL planner can be used instead.

Policy training: Given the generated data Dimitation, we use IL to learn a policy µ.

The learned policy µ maps an observation to action, and therefore can be readily deployed to anytest problem in Dtest. If the observation o(Π, s) contains sufficient information about Π and s, thenpotentially, the policy µ can represent the decision making of the planning algorithm used to generatethe data. Moreover, if there is a shared structure between the domains in Dtrain, such as subgoals, orsimple decision rules in certain situations, a good learning algorithm has the potential to learn theshared structure.

One may wonder why such a naive approach would even learn to produce the complex decisionmaking ability that is required to solve unseen instances in Dtest. Indeed, as we show in ourexperiments, naive behavioral cloning with standard shallow neural networks fails on this task. One

3It is also possible to extend our work to graph representations using convolutions on graphs [27, 3]. Wedefer this to future work.

4

Page 5: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

of the contributions of this work is the investigation of DNN representations that make this simpleapproach succeed.

4 Network ArchitectureIn this section, we present our DNN architecture for learning a GRP. In particular, we propose twodesign choices that aid in learning long-horizon planning behavior.

4.1 Goal Based Policy for Data BootstrappingIn the IL literature (e.g., [20, 21]), the policy is typically structured as a mapping from observationto action a = µ(o). In order for this policy to represent goal directed behavior, as in the planningdomains we consider, the observation o at each state must contain information about the goal state sg .

Recall that our training data Dimitation consists of Ntrain trajectories composed of observation-actionpairs. This means that the number of training samples for a policy a = µ(o) is equal to the number ofobservation-action pairs in the training data.

We propose instead, to structure the policy as a mapping from both a current observation and goalobservation to the current action a = µ(o, og). We term such a policy structure a goal based policy.Our reasoning for such a structure is based on the following fact:

Proposition 1. For a planning problem Π with initial state s0 and goal state sg, let τopt ={s0, s1, . . . , sg} denote the shortest plan from s0 to sg. Let µopt(s) denote an optimal policyfor Π in the sense that executing it from s0 generates the shortest path τopt to sg . Then, µopt is alsooptimal for a problem Π with the initial and goal states replaced with any two states si, sj ∈ τoptsuch that i < j.

Proposition 1 underlies classical planning methods such as triangle tables [6]. Here, we exploit itto design our DNN based on the following observation: if we structure the DNN to take as inputboth the current observation and a goal observation, for each observation-action trajectory in ourdata Dimitation, any pair of successive observations oi, oj can be used as a sample for training thepolicy. We term this bootstrapping the data. For a given trajectory of length T , the bootstrap canpotentially increase the number of training samples from T to (T − 1)2/2. In practice, for eachtrajectory τ ∈ Dimitation, we uniformly sample nbootstrap pairs of observations from τ . In each pair, thefirst observation is treated as the current observation, while the last observation is treated as the goalobservation. This results in nbootstrap + T training samples for each trajectory τ , which are added to abootstrap training set Dbootstrap to be used instead of Dimitation for training the policy. 4

4.2 Network Structure

We propose a general structure for a network that can learn a GRP from visual execution traces.

Our network is depicted in Figure 2. The current state and goal state observations are passed throughseveral layers of convolution which are shared between the action prediction network and the planlength prediction network. There are also skip connections from the input layer to to every convolutionlayer.

The shared representation is motivated by the fact that both the actions and the overall plan length areintegral parts of a plan. Having knowledge of the actions makes it easy to determine plan length andvice versa, knowledge about the plan length can act as a template for determining the actions. Theskip connections are motivated by the fact that several planning algorithms can be seen as applyinga repeated computation, based on the planning domain, to a latent variable. For example, greedysearch expands the current node based on the possible next states, which are encoded in the domain;value iteration is a repeated modification of the value given the reward and state transitions, which arealso encoded in the domain. Since the network receive no other knowledge about the domain, otherthan what’s present in the observation, we hypothesize that feeding the observation to every conv-netlayer can facilitate the learning of similar planning computations. We note that in value iterationnetworks [27], similar skip connections were used in an explicit neural network implementation ofvalue iteration.

4Note that for the Sokoban domain, goal observations in the test set (i.e., real goals) do not contain the robotposition, while the goal observations in the bootstrap training set include the robot position. However, thisinconsistency had no effect in practice, which we verified by explicitly removing the robot from the observation.

5

Page 6: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

Figure 2: Network architecture. A pair of current and goal observations are passed in to a sharedconv-net. This shared representation is input to an action prediction conv-net and a plan lengthprediction conv-net. Skip connections from the input observations to all conv-layers are added.

4.3 Generalization to Different Problem Sizes

A primary challenge in learning for planning is finding representations that can generalize acrossdifferent problem sizes. For example, we expect that a good policy for Sokoban should work well onthe instances it was trained on, 9× 9 domains for example, as well as on larger instances, such as12× 12 domains.

While the convolution layers can be applied to any image size, the number of inputs to the fullyconnected layer is strictly tied to the image size. This means that the network architecture describedabove is fixed to a particular domain size. To remove this dependency, we employ a trick used in fullyconvolutional networks [14], and keep only a k × k window of the last convolution layer, centeredaround the current agent position. This modification makes our DNN applicable to any domain size.

5 ExperimentsHere we report our experiments on learning for planning with DNNs. Our goal is to answer thefollowing questions:

1. What makes a good DNN architecture for learning planning behavior?

2. Is the DNN plan length prediction a useful planning heuristic?

3. Can DNN-based policies and heuristics generalize to changes in the size of domain?

We consider the Sokoban domain, as described on Figure 1, with a 9×9 grid and two difficulty levels:moving a single object, and a harder task of moving two objects. We generated training data usinga random level generator5 for Sokoban. Note that in Sokoban, the last observation in a trajectorycontains the agent’s final position, which reveals information about the plan. Since our networks takea goal observation as input, we added an additional observation at the end of each trajectory with theagent removed, such that we can feed in a goal observation without revealing additional information.

For imitation learning, we represent the policy with the DNNs described in Section 4 and optimizeusing Adam [12] (step size 0.001). When training using the bootstrap method of Section 4.1, weselected nbootstrap = T for generating Dbootstrap. Unless stated otherwise, the training set used in allexperiments was comprised of 45k observation-action trajectories.

We use two metrics to evaluate policy performance on the set of test domains Dtest. The first one isexecution success rate. Starting from the initial state we execute the policy and track whether or notthe goal state is reached. The second metric is classification error on the next action, whether or not itmatches what the planner would have done. For evaluating the accuracy of plan length prediction, wemeasure the average `1 loss (absolute difference).

5We did not use the Sokoban data from the learning for planning competition as it only contains 60 trainingdomains, which is not enough samples for training DNNs. Our generator works as follows: we assume the roomdimensions are a multiple of 3 and partition the grid into 3x3 blocks. Each block is filled with a randomly chosenand randomly rotated pattern from a predefined set of 17 different patterns. To make sure the generated levelsare not too easy and not impossible, we discard the ones containing open areas greater than 3x4 and discard theones with disconnected floor tiles. For more details we refer the reader to Taylor et al. [28].

6

Page 7: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

Num Params Deep-8 Wide-2 Wide-1556288 0.068 0.092 0.129 error rate

0.83 0.62 0.38 succ rate

Table 1: Comparison of deep vs. shallow networks with the same number of parameters. We startedwith a DNN containing 8 convolution layers (each layer with 64 filters) and cut it down to 2 and 1convolution layers (each with layers of depth 256 and 512, respectively). It is clear that the deepernetwork out performs the shallow ones, even though the number of parameters is unchanged, and isthus better suited to represent planning-based behavior in our data.

5.1 Evaluating Network Structure

In this section, based on several ablation experiments, we aim to tease out the important ingredientsfor a successful GRP.

In Figure 3 we plot the success rate on two-object Sokoban, for different network depths, and withor without skip connections. The results suggest that deeper networks perform better, with skipconnections resulting in a consistent advantage. To further establish this claim, in Table 1 we comparedeep networks with shallow and wide networks that have the same number of parameters. Theimproved results for the deeper networks suggest that for learning the planning based reasoning inthis data – the deeper the network the better. We note a related observation in the context of aDNN representation of the value iteration planning algorithm in [27]. However, in our experimentsthe performance levels off after 14 layers. We attribute this to the general difficulty of training deepDNNs due to gradient propagation, as evident in the failure of training the 14 layer architecturewithout skip connections, Figure 3.

Figure 3: Depth and skip connections. We plot the success rate for deterministic execution on the 2object environment. Note that deeper networks show improved success rates and that skip connectionsimprove performance consistently. Also, we were unable to successfully train a 14 layer deep networkwithout skip connections.

We also investigated the benefit of having a shared representation for both action and plan lengthprediction, compared to predicting each with a separate network. The ablation results are presentedin Table 2. Interestingly, the plan length prediction improves the accuracy of the action prediction.

5.2 Evaluating Bootstrap Performance

Here we evaluate the bootstrapping approach of Section 4.1. In Table 2 we show the success rate andplan length prediction error for architectures with and without the bootstrapping. As can be observed,the bootstrapping resulted in better use of the data, and led to improved results.

We also investigated the performance of bootstrapping with respect to the size of the training dataset.We observed that for smaller datasets, a non-uniform sampling strategy for the bootstrap dataperformed better. For each τ ∈ Dimitation, we sampled an observation o from a distribution that islinearly increasing in time, such that observations near the goal have higher probability. We alsonoted that in Sokoban, all goal observations in the test set have the objects placed at goal positions.

7

Page 8: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

w/ bootstrap w/o bootstrapPredict plan length 2.211 2.481 L1 normPredict plan length 2.205 2.319 L1 norm

& actions 0.844 0.818 Succ RatePredict actions 0.814 0.814 Succ Rate

Table 2: The benefits of bootstrapping and having a shared representation. We measure `1 normerror for plan length prediction, and success rate on execution for evaluating action prediction.Best performance was obtained with using bootstrapping and the shared representation. For thisexperiment the training set contained 25k observation-action trajectories.

Thus, bootstrapped goal observations in which the objects are not at a goal position would be visuallydifferent from the test data, and thus less effective for learning. We therefore made a modification tothe bootstrapped observations. For a trajectory τ starting from the initial observation in τ and endingat the sampled observation o, we update the goal location6 to be the object position in o. Thus, theobjects positions in o are visually modified to look as goals for the trajectory leading to o. Both τand τ are added to a bootstrap training set Dbootstrap which is used for training instead of Dimitation.The performance of this bootstrapping strategy is shown in Figure 4. As should be expected, theperformance improvement due to data augmentation is more significant for smaller data sets.

Figure 4: The affect of bootstrapping on the performance of 2-object Sokoban, as a function of thedataset size. For this experiment we used a non-uniform sampling strategy described in Section 5.2which performed better on smaller datasets.

5.3 Evaluating GRP PerformanceThe learned GRP in the best performing architecture (14 layers, with bootstrapping and a sharedrepresentation) can solve one-object Sokoban with 97% success rate, and two-object Sokoban with87% success rate. In Figure 1 we plot a trajectory that the policy predicted in a challenging one-objectdomain. The two-object trajectories are harder to visualise in images, and we provide a videodemonstration at https://sites.google.com/site/learn2plannips/. We observed that thepolicy learned to predict actions that avoid dead ends that happen far in the future, as Figure 1demonstrates. The learned policy can be deployed in a new planning problem instead of running aplanner.5.4 DNN as a Heuristic GeneratorIn terms of computation speed, running a forward pass of the DNN is generally faster than a planner.However, with some nonzero probability, the policy will fail to accomplish the task. In this section,we show that the plan length predicted by the DNN can also be used as a heuristic within standardplanners. This approach can guarantee a successful completion of the task. The advantage, as wedemonstrate, is that the learned DNN plan length can significantly improve upon an off-the-shelfheuristic that does not make use of the training data.

6We note that this procedure requires us to make a change to the observations, which limits our approach todomains where such a modification is feasible. In Sokoban, performing this modification is straightforward,and we believe the same should hold for many other domains. We also note that a related idea was recentlysuggested in the context of reinforcement learning [1].

8

Page 9: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

A* + NN A* + M GS + NN GS + M FF FDplan 32.7 (12.6) 32.3 (12.2) 35.3 (16.5) 42.9 (20.5) 41.7 (20.5) 32.2 (12.1)

length# nodes 175.6 (434.3) 3565 (5431) 100.5 (346.8) 482 (1071) 1414 (2259) 37593 (44296)explored# nodes 647.2 (1655) 18026 (30481) 370.9 (1440) 1858 (4761) N/A 107424 (127176)

generated

Table 3: Comparison of the learned DNN heuristic with the standard Manhattan heuristic for differentplanners. Average performance measures are reported, with standard deviations in parenthesis.Performance was evaluated on 9x9 domains in 2-object Sokoban.

A* + NN A* + M GS + NN GS + M FF FD12x12 45.5 (18.6) 44.9 (18.2) 55.9 (32.0) 59.9 (30.9) 59.6 (31.6) 44.5 (17.6)plan

length# nodes 3332 (12216) 18048 (34436) 919 (4466) 1774 (5642) 5248 (13239) 143818 (188792)explored15x15 57.9 (23.3) 57.0 (22.6) 84.7 (55.7) 78.3 (39.1) 76.6 (37.6) 56.0 (21.2)plan

length# nodes 24111 (57151) 55386 (96996) 3790 (12042) 4119 (9509) 12115 (30136) 415240 (574833)explored18x18 67.6 (22.6) 66.2 (21.7) 116.2 (72.9) 90.7 (39.3) 90.4 (39.0) 65.3 (21.0)plan

length# nodes 44780 (75472) 70347 (99312) 10288 (31501) 9011 (37592) 23568 (89494) 614375 (799367)explored

Table 4: A NN heuristic trained on 9x9 instance, evaluated on bigger domains. As the domaininstances increase in size, the NN heuristic consistently outperforms the of-the-shelf classical plannerin terms of number of nodes explored.

In particular, we investigated using the DNN as a heuristic in greedy search and A∗ search [22]. InTable 3 we evaluate 3 planning performance measures: number of nodes generated during search,number of nodes explored during search, and length of the plan that was found. The first two measuresindicate the planning speed, where evaluating less nodes translates to faster planning. The total planlength is used as a measure of plan quality. We evaluate planning performance on the test set. As canbe seen, in terms of the number of nodes explored, the learned NN heuristic significantly outperformsthe Manhattan heuristic7 in both greedy search and A* search. Even though the NN heuristic is notguaranteed to be admissible, when used in conjunction with A*, plan quality is very close to optimal.We also add a comparison to two state-of-the-art planners: Fast Forward (FF, [9]) and Fast Downward(FD, [8]). FD uses an anytime algorithm, so we constrained the planning time to be no more than 5minutes per domain. For the 9x9 domains, FD always found the optimal solution. We note that A*with our learned heuristic dramatically outperformed both planners in terms of the number of nodesexplored. In terms of plan length, A* with our heuristic outperforms FF and is comparable with FD.

The utility of faster planning, exploring less nodes, comes into play when the robot performs similarplanning tasks over and over again – in general, reduced performance latency tends to increase usersatisfaction.

5.5 DNN Heuristic Generalization

We also evaluate generalization using our GRP. In Table 4 and Figure 5, we evaluate a DNN heuristictrained on 9× 9 domains, and evaluated on larger domains. During training, we chose the windowsize k = 1, to influence learning a domain-invariant policy.

7Note that the Manhattan heuristic is only admissible in the 1-object case. We also tried other heuristicssuch as Euclidean distance, Hamiltonian distance, and Max over the three. Hamiltonian distance took long tocompute. Overall, Manhattan distance gave the best performance.

9

Page 10: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

(a) (b)

Figure 5: Here we compare the generalization performance of the learned NN heuristic against aManhattan distance heuristic, both used in A*. The NN heuristic was trained only on 9x9 instancesand then evaluated on bigger instances. As seen in (a) the learned heuristic consistently explores lessnodes which results in faster planning. In terms of plan quality (b), the learned heuristic finds plansthat are close to optimal.

5.6 Analysis of Failure Modes

In this section we investigate the failure modes of the learned GRP. We noticed that there were twoprimary failure modes. The first failure mode is due to cycles in the policy, and is a consequenceof using a deterministic policy. For example, when the agent is between two objects a deterministicpolicy may oscillate, moving back and fourth between the two. We found that a stochastic policysignificantly reduces this type of failure. However, stochastic policies have some non-zero probabilityof choosing actions that lead to a dead end (e.g., pushing the box directly up against a wall), whichcan lead to different failures. The second failure mode was the inability of our policy to foresee longterm dependencies between the two objects. An example of such a case is shown in Figure 6 (f-h),where deciding which object to move first requires a look-ahead of more than 20 steps. A possibleexplanation for this failure is that such scenarios are not frequent in the training data.

Additionally, we investigated whether the failure cases can be related to specific features in thetask. Specifically, we considered the task plan length (computed using FD), the number of wallsin the domain, and the planning time with the FD planner (results are similar with other planners).Intuitively, these features are expected to correlate with the difficulty of the task. In Figure 6 (a-c) weplot the success rate vs. the features described above. As expected, success rate decreases with planlength. Interestingly, however, several domains that required a long time for FD were ‘easy’ for thelearned policy, and had a high success rate. Further investigation revealed that these domains hadlarge open areas, which are ‘hard’ for planners to solve due to a large branching factor, but admita simple policy. An example of one such domain is shown in Figure 6 (d-e). We also note that thenumber of walls had no visible effect on success rate – it is the configuration of the walls that matters,and not their quantity.

5.7 Comparison to Value Iteration Networks

Value iteration networks (VINs; [27]) are DNNs designs that have the capacity to perform a valueiteration planning computation. While value iteration can easily be applied to small state spaceproblems such as 2D navigation, as shown in [27], the state space in Sokoban is much larger, as itinvolves the interaction of the agent with the movable objects, and not only the obstacles. Here wedemonstrate that indeed VINs cannot solve this domain. We first trained a VIN on Sokoban with nomovable objects, which is equivalent to a navigation task. As expected, the VIN learned a successfulpolicy with a 97.5% success rate. However, for 1 object Sokoban, VIN performance dropped to53.5% and for 2 objects it dropped to 8.5%. This shows that VINs (at least as implemented in [27])are not suitable for learning the complex planning behavior in Sokoban.

10

Page 11: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

(a) (b) (c)

(d) (e) (f) (g) (h)

Figure 6: Analysis of Failure Modes. (a-c): Success rate vs features of the domain. Plan length(a) seems to be the main factor in determining success rate. Longer plans fail more often. Whilethere is some relationship between planning time and success rate (b), planning time is not always anaccurate indicator, as explained in (d,e). The number of walls (c) does not affect success rate. (d,e):Domains containing large open rooms results in a high branching factor and thus produce the illusionof difficulty while still having a simple underlying policy. The domain in (d) took FD significantlylonger time to solve, 8.6 seconds compared to 1.6 seconds for the domain in (e), although it has ashorter optimal solution, 51 steps compared to 65 steps. This is since the domain in (e) can be brokenup into small regions which are all connected by hallways, a configuration that reduces the branchingfactor and thus the overall planning speed. (f-h): Demonstration of the 2nd failure mode in Section5.6. From the start state, the policy moves the first object using the path shown in (f). It proceeds tomove the next object using the path in (g). As the game state approaches (h) it becomes clear that thecurrent domain is no longer solvable. The lower object needs to be pushed down but is blocked bythe upper object, which can no longer be moved out of the way. In order to solve this level, the firstobject must ether be moved to the bottom goal or must be moved after the second object has beenplaced at the bottom goal. Both solutions require a look-ahead consisting of 20+ steps.

6 Conclusion

In this work we explored a simple yet powerful approach in learning for planning, based on imitationlearning from visual execution traces of a planner. We used deep neural networks for learning apolicy, and proposed several network designs that improve learning performance in this setting. Inaddition, we proposed networks that can be used to learn a heuristic for off-the-shelf planners, whichled to significant improvements over standard heuristics that do not leverage learning.

Our results on the challenging Sokoban domain suggest that DNNs have the capability to extractpowerful features from observations, and the potential to learn the type of ‘visual thinking’ that makessome planning problems easy to humans but very hard for automatic planners.

There is still much to explore in employing deep networks for planning. While representations forimages based on deep conv-nets have become standard, representations for other modalities such asgraphs and logical expressions are not yet mature (although see [3] for recent developments). Webelieve that the results presented here will motivate future research in representation learning forplanning.

References[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,

Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.arXiv preprint arXiv:1707.01495, 2017.

11

Page 12: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

[2] Tom Bylander. The computational complexity of propositional strips planning. ArtificialIntelligence, 69(1-2):165–204, 1994.

[3] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorialoptimization algorithms over graphs. arXiv preprint arXiv:1704.01665, 2017.

[4] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, IlyaSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. arXiv preprintarXiv:1703.07326, 2017.

[5] Alan Fern, Roni Khardon, and Prasad Tadepalli. The first learning track of the internationalplanning competition. Machine Learning, 84(1):81–107, 2011.

[6] Richard E Fikes, Peter E Hart, and Nils J Nilsson. Learning and executing generalized robotplans. Artificial Intelligence, 3:251 – 288, 1972. ISSN 0004-3702. doi: http://dx.doi.org/10.1016/0004-3702(72)90051-3. URL http://www.sciencedirect.com/science/article/pii/0004370272900513.

[7] Maria Fox and Derek Long. PDDL2. 1: An extension to PDDL for expressing temporal planningdomains. J. Artif. Intell. Res.(JAIR), 20:61–124, 2003.

[8] Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence (JAIR),26:191–246, 2006.

[9] Jörg Hoffman. FF: The fast-forward planning system. AI Magazine, 22:57–62, 2001.

[10] Andreas Junghanns and Jonathan Schaeffer. Sokoban: Enhancing general single-agent searchmethods using domain knowledge. Artificial Intelligence, 129(1-2):219–251, 2001.

[11] Roni Khardon. Learning action strategies for planning domains. Artificial Intelligence, 113(1):125 – 148, 1999.

[12] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105, 2012.

[14] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman-tic segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3431–3440, 2015.

[15] Mario Martin and Hector Geffner. Learning generalized policies in planning using conceptlanguages. 2000.

[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[17] Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select andgeneralize striking movements in robot table tennis. The International Journal of RoboticsResearch, 32(3):263–279, 2013.

[18] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and SergeyLevine. Combining self-supervised learning and imitation for vision-based rope manipulation.arXiv preprint arXiv:1703.02018, 2017.

[19] Mark Pfeiffer, Michael Schaeuble, Juan Nieto, Roland Siegwart, and Cesar Cadena. Fromperception to decision: A data-driven approach to end-to-end motion planning for autonomousground robots. arXiv preprint arXiv:1609.07910, 2016.

[20] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances inNeural Information Processing Systems, pages 305–313, 1989.

12

Page 13: Learning Generalized Reactive Policies using Deep Neural ... · learned a DNN strategy for Go, using a combination of IL, and reinforcement learning through self-play. Extending our

[21] Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning andstructured prediction to no-regret online learning. In AISTATS, 2011.

[22] Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach. ArtificialIntelligence. Prentice-Hall, Egnlewood Cliffs, 25:27, 1995.

[23] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-tering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489,2016.

[24] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, GabrielDulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. arXiv preprint arXiv:1612.08810, 2016.

[25] Siddharth Srivastava, Neil Immerman, Shlomo Zilberstein, and Tianjiao Zhang. Directed searchfor generalized plans using classical planners. In Proceedings of the International Conferenceon Automated Planning and Scheduling, pages 226–233, Freiburg, Germany, 2011. URLhttp://rbr.cs.umass.edu/shlomo/papers/SIZZicaps11.html.

[26] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, pages 3104–3112, 2014.

[27] Aviv Tamar, Sergey Levine, Pieter Abbeel, YI WU, and Garrett Thomas. Value iterationnetworks. In Advances in Neural Information Processing Systems, pages 2146–2154, 2016.

[28] Joshua Taylor and Ian Parberry. Procedural generation of sokoban levels. 2011.

[29] Théophane Weber, Sébastien Racanière, David P Reichert, Lars Buesing, Arthur Guez,Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yu-jia Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprintarXiv:1707.06203, 2017.

[30] SungWook Yoon, Alan Fern, and Robert Givan. Inductive policy selection for first-order MDPs.In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages568–576. Morgan Kaufmann Publishers Inc., 2002.

[31] Sungwook Yoon, Alan Fern, and Robert Givan. Learning control knowledge for forward searchplanning. Journal of Machine Learning Research, 9(Apr):683–718, 2008.

13


Recommended