Learning Models for Following Natural Language Directions ... · Learning Models for Following...

Learning Models for Following Natural Language Directionsin Unknown Environments

Sachithra Hemachandra∗ Felix Duvallet∗ Thomas M. HowardNicholas Roy Anthony Stentz Matthew R. Walter

Abstract—Natural language offers an intuitive and flexiblemeans for humans to communicate with the robots that wewill increasingly work alongside in our homes and workplaces.Recent advancements have given rise to robots that are ableto interpret natural language manipulation and navigationcommands, but these methods require a prior map of therobot’s environment. In this paper, we propose a novel learningframework that enables robots to successfully follow naturallanguage route directions without any previous knowledge ofthe environment. The algorithm utilizes spatial and semanticinformation that the human conveys through the command tolearn a distribution over the metric and semantic propertiesof spatially extended environments. Our method uses thisdistribution in place of the latent world model and interpretsthe natural language instruction as a distribution over theintended behavior. A novel belief space planner reasons directlyover the map and behavior distributions to solve for a policyusing imitation learning. We evaluate our framework on avoice-commandable wheelchair. The results demonstrate that bylearning and performing inference over a latent environmentmodel, the algorithm is able to successfully follow naturallanguage route directions within novel, extended environments.

I. INTRODUCTION

Over the past decade, robots have moved out of con-trolled isolation and into our homes and workplaces, wherethey coexist with people in domains that include healthcareand manufacturing. One long-standing challenge to realizingrobots that behave effectively as our partners is to developcommand and control mechanisms that are both intuitiveand efficient. Natural language offers a flexible mediumthrough which people can communicate with robots, withoutrequiring specialized interfaces or significant prior training.For example, a voice-commandable wheelchair [1] allows themobility-impaired to independently and safely navigate theirsurroundings simply by speaking to the chair, without theneed for traditional head-actuated switches or sip-and-puffarrays. Recognizing these advantages, much attention hasbeen paid of late to developing algorithms that enable robotsto interpret natural language expressions that provide route

∗The first two authors contributed equally to this paper.S. Hemachandra and N. Roy are with the Computer Science and Artificial

Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge,MA USA {sachih,tmhoward,nickroy}@csail.mit.edu

F. Duvallet and A. Stentz are with the Robotics Institute, Carnegie MellonUniversity, Pittsburgh, PA USA {felixd,tony}@cmu.edu

T.M. Howard is with the University of Rochester, Rochester, NY [email protected]

M.R. Walter is with the Toyota Technological Institute at Chicago,Chicago, IL USA [email protected]

Go to the kitchen that is down the hallway

Fig. 1. Our goal is to enable robots to autonomously follow natural languagecommands without any prior knowledge of their environment.

directions [2], [3], [4], [5], that command manipulation [6],[7], and that convey environment knowledge [8], [9].

Natural language interpretation becomes particularly chal-lenging when the expression references areas in the environ-ment unknown to the robot. Consider an example in whicha user directs the voice-commandable wheelchair to “go tothe kitchen that is down the hallway,” when the wheelchairis in an unknown environment and the hallway and kitchenare outside the field-of-view of its sensors (Fig. 1). Unableto associate the hallway and kitchen with specific locations,most existing solutions to language understanding wouldresult in the robot exploring until it happens upon a kitchen.By reasoning over the spatial and semantic environmentinformation that the command conveys, however, the robotwould be able to follow the spoken directions more effi-ciently.

In this paper, we propose a framework that follows naturallanguage route directions within unknown environments byexploiting spatial and semantic knowledge implicit in thecommands. There are three algorithmic contributions thatare integral to our approach. The first is a learned languageunderstanding model that efficiently infers environment an-notations and desired behaviors from the user’s command.The second is an estimation-theoretic algorithm that learns adistribution over hypothesized world models by treating theinferred annotations as observations of the environment andfusing them as observations from the robot’s sensor streams(Fig. 2). The third is a belief space policy learned from humandemonstrations that reasons directly over the world model

(a) t = 3 (b) t = 4 (c) t = 8

Fig. 2. Visualization of the evolution of the semantic map over time as the robot follows the command “go to the kitchen that is down the hallway.” Smallcircles and large filled-in areas denote sampled and visited regions, respectively, each colored according to its type (lab: green, hallway: yellow, kitchen:blue). The robot (a) first samples possible locations of the kitchen and moves towards them, (b) then observes the hallway and refines its estimate usingthe “down” relation provided by the user. Finally, the robot (c) reaches the actual kitchen and declares it has finished following the direction.

distribution to identify suitable navigation actions.This paper generalizes previous work by the authors [10],

which was limited to object-relative navigation within small,open environments. The novel contributions of this workenable robots to follow natural language route directionsin large, complex environments. They include: a hierarchi-cal framework that learns a compact probabilistic graphicalmodel for language understanding; a semantic map inferencealgorithm that hypothesizes the existence and location ofregions in spatially extended environments; and a belief spacepolicy learned from human demonstrations that considersspatial relationships with respect to a hypothesized map dis-tribution. We demonstrate these advantages through simula-tions and experiments with a voice-commandable wheelchairin an office-like environment.

II. RELATED WORK

Recent advancements in language understanding have en-abled robots to understand free-form commands that instructthem to manipulate objects [6], [7] or navigate throughenvironments using route directions [2], [3], [4], [7], [11].With few exceptions, most of these techniques require apriori knowledge of location, geometry, colloquial name, andtype of all objects and regions within the environment [3],[7], [6]. Without known world models, however, interpretingfree-form commands becomes much more difficult. Existingmethods have dealt with this by learning a parser that mapsthe natural language command directly to plans [2], [4], [11].Alternatively, Duvallet et al. [12] use imitation learning totrain a policy that reasons about uncertainty in the groundingand that is able to backtrack as necessary. However, noneof these approaches explicitly utilize the knowledge that theinstruction conveys to influence their models of the envi-ronment, nor do they reason about its uncertainty. Instead,our framework treats language as an additional, albeit noisy,sensor that we use to learn a distribution over hypothesizedworld models, by taking advantage of information implicitlycontained in a given command.

Related to our algorithm’s ability to learn world models,state-of-the-art semantic mapping frameworks exist that focuson using the robot’s sensor observations to update its repre-sentation of the world [13], [14]. Some methods additionally

incorporate natural language descriptions in order to improvethe learned world models [8], [9]. These techniques, however,only use language to update regions of the environmentthat the robot has observed and are not able to extendthe maps based on natural language. Our approach treatsnatural language as another sensor and uses it to extendthe spatial representation by adding both topological andmetric information regarding hypothesized regions in theenvironment, which is then used for planning. Williams etal. [15] use a cognitive architecture to add unvisited locationsto a partial map. However, they only reason about topologicalrelationships to unknown places, do not maintain multiplehypotheses, and make strong assumptions about the environ-ment that limit the applicability to real systems. In contrast,our approach reasons both topologically and metrically aboutregions, and can deal with ambiguity, which allows us tooperate in challenging environments.

III. APPROACH OVERVIEW

We define natural language direction following as one ofinferring the robot’s trajectory xt+1:T that is most likely fora given command Λt:

argmaxxt+1:T ∈<n

p(xt+1:T |Λt, zt, ut

), (1)

where zt and ut are the history of sensor observations andodometry data, respectively. Traditionally, this problem hasbeen solved by also conditioning the distribution over aknown world model. Without any a priori knowledge of theenvironment, we treat this world model as a latent variable St.We then interpret the natural language command in terms ofthe latent world model, which results in a distribution overbehaviors βt. We then solve the inference problem (1) bymarginalizing over the latent world model and behaviors:

argmaxxt+1:T ∈<n

∫βt

∫St

p(xt+1:T |βt, St,Λt) · p(βt|St,Λt)

· p(St|Λt) dSt dβt,(2)

where we have omitted the measurement zt and odometry ut

histories for lack of space.By structuring the problem in this way, we are able to treat

inference as three coupled learning problems. The framework

annotationinference

semanticmapping

behaviorinference

policyplanner

mapdistribution

annotationdistribution

behaviordistribution

“go to the kitchenthat is downthe hallway”

observations

parse tree(s)

action

Fig. 3. Outline of the framework.

(Fig. 3) first converts the natural language direction into a setof environment annotations using learned language groundingmodels. It then treats these annotations as observations of theenvironment (i.e., the existence, name, and relative locationof rooms) that it uses together with data from the robot’sonboard sensors to learn a distribution over possible worldmodels (third factor in Eqn. 2). Our framework then infersa distribution over behaviors conditioned upon the worldmodel and the command (second factor). We then solve forthe navigation actions that are consistent with this behaviordistribution (first factor) using a learned belief space policythat commands a single action to the robot. As the robotexecutes this action, we update the world model distributionbased upon new utterances and sensor observations, andsubsequently select an updated action according to the policy.This process repeats as the robot navigates.

The rest of this paper details each of these componentsin turn. We then demonstrate our approach to followingnatural language directions through large unstructured indoorenvironments on the robot shown in Fig. 1 as well as simu-lated experiments. We additionally evaluate our approach tolearning belief space policies on a corpus of natural languagedirections through one floor of an indoor building.

IV. NATURAL LANGUAGE UNDERSTANDING

Our framework relies on learned models to identify theexistence of annotations and behaviors conveyed by free-form language and to convert these into a form suitable forsemantic mapping and the belief space planner. This is achallenge because of the diversity of natural language direc-tions, annotations, and behaviors. We perform this translationusing the Hierarchical Distributed Correspondence Graph(HDCG) model [16], which is a more efficient extension ofthe Distributed Correspondence Graph (DCG) [7]. The DCGexploits the grammatical structure of language to formulatea probabilistic graphical model that expresses the correspon-dence φ ∈ Φ between linguistic elements from the commandand their corresponding constituents (groundings) γ ∈ Γ. Thefactors f in the DCG are represented by log-linear modelswith feature weights that are learned from a training corpus.The task of grounding a given expression then becomes aproblem of inference on the DCG model.

The HDCG model employs DCG models in a hierarchicalfashion, by inferring rules R to construct the space of ground-ings for lower levels in the hierarchy. At any one level, thealgorithm constructs the space of groundings based upon adistribution over the rules from the previous level:

Γ→ Γ (R) . (3)

The HDCG model treats these rules and, in turn, the structureof the graph, as latent variables. Language understandingthen proceeds by performing inference on the marginalizedmodels:

arg maxΦ

∫R

p (Φ|R,Γ (R) ,Λ,Ψ) p (R|Γ (R) ,Λ,Ψ) (4)

arg maxΦ

∫R

∏i

∏j

f(Φij ,Γij (R) ,Λi,Ψ, R

)× (5)∏

i

∏j

f(R,Λi,Ψ,Γij (R)

).

We now describe how the HDCG model infers annotations(representing our knowledge of the environment inferredfrom the language) and behaviors (representing the intent ofthe command) to understand the natural language commandgiven by the user.

A. Annotation Inference

An annotation is a set of object types and subspaces.A subspace is defined here as a spatial relationship (e.g.,down, left, right) with respect to an object type. In theexperiments described in Section VII we assume 17 objecttypes and 12 spatial relationships. We also permit objecttypes to express a spatial relationship with another objecttype. We denote object types by their physical type (e.g.,kitchen, hallway), subspaces as the relationship type withan object type argument (e.g., down(kitchen), left(hallway)),and object types with spatial relationships as an object typewith a subspace argument (e.g., kitchen(down(hallway))).Since the number of possible combinations of annotationsis equal to the power set of the number of symbols, 23,485

annotations can be expressed by an instruction.1 The HDCGmodel infers a distribution of graphical models to efficientlygenerate annotations by assuming conditional independenceof constituents and eliminating symbols that are learned to beirrelevant to the utterance. For example, Figure 4 illustratesthe model for the direction “go to the kitchen that is downthe hall.” In this example only 4 of the 3,485 symbols(two object types, one subspace, and one object type witha spatial relationship) are active in this model. Note thatall factors with inactive correspondence variables are notillustrated in Figures 4 and 5. At the root of the sentencethe symbols for an object type (kitchen) and an object typewith a spatial relationship (kitchen(down(hallway))) are sentto the semantic map to fuse with other observations.

13,485 symbols = 17 object types, 204 subspaces, and 3,264 object typeswith spatial relationships (we exclude object types with spatial relationshipsto the same object type)

go to the

kitchen

that is down the

hall

VP PP NP NP SBAR WHNP S VP ADVP NP

γ4 γ4 γ4

γ3 γ3 γ3 γ3 γ1 γ1 γ1 γ1 γ2γ3 γ3γ3 γ3

γ1=down(hallway)γ2=hallwayγ3=kitchenγ4=kitchen(down(hallway))

Fig. 4. The active groundings in annotation inference for the direction“go to the kitchen that is down the hall”. The two symbols at the rootof the sentence (γ3,γ4) are sent to the semantic map to fuse with otherobservations.

B. Behavior Inference

A behavior is a set of objects, subspaces, actions, ob-jectives, and constraints. Behavior inference differs fromannotation inference by considering objects from the se-mantic map and subspaces defined with respect to objectsfrom the semantic map instead of only object types. Wedenote actions by their type and an object or subspaceargument (e.g., navigate(hallway)), objectives by their type(e.g., quickly, safely), and constraints as objects with spatialrelationship from the semantic map (e.g., o4(down(o3))). Inthe experiments presented in Section VII we assume 4 actiontypes, 3 objectives, and 12 spatial relations. Just as withannotation inference, the HDCG model eliminates irrelevantaction types, objective types, objects, and spatial relationshipsto efficiently infer behaviors. Figure 5 illustrates the modelfor the direction “go to the kitchen that is down the hall” inthe context of an inferred map. In this example a navigateaction with a goal relative to o1 would be inferred as themost likely behavior for the policy planner.

V. SEMANTIC MAPPING

We represent the world model as a modified semanticmap [8] St = {Gt, Xt}, a hybrid metric and topological rep-resentation of the environment. The topology Gt consists ofnodes ni that denote locations in the environment, edges thatdenote inter-node connections, and non-overlapping regionsRα = {n1, n2, . . . , nm} that represent spatially coherentareas compatible with a human’s decomposition of space(e.g., rooms and hallways). We associate a pose xi witheach node ni, the vector of which constitutes the metricmap Xt. Each region is also labeled according to its type(e.g., kitchen, hallway). An edge connects two regions thatthe robot has transitioned between or for which languageindicates the existence of an inter-region spatial relation (e.g.,that the kitchen is “down” the hallway).

Annotations extracted from a given command provide in-formation regarding the existence, relative location, and typeof regions2 in the environment. We learn a distribution over

2Regions as defined by the mapping framework are also considered asobjects for the purpose of natural language understanding.

go to the

kitchen

that is down the

hall

VP PP NP NP SBAR WHNP S VP ADVP NP

γ8 γ7 γ7 γ7 γ5 γ5 γ5 γ5 γ6γ8 γ7

o1 (kitchen)

o3 (elevator lobby)

o2 (hallway)

o4 (lab space)

robot

γ5=down(o2)γ6=o2

γ7=o1

γ8=navigate(o1)

Fig. 5. The active groundings in behavior inference for the direction “goto the kitchen that is down the hall” in the context of a inferred map with 4objects. In this example a navigate action with a goal relative to o1 wouldbe sent to the policy planner.

world models consistent with these annotations by treatingthem as observations αt in a filtering framework. We combinethese observations with those from other sensors onboardthe robot (LIDAR and region appearance observations) ztto maintain a distribution over the semantic map:

p(St|Λt, zt, ut)≈ p(St|αt, zt, ut) (6a)= p(Gt, Xt, |αt, zt, ut) (6b)= p(Xt|Gt, αt, zt, ut)p(Gt|αt, zt, ut), (6c)

where we assume that an utterance Λt provides a set ofannotations αt. The factorization within the last line modelsthe metric map induced by the topology, as with pose graphrepresentations [17]. We maintain this distribution over timeusing a Rao-Blackwellized particle filter (RBPF) [18], witha sample-based approximation of the distribution over thetopology, and a Gaussian distribution over metric poses.

The robot observes transitions between environment re-gions and the semantic label of its current region. As sceneunderstanding is not the focus of this work, we use AprilTagfiducials [19] placed in each region that denotes its label.Unlike our earlier work [9] in which we segment regionsbased only on their spatial coherence using spatial clustering,here we additionally use the presence of conflicting spatialappearance tags to also segment the region. As such, weassume that we are aware of the segmentation of the spaceimmediately, which is not possible with a purely spectralclustering based approach, allowing us to immediately eval-uate each particle’s likelihood based on the observation ofregion appearance. In turn, we can down-weight particles thatare inconsistent with the actual layout of the world sooner,reducing the number of actions the robot must take to satisfythe command.

We maintain each particle through the three steps ofthe RPBF. First, we propagate the topology by samplingmodifications to the graph when the robot receives new

sensor observations or annotations. Second, we perform aBayesian update to the pose distribution based upon thesampled modifications to the underlying graph. Third, weupdate the weight of each particle based on the likelihood ofgenerating the given observations, and resample as neededto avoid particle depletion. We now outline this process inmore detail.

During the proposal step, we first add an additional node ntand edge to each particle’s topology that model the robot’smotion ut, yielding a new topology S

(i)−t . We then sample

modifications to the topology ∆(i)t = {∆(i)

αt ,∆(i)zt } based on

the most recent annotations αt and sensor observations zt:

p(S(i)t |S

(i)t−1, αt, zt, ut) = p(∆(i)

αt |S(i)−t , αt)

p(∆(i)zt |S

(i)−t , zt) p(S

(i)−t |S(i)

t−1, ut). (7)

This updates the proposed graph topology S(i)−t with the

graph modifications ∆(i)t to yield the new semantic map S(i)

t .The updates can include the addition and deletion of nodesand regions from the graph that represent newly hypothesizedor observed regions, and edges that express express spatialrelations inferred from observations or annotations.

We sample graph modifications from two independent pro-posal distributions for annotations αt and robot observationszt. This is done by sampling a grounding for each observationand modifying the graph according to the implied grounding.

A. Graph modifications based on natural language

Given a set of annotations αt = {αt,j}, we sample mod-ifications to the graph for each particle. An annotation αt,jcontains a spatial relation and figure when the languagedescribes one region (e.g., “go to the elevator lobby”), and anadditional landmark when the language describes the relationbetween two regions (e.g., “go to the lobby through thehallway”). We use a likelihood model over the spatial relationto sample landmark and figure pairs for the grounding. Thismodel employs a Dirichlet process prior that accounts forthe fact that the annotation may refer to regions that existin the map or to unknown regions. If either the landmarkor the figure are sampled as new regions, we add them tothe graph and create an edge between them. We also samplethe metric constraint associated with this edge based on thespatial relation. The spatial relation models employ featuresthat describe the locations of the regions, their boundaries,and robot’s location at the time of the utterance, and aretrained based upon a natural language corpus [6].

B. Graph modifications based on robot observations

If the robot does not observe a region transition (i.e.the robot is in the same region as before), the algorithmadds the new node nt to the current region and modifiesits spatial extent. If there are any edges denoting spatialrelations to hypothesized regions, the algorithm resamplestheir constraint if its likelihood changes significantly due tothe modified spatial extent of the current region.

Alternatively, if the robot observes a region transition,the new node nt is assigned to a new or existing region

as follows. First, the algorithm checks if the robot is ina previously visited region, based on spatial proximity, inwhich case it will add nt to that region. Otherwise, it willcreate a new region and check whether it matches a regionthat was previously hypothesized based on an annotation(for example, a newly-visited kitchen can be the same asa hypothesized kitchen described with language). We do soby sampling a grounding to any unobserved regions in thetopology using a Dirichlet process prior. If this process resultsin a grounding to an existing hypothesized region, we removethe hypothesized region and adjust the topology accordingly,resampling any edges to yet-unobserved regions. For exam-ple, if an annotation suggested the existence of a “kitchendown the hallway,” and we grounded the robot’s currentregion to the hypothesized hallway, we would reevaluate the“down” relation for the hypothesized kitchen with respect tothis detected hallway.

C. Re-weighting particles and resampling

After modifying each particle’s topology, we perform aBayesian update to its Gaussian distribution. We then re-weight each particle according to the likelihood of generatinglanguage annotations and region appearance observations:

w(i)t =p(zt, αt|S(i)

t−1)w(i)t−1=p(αt|S(i)

t−1)p(zt|S(i)t−1)w

(i)t−1. (8)

When calculating the likelihood of each region appearanceobservation, we consider the current node’s region type andcalculate the likelihood of generating this observation giventhe topology. In effect, this down-weights any particle with asampled region of a particular type existing on top of a knowntraversed region of a different type. We use a likelihoodmodel that describes the observation of a region’s type, witha latent binary variable v that denotes whether or not theobservation is valid. We marginalize over v to arrive at thelikelihood of generating the given observation, where Ru isthe set of unobserved regions in particle S(i)

t−1:

p(zt|S(i)t−1) =

∏Ri∈Ru

(∑v∈1,0

p(zt|v,Ri)× p(v|Ri)

). (9)

For annotations, we use the language grounding likelihoodunder the map at the previous time step. As such, a particlewith an existing pair of regions conforming to a specifiedlanguage constraint will be weighted higher than one with-out. When the particle weights fall below a threshold, weresample particles to avoid particle depletion [18].

VI. REASONING AND LEARNING IN BELIEF SPACE

Searching for the complete trajectory that is optimal inthe distribution of maps would be intractable. Instead, wetreat direction following as sequential decision making underuncertainty, where a policy π minimizes a single step of thecost function c over the available actions a ∈ At from state x:

π (x, St) = argmina∈At

c (x, a, St) . (10)

After executing the action and updating the map distribution,we repeat this process until the policy declares it has com-pleted following the direction using a separate stop action.

As the robot travels in the environment, it keeps track ofthe nodes in the topological graph Gt it has visited (V) andfrontiers (F) that lie at the edge of explored space. The actionset At consists of paths to nodes in the graph. An additionalaction astop declares that the policy has completed followingthe direction. Intuitively, an action represents a single stepalong the path that takes the robot towards its destination.Each action may explore new parts of the environment (forexample continuing to travel down a hallway) or backtrackif the policy has made a mistake (for example, traveling to aroom in a different part of the environment). The followingsections explain how the policy reasons in belief space, andthe novel imitation learning formulation to train the policyfrom demonstrations of correct behavior.

A. Belief Space Reasoning using Distribution Embedding

The semantic map St provides a distribution over thepossible locations of the landmarks relevant to the commandthe robot is following. As such, the policy π must reasonabout a distribution of action features when computing thecost of any action a. We accomplish this by embedding theaction feature distribution in a Reproducing Kernel HilbertSpace (RKHS), using the mean feature map [20] consistingof the first K moments of the features computed with respectto each map sample S(i)

t (and its likelihood):

Φ̂1 (x, a, St) =∑S

(i)t

p(S(i)t ) φ

(x, a, S

(i)t

)(11)

Φ̂2 (x, a, St) =∑S

(i)t

p(S(i)t )

(φ(x, a, S

(i)t

)− Φ̂1

)2

(12)

. . .

Φ̂k (x, a, St) =∑S

(i)t

p(S(i)t )

(φ(x, a, S

(i)t

)− Φ̂1

)k(13)

Intuitively, this formulation computes features for the actionand all hypothesized landmarks individually, aggregates thesefeature vectors, and then computes moments of the featurevector distribution (mean, variance, and higher order statis-tics). A simplified illustration, shown in Figure 6, shows howour approach computes belief space features for two actionswith a hypothesized kitchen (with two possible locations).

The cost function in Equation 10 can now be rewrittenas a weighted sum of the first K moments of the featuredistribution:

c (x, a, St) =

K∑i=1

wTi Φ̂i (x, a, St) . (14)

By concatenating the weights and moments into respectivecolumn vectors W := [w1; . . . ;wk] and F := [Φ̂1; . . . ; Φ̂k],we can rewrite the policy in Equation 10 as minimizing aweighted sum of the feature moments Fa for action a :

π (x, St) = argmina∈At

WTFa. (15)

a1

a2

Kitchen

Kitchen Start

φ(a1, S1), φ(a1, S

2)

φ(a2, S1), φ(a2, S

2)

Fig. 6. Simplified illustration of computing feature moments in the spaceof hypothesized landmarks (in this case, two kitchens). To compute thefeatures over a landmark distribution, we compute the features for eachaction across all hypothesized landmark samples, and aggregate them bycomputing moment statistics.

The vector φ(x, a, S(i)t ) are features of the action and

a single landmark in S(i)t . It contains geometric features

describing the shape of the action (e.g., the cumulativechange in angle), the geometry of the landmark (e.g., the areaof the landmark), and the relationship between the action andlandmark (e.g., the difference between the ending and startingdistances to the landmark). See [12] for more details.

B. Imitation Learning Formulation

We use imitation learning to train the policy by treatingaction prediction as a multi-class classification problem:given an expert demonstration, we wish to correctly predicttheir action among all possible actions for the same state. Al-though prior work introduced imitation learning for traininga direction following policy, it operated in partially knownenvironments [12]. Instead, we train a belief space policy thatreasons in a distribution of hypothesized maps.

We assume the expert’s policy π∗ minimizes the unknownimmediate cost C(x, a∗, St) of performing the demonstratedaction a∗ from state x, under the map distribution St.However, since we cannot directly observe the true costs ofthe expert’s policy, we must instead minimize a surrogateloss that penalizes disagreements between the expert’s ac-tion a∗ and the policy’s action a, using the multi-class hingeloss [21]:

` (x, a∗, c, St)=max

(0, 1+c (x, a∗, St)−min

a 6=a∗[c (x, a, St)]

).

(16)The minimum of this loss occurs when the cost of the expert’saction is lower than the cost of all other actions, with amargin of one. This loss can be re-written and combinedwith Equation 15 to yield:

` (x, a∗,W, St) = WTFa∗ −mina

[WTFa − lxa

], (17)

where the margin lxa = 0 if a = a∗ and 1 otherwise.This ensures that the expert’s action is better than all otheractions by a margin [22]. Adding a regularization term λto Equation 17 yields our complete optimization loss:

` (x, a∗,W, St)=λ

2‖W‖2 +WTFa∗ −min

a

[WTFa − lxa

].

(18)Although this loss function is convex, it is not differen-

tiable. However, we can optimize it efficiently by taking the

TABLE IDIRECTION FOLLOWING EFFICIENCY ON THE ROBOT

Distance (m) Time (s)

Algorithm Mean Std Dev Mean Std Dev

Known Map 13.10 0.67 62.48 16.61With Language 12.62 0.62 122.14 32.48

Without Language 24.91 13.55 210.35 97.73

subgradient of Equation 18 and computing action predictionsfor the loss-augmented policy [22]:

∂`

∂W= λW + Fa∗ − Fa′ (19)

a′ = argmina

[WTFa − lxa

]. (20)

Note that a′ (the best loss-augmented action) is simply thesolution to our policy using a loss-augmented cost. This leadsto the update rule for the weights W :

Wt+1 ←Wt − α∂`

∂W(21)

with a learning rate α ∝ 1/tγ . Intuitively, if the currentpolicy disagrees with the expert’s demonstration, Equation 21decreases the weight (and thus the cost) for the features ofthe demonstrated action Fa∗ , and increases the weight forthe features of the planned action Fa′ . If the policy producesactions that agree with the expert’s demonstration, the updatewill only be for the regularization term. As in our prior work,we train the policy using the DAGGER (Dataset Aggregation)algorithm [23], which learns a policy by iterating betweencollecting data (using the current policy) and applying expertcorrections on all states visited by the policy (using theexpert’s demonstrated policy).

Treating direction following in the space of possible se-mantic maps as a problem of sequential decision makingunder uncertainty provides an efficient approximate solutionto the belief space planning problem. By using a kernelembedding of the distribution of features for a given action,our approach can learn a policy that reasons about thedistribution of semantic maps.

VII. RESULTS

We implemented the algorithm on our voice-commandablewheelchair (Fig. 1), which is equipped with three forward-facing cameras with a collective field-of-view of 120 degrees,and forward- and rearward-facing LIDARs. We set up anexperiment in which the wheelchair was placed in a lobbywithin MIT’s Stata Center, with several hallways, offices,and lab spaces, as well as a kitchen on the same floor.As scene understanding is not the focus of this paper, weplaced AprilTag fiducials [19] to identify the existence andsemantic type of regions in the environment. We trained theHDCG models from a parallel corpus of 54 fully-labeledexamples. We then directed the wheelchair to execute thenovel instruction “go to the kitchen that is down the hallway.”

We compare our framework against two other methods.The first emulates the previous state-of-the-art and uses a

TABLE IIDIRECTION FOLLOWING EFFICIENCY IN SIMULATION

Distance (m) Time (s)

Algorithm Mean Std Dev Mean Std Dev

Known Map 12.88 0.06 18.32 3.54With Language 16.64 6.84 82.78 10.56

Without Language 25.28 12.99 85.57 17.80

WaterFountains

Door

Cabinet

Start

Fig. 7. Ground truth path for the direction “go to the door after thewater fountain, turn right, go straight to the cabinet.” The direction containsinformation about the door’s location (i.e., it is after the water fountain) thatis important to distinguishing it from the other doors in the same hallway.

known map of the environment in order to infer the actionsconsistent with the route direction. The second assumesno prior knowledge of the environment (as with ours) andopportunistically grounds the command in the map, butdoes not use language to modify the map. We performedsix experiments with our algorithm, three with the knownmap method, and five with the method that does not uselanguage, all of which were successful (the robot reachedthe kitchen). Table I compares the total distance traveledand execution time for the three methods. Our algorithmresulted in paths with lengths close to those of the knownmap, and significantly outperformed the method that did notuse language. Our framework did require significantly moretime to follow the directions than the known map case, due tothe fact that it repeats the three steps of the algorithm whennew sensor data arrives. Figure 2 shows a visualization ofthe semantic maps over several time steps for one successfulrun on the robot.

We performed a similar evaluation in a simulated envi-ronment comprised of an office, hallway, and kitchen. Withthe robot starting in the office, we ran ten simulations ofeach method. As with the physical experiment, our methodresulted in an average length closer to that of the known mapcase, but with a longer average run time (Table II).

To evaluate the performance of the learned belief spacepolicy in isolation on a larger corpus of natural languagedirections (with more verbs, spatial relations, and landmarks),we performed cross-validation trials of the policy operatingin a simplified simulated map. We evaluated the policy usinga corpus of 55 multi-step natural language directions, someof which refer to navigation landmarks (for example, thedirection shown in Fig. 7). These directions are similarto those in our prior work [12]. For this cross-validationevaluation, we trained the policy on 28 randomly-sampled

0 5 10 15

with

with

out

0.88 2.84 5.54

2.26 7.65 12.93

Mean ending distance error (m)Bel

ief

spac

ere

ason

ing:

Fig. 8. Tukey box plots showing the mean ending distance error of 27natural language directions over 200 cross-validation trials, with and withoutbelief space reasoning. Reasoning about the distribution of landmarks (with)improves direction following performance compared to without.

directions then evaluated the learned policy on the remaining27 directions (measuring the average ending distance erroracross the held out directions). The results of this experi-ment, shown in Fig. 8, demonstrate the benefit of using theadditional information available in the direction to infer adistribution of possible environment models. By contrast, ourprior approach (without belief space reasoning) ignores thisinformation which results in larger ending distance errors.

VIII. CONCLUSIONS

Robots that can understand and follow natural languagedirections in unknown environments are one step towardsintuitive human-robot interaction. Reasoning about parts ofthe environment that have not yet been detected would helpenable seamless coordination in human-robot teams.

We have generalized our prior work to move beyondobject-relative navigation in small, open environments. Theprimary contributions of this work include:• a hierarchical framework that learns a compact proba-

bilistic graphical model for language understanding;• a semantic map inference algorithm that hypothesizes

the existence and location of spatially coherent regionsin large environments; and

• a belief space policy that reasons directly over thehypothesized map distribution and is trained based onexpert demonstrations.

Together, these algorithms are integral to efficiently inter-preting and following natural language route directions inunknown, spatially extended, and complex environments. Weevaluated our algorithm through a series of simulations aswell as demonstrations on a voice-commandable autonomouswheelchair tasked with following natural language routeinstructions in an office-like environment.

In the future, we plan to carry out experiments on a morediverse set of commands. Other future work will focus onhandling sequences of commands, as well as streams ofcommand that are given during execution to change thebehavior of the robot.

ACKNOWLEDGMENTS

This work was supported in part by the Robotics Consortium ofthe U.S. Army Research Laboratory under the Collaborative TechnologyAlliance Program, Cooperative Agreement W911NF-10-2-0016, and by

ONR under MURI grant “Reasoning in Reduced Information Spaces” (no.N00014-09-1-1052).

REFERENCES

[1] S. Hemachandra, T. Kollar, N. Roy, and S. Teller, “Following andinterpreting narrated guided tours,” in Proc. IEEE Int’l Conf. onRobotics and Automation (ICRA), 2011.

[2] M. MacMahon, B. Stankiewicz, and B. Kuipers, “Walk the talk:Connecting language, knowledge, and action in route instructions,” inProc. Nat’l Conf. on Artificial Intelligence (AAAI), 2006.

[3] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward understanding naturallanguage directions,” in Proc. Int’l. Conf. on Human-Robot Interaction,2010.

[4] D. L. Chen and R. J. Mooney, “Learning to interpret natural languagenavigation instructions from observations,” in Proc. Nat’l Conf. onArtificial Intelligence (AAAI), 2011.

[5] C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox,“A joint model of language and perception for grounded attributelearning,” in Proc. Int’l Conf. on Machine Learning (ICML), 2012.

[6] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee,S. Teller, and N. Roy, “Understanding natural language commandsfor robotic navigation and mobile manipulation,” in Proc. Nat’l Conf.on Artificial Intelligence (AAAI), 2011.

[7] T. Howard, S. Tellex, and N. Roy, “A natural language planner interfacefor mobile manipulators,” in Proc. IEEE Int’l Conf. on Robotics andAutomation (ICRA), 2014.

[8] M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex, and S. Teller,“Learning semantic maps from natural language descriptions,” in Proc.Robotics: Science and Systems (RSS), 2013.

[9] S. Hemachandra, M. R. Walter, S. Tellex, and S. Teller, “Learningspatial-semantic representations from natural language descriptionsand scene classifiers,” in Proc. IEEE Int’l Conf. on Robotics andAutomation (ICRA), 2014.

[10] F. Duvallet, M. R. Walter, T. Howard, S. Hemachandra, J. Oh, S. Teller,N. Roy, and A. Stentz, “Inferring maps and behaviors from naturallanguage instructions,” in Proc. Int’l. Symp. on Experimental Robotics(ISER), 2014.

[11] C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox, “Learning toparse natural language commands to a robot control system,” in Proc.Int’l. Symp. on Experimental Robotics (ISER), 2012.

[12] F. Duvallet, T. Kollar, and A. Stentz, “Imitation learning for naturallanguage direction following through unknown environments,” in Proc.IEEE Int’l Conf. on Robotics and Automation (ICRA), 2013.

[13] H. Zender, O. Martı́nez Mozos, P. Jensfelt, G. Kruijff, and W. Burgard,“Conceptual spatial representations for indoor mobile robots,” Roboticsand Autonomous Systems, 2008.

[14] A. Pronobis, O. Martı́nez Mozos, B. Caputo, and P. Jensfelt, “Multi-modal semantic place classification,” Int’l J. of Robotics Research,2010.

[15] T. Williams, R. Cantrell, G. Briggs, P. Schermerhorn, and M. Scheutz,“Grounding natural language references to unvisited and hypotheticallocations,” in Proc. Nat’l Conf. on Artificial Intelligence (AAAI), 2013.

[16] T. M. Howard, I. Chung, O. Propp, M. R. Walter, and N. Roy, “Efficientnatural language interfaces for assistive robots,” in IEEE/RSJ Int’lConf. on Intelligent Robots and Systems (IROS) Work. on Rehabili-tation and Assistive Robotics, 2014.

[17] M. Kaess, A. Ranganathan, and F. Dellaert, “iSAM: Incrementalsmoothing and mapping,” Trans. on Robotics, 2008.

[18] A. Doucet, N. de Freitas, K. Murphy, and S. Russell, “Rao-Blackwellised particle filtering for dynamic Bayesian networks,” inProceedings of the Conference on Uncertainty in Artificial Intelligence(UAI), 2000.

[19] E. Olson, “AprilTag: A robust and flexible visual fiducial system,” inProc. IEEE Int’l Conf. on Robotics and Automation (ICRA), May 2011.

[20] A. Smola, A. Gretton, L. Song, and B. Schlkopf, “A Hilbert spaceembedding for distributions,” in In Algorithmic Learning Theory: 18thInternational Conference, 2007.

[21] K. Crammer and Y. Singer, “On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines,” Journal of MachineLearning Research, 2002.

[22] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum MarginPlanning,” in Proc. Int’l Conf. on Machine Learning (ICML), 2006.

[23] S. Ross, G. J. Gordon, and J. A. Bagnell, “A Reduction of ImitationLearning and Structured Prediction to No-Regret Online Learning,” inInternational Conference on Artificial Intelligence and Statistics, 2011.

Date post:	09-May-2018
Category:	Documents
Upload:	letram
View:	225 times
Download:	0 times

Learning Models for Following Natural Language Directions ... · Learning Models for Following...

Documents