+ All Categories
Home > Documents > Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is...

Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is...

Date post: 25-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
10
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 108–117, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Incremental Acquisition of Verb Hypothesis Space towards Physical World Interaction Lanbo She and Joyce Y. Chai Department of Computer Science and Engineering Michigan State University East Lansing, Michigan 48824, USA {shelanbo, jchai}@cse.msu.edu Abstract As a new generation of cognitive robots start to enter our lives, it is important to enable robots to follow human commands and to learn new actions from human lan- guage instructions. To address this issue, this paper presents an approach that ex- plicitly represents verb semantics through hypothesis spaces of fluents and automat- ically acquires these hypothesis spaces by interacting with humans. The learned hy- pothesis spaces can be used to automati- cally plan for lower-level primitive actions towards physical world interaction. Our empirical results have shown that the rep- resentation of a hypothesis space of flu- ents, combined with the learned hypothe- sis selection algorithm, outperforms a pre- vious baseline. In addition, our approach applies incremental learning, which can contribute to life-long learning from hu- mans in the future. 1 Introduction As a new generation of cognitive robots start to enter our lives, it is important to enable robots to follow human commands (Tellex et al., 2014; Thomason et al., 2015) and to learn new actions from human language instructions (Cantrell et al., 2012; Mohan et al., 2013). To achieve such a capability, one of the fundamental challenges is to link higher-level concepts expressed by human language to lower-level primitive actions the robot is familiar with. While grounding language to perception (Gorniak and Roy, 2007; Chen and Mooney, 2011; Kim and Mooney, 2012; Artzi and Zettlemoyer, 2013; Tellex et al., 2014; Liu et al., 2014; Liu and Chai, 2015) has received much at- tention in recent years, less work has addressed grounding language to robotic action. Actions are often expressed by verbs or verb phrases. Most semantic representations for verbs are based on ar- gument frames (e.g., thematic roles which capture participants of an action). For example, suppose a human directs a robot to “fill the cup with milk”. The robot will need to first create a semantic rep- resentation for the verb “fill” where “the cup” and milk” are grounded to the respective objects in the environment (Yang et al., 2016). Suppose the robot is successful in this first step, it still may not be able to execute the action “fill” as it does not know how this higher-level action corresponds to its lower-level primitive actions. In robotic systems, operations usually consist of multiple segments of lower-level primitive actions (e.g., move to, open gripper, and close gripper) which are executed both sequentially and con- currently. Task scheduling provides the order or schedule for executions of different segments of actions and action planning provides the plan for executing each individual segment. Primitive ac- tions are often predefined in terms of how they change the state of the physical world. Given a goal, task scheduling and action planning will derive a sequence of primitive actions that can change the initial environment to the goal state. The goal state of the physical world becomes a driving force for robot actions. Thus, beyond se- mantic frames, modeling verb semantics through their effects on the state of the world may provide a link to connect higher-level language and lower- level primitive actions. Motivated by this perspective, we have devel- oped an approach where each verb is explicitly represented by a hypothesis space of fluents (i.e., desired goal states) of the physical world, which is incrementally acquired and updated through inter- acting with humans. More specifically, given a hu- man command, if there is no knowledge about the 108
Transcript
Page 1: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 108–117,Berlin, Germany, August 7-12, 2016. c©2016 Association for Computational Linguistics

Incremental Acquisition of Verb Hypothesis Space towards PhysicalWorld Interaction

Lanbo She and Joyce Y. ChaiDepartment of Computer Science and Engineering

Michigan State UniversityEast Lansing, Michigan 48824, USA

{shelanbo, jchai}@cse.msu.edu

Abstract

As a new generation of cognitive robotsstart to enter our lives, it is important toenable robots to follow human commandsand to learn new actions from human lan-guage instructions. To address this issue,this paper presents an approach that ex-plicitly represents verb semantics throughhypothesis spaces of fluents and automat-ically acquires these hypothesis spaces byinteracting with humans. The learned hy-pothesis spaces can be used to automati-cally plan for lower-level primitive actionstowards physical world interaction. Ourempirical results have shown that the rep-resentation of a hypothesis space of flu-ents, combined with the learned hypothe-sis selection algorithm, outperforms a pre-vious baseline. In addition, our approachapplies incremental learning, which cancontribute to life-long learning from hu-mans in the future.

1 Introduction

As a new generation of cognitive robots start toenter our lives, it is important to enable robotsto follow human commands (Tellex et al., 2014;Thomason et al., 2015) and to learn new actionsfrom human language instructions (Cantrell et al.,2012; Mohan et al., 2013). To achieve such acapability, one of the fundamental challenges isto link higher-level concepts expressed by humanlanguage to lower-level primitive actions the robotis familiar with. While grounding language toperception (Gorniak and Roy, 2007; Chen andMooney, 2011; Kim and Mooney, 2012; Artzi andZettlemoyer, 2013; Tellex et al., 2014; Liu et al.,2014; Liu and Chai, 2015) has received much at-tention in recent years, less work has addressed

grounding language to robotic action. Actions areoften expressed by verbs or verb phrases. Mostsemantic representations for verbs are based on ar-gument frames (e.g., thematic roles which captureparticipants of an action). For example, suppose ahuman directs a robot to “fill the cup with milk”.The robot will need to first create a semantic rep-resentation for the verb “fill” where “the cup” and“milk” are grounded to the respective objects inthe environment (Yang et al., 2016). Suppose therobot is successful in this first step, it still may notbe able to execute the action “fill” as it does notknow how this higher-level action corresponds toits lower-level primitive actions.

In robotic systems, operations usually consist ofmultiple segments of lower-level primitive actions(e.g., move to, open gripper, and close gripper)which are executed both sequentially and con-currently. Task scheduling provides the order orschedule for executions of different segments ofactions and action planning provides the plan forexecuting each individual segment. Primitive ac-tions are often predefined in terms of how theychange the state of the physical world. Givena goal, task scheduling and action planning willderive a sequence of primitive actions that canchange the initial environment to the goal state.The goal state of the physical world becomes adriving force for robot actions. Thus, beyond se-mantic frames, modeling verb semantics throughtheir effects on the state of the world may providea link to connect higher-level language and lower-level primitive actions.

Motivated by this perspective, we have devel-oped an approach where each verb is explicitlyrepresented by a hypothesis space of fluents (i.e.,desired goal states) of the physical world, which isincrementally acquired and updated through inter-acting with humans. More specifically, given a hu-man command, if there is no knowledge about the

108

Page 2: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

corresponding verb (i.e., no existing hypothesisspace for that verb), the robot will initiate a learn-ing process by asking human partners to demon-strate the sequence of actions that is necessary toaccomplish this command. Based on this demon-stration, a hypothesis space of fluents for that verbframe will be automatically acquired. If there is anexisting hypothesis space for the verb, the robotwill select the best hypothesis that is most rele-vant to the current situation and plan for the se-quence of lower-level actions. Based on the out-come of the actions (e.g., whether it has success-fully executed the command), the correspondinghypothesis space will be updated. Through thisfashion, a hypothesis space for each encounteredverb frame is incrementally acquired and updatedthrough continuous interactions with human part-ners. In this paper, to focus our effort on repre-sentations and learning algorithms, we adopted anexisting benchmark dataset (Misra et al., 2015) tosimulate the incremental learning process and in-teraction with humans.

Compared to previous works (She et al., 2014b;Misra et al., 2015), our approach has three uniquecharacteristics. First, rather than a single goal stateassociated with a verb, our approach captures aspace of hypotheses which can potentially accountfor a wider range of novel situations when the verbis applied. Second, given a new situation, ourapproach can automatically identify the best hy-pothesis that fits the current situation and plan forlower-level actions accordingly. Third, through in-cremental learning and acquisition, our approachhas a potential to contribute to life-long learningfrom humans. This paper provides details on thehypothesis space representation, the induction andinference algorithms, as well as experiments andevaluation results.

2 Related Work

Our work here is motivated by previous linguisticstudies on verbs, action modeling in AI, and recentadvances in grounding language to actions.

Previous linguistic studies (Hovav and Levin,2008; Hovav and Levin, 2010) propose actionverbs can be divided into two types: mannerverbs that “specify as part of their meaning a man-ner of carrying out an action” (e.g., nibble, rub,laugh, run, swim), and result verbs that “specifythe coming about of a result state” (e.g., clean,cover, empty, fill, chop, cut, open, enter). Re-

cent work has shown that explicitly modeling re-sulting change of state for action verbs can im-prove grounded language understanding (Gao etal., 2016). Motivated by these studies, this paperfocuses on result verbs and uses hypothesis spacesto explicitly represent the result states associatedwith these verbs.

In AI literature on action modeling, actionschemas are defined with preconditions and ef-fects. Thus, representing verb semantics for ac-tion verbs using resulting states can be connectedto the agent’s underlying planning modules. Dif-ferent from earlier works in the planning com-munity that learn action models from exampleplans (Wang, 1995; Yang et al., 2007) and frominteractions (Gil, 1994), our goal here is to explorethe representation of verb semantics and its acqui-sition through language and action.

There has been some work in the robotics com-munity to translate natural language to roboticoperations (Kress-Gazit et al., 2007; Jia et al.,2014; Sung et al., 2014; Spangenberg and Hen-rich, 2015), but not for the purpose of learningnew actions. To support action learning, previ-ously we have developed a system where the robotcan acquire the meaning of a new verb (e.g., stack)by following human’s step-by-step language in-structions (She et al., 2014a; She et al., 2014b).By performing the actions at each step, the robotis able to acquire the desired goal state associ-ated with the new verb. Our empirical resultshave shown that representing acquired verbs byresulting states allow the robot to plan for prim-itive actions in novel situations. Moreover, recentwork (Misra et al., 2014; Misra et al., 2015) haspresented an algorithm for grounding higher-levelcommands such as “microwave the cup” to lower-level robot operations, where each verb lexicon isrepresented as the desired resulting states. Theirempirical evaluations once again have shown theadvantage of representing verbs as desired statesin robotic systems. Different from these previousworks, we represent verb semantics through a hy-pothesis space of fluents (rather than a single hy-pothesis). In addition, we present an incremen-tal learning approach for inducing the hypothesisspace and selecting the best hypothesis.

3 An Incremental Learning Framework

An overview of our incremental learning frame-work is shown in Figure 1. Given a language

109

Page 3: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

Figure 1: An incremental process of verb acquisi-tion (i.e. learning) and application (i.e. inference).

command Li (e.g. “fill the cup with water.”) andan environment Ei (e.g. a simulated environmentshown in Figure 1), the goal is to identify a se-quence of lower-level robotic actions to performthe command. Similar to previous works (Pasulaet al., 2007; Mouro et al., 2012), the environmentEi is represented by a conjunction of groundedstate fluents, where each fluent describes eitherthe property of an object or relations (e.g. spa-tial) between objects. The language command Li

is first translated to an intermediate representationof grounded verb frame vi through semantic pars-ing and referential grounding (e.g. for “fill thecup”, the argument the cup is grounded to Cup1in the scene). The system knowledge of each verbframe (e.g., fill(x)) is represented by a Hy-pothesis Space H, where each hypothesis (i.e. anode) is a description of possible fluents - or, inother words, resulting states - that are attributed toexecuting the verb command. Given a verb framevi and an environment Ei, a Hypothesis Selectorwill choose an optimal hypothesis from space Hto describe the expected resulting state of execut-ing vi in Ei. Given this goal state and the cur-rent environment, a symbolic planner such as theSTRIPS planner (Fikes and Nilsson, 1971) is usedto generate an action sequence for the agent to ex-ecute. If the action sequence correctly performsthe command (e.g. as evaluated by a human part-ner), the hypothesis selector will be updated withthe success of its prediction. On the other hand,if the action has never been encountered (i.e., thesystem has no knowledge about this verb and thusthe corresponding space is empty) or the predictedaction sequence is incorrect, the human partnerwill provide an action sequence ~Ai that can cor-rectly perform command vi in the current environ-ment. Using ~Ai as the ground truth information,

Figure 2: An example hypothesis space for theverb frame fill(x). The bottom node capturesthe state changes after executing the fill commandin the environment. Anchored by the bottom node,the hypothesis space is generated in a bottom-upfashion. Each node represents a potential goalstate. The highlighted nodes are pruned during in-duction, as they are not consistent with the bottomnode.

the system will not only update the hypothesis se-lector, but will also update the existing space ofvi. The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in futureinteraction. Through this procedure, a hypothe-sis space for each verb frame vi is continually andincrementally updated through human-robot inter-action.

4 State Hypothesis Space

To bridge human language and robotic actions,previous works have studied representing the se-mantics of a verb with a single resulting state (Sheet al., 2014b; Misra et al., 2015). One problemof this representation is that when the verb is ap-plied in a new situation, if any part of the result-ing state cannot be satisfied, the symbolic plannerwill not be able to generate a plan for lower-levelactions to execute this verb command. The plan-ner is also not able to determine whether the failedpart of state representation is even necessary. Infact, this effect is similar to the over-fitting prob-lem. For example, given a sequence of actionsof performing fill(x), the induced hypothe-sis could be “Has(x,Water) ∧ Grasping(x) ∧In(x, o1)∧¬(On(x, o2))”, where x is a graspableobject (e.g. a cup or bowl), o1 is any type of sink,and o2 is any table. However, during inference,when applied to a new situation that does not haveany type of sink or table, this hypothesis will not

110

Page 4: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

Figure 3: A training instance {Ei, vi, ~Ai} for hypothesis space induction. E ′i is the resulting environment

of executing ~Ai in Ei. The change of state in E ′i compared to Ei is highlighted in bold. Different heuristics

generate different Base Hypotheses as shown at the bottom.

be applicable. Nevertheless, the first two termsHas(x,Water) ∧ Grasping(x) may already besufficient to generate a plan for completing theverb command.

To handle this over-fitting problem, we proposea hierarchical hypothesis space to represent verbsemantics, as shown in Figure 2. The space is or-ganized based on a specific-to-general hierarchi-cal structure. Formally, a hypothesis space H fora verb frame is defined as: 〈N,E〉, where eachni ∈ N is a hypothesis node and each eij ∈ Eis a directed edge pointing from parent ni to childnj , meaning node nj is more general than ni andhas one less constraint.

In Figure 2, the bottom hypothesis (n1) isHas(x,Water) ∧ Grasping(x) ∧ In(x, o1) ∧¬(On(x, o2)). A hypothesis ni represents a con-junction of parameterized state fluents lk:

ni := ∧ lk, and lk := [¬] predk(xk1 [, xk2 ])

A fluent lk is composed of a predicate (e.g. objectstatus: Has, or spatial relation: On) and a set ofargument variables. It can be positive or negative.Take the bottom node in Figure 2 as an example, itcontains four fluents including one negative term(i.e. ¬(On(x, o2))) and three positive terms. Dur-ing inference, the parameters will be grounded tothe environment to check whether this hypothesisis applicable.

5 Hypothesis Space Induction

Given an initial environment Ei, a language com-mand which contains the verb frame vi, and a cor-responding action sequence ~Ai, {Ei, vi, ~Ai} formsa training instance for hypothesis space induction.First, based on different heuristics, a base hypoth-esis is generated by comparing the state differencebetween the final and the initial environment. Sec-ond, a hypothesis spaceH is induced on top of this

Base Hypothesis in a bottom-up fashion. And dur-ing induction some nodes are pruned. Third, if thesystem has existing knowledge for the same verbframe (i.e. an existing hypothesis spaceHt for thesame verb frame), this newly induced space willbe merged with previous knowledge. Next we ex-plain each step in detail.

5.1 Base Hypothesis InductionOne key concept in the space induction is the BaseHypothesis (e.g. the bottom node in Figure 2),which provides a foundation for building a space.As shown in Figure 3, given a verb frame vi anda working environment Ei, the action sequence~Ai given by a human will change the initial en-

vironment Ei to a final environment E ′i . The state

changes are highlighted in Figure 3. Suppose astate change can be described by n fluents. Thenthe first question is which of these n fluents shouldbe included in the base hypothesis. To gain someunderstanding on what would be a good represen-tation, we applied different heuristics of choosingfluents to form a base hypothesis as shown in Fig-ure 3:

• H1argonly: only includes the changed statesassociated with the argument objects speci-fied in the frame (e.g., in Figure 3, Kettle1is the only argument).

• H2manip: includes the changed states of allthe objects that have been manipulated in theaction sequence taught by the human.

• H3argrelated: includes the changed states ofall the objects related to the argument ob-jects in the final environment. An object ois considered as “related to” an argument ob-ject if there is a state fluent that includes botho and an argument object in one predicate.(e.g. Stove is related to the argument objectKettle1 through On(Kettle1, Stove)).

111

Page 5: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

Input: A Base Hypothesis hInitialization: Set initial space H : 〈N, E〉 with N:[h]

and E:[ ],Set a set of temporary hypotheses T :[h]

while T is not empty doPop an element t from T

Generate children [t(0),...,t(k)] from t by removingeach single fluentforeach i = 0 ... k do

if t(i) is consistent with t thenAppend t(i) to T ;Add t(i) to N if not already in;Add link t → t(i) to E if not already in;

elsePrune t(i) and any node that can begeneralized from t(i)

endend

endOutput: Hypothesis space H

Algorithm 1: A single hypothesis space induc-tion algorithm. H is a space initialized with abase hypothesis and an empty set of links. T is atemporary container of candidate hypotheses.

• H4all: includes all the fluents whose valuesare changed from Ei to E ′

i (e.g. all the fourhighlighted state fluents in E ′

i).

5.2 Single Space InductionFirst we define the consistency between two hy-potheses:

Definition. Hypotheses h1 and h2 are consistent,if and only if the action sequence ~A1 generatedfrom a symbolic planner based on goal state h1 isexactly the same as the action sequence ~A2 gener-ated based on goal state h2.

Given a base hypothesis, the space inductionprocess is a while-loop generalizing hypothesesin a bottom-up fashion, which stops when no hy-potheses can be further generalized. As shownin Algorithm 1, a hypothesis node t can firstlybe generalized to a set of immediate children[t(0),...,t(k)] by removing a single fluent from t.For example, the base hypothesis n1 in Figure 2is composed of 4 fluents, such that 4 immediatechildren nodes can potentially be generated. If achild node t(i) is consistent with its parent t (i.e.determined based on the consistency defined pre-viously), node t(i) and a link t → t(i) are addedto the space H. The node t(i) is also added to atemporary hypothesis container waiting to be fur-ther generalized. On the other hand, some childrenhypotheses can be inconsistent with their parents.For example, the gray node (n2) in Figure 2 is a

child node that is inconsistent with its parent (n1).As n2 does not explicitly specify Has(x,Water)as part of its goal state, the symbolic planner gen-erates less steps to achieve goal state n2 than goalstate n1. This implies that the semantics of achiev-ing n2 may be different than those for achievingn1. Such hypotheses that are inconsistent withtheir parents are pruned. In addition, if t(i) is in-consistent with its parent t, any children of t(i) arealso inconsistent with t (e.g. children of n2 in Fig-ure 2 are also gray nodes, meaning they are incon-sistent with the base hypothesis). Through prun-ing, the size of entire space can be greatly reduced.

In the resulting hypothesis space, every singlehypothesis is consistent with the base hypothesis.By only keeping consistent hypotheses via prun-ing, we can remove fluents that are not representa-tive of the main goal associated with the verb.

5.3 Space Merging

If the robot has existing knowledge (i.e. hypoth-esis space Ht) for a verb frame, the induced hy-pothesis spaceH from a new instance of the sameverb will be merged with the existing space Ht.Currently, a new space Ht+1 is generated wherethe nodes of Ht+1 are the union of H and Ht,and links in Ht+1 are generated by checking theparent-child relationship between nodes. In futurework, more space merging operations will be ex-plored, and human feedback will be incorporatedinto the induction process.

6 Hypothesis Selection

Hypothesis selection is applied when the agent in-tends to execute a command. Given a verb frameextracted from the language command, the agentwill first select the best hypothesis (describing thegoal state) from the existing knowledge base, andthen apply a symbolic planner to generate an ac-tion sequence to achieve the goal. In our frame-work, the model of selecting the best hypothesisis incrementally learned throughout continuous in-teraction with humans. More specifically, givena correct action sequence (whether performed bythe robot or provided by the human), a regressionmodel is trained to capture the fitness of a hypoth-esis given a particular situation.

Inference: Given a verb frame vi and a workingenvironment Ei, the goal of inference is to esti-mate how well each hypothesis hk from a spaceHt describes the expected result of performing vi

112

Page 6: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

in Ei. The best fit hypothesis will be used as thegoal state to generate the action sequence. Specif-ically, the “goodness” of describing command vi

with hypothesis hk in environment Ei is formu-lated as follows:

f(hk | vi; Ei;Ht) = W T ·Φ(hk, vi, Ei,Ht) (1)

where Φ(hk, vi, Ei,Ht) is a feature vector captur-ing multiple aspects of relations between hk, vi, Eiand Ht as shown in Table 1; and W capturesthe weight associated with each feature. Exam-ple global features include whether the candidategoal hk is in the top level of entire space Ht andwhether hk has the highest frequency. Examplelocal features include if most of the fluents in hk

are already satisfied in current scene Ei (as this hk

is unlikely to be a desired goal state). The featuresalso include whether the same verb frame vi hasbeen performed in a similar scene during previousinteractions, as the corresponding hypotheses in-duced during that experience are more likely to berelevant and are thus preferred.

Parameter Estimation: Given an action se-quence ~Ai that illustrates how to correctly performcommand vi in environment Ei during interaction,the model weights will be incrementally updatedwith1:

Wt+1 = Wt − η(α∂R(Wt)∂Wt

+∂L(Jki, fki)

∂Wt)

where fki := f(hk|vi; Ei;Ht) is defined in Equa-tion 1. Jki is the dependent variable the modelshould approximate, where Jki := J(si, hk) is theJaccard Index (details in Section 7) between hy-pothesis hk and a set of changed states si (i.e. thechanged states of executing the illustration actionsequence ~Ai in current environment). L(Jki, fki)is a squared loss function. αR(Wt) is the penaltyterm, and η is the constant learning rate.

7 Experiment Setup

Dataset Description. To evaluate our approach,we applied the dataset made available by (Misraet al., 2015). To support incremental learning,each utterance from every original paragraph is ex-tracted so that each command/utterance only con-tains one verb and its arguments. The correspond-ing initial environment and an action sequence

1The SGD regressor in the scikit-learn (Pedregosa et al.,2011) is used to perform the linear regression with L2 regu-larization.

Features on candidate hypothesis hk and the space Ht

1. If hk belongs to the top level of Ht.2. If hk has the highest frequency in Ht.

Features on hk and current situation Ei

3. Portion of fluents in hk that are already satisfied by Ei.4. Portion of non-argument objects in hk. Examples ofnon-argument objects are o1 and o2 in Figure 2.

Features on relations between a testing verb frame vi

and previous interaction experience5. Whether the same verb frame vi has been executedpreviously with the same argument objects.6. Similarities between noun phrase descriptions used incurrent command and commands from interaction history.

Table 1: Current features used for incrementallearning of the regression model. The first twoare binary features and the rest are real-valued fea-tures.

taught by a human for each command are also ex-tracted. An example is shown in Figure 3, whereLi is a language command, Ei is the initial work-ing environment, and ~Ai is a sequence of primitiveactions to complete the command given by the hu-man. In the original data, some sentences are notaligned with any actions, and thus cannot be usedfor either the learning or the evaluation. Remov-ing these unaligned sentences resulted in a totalof 991 data instances, including 165 different verbframes.

Among the 991 data instances, 793 were usedfor incremental learning (i.e., space induction andhypothesis selector learning). Specifically, given acommand, if the robot correctly predicts an actionsequence2, this correct prediction is used to updatethe hypothesis selector. Otherwise, the agent willrequire a correct action sequence from the human,which is used for hypothesis space induction aswell as updating the hypothesis selector.

The hypothesis spaces and regression based se-lectors acquired at each run were evaluated on theother 20% (198) testing instances. Specifically, foreach testing instance, the induced space and thehypothesis selector were applied to identify a de-sired goal state. Then a symbolic planner3 was ap-plied to predict an action sequence ~A(p) based onthis predicted goal state. We then compared ~A(p)

with the ground truth action sequence ~A(g) usingthe following two metrics.

• IED (Instruction Edit Distance) measures2Currently, a prediction is considered correct if the pre-

dicted result (c(p)) is similar to a human labeled action se-quence (c(g)) (i.e., SJI(c(g), c(p)) > 0.5).

3The symbolic planner implemented by (Rintanen, 2012)was utilized to generate action sequences.

113

Page 7: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

(a) IED results for different configurations (b) SJI results for different configurations

Figure 4: The overall performance on the testing set with different configurations in generating the basehypothesis and in hypothesis selection. Each configuration runs five times by randomly shuffling theorder of learning instances, and the averaged performance is reported. The result from Misra2015 isshown as a line. Results that are statistically significant better than Misra2015 are marked with ∗ (pairedt-test, p< 0.05).

similarity between the ground truth action se-quence ~A(g) and the predicted sequence ~A(p).Specifically, the edit distance d between twoaction sequences ~A(g) and ~A(p) is first cal-culated. Then d is rescaled as IED = 1 −d/max( ~A(g), ~A(p)), such that IED rangesfrom 0 to 1 and a larger IED means the twosequences are more similar.

• SJI (State Jaccard Index). Because differentaction sequences could lead to a same goalstate, we also use Jaccard Index to check theoverlap between the changed states. Specif-ically, executing the ground truth action se-quence ~A(g) in the initial scene Ei results ina final environment E ′

i . Suppose the changedstates between Ei and E ′

i is c(g). For the pre-dicted action sequence, we can calculate an-other set of changed states c(p). The Jac-card Index between c(g) and c(p) is evaluated,which also ranges from 0 to 1 and a largerSJI means the predicted state changes aremore similar to the ground truth.

Configurations. We also compared the resultsof using the regression based selector to select ahypothesis (i.e., RegressionBased) with the fol-lowing different strategies for selecting the hy-pothesis:

• Misra2015: The state of the art system re-ported in (Misra et al., 2015) on the com-mand/utterance level evaluation4.

4We applied the same system described in (Misra et al.,2015) to predict action sequences. The only difference is herewe report the performance at the command level, not at theparagraph level.

• MemoryBased: Given the induced space,only the base hypotheses hks from eachlearning instances are used. Because thesehks don’t have any relaxation, they representpurely learning from memorization.

• MostGeneral: In this case, only those hy-potheses from the top level of the hypothesisspace are used, which contain the least num-ber of fluents. These nodes are the most re-laxed hypotheses in the space.

• MostFrequent: In this setting, the hypothe-ses that are most frequently observed in thelearning instances are used.

8 Results

8.1 Overall performance

The results of the overall performance acrossdifferent configurations are shown in Figure 4.For both of the IED and SJI (i.e. Figure 4(a)and Figure 4(b)), the hypothesis spaces with theregression model based hypothesis selector al-ways achieve the best performance across differentconfigurations, and outperforms the previous ap-proach (Misra et al., 2015). For different base hy-pothesis induction strategies, the H4all consider-ing all the changed states achieves the best perfor-mance across all configurations. This is becauseH4all keeps all of the state change informationcompared with other heuristics. The performanceofH2manip is similar toH4all. The reason is that,when all the manipulated objects are considered,the resulted set of changed states will cover mostof the fluents in H4all. On the other dimension,

114

Page 8: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

(a) Use regression based selector to select hypothesis,and compare each base hypothesis induction heuristics.

(b) Induce the base hypothesis with H4all, and comparedifferent hypothesis selection strategies.

Figure 5: Incremental learning results. The spaces and regression models acquired at different incremen-tal learning cycles are evaluated on testing set. The averaged Jaccard Index is reported.

the regression based hypothesis selector achievesthe best performance and the MemoryBased strat-egy has the lowest performance. Results for Most-General and MostFrequent are between the regres-sion based selector and MemoryBased.

8.2 Incremental Learning Results

Figure 5 presents the incremental learning resultson the testing set. To better present the results, weshow the performance based on each learning cy-cle of 40 instances. The averaged Jaccard Index(SJI) is reported. Specifically, Figure 5(a) showsthe results of configurations comparing differentbase hypothesis induction heuristics using regres-sion model based hypothesis selection. After us-ing 200 out of 840 (23.8%) learning instances, allthe four curves achieve more than 80% of the over-all performance. For example, for the heuristicH4all, the final average Jaccard Index is 0.418.When 200 instances are used, the score is 0.340(0.340/0.418≈81%). The same number holds forthe other heuristics. After 200 instances, H4all

and H2manip consistently achieve better perfor-mance than H1argonly and H3argrelated. This re-sult indicates that while change of states mostly af-fect the arguments of the verbs, other state changesin the environment cannot be ignored. Modelingthem actually leads to better performance. UsingH4all for base hypothesis induction, Figure 5(b)shows the results of comparing different hypoth-esis selection strategies. The regression modelbased selector always outperforms other selectionstrategies.

8.3 Results on Frequently Used Verb Frames

Beside overall evaluation, we have also taken acloser look at individual verb frames. Most of the

Figure 6: Incremental evaluation for individualverb frames. Four frequently used verb framesare examined: place(x, y), put(x, y), take(x),and turn(x). X-axis is the number of incremen-tal learning instances, and Y-axis is the averagedSJI computed with H4all base hypothesis induc-tion and regression based hypothesis selector.

verb frames in the data have a very low frequency,which cannot produce statistically significant re-sults. So we only selected verb frames with fre-quency larger than 40 in this evaluation. For eachverb frame, 60% data are used for incrementallearning and 40% are for testing. For each frame, aregression based selector is trained separately. Theresulting SJI curves are shown in Figure 6.

As shown in Figure 6, all the four curves be-come steady after 8 learning instances are used.However, while some verb frames have final SJIsof more than 0.55 (i.e. take(x) and turn(x)), oth-ers have relatively lower results (e.g. results forput(x, y) are lower than 0.4). After examining thelearning instances for put(x, y), we found thesedata are more noisy than the training data for otherframes. One source of errors is the incorrect ob-ject grounding results. For example, a problematic

115

Page 9: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

training instance is “put the pillow on the couch”,where the object grounding module cannot cor-rectly ground the “couch” to the target object. Asa result, the changed states of the second argument(i.e. the “couch”) are incorrectly identified, whichleads to incorrect prediction of desired states dur-ing inference. Another common error source isfrom automated parsing of utterances. The actionframes generated from the parsing results could beincorrect in the first place, which would contributeto a hypothesis space for a wrong frame. Thesedifferent types of errors are difficult to be recog-nized by the system itself. This points to the fu-ture direction of involving humans in a dialogueto learn a more reliable hypothesis space for verbsemantics.

9 Conclusion

This paper presents an incremental learning ap-proach that represents and acquires semantics ofaction verbs based on state changes of the envi-ronment. Specifically, we propose a hierarchicalhypothesis space, where each node in the spacedescribes a possible effect on the world from theverb. Given a language command, the induced hy-pothesis space, together with a learned hypothe-sis selector, can be applied by the agent to planfor lower-level actions. Our empirical results havedemonstrated a significant improvement in perfor-mance compared to a previous leading approach.More importantly, as our approach is based on in-cremental learning, it can be potentially integratedin a dialogue system to support life-long learningfrom humans. Our future work will extend thecurrent approach with dialogue modeling to learnmore reliable hypothesis spaces of resulting statesfor verb semantics.

Acknowledgments

This work was supported by IIS-1208390 andIIS-1617682 from the National Science Founda-tion. The authors would like to thank DipendraK. Misra and colleagues for providing the evalua-tion data, and the anonymous reviewers for valu-able comments.

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-

tion for Computational Linguistics, Volume1(1):49–62.

R. Cantrell, K. Talamadupula, P. Schermerhorn, J. Ben-ton, S. Kambhampati, and M. Scheutz. 2012. Tellme when and why to do it! run-time planner modelupdates via natural language instruction. In Pro-ceedings of the Seventh Annual ACM/IEEE Inter-national Conference on Human-Robot Interaction(HRI’12), pages 471–478, Boston, Massachusetts,USA, March.

David L Chen and Raymond J Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In Proceedings of the 25thAAAI Conference on Artificial Intelligence (AAAI-2011), pages 859–865, San Francisco, California,USA, August.

Richard E. Fikes and Nils J. Nilsson. 1971. Strips: Anew approach to the application of theorem provingto problem solving. In Proceedings of the 2nd Inter-national Joint Conference on Artificial Intelligence(IJCAI’71), pages 608–620, London, England.

Qiaozi Gao, Malcolm Doering, Shaohua Yang, andJoyce Y. Chai. 2016. Physical causality of actionverbs in grounded language understanding. In InProceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), Berlin,Germany.

Yolanda Gil. 1994. Learning by experimentation:incremental refinement of incomplete planning do-mains. In Prococeedings of the Eleventh Interna-tional Conference on Machine Learning (ICML’94),pages 87–95, New Brunswick, NJ, USA.

P. Gorniak and D. Roy. 2007. Situated language un-derstanding as filtering perceived affordances. Cog-nitive Science, Volume31(2):197–231.

Malka Rappaport Hovav and Beth Levin. 2008. Re-flections on manner/result complementarity. Lecturenotes.

Malka Rappaport Hovav and Beth Levin. 2010. Re-flections on Manner / Result Complementarity. Lex-ical Semantics, Syntax, and Event Structure, pages21–38.

Yunyi Jia, Ning Xi, Joyce Y. Chai, Yu Cheng, Rui Fang,and Lanbo She. 2014. Perceptive feedback for nat-ural language control of robotic operations. In 2014IEEE International Conference on Robotics and Au-tomation, ICRA 2014, Hong Kong, China, May 31 -June 7, 2014, pages 6673–6678.

Joohyun Kim and Raymond J. Mooney. 2012. Un-supervised pcfg induction for grounded languagelearning with highly ambiguous supervision. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing and Natural Lan-guage Learning (EMNLP-CoNLL ’12), pages 433–444, Jeju Island, Korea.

116

Page 10: Incremental Acquisition of Verb Hypothesis Space towards … · The updated hypothesis space is treated as sys-tem knowledge of vi, which will be used in future interaction. Through

Hadas Kress-Gazit, Georgios E Fainekos, and George JPappas. 2007. From structured english to robot mo-tion. In Intelligent Robots and Systems, 2007. IROS2007. IEEE/RSJ International Conference on, pages2717–2722.

Changsong Liu and Joyce Y. Chai. 2015. Learningto mediate perceptual differences in situated human-robot dialogue. In Proceedings of the 29th AAAIConference on Artificial Intelligence (AAAI’15),pages 2288–2294, Austin, Texas, USA.

Changsong Liu, Lanbo She, Rui Fang, and Joyce Y.Chai. 2014. Probabilistic labeling for efficient ref-erential grounding based on collaborative discourse.In Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics ACL’14(Volume 2: Short Papers), pages 13–18, Baltimore,MD, USA.

Dipendra Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2014. Tell me dave: Context-sensitive grounding of natural language to mo-bile manipulation instructions. In Proceedings ofRobotics: Science and Systems (RSS’14), Berkeley,US.

Dipendra Kumar Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lex-icon induction for high-level instructions. In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing of the Asian Federation of Natural Lan-guage Processing ACL-IJCNLP’15 (Volume 1: LongPapers), pages 992–1002, Beijing, China.

Shiwali Mohan, James Kirk, and John Laird. 2013.A computational model for situated task learningwith interactive instruction. In Proceedings ofthe International conference on cognitive modeling(ICCM’13).

Kira Mouro, Luke S. Zettlemoyer, Ronald P. A. Pet-rick, and Mark Steedman. 2012. Learning stripsoperators from noisy and incomplete observations.In Proceedings of the Twenty-Eighth Conferenceon Uncertainty in Artificial Intelligence (UAI’12),pages 614–623, Catalina Island, CA, USA.

Hanna M Pasula, Luke S Zettlemoyer, and Leslie PackKaelbling. 2007. Learning symbolic models ofstochastic domains. Journal of Artificial Intelli-gence Research, Volume29:309–352.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine learn-ing in Python. Journal of Machine Learning Re-search, Volume12:2825–2830.

Jussi Rintanen. 2012. Planning as satisfiability:Heuristics. Artificial Intelligence, Volume193:45–86.

Lanbo She, Yu Cheng, Joyce Y. Chai, Yunyi Jia,Shaohua Yang, and Ning Xi. 2014a. Teachingrobots new actions through natural language instruc-tions. In The 23rd IEEE International Symposiumon Robot and Human Interactive Communication,IEEE RO-MAN’14, pages 868–873, Edinburgh, UK.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia,Joyce Y. Chai, and Ning Xi. 2014b. Back to theblocks world: Learning new actions through situ-ated human-robot dialogue. In Proceedings of the15th Annual Meeting of the Special Interest Groupon Discourse and Dialogue (SIGDIAL), pages 89–97, Philadelphia, PA, U.S.A., June. Association forComputational Linguistics.

M. Spangenberg and D. Henrich. 2015. Groundingof actions based on verbalized physical effects andmanipulation primitives. In Intelligent Robots andSystems (IROS), 2015 IEEE/RSJ International Con-ference on, pages 844–851, Hamburg, Germany.

Jaeyong Sung, Bart Selman, and Ashutosh Saxena.2014. Synthesizing manipulation sequences forunder-specified tasks using unrolled markov randomfields. In 2014 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS’14), pages2970–2977, Chicago, IL, USA.

Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, andNicholas Roy. 2014. Learning perceptuallygrounded word meanings from unaligned paralleldata. Machine Learning, Volume94(2):151–167.

Jesse Thomason, Shiqi Zhang, Raymond Mooney, andPeter Stone. 2015. Learning to interpret natural lan-guage commands through human-robot dialog. InProceedings of the 2015 International Joint Confer-ence on Artificial Intelligence (IJCAI), pages 1923–1929, Buenos Aires, Argentina.

Xuemei Wang. 1995. Learning by observation andpractice: An incremental approach for planning op-erator acquisition. In Proceedings of the TwelfthInternational Conference on Machine Learning(ICML’95), pages 549–557, Tahoe City, California,USA.

Qiang Yang, Kangheng Wu, and Yunfei Jiang. 2007.Learning action models from plan examples us-ing weighted max-sat. Artificial Intelligence, Vol-ume171(23):107 – 143.

Shaohua Yang, Qiaozi Gao, Changsong Liu, CaimingXiong, Song-Chun Zhu, and Joyce Y. Chai. 2016.Grounded semantic role labeling. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL’16),San Diego, California.

117


Recommended