+ All Categories
Home > Documents > arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

Date post: 26-Feb-2022
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
15
Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game Haichao Zhang , Haonan Yu , and Wei Xu †§ Baidu Research - Institue of Deep Learning, Sunnyvale USA § National Engineering Laboratory for Deep Learning Technology and Applications, Beijing China {zhanghaichao,haonanyu,wei.xu}@baidu.com Abstract Building intelligent agents that can com- municate with and learn from humans in natural language is of great value. Su- pervised language learning is limited by the ability of capturing mainly the statis- tics of training data, and is hardly adaptive to new scenarios or flexible for acquiring new knowledge without inefficient retrain- ing or catastrophic forgetting. We high- light the perspective that conversational interaction serves as a natural interface both for language learning and for novel knowledge acquisition and propose a joint imitation and reinforcement approach for grounded language learning through an in- teractive conversational game. The agent trained with this approach is able to ac- tively acquire information by asking ques- tions about novel objects and use the just- learned knowledge in subsequent conver- sations in a one-shot fashion. Results com- pared with other methods verified the ef- fectiveness of the proposed approach. 1 Introduction Language is one of the most natural forms of com- munication for human and is typically viewed as fundamental to human intelligence; therefore it is crucial for an intelligent agent to be able to use lan- guage to communicate with human as well. While supervised training with deep neural networks has led to encouraging progress in language learning, it suffers from the problem of capturing mainly the statistics of training data, and from a lack of adaptiveness to new scenarios and being flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting. Moreover, supervised training of deep neural network mod- els needs a large number of training samples while many interesting applications require rapid learn- ing from a small amount of data, which poses an even greater challenge to the supervised setting. In contrast, humans learn in a way very different from the supervised setting (Skinner, 1957; Kuhl, 2004). First, humans act upon the world and learn from the consequences of their actions (Skinner, 1957; Kuhl, 2004; Petursdottir and Mellor, 2016). While for mechanical actions such as movement, the consequences mainly follow geometrical and mechanical principles, for language, humans act by speaking, and the consequence is typically a response in the form of verbal and other behav- ioral feedback (e.g., nodding) from the conversa- tion partner (i.e., teacher). These types of feed- back typically contain informative signals on how to improve language skills in subsequent conver- sations and play an important role in humans’ language acquisition process (Kuhl, 2004; Peturs- dottir and Mellor, 2016). Second, humans have shown a celebrated ability to learn new concepts from small amount of data (Borovsky et al., 2003). From even just one example, children seem to be able to make inferences and draw plausible bound- aries between concepts, demonstrating the ability of one-shot learning (Lake et al., 2011). The language acquisition process and the one- shot learning ability of human beings are both impressive as a manifestation of human intelli- gence, and are inspiring for designing novel set- tings and algorithms for computational language learning. In this paper, we leverage conversation as both an interactive environment for language learning (Skinner, 1957) and a natural interface for acquiring new knowledge (Baker et al., 2002). We propose an approach for interactive language acquisition with one-shot concept learning ability. The proposed approach allows an agent to learn grounded language from scratch, acquire the trans- arXiv:1805.00462v1 [cs.CL] 26 Apr 2018
Transcript
Page 1: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

Interactive Language Acquisition with One-shot Visual Concept Learningthrough a Conversational Game

Haichao Zhang†, Haonan Yu†, and Wei Xu †§† Baidu Research - Institue of Deep Learning, Sunnyvale USA

§ National Engineering Laboratory for Deep Learning Technology and Applications, Beijing China{zhanghaichao,haonanyu,wei.xu}@baidu.com

Abstract

Building intelligent agents that can com-municate with and learn from humans innatural language is of great value. Su-pervised language learning is limited bythe ability of capturing mainly the statis-tics of training data, and is hardly adaptiveto new scenarios or flexible for acquiringnew knowledge without inefficient retrain-ing or catastrophic forgetting. We high-light the perspective that conversationalinteraction serves as a natural interfaceboth for language learning and for novelknowledge acquisition and propose a jointimitation and reinforcement approach forgrounded language learning through an in-teractive conversational game. The agenttrained with this approach is able to ac-tively acquire information by asking ques-tions about novel objects and use the just-learned knowledge in subsequent conver-sations in a one-shot fashion. Results com-pared with other methods verified the ef-fectiveness of the proposed approach.

1 Introduction

Language is one of the most natural forms of com-munication for human and is typically viewed asfundamental to human intelligence; therefore it iscrucial for an intelligent agent to be able to use lan-guage to communicate with human as well. Whilesupervised training with deep neural networks hasled to encouraging progress in language learning,it suffers from the problem of capturing mainlythe statistics of training data, and from a lack ofadaptiveness to new scenarios and being flexiblefor acquiring new knowledge without inefficientretraining or catastrophic forgetting. Moreover,supervised training of deep neural network mod-

els needs a large number of training samples whilemany interesting applications require rapid learn-ing from a small amount of data, which poses aneven greater challenge to the supervised setting.

In contrast, humans learn in a way very differentfrom the supervised setting (Skinner, 1957; Kuhl,2004). First, humans act upon the world and learnfrom the consequences of their actions (Skinner,1957; Kuhl, 2004; Petursdottir and Mellor, 2016).While for mechanical actions such as movement,the consequences mainly follow geometrical andmechanical principles, for language, humans actby speaking, and the consequence is typically aresponse in the form of verbal and other behav-ioral feedback (e.g., nodding) from the conversa-tion partner (i.e., teacher). These types of feed-back typically contain informative signals on howto improve language skills in subsequent conver-sations and play an important role in humans’language acquisition process (Kuhl, 2004; Peturs-dottir and Mellor, 2016). Second, humans haveshown a celebrated ability to learn new conceptsfrom small amount of data (Borovsky et al., 2003).From even just one example, children seem to beable to make inferences and draw plausible bound-aries between concepts, demonstrating the abilityof one-shot learning (Lake et al., 2011).

The language acquisition process and the one-shot learning ability of human beings are bothimpressive as a manifestation of human intelli-gence, and are inspiring for designing novel set-tings and algorithms for computational languagelearning. In this paper, we leverage conversationas both an interactive environment for languagelearning (Skinner, 1957) and a natural interfacefor acquiring new knowledge (Baker et al., 2002).We propose an approach for interactive languageacquisition with one-shot concept learning ability.The proposed approach allows an agent to learngrounded language from scratch, acquire the trans-

arX

iv:1

805.

0046

2v1

[cs

.CL

] 2

6 A

pr 2

018

Page 2: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

ferable skill of actively seeking and memorizinginformation about novel objects, and develop theone-shot learning ability, purely through conver-sational interaction with a teacher.

2 Related Work

Supervised Language Learning. Deep neuralnetwork-based language learning has seen greatsuccess on many applications, including machinetranslation (Cho et al., 2014b), dialogue genera-tion (Wen et al., 2015; Serban et al., 2016), imagecaptioning and visual question answering (?Antolet al., 2015). For training, a large amount of la-beled data is needed, requiring significant effortsto collect. Moreover, this setting essentially cap-tures the statistics of training data and does not re-spect the interactive nature of language learning,rendering it less flexible for acquiring new knowl-edge without retraining or forgetting (Stent andBangalore, 2014).

Reinforcement Learning for Sequences. Somerecent studies used reinforcement learning (RL)to tune the performance of a pre-trained languagemodel according to certain metrics (Ranzato et al.,2016; Bahdanau et al., 2017; Li et al., 2016; Yuet al., 2017). Our work is also related to RL innatural language action space (He et al., 2016) andshares a similar motivation with Weston (2016)and Li et al. (2017), which explored languagelearning through pure textual dialogues. However,in these works (He et al., 2016; Weston, 2016;Li et al., 2017), a set of candidate sequences isprovided and the action is to select one from theset. Our main focus is rather on learning languagefrom scratch: the agent has to learn to generate asequence action rather than to simply select onefrom a provided candidate set.Communication and Emergence of Language.Recent studies have examined learning to com-municate (Foerster et al., 2016; Sukhbaatar et al.,2016) and invent language (Lazaridou et al., 2017;Mordatch and Abbeel, 2018). The emerged lan-guage needs to be interpreted by humans via post-processing (Mordatch and Abbeel, 2018). We,however, aim to achieve language learning fromthe dual perspectives of understanding and gener-ation, and the speaking action of the agent is read-ily understandable without any post-processing.Some studies on language learning have used aguesser-responder setting in which the guessertries to achieve the final goal (e.g., classification)

by collecting additional information through ask-ing the responder questions (Strub et al., 2017;Das et al., 2017). These works try to optimize thequestion being asked to help the guesser achievethe final goal, while we focus on transferablespeaking and one-shot ability.

One-shot Learning and Active Learning. One-shot learning has been investigated in some re-cent works (Lake et al., 2011; Santoro et al.,2016; Woodward and Finn, 2016). The memory-augmented network (Santoro et al., 2016) storesvisual representations mixed with ground truthclass labels in an external memory for one-shotlearning. A class label is always provided follow-ing the presentation of an image; thus the agentreceives information from the teacher in a passiveway. Woodward and Finn (2016) present effortstoward active learning, using a vanilla recurrentneural network (RNN) without an external mem-ory. Both lines of study focus on image classi-fication only, meaning the class label is directlyprovided for memorization. In contrast, we tar-get language and one-shot learning via conversa-tional interaction, and the learner has to learn toextract important information from the teacher’ssentences for memorization.

3 The Conversational Game

We construct a conversational game inspired byexperiments on language development in infantsfrom cognitive science (Waxman, 2004). Thegame is implemented with the XWORLD simula-tor (Yu et al., 2018; Zhang et al., 2017) and is pub-licly available online.1 It provides an environmentfor the agent2 to learn language and develop theone-shot learning ability. One-shot learning heremeans that during test sessions, no further traininghappens to the agent and it has to answer teacher’squestions correctly about novel images of never-before-seen classes after being taught only onceby the teacher, as illustrated in Figure 1. To suc-ceed in this game, the agent has to learn to 1) speakby generating sentences, 2) extract and memo-rize useful information with only one exposureand use it in subsequent conversations, and 3) be-have adaptively according to context and its ownknowledge (e.g., asking questions about unknownobjects and answering questions about somethingknown), all achieved through interacting with the

1https://github.com/PaddlePaddle/XWorld2We use the term agent interchangeably with learner.

Page 3: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

S1

-� Train

Teac

her

Lear

ner

8 8 8

Sl4 4 � -Test (novel data)4 4

Figure 1: Interactive language and one-shot concept learning. Within a session Sl, the teacher mayask questions, answer learner’s questions, make statements, or say nothing. The teacher also providesreward feedback based on learner’s responses as (dis-)encouragement. The learner alternates between in-terpreting teacher’s sentences and generating a response through interpreter and speaker. Left: Initially,the learner can barely say anything meaningful. Middle: Later it can produce meaningful responses forinteraction. Right: After training, when confronted with an image of cherry, which is a novel class thatthe learner never saw before during training, the learner can ask a question about it (“what is it”) andgenerate a correct statement (“this is cherry”) for another instance of cherry after only being taught once.

teacher. This makes our game distinct from otherseemingly relevant games, in which the agent can-not speak (Wang et al., 2016) or “speaks” by se-lecting a candidate from a provided set (He et al.,2016; Weston, 2016; Li et al., 2017) rather thangenerating sentences by itself, or games mainlyfocus on slow learning (Das et al., 2017; Strubet al., 2017) and falls short on one-shot learning.

In this game, sessions (Sl) are randomly in-stantiated during interaction. Testing sessions areconstructed with a separate dataset with conceptsthat never appear before during training to eval-uate the language and one-shot learning ability.Within a session, the teacher randomly selects anobject and interacts with the learner about the ob-ject by randomly 1) posing a question (e.g., “whatis this”), 2) saying nothing (i.e., “”) or 3) mak-ing a statement (e.g., “this is monkey”). Whenthe teacher asks a question or says nothing, i) ifthe learner raises a question, the teacher will pro-vide a statement about the object asked (e.g., “it isfrog”) with a question-asking reward (+0.1); ii) ifthe learner says nothing, the teacher will still pro-vide an answer (e.g., “this is elephant”) but withan incorrect-reply reward (−1) to discourage thelearner from remaining silent; iii) for all other in-correct responses from the learner, the teacher willprovide an incorrect-reply reward and move on tothe next random object for interaction. When theteacher generates a statement, the learner will re-ceive no reward if a correct statement is gener-ated otherwise an incorrect-reply reward will begiven. The session ends if the learner answersthe teacher’s question correctly, generates a cor-rect statement when the teacher says nothing (re-ceiving a correct-answer reward +1), or when the

maximum number of steps is reached. The sen-tence from teacher at each time step is generatedusing a context-free grammar as shown in Table 1.

A success is reached if the learner behaves cor-rectly during the whole session: asking questionsabout novel objects, generating answers whenasked, and making statements when the teachersays nothing about objects that have been taughtwithin the session. Otherwise it is a failure.

Table 1: Grammar for the teacher’s sentences.start → question | silence | statementquestion → Q1 | Q2 | Q3silence → “ ”statement → A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8Q1 → “what”Q2 → “what” MQ3 → “tell what” NM → “is it” | “is this” | “is there” | “do you see” |

“can you see” | “do you observe” | “can youobserve”

N → “it is” | “this is” | “there is” | “you see” |“you can see” | “you observe” | “you canobserve”

A1 → GA2 → “it is” GA3 → “this is” GA4 → “there is” GA5 → “i see” GA6 → “i observe” GA7 → “i can see” GA8 → “i can observe” GG → object name

4 Interactive Language Acquisition viaJoint Imitation and Reinforcement

Motivation. The goal is to learn to converse anddevelop the one-shot learning ability by convers-ing with a teacher and improving from teacher’sfeedback. We propose to use a joint imitation andreinforce approach to achieve this goal. Imitation

Page 4: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

helps the agent to develop the basic ability togenerate sensible sentences. As learning is doneby observing the teacher’s behaviors during con-version, the agent essentially imitates the teacherfrom a third-person perspective (Stadie et al.,2017) rather than imitating an expert agent whois conversing with the teacher (Das et al., 2017;Strub et al., 2017). During conversations, theagent perceives sentences and images without anyexplicit labeling of ground truth answers, and ithas to learn to make sense of raw perceptions,extract useful information, and save it for lateruse when generating an answer to teacher’s ques-tion. While it is tempting to purely imitate theteacher, the agent trained this way only devel-ops echoic behavior (Skinner, 1957), i.e., mimicry.Reinforce leverages confirmative feedback fromthe teacher for learning to converse adaptively be-yond mimicry by adjusting the action policy. Itenables the learner to use the acquired speakingability and adapt it according to reward feedback.This is analogous to some views on the babies’language-learning process that babies use the ac-quired speaking skills by trial and error with par-ents and improve according to the consequences ofspeaking actions (Skinner, 1957; Petursdottir andMellor, 2016). The fact that babies don’t fully de-velop the speaking capabilities without the abilityto hear (Houston and Miyamoto, 2011), and that itis hard to make a meaningful conversation with atrained parrot signifies the importance of both im-itation and reinforcement in language learning.

Formulation. The agent’s response can be mod-eled as a sample from a probability distribu-tion over the possible sequences. Specifically,for one session, given the visual input vt andconversation history Ht={w1,a1, · · · ,wt}, theagent’s response at can be generated by samplingfrom a distribution of the speaking action at∼pSθ(a|Ht,vt). The agent interacts with the teacherby outputting the utterance at and receives feed-back from the teacher in the next step, with wt+1 asentence as verbal feedback and rt+1 reward feed-back (with positive values as encouragement whilenegative values as discouragement, according toat, as described in Section 3). Central to the goalis learning pSθ(·). We formulate the problem as theminimization of a cost function as:

Lθ=EW[−∑

t log pIθ(wt|·)]︸ ︷︷ ︸

Imitation LIθ

+EpSθ[−∑

t[γ]t−1 · rt]︸ ︷︷ ︸

Reinforce LRθ

where EW(·) is the expectation over all the sen-tences W from teacher, γ is a reward discountfactor, and [γ]t denotes the exponentiation over γ.While the imitation term learns directly the predic-tive distribution pIθ(w

t|Ht−1,at), it contributes topSθ(·) through parameter sharing between them.

Architecture. The learner comprises four ma-jor components: external memory, interpreter,speaker, and controller, as shown in Figure 2. Ex-ternal memory is flexible for storing and retriev-ing information (Graves et al., 2014; Santoro et al.,2016), making it a natural component of our net-work for one-shot learning. The interpreter is re-sponsible for interpreting the teacher’s sentences,extracting information from the perceived signals,and saving it to the external memory. The speakeris in charge of generating sentence responses withreading access to the external memory. The re-sponse could be a question asking for informa-tion or a statement answering a teacher’s question,leveraging the information stored in the externalmemory. The controller modulates the behaviorof the speaker to generate responses according tocontext (e.g., the learner’s knowledge status).

At time step t, the interpreter uses aninterpreter-RNN to encode the input sentence wt

from the teacher as well as historical conversa-tional information into a state vector htI. htI isthen passed through a residue-structured network,which is an identity mapping augmented with alearnable controller f(·) implemented with fullyconnected layers for producing ct. Finally, ct isused as the initial state of the speaker-RNN forgenerating the response at. The final state htlast ofthe speaker-RNN will be used as the initial state ofthe interpreter-RNN at the next time step.

4.1 Imitation with Memory AugmentedNeural Network for Echoic Behavior

The teacher’s way of speaking provides a sourcefor the agent to imitate. For example, the syn-tax for composing a sentence is a useful skillthe agent can learn from the teacher’s sentences,which could benefit both interpreter and speaker.Imitation is achieved by predicting teacher’s futuresentences with interpreter and parameter sharingbetween interpreter and speaker. For prediction,we can represent the probability of the next sen-tence wt conditioned on the image vt as well asprevious sentences from both the teacher and the

Page 5: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

it is monkey

it is monkey

monkey

write

what is this

what is this

< , mix>

mix

read

read monkey

write

read

read monkey

tiger

this is tiger

monkey

write

read

read monkey

< , tiger>

tiger

controller

+

controller

+

controller

+

< , monkey>

Ext

erna

l Mem

ory

Ext

erna

l Mem

ory

Ext

erna

l Mem

ory

Ext

erna

l Mem

ory

Ext

erna

l Mem

ory

Ext

erna

l Mem

ory

mix

mix

t 7−→ t+ 1 7−→ t+ 2 7−→wt vt

ht−1last htlasthtI

rtI

ct

rts

at

Inte

rpre

ter

Spe

aker

6

?

shar

epa

ram

eter

Teac

her

Lear

ner

(a)

Inte

rpre

ter-

RN

NS

peak

er-R

NN

6

?

shar

epa

ra.

memory-RNN fusion gate

additive aggregation

ht−1last

rtI

htI

ct

rts

htlast

(b)

Figure 2: Network structure. (a) Illustration of the overall architecture. At each time step, the learneruses the interpreter module to encode the teacher’s sentence. The visual perception is also encoded andused as a key to retrieve information from the external memory. The last state of the interpreter-RNN willbe passed through a controller. The controller’s output will be added to the input and used as the initialstate of the speaker-RNN. The interpreter-RNN will update the external memory with an importance(illustrated with transparency) weighted information extracted from the perception input. ‘Mix’ denotesa mixture of word embedding vectors. (b) The structures of the interpreter-RNN (top) and the speaker-RNN (bottom). The interpreter-RNN and speaker-RNN share parameters.

learner {w1,a1, · · · ,wt−1,at−1} as

pIθ(wt|Ht−1,at−1,vt)

=∏i p

Iθ(w

ti |wt1:i−1,ht−1last ,v

t),(1)

where ht−1last is the last state of the RNN at time stept−1 as the summarization of {Ht−1,at−1} (c.f.,Figure 2), and i indexes words within a sentence.

It is natural to model the probability of the i-thword in the t-th sentence with an RNN, where thesentences up to t and words up to i within the t-thsentence are captured by a fixed-length state vec-tor hti = RNN(hti−1, w

ti). To incorporate knowl-

edge learned and stored in the external memory,the generation of the next word is adaptively basedon i) the predictive distribution of the next wordfrom the state of the RNN to capture the syntac-tic structure of sentences, and ii) the informationfrom the external memory to represent the previ-ously learned knowledge, via a fusion gate g:

pIθ(wti |hti,vt) = (1− g) · ph + g · pr, (2)

where ph = softmax(ETfMLP(hti)

)and pr =

softmax(ETr

). E∈Rd×k is the word embedding

table, with d the embedding dimension and k thevocabulary size. r is a vector read out from theexternal memory using a visual key as detailed inthe next section. fMLP(·) is a multi-layer Multi-Layer Perceptron (MLP) for bridging the seman-tic gap between the RNN state space and the word

embedding space. The fusion gate g is computedas g = f(hti, c), where c is the confidence scorec=max(ETr), and a well-learned concept shouldhave a large score by design (Appendix A.2).

Multimodal Associative Memory. We use a mul-timodal memory for storing visual (v) and sen-tence (s) features with each modality while pre-serving the correspondence between them (Badde-ley, 1992). Information organization is more struc-tured than the single modality memory as usedin Santoro et al. (2016) and cross modality re-trieval is straightforward under this design. A vi-sual encoder implemented as a convolutional neu-ral network followed by fully connected layers isused to encode the visual image v into a visualkey kv, and then the corresponding sentence fea-ture can be retrieved from the memory as:

r← READ(kv,Mv,Ms). (3)

Mv and Ms are memories for visual and sen-tence modalities with the same number of slots(columns). Memory read is implemented as r=Msα with α a soft reading weight obtainedthrough the visual modality by calculating the co-sine similarities between kv and slots of Mv.

Memory write is similar to Neural Turing Ma-chine (Graves et al., 2014), but with a content im-portance gate gmem to adaptively control whetherthe content c should be written into memory:

Mm ← WRITE(Mm, cm, gmem), m∈{v, s}.

Page 6: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

For the visual modality cv,kv. For the sentencemodality, cs has to be selectively extracted fromthe sentence generated by the teacher. We use anattention mechanism to achieve this by cs=Wη,where W denotes the matrix with columns be-ing the embedding vectors of all the words inthe sentence. η is a normalized attention vectorrepresenting the relative importance of each wordin the sentence as measured by the cosine sim-ilarity between the sentence representation vec-tor and each word’s context vector, computed us-ing a bidirectional-RNN. The scalar-valued con-tent importance gate gmem is computed as a func-tion of the sentence from the teacher, meaning thatthe importance of the content to be written intomemory depends on the content itself (c.f., Ap-pendix A.3 for more details). The memory writeis achieved with an erase and an add operation:

Mm = Mm −Mm � (gmem · 1 · βT),

Mm = Mm + gmem · cm · βT, m∈{v, s}.

� denotes Hadamard product and the write loca-tion β is determined with a Least Recently UsedAccess mechanism (Santoro et al., 2016).

4.2 Context-adaptive Behavior Shapingthrough Reinforcement Learning

Imitation fosters the basic language ability forgenerating echoic behavior (Skinner, 1957), butit is not enough for conversing adaptively withthe teacher according to context and the knowl-edge state of the learner. Thus we leverage re-ward feedback to shape the behavior of the agentby optimizing the policy using RL. The agent’s re-sponse at is generated by the speaker, which canbe modeled as a sample from a distribution over allpossible sequences, given the conversation historyHt={w1,a1, · · · ,wt} and visual input vt:

at ∼ pSθ(a|Ht,vt). (4)

As Ht can be encoded by the interpreter-RNNas htI, the action policy can be represented aspSθ(a|htI,vt). To leverage the language skill thatis learned via imitation through the interpreter,we can generate the sentence by implementing thespeaker with an RNN, sharing parameters withthe interpreter-RNN, but with a conditional signalmodulated by a controller network (Figure 2):

pSθ(at|htI,vt) = pIθ(at|htI + f(htI, c),v

t). (5)

The reason for using a controller f(·) for modula-tion is that the basic language model only offersthe learner the echoic ability to generate a sen-tence, but not necessarily the adaptive behavioraccording to context (e.g. asking questions whenfacing novel objects and providing an answer fora previously learned object according to its ownknowledge state). Without any additional moduleor learning signals, the agent’s behaviors would bethe same as those of the teacher because of param-eter sharing; thus, it is difficult for the agent tolearn to speak in an adaptive manner.

To learn from consequences of speaking ac-tions, the policy pSθ(·) is adjusted by maximizingexpected future reward as represented by LRθ . As anon-differentiable sampling operation is involvedin Eqn.(4), policy gradient theorem (Sutton andBarto, 1998) is used to derive the gradient for up-dating pSθ(·) in the reinforce module:

∇θLRθ = EpSθ[∑

tAt · ∇θ log pSθ(at|ct)

], (6)

where At =V (htI, ct)− rt+1−γV (ht+1

I , ct+1) isthe advantage (Sutton and Barto, 1998) estimatedusing a value network V (·). The imitation mod-ule contributes by implementing LIθ with a cross-entropy loss (Ranzato et al., 2016) and minimizingit with respect to the parameters in pIθ(·), which areshared with pSθ(·). The training signal from imita-tion takes the shortcut connection without goingthrough the controller. More details on f(·), V (·)are provided in Appendix A.2.

5 Experiments

We conduct experiments with comparison to base-line approaches. We first experiment with a word-level task in which the teacher and the learnercommunicate a single word each time. We theninvestigate the impact of image variations on con-cept learning. We further perform evaluation onthe more challenging sentence-level task in whichthe teacher and the agent communicate in the formof sentences with varying lengths.

Setup. To evaluate the performance in learning atransferable ability, rather than the ability of fit-ting a particular dataset, we use an Animal datasetfor training and test the trained models on a Fruitdataset (Figure 1). More details on the datasets areprovided in Appendix A.1. Each session consistsof two randomly sampled classes, and the maxi-mum number of interaction steps is six.

Page 7: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

0 1 2 3 4 5 6 7 8 9

-6

-5

-4

-3

-2

-1

0

Number of Games

Re

wa

rd

Reinforce Imitation Imitation+Gaussian-RL Proposed

×103

Figure 3: Evolution of reward during training forthe word-level task without image variations.

Baselines. The following methods are compared:

• Reinforce: a baseline model with the samenetwork structure as the proposed model andtrained using RL only, i.e. minimizing LRθ ;• Imitation: a recurrent encoder decoder (Serban

et al., 2016) model with the same structure asours and trained via imitation (minimizing LIθ);• Imitation+Gaussian-RL: a joint imitation and

reinforcement method using a Gaussian pol-icy (Duan et al., 2016) in the latent space of thecontrol vector ct (Zhang et al., 2017). The pol-icy is changed by modifying the control vectorct the action policy depends upon.

Training Details. The training algorithm is imple-mented with the deep learning platform PaddlePad-

dle.3 The whole network is trained from scratch inan end-to-end fashion. The network is randomlyinitialized without any pre-training and is trainedwith decayed Adagrad (Duchi et al., 2011). Weuse a batch size of 16, a learning rate of 1×10−5

and a weight decay rate of 1.6×10−3. We alsoexploit experience replay (Wang et al., 2017; Yuet al., 2018). The reward discount factor γ is0.99, the word embedding dimension d is 1024and the dictionary size k is 80. The visual im-age size is 32×32, the maximum length of gen-erated sentence is 6 and the memory size is 10.Word embedding vectors are initialized as randomvectors and remain fixed during training. A sam-pling operation is used for sentence generationduring training for exploration while a max op-eration is used during testing both for Proposedand for Reinforce baseline. The max operation is

3https://github.com/PaddlePaddle/Paddle

0

20

40

60

80

100

Su

ccess R

ate

(%

)

Reinforce Imitation Imitation+Gaussian-RL Proposed

-6

-5

-4

-3

-2

-1

0

1

Rew

ard

Reinforce Imitation Imitation+Gaussian-RL Proposed

Figure 4: Test performance for the word-leveltask without image variations. Models are trainedon the Animal dataset and tested on the Fruit dataset.

Image Variation Ratio0 0.2 0.4 0.6 0.8 1

0

20

40

60

80

100

Su

ccess R

ate

(%

)

Image Variation Ratio0 0.2 0.4 0.6 0.8 1

-6

-4

-2

0

Rew

ard

Reinforce Imitation Imitation+Gaussian-RL Proposed

Figure 5: Test success rate and reward for theword-level task on the Fruit dataset under differ-ent test image variation ratios for models trainedon the Animal dataset with a variation ratio of 0.5(solid lines) and without variation (dashed lines).

used in both training and testing for Imitation andImitation+Gaussian-RL baselines.

5.1 Word-Level Task

In this experiment, we focus on a word-level task,which offers an opportunity to analyze and under-stand the underlying behavior of different algo-rithms while being free from distracting factors.Note that although the teacher speaks a word eachtime, the learner still has to learn to generate a full-sentence ended with an end-of-sentence symbol.

Figure 3 shows the evolution curves of the re-wards during training for different approaches.It is observed that Reinforce makes very littleprogress, mainly due to the difficulty of explo-ration in the large space of sequence actions.Imitation obtains higher rewards than Reinforceduring training, as it can avoid some penaltyby generating sensible sentences such as ques-tions. Imitation+Gaussian-RL gets higher re-wards than both Imitation and Reinforce, indi-cating that the RL component reshapes the actionpolicy toward higher rewards. However, as theGaussian policy optimizes the action policy indi-rectly in a latent feature space, it is less efficientfor exploration and learning. Proposed achievesthe highest final reward during training.

We train the models using the Animal dataset andevaluate them on the Fruit dataset; Figure 4 sum-

Page 8: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

(a) (b) (c) (d)

Figure 6: Visualization of the CNN features with t-SNE. Ten classes randomly sampled from (a-b) theAnimal dataset and (c-d) the Fruit dataset, with features extracted using the visual encoder trained without(a, c) and with (b, d) image variations on the the Animal dataset.

Teac

her

Lear

ner Interpreter

Speaker

η η η η

g g g g

gmem gmem gmem gmem

large

Figure 7: Example results of the proposed approach on novel classes. The learner can ask about thenew class and use the interpreter to extract useful information from the teacher’s sentence via word-levelattention η and content importance gmem jointly. The speaker uses the fusion gate g to adaptively switchbetween signals from RNN (small g) and external memory (large g) to generate sentence responses.

marizes the success rate and average reward over1K testing sessions. As can be observed, Rein-force achieves the lowest success rate (0.0%) andreward (−6.0) due to its inherent inefficiency inlearning. Imitation performs better than Rein-force in terms of both its success rate (28.6%)and reward value (−2.7). Imitation+Gaussian-RL achieves a higher reward (−1.2) during test-ing, but its success rate (32.1%) is similar to thatof Imitation, mainly due to the rigorous criteriafor success. Proposed reaches the highest successrate (97.4%) and average reward (+1.1)4, outper-forming all baseline methods by a large margin.From this experiment, it is clear that imitationwith a proper usage of reinforcement is crucial forachieving adaptive behaviors (e.g., asking ques-tions about novel objects and generating answersor statements about learned objects proactively).

5.2 Learning with Image VariationsTo evaluate the impact of within-class image vari-ations on one-shot concept learning, we train mod-els with and without image variations, and duringtesting compare their performance under differentimage variation ratios (the chance of a novel imageinstance being present within a session) as shownin Figure 5. It is observed that the performance of

4The testing reward is higher than the training rewardmainly due to the action sampling in training for exploration.

the model trained without image variations dropssignificantly as the variation ratio increases. Wealso evaluate the performance of models trainedunder a variation ratio of 0.5. Figure 5 clearlyshows that although there is also a performancedrop, which is expected, the performance degradesmore gradually, indicating the importance of im-age variation for learning one-shot concepts. Fig-ure 6 visualizes sampled training and testing im-ages represented by their corresponding featuresextracted using the visual encoder trained with-out and with image variations. Clusters of visu-ally similar concepts emerge in the feature spacewhen trained with image variations, indicating thata more discriminative visual encoder was obtainedfor learning generalizable concepts.

5.3 Sentence-Level Task

We further evaluate the model on sentence-leveltasks. Teacher’s sentences are generated using thegrammar as shown in Table 1 and have a numberof variations with sentence lengths ranging fromone to five. Example sentences from the teacherare presented in Appendix A.1. This task is morechallenging than the word-level task in two ways:i) information processing is more difficult as thelearner has to learn to extract useful informationwhich could appear at different locations of thesentence; ii) the sentence generation is also more

Page 9: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

difficult than the word-level task and the learnerhas to adaptively fuse information from RNN andexternal memory to generate a complete sentence.Comparison of different approaches in terms oftheir success rates and average rewards on thenovel test set are shown in Figure 8. As can beobserved from the figure, Proposed again outper-forms all other compared methods in terms of bothsuccess rate (82.8%) and average reward (+0.8),demonstrating its effectiveness even for the morecomplex sentence-level task.

We also visualize the information extraction andthe adaptive sentence composing process of theproposed approach when applied to a test set. Asshown in Figure 7, the agent learns to extract use-ful information from the teacher’s sentence anduse the content importance gate to control whatcontent is written into the external memory. Con-cretely, sentences containing object names have alarger gmem value, and the word corresponding toobject name has a larger value in the attention vec-tor η compared to other words in the sentence.The combined effect of η and gmem suggests thatwords corresponding to object names have higherlikelihoods of being written into the external mem-ory. The agent also successfully learns to usethe external memory for storing the informationextracted from the teacher’s sentence, to fuse itadaptively with the signal from the RNN (captur-ing the syntactic structure) and to generate a com-plete sentence with the new concept included. Thevalue of the fusion gate g is small when gener-ating words like “what,”, “i,” “can,” and “see,”meaning it mainly relies on the signal from theRNN for generation (c.f., Eqn.(2) and Figure 7).In contrast, when generating object names (e.g.,“banana,” and “cucumber”), the fusion gate g hasa large value, meaning that there is more emphasison the signal from the external memory. This ex-periment showed that the proposed approach is ap-plicable to the more complex sentence-level taskfor language learning and one-shot learning. Moreinterestingly, it learns an interpretable operationalprocess, which can be easily understood. More re-sults including example dialogues from differentapproaches are presented in Appendix A.4.

6 Discussion

We have presented an approach for grounded lan-guage acquisition with one-shot visual conceptlearning in this work. This is achieved by purely

0

20

40

60

80

100

Su

ccess R

ate

(%

)

Reinforce Imitation Imitation+Gaussian-RL Proposed

-6

-5

-4

-3

-2

-1

0

1

Rew

ard

Reinforce Imitation Imitation+Gaussian-RL Proposed

Figure 8: Test performance for sentence-leveltask with image variations (variation ratio=0.5).

interacting with a teacher and learning from feed-back arising naturally during interaction throughjoint imitation and reinforcement learning, with amemory augmented neural network. Experimentalresults show that the proposed approach is effec-tive for language acquisition with one-shot visualconcept learning across several different settingscompared with several baseline approaches.

In the current work, we have designed and useda computer game (synthetic task with syntheticlanguage) for training the agent. This is mainlydue to the fact that there is no existing dataset tothe best of our knowledge that is adequate for de-veloping our addressed interactive language learn-ing and one-shot learning problem. For our cur-rent design, although it is an artificial game, thereis a reasonable amount of variations both withinand across sessions, e.g., the object classes to belearned within a session, the presentation order ofthe selected classes, the sentence patterns and im-age instances to be used etc. All these factors con-tribute to the increased complexity of the learningtask, making it non-trivial and already very chal-lenging to existing approaches as shown by theexperimental results. While offering flexibility intraining, one downside of using a synthetic taskis its limited amount of variation compared withreal-world scenarios with natural languages. Al-though it might be non-trivial to extend the pro-posed approach to real natural language directly,we regard this work as an initial step towards thisultimate ambitious goal and our game might shedsome light on designing more advanced games orperforming real-world data collection. We plan toinvestigate the generalization and application ofthe proposed approach to more realistic environ-ments with more diverse tasks in future work.

Acknowledgments

We thank the reviewers and PC members for theirsefforts in helping improving the paper. We thankXiaochen Lian and Xiao Chu for their discussions.

Page 10: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In International Conference on ComputerVision (ICCV).

Alan Baddeley. 1992. Working memory. Science,255(5044):556–559.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C.Courville, and Yoshua Bengio. 2017. An actor-criticalgorithm for sequence prediction. In InternationalConference on Learning Representations (ICLR).

Ann C. Baker, Patricia J. Jensen, and David A. Kolb.2002. Conversational Learning: An ExperientialApproach to Knowledge Creation. Copley Publish-ing Group.

Arielle Borovsky, Marta Kutas, and Jeff Elman. 2003.Learning to use words: Event related potentials in-dex single-shot contextual word learning. Cognzi-tion, 116(2):289–296.

K. Cho, B. Merrienboer, C. Glehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio. 2014a.Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Em-pirical Methods in Natural Language Processing(EMNLP).

Kyunghyun Cho, Bart van Merrienboer, CalarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014b. Learn-ing phrase representations using RNN encoder–decoder for statistical machine translation. In Em-pirical Methods in Natural Language Processing(EMNLP).

Abhishek Das, Satwik Kottur, , Jose M.F. Moura, Ste-fan Lee, and Dhruv Batra. 2017. Learning cooper-ative visual dialog agents with deep reinforcementlearning. In International Conference on ComputerVision (ICCV).

Yan Duan, Xi Chen, Rein Houthooft, John Schulman,and Pieter Abbeel. 2016. Benchmarking deep rein-forcement learning for continuous control. In Inter-national Conference on International Conference onMachine Learning (ICML).

J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub-gradient methods for online learning and stochas-tic optimization. Journal of Machine Learning Re-search, 12:2121–2159.

Jakob N. Foerster, Yannis M. Assael, Nando de Freitas,and Shimon Whiteson. 2016. Learning to commu-nicate with deep multi-agent reinforcement learning.In Advances in Neural Information Processing Sys-tems (NIPS).

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural turing machines. CoRR, abs/1410.5401.

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li-hong Li, Li Deng, and Mari Ostendorf. 2016. Deepreinforcement learning with a natural language ac-tion space. In Association for Computational Lin-guistics (ACL).

Derek M. Houston and Richard T. Miyamoto. 2011.Effects of early auditory experience on word learn-ing and speech perception in deaf children withcochlear implants: Implications for sensitive pe-riods of language development. Otol Neurotol,31(8):1248–1253.

Patricia K. Kuhl. 2004. Early language acquisi-tion: cracking the speech code. Nat Rev Neurosci,5(2):831–843.

Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross,and Joshua B. Tenenbaum. 2011. One shot learn-ing of simple visual concepts. In Proceedings of the33th Annual Meeting of the Cognitive Science Soci-ety.

Angeliki Lazaridou, Alexander Peysakhovich, andMarco Baroni. 2017. Multi-agent cooperation andthe emergence of (natural) language. In Inter-national Conference on Learning Representations(ICLR).

Jiwei Li, Alexander H. Miller, Sumit Chopra, MarcAu-relio Ranzato, and Jason Weston. 2017. Learningthrough dialogue interactions by asking questions.In International Conference on Learning Represen-tations (ICLR).

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016. Deep re-inforcement learning for dialogue generation. InEmpirical Methods in Natural Language Processing(EMNLP).

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller.2013. Playing Atari with deep reinforcement learn-ing. In NIPS Deep Learning Workshop.

Igor Mordatch and Pieter Abbeel. 2018. Emergenceof grounded compositional language in multi-agentpopulations. In Association for the Advancement ofArtificial Intelligence (AAAI).

Anna Ingeborg Petursdottir and James R. Mellor. 2016.Reinforcement contingencies in language acquisi-tion. Policy Insights from the Behavioral and BrainSciences, 4(1):25–32.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In InternationalConference on Learning Representations (ICLR).

Adam Santoro, Sergey Bartunov, Matthew Botvinick,Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks.In International Conference on Machine Learning(ICML).

Page 11: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C. Courville, and Joelle Pineau. 2016.Building end-to-end dialogue systems using genera-tive hierarchical neural network models. In Associ-ation for the Advancement of Artificial Intelligence(AAAI).

B. F. Skinner. 1957. Verbal Behavior. Copley Publish-ing Group.

Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever.2017. Third-person imitation learning. In Inter-national Conference on Learning Representations(ICLR).

Amanda Stent and Srinivas Bangalore. 2014. NaturalLanguage Generation in Interactive Systems. Cam-bridge University Press.

Florian Strub, Harm de Vries, Jeremie Mary, BilalPiot, Aaron C. Courville, and Olivier Pietquin. 2017.End-to-end optimization of goal-driven and visuallygrounded dialogue systems. In International JointConference on Artificial Intelligence (IJCAI).

Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus.2016. Learning multiagent communication withbackpropagation. In Advances in Neural Informa-tion Processing Systems (NIPS).

Richard S. Sutton and Andrew G. Barto. 1998. Rein-forcement Learning: An Introduction. MIT Press.

S. I. Wang, P. Liang, and C. Manning. 2016. Learninglanguage games through interaction. In Associationfor Computational Linguistics (ACL).

Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos,K. Kavukcuoglu, and N. Freitas. 2017. Sample effi-cient actor-critic with experience replay. In Inter-national Conference on Learning Representations(ICLR).

Sandra R. Waxman. 2004. Everything had a name, andeach name gave birth to a new thought: links be-tween early word learning and conceptual organi-zation. Cambridge, MA: The MIT Press.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-hao Su, David Vandyke, and Steve J. Young. 2015.Semantically conditioned LSTM-based natural lan-guage generation for spoken dialogue systems. InEmpirical Methods in Natural Language Processing(EMNLP).

Jason Weston. 2016. Dialog-based language learning.In Advances in Neural Information Processing Sys-tems (NIPS).

Mark Woodward and Chelsea Finn. 2016. Active one-shot learning. In NIPS Deep Reinforcement Learn-ing Workshop.

Haonan Yu, Haichao Zhang, and Wei Xu. 2018. Inter-active grounded language acquisition and general-ization in a 2D world. In International Conferenceon Learning Representations (ICLR).

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. SeqGAN: Sequence generative adversarialnets with policy gradient. In Association for the Ad-vancement of Artificial Intelligence (AAAI).

Haichao Zhang, Haonan Yu, and Wei Xu. 2017. Lis-ten, interact and talk: Learning to speak via interac-tion. In NIPS Workshop on Visually-Grounded In-teraction and Language.

Page 12: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

A Appendix

A.1 Datasets and Example Sentences

The Animal dataset contains 40 animal classes with408 images in total, with about 10 images per classon average. The Fruit dataset contains 16 classesand 48 images in total with 3 images per class.The object classes and images are summarized inTable 2 and Figure 9. Example sentences from theteacher in different cases (questioning, answering,and saying nothing) are presented in Table 3.

Table 2: Object classes for two datasets.Set #cls/img Object Names

Animal 40/408

armadillo, bear, bull, butterfly, camel,cat, chicken, cobra, condor, cow, crab,crocodile, deer, dog, donkey, duck, ele-phant, fish, frog, giraffe, goat, hedge-hog, kangaroo, koala, lion, monkey, oc-topus, ostrich, panda, peacock, pen-guin, pig, rhinoceros, rooster, seahorse,snail, spider, squirrel, tiger, turtle

Fruit 16/48

apple, avocado, banana, blueberry, cab-bage, cherry, coconut, cucumber, fig,grape, lemon, orange, pineapple, pump-kin, strawberry, watermelon

Table 3: Example sentences from the teacher.Category Example SentencesEmpty “”

Question

“what”“what is it”“what is this”“what is there”“what do you see”“what can you see”“what do you observe”“what can you observe”“tell what it is”“tell what this is”“tell what there is”“tell what you see”“tell what you can see”“tell what you observe”“tell what you can observe”“apple”

Answer /“it is apple”

Statement“this is apple”“there is apple”“i see apple”“i observe apple”“i can see apple”“i can observe apple”

A.2 Network Details

A.2.1 Visual EncoderThe visual encoder takes an input image and out-puts a visual feature vector. It is implemented as aconvolutional neural network (CNN) followed by

fully connected (FC) layers. The CNN has fourlayers. Each layer has 32, 64, 128, 256 filters ofsize 3×3, followed by max-poolings with a poolingsize of 3 and a stride of 2. The ReLU activation isused for all layers. Two FC layers with output di-mensions of 512 and 1024 are used after the CNN,with ReLU and a linear activations respectively.

A.2.2 Interpreter and Speaker

Interpreter and speaker are implemented withinterpreter-RNN and speaker-RNN respectivelyand they share parameters. The RNN is imple-mented using the Gated Recurrent Unit (Cho et al.,2014a) with a state dimension of 1024. Before in-puting to the RNN, word ids are first projected toa word embedding vector of dimension 1024 fol-lowed with two FC layers with ReLU activationsand a third FC layer with linear activation, all hav-ing output dimensions of 1024.

A.2.3 Fusion Gate

The fusion gate g is implemented as two FC lay-ers with ReLU activations a third FC layer with asigmoid activation. The output dimensions are 50,10 and 1 for each layer respectively.

A.2.4 Controller

The controller f(·) together with the identity map-ping forms a residue-structured network as

c = h + f(h). (7)

f(·) is implemented as two FC layers with ReLUactivations and a third FC layer with a linear acti-vation, all having an output dimensions of 1024.

A.2.5 Value Network

The value network is introduced to estimate theexpected accumulated future reward. It takes thestate vector of interpreter-RNN hI and the confi-dence c as input. It is implemented as two FClayers with ReLU activations and output dimen-sions of 512 and 204 respectively. The third layeris another FC layer with a linear activation and anoutput dimension of 1. It is trained by minimizinga cost as (Sutton and Barto, 1998)

LV=EpSθ(V (htI, c

t)− rt+1−λV ′(ht+1

I , ct+1))2.

V′(·) denotes a target version of the value net-

work, whose parameters remain fixed until copiedfrom V (·) periodically (Mnih et al., 2013).

Page 13: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

Figure 9: Dataset images. Top: Animal dataset. Bottom: Fruit dataset.

A.2.6 Confidence ScoreThe confidence score c is defined as follows:

c=max(ETr), (8)

where E∈Rd×k is the word embedding table, withd the embedding dimension and k the vocabularysize. r∈Rd is the vector read out from the sentencemodality of the external memory as:

r = Msα, (9)

where α a soft reading weight obtained throughthe visual modality by calculating the cosine sim-ilarities between kv and the slots of Mv. Thecontent stored in the memory is extracted fromteacher’s sentence {w1, w2, · · · , wi, · · · , wn} as(detailed in Section A.3):

cs = [w1,w2, · · · ,wi · · · ,wn]η, (10)

where wi∈Rd denotes the embedding vector ex-tracted from the word embedding table E for theword wi. Therefore, for a well-learned conceptwith effective η for information extraction and ef-fective α for information retrieval, r should be an

embedding vector mainly corresponding to the la-bel word associated with the visual image. There-fore, the value of c should be large and the max-imum is reached at the location where that labelword resides in the embedding table. For a com-pletely novel concept, as the memory contains noinformation about it, the reading attention α willnot be focused and thus r would be an averagingof a set of existing word embedding vectors in theexternal memory, leading to a small c value.

A.3 Sentence Content Extraction andImportance Gate

A.3.1 Content ExtractionWe use an attention scheme to extract useful in-formation from a sentence to be written into mem-ory. Given a sentence w = {w1, w2, · · · , wn}and the corresponding word embedding vectors{w1,w2, · · · ,wn}, a summary of the sentenceis firstly generated using a bidirectional RNN,yielding the states {−→w1,

−→w2, · · · ,−→wn} for the for-ward pass and {←−w1,

←−w2, · · · ,←−wn} for the back-ward pass. The summary vector is the concate-

Page 14: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

nation of the last state of forward pass and the firststate of the backward pass:

s = concat(−→wn,←−w1). (11)

The context vector is the concatenation of theword embedding vector and the state vectors ofboth forward and backward passes:

wi = concat(wi,−→wi,←−wi). (12)

The word level attention η = [η1, η2, · · · , ηi, · · · ]is computed as the cosine similarity between trans-formed sentence summary vector s and each con-text vector wi:

ηi = cos(fθ1MLP(s), fθ2MLP(wi)

). (13)

Both MLPs contain two FC layers with output di-mensions of 1024 and a linear and a Tanh activa-tion for each layer respectively. The content cs tobe written into the memory is computed as:

cs = Wη = [w1,w2, · · · ,wn]η. (14)

A.3.2 Importance GateThe content importance gate is computed asgmem=σ(fMLP(s)), meaning that the importanceof the content to be written into the memory de-pends on the sentence from the teacher. The MLPcontains two FC layers with ReLU activation andoutput dimensions of 50 and 30 respectively. An-other FC layer with a linear activation, and an out-put dimension of 20 is used. The output layer isan FC layer with an output dimension of 1 and asigmoid activation σ .

A.4 Example Dialogues on Novel DataWe train models on the Animal dataset and per-form the evaluation on the Fruit dataset. Exam-ple dialogues of different approaches are shownin Table 4. It is observed that Reinforce arrivesat a policy that the learner keeps silent. Both Imi-tation and Imitation+Gaussian-RL can generatesensible sentences, but cannot speak adaptively ac-cording to context. Proposed can speak accord-ing to context adaptively, asking information aboutnovel classes, and answering questions after beingtaught only once by the teacher.

Page 15: arXiv:1805.00462v1 [cs.CL] 26 Apr 2018

Table 4: Example dialogues from different approaches.

ReinforceT: whatL: 4

T: i can see cucumberL: 8

T: there is cucumberL: 8

T: tell what you can observeL: 4

T: i observe cabbageL: 8

T:L: 8

FAILURE

T: there is orangeL: 8

T: tell what it isL: 4

T: i see pineappleL: 8

T: what can you seeL: 4

T: there is orangeL: 8

T: what can you seeL: 8

FAILURE

T: i observe grapeL: 8

T: i can observe orangeL: 8

T: what is itL: 8

T: i see grapeL: 8

T: tell what it isL: 8

T: i observe orangeL: 8

FAILURE

ImitationT: appleL: what is this 8

T: tell what it isL: i can see condor 8

T: this is appleL: what is this 8

T: tell what there isL: i can see apple 8

T:L: what is this 8

T: i can observe appleL: what is this 8

FAILURE

T: what is thereL: i can observe condor 8

T: tell what you seeL: i can see condor 8

T: tell what you can seeL: i can see condor 8

T: what is thereL: i can see condor 8

T:L: what is this 4

T: i can see cucumberL: what is this 8

FAILURE

T: i see lemonL: what is this 8

T: i see lemonL: what is this 8

T: tell what you can observeL: i can see lemon 8

T: tell what there isL: what is this 8

T: it is lemonL: what is this 8

T: tell what you observeL: what is this 8

FAILURE

Imitation+Gaussian-RLT: i can see strawberryL: i can see snail 8

T: it is strawberryL: i can see strawberry 4

T: i can see strawberryL: i can see strawberry 4

T: blueberryL: i can see blueberry 4

T: whatL: i can see blueberry 4

FAILURE

T: what do you observeL: i can see tiger 8

T: i can observe coconutL: i can see coconut 4

T: what can you seeL: i can see tiger 8

T: whatL: i can see dog 8

T: tell what you can observeL: i can see tiger 8

FAILURE

T: i can observe strawberryL: i can see strawberry 4

T: this is pineappleL: i can see pineapple 4

T: this is pineappleL: i can see pineapple 4

T: what do you seeL: i can see pineapple 8

FAILURE

ProposedT:L: tell what you observe 4

T: this is watermelonL: watermelon 4

T: tell what there isL: tell what you see 4

T: i observe lemonL: lemon 4

T:L: lemon 4

SUCCESS

T: tell what you can observeL: tell what you observe 4

T: i observe cherryL: cherry 4

T: i observe orangeL: orange 4

T: tell what there isL: orange 4

SUCCESS

T:L: tell what you observe 4

T: i see blueberryL: blueberry 4

T: there is orangeL: orange 4

T: i can observe orangeL: orange 4

T: tell what it isL: this is blueberry 4

SUCCESST: i can see grapeL: grape 4

T:L: tell what you observe 4

T: i can see pumpkinL: pumpkin 4

T:L: grape 4

SUCCESS

T: whatL: tell what you observe 4

T: i can see avocadoL: avocado 4

T: this is coconutL: coconut 4

T: what is thereL: this is avocado 4

SUCCESS

T: it is coconutL: i can see coconut 4

T: this is pumpkinL: i can see pumpkin 4

T: what is itL: i can see pumpkin 8

T: what do you seeL: i can see pumpkin 4

FAILURE


Recommended