Learning Generative End-to-end Dialog Systems with...

November 21, 2017DRAFT

Learning Generative End-to-end DialogSystems with Knowledge

Tiancheng Zhao

December 2017

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Maxine Eskenazi, Chair (Carnegie Mellon)

William Cohen (Carnegie Mellon)Louis-Philippe Morency (Carnegie Mellon)

Dilek Hakkani-Tur (Google)

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright © 2017 Tiancheng Zhao


Keywords: dialog systems, end-to-end models, deep learning, reinforcement learn-ing, generative models, transfer learning, zero-shot learning


AbstractDialog systems are intelligent agents that can converse with human in

natural language and facilitate human. Traditional dialog systems follow amodular approach and often have trouble expanding to new or more com-plex domains, which hinder the development of more powerful future dia-log systems. This dissertation targets at an ambitious goal: to create domain-agnostic learning algorithms and dialog models that can continuously learnto converse in any new domains and only requires a small amount of newdata for the adaption. Achieving this goal first requires powerful statis-tic models that are expressive enough to model the natural language anddecision-making process in dialogs of many domains; and second requiresa learning framework enabling models to share knowledge from previousexperience so it can learn to converse in new domains with limited data.

End-to-end (E2E) generative dialog models based on encoder-decoderneural networks are strong candidates for the first requirement. The basicidea is to use an encoder network to map a dialog context into a learned dis-tributed representation and then use a decoder network to generate the nextsystem response. These models are not restricted to hand-crafted interme-diate states and can in principle generalize to novel system responses thatare not observed in the training. However, it is far from trivial to build afull-fledged dialog system using encoder-decoder models. Thus in the firststage of this thesis, we develop a set of novel neural network architecturesthat offer key properties that are required by dialog modeling. Experimentsprove that the resulting system can interact with both users and symbolicknowledge bases, model complex dialog policy and reason over long dis-course history.

We tackle the second requirement by proposing a novel learning withknowledge (LWK) framework that can adapt the proposed system to new do-mains with minimum data. Two types of knowledge are studied: 1) domainknowledge from human experts 2) models' knowledge from learning relateddomains. To incorporate these knowledge, a domain description that cancompactly encode domain knowledge is first proposed. Then we developnovel domain-aware models and training algorithms to teach the systemlearn from data in related domains and generalize to unseen ones. Exper-iments show the proposed framework is able to achieve strong performancein new domains with limited, even zero, in-domain training data.

In conclusion, this dissertation shows that by combing specialized encoder-decoder models with the proposed LWK framework, E2E generative dialogmodels can be readily applied in complex dialog applications and can beeasily expanded to new domains with extremely limited resources, whichwe believe is an important step towards future general-purpose conversa-tional agents that are more natural and intelligent.



Contents

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 52.1 Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Task-oriented Dialog Systems . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Chat-oriented Dialog Systems . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Deep Generative Models for Natural Language . . . . . . . . . . . . . . . . 9

2.3.1 Encoder-Decoder Models . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Variational Latent Variable Models . . . . . . . . . . . . . . . . . . . 10

2.4 Learning from Multiple Sources of Data . . . . . . . . . . . . . . . . . . . . 112.4.1 Learning Domain-Invariant Representation . . . . . . . . . . . . . . 122.4.2 Zero-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Generative E2E Dialog Models in a Single Domain 153.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Formulations and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Learning to Interface with Symbolic Knowledge Base . . . . . . . . . . . . 18

3.3.1 KB As A Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Learning from Delayed Rewards . . . . . . . . . . . . . . . . . . . . 193.3.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Modeling Stochastic Dialog Policy . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 The Dull Response Problem and Beyond . . . . . . . . . . . . . . . 273.4.2 Latent Variable Dialog Model . . . . . . . . . . . . . . . . . . . . . . 283.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Handling Slot Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

v


3.5.1 Slot Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.2 Handling Slot Expansion via Delexicalized Memory . . . . . . . . . 373.5.3 A Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . 393.5.4 Evaluations and Experiments . . . . . . . . . . . . . . . . . . . . . . 413.5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Put it All Together and Discussion . . . . . . . . . . . . . . . . . . . . . . . 43

4 Learning with Knowledge to Converse in New Domains 454.1 The Needs for Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Learning with Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.1 Domain Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Corpora for Benchmark Learning with Knowledge . . . . . . . . . . . . . . 484.3.1 SimDial: A Multi-domain Synthetic Dialog Generator . . . . . . . . 48

4.4 Pilot Study on SimDial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1 Study Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.2 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Proposed Work and Timeline 575.1 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Create a Human-human Multi-domain Dialog Corpus . . . . . . . 575.1.2 Improve Performance on Single Domain . . . . . . . . . . . . . . . 585.1.3 Fully Develop Learning with Knowledge . . . . . . . . . . . . . . . 58

5.2 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 61

vi


List of Figures

2.1 Dialog System Pipeline for Task-oriented Dialog Systems . . . . . . . . . . 6

3.1 Dataset creation from an example dialog. . . . . . . . . . . . . . . . . . . . 173.2 The challenge of interfacing with KB and our solution. . . . . . . . . . . . 183.3 A reinforcement learning interpretation of the proposed method to inter-

face with KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 The network takes the observation ot at turn t. The recurrent unit updates

its hidden state based on both the history and the current turn embed-ding. Then the model outputs the Q-values for all actions. The policynetwork in grey is masked by the action mask . . . . . . . . . . . . . . . . 21

3.5 Graphs showing the evolution of the win rate during training. . . . . . . . 253.6 Given A’s question, there exists many valid responses from B for different

assumptions of the latent variables, e.g., B’s hobby. . . . . . . . . . . . . . 273.7 Graphical models of CVAE (a) and kgCVAE (b) . . . . . . . . . . . . . . . . 283.8 The neural network architectures for the baseline and the proposed CVAE/kgCVAE

models.⊕

denotes the concatenation of the input vectors. The dashedblue connections only appear in kgCVAE. . . . . . . . . . . . . . . . . . . . 28

3.9 BLEU-4 precision/recall vs. the number of distinct reference dialog acts. . 333.10 t-SNE visualization of the posterior z for test responses with top 8 fre-

quent dialog acts. The size of circle represents the response length. . . . . 353.11 The value of the KL divergence during training with different setups on

Penn Treebank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.12 High-level illustration of a GEDM with delexicalized memory unit. . . . . 383.13 The implementation of delexicalized memory augmented GEDM. . . . . . 393.14 An example of entity indexing and utterance lexicalization. . . . . . . . . . 403.15 The unified model with all techniques. . . . . . . . . . . . . . . . . . . . . . 43

4.1 High-level Architecture for Domain-Aware Dialog Models . . . . . . . . . 474.2 Overall Architecture of SimDial Data Generator . . . . . . . . . . . . . . . 49

vii



List of Tables

3.1 Summary of the available questions. Qa is the number of questions forattribute a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Performance of the three systems . . . . . . . . . . . . . . . . . . . . . . . . 243.3 State tracking performance of the three systems. The results are in the

format of precision/recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 r2 of the linear regression for predicting the number of guesses in the test

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Performance on perplexity and BLEU scores. The highest score is in bold.

Note that our BLEU scores are normalized to [0, 1]. . . . . . . . . . . . . . . 333.6 Performance on semantic matching and dialog acts. The highest score is

in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.7 Generated responses from the baselines and kgCVAE in two examples.

KgCVAE also provides the predicted dialog act for each response. Thecontext only shows the last utterance due to space limit (the actual con-text window size is 10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.8 The reconstruction perplexity and KL terms on Penn Treebank test set. . . 363.9 Performance of each model on automatic measures. . . . . . . . . . . . . . 42

4.1 Complexity Specifications for clean and noisy conditions . . . . . . . . . . 494.2 Statistics of the four dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3 Results on Clean vs. Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . 524.4 Varying the size of training data on Noisy-Rest . . . . . . . . . . . . . . . . 534.5 Results on Clean vs Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . 544.6 Example Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

ix



List of Abbreviations

AI Artificial Intelligence

ASR Automatic Speech Recognition

DM Dialog Management

E2E End-to-end

GDEM Generative E2E Dialog Models

KB Knowledge Base

KL Knowledge Base

LSTM Long Short-term Memory

MDP Markov Decision Process

MTL Multi-task Learning

NLG Natural Language Generation

NLP Natural Language Processing

NLU Natural Language Understanding

POMDP Partially Observable Markov Decision Process

RL Reinforcement Learning

RNN Recurrent Neural Network

TL Transfer Learning

TTS Text-to-speech

ZSL Zero-shot Learning

xi



Chapter 1

Introduction

1.1 Introduction

Dialog systems are artificial intelligence (AI) agents that can communicate with hu-man via natural language to assist, inform and entertain. Dialog systems have beensuccessfully applied to many tasks, such as customer service, flight booking and social-chatting. Yet the traditional approach of developing dialog systems is limited by hand-crafted intermediate representation and struggles to rapidly expand to new and com-plex domains. This dissertation targets at an ambitious goal: to develop general learn-ing algorithms and domain-agnostic dialog models that can learn to converse in anydomains and only require minimum amount of new data for adapting the system to anew domain. Solving this objective will not only offer us more capable dialog systems,but also will advance our understanding in building AI systems that resemble closerto human's incredible learning ability. Also, the above definition suggests two key de-sired properties: “converse in any domains” and “require little data to adapt to newdomains”. The first property implies the needs for powerful statistical models that cansimultaneously capture the complex natural language and decision-making in conver-sations from multiple domains. The second property needs the target system be ableto incorporate inductive bias from either a human teacher or its experience of learningrelated domains to improve its performance in the new domain.

End-to-end (E2E) generative dialog systems based on encoder-decoder deep neuralnetworks [15, 92] are indeed one of the strongest candidates to satisfy the first property.The basic idea is to use an encoder network to map a dialog context, e.g. the discoursehistory, into a learned distributed representation and then use a decoder network to gen-erate the next system response. These models are not restricted to hand-crafted statesand can in principle generalize to novel system responses that are not observed in train-ing. However, despite a encoder-decoder's expressive modeling power, its vanilla archi-tecture is not sufficient to build full-fledged dialog systems for several reasons, such asits inability to interface to external knowledge, tendency to generate dull responses andetc. Therefore, this thesis first focuses on evolving the encoder-decoder models to havethe ability of dealing with some of key challenges for modeling dialog by proposing a

1


set of novel architectures that can gracefully solves these limitations.After building an E2E dialog model that is proficient to handle all operations in a

single dialog domain, we then study the second property: learning new conversationalskills with small data. In fact the data scarcity problem is extremely common in most ofthe dialog domains, due to the difficulty of collecting human-computer conversations.To our best knowledge, current E2E generative dialog models are data-hungry and haveonly been successfully applied to domains with abundant training material. This lim-itation prohibits the possibility of using the E2E approach for rapid prototyping andday-0 deployment in domains with no existing data. As a result, developers would stillchoose traditional rule-based methods to prototype their initial systems and only con-sider utilizing E2E models at later stages of the development. Therefore, the ability toadapt with little data becomes a even more essential requirement for using E2E systemsfor any real-world applications.

In order to learn from little data, this dissertation proposes a novel learning withknowledge (LWK) framework, which strives to introduce inductive bias in an E2E modelby incorporating knowledge, so that it can generalize to new domains with only a fewtraining samples. The first type of knowledge LWK utilizes is a model's past experienceof learning other tasks, which is closely related to transfer learning [63]. By allowingexperience being shared across domains, models only need to learn domain-specificbehavior from the limited in-domain training data, and can reuse domain-invariantexperience gained from other domains. Furthermore, a more extreme scenario is thezero-shot learning problem [44], where there is zero training data in the new domainand there is no data to learn domain-specific information. Learning from zero datais hard but not impossible. LWK tackles this challenge by leverage human expertsknowledge who can describe the “task” of this new domain. Specifically, LWK pro-poses a domain specification for human experts to succinctly express domain meta in-formation and the resulting specification is encoded by neural networks as a part ofinputs to the model. The extra knowledge provided by domain specification permitsthe model to relate the underlying decision-making process with respect to differentdomains, so that can it figure out the proper actions even in unseen domains. For ex-ample, an experienced human customer service agent can quickly adapt to talk about anew product just given the product's description without the needs to study historicaldialogs. Our preliminary experiments show that the effectiveness of our framework ona synthetic dataset and we propose to advance this idea much further into real-worlddatasets. Furthermore, although out of the scope of this thesis, the methods developedhere can be also applied to related conditional natural language generation tasks, e.g.image captioning or question-answering.

1.2 Thesis Hypothesis

In short, this dissertation will focus on verifying the following main research hypothe-sis, and that is:

By developing specialized neural architectures for dialog systems and in-

2


corporating knowledge via the LWK framework, a generative E2E dialogsystem can simultaneously achieve strong performance in multiple domainsand can adapt to a new domain with small, even zero, in-domain conversa-tional data for training.

We approach this ultimate goal in two stages: first we develop novel encoder-decodermodels to achieve good performance in a single domain with abundant data. And thenthe second stage focuses on developing learning algorithm that allows the model fromthe first stage to generalize across domains with the help of knowledge. The evaluationwill be carried out on both synthetic and real-world datasets, and live interactions onour DialPort platform [115].

1.3 Proposal Outline

The rest of the proposal is organized as follows:

• Chapter 2: Related WorkThis chapter gives an overview of related research areas, including both workabout dialog system and related work in machine learning.

• Chapter 3: Generative E2E Dialog Models in a Single DomainThis chapter first identifies and analyzes the special challenges for building E2Edialog systems and presents a set of novel E2E models to tackle these challenges.

• Chapter 4: Transfer Knowledge To Converse in New DomainsThis chapter formally formulates the problem and describes the LWK framework.It also proposes two novel corpora for training and evaluating the proposed frame-work. Experiment results from a pilot study are also presented.

• Chapter 5: Proposed Work and TimelineThis section presents the proposed work and lays out the timeline for this pro-posal.

3



Chapter 2

Related Work

2.1 Dialog Systems

Research in dialog systems can be roughly divided into two categories in terms of ap-plications: task-oriented and chat-oriented systems.

2.1.1 Task-oriented Dialog Systems

Task-oriented systems are designed to achieve certain goals via conversing with humanusers, e.g. flight booking or restaurant recommendation and etc. It is sometimes alsoreferred as slot-filling systems. The classical method to build task-oriented system isthe pipeline approach, which divides the whole system into several modules as shownin Figure 2.1. The dialog system pipeline contains the following components: naturallanguage understanding (NLU) maps the user utterances to some semantic represen-tation. This information is further processed by the dialog state tracker (DST), whichaccumulates the input of the turn along with the dialog history. The DST outputs thecurrent dialog state and the dialog policy (DP) selects the next system action based onthe dialog state. Then natural language generation (NLG) maps the selected action toits surface. For many applications, an external knowledge base (KB), e.g. bus scheduletable or Web API, is interfaced with the DST and DP to provide external information. Ifspoken language interaction is enabled, this pipeline will also include automatic speechrecognition (ASR) and text-to-speech (TTS) for transforming the signals between audioand text.

One key limitation of the pipeline approach is that the intermediate interfaces amongdifferent components are often hand-crafted, e.g. a set of dialog state variables. Al-though many advanced statistical methods have been developed to estimate these vari-ables in a data-driven fashion [55, 98], the whole system is limited by the hand-craftedintermediate representations so it struggle to generalize to new or more complex do-mains. Moreover, since each module is independently optimized, online adaption formodular system is tedious. For example, when one module (e.g. NLU) is retrainedwith new data, all the others (e.g. DM) that depend on it become sub-optimal due to

5


the fact that they were trained on the output distributions of the older version of themodule. Therefore, end-to-end (E2E) dialog systems based on deep learning models

Figure 2.1: Dialog System Pipeline for Task-oriented Dialog Systems

are created to alleviate this problem by automatically learning intermediate features inhigh-dimension distributed representations. Wen, et al. [95] first introduced a fully dif-ferentiable network architecture that can be trained on both oracle system responses andintermediate labels jointly. After supervised training, the dialog policy of this modelcan be fine tuned using reinforcement learning [80]. Furthermore, our work first uti-lized deep reinforcement learning to enable a task-oriented E2E dialog system learn tointerface with external KB [114]. Later work has extended this idea using soft atten-tion [21]. The above approaches still retain a part of the intermediate representations(e.g. dialog acts) from the classical pipeline and combine a subset of the dialog pipelineinto one E2E model. Meanwhile, there is line of research strives to completely removeintermediate representations and build models that directly map an observed dialogto the system response. Two approaches are usually considered: retrieval-based andgenerative-based systems. For retrieval-based systems, the dialog history is encoded bya encoder network and a list of system responses is also encoded into utterance embed-dings. Then a matching function is deployed to find the best matched system responsew.r.t to the given dialog context [9, 101]. For generative-based systems often builds onthe encoder-decoder framework [15, 92], which uses an encoder network to encode thedialog history and then uses decoder network to generate the system response token-by-token [9, 25, 108]. Our work is the first one transformed a real-world system (i.e. CMULet's Go) into E2E generative model [116] and tested with users through spoken inter-face. This branch has focused on investigating various neural network architectures inorder to better generate the next system response. Comparing these two approaches,the retrieval-based methods can output more fluent and controllable responses but arelimited by the manually created response pool. By contrast, the generative systems cangenerate novel utterances not appeared in training and has the potential to scale up tomore complex situations but much more challenging to optimize.

6


2.1.2 Chat-oriented Dialog Systems

Chat-oriented dialog systems are designed to carry out open-domain conversations,so that are not restricted to a certain domain or a specific goal. Chat-oriented dialogsystems have been mainly used for entertainment and social chat. Since the unlimitedscope of chat-oriented dialog system, it is more difficult than a task-oriented dialogsystem in one domain. In terms of development techniques, chat-oriented systems alsofollow either retrieval-based methods and generative models, which often share thesame models as the ones described for task-oriented systems. However, chat-orientedsystems face distinctive different challenges compared to task-oriented ones.

First of all, since it is unfeasible to manually design dialog state variable as in thetask-oriented case, modeling long term context is a open-ended research question foropen-domain chatting. Hierarchical encoder [74] has been proposed to exploit the hier-archical structure in dialog and has shown better results then encoding the entire dis-course history word-by-word. More fine-grained encoders [37, 104] are also proposedto better extract key elements (e.g. entities) from the discourse history. Furthermore, re-cent research has found that encoder-decoder models tend to generate generic and dullresponses, (e.g., I don’t know), rather than meaningful and specific answers [45, 76].To tackle this problem, one line of research has focused on augmenting the input ofencoder-decoder models with richer context information, in order to generate more spe-cific responses. Li et al. [46] captured speakers’ characteristics by encoding backgroundinformation and speaking style into the distributed embeddings, which are used to re-rank the generated response from an encoder-decoder model. Xing et al. [103] maintaintopic encoding based on Latent Dirichlet Allocation (LDA) [5] of the conversation toencourage the model to output more topic coherent responses. The second categoryof solution is to improve the decoding algorithm to encourage more diverse responses,including decoding with beam search and its variations [102], encouraging responsesthat have long-term payoff [47] or adding mutual-information loss in additional tostandard maximum likelihood estimation [45]. The last category of solutions is to in-troduce a latent random variable to model a distribution of responses given a dialogcontext [11, 76, 117]. Our work [117] also falls into this category and is one of the earlywork explored the learned information in the latent representation. The advantage ofhaving a latent variable is that at test time the system can generate various responses bysampling from the latent variable.

At last, despite the task-oriented and chat-oriented systems have been usually ex-plored independently, there is pioneer work in combining the two types of systems. Ourwork has found that interleaving chat with task-oriented systems can improve the ro-bustness of the system against misunderstanding errors, and improve user satisfactionby keeping them engaged with the systems [116]. Yu et al. [110] has used reinforcementlearning to learn the interleaving strategy. Other work has used social chat reasoner toimprove the rapport between the computer and human in order to develop a amicablelong-term relationship [112].

7


2.2 Deep Reinforcement Learning

Reinforcement learning (RL) is perhaps the most popular method to learn dialog policyin dialog systems [30, 94, 100]. RL models are based on the Markov Decision Process(MDP). An MDP is a tuple (S,A, P, γ, R), where S is a set of states; A is a set of ac-tions; P defines the transition probability P (s′|s, a); R defines the expected immediatereward R(s, a); and γ ∈ [0, 1) is the discounting factor. The goal of reinforcement learn-ing is to find the optimal policy π∗, such that the expected cumulative return is maxi-mized [84]. MDPs assume full observability of the internal states of the world, whichis rarely true for real-world applications. The Partially Observable Markov DecisionProcess (POMDP) takes the uncertainty in the state variable into account. A POMDPis defined by a tuple (S,A, P, γ, R,O, Z). O is a set of observations and Z defines anobservation probability P (o|s, a). The other variables are the same as the ones in MDPs.Solving a POMDP usually requires computing the belief state b(s), which is the proba-bility distribution of all possible states, such that

∑s b(s) = 1. It has been shown that the

belief state is sufficient for optimal control [58], so that the objective is to find π∗ : b→ athat maximizes the expected future return.

The deep Q-Network (DQN) introduced by Mnih [57] uses a deep neural network(DNN) to parametrize the Q-value function Q(s, a; θ) and achieves human-level perfor-mance in playing many Atari games. DQN keeps two separate models: a target networkθ−i and a behavior network θi. For every K new samples, DQN uses θ−i to compute thetarget values yDQN and updates the parameters in θi. Only after every C updates, thenew weights of θi are copied over to θ−i . Furthermore, DQN utilizes experience replay tostore all previous experience tuples (s, a, r, s′). Before a new model update, the algo-rithm samples a mini-batch of experiences of size M from the memory and computesthe gradient of the following loss function:

L(θi) = E(s,a,r,s′)[(yDQN −Q(s, a; θi))

2] (2.1)

yDQN = r + γmaxa′

Q(s′, a′; θ−i ) (2.2)

Recently, Hasselt et al. [91] leveraged the overestimation problem of standard Q learn-ing by introducing double DQN and Schaul et al. [72] improves the convergence speedof DQN via prioritized experience replay. We found both modifications useful and in-cluded them in our studies.

An extension to DQN is a Deep Recurrent Q-Network (DRQN) which introduces aLong Short-Term Memory (LSTM) layer [38] on top of the convolutional layer of theoriginal DQN model [36] which allows DRQN to solve POMDPs. The recurrent neuralnetwork can thus be viewed as an approximation of the belief state that can aggregateinformation from a sequence of observations. Hausknecht [36] shows that DRQN per-forms significantly better than DQN when an agent only observes partial states. A simi-lar model was proposed by Narasimhan and Kulkarni [59] and learns to play Multi-UserDungeon (MUD) games [19] with game states hidden in natural language paragraphs.Our work [114] is the first one that applied DRQN to solve E2E dialog systems and learndialog policy that can interact with both users and external KB from delayed rewards.

8


2.3 Deep Generative Models for Natural Language

Generative modeling is an area of machine learning research which deals with modelsof the distribution of data P (X), where X are data points that can be high-dimensionaland structured. Generative models have been extensively studied in many fields, in-cluding computer vision, natural language and etc., and still remains to be one of themost exciting field of research. This section will focus on backpropagation-based gen-erative models for natural language using neural networks, which sit at the core of thisdissertation. Also, besides introducing the generic generative models, we are more in-terested in conditional generative model P (X|C) where C is an arbitrary variable inhigh-dimensional space that can influence the distribution of X .

2.3.1 Encoder-Decoder Models

The most common yet very powerful conditional generative model for natural languageis the encoder-decoder model [15, 92]. The standard form of encoder-decoder models theconditional distribution of target word tokens P (X|C) conditioned on a given wordsequence X , which is also known as the sequence-to-sequence model. The basic idea is touse an encoder recurrent neural network (RNN) to encode the context sentence C intoan distributed representation and then use a decoder RNN to predict the words in thetarget sentence X . Let the wxi and wcj to denote the ith and jth words in the target andcontext sentence respectively, and RNNe and RNNd to denote the encoder and decoderRNNs. Then the source sentence is encoded by recursively applying:

he0 = 0 (2.3)hei = RNNe(wxi ,h

ei−1) (2.4)

Then the last hidden state of the encoder RNN he|c| is treated as the representation of c,which in theory is able to encode all of the information in the context sentence. Thenthe initial state of RNNd is initialized to be he|c|, and predicts the words in the targetsentence x sequentially via:

oj = softmax(Whdj + b) (2.5)

hdj = RNNd(wcj ,hdj−1) (2.6)

where oj is the decoder RNN's output probability for every word in the vocabulary attime step j. Also, in order to make the model predict the first word in the target sentenceand predict a terminal symbol indicating the end of generation, special symbols BOSand EOS are usually padded at the beginning and end of the target sentence. Moreover,it is important to note that the encoder and decoder networks are not limited to RNNsor text word sequences. Within the scope of X being word sequences, past research hasinvestigated a variety of encoders, including convolutional neural network (CNN) toencode visual data [92], tree encoder to encode syntactic trees [26] or hierarchical RNNsto encode conversation [74] and etc. Although the standard encoder-decoder models are

9


very simple, they have achieved impressive results in a wide range of natural languageprocessing (NLP) tasks, including machine translation [92], image captioning [93] andetc.

Memory-Augmented Encoder-Decoder

Although the standard encoder-decoder models are able to learn long-term dependencyin theory, it often struggles to deal with long-term information in practice. Attentionmechanism [2, 52] is an important extension of encoder-decoder models that enablebetter modeling of long term context. The general idea is instead of asking the encoderRNN to summarize a fixed-size distributed representation of the contextC, but allowingit create dynamic size distributed representation (usually a list of fixed size vectors),and then equip the decoder RNN with a reading mechanism that can retrieve a subsetof the information from the dynamic source representation. Specifically, let the dynamicrepresentation of the context sentence be He = {he1, ..., he|c|}. Then at each decoder step,the update function now becomes:

oj = softmax(W [hdj ,m

ej ] + b) (2.7)

mej =

|c|∑i

αijhei (2.8)

αij = f(hei ,h

dj ) (2.9)

hdj = RNNd(wxj ,hdj−1) (2.10)

where α is the scalar attention score computed via a matching function f which can besimple dot product, bi-linear mapping or a neural network [52].

A recent extension of attention mechanism is the copy-mechanism [35, 54]. Simi-lar to the attention mechanism, the copy-mechanism also utilizes a pointer to dynami-cally read from a variable-length representation of the source sentence. However, ratherthan asking the decoder RNN to output the next word via its softmax layer, the copy-mechanism directly copies and outputs the selected word according to the attention.The main advantage of copy-mechanism is its ability to handle rare words and OOVsbetter in case where other encoder-decoder models fail [54]. Also past work has foundthat copy-mechanism results into better generalization performance when the task in-herently has the copy-nature, e.g. entity references [25, 118].

2.3.2 Variational Latent Variable Models

The generation of real-world data usually involves in a hierarchical process. For exam-ple, given a dialog context, the speaker may first decide the high level action to response,e.g. ask a question or give a suggestion, and then the second stage generates the actualresponse in natural language take account into low-level factors. Such high-level deci-sions are often unobserved in data and are referred as latent variables. The objective to

10


maximize for unconditional generation is the marginal probability of data X

P (X) =

∫P (X|z; θ)P (z)dz (2.11)

One of the most successful framework to model such phenomenon is the variationalautoencoder (VAE) [43, 68]. The idea of VAE is to encode the input x into a probabil-ity distribution z instead of a point encoding in the autoencoder. Then VAE applies adecoder network to reconstruct the original input using samples from z. To generateimages, VAE first obtains a sample of z from the prior distribution, e.g. N (0, I), andthen produces an image via the decoder network. To deal with the integral over thehigh-dimensional latent variable z, VAE utilizes the Stochastic Gradient Variational Bayes(SGVB), which instead of directly optimizing the marginal log likelihood of the data,optimizes the evidence lower bound (ELBO), which is a lower bound of the actual data loglikelihood. ELBO is usually expressed as:

logP (X) ≥ Ez∼q(z|x)[logP (X|z)]− KL(q(z|x)‖P (z)) (2.12)

In order to do conditional generation, VAE has been extended to conditional variationalautoencoder (CVAE) [77, 107]. CVAE introduces a new random variable C that is givenat the generation stage. The goal is then to maximize the lower bound of the conditionalprobability, which can be expressed as:

logP (X|C) ≥ Ez∼q(z|x,c)[logP (X|z, C)]− KL(q(z|X,C)‖P (z|C)) (2.13)

Both VAE and CVAE were first introduced to modeling images, were later extend tomodeling natural language. Although VAE/CVAE has achieved impressive results inimage generation, adapting this to natural language generators is non-trivial. Bowmanet al. [10] have used VAE with Long-Short Term Memory (LSTM)-based recognition anddecoder networks to generate sentences from a latent Gaussian variable. They showedthat their model is able to generate diverse sentences with even a greedy LSTM decoder.They also reported the difficulty of training because the LSTM decoder tends to ignorethe latent variable. We refer to this issue as the vanishing latent variable problem. Mean-while, CVAE has been adapted for dialog response generation [76, 117]. Our work [117]improves the training methods for CVAE to alleviate the vanishing latent variable prob-lem and also incorporate human annotated data to guide the learning of the high-levellatent variable.

2.4 Learning from Multiple Sources of Data

Learning from multiple sources of data is a well-known approach to improve the gen-eralization ability of machine learning models, especially in domains with limited data.Biologically, human and animals continuously learning new tasks by leveraging theirknowledge from related task they have acquired before. From a machine learning per-spective, learning multiple tasks together can be viewed as a method to include induc-tive bias [56], which requires the learned decision hypothesis to be general enough for

11


all the tasks that it’s jointly learning. Today the research community has developed sev-eral areas of study that are rooted from this notion, including meta learning [87], transferlearning [63], domain adaption [20], zero-shot learning [44] and etc. This proposal will fol-low the notation introduced in the transfer learning community [63], which is mostrelevant to the proposed work. In this notation, a domain is defined as D = {X,P (X)}which are the input space and its marginal distribution. Whereas a task is defined asT = {Y, P (Y |X)}, which is an output space and its conditional distribution. Then thecore of transfer learning is by leveraging already existing data in related tasks or do-mains (referred as the source) to improve the machine learning model’s performancein a target, which can have either different domain or different tasks compared to thesources. Given source/target domains DS/DT , and source/target tasks, TS/TT , thenthere are 4 possible transfer learning scenarios:

1. XS 6= XT : this is where the input space is different in the source and target domain,e.g. text vs image or English vs Chinese.

2. P (XS) 6= P (XT ): the marginal distribution is different for the input space, e.g.movie reviews vs restaurant reviews.

3. YS 6= YT : the label space is different between the source and the target, e.g. senti-ment vs dialog act classification.

4. P (YS|XS) 6= P (YT |XT ): the conditional probabilities are different, e.g. movie re-view scores in 2010 vs 2016.

Given the above formulation, two important topics are relevant to our proposed workand they are learning domain-invariant representation and zero-shot learning.

2.4.1 Learning Domain-Invariant Representation

Learning domain-invariant representation is a powerful method to develop model thatcan operate in different domains (scenarios 1&2). This is because although the surfacelevel of inputsX or P (X) might be quite different in the source and the target, there mayexist a shared representation h(X) that can be extracted from both domains. Then sinceh is independent of the surface inputs, the conditional model trained in the source do-main P (Y |h) can be used for the target. Many methods have been developed to achievethis goal. One popular method is through multi-task learning (MTL) [27]. MLT meansto train models jointly on several tasks to improve the models' generalization ability.This is done by drawing domain-specific information from the training feedback of re-lated tasks [12]. Deep learning-based MTL can be generally divided into two categories:hard parameter sharing and soft parameter sharing [71]. The hard parameter sharingapproach applies the same neural network layers for all tasks to learn shared represen-tation, which can reduce the chance of overfitting. For soft parameter sharing, each taskhas its own model, but their weights are tied by some regularization metrics, e.g. L2distance [24]. MTL has been applied to many NLP tasks. Collobert et al. [18] used ahard-shared neural network to learn models that can simultaneously do part-of-speechtagging, chunking, named entity, recognition, and semantic role labeling and resultsinto better performance than training individual models for each task. A more recent

12


work [51] leveraged MTL to encoder-decoder models and jointly learn machine trans-lation with image captioning and syntactic parsing, which lead to performance gain.MTL also has improved the NLU module for a dialog system [49] by jointly modelingslot-filling and intent prediction.

2.4.2 Zero-shot Learning

Zero-shot learning (ZSL) refers to an extreme situation where there is no training dataavailable for the target domain so the label space Y is unseen in the source. Althoughdifficult for machines, human is indeed capable of ZSL. For example, after an personreads a detailed description about the look of a cat, this person should be able to rec-ognize an image of cat even he/she has never seen a cat before. Therefore, the majorchallenge of ZSL is to construct a shared representation g of the output space Y , so thata model trained on the source P (g|X) can be still used to predict meaning outputs thatcan be related to the new labels.

ZSL was first introduced in the computer vision community [44, 62], which has fo-cused on recognizing unseen objects from images. The major approach is to parametrizethe object Y into semantic output attributes instead of directly predicting the object classindex. As a result, in the test time, the model can first predict the semantic attributes ofthe input image. Then the final prediction can be obtained by comparing the predict at-tributes with a list of candidates objects. A more recent work [70] improves this idea byjointly learn a bi-linear mapping to directly fuse the information from semantic codesand input image for prediction. Besides image recognition, recent work has explored thenotion of task generalization in robotics, so that a robot can execute a new task that is notmentioned in training [23, 61]. In this case, a task is described by one demonstration ora sequence of instructions and the system needs to learn to breakdown the instructionsinto previously learned skills. ZSL has also been applied to individual components inthe dialog system pipeline. Chen et al. [14] developed an intent classifier that can pre-dict new intent labels that are not included in the training data. Bapna et al. [3] extendedthe idea to the slot-filling module to track novel slot types. Both papers leverage a natu-ral language description about the label (intent or slot-type) in order to learn a semanticembedding about the label space. Then, given any new labels the model can still makepredictions. Moreover, there has been extensive work on learning domain-adaptabledialog policy by first training a dialog policy on K previous domains, and then to testthe policy on the K+1 new domain. Gasic et al. [31] used the Gaussian Process withcross-domain kernal functions. The resulting policy can leverage the experience fromother domains to make educated decisions in a new one. Finally, ZSL has also beenapplied to NLG. Wen et al. [96] used delexicalized data to synthetically generate NLGtraining data for a new domain.

In summary, past ZSL research for dialog has mostly focused on adapting individualmodules of a pipeline-based dialog system. We consider our proposal to be the first stepin exploring the notion of adapting an entire E2E dialog system to new domains for taskgeneralization.

13



Chapter 3

Generative E2E Dialog Models in aSingle Domain

3.1 Introduction

Teaching machines to converse like human for real-world purposes is arguably one ofthe hardest challenges in AI. To carry out smooth and meaningful conversation with hu-man, a dialog system needs be competent in understanding natural language, makingintelligent decisions and generating appropriate responses. Moreover, the system alsoneeds to be able operate under uncertainty, i.e. noise introduced by speech recognitionerrors or inherit ambiguity in natural language. At last, unlike many NLP tasks, a dialogsystem is interactive so it has to reason and plan over multiple turns with human users.One popular philosophy to tackle the problem of building dialog systems is divide-and-conquer, i.e. solving a set of sub-problems of the main problem and group these sub-problem solutions into an integrated pipeline. The conventional dialog system indeedfalls into this category and past research has focused on improving individual modulesin the overall pipeline. This dissertation, however, is built upon a different hypothesis:“building an E2E domain-agnostic dialog system that is not tailored for specific appli-cations and later adapting this model to one or several specific applications by learningfrom domain-specific data”. This hypothesis is backed up by the recent revolution indeep learning based E2E models that achieve the state-of-art performance in a widerange of tasks, including image recognition, speech recognition and etc. One main rea-son for E2E model’s success is its ability of learning representation. E2E models breakthe barrier between modules and learn features that can maximize the performance ofthe “end” goal. E2E systems also provide unprecedented flexibility to adapt to new do-mains, since a general-purpose model only needs to be re-trained on the data from thenew domain. Therefore, the above-mentioned benefits has motivates us to develop thedomain-agnostic generative dialog models.

Generative E2E dialog models (GEDM) based on encoder-decoder models are strongcandidates for our purpose because they naturally overcome some of the long-standinglimitations of the conventional technologies. For clarify of discussion, a GEDM can be

15


divided into three parts: recognition, decision-making and generation. Recognition refersto the process of mapping a raw dialog context input into a distributed dialog state rep-resentation; decision-making refers to the process of mapping the dialog state to a dis-tributed system action (i.e. the initial hidden state of the decoder RNN); and generationis the processing generating system response into words. Compared to the traditionaldialog pipeline, a GEDM has the following advantages in all above three processes:

(1) For the recognition, a pipeline systems depends on NLU to parse users utterancesinto semantic frames (e.g. slot-values and dialog acts), which soon become intractable ifwe want complex system behavior, e.g. discourse obligation, self-disclosure, anaphoraresolution and etc. Instead, a GEDM directly operates on the raw dialog context inputand learn distributed dialog state representation in a utilitarian manner, i.e. learn usefuldialog state features that are needed to imitate the expected system behavior. (2) For thedecision-making, a traditional system relies on hand-crafted state variables and systemacts. Sub-optimal design of them will lead to sub-optimal system performance and thesedesign are not flexible to adapt to new system behavior or domains. Instead, a GEDMcan learn the optimal configuration of dialog state and system acts in a data-drivenfashion with no assumptions on the task or domains that they are used for. (3) For thegeneration, a GEDM is advantageous because of it’s generative. Given a large enoughoutput vocabulary, it has the potential to generate responses in any domain from thesame model. This is crucial for developing general-purpose dialog systems that canseamlessly carry out both chat and task conversations in multiple domains. And thisproperty is unique compared to traditional NLG or retrieval-based systems, which canonly output from a given pool of systems responses.

Unfortunately, GEDMs have not yet become the silver bullet for solving dialog sys-tems. As noted in Chapter 1, GEDMs are limited because: (1) the standard architecturecannot model some of the essential properties in dialog systems. (2) training requireslarge data while data scarcity is common for dialog applications. The second challengewill be addressed in Chapter 4. In this section, we identify three major limitations ofstandard encoder-decoder models:

1. Interface with Symbolic Interface: task-oriented dialog systems not only inter-act with human user but also need to interface with symbolic knowledge basesto read/write external information and later inform them to human users. Un-fortunately, straightforward interface with symbolic knowledge involves in non-differentiable operators and requires special care to remain E2E trainable.

2. Stochastic Dialog Policy: the reasoning process of deciding what say next is ahighly stochastic at multiple levels, i.e. from discourse to word level, resultinginto multi-modal distribution of valid responses. In other words, given a similardialog context, there are many equally good responses. On the hand, standardtraining encoder-decoder models with maximum likelihood leads into dull andgeneric responses, such as ”I am not sure”, as reported in [45, 117].

3. Slot Expansion: for dialog systems deal with real-world knowledge base, thepropositional contents will evolve over time, e.g. new movies added to database.Vanilla encoder-decoders are trained on offline corpus and do not learn to gener-

16


alize to these new entities.As we can note, solving the above challenges are required to make a GEDM capa-

ble of replacing traditional methods to build full-fledged dialog systems. The followingare our solutions to introduce the desired abilities for GEDMs, so they can interfacewith both users and knowledge, deal with domain-expansion and learn multi-modalstochastic outputs. The rest of the chapter is organized as: Section 3.2 defines the no-tations; Section 3.3-3.5 describes the proposed solutions to the above-mentioned chal-lenges; Section 3.6 aggregates the techniques and presents our final model. The resultsin Section 3.3-3.5 are published at [114, 116, 117].

3.2 Formulations and Notations

We first formally describe the variables involved in a dialog dataset. Without loss ofgenerality, a dialog dataset can be represented as a list of (c, x) pairs, where c can be ar-bitrary structured data that describe a dialog context, e.g. discourse history, speaker in-formation and etc, and x is a system response to context c. Further, a dialog context c is alist of utterances [(u1,m1), ..., (ut,mt), ..., (uT ,mT )], where each ut is an natural languageutterance expressed as a sequence of word tokens [wt1, ..., w

ti , ..., w

t|ut|]. Also, mt contains

meta features about ut, including speaker identity, ASR confidence and etc. Meanwhile,a system response x is also represented by a sequence of tokens [w1, ...wj, ...w|x|]. Atlast, we use C and X to denote the random variables corresponded by the context andsystem response. Figure 3.1 shows an example dialog expressed in the above format.

Figure 3.1: Dataset creation from an example dialog.

Given the above notation, then in the supervised learning setting, the goal to createa probabilistic model parameterized by θ and find the response with maximum condi-tional probability, x∗ = argmax

xP (x|c; θ). In the value-based reinforcement learning set-

ting, the goal is to create a Q-value function parameterized by θ and find the responsethat maximize the discounted cumulative return, x∗ = argmax

xQθ(x, c).

At last for an standard encoder-decoder-based GEDM, we will denote the encodernetwork be h = fee(c), which outputs a matrix h that denote a “summary” of the dialogcontext. Then the decision-making function denoted by z = π(h) which decides theinitial state of the decoder. At last x = fd(z) is the decoder that generates the systemresponse in natural language.

17


3.3 Learning to Interface with Symbolic Knowledge Base

3.3.1 KB As A Environment

The goal is to predictX givenC. Now if we divideX into two categories: KB-dependentand KB-independent system outputs. For those ones that are KB-independent, C onlyneeds to have the conversation history. However, for these KB-dependent system out-puts, its content cannot be inferred from the raw conversation history. For example inFigure 3.1, the system cannot generate “Paris 66 is a good choice” just given the con-versation history, because such information can only be obtained from an external KB.Moreover, the system cannot simply memorize the answer for this given dialog con-text, because content in KB can change over time and there can be multiple matches.Therefore, a task-oriented dialog system that involves information access has to be ableto dynamically access external KB. What’s happening under the hood is illustrated inFigure 3.2(1). The system first generates a query q for the KB and receives KB results ukb.Next, the model generates the system response based on ukb. Now, it becomes evidentthat if the loss is only computed at the system verbal output, then it is not feasible tocompute the gradient and backprop through the KB unless we assume KB is differen-tiable (e.g. memory block with attention [21]). Here such assumption is not preferablebecause it is quite frequent that a dialog system will need to interface with various typesof KB, e.g. web APIs, relational database, which by no means are differentiable.

Figure 3.2: The challenge of interfacing with KB and our solution.

To solve this problem, our proposed solution is as follows: instead of training a uni-task model that is trained to generate natural language response for the human users,we train a multi-task model that is able to output both verbal responses xv for humanand KB responses xkb for KB. Then as shown in Figure 3.2(2), we can break one (c, x)data point into two data points if x happens to be KB-dependent. The two data pointsare (c, xkb) and ([c, (ukb,mkb)], x), and the loss function can be computed at both placesand used to update the model parameters.

Although the above formulation bypass the problem of back-propagating gradientsthrough non-differentiable KB operations, there are two new issues that need to be ad-dressed to make it actually useful:

1. Correctly generate structured KB query with arbitrary format can be hard.

2. Sometimes the database query is latent and not stored in the dialog data.

18


Figure 3.3: A reinforcement learning interpretation of the proposed method to interfacewith KB.

Benefited from the flexibility of encoder-decoder models with RNN decoders, the firstchallenge is a less problem because the decoder in principle can generate arbitrary se-quential outputs including KB queries e.g. a SQL query SELECT * LOC= PITTS-BURGH TYPE=FRENCH. It only becomes an issue when the system is expected to gen-erate novel query that contains entities that are not covered in the training data, whichwill be investigated in Section 3.5. On the other hand, the second problem is a biggerchallenge because a dialog corpus only has the verbal conversation history. The fol-lowing study demonstrates our solution to each E2E dialog agent play a conversationalgame with users using deep RL. The agent is able to learn to query the KB and havingconversation with users without the need for ground-truth label for KB interaction. Theresults are published in our past work [114].

3.3.2 Learning from Delayed Rewards

In fact the problem of latent query lies in a more general challenge in AI, i.e. learningfrom delayed reward [85]. Since the agent may not receive immediate feedback signalswhen interacting with the KB, and can only decide whether or not the previous KBactions are correct based on users reaction several steps later. RL is one of the classicalmethods to solve such problem. From a RL perspective, our approach is equivalent totreat the KB as a part of the environment together with users. Then the action space ofthe system is the union of both verbal and KB queries. As shown in Figure 3.3, both theKB and users can respond to systems via new observations and reward signals. Thisformulation suggests that the agent should be able to learn optimal policy even there isno reward received for any of its KB actions as long as it can receive reward elsewherefrom verbal actions. And we will therefore denote our solution as KB As A Environment(KaaE)

To test our hypothesis, we built a dialog system that can play 20 questions (20Q) withhuman users. The game rules are as follows: at the beginning of each game, the userthinks of a famous person. Then the agent asks the user a series of Yes/No questions.

19


The user honestly answers, using one of three intents: YES, NO or UNKNOWN. Theuser can answer with any natural utterance representing one of the three intents. Theagent can make guesses at any turn, but a wrong guess results in a negative reward.The goal is to guess the correct person within a maximum number of turns with theleast number of wrong guesses. An example game conversation is as follows:

Sys: Is this person male?User: Yes I think so.Sys: Is this person an artist?User: He is not an artist....Sys: I guess this person is Bill Gates.User: Correct.We can formulate the game as a slot-filling dialog. Assume the system has |Q| avail-

able questions to select from at each turn. The answer to each question becomes a slotand each slot has three possible values: yes/no/unknown. Due to the length limit andwrong guess penalty, the optimal policy does not allow the agent to ask all of the ques-tions regardless of the context or guess every person in the database one by one. Specif-ically, we learn an optimal policy that either generates a verbal response or modifies thecurrent estimated dialog state based on the new observations. This formulation makesit possible to obtain a state tracker even without the labelled data required for DST, aslong as the rewards from the users and the databases are available. Furthermore, incases where dialog state tracking labels are available, the proposed model can incorpo-rate them with minimum modification and greatly accelerate its learning speed. Thus,the following sections describe two models: RL and Hybrid-RL, corresponding to twolabelling scenarios: 1) only dialog success labels and 2) dialog success labels with statetracking labels.

3.3.3 Model Architecture

We consider a task-oriented dialog task, in which there are S slots, each with cardinalityCi, i ∈ [0, S). The environment consists of a user, Eu and a database Ekb. The agent cansend verbal actions, av ∈ Av to the user and the user will reply with natural languageresponses ou and rewards ru. In order to interface with the database environment Ekb,the agent can apply special actions ah ∈ Ah that can modify a query hypothesis h. Thehypothesis is a slot-filling form that represents the most likely slot values given theobserved evidence. Given this hypothesis, h, the database can perform a normal queryand give the results as observations, okb and rewards rkb.

At each turn t, the agent applies its selected action at ∈ {Av, Ah} and receives theobservations from either the user or the database. We can then define the observation ot

of turn t as,

ot =

atoutokbt

(3.1)

20


Figure 3.4: The network takes the observation ot at turn t. The recurrent unit updatesits hidden state based on both the history and the current turn embedding. Then themodel outputs the Q-values for all actions. The policy network in grey is masked by theaction mask

We then use the LSTM network as the dialog state tracker that is capable of ag-gregating information over turns and generating a dialog state representation, bt =LSTM(ot, bt−1), where bt is an approximation of the belief state at turn t. Finally, thedialog state representation from the LSTM network is the input to S+1 policy networksimplemented as Multilayer Perceptrons (MLP). The first policy network approximatesthe Q-value function for all verbal actions Q(bt, a

v) while the rest estimate the Q-valuefunction for each slot, Q(bt, a

h), as shown in Figure 3.4.

Incorporating State Tracking Labels

The pure RL approach described in the previous section could suffer from slow conver-gence when the cardinality of slots is large. This is due to the nature of reinforcementlearning: that it has to try different actions (possible values of a slot) in order to es-timate the expected long-term payoff. On the other hand, a supervised classifier canlearn much more efficiently. A typical multi-class classification loss function (e.g. cate-gorical cross entropy) assumes that there is a single correct label such that it encouragesthe probability of the correct label and suppresses the probabilities of the all the wrongones. Modeling dialog state tracking as a Q-value function has advantages over a localclassifier. For instance, take the situation where a user wants to send an email and thestate tracker needs to estimate the user’s goal from among three possible values: send,edit and delete. In a classification task, all the incorrect labels (edit, delete) are treated asequally undesirable. However, the cost of mistakenly recognizing the user goal as deleteis much larger than edit, which can only be learned from the future rewards. In order totrain the slot-filling policy with both short-term and long-term supervision signals, wedecompose the reward function for Ah into two parts:

Qπ(b, ah) = R(b, a) + γ∑b′

P (b′|b, ah)V π(b′) (3.2)

R(b, a, b′) = R(b, ah) + P (ah|b) (3.3)

21


where P (ah|b) is the conditional probability that the correct label of the slots is ah giventhe current belief state. In practice, instead of training a separate model estimatingP (ah|b), we can replace P (ah|b) by 1(y = ah) as the sample reward r, where y is the la-bel. Furthermore, a key observation is that although it is expensive to collect data fromthe user Eu, one can easily sample trajectories of interaction with the database sinceP (b′|b, ah) is known. Therefore, we can accelerate learning by generating synthetic ex-periences, i.e. tuple (b, ah, r, b′)∀ah ∈ Ah and add them to the experience replay buffer.This approach is closely related to the Dyna Q-Learning proposed in [83]. The differenceis that Dyna Q-learning uses the estimated environment dynamics to generating expe-riences, while our method only uses the known transition function (i.e. the dynamics ofthe database) to generate synthetic samples.

Implementation Details

We can optimize the network architecture in several ways to improve its efficiency:Shared State Tracking Policies: it is more efficient to tie the weights of the policy

networks for similar slots and use the index of slot as an input. This can reduce thenumber of parameters that needs to be learned and encourage shared structures.

Reward Shaping based on the Database: the reward signals from the users areusually sparse (at the end of a dialog), the database, however, can provide frequentrewards to the agent. Reward shaping is a technique used to speed up learning. Ng etal. [60] showed that potential-based reward shaping does not alter the optimal solution;it only impacts the learning speed. The pseudo reward function F (s, a, s′) is defined as:

R(s, a, s′) = R(s, a, s′) + F (s, a, s′) (3.4)F (s, a, s′) = γφ(s′)− φ(s) (3.5)

Let the total number of entities in the database be D and Pmax be the max potential,the potential φ(s) is:

φ(st) = Pmax(1−dtD

) if dt > 0 (3.6)

φ(st) = 0 if dt = 0 (3.7)

The intuition of this potential function is to encourage the agent to narrow downthe possible range of valid entities as quickly as possible. Meanwhile, if no entities areconsistent with the current hypothesis, this implies that there are mistakes in previousslot filling, which gives a potential of 0.

3.3.4 Experiments

Simulator Construction

We constructed a simulator for 20Q. The simulator has two parts: a database of 100famous people and a user simulator.

22


We selected 100 people from Freebase [8], each of them has 6 attributes: birthplace,degree, gender, profession and birthday. We manually designed several Yes/No questionsfor each attribute that is available to the agent. Each question covers a different set ofpossible values for a given attribute and thus carries a different discriminative powerto pinpoint the person that the user is thinking of. As a result, the agent needs to judi-ciously select the question, given the context of the game, in order to narrow down therange of valid people. There are 31 questions. Table 3.1 shows a summary.

Attribute Qa Example QuestionBirthday 3 Was he/she born before 1950?Birthplace 9 Was he/she born in USA?Degree 4 Does he/she have a PhD?Gender 2 Is this person male?Profession 8 Is he/she an artist?Nationality 5 Is he/she a citizen of an Asian

country?

Table 3.1: Summary of the available questions. Qa is the number of questions for at-tribute a.

At the beginning of each game, the simulator will first uniformly sample a personfrom the database as the person it is thinking of. Also there is a 5% chance that the sim-ulator will consider unknown as an attribute and thus it will answer with unknown intentfor any question related to it. After the game begins, when the agent asks a question,the simulator first determines the answer (yes, no or unknown) and replies using natu-ral language. In order to generate realistic natural language with the yes/no/unknownintent, we collected utterances from the Switchboard Dialog Act (SWDA) Corpus [39].We further post-processed results and removed irrelevant utterances, which led to 508,445 and 251 unique utterances with intent respectively yes/no/unknown. We keep the fre-quency counts for each unique expression. Thus at run time, the simulator can samplea response according to the original distribution in the SWDA Corpus.

A game is terminated when one of the four conditions is fulfilled: 1) the agentguesses the correct answer, 2) there are no people in the database consistent with thecurrent hypothesis, 3) the max game length (100 steps) is reached and 4) the max num-ber of guesses is reached (10 guesses). Only if the agent guesses the correct answer(condition 1) treated as a game victory. The win and loss rewards are 30 and −30 and awrong guess leads to a −5 penalty.

Training Details

The user environment Eu is the simulator that only accepts verbal actions, either aYes/No question or a guess, and replies with a natural language utterance. ThereforeAv contains |Q|+1 actions, in which the first |Q| actions are questions and the last actionmakes a guess, given the results from the database.

23


The database environment reads in a query hypothesis h and returns a list of peoplethat satisfy the constraints in the query. h has a size of |Q| and each dimension can beone of the three values: yes/no/unknown. Since the cardinality for all slots is the same,we only need 1 slot-filling policy network with 3 Q-value outputs for yes/no/unknown,to modify the value of the latest asked question. Thus Ah = {yes, no, unknown}. Forexample, considering Q = 3 and the hypothesis h is: [unknown, unknown, unknown]. Ifthe latest asked question is Q1 (1-based), then applying action ah = yes will result in thenew hypothesis: [yes, unknown, unknown].

To represent the observation ot in distributed form, we use a bag-of-bigrams featurevector to represent a user utterance; a one-hot vector to represent a system action anda single discrete number to represent the number of people satisfying the current hy-pothesis. The hyper-parameters of the neural network model are as follows: the sizeof turn embedding is 30; the size of LSTMs is 256; each policy network has a hiddenlayer of 128 with tanh activation. We also add a dropout rate of 0.3 for both LSTMsand tanh layer outputs. The network has a total of 470,005 parameters. The networkwas trained through RMSProp [88]. For hyper-parameters of DRQN, the behavior net-work was updated every 4 steps and the interval between each target network updateC is 1000. ε-greedy exploration is used for training, where ε is linearly decreased from1 to 0.1. The reward shaping constant Pmax is 2 and the discounting factor γ is 0.99.The resulting network was evaluated every 5000 steps and the model was trained up to120,000 steps. Each evaluation records the agent’s performance with a greedy policy for200 independent episodes.

3.3.5 Results and Discussion

Dialog Policy Analysis

We compare the performance of three models: a strong modular baseline, RL and Hybrid-RL. The baseline has an independently trained state tracker and dialog policy. The statetracker is also an LSTM-based classifier that inputs a dialog history and predicts theslot-value of the latest question. The dialog policy is a DRQN that assumes perfectslot-filling during training and simply controls the next verbal action. Thus the essen-tial difference between the baseline and the proposed models is that the state trackerand dialog policy are not trained jointly. Also, since hybrid-RL effectively changes thereward function, the typical average cumulative reward metric is not applicable for per-formance comparison. Therefore, we directly compare the win rate and average gamelength in later discussions.

Win Rate (%) Avg TurnBaseline 68.5 12.2RL 85.6 21.6Hybrid-RL 90.5 19.22

Table 3.2: Performance of the three systems

24


Table 3.2 shows that both proposed models achieve significantly higher win ratethan the baseline by asking more questions before making guesses. Figure 3.5 illus-trates the learning process of the three models. The horizontal axis is the total numberof interaction between the agent and either the user or the database. The baseline modelhas the fastest learning speed but its performance saturated quickly because the dialogpolicy was not trained together with the state tracker. So the dialog policy is not awareof the uncertainty in slot-filling and the slot-filler does not distinguish between the con-sequences of different wrong labels (e.g classify yes to no versus to unknown). On theother hand, although RL reaches high performance at the end of the training, it strug-gles in the early stages and suffers from slow convergence. This is due to that fact thatcorrect slot-filling is a prerequisite for winning 20Q, while the reward signal has a longdelayed horizon in the RL approach. Finally, the hybrid-RL approach is able to convergeto the optimal solution much faster than RL due to the fact that it efficiently exploits theinformation in the state tracking label.

Figure 3.5: Graphs showing the evolution of the win rate during training.

State Tracking Analysis

One of the hypotheses is that the RL approach can learn a good state tracker using onlydialog success reward signals. We ran the best trained models using a greedy policyand collected 10,000 samples. Table 3.3 reports the precision and recall of slot filling inthese trajectories. The results indicate that the RL model learns a completely differentstrategy compared to the baseline. The RL model aims for high precision so that itpredicts unknown when the input is ambiguous, which is a safer option than predictingyes/no, because confusing between yes and no may potentially lead to a contradiction

25


Unknown Yes NoBaseline 0.99/0.60 0.96/0.97 0.94/0.95RL 0.21/0.77 1.00/0.93 0.95/0.51Hybrid-RL 0.54/0.60 0.98/0.92 0.94/0.93

Table 3.3: State tracking performance of the three systems. The results are in the formatof precision/recall

and a game failure. This is very different from the baseline which does not distinguishbetween incorrect labels. Therefore, although the baseline achieves better classificationmetrics, it does not take into account the long-term payoff and performs sub-optimallyin terms of overall performance.

Dialog State Representation Analysis

Tracking the state over multiple turns is crucial because the agent’s optimal action de-pends on the history, e.g. the question it has already asked, the number of guesses ithas spent. Furthermore, one of the assumptions is that the output of the LSTM net-work is an approximation of the belief state in the POMDP. We conducted a study totest these hypotheses. We ran the Hybrid-RL models saved at 20K, 50K and 100K stepsagainst the simulator with a greedy policy and recorded 10,000 samples for each model.The study checks whether we can reconstruct an important state feature: the number ofguesses the agent has made from the dialog state embedding. We divide the collected10,000 samples into 80% for training and 20% for testing. We used the LSTM output asinput features to a linear regression model with l2 regularization. Table 3.4 shows thecorrelation of determination r2 increases for the model that was trained with more data.

Model 20K 50K 100Kr2 0.05 0.51 0.77

Table 3.4: r2 of the linear regression for predicting the number of guesses in the testdataset.

3.3.6 Conclusion

In conclusion, combining deep RL and KaaE can indeed make a dialog system learnto communicate with both KB and users without explicit supervision. Furthermore,our state representation analysis was the first work explicitly confirmed that E2E dialogmodels with LSTM encoder can learn the latent dialog state representation.

26


3.4 Modeling Stochastic Dialog Policy

3.4.1 The Dull Response Problem and Beyond

After we enable a GEDM to interface KB, this section focuses on how to better inter-act with users. When modeling dialog response generation for complex domains, (e.g.open-domain chatting), past research has found that encoder-decoder models tend togenerate generic and dull responses (e.g., I don’t know), rather than meaningful andspecific answers [45, 76]. There have been many attempts to explain and solve this lim-itation including enriching the encoder with wider range of features or improving thedecoding algorithms, e.g. variations of beam search (see Chapter 2 for details). Build-ing upon the past work, our key proposed idea is to model dialogs as a one-to-manyproblem at the discourse level. Several reasons may contribute to this. First, real-worldconversational data are collected from many different speakers and each speaker fol-lows their own decision-making policies. Therefore the response distribution given acontext is generated from a mixture of policy. Second, even for the same speaker, thereare many other latent factors that decide the next response, e.g. his/her relationshipwith the listener. Therefore, given a similar dialog history (and other observed inputs),there may exist many valid responses (at the discourse level), each corresponding toa certain configuration of the latent variables that are not presented in the input. Touncover the potential responses, we strive to model a probabilistic distribution overthe distributed utterance embeddings of the potential responses using a latent variable(Figure 3.6). This allows us to generate diverse responses by drawing samples from thelearned distribution and reconstruct their words via a decoder neural network.

Figure 3.6: Given A’s question, there exists many valid responses from B for differentassumptions of the latent variables, e.g., B’s hobby.

Next we present a novel neural dialog model adapted from conditional variationalautoencoders (CVAE) [77, 107], which introduces a latent variable that can capture dis-course level variations as described above. We then propose Knowledge-Guided CVAE(kgCVAE), which enables easy integration of expert knowledge and results in perfor-mance improvement and model interpretability. Last, we develop a training method inaddressing the difficulty of optimizing CVAE for natural language generation [10]. Weevaluate our models on human-human conversation data and yield promising resultsin: (a) generating appropriate and discourse-level diverse responses, and (b) showingthat the proposed training method is more effective than the previous techniques. Thisstudy in published in [117].

27


3.4.2 Latent Variable Dialog Model

Figure 3.7: Graphical models of CVAE (a) and kgCVAE (b)

Conditional Variational Autoencoder (CVAE) for Dialog Generation

Each dyadic conversation can be represented via three random variables: the dialogcontext c (context window size k − 1), the response utterance x (the kth utterance) anda latent variable z, which is used to capture the latent distribution over the valid re-sponses. Further, c is composed of the dialog history: the preceding k-1 utterances;conversational floor (1 if the utterance is from the same speaker of x, otherwise 0) andmeta features m (e.g. the topic). We then define the conditional distribution p(x, z|c) =p(x|z, c)p(z|c) and our goal is to use deep neural networks (parametrized by θ) to ap-proximate p(z|c) and p(x|z, c). We refer to pθ(z|c) as the prior network and pθ(x, |z, c) asthe response decoder. Then the generative process of x is (Figure 3.7 (a)):

1. Sample a latent variable z from the prior network pθ(z|c).

2. Generate x through the response decoder pθ(x|z, c).

Figure 3.8: The neural network architectures for the baseline and the proposedCVAE/kgCVAE models.

⊕denotes the concatenation of the input vectors. The dashed

blue connections only appear in kgCVAE.

28


CVAE is trained to maximize the conditional log likelihood of x given c, which in-volves an intractable marginalization over the latent variable z. As proposed in [77, 107],CVAE can be efficiently trained with the Stochastic Gradient Variational Bayes (SGVB)framework [43] by maximizing the variational lower bound of the conditional log like-lihood. We assume the z follows multivariate Gaussian distribution with a diagonalcovariance matrix and introduce a recognition network qφ(z|x, c) to approximate the trueposterior distribution p(z|x, c). Sohn and et al,. [77] have shown that the variationallower bound can be written as:

L(θ, φ;x, c) = −KL(qφ(z|x, c)‖pθ(z|c))+ Eqφ(z|c,x)[log pθ(x|z, c)] (3.8)

≤ log p(x|c)

Figure 3.8 demonstrates an overview of our model. The utterance encoder is a bidirec-tional recurrent neural network (BRNN) [73] with a gated recurrent unit (GRU) [16] toencode each utterance into fixed-size vectors by concatenating the last hidden states ofthe forward and backward RNN ui = [~hi, ~hi]. x is simply uk. The context encoder isa 1-layer GRU network that encodes the preceding k-1 utterances by taking u1:k−1 andthe corresponding conversation floor as inputs. The last hidden state hc of the contextencoder is concatenated with meta features and c = [hc,m]. Since we assume z followsisotropic Gaussian distribution, the recognition network qφ(z|x, c) ∼ N (µ, σ2I) and theprior network pθ(z|c) ∼ N (µ′, σ′2I), and then we have:[

µlog(σ2)

]= Wr

[xc

]+ br (3.9)[

µ′

log(σ′2)

]= MLPp(c) (3.10)

We then use the reparametrization trick [43] to obtain samples of z either fromN (z;µ, σ2I)predicted by the recognition network (training) or N (z;µ′, σ′2I) predicted by the priornetwork (testing). Finally, the response decoder is a 1-layer GRU network with initialstate s0 = Wi[z, c] + bi. The response decoder then predicts the words in x sequentially.

Knowledge-Guided CVAE (kgCVAE)

In practice, training CVAE is a challenging optimization problem and often requireslarge amount of data. On the other hand, past research in spoken dialog systems anddiscourse analysis has suggested that many linguistic cues capture crucial features inrepresenting natural conversation. For example, dialog acts [66] have been widely usedin the dialog managers [48, 67, 114] to represent the propositional function of the system.Therefore, we conjecture that it will be beneficial for the model to learn meaningfullatent z if it is provided with explicitly extracted discourse features during the training.

In order to incorporate the linguistic features into the basic CVAE model, we first de-note the set of linguistic features as y. Then we assume that the generation of x depends

29


on c, z and y. y relies on z and c as shown in Figure 3.7. Specifically, during training theinitial state of the response decoder is s0 = Wi[z, c, y] + bi and the input at every step is[et, y] where et is the word embedding of tth word in x. In addition, there is an MLP topredict y′ = MLPy(z, c) based on z and c. In the testing stage, the predicted y′ is used bythe response decoder instead of the oracle decoders. We denote the modified model asknowledge-guided CVAE (kgCVAE) and developers can add desired discourse featuresthat they wish the latent variable z to capture. KgCVAE model is trained by maximiz-ing:

L(θ, φ;x, c, y) = −KL(qφ(z|x, c, y)‖Pθ(z|c))+ Eqφ(z|c,x,y)[log p(x|z, c, y)]

+ Eqφ(z|c,x,y)[log p(y|z, c)] (3.11)

Since now the reconstruction of y is a part of the loss function, kgCVAE can more effi-ciently encode y-related information into z than discovering it only based on the surface-level x and c. Another advantage of kgCVAE is that it can output a high-level label (e.g.dialog act) along with the word-level responses, which allows easier interpretation ofthe model’s outputs.

Optimization Challenges

A straightforward VAE with RNN decoder fails to encode meaningful information inz due to the vanishing latent variable problem [10]. Bowman et al., [10] proposed twosolutions: (1) KL annealing: gradually increasing the weight of the KL term from 0 to 1during training; (2) word drop decoding: setting a certain percentage of the target wordsto 0. We found that CVAE suffers from the same issue when the decoder is an RNN.Also we did not consider word drop decoding because Bowman et al,. [10] have shownthat it may hurt the performance when the drop rate is too high.

As a result, we propose a simple yet novel technique to tackle the vanishing latentvariable problem: bag-of-word loss. The idea is to introduce an auxiliary loss that re-quires the decoder network to predict the bag-of-words in the response x as shown inFigure 3.8(b). We decompose x into two variables: xo with word order and xbow with-out order, and assume that xo and xbow are conditionally independent given z and c:p(x, z|c) = p(xo|z, c)p(xbow|z, c)p(z|c). Due to the conditional independence assumption,the latent variable is forced to capture global information about the target response. Letf = MLPb(z, x) ∈ RV where V is vocabulary size, and we have:

log p(xbow|z, c) = log

|x|∏t=1

efxt∑Vj e

fj(3.12)

where |x| is the length of x and xt is the word index of tth word in x.:

L′(θ, φ;x, c) = L(θ, φ;x, c)

+ Eqφ(z|c,x,y)[log p(xbow|z, c)] (3.13)

30


We will show that the bag-of-word loss in Equation 3.13 is very effective against thevanishing latent variable and it is also complementary to the KL annealing technique.

3.4.3 Experiments

We chose the Switchboard (SW) 1 Release 2 Corpus [33] to evaluate the proposed mod-els. SW has 2400 two-sided telephone conversations with manually transcribed speechand alignment. In the beginning of the call, a computer operator gave the callers promptsthat define the desired topic of discussion. There are 70 available topics. We randomlysplit the data into 2316/60/62 dialogs for train/validate/test. The pre processing in-cludes (1) tokenize using the NLTK tokenizer [4]; (2) remove non-verbal symbols andrepeated words due to false starts; (3) keep the top 10K frequent word types as the vo-cabulary. The final data have 207, 833/5, 225/5, 481 (c, x) pairs for train/validate/test.Furthermore, a subset of SW was manually labeled with dialog acts [79]. We extracteddialog act labels based on the dialog act recognizer proposed in [69]. The features in-clude the uni-gram and bi-gram of the utterance, and the contextual features of the last3 utterances. We trained a Support Vector Machine (SVM) [86] with linear kernel on thesubset of SW with human annotations. There are 42 types of dialog acts and the SVMachieved 77.3% accuracy on held-out data. Then the rest of SW data are labeled withdialog acts using the trained SVM dialog act recognizer.

Training

We trained with the following hyperparameters (according to the loss on the validatedataset): word embedding has size 200 and is shared across everywhere. We initializethe word embedding from Glove embedding pre-trained on Twitter [65]. The utteranceencoder has a hidden size of 300 for each direction. The context encoder has a hiddensize of 600 and the response decoder has a hidden size of 400. The prior network and theMLP for predicting y both have 1 hidden layer of size 400 and tanh non-linearity. Thelatent variable z has a size of 200. The context window k is 10. All the initial weights aresampled from a uniform distribution [-0.08, 0.08]. The mini-batch size is 30. The modelsare trained end-to-end using the Adam optimizer [42] with a learning rate of 0.001 andgradient clipping at 5. We selected the best models based on the variational lower boundon the validate data. Finally, we use the BOW loss along with KL annealing of 10,000batches to achieve the best performance. Section 3.4.5 gives a detailed argument for theimportance of the BOW loss.

We compared three neural dialog models: a strong baseline model, CVAE, and kgC-VAE. The baseline model is an encoder-decoder neural dialog model without latentvariables similar to [75]. The baseline model’s encoder uses the same context encoderto encode the dialog history and the meta features as shown in Figure 3.8. The encodedcontext c is directly fed into the decoder networks as the initial state. The hyperparame-ters of the baseline are the same as the ones reported in Section 3.4.3 and the baseline istrained to minimize the standard cross entropy loss of the decoder RNN model withoutany auxiliary loss.

31


Also, to compare the diversity introduced by the stochasticity in the proposed latentvariable versus the softmax of RNN at each decoding step, we generate N responsesfrom the baseline by sampling from the softmax. For CVAE/kgCVAE, we sample Ntimes from the latent z and only use greedy decoders so that the randomness comesentirely from the latent variable z.

3.4.4 Quantitative Results

Automatically evaluating an open-domain generative dialog model is an open researchchallenge [50]. Following our one-to-many hypothesis, we propose the following metrics.We assume that for a given dialog context c, there exist Mc reference responses rj , j ∈[1,Mc]. Meanwhile a model can generate N hypothesis responses hi, i ∈ [1, N ]. Thegeneralized response-level precision/recall for a given dialog context is:

precision(c) =

∑Ni=1maxj∈[1,Mc]d(rj, hi)

N

recall(c) =

∑Mc

j=1maxi∈[1,N ]d(rj, hi))

Mc

where d(rj, hi) is a distance function which lies between 0 to 1 and measures the sim-ilarities between rj and hi. The final score is averaged over the entire test dataset andwe report the performance with 3 types of distance functions in order to evaluate thesystems from various linguistic points of view:

1. Smoothed Sentence-level BLEU [13]: BLEU is a popular metric that measures thegeometric mean of modified n-gram precision with a length penalty [45, 64]. Weuse BLEU-1 to 4 as our lexical similarity metric and normalize the score to 0 to 1scale.

2. Cosine Distance of Bag-of-word Embedding: a simple method to obtain sentenceembeddings is to take the average or extrema of all the word embeddings in thesentences [1, 28]. The d(rj, hi) is the cosine distance of the two embedding vectors.We used Glove embedding and denote the average method as A-bow and extremamethod as E-bow. The score is normalized to [0, 1].

3. Dialog Act Match: to measure the similarity at the discourse level, the same dialogact tagger from 3.4.3 is applied to label all the generated responses of each model.We set d(rj, hi) = 1 if rj and hi have the same dialog acts, otherwise d(rj, hi) = 0.

One challenge of using the above metrics is that there is only one, rather than multiplereference responses/contexts. This impacts reliability of our measures. Inspired by [78],we utilized information retrieval techniques to gather 10 extra candidate reference re-sponses/context from other conversations with the same topics. Then the 10 candidatereferences are filtered by two experts, which serve as the ground truth to train the ref-erence response classifier. The result is 6.69 extra references in average per context. Theaverage number of distinct reference dialog acts is 4.2. Table 3.5, 3.6 shows the results.

The proposed models outperform the baseline in terms of recall in all the metricswith statistical significance. This confirms our hypothesis that generating responses

32


Models perplexity (KL) BLEU-1 BLEU-2 BLEU-3 BLEU-4Baseline 35.4 (n/a) 0.405/0.336 0.3/0.281 0.272/0.254 0.226/0.215CVAE 20.2 (11.36) 0.372/0.381 0.295/0.322 0.265/0.292 0.223/0.248kgCVAE 16.02 (13.08) 0.412/0.411 0.350/0.356 0.310/0.318 0.262/0.272

Table 3.5: Performance on perplexity and BLEU scores. The highest score is in bold.Note that our BLEU scores are normalized to [0, 1].

Models A-bow E-bow Dialog ActBaseline 0.951/0.935 0.827/0.801 0.736/0.514CVAE 0.954/0.943 0.815/0.812 0.804/0.807kgCVAE 0.961/0.944 0.804/0.807 0.721/0.598

Table 3.6: Performance on semantic matching and dialog acts. The highest score is inbold.

with discourse level diversity can lead to a more comprehensive coverage of the poten-tial responses than promoting only word-level diversity. As for precision, we observedthat the baseline has higher or similar scores than CVAE in all metrics, which is expectedsince the baseline tends to generate the mostly likely and safe responses repeatedly inthe N hypotheses. However, kgCVAE is able to achieve the highest precision and recallin the 4 metrics at the same time (BLEU1-4, A-BOW). One reason for kgCVAE’s goodperformance is that the predicted dialog act label in kgCVAE can regularize the gener-ation process of its RNN decoder by forcing it to generate more coherent and precisewords. We further analyze the precision/recall of BLEU-4 by looking at the averagescore versus the number of distinct reference dialog acts. A low number of distinct di-alog acts represents the situation where the dialog context has a strong constraint onthe range of the next response (low entropy), while a high number indicates the oppo-site (high-entropy). Figure 3.9 shows that CVAE/kgCVAE achieves significantly higherrecall than the baseline in higher entropy contexts. Also it shows that CVAE suffersfrom lower precision, especially in low entropy contexts. Finally, kgCVAE gets higherprecision than both the baseline and CVAE in the full spectrum of context entropy.

Figure 3.9: BLEU-4 precision/recall vs. the number of distinct reference dialog acts.

33


3.4.5 Qualitative Results

Example 1-Topic: Recycling Context: A: are they doing a lot of recycling out in Georgia?Target-B (statement): well at my workplace we have palaces for aluminium cans and wehave a separate trash can for recyclable paperBaseline+Sampling kgCVAE+Greedy1. well I’m a graduate student and have twokids

1. (non-understand) pardon

2. well I was in last year and so we’ve hadlots of recycling

2. (statement) oh you’re not going to have acurbside pick up here

3. I’m not sure 3. (statement) okay I am sure about a recy-cling center

4. well I don’t know I just moved here in newyork

4. (yes-answer) yeah so

Example 2-Topic: Child Care Context: A: you know a private home to take their childrento when they’re young until they hit the preschool age and they Target-B (backchannel):uh-huhBaseline+Sampling kgCVAE+Greedy1. um - hum 1. (backchannel) uh-huh2. yeah 2. (turn-exit) um-hum3. um - hum 3. (backchannel) yeah4. uh-huh 4. (statement) oh yeah I think that’s part of

the problem

Table 3.7: Generated responses from the baselines and kgCVAE in two examples. KgC-VAE also provides the predicted dialog act for each response. The context only showsthe last utterance due to space limit (the actual context window size is 10).

Table 3.7 shows the outputs generated from the baseline and kgCVAE. In example1, caller A begins with an open-ended question. The kgCVAE model generated highlydiverse answers that cover multiple plausible dialog acts. Further, we notice that thegenerated text exhibits similar dialog acts compared to the ones predicted separatelyby the model, implying the consistency of natural language generation based on y. Onthe contrary, the responses from the baseline model are limited to local n-gram vari-ations and share a similar prefix, i.e. I’m. Example 2 is a situation where caller A istelling B stories. The ground truth response is a back-channel and the range of validanswers is more constrained than example 1 since B is playing the role of a listener.The baseline successfully predicts ”uh-huh”. The kgCVAE model is also able to gen-erate various ways of back-channeling. This implies that the latent z is able to capturecontext-sensitive variations, i.e. in low-entropy dialog contexts modeling lexical diver-sity while in high-entropy ones modeling discourse-level diversity. Moreover, kgCVAEis occasionally able to generate more sophisticated grounding (sample 4) beyond a sim-ple back-channel, which is also an acceptable response given the dialog context.

In addition, past work [43] has shown that the recognition network is able to learn

34


to cluster high-dimension data, so we conjecture that posterior z outputted from therecognition network should cluster the responses into meaningful groups. Figure 3.10visualizes the posterior z of responses in the test dataset in 2D space using t-SNE [53].We found that the learned latent space is highly correlated with the dialog act and lengthof responses, which confirms our assumption.

Figure 3.10: t-SNE visualization of the posterior z for test responses with top 8 frequentdialog acts. The size of circle represents the response length.

Results for Bag-of-Word Loss

Finally, we evaluate the effectiveness of bag-of-word (BOW) loss for training VAE orCVAE with the RNN decoder. To compare with past work [10], we conducted the samelanguage modelling (LM) task on Penn Treebank using VAE. The network architecture issame except we use GRU instead of LSTM. We compared four different training setups:(1) standard VAE without any heuristics; (2) VAE with KL annealing (KLA); (3) VAEwith BOW loss; (4) VAE with both BOW loss and KLA. Intuitively, a well trained modelshould lead to a low reconstruction loss and small but non-trivial KL cost. For all modelswith KLA, the KL weight increases linearly from 0 to 1 in the first 5000 batches.

Table 3.8 shows the reconstruction perplexity and the KL cost on the test dataset.The standard VAE fails to learn a meaningful latent variable by having a KL cost closeto 0 and a reconstruction perplexity similar to a small LSTM LM [111]. KLA helps toimprove the reconstruction loss, but it requires early stopping since the models will fallback to the standard VAE after the KL weight becomes 1. At last, the models with BOWloss achieved significantly lower perplexity and larger KL cost.

35


Model Perplexity KL costStandard 122.0 0.05KLA 111.5 2.02BOW 97.72 7.41BOW+KLA 73.04 15.94

Table 3.8: The reconstruction perplexity and KL terms on Penn Treebank test set.

Figure 3.11 visualizes the evolution of the KL cost. We can see that for the standardmodel, the KL cost crashes to 0 at the beginning of training and never recovers. On thecontrary, the model with only KLA learns to encode substantial information in latent zwhen the KL cost weight is small. However, after the KL weight is increased to 1 (after5000 batch), the model once again decides to ignore the latent z and falls back to thenaive implementation. The model with BOW loss, however, consistently converges toa non-trivial KL cost even without KLA, which confirms the importance of BOW lossfor training latent variable models with the RNN decoder. Last but not least, our exper-iments showed that the conclusions drawn from LM using VAE also apply to trainingCVAE/kgCVAE, so we used BOW loss together with KLA for all previous experiments.

Figure 3.11: The value of the KL divergence during training with different setups onPenn Treebank.

3.5 Handling Slot Expansion

3.5.1 Slot Expansion

Slot Expansion refers to a common situation where new entities are introduced in a task-oriented dialog domain. There are several reasons for this. Take a restaurant recom-mendation system for example, new cuisine types, new city locations maybe gradually

36


added overtime. Also, the returned restaurant information from a KB may also varyfrom time to time even given the same KB query. At last, real users can mention newentities that do not appear in the training data. This phenomenon is reflected as an ex-pansion of slot vocabulary, while the set of slots that needed to be filled stays the same.The new entities that are not covered in train data will be refereed as out-of-vocabularyentities (OOV entities). Ideally a robust dialog system should be immune to slot ex-pansion because the underlying recognition, decision-making and generation processesare not effected by the new entities. The traditional pipeline-based dialog systems hasstrong performance in this aspect, since the only module that needs to be updated is theNLU component, and the rest of the pipeline can be used without modification. On theother hand, the later experiments will show GEDMs are vulnerable to slot expansion be-cause of several reasons. First, since the new entities are often not included in the inputvocabulary of the encoder network, they will be mapped to 〈unk〉 symbols, which areusually mistreated by the encoder network. The second reason is that the new entitiesmight be missing in the output vocabulary after all, which effectively prohibits the pos-sibility for the decoder model to generate them from its output softmax. For example,if a user said: “I like pizza”, and pizza is a new entity, it is impossible for the decoder togenerate “Do you mean pizza?”. The best possible output from the system is “Do youmean 〈unk〉?”, which is not satisfactory.

Building GEDMs that are invariant to slot expansion is necessary. The motivationsare two-folds. First, for task-oriented dialog systems, the entities in system output carrya significant proportion of the information that the system tries to convey. Fail to ac-curately generate these words has devastating effect on the continuation of the dialog.Second, immune to slot expansion implies better generalization of a GEDM, which canlead to better performance. This is because the underlying processes of decision-makingand generation for a slot-filling dialog system is independent of the entity value. Sopushing a GEDM towards slot expansion invariant forces to the model learn the under-lying process instead of leverage language model information and simply memorize thefrequent entities in the training data.

3.5.2 Handling Slot Expansion via Delexicalized Memory

The proposed solution is augmenting a GEDM with a trainable memory unit. The basicintuition is as follows: first register all entities [e1, ...ei, ...eKc ]in a dialog context c inmemoryM, where Kc is the number of unique entities in c. Then the system will createfeatures for each entity, denoted by φ(ei). Then the recognition and decision-makingprocesses of a GEDM operates on φ(ei) instead of ei. At last in the generation stage, ifthe decoder decides to generate ei, it can use the mapping φ(ei)→ ei to generate ei. Evenei is actually OOV, as long as φ still contains meaningful information, the decoder canstill generate ei if it’s needed. Since φ extracts lexical independent features, the abovememory block is called Delexicalized Memory. Given this high-level procedure, we cannow define two key functions: The first one is a write functionM = w(c). This writefunction needs to decide three elements: the number of entities kc in c, the feature foreach entity in there φ(ei) i ∈ [1, kc] and last the pointers from φ(ei) to ei. The last element

37


Figure 3.12: High-level illustration of a GEDM with delexicalized memory unit.

can be easily implemented by a dictionary, while the first two elements require specialeffort. The second function is a read function x = r(M, z) that generates the systemresponse leverage both information from its initial state z andM. Therefore, the goalbecomes to construct the w and r functions.

Before diving into the implementations of these two functions, their connection tothe past work and properties are first discussed. First of all, registering key entitiesfrom a discourse history and reference them in system output is closely related to thegrounding theory in human-human communication [17]. Clark[17] believes that to haveeffective communication, interlocutors need to build a mutual knowledge and beliefthrough grounding behavior e.g. acknowledge or confirmation. The proposedM canbe viewed as an approximation of the mutual belief of a given dialog context and w andr can be thought of as the dialog move that can modify the common ground. More-over, from a system perspective, a slot-filling dialog system uses grounding strategiesto corroborate concepts with users. This is especially useful if the acoustic environmentis noisy, so that ASR has high chance to recognize wrong words [7]. Such behavior canbe explained here by selecting an entity from M according φ(ei), which could encodeinformation about system’s confidence in ei’s correctness.

Last to test our hypothesis, the following experiments show our initial results ofimplementing the delexicalized using hand-crafted write/read functions, which is pub-lished in [116]. The experiment outcomes confirm the importance and effectivenessof having such memory unit for modeling task-oriented dialogs. The hand-craftedwrite/read function however makes the models no longer E2E trainable and may leadto sub-optimal behavior. Therefore, we plan to extend this idea to develop E2E train-able write/read function so that the memory becomes a part of the E2E model. Theproposed plan is described in Section 5.1.

38


3.5.3 A Baseline Implementation

This implementation consists of three steps as shown in Figure 3.13: a) write function w:entity indexing (EI), b) encoder-decoder (ED), c) read function r: system utterance lex-icalization (UL). The intuition is to leverage domain-general named-entity recognition(NER) [89] techniques to extract salient entities in the raw dialog history and convert thelexical values of the entities into entity indexes. Then the output from the decoder net-works are lexicalized by replacing the entity indexes and special KB tokens with naturallanguage. The following sections explain each step in detail.

Entity Indexing and Utterance Lexicalization

Figure 3.13: The implementation of delexicalized memory augmented GEDM.

Entity Indexing EI has two parts. First, the EI utilizes an existing domain-generalNER to extract entities from both the user and system utterances. Note that the en-tity here is assumed to be a super-set of the slots in the domain. For example, for aflight-booking system, the system may contain two slots: [from-LOCATION] and [to-LOCATION] for the departure and arrival city, respectively. However, EI only extractsevery mention of [LOCATION] in the utterances and leaves the task of distinguishingbetween departure and arrival to the encoder-decoder model. Furthermore, this stepreplaces each KB search result with its search query. The second step of EI involvesconstructing a indexed entity table. Each entity is indexed by its order of occurrence inthe conversation. Therefore the feature function φ in EI is simply φ : ei → [entity type,occur index]. Figure 3.14 shows an example in which there are two [LOCATION] men-tions.

Utterance Lexicalization is the reverse of EI. Since EI is a deterministic process, itseffect can always be reversed by finding the corresponding entity in the indexed entitytable and replacing the index with its word. For KB search, a simple string matchingalgorithm can search for the special [kb-search] token and take the following generated

39


Figure 3.14: An example of entity indexing and utterance lexicalization.

entities as the argument to the KB. Then the actual KB results can replace the originalKB query. Figure 3.14 shows an example of utterance lexicalization.

Encoder-Decoder Models

The encoder-decoder model can then read in the EI-processed dialog history and pre-dict the system’s next utterance in EI format. Specifically, a dialog history of k turnsis represented by [(a0, u0, c0), ...(ak−1, uk−1, ck−1)], in which ai, ui and ci are, respectively,the system, user utterance and ASR confidence score at turn i. Each utterance in thedialog history is encoded into fixed-size vectors using Convolution Neural Networks(CNNs) proposed in [41]. Specifically, each word in an utterance x is mapped to itsword embedding, so that an utterance is represented as a matrix R ∈ R|x|×D, in whichD is the size of the word embedding. Then L filters of size 1,2,3 conduct convolutionson R to obtain a feature map, c, of n-gram features in window size 1,2,3. Then c ispassed through a nonlinear ReLu [32] layer, followed by a max-pooling layer to obtaina compact summary of salient n-gram features, i.e. et(x) = maxpool(ReLu(c+b)). UsingCNNs to capture word-order information is crucial, because the encoder-decoder has tobe able to distinguish between fine-grained differences between entities. For example,a simple bag-of-word embedding approach will fail to distinguish between the two lo-cation entities in “leave from [LOCATION-0] and go to [LOCATION-1]”, while a CNNencoder can capture the context information of these two entities.

After obtaining utterance embedding, a turn-level dialog history encoder networksimilar to the one proposed in [114] is used. Turn embedding is a simple concatena-tion of system, user utterance embedding and the confidence score t = [eu(ai); e

u(ui); ci].Then an Long Short-Term Memory (LSTM) [38] network reads the sequence turn em-beddings in the dialog history via recursive state update si+1 = LSTM(ti+1, hi), in whichhi is the output of the LSTM hidden state.

Decoding with/without Attention A baseline decoder takes in the last hidden stateof the encoder as its initial state and decodes the next system utterance word by word asshown in [82]. This assumes that the fixed-size hidden state is expressive enough to en-code all important information about the history of a dialog. However, this assumptionmay often be violated for a task that has long-dependency or complex reasoning of theentire source sequence. An attention mechanism proposed [2] in the machine transla-tion community has helped encoder-decoder models improve state-of-art performancein various tasks [2, 106]. Attention allows the decoder to look over every hidden statein the encoder and dynamically decide the importance of each hidden state at eachdecoding step, which significantly improves the model’s ability to handle long-term

40


dependency. We experiment decoders both with and without attention. Attention iscomputed similarly multiplicative attention described in [52]. We denote the hiddenstate of the decoder at time step j by sj , and the hidden state outputs of the encoder atturn i by hi. We then predict the next word by

aji = softmax(hTi Wasj + ba) (3.14)

cj =∑i

ajihi (3.15)

sj = tanh(Ws

[sjcj

]) (3.16)

p(wj|sj, cj) = softmax(Wosj) (3.17)

The decoder next state is updated by sj+1 = LSTM(sj, e(wj+1), sj).

3.5.4 Evaluations and Experiments

System performance was assessed from four perspectives that are essential for task-oriented systems: BLEU, entities, dialog acts, and KB query. The online evaluation iscomposed of objective task success rate, the number of turns, and subjective satisfactionwith human users.

BLEU [64]: compares the n-gram precision with length penalty, and has been a pop-ular score used to evaluate the performance of natural language generation [97] andopen-domain dialog models [47]. Corpus-level BLEU-4 is reported.

Entities: This metric measures the model’s performance in generating the correctslot-values. The slot-values mostly occur in grounding utterances (e.g. explicit/implicitconfirm) and KB queries. We compute precision, recall, and F-score.

Acts: Each system utterance is made up of one or more dialog acts, e.g. “leavingat [TIME-0], where do you want to go?” → [implicit-confirm, request(arrival place)].To evaluate whether a generated utterance has the same dialog acts as the groundtruth, we trained a multi-label dialog tagger using one-vs-rest Support Vector Machines(SVM) [90], with bag-of-bigram features for each dialog act label. Since the naturallanguage generation module in Let’s Go is handcrafted, the dialog act tagger achieved99.4% average label accuracy on a held-out dataset. We used this dialog act tagger totag both the ground truth and the generated responses. Then we computed the micro-average precision, recall, and the F-score.

KB Queries: Although the slots metric already covers the KB queries, here theprecision/recall/F-score of system utterances that contain KB queries are also explic-itly measured, due to their importance. Specifically, this action measures whether thesystem is able to generate the special [kb-query] symbol to initiate a KB query, as wellas how accurate the corresponding KB query arguments are.

These four evaluation metrics will be used throughout for experiments on task-oriented systems. Furthermore, a combined score can be obtained by taking the averageof these four metrics, which will be refereed as the BEAK score.

41


Data and Training

The CMU Let’s Go Bus Information System [67] is a task-oriented spoken dialog sys-tem that contains bus information. We combined the train1a and train1b datasets fromDSTC 1 [98], which contain 2608 total dialogs. The average dialog length is 9.07 turns.The dialogs were randomly splitted into 85/5/10 proportions for train/dev/test data.The data was noisy since the dialogs were collected from real users via telephone lines.Furthermore, this version of Let’s Go used an in-house database containing the Port Au-thority bus schedule. In the current version, that database was replaced with the GoogleDirections API, which both reduces the human burden of maintaining a database andopens the possibility of extending Let’s Go to cities other than Pittsburgh. Connectingto Google Directions API involves a POST call to their URL, with our given access keyas well as the parameters needed: departure place, arrival place and departure time,and the travel mode, which we always set as TRANSIT to obtain relevant bus routes.There are 14 distinct dialog acts available to the system, and each system utterance con-tains one or more dialog acts. Lastly, the system vocabulary size is 1311 and the uservocabulary size is 1232. After the EI process, the sizes become 214 and 936, respectively.

For all experiments, the word embedding size was 100. The sizes of the LSTM hid-den states for both the encoder and decoder were 500 with 1 layer. The attention contextsize was also 500. We tied the CNN weights for the encoding system and user utter-ances. Each CNN has 3 filter windows, 1, 2, and 3, with 100 feature maps each. Wetrained the model end-to-end using Adam [42], with a learning rate of 1e-3 and a batchsize of 40. To combat overfitting, we apply dropout [111] to the LSTM layer outputs andthe CNN outputs after the maxpooling layer, with a dropout rate of 40%.

3.5.5 Results and Discussion

Metrics Baseline EI EI +AttnBLEU 36.9 54.6 59.3Entity F1 35.2 62.1 64.2Act F1 80.5 80.0 81.5KB F1 N/A 51.9 62.2BEAK 50.8 (\ KB) 62.2 66.8

Table 3.9: Performance of each model on automatic measures.

Three systems were compared: the basic encoder-decoder models without EI (Base-line), the basic model with EI pre-processing (EI), the model with attentional decoder(EI+Attn). The comparison was carried out on exactly the same held-out test datasetthat contains 261 dialogs. Table 3.9 shows the results. It can be seen that all four modelsachieve similar performance on the dialog act metrics, even the baseline model. Thisconfirms the capacity of encoder-decoders models to learn the “shape” of a conversa-tion, since they have achieved impressive results in more challenging settings, e.g. mod-eling open-domain conversations. Furthermore, since the DSTC1 data was collected

42


over several months, there were minor updates made to the dialog manager. Therefore,there are inherent ambiguities in the data (the dialog manager may take different ac-tions in the same situation). We conjecture that∼80% is near the upper limit of our datain modeling the system’s next dialog act given the dialog history.

On the other hand, these proposed methods significantly improved the metrics re-lated to slots and KB queries. The inclusion of EI alone was able to improve the F-scoreof slots by a relative 76%, which confirms that EI is crucial in developing slot-valueindependent encoder-decoder models for modeling task-oriented dialogs. Likewise,the inclusion of attention further improved the prediction of slots in system utterances.Adding attention also improved the performance of predicting KB queries, more so thanthe overall slot accuracy. This is expected, since KB queries are usually issued near theend of a conversation, which requires global reasoning over the entire dialog history.The use of attention allows the decoder to look over the history and make better de-cisions rather than simply depending on the context summary in the last hidden layerof the encoder. Because of the good performance achieved by the models with the at-tention decoder, the attention weights in Equation 3.14 at every step of the decodingprocess in two example dialogs from test data are visualized. For both figures, the ver-tical axes show the dialog history flowing from the top to the bottom. Each row is aturn in the format of (system utterance # user utterance). The top horizontal axis showsthe predicted next system utterance. The darkness of a bar indicates the value of theattention calculated in Equation 3.14.

3.6 Put it All Together and Discussion

Figure 3.15: The unified model with all techniques.

Eventually, Figure 3.15 shows the proposed final model, Stochastic Entity AgnosticMemory Network (SteamNet) that address all three challenges raised in the introductionof this chapter. In summary, we introduce a context encoder fe to map the raw dialogcontext into meaningful distributed representation h = fe(c). Also a write function will

43


create a memory block that contains salient delexicalized entities features for ground-ingM = w(c). Then to model the stochastic policy of the dialog modeling, we deploya stochastic random variable to represent the next system high-level action z = π(h),which is the initial state of the decoder. Eventually, the system response words gener-ated is constructed via a decoder function x = fd(z,M). The generated x can be eithera verbal action that sends to a user or a KB action that sends query to the KB. Then ei-ther the user a the KB will respond with new observations and feedback, which will beadded to the dialog context. This completes the entire loop of operation of our proposedGEDM.

We plan to improve the current SteamNet from two aspects. First the current la-tent action is is uni-modal Gaussian random variable, which cannot capture more com-plex multi-modal distribution. Moreover, although our study in Figure 3.10 indicatesthat the learned meaningful posterior latent distribution by grouping system responses,such clustering is not human interpretable without manual annotation (e.g. dialog acts)and manual inspection. Therefore, we plan to develop a more flexible latent variableto solve the above two issues. Second, the write and read functions to create/assessthe delexicalized memory is currently achieved via EI and UL. EI has limitations be-cause it is dependent on a external NER to recognize and indexed every entities in thedialog history. Therefore, if the EI process misses some entities, it becomes impossiblethe downstream models to recover from there. As a results, we aim to integrate thewrite and read function as a part of the neural network and automatically learn howto write and read to the delexicalized memory in an end-to-end fashion. More detaileddescription about the proposed work can be found in Section 5.1.

After finishing the proposed work, we plan to conduct a comprehensive evaluationfor SteamNet on various datasets (both task and chat) and compare its performance withother encoder-decoder models. Also, SteamNet will be the base model for Chapter 4,in which we explore methods to efficiently adapt it into new domains with minimumamount of target domain training dialogs.

At last, we want to discuss about the relationship between the SteamNet with tradi-tional dialog pipeline.The conventional dialog system follows a pipeline approach andconsists of NLU, DST, DP and NLG. The encoder function fe and w are playing the roleof NLU and DST, which maps the dialog context into a dialog state representation. Thegenerated h andM can be thought of an approximation of the summary state and beliefstate developed in POMDP-based dialog management [109]. One possible counter ar-gument is why not modeling the dialog state as the posterior inference of a latent dialogstate variable, instead of using a deterministic mapping. In fact, solving POMDP relieson finding the belief state of its hidden state, i.e. a distribution over all possible states,which results into a belief-state MDP [40]. The optimal policy of the given POMDP isequivalent to the optimal policy on the belief-state MDP. Therefore, there is no need tointroduce stochasticity in the state-encoder and a deterministic state representation issufficient to learn features for optimal decision-making. Moreover, the DP correspondsto the policy network π that is a stochastic layer. This solves the one-to-many mappingproblem in dialog decision-making. At last, the decoder function fd can be thought as acontext-dependent NLG that realize the system acts into natural language.

44


Chapter 4

Learning with Knowledge to Conversein New Domains

4.1 The Needs for Knowledge

This chapter will explore methods to develop E2E dialog systems in domains with lowresources. Deep learning systems have achieved the super-human performance in ap-plications with gigantic size of labeled data, e.g. speech recognition [105]. Unfortu-nately, deep models are known to be prone to overfitting when the training data size isorders of magnitude smaller than the number of parameters. In contrast, adult humansare able to grasp new skills with only few examples. One of the reasons why humanintelligence can generalize to new domains so quickly is because its ability to leveragepast experiences in learning related tasks and distill abstractions from the surface input.Moreover, past research in machine learning theory [56] has suggested that introducinginductive bias, i.e. a bias towards a particular set of model hypotheses, is required tolearn models with generalization ability from a finite set of examples. The inductive biascan be obtained from analogy to previously learned generalization, factual knowledgeabout the domain and etc.

Inspired by this, this chapter proposes to incorporate knowledge in order to adaptE2E dialog systems to domains with limited training materials. The knowledge here isdefined to focus on two types of information: (1) the experience from learning relateddomains and (2) human expert’s distilled knowledge about a domain. With the help ofthese two sources of information, the goal is to enable model learn transferable abstrac-tion from related domains and learn generalize mechanism of adapting to new domainsgiven human’s domain meta information. Take a task-oriented dialog system in moviedomain for example. If the system has learning experience in other domains such asrestaurant booking, it ideally can transfer common knowledge, e.g. grounding strate-gies, to the new domain and only needs to learn movie-specific knowledge. An examplefor the second category of knowledge is as follows. Imagine a human gives the systemdetailed descriptions about each slot, then the system can associate each slot with pastslots that it has already learned to recognize, by uncovering the relatedness lied in the

45


slot description. Therefore, we believe that by transferring knowledge among domains,learning to converse in a new domain shall require only a small amount of data, noteven necessarily dialog data, for the adaption.

4.1.1 Problem Formulation

Now we formally define the problem that this chapter aims to solve. Consider the casethat there is a source training dataset that contains abundant dialog data generated fromK − 1 domains: DStrain = {(c(n), x(n), k(n)), n = 1...N}, where c and x are the dialog con-text and corresponding system response respectively, as defined in Chapter 3. The newvariable k ∈ [1, K] is the index of the domain that this data point belongs to. Mean-while, there is target training dataset DTtrain that is much smaller than DStrain. Also, DTtraingenerated solely from the Kth target domain and can be empty. When DTtrain = ∅, wewill denote the problem as zero-shot dialog learning problem. The union of the two train-ing sets is denoted as Dtrain = DStrain

⋃DTtrain. There are also two datasets for testing,

DStest and DTtest, where DStest is generated from the same domain distribution in DStrainand DTtest is generated from the Kth domain only. Besides the above train and test data,we assume that there is access to a set of domain description for all K domains. LetDD = {dk, k = 1...K}, where dk is a structured data point that contains essential metainformation that describe a system’s responsibilities in domain k. Domain descriptionwill be formally defined in the next section.

Then the primary goal is to develop a GEDM that can achieve strong performance onDTtest by training on Dtrain. Our secondary goal is to check if the model is able to achieveperformance on DStest no worse than the models that juts trained on DStrain, to confirmthat it is not subject to catastrophic forgetting [29] after adapting to the target domain.

4.1.2 Challenges

Achieving the above goals forms a challenging transfer learning problem. Following thetransfer learning notations in Chapter 2, a dialog domain consists of {C,P (C)}, whichdefines the input space and its marginal distribution, and {X,P (X|C)} defines the out-put space and the conditional distribution. Benefited from the flexible input and outputinterface of GEDMs, given a large enough vocabulary, the input and output space, i.e.C and X are the same across all domains. However, both P (C) and P (X|C), will bedrastically different from one domain to another. For example, let one source domainbe restaurant recommendation and the target domain be flight booking. A typical dia-log context in the source can be: “c = User: I am looking for Chinese food. Sys: What'syour location?” It is evident that the probability of this context appearing in the targetdomain P T (c) is very low. As for P (X|C), given the same dialog context c, the systemin different domains should generate wildly different responses. For example, in thebeginning of a dialog, the restaurant system may introduce itself by: “I can you tell youwhere to eat” versus the weather flight may go with: “Let me book a flight for you”.Therefore, learning to transfer is hard because both P (C) and P (X|C) from source andtarget are different.

46


More specifically, an encoder-decoder based GEDM will face the following chal-lenges when train on Dtrain and test on DT

test. First, since the training data from domainK is limited, a significant proportion of words in DT

test will be OOVs. Also, the utter-ances in target domain dialog context will have different natural language expressionsabout new slots and intentions. Both issues may make the encoder fail to output mean-ingful representation of the dialog context. Moreover, the system responses X in thetarget domain will likely never appear in the training data, so that softmax-based out-put in the decoder may never be able generate words or phrases that are frequent in thetarget but rare in the source. Our pilot study in Section 4.4.2 shows that the standardencoder-decoder completely failed when testing in a new domain.

4.2 Learning with Knowledge

Figure 4.1: High-level Architecture for Domain-Aware Dialog Models

Figure 4.1 shows a high-level view for the proposed learning with knowledge (LWK)framework. Comparing it to the SteamNet presented in Section 3.6, the major differenceis the introduction of a domain description as a part of the input to the system. Since nowthe system is aware of which domain it is operating in, we denote this type of dialogsystems as Domain-aware Dialog Models. Incorporating domain information is crucialfor adapting a GEDM to new domains. First past research in domain-adaption [20]has shown that successful domain adaption requires models to learn both domain-independent and domain-specific parameters to achieve better performance. There-fore, with the help of domain descriptions, the encoder and decoder networks shouldimplicitly partition its parameters to exhibit domain-specific behaviors conditioned ondifferent domains. Second, in the more challenging zero-shot dialog learning paradigm,leveraging meta information in domain descriptions enables the systems to reuse skillsacquired from other domains that share similar traits [23, 61]. Designing novel neu-ral model to achieve the above goals is very challenging. Specifically, the open-endedresearch questions that need to be solved include but not limited to: 1. what is theformat of domain description that can describe a dialog task? 2. what kind domain en-coder is the best to encode such a description and 3. what is the best fusion mechanismto combine the information from a domain description to the rest of encoder-decodercomponents. Answering these questions sit at the core of this thesis.

47


4.2.1 Domain Description

An ideal domain description shall capture all domain-specific information about a givendomain so that a domain-aware GEDM can operate perfectly in this new domain withlittle in-domain training data. In the past zero-shot learning work on image classi-fication [62], it is relatively straightforward to represent the image labels in terms offine-grained attributes or semantic codes. Unfortunately, it is a much more challengingtask to represent a dialog domain in a compact manner. The extreme case is definingdomain-description for open-domain chat-oriented systems, which have no constraintsthe potential topics of discussion and system incentives. Therefore, as this first step to-wards describing a dialog domain, this thesis will focus on defining the domain descrip-tion for task-oriented slot-filling dialog systems. Slot-filling dialog systems have beenextensively studied [6, 67, 109, 113]. It is known that the following items are importantdomain-specific information, including (1) a set of slots, (2) system utterance semanticframes (3) user utterance semantic frames. The domain ontology proposed in our pre-vious work [113] is a promising foundation to build the proposed domain descriptionfor LWK. The current domain ontology however is only used as a part of dialog statefor the planning of dialog management. Improving and standardizing a new version ofdomain ontology for this thesis is in our proposed work.

4.3 Corpora for Benchmark Learning with Knowledge

In order to evaluate LWK’s ability to generalize to new domains, multi-domain dialogdatasets are needed. Unfortunately, currently public dialog datasets for both task andchat-oriented dialogs were not designed for this purpose. The closest ones are the datafrom dialog state tracking challenges (DSTC) since 2013 [98], which are mostly task-oriented dialogs collected from various systems, e.g. CMU Let’s Go [67]. However,these datasets are not satisfactory for two main reasons: 1) there are not enough num-ber of domains in these datasets, and our focus is to train models on a large numberof domains for probing the limit of domain generalization 2) these data were collectedfrom different hand-crafted dialog systems, which do not exhibit complex enough sys-tem behavior to test the expressive power of GEDMs. Therefore, we develop two newcorpora, one synthetic and one real-world, that are designed to be used as the bench-mark of developing GEDMs in multiple domains. The following section describes thesynthetic one (completed work) and Section 5.1.1 presents our plan to collect a newreal-world corpus.

4.3.1 SimDial: A Multi-domain Synthetic Dialog Generator

Collecting large real-world dialog corpus is a tedious process. Therefore, simulateddata has been commonly used as the initial test bed for evaluating and training dialogsystem [22, 99]. SimDial is a configurable domain-agnostic synthetic dialog generator thatcan generate synthetic conversations in any slot-filling domains with various types of

48


Environmental InteractionASR error Hesitation Self Restart Self Correct

Clean 0.0 0.0 0.0 0.0Noisy 0.3 0.4 0.1 0.2

PropositionalYes/No Question Mix Reject Two-slots Don’t Care Two Goals

Clean 0.0 0.0 0.0 0.0 0.0Noisy 0.4 0.5 0.3 0.1 0.4

Table 4.1: Complexity Specifications for clean and noisy conditions.

noisy condition. The overall architecture of SimDial is shown in Figure 4.2 To generate

Figure 4.2: Overall Architecture of SimDial Data Generator

conversation data using SimDial, the developers needs to provide two specifications:domain specification (DS) and complexity specification (CS). DS defines the content of thetarget domain, which is equivalent to the concept of domain description defined inLWK. CS on the other hand defines how complex the generated conversations will beat three aspects:

1. Environmental Complexity: defines the noise level of communication channel, e.g.ASR error rate.

2. Propositional Complexity: defines the level of linguistic variations in user utter-ances in terms of propositional content.

3. Interaction Complexity: defines the level of linguistic variations in user utterancesin terms of real-time interaction.

Table 4.1 shows two complexity spec that will be used in the rest of this chapter. The firstone is Clean which introduces 0 noise and the second one is Noisy, which we introducenon-trivial noise in all three aspects.

Moreover, SimDial is equipped with a user and a system simulator that can conversewith each other. We assume system output is perfectly transmitted to the user, while the

49


user output is corrupted by a noisy channel that will introduce uncertainty and errors.Therefore, the system will need to carry out error-handling strategies [6] to successfullycomplete the tasks. The following are two examples with clean CS and noisy CS re-spectively. (The number before every user utterance is the simulated ASR confidencescore).

Clean Complexity SpecSys: What can I do for you?User(1.0): Hi. Recommend a restau-rant.Sys: Which place?User(1.0): I am at PittsburghSys: What kind of food do you like?User(1.0): Chinese food.Sys: China Palace is a good choice.Anything else?User(1.0): What kind of parking op-tions does it have?Sys: It has street parking.User(1.0): Thanks. Bye.

Noisy Complexity SpecSys: What can I do for you?User (0.54): Hello robot. I need to finda restaurant.Sys: Which city are you interested in?User (0.75): Boston. Oh no, at uhmSeattle actually. And I like Chinesefood.Sys: I believe you said Seattle and Chi-nese food. Restaurant Tao is a niceplace. Anything else?User (0.6): Not done yet. Is it opennow?Sys: Yes, it is open right now. Whatelse?User (0.69): No more questions.Thanks.Sys: Bye.

It is evident that the construction of SimDial is identical to LWK’s assumption abouttransferring knowledge among domains. This is because SimDial essentially is a do-main agnostic function that can continue to operate in new domains just given the DS(domain description). The generated data thus share a common process that decidesthe data distribution. Thus SimDial data allow us to evaluate whether our proposedmodels is able to learn the underlying domain-agnostic dialog decision-making func-tion from the noisy surface natural language form, without worrying about the data donot follow this basic assumption. Therefore, SimDial provides an ideal prerequisite fortesting proposed methods on real-world dataset, which is described in Section 5.1.1.

4.4 Pilot Study on SimDial

4.4.1 Study Overview

This section explores the limit of the baseline encoder-decoder model by varying thelevel of complexity and the amount of training data. Specifically, we use SimDial to gen-erate synthetic conversations with two different complexity specifications and three dif-ferent domain descriptions. Then we trained a standard hierarchical encoder-decoder

50


model [74, 116] with attention [2] and evaluated the system performance using BEAKscore and its breakdown defined in Section 3.5. The goals are two folds: first we wishto show that correctly modeling data generated from SimDial is not a trivial even forpowerful encoder-decoder models; and second we wish to show that limitation of stan-dard E2E model to generalize across domains and deal with small data, which pavesthe road for proposing our LWK framework in the rest of this proposal. The results ofthis study will be the baseline for later proposed advanced methods.

4.4.2 Baseline Model

Following the same notation defined in 3.2, we first use bi-directional GRU utteranceencoder to encode every utterances in the context c. Let the last hidden layer of theforward and backward GRU be hf and hb, then the final embedding of an utterance andits meta information is e(ut,mt) = [hf|ut|, h

b1, p, f ], where p is the ASR confidence score

and f is a binary bit indicates the speaker (system or user). Then a context encoderGRU is used to encode the list of utterance embeddings according to recurrent updateht = GRU(e(ut,mt, ht−1)). Last, a decoder GRU with attention is used to generate thesystem response by attending to the hidden layer of context GRU ht at each turn.

ajt = softmax(hTt Wahj + ba) (4.1)

cj =∑i

ajthi (4.2)

hj = tanh(Ws

[hjcj

]) (4.3)

p(wj|hj, cj) = softmax(Wohj) (4.4)

Also, this model follows our KB-as-a-environment (KaaE) approach described in Sec-tion 3.3, so that the encoder will also reads in the results from KB as an utterance anddecode the system utterances to inform the information in the KB results. An exampleKB returned result is: {Name:Pairs 66; Type:French; Hours:9am-8pm}. Wetreat this structured result (i.e. a dictionary) same as a sequence of word tokens, and weuse the same bi-directional GRU utterance encoder to obtain its embedding. Last, wedoes not include the latent variable action as proposed in Section 3.4 because the currentsystem simulator in SimDial is deterministic at dialog-act level. Therefore, we does notinclude the stochastic node here for simplicity. We are currently constructing a systemsimulator that will have configurable stochastic behavior. We plan to then evaluate thebenefit of stochastic policy in the proposed future work.

4.4.3 Experiments

Four different datasets are generated using SimDial. For complexity spec, we use thenoisy and clean version defined in Table 4.1. For domain specifications, three versionsare constructed: Restaurant (Rest), Restaurant-2 (Rest-2) and Movie. Four combinations

51


Vocab Size Avg Dialog Length # User Slots # Sys SlotsClean-Rest 251 11.1 2 4Noisy-Rest 288 17.58 2 4Noisy-Rest-2 372 17.56 2 4Noisy-Bus 386 20.62 3 3

Table 4.2: Statistics of the four dataset.

were used: Clean-Rest, Noisy-Rest, Noisy-Rest-2, Noisy-Movie. Table 4.2 shows thestatistics about each dataset. By default, 2000 dialogs are generated for training and 500dialogs are made for testing.

Three experiments were conducted. (1) The first experiment looks at the effect ofintroducing nontrivial noise. The baseline is trained on 2000 dialogs from Clean-Restand Noisy-Rest for each. For each setting, the trained model is tested on another 500generated dialogs. (2) The second experiment looks at how the size of training dataeffects the system performance. We used the Noisy-Rest domain and gradually reducethe size of training dialogs from 2000 to 0. All the models were tested on the same 500dialogs. (3) The last experiment tests the generalization ability in a different domain.We trained the baseline model on 2000 Noisy-Rest dialogs and test on three dataset. Thefirst one is data also from Noisy-Rest; the second one is from Noisy-Rest-2, which arealso dialogs about restaurants (i.e. same set of slots), but with completely different slot-vocabulary (aka slot expansion); the last one is Noisy-Weather, which is conversationsfrom a complete new domain.

The utterance encoder is bidirectional GRU networks with hidden size 256 for eachdirection. The context encoder and response decoder both have size 512 with GRUcell. The word embedding size is 200 and shared for every RNN. The model is trainedwith Adam [42] with learning rate 1e-3 with batch size 20. Early stop is conductedaccording the validation loss. 30% dropout is applied at the input and output layer ofeach RNN [111] to alleviate overfitting.

4.4.4 Results

Clean Vs Noisy Data

PPL BLEU Ent F1 Act F1 KB F1 BEAKClean-Rest 1.123 62.4 99.8% 100% 99.9% 90.6Noisy-Rest 1.174 60.1 83.5% 92.5% 92.6% 82.2

Table 4.3: Results on Clean vs. Noisy Data

Table 4.3 shows the performance of the baseline model on the restaurant domainwith clean complexity vs noisy complexity. Since the models are trained on sufficientlylarge dataset, they achieve good performance. The baseline system is able to obtain nearperfect performance on the clean dataset. On the noisy dataset, the performance drops,

52


especially on the Ent F1 score (99.8% to 83.5%) and there are about 8% drop in terms ofdialog act and KB search. The major errors fall into the following category:

1. The model fails to imitate the system simulator’s policy: for example the groundtruth system decides to implicit confirm about a slot, e.g. ”You said Pittsburgh”while the model predict an explicit confirm, e.g. ”Do you mean Pittsburgh”. Thisindicate that the model was not able to learn good enough dialog state representa-tion, because the simulator decides the next action to use according to a confidencescore tracked by an internal Hidden Markov Model (HMM).

2. The model fails to refer the correct slot: ”Do you mean Boston?” instead of ”Doyou mean Pittsburgh?”. This indicates the model is not able to track all the slotvalues, a problem that should be alleviate by using the delexicalized memory pro-posed in Section 3.5 .

3. The model fails to generate the correct KB search: “Search Food Type=French”instead of “Search Food Type=Chinese”. Again this problem should be improvedby the delexicalized memory mechanism.

The above analysis indicates that there are space for improvement of the model architec-ture in cases training data is sufficient. We plan to show the updated results after finishthe proposed work in Chapter 3. Moreover, the results indicate that solving the gener-ated SimDial can be a challenge task for the state-of-art neural encoder decoder modelswith a reasonable complex complexity specification. The task can be even harder if amore aggressive complexity specification is used.

Reduced Training Data

Train Size PPL BLEU Ent F1 Act F1 KB F1 BEAK2000 1.174 60.1 83.5% 92.5% 92.6% 82.21000 1.173 58.4 82.0% 91.0% 86.4% 79.4500 1.275 49.4 73.8% 90.1% 60.7% 67.7200 1.782 37.5 39.3% 90.3% 31.7% 49.420 2.181 24.6 26.4% 78.7% 27.1% 39.2

Table 4.4: Varying the size of training data on Noisy-Rest

Table 4.4 shows the effect on the baseline model by reducing the size of training data.First the performance indeed decreases with the reducing the size of training data. Thesystem is able to maintain its performance training with 1000 dialogs. The performancedrops significantly after the train size is dropped down to 200 or more. When using only20 dialogs, the training becomes unstable and the system is prone to fall in local optimaland generates illegible sentences. Moreover, an interesting phenomena is that F1 scorefor dialog act drops much slower than the F-1 score of Entity and KB (i.e. Act F-1 is stillover 90% while the other are dropped down to about 40%). This implies that learningto predict the form of next system response is simpler task than accurately generate thecorrect key words, e.g. KB query or slot-value.

53


Testing on New Domains

PPL BLEU Ent F1 Act F1 KB F1 BEAKNoisy-Rest 1.174 60.1 83.5% 92.5% 92.6% 82.2Noisy-Rest-2 5.504 36.1 51.7% 91.6% 53.2% 58.1Noisy-Bus 926.4 14.2 37.4% 79.6% 38.0% 39.1

Table 4.5: Results on Clean vs Noisy Data

At last, Table 4.5 shows the performance of a model trained on Noisy-Rest with 2000training dialogs and test on two new domains, i.e. Noisy-Rest-2 and Noisy-Bus. Asdescribed above, Rest-2 is a very similar domain to Rest by sharing the same slot typesand system utterance templates. However, Rest-2 has quite different slot vocabularycompared to Rest. The results indicate that the model is able to maintain a similardialog act F-1 score, but the performance on entity and KB are dropped down to about50%. This implies that the model can still predict the action to do next even though theslot-value is different. However, the model is not able to generate these slots accuratelyin the decoding stage anymore. This is reflected at both the BLEU score and the lowperformance for entity and KB query. The typical error examples in Table 4.6 furtherconfirms the model incompetence in generating novel slot-values that are not observedin the data.

Target Response Generated Response

Rest-2 Do you mean healthy? Do you mean Boston?KB FOOD: salad LOC: Mellon ParkRET: open

KB FOOD: Italian LOC: Austin RET:open

Bus Bus 143 can take you there. Whatelse can I do?

Restaurant 69 is a good choice. Areyou happy about my answer?

I believe you said Airport. When areyou going?

Do you mean Thai?

Table 4.6: Example Errors

Then we look at the more challenging task of testing on bus domain, which is dif-ferent from Rest in terms of both slots and system utterances. Now the challenges aretwo-folds: (1) generate the correct sentence form (2) make sure the entities or KB queriesare correct. We observed the model failed on both cases. Table 4.6 shows some typicalerrors. For the first example in Bus domain, the model is able to generate a responsewith the correct dialog act, i.e. inform + request-more. However, the lexical realizationis completely in the restaurant domain that lead to poor scores for Entity F-1 and BLEU.The second example shows that for a system utterance that is shared among domains,i.e. domain-independent grounding utterances, the model fail to generate the correctslot “airport”. Instead it can only generate a slot about restaurant “Thai” food in thiscase.

54


4.4.5 Discussion

Based on the above experiments, the standard dialog model can achieve almost per-fect performance when trained with full size data under a clean condition. We thenobserved a minor performance drop when the model is trained to imitate the morecomplex system behavior under the noisy complexity specification. This suggests thereis space for improving the E2E model itself even training with large data. The secondexperiment confirms that training deep E2E models requires large training data. Whenthe data size is too small, the evidences from training data fail to provide enough con-strains to learn robust models and lead to overfitting. Then the next question is doestraining a general GEDM together with several source domains with abundant datasolve the problem? Our last experiment shows that blindly train a GEDM on relateddomains cannot produce models that can generalize to new domains. Table 4.6 showsthat when the new domain is a result of slot expansion, the baseline GEDM fails to gen-erate these new entities. When the new domain is completely different from the source,the baseline fails to generate neither the entities nor the correct system utterance forms.In summary, our error analysis indicate several key issues that need to be addressed:(1) The encoder network needs to extract better salient representation form the dialoghistory to better support the imitation of the policy. (2) The encoder network needs tolearn both domain-invariant and domain-specific representation of the dialog context.(3) The decoder needs to handle OOV entities that are not included in the training. (4)The decoder network needs to learn to generalize to new system utterances that aredomain-specific with limited resources.

55



Chapter 5

Proposed Work and Timeline

5.1 Proposed Work

5.1.1 Create a Human-human Multi-domain Dialog Corpus

SimDial is a great test bed for studying developing dialog systems that can simulta-neously master several domains. However, it is after all synthetic data and will havediscrepancy with real-world datasets. Specifically, SimDial data can only be used as aprerequisite exam before testing models in the real-world situation. Therefore, to fullyevaluate the proposed method, we plan to collect a large human-human corpus in mul-tiple task-driven domains via Amazon Mechanical Turk. Specifically, we will pair twoworkers into a chatting room and show them the DS at the beginning of the session.Then the two workers are asked to have a task-oriented dialog where one worker willplay the role of a user and the other plays the role of a computer. The data collectionprocess will cover several different domains and we plan to collect over 10K dialogs intotal. The exact experiment design is part of the proposed work. The following keyproperties are expected from this real-world dialog dataset:

1. Complex natural language expression in each of the given domain from users andsystems.

2. Mixed task and chat conversations. The worker who plays the system can generateinteresting system responses that are beyond the limit of hand-crafted dialog actsand strategy. The worker may not always stay on task and go off for garden-pathbehavior.

3. Real user behavior. SimDial attempts to model some of the user behavior, e.g.changing goals via its propositional complexity function. The collected human-human dataset is expected to contain more complex behavior pattern.

4. Stochastic Policy: since there will be multiple workers play the role of the system,the underlying policy will no by means be the same, whereas SimDial is generatedfrom the same dialog policy.

Therefore, the collected human-human data will be a valid check to confirm the resultsfrom SimDial can generalize to real-world applications.

57


5.1.2 Improve Performance on Single Domain

Improve Latent Action Random Variable

The latent variable model proposed in Section 3.4 assumes a uni-modal Gaussian dis-tribution of the system response. However, it’s frequent that an optimal dialog systemshould consider a multi-modal distribution where there are several groups of possibleresponses. Due to the mode missing property of KL divergence KL(q|p), the proposaldistribution will capture one of the mode of the true posterior distribution. The follow-ing dialog is an example. At one stage in the dialog, the system may have two goals topursue: 1) stay on the main task and ask about the second slot 2) switch to social talk,self-disclosure in this case to improve rapport. Both options are valid responses andhave their own variations that need to be modeled. Ideally the latent action variableshould have two modes, each corresponds to one type of intention.

System-1: What kind of movie do you like?User-1: I like sci-fi movie a lot.System-2: (option-1) Okay. Which theater do you want to go?System-2: (option-2) I like sci-fi movie too! Star Wars is my favorite. How about you?

Improve Delexicalized Memory

Section 3.5 proposes an abstract construction of delexicalized memory which consists ofa write function w and a read function r. w is responsible for recognizing salient entitiesin the dialog context and create entity independent features φ(·) about each entity. Asfor the read function r is used to select relevant entity according to φ and then generatethe corresponding entity. So far, the write function is implemented as EI and the readfunction is implemented as UI. Both methods depend on external NER to recognizeentities in user utterances. Unfortunately, a dialog system often needs to recognize fine-grained concepts that are not included in the off-the-shelf NERs. Therefore, we planto work on a E2E delexicalized memory that can learn to extract salient entities froma raw dialog history and register them in a neural memory block that contains bothfeature φ(e) and the pointer from the feature to the linked entity e. There are severalpromising directions to achieve this goal by drawing inspirations from recent variationsin attention mechanism [52], Neural Turing machine [34] and memory networks [81].

5.1.3 Fully Develop Learning with Knowledge

Develop Domain Description and Its Encoders

Section 4.2 identifies key attributes of a good domain description and some promis-ing foundations to work with. We plan to build on the domain ontology developedin our past work [113] and formalizes its format into a general domain descriptionfor any types of slot-filling task-oriented dialog systems. Moreover, the resulting do-main description format may contain highly structured data, such as list, dictionaryand etc. These data format are challenging for a deep neural network to read in and

58


create meaningful representations. Therefore, we will also work on developing novelneural encoders to encode the information from our domain description. The resultingmethods would be crucial for providing a domain-aware GEDM about its current do-main information, but also valuable for the AI community in general about encodingstructured data using neural networks.

Develop Fusion Methods to Incorporate Domain Knowledge in GEDMs

Given the domain description and its encoded representation, the next major challengeis about how to combine this source of information into our GEDM, so that it can seam-less alter its output behavior conditioned on a certain domain. More specifically, thereare two fusion processes that needed to be addressed. First is the fusion between a do-main description and the encoder part of a GEDM. Given a domain, the encoder shouldexhibits different encoding behavior on the same dialog context. For example, in onedomain, the encoder should focus on recognizing locations and in a different domain,the encoder should focus on recognizing weather types. This becomes more complexbecause there can be shared slots among domains, so that besides domain-specific be-havior, the encoder should also learn domain-independent behavior at the same time.Moreover, the above expected behavior also effects the write function of the proposeddelexicalized memory. Therefore, solving the fusion problem between domain knowl-edge and the encoder is crucial to provide a robust dialog context feature extractor usedby the decoder. Moreover, the second part is the fusion between a domain descriptionand the decoder. Experiments in Section 4.4.2 shows that without domain knowledge,the decoder will fail to generate domain-specific entities and utterances. Some of thekey questions needs to be answered: (1) how to incorporate domain knowledge intothe read function a delexicalized memory. (2) in the zero-shot dialog learning case, howcan the decoder learn domain-specific utterances, (e.g. the weather is XX in a weatherdomain) without observing any weather dialogs. Solving this problem is in fact solvinga more general question and that is how to adapt the output distribution of RNN textdecoder to new domains without limited resources.

5.2 Timeline

At last, the timeline for this proposal is organized along the proposed milestones. Theoverall goal is to accomplish the final thesis within 16 months (thesis defense in April2019).

• March 2018: Collect human-human dataset on Mechanical Turk (Section 5.1.1).• April 2018: Improve performance on single domain (Section 5.1.2).

Work on E2E delexicalized memory.

Work on improving latent action variable.• December 2018: Fully develop LWK (Section 5.1.3).

Develop domain descriptions.

59


Develop domain-aware encoders.

Develop domain-aware decoders.• February 2019: Thesis writing.• April 2019: Thesis defense.

60


Bibliography

[1] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2016.Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXivpreprint arXiv:1608.04207 . 2

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .2.3.1, 3.5.3, 4.4.1

[3] Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2017. Towardszero-shot frame semantic parsing for domain scaling. arXiv preprint arXiv:1707.02363. 2.4.2

[4] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing withPython. ” O’Reilly Media, Inc.”. 3.4.3

[5] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.Journal of machine Learning research 3(Jan):993–1022. 2.1.2

[6] Dan Bohus and Alexander I Rudnicky. 2003. Ravenclaw: Dialog management us-ing hierarchical task decomposition and an expectation agenda. Computer Speech andLanguage . 4.2.1, 4.3.1

[7] Dan Bohus and Alexander I Rudnicky. 2005. Error handling in the ravenclaw dialogmanagement framework. In Proceedings of the conference on Human Language Technol-ogy and Empirical Methods in Natural Language Processing. Association for Computa-tional Linguistics, pages 225–232. 3.5.2

[8] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008.Freebase: a collaboratively created graph database for structuring human knowl-edge. In Proceedings of the 2008 ACM SIGMOD international conference on Managementof data. ACM, pages 1247–1250. 3.3.4

[9] Antoine Bordes and Jason Weston. 2016. Learning end-to-end goal-oriented dialog.arXiv preprint arXiv:1605.07683 . 2.1.1

[10] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprintarXiv:1511.06349 . 2.3.2, 3.4.1, 3.4.2, 3.4.5

[11] Kris Cao and Stephen Clark. 2017. Latent variable dialogue models and their di-versity. arXiv preprint arXiv:1702.05962 . 2.1.2

61


[12] Rich Caruana. 1998. Multitask learning. In Learning to learn, Springer, pages 95–133.2.4.1

[13] Boxing Chen and Colin Cherry. 2014. A systematic comparison of smoothing tech-niques for sentence-level bleu. ACL 2014 page 362. 1

[14] Yun-Nung Chen, Dilek Hakkani-Tur, and Xiaodong He. 2016. Zero-shot learning ofintent embeddings for expansion by convolutional deep structured semantic models.In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conferenceon. IEEE, pages 6045–6049. 2.4.2

[15] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase rep-resentations using rnn encoder-decoder for statistical machine translation. arXivpreprint arXiv:1406.1078 . 1.1, 2.1.1, 2.3.1

[16] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555 . 3.4.2

[17] Herbert H Clark, Susan E Brennan, et al. 1991. Grounding in communication. Per-spectives on socially shared cognition 13(1991):127–149. 3.5.2

[18] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journalof Machine Learning Research 12(Aug):2493–2537. 2.4.1

[19] Pavel Curtis. 1992. Mudding: Social phenomena in text-based virtual realities. Highnoon on the electronic frontier: Conceptual issues in cyberspace pages 347–374. 2.2

[20] Hal Daume III. 2009. Frustratingly easy domain adaptation. arXiv preprintarXiv:0907.1815 . 2.4, 4.2

[21] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, FaisalAhmed, and Li Deng. 2016. End-to-end reinforcement learning of dialogue agentsfor information access. arXiv preprint arXiv:1609.00777 . 2.1.1, 3.3.1

[22] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexan-der Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating prerequisite qualitiesfor learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931 . 4.3.1

[23] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider,Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. 2017. One-shot imitation learn-ing. arXiv preprint arXiv:1703.07326 . 2.4.2, 4.2

[24] Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. Low resource de-pendency parsing: Cross-lingual parameter sharing in a neural network parser. InACL (2). pages 845–850. 2.4.1

[25] Mihail Eric and Christopher D Manning. 2017. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. arXivpreprint arXiv:1701.04024 . 2.1.1, 2.3.1

[26] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-

62


sequence attentional neural machine translation. arXiv preprint arXiv:1603.06075 .2.3.1

[27] Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized multi–task learn-ing. In Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, pages 109–117. 2.4.1

[28] Gabriel Forgues, Joelle Pineau, Jean-Marie Larcheveque, and Real Tremblay. 2014.Bootstrapping dialog systems with word embeddings. In NIPS, Modern MachineLearning and Natural Language Processing Workshop. 2

[29] Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trendsin cognitive sciences 3(4):128–135. 4.1.1

[30] M Gasic, F Jurcıcek, Simon Keizer, Francois Mairesse, Blaise Thomson, Kai Yu, andSteve Young. 2010. Gaussian processes for fast policy optimisation of pomdp-baseddialogue managers. In Proceedings of the 11th Annual Meeting of the Special InterestGroup on Discourse and Dialogue. Association for Computational Linguistics, pages201–204. 2.2

[31] Milica Gasic and Steve Young. 2014. Gaussian processes for pomdp-based dia-logue manager optimization. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing 22(1):28–40. 2.4.2

[32] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifierneural networks. In Aistats. 106, page 275. 3.5.3

[33] John J Godfrey and Edward Holliman. 1997. Switchboard-1 release 2. LinguisticData Consortium, Philadelphia . 3.4.3

[34] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXivpreprint arXiv:1410.5401 . 5.1.2

[35] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393 . 2.3.1

[36] Matthew Hausknecht and Peter Stone. 2015. Deep recurrent q-learning for partiallyobservable mdps. arXiv preprint arXiv:1507.06527 . 2.2

[37] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun.2016. Tracking the world state with recurrent entity networks. arXiv preprintarXiv:1612.03969 . 2.1.2

[38] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neuralcomputation 9(8):1735–1780. 2.2, 3.5.3

[39] Dan Jurafsky, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard swbd-damsl shallow-discourse-function annotation coders manual. Institute of CognitiveScience Technical Report pages 97–102. 3.3.4

[40] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. 1998. Plan-ning and acting in partially observable stochastic domains. Artificial intelligence101(1):99–134. 3.6

63


[41] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXivpreprint arXiv:1408.5882 . 3.5.3

[42] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 . 3.4.3, 3.5.4, 4.4.3

[43] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXivpreprint arXiv:1312.6114 . 2.3.2, 3.4.2, 3.4.2, 3.4.5

[44] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. 2008. Zero-data learning ofnew tasks. In AAAI. volume 1, page 3. 1.1, 2.4, 2.4.2

[45] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. Adiversity-promoting objective function for neural conversation models. arXiv preprintarXiv:1510.03055 . 2.1.2, 2, 3.4.1, 1

[46] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. arXiv preprint arXiv:1603.06155 . 2.1.2

[47] Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky. 2016. Deep reinforcementlearning for dialogue generation. arXiv preprint arXiv:1606.01541 . 2.1.2, 3.5.4

[48] Diane J Litman and James F Allen. 1987. A plan recognition model for subdialoguesin conversations. Cognitive science 11(2):163–200. 3.4.2

[49] Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models forjoint intent detection and slot filling. arXiv preprint arXiv:1609.01454 . 2.4.1

[50] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin,and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empiricalstudy of unsupervised evaluation metrics for dialogue response generation. arXivpreprint arXiv:1603.08023 . 3.4.4

[51] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser.2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 . 2.4.1

[52] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-tive approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 . 2.3.1, 2.3.1, 3.5.3, 5.1.2

[53] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne.Journal of Machine Learning Research 9(Nov):2579–2605. 3.4.5

[54] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointersentinel mixture models. arXiv preprint arXiv:1609.07843 . 2.3.1

[55] Gregoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, DilekHakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Us-ing recurrent neural networks for slot filling in spoken language understanding.IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23(3):530–539. 2.1.1

[56] Tom M Mitchell. 1980. The need for biases in learning generalizations. Department ofComputer Science, Laboratory for Computer Science Research, Rutgers Univ. New

64


Jersey. 2.4, 4.1

[57] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Os-trovski, et al. 2015. Human-level control through deep reinforcement learning. Nature518(7540):529–533. 2.2

[58] George E Monahan. 1982. State of the arta survey of partially observable markovdecision processes: theory, models, and algorithms. Management Science 28(1):1–16.2.2

[59] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. 2015. Language un-derstanding for text-based games using deep reinforcement learning. arXiv preprintarXiv:1506.08941 . 2.2

[60] Andrew Y Ng, Daishi Harada, and Stuart Russell. 1999. Policy invariance underreward transformations: Theory and application to reward shaping. In ICML. vol-ume 99, pages 278–287. 3.3.3

[61] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. arXiv preprintarXiv:1706.05064 . 2.4.2, 4.2

[62] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009.Zero-shot learning with semantic output codes. In Advances in neural information pro-cessing systems. pages 1410–1418. 2.4.2, 4.2.1

[63] Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans-actions on knowledge and data engineering 22(10):1345–1359. 1.1, 2.4

[64] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: amethod for automatic evaluation of machine translation. In Proceedings of the 40th an-nual meeting on association for computational linguistics. Association for ComputationalLinguistics, pages 311–318. 1, 3.5.4

[65] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:Global vectors for word representation. In EMNLP. volume 14, pages 1532–43. 3.4.3

[66] Massimo Poesio and David Traum. 1998. Towards an axiomatization of dialogueacts. In Proceedings of the Twente Workshop on the Formal Semantics and Pragmatics ofDialogues (13th Twente Workshop on Language Technology. Citeseer. 3.4.2

[67] Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi.2005. Lets go public! taking a spoken dialog system to the real world. In in Proc. ofInterspeech 2005. Citeseer. 3.4.2, 3.5.4, 4.2.1, 4.3

[68] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochas-tic backpropagation and approximate inference in deep generative models. arXivpreprint arXiv:1401.4082 . 2.3.2

[69] Eugenio Ribeiro, Ricardo Ribeiro, and David Martins de Matos. 2015. The influenceof context on dialogue act recognition. arXiv preprint arXiv:1506.00839 . 3.4.3

[70] Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple ap-

65


proach to zero-shot learning. In International Conference on Machine Learning. pages2152–2161. 2.4.2

[71] Sebastian Ruder. 2017. An overview of multi-task learning in deep neural net-works. arXiv preprint arXiv:1706.05098 . 2.4.1

[72] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritizedexperience replay. arXiv preprint arXiv:1511.05952 . 2.2

[73] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-works. IEEE Transactions on Signal Processing 45(11):2673–2681. 3.4.2

[74] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and JoellePineau. 2015. Building end-to-end dialogue systems using generative hierarchicalneural network models. arXiv preprint arXiv:1507.04808 . 2.1.2, 2.3.1, 4.4.1

[75] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and JoellePineau. 2016. Building end-to-end dialogue systems using generative hierarchicalneural network models. In Proceedings of the 30th AAAI Conference on Artificial Intelli-gence (AAAI-16). 3.4.3

[76] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, JoellePineau, Aaron Courville, and Yoshua Bengio. 2016. A hierarchical latent variableencoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069 .2.1.2, 2.3.2, 3.4.1

[77] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured outputrepresentation using deep conditional generative models. In Advances in Neural In-formation Processing Systems. pages 3483–3491. 2.3.2, 3.4.1, 3.4.2

[78] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Mar-garet Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural networkapproach to context-sensitive generation of conversational responses. arXiv preprintarXiv:1506.06714 . 3.4.4

[79] Andreas Stolcke, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel Jurafsky, Rachel Martin, and MarieMeteer. 2000. Dialogue act modeling for automatic tagging and recognition of con-versational speech. Computational linguistics 26(3):339–373. 3.4.3

[80] Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, DavidVandyke, Tsung-Hsien Wen, and Steve Young. 2016. Continuously learning neuraldialogue management. arXiv preprint arXiv:1606.02689 . 2.1.1

[81] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memorynetworks. In Advances in neural information processing systems. pages 2440–2448. 5.1.2

[82] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learningwith neural networks. In Advances in neural information processing systems. pages 3104–3112. 3.5.3

[83] Richard S Sutton. 1990. Integrated architectures for learning, planning, and re-acting based on approximating dynamic programming. In Proceedings of the seventh

66


international conference on machine learning. pages 216–224. 3.3.3

[84] Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning.MIT Press. 2.2

[85] Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction,volume 1. MIT press Cambridge. 3.3.2

[86] Johan AK Suykens and Joos Vandewalle. 1999. Least squares support vector ma-chine classifiers. Neural processing letters 9(3):293–300. 3.4.3

[87] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science & Busi-ness Media. 2.4

[88] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gra-dient by a running average of its recent magnitude. COURSERA: Neural Networks forMachine Learning 4:2. 3.3.4

[89] Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003shared task: Language-independent named entity recognition. In Proceedings of theseventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Associa-tion for Computational Linguistics, pages 142–147. 3.5.3

[90] Grigorios Tsoumakas and Ioannis Katakis. 2006. Multi-label classification: Anoverview. International Journal of Data Warehousing and Mining 3(3). 3.5.4

[91] Hado Van Hasselt, Arthur Guez, and David Silver. 2015. Deep reinforcement learn-ing with double q-learning. arXiv preprint arXiv:1509.06461 . 2.2

[92] Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprintarXiv:1506.05869 . 1.1, 2.1.1, 2.3.1, 2.3.1

[93] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Showand tell: A neural image caption generator. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pages 3156–3164. 2.3.1

[94] Marilyn A. Walker. 2000. An application of reinforcement learning to dialoguestrategy selection in a spoken dialogue system for email. Journal of Artificial Intelli-gence Research pages 387–416. 2.2

[95] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-HaoSu, Stefan Ultes, David Vandyke, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 . 2.1.1

[96] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-HaoSu, David Vandyke, and Steve Young. 2016. Multi-domain neural network languagegeneration for spoken dialogue systems. arXiv preprint arXiv:1603.01232 . 2.4.2

[97] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, andSteve Young. 2015. Semantically conditioned lstm-based natural language generationfor spoken dialogue systems. arXiv preprint arXiv:1508.01745 . 3.5.4

[98] Jason Williams, Antoine Raux, Deepak Ramachandran, and Alan Black. 2013. Thedialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference. pages

67


404–413. 2.1.1, 3.5.4, 4.3

[99] Jason Williams and Steve Young. 2003. Using wizard-of-oz simulations to boot-strap reinforcement-learning-based dialog management systems. In Proceedings of the4th SIGDIAL Workshop on Discourse and Dialogue. 4.3.1

[100] Jason D Williams and Steve Young. 2007. Partially observable markov decisionprocesses for spoken dialog systems. Computer Speech & Language 21(2):393–422. 2.2

[101] Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialogcontrol optimized with supervised and reinforcement learning. arXiv preprintarXiv:1606.01269 . 2.1.1

[102] Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning asbeam-search optimization. arXiv preprint arXiv:1606.02960 . 2.1.2

[103] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma.2016. Topic augmented neural response generation with a joint attention mechanism.arXiv preprint arXiv:1606.08340 . 2.1.2

[104] Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017.Hierarchical recurrent attention network for response generation. arXiv preprintarXiv:1701.07149 . 2.1.2

[105] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, An-dreas Stolcke, Dong Yu, and Geoffrey Zweig. 2017. The microsoft 2016 conversationalspeech recognition system. In Acoustics, Speech and Signal Processing (ICASSP), 2017IEEE International Conference on. IEEE, pages 5255–5259. 4.1

[106] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, RuslanSalakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual attention. In ICML. volume 14, pages77–81. 3.5.3

[107] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2015. Attribute2image:Conditional image generation from visual attributes. arXiv preprint arXiv:1512.00570. 2.3.2, 3.4.1, 3.4.2

[108] Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. 2016. Reference-awarelanguage models. arXiv preprint arXiv:1611.01628 . 2.1.1

[109] Stephanie Young, Jost Schatzmann, Karl Weilhammer, and Hui Ye. 2007. Thehidden information state approach to dialog management. In Acoustics, Speech andSignal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, volume 4,pages IV–149. 3.6, 4.2.1

[110] Zhou Yu, Alan W Black, and Alexander I Rudnicky. 2017. Learning conversationalsystems that interleave task and non-task content. arXiv preprint arXiv:1703.00099 .2.1.2

[111] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural net-work regularization. arXiv preprint arXiv:1409.2329 . 3.4.5, 3.5.4, 4.4.3

[112] Ran Zhao, Alexandros Papangelis, and Justine Cassell. 2014. Towards a dyadic

68


computational model of rapport management for human-virtual agent interaction.In International Conference on Intelligent Virtual Agents. Springer, pages 514–527. 2.1.2

[113] Tiancheng Zhao. 2016. Reinforest: Multi-domain dialogue management usinghierarchical policies and knowledge ontology. Technical Report . 4.2.1, 5.1.3

[114] Tiancheng Zhao and Maxine Eskenazi. 2016. Towards end-to-end learning for dia-log state tracking and management using deep reinforcement learning. arXiv preprintarXiv:1606.02560 . 2.1.1, 2.2, 3.1, 3.3.1, 3.4.2, 3.5.3

[115] Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2016. Dialport: Connectingthe spoken dialog research community to real user data. In Spoken Language Technol-ogy Workshop (SLT), 2016 IEEE. IEEE, pages 83–90. 1.2

[116] Tiancheng Zhao, Allen Lu, Kyusong Lee, and Maxine Eskenazi. 2017. Genera-tive encoder-decoder models for task-oriented spoken dialog systems with chattingcapability. arXiv preprint arXiv:1706.08476 . 2.1.1, 2.1.2, 3.1, 3.5.2, 4.4.1

[117] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-leveldiversity for neural dialog models using conditional variational autoencoders. arXivpreprint arXiv:1703.10960 . 2.1.2, 2.3.2, 2, 3.1, 3.4.1

[118] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generatingstructured queries from natural language using reinforcement learning. arXiv preprintarXiv:1709.00103 . 2.3.1

69

Date post:	05-Aug-2019
Category:	Documents
Upload:	dinhminh
View:	214 times
Download:	0 times

Learning Generative End-to-end Dialog Systems with...

Documents