+ All Categories
Home > Documents > arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and...

arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and...

Date post: 23-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
11
Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks Yun-Nung Chen ? Dilek Hakkani-T ¨ ur Gokhan Tur Asli Celikyilmaz Jianfeng Gao Li Deng ? National Taiwan University, Taipei, Taiwan Google Research, Mountain View, CA Microsoft Research, Redmond, WA { ? y.v.chen, dilek, gokhan}@ieee.org { aslicel, jfgao, deng}@microsoft.com Abstract Natural language understanding (NLU) is a core component of a spoken dialogue system. Recently recurrent neural networks (RNN) ob- tained strong results on NLU due to their supe- rior ability of preserving sequential informa- tion over time. Traditionally, the NLU mod- ule tags semantic slots for utterances consid- ering their flat structures, as the underlying RNN structure is a linear chain. However, natural language exhibits linguistic properties that provide rich, structured information for better understanding. This paper introduces a novel model, knowledge-guided structural attention networks (K-SAN), a generalization of RNN to additionally incorporate non-flat network topologies guided by prior knowl- edge. There are two characteristics: 1) impor- tant substructures can be captured from small training data, allowing the model to generalize to previously unseen test data; 2) the model automatically figures out the salient substruc- tures that are essential to predict the semantic tags of the given sentences, so that the under- standing performance can be improved. The experiments on the benchmark Air Travel In- formation System (ATIS) data show that the proposed K-SAN architecture can effectively extract salient knowledge from substructures with an attention mechanism, and outperform the performance of the state-of-the-art neural network based frameworks. 1 Introduction In the past decade, goal-oriented spoken dialogue systems (SDS), such as the virtual personal assis- tants Microsoft’s Cortana and Apple’s Siri, are be- ing incorporated in various devices and allow users to speak to systems freely in order to finish tasks more efficiently. A key component of these conver- sational systems is the natural language understand- ing (NLU) module-it refers to the targeted under- standing of human speech directed at machines (Tur and De Mori, 2011). The goal of such “targeted” un- derstanding is to convert the recognized user speech into a task-specific semantic representation of the user’s intention, at each turn, that aligns with the back-end knowledge and action sources for task completion. The dialogue manager then interprets the semantics of the user’s request and associated back-end results, and decides the most appropriate system action, by exploiting semantic context and user specific meta-information, such as geo-location and personal preferences (McTear, 2004; Rudnicky and Xu, 1999). A typical pipeline of NLU includes: domain classification, intent determination, and slot fill- ing (Tur and De Mori, 2011). NLU first decides the domain of user’s request given the input utter- ance, and based on the domain, predicts the in- tent and fills associated slots corresponding to a domain-specific semantic template. For example, Figure 1 shows a user utterance, “show me the flights from seattle to san francisco” and its seman- tic frame, find flight(origin=“seattle”, dest=“san francisco”). It is easy to see the relationship be- tween the origin city and the destination city in this example, although these do not appear next to each other. Traditionally, domain detection and in- tent prediction are framed as utterance classification arXiv:1609.03286v1 [cs.AI] 12 Sep 2016
Transcript
Page 1: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

Knowledge as a Teacher:Knowledge-Guided Structural Attention Networks

Yun-Nung Chen? Dilek Hakkani-Tur† Gokhan Tur†Asli Celikyilmaz‡ Jianfeng Gao‡ Li Deng‡

?National Taiwan University, Taipei, Taiwan†Google Research, Mountain View, CA‡Microsoft Research, Redmond, WA

{?y.v.chen, †dilek, †gokhan}@ieee.org{‡aslicel, ‡jfgao, ‡deng}@microsoft.com

Abstract

Natural language understanding (NLU) is acore component of a spoken dialogue system.Recently recurrent neural networks (RNN) ob-tained strong results on NLU due to their supe-rior ability of preserving sequential informa-tion over time. Traditionally, the NLU mod-ule tags semantic slots for utterances consid-ering their flat structures, as the underlyingRNN structure is a linear chain. However,natural language exhibits linguistic propertiesthat provide rich, structured information forbetter understanding. This paper introducesa novel model, knowledge-guided structuralattention networks (K-SAN), a generalizationof RNN to additionally incorporate non-flatnetwork topologies guided by prior knowl-edge. There are two characteristics: 1) impor-tant substructures can be captured from smalltraining data, allowing the model to generalizeto previously unseen test data; 2) the modelautomatically figures out the salient substruc-tures that are essential to predict the semantictags of the given sentences, so that the under-standing performance can be improved. Theexperiments on the benchmark Air Travel In-formation System (ATIS) data show that theproposed K-SAN architecture can effectivelyextract salient knowledge from substructureswith an attention mechanism, and outperformthe performance of the state-of-the-art neuralnetwork based frameworks.

1 Introduction

In the past decade, goal-oriented spoken dialoguesystems (SDS), such as the virtual personal assis-

tants Microsoft’s Cortana and Apple’s Siri, are be-ing incorporated in various devices and allow usersto speak to systems freely in order to finish tasksmore efficiently. A key component of these conver-sational systems is the natural language understand-ing (NLU) module-it refers to the targeted under-standing of human speech directed at machines (Turand De Mori, 2011). The goal of such “targeted” un-derstanding is to convert the recognized user speechinto a task-specific semantic representation of theuser’s intention, at each turn, that aligns with theback-end knowledge and action sources for taskcompletion. The dialogue manager then interpretsthe semantics of the user’s request and associatedback-end results, and decides the most appropriatesystem action, by exploiting semantic context anduser specific meta-information, such as geo-locationand personal preferences (McTear, 2004; Rudnickyand Xu, 1999).

A typical pipeline of NLU includes: domainclassification, intent determination, and slot fill-ing (Tur and De Mori, 2011). NLU first decidesthe domain of user’s request given the input utter-ance, and based on the domain, predicts the in-tent and fills associated slots corresponding to adomain-specific semantic template. For example,Figure 1 shows a user utterance, “show me theflights from seattle to san francisco” and its seman-tic frame, find flight(origin=“seattle”, dest=“sanfrancisco”). It is easy to see the relationship be-tween the origin city and the destination city inthis example, although these do not appear next toeach other. Traditionally, domain detection and in-tent prediction are framed as utterance classification

arX

iv:1

609.

0328

6v1

[cs

.AI]

12

Sep

2016

Page 2: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

show me the flights from seattle to san francisco

O O O O B-origin O B-dest I-destO

Figure 1: An example utterance annotated with its se-mantic slots in the IOB format (S).

problems, where several classifiers such as supportvector machines and maximum entropy have beenemployed (Haffner et al., 2003; Chelba et al., 2003;Chen et al., 2014). Then slot filling is framed as aword sequence tagging task, where the IOB (in-out-begin) format is applied for representing slot tagsas illustrated in Figure 1, and hidden Markov mod-els (HMM) or conditional random fields (CRF) havebeen employed for slot tagging (Pieraccini et al.,1992; Wang et al., 2005).

With the advances on deep learning, deep be-lief networks (DBNs) with deep neural networks(DNNs) have been applied to domain and intentclassification tasks (Sarikaya et al., 2011; Tur et al.,2012; Sarikaya et al., 2014). Recently, Ravuri andStolcke (2015) proposed an RNN architecture forintent determination. For slot filling, deep learninghas been viewed as a feature generator and the neu-ral architecture can be merged with CRFs (Xu andSarikaya, 2013). Yao et al. (2013) and Mesnil etal. (2015) later employed RNNs for sequence label-ing in order to perform slot filling. However, theabove studies benefit from large training data with-out leveraging any existing knowledge. When tag-ging sequences RNNs consider them as flat struc-tures, with their underlying linear chain structures,potentially ignoring the structured information typi-cal of natural language sequences.

Hierarchical structures and semantic relationshipscontain linguistic characteristics of input word se-quences forming sentences, and such informationmay help interpret their meaning. Furthermore,prior knowledge would help in the tagging of se-quences, especially when dealing with previouslyunseen sequences (Tur et al., 2010; Deoras andSarikaya, 2013). Prior work exploited externalweb-scale knowledge graphs such as Freebase andWikipedia for improving NLU (Heck et al., 2013;Ma et al., 2015b; Chen et al., 2014) Liu et al.(2013) and Chen et al. (2015) proposed approachesthat leverage linguistic knowledge encoded in parsetrees for language understanding, where the ex-

tracted syntactic structural features and semantic de-pendency features enhance inference model learn-ing, and the model achieves better language under-standing performance in various domains.

Even with the emerging paradigm of integratingdeep learning and linguistic knowledge for differ-ent NLP tasks (Socher et al., 2014), most of theprevious work utilized such linguistic knowledgeand knowledge bases as additional features as in-put to neural networks, and then learned the mod-els for tagging sequences. These feature enrich-ment based approaches have some possible limita-tions: 1) poor generalization and 2) error propaga-tion. Poor generalization comes from the mismatchbetween knowledge bases and the input data, andthen the incorrectly extracted features due to errorsin previous processing propagate errors to the neu-ral models. In order to address the issues and bet-ter learn the sequence tagging models, this paperproposes knowledge-guided structural attention net-works, K-SAN, a generalization of RNNs that au-tomatically learn the attention guided by external orprior knowledge and generate sentence-based rep-resentations specifically for modeling sequence tag-ging. The main difference between K-SAN and pre-vious approaches is that knowledge plays the role ofa teacher to guide networks where and how muchto focus attention considering the whole linguisticstructure simultaneously. Our main contributionsare three-fold:

• End-to-end learningTo our knowledge, this is the first neural net-work approach that utilizes general knowledgeas guidance in an end-to-end fashion, where themodel automatically learns important substruc-tures with an attention mechanism.

• Generalization for different knowledgeThere is no required schema of knowledge, anddifferent types of parsing results, such as de-pendency relations, knowledge graph-specificrelations, and parsing output of hand-craftedgrammars, can serve as the knowledge guid-ance in this model.

• Efficiency and parallelizabilityBecause the substructures from the input ut-terance are modeled separately, modeling time

Page 3: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

may not increase linearly with respect to thenumber of words in the input sentence.

In the following sections, we empirically show thebenefit of K-SAN on the targeted NLU task.

2 Related Work

Knowledge-Based Representations There is anemerging trend of learning representations at dif-ferent levels, such as word embeddings (Mikolovet al., 2013), character embeddings (Ling et al.,2015), and sentence embeddings (Le and Mikolov,2014; Huang et al., 2013). In addition to fullyunsupervised embedding learning, knowledge baseshave been widely utilized to learn entity embeddingswith specific functions or relations (Celikyilmaz andHakkani-Tur, 2015; Yang et al., 2014). Differentfrom prior work, this paper focuses on learning com-posable substructure embeddings that are informa-tive for understanding.

Recently linguistic structures are taken into ac-count in the deep learning framework. Ma etal. (2015a) and Tai et al. (2015) both proposeddependency-based approaches to combine deeplearning and linguistic structures, where the modelused tree-based n-grams instead of surface onesto capture knowledge-guided relations for sentencemodeling and classification. Roth and Lapata (2016)utilized lexicalized dependency paths to learn em-bedding representations for semantic role label-ing. However, the performance of these approacheshighly depends on the quality of “whole” sentenceparsing, and there is no control of degree of atten-tions on different substructures. Learning robustrepresentations incorporating whole structures stillremains unsolved. In this paper, we address the lim-itation by proposing K-SAN to learn robust repre-sentations of whole sentences, where the whole rep-resentation is composed of the salient substructuresin order to avoid error propagation.

Neural Attention and Memory Model One ofthe earliest work with a memory component appliedto language processing is memory networks (Westonet al., 2015; Sukhbaatar et al., 2015), which encodefacts into vectors and store them in the memory forquestion answering (QA). Following their success,Xiong et al. (2016) proposed dynamic memory net-works (DMN) to additionally capture position and

temporality of transitive reasoning steps for differentQA tasks. The idea is to encode important knowl-edge and store it into memory for future usage withattention mechanisms. Attention mechanisms allowneural network models to selectively pay attention tospecific parts. There are also various tasks showingthe effectiveness of attention mechanisms.

However, most previous work focused on the clas-sification or prediction tasks (predicting a singleword given a question), and there are few stud-ies for NLU tasks (slot tagging). Based on thefact that the linguistic or knowledge-based substruc-tures can be treated as prior knowledge to bene-fit language understanding, this work borrows theidea from memory models to improve NLU. Un-like the prior NLU work that utilized representationslearned from knowledge bases to enrich features ofthe current sentence, this paper directly learns a sen-tence representation incorporating memorized sub-structures with an automatically decided attentionmechanism in an end-to-end manner.

3 Knowledge-Guided Structural AttentionNetworks (K-SAN)

For the NLU task, given an utterance with a se-quence of words/tokens ~s = w1, ..., wT , our modelis to predict corresponding semantic tags ~y =y1, ..., yT for each word/token by incorporatingknowledge-guided structures. The proposed modelis illustrated in Figure 2. The knowledge encodingmodule first leverages external knowledge to gener-ate a linguistic structure for the utterance, where adiscrete set of knowledge-guided substructures {xi}is encoded into a set of vector representations (§ 3.1).The model learns the representation for the wholesentence by paying different attention on the sub-structures (§ 3.2). Then the learned vector encodingthe knowledge-guided structure is used for improv-ing the semantic tagger (§ 4).

3.1 Knowledge Encoding Module

The prior knowledge obtained from external re-sources, such as dependency relations, knowledgebases, etc., provides richer information to help de-cide the semantic tags given an input utterance. Thispaper takes dependency relations as an example forknowledge encoding, and other structured relations

Page 4: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

knowledge-guided structure {xi}

Knowledge Encoder

Sentence Encoder

Inner Product

u

mi

Knowledge Attention Distributionpi

Encoded Knowledge Representation

Weighted Sum

h

o

Knowledge-Guided Representation

slot tagging sequence

s

y

show me the flights from seattle to san francisco

ROOT

Input Sentence

ht-1 ht+1htW W W W

wt-1

yt-1

U

Mwt

U

wt+1

U

V

yt

V

yt+1

V

MM

RNN Tagger

Knowledge Encoding Module

Mkg

Min Mout

“show me the flights from seattle to san francisco”

Figure 2: The illustration of knowledge-guided structural attention networks (K-SAN) for NLU.

show

me

the

flights

from

seattle

to

san

francisco

ROOT

1.

3.

4.

2. 1. show me

2. show flights the

3. show flights from seattle

4. show flights to francisco san

Sentence s

show me the flights from seattle to san francisco

Knowledge-Guided Substructure xi

Figure 3: The knowledge-guided substructures of depen-dency parsing, xi, on an example sentence s.

can be applied in the same way. The input utteranceis parsed by a dependency parser, and the substruc-tures are built according to the paths from the root toall leaves (Chen and Manning, 2014). For example,the dependency parsing of the utterance “show methe flights from seattle to san francisco” is shown inFigure 3, where the associated substructures are ob-tained from the parsing tree for knowledge encod-ing. Here we do not utilize the dependency relationlabels in the experiments for better generalization,because the labels may not be always available fordifferent knowledge resources. Note that the num-ber of substructures may be less than the number ofwords in the utterance, because non-leaf nodes donot have corresponding substructure in order to re-duce the duplicated information in the model. Thetop-left component of Figure 2 illustrates the mod-ule for modeling knowledge-guided substructures.

3.2 Model ArchitectureThe model embeds all knowledge-guided substruc-tures into a continuous space and stores embeddingsof all x’s in the knowledge memory. The represen-tation of the input utterance is then compared withencoded knowledge representations to integrate thecarried structure guided by knowledge via an atten-tion mechanism. Then the knowledge-guided rep-resentation of the sentence is taken together withthe word sequence for estimating the semantic tags.Four main procedures are described below.

Encoded Knowledge Representation To storethe knowledge-guided structure, we convert eachsubstructure (e.g. path starting from the root to theleaf in the dependency tree), xi, into a structure vec-tormi with dimension d by embedding the substruc-ture in a continuous space through the knowledgeencoding model Mkg. The input utterance s is alsoembedded to a vector u with the same dimensionthrough the model Min.

mi = Mkg(xi), (1)

u = Min(s). (2)

We apply the three types for knowledge encodingmodels, Mkg and Min, in order to model multiplewords from a substructure xi or an input sentence sinto a vector representation: 1) fully-connected neu-ral networks (NN) with linear activation, 2) recur-

Page 5: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

rent neural networks (RNN), and 3) convolutionalneural networks (CNN) with a window size 3 anda max-pooling operation. For example, one of sub-structures shown in Figure 3, “show flights seattlefrom”, is encoded into a vector embedding. In theexperiments, the weights ofMkg andMin are tied to-gether based on their consistent ability of sequenceencoding.

Knowledge Attention Distribution In the em-bedding space, we compute the match between thecurrent utterance vector u and its substructure vec-tor mi by taking their inner product followed by asoftmax.

pi = softmax(uTmi), (3)

where softmax(zi) = ezi/∑

j ezj and pi can be

viewed as attention distribution for modeling impor-tant substructures from external knowledge in orderto understand the current utterance.

Sentence Representation In order to encode theknowledge-guided structure, a vector h is a sum overthe encoded knowledge embeddings weighted by theattention distribution.

h =∑i

pimi, (4)

which indicates that the sentence pays different at-tention to different substructures guided from exter-nal knowledge. Because the function from input tooutput is smooth, we can easily compute gradientsand back propagate through it. Then the sum ofthe substructure vector h and the current input em-bedding u are then passed through a neural networkmodelMout to generate an output knowledge-guidedrepresentation o.

o =Mout(h+ u), (5)

where we employ a fully-connected dense networkfor Mout.

Sequence Tagging To estimate the tag sequence ~ycorresponding to an input word sequence ~s, we usean RNN module for training a slot tagger, where theknowledge-guided representation o is fed into the in-put of the model in order to incorporate the structureinformation.

~y = RNN(o,~s) (6)

4 Recurrent Neural Network Tagger

4.1 Chain-Based RNN TaggerGiven ~s = w1, ..., wT , the model is to predict ~y =y1, ..., yT where the tag yi is aligned with the wordwi. We use the Elman RNN architecture, consist-ing of an input layer, a hidden layer, and an outputlayer (Elman, 1990). The input, hidden and outputlayers consist of a set of neurons representing the in-put, hidden, and output at each time step t, wt, ht,and yt, respectively.

ht = φ(Wwt + Uht−1), (7)

yt = softmax(V ht), (8)

where φ is a smooth bounded function such as tanh,and yt is the probability distribution over of semantictags given the current hidden state ht. The sequenceprobability can be formulated as

p(~y | ~s) = p(~y | w1, ..., wT ) =∏i

p(yi | w1, ..., wi).

(9)The model can be trained using backpropagation tomaximize the conditional likelihood of the trainingset labels.

To overcome the frequent vanishing gradients is-sue when modeling long-term dependencies, gatedRNN was designed to use a more sophisticated ac-tivation function than a usual activation function,consisting of affine transformation followed by asimple element-wise nonlinearity by using gatingunits (Chung et al., 2014), such as long short-term memory (LSTM) and gated recurrent unit(GRU) (Hochreiter and Schmidhuber, 1997; Cho etal., 2014). RNNs employing either of these recur-rent units have been shown to perform well in tasksthat require capturing long-term dependencies (Mes-nil et al., 2015; Yao et al., 2014; Graves et al.,2013; Sutskever et al., 2014). In this paper, we useRNN with GRU cells to allow each recurrent unitto adaptively capture dependencies of different timescales (Cho et al., 2014; Chung et al., 2014), becauseRNN-GRU can yield comparable performance asRNN-LSTM with need of fewer parameters and lessdata for generalization (Chung et al., 2014)

A GRU has two gates, a reset gate r, and an up-date gate z (Cho et al., 2014; Chung et al., 2014).The reset gate determines the combination between

Page 6: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

o

input sentence s

y

ht-1 ht+1htW2 W2 W2 W2

wt-1

yt-1

U1

wt

U1

wt+1

U1

V

yt

V

yt+1

V

M

Chain-Based RNN Tagger

ht-1 ht+1ht

W1 W1 W1 W1

M M

V

U2U2U2

VV

Knowledge-Guided RNN Tagger

slot tagging sequence

1 1 1

2 2 2

Figure 4: The joint tagging model that incorporates achain-based RNN tagger (upper block) and a knowledge-guided RNN tagger (lower block).

the new input and the previous memory, and the up-date gate decides how much the unit updates its ac-tivation, or content.

r = σ(W rwt + U rht−1), (10)

z = σ(W zwt + U zht−1), (11)

where σ is a logistic sigmoid function.Then the final activation of the GRU at time t, ht,

is a linear interpolation between the previous activa-tion ht−1 and the candidate activation ht:

ht = (1− z)� ht + z � ht−1, (12)

ht = φ(W hwt + Uh(ht−1 � r))), (13)

where � is an element-wise multiplication. Whenthe reset gate is off, it effectively makes the unitact as if it is reading the first symbol of an inputsequence, allowing it to forget the previously com-puted state. Then yt can be computed by (8).

4.2 Knowledge-Guided RNN Tagger

In order to model the encoded knowledge from pre-vious turns, for each time step t, the knowledge-guided sentence representation o in (5) is fed intothe RNN model together with the word wt. For theplain RNN, the hidden layer can be formulated as

ht = φ(Mo+Wwt + Uht−1) (14)

to replace (7) as illustrated in the right block ofFigure 2. RNN-GRU can incorporate the encodedknowledge in the similar way, where Mo can beadded into gating mechanisms for modeling contex-tual knowledge similarly.

4.3 Joint RNN TaggerBecause the chain-based tagger and the knowledge-guided tagger carry different information, the jointRNN tagger is proposed to balance the informationbetween two model architectures. Figure 4 presentsthe architecture of the joint RNN tagger.

h1t = φ(W 1wt + U1ht−1), (15)

h2t = φ(Mo+W 2wt + U2ht−1), (16)

yt = softmax(V (αh1t + (1− α)h2t )), (17)

where α is the weight for balancing chain-based andknowledge-guided information. By jointly consid-ering chain-based information (h1t ) and knowledge-guided information (h2t ), the joint RNN tagger is ex-pected to achieve better generalization, and the per-formance may be less sensitive to poor structuresfrom external knowledge. In the experiments, α isset to 0.5 for balancing two sides. The objectiveof the proposed model is to maximize the sequenceprobability p(~y | ~s) in (9), and the model can betrained in an end-to-end manner, where the errorwould be back-propagated through the whole archi-tecture.

5 Experiments

5.1 Experimental SetupThe dataset for experiments is the benchmark ATIScorpus, which is extensively used by the NLU com-munity (Mesnil et al., 2015). There are 4978 train-ing utterances selected from Class A (context in-dependent) in the ATIS-2 and ATIS-3, while thereare 893 utterances selected from the ATIS-3 Nov93and Dec94. In the experiments, we only use lexi-cal features. In order to show the robustness to datascarcity, we conduct the experiments with 3 differentsizes of training data (Small, Medium, and Large),where Small is 1/40 of the original set, Medium is1/10 of the original set, and Large is the full set. Theevaluation metrics for NLU is F-measure on the pre-dicted slots1.

For experiments with K-SAN, we parse all datawith the Stanford dependency parser (Chen andManning, 2014) and represent words as their em-beddings trained on the in-domain data, where theparser is pre-trained on PTB. The loss function is

1The used evaluation script is conlleval.

Page 7: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

Model DatasetEncoder (Mkg/Min) Knowledge Tagger Small Medium Large

Baseline - 7 CRF 58.94 78.74 89.73- 7 RNN 68.58 84.55 92.97CNN 7 RNN 73.57 85.52 93.88

Structural - 3 CRF 59.55 78.71 90.13DCNN 3 RNN 70.24 83.80 93.25Tree-RNN 3 RNN 73.50 83.92 92.28

Proposed K-SAN (NN) 3 RNN 74.11† 85.97 93.98†

K-SAN (RNN) 3 RNN 73.13 86.85† 94.97†K-SAN (CNN) 3 RNN 74.60† 87.99† 94.86†

Table 1: The F1 scores of predicted slots on the different size of ATIS training examples, where K-SAN utilizes thedependency relations parsed from the Stanford parser. Small: 1/40 set; Medium: 1/10 set; Large: original set. (†indicates that the performance is significantly better than all baseline models with p < 0.05 in the t-test.)

cross-entropy, and the optimizer we use is adam withthe default setting (Kingma and Ba, 2014), wherethe learning rate λ = 0.001, β1 = 0.9, β2 = 0.999,and ε = 1e−08. The maximum iteration for trainingour K-SAN models is set as 300. The dimensional-ity of input word embeddings is 100, and the hiddenlayer sizes are in {50, 100, 150}. The dropout ratesare set as {0.25, 0.50}. All reported results are fromthe joint RNN tagger, and the hyperparameters aretuned in the dev set for all experiments.

5.2 BaselineTo validate the effectiveness of the proposed model,we compare the performance with the followingbaselines.

• Baseline:– CRF Tagger (Tur et al., 2010): predicts a

semantic slot for each word with a contextwindow (size = 5).

– RNN Tagger (Mesnil et al., 2015): pre-dicts a semantic slot for each word.

– CNN Encoder-Tagger (Kim, 2014): tagsemantic slots with consideration of sen-tence embeddings learned by a convolu-tional model.

• Structural: The NLU models utilize linguis-tic information when tagging slots, whereDCNN and Tree-RNN are the state-of-the-artapproaches for embedding sentences with lin-guistic structures.

– CRF Tagger (Tur et al., 2010): predictsslots based on the lexical (5-word win-

dow) and syntactic (dependent head in theparsing tree) features.

– DCNN (Ma et al., 2015a): predicts slotsby incorporating sentence embeddingslearned by a convolutional model withconsideration of dependency tree struc-tures.

– Tree-RNN (Tai et al., 2015): predicts slotswith sentence embeddings learned by anRNN model based on the tree structuresof sentences.

5.3 Slot Filling ResultsTable 1 shows the performance of slot filling on dif-ferent size of training data, where there are threedatasets (Small, Medium, and Large use 1/40, 1/10,and whole training data). For baselines (modelswithout knowledge features), CNN Encoder-Taggerachieves the best performance on all datasets.

Among structural models (models with knowl-edge encoding), Tree-RNN Encoder-Tagger per-forms better for Small data but slightly worse thanthe DCNN Encoder-Tagger.

CNN (Kim, 2014) performs better compared toDCNN (Ma et al., 2015a) and Tree-RNN (Tai etal., 2015), even though CNN does not leverage ex-ternal knowledge when encoding sentences. Whencomparing the NLU performance between baselinesand other state-of-the-art structural models, thereis no significant difference. This suggests that en-coding sentence information without distinguishingsubstructure may not capture salient semantics in or-der to improve understanding performance.

Page 8: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

find nonstop flights from salt lake city to new york on saturday april ninth

find nonstop flights from salt lake city to new york on saturday april ninth

find nonstop flights from salt lake city to new york on saturday april ninth

flight_stop fromloc.city_name toloc.city_name depart_date.day_name

depart_date.month_name

depart_date.day_number

Small

Medium

Large

Figure 5: The visualization of the decoded knowledge-guided structural attention for both relations and words learnedfrom different size of training data. Relations and words with darker color indicate higher attention weights generatedby the proposed K-SAN with CNN. The slot tags are shown in the figure for reference. Note that the dependencyrelations are incorrectly parsed by the Stanford parser in this example, but our model is still able to benefit from thestructural information.

Among the proposed K-SAN models, CNN forencoding performs best on Small (75% on F1)and Medium (88% on F1), and RNN for en-coding performs best on the Large set (95% onF1). Also, most of the proposed models outper-form all baselines, where the improvement for thesmall dataset is more significant. This suggeststhat the proposed models carry better generaliza-tion and are less sensitive to unseen data. For ex-ample, given an utterance “which flights leave onmonday from montreal and arrive in chicago inthe morning”, “morning” can be correctly taggedwith a semantic tag B-arrive time.period of dayby K-SAN, but it is incorrectly tagged with B-depart time.period of day by baselines, becauseknowledge guides the model to pay correct atten-tion to salient substructures. The proposed modelpresents the state-of-the-art performance on thelarge dataset (RNN-BLSTM in baselines), showingthe effectiveness of leveraging knowledge-guidedstructures for learning embeddings that can be usedfor specific tasks and the robustness to data scarcityand mismatch.

5.4 Attention Analysis

In order to show the effectiveness of boosting per-formance by learning correct attention from muchsmaller training data through the proposed model,we present the visualization of the attention for bothwords and relations decoded by K-SAN with CNNin the Figure 5. The darker color of blocks and linesindicates the higher attention for words and relationsrespectively. From the figure, the words and the rela-tions with higher attention are the most crucial parts

for predicting correct slots, e.g. origin, destination,and time. Furthermore, the difference of attentiondistribution between three datasets is not significant;this suggests that our proposed model is able to paycorrect attention to important substructures guidedby the external knowledge even the training data isscarce.

5.5 Knowledge Generalization

In order to show the capacity of generalization todifferent knowledge resources, we perform the K-SAN model for different knowledge bases. Belowwe compare two types of knowledge formats: de-pendency tree and Abstract Meaning Representation(AMR). AMR is a semantic formalism in which themeaning of a sentence is encoded as a rooted, di-rected, acyclic graph (Banarescu et al., 2013), wherenodes represent concepts, and labeled directed edgesrepresent the relations between two concepts. Theformalism is based on propositional logic and neo-Davidsonian event representations (Parsons, 1990;Davidson, 1967). The semantic concepts in AMRwere leveraged to benefit multiple NLP tasks (Liuet al., 2015). Unlike syntactic information from de-pendency trees, the AMR graph contains semanticinformation, which may offer more specific concep-tual relations. Figure 6 shows the comparison of adependency tree and an AMR graph associated withthe same example utterance and how the knowledge-guided substructures are constructed.

Table 2 presents the performance of CRF andK-SAN with CNN taggers that utilize dependencyrelations and AMR edges as knowledge guid-ance on the same datasets, where CRF takes the

Page 9: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

show

me

the

flights

from

seattle

to

san

francisco

ROOT

1.

3.

4.

2.

1. show me

2. show flights the

3. show flights from seattle

4. show flights to francisco san

Sentence s

show me the flights from seattle to san francisco

Knowledge-Guided Substructure xi

(a) Syntax: the dependency tree

Sentence s

show me the flights from seattle to san francisco

Knowledge-Guided Substructure xi

1. show you

2. show flight seattle

3. show flight san francisco

4. show i

show

you

flightI

1.

2.

4.

city

city

Seattle

San Francisco3.

(b) Semantics: the AMR graph

Figure 6: The constructing procedure of knowledge-guided substructures, xi, on an example sentence s.

Approach Knowledge (Max #Substructure) Small Medium Large

CRFDependency Tree

Stanford - 59.55 78.71 90.13SyntaxNet - 61.09 78.87 90.92

AMR GraphRule-Based - 59.55 79.15 89.97JAMR - 61.12 78.64 90.25

K-SAN (CNN)Dependency Tree

Stanford 53 74.60 87.99 94.86SyntaxNet 25 74.35 88.40 95.00

AMR GraphRule-Based 19 74.32 88.14 94.85JAMR 8 74.27 88.27 94.89

Table 2: The F1 scores of predicted slots with knowledge from different resources.

head words from either dependency trees or AMRgraphs as additional features and K-SAN incorpo-rates knowledge-guided substructures as illustratedin Figure 6. The dependency trees are obtained fromthe Stanford dependency parser or the SyntaxNetparser2, and AMR graphs are generated by a rule-based AMR parser or JAMR3.

Among four knowledge resources (different typesand obtained from different parsers), all results showthe similar performance for three sizes of datasets.The maximum number of substructures for the de-pendency tree is larger than the number in the AMRgraph (53 and 25 v.s. 19 and 8), because syntax ismore general and may provide richer cues for guid-ing more attention while semantics is more specificand may offer stronger guidance. In sum, the mod-els applying four different resources achieve simi-lar performance, and all significantly outperform the

2https://github.com/tensorflow/models/tree/master/syntaxnet

3https://github.com/jflanigan/jamr

state-of-the-art NLU tagger, showing the effective-ness, generalization, and robustness of the proposedK-SAN model.

6 Conclusion

This paper proposes a novel model, knowledge-guided structural attention networks (K-SAN), thatleverages prior knowledge as guidance to incorpo-rate non-flat topologies and learn suitable attentionfor different substructures that are salient for spe-cific tasks. The structured information can be cap-tured from small training data, so the model hasbetter generalization and robustness. The experi-ments show benefits and effectiveness of the pro-posed model on the language understanding task,where all knowledge-guided substructures capturedby different resources help tagging performance,and the state-of-the-art performance is achieved onthe ATIS benchmark dataset.

Page 10: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

ReferencesLaura Banarescu, Claire Bonial, Shu Cai, Madalina

Georgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representation forsembanking. In Proceedings of the Linguistic Annota-tion Workshop and Interoperability with Discourse.

Asli Celikyilmaz and Dilek Hakkani-Tur. 2015. Convo-lutional neural network based semantic tagging withentity embeddings. In NIPS Workshop on MachineLearning for SLU and Interaction.

Ciprian Chelba, Monika Mahajan, and Alex Acero.2003. Speech utterance classification. In 2003 IEEEInternational Conference on Acoustics, Speech, andSignal Processing, 2003. Proceedings.(ICASSP), vol-ume 1, pages I–280. IEEE.

Danqi Chen and Christopher D Manning. 2014. Afast and accurate dependency parser using neural net-works. In EMNLP, pages 740–750.

Yun-Nung Chen, Dilek Hakkani-Tur, and Gokan Tur.2014. Deriving local relational surface forms fromdependency-based entity embeddings for unsuper-vised spoken language understanding. In 2014 IEEESpoken Language Technology Workshop (SLT), pages242–247. IEEE.

Yun-Nung Chen, William Yang Wang, Anatole Gersh-man, and Alexander I Rudnicky. 2015. Matrix fac-torization with knowledge graph propagation for unsu-pervised spoken language understanding. Proceedingsof ACL-IJCNLP.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the proper-ties of neural machine translation: Encoder-decoderapproaches. arXiv preprint arXiv:1409.1259.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Donald Davidson. 1967. The logical form of action sen-tences.

Anoop Deoras and Ruhi Sarikaya. 2013. Deep beliefnetwork based semantic taggers for spoken languageunderstanding. In INTERSPEECH, pages 2713–2717.

Jeffrey L Elman. 1990. Finding structure in time. Cog-nitive science, 14(2):179–211.

Alan Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recurrentneural networks. In 2013 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP), pages 6645–6649. IEEE.

Patrick Haffner, Gokhan Tur, and Jerry H Wright.2003. Optimizing svms for complex call classifi-cation. In 2003 IEEE International Conference on

Acoustics, Speech, and Signal Processing, 2003. Pro-ceedings.(ICASSP), volume 1, pages I–632. IEEE.

Larry P Heck, Dilek Hakkani-Tur, and Gokhan Tur.2013. Leveraging knowledge graphs for web-scaleunsupervised semantic parsing. In INTERSPEECH,pages 1594–1598.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9(8):1735–1780.

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, and Larry Heck. 2013. Learning deepstructured semantic models for web search using click-through data. In Proceedings of the 22nd ACM inter-national conference on Conference on information &knowledge management, pages 2333–2338. ACM.

Yoon Kim. 2014. Convolutional neural networks for sen-tence classification. arXiv preprint arXiv:1408.5882.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Quoc V Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. arXiv preprintarXiv:1405.4053.

Wang Ling, Tiago Luıs, Luıs Marujo, Ramon Fernan-dez Astudillo, Silvio Amir, Chris Dyer, Alan WBlack, and Isabel Trancoso. 2015. Finding func-tion in form: Compositional character models foropen vocabulary word representation. arXiv preprintarXiv:1508.02096.

Jingjing Liu, Panupong Pasupat, Yining Wang, ScottCyphers, and James Glass. 2013. Query under-standing enhanced by hierarchical parsing structures.In Automatic Speech Recognition and Understanding(ASRU), 2013 IEEE Workshop on, pages 72–77. IEEE.

Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh,and Noah A Smith. 2015. Toward abstractive sum-marization using semantic representations. In In Pro-ceedings of the Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, pages 1077–1086.

Mingbo Ma, Liang Huang, Bing Xiang, and BowenZhou. 2015a. Dependency-based convolutional neu-ral networks for sentence embedding. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Process-ing, pages 174–179.

Yi Ma, Paul A Crook, Ruhi Sarikaya, and Eric Fosler-Lussier. 2015b. Knowledge graph inference for spo-ken dialog systems. In 2015 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP), pages 5346–5350. IEEE.

Page 11: arXiv:1609.03286v1 [cs.AI] 12 Sep 2016et al., 2013), character embeddings (Ling et al., 2015), and sentence embeddings (Le and Mikolov, 2014; Huang et al., 2013). In addition to fully

Michael F McTear. 2004. Spoken dialogue technology:toward the conversational user interface. SpringerScience & Business Media.

Gregoire Mesnil, Yann Dauphin, Kaisheng Yao, YoshuaBengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He,Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Us-ing recurrent neural networks for slot filling in spokenlanguage understanding. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 23(3):530–539.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositionality.In Advances in neural information processing systems,pages 3111–3119.

Terence Parsons. 1990. Events in the semantics of en-glish: A study in subatomic semantics.

Roberto Pieraccini, Evelyne Tzoukermann, ZakharGorelov, Jean-Luc Gauvain, Esther Levin, Chin-HuiLee, and Jay G Wilpon. 1992. A speech understand-ing system based on statistical representation of se-mantics. In 1992 IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP),volume 1, pages 193–196. IEEE.

Suman Ravuri and Andreas Stolcke. 2015. Recurrentneural network and lstm models for lexical utteranceclassification. In Sixteenth Annual Conference of theInternational Speech Communication Association.

Michael Roth and Mirella Lapata. 2016. Neural seman-tic role labeling with dependency path embeddings.arXiv preprint arXiv:1605.07515.

Alexander Rudnicky and Wei Xu. 1999. An agenda-based dialog management architecture for spoken lan-guage systems. In IEEE Automatic Speech Recog-nition and Understanding Workshop, volume 13,page 17.

Ruhi Sarikaya, Geoffrey E Hinton, and Bhuvana Ramab-hadran. 2011. Deep belief nets for natural languagecall-routing. In 2011 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP),pages 5680–5683. IEEE.

Ruhi Sarikaya, Geoffrey E Hinton, and Anoop Deoras.2014. Application of deep belief networks for naturallanguage understanding. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 22(4):778–784.

Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014. Groundedcompositional semantics for finding and describingimages with sentences. Transactions of the Associa-tion for Computational Linguistics, 2:207–218.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In Advances in

Neural Information Processing Systems, pages 2431–2439.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in neural information processing systems,pages 3104–3112.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representa-tions from tree-structured long short-term memorynetworks. arXiv preprint arXiv:1503.00075.

Gokhan Tur and Renato De Mori. 2011. Spoken lan-guage understanding: Systems for extracting semanticinformation from speech. John Wiley & Sons.

Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2010.What is left to be understood in atis? In Spoken Lan-guage Technology Workshop (SLT), 2010 IEEE, pages19–24. IEEE.

Gokhan Tur, Li Deng, Dilek Hakkani-Tur, and XiaodongHe. 2012. Towards deeper understanding: Deepconvex networks for semantic utterance classification.In 2012 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages 5045–5048. IEEE.

Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding. IEEE Signal ProcessingMagazine, 22(5):16–31.

Jason Weston, Sumit Chopra, and Antoine Bordesa.2015. Memory networks. In International Conferenceon Learning Representations (ICLR).

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visualand textual question answering. arXiv preprintarXiv:1603.01417.

Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neu-ral network based triangular CRF for joint intent detec-tion and slot filling. In 2013 IEEE Workshop on Auto-matic Speech Recognition and Understanding (ASRU),pages 78–83. IEEE.

Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao,and Li Deng. 2014. Embedding entities and relationsfor learning and inference in knowledge bases. arXivpreprint arXiv:1412.6575.

Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. In INTER-SPEECH, pages 2524–2528.

Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-offrey Zweig, and Yangyang Shi. 2014. Spokenlanguage understanding using long short-term mem-ory neural networks. In 2014 IEEE Spoken LanguageTechnology Workshop (SLT), pages 189–194. IEEE.


Recommended