Attentive Convolution - arXiv · standing (NLU). Attentive convolution is needed for NLU tasks. We...

Attentive Convolution

Wenpeng Yin, Hinrich SchutzeCenter for Information and Language Processing

LMU Munich, [email protected]

Abstract

In NLP, convolution neural networks (CNNs)have benefited less than recurrent neural net-works (RNNs) from attention mechanisms.We hypothesize that this is because attentionin CNNs has been mainly implemented as at-tentive pooling (i.e., it is applied to pooling)rather than as attentive convolution (i.e., it isintegrated into convolution). Convolution isthe differentiator of CNNs in that it can power-fully model the higher-level representation ofa word by taking into account its local fixed-size context in input text tx. In this work, wepropose an attentive convolution network, Att-entiveConvNet. It extends the context scopeof the convolution operation, deriving higher-level features for a word not only from localcontext, but also from information extractedfrom nonlocal context by the attention mech-anism commonly used in RNNs. This nonlocalcontext can come (i) from parts of the inputtext tx that are distant or (ii) from a secondinput text, the context text ty . In an evalu-ation on sentence relation classification (tex-tual entailment and answer sentence selection)and text classification, experiments demon-strate that AttentiveConvNet has state-of-the-art performance and outperforms RNN/CNNvariants with and without attention. All codewill be publicly released.

1 Introduction

Natural language processing (NLP) has benefitedgreatly from the resurgence of deep neural networks(DNNs), due to their high performance with lessneed of engineered features. A standard DNN con-sists of a series of non-linear transformation layers,each producing a fixed-dimensional hidden repre-sentation. For tasks like machine translation that

have large input spaces, this paradigm must encodethe entire input text in one hidden state, resultingin an information bottleneck; systems cannot en-code all input information in this bottleneck, but donot know how to select the subset that is neededfor correct subsequent decisions. In response, at-tention mechanisms are commonly used to performa soft-selection over hidden representations; thismechanism scales with input size, dynamically pick-ing only that information that is important for eachstep (e.g., of the translation process). Attention isroughly a variable-length memory model and hasshown to be important for good performance onmany tasks.

Convolution neural networks (CNNs, LeCun etal. (1998)) and recurrent neural networks (RNNs,Elman (1990)) are two important types of DNNs.Most work on attention has been done for RNNs.Attention-based RNNs typically take three types ofinput to make a decision at current step: (i) the cur-rent input state, (ii) a representation of local con-text (computed unidirectionaly or bidirectionally,Rocktaschel et al. (2016)) and (iii) the attention-weighted sum of hidden states corresponding tononlocal context (e.g., the hidden states of the en-coder in neural machine translation (Bahdanau et al.,2015a)). An important question therefore is whetherCNNs can benefit from such an attention mechanismas well and how. This is our technical motivation.

Our second motivation is natural language under-standing (NLU). Attentive convolution is needed forNLU tasks. We distinguish two main cases: inter-text and intratext. Intertext means that, in addition tothe input text tx, there is a second input text ty overwhich we compute attention. Consider the SNLItextual entailment examples in Table 1; here the in-put text tx is the hypothesis and the context text ty is

arX

iv:1

710.

0051

9v1

[cs

.CL

] 2

Oct

201

7

textual entailment examplepremise three bikers stop in townhyp1 a group of bikers are in the streethyp2 the bikers did n’t stop in the town

sentiment analysis exampleWith the 2017 NBA All-Star game in the booksI think we can all agree that this was definitelyone to remember. Not because of the three-pointshootout, the dunk contest, or the game itself butbecause of the ludicrous trade that occurred afterthe festivities.

Table 1: Examples for inter- and intratext attention.

the premise. Intuitively, a system in this task shouldmatch “three bikers” and “in town” in the premisewith “a group of bikers” and “in the street” in hyp1,respectively; and match “three bikers” and “stop”in the premise with “the bikers” and “did n’t stop”in hyp2, respectively. Thus, cross-text, i.e., inter-text, phrase matching is important and we cannotjust match words, we also need to take their contextsinto account. Intratext means that we compute atten-tion over the input text tx; so the same text takes theroles of both input text tx and context text ty. Con-sider the sentiment analysis examples in Table 1. Inaddition to taking into consideration local contexts(as most sentiment analysis systems do), “remem-ber” (per default a neutral word) can only be inter-preted as enhancing negativity here if it is related tothe distant phrase “ludicrous”.

A convolutional filter in conventional CNNs mod-els mutual dependencies within a local context win-dow, but ignores both intertext and intratext context.Attentive pooling CNNs (Yin and Schutze, 2015;dos Santos et al., 2016) also do intertext modeling,but they perform the phrase matching after the con-volution process, so that no rich intertext and intra-text context is available to convolutional filters.

In this work, we propose attentive convolutionnetworks, AttentiveConvNets. In the intratext case(text classification in our example), AttentiveConv-Nets extend the local context window of standardCNNs to cover the entire input text tx. In the inter-text case (sentence relation identification in our ex-ample), AttentiveConvNets extend the local contextwindow to cover a second input text ty, the contexttext. Our formalization is the same for intertext andintratext. So we will use ta from now on to refer to

the text over which we compute attention. ta is ty

for intertext and tx for intratext.For a convolution operation over a window in tx

like (leftcontext, target, rightcontext), we first comparethe representation of target with all hidden states inta to get an attentive context representation attcontext,then CNN derives a higher-level representation fortarget, denoted as targetnew, by integrating targetwith three pieces of context: leftcontext, rightcontextand attcontext. We can have two interpretations forthis attentive convolution. (i) For intratext, a higher-level word representation targetnew is learned byconsidering local (i.e., leftcontext and rightcontext) aswell as nonlocal (i.e., attcontext) context. (ii) For in-tertext, representation targetnew is generated to de-note the cross-text aligned phrases (target, attcontext)in context leftcontext and rightcontext.

We apply AttentiveConvNets to two sentence re-lation identification tasks, SNLI textual entailment(Bowman et al., 2015) and question-aware answersentence selection on SQUAD (Rajpurkar et al.,2016), and on the large-scale Yelp sentiment clas-sification task (Lin et al., 2017). AttentiveConvNetoutperforms competitive DNNs with and without at-tention and gets state-of-the-art on the three tasks.

Overall, we make the following contributions:

• This is the first work that enables CNNs to acquirethe attention mechanism commonly employed inRNNs.

• We distinguish and build flexible modules – at-tention source, attention focus and attention ben-eficiary – to greatly advance the expressivity ofattention mechanisms in CNNs.

• AttentiveConvNet provides a new way to broadenthe originally constrained scope of filters in con-ventional CNNs. Broader and richer contextcomes from either external inputs (intertext) or in-ternal inputs (intratext).

• AttentiveConvNet shows its superiority over com-petitive DNNs with and without attention.

2 Related Work

In this section we discuss attention-related DNNs inNLP, the most relevant work for our paper.

2.1 RNNs with Attention

Graves (2013) and Graves et al. (2014) first intro-duced a differentiable attention mechanism that al-lows RNNs to focus on different parts of the input.This idea has been broadly explored in RNNs. Wenow summarize three bodies of relevant NLP work.

Sequence-to-sequence text generation.This category follows Encoder(source) −→Decoder(target). The attention mechanism inSeq2Seq learning allows an RNN decoder to di-rectly access information about the input each timebefore it emits a symbol.

Bahdanau et al. (2015b) bring the attention ideainto neural machine translation (NMT), extendingthe basic encoder-decoder by automatically align-ing the current decoding hidden state with eachsource hidden state, then all source hidden states areweight-averaged as a context vector. The model thenpredicts a target word based on the context vectorand the previously generated target words. Luong etal. (2015) extend this “global” attention (Bahdanauet al., 2015b) and propose a type of “local” attentionfor NMT – focusing on a subset of the source posi-tions per target word. Libovicky and Helcl (2017)further extend the attention mechanism for multi-modal translation: multiple input sources, single tar-get. Kim et al. (2017) generalize soft-selection at-tention by specifying possible structural dependen-cies among source elements in a soft manner.

Other work on text generation includes responsegeneration in social media (Shang et al., 2015), doc-ument reconstruction (Li et al., 2015) and documentsummarization (Nallapati et al., 2016).

Machine comprehension. This category followsa function f(textdoc, textquestion) −→ textanswer.Hermann et al. (2015) provide reading comprehen-sion datasets “CNN” and “Daily Mail” and developa class of attention based RNNs that learn to readreal documents and answer complex questions. Fora (document, question) pair, the system predictswhich token in the document answers the question.The key contribution is the attention mechanism; itsupports learning a compact joint representation forthe (document, question) pair.

Kumar et al. (2016) and Xiong et al. (2016)present dynamic memory networks (DMNs) formodeling question answering for bAbI. DMN is still

a recurrent attention process over question represen-tation, fact representations in the document and pre-vious memory episode.

Wang and Jiang (2017) present the first atten-tion based RNN for SQUAD (Rajpurkar et al.,2016) in which the input is a (document, ques-tion) pair, output is a text span in the document asthe answer. Other attentive RNNs for SQUAD in-clude Dynamic Coattention Networks (Xiong et al.,2017), Bi-Directional Attention Flow (BIDAF) (Seoet al., 2017), Recurrent Span Representations (Leeet al., 2016), Document Reader (Chen et al., 2017a),Reasoning Network (Shen et al., 2017), Ruminat-ing Reader (Gong and Bowman, 2017), MnemonicReader (Hu et al., 2017), R-Net (Wang et al., 2017a).

We define sentence relation classification asf(textx, texty) −→ class. Rocktaschel et al.(2016) employ neural word-to-word attention sim-ilar to Bahdanau et al. (2015b) and Hermann etal. (2015) for SNLI (Bowman et al., 2015). Thedifference is that attention is not used to gener-ate words, but to obtain a text-pair representationfrom fine-grained cross-text alignments. Wang andJiang (2016) propose match-LSTM, an extension of(Rocktaschel et al., 2016)’s attention. Cheng et al.(2016) present a new machine reader that equipsan LSTM with a memory tape instead of a mem-ory cell to store the past information and adap-tively use it without severe information compres-sion. Other work on attentive matching includesMulti-Perspective Matching (Wang et al., 2017b)and Enhanced LSTM (Chen et al., 2017b).

Miao et al. (2016) present Neural Answer Selec-tion Model (NASM) based on LSTM and attentionto identify the correct sentences answering a factualquestion from a set of candidate sentences. NASMapplies an attention model to focus on the words inthe answer sentence that are prominent for predict-ing the answer matched to the current question.

2.2 CNNs with Attentive PoolingIn NLP, there is little work on attention-basedCNNs; two exceptions are (Yin et al., 2016) and (dosSantos et al., 2016). These two papers mainly imple-ment the attention in pooling, i.e., the convolutionis not affected. Specifically, their systems work ontwo input sentences, each with a set of hidden statesgenerated by a convolution layer, then each sentence

(a) Light attentive convolution layer (b) Advanced attentive convolution layer

Figure 1: Attentive convolution layer

will learn a weight for every hidden state by com-paring this hidden state with all hidden states in theother sentence, finally each input sentence obtains arepresentation by a weighted mean pooling over allits hidden states. The core component – weightedmean pooling – was referred to as “attentive pool-ing”, aiming to yield the sentence representation.

In contrast to attentive convolution, attentivepooling does not connect the hidden states of cross-text aligned phrases directly and in a fine-grainedmanner to the final decision making, only the match-ing scores contribute to the final weighting in meanpooling. This important distinction between atten-tive convolution and attentive pooling is further dis-cussed in Section 4.1 (see paragraph “Analysis”).

Inspired by the attention mechanisms in RNNs,we assume that it is the hidden states of alignedphrases rather than their matching scores that canbetter contribute to the representation learning anddecision making. Hence, our attentive convolutionwork differs from attentive pooling in that it uses at-tended hidden states from extra context (intertext) orbroader context range (intratext) to participate in theconvolution. In experiments, we will show its supe-riority.

2.3 Attention in Other DNN Architectures

Parikh et al. (2016) address SNLI by accumulatingfine-grained word-by-word alignments, computedby feedforward neural networks.

Vaswani et al. (2017)’s Transformer uses self-attention (Cheng et al., 2016; Lin et al., 2017), dis-pensing with recurrence and convolutions entirely.

3 AttentiveConvNet Model

We use bold uppercase, e.g., H, for matrices; boldlowercase, e.g., h, for vectors; bold lowercase withindex, e.g., hi, for columns of H; and non-bold low-ercase for scalars.

For our system, we assume that at a certain layer,AttentiveConvNet represents text t (t ∈ {tx, ta})as a sequence of hidden states hi ∈ Rd (i =1, 2, . . . , |t|), forming feature map H ∈ Rd×|t|,where d is the dimensionality of hidden states. Eachhidden state hi has its left context li and right con-text ri. In concrete CNN systems, context li andri can cover multiple adjacent hidden states, we setli = hi−1 and ri = hi+1 for simplicity in followingdescription.

We now describe light and advanced versions ofAttentiveConvNet. Recall that AttentiveConvNetsaim to compute a representation for tx in a way thatconvolution filters encode not only local context, butalso attentive context over ta.

3.1 Light AttentiveConvNet

Figure 1(a) shows the light version of Attentive-ConvNet. It differs in two key points – (i) and (ii)– both from the basic convolution layer that mod-els a single piece of text and from the Siamese CNNthat models two text pieces in parallel. (i) A match-ing process by an energy function1 determines howrelevant each hidden state in text ta is to the currenthidden state hx

i in tx. We then compute an average

1Similar to Bordes et al. (2014), we use this term broadly torefer to any semantic matching function.

of the hidden states in ta, weighted by the matchingscores, to get the attentive context cxi for hx

i . (ii)Convolution for position i in tx integrates hiddenstate hx

i with three sources of context: left contexthxi−1, right context hx

i+1 and attentive context cxi .Attentive Context Vector Generation. First, an

energy function fe(hxi ,ha

j ) in matching process gen-erates a matching score ei,j between a hidden statein tx and a hidden state in ta by (i) dot product:

ei,j = (hxi )

T · haj (1)

or (ii) bilinear form:ei,j = (hx

i )TWeha

j (2)

(where We ∈ Rd×d) or (iii) additive projection:

ei,j = (ve)T · tanh(We · hx

i +Ue · haj ) (3)

where We,Ue ∈ Rd×d and ve ∈ Rd.Given the matching scores, the attentive context

cxi for hidden state hxi is the weighted average of all

hidden states in ta:cxi =

∑j

softmax(ei)j · haj (4)

We refer to the concatenation of attentive contexts[cx1 ; . . . ; c

xi ; . . . ; c

x|tx|] as the feature map Cx ∈

Rd×|tx| for tx.Attentive Convolution. A position i in tx at layer

n has hidden state hx,ni , left context hx,n

i−1, right con-text hx,n

i+1 and attentive context cx,ni . Attentive con-volution then generates the higher-level hidden stateat position i at layer n+ 1:

hx,n+1i = tanh(W · [hx,n

i−1,hx,ni ,hx,n

i+1, cx,ni ] + b) (5)

= tanh(W1 · [hx,ni−1,h

x,ni ,hx,n

i+1]+

W2 · cx,ni + b) (6)

where W ∈ Rd×4d is the concatenation of W1 ∈Rd×3d and W2 ∈ Rd×d, b ∈ Rd.

As Equation 6 shows, Equation 5 can be achievedby summing up the results of two separate and par-allel convolution steps before the non-linearity. Thefirst is still standard convolution-without-attentionover feature map Hx,n by filter width 3 over win-dow (hx,n

i−1, hx,ni , hx,n

i+1). The second is convolutionon the feature map Cx,n, i.e., the attentive context,with filter width 1, i.e., over each cx,ni . Finally sumup the results element-wise and add bias term andnon-linearity. This divide-and-conquer makes theattentive convolution easy to implement in practicewith no need to create a new feature map, as required

role textpremise Three firefighter come out of subway station

hypothesis Three firefighters putting out a fire insideof a subway station

Table 2: Multi-granular alignments required in textualentailment

in Equation 5, to integrate Hx,n and Cx,n.Our experiments show that this light version of

AttentiveConvNet works much better than the basicCNN. The following two considerations show thatthere is space to improve its expressivity.

(i) Higher-level or more abstract representationsare required in subsequent layers. We find that di-rectly forwarding the hidden states in tx or ta to thematching process is less optimal in some tasks. Pre-learning some more higher-level or abstract repre-sentations helps in subsequent learning phase.

(ii) Multi-granular alignments are preferred insome text pair modeling cases. Table 2 shows an-other example from SNLI. On the unigram level,“out” in premise matches with “out” in hypothesisperfectly, while “out” in premise is contradictory to“inside” in hypothesis. But considering their context– “come out” in premise and “putting out a fire” inhypothesis – clearly indicates they are not semanti-cally equivalent. And the ground truth conclusionfor this pair is “neutral”, i.e., the hypothesis is possi-bly true. Therefore, matching should be conductedacross phrase granularities.

We now present advanced AttentiveConvNet. Itis more expressive and modular, based on the twoforegoing considerations (i) and (ii).

3.2 Advanced AttentiveConvNet

Adel and Schutze (2017) distinguish between focusand source of attention. The focus of attention is thelayer of the network that is reweighted by attentionweights. The source of attention is the informationsource that is used to compute the attention weights.Adel and Schutze (2017) showed that increasing thescope of the attention source is beneficial. Here wefurther extend this principle to define beneficiary ofattention – the feature map (labeled “beneficiary” inFigure 1(b)) that is contextualized by the attentivecontext (labeled “attentive context” in Figure 1(b)).In light attentive convolutional layer (Figure 1(a)),the source of attention is hidden states in text tx, thefocus of attention is hidden states of text ta, the ben-

eficiary of attention is again the hidden states of tx,i.e., it is identical to the source of attention.

We now try to distinguish these three conceptsfurther to promote the expressivity of an attentiveconvolutional layer. We call it “advanced Attent-iveConvNet”, see Figure 1(b). It differs from thelight version in three ways: (i) attention source islearned by function fmgran(H

x), feature map Hx oftx acting as input; (ii) attention focus is learned byfunction fmgran(H

a), feature map Ha of ta acting asinput; (iii) attention beneficiary is learned by func-tion fbene(H

x), Hx acting as input. Both functionsfmgran() and fbene() are based on a gated convolu-tional function fgconv():

oi = tanh(Wh · ii + bh) (7)

gi = sigmoid(Wg · ii + bg) (8)

fgconv(ii) = gi · ui + (1− gi) · oi (9)

where ii is a composed representation, denoting agenerally defined input phrase [· · · ,ui, · · · ] of ar-bitrary length with ui as the central unigram-levelhidden state, the gate gi sets a trade-off between theunigram-level input ui and the temporary output oiat the phrase-level. We elaborate these modules inthe remainder of this subsection.

Attention Source. First, we present a general in-stance of generating source of attention by functionfmgran(H), learning word representation in multi-granulary context. In our system, we consider gran-ularities one and three, corresponding to uni-gramhidden state and tri-gram hidden state. For the uni-hidden state case, it is a gated convolution layer:

hxuni,i = fgconv(h

xi ) (10)

For tri-hidden state case:

hxtri,i = fgconv([h

xi−1,h

xi ,h

xi+1]) (11)

Finally, the overall hidden state at position i is theconcatenation of huni,i and htri,i:

hxmgran,i = [hx

uni,i,hxtri,i] (12)

i.e., fmgran(Hx) = Hx

mgran.Such kind of comprehensive hidden state can en-

code the semantics of multigranular spans at a posi-tion, such as “out” and “come out of”. Gating hereimplicitly enables cross-granular alignments in sub-sequent attention mechanism as it sets highway con-nection (Srivastava et al., 2015) between the inputgranularity and the output granularity.

Attention Focus. For simplicity, we use thesame architecture for the attention source (just in-troduced) and for the attention focus, ta; i.e., for theattention focus: fmgran(H

a) = Hamgran. See Fig-

ure 1(b). Thus, the focus of attention will participatein the matching process as well as be reweighted toform an attentive context vector. We leave exploringdifferent architectures for attention source and focusfor future work.

Another benefit of multi-granular hidden states inattention focus is to keep structure information incontext vector. In standard attention mechanisms inRNNs, all hidden states are average-weighted as acontext vector, the order information is missing. InCNNs, bigger-granular hidden states keep the localorder or structures to boost the attentive effect.

Attention Beneficiary. In our system, we sim-ply use fgconv() over uni-granularity to learn a moreabstract representation over the current hidden rep-resentations in Hx, so that

fbene(hxi ) = fgconv(h

xi ) (13)

Subsequently, the attentive context vector cxiis generated based on attention source featuremap fmgran(H

x) and attention focus feature mapfmgran(H

a), according to the description in the lightattentive convolutional layer. Then attentive convo-lution is conducted over attention beneficiary featuremap fbene(H

x) and the attentive context vectors Cx

to get higher-layer feature map Hx,n+1 for text tx. Asymmetrical process can be carried out for the textty as well to form a two-way attentional system. Thisis the architecture we use for sentence relation clas-sification below.

3.3 Analysis

AttentiveConvNet consists of three modules; eachcan be flexibly built. In addition, we can do (max,mean etc.) pooling over the output feature map ofthe attentive convolution layer, or stack new atten-tive convolution layers to form deep architectures.

Compared to the standard attention mecha-nism in RNNs, AttentiveConvNet has a similar en-ergy function and a similar process of computingcontext vectors, but differs in three ways. (i) Thediscrimination of attention source, focus and bene-ficiary improves expressivity. (ii) In CNNs, the sur-rounding hidden states for a concrete position are

available, so the attention matching is able to encodethe left context as well as the right context. In RNNshowever, it needs bidirectional RNNs to yield bothleft and right context representations. (iii) As atten-tive convolution can be implemented by summing uptwo separate convolution steps, it means this archi-tecture provides the attentive representations, as wellas representations computed without the use of at-tention. This trick is helpful in practice to use richerrepresentations for better performance. In contrast,such a clean modular separation of representationscomputed with and without attention is harder to re-alize in attention-based RNNs.

Prior attention mechanisms explored in CNNsmostly involve attentive pooling (Yin et al., 2016;dos Santos et al., 2016), i.e., the weights of the post-convolution pooling layer are determined by atten-tion. These weights come from energy function be-tween hidden states of two text pieces. However, aweight value is not informative enough to tell the re-lationships between aligned objects. Consider a tex-tual entailment sentence pair for which we need todetermine whether “inside −→ outside” holds. Thecosine similarity of these two words is high, e.g.,≈ .7 in word2vec (Mikolov et al., 2013) and GloVe(Pennington et al., 2014). On the other hand, cosinesimilarity between “inside” and “in” is lower: .31in word2vec, .46 in glove. Apparently, the highernumber .7 does not mean “outside” is more likelythan “in” to be entailed by “inside”. Instead, jointrepresentations for aligned phrases [hinside, houtside],[hinside, hin] are more informative and enable fine-grained reasoning. This illustrates why attentivecontext vectors participating in the convolution op-eration are expected to be more effective than thepost-convolution attentive pooling.

Inter-text attention & intra-text attention. Fig-ures 1(a)-1(b) depict the modeling for two textpieces tx and ty. This is a common application of at-tention mechanism in literature; we call it inter-textattention. But AttentiveConvNet can also be appliedto model a single text input, i.e., intra-text attention.As the sentiment analysis example in Table 1 shows,a text piece can contain informative points at differ-ent locations; conventional CNNs’ ability to modelnonlocal dependency is limited due to fixed-size fil-ter widths. In AttentiveConvNet, we can set ta = tx.The attentive context vector then accumulates all re-

lated parts together for a given position. In otherwords, our intra-text attentive convolution is able toconnect all related spans together to form a compre-hensive decision. This is a new way to broaden thescope of conventional filter widths: a filter now cov-ers not only the local window, but also those spansthat are related yet beyond the scope of the window.

4 Experiments

We evaluate intertext attention on sentence relationclassification (textual entailment and answer sen-tence selection) and intratext attention on “single-text” classification.

All experiments share a common setup. The in-put is represented using 300-dimensional publiclyavailable GloVe embeddings; OOVs are randomlyinitialized. The architecture consists of the follow-ing seven layers in sequence: embedding, attentiveconvolution, max-pooling, composition, hidden 1,hidden 2 and logistic regression. The input to logis-tic regression is the concatenation of the outputs ofthe previous three layers: composition, hidden 1 andhidden 2. We use AdaGrad (Duchi et al., 2011) fortraining. Embeddings are fine-tuned during training.

The natural settings for sentence relation andsingle-text classification are intertext and intratext,respectively. For sentence relation, we also testusing both intertext and intratext, referred to as“advanced&intra-attention” in the tables. We al-ways report “light” and “advanced” AttentiveConv-Net performance and compare against three typesof baselines: (i) w/o-attention, (ii) with-attention:RNNs with attention and attentive pooling CNNsand (iii) prior state of the art, typeset in italics.

4.1 Textual Entailment

Dataset: Stanford Natural Language Inference(SNLI, Bowman et al. (2015)), split 549,367 / 9,842/ 9,824 into train/dev/test; sentence relation classes:entailment, contradiction, neutral. Setup: dropoutof .1 for the output of each layer, learning rate .02,hidden size 300 across layers, batch size 50, filterwidth 3. In this task, the architecture is “Siamese”up to the max-pooling layer for the two inputs tx

and ty. Let rz be the output of max-pooling for in-put tz . The composition layer concatenates rx, ry

and rx� ry (where� is dot product) and passes this

Systems accw

/oat

tent

ion bi-CNN 80.3

bi-LSTM (Bowman et al., 2015) 77.6Tree-CNN (Mou et al., 2016) 82.1NES (Munkhdalai and Yu, 2017) 84.8

with

atte

ntio

n

W-by-W attention (Rocktaschel) 83.5Self-Attentive (Lin et al., 2017) 84.4Match-LSTM (Wang & Jiang) 86.1Decompose Attention (Parikh) 86.8LSTMN (Cheng et al., 2016) 89.1ABCNN (Yin et al., 2016) 83.7APCNN (dos Santos et al., 2016) 83.9

Atte

ntiv

eC

onvN

et light 86.3advanced 87.8advanced&intra-attention 88.4ensemble 89.3

Table 3: Performance comparison on SNLI test

on as input to hidden layer 1.Baselines: (i) w/o-attention. bi-CNN: Siamese

CNN, very similar to AttentiveConvNet, but with-out attention; bi-LSTM (Bowman et al., 2015):Siamese LSTM; Tree-CNN (Mou et al., 2016):Siamese CNN over dependency trees of sentences;NSE: Neural Semantic Encoders (Munkhdalai andYu, 2017); (ii) with-attention. Word-by-Word At-tention (Rocktaschel et al., 2016), the first work thatemploys standard attention mechanism in RNN sys-tem in this task, and its enhanced variants: Match-LSTM (Wang and Jiang, 2016) and the state-of-the-art LSTMN (Cheng et al., 2016); Self-Attentive(Lin et al., 2017), an intra-sentence attention model;Decompose Attention (Parikh et al., 2016), thefirst work that achieves fine-grained cross-sentencealignments and reasoning without convolutional orrecurrent components. Attentive pooling CNNs:ABCNN (Yin et al., 2016) and APCNN (dos San-tos et al., 2016).

Results. In Table 3, AttentiveConvNet outper-forms bi-CNN, its w/o-attention equivalent, by ≥6 = 86.3 − 80.3. This shows the effectivenessof attentive convolution. The AttentiveConvNet en-semble2 outperforms all prior with-attention work:

2The ensemble uses five copies of AttentiveConvNet-advanced&intra-attention system with different parameters,trained over different minibatches. In testing, we forward thesame minibatch to each copy, average the output probabilitiesand pick the highest-probability label.

W-by-W attention (Rocktaschel et al., 2016), Self-Attentive (Lin et al., 2017), Match-LSTM (Wangand Jiang, 2016) and LSTMN (Cheng et al., 2016).The single system advanced&intra-attention out-performs three of these baselines and is close toLSTMN (89.1 vs. 88.4).

Analysis. As attentive pooling CNN baselines,we use ABCNN and APCNN. As we discussed inrelated work, in attentive pooling, information flowsthrough attention weights rather than through atten-tive context vectors. Attentive context vectors aremuch more informative than attention weights whenmaking decisions based on aligned phrases. Takethe following SNLI pair as an example. Premise: “Acouple is eating outside at a table and he is point-ing at something”; Hypothesis: “A couple is eatinginside at a table and he is pointing at something”.They only differ by a single word: outside vs. inside.However, outside and inside have cosine similarity≈ .7 for word2vec and GloVe. There are two prob-lems for ABCNN and APCNN. (i) Attentive pool-ing has an implicit assumption that hidden statesthat are better matched should be highly weighted.In above example, all words except for the contra-dictory pair “outside” / “inside” can find the bestmatch with cosine similarity 1.0 since all are iden-tical words. This means these identical words willbe more highly weighted than “outside” / “inside”in the respective text representation. However, it isapparent that “outside” / “inside” are more decisivethan other words in this instance. Attentive convo-lution takes the hidden states of “outside” and “in-side” (not their uninformative similarity) directly asinput. This capability is needed here to make thecorrect decision. (ii) Attentive pooling essentiallyis a weighted average operation. Each text repre-sentation is therefore an unordered sum of all hid-den states. This makes it less sensitive to distinguishthe relationships of individual, aligned hidden statepairs. Attentive convolution instead relies on boththe two aligned hidden states and their context hid-den states to make fine-grained judgment.

Recall that in Section 3.2 we implementedunigram-level and trigram-level hidden states formultigranular alignments and proposed a gatingmechanism to achieve alignment across granulari-ties. Therefore, apart from above overall perfor-mance, we also evaluate the contributions of fol-

Systems accAttentiveConvNet (advanced+intra-attention) 88.7w/o tri-hidden 88.4w/o uni-hidden 87.3w/o gate 88.2

Table 4: Ablation test on SNLI dev

lowing three key architecture settings of Attentive-ConvNet: (i) tri-hidden states in attention source andattention focus; (ii) uni-hidden states in attentionsource and attention focus; (iii) convolution gatesin attention source, focus and beneficiary learning.Table 4 reports the ablation test of AttentiveConv-Net (advanced&intra-attention) on SNLI dev. Eachcomponent is found contributing to overall perfor-mance; uni-hidden states show especially big bene-fit compared to tri-hidden states. This hints that un-igram level alignment is already strong information.This is consistent with the basic rationale of previ-ous work (Parikh et al., 2016) and it is why we donot stack another attentive convolution layer.

4.2 Answer Sentence Selection

We create an answer sentence selection benchmarkbased on SQUAD (Rajpurkar et al., 2016), referredto as SQUAD-AnSS, as follows. A SQUAD instancehas the form (passage, question, answer) where an-swer is a subsequence of passage. We split eachpassage into sentences. The sentence containing theanswer is labeled positive, the other sentences nega-tive. As SQUAD only releases train and dev pub-licly (not test), we use 105,980 question-sentencepairs derived from squad-dev as test set. We derive448,307 question sentence pairs from squad-trainand split them into train (400,000) and dev (48,307).

Most work on SQUAD processes the entire inputparagraph, then detects the answer span in this longsequence. SQUAD-AnSS is a stepping stone to-wards an alternative approach: first rank sentences,then detect the answer span in the top-ranked sen-tence. Our performance is almost 90% (Table 5),suggesting that this approach is promising.

Setup. We treat this task as a ranking prob-lem: rank the predicted probability of a positive pairhigher than that of a negative pair by a margin (set to.85). We use precision at 1 for evaluation since onlythe top-1 sentence is used in our SQUAD applica-tion scenario. Other hyperparameters: learning rate

Systems p@1

w/o

atte

ntio

n

WordC1 68.87WordC2 79.59CDSSM (Shen et al., 2014) 67.21bi-CNN 77.94Bi-GRU (Tang et al., 2017) 78.14

with

atte

ntio

n Sentence-Rank 86.00ABCNN (Yin et al., 2016) 84.72APCNN (dos Santos et al., 2016) 84.14

Atte

ntiv

eC

onvN

et light 87.42∗

advanced 88.27∗

advanced&intra-att 88.54∗

Table 5: System comparison on SQUAD-AnSS. Signifi-cant improvements over state of the art are marked with∗ (test of equal proportions, p < .05).

.02, batch size 30. The composition layer outputs[rx, ry, rx � ry] as for textual entailment.

Baselines. (i) w/o-attention. Two feature engi-neering methods: WordC1 (word cooccurrence nor-malized by answer sentence length) and WordC2(word cooccurrence normalized by question length);three DNN systems: CDSSM (Shen et al., 2014), bi-CNN and Bi-(directional) GRU (Tang et al., 2017).(ii) with-attention. Two attentive pooling CNNs(ABCNN and APCNN) and the state-of-the-art sys-tem Sentence-Rank (Wang et al., 2017a).

Results and Analysis. AttentiveConvNet outper-forms all prior systems. It is 3 points above the atten-tive pooling CNNs and up to 2.54 above the state-of-the-art. We attribute the success of AttentiveConv-Net to two factors. (i) A question-sentence match ismore easily and effectively detected by matching ofsome local core phrases rather than global seman-tic matching. This is widely-recognized in litera-ture. E.g., WordC2, the simple method with overlap-ping features, is even more effective than Tang et al.(2017)’s Bi-GRU (note that GRU systems are widelydeveloped to learn global semantics of text); this in-dicates that the surface overlap is already very ef-fective to downweight lots of negative sentence can-didates. CNNs are the DNNs with most strength inderiving robust local features. (ii) Attentive convo-lution enables fine-grained cross-text phrase match-ing with consideration of surrounding context so thatmore effective reasoning can be achieved.

Systems acc

w/o

atte

ntio

n

Paragraph Vector 58.43Lin et al. BiLSTM 61.99Lin et al. CNN 62.05MultichannelCNN (Kim) 64.62

with

atte

ntio

n CNN+internal attention 61.43

Lin et al. RNN Self-Att. 64.21

Atte

ntiv

eC

onvN

et light 66.75

advanced 67.36∗

Table 6: System comparison on Yelp. Significant im-provements over state of the art are marked with ∗ (testof equal proportions, p < .05).

4.3 Text Classification

We evaluate sentiment analysis on Yelp (Lin etal., 2017): 500K/2000/2000 review-star pairs intrain/dev/test. Most text instances in this dataset arelong: 25%, 50%, 75% percentiles are 46, 81, 125words, respectively. The task is five-way classifica-tion: one to five stars. The measure is accuracy.

Setup. The composition layer passes the outputrx of max-pooling through without modification.Hyperparameters: learning rate .01, hidden size 500across layers, dropout .1, batch size 10.

Baselines. (i) w/o-attention. Three baselinesfrom Lin et al. (2017): Paragraph Vector (Le andMikolov, 2014) (unsupervised sentence representa-tion learning), BiLSTM and CNN. We also reimple-ment MultichannelCNN (Kim, 2014), recognizedas a simple, but surprisingly strong sentence mod-eler. (ii) with-attention. RNN Self-Attention (Linet al., 2017) is directly comparable to Attentive-ConvNet: it also uses intra-text attention. We alsoreimplement CNN+internal attention (Adel andSchutze, 2017) , an intra-text attention idea similarto, but less complicated than (Lin et al., 2017).

Results and Analysis. Table 6 shows that Att-entiveConvNet outperforms the w/o-attention base-lines. More importantly, it outperforms the two self-attentive models: CNN+internal attention (Adel andSchutze, 2017) and RNN Self-Attention (Lin et al.,2017). Adel and Schutze (2017) generate an atten-tion weight for each CNN hidden state by a lineartransformation of the same hidden state, then com-pute weighted average over all hidden states as thetext representation. Lin et al. (2017) extend that ideaby generating a group of attention weight vectors,

indices of sorted text groups

1 2 3 4 5 6 7 8 9 10

acc (

%)

0.58

0.6

0.62

0.64

0.66

0.68

0.7 MultichannelCNN

AttentiveConvNet

0.6+diff of two curves

Figure 2: AttentiveConvNet vs. MultichannelCNN.

then RNN hidden states are averaged by those di-verse weighted vectors, allowing extracting differ-ent aspects of the text into multiple vector represen-tations. Both works are essentially weighted meanpooling, similar to the attentive pooling in (Yin etal., 2016; dos Santos et al., 2016).

Next, we compare AttentiveConvNet with Multi-channelCNN for different length ranges. We sort the2000 test instances by length, then split them into 10groups, each consisting of 200 instances. Figure 2shows performance of AttentiveConvNet vs. Multi-channnelCNN.

We observe that AttentiveConvNet consistentlyoutperforms MultichannelCNN, the strongest base-line system, for all lengths. Furthermore, theimprovement over MultichannelCNN generally in-creases with length. This is evidence that Attent-iveConvNet is more effectively modeling long text.This is likely due to AttentiveConvNet’s capabilityto encode broader context in its filters.

5 Summary

We presented AttentiveConvNet, the first work thatenables CNNs to acquire the attention mechanismcommonly employed in RNNs. AttentiveConvNetcombines the strengths of CNNs with the strengthsof the RNN attention mechanism. On the one hand,it makes broad and rich context available for predic-tion, either context from external inputs (intertext)or internal inputs (intratext). On the other hand, itcan take full advantage of the strengths of convo-lution: it is more order-sensitive than attention inRNNs and local-context information can be pow-erfully and efficiently modeled through convolutionfilters. Our experiments demonstrate that Attentive-ConvNet performs better than prior DNNs with andwithout attention.

References

Heike Adel and Hinrich Schutze. 2017. Exploring differ-ent dimensions of attention for uncertainty detection.In Proceedings of EACL, pages 22–34.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015a. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015b. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR.

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A semantic matching energyfunction for learning with multi-relational data - appli-cation to word-sense disambiguation. Machine Learn-ing, 94(2):233–259.

Samuel R Bowman, Gabor Angeli, Christopher Potts, andChristopher D Manning. 2015. A large annotated cor-pus for learning natural language inference. In Pro-ceedings of EMNLP, pages 632–642.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017a. Reading wikipedia to answer open-domain questions. In Proceedings of ACL, pages1870–1879.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017b. Enhanced LSTM fornatural language inference. In Proceedings of ACL,pages 1657–1668.

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machine read-ing. In Proceedings of EMNLP, pages 551–561.

Cıcero Nogueira dos Santos, Ming Tan, Bing Xiang, andBowen Zhou. 2016. Attentive pooling networks.CoRR, abs/1602.03609.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learning andstochastic optimization. JMLR, 12:2121–2159.

Jeffrey L. Elman. 1990. Finding structure in time. Cog-nitive Science, 14(2):179–211.

Yichen Gong and Samuel R. Bowman. 2017. Ruminat-ing reader: Reasoning with gated multi-hop attention.CoRR, abs/1704.07415.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural turing machines. CoRR, abs/1410.5401.

Alex Graves. 2013. Generating sequences with recurrentneural networks. CoRR, abs/1308.0850.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines toread and comprehend. In Proceedings of NIPS, pages1693–1701.

Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2017.Mnemonic reader for machine comprehension. CoRR,abs/1705.02798.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M. Rush. 2017. Structured attention networks.In Proceedings of ICLR.

Yoon Kim. 2014. Convolutional neural networks for sen-tence classification. In Proceedings of EMNLP, pages1746–1751.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong, Ro-main Paulus, and Richard Socher. 2016. Ask meanything: Dynamic memory networks for natural lan-guage processing. In Proceedings of ICML, pages1378–1387.

Quoc V. Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Proceed-ings of ICML, pages 1188–1196.

Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

Kenton Lee, Tom Kwiatkowski, Ankur P. Parikh, and Di-panjan Das. 2016. Learning recurrent span repre-sentations for extractive question answering. CoRR,abs/1611.01436.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.A hierarchical neural autoencoder for paragraphs anddocuments. In Proceedings of ACL, pages 1106–1115.

Jindrich Libovicky and Jindrich Helcl. 2017. Atten-tion strategies for multi-source sequence-to-sequencelearning. In Proceedings of ACL, pages 196–202.

Zhouhan Lin, Minwei Feng, Cıcero Nogueira dos Santos,Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.2017. A structured self-attentive sentence embedding.In Proceedings of ICLR.

Thang Luong, Hieu Pham, and Christopher D. Manning.2015. Effective approaches to attention-based neuralmachine translation. In Proceedings of EMNLP, pages1412–1421.

Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neuralvariational inference for text processing. In Proceed-ings of ICML, pages 1727–1736.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their composi-tionality. In Proceedings of NIPS, pages 3111–3119.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan,and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In Proceed-ings of ACL, pages 130–136.

Tsendsuren Munkhdalai and Hong Yu. 2017. Neural se-mantic encoders. In Proceedings of EACL, pages 397–407.

Ramesh Nallapati, Bowen Zhou, Cıcero Nogueira dosSantos, Caglar Gulcehre, and Bing Xiang. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of CoNLL,pages 280–290.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceedingsof EMNLP, pages 2249–2255.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global vectors for word rep-resentation. In Proceedings of EMNLP, pages 1532–1543.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100, 000+ questionsfor machine comprehension of text. In Proceedingsof EMNLP, pages 2383–2392.

Tim Rocktaschel, Edward Grefenstette, Karl Moritz Her-mann, Tomas Kocisky, and Phil Blunsom. 2016. Rea-soning about entailment with neural attention. In Pro-ceedings of ICLR.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In Proceedings ofICLR.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neu-ral responding machine for short-text conversation. InProceedings of ACL, pages 1577–1586.

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, andGregoire Mesnil. 2014. A latent semantic modelwith convolutional-pooling structure for informationretrieval. In Proceedings of CIKM, pages 101–110.

Yelong Shen, Po-Sen Huang, Jianfeng Gao, and WeizhuChen. 2017. ReasoNet: Learning to stop reading inmachine comprehension. In Proceedings of SIGKDD,pages 1047–1055.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Training very deep networks. InProceedings of NIPS, pages 2377–2385.

Duyu Tang, Nan Duan, Tao Qin, and Ming Zhou. 2017.Question answering and question generation as dualtasks. CoRR, abs/1706.02027.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. CoRR, abs/1706.03762.

Shuohang Wang and Jing Jiang. 2016. Learning natu-ral language inference with LSTM. In Proceedings ofNAACL, pages 1442–1451.

Shuohang Wang and Jing Jiang. 2017. Machine com-prehension using match-lstm and answer pointer. InProceedings of ICLR.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, andMing Zhou. 2017a. Gated self-matching networksfor reading comprehension and question answering. InProceedings of ACL, pages 189–198.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017b.Bilateral multi-perspective matching for natural lan-guage sentences. In Proceedings of IJCAI, pages4144–4150.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual and tex-tual question answering. In Proceedings of ICML,pages 2397–2406.

Caiming Xiong, Victor Zhong, and Richard Socher.2017. Dynamic coattention networks for question an-swering. In Proceedings of ICLR.

Wenpeng Yin and Hinrich Schutze. 2015. Multi-GranCNN: An architecture for general matching oftext chunks on multiple levels of granularity. In Pro-ceedings of ACL, pages 63–73.

Wenpeng Yin, Hinrich Schutze, Bing Xiang, and BowenZhou. 2016. ABCNN: Attention-based convolutionalneural network for modeling sentence pairs. TACL,4:259–272.

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Attentive Convolution - arXiv · standing (NLU). Attentive convolution is needed for NLU tasks. We...

Documents