+ All Categories
Home > Documents > Hadi Amiri, Philip Resnik, Jordan Boyd-Graber, and Hal...

Hadi Amiri, Philip Resnik, Jordan Boyd-Graber, and Hal...

Date post: 21-Feb-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
12
Hadi Amiri, Philip Resnik, Jordan Boyd-Graber, and Hal Daum´ e III. Learning Text Pair Similarity with Context- sensitive Autoencoders. Association for Computational Linguistics, 2016. @inproceedings{Amiri:Resnik:Boyd-Graber:Daume-III-2016, Author = {Hadi Amiri and Philip Resnik and Jordan Boyd-Graber and Hal {Daum\’{e} III}}, Url = {docs/2016_acl_context_ae.pdf}, Booktitle = {Association for Computational Linguistics}, Location = {Berlin, Brandenburg}, Year = {2016}, Title = {Learning Text Pair Similarity with Context-sensitive Autoencoders}, } Links: Code [http://www.umiacs.umd.edu/ ~ hadi/csAe/index.html] Downloaded from http://cs.colorado.edu/ ~ jbg/docs/2016_acl_context_ae.pdf 1
Transcript

Hadi Amiri, Philip Resnik, Jordan Boyd-Graber, and Hal Daume III. Learning Text Pair Similarity with Context-sensitive Autoencoders. Association for Computational Linguistics, 2016.

@inproceedings{Amiri:Resnik:Boyd-Graber:Daume-III-2016,Author = {Hadi Amiri and Philip Resnik and Jordan Boyd-Graber and Hal {Daum\’{e} III}},Url = {docs/2016_acl_context_ae.pdf},Booktitle = {Association for Computational Linguistics},Location = {Berlin, Brandenburg},Year = {2016},Title = {Learning Text Pair Similarity with Context-sensitive Autoencoders},}

Links:

• Code [http://www.umiacs.umd.edu/~hadi/csAe/index.html]

Downloaded from http://cs.colorado.edu/~jbg/docs/2016_acl_context_ae.pdf

1

Learning Text Pair Similarity with Context-sensitive Autoencoders

Hadi Amiri1, Philip Resnik1, Jordan Boyd-Graber2, Hal Daume III11Institute for Advanced Computer StudiesUniversity of Maryland, College Park, MD

2Department of Computer ScienceUniversity of Colorado, Boulder, CO

{hadi,resnik,hal}@umd.edu,[email protected]

Abstract

We present a pairwise context-sensitiveAutoencoder for computing text pair sim-ilarity. Our model encodes input textinto context-sensitive representations anduses them to compute similarity betweentext pairs. Our model outperforms thestate-of-the-art models in two semantic re-trieval tasks and a contextual word simi-larity task. For retrieval, our unsupervisedapproach that merely ranks inputs with re-spect to the cosine similarity between theirhidden representations shows comparableperformance with the state-of-the-art su-pervised models and in some cases outper-forms them.

1 Introduction

Representation learning algorithms learn repre-sentations that reveal intrinsic low-dimensionalstructure in data (Bengio et al., 2013). Such rep-resentations can be used to induce similarity be-tween textual contents by computing similarity be-tween their respective vectors (Huang et al., 2012;Silberer and Lapata, 2014).

Recent research has made substantialprogress on semantic similarity using neuralnetworks (Rothe and Schutze, 2015; Dos Santoset al., 2015; Severyn and Moschitti, 2015).In this work, we focus our attention on deepautoencoders and extend these models to integratesentential or document context information abouttheir inputs. We represent context information aslow dimensional vectors that will be injected todeep autoencoders. To the best of our knowledge,this is the first work that enables integratingcontext into autoencoders.

In representation learning, context may appearin various forms. For example, the context of

a current sentence in a document could be ei-ther its neighboring sentences (Lin et al., 2015;Wang and Cho, 2015), topics associated withthe sentence (Mikolov and Zweig, 2012; Le andMikolov, 2014), the document that contains thesentence (Huang et al., 2012), as well as their com-binations (Ji et al., 2016). It is important to inte-grate context into neural networks because thesemodels are often trained with only local informa-tion about their individual inputs. For example,recurrent and recursive neural networks only uselocal information about previously seen words in asentence to predict the next word or composition.1

On the other hand, context information (such astopical information) often capture global informa-tion that can guide neural networks to generatemore accurate representations.

We investigate the utility of context informa-tion in three semantic similarity tasks: contextualword sense similarity in which we aim to predictsemantic similarity between given word pairs intheir sentential context (Huang et al., 2012; Rotheand Schutze, 2015), question ranking in which weaim to retrieve semantically equivalent questionswith respect to a given test question (Dos Santoset al., 2015), and answer ranking in which we aimto rank single-sentence answers with respect to agiven question (Severyn and Moschitti, 2015).

The contributions of this paper are as follows:(1) integrating context information into deep au-toencoders and (2) showing that such integra-tion improves the representation performance ofdeep autoencoders across several different seman-tic similarity tasks.

Our model outperforms the state-of-the-art su-

1For example, RNNs can predict the word “sky” giventhe sentence “clouds are in the ,” but they are less accuratewhen longer history or global context is required, e.g. pre-dicting the word “french” given the paragraph “I grew up inFrance. . . . I speak fluent .”

pervised baselines in three semantic similaritytasks. Furthermore, the unsupervised version ofour autoencoder show comparable performancewith the supervised baseline models and in somecases outperforms them.

2 Context-sensitive Autoencoders

2.1 Basic AutoencodersWe first provide a brief description of basic au-toencoders and extend them to context-sensitiveones in the next Section. Autoencoders are trainedusing a local unsupervised criterion (Vincent et al.,2010; Hinton and Salakhutdinov, 2006; Vincent etal., 2008). Specifically, the basic autoencoder inFigure 1(a) locally optimizes the hidden represen-tation h of its input x such that h can be used toaccurately reconstruct x,

h = g(Wx + bh) (1)

x = g(W′h + bx), (2)

where x is the reconstruction of x, the learning pa-rameters W ∈ Rd′×d and W′ ∈ Rd×d′ are weightmatrices, bh ∈ Rd′ and bx ∈ Rd are bias vectorsfor the hidden and output layers respectively, andg is a nonlinear function such as tanh(.).2 Equa-tion (1) encodes the input into an intermediate rep-resentation and Equation (2) decodes the resultingrepresentation.

Training a single-layer autoencoder corre-sponds to optimizing the learning parameters tominimize the overall loss between inputs and theirreconstructions. For real-valued x, squared lossis often used, l(x) = ||x − x||2, (Vincent et al.,2010):

minΘ

n∑

i=1

l(x(i))

Θ = {W,W′,bh,bx}.(3)

This can be achieved using mini-batch stochasticgradient descent (Zeiler, 2012).

2.2 Integrating Context into AutoencodersWe extend the above basic autoencoder to inte-grate context information about inputs. We as-sume that—for each training example x ∈ Rd—we have a context vector cx ∈ Rk that containscontextual information about the input.3 The na-

2If the squared loss is used for optimization, as in Equa-tion (3), nonlinearity is often not used in Equation (2) (Vin-cent et al., 2010).

3We slightly abuse the notation throughout this paper byreferring to cx or hi as vectors, not elements of vectors.

𝒉

𝒙

𝑾

𝒙

𝑾′

^

(a) Basic Autoencoder

𝒉

𝒙

𝑾

𝑾′

𝒙

𝑽

𝒉𝒄

^ 𝒉𝒄

𝑽′

(b) Context Autoencoder

Figure 1: Schematic representation of basic andcontext-sensitive autoencoders: (a) Basic autoen-coder maps its input x into the representation hsuch that it can reconstruct x with minimum loss,and (b) Context-sensitive autoencoder maps its in-puts x and hc into a context-sensitive representa-tion h (hc is the representation of the context in-formation associated to x).

ture of this context vector depends on the input andtarget task. For example, neighboring words canbe considered as the context of a target word incontextual word similarity task.

We first learn the hidden representation hc ∈Rd′ for the given context vector cx. For this, weuse the same process as discussed above for thebasic autoencoder where we use cx as the inputin Equations (1) and (2) to obtain hc. We then usehc to develop our context-sensitive autoencoder asdepicted in Figure 1(b). This autoencoder maps itsinputs x and hc into a context-sensitive represen-tation h as follows:

h = g(Wx + Vhc + bh) (4)

x = g(W′h + bx) (5)

hc = g(V′h + bhc). (6)

Our intuition is that if h leads to a good recon-struction of its inputs, it has retained informationavailable in the input. Therefore, it is a context-sensitive representation.

The loss function must then compute the lossbetween the input pair (x, hc) and its reconstruc-tion (x, hc). For optimization, we can still usesquared loss with a different set of parameters tominimize the overall loss on the training examples:

l(x,hc) = ||x− x||2 + λ||hc − hc||2

minΘ

n∑

i=1

l(x(i),h(i)c )

Θ = {W,W′,V,V′,bh,bx,bhc},

(7)

𝒄𝑥

𝑽0

h1

hc 𝒙

𝑾1 𝑽1

hi-1

hi

𝑾𝑖

hc

𝑽𝑖

DAE-i

DAE-1

hc

DAE-0

(a) Network initialization

h1

hn

hc

𝒙 𝒄𝑥

𝑽0 𝑾1

𝑾𝑛

𝑽𝑛

𝑽1

𝑓𝜃(𝒙)

(b) Stacking Denoising Autoencoders

hc

𝒙 𝒄𝑥

hc

𝑽0

En

co

de

r

De

co

de

r

𝑾1

𝑾𝑛

𝑽1

𝑽𝑛

𝑾𝑛𝑇

𝑾1𝑇

𝑽𝑛𝑇

𝑽1𝑇

𝑽0𝑇

𝒙 ^ 𝒄𝑥 ^

^

(c) Unrolling and Fine-tuning

Figure 2: Proposed framework for integrating context into deep autoencoders. Context layer (cx and hc)and context-sensitive representation of input (hn) are shown in light red and gray respectively. (a) Pre-training properly initializes a stack of context-sensitive denoising autoencoders (DAE), (b) A context-sensitive deep autoencoder is created from properly initialized DAEs, (c) The network in (b) is unrolledand its parameters are fine-tuned for optimal reconstruction.

where λ ∈ [0, 1] is a weight parameter that con-trols the effect of context information in the re-construction process.

2.2.1 DenoisingDenoising autoencoders (DAEs) reconstruct an in-put from a corrupted version of it for more effec-tive learning (Vincent et al., 2010). The corruptedinput is then mapped to a hidden representationfrom which we obtain the reconstruction. How-ever, the reconstruction loss is still computed withrespect to the uncorrupted version of the input asbefore. Denoising autoencoders effectively learnrepresentations by reversing the effect of the cor-ruption process. We use masking noise to corruptthe inputs where a fraction η of input units arerandomly selected and set to zero (Vincent et al.,2008).

2.2.2 Deep Context-Sensitive AutoencodersAutoencoders can be stacked to create deep net-works. A deep autoencoder is composed of mul-tiple hidden layers that are stacked together. Theinitial weights in such networks need to be prop-erly initialized through a greedy layer-wise train-ing approach. Random initialization does notwork because deep autoencoders converge to poorlocal minima with large initial weights and resultin tiny gradients in the early layers with small ini-tial weights (Hinton and Salakhutdinov, 2006).

Our deep context-sensitive autoencoder is com-posed of a stacked set of DAEs. As discussedabove, we first need to properly initialize the learn-

ing parameters (weights and biases) associated toeach DAE. As shown in Figure 2(a), we first trainDAE-0, which initializes parameters associated tothe context layer. The training procedure is exactlythe same as training a basic autoencoder (Sec-tion 2.1 and Figure 1(a)).4 We then treat hc and xas “inputs” for DAE-1 and use the same approachas in training a context-sensitive autoencoder toinitialize the parameters of DAE-1 (Section 2.2and Figure 1(b)). Similarly, the ith DAE is builton the output of the (i− 1)th DAE and so on untilthe desired number of layers (e.g. n layers) are ini-tialized. For denoising, the corruption is only ap-plied on “inputs” of individual autoencoders. Forexample, when we are training DAE-i, hi−1 andhc are first obtained from the original inputs of thenetwork (x and cx) through a single forward passand then their corrupted versions are computed totrain DAE-i.

Figure 2(b) shows that the n properly initial-ized DAEs can be stacked to form a deep context-sensitive autoencoder. We unroll this network tofully optimize its weights through gradient descentand backpropagation (Vincent et al., 2010; Hintonand Salakhutdinov, 2006) .

2.2.3 Unrolling and Fine-tuningWe optimize the learning parameters of our ini-tialized context-sensitive deep autoencoder by un-folding its n layers and making a 2n−1 layer net-

4Figure 2(a) shows compact schematic diagrams of au-toencoders used in Figures 1(a) and 1(b)

work whose lower layers form an “encoder” net-work and whose upper layers form a “decoder”network (Figure 2(c)). A global fine-tuning stagebackpropagates through the entire network to fine-tune the weights for optimal reconstruction. Inthis stage, we update the network parameters againby training the network to minimize the loss be-tween original inputs and their actual reconstruc-tion. We backpropagate the error derivatives firstthrough the decoder network and then through theencoder network. Each decoder layer tries to re-cover the input of its corresponding encoder layer.As such, the weights are initially symmetric andthe decoder weights do need to be learned.

After the training is complete, the hidden layerhn contains a context-sensitive representation ofthe inputs x and cx.

2.3 Context Information

Context is task and data dependent. For example,a sentence or document that contains a target wordforms the word’s context.

When context information is not readily avail-able, we use topic models to determine such con-text for individual inputs (Blei et al., 2003; Stevenset al., 2012). In particular, we use Non-NegativeMatrix Factorization (NMF) (Lin, 2007): Givena training set with n instances, i.e., X ∈ Rv×n,where v is the size of a global vocabulary and thescalar k is the number of topics in the dataset, welearn the topic matrix D ∈ Rv×k and context ma-trix C ∈ Rk×n using the following sparse codingalgorithm:

minD,C

‖X−DC‖2F + µ‖C‖1, (8)

s.t. D ≥ 0, C ≥ 0,

where each column in C is a sparse representa-tion of an input over all topics and will be usedas global context information in our model. Weobtain context vectors for test instances by trans-forming them according to the fitted NMF modelon training data. We also note that advancedtopic modeling approaches, such as syntactic topicmodels (Boyd-Graber and Blei, 2009), can bemore effective here as they generate linguisticallyrich context information.

3 Text Pair Similarity

We present unsupervised and supervised ap-proaches for predicting semantic similarity scores

𝐡𝟏𝟏

𝐡𝒏𝟏

𝒙𝟏 𝒄𝒙𝟏

𝑽0 𝑾1

𝑾𝑛

𝑽𝑛

𝑽1 𝐡𝟏𝟐

𝐡𝒏𝟐

𝒙𝟐

𝑾1

𝑾𝑛

𝒄𝒙𝟐

𝑽0

𝑽𝑛

𝑽1

Additional features

SoftMax

𝑴1 𝑴2 𝑴0

Figure 3: Pairwise context-sensitive autoencoderfor computing text pair similarity.

for input texts (e.g., a pair of words) each with itscorresponding context information. These scoreswill then be used to rank “documents” against“queries” (in retrieval tasks) or evaluate how pre-dictions of a model correlate with human judg-ments (in contextual word sense similarity task).

In unsupervised settings, given a pair of in-put texts with their corresponding context vectors,(x1,cx1) and (x2,cx2), we determine their seman-tic similarity score by computing the cosine simi-larity between their hidden representations h1

n andh2n respectively.In supervised settings, we use a copy of our

context-sensitive autoencoder to make a pairwisearchitecture as depicted in Figure 3. Given(x1,cx1), (x2,cx2), and their binary relevancescore, we use h1

n and h2n as well as additional fea-

tures (see below) to train our pairwise network (i.e.further fine-tune the weights) to predict a similar-ity score for the input pair as follows:

rel(x1,x2) = softmax(M0a+M1h1n+M2h2

n+b)(9)

where a carries additional features, Ms are weightmatrices, and b is the bias. We use the differenceand similarity between the context-sensitive rep-resentations of inputs, h1

n and h2n, as additional

features:

hsub = |h1n − h2

n|hdot = h1

n � h2n,

(10)

where hsub and hdot capture the element-wise dif-ference and similarity (in terms of the sign of ele-ments in each dimension) between h1

n and h2n, re-

spectively. We expect elements in hsub to be smallfor semantically similar and relevant inputs andlarge otherwise. Similarly, we expect elements inhdot to be positive for relevant inputs and negativeotherwise.

We can use any task-specific feature as addi-tional features. This includes features from the

minimal edit sequences between parse trees of theinput pairs (Heilman and Smith, 2010; Yao et al.,2013), lexical semantic features extracted from re-sources such as WordNet (Yih et al., 2013), orother features such as word overlap features (Sev-eryn and Moschitti, 2015; Severyn and Moschitti,2013). We can also use additional features (Equa-tion 10), computed for BOW representations ofthe inputs x1 and x2. Such additional features im-prove the performance of our and baseline models.

4 Experiments

In this Section, we use t-test for significant test-ing and asterisk mark (*) to indicate significanceat α = 0.05.

4.1 Data and Context Information

We use three datasets: “SCWS” a word similar-ity dataset with ground-truth labels on similar-ity of pairs of target words in sentential contextfrom Huang et al. (2012); “qAns” a TREC QAdataset with ground-truth labels for semanticallyrelevant questions and (single-sentence) answersfrom Wang et al. (2007); and “qSim” a commu-nity QA dataset crawled from Stack Exchangewith ground-truth labels for semantically equiva-lent questions from Dos Santos et al. (2015). Ta-ble 1 shows statistics of these datasets. To enabledirect comparison with previous work, we use thesame training, development, and test data providedby Dos Santos et al. (2015) and Wang et al. (2007)for qSim and qAns respectively and the entire dataof SCWS (in unsupervised setting).

We consider local and global context for tar-get words in SCWS. The local context of a targetword is its ten neighboring words (five before andfive after) (Huang et al., 2012), and its global con-text is a short paragraph that contains the targetword (surrounding sentences). We compute aver-age word embeddings to create context vectors fortarget words.

Also, we consider question title and body andanswer text as input in qSim and qAns and useNMF to create global context vectors for questionsand answers (Section 2.3).

4.2 Parameter Setting

We use pre-trained word vectors from GloVe (Pen-nington et al., 2014). However, because qSimquestions are about specific technical topics, weonly use GloVe as initialization.

Data Split #Pairs %RelSCWS All data 2003 100.0%

qAns

Train-All 53K 12.00%Train 4,718 7.400%Dev 1,148 19.30%Test 1,517 18.70%

qSimTrain 205K 0.048%Dev 43M 0.001%Test 82M 0.001%

Table 1: Data statistics. (#Pairs: number of word-word pairs in SCWS, question-answer pairs inqAns, and question-question pairs in qSim; %Rel:percentage of positive pairs.)

For the unsupervised SCWS task, followingHuang et al. (2012), we use 100-dimensional wordembeddings, d = 100, with hidden layers and con-text vectors of the same size, d′ = 100, k = 100.In this unsupervised setting, we set the weight pa-rameter λ = .5, masking noise η = 0, depth ofour model n = 3. Tuning these parameters willfurther improve the performance of our model.

For qSim and qAns, we use 300-dimensionalword embeddings, d = 300, with hidden layersof size d′ = 200. We set the size of context vec-tors k (number of topics) using the reconstructionerror of NMF on training data for different valuesof k. This leads to k = 200 for qAns and k = 300for qSim. We tune the other hyper-parameters (η,n, and λ) using development data.

We set each input x (target words in SCWS,question titles and bodies in qSim, and questiontitles and single-sentence answers in qAns) to theaverage of word embeddings in the input. Inputvectors could be initialized through more accurateapproaches (Mikolov et al., 2013b; Li and Hovy,2014); however, averaging leads to reasonable rep-resentations and is often used to initialize neuralnetworks (Clinchant and Perronnin, 2013; Iyyer etal., 2015).

4.3 Contextual Word Similarity

We first consider the contextual word similaritytask in which a model should predict the semanticsimilarity between words in their sentential con-text. For this evaluation, we compute Spearman’sρ correlation (Kokoska and Zwillinger, 2000) be-tween the “relevance scores” predicted by differ-ent models and human judgments (Section 3).

The state-of-the-art model for this task is asemi-supervised approach (Rothe and Schutze,2015). This model use resources like WordNet

to compute embeddings for different senses ofwords. Given a pair of target words and theircontext (neighboring words and sentences), thismodel represents each target word as the averageof its sense embeddings weighted by cosine simi-larity to the context. The cosine similarity betweenthe representations of words in a pair is then usedto determine their semantic similarity. Also, theSkip-gram model (Mikolov et al., 2013a) is ex-tended in (Neelakantan et al., 2014; Chen et al.,2014) to learn contextual word pair similarity inan unsupervised way.

Table 2 shows the performance of differentmodels on the SCWS dataset. SAE, CSAE-LC,CSAE-LGC show the performance of our pairwiseautoencoders without context, with local context,and with local and global context, respectively.In case of CSAE-LGC, we concatenate local andglobal context to create context vectors. CSAE-LGC performs significantly better than the base-lines, including the semi-supervised approach inRothe and Schutze (2015). It is also interestingthat SAE (without any context information) out-performs the pre-trained word embeddings (Pre-trained embeds.).

Comparing the performance of CSAE-LC andCSAE-LGC indicates that global context is use-ful for accurate prediction of semantic similaritybetween word pairs. We further investigate thesemodels to understand why global context is useful.Table 3 shows an example in which global con-text (words in neighboring sentences) effectivelyhelp to judge the semantic similarity between “Air-port” and “Airfield.” This is while local context(ten neighboring words) are less effective in help-ing the models to relate the two words.

Furthermore, we study the effect of global con-text in different POS tag categories. As Figure 4shows global context has greater impact on A-Aand N-N categories. We expect high improve-ment in the N-N category as noun senses are fairlyself-contained and often refer to concrete things.Thus broader (not only local) context is needed tojudge their semantic similarity. However, we don’tknow the reason for improvement on the A-A cat-egory as, in context, adjective interpretation is of-ten affected by local context (e.g., the nouns thatadjectives modify). One reason for improvementcould be because adjectives are often interchange-able and this characteristic makes their meaning tobe less sensitive to local context.

Model Context ρ×100Huang et al. (2012) LGC 65.7Chen et al. (2014) LGC 65.4Neelakantan et al. (2014) LGC 69.3Rothe and Schutze (2015) LGC 69.8Pre-trained embeds. (GloVe) - 60.2SAE - 61.1CSAE LC 66.4CSAE LGC 70.9*

Table 2: Spearman’s ρ correlation between modelpredictions and human judgments in contextualword similarity. (LC: local context only, LGC: lo-cal and global context.)

. . . No cases in Gibraltar were reported. The airportis built on the isthmus which the Spanish Governmentclaim not to have been ceded in the Treaty of Utrecht.Thus the integration of Gibraltar Airport in the SingleEuropean Sky system has been blocked by Spain. The1987 agreement for joint control of the airport with. . .. . . called “Tazi” by the German pilots. On 23 Dec1942, the Soviet 24th Tank Corps reached nearby Skas-sirskaya and on 24 Dec, the tanks reached Tatsinskaya.Without any soldiers to defend the airfield it was aban-doned under heavy fire. In a little under an hour, 108Ju-52s and 16 Ju-86s took off for Novocherkassk – leav-ing 72 Ju-52s and many other aircraft burning on theground. A new base was established. . .

Table 3: The importance of global context (neigh-boring sentences) in predicting the semanticallysimilar words (Airport, Airfield).

4.4 Answer Ranking Performance

We evaluate the performance of our model in theanswer ranking task in which a model should re-trieve correct answers from a set of candidates fortest questions. For this evaluation, we rank an-swers with respect to each test question accord-ing to the “relevance score” between question andeach answer (Section 3).

The state-of-the-art model for answer rankingon qAns is a pairwise convolutional neural net-work (PCNN) presented in (Severyn and Mos-chitti, 2015). PCNN is a supervised model thatfirst maps input question-answer pairs to hiddenrepresentations through a standard convolutionalneural network (CNN) and then utilizes these rep-resentations in a pairwise CNN to compute a rele-vance score for each pair. This model also utilizesexternal word overlap features for each question-answer pair.5 PCNN outperforms other competingCNN models (Yu et al., 2014) and models that use

5Word overlap and IDF-weighted word overlap computedfor (a): all words, and (b): only non-stop words for eachquestion-answer pair (Severyn and Moschitti, 2015).

N−N V−V A−A N−V0.5

0.55

0.6

0.65

0.7

0.75

POS Tag Pairs

Per

form

ance

(ρ)

CSAE−LCCSAE−LGC

Figure 4: Effect of global context on contextualword similarity in different parts of speech (N:noun, V: verb, A: adjective). We only consider fre-quent categories.

syntax and semantic features (Heilman and Smith,2010; Yao et al., 2013).

Tables 4 and 5 show the performance of dif-ferent models in terms of Mean Average Preci-sion (MAP) and Mean Reciprocal Rank (MRR)in supervised and unsupervised settings. PCNN-WO and PCNN show the baseline performancewith and without word overlap features. SAE andCSAE show the performance of our pairwise au-toencoders without and with context informationrespectively. Their “X-DST” versions show theirperformance when additional features (Equation10) are used. These features are computed forthe hidden and BOW representations of question-answer pairs. We also include word overlap fea-tures as additional features.

Table 4 shows that SAE and CSAE consistentlyoutperform PCNN, and SAE-DST and CSAE-DST outperform PCNN-WO when the models aretrained on the larger training dataset, “Train-All.”But PCNN shows slightly better performance thanour model on “Train,” the smaller training dataset.We conjecture this is because PCNN’s convolu-tion filter is wider (n-grams, n > 2) (Severyn andMoschitti, 2015).

Table 5 shows that the performance of unsuper-vised SAE and CSAE are comparable and in somecases better than the performance of the super-vised PCNN model. We attribute the high perfor-mance of our models to context information thatleads to richer representations of inputs.

Furthermore, comparing the performance ofCSAE and SAE in both supervised and unsuper-vised settings in Tables 4 and 5 shows that contextinformation consistently improves the MAP andMRR performance at all settings except for MRRon “Train” (supervised setting) that leads to a com-

Model Train Train-AllMAP MRR MAP MRR

PCNN 62.58 65.91 67.09 72.80SAE 65.69* 71.70* 69.54* 75.47*CSAE 67.02* 70.99* 72.29* 77.29*PCNN-WO 73.29 79.62 74.59 80.78SAE-DST 72.53 76.97 76.38* 82.11*CSAE-DST 71.26 76.88 76.75* 82.90*

Table 4: Answer ranking in supervised setting

Model Train Train-AllMAP MRR MAP MRR

SAE 63.81 69.30 66.37 71.71CSAE 64.86* 69.93* 66.76* 73.79*

Table 5: Answer ranking in unsupervised setting.

parable performance. Context-sensitive represen-tations significantly improve the performance ofour model and often lead to higher MAP than themodels that ignore context information.

4.5 Question Ranking Performance

In the question ranking task, given a test ques-tion, a model should retrieve top-K questions thatare semantically equivalent to the test question forK = {1, 5, 10}. We use qSim for this evaluation.

We compare our autoencoders against PCNNand PBOW-PCNN models presented in Dos San-tos et al. (2015). PCNN is a pairwise convolu-tional neural network and PBOW-PCNN is a jointmodel that combines vector representations ob-tained from a pairwise bag-of-words (PBOW) net-work and a pairwise convolutional neural network(PCNN). Both models are supervised as they re-quire similarity scores to train the network.

Table 6 shows the performance of differ-ent models in terms of Precision at Rank K,P@K. CSAE is more precise than the baseline;CSAE and CSAE-DST models consistently out-perform the baselines on P@1, an important met-ric in search applications (CSAE also outperformsPCNN on P@5). Although context-sensitive mod-els are more precise than the baselines at higherranks, the PCNN and PBOW-PCNN models re-main the best model for P@10.

Tables 6 and 7 show that context informationconsistently improves the results at all ranks inboth supervised and unsupervised settings. Theperformance of the unsupervised SAE and CSAEmodels are comparable with the supervised PCNNmodel in higher ranks.

5 10 15 20 25 30 35 40 453.8

4

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

#epoch

reco

nst

ruct

ion

err

or

errNMF

= 10.79

sAEcontext−sAE

(a) qSim Dataset

5 10 15 20 25 30 35 40 45

3

3.5

4

4.5

5

5.5

6

#epoch

reco

nst

ruct

ion

err

or

errNMF

= 9.52

sAEcontext−sAE

(b) qAns Dataset

2 2.5 3 3.5 4 4.5

x 104

0.06

0.07

0.08

0.09

0.1

0.11

0.12

topic density

reco

nst

ruct

ion

imp

rove

men

t

rec improvementlinear trend

(c) qSim Dataset

Figure 5: Reconstruction Error and Improvement: (a) and (b) reconstruction error on qSim and qAnsrespectively. errNMF shows the reconstruction error of NMF. Smaller error is better, (c) improvementin reconstruction error vs. topic density: greater improvement is obtained in topics with lower density.

Model P@1 P@5 P@10PCNN 20.0 33.8 40.4SAE 16.8 29.4 32.8CSAE 21.4 34.9 37.2PBOW-PCNN 22.3 39.7 46.4SAE-DST 22.2 35.9 42.0CSAE-DST 24.6 37.9 38.9

Table 6: Question ranking in supervised setting

Model P@1 P@5 P@10SAE 17.3 32.4 32.8CSAE 18.6 33.2 34.1

Table 7: Question ranking in unsupervised setting

5 Performance Analysis and Discussion

We investigate the effect of context informationin reconstructing inputs and try to understand rea-sons for improvement in reconstruction error. Wecompute the average reconstruction error of SAEand CSAE (Equations (3) and (7)). For these ex-periments, we set λ = 0 in Equation (7) so thatwe can directly compare the resulting loss of thetwo models. CSAE will still use context informa-tion with λ = 0 but it does not backpropagate thereconstruction loss of context information.

Figures 5(a) and 5(b) show the average recon-struction error of SAE and CSAE on qSim andqAns datasets. Context information conistentlyimproves reconstruction. The improvement isgreater on qSim which contains smaller numberof words per question as compared to qAns. Also,both models generate smaller reconstruction errorsthan NMF (Section 2.3). The lower performanceof NMF is because it reconstructs inputs merelyusing global topics identified in datasets, while our

models utilize both local and global information toreconstruct inputs.

5.1 Analysis of Context information

The improvement in reconstruction error mainlystems from areas in data where “topic density” islower. We define topic density for a topic as thenumber of documents that are assigned to the topicby our topic model. We compute the average im-provement in reconstruction error for each topic Tjusing the loss functions for the basic and context-sensitive autoencoders:

∆j =1

|Tj |∑

x∈Tjl(x)− l(x,hx)

where we set λ = 0. Figure 5(c) shows improve-ment of reconstruction error versus topic densityon qSim. Lower topic densities have greater im-provement. This is because they have insufficienttraining data to train the networks. However, in-jecting context information improves the recon-struction power of our model by providing moreinformation. The improvements in denser areasare smaller because neural networks can train ef-fectively in these areas.6

5.2 Effect of Depth

The intuition behind deep autoencoders (and, gen-erally, deep neural networks) is that each layerlearns a more abstract representation of the in-put than the previous one (Hinton and Salakhut-dinov, 2006; Bengio et al., 2013). We investigate

6We observed the same pattern in qAns.

1 2 3 4 5 6 760

62

64

66

68

70

72

#hidden layers

ρ co

rrel

atio

n

Pe−trained CSAE−LC CSAE−LGC

Figure 6: Effect of depth in contextual word simi-larity. Three hidden layers is optimal for this task.

if adding depth to our context-sensitive autoen-coder will improve its performance in the contex-tual word similarity task.

Figure 6 shows that as we increase the depth ofour autoencoders, their performances initially im-prove. The CSAE-LGC model that uses both lo-cal and global context benefits more from greaternumber of hidden layers than CSAE-LC that onlyuses local context. We attribute this to the use ofglobal context in CSAE-LGC that leads to moreaccurate representations of words in their context.We also note that with just a single hidden layer,CSAE-LGC largely improves the performance ascompared to CSAE-LC.

6 Related Work

Representation learning models have been ef-fective in many tasks such as language model-ing (Bengio et al., 2003; Mikolov et al., 2013b),topic modeling (Nguyen et al., 2015), paraphrasedetection (Socher et al., 2011), and ranking tasks(Yih et al., 2013). We briefly review works thatuse context information for text representation.

Huang et al. (2012) presented an RNN modelthat uses document-level context information toconstruct more accurate word representations. Inparticular, given a sequence of words, the ap-proach uses other words in the document as exter-nal (global) knowledge to predict the next word inthe sequence. Other approaches have also mod-eled context at the document level (Lin et al.,2015; Wang and Cho, 2015; Ji et al., 2016).

Ji et al. (2016) presented a context-sensitiveRNN-based language model that integrates repre-sentations of previous sentences into the languagemodel of the current sentence. They showed thatthis approach outperforms several RNN languagemodels on a text coherence task.

Liu et al. (2015) proposed a context-sensitiveRNN model that uses Latent Dirichlet Alloca-tion (Blei et al., 2003) to extract topic-specificword embeddings. Their best-performing modelregards each topic that is associated to a word in asentence as a pseudo word, learns topic and wordembeddings, and then concatenates the embed-dings to obtain topic-specific word embeddings.

Mikolov and Zweig (2012) extended a basicRNN language model (Mikolov et al., 2010) byan additional feature layer to integrate external in-formation (such as topic information) about inputsinto the model. They showed that such informa-tion improves the perplexity of language models.

In contrast to previous research, we integratecontext into deep autoencoders. To the best ofour knowledge, this is the first work to do so.Also, in this paper, we depart from most previ-ous approaches by demonstrating the value of con-text information in sentence-level semantic simi-larity and ranking tasks such as QA ranking tasks.Our approach to the ranking problems, both forAnswer Ranking and Question Ranking, is dif-ferent from previous approaches in the sense thatwe judge the relevance between inputs based ontheir context information. We showed that addingsentential or document context information aboutquestions (or answers) leads to better rankings.

7 Conclusion and Future Work

We introduce an effective approach to integratesentential or document context into deep autoen-coders and show that such integration is impor-tant in semantic similarity tasks. In the future, weaim to investigate other types of linguistic context(such as POS tag and word dependency informa-tion, word sense, and discourse relations) and de-velop a unified representation learning frameworkthat integrates such linguistic context with repre-sentation learning models.

Acknowledgments

We thank anonymous reviewers for their thought-ful comments. This paper is based upon work sup-ported, in whole or in part, with funding from theUnited States Government. Boyd-Graber is sup-ported by NSF grants IIS/1320538, IIS/1409287,and NCSE/1422492. Any opinions, findings, con-clusions, or recommendations expressed here arethose of the authors and do not necessarily reflectthe view of the sponsors.

ReferencesYoshua Bengio, Rejean Ducharme, Pascal Vincent, and

Christian Janvin. 2003. A neural probabilistic lan-guage model. J. Mach. Learn. Res., 3:1137–1155,March.

Yoshua Bengio, Aaron Courville, and Pierre Vincent.2013. Representation learning: A review and newperspectives. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, 35(8).

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. the Journal of ma-chine Learning research, 3:993–1022.

Jordan L Boyd-Graber and David M Blei. 2009. Syn-tactic topic models. In Proceedings of NIPS.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A unified model for word sense representation anddisambiguation. In Proceedings of EMNLP.

Stephane Clinchant and Florent Perronnin. 2013. Ag-gregating continuous word embeddings for informa-tion retrieval. the Workshop on Continuous VectorSpace Models and their Compositionality, ACL.

Cicero Dos Santos, Luciano Barbosa, Dasha Bog-danova, and Bianca Zadrozny. 2015. Learning hy-brid representations to retrieve semantically equiva-lent questions. In Proceedings of ACL-IJCNLP.

Michael Heilman and Noah A Smith. 2010. Tree editmodels for recognizing textual entailments, para-phrases, and answers to questions. In Proceedingsof NAACL.

Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006.Reducing the dimensionality of data with neural net-works. Science, 313(5786):504–507.

Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In Proceedings of ACL.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of ACL-IJCNLP.

Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer,and Jacob Eisenstein. 2016. Document context lan-guage models. ICLR (Workshop track).

S. Kokoska and D. Zwillinger. 2000. CRC StandardProbability and Statistics Tables and Formulae, Stu-dent Edition. Taylor & Francis.

Quoc V. Le and Tomas Mikolov. 2014. Distributedrepresentations of sentences and documents. In Pro-ceedings of ICML.

Jiwei Li and Eduard Hovy. 2014. A model of coher-ence based on distributed sentence representation.In Proceedings of EMNLP.

Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,and Sheng Li. 2015. Hierarchical recurrent neuralnetwork for document modeling. In Proceedings ofEMNLP.

Chuan-bi Lin. 2007. Projected gradient methods fornonnegative matrix factorization. Neural computa-tion, 19(10):2756–2779.

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and MaosongSun. 2015. Topical word embeddings. In Proceed-ings of AAAI.

Tomas Mikolov and Geoffrey Zweig. 2012. Contextdependent recurrent neural network language model.In Spoken Language Technologies. IEEE.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In Pro-ceedings of INTERSPEECH.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. arXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of NIPS.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In Proceedings of the EMNLP.

Dat Quoc Nguyen, Richard Billingsley, Lan Du, andMark Johnson. 2015. Improving topic models withlatent feature word representations. TACL, 3:299–313.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of EMNLP.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of ACL-IJNLP.

Aliaksei Severyn and Alessandro Moschitti. 2013. Au-tomatic feature engineering for answer selection andextraction. In Proceedings of EMNLP.

Aliaksei Severyn and Alessandro Moschitti. 2015.Learning to rank short text pairs with convolutionaldeep neural networks. In Proceedings of SIGIR.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In Proceedings of ACL.

Richard Socher, Eric H. Huang, Jeffrey Pennington,Andrew Y. Ng, and Christopher D. Manning. 2011.Dynamic pooling and unfolding recursive autoen-coders for paraphrase detection. In Proceedings ofNIPS.

Keith Stevens, Philip Kegelmeyer, David Andrzejew-ski, and David Buttler. 2012. Exploring topic co-herence over many models and many topics. In Pro-ceedings of EMNLP-CNNL.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In Proceedings ICML.

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie,Yoshua Bengio, and Pierre-Antoine Manzagol.2010. Stacked denoising autoencoders: Learninguseful representations in a deep network with a localdenoising criterion. The Journal of Machine Learn-ing Research, 11:3371–3408.

Tian Wang and Kyunghyun Cho. 2015. Larger-contextlanguage modelling. CoRR, abs/1511.03729.

Mengqiu Wang, Noah A. Smith, and Teruko Mita-mura. 2007. What is the Jeopardy model? a quasi-synchronous grammar for QA. In Proceedings ofEMNLP-CoNLL.

Xuchen Yao, Benjamin Van Durme, Chris Callison-burch, and Peter Clark. 2013. Answer extractionas sequence tagging with tree edit distance. In Pro-ceedings of NAACL.

Wen-tau Yih, Ming-Wei Chang, Christopher Meek, andAndrzej Pastusiak. 2013. Question answering usingenhanced lexical semantic models. In Proceedingsof ACL.

Lei Yu, Karl Moritz Hermann, Phil Blunsom, andStephen Pulman. 2014. Deep learning for answersentence selection. In NIPS, Deep Learning Work-shop.

Matthew D. Zeiler. 2012. ADADELTA: an adaptivelearning rate method. CoRR, abs/1212.5701.


Recommended