Data-efficient Neural Text Compression with Interactive LearningProceedings of NAACL-HLT 2019 ,...

Proceedings of NAACL-HLT 2019, pages 2543–2554Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

2543

Data-efficient Neural Text Compression with Interactive Learning

Avinesh P.V.S and Christian M. MeyerResearch Training Group AIPHES and UKP Lab

Computer Science Department, Technische Universitat Darmstadtwww.aiphes.tu-darmstadt.de, www.ukp.tu-darmstadt.de

Abstract

Neural sequence-to-sequence models havebeen successfully applied to text compression.However, these models were trained on hugeautomatically induced parallel corpora, whichare only available for a few domains and tasks.In this paper, we propose a novel interactivesetup to neural text compression that enablestransferring a model to new domains and com-pression tasks with minimal human supervi-sion. This is achieved by employing activelearning, which intelligently samples from alarge pool of unlabeled data. Using this setup,we can successfully adapt a model trained onsmall data of 40k samples for a headline gener-ation task to a general text compression datasetat an acceptable compression quality with just500 sampled instances annotated by a human.

1 Introduction

Text compression is the task of condensing one ormultiple sentences into a shorter text of a givenlength preserving the most important information.In natural language generation applications, suchas summarization, text compression is a major stepto condense the extracted important content of thesource documents. But text compression can alsobe applied in a wide range of related applications,including the generation of headlines (Filippovaet al., 2015), captions (Wubben et al., 2016), sub-titles (Vandegehinste and Pan, 2004; Luotolahtiand Ginter, 2015), and the compression of text forsmall screens (Corston-Oliver, 2001).

Neural sequence-to-sequence (Seq2Seq) mod-els have shown remarkable success in many areasof natural language processing and specifically innatural language generation tasks, including textcompression (Rush et al., 2015; Filippova et al.,2015; Yu et al., 2018; Kamigaito et al., 2018). De-spite their success, Seq2Seq models have a ma-jor drawback, as they require huge parallel cor-

pora with pairs of source and compressed text tobe able to learn the parameters for the model. Sofar, the size of the training data has been propor-tional to the increase in the model’s performance(Koehn et al., 2003; Suresh, 2010), which is a ma-jor hurdle if only limited annotation capacities areavailable to manually produce a corpus. That iswhy existing research employs large-scale auto-matically extracted compression pairs, such as thefirst sentence and the presumably shorter headlineof a news article. However, such easy-to-extractsource data is only available for a few tasks, do-mains, and genres and the corresponding modelsdo not generalize well from the task of headlinegeneration to other text compression tasks.

In this paper, we propose an interactive setup toneural text compression, which learns to compressbased on user feedback acquired during trainingtime. For the first time, we apply active learning(AL) methods to neural text compression, whichgreatly reduces the amount of the required train-ing data and thus yields a much more data-efficienttraining and annotation workflow. In our experi-ments, we find that this approach enables the suc-cessful transfer of a model trained on headlinegeneration data to a general text compression taskwith a minimum of parallel training instances.

The objective of AL is to efficiently select unla-beled instances that a user should annotate to ad-vance the training. A key component of AL is thechoice of the sampling strategy, which curates thesamples in order to maximize the model’s perfor-mance with a minimum amount of user interac-tion. Many AL sampling strategies have proveneffective for human-supervised natural languageprocessing tasks other than compression (Hahnet al., 2012; Peris and Casacuberta, 2018; Liuet al., 2018).

In our work, we exploit the application ofuncertainty-based sampling using attention disper-

2544

sion and structural similarity for choosing samplesto be annotated for our interactive Seq2Seq textcompression model. We employ the AL strategiesfor (a) learning a model with a minimum data, and(b) adapting a pretrained model with few user in-puts to a new domain.

In the remaining paper, we first discuss relatedwork and introduce the state-of-the-art Seq2Seqarchitecture for the neural text compression task.Then, we propose our novel interactive compres-sion approach and demonstrate how batch modeAL can be integrated with neural Seq2Seq mod-els for text compression. In section 4, we in-troduce our experimental setup, and in section 5,we evaluate our AL strategies and show that ourapproach successfully enables (a) learning theSeq2Seq model with a minimum of data, (b) trans-fer of a pretrained headline generation model to anew compression task and dataset with minimaluser interaction. To encourage further researchand enable reproducing our results, we publish ourcode as open-source software.1

2 Related Work

In this section, we discuss related work to our re-search concerning: (1) neural text compressionmodels, (2) existing text compression corpora and(3) active learning for neural models.

Neural text compression. Neural text compres-sion models can be broadly classified into two cat-egories: (a) deletion-based extractive models and(b) abstractive models. The goal of the deletion-based models is to delete unimportant words froma source text to generate a shorter version of thetext. In contrast, abstractive models generate ashorter text by inserting, reordering, reformulat-ing, or deleting words of the source text.

Previously, deletion-based extractive methodsexplored various modeling approaches, includ-ing the noisy-channel model (Knight and Marcu,2002; Turner and Charniak, 2005), integer linearprogramming (Clarke and Lapata, 2007), varia-tional autoencoders (Miao and Blunsom, 2016),and Seq2Seq models (Filippova et al., 2015). Sim-ilarly, recent abstractive models have seen tree-to-tree transduction models (Cohn and Lapata, 2013)and variations of Seq2Seq models, such as atten-tion (Rush et al., 2015), attentive long short-termmemory (LSTM) models (Wubben et al., 2016)

1https://github.com/UKPLab/NAACL2019-interactiveCompression

and operation networks where the Seq2Seq modeldecoder is replaced with a deletion decoder and acopy-generate decoder (Yu et al., 2018).

Filippova et al. (2015) show that Seq2Seq mod-els without any linguistic features have the abil-ity to delete unimportant information. Kami-gaito et al. (2018) incorporate higher-order de-pendency features into a Seq2Seq model and re-port promising results. Rush et al. (2015) pro-pose an attention-based Seq2Seq model for gen-erating headlines. Chopra et al. (2016) further im-prove this task with recurrent neural networks. Al-though Seq2Seq models show state-of-the-art re-sults on different compression datasets, there is yetno work which investigates whether large train-ing corpora are needed to train neural compressionmodels and if there are efficient ways to train andadapt them to other datasets with few annotations.

Text compression corpora. Early publiclyavailable text compression datasets are manuallycurated but small (Cohn and Lapata, 2008; Clarkeand Lapata, 2006, 2008). These datasets aretypically used by unsupervised approaches asthey are 200 times smaller in size compared to theannotated data used for training state-of-the-artsupervised approaches. Filippova and Altun(2013) introduce an extractive compressiondataset of 250k headline and first sentence com-pression pairs based on Google News, which theyuse for training a supervised compression method.Similarly, Rush et al. (2015) create another largeabstractive dataset of 4 million headline and firstsentence compression pairs from news articlesextracted from the Annotated Gigaword corpus(Napoles et al., 2012). Although these datasetsare large, they predominantly address headlinegeneration for news.

Creating such large corpora manually for anew task or domain is hard. Toutanova et al.(2016) pioneered the manual creation of a multi-reference compression dataset MSR-OANC with6k sentence–short paragraph pairs from businessletters, newswire, journals, and technical doc-uments sampled from the Open American Na-tional Corpus2. They provide five crowd-sourcedrewrites for a fixed compression ratio and alsoacquire quality judgments. This dataset cov-ers multiple genres compared to the large au-tomatically collected compression datasets, andToutanova et al. (2016) show that neural Seq2Seq

2https://www.anc.org/data/oanc

https://github.com/UKPLab/NAACL2019-interactiveCompression

https://github.com/UKPLab/NAACL2019-interactiveCompression

https://www.anc.org/data/oanc

2545

Figure 1: Pipeline of our interactive text compression model. The pipeline is divided into three main components:(1) Neural Seq2Seq text compression model, (2) active learning, and (3) interactive text compression

models trained on headline generation datasets failto achieve state-of-the-art results as compared toan ILP-based unsupervised method. In our work,we go beyond that and investigate strategies to eas-ily adapt pretrained models to such small datasetsemploying minimal user input.

Active learning for neural models. AL hasbeen successfully applied to various natural lan-guage processing tasks, including corpus anno-tation (Hahn et al., 2012; Yan et al., 2011), do-main adaptation (Chan and Ng, 2007), personal-ized summarization (P. V. S. and Meyer, 2017),machine translation (Haffari and Sarkar, 2009),language generation (Mairesse et al., 2010), andmany more. Only recently, it has been appliedto neural models: Wang et al. (2017a) proposean AL approach for a black box semantic rolelabelling (SRL) model where the AL frameworkis an add-on to the neural SRL models. Perisand Casacuberta (2018) use AL in neural machinetranslation. They propose quality estimation sam-pling, coverage sampling, and attention distractionsampling strategies to query data for interactivemachine translation. Liu et al. (2018) addition-ally propose an AL simulation trained on a high-resource language pair to transfer their model tolow-resource language pairs. In another line of re-search, Sener and Savarese (2018) discuss a core-set AL approach as a batch sampling method forneural image classification based on convolutionalneural networks. Although AL techniques havebeen widely used in natural language processing,to our knowledge, there is yet no work on the useof AL for neural text compression. We fill this gapby putting the human in the loop to learn effec-tively from a minimal amount of interactive feed-back and for the first time, we explore this data-

efficient AL-based approach to adapt a model to anew compression dataset.

3 Approach

To address this research problem, we first describethe neural Seq2Seq text compression models weuse. Then, we introduce our active learning strate-gies to select the training samples interactively forin-domain training as well as for domain adapta-tion, and we describe a novel interactive neuraltext compression setup. Figure 1 illustrates themain components of our system.

3.1 Neural Seq2Seq Text CompressionIn this work, we employ state-of-the-art Seq2Seqmodels with attention (Seq2Seq-gen) (Rush et al.,2015) and pointer-generated networks with cover-age (Pointer-gen) (See et al., 2017) as our basemodels, which we use for our AL-based interac-tive text compression setup.

Both Seq2Seq models are built upon theencoder-decoder framework by Sutskever et al.(2014). The encoder encodes the input sequencex = (x1, x2.., xn) represented by an embeddingmatrix into a continuous space using a bidirec-tional LSTM network and outputs a sequence ofhidden states. The decoder is a conditional bidi-rectional LSTM network with attention distribu-tion (Luong et al., 2015)

aji =exp(eji )∑nk=1 exp(e

jk)

(1)

where eji is computed at each generation step jwith the encoder states henci and the decoder stateshdecj :

eji = q · tanh(W ench henci +W dec

h hdecj + batt) (2)

2546

where q , W ench , W dec

h and batt are learnable pa-rameters. The attention distribution aji is used tocompute the weighted sum of the encoder hiddenstates, also known as the context vector

c∗j =n∑i

ajihenci (3)

To obtain the vocabulary distribution P vocabj at

generation step j, we concatenate the fixed con-text vector with the decoder state hdecj and pass itthrough two linear layers:

P vocabj = softmax(Wv(W

′v[h

decj ; c∗j ] + b′v) + bv)

(4)

where Wv, W ′v, bv and b′v are learnable parame-ters. P vocab

j is a probability distribution over allwords in the vocabulary V . Based on the vocab-ulary distribution, the model generates the targetsequence y = y1, y2, . . . , ym, m ≤ n with

yj = argmaxwPvocabj (w), w ∈ V (5)

for each generation step j.Finally during training, we define the loss func-

tion for generation step j as the negative log like-lihood of the target word yj and the overall lossfunction for the target word sequence as L:

L =1

m

m∑j=0

− logP vocabj (yj) (6)

Another state-of-the-art approach we use forour experiments is the pointer-generator networks(Pointer-gen) proposed by See et al. (2017). Thismodel uses a pointer-generator network that deter-mines a probability function to generate the wordsfrom the vocabulary V or copy the words from thesource text by sampling from the attention distri-bution aji as shown in Eq. 8. The model achievesthis by calculating an additional generation prob-ability pgen for generation step j, which is calcu-lated from the context vector c∗j , the decoder statehdecj , and the current input to the decoder x

′j :

pgen = σ(W Tc c∗j +W T

hdechdecj +W T

x′x′j + bgen)

(7)

Pj(w) = pgenPvocabj (w) + (1− pgen)

n∑i=0

aji

(8)

where vectors Wc, Whdec , Wx′ , bgen are learnableparameters, n is the number of words in the sourcetext and σ is the sigmoid function.

The model also uses an extra feature of cover-age to keep track of words generated by the modelto discourage repetition. In the coverage model,a coverage vector is calculated which is the sumof the attention distribution across all the previousdecoding steps and it is passed on as an extra inputto the attention mechanism:

cji =

j−1∑k=0

aki (9)

eji = q · tanh(W ench henci +W dec

h hdecj

+Wccji + batt) (10)

where Wc is an additional learnable parameter.

3.2 Active Learning

Toutanova et al. (2016) show that Seq2Seq mod-els, which perform well on large news headlinegeneration datasets, fail to achieve good perfor-mance on their MSR-OANC multi-genre com-pression dataset. A major issue with trainingSeq2Seq models is the lack of domain-specificdata and the expensive process to create parallelcompression pairs. It is therefore indispensable tominimize the cost of data annotation. Thus, ALcomes into play whose key element is to find astrategy for selecting samples the user should an-notate which yield a more efficient training pro-cess. For text compression, we suggest AL strate-gies to maximize the model’s coverage and thediversity of the samples. To this end, we buildupon work in uncertainty sampling by (Peris andCasacuberta, 2018; Wang et al., 2017b) and pro-pose a new strategy to predict the sample diversityat a structural level.

Coverage constraint sampling (Coverage-AL).An important factor on which text compressionmodels are evaluated is the coverage (Marsi et al.,2010). Coverage can be defined as the text com-pression models being able to learn the deletionor generation rules from the training samples andapply them on an input source text. Wu et al.(2016) first proposed the idea of using attentionweights to calculate coverage penalty for activelearning based machine translation systems. Theattention weights were further extended by Perisand Casacuberta (2018) to estimate an attention

2547

dispersion based uncertainty score for a sentence.The idea of attention dispersion is that if the neu-ral Seq2Seq compression model is uncertain thenthe attention weights will be dispersed across thesource text while generating the target words. Thesamples with higher dispersion will have their at-tention weights uniformly distributed across thesource sentences. Thus, the goal is to find thesamples with high uncertainty based on attentiondispersion. As we want to define to the extentto which the attention distribution differs from anormal distribution we propose to use a skewnessscore. The skewness score calculates the attentiondispersion while decoding a target word yj .

skewness(yj) =1n

∑ni=1(a

ji −

1n)

3

( 1n∑n

i=1(aji −

1n)

2)3/2(11)

aji is the attention weight assigned by the attentionlayer to the i-th source word when decoding thej-th target word and 1

n is the mean of the attentionweights of the target word yj .

The skewness for a normal distribution is zero,and since we are interested in the skewness of sam-ples with heavy tails, we take the negative of theskewness averaged across all target words to ob-tain the uncertainty coverage score Cscore.

Cscore(x, y) =

∑mj=1−skewness(yj)

m(12)

where m is the number of target words.

Diversity constraint sampling (Diversity-AL).Diversity sampling methods have been used in in-formation retrieval (Xu et al., 2007) and imageclassification (Wang et al., 2017b). The core ideais that samples that are highly similar to each othertypically yield little new information and thus lowperformance. Similarly, to increase the diversityof the samples in neural text compression, wepropose a novel scoring metric to measure thediversity of multiple source texts at a structurallevel. Our intuition is that integrating part-of-speech, dependency and named entity informationis useful for text compression, e.g., to learn whichnamed entities are important and how to compressa wide range of phrase types and syntacticallycomplex sentences. Thus, we consider part ofspeech tags, dependency trees, and named entityembeddings and calculate the structural similarityof the source text with regard to the target text.We use a multi-task convolutional neural network

similar to Søgaard and Goldberg (2016) trained onOntoNotes and Common Crawl to learn the struc-tural embeddings consisting of tag, dependencyand named entity embeddings. The diversity scoreDscore is calculated using the cosine distance be-tween the average of the structural embeddings ofthe words in the source sentence and the averageof the structural embeddings of the words in thetarget compression as in Eq. 13:

Dscore(x, y) =Estruc(x) · Estruc(y)

||Estruc(x)|| · ||Estruc(y)||(13)

where Estruc(·) is the average structural embed-ding of a text.

These AL sampling strategies are applied in-teractively while training to make better use ofthe data by selecting the most uncertain instances.Additionally, both strategies can be applied for do-main adaptation by actively querying user annota-tions for a domain-specific dataset in an interactivetext compression setup, which we describe next.

3.3 Interactive Text Compression

In this subsection, we introduce our interactivetext compression setup. Our goal is to select thebatch of samples for training efficiently with min-imal samples and to become able to transfer themodels to new datasets for different domains andgenres with few labeled data.

We consider an initial collection of parallel in-stances D = {(xi, yi) | 1 ≤ i ≤ N} consisting ofpairs of input text xi and their corresponding com-pression yi. Additionally, we consider unlabeledinstances D′ = {xi | i > N}, for which we onlyknow the uncompressed source texts. Our goalis to sample sets of unlabeled instances St ⊂ D′

which should be annotated by a user in each timestep t. The interactive compression model canonly see the labeled pairs from the initial datasetD in the beginning, but then incrementally learnsfrom the user annotations.

Algorithm 1 provides an overview of our in-teractive compression setup. The inputs are thelabeled compression pairs D and the unlabeledsource texts D′. D is used to initially train theneural text compression model M . In line 5, westart the interactive feedback loop iterating overt = 0, . . . , T . We first sample a set of unla-beled source texts St (line 6) by using our ALstrategies introduced in section 3.2 and then loopover each of the unlabeled samples to be annotated

2548

or supervised by the human in line 10. As theuser feedback in the current time step of sampleSt, we obtain the compressions Yt of the sampledsource texts St from the user and use them for on-line training of the model M . After T iterationsor if there are no samples left for querying (i.e.,St = ∅), we stop the iteration and return the up-dated Seq2Seq model M .

Algorithm 1 Interactive Text Compression1: procedure INTERACTIVECOMPRESSION()2: input: Text Compression Pairs D,3: Unlabeled Text D′

4: M ← learnSeq2Seq(D)5: for t = 0, ..., T do6: St ← getSample(D′)7: if St = ∅ then8: return M9: else

10: Yt ← queryUser(St)11: M ← update(M,St,Yt)12: D′ ← D′ − St13: end if14: end for15: end procedure

4 Experimental Setup

4.1 Data

For our experiments, we use the large GoogleNews text compression corpus3 by Filippova andAltun (2013), which contains 250k automaticallyextracted the deletion-based compressions fromaligned headlines and first sentences of news arti-cles. Recent studies on text compression have ex-tensively used this dataset (e.g., Zhao et al., 2018;Kamigaito et al., 2018). We carry out in-domainactive learning experiments on the Google Newscompression corpus.

To evaluate our interactive setup, we adapt thetrained models to the MSR-OANC text compres-sion corpus by Toutanova et al. (2016), whichcontains 6k crowdsourced multi-genre compres-sions from the Open American National Corpus.This corpus is well-suited to evaluate our inter-active setup, since it is sourced from mixture ofnewswire, letters, journals, and non-fiction genres,

3https://github.com/google-research-datasets/sentence-compression

in contrast to the Google News corpus coveringonly newswire.

Dataset # Train # Dev # Test

Google News 195,000 5,000 10,000MSR-OANC 5,000 448 785

Table 1: Statistics of the compression datasets

For evaluating the compressions against the ref-erence compressions, we use a Python wrapper4 ofthe ROUGE metric (Lin, 2004) with the parame-ters suggested by Owczarzak et al. (2012) yieldinghigh correlation with human judgments (i.e., withstemming and without stopword removal).5

4.2 Preprocessing and Parameters

To preprocess the datasets, we perform tokeniza-tion. We obtain the structural embeddings for asentence using spaCy6 embeddings learned usinga multi-task convolutional neural network.

To evaluate and assess the effectiveness of ouractive learning-based sampling approaches, weset up our interactive text compression approachfor the two state-of-the-art Seq2Seq models con-sisting of a generative model (Seq2Seq-gen) anda generate-and-copy model (Pointer-gen) as de-scribed in Section 3.1. For the neural Seq2Seqtext compression experiments, we set the beamsize and batch size to 10 and 30 respectively. Weuse the Adam optimizer (Kingma and Ba, 2015)for the gradient-based optimization. Finally, theparameters for the neural network parameters likeweights and biases are randomly initialized.

In order to asses the effectiveness of AL for neu-ral text compression we extend the OpenNMT7

implementations with our interactive frameworkfollowing Algorithm 1. The sampling strategy se-lects instances to be annotated interactively by theuser in batches. Next, the neural text compressionmodel is incrementally updated with the selectedsamples.

Due to the presence of a human in the loop, ittypically demands real user feedback, but the costof collecting sufficient data for various settings ofour models is prohibitive. Thus in our experi-ments, the users were simulated by using the com-

4https://github.com/pltrdy/files2rouge5-n 2 -c 95 -r 1000 -a -m6https://spacy.io/7https://github.com/OpenNMT/OpenNMT-py

https://github.com/google-research-datasets/sentence-compression



https://github.com/pltrdy/files2rouge

https://spacy.io/

https://github.com/OpenNMT/OpenNMT-py

2549

MethodsUB Random Coverage-AL Diversity-AL

R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL

Seq2Seq-gen 59.94 52.08 59.78 61.60 50.03 61.37 62.89 51.38 62.56 62.54 50.19 62.13Pointer-gen 79.26 71.77 79.08 71.61 61.15 71.28 78.11 70.50 77.89 77.45 70.30 77.38

Table 2: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) achieved by the state-of-the-art models using oursampling strategies evaluated on the Google compression test set. Bold marks best AL strategy.

Figure 2: Analysis of the active learning approachescombined with state-of-the-art Seq2Seq compressionmodels on Google compression dataset while varyingthe training sizes.

pression pairs from our corpus as the sentences an-notated by the user.

5 Results and Analysis

Our experiments address two main research ques-tions for in-domain training and domain adapta-tion of neural text compression:

• Which active learning strategies are useful intext compression to select training samplessuch that higher performance can be achievedwith a minimum of labeled instances?

• Which instances are to be annotated interac-tively by the user such that the model adaptsquickly to a new dataset?

In-domain Active Learning. For in-domain ac-tive learning experiments, we choose the GoogleNews text compression training corpus and sam-ple for corpus sizes between 10% and 100% in

ten percent point steps. As a baseline, we usea random sampling strategy to test the state-of-the-art Seq2Seq neural text compression models.Figure 2 suggests that our coverage-based sam-pling (Coverage-AL) and diversity-based sam-pling (Diversity-AL) strategies outperform therandom sampling strategy throughout all trainingsizes. A key observation is that our samplingstrategies are behind the upper bound by just 0.5%ROUGE-2 when only 20% of the training data isused. Table 2 illustrates the results of our samplingstrategies when 20% of the data is used for train-ing. All the results are in comparison to the upperbound (UB) receiving 100% of the training data.

Coverage-AL performs better than theDiversity-AL for both the Seq2Seq-gen andPointer-gen models. However, they are still noteffective in the Seq2Seq-gen model where randomsampling performs on par with the active learningsampling approaches. We believe this is due to theSeq2Seq-gen model’s inability to copy from thesource text in the sampled set as a consequenceof active learning in the batch setting. Whereasfor Pointer-gen model, we observed that bothCoverage-AL and Diversity-AL strategies ofadding new samples for training had a greaterimpact when the model has not adapted. Weattribute the effectiveness of the Coverage-ALstrategy over Diversity-AL to the exploitationof the model uncertainty, as the Diversity-ALonly uses the similarity based on the samples, butmisses to integrate the model uncertainty.

Table 3 presents an example sentence com-pression pair from the Google News datasetand the generated compressions of both neu-ral Seq2Seq models when using one of thethree sampling strategies. The example showsthat detailed descriptions like the names of theships “JING GANGSHA” and “HENG SHUI”are dropped by all models. In particular, theSeq2Seq-gen model has the problem of generat-ing words not present in the original text (e.g.,

2550

Source text: Two Chinese war ships , “ JING GANGSHA ” and “ HENG SHUI ” arrived at theport of Trincomalee on 13 th January 2014 on a good will visit .

Reference: Two Chinese war ships , arrived at the port of Trincomalee will visit .

Seq2Seq-gen+ Random: Two Chinese war ships , arrived at the port of toddlers on 13 th January 2014 .+ Coverage-AL: Two Chinese war ships , arrived at the port of Trincomalee on a good will visit .+ Diversity-AL: Two Chinese war ships arrived at the port of Scottsbluff on 13 th .

Pointer-gen+ Random: Two Chinese war ships , arrived at the port of Trincomalee on 13 th January 2014 .+ Coverage-AL: Two Chinese war ships arrived at the port of Trincomalee will visit .+ Diversity-AL: Two Chinese war ships , arrived at the port of Trincomalee .

Table 3: In-domain active learning example sentence and compressions for Google News compression datasetwhen using 20% of labelled compressions with Random, Coverage-AL, Diversity-AL sampling strategies

MethodsMSR-OANC ID Random Coverage-AL Diversity-ALR1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL

Seq2Seq-gen 30.05 10.42 26.87 33.51 13.60 30.26 35.10 15.00 32.78 34.85 14.92 32.41Pointer-gen 35.24 16.57 32.56 38.19 21.87 37.94 39.59 24.87 37.02 39.42 24.70 36.86

Table 4: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) achieved by the state-of-the-art models using oursampling strategies when interactively retrained using 10% of the MSR-OANC training set. The results are incomparison to the models trained on in-domain training set (MSR-OANC ID). Bold marks best AL strategy.

“toddlers”, “Scottsbluff”). In contrast, the Pointer-gen model’s ability to copy from the originaltext restrains the model from generating irrele-vant words. Although Diverysity-AL based mod-els recognized the phrasal constructs crucial forthe sentence meaning, Coverage-AL generated theclosest compression to the reference.

Active learning for domain adaptation. To testour interactive Seq2Seq model using active learn-ing strategies for the domain adaptation scenario,we train the model on the Google News compres-sion corpus and test it on the multi-genre MSR-OANC compression dataset. Additionally, fordomain adaptation, the neural Seq2Seq model isupdated incrementally using our interactive com-pression Algorithm 1. The sampling strategiesselect the instances to be interactively annotatedby the user. As the cost of interactive experi-mentation with real users, we use simulated feed-back from the labeled sentence compressions fromthe MSR-OANC training data. The two samplingstrategies used for in-domain active learning areused for interactive compression with the state-of-the-art Seq2Seq models. Table 4 illustrates theresults of the interactive text compression modelwhen applied to the MSR-OANC text compres-

sion dataset. One interesting observation is thefact that our sampling strategies at 10% of thetraining data (≈ 500 samples) perform better thanmodels trained on in-domain training data (MSR-OANC ID) with 5k training instances by +8.3%and +8.2% ROUGE-2.

Figure 3 shows the results for the various sam-ple sizes of the 5k training instances. The resultsshow a similar trend as the active learning for theinteractive data-selection scenario. The Coverage-AL and Diversity-AL strategies do not show sig-nificant differences from each other. However, thetwo active learning strategies achieve on average+2.5% ROUGE-2 better results than the randomsampling. The results demonstrate that the use ofrelevant training samples is useful for transferringthe models to new domains and genres.

Table 5 shows an example from the MSR-OANC compression dataset. The example illus-trates similar compression properties as seen inthe in-domain settings. In particular, the two mod-els learned to drop appositions, optional modifiers,detailed clauses, etc. Additionally, we also ob-served that the difficult cases where those wherethere is little to be removed, but due to highercompression ratios during the training, the modelsremoved more than required. This confirms the

2551

Source text: Given the urgency of the situation in Alaska , Defenders needs your immediateassistance to help save Alaska ’s wolves from same - day airborne land - and - shootslaughter .

Reference: Given the urgency of the situation in Alaska , Defenders needs your immediateassistance saving Alaska ’s wolves from slaughter .

Seq2Seq-gen+ Random: Immediate assistance to save Alaska’s tundra .+ Coverage-AL: Sometimes needs your assistance to help save Alaska ’s wolves .+ Diversity-AL: The situation in Alaska, help save Alaska ’s tundra .

Pointer-gen+ Random: Immediate assistance to help save Alaska s wolves .+ Coverage-AL: The urgency of the situation in Alaska , Defenders needs your immediate assistance .+ Diversity-AL: Defenders needs your assistance to help save Alaska ’s wolves .

Table 5: Domain adaptation example from the MSR-OANC dataset when trained on a 20% of labelled compres-sions with Random, Coverage-AL, and Diversity-AL sampling strategies

Figure 3: Analysis of the active learning for domainadaptation on the MSR-OANC dataset while varyingthe training data.

cause for lower ROUGE scores compared to theGoogle News corpus.

6 Conclusion

We propose a novel neural text compression ap-proach using a neural Seq2Seq method with aninteractive setup that aims at (a) learning an in-domain model with a minimum of data and (b)adapting a pretrained model with few user inputsto a new domain or genre. In this paper, we inves-tigate two uncertainty-based active learning strate-gies with (a) a coverage constraint using atten-tion dispersion and (b) a diversity constraint us-ing structural similarity to make better use of theuser in the loop for preparing training data pairs.The active learning based data selection method-ology samples the data such that the most uncer-

tain samples are available for training first. Ex-perimental results show that the selected samplesachieve comparable performance to the state-of-the-art systems, but trained on 80% less in-domaintraining data. Active learning with an interactivetext compression model helps in transferring mod-els trained on a large parallel corpus for a headlinegeneration task to a general compression datasetwith just 500 sampled instances. Additionally, thesame in-domain active learning based data selec-tion shows a notable performance improvementin an online interactive domain adaptation setup.Our experiments demonstrate that instead of moretraining data, relevant training data is essential fortraining Seq2Seq models in both in-domain train-ing as well as domain adaptation.

In future work, we plan to explore several linesof work. First, we intend to investigate further ap-plications of our interactive setup, e.g., in moviesubtitle compression or television closed captionswhere there is no sufficient training data to buildneural models. On a more general level, the inter-active setup and the active learning strategies pre-sented can also be used for other natural languageprocessing tasks, such as question answering, totransfer a model to a new domain or genre.

Acknowledgments

This work has been supported by the German Re-search Foundation as part of the Research TrainingGroup Adaptive Preparation of Information fromHeterogeneous Sources (AIPHES) under grantNo. GRK 1994/1. We also acknowledge the usefulsuggestions of the anonymous reviewers.

2552

References

Yee Seng Chan and Hwee Tou Ng. 2007. Domainadaptation with active learning for word sense dis-ambiguation. In Proceedings of the 45th AnnualMeeting of the Association for Computational Lin-guistics (ACL), June 23–30, 2007, Prague, CzechRepublic, pages 49–56.

Sumit Chopra, Michael Auli, and Alexander M. Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL/HLT),June 12–17, 2016, San Diego CA, USA, pages 93–98.

James Clarke and Mirella Lapata. 2006. Models forsentence compression: A comparison across do-mains, training requirements and evaluation mea-sures. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th An-nual Meeting of the Association for ComputationalLinguistics (COLING/ACL), July 17–21, 2006, Syd-ney, Australia, pages 377–384.

James Clarke and Mirella Lapata. 2007. Modellingcompression with discourse constraints. In Pro-ceedings of the 2007 Joint Conference on Em-pirical Methods in Natural Language Process-ing and Computational Natural Language Learn-ing (EMNLP/CoNLL), June 28–30, 2007, Prague,Czech Republic, pages 1–11.

James Clarke and Mirella Lapata. 2008. Global in-ference for sentence compression: An integer linearprogramming approach. Journal of Artificial Intelli-gence Research, 31:399–429.

Trevor Cohn and Mirella Lapata. 2008. Sentence com-pression beyond word deletion. In COLING 2008,22nd International Conference on ComputationalLinguistics, Proceedings of the Conference, 18-22August 2008, Manchester, UK, pages 137–144.

Trevor Cohn and Mirella Lapata. 2013. An ab-stractive approach to sentence compression. ACMTransactions on Intelligent Systems and Technology,4(3):41:1–41:35.

Simon Corston-Oliver. 2001. Text compaction fordisplay on very small screens. In Proceedings ofthe NAACL Workshop on Automatic Summarization,June 3, 2001, Pitsburgh, PA, USA, pages 89–98.

Katja Filippova, Enrique Alfonseca, Carlos A. Col-menares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)September 17–21, 2015, Lisbon, Portugal, pages360–368.

Katja Filippova and Yasemin Altun. 2013. Overcom-ing the lack of parallel data in sentence compres-sion. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing(EMNLP) October 18–21, 2013, Seattle, WA, USA,pages 1481–1491.

Gholamreza Haffari and Anoop Sarkar. 2009. Activelearning for multilingual statistical machine transla-tion. In Proceedings of the 47th Annual Meeting ofthe Association for Computational Linguistics andthe 4th International Joint Conference on NaturalLanguage Processing (ACL/IJCNLP), August 2–7,2009, Singapore, pages 181–189.

Udo Hahn, Elena Beisswanger, Ekaterina Buyko, andErik Faessler. 2012. Active learning-based corpusannotation - the pathojen experience. In AmericanMedical Informatics Association Annual Symposium(AMIA), November 3–7, 2012, Chicago, IL, USA.

Hidetaka Kamigaito, Katsuhiko Hayashi, Tsutomu Hi-rao, and Masaaki Nagata. 2018. Higher-order syn-tactic attention network for longer sentence com-pression. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, (NAACL/HLT), June 1–6, 2018, New Or-leans, LA, USA, pages 1716–1726.

Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam:A method for stochastic optimization. In Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR), May 7–9, 2015, SanDiego, CA, USA.

Kevin Knight and Daniel Marcu. 2002. Summariza-tion beyond sentence extraction: A probabilistic ap-proach to sentence compression. Artificial Intelli-gence, 139(1):91–107.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Hu-man Language Technology Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (HLT/NAACL), May 27–June 1,2003, Edmonton, Canada, pages 48–54.

Chin-Yew Lin. 2004. ROUGE: A Package for Auto-matic Evaluation of Summaries. In Proceedings ofthe ACL Workshop “Text Summarization BranchesOut”, July 25-26, 2004, Barcelona, Spain, pages74–81.

Ming Liu, Wray L. Buntine, and Gholamreza Haffari.2018. Learning to actively learn neural machinetranslation. In Proceedings of the 22nd Confer-ence on Computational Natural Language Learning(CoNLL), October 31–November 1, 2018, Brussels,Belgium, pages 334–344.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natural

http://aclweb.org/anthology/P07-1007



http://aclweb.org/anthology/N/N16/N16-1012.pdf






http://www.aclweb.org/anthology/D07-1001

http://www.aclweb.org/anthology/D07-1001

https://doi.org/10.1613/jair.2433



http://www.aclweb.org/anthology/C08-1018

http://www.aclweb.org/anthology/C08-1018

https://doi.org/10.1145/2483669.2483674

https://doi.org/10.1145/2483669.2483674

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/naacl2001.textcompaction.corstonoliver.pdf

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/naacl2001.textcompaction.corstonoliver.pdf

http://aclweb.org/anthology/D/D15/D15-1042.pdf




http://www.aclweb.org/anthology/P09-1021



http://knowledge.amia.org/amia-55142-a2012a-1.636547/t-003-1.640625/f-001-1.640626/a-038-1.641123/a-039-1.641120

http://knowledge.amia.org/amia-55142-a2012a-1.636547/t-003-1.640625/f-001-1.640626/a-038-1.641123/a-039-1.641120

https://aclanthology.info/papers/N18-1155/n18-1155



https://arxiv.org/abs/1412.6980

https://arxiv.org/abs/1412.6980

https://doi.org/10.1016/S0004-3702(02)00222-9

https://doi.org/10.1016/S0004-3702(02)00222-9

https://doi.org/10.1016/S0004-3702(02)00222-9


http://aclweb.org/anthology/W04-1013


https://aclanthology.info/papers/K18-1033/k18-1033




2553

Language Processing (EMNLP), September 17–21,2015, Lisbon, Portugal, pages 1412–1421.

Juhani Luotolahti and Filip Ginter. 2015. Sentencecompression for automatic subtitling. In Proceed-ings of the 20th Nordic Conference of Computa-tional Linguistics (NODALIDA), May 11–13, 2015,Vilnius, Lithuania, pages 135–143.

Francois Mairesse, Milica Gasic, Filip Jurcıcek, Si-mon Keizer, Blaise Thomson, Kai Yu, and Steve J.Young. 2010. Phrase-based statistical language gen-eration using graphical models and active learning.In Proceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics (ACL), July11–16, 2010, Uppsala, Sweden, pages 1552–1561.

Erwin Marsi, Emiel Krahmer, Iris Hendrickx, and Wal-ter Daelemans. 2010. On the limits of sentence com-pression by deletion. In Empirical Methods in Nat-ural Language Generation: Data-oriented Methodsand Empirical Evaluation, volume 5790 of LectureNotes in Artificial Intelligence, pages 45–66.

Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), November 1–4, 2016, Austin,TX, USA, pages 319–328.

Courtney Napoles, Matthew Gormley, and BenjaminVan Durme. 2012. Annotated gigaword. In Pro-ceedings of the Joint Workshop on Automatic Knowl-edge Base Construction and Web-scale Knowl-edge Extraction (AKBC-WEKEX), June 7–8, 2012,Montreal, Canada, pages 95–100.

Karolina Owczarzak, John M. Conroy, Hoa TrangDang, and Ani Nenkova. 2012. An assessment ofthe accuracy of automatic evaluation in summariza-tion. In Proceedings of the NAACL-HLT Work-shop on Evaluation Metrics and System Compari-son for Automatic Summarization, June 3-8, 2012,Montreal, Canada, pages 1–9.

Avinesh P. V. S. and Christian M. Meyer. 2017.Joint optimization of user-desired content in multi-document summaries by learning from user feed-back. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(ACL), July 30–August 4, Vancouver, Canada, pages1353–1363.

Alvaro Peris and Francisco Casacuberta. 2018. Activelearning for interactive neural machine translation ofdata streams. In Proceedings of the 22nd Confer-ence on Computational Natural Language Learning(CoNLL), October 31–November 1, 2018, Brussels,Belgium, pages 151–160.

Alexander M. Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for abstrac-tive sentence summarization. In Proceedings of the2015 Conference on Empirical Methods in Natural

Language Processing (EMNLP), September 17–21,2015, Lisbon, Portugal, pages 379–389.

Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get to the point: Summarization withpointer-generator networks. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (ACL), July 30–August 4, Van-couver, Canada, pages 1073–1083.

Ozan Sener and Silvio Savarese. 2018. Active learn-ing for convolutional neural networks: A core-setapproach. In Proceedings of the 6th InternationalConference on Learning Representations (ICLR),May 6–9, New Orleans, LA, USA.

Anders Søgaard and Yoav Goldberg. 2016. Deepmulti-task learning with low level tasks supervisedat lower layers. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (ACL), August 7–12, 2016, Berlin, Ger-many.

Bipin Suresh. 2010. Inclusion of large input corporain statistical machine translation. Technical report,Stanford University.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Process-ing Systems 27: Annual Conference on Neural In-formation Processing Systems 2014, December 8–13 2014, Montreal, Canada, pages 3104–3112.

Kristina Toutanova, Chris Brockett, Ke M. Tran, andSaleema Amershi. 2016. A dataset and evaluationmetrics for abstractive compression of sentences andshort paragraphs. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), November 1-4, 2016, Austin,TX, USA, pages 340–350.

Jenine Turner and Eugene Charniak. 2005. Super-vised and unsupervised learning for sentence com-pression. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Linguis-tics (ACL), June 25-30, 2005, Ann Arbor, MI, USA,pages 290–297.

Vincent Vandegehinste and Yi Pan. 2004. Sentencecompression for automated subtitling: A hybrid ap-proach. In Proceedings of the ACL Workshop “TextSummarization Branches Out”, July 25-26, 2004,Barcelona, Spain, pages 89–95.

Chenguang Wang, Laura Chiticariu, and Yunyao Li.2017a. Active learning for black-box semantic rolelabeling with neural factors. In Proceedings of theTwenty-Sixth International Joint Conference on Ar-tificial Intelligence (IJCAI), August 19–25, 2017,Melbourne, Australia, pages 2908–2914.

Gaoang Wang, Jenq-Neng Hwang, Craig S. Rose, andFarron Wallace. 2017b. Uncertainty sampling basedactive learning with diversity constraint by sparseselection. In 19th IEEE International Workshop

http://aclweb.org/anthology/W/W15/W15-1818.pdf




https://doi.org/10.1007/978-3-642-15573-4_3

https://doi.org/10.1007/978-3-642-15573-4_3




http://dl.acm.org/citation.cfm?id=2391200.2391218




https://doi.org/10.18653/v1/P17-1124

https://doi.org/10.18653/v1/P17-1124

https://doi.org/10.18653/v1/P17-1124






https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

https://arxiv.org/pdf/1708.00489.pdf



http://aclweb.org/anthology/P/P16/P16-2038.pdf



https://nlp.stanford.edu/courses/cs224n/2010/reports/bipins.pdf

https://nlp.stanford.edu/courses/cs224n/2010/reports/bipins.pdf

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks







http://www.aclweb.org/anthology/W04-1015



https://doi.org/10.24963/ijcai.2017/405

https://doi.org/10.24963/ijcai.2017/405

https://doi.org/10.1109/MMSP.2017.8122269



2554

on Multimedia Signal Processing (MMSP), October16–18, 2017, Luton, UK, pages 1–6.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. arXiv,1609.08144.

Sander Wubben, Emiel Krahmer, Antal van den Bosch,and Suzan Verberne. 2016. Abstractive compressionof captions with attentive recurrent neural networks.In Proceedings of the Ninth International NaturalLanguage Generation Conference (INLG), Septem-ber 5–8, 2016, Edinburgh, UK, pages 41–50.

Zuobing Xu, Ram Akella, and Yi Zhang. 2007. Incor-porating diversity and density in active learning forrelevance feedback. In Advances in Information Re-trieval, 29th European Conference on IR Research,ECIR 2007, Rome, Italy, April 2–5, 2007, Proceed-ings, pages 246–257.

Yan Yan, Romer Rosales, Glenn Fung, and Jennifer G.Dy. 2011. Active learning from crowds. In Proceed-ings of the 28th International Conference on Inter-national Conference on Machine Learning (ICML),June 28–July 2, 2011, Bellevue, WA, USA, pages1161–1168.

Naitong Yu, Jie Zhang, Minlie Huang, and XiaoyanZhu. 2018. An operation network for abstractivesentence compression. In Proceedings of the 27thInternational Conference on Computational Lin-guistics (COLING), August 20-26, 2018, Santa Fe,NM, USA, pages 1065–1076.

Yang Zhao, Zhiyuan Luo, and Akiko Aizawa. 2018. Alanguage model based evaluator for sentence com-pression. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguis-tics (ACL), July 15–20, 2018, Melbourne, Australia,pages 170–175.

http://arxiv.org/abs/1609.08144





https://doi.org/10.1007/978-3-540-71496-5_24

https://doi.org/10.1007/978-3-540-71496-5_24

https://doi.org/10.1007/978-3-540-71496-5_24

http://dl.acm.org/citation.cfm?id=3104482.3104628

https://aclanthology.info/papers/C18-1091/c18-1091

https://aclanthology.info/papers/C18-1091/c18-1091

https://aclanthology.info/papers/P18-2028/p18-2028



Date post:	11-Nov-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Data-efficient Neural Text Compression with Interactive LearningProceedings of NAACL-HLT 2019 ,...

Documents