+ All Categories
Home > Documents > Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL...

Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL...

Date post: 10-Feb-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
12
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 675–686 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 675 Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting Yen-Chun Chen and Mohit Bansal UNC Chapel Hill {yenchun, mbansal}@cs.unc.edu Abstract Inspired by how humans summarize long documents, we propose an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i.e., compresses and para- phrases) to generate a concise overall sum- mary. We use a novel sentence-level pol- icy gradient method to bridge the non- differentiable computation between these two neural networks in a hierarchical way, while maintaining language fluency. Em- pirically, we achieve the new state-of-the- art on all metrics (including human eval- uation) on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores. Moreover, by first operating at the sentence-level and then the word-level, we enable parallel decoding of our neural generative model that results in substan- tially faster (10-20x) inference speed as well as 4x faster training convergence than previous long-paragraph encoder-decoder models. We also demonstrate the general- ization of our model on the test-only DUC- 2002 dataset, where we achieve higher scores than a state-of-the-art model. 1 Introduction The task of document summarization has two main paradigms: extractive and abstractive. The former method directly chooses and outputs the salient sentences (or phrases) in the original doc- ument (Jing and McKeown, 2000; Knight and Marcu, 2000; Martins and Smith, 2009; Berg- Kirkpatrick et al., 2011). The latter abstractive approach involves rewriting the summary (Banko et al., 2000; Zajic et al., 2004), and has seen sub- stantial recent gains due to neural sequence-to- sequence models (Chopra et al., 2016; Nallap- ati et al., 2016; See et al., 2017; Paulus et al., 2018). Abstractive models can be more concise by performing generation from scratch, but they suffer from slow and inaccurate encoding of very long documents, with the attention model being required to look at all encoded words (in long paragraphs) for decoding each generated summary word (slow, one by one sequentially). Abstrac- tive models also suffer from redundancy (repeti- tions), especially when generating multi-sentence summary. To address both these issues and combine the advantages of both paradigms, we pro- pose a hybrid extractive-abstractive architecture, with policy-based reinforcement learning (RL) to bridge together the two networks. Similar to how humans summarize long documents, our model first uses an extractor agent to select salient sen- tences or highlights, and then employs an abstrac- tor network to rewrite (i.e., compress and para- phrase) each of these extracted sentences. To over- come the non-differentiable behavior of our ex- tractor and train on available document-summary pairs without saliency label, we next use actor- critic policy gradient with sentence-level metric rewards to connect these two neural networks and to learn sentence saliency. We also avoid com- mon language fluency issues (Paulus et al., 2018) by preventing the policy gradients from affect- ing the abstractive summarizer’s word-level train- ing, which is supported by our human evaluation study. Our sentence-level reinforcement learn- ing takes into account the word-sentence hierar- chy, which better models the language structure and makes parallelization possible. Our extractor combines reinforcement learning and pointer net- works, which is inspired by Bello et al. (2017)’s attempt to solve the Traveling Salesman Problem. Our abstractor is a simple encoder-aligner-decoder
Transcript
Page 1: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 675–686Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

675

Fast Abstractive Summarization withReinforce-Selected Sentence Rewriting

Yen-Chun Chen and Mohit BansalUNC Chapel Hill

{yenchun, mbansal}@cs.unc.edu

Abstract

Inspired by how humans summarize longdocuments, we propose an accurate andfast summarization model that first selectssalient sentences and then rewrites themabstractively (i.e., compresses and para-phrases) to generate a concise overall sum-mary. We use a novel sentence-level pol-icy gradient method to bridge the non-differentiable computation between thesetwo neural networks in a hierarchical way,while maintaining language fluency. Em-pirically, we achieve the new state-of-the-art on all metrics (including human eval-uation) on the CNN/Daily Mail dataset, aswell as significantly higher abstractivenessscores. Moreover, by first operating atthe sentence-level and then the word-level,we enable parallel decoding of our neuralgenerative model that results in substan-tially faster (10-20x) inference speed aswell as 4x faster training convergence thanprevious long-paragraph encoder-decodermodels. We also demonstrate the general-ization of our model on the test-only DUC-2002 dataset, where we achieve higherscores than a state-of-the-art model.

1 Introduction

The task of document summarization has twomain paradigms: extractive and abstractive. Theformer method directly chooses and outputs thesalient sentences (or phrases) in the original doc-ument (Jing and McKeown, 2000; Knight andMarcu, 2000; Martins and Smith, 2009; Berg-Kirkpatrick et al., 2011). The latter abstractiveapproach involves rewriting the summary (Bankoet al., 2000; Zajic et al., 2004), and has seen sub-stantial recent gains due to neural sequence-to-

sequence models (Chopra et al., 2016; Nallap-ati et al., 2016; See et al., 2017; Paulus et al.,2018). Abstractive models can be more conciseby performing generation from scratch, but theysuffer from slow and inaccurate encoding of verylong documents, with the attention model beingrequired to look at all encoded words (in longparagraphs) for decoding each generated summaryword (slow, one by one sequentially). Abstrac-tive models also suffer from redundancy (repeti-tions), especially when generating multi-sentencesummary.

To address both these issues and combinethe advantages of both paradigms, we pro-pose a hybrid extractive-abstractive architecture,with policy-based reinforcement learning (RL) tobridge together the two networks. Similar to howhumans summarize long documents, our modelfirst uses an extractor agent to select salient sen-tences or highlights, and then employs an abstrac-tor network to rewrite (i.e., compress and para-phrase) each of these extracted sentences. To over-come the non-differentiable behavior of our ex-tractor and train on available document-summarypairs without saliency label, we next use actor-critic policy gradient with sentence-level metricrewards to connect these two neural networks andto learn sentence saliency. We also avoid com-mon language fluency issues (Paulus et al., 2018)by preventing the policy gradients from affect-ing the abstractive summarizer’s word-level train-ing, which is supported by our human evaluationstudy. Our sentence-level reinforcement learn-ing takes into account the word-sentence hierar-chy, which better models the language structureand makes parallelization possible. Our extractorcombines reinforcement learning and pointer net-works, which is inspired by Bello et al. (2017)’sattempt to solve the Traveling Salesman Problem.Our abstractor is a simple encoder-aligner-decoder

Page 2: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

676

model (with copying) and is trained on pseudodocument-summary sentence pairs obtained viasimple automatic matching criteria.

Thus, our method incorporates the abstractiveparadigm’s advantages of concisely rewriting sen-tences and generating novel words from the fullvocabulary, yet it adopts intermediate extractivebehavior to improve the overall model’s quality,speed, and stability. Instead of encoding and at-tending to every word in the long input documentsequentially, our model adopts a human-inspiredcoarse-to-fine approach that first extracts all thesalient sentences and then decodes (rewrites) them(in parallel). This also avoids almost all redun-dancy issues because the model has already cho-sen non-redundant salient sentences to abstrac-tively summarize (but adding an optional finalreranker component does give additional gains byremoving the fewer across-sentence repetitions).

Empirically, our approach is the new state-of-the-art on all ROUGE metrics (Lin, 2004) as wellas on METEOR (Denkowski and Lavie, 2014)of the CNN/Daily Mail dataset, achieving sta-tistically significant improvements over previousmodels that use complex long-encoder, copy, andcoverage mechanisms (See et al., 2017). Thetest-only DUC-2002 improvement also shows ourmodel’s better generalization than this strong ab-stractive system. In addition, we surpass the pop-ular lead-3 baseline on all ROUGE scores with anabstractive model. Moreover, our sentence-levelabstractive rewriting module also produces sub-stantially more (3x) novel N -grams that are notseen in the input document, as compared to thestrong flat-structured model of See et al. (2017).This empirically justifies that our RL-guided ex-tractor has learned sentence saliency, rather thanbenefiting from simply copying longer sentences.We also show that our model maintains the samelevel of fluency as a conventional RNN-basedmodel because the reward does not leak to our ab-stractor’s word-level training. Finally, our model’straining is 4x and inference is more than 20x fasterthan the previous state-of-the-art. The optionalfinal reranker gives further improvements whilemaintaining a 7x speedup.

Overall, our contribution is three fold: Firstwe propose a novel sentence-level RL techniquefor the well-known task of abstractive summariza-tion, effectively utilizing the word-then-sentencehierarchical structure without annotated matching

sentence-pairs between the document and groundtruth summary. Next, our model achieves thenew state-of-the-art on all metrics of multiple ver-sions of a popular summarization dataset (as wellas a test-only dataset) both extractively and ab-stractively, without loss in language fluency (alsodemonstrated via human evaluation and abstrac-tiveness scores). Finally, our parallel decoding re-sults in a significant 10-20x speed-up over the pre-vious best neural abstractive summarization sys-tem with even better accuracy.1

2 Model

In this work, we consider the task of summa-rizing a given long text document into several(ordered) highlights, which are then combinedto form a multi-sentence summary. Formally,given a training set of document-summary pairs{xi, yi}Ni=1, our goal is to approximate the func-tion h : X → Y,X = {xi}Ni=1, Y = {yi}Ni=1

such that h(xi) = yi, 1 ≤ i ≤ N . Further-more, we assume there exists an abstracting func-tion g defined as: ∀s ∈ Si,∃d ∈ Di such thatg(d) = s, 1 ≤ i ≤ N , where Si is the set of sum-mary sentences in xi and Di the set of documentsentences in yi. i.e., in any given pair of docu-ment and summary, every summary sentence canbe produced from some document sentence. Forsimplicity, we omit subscript i in the remainderof the paper. Under this assumption, we can fur-ther define another latent function f : X → Dn

that satisfies f(x) = {dt}nj=1 and y = h(x) =[g(d1), g(d2), . . . , g(dn)], where [, ] denotes sen-tence concatenation. This latent function f can beseen as an extractor that chooses the salient (or-dered) sentences in a given document for the ab-stracting function g to rewrite. Our overall modelconsists of these two submodules, the extractoragent and the abstractor network, to approximatethe above-mentioned f and g, respectively.

2.1 Extractor Agent

The extractor agent is designed to model f , whichcan be thought of as extracting salient sentencesfrom the document. We exploit a hierarchical neu-ral model to learn the sentence representations ofthe document and a ‘selection network’ to extractsentences based on their representations.

1We are releasing our code, best pretrained models,as well as output summaries, to promote future research:https://github.com/ChenRocks/fast_abs_rl

Page 3: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

677

bi-LSTM

bi-LSTM

bi-LSTM

bi-LSTMEncoded Sentence Representations

r1

r2

r3

r4

h0

h1

h4

r1

r2

r3

r4

h0

h1

h4

r1

r2

r3

r4

h0

h1

h4

r1

r2

r3

r4

h0

h1

h4

LSTM

LSTM

LSTM

Extraction Probabilities (Policy)

r1

r2

r3

r4

h0

h1

h4

r1

r2

r3

r4

h0

h1

h4

r1

r2

r3

r4

h0

h1

h4

Context-aware Sent. Reps.(from previous extraction step)

CO

NV

Embedded W

ord Vectors

Convolutional Sentence Encoder

Figure 1: Our extractor agent: the convolutional encoder computes representation rj for each sentence.The RNN encoder (blue) computes context-aware representation hj and then the RNN decoder (green)selects sentence jt at time step t. With jt selected, hjt will be fed into the decoder at time t+ 1.

2.1.1 Hierarchical Sentence RepresentationWe use a temporal convolutional model (Kim,2014) to compute rj , the representation of each in-dividual sentence in the documents (details in sup-plementary). To further incorporate global contextof the document and capture the long-range se-mantic dependency between sentences, a bidirec-tional LSTM-RNN (Hochreiter and Schmidhuber,1997; Schuster et al., 1997) is applied on the con-volutional output. This enables learning a strongrepresentation, denoted as hj for the j-th sentencein the document, that takes into account the con-text of all previous and future sentences in thesame document.

2.1.2 Sentence SelectionNext, to select the extracted sentences based on theabove sentence representations, we add anotherLSTM-RNN to train a Pointer Network (Vinyalset al., 2015), to extract sentences recurrently. Wecalculate the extraction probability by:

utj =

v>p tanh(Wp1hj +Wp2et) if jt 6= jk

∀k < t

−∞ otherwise(1)

P (jt|j1, . . . , jt−1) = softmax(ut) (2)where et’s are the output of the glimpse opera-tion (Vinyals et al., 2016):

atj = v>g tanh(Wg1hj +Wg2zt) (3)

αt = softmax(at) (4)

et =∑j

αtjWg1hj (5)

Abstractor

d1 d2 d3 d4 d5 djt g(djt) st

d1 d2 d3 d4 d5 djt g(djt) st

SummarySentence(groundtruth)

d1 d2 d3 d4 d5 djt g(djt) st

GeneratedSentence

Reward

RL Agent

Extractor

Policy Gradient Update

Observation

d1 d2 d3 d4 d5 djt g(djt) st

d1 d2 d3 d4 d5 djt g(djt) st

d1 d2 d3 d4 d5 djt g(djt) st

d1 d2 d3 d4 d5 djt g(djt) st

DocumentSentences

Action (extract sent.)

Figure 2: Reinforced training of the extractor (forone extraction step) and its interaction with the ab-stractor. For simplicity, the critic network is notshown. Note that all d’s and st are raw sentences,not vector representations.

In Eqn. 3, zt is the output of the added LSTM-RNN (shown in green in Fig. 1) which is referredto as the decoder. All theW ’s and v’s are trainableparameters. At each time step t, the decoder per-forms a 2-hop attention mechanism: It first attendsto hj’s to get a context vector et and then attendsto hj’s again for the extraction probabilities.2 Thismodel is essentially classifying all sentences of thedocument at each extraction step. An illustrationof the whole extractor is shown in Fig. 1.

2.2 Abstractor Network

The abstractor network approximates g, whichcompresses and paraphrases an extracted docu-ment sentence to a concise summary sentence. We

2Note that we force-zero the extraction prob. of alreadyextracted sentences so as to prevent the model from using re-peating document sentences and suffering from redundancy.This is non-differentiable and hence only done in RL training.

Page 4: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

678

use the standard encoder-aligner-decoder (Bah-danau et al., 2015; Luong et al., 2015). We add thecopy mechanism3 to help directly copy some out-of-vocabulary (OOV) words (See et al., 2017). Formore details, please refer to the supplementary.

3 Learning

Given that our extractor performs a non-differentiable hard extraction, we apply stan-dard policy gradient methods to bridge the back-propagation and form an end-to-end trainable(stochastic) computation graph. However, sim-ply starting from a randomly initialized networkto train the whole model in an end-to-end fash-ion is infeasible. When randomly initialized, theextractor would often select sentences that arenot relevant, so it would be difficult for the ab-stractor to learn to abstractively rewrite. On theother hand, without a well-trained abstractor theextractor would get noisy reward, which leadsto a bad estimate of the policy gradient and asub-optimal policy. We hence propose optimiz-ing each sub-module separately using maximum-likelihood (ML) objectives: train the extractor toselect salient sentences (fit f ) and the abstractor togenerate shortened summary (fit g). Finally, RL isapplied to train the full model end-to-end (fit h).

3.1 Maximum-Likelihood Training forSubmodules

Extractor Training: In Sec. 2.1.2, we haveformulated our sentence selection as classifica-tion. However, most of the summarization datasetsare end-to-end document-summary pairs with-out extraction (saliency) labels for each sentence.Hence, we propose a simple similarity method toprovide a ‘proxy’ target label for the extractor.Similar to the extractive model of Nallapati et al.(2017), for each ground-truth summary sentence,we find the most similar document sentence djtby:4

jt = argmaxi(ROUGE-Lrecall(di, st)) (6)

Given these proxy training labels, the extractor isthen trained to minimize the cross-entropy loss.

3We use the terminology of copy mechanism (originallynamed pointer-generator) in order to avoid confusion withthe pointer network (Vinyals et al., 2015).

4Nallapati et al. (2017) selected sentences greedily tomaximize the global summary-level ROUGE, whereas wematch exactly 1 document sentence for each GT summarysentence based on the individual sentence-level score.

Abstractor Training: For the abstractor training,we create training pairs by taking each summarysentence and pairing it with its extracted docu-ment sentence (based on Eqn. 6). The networkis trained as an usual sequence-to-sequence modelto minimize the cross-entropy loss L(θabs) =− 1M

∑Mm=1 logPθabs(wm|w1:m−1) of the decoder

language model at each generation step, whereθabs is the set of trainable parameters of the ab-stractor and wm the mth generated word.

3.2 Reinforce-Guided ExtractionHere we explain how policy gradient techniquesare applied to optimize the whole model. Tomake the extractor an RL agent, we can formu-late a Markov Decision Process (MDP)5: at eachextraction step t, the agent observes the currentstate ct = (D, djt−1), samples an action jt ∼πθa,ω(ct, j) = P (j) from Eqn. 2 to extract a doc-ument sentence and receive a reward6

r(t+ 1) = ROUGE-LF1(g(djt), st) (7)

after the abstractor summarizes the extracted sen-tence djt . We denote the trainable parameters ofthe extractor agent by θ = {θa, ω} for the decoderand hierarchical encoder respectively. We can thentrain the extractor with policy-based RL. We illus-trate this process in Fig. 2.

The vanilla policy gradient algorithm, REIN-FORCE (Williams, 1992), is known for high vari-ance. To mitigate this problem, we add a criticnetwork with trainable parameters θc to predictthe state-value function V πθa,ω(c). The predictedvalue of critic bθc,ω(c) is called the ‘baseline’,which is then used to estimate the advantage func-tion: Aπθ(c, j) = Qπθa,ω(c, j) − V πθa,ω(c) be-cause the total return Rt is an estimate of action-value function Q(ct, jt). Instead of maximizingQ(ct, jt) as done in REINFORCE, we maximizeAπθ(c, j) with the following policy gradient:

∇θa,ωJ(θa, ω) =E[∇θa,ωlogπθ(c, j)Aπθ(c, j)]

(8)

And the critic is trained to minimize the squareloss: Lc(θc, ω) = (bθc,ω(ct) − Rt)

2. This is5Strictly speaking, this is a Partially Observable Markov

Decision Process (POMDP). We approximate it as an MDPby assuming that the RNN hidden state contains all past info.

6In Eqn. 6, we use ROUGE-recall because we want theextracted sentence to contain as much information as possiblefor rewriting. Nevertheless, for Eqn. 7, ROUGE-F1 is moresuitable because the abstractor g is supposed to rewrite theextracted sentence d to be as concise as the ground truth s.

Page 5: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

679

known as the Advantage Actor-Critic (A2C), asynchronous variant of A3C (Mnih et al., 2016).For more A2C details, please refer to the supp.

Intuitively, our RL training works as follow: Ifthe extractor chooses a good sentence, after the ab-stractor rewrites it the ROUGE match would behigh and thus the action is encouraged. If a badsentence is chosen, though the abstractor still pro-duces a compressed version of it, the summarywould not match the ground truth and the lowROUGE score discourages this action. Our RLwith a sentence-level agent is a novel attempt inneural summarization. We use RL as a saliencyguide without altering the abstractor’s languagemodel, while previous work applied RL on theword-level, which could be prone to gaming themetric at the cost of language fluency.7

Learning how many sentences to extract: In atypical RL setting like game playing, an episodeis usually terminated by the environment. On theother hand, in text summarization, the agent doesnot know in advance how many summary sentenceto produce for a given article (since the desiredlength varies for different downstream applica-tions). We make an important yet simple, intuitiveadaptation to solve this: by adding a ‘stop’ ac-tion to the policy action space. In the RL trainingphase, we add another set of trainable parametersvEOE (EOE stands for ‘End-Of-Extraction’) withthe same dimension as the sentence representation.The pointer-network decoder treats vEOE as oneof the extraction candidates and hence naturallyresults in a stop action in the stochastic policy.We set the reward for the agent performing EOEto ROUGE-1F1([{g(djt)}t], [{st}t]); whereas forany extraneous, unwanted extraction step, theagent receives zero reward. The model is there-fore encouraged to extract when there are still re-maining ground-truth summary sentences (to ac-cumulate intermediate reward), and learn to stopby optimizing a global ROUGE and avoiding extraextraction.8 Overall, this modification allows dy-

7During this RL training of the extractor, we keep the ab-stractor parameters fixed. Because the input sentences for theabstractor are extracted by an intermediate stochastic policyof the extractor, it is impossible to find the correct target sum-mary for the abstractor to fit g with ML objective. Though itis possible to optimize the abstractor with RL, in out prelim-inary experiments we found that this does not improve theoverall ROUGE, most likely because this RL optimizes at asentence-level and can add across-sentence redundancy. Weachieve SotA results without this abstractor-level RL.

8We use ROUGE-1 for terminal reward because it is abetter measure of bag-of-words information (i.e., has all the

namic decisions of number-of-sentences based onthe input document, eliminates the need for tuninga fixed number of steps, and enables a data-drivenadaptation for any specific dataset/application.

3.3 Repetition-Avoiding RerankingExisting abstractive summarization systems onlong documents suffer from generating repeatingand redundant words and phrases. To mitigatethis issue, See et al. (2017) propose the coveragemechanism and Paulus et al. (2018) incorporatetri-gram avoidance during beam-search at test-time. Our model without these already performswell because the summary sentences are gener-ated from mutually exclusive document sentences,which naturally avoids redundancy. However, wedo get a small further boost to the summary qualityby removing a few ‘across-sentence’ repetitions,via a simple reranking strategy: At sentence-level,we apply the same beam-search tri-gram avoid-ance (Paulus et al., 2018). We keep all k sentencecandidates generated by beam search, where k isthe size of the beam. Next, we then rerank allkn combinations of the n generated summary sen-tence beams. The summaries are reranked by thenumber of repeated N -grams, the smaller the bet-ter. We also apply the diverse decoding algorithmdescribed in Li et al. (2016) (which has almost nocomputation overhead) so as to get the above ap-proach to produce useful diverse reranking lists.We show how much the redundancy affects thesummarization task in Sec. 6.2.

4 Related Work

Early summarization works mostly focused on ex-tractive and compression based methods (Jing andMcKeown, 2000; Knight and Marcu, 2000; Clarkeand Lapata, 2010; Berg-Kirkpatrick et al., 2011;Filippova et al., 2015). Recent large-sized corporaattracted neural methods for abstractive summa-rization (Rush et al., 2015; Chopra et al., 2016).Some of the recent success in neural abstractivemodels include hierarchical attention (Nallapatiet al., 2016), coverage (Suzuki and Nagata, 2016;Chen et al., 2016; See et al., 2017), RL based met-ric optimization (Paulus et al., 2018), graph-basedattention (Tan et al., 2017), and the copy mecha-nism (Miao and Blunsom, 2016; Gu et al., 2016;See et al., 2017).

important information been generated); while ROUGE-L isused as intermediate rewards since it is known for better mea-surement of language fluency within a local sentence.

Page 6: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

680

Our model shares some high-level intuition withextract-then-compress methods. Earlier attemptsin this paradigm used Hidden Markov Models andrule-based systems (Jing and McKeown, 2000),statistical models based on parse trees (Knightand Marcu, 2000), and integer linear programmingbased methods (Martins and Smith, 2009; Gillickand Favre, 2009; Clarke and Lapata, 2010; Berg-Kirkpatrick et al., 2011). Recent approaches in-vestigated discourse structures (Louis et al., 2010;Hirao et al., 2013; Kikuchi et al., 2014; Wanget al., 2015), graph cuts (Qian and Liu, 2013),and parse trees (Li et al., 2014; Bing et al., 2015).For neural models, Cheng and Lapata (2016) useda second neural net to select words from an ex-tractor’s output. Our abstractor does not merely‘compress’ the sentences but generatively producenovel words. Moreover, our RL bridges the ex-tractor and the abstractor for end-to-end training.

Reinforcement learning has been used to op-timize the non-differential metrics of languagegeneration and to mitigate exposure bias (Ran-zato et al., 2016; Bahdanau et al., 2017). Henßet al. (2015) use Q-learning based RL for extrac-tive summarization. Paulus et al. (2018) use RLpolicy gradient methods for abstractive summa-rization, utilizing sequence-level metric rewardswith curriculum learning (Ranzato et al., 2016)or weighted ML+RL mixed loss (Paulus et al.,2018) for stability and language fluency. We usesentence-level rewards to optimize the extractorwhile keeping our ML trained abstractor decoderfixed, so as to achieve the best of both worlds.

Training a neural network to use another fixednetwork has been investigated in machine trans-lation for better decoding (Gu et al., 2017a) andreal-time translation (Gu et al., 2017b). They useda fixed pretrained translator and applied policygradient techniques to train another task-specificnetwork. In question answering (QA), Choi et al.(2017) extract one sentence and then generate theanswer from the sentence’s vector representationwith RL bridging. Another recent work attempteda new coarse-to-fine attention approach on sum-marization (Ling and Rush, 2017) and found de-sired sharp focus properties for scaling to larger in-puts (though without metric improvements). Veryrecently (concurrently), Narayan et al. (2018) useRL for ranking sentences in pure extraction-basedsummarization and Celikyilmaz et al. (2018) in-vestigate multiple communicating encoder agents

to enhance the copying abstractive summarizer.Finally, there are some loosely-related recent

works: Zhou et al. (2017) proposed selective gateto improve the attention in abstractive summa-rization. Tan et al. (2018) used an extract-then-synthesis approach on QA, where an extractionmodel predicts the important spans in the passageand then another synthesis model generates the fi-nal answer. Swayamdipta et al. (2017) attemptedcascaded non-recurrent small networks on extrac-tive QA, resulting a scalable, parallelizable model.Fan et al. (2017) added controlling parameters toadapt the summary to length, style, and entity pref-erences. However, none of these used RL to bridgethe non-differentiability of neural models.

5 Experimental Setup

Please refer to the supplementary for full trainingdetails (all hyperparameter tuning was performedon the validation set). We use the CNN/Daily Maildataset (Hermann et al., 2015) modified for sum-marization (Nallapati et al., 2016). Because thereare two versions of the dataset, original text andentity anonymized, we show results on both ver-sions of the dataset for a fair comparison to priorworks. The experiment runs training and evalu-ation for each version separately. Despite the factthat the 2 versions have been considered separatelyby the summarization community as 2 differentdatasets, we use same hyper-parameter values forboth dataset versions to show the generalization ofour model. We also show improvements on theDUC-2002 dataset in a test-only setup.

5.1 Evaluation Metrics

For all the datasets, we evaluate standard ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) on full-length F1 (with stemming) following previousworks (Nallapati et al., 2017; See et al., 2017;Paulus et al., 2018). Following See et al. (2017),we also evaluate on METEOR (Denkowski andLavie, 2014) for a more thorough analysis.

5.2 Modular Extractive vs. Abstractive

Our hybrid approach is capable of both extrac-tive and abstractive (i.e., rewriting every sentence)summarization. The extractor alone performs ex-tractive summarization. To investigate the effect ofthe recurrent extractor (rnn-ext), we implement afeed-forward extractive baseline ff-ext (details insupplementary). It is also possible to apply RL

Page 7: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

681

Models ROUGE-1 ROUGE-2 ROUGE-L METEORExtractive Results

lead-3 (See et al., 2017) 40.34 17.70 36.57 22.21Narayan et al. (2018) 40.0 18.2 36.6 -ff-ext 40.63 18.35 36.82 22.91rnn-ext 40.17 18.11 36.41 22.81rnn-ext + RL 41.47 18.72 37.76 22.35

Abstractive ResultsSee et al. (2017) (w/o coverage) 36.44 15.66 33.42 16.65See et al. (2017) 39.53 17.28 36.38 18.72Fan et al. (2017) (controlled) 39.75 17.29 36.54 -ff-ext + abs 39.30 17.02 36.93 20.05rnn-ext + abs 38.38 16.12 36.04 19.39rnn-ext + abs + RL 40.04 17.61 37.59 21.00rnn-ext + abs + RL + rerank 40.88 17.80 38.54 20.38

Table 1: Results on the original, non-anonymized CNN/Daily Mail dataset. Adding RL gives statisti-cally significant improvements for all metrics over non-RL rnn-ext models (and over the state-of-the-artSee et al. (2017)) in both extractive and abstractive settings with p < 0.01. Adding the extra rerankingstage yields statistically significant better results in terms of all ROUGE metrics with p < 0.01.

to extractor without using the abstractor (rnn-ext+ RL).9 Benefiting from the high modularity ofour model, we can make our summarization sys-tem abstractive by simply applying the abstractoron the extracted sentences. Our abstractor rewriteseach sentence and generates novel words from alarge vocabulary, and hence every word in ouroverall summary is generated from scratch; mak-ing our full model categorized into the abstractiveparadigm.10 We run experiments on separatelytrained extractor/abstractor (ff-ext + abs, rnn-ext +abs) and the reinforced full model (rnn-ext + abs +RL) as well as the final reranking version (rnn-ext+ abs + RL + rerank).

6 Results

For easier comparison, we show separate tablesfor the original-text vs. anonymized versions –Table 1 and Table 2, respectively. Overall, ourmodel achieves strong improvements and the newstate-of-the-art on both extractive and abstractivesettings for both versions of the CNN/DM dataset(with some comparable results on the anonymizedversion). Moreover, Table 3 shows the gener-alization of our abstractive system to an out-of-domain test-only setup (DUC-2002), where ourmodel achieves better scores than See et al. (2017).

6.1 Extractive Summarization

In the extractive paradigm, we compare our modelwith the extractive model from Nallapati et al.

9In this case the abstractor function g(d) = d.10Note that the abstractive CNN/DM dataset does not in-

clude any human-annotated extraction label, and hence ourmodels do not receive any direct extractive supervision.

Models R-1 R-2 R-LExtractive Results

lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5Nallapati et al. (2017) 39.6 16.2 35.3ff-ext 39.51 16.85 35.80rnn-ext 38.97 16.65 35.32rnn-ext + RL 40.13 16.58 36.47

Abstractive ResultsNallapati et al. (2016) 35.46 13.30 32.65Fan et al. (2017) (controlled) 38.68 15.40 35.47Paulus et al. (2018) (ML) 38.30 14.81 35.49Paulus et al. (2018) (RL+ML) 39.87 15.82 36.90ff-ext + abs 38.73 15.70 36.33rnn-ext + abs 37.58 14.68 35.24rnn-ext + abs + RL 38.80 15.66 36.37rnn-ext + abs + RL + rerank 39.66 15.85 37.34

Table 2: ROUGE for anonymized CNN/DM.

(2017) and a strong lead-3 baseline. For producingour summary, we simply concatenate the extractedsentences from the extractors. From Table 1 andTable 2, we can see that our feed-forward extrac-tor out-performs the lead-3 baseline, empiricallyshowing that our hierarchical sentence encodingmodel is capable of extracting salient sentences.11

The reinforced extractor performs the best, be-cause of the ability to get the summary-level re-ward and the reduced train-test mismatch of feed-ing the previous extraction decision. The improve-ment over lead-3 is consistent across both tables.In Table 2, it outperforms the previous best neuralextractive model (Nallapati et al., 2017). In Ta-ble 1, our model also outperforms a recent, con-

11The ff-ext model outperforms rnn-ext possibly becauseit does not predict sentence ordering; thus is easier to opti-mize and the n-gram based metrics do not consider sentenceordering. Also note that in our MDP formulation, we cannotapply RL on ff-ext due to its historyless nature. Even if ap-plied naively, there is no mean for the feed-forward model tolearn the EOE described in Sec. 3.2.

Page 8: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

682

Models R-1 R-2 R-LSee et al. (2017) 37.22 15.78 33.90rnn-ext + abs + RL 39.46 17.34 36.72

Table 3: Generalization to DUC-2002 (F1).

current work by Narayan et al. (2018), showingthat our pointer-network extractor and reward for-mulations are very effective when combined withA2C RL.

6.2 Abstractive Summarization

After applying the abstractor, the ff-ext basedmodel still out-performs the rnn-ext model. Bothcombined models exceed the pointer-generatormodel (See et al., 2017) without coverage by alarge margin for all metrics, showing the effec-tiveness of our 2-step hierarchical approach: ourmethod naturally avoids repetition by extractingmultiple sentences with different keypoints.12

Moreover, after applying reinforcement learn-ing, our model performs better than the best modelof See et al. (2017) and the best ML trained modelof Paulus et al. (2018). Our reinforced model out-performs the ML trained rnn-ext + abs baselinewith statistical significance of p < 0.01 on all met-rics for both version of the dataset, indicating theeffectiveness of the RL training. Also, rnn-ext +abs + RL is statistically significant better than Seeet al. (2017) for all metrics with p < 0.01.13 Inthe supplementary, we show the learning curve ofour RL training, where the average reward goesup quickly after the extractor learns the End-of-Extract action and then stabilizes. For all theabove models, we use standard greedy decodingand find that it performs well.

Reranking and Redundancy Although theextract-then-abstract approach inherently will notgenerate repeating sentences like other neural-decoders do, there might still be across-sentenceredundancy because the abstractor is not awareof other extracted sentences when decoding one.Hence, we incorporate an optional reranking strat-egy described in Sec. 3.3. The improved ROUGEscores indicate that this successfully removessome remaining redundancies and hence producesmore concise summaries. Our best abstractive

12A trivial lead-3 + abs baseline obtains ROUGE of(37.37, 15.59, 34.82), which again confirms the importanceof our reinforce-based sentence selection.

13We calculate statistical significance based on the boot-strap test (Noreen, 1989; Efron and Tibshirani, 1994) with100K samples. Output of Paulus et al. (2018) is not availableso we couldn’t test for statistical significance there.

Relevance Readability TotalSee et al. (2017) 120 128 248rnn-ext + abs + RL + rerank 137 133 270Equally good/bad 43 39 82

Table 4: Human Evaluation: pairwise comparisonbetween our final model and See et al. (2017).

model (rnn-ext + abs + RL + rerank) is clearly su-perior than the one of See et al. (2017). We arecomparable on R-1 and R-2 but a 0.4 point im-provement on R-L w.r.t. Paulus et al. (2018).14

We also outperform the results of Fan et al. (2017)on both original and anonymized dataset versions.Several previous works have pointed out that ex-tractive baselines are very difficult to beat (interms of ROUGE) by an abstractive system (Seeet al., 2017; Nallapati et al., 2017). Note that ourbest model is one of the first abstractive modelsto outperform the lead-3 baseline on the original-text CNN/DM dataset. Our extractive experimentserves as a complementary analysis of the effect ofRL with extractive systems.

6.3 Human Evaluation

We also conduct human evaluation to ensure ro-bustness of our training procedure. We measurerelevance and readability of the summaries. Rel-evance is based on the summary containing im-portant, salient information from the input article,being correct by avoiding contradictory/unrelatedinformation, and avoiding repeated/redundant in-formation. Readability is based on the summa-rys fluency, grammaticality, and coherence. Toevaluate both these criteria, we design the follow-ing Amazon MTurk experiment: we randomly se-lect 100 samples from the CNN/DM test set andask the human testers (3 for each sample) to rankbetween summaries (for relevance and readabil-ity) produced by our model and that of See et al.(2017) (the models were anonymized and ran-domly shuffled), i.e. A is better, B is better, bothare equally good/bad. Following previous work,the input article and ground truth summaries arealso shown to the human participants in additionto the two model summaries.15 From the resultsshown in Table 4, we can see that our model isbetter in both relevance and readability w.r.t. Seeet al. (2017).

14We do not list the scores of their pure RL model becausethey discussed its bad readability.

15We selected human annotators that were located in theUS, had an approval rate greater than 95%, and had at least10,000 approved HITs on record.

Page 9: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

683

SpeedModels total time (hr) words / sec(See et al., 2017) 12.9 14.8rnn-ext + abs + RL 0.68 361.3rnn-ext + abs + RL + rerank 2.00 (1.46 +0.54) 109.8

Table 5: Speed comparison with See et al. (2017).

6.4 Speed Comparison

Our two-stage extractive-abstractive hybrid modelis not only the SotA on summary quality met-rics, but more importantly also gives a significantspeed-up in both train and test time over a strongneural abstractive system (See et al., 2017).16

Our full model is composed of a extremely fastextractor and a parallelizable abstractor, where thecomputation bottleneck is on the abstractor, whichhas to generate summaries with a large vocabularyfrom scratch.17 The main advantage of our ab-stractor at decoding time is that we can first com-pute all the extracted sentences for the document,and then abstract every sentence concurrently (inparallel) to generate the overall summary. In Ta-ble 5, we show the substantial test-time speed-upof our model compared to See et al. (2017).18 Wecalculate the total decoding time for producing allsummaries for the test set.19 Due to the fact thatthe main test-time speed bottleneck of RNN lan-guage generation model is that the model is con-strained to generate one word at a time, the totaldecoding time is dependent on the number of to-tal words generated; we hence also report the de-coded words per second for a fair comparison. Ourmodel without reranking is extremely fast. FromTable 5 we can see that we achieve a speed up of18x in time and 24x in word generation rate. Evenafter adding the (optional) reranker, we still main-tain a 6-7x speed-up (and hence a user can chooseto use the reranking component depending on theirdownstream application’s speed requirements).20

16The only publicly available code with a pretrained modelfor neural summarization which we can test the speed.

17The time needed for extractor is negligible w.r.t. the ab-stractor because it does not require large matrix multiplica-tion for generating every word. Moreover, with convolutionalencoder at word-level made parallelizable by the hierarchicalrnn-ext, our model is scalable for very long documents.

18For details of training speed-up, please see the supp.19We time the model of See et al. (2017) using beam size of

4 (used for their best-reported scores). Without beam-search,it gets significantly worse ROUGE of (36.62, 15.12, 34.08),so we do not compare speed-ups w.r.t. that version.

20Most of the recent neural abstractive summarization sys-tems are of similar algorithmic complexity to that of See et al.(2017). The main differences such as the training objective(ML vs. RL) and copying (soft/hard) has negligible test run-time compared to the slowest component: the long-summary

Novel N -gram (%)Models 1-gm 2-gm 3-gm 4-gmSee et al. (2017) 0.1 2.2 6.0 9.7rnn-ext + abs + RL + rerank 0.3 10.0 21.7 31.6reference summaries 10.8 47.5 68.2 78.2

Table 6: Abstractiveness: novel n-gram counts.

7 Analysis

7.1 Abstractiveness

We compute an abstractiveness score (See et al.,2017) as the ratio of novel n-grams in the gen-erated summary that are not present in the in-put document. The results are shown in Table 6:our model rewrites substantially more abstractivesummaries than previous work. A potential rea-son for this is that when trained with individualsentence-pairs, the abstractor learns to drop moredocument words so as to write individual sum-mary sentences as concise as human-written ones;thus the improvement in multi-gram novelty.

7.2 Qualitative Analysis on Output Examples

We show examples of how our best model selectssentences and then rewrites them. In the supple-mentary Figure 2 and Figure 3, we can see howthe abstractor rewrites the extracted sentences con-cisely while keeping the mentioned facts. Addingthe reranker makes the output more compact glob-ally. We observe that when rewriting longer text,the abstractor would have many facts to choosefrom (Figure 3 sentence 2) and this is where thereranker helps avoid redundancy across sentences.

8 Conclusion

We propose a novel sentence-level RL modelfor abstractive summarization, which makes themodel aware of the word-sentence hierarchy. Ourmodel achieves the new state-of-the-art on bothCNN/DM versions as well a better generalizationon test-only DUC-2002, along with a significantspeed-up in training and decoding.

Acknowledgments

We thank the anonymous reviewers for their help-ful comments. This work was supported by aGoogle Faculty Research Award, a BloombergData Science Research Grant, an IBM FacultyAward, and NVidia GPU awards.

attentional-decoder’s sequential generation; and this is thecomponent that we substantially speed up via our parallelsentence decoding with sentence-selection RL.

Page 10: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

684

ReferencesDzmitry Bahdanau, Philemon Brakel, Kelvin Xu,

Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C.Courville, and Yoshua Bengio. 2017. An actor-criticalgorithm for sequence prediction. In ICLR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Michele Banko, Vibhu O. Mittal, and Michael J. Wit-brock. 2000. Headline generation based on statis-tical translation. In Proceedings of the 38th An-nual Meeting on Association for Computational Lin-guistics, ACL ’00, pages 318–325, Stroudsburg, PA,USA. Association for Computational Linguistics.

Irwan Bello, Hieu Pham, Quoc V. Le, MohammadNorouzi, and Samy Bengio. 2017. Neural combi-natorial optimization with reinforcement learning.arXiv preprint 1611.09940.

Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein.2011. Jointly learning to extract and compress. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies - Volume 1, HLT ’11, pages481–490, Stroudsburg, PA, USA. Association forComputational Linguistics.

Lidong Bing, Piji Li, Yi Liao, Wai Lam, WeiweiGuo, and Rebecca J. Passonneau. 2015. Abstractivemulti-document summarization via phrase selectionand merging. In ACL.

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. NAACL-HLT.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, andHui Jiang. 2016. Distraction-based neural networksfor modeling documents. In IJCAI.

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 484–494, Berlin, Germany.Association for Computational Linguistics.

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, IlliaPolosukhin, Alexandre Lacoste, and Jonathan Be-rant. 2017. Coarse-to-fine question answering forlong documents. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 209–220.Association for Computational Linguistics.

Sumit Chopra, Michael Auli, and Alexander M. Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 93–98, SanDiego, California. Association for ComputationalLinguistics.

James Clarke and Mirella Lapata. 2010. Discourseconstraints for document compression. Computa-tional Linguistics, 36(3):411–441.

Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In Proceedings of the EACL2014 Workshop on Statistical Machine Translation.

Bradley Efron and Robert J Tibshirani. 1994. An intro-duction to the bootstrap. CRC press.

Angela Fan, David Grangier, and Michael Auli. 2017.Controllable abstractive summarization. arXivpreprint, abs/1711.05217.

Katja Filippova, Enrique Alfonseca, Carlos Col-menares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. InProceedings of the 2015 Conference on Empir-ical Methods in Natural Language Processing(EMNLP’15).

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theWorkshop on Integer Linear Programming for Nat-ural Langauge Processing, ILP ’09, pages 10–18,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Jiatao Gu, Kyunghyun Cho, and Victor O. K. Li. 2017a.Trainable greedy decoding for neural machine trans-lation. In EMNLP.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor O. K. Li. 2017b. Learning to translate in real-time with neural machine translation. In EACL.

Sebastian Henß, Margot Mieskes, and Iryna Gurevych.2015. A reinforcement learning approach for adap-tive single- and multi-document summarization. InInternational Conference of the German Society forComputational Linguistics and Language Technol-ogy (GSCL-2015), pages 3–12.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems (NIPS).

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino,Norihito Yasuda, and Masaaki Nagata. 2013.Single-document summarization as a tree knapsackproblem. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 1515–1520, Seattle, Washington, USA.Association for Computational Linguistics.

Page 11: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

685

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(9):1735–1780.

Hongyan Jing and Kathleen R. McKeown. 2000. Cutand paste based text summarization. In Proceed-ings of the 1st North American Chapter of the As-sociation for Computational Linguistics Conference,NAACL 2000, pages 178–185, Stroudsburg, PA,USA. Association for Computational Linguistics.

Yuta Kikuchi, Tsutomu Hirao, Hiroya Takamura, Man-abu Okumura, and Masaaki Nagata. 2014. Singledocument summarization based on nested tree struc-ture. In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers), pages 315–320, Baltimore,Maryland. Association for Computational Linguis-tics.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1746–1751,Doha, Qatar. Association for Computational Lin-guistics.

Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence compres-sion. In Proceedings of the Seventeenth NationalConference on Artificial Intelligence and TwelfthConference on Innovative Applications of ArtificialIntelligence, pages 703–710. AAAI Press.

Chen Li, Yang Liu, Fei Liu, Lin Zhao, and FuliangWeng. 2014. Improving multi-documents summa-rization by sentence compression based on expandedconstituent parse trees. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 691–701. Asso-ciation for Computational Linguistics.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A sim-ple, fast diverse decoding algorithm for neural gen-eration. arXiv preprint, abs/1611.08562.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text SummarizationBranches Out: Proceedings of the ACL-04 Work-shop, pages 74–81, Barcelona, Spain. Associationfor Computational Linguistics.

Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fineattention models for document summarization. InProceedings of the Workshop on New Frontiers inSummarization, pages 33–42. Association for Com-putational Linguistics.

Annie Louis, Aravind Joshi, and Ani Nenkova. 2010.Discourse indicators for content selection in summa-rization. In Proceedings of the 11th Annual Meetingof the Special Interest Group on Discourse and Dia-logue, SIGDIAL ’10, pages 147–156, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches to attention-based neural machine translation. In Empir-ical Methods in Natural Language Processing(EMNLP), pages 1412–1421, Lisbon, Portugal. As-sociation for Computational Linguistics.

Andre F. T. Martins and Noah A. Smith. 2009. Sum-marization with a joint model for sentence extractionand compression. In Proceedings of the Workshopon Integer Linear Programming for Natural Lan-gauge Processing, ILP ’09, pages 1–9, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. In EMNLP.

Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning.In Proceedings of The 33rd International Confer-ence on Machine Learning, volume 48 of Proceed-ings of Machine Learning Research, pages 1928–1937, New York, New York, USA. PMLR.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of doc-uments. In AAAI Conference on Artificial Intelli-gence.

Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dossantos, Caglar Gulcehre, and Bing Xiang. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. NAACL-HLT.

Eric W Noreen. 1989. Computer-intensive methods fortesting hypotheses. Wiley New York.

Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep reinforced model for abstractive sum-marization. In ICLR.

Xian Qian and Yang Liu. 2013. Fast joint compres-sion and summarization via graph cuts. In Proceed-ings of the 2013 Conference on Empirical Methodsin Natural Language Processing, pages 1492–1502,Seattle, Washington, USA. Association for Compu-tational Linguistics.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In ICLR.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389, Lisbon, Portugal.Association for Computational Linguistics.

Page 12: Fast Abstractive Summarization with Reinforce-Selected ...we propose a novel sentence-level RL technique for the well-known task of abstractive summariza-tion, effectively utilizing

686

Mike Schuster, Kuldip K. Paliwal, and A. General.1997. Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.

Jun Suzuki and Masaaki Nagata. 2016. Rnn-basedencoder-decoder approach with word frequency es-timation. In EACL.

Swabha Swayamdipta, Ankur P. Parikh, and TomKwiatkowski. 2017. Multi-mention learning forreading comprehension with neural cascades. arXivpreprint, abs/1711.00894.

Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, andMing Zhou. 2018. S-net: From answer extraction toanswer generation for machine reading comprehen-sion. In AAAI.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.Abstractive document summarization with a graph-based attentional neural model. In ACL.

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.2016. Order matters: Sequence to sequence for sets.In International Conference on Learning Represen-tations (ICLR).

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information ProcessingSystems 28, pages 2692–2700. Curran Associates,Inc.

Xun Wang, Yasuhisa Yoshida, Tsutomu Hirao, Kat-suhito Sudoh, and Masaaki Nagata. 2015. Sum-marization based on task-oriented discourse parsing.IEEE/ACM Trans. Audio, Speech and Lang. Proc.,23(8):1358–1367.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Mach. Learn., 8(3-4):229–256.

David Zajic, Bonnie Dorr, and Richard Schwartz. 2004.Bbn/umd at duc-2004: Topiary. In HLT-NAACL2004 Document Understanding Workshop, pages112–119, Boston, Massachusetts.

Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.2017. Selective encoding for abstractive sentencesummarization. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1095–1104. Association for Computational Linguistics.


Recommended