In Conclusion Not Repetition: Comprehensive Abstractive ... · In Conclusion Not Repetition:...

In Conclusion Not Repetition: Comprehensive AbstractiveSummarization With Diversified Attention Based On Determinantal Point

ProcessesLei Li1, Wei Liu1, Marina Litvak2, Natalia Vanetik2 and Zuying Huang1

1 Beijing University of Posts and Telecommunications{leili, thinkwee, zoehuang}@bupt.edu.cn

2 Shamoon College of [email protected] [email protected]

Abstract

Various Seq2Seq learning models designedfor machine translation were applied for ab-stractive summarization task recently. Despitethese models provide high ROUGE scores,they are limited to generate comprehensivesummaries with a high level of abstractiondue to its degenerated attention distribution.We introduce Diverse Convolutional Seq2SeqModel(DivCNN Seq2Seq) using Determinan-tal Point Processes methods(Micro DPPs andMacro DPPs) to produce attention distributionconsidering both quality and diversity. With-out breaking the end to end architecture, Di-vCNN Seq2Seq achieves a higher level ofcomprehensiveness compared to vanilla mod-els and strong baselines. All the reproduciblecodes and datasets are available online1.

1 Introduction

Given an article, abstractive summarization aimsat generating one or several short sentences thatcover the main idea of original article, whichis a combination of Natural Language Under-standing(NLU) and Natural Language Genera-tion(NLG).

Abstractive summarization uses Seq2Seq mod-els (Sutskever et al., 2014) which consist of an en-coder, a decoder and attention mechanism (Mnihet al., 2014). With attention mechanism the de-coder can choose a weighted context representa-tion at each generation step so it can focus ondifferent parts of encoded information. Seq2Seqwith attention achieved remarkable results on ma-chine translation (Bahdanau et al., 2014) and othertext generation tasks such as abstractive summa-rizaiton (Rush et al., 2015).

Unlike machine translation that emphasizes at-tention mechanism as a method of learning wordlevel alignments between source text and target

1available at https://github.com/thinkwee/DPP CNN Summarization

Article: marseille , france the french prosecutor lead-ing an investigation into the crash of germanwings flight9525 insisted wednesday that he was not aware of anyvideo footage from on board the plane . marseille pros-ecutor brice robin told cnn that so far no videos wereused in the crash investigation ...... of a cell phone videoshowing the harrowing final seconds from on board ger-manwings flight 9525 as it crashed into the french alps .......paris match and bild reported that the video was re-covered from a phone at the wreckage site . ...... cnn ’sfrederik pleitgen , pamela boykoff , antonia mortensen ,sandrine amiel and anna-maja rappard contributed to thisreport .CNN Seq2Seq: french prosecutor UNK robin says hewas not aware of any video .DivCNN Seq2Seq with Micro DPPs: new french pros-ecutor leading an investigation into the crash of UNKwings flight UNK 25 which crashed into french alps . thevideo was recovered from a phone at the wreckage site .DivCNN Seq2Seq with Macro DPPs: french prosecutorsays he was not aware of any video footage from on boardUNK wings flight UNK 25 as it crashed into french alps .

Table 1: Article-summary sample from CNN-DMdataset. Colored spans are attentive parts. Micro DPPsmodel puts wider attention on article than vanilla doesand Macro DPPs puts the widest attention, includingformer two models’ attentive parts.

text, attention in summarization should be softand diverse. Many works noticed that attentionmay be over concentrated for summarization andhence cause problems like generating duplicatewords or duplicate sentences. Researchers try tosolve these problems by introducing various at-tention structures, including local attention (Lu-ong et al., 2015), hierarchical attention (Nallap-ati et al., 2016), distraction attention (Chen et al.,2016) and coverage mechanism (See et al., 2017)etc. But all these works ignore another repeatproblem, as we call it, ”Original Text Repetition”.We define and explain this problem in section 3.

In this paper we propose a novel Diverse Convo-lutional Seq2Seq Model(DivCNN Seq2Seq) basedon Micro Determinantal Point Processes(MicroDPPs) and Macro Determinantal Point Pro-cesses(Macro DPPs). Our contributions are as fol-

822Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 822 – 832

Hong Kong, China, November 3-4, 2019. c© 2019 Association for Computational Linguistics

lows:

• We define and describe the Original TextRepetition problem in abstractive summa-rization and identify the cause behind it,which are degenerated attention distributions.We have also introduced three article-relatedmetrics for the Original Text Repetition esti-mation and applied them in our experiments.

• We suggest a solution to this problem inthe form of introducing DPPs into deep neu-ral network (DNN) attention adjustment andpropose DivCNN Seq2Seq. In order to adaptDPPs to large scale computing, we proposetwo kinds of methods: Micro DPPs andMacro DPPs. To the best of our knowledge,this is the first attempt to adjust attention dis-tributions considering both quality and diver-sity.

• We evaluate our models on six open datasetsand show its superiority on improving thecomprehensiveness of generated summarieswithout losing much training and inferencespeed.

2 Convolutional Seq2Seq Learning

Usually encoder and decoder in Seq2Seq ar-chitecture are recurrent neural network(RNN)or its variants like Long Short Term Memory(LSTM) (Hochreiter and Schmidhuber, 1997) andGated Recurrent Unit (GRU) (Chung et al., 2014)network. Recently, a Seq2Seq architecture basedentirely on convolutional neural networks (CNNSeq2Seq) (Gehring et al., 2017) was proposed.It has better hierarchical representation of natu-ral language and can be computed in parallel. Inthis paper we choose CNN Seq2Seq as our base-line system because it performs better on captur-ing long-term dependency, which is important forsummarization.

Both encoder and decoder in CNN Seq2Seqconsist of convolutional blocks. Each block con-tains a one dimensional convolution (Conv1d), agated linear unit (GLU) (Dauphin et al., 2017) andseveral fully connected layers for dimension trans-formation. Residual connection (He et al., 2016)and batch normalization (Ioffe and Szegedy, 2015)are used in each block. Each block receives an in-put I of size RB∗T∗C , where B, T , and C are re-spectively batch size, length of text and number of

channels (the same as embedding size). Conv1dpads the sentence first and then generates a ten-sor [O1, O2] of size RB∗T∗2C , doubling the chan-nel. The extra channels are used in a simple non-linearity gated mechanism:

O1, O2 = Conv1d(I) (1)

GLU([O1, O2]) = O1 ⊗ σ(O2) (2)

Multi-step attention (Gehring et al., 2017) areused in CNN Seq2Seq. Each convolutional blockin decoder has its own attentive context. Followedon query-key-value definition of attention, queriesQ ∈ RB∗Tg∗C are different decoder block out-puts, where Tg stands for summary length; keysK ∈ RB∗Ts∗C are encoder last block outputs,where Ts stands for article length; values are sumof encoder input embeddings E ∈ RB∗Ts∗C andK. Because of the parallel architecture, atten-tion for all decoder time steps can be calculated atonce. Such architecture can speed up training andgive convenience for our DPPs calculation. Usingthe simplest dot product attention, all the calcula-tions can be done with an efficient batch matrixmultiplication (BMM).

scoreattn = BMM(Q,K) (3)

weightattn = Softmax(scoreattn) (4)

context = BMM(weightattn,K + E) (5)

3 Original Text Repetition

Original Text Repetition(OTR) problem meansthat each sentence in generated summaries are rep-etitions of article sentences. The abstractive sum-marization hence degenerates to extractive sum-marization. The ROUGE metric can not detect thisproblem since it only measures the n-grams con-currence between generated summaries and goldsummaries without taking articles into considera-tion. The word repeat problem (See et al., 2017) orthe lack of abstraction problem (Kryscinski et al.,2018) can be seen as extreme condition or alterna-tive description of OTR. Behind this phenomenonis the degenerated attention distribution learned bymodel which we define as:

• Narrow Word Attention for each summaryword, the attention distribution narrows toone word position in article.

• Adjacent Sentence Attention for all wordsin each summary sentence, their positions of

823

Figure 1: Degenerated attention distribution behindOTR problem. The generated summary repeats the firstsentence in article. We select the first 16 words of sum-mary and show their attention over first 50 words ofarticle.

attention peaks are adjacent or semanticallyadjacent, which means that attended articleparts have similar features.

As shown in the Figure 1, sentence attentiondegenerates to several adjacent peaks on the re-peated article positions. Usually each sentencein gold summaries considers multiple article sen-tences and induct to one, not simply copying onearticle sentence. The gap between generated sum-maries(copy) and gold summaries(induce) meansthat model just learned to find article sentencesthat has the maximum similarity to gold sum-maries not the relation between article facts andsummaries. Degenerated attention mechanismmisleads the model.

4 Diverse Convolutional Seq2Seq Model

To prevent Seq2Seq model from attention degen-eration, we introduce DPPs as a method of regu-larization in CNN Seq2Seq and propose DivCNNSeq2Seq.

4.1 Quality and Diversity Decomposition ofDeterminantal Point Processes

DPPs have been widely used in recommender sys-tems, information retrieval and extractive summa-rization systems. It can generate subsets withboth high quality and high diversity (Kulesza andTaskar, 2011).

Given a discrete, finite point process P and aground set D, if for every A ∈ D and a randomsubset Y drawn according to P, there is:

P (A ∈ Y ) = det(KA) (6)

where K is a real symmetric matrix that indexedby the elements of D, then P is a determinan-tal point process and K is the marginal kernel

of DPPs. Marginal kernel merely gives marginalprobability of one certain item to be selected inone particular sampling process, hence we use L-ensemble (Kulesza and Taskar, 2011) to modelatomic probabilities for every possible instantia-tion of Y:

K = L(L+ I)−1 = I − (L+ I)−1 (7)

PL(Y = Y ) ∝ det(LY ) (8)

PL(Y = Y ) =det(LY )

det(L+ I)(9)

L-ensemble is also one kind of DPPs and canbe constructed directly using the quality(q) andsimilarity(sim) of point set:

Li,j = q(i) ∗ sim(i, j) ∗ q(j) (10)

Equation 9 is a probability that subset Y beingchosen, which is actually a quantitative indica-tor for the score of the subset considering bothits quality and diversity(QD-score). Summariza-tion follows the same principle: a good summaryshould consider both information significance andredundancy. In extractive summarization set ofsentences with high score (quality) and diversityis chosen to a summary, using DPPs sampling al-gorithm (Li et al., 2017).

In Figure 2 we show the difference betweenquality-only sampling and DPPs sampling. Wefirst generate a simulated attention distribution fortesting. Then we use word position distance assimilarity measure and attention as quality to con-struct the L matrix(L-ensemble) for DPPs. Pointsubset is sampled based on quality (green) orDPPs (blue), then a gaussian mixture distributionwas generated around these points to soften andreweight the attention. Both samplings approxi-mate the distribution of original attention distribu-tion (orange), but DPPs approximate it better andhave more scattering peaks. Sampling only con-sidering attention weight (quality) generates lesspeaks, which means many adjacent points withlow diversity are sampled.

In actual experiments we choose attentionweight as quality. The model learns attention dis-tribution to score different parts of article and ob-viously higher attention means higher quality. Inoriginal CNN Seq2Seq the sum of encoder out-put and encoder input embeddings are encodedfeature vectors. We follow this setting and usethe feature vectors to calculate cosine similarity.

824

Figure 2: Comparison of different reweighting methodson a simulated distribution. DPPs sampling reweight-ing approximates original distribution better since itcatches the high attention area around position 160. Italso samples less adjacent points around position 110.

Article Encoder

SummaryDecoder

Decoder Output

Encoder Input Embeddings

Encoder Output

Decoder InputEmbeddings

Attention Matrix

Similarity MatrixQuality Matrix

L Matrix

+

Figure 3: Construction of L matrix

Specifically, the encoder output are tree-like se-mantic features extracted by CNN encoder whilethe encoder input embedding provides point infor-mation about a specific input element before en-coding (Gehring et al., 2017). Hence feature vec-tors contain both highly abstract semantic featuresand specific grammatical features when calculat-ing diversity. Compared to extractive summariza-tion, DPPs in abstractive summarization use statusof DNN as quality and diversity which can be op-timized dynamically during training.

The computation of Lmatrix is shown in Figure3. For each sample in a batch(128 in our experi-ments), the encoder input embeddings E ∈ RTs∗Cmultiply its transpose to produce similarity matrixS ∈ RTs∗Ts . The weight vectors of Multi-stepAttention average over decoder layers and sum-

mary length, then do the same operation to gen-erate quality matrix Q ∈ RTs∗Ts . Then we use thehadamard product of Q and S as L ∈ RTs∗Ts .

4.2 Macro DPPs

L MatrixSample submatrix with large

attention weight

Sample Submatrix with equidistant sampling

𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝐿𝑜𝑠𝑠

Random Choose One In Each Batch

Fix QualityOptimize Diversity

Fix DiversityOptimize Quality

Figure 4: Conditional sampling in Macro DPPs

The idea of Macro DPPs is to pick subsets undersome restriction and evaluate QD-score of subsetusing equation 9. The ideal attention distributionshould have subsets with high QD-score.

We do not use DPPs sampling since the purposeof Macro DPPs is to evaluate subsets not to sam-ple subsets with high QD-score. The attention dis-tribute over the ground set so we introduce con-ditional sampling to sample a subset that has highquality or high diversity, then improve the othermetric as follows:

• Improve Diversity in High Quality SubsetSelect points with high attention weight toconstruct subset and require no gradient forquality matrix, just optimize diversity.

• Improve Quality in High Diversity Sub-set Sampling point subset with high diversityis hard to realize, so we just make equidis-tant(equidistant on word positions) samplingto approximate it. Contrary to the previousmethod, we require no gradient for similaritymatrix and just optimize quality.

We randomly choose one condition in eachbatch. After the point subset was chosen, the sub-matrix LY can be built by selecting elements inL indexed by point subset. Then we calculate theQD-score of the submatrix and add it into modelloss as a regularization. We calculate the logarith-mic summation of eigenvalues to prevent numeri-

825

cal underflow.

lossQD =∑

log λL+I −∑

log λLY(11)

∝ det(LY )

det(L+ I)(12)

lossmodel = γlossMLE + (1− γ)lossQD (13)

4.3 Micro DPPs

The idea of Micro DPPs is to sample a subsetY with large QD-score from all article positionsand to use these sampled points as adjusted at-tention focus points. Then a Gaussian Mixture(GM) distribution around these points is generatedas ideal attention distribution(weightideal). Thewhole process can be seen as a selection and soft-ening on attention. The Kullback-Leibler diver-gence of the ideal distribution and attention distri-bution (weight) then is added into the loss func-tion as regularization.

P = BFGMInference(L, t) (14)

weightideal = GMµ∈P (µ, σ, π) (15)

lossKL = KLdiv(weightideal, weightattn)(16)

lossmodel = γlossMLE + (1− γ)lossKL (17)

Classic sampling algorithm for DPPs (Kuleszaand Taskar, 2011) runs slow when the size of Lmatrix is large and it can not be computed in batch.In the DivCNN Seq2Seq model we need to con-struct an L matrix for every sample and everylayer in the decoder, which is ultimately large. Tooptimize DPPs runtime for this large-scale com-putation, we introduce a Batch computation ver-sion of Fast Greedy Maximum A Posteriori Infer-ence (Chen et al., 2018)(BFGMInference) to sam-ple a subset with high QD-score.

BFGMInference uses a greedy methodto approximate the MAP result Ymap =argmaxY ∈D det(LY ): each time we selectj that has maximum QD-score improvements andadd it to Y .

f(Y ) = log det(LY ) (18)

j = argmaxi∈D\Y

f(Y ∪ {i})− f(Y ) (19)

Algorithm 1 BFGMInference

Input: matrix L ∈ RB∗Ts∗Ts , size of sampled subsett

Output: Sampled subset Y ∈ RB∗t

1: Initialize Di = Lii; mask = 1B∗Ts ; J =argmax(log(D ∗mask)); C ∈ 0B∗Ts∗1

2: maskj∈J = 03: count = 14: while count < t do5: candidate = {i|maski = 1}6: ctemp = 0B∗Ts∗1, dtemp = 0B∗Ts

7: for idx = 0; idx < Ts− count; idx++ do8: i = candidate[:, idx], j = J9: ei = (Lj,i − 〈cj , ci〉)/dj

10: ctempi = ei, dtempi = ei2

11: end for12: C = [C, ctemp], D = D − dtemp13: J = argmax(log(D ∗mask))14: maskj∈J = 015: count = count+ 116: end while17: Y = {i|maski = 0}18: return Y

By using Cholesky decomposition we have:

LY = V V T (20)

LY ∪{i} =

[V 0ci di

] [V 0ci di

]T(21)

V cTi = LY,i (22)

d2i = Lii − ||ci||22 (23)

Then we can transform equation 19 into:

j = argmaxi∈D\Y

log(d2i ) (24)

The c and d can be updated incrementally accord-ing to equation 22 and 23. The complete algo-rithm is described in Algorithm 1. BFGMInfer-ence algorithm gains significant speed improve-ments when the size of L matrix is large as shownin Figure 5.

5 Experimental Setup

Datasets We test DivCNN Seq2Seq modelon the widely used CNN-DM dataset (Her-mann et al., 2015) and give detailed analy-sis on diversity and quality. Also we triedour model on other five abstractive summa-rization datasets which are NEWSROOM cor-pus (Grusky et al., 2018), TLDR (Volske et al.,2017), BIGPATENT (Sharma et al., 2019), WIK-IHOW (Koupaee and Wang, 2018) and RED-DIT (Kim et al., 2018). For CNN-DM corpus wetruncate articles to 600 words and summaries to

826

Dataset # Docs Type Average DocumentWords

Average SummaryWords

CNN-DM 287227/13368/11490 News 789/777/768 55/59/62NEWSROOM 995041/108862/108837 News 659/654/652 26/26/26

REDDIT 34000/4000/4000 Social Media 418/445/451 20/23/25BIGPATENT 1207222/67068/67072 Documentation 699/699/699 116/116/116

TLDR 960000/20000/20000 Social Media 197/195/204 19/19/19WIKIHOW 180000/10000/20000 Knowledge Base 475/488/418 62/60/74

Table 2: Dataset overview(train/valid/test).

Figure 5: Speed comparison of classical DPPs sam-pling (blue), FGMInference (red) and BFGMInference(gray) with a batch size of 100.

70 words. For other corpus we only keep articlesand summaries that have length around its aver-age length. Specially we only use the TIFU-longversion of REDDIT and non-anonymized versionof CNN-DM dataset. If the raw datasets were notdivided into train/dev/test then we divide the shuf-fled datasets manually. Details of all six datasetsare shown in Table 2.

Hyperparameters and Optimization All theCNN models use a 50000 words article dictio-nary and 20000 words summary dictionary withbyte pair encoding (BPE) (Sennrich et al., 2015).Word embeddings are pretrained on training cor-pus using Fasttext (Bojanowski et al., 2017; Joulinet al., 2016). We do not train models with largeparameters to increase ROUGE results since whatwe try to improve is the comprehensiveness ofeach sentence in summary. The total parametersof whole CNN seq2seq Model are about 3800wand the DivCNN Seq2Seq does not change the pa-rameters amount. All the models set embeddingdimensionality and CNN channels to 256. The en-coder has 20 blocks with kernel size 5 and the de-coder has 4 blocks with kernel size 3. Such scaleof model parameters are enough for the model togenerate fluent summaries. The γ is 0.6 for Macro

DPPs and 0.7 for Micro DPPs. In Macro DPPswe choose top 30 points when optimizing diversityand a stride of 20 for equidistant. In Micro DPPsfor each summary we sample 20 points to generategaussian mixture distributions. We train the modelwith Nesterov’s accelerated gradient method us-ing a momentum of 0.99 and renormalized gradi-ents when the norm exceeded 0.1 (Sutskever et al.,2013). The beam search size is 5 and we apply adropout of 0.1 to the embeddings and linear trans-form layers. We did not fix training epoches. Themodel was trained until the average epoch loss cannot be lower anymore. The DPPs regularizationonly works when training and does not bring ex-tra parameters into model. During test the modelhas learned proper attention so the generation isthe same as vanilla CNN Seq2Seq Model.

Article-Related Metrics It is hard to evaluatea summary since summarization itself is very sub-jective. ROUGE compares generated summariesand gold summaries in checking concurrence ofn-grams that results in a very limited evaluation ina word level. We set three article-related metricsto evaluate the comprehensiveness of summaries:

• Jaccard Similarity Upper Bound (JS) Foreach summary sentence, we compute its jac-card similarity with every article sentence.The largest jaccard similarity for each sum-mary sentence is selected as JS. It measuresthe extent to which summaries copy articles.The worst situation is 1.

• Sentence Coverage (SC) We define those ar-ticle sentences that have a jaccard similarityhigher than the gold summaries JS value ascovered sentence. Then the average counts ofcovered article sentences for each summarysentence is a ratio that can be used to measurethe coverage of the article by the summaries.The worst situation is less than or equal to 1.

• Novel Bigram Proportion (NOVEL) Thepercentage of bigrams in summaries that did

827

not appear in the article. It reflects the ab-straction of summaries. The worst situationis 0.

Strong Baseline We chose four strong base-lines which reported high ROUGE scores. Bot-tomUp (Gehrmann et al., 2018), sumGAN (Liuet al., 2018) and RL Rerank (Chen and Bansal,2018) are complicate systems that have additionalmodules or post-processings and partially relievedthe OTR problem. The Pointer Generator (Seeet al., 2017) reaches best ROUGE result in singleend to end model but suffers greatly from repeti-tion problem.

Various Possible Causes of the OTR problemWe had supposed several other reasons for the rep-etition problems besides attention degeneration in-cluding overfitting, bad usage of translation-styleattention mechanism, lack of decoding ability andhigh variance attention distribution. Respectively,we designed comparative experiments as follows:

• Overfitting We set dropout ratio to 0.1(SMALL) and 0.5 (LARGE) for testing over-fitting.

• Direct Attention Remove the encoder inputembedding in attention value so the decoderlooks at highly abstract features directly (DI-RECT).

• Lack of Decoding Ability We double (DOU-BLE) or half (HALF) the vanilla (VANILLA)decoder layers to adjust the decoding ability.

• High Variance Attention We scale down theattention distribution manually when train-ning (SCALE), lowering the variance of thedistribution.

6 Results

Various Possibilities As shown in Table 4, scaledattention has the lowest jaccard similarity up-per bound, which confirms our idea that over-concentrated attention makes model to copy arti-cle sentences. As for sentence coverage, small de-coder with large drop out ratio performs the best,proving that large and overfitted models may havedegenerated attention. Although the scaled atten-tion has the best JS score, its SC score is the worst(SC less than 1.0 means duplicate sentences aregenerated). So, we may conclude that directly

scaling down attention breaks the value of atten-tion. The ideal attention is not about erasing thepeak or the variance of attention but to have mul-tiple peaks in sentence attention and have high di-versity at the same time. Neither aggregative norscattering attention distributions do good to sum-mary generation. Direct attention model has themaximum NOVEL score which means point in-formation about a specific input element makesmodel prefer copying article words instead of gen-erating new words.

CNN-DM Results With large model parame-ters and dictionaries, four models in strong base-lines reach nearly 40 points in ROUGE-1 but theyperform poorly on article-related metrics. Singleend to end systems like Pointer Generator per-forms poorly on JS value and NOVEL proportionwhich means most of its summaries are copiedfrom articles. As for three models with mul-tiple modules or post-processing, the BottomUpmodel has relatively good jaccard similarity upperbound and the best ROUGE result but its article-related metrics are still far away from gold sum-maries level. RL Rerank model has better scoreon JS and sumGAN has better NOVEL score butnone of these model reached a balanced good per-formance on three article-related metrics. Com-pared to vanilla CNN Seq2Seq, DivCNN Seq2Seqimproves the JS and NOVEL points and raisesthe ROUGE score at the same time, provingthat proper attention distribution can help reach-ing a better local optima. Compared with strongbaselines, DivCNN Seq2Seq achieves the best inNOVEL, second and third in JS and SC, respec-tively. Empirically we suggest that γ for both Mi-cro and Macro DPPs should be set to make aver-age loss change less than 10% compared to vanillamodels. We also observed that Micro DPPs ismore sensitive to γ compared to Macro DPPs andis easier to converge but may degenerate to vanillaCNN Seq2Seq. Macro DPPs usually can reachbetter results but it needs more time to train sinceeigenvalue calculation is expensive and can not beaccelerated through fp16 tensor computing.

Novel Bigram NOVEL is a tricky metric whichis used in many researches about abstractive sum-marization. There are many possibilities on ex-plaining a high NOVEL score: first, the sum-mary get a high Novel Bigram ratio because it hasmany Novel unigrams, which may be good or bad;second, the model may be underfitting and can

828

Model JS SC NOVEL R1 R2 RLgold 0.326 – 0.575 – – –

(Liu et al., 2018) sumGAN 0.709 1.136 0.118 39.92 17.65 27.25(Gehrmann et al., 2018) BottomUp 0.541 1.015 0.098 41.53 18.76 27.92

(Chen and Bansal, 2018) RL Rerank 0.585 1.181 0.105 39.38 16.03 24.95(See et al., 2017) Pointer Generator 0.774 1.317 0.079 39.53 17.28 26.89

CNN Seq2Seq Vanilla 0.616 1.137 0.167 30.4 11.7 23.09DivCNN Seq2Seq with Micro DPPs 0.568 1.214 0.183 30.61 11.82 23.19DivCNN Seq2Seq with Macro DPPs 0.587 1.265 0.177 32.28 12.75 24.32

Extract with Attention – – – 35.48 13.67 22.85Extract with DPPs Diversified Attention – – – 35.35 13.69 23.07

Table 3: Results on CNN-DM datasets.

Model JS SC NOVELSCALE 0.382 0.79 0.199DIRECT 0.567 1.14 0.207

LARGEDOUBLE 0.591 1.201 0.192VANILLA 0.639 1.261 0.162

HALF 0.639 1.281 0.153

SMALLDOUBLE 0.631 1.259 0.167VANILLA 0.616 1.137 0.167

HALF 0.625 1.29 0.161

Table 4: Explore various possible causes of OTR.

not generate fluent sentences; third, the generatedsummary use novel bigrams to conclude the orig-inal text and generate readable sentences, whichis the best condition. From the table 3 we can seethat Bottom Up has the best ROUGE and JS resultsbut worst NOVEL score. Our DivCNN Modelshave just the opposite metric scores. These threemetrics shouldn’t be ambivalent since gold sum-maries can reach high NOVEL and low JS at thesame time. Base on these facts we make the fol-lowing conjectures:

• Good summaries model learned have stylesdiffer from human-write summaries. Modeltend to copy bigrams from original articleand reorganize them into short summary sen-tences. Human tend to use brand new bi-grams to paraphrase facts contained in origi-nal article. The model use a rewrite(compressand extract) way while human-write is over-write style.

• Though we use human-write summaries asgold summaries for model to learn and theMLE loss is steadily descending during train-ing but it learned summaries with differentstyle. It implicates that model may not havea ”NLU+NLG” process like human do but be

restrained in a sentence-level rewrite frame-work. For Bottom Up and RL Rerank it isnot a problem because these two systems aredesigned to rewrite. They only send parts ofarticle into Seq2Seq. Such design can gainhigh ROUGE score but it is not the way inwhich human write gold summaries.

Figure 6: Actual attention distribution learned byvanilla model and DPPs models.

Extractive Summarization based on LearnedAttention We also extract article sentences basedon sentence attention learned by DPPs models togenerate summaries. The attention of a sentenceis the sum of the attention weights of the wordsin the sentence. Table 3 shows that extractivesummarization reaches better ROUGE values, im-plicating that both vanilla and DivCNN modelslearned appropriate sentence attention. Extractivesummarization uses accumulated sentence atten-tion instead of specific distribution, so the resultsof vanilla models are almost the same as DivCNN.

Sample Visualization We randomly chooseone sample in the test set of CNN-DM corpus tovisualize and analyze. As shown in Table 1 we

829

Datasets Baseline Baseline Micro DPPs Macro DPPsNEWSROOM LEAD-3 Baseline 30.63/21.41/28.57 38.33/25.01/35.3 40.10/26.71/37.24

TLDR - - 65.94/56.97/64.65 66.93/57.84/65.71BIGPATENT (Chen and Bansal, 2018) 37.12/11.87/32.45 33.21/10.47/24.86 34.55/11.65/25.96WIKIHOW (See et al., 2017) 28.53/9.23/26.54 24.52/6.49/20.56 27.58/8.01/22.61

REDDIT (Kim et al., 2018) 19/3.7/15.1 21.39/4.24/17.11 21.57/4.48/17.29

Table 5: ROUGE F1 results(R1/R2/RL) on different datasets

highlighted attentive parts in the article for differ-ent models. The vanilla model just generates onesentence which only focuses on one part of article.The Micro DPPs model generates two sentencesconsidering three parts of the article. Macro DPPsconsidered article spans that both vanilla modeland Micro DPPs model paid attention to. We alsochecked the attention distribution of this sample.As shown in Figure 6, vanilla model (red) learnedonly several peaks over article position 70 to 90,which suggests that it only focuses on one sen-tence and repeats this sentence in a summary. At-tention learned by Micro DPPs model (green) stillnarrows to several peaks but explores more posi-tions compared to vanilla. Macro DPPs (blue) hasmore natural design of loss function and it opti-mize quality and diversity directly so it has a morescattering attention distribution.

More Datasets We test our model on other fivenewly-released abstractive summarization datasetswhich have various compression ratio, differentprofessional field and more flexible human-writesummaries. Only ROUGE results are collectedsince no baseline generated summaries are pro-vided for us to calculate article-related metrics.Table 5 shows that DivCNN performs better thanbest baselines on NEWSROOM, REDDIT andreaches incredible ROUGE scores more than 60(but no baseline is reported in the dataset paper sothe result is not comparable). The compression ra-tio and article length have little impact on the per-formance of DivCNN. The results show that Di-vCNN prefers short summaries.

Attention & Representation Degeneration Inorder to solve attention degeneration we introduceDPPs to improve the diversity of features wheremodel paid high attention to. This solution is con-sistent with the Presentation Degeneration Prob-lem in NLG (Gao et al., 2018). As shown in Figure7, Macro DPPs have more diverse embedding pre-sentation compared to vanilla model. (Gao et al.,2018) directly add a regularization loss of diver-sity to increase the representation power of wordembeddings while we aim at generating attention

Figure 7: Presentation Degeneration Problem in NLG.We use tSNE (Maaten and Hinton, 2008) to reduce thedimension of word embeddings learned in the model.

distribution considering both quality and diversity,resulting in learning word embeddings with richrepresentation power.

7 Conclusions and Future Works

We have defined the ”OTR” problem that leads toincomplete summaries and revealed the cause be-hind it, which is attention degeneration. We alsointroduce three article-related metrics to evaluatethis problem. DPPs are applied directly on atten-tion generation and we propose Macro and Mi-cro DPPs versions of DivCNN Seq2Seq model toadjust attention considering both quality and di-versity. Results on CNN-DM and other five opendatasets show that DivCNN Seq2Seq can improvethe comprehensiveness of summaries.

Due to the hardware limitation we only traina small-parameters version of DivCNN. Also welost some precision when approximating L ma-trix and accelerating sampling. These drawbackslead to limited performance improvements. Inthe future we hope to explore further on fol-lowing directions: Quantifiable and controllablequality/diversity in DPPs; better approximation inconditional sampling, such as dynamic samplingstride adjustment; try to apply DPPs-optimized at-tention on another student model to improve gen-eration.

830

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Laming Chen, Guoxin Zhang, and Eric Zhou. 2018.Fast greedy map inference for determinantal pointprocess to improve recommendation diversity. InAdvances in Neural Information Processing Sys-tems, pages 5622–5633.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,and Hui Jiang. 2016. Distraction-based neural net-works for document summarization. arXiv preprintarXiv:1610.08462.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. arXiv preprint arXiv:1805.11080.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Yann N Dauphin, Angela Fan, Michael Auli, and DavidGrangier. 2017. Language modeling with gated con-volutional networks. In Proceedings of the 34thInternational Conference on Machine Learning-Volume 70, pages 933–941. JMLR. org.

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang,and Tieyan Liu. 2018. Representation degenera-tion problem in training natural language generationmodels.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning. In Proceedingsof the 34th International Conference on MachineLearning-Volume 70, pages 1243–1252. JMLR. org.

Sebastian Gehrmann, Yuntian Deng, and Alexander MRush. 2018. Bottom-up abstractive summarization.arXiv preprint arXiv:1808.10792.

Max Grusky, Mor Naaman, and Yoav Artzi. 2018.Newsroom: A dataset of 1.3 million summarieswith diverse extractive strategies. arXiv preprintarXiv:1804.11283.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances inneural information processing systems, pages 1693–1701.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. arXiv preprintarXiv:1502.03167.

Armand Joulin, Edouard Grave, Piotr Bojanowski,Matthijs Douze, Herve Jegou, and Tomas Mikolov.2016. Fasttext. zip: Compressing text classificationmodels. arXiv preprint arXiv:1612.03651.

Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim.2018. Abstractive summarization of reddit postswith multi-level memory networks. arXiv preprintarXiv:1811.00783.

Mahnaz Koupaee and William Yang Wang. 2018. Wik-ihow: A large scale text summarization dataset.arXiv preprint arXiv:1810.09305.

Wojciech Kryscinski, Romain Paulus, Caiming Xiong,and Richard Socher. 2018. Improving abstrac-tion in text summarization. arXiv preprintarXiv:1808.07913.

Alex Kulesza and Ben Taskar. 2011. k-dpps: Fixed-size determinantal point processes. In Proceedingsof the 28th International Conference on MachineLearning (ICML-11), pages 1193–1200.

Lei Li, Yazhao Zhang, Junqi Chi, and Zuying Huang.2017. Uids: A multilingual document summariza-tion framework based on summary diversity and hi-erarchical topics. In Chinese Computational Lin-guistics and Natural Language Processing Basedon Naturally Annotated Big Data, pages 343–354.Springer.

Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu,and Hongyan Li. 2018. Generative adversarial net-work for abstractive text summarization. In Thirty-Second AAAI Conference on Artificial Intelligence.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of machinelearning research, 9(Nov):2579–2605.

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al.2014. Recurrent models of visual attention. InAdvances in neural information processing systems,pages 2204–2212.

831

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summa-rization using sequence-to-sequence rnns and be-yond. arXiv preprint arXiv:1602.06023.

Alexander M Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. arXiv preprintarXiv:1509.00685.

Abigail See, Peter J Liu, and Christopher D Man-ning. 2017. Get to the point: Summarizationwith pointer-generator networks. arXiv preprintarXiv:1704.04368.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909.

Eva Sharma, Chen Li, and Lu Wang. 2019. Bigpatent:A large-scale dataset for abstractive and coherentsummarization. arXiv preprint arXiv:1906.03741.

Ilya Sutskever, James Martens, George Dahl, and Ge-offrey Hinton. 2013. On the importance of initial-ization and momentum in deep learning. In Interna-tional conference on machine learning, pages 1139–1147.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Michael Volske, Martin Potthast, Shahbaz Syed, andBenno Stein. 2017. Tl; dr: Mining reddit to learn au-tomatic summarization. In Proceedings of the Work-shop on New Frontiers in Summarization, pages 59–63.

832

Date post:	13-Jul-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

In Conclusion Not Repetition: Comprehensive Abstractive ... · In Conclusion Not Repetition:...

Documents