Deep Learning Based Abstractive Text Summarization...

Research ArticleDeep Learning Based Abstractive Text SummarizationApproaches Datasets Evaluation Measures and Challenges

Dima Suleiman and Arafat Awajan

Princess Sumaya University for Technology Amman Jordan

Correspondence should be addressed to Dima Suleiman dsuleimanpsutedujo

Received 24 April 2020 Revised 1 July 2020 Accepted 25 July 2020 Published 24 August 2020

Academic Editor Dimitris Mourtzis

Copyright copy 2020 Dima Suleiman and Arafat Awajan is is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in anymedium provided the original work isproperly cited

In recent years the volume of textual data has rapidly increased which has generated a valuable resource for extracting andanalysing information To retrieve useful knowledge within a reasonable time period this information must be summarised ispaper reviews recent approaches for abstractive text summarisation using deep learning models In addition existing datasets fortraining and validating these approaches are reviewed and their features and limitations are presented e Gigaword dataset iscommonly employed for single-sentence summary approaches while the Cable News Network (CNN)Daily Mail dataset iscommonly employed for multisentence summary approaches Furthermore the measures that are utilised to evaluate the qualityof summarisation are investigated and Recall-Oriented Understudy for Gisting Evaluation 1 (ROUGE1) ROUGE2 and ROUGE-L are determined to be the most commonly applied metrics e challenges that are encountered during the summarisationprocess and the solutions proposed in each approach are analysed e analysis of the several approaches shows that recurrentneural networks with an attention mechanism and long short-term memory (LSTM) are the most prevalent techniques forabstractive text summarisatione experimental results show that text summarisation with a pretrained encoder model achievedthe highest values for ROUGE1 ROUGE2 and ROUGE-L (4385 2034 and 399 respectively) Furthermore it was determinedthat most abstractive text summarisation models faced challenges such as the unavailability of a golden token at testing time out-of-vocabulary (OOV) words summary sentence repetition inaccurate sentences and fake facts

1 Introduction

Currently there are vast quantities of textual data availableincluding online documents articles news and reviews thatcontain long strings of text that need to be summarised [1]e importance of text summarisation is due to severalreasons including the retrieval of significant informationfrom a long text within a short period easy and rapid loadingof the most important information and resolution of theproblems associated with the criteria needed for summaryevaluation [2] Due to the evolution and growth of automatictext summarisation methods which have provided signifi-cant results in many languages these methods need to bereviewed and summarised erefore in this review wesurveyed the most recent methods and focused on thetechniques datasets evaluation measures and challenges of

each approach in addition to the manner in which eachmethod addressed challenges

Applications such as search engines and news websitesuse text summarisation [1] In search engines previews areproduced as snippets and news websites generate headlinesto describe the news to facilitate knowledge retrieval [3 4]Text summarisation can be divided into several categoriesbased on function genre summary context type of sum-marizer and number of documents [5] one specific textsummarisation classification approach divides the summa-risation process into extractive and abstractive categories [6]

Extractive summarisation extracts or copies some partsfrom the original text based on scores computed using eitherstatistical features or linguistic features while abstractivesummarisation rephrases the original text to generate newphrases that may not be in the original text which is

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 9365340 29 pageshttpsdoiorg10115520209365340

considered a difficult task for a computer As abstractive textsummarisation requires an understanding of the documentto generate the summary advanced machine learningtechniques and extensive natural language processing (NLP)are required us abstractive summarisation is harder thanextractive summarisation since abstractive summarisationrequires real-word knowledge and semantic class analysis [7]However abstractive summarisation is also better than ex-tractive summarisation since the summary is an approximaterepresentation of a human-generated summary which makesit more meaningful [8] For both types acceptable summa-risation should have the following sentences that maintainthe order of the main ideas and concepts presented in theoriginal text minimal to no repetition sentences that areconsistent and coherent and the ability to remember themeaning of the text even for long sentences [7] In additionthe generated summary must be compact while conveyingimportant information about the original text [2 9]

Abstractive text summarisation approaches includestructured and semantic-based approaches Structured ap-proaches encode the crucial features of documents usingseveral types of schemas including tree ontology lead andbody phrases and template and rule-based schemas whilesemantic-based approaches are more concerned with thesemantics of the text and thus rely on the informationrepresentation of the document to summarise the textSemantic-based approaches include the multimodal se-mantic method information item method and semanticgraph-based method [10ndash17]

Deep learning techniques were employed in abstractivetext summarisation for the first time in 2015 [18] and theproposed model was based on the encoder-decoder archi-tecture For these applications deep learning techniqueshave provided excellent results and have been extensivelyemployed in recent years

Raphal et al surveyed several abstractive text summa-risation processes in general [19] eir study differentiatedbetween different model architectures such as reinforce-ment learning (RL) supervised learning and attentionmechanism In addition comparisons in terms of wordembedding data processing training and validation hadbeen performed However there are no comparisons of thequality of several models that generated summaries

Furthermore both extractive and abstractive summa-risation models were summarised in [20 21] In [20] theclassification of summarisation tasks was based on three factorsinput factors purpose factors and output factors Dong andMahajani et al surveyed only five abstractive summarisationmodels each On the other hand Mahajani et al focused on thedatasets and training techniques in addition to the architectureof several abstractive summarisationmodels [21] However thequality of the generated summary of the different techniquesand the evaluation measures were not discussed

Shi et al presented a comprehensive survey of severalabstractive text summarisation models which are based onsequence-to-sequence encoder-decoder architecture forconvolutional and RNN seq2seq models e focus was thestructure of the network training strategy and the algorithmsemployed to generate the summary [22] Although several

papers have analysed abstractive summarisation models fewpapers have performed a comprehensive study [23] More-over most of the previous surveys covered the techniquesuntil 2018 even though surveys were published in 2019 and2020 such as [20 21] In this review we addressed most of therecent deep learning-based RNN abstractive text summa-risation models Furthermore this survey is the first to ad-dress recent techniques applied in abstractive summarisationsuch as Transformer

is paper provides an overview of the approachesdatasets evaluationmeasures and challenges of deep learning-based abstractive text summarisation and each topic isdiscussed and analysed We classified the approaches based onthe output type into single-sentence summary and multi-sentence summary approaches Also within each classificationwe compared between the approaches in terms of architecturedataset dataset preprocessing evaluation and results eremainder of this paper is organised as follows Section 2introduces a background of several deep learning models andtechniques such as the recurrent neural network (RNN)bidirectional RNN attention mechanisms long short-termmemory (LSTM) gated recurrent unit (GRU) and sequence-to-sequencemodels Section 3 describes themost recent single-sentence summarisation approaches while the multisentencesummarisation approaches are covered in Section 4 Section 5and Section 6 investigate datasets and evaluation measuresrespectively Section 7 discusses the challenges of the sum-marisation process and solutions to these challenges Con-clusions and discussion are provided in Section 8

2 Background

Deep learning analyses complex problems to facilitate thedecision-making process Deep learning attempts to imitatewhat the human brain can achieve by extracting features atdifferent levels of abstraction Typically higher-level layershave fewer details than lower-level layers [24] e outputlayer will produce an output by nonlinearly transforming theinput from the input layere hierarchical structure of deeplearning can support learning e level of abstraction of acertain layer will determine the level of abstraction of thenext layer since the output of one layer will be the input ofthe next layer In addition the number of layers determinesthe deepness which affects the level of learning [25]

Deep learning is applied in several NLP tasks since itfacilitates the learning of multilevel hierarchal representa-tions of data using several data processing layers of nonlinearunits [24 26ndash28] Various deep learning models have beenemployed for abstractive summarisation including RNNsconvolutional neural networks (CNNs) and sequence-to-sequence models We will cover deep learning models inmore detail in this section

21 RNN Encoder-Decoder Summarization RNN encoder-decoder architecture is based on the sequence-to-sequencemodel e sequence-to-sequence model maps the inputsequence in the neural network to a similar sequence thatconsists of characters words or phrases is model is

2 Mathematical Problems in Engineering

utilised in several NLP applications such as machinetranslation and text summarisation In text summarisationthe input sequence is the document that needs to besummarised and the output is the summary [29 30] asshown in Figure 1

An RNN is a deep learning model that is applied toprocess data in sequential order such that the input of acertain state depends on the output of the previous state[31 32] For example in a sentence the meaning of a word isclosely related to the meaning of the previous words AnRNN consists of a set of hidden states that are learned by theneural network An RNN may consist of several layers ofhidden states where states and layers learn different featurese last state of each layer represents the whole inputs of thelayer since it accumulates the values of all previous states [5]For example the first layer and its state can be employed forpart-of-speech tagging while the second layer learns tocreate phrases In text summarisation the input for the RNNis the embedding of words phrases or sentences and theoutput is the word embedding of the summary [5]

In the RNN encoder-decoder model at the encoder sideat certain hidden states the vector representation of thecurrent input word and the output of the hidden states of allprevious words are combined and fed to the next hiddenstate As shown in Figure 1 the vector representation of thewordW3 and the output of the hidden states he1 and he2 arecombined and fed as input to the hidden states he3 Afterfeeding all the words of the input string the output gen-erated from the last hidden state of the encoder are fed to thedecoder as a vector referred to as the context vector [29] Inaddition to the context vector which is fed to the first hiddenstate of the decoder the start-of-sequence symbol langSOSrang isfed to generate the first word of the summary from theheadline (assumeW5 as shown in Figure 1) In this caseW5is fed as the input to the next decoder hidden state Eachgenerated word is passed as an input to the next decoderhidden state to generate the next word of the summary elast generated word is the end-of-sequence symbol langEOSrangBefore generating the summary each output from the de-coder will take the form of a distributed representationbefore it is sent to the softmax layer and attention mech-anism to generate the next summary [29]

22BidirectionalRNN Bidirectional RNN consists of forwardRNNs and backward RNNs Forward RNNs generate a se-quence of hidden states after reading the input sequence fromleft to right On the other hand the backward RNNs generate asequence of hidden states after reading the input sequence fromright to left e representation of the input sequence is theconcatenation of the forward and backward RNNs [33]erefore the representation of each word depends on therepresentation of the preceding (past) and following (future)words In this case the context will contain the words to the leftand the words to the right of the current word [34]

Using bidirectional RNN enhances the performance Forexample if we have the following input text ldquoSara ate adelicious pizza at dinner tonightrdquo in this case assume thatwe want to predict the representation of the word ldquodinnerrdquo

using bidirectional RNN and the forward LSTMs representldquoSara ate a delicious pizza atrdquo while the backward LSTMrepresents ldquotonightrdquo Considering the word ldquotonightrdquo whenrepresenting the word ldquodinnerrdquo provides better results

On the other hand using the bidirectional RNN at thedecoder size minimizes the probability of the wrong pre-diction e reason for this is that the unidirectional RNNonly considers the previous prediction and reason onlyabout the past erefore if there is an error in previousprediction the error will accumulate in all subsequentpredictions and this problem can be addressed using thebidirectional RNN [35]

23 Gated Recurrent Neural Networks (LSTM and GRU)Gated RNNs are employed to solve the problem of vanishinggradients which occurs when training a long sequence usingan RNN is problem can be solved by allowing the gra-dients to backpropagate along a linear path using gateswhere each gate has a weight and a bias Gates can controland modify the amount of information that flows betweenhidden states During training the weights and biases of thegates are updated e most popular gated RNNs are LSTM[36] and GRU [37] which are two variants of an RNN

231 Long Short-Term Memory (LSTM) e repeating unitof the LSTM architecture consists of inputread memoryupdate forget and output gates [5 7] but the chainingstructure is the same as that of an RNNe four gates shareinformation with each other thus information can flow inloops for a long period of time e four gates of each LSTMunit which are shown in Figures 2 and 3 are discussed here

(1) Input Gate In the first timestep the input is a vector that isinitialised randomly while in subsequent steps the input of thecurrent step is the output (content of the memory cell) of theprevious step In all cases the input is subject to element-wisemultiplication with the output of the forget gate e multi-plication result is added to the current memory gate output

(2) Forget Gate A forget gate is a neural network with onelayer and a sigmoid activation function e value of thesigmoid function will determine if the information of theprevious state should be forgotten or remembered If thesigmoid value is 1 then the previous state will be remem-bered but if the sigmoid value is 0 then the previous statewill be forgotten In language modelling for example theforget gate remembers the gender of the subject to producethe proper pronouns until it finds a new subject ere arefour inputs for the forget gate the output of the previousblock the input vector the remembered information fromthe previous block and the bias

(3) Memory Gate e memory gate controls the effect of theremembered information on the new information ememory gate consists of two neural networks e firstnetwork has the same structure as the forget gate but adifferent bias and the second neural network has a tanhactivation function and is utilised to generate the new

Mathematical Problems in Engineering 3

++++

+X CtCtndash1

Htndash1

Xt

σ σ σ

X

Tanh

Tanh

X

Ht

Ht

Inputvector

Memory from

previous block

Output of

previous block

Output of

current block

Element-wisemultiplication

Element-wisesummation

concatenation

Memory from

current block

Sigmoid

Hyperbolictangent

Bias

0 1 2 3

Figure 2 LSTM unit architecture [5]

he1

W1 W2 W3 W4

he2 he3 he4 hd1

ltSOSgt W5

W5 W6 W7 ltEOSgt

W6 W7

hd2 hd3 hd4

Context vector

Encoder Decoder

Figure 1 Sequence-to-sequence the last hidden state of the encoder is fed as input to the decoder with the symbol EOS [51]

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(a)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(b)

Figure 3 Continued


information e new information is formed by adding theold information to the result of the element-wise multi-plication of the output of the two memory gate neuralnetworks

(4) Output Gate e output gates control the amount of newinformation that is forwarded to the next LSTM unit eoutput gate is a neural network with a sigmoid activationfunction that considers the input vector the previous hiddenstate the new information and the bias as input e outputof the sigmoid function is multiplied by the tanh of the newinformation to produce the output of the current block

232 Gated Recurrent Unit (GRU) A GRU is a simplifiedLSTM with two gates a reset gate and an update gate andthere is no explicit memory e previous hidden state in-formation is forgotten when all the reset gate elements ap-proach zero then only the input vector affects the candidatehidden state In this case the update gate acts as a forget gateLSTM and GRU are commonly employed for abstractivesummarisation since LSTM has a memory unit that providesextra control however the computation time of the GRU isreduced [38] In addition while it is easier to tune the pa-rameters with LSTM the GRU takes less time to train [30]

24 Attention Mechanism e attention mechanism wasemployed for neural machine translation [33] before beingutilised for NLP tasks such as text summarisation [18] Abasic encoder-decoder architecture may fail when given longsentences since the size of encoding is fixed for the inputstring thus it cannot consider all the elements of a longinput To remember the input that has a significant impacton the summary the attention mechanism was introduced[29] e attention mechanism is employed at each outputword to calculate the weight between the output word andevery input word the weights add to one e advantage ofusing weights is to show which input word must receiveattention with respect to the output word e weightedaverage of the last hidden layers of the decoder in the current

step is calculated after passing each input word and fed to thesoftmax layer along the last hidden layers [39]

25 Beam Search Beam search and greedy search are verysimilar however while greedy search considers only the besthypothesis beam search considers b hypotheses where brepresents the beam width or beam size [5] In text sum-marisation tasks the decoder utilises the final encoderrepresentation to generate the summary from the targetvocabulary In each step the output of the decoder is aprobability distribution over the target wordus to obtainthe output word from the learned probability severalmethods can be applied including (1) greedy samplingwhich selects the distribution mode (2) 1-best or beamsearch which selects the best output and (3) n-best or beamsearch which select several outputs When n-best beamsearch is employed the top bmost relevant target words areselected from the distribution and fed to the next decoderstate e decoder keeps only the top k1 of k words from thedifferent inputs and discards the rest

26 Distributed Representation (Word Embedding) A wordembedding is a word distributional vector representationthat represents the syntax and semantic features of words[40] Words must be converted to vectors to handle variousNLP challenges such that the semantic similarity betweenwords can be calculated using cosine similarity Euclideandistance etc [41ndash43] In NLP tasks the word embeddings ofthe words are fed as inputs to neural network models In therecurrent neural network encoder-decoder architecturewhich is employed to generate the summaries the input ofthe model is the word embedding of the text and the outputis the word embedding of the summary

In NLP there are several word embedding models such asWord2Vec GloVe FastText and Bidirectional Encoder Rep-resentations from Transformers (BERT) which are the mostrecently employed word embedding models [41 44ndash47] eWord2Vec model consists of two approaches skip-gram andcontinuous bag-of-words (CBOW) which both depend on thecontext window [41] On the other hand GloVe represents the

++++

X

σ σ

Ht

+ CtCt-1

Ht-1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(c)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(d)

Figure 3 LSTM unit gates [5] (a) input gate (b) forget gate (c) memory gate (d) output gate


global vector which is based on statistics of the global corpusinstead of the context window [44] FastText extends the skip-gram of the Word2Vec model by using the subword internalinformation to address the out-of-vocabulary (OOV) terms[46] In FastText the subword components are composed tobuild the vector representation of the words which facilitatesrepresentation of the word morphology and lexical similaritye BERT word embedding model is based on a multilayerbidirectional transformer encoder [47 48] Instead of usingsequential recurrence the transformer neural network utilisesparallel attention layers BERTcreates a single large transformerby combining the representations of the words and sentencesFurthermore BERT is pretrained with an unsupervised ob-jective over a large amount of text

27 Transformers e contextual representations of languageare learned from large corpora One of the new languagerepresentations which extend word embedding models isreferred to as BERTmentioned in the previous section [48] InBERT two tokens are inserted to the texte first token (CLS)is employed to aggregate the whole text sequence informatione second token is (SEP) this token is inserted at the end ofeach sentence to represent it e resultant text consists oftokens where each token is assigned three types of embed-dings token segmentation and position embeddings Tokenembedding is applied to indicate the meaning of a tokenSegmentation embedding identifies the sentences and positionembedding determines the position of the token e sum ofthe three embeddings is fed to the bidirectional transformer asa single vector Pretrained word embedding vectors are moreprecise and rich with semantic features BERT has the ad-vantage of fine-tuning (based on the objectives of certain tasks)and feature-based methods Moreover transformers computethe presentation of the input and output by using self-atten-tion where the self-attention enables the learning of the rel-evance between the ldquoword-pairrdquo [47]

3 Single-Sentence Summary

Recently the RNN has been employed for abstractive textsummarisation and has provided significant resultserefore we focus on abstractive text summarisation basedon deep learning techniques especially the RNN [49] Wediscussed the approaches that have applied deep learning forabstractive text summarisation since 2015 RNN with anattention mechanism was mostly utilised for abstractive textsummarisation We classified the research according tosummary type (ie single-sentence or multisentence sum-mary) as shown in Figure 4 We also compared the ap-proaches in terms of encoder-decoder architecture wordembedding dataset and dataset preprocessing and evalu-ations and results is section covers single-sentencesummary methods while Section 4 covers multisentencesummary methods Single-sentence summary methods in-clude a neural attention model for abstractive sentencesummarisation [18] abstractive sentence summarisationwith attentive RNN (RAS) [39] quasi-RNN [50] a methodfor generating news headlines with RNNs [29] abstractivetext summarisation using an attentive sequence-to-sequence

RNN [38] neural text summarisation [51] selectiveencoding for abstractive sentence summarisation (SEASS)[52] faithful to the original fact aware neural abstractivesummarization (FTSumg) [53] and the improving trans-former with sequential context [54]

31 Abstractive Summarization Architecture

311 Feedforward Architecture Neural networks were firstemployed for abstractive text summarisation by Rush et alin 2015 where a local attention-based model was utilised togenerate summary words by conditioning it to input sen-tences [18]ree types of encoders were applied the bag-of-words encoder the convolution encoder and the attention-based encoder e bag-of-words model of the embeddedinput was used to distinguish between stop words andcontent words however this model had a limited ability torepresent continuous phrasesus a model that utilised thedeep convolutional encoder was employed to allow thewords to interact locally without the need for context econvolutional encoder model can alternate between tem-poral convolution and max-pooling layers using the stan-dard time-delay neural network (TDNN) architecturehowever it is limited to a single output representation elimitation of the convolutional encodermodel was overcomeby the attention-based encodere attention-based encoderwas utilised to exploit the learned soft alignment to weightthe input based on the context to construct a representationof the output Furthermore the beam-search decoder wasapplied to limit the number of hypotheses in the summary

312 RNN Encoder-Decoder Architecture

(1) LSTM-RNN An abstractive sentence summarisationmodelthat employed a conditional recurrent neural network (RNN)to generate the summary from the input is referred to as arecurrent attentive summariser (RAS) [39] A RAS is an ex-tension of the work in [18] In [18] the model employed afeedforward neural network while the RAS employed an RNN-LSTM e encoder and decoder in both models were trainedusing sentence-summary pair datasets but the decoder of theRAS improved the performance since it considered the positioninformation of the input words Furthermore previous wordsand input sentences were employed to produce the next wordin the summary during the training phase

Lopyrev [29] proposed a simplified attention mechanismthat was utilised in an encoder-decoder RNN to generateheadlines for news articles e news article was fed into theencoder one word at a time and then passed through theembedding layer to generate the word representation eexperiments were conducted using simple and complexattention mechanisms In the simple attention mechanismthe last layer after processing the input in the encoding wasdivided into two parts one part for calculating the attentionweight vector and one part for calculating the contextvector as shown in Figure 5(a) However in the complexattention mechanism the last layer was employed to cal-culate the attention weight vector and context vector without


fragmentation as shown in Figure 5(b) In both figures thesolid lines indicate the part of the hidden state of the lastlayer that is employed to compute the context vector whilethe dashed lines indicate the part of the hidden state of thelast layer that is applied to compute the attention weightvector e same difference was seen on the decoder side in

the simple attention mechanism the last layer was dividedinto two parts (one part was passed to the softmax layer andthe other part was applied to calculate the attention weight)while in the complex attention mechanism no such divisionwas made A beam search at the decoder side was performedduring testing to extend the sequence of the probability

Abstractive summarization apporaches based on recurrent neural network + attention mechanism

Single-sentence summary

A neural attention model for abstractive sentence summarization (ABS)

Abstractive sentence summarization with attentive recurrent neural networks (RAS)

Quasi-recurrent neural network(QRNN) + CNN

Generating news headlines with recurrent neural networks

Abstractive text summarization using attentive sequence-to-sequence RNNs

Selective encoding for abstractive sentence summarization (SEASS)

Faithful to the original fact aware neural abstractive summarization (FTSumg)

Improving transformer with sequential context representations for abstractive text

summarization (RCT)

Multisentence summary

Get to the point summarization with pointer-generator networks

Reinforcement learning (RL)

Generative adversarial network for abstractive text summarization

Exploring semantic phrases (ATSDL) + CNN

Bidirectional attentional encoder-decoder model and bidirectional beam search

Key information guide network

Improving abstraction in text summarization

Dual encoding for abstractive text summarization (DEATS)

Bidirectional decoder (BiSum)

Text summarization with pretrained encoders

A text abstraction summary model based on BERT word embedding and reinforcement

learning

Transformer-based model for single-documentneural summarization

Text summarization method based on double attention pointer network (DAPT)

Figure 4 Taxonomy of several approaches that use a recurrent neural network and attention mechanism in abstractive text summarisationbased on the summary type


e encoder-decoder RNN and sequence-to-sequencemodels were utilised in [55] which mapped the inputs to thetarget sequences the same approach was also employed in[38 51] ree different methods for global attention wereproposed for calculating the scoring functions including dotproduct scoring the bilinear form and the scalar valuecalculated from the projection of the hidden states of theRNN encoder [38] e model applied LSTM cells instead ofGRU cells (both LSTM and GRU are commonly employedfor abstractive summarisation tasks since LSTM has amemory unit that provides control but the computation timeof GRU is lower) ree models were employed the firstmodel applied unidirectional LSTM in both the encoder andthe decoder the second model was implemented usingbidirectional LSTM in the encoder and unidirectional LSTMin the decoder and the third model utilised a bidirectionalLSTM encoder and an LSTM decoder with global attentione first hidden state of the decoder is the concatenation ofall backward and forward hidden states of the encoder euse of attention in an encoder-decoder neural networkgenerates a context vector at each timestep For the localattention mechanism the context vector is conditioned on asubset of the encoderrsquos hidden states while for the globalattention mechanism the vector is conditioned on all theencoderrsquos hidden states After generating the first decoderoutput the next decoder input is the word embedding of theoutput of the previous decoder step e affine transfor-mation is used to convert the output of the decoder LSTM toa dense vector prediction due to the long training timeneeded before the number of hidden states is the same as thenumber of words in the vocabulary

Khandelwal [51] employed a sequence-to-sequencemodel that consists of an LSTM encoder and LSTM decoderfor abstractive summarisation of small datasets e decodergenerated the output summary after reading the hiddenrepresentations generated by the encoder and passing themto the softmax layer e sequence-to-sequence model doesnot memorize information so generalization of the model isnot possible us the proposed model utilised imitationlearning to determine whether to choose the golden token(ie reference summary token) or the previously generatedoutput at each step

(2) GRU-RNN A combination of the elements of the RNN andconvolutional neural network (CNN) was employed in anencoder-decoder model that is referred to as a quasi-recurrentneural network (QRNN) [50] In the QRNN the GRU wasutilised in addition to the attention mechanism e QRNNwas applied to address the limitation of parallelisation whichaimed to obtain the dependencies of the words in previoussteps via convolution and ldquofo-poolingrdquo which were performedin parallel as shown in Figure 6e convolution in theQRNNcan be either mass convolution (considering previous time-steps only) or centre convolution (considering future time-steps) e encoder-decoder model employed two neuralnetworks the first network applied the centre convolution ofQRNN and consisted ofmultiple hidden layers that were fed bythe vector representation of the words and the second networkcomprised neural attention and considered as input the en-coder hidden layers to generate one word of a headline edecoder accepted the previously generated headline word andproduced the next word of the headline this process continueduntil the headline was completed

SEASS is an extension of the sequence-to-sequence re-current neural network that was proposed in [52] e se-lective encoding for the abstractive sentence summarisation(SEASS) approach includes a selective encoding model thatconsists of an encoder for sentences a selective gate networkand a decoder with an attention mechanism as shown inFigure 7 e encoder uses a bidirectional GRU while thedecoder uses a unidirectional GRU with an attentionmechanism e encoder reads the input words and theirrepresentations e meaning of the sentences is applied bythe selective gate to choose the word representations forgenerating the word representations of the sentence Toproduce an excellent summary and accelerate the decodingprocess a beam search was selected as the decoder

On the other hand dual attention was applied in [53]e proposed dual attention approach consists of threemodules two bidirectional GRU encoders and one dualattention decoder e decoder has a gate network forcontext selection as shown in Figure 8 and employs copyingand coverage mechanisms e outputs of the encoders aretwo context vectors one context vector for sentences andone context vector for the relation where the relationmay be

Context

Attention weight

Word1 Word2 Word3 ltEOSgt

Headline

(a)

Context


Headline

Attention weight

(b)

Figure 5 (a) Simple attention and (b) complex attention [29]


---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear

LSTMlinearLinearLSTMlinear

Figure 6 Comparison of the CNN LSTM and QRNN models [50]

Enco

der

Word3Word1 Word2 Word4 Word5

MLP

hi S

Selective gate network

Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder

Figure 7 Selective encoding for abstractive sentence summarisation (SEASS) [52]

Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP

Rel3Rel1 Rel2 Rel4 Rel5

Attention

Rela

tion

enco

der

Ctx

Ctr Ct

Dual attention decoder

Ytndash1

Figure 8 Faithful to the original [53]


a triple or tuple relation A triple relation consists of thesubject predicate and object while the tuple relationconsists of either (subject and predicate) or (predicate andsubject) Sometimes the triple relation cannot be extractedin this case two tuple relations are utilised e decoder gatemerges both context vectors based on their relativeassociation

(3) Others e long-sequence poor semantic representa-tion of abstractive text summarisation approaches whichare based on an RNN encoder-decoder framework wasaddressed using the RC-Transformer (RCT) [54] An RCTis an RNN-based abstractive text summarisation modelthat is composed of two encoders (RC encoder andtransformer encoder) and one decoder e transformershows an advantage in parallel computing in addition toretrieving the global context semantic relationships On theother hand sequential context representation was achievedby a second encoder of the RCT-Transformer Word or-dering is very crucial for abstractive text summarisationwhich cannot be obtained by positioning encodingerefore an RCT utilised two encoders to address theproblem of a shortage of sequential information at the wordlevel A beam search was utilised at the decoder Fur-thermore Cai et al compared the speed of the RCTmodeland that of the RNN-based model and concluded that theRCT is 14x and 12x faster

32 Word Embedding In the QRNN model GloVe wordembedding which was pretrained using the Wikipedia andGigaword datasets was performed to represent the text andsummary [50] In the first model the proposed model byJobson et al the word embedding randomly initialisedand updated during training while GloVe word embed-ding was employed to represent the words in the secondand third models [38] In a study by Cai et al Transformerwas utilised [54]

33 Dataset and Dataset Preprocessing In the model thatwas proposed by Rush et al datasets were preprocessed viaPTB tokenization by using ldquordquo to replace all digits con-version of all letters to lowercase letters and the use ofldquoUNKrdquo to replace words that occurred fewer than 5 times[18] e model was trained with any input-output pairs dueto the shortage of constraints for generating the output etraining process was carried out on the Gigaword datasetswhile the summarisation evaluation was conducted onDUC2003 and DUC2004 [18] Furthermore the proposedmodel by Chopra et al was trained using the Gigawordcorpus with sentence separation and tokenisation [39] Toform sentence-summary pairs each headline of the articlewas paired with the first sentence of the article e samepreprocessing steps of the data in [18] were performed in[39] Moreover the Chopra et al model was evaluated usingthe DUC2004 dataset which consists of 500 pairs

Gigaword datasets were also employed by the QRNNmodel [50] Furthermore articles that started with sentences

that contained more than 50 words or headlines with morethan 25 words were removed Moreover the words in thearticles and their headlines were converted to lowercasewords and the data points were split into short mediumand long sentences based on the lengths of the sentences toavoid extra padding

Lopyrev and Jobson et al trained the model usingGigaword after processing the data In the Lopyrev modelthe most crucial preprocessing steps for both the text and theheadline were tokenisation and character conversion tolowercase [29] In addition only the characters of the firstparagraph were retained and the length of the headline wasfixed between 25 and 50 words Moreover the no-headlinearticles were disregarded and the langunkrang symbol was used toreplace rare words

Khandelwal employed the Association for Computa-tional Linguistics (ACL) Anthology Reference Corpuswhich consists of 16845 examples for training and 500examples for testing and they were considered smalldatasets in experiments [51] e abstract included the firstthree sentences and the unigram that overlaps between thetitle and the abstract was also calculated ere were 25tokens in the summary and there were a maximum of 250tokens in the input text

e English Gigaword dataset DUC2004 corpus andMSR-ATC were selected to train and test the SEASS model[52] Moreover the experiments of the Cao et al model wereconducted using the Gigaword dataset [53] e samepreprocessing steps of the data in [18] were performed in[52 53] Moreover RCT also employed the Gigaword andDUC2004 datasets in experiments [54]

34 Evaluation and Results Recall-Oriented Understudyfor Gisting Evaluation 1 (ROUGE1) ROUGE2 andROUGE-L were utilised to evaluate the Rush et al modeland values of 2818 849 and 2381 respectively wereobtained [18] e experimental results of the Chopra et almodel showed that although DUC2004 was too complexfor the experiments on the Gigaword corpus the proposedmodel outperformed state-of-the-art methods in terms ofROUGE1 ROUGE2 and ROUGE-L [39] e values ofROUGE1 ROUGE2 and ROUGE-L were 2897 826 and2406 respectively On the other hand BLEU wasemployed to evaluate the Lopyrev model [29] whileKhandelwal utilised perplexity [51] e SEASS model wasevaluated using ROUGE1 ROUGE2 and ROUGE-L andthe results of the three measures were 3615 1754 and3363 respectively [52] Moreover ROUGE1 ROUGE2and ROUGE-L were selected for evaluating the Cao et almodel [53] e values of ROUGE1 ROUGE2 andROUGE-L were 3727 1765 and 3424 respectively andthe results showed that fake summaries were reduced by80 In addition the RCT was evaluated using ROUGE1ROUGE2 and ROUGE-L with values 3727 1819 and3462 compared with the Gigaword dataset e resultsshowed that the RCT model outperformed other modelsby generating a high-quality summary that contains silentinformation [54]


4 Multisentence Summary

In this section multisentence summary and deep learning-based abstractive text summarisation are discussed Multi-sentence summary methods include the get to the pointmethod (summarisation with pointer-generator networks)[56] a deep reinforced model for abstractive summarization(RL) [57] generative adversarial network for abstractive textsummarization [58] semantic phrase exploration (ATSDL)[30] bidirectional attention encoder-decoder and bidirec-tional beam search [35] key information guide network [59]text summarisation abstraction improvement [60] dualencoding for abstractive text summarisation (DEATS) [61]and abstractive document summarisation via bidirectionaldecoder (BiSum) [62] the text abstraction summary modelbased on BERT word embedding and RL [63] transformer-basedmodel for single documents neural summarisation [64]text summarisation with pretrained encoders [65] and textsummarisation method based on the double attention pointernetwork [49] e pointer-generator [55] includes single-sentence andmultisentence summaries Additional details arepresented in the following sections


411 LSTM RN A novel abstractive summarisationmethodwas proposed in [56] it generated a multisentence summaryand addressed sentence repetition and inaccurate infor-mation See et al proposed a model that consists of a single-layer bidirectional LSTM encoder a single-layer unidirec-tional LSTM decoder and the sequence-to-sequence at-tention model proposed by [55] e See et al modelgenerates a long text summary instead of headlines whichconsists of one or two sentences Moreover the attentionmechanism was employed and the attention distributionfacilitated the production of the next word in the summaryby telling the decoder where to search in the source words asshown in Figure 9 is mechanism constructed theweighted sum of the hidden state of the encoder that fa-cilitated the generation of the context vector where thecontext vector is the fixed size representation of the inpute probability (Pvocab) produced by the decoder wasemployed to generate the final prediction using the contextvector and the decoderrsquos last step Furthermore the value ofPvocab was equal to zero for OOV words RL was employedfor abstractive text summarisation in [57] e proposedmethod in [57] which combined RL with supervised wordprediction was composed of a bidirectional LSTM-RNNencoder and a single LSTM decoder

Two modelsmdashgenerative and discriminative mod-elsmdashwere trained simultaneously to generate abstractivesummary text using the adversarial process [58] emaximum likelihood estimation (MLE) objective functionemployed in previous sequence-sequence models suffersfrom two problems the difference between the training lossand the evaluation metric and the unavailability of a goldentoken at testing time which causes errors to accumulateduring testing To address the previous problems the

proposed approach exploited the adversarial framework Inthe first step of the adversarial framework reinforcementlearning was employed to optimize the generator whichgenerates the summary from the original text In the secondstep the discriminator which acts as a binary classifierclassified the summary as either a ground-truth summary ora machine-generated summary e bidirectional LSTMencoder and attention mechanism were employed as shownin [56]

Abstract text summarisation using the LSTM-CNNmodel based on exploring semantic phrases (ATSDL) wasproposed in [30] ATSDL is composed of two phases thefirst phase extracts the phrases from the sentences while thesecond phase learns the collocation of the extracted phrasesusing the LSTM model To generate sentences that aregeneral and natural the input and output of the ATSDLmodel were phrases instead of words and the phrases weredivided into three main types ie subject relation andobject phrases where the relation phrase represents therelation between the input phrase and the output phraseephrase was represented using a CNN layer ere are twomain reasons for choosing the CNN first the CNN wasefficient for sentence-level applications and second trainingwas efficient since long-term dependency was unnecessaryFurthermore to obtain several vectors for a phrase multiplekernels with different widths that represent the dimen-sionality of the features were utilisedWithin each kernel themaximum feature was selected for each row in the kernel viamaximum pooling e resulting values were added toobtain the final value for each word in a phrase BidirectionalLSTM was employed instead of a GRU on the encoder sidesince parameters are easy to tune with LSTM Moreover thedecoder was divided into two modes a generate mode and acopy mode e generate mode generated the next phrase inthe summary based on previously generated phrases and thehidden layers of the input on the encoder side while thecopy mode copied the phrase after the current input phraseif the current generated phrase was not suitable for thepreviously generated phrases in the summary Figure 10provides additional details

Bidirectional encoder and decoder LSTM-RNNs wereemployed to generate abstractive multisentence summaries[35] e proposed approach considered past and futurecontext on the decoder side when making a prediction as itemployed a bidirectional RNN Using a bidirectional RNNon the decoder side addressed the problem of summaryimbalance An unbalanced summary could occur due tonoise in a previous prediction which will reduce the qualityof all subsequent summaries e bidirectional decoderconsists of two LSTMs the forward decoder and thebackward decoder e forward decoder decodes the in-formation from left to right while the backward decoderdecodes the information from right to left e last hiddenstate of the forward decoder is fed as the initial input to thebackward decoder and vice versa Moreover the researcherproposed a bidirectional beam-search method that generatessummaries from the proposed bidirectional model Bidi-rectional beam search combined information from the pastand future to produce a better summary erefore the


output summary was balanced by considering both past andfuture information and by using a bidirectional attentionmechanism In addition the input sequence was read inreverse order based on the conclusion that LSTM learnsbetter when reading the source in reverse order while re-membering the order of the target [66 67] A softmax layerwas employed on the decoder side to obtain the probabilityof each target word in the summary over the vocabularydistribution by taking the output of the decoder as input forthe softmax layer e decoder output depends on the in-ternal representation of the encoder ie the context vectorthe current hidden state of the decoder and the summarywords previously generated by the decoder hidden statese objective of training is to maximise the probability of thealignment between the sentence and the summary from bothdirections During training the input of the forward decoder

is the previous reference summary token However duringtesting the input of the forward decoder is the tokengenerated in the previous step e same situation is true forthe backward decoder where the input during training is thefuture token from the summary Nevertheless the bidi-rectional decoder has difficulty during testing since thecomplete summary must be known in advance thus the fullbackward decoder was generated and fed to the forwarddecoder using a unidirectional backward beam search

A combination of abstractive and extractive methodswas employed in the guiding generation model proposed by[59] e extractive method generates keywords that areencoded by a key information guide network (KIGN) torepresent key information Furthermore to predict the finalsummary of the long-term value the proposed methodapplied a prediction guide mechanism [68] A prediction

Word1 Word2 Word4 Word5Word3 Word6 Word7

Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n

Figure 9 Baseline sequence-to-sequence model with attention mechanism [56]

helliphelliphelliphellip

Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling

Figure 10 Semantic-unit-based LSTM model [30]


guide mechanism is a feedforward single-layer neural net-work that predicts the key information of the final summaryduring testing e encoder-decoder architecture baseline ofthe proposed model is similar to that proposed by Nallapatiet al [55] where both the bidirectional LSTM encoder andthe unidirectional LSTM decoder were employed Bothmodels applied the attention mechanism and softmax layerMoreover the process of generating the summary wasimproved by proposing KIGN which considers as input thekeywords extracted using the TextRank algorithm In KIGNkey information is represented by concatenating the lastforward hidden state and first backward hidden state KIGNemploys the attention mechanism and pointer mechanismIn general the attention mechanism hardly identifies thekeywords thus to identify keywords the output of KIGNwill be fed to the attention mechanism As a result theattentionmechanismwill be highly affected by the keywordsHowever to enable the pointer network to identify thekeywords which are the output of KIGN the encodercontext vector and hidden state of the decoder will be fed tothe pointer network and the output will be employed tocalculate the soft switch e soft switch determines whetherto copy the target from the original text or generate it fromthe vocabulary of the target as shown in Figure 11

e level of abstraction in the generated summary of theabstractive summarisation models was enhanced via the twotechniques proposed in [60] decoder decomposition and theuse of a novel metric for optimising the overlap between then-gram summary and the ground-truth summary e de-coder was decomposed into a contextual network andpretrained language model as shown in Figure 12 econtextual network applies the source document to extractthe relevant parts and the pretrained language model isgenerated via prior knowledge is decomposition methodfacilitates the addition of an external pretrained languagemodel that is related to several domains Furthermore anovel metric was employed to generate an abstractivesummary by including words that are not in the sourcedocument Bidirectional LSTM was utilised in the encoderand the decoder applied 3-layer unidirectional weight-dropped LSTM In addition the decoder utilised a temporalattention mechanism which applied the intra-attentionmechanism to consider previous hidden states Further-more a pointer network was introduced to alternate be-tween copying the output from the source document andselecting it from the vocabulary As a result the objectivefunction combined between force learning and maximumlikelihood

A bidirectional decoder with a sequence-to-sequencearchitecture which is referred to as BiSum was employed tominimise error accumulation during testing [62] Errorsaccumulate during testing as the input of the decoder is thepreviously generated summary word and if one of thegenerated word summaries is incorrect then the error willpropagate through all subsequent summary words In thebidirectional decoder there are two decoders a forwarddecoder and a backward decoder e forward decodergenerates the summary from left to right while the backwarddecoder generates the summary from right to left e

forward decoder considers a reference from the backwarddecoder However there is only a single-layer encoder eencoder and decoder employ an LSTM unit but while theencoder utilises bidirectional LSTM the decoders use uni-directional LSTM as shown in Figure 13 To understand thesummary generated by the backward decoder the attentionmechanism is applied in both the backward decoder and theencoder Moreover to address the problem of out-of-vo-cabulary words an attention mechanism is employed inboth decoders

A double attention pointer network which is referred toas (DAPT) was applied to generate an abstractive textsummarisation model [49] e encoder utilised bidirec-tional LSTM while the decoder utilised unidirectionalLSTM e encoder key features were extracted using a self-attention mechanism At the decoder the beam search wasemployed Moreover more coherent and accurate sum-maries were generated e repetition problem wasaddressed using an improved coverage mechanism with atruncation parameter e model was optimised by gener-ating a training model that is based on RL and scheduledsampling

412 GRU-RNN Dual encoding using a sequence-to-se-quence RNN was proposed as the DEATS method [61] edual encoder consists of two levels of encoders ie primaryand secondary encoders in addition to one decoder and allof them employ a GRU e primary encoder considerscoarse encoding while the secondary encoder considers fineencoding e primary encoder and decoder are the same asthe standard encoder-decoder model with an attentionmechanism and the secondary encoder generates a newcontext vector that is based on previous output and inputMoreover an additional context vector provides meaningfulinformation for the output us the repetition problem ofthe generated summary that was encountered in previousapproaches is addressede semantic vector is generated onboth levels of encoding in the primary encoder the semanticvector is generated for each input while in the secondaryencoder the semantic vector is recalculated after the im-portance of each input word is calculated e fixed-lengthoutput is partially generated at each stage in the decodersince it decodes in stages

Figure 14 elaborates the DEATS process e primaryencoder produces a hidden state hpj for each input j andcontent representation cp Next the decoder decodes a fixed-length output which is referred to as the decoder contentrepresentation cd e weight αj can be calculated using thehidden states hpj and the content representations cp and cdIn this stage the secondary encoder generates new hiddenstates or semantic context vectors hsm which are fed to thedecoder Moreover DEATS uses several advanced tech-niques including a pointer-generator copy mechanism andcoverage mechanism

Wang et al proposed a hybrid extractive-abstractive textsummarisation model which is based on combining thereinforcement learning with BERTword embedding [63] Inthis hybrid model a BERTfeature-based strategy was used to


generate contextualised token embedding is modelconsists of two submodels abstractive agents and extractiveagents which are bridged using RL Important sentences areextracted using the extraction model and rewritten using theabstraction model A pointer-generator network was utilisedto copy some parts of the original text where the sentence-level and word-level attentions are combined In addition abeam search was performed at the decoder In abstractiveand extractive models the encoder consists of a bidirectionalGRU while the decoder consists of a unidirectional GRU

e training process consists of pretraining and full trainingphases

Egonmwan et al proposed to use sequence-to-sequenceand transformer models to generate abstractive summaries[64] e proposed summarisation model consists of twomodules an extractive model and an abstractive model eencoder transformer has the same architecture shown in[48] however instead of receiving the document repre-sentation as input it receives sentence-level representatione architecture of the abstractive model consists of a single-

Word1Word2 Word4Word5Word3 Word6Word7

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt Word_

Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel

htdeccttemp ctint Fusion layer

Figure 12 Decoder decomposed into a contextual model and a language model [60]

Word1Word2 Word4 Word5Word3 Word6 Word7

Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct

Figure 11 Key information guide network [59]


layer unidirectional GRU at the encoder and single-layerunidirectional GRU at the decoder e input of the encoderis the output of the transformer A beam search was per-formed during inference at the decoder while greedy-decoding was employed during training and validation

413 Others BERT is employed to represent the sentencesof the document to express its semantic [65] Liu et alproposed abstractive and extractive summarisation modelsthat are based on encoder-decoder architecturee encoderused a BERT pretrained document-level encoder while thedecoder utilised a transformer that is randomly initialisedand trained from scratch In the abstractive model theoptimisers of the encoder and decoder are separatedMoreover two stages of fine-tuning are utilised at the en-coder one stage in extractive summarisation and one stagein abstractive summarisation At the decoder side a beamsearch was performed however the coverage and copymechanisms were not employed since these twomechanisms

need additional tuning of the hyperparameters e repe-tition problem was addressed by producing different sum-maries by using trigram-blocking e OOV words rarelyappear in the generated summary

42Word Embedding e word embedding of the input forthe See et al model was learned from scratch instead of usinga pretrained word embedding model [56] On the otherhand both the input and output tokens applied the sameembedding matrix Wemb which was generated using theGloVe word embedding model in the Paulus et al model[57] Another word embedding matrix referred to as Woutwas applied in the token generation layer Additionally asharing weighting matrix was employed by both the sharedembedding matrixWemb and theWoutmatrix e sharingweighting matrixes improved the process of generatingtokens since they considered the embedding syntax andsemantic information

Word_Sum1

ltStartgt Word_Sum2

Word1Word2 Word4Word5Word3 Word6Word7Prim

ary

enco

der

hidd

en st

ates

Word1Word2 Word4 Word5Word3 Word6 Word7Seco

ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p

Figure 14 Dual encoding model [61]

Word1 Word2 Word4 Word5Word3 Word6 Word7helliphellip

Encoder


Sum2

Forw

ard

deco

der

helliphellip

Context vectorWord_Sumn-1

Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector

Figure 13 Abstractive document summarisation via bidirectional decoder (BiSum) [62]


e discriminator input sequence of the Liu et al modelwas encoded using a maximum pooling CNN where theresult was passed to the softmax layer [58] On the otherhand the word embedding that was applied in the Al-Sabahiet al model was learned from scratch using the CNNDailyMail datasets with 128 dimensions [35] Egonmwan et al[64] used pretrained GloVe word embedding BERT wordembedding was utilised in the models proposed by Wanget al [63] and Liu et al [65]

43 Dataset and Dataset Preprocessing Experiments wereconducted with the See et al [56] Al-Sabahi et al [35] andLi et al [59] models using CNNDaily Mail datasets whichconsist of 781 tokens paired with 56 tokens on average287226 pairs 13368 pairs and 11490 pairs were utilised fortraining validation and testing respectively [56] In themodel proposed by Paulus et al the document was pre-processed using the same method applied in [55] eproposed model was evaluated using two datasets the CNNDaily News dataset and the New York Times dataset eCNNDaily Mail dataset was utilised by Liu et al for trainingtheir model [58]

e ATSDL model consisted of three stages text pre-processing phrase extractions and summary generation [30]During text preprocessing the CoreNLP tool was employedto segment the words reduce themorphology and resolve thecoreference e second stage of the ATSDL model wasphrase extraction which included the acquisition refinementand combination of phrases In addition multiorder semanticparsing (MOSP) which was proposed to create multilayerbinary semantics was applied for phrase extraction e firststep of MOSP was to perform Stanford NLP parsing which isa specialised tool that retrieved the lexical and syntacticfeatures from the preprocessed sentences Next dependencyparsing was performed to create a binary tree by determiningthe root of the tree which represents the relational phrase Ifthe child node has children then the child is considered a newroot with children this process continues recursively untilthere are no children for the root In this case the treestructure is completed Accordingly the compound phrasescan be explored via dependency parsing However one of theimportant stages of phrase extraction is refinement duringwhich redundant and incorrect phrases are refined beforetraining by applying simple rules First the phrase triples atthe topmost level are exploited since they carry the mostsemantic information Second triple phrases with subject andobject phrases and no nouns are deleted since the nouncontains a considerable amount of conceptual informationTriple phrases without a verb in a relational phrase are de-leted Moreover phrase extraction includes phrase combi-nation during which phrases with the same meaning arecombined to minimise redundancy and the time required totrain the LSTM-RNN To achieve the goal of the previous taskand determine whether two phrases can be combined a set ofartificial rules are applied e experiments were conductedusing the CNN and Daily Mail datasets which consisted of92000 text sources and 219000 text sources respectively

e Kryscinski et al [60] model was trained using a CNNDaily Mail dataset which was preprocessed using the methodfrom 56[55 56] e experiments of DEATS were conductedusing the CNNDaily Mail dataset and DUC2004 corpus [61]e experiments of the BiSum model were performed usingthe CNNDaily Mail dataset [62] In theWang et al proposedmodel CNNDaily Mail and DUC2002 were employed inexperiments [63] while the Egonmwan et al model employedthe CNNDaily and Newsroom datasets in experiments [64]Experiments were conducted with the Liu et al [65] modelusing three benchmark datasets including CNNDaily MailNew York Times Annotated Corpus (NYT) and XSumExperiments were also conducted with the DAPT modelusing the CNNDaily Mail and LCSTS datasets [49]

44 Evaluation and Results e evaluation metricsROUGE1 ROUGE2 and ROUGE-L with values of 39531728 and 3638 respectively were applied to measure theperformance of the See et al model [56] which out-performed previous approaches by at least two points interms of the ROUGE metrics Reinforcement learning withthe intra-attention model achieved the following resultsROUGE1 4116 ROUGE2 1575 and ROUGE-L 3908[57] e results for the maximum likelihood model were3987 1582 and 369 for ROUGE1 ROUGE2 and ROUGE-L respectively Overall the proposed approach yielded high-quality generated summaries [57]

ROUGE1 ROUGE2 and ROUGE-L were utilised toevaluate the Liu et al model which obtained values of 39921765 and 3671 respectively [58] In addition a manualqualitative evaluation was performed to evaluate the qualityand readability of the summary Two participants evaluatedthe summaries of 50 test examples that were selected ran-domly from the datasets Each summary was given a scorefrom 1 to 5 where 1 indicates a low level of readability and 5indicates a high level of readability

ROUGE1 and ROUGE2 were used to evaluate theATSDL model [30] e value of ROUGE1 was 349 and thevalue of ROUGE2 was 178 Furthermore ROUGE1ROUGE2 and ROUGE-L were applied as evaluation metricsof the Al-Sabahi et al and Li et al models and the values of426 188 and 385 respectively were obtained for the Al-Sabahi et al model [35] while the values of 3895 1712 and3568 respectively were obtained for the Li et al model [58]

e evaluation of the Kryscinski et al model was con-ducted using quantitative and qualitative evaluations [60]e quantitative evaluations included ROUGE1 ROUGE2and ROUGE-L and the values of 4019 1738 and 3752respectively were obtained Additionally a novel score re-lated to the n-gram was employed to measure the level ofabstraction in the summary e qualitative evaluation in-volved the manual evaluation of the proposed model Fiveparticipants evaluated 100 full-text summaries in terms ofrelevance and readability by giving each document a valuefrom 1 to 10 Furthermore for comparison purposes full-text summaries from two previous studies [56 58] wereselected e evaluators graded the output summarieswithout knowing which model generated them


Moreover ROUGE1 ROUGE2 and ROUGE-L wereapplied for evaluating DEATS and the values of 4085 1808and 3713 respectively were obtained for the CNNDailyMail dataset [61] e experimental results of the BiSummodel showed that the values of ROUGE1 ROUGE2 andROUGE-L were 3701 1595 and 3366 respectively [62]

Several variations in the Wang et al model wereimplemented e best results were achieved by the BEAR(large +WordPiece) model where the WordPiece tokeniserwas utilised e values of ROUGE1 ROUGE2 andROUGE-L were 4195 2026 and 3949 respectively [63] InEgonmwan et al model the values of ROUGE1 andROUGE2 were 4189 and 1890 respectively while the valueof ROUGE3was 3892 Several variations in the Liu et al [65]model were evaluated using ROUGE1 ROUGE2 andROUGE-L where the best model which is referred to asBERTSUMEXT (large) achieved the values of 4385 2034and 3990 for ROUGE1 ROUGE2 and ROUGE-L re-spectively over the CNNDaily Mail datasets Moreover themodel was evaluated by a human via a question and an-swering paradigm where 20 documents were selected forevaluation ree values were chosen for evaluating theanswer a score of 1 indicates the correct answer a score of05 indicates a partially correct answer and a score of 0indicates a wrong answer ROUGE1 ROUGE2 andROUGE-L for the DAPT model over the CNNDaily Maildatasets were 4072 1828 and 3735 respectively

Finally the pointer-generator approach was applied onthe single-sentence and multisentence summaries Atten-tion encoder-decoder RNNs were employed to model theabstractive text summaries [55] Both the encoder anddecoder have the same number of hidden states Addi-tionally the proposed model consists of a softmax layer forgenerating the words based on the vocabulary of the targete encoder and decoder differ in terms of their compo-nents e encoder consists of two bidirectional GRU-RNNsmdasha GRU-RNN for the word level and a GRU-RNNfor the sentence levelmdashwhile the decoder uses a unidi-rectional GRU-RNN as shown in Figure 15 Furthermorethe decoder uses batching where the vocabulary at thedecoder for each minibatch is restricted to the words in thebatch of the source document Instead of considering everyvocabulary only certain vocabularies were added based onthe frequency of the vocabulary in the target dictionary todecrease the size of the decoder softmax layer Severallinguistic features were considered in addition to the wordembedding of the input words to identify the key entities ofthe document Linguistic and statistical features includedTF-IDF statistics and the part-of-speech and named-entitytags of the words Specifically the part-of-speech tags werestored in matrixes for each tag type that was similar to wordembedding while the TF-IDF feature was discretised inbins with a fixed number where one-hot representationwas employed to represent the value of the bins e one-hot matrix consisted of the number of bin entries whereonly one entry was set to one to indicate the value of the TF-IDF of a certain word is process permitted the TF-IDFto be addressed in the same way as any other tag byconcatenating all the embeddings into one long vector as

shown in Figure 16 e experiments were conducted usingthe annotated Gigaword corpus with 38M training ex-amples the DUC corpus and the CNNDaily Mail corpuse preprocessing methods included tokenisation andpart-of-speech and name-entity generation Additionallythe Word2Vec model with 200 dimensions was applied forword embedding and trained using the Gigaword corpusAdditionally the hidden states had 400 dimensions in boththe encoder and the decoder Furthermore datasets withmultisentence summaries were utilised in the experimentse values of ROUGE1 ROUGE2 and ROUGE-L werehigher than those of previous work on abstractive sum-marisation with values of 3546 133 and 3265respectively

Finally for both single-sentence summary and multi-sentence summary models the components of the encoderand decoder of each approach are displayed in Table 1Furthermore dataset preprocessing and word embedding ofseveral approaches are appeared in Table 2 while trainingoptimization mechanism and search at the decoder arepresented in Table 3

5 Datasets for Text Summarization

Various datasets were selected for abstractive text summa-risation including DUC2003 DUC2004 [69] Gigaword[70] and CNNDaily Mail [71] e DUC datasets areproduced for the Document Understanding Conferencealthough their quality is high they are small datasets that aretypically employed to evaluate summarisation models eDUC2003 and DUC2004 datasets consist of 500 articles eGigaword dataset from the Stanford University LinguisticsDepartment was the most common dataset for modeltraining in 2015 and 2016 Gigaword consists of approxi-mately 10 million documents from seven news sourcesincluding the New York Times Associated Press andWashington Post Gigaword is one of the largest and mostdiverse summarisation datasets even though it containsheadlines instead of summaries thus it is considered tocontain single-sentence summaries

Recent studies utilised the CNNDaily Mail datasets fortraining and evaluation e CNNDaily Mail datasetsconsist of bullet points that describe the articles wheremultisentence summaries are created by concatenating thebullet points of the article [5] CNNDaily Mail datasets thatare applied in abstractive summarisation were presented byNallapati et al [55] ese datasets were created by modi-fying the CNNDaily Mail datasets that were generated byHermann et al [71] e Hermann et al datasets wereutilised for extractive summarisation e abstractivesummarisation CNNDaily Mail datasets have 286817 pairsfor training and 13368 pairs for validation while 11487pairs were applied in testing In training the source doc-uments have 766 words (on average 2974 sentences) whilethe summaries have 53 words (on average 372 sentences)[55]

In April 2018 NEWSROOM a summarisation datasetthat consists of 13 million articles collected from socialmedia metadata from 1998 to 2017 was produced [72] e


W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF

Figure 16 Word embedding concatenated with discretized TF-IDF POS and NER one-embedding vectors [55]

Word_Sum1

ltStartgt Word_Sum2

Word1 Word2 Word4 Word5Word3 Word6

Wor

d-le

vel

hidd

en st

ates

Decoderhidden states

Encoderhidden states

Sent

ence

-leve

lhi

dden

stat

es

Figure 15 Word-level and sentence-level bidirectional GRU-RNN [55]

Table 1 Encoder and decoder components

Reference Year Encoder Decoder[18] 2015 Bag-of-words convolutional and attention-based[29] 2015 RNN with LSTM units and attention RNN with LSTM units and attention[39] 2016 RNN-LSTM decoder RNN Word-based[50] 2016 GRU+QRNN+ attention GRU+RNN QRNN[38] 2016 Unidirectional RNN attentive encoder-decoder LSTM Unidirectional RNN attentive encoder-decoder LSTM

Bidirectional LSTM Unidirectional LSTMBidirectional LSTM Decoder that had global attention

[51] 2016 LSTM-RNN LSTM-RNN[55] 2016 Two bidirectional GRU-RNN GRU-RNN unidirection[52] 2017 Bidirectional GRU Unidirectional GRU[53] 2017 Bidirectional GRU Unidirectional GRU[56] 2017 Single-layer bidirectional LSTM+ attention Single-layer unidirectional LSTM[57] 2017 Bidirectional LSTM-RNN+ intra-attention single LSTM decoder + intra-attention[58] 2018 Bidirectional LSTM Unidirectional LSTM[30] 2018 Bidirectional LSTM Unidirectional LSTM[35] 2018 Bidirectional LSTM Bidirectional LSTM[59] 2018 Bidirectional LSTM Unidirectional LSTM[60] 2018 Bidirectional LSTM 3-layer unidirectional LSTM[61] 2018 Bidirectional GRU Unidirectional GRU[62] 2018 Bidirectional LSTM Two-decoder unidirectional LSTM[63] 2019 Bidirectional GRU Unidirectional GRU[64] 2019 Unidirectional GRU Unidirectional GRU[49] 2020 Bidirectional LSTM Unidirectional LSTM


NEWSROOM dataset consists of 992985 pairs for trainingand 108612 and 108655 pairs for validation and testingrespectively [22] e quality of the summaries is high andthe style of the summarisation is diverse Figure 17 displaysthe number of surveyed papers that applied each of the

datasets Nine research papers utilised Gigaword fourteenpapers employed the CNNDaily Mail datasets (largestnumber of papers on the list) and one study applied the ACLAnthology Reference DUC2002 DUC2004 New YorkTimes Annotated Corpus (NYT) and XSum datasets

Table 2 Dataset preprocessing and word embedding

Reference Authors Dataset preprocessing Input (word embedding)

[18] Rush et alPTB tokenization by using ldquordquo to replace all digitsconverting all letters to lower case and ldquoUNKrdquo toreplace words that occurred fewer than 5 times

Bag-of-words of the input sentence embedding

[39] Chopra et alPTB tokenization by using ldquordquo to replace all digitsconverting all letters to lower case and ldquoUNKrdquo toreplace words that occurred fewer than 5 times

Encodes the position information of the input words

[55] Nallapatiet al

Part-of-speech and name-entity tags generating andtokenization

(i) Encodes the position information of the inputwords

(ii) e input text was represented using theWord2Vec model with 200 dimensions that was

trained using Gigaword corpus(iii) Continuous features such as TF-IDF were

represented using bins and one-hot representation forbins

(iv) Lookup embedding for part-of-speech tagging andname-entity tagging

[52] Zhou et alPTB tokenization by using ldquordquo to replace all digitsconverting all letters to lower case and ldquoUNKrdquo toreplace words that occurred fewer than 5 times

Word embedding with size equal to 300

[53] Cao et alNormalization and tokenization using the ldquordquo toreplace digits convert the words to lower case and

ldquoUNKrdquo to replace the least frequent words

GloVe word embedding with dimension size equal to200

[54] Cai et al Byte pair encoding (BPE) was used in segmentation Transformer

[50] Adelson et al Converting the article and their headlines to lower caseletters GloVe word embedding

[29] LopyrevTokenization converting the article and their

headlines to lower case letters using the symbol langunkrang

to replace rare words

e input was represented using the distributedrepresentation

[38] Jobson et al

e word embedding randomly initialised andupdated during training while GloVe word embeddingwas used to represent the words in the second and

third models

[56] See et ale word embedding of the input for was learned fromscratch instead of using a pretrained word embedding

model[57] Paulus et al e same as in [55] GloVe

[58] Liu et al CNN maximum pooling was used to encode thediscriminator input sequence

[30] Song et ale words were segmented using CoreNLP tool

resolving the coreference and reducing themorphology

Convolutional neural network was used to representthe phrases

[35] Al-Sabahiet al

e word embedding is learned from scratch duringtraining with a dimension of 128

[59] Li et al e same as in [55] Learned from scratch during training

[60] Kryscinskiet al e same as in [55] Embedding layer with a dimension of 400

[61] Yao et al e word embedding is learned from scratch duringtraining with a dimension of 128

[62] Wan et al No word segmentation Embedding layer learned during training[65] Liu et al BERT[63] Wang et al Using WordPiece tokenizer BERT

[64] Egonmwanet al



Table 4 lists the datasets that are used to train and validatethe summarisation methods in the research papers listed inthis work

6 Evaluation Measures

e package ROUGE is employed to evaluate the textsummarisation techniques by comparing the generated

summary with a manually generated summary [73] epackage consists of several measures to evaluate the per-formance of text summarisation techniques such asROUGE-N (ROUGE1 and ROUGE2) and ROUGE-L whichwere employed in several studies [38] ROUGE-N is n-gramrecall such that ROUGE1 and ROUGE2 are related tounigrams and bigrams respectively while ROUGE-L isrelated to the longest common substring Since the manual

Table 3 Training optimization mechanism and search at the decoder

Reference Authors Training and optimization Mechanism Search at decoder (siz)

[18] Rush et al Stochastic gradient descent tominimise negative log-likelihood Beam search

[39] Chopra et alMinimizing negative log-likelihoodusing end-to-end using stochastic

gradient descent

Encodes the position information ofthe input words Beam search

[55] Nallapatiet al

Optimize the conditional likelihoodusing Adadelta Pointer mechanism Beam search (5)

[52] Zhou et alStochastic gradient descent Adamoptimizer optimizing the negative

log-likelihoodAttention mechanism Beam search (12)

[53] Cao et al Adam optimizer optimizing thenegative log-likelihood

Copy mechanism coveragemechanism dual-attention decoder Beam search (6)

[54] Cai et al Cross entropy is used as the lossfunction Attention mechanism Beam search (5)

[50] Adelson et al Adam Attention mechanism

[29] Lopyrev RMSProp adaptive gradient method Simple and complex attentionmechanism Beam search

[38] Jobson et al Adadelta minimising the negativelog probability of prediction word

Bilinear attention mechanismpointer mechanism

[56] See et al Adadelta Coverage mechanism attentionmechanism pointer mechanism Beam search (4)

[57] Paulus et al Adam RLIntradecoder attention mechanism

pointer mechanism copymechanism RL

Beam search (5)

[58] Liu et al Adadelta stochastic gradientdescent

Attention mechanism pointermechanism copy mechanism RL

[30] Song et al Attention mechanism copymechanism

[35] Al-Sabahiet al Adagrad Pointer mechanism coverage

mechanism copy mechanism Bidirectional beam search

[59] Li et al AdadeltaAttention mechanism pointermechanism copy mechanismprediction guide mechanism

Beam search

[60] Kryscinskiet al

Asynchronous gradient descentoptimizer

Temporal attention and intra-attention pointer mechanism RL Beam search

[61] Yao et al RL AdagradAttention mechanism pointermechanism copy mechanism

coverage mechanism RLBeam search (4)

[62] Wan et al Adagrad Attention mechanism pointermechanism

Beam-search backward (2) andforward (4)

[65] Liu et al Adam Self-attention mechanism Beam search (5)

[63] Wang et al Gradient of reinforcement learningAdam cross-entropy loss function

Attention mechanism pointermechanism copy mechanism new

coverage mechanismBeam search

[64] Egonmwanet al Adam Self-attention mechanism

Greedy-decoding during trainingand validation Beam search at

decoding during testing

[49] Peng et al Adam gradient descent cross-entropy loss

Coverage mechanism RL doubleattention pointer network (DAPT) Beam search (5)


evaluation of automatic text summarisation is a time-con-suming process and requires extensive effort ROUGE is

employed as a standard for evaluating text summarisationROUGE-N is calculated using the following equation

ROUGE minus N 1113936 S isin REFERENCE SUMMARIES 1113936 gramn isin Countmatch gramn( 1113857

1113936 S isin REFERENCE SUMMARIES 1113936 gramn isin Count gramn( 1113857 (1)

where S is the reference summary n is the n-gram lengthand Countmatch (gramn) is the maximum number ofmatching n-gram words between the reference summaryand the generated summary Count (gramn) is the totalnumber of n-gram words in the reference summary [73]

ROUGE-L is the longest common subsequence (LCS)which represents the maximum length of the commonmatching words between the reference summary and the

generated summary LCS calculation does not necessarilyrequire the match words to be consecutive however theorder of occurrence is important In addition no predefinednumber of match words is required LCS considers only themain in-sequence which is one of its disadvantages since thefinal score will not include other matches For exampleassume that the reference summary R and the automaticsummary A are as follows

Gigaword CNNDailyMail

ACL DUC2004DUC2002 NewsroomNYT XSum

Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

Figure 17 e number of research papers that used the Gigaword CNNDaily Mail ACL DUC2002 DUC2004 NYT Newsroom andXSum datasets [61]

Table 4 Abstractive summarisation datasets

Reference Training Summarization Evaluation[18] Gigaword DUC2003 and DUC2004[39] Gigaword DUC2004[50] Gigaword Gigaword[29] Gigaword Articles from BBC e Wall Street Journal Guardian Huffington Post and Forbes[38] Gigaword mdash[54] Gigaword and DUC2004 Gigaword and DUC2004[51] ACL anthology reference ACL anthology reference[52] Gigaword and DUC2004 Gigaword and DUC2004[53] Gigaword and DUC2004 Gigaword and DUC2004[56] CNNDaily Mail CNNDaily Mail[57] CNNDaily and New York Times CNNDaily and New York Times[58] CNNDaily Mail CNNDaily Mail[30] CNNDaily Mail CNNDaily Mail[35] CNNDaily Mail CNNDaily Mail[59] CNNDaily Mail CNNDaily Mail[60] CNNDaily Mail CNNDaily Mail[61] CNNDaily Mail CNNDaily Mail[55] Gigaword DUC CNNDaily Mail Gigaword DUC CNNDaily Mail[62] CNNDaily Mail CNNDaily Mail[65] CNNDaily Mail NYT and XSum CNNDaily Mail NYT and XSum[63] CNNDaily Mail and DUC2002 CNNDaily Mail and DUC2002[64] CNNDaily Mail and Newsroom CNNDaily Mail and Newsroom[49] CNNDaily Mail CNNDaily Mail


R Ahmed ate the appleA the apple Ahmed ate

In this case ROUGE-L will consider either ldquoAhmed aterdquoor ldquothe applerdquo but not both similar to LCS

Tables 5 and 6 present the values of ROUGE1 ROUGE2and ROUGE-L for the text summarisation methods in thevarious studies reviewed in this research In addition Per-plexity was employed in [18 39 51] and BLEU was utilisedin [29] e models were evaluated using various datasetse other models applied ROUGE1 ROUGE2 andROUGE-L for evaluation It can be seen that the highestvalues of ROUGE1 ROUGE2 and ROUGE-L for textsummarisation with the pretrained encoder model were4385 2034 and 399 respectively [65] Even thoughROUGE was employed to evaluate abstractive summa-risation it is better to obtain new methods to evaluate thequality of summarisation e new evaluation metrics mustconsider novel words and semantics since the generatedsummary contains words that do not exist in the originaltext However ROUGE was very suitable for extractive textsummarisation

Based on our taxonomy we divided the results ofROUGE1 ROUGE2 and ROUGE-L into two groups efirst group considered single-sentence summary approacheswhile the second group considered multisentence summaryapproaches Figure 18 compares several deep learningtechniques in terms of ROUGE1 ROUGE2 and ROUGE-Lfor the Gigaword datasets consisting of single-sentencesummary documents e highest values for ROUGE1ROUGE2 and ROUGE-L were achieved by the RCTmodel[54] e values for ROUGE1 ROUGE2 and ROUGE-Lwere 3727 1819 and 3462 respectively

Furthermore Figure 19 compares the ROUGE1ROUGE2 and ROUGE-L values for abstractive text sum-marisation methods for the CNNDaily Mail datasets whichconsist of multisentence summary documents e highestvalues of ROUGE1 ROUGE2 and ROUGE-L were achievedfor text summarisation with a pretrained encoder modele values for ROUGE1 ROUGE2 and ROUGE-L were4385 2034 and 399 respectively [65] It can be clearly seenthat the best model in the single-sentence summary andmultisentence summary is the models that employed BERTword embedding and were based on transformers eROUGE values for the CNNDaily Mail datasets are largerthan those for the Gigaword dataset as Gigaword is utilisedfor single-sentence summaries as it contains headlines thatare treated as summaries while the CNNDailyMail datasetsare multisentence summaries us the summaries in theCNNDaily Mail datasets are longer than the summaries inGigaword

Liu et al selected two human elevators to evaluate thereadability of the generated summary of 50 test examples of 5models [58] e value of 5 indicates that the generatedsummary is highly readable while the value of 1 indicatesthat the generated summary has a low level of readability Itcan be clearly seen from the results that the Liu et al modelwas better than the other four models in terms of ROUGE1ROUGE2 and human evaluation even though the model is

not optimal with respect to the ROUGE-L value In additionto quantitative measures qualitative evaluationmeasures areimportant Kryscinski et al also performed qualitativeevaluation to evaluate the quality of the generated summary[60] Five human evaluators evaluated the relevance andreadability of 100 randomly selected test examples wheretwo values are utilised 1 and 10e value of 1 indicates thatthe generated summary is less readable and less relevancewhile the value of 10 indicates that the generated summary isreadable and very relevance e results showed that interms of readability the model proposed by Kryscinski et alis slightly inferior to See et al [56] and Liu et al [58] modelswith mean values of 635 676 and 679 for Kryscinski et alto See et al and Liu et al respectively On the other handwith respect to the relevance the means values of the threemodels are relevance with values of 663 673 and 674 forKryscinski et al to See et al and Liu et al respectivelyHowever the Kryscinski et al model was the best in terms ofROUGE1 ROUGE2 and ROUGE-L

Liu et al evaluated the quality of the generated summaryin terms of succinctness informativeness and fluency inaddition to measuring the level of retaining key informationwhich was achieved by human evaluation [65] In additionqualitative evaluation evaluated the output in terms ofgrammatical mistakes ree values were selected for eval-uating 20 test examples 1 indicates a correct answer 05indicates a partially correct answer and 0 indicates an in-correct answer We can conclude that quantitative evalua-tions which include ROUGE1 ROUGE2 and ROUGE-Lare not enough for evaluating the generated summary ofabstractive text summarisation especially when measuringreadability relevance and fluency erefore qualitativemeasures which can be achieved by manual evaluation arevery important However qualitative measures withoutquantitative measures are not enough due to the smallnumber of testing examples and evaluators

7 Challenges and Solutions

Text summarisation approaches have faced various chal-lenges although some have been solved others still need tobe addressed In this section these challenges and theirpossible solutions are discussed

71 Unavailability of the Golden Token during TestingDue to the availability of golden tokens (ie referencesummary tokens) during training previous tokens in theheadline can be input into the decoder at the next stepHowever during testing the golden tokens are not availablethus the input for the next step in the decoder will be limitedto the previously generated output word To solve this issuewhich becomes more challenging when addressing smalldatasets different solutions have been proposed For ex-ample in reference [51] the data-as-demonstrator (DaD)model [74] is utilised In DaD at each step based on a coinflip either a gold token is utilised during training or theprevious step is employed during both testing and trainingIn this manner at least the training step receives the same


input as testing In all cases the first input of the decoder isthe langEOSrang token and the same calculations are applied tocompute the loss In [29] teacher forcing is employed toaddress this challenge during training instead of feeding theexpected word from the headline 10 of the time thegenerated word of the previous step is fed back [75 76]

Moreover the mass convolution of the QRNN is applied in[50] since the dependency of words generated in the future isdifficult to determine

72Out-of-Vocabulary (OOV)Words One of the challengesthat may occur during testing is that the central words of thetest document may be rare or unseen during training thesewords are referred to as OOV words In 61[55 61] aswitching decoderpointer was employed to address OOVwords by using pointers to point to their original positions inthe source document e switch on the decoder side is usedto alternate between generating a word and using a pointeras shown in Figure 20 [55] When the switch is turned offthe decoder will use the pointer to point to the word in thesource to copy it to the memory When the switch is turnedon the decoder will generate a word from the target vo-cabularies Conversely researchers in [56] addressed OOVwords via probability generation Pgen where the value iscalculated from the context vector and decoder state asshown in Figure 21 To generate the output word Pgenswitches between copying the output words from the inputsequence and generating them from the vocabulary Fur-thermore the pointer-generator technique is applied topoint to input words to copy them e combination be-tween the words in the input and the vocabulary is referredto the extended vocabulary In addition in [57] to generatethe tokens on the decoder side the decoder utilised theswitch function at each timestep to switch between gener-ating the token using the softmax layer and using the pointermechanism to point to the input sequence position for

Table 5 Evaluation measures of several deep learning abstractive text summarisation methods over the Gigaword dataset

Reference Year Authors Model ROUGE1 ROUGE2 ROUGE-L[18] 2015 Rush et al ABS+ 2818 849 2381[39] 2016 Chopra et al RAS-Elman (k 10) 2897 826 2406[55] 2016 Nallapati et al Words-lvt5k-1sent 2861 942 2524[52] 2017 Zhou et al SEASS 3615 1754 3363[53] 2018 Cao et al FTSumg 3727 1765 3424[54] 2019 Cai et al RCT 3727 1819 3462

Table 6 Evaluation measures of several abstractive text summarisation methods over the CNNDaily Mail datasets

Reference Year Authors Model ROUGE1 ROUGE2 ROUGE-L[55] 2016 Nallapati et al Words-lvt2k-temp-att 3546 1330 3265[56] 2017 See et al Pointer-generator + coverage 3953 1728 3638[57] 2017 Paulus et al Reinforcement learning with intra-attention 4116 1575 3908[57] 2017 Paulus et al Maximum-likelihood +RL with intra-attention 3987 1582 3690[58] 2018 Liu et al Adversarial network 3992 1765 3671[30] 2018 Song et al ATSDL 349 178 mdash[35] 2018 Al-Sabahi et al Bidirectional attentional encoder-decoder 426 188 385[59] 2018 Li et al Key information guide network 3895 1712 3568[60] 2018 Kryscinski et al ML+RL ROUGE+Novel with LM 4019 1738 3752[61] 2018 Yao et al DEATS 4085 1808 3713[62] 2018 Wan et al BiSum 3701 1595 3366[63] 2019 Wang et al BEAR (large +WordPiece) 4195 2026 3949[64] 2019 Egonmwan et al TRANS-ext + filter + abs 4189 189 3892[65] 2020 Liu et al BERTSUMEXT (large) 4385 2034 3990[49] 2020 Peng et al DAPT+ imp-coverage (RL+MLE (ss)) 4072 1828 3735

ROUGE1 ROUGE2 ROUGE-L0

5

10

15

20

25

30

35

40

ABS+RAS-Elman (k = 10)SEASS

Words-lvt5k-1sent (Gigaword)FTSumgRCT

Figure 18 ROUGE1 ROUGE2 and ROUGE-L scores of severaldeep learning abstractive text summarisation methods for theGigaword dataset


unseen tokens to copy them Moreover in [30] rare wordswere addressed by using the location of the phrase and theresulting summary was more natural Moreover in 35[35 58 60] the OOV problem was addressed by usingthe pointer-generator technique employed in [56] whichalternates between generating a new word and coping theword from the original input text

73 Summary Sentence Repetition and Inaccurate InformationSummary e repetition of phrases and generation of in-coherent phrases in the generated output summary are two

challenges that must be considered Both challenges are dueto the summarisation of long documents and the productionof long summaries using the attention-based encoder-de-coder RNN [57] In [35 56] repetition was addressed byusing the coverage model to create the coverage vector byaggregating the attention over all previous timesteps In [57]repetition was addressed by using the key attention mech-anism where for each input token the encoder intra-temporal attention records the weights of the previousattention Furthermore the intratemporal attention uses thehidden states of the decoder at a certain timestep thepreviously generated words and the specific part of theencoded input sequence as shown in Figure 22 to preventrepetition and attend to the same sequence of the input at adifferent step of the decoder However the intra-attentionencoder mechanism cannot address all the repetitionchallenges especially when a long sequence is generatedus the intradecoder attention mechanism was proposedto allow the decoder to consider more previously generatedwords Moreover the proposed intradecoder attentionmechanism is applicable to any type of the RNN decoderRepetition was also addressed by using an objective functionthat combines the cross-entropy loss maximum likelihoodand gradient reinforcement learning to minimise the ex-posure bias In addition the probability of trigram p (yt) wasproposed to address repetition in the generated summarywhere yt is the trigram sequence In this case the value ofp (yt) is 0 during a beam search in the decoder when thesame trigram sequence was already generated in the outputsummary Furthermore in [60] the heuristic proposed by[57] was employed to reduce repetition in the summaryMoreover in [61] the proposed approach addressed repe-tition by exploiting the encoding features generated using asecondary encoder to remember the previously generateddecoder output and the coverage mechanism is utilised

74 Fake Facts Abstractive summarisation may generatesummaries with fake facts and 30 of summaries generatedfrom abstractive text summarisation suffer from thisproblem [53] With fake facts there may be a mismatchbetween the subject and the object of the predicates us toaddress this problem dependency parsing and open in-formation extraction (eg open information extraction(OpenIE)) are performed to extract facts

erefore the sequence-to-sequence framework withdual attention was proposed where the generated summarywas conditioned by the input text and description of theextracted facts OpenIE facilitates entity extraction from arelation and Stanford CoreNLP was employed to providethe proposed approach with OpenIE and the dependencyparser Moreover the decoder utilised copying and coveragemechanisms

75 Other Challenges e main issue of the abstractive textsummarisation dataset is the quality of the reference sum-mary (Golden summary) In the CNNDaily Mail datasetthe reference summary is the highlight of the news Everyhighlight represents a sentence in the summary therefore

0

5

10

15

20

25

30

35

40

45

50

ROUGE1 ROUGE2 ROUGE-L

Pointer-generator + coverageReinforcement learning with intra-attentionMaximum-likelihood + RL with intra-attentionAdversarial networkATSDLBidirectional attentional encoder-decoderKey information guide networkML + RL ROUGE + novel with LMDEATSWords-lvt2k-temp-att (CNNDaily Mail)BiSumBERTSUMEXT (large)BEAR (large + wordPiece)TRANS-ext + filter + absDAPT + imp-coverage (RL + MLE (ss))

Figure 19 ROUGE1 ROUGE2 and ROUGE-L scores of ab-stractive text summarisation methods for the CNNDaily Maildatasets


the number of sentences in the summary is equal to thenumber of highlights Sometimes the highlights do notaddress all crucial points in the summary erefore a high-quality dataset needs high effort to become availableMoreover in some languages such as Arabic the multi-sentence dataset for abstractive summarisation is notavailable Single-sentence abstractive Arabic text summa-risation is available but is not free

Another issue of abstractive summarisation is the use ofROUGE for evaluation ROUGE provides reasonable resultsin the case of extractive summarisation However in ab-stractive summarisation ROUGE is not enough as ROUGEdepends on exact matching between words For example the

words book and books are considered different using anyone of the ROUGE metrics erefore a new evaluationmeasure must be proposed to consider the context of thewords (words that have the same meaning must be con-sidered the same even if they have a different surface form)In this case we propose to use METEOR which was usedrecently in evaluating machine translation and automaticsummarisation models [77] Moreover METEOR considersstemming morphological variants and synonyms In ad-dition in flexible order language it is better to use ROUGEwithout caring about the order of the words

e quality of the generated summary can be improvedusing linguistic features For example we proposed the use

Encoder hidden states Decoder hidden states

Word1 Word2 Word4 Word5 Word6 Word7Word3

G P P G G

Figure 20 e generatorpointer switching model [55]


Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion

X (1 ndash Pgen) X Pgen

Pgen

Figure 21 Pointer-generator model [56]


of dependency parsing at the encoder in a separate layer atthe top of the first hidden state layer We proposed the use ofthe word embedding which was built by considering thedependency parsing or part-of-speech tagging At the de-coder side the beam-search quality can be improved byconsidering the part-of-speech tagging of the words and itssurrounding words

Based on the new trends and evaluation results we thinkthat the most promising feature among all the features is theuse of the BERTpretrained model e quality of the modelsthat are based on the transformer is high and will yieldpromising results

8 Conclusion and Discussion

In recent years due to the vast quantity of data available onthe Internet the importance of the text summarisationprocess has increased Text summarisation can be dividedinto extractive and abstractive methods An extractive textsummarisationmethod generates a summary that consists ofwords and phrases from the original text based on linguisticsand statistical features while an abstractive text summa-risation method rephrases the original text to generate asummary that consists of novel phrases is paper reviewedrecent approaches that applied deep learning for abstractivetext summarisation datasets and measures for evaluation ofthese approaches Moreover the challenges encounteredwhen employing various approaches and their solutionswere discussed and analysed e overview of the reviewedapproaches yielded several conclusions e RNN and at-tention mechanism were the most commonly employeddeep learning techniques Some approaches applied LSTMto solve the gradient vanishing problem that was encoun-tered when using an RNN while other approaches applied aGRU Additionally the sequence-to-sequence model wasutilised for abstractive summarisation Several datasets wereemployed including Gigaword CNNDaily Mail and the

New York Times Gigaword was selected for single-sentencesummarisation and CNNDaily Mail was employed formultisentence summarisation Furthermore ROUGE1ROUGE2 and ROUGE-L were utilised to evaluate thequality of the summaries e experiments showed that thehighest values of ROUGE1 ROUGE2 and ROUGE-L wereobtained in text summarisation with a pretrained encodermode with values of 4385 2034 and 399 respectively ebest results were achieved by the models that applyTransformer e most common challenges faced during thesummarisation process were the unavailability of a goldentoken at testing time the presence of OOV words summarysentence repetition sentence inaccuracy and the presence offake facts In addition there are several issues that must beconsidered in abstractive summarisation including thedataset evaluation measures and quality of the generatedsummary

Data Availability

No data were used to support this study

Conflicts of Interest

e authors declare no conflicts of interest

References

[1] M Allahyari S Pouriyeh M Assefi et al ldquoText summari-zation techniques a brief surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 8 no 102017

[2] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Reviewvol 45 no 2 pp 203ndash234 2016

[3] A Turpin Y Tsegay D Hawking and H E Williams ldquoFastgeneration of result snippets in web searchrdquo in Proceedings ofthe 30th Annual international ACM SIGIR Conference on


Encoder

+


Sum2

+

HC C

Decoder

helliphellip

Figure 22 A new word is added to the output sequence by combining the current hidden state ldquoHrdquo of the decoder and the two contextvectors marked as ldquoCrdquo [57]


Research and Development in information Retrieval-SIGIRrsquo07p 127 Amsterdam e Netherlands 2007

[4] E D Trippe ldquoA vision for health informatics introducing theSKED framework an extensible architecture for scientificknowledge extraction from datardquo 2017 httparxivorgabs170607992

[5] S Syed Abstractive Summarization of Social Media Posts Acase Study using Deep Learning Masterrsquos thesis BauhausUniversity Weimar Germany 2017

[6] D Suleiman and A A Awajan ldquoDeep learning based ex-tractive text summarization approaches datasets and eval-uation measuresrdquo in Proceedings of the 2019 SixthInternational Conference on Social Networks Analysis Man-agement and Security (SNAMS) pp 204ndash210 Granada Spain2019

[7] Q A Al-Radaideh and D Q Bataineh ldquoA hybrid approachfor Arabic text summarization using domain knowledge andgenetic algorithmsrdquo Cognitive Computation vol 10 no 4pp 651ndash669 2018

[8] C Sunitha A Jaya and A Ganesh ldquoA study on abstractivesummarization techniques in Indian languagesrdquo ProcediaComputer Science vol 87 pp 25ndash31 2016

[9] D R Radev E Hovy and K McKeown ldquoIntroduction to thespecial issue on summarizationrdquo Computational Linguisticsvol 28 no 4 pp 399ndash408 2002

[10] A Khan and N Salim ldquoA review on abstractive summari-zation methodsrdquo Journal of eoretical and Applied Infor-mation Technology vol 59 no 1 pp 64ndash72 2014

[11] N Moratanch and S Chitrakala ldquoA survey on abstractive textsummarizationrdquo in Proceedings of the 2016 InternationalConference on Circuit Power and Computing Technologies(ICCPCT) pp 1ndash7 Nagercoil India 2016

[12] S Shimpikar and S Govilkar ldquoA survey of text summarizationtechniques for Indian regional languagesrdquo InternationalJournal of Computer Applications vol 165 no 11 pp 29ndash332017

[13] N R Kasture N Yargal N N Singh N Kulkarni andV Mathur ldquoA survey on methods of abstractive text sum-marizationrdquo International Journal for Research in EmergingScience andTechnology vol 1 no 6 p 5 2014

[14] P Kartheek Rachabathuni ldquoA survey on abstractive sum-marization techniquesrdquo in Proceedings of the 2017 Interna-tional Conference on Inventive Computing and Informatics(ICICI) pp 762ndash765 Coimbatore 2017

[15] S Yeasmin P B Tumpa A M Nitu E Ali and M I AfjalldquoStudy of abstractive text summarization techniquesrdquoAmerican Journal of Engineering Research vol 8 2017

[16] A Khan N Salim H Farman et al ldquoAbstractive textsummarization based on improved semantic graph ap-proachrdquo International Journal of Parallel Programmingvol 46 no 5 pp 992ndash1016 2018

[17] Y Jaafar and K Bouzoubaa ldquoTowards a new hybrid approachfor abstractive summarizationrdquo Procedia Computer Sciencevol 142 pp 286ndash293 2018

[18] A M Rush S Chopra and J Weston ldquoA neural attentionmodel for abstractive sentence summarizationrdquo in Proceed-ings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing Lisbon Portugal 2015

[19] N Raphal H Duwarah and P Daniel ldquoSurvey on abstractivetext summarizationrdquo in Proceedings of the 2018 InternationalConference on Communication and Signal Processing (ICCSP)pp 513ndash517 Chennai 2018

[20] Y Dong ldquoA survey on neural network-based summarizationmethodsrdquo 2018 httparxivorgabs180404589

[21] A Mahajani V Pandya I Maria and D Sharma ldquoA com-prehensive survey on extractive and abstractive techniques fortext summarizationrdquo in Ambient Communications andComputer Systems Y-C Hu S Tiwari K K Mishra andM C Trivedi Eds vol 904 pp 339ndash351 Springer Singapore2019

[22] T Shi Y Keneshloo N Ramakrishnan and C K ReddyNeural Abstractive Text Summarization with Sequence-To-Sequence Models A Survey httparxivorgabs1812023032020

[23] A Joshi E Fidalgo E Alegre and U de Leon ldquoDeep learningbased text summarization approaches databases and eval-uation measuresrdquo in Proceedings of the International Con-ference of Applications of Intelligent Systems Spain 2018

[24] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquoNaturevol 521 no 7553 pp 436ndash444 2015

[25] D Suleiman A Awajan andW Al Etaiwi ldquoe use of hiddenMarkov model in natural Arabic language processing asurveyrdquo Procedia Computer Science vol 113 pp 240ndash2472017

[26] H Wang and D Zeng ldquoFusing logical relationship infor-mation of text in neural network for text classificationrdquoMathematical Problems in Engineering vol 2020 pp 1ndash162020

[27] J Yi Y Zhang X Zhao and J Wan ldquoA novel text clusteringapproach using deep-learning vocabulary networkrdquo Mathe-matical Problems in Engineering vol 2017 pp 1ndash13 2017

[28] T Young D Hazarika S Poria and E Cambria ldquoRecenttrends in deep learning based natural language processing[review article]rdquo IEEE Computational Intelligence Magazinevol 13 no 3 pp 55ndash75 2018

[29] K Lopyrev Generating news headlines with recurrent neuralnetworks p 9 2015 httpsarxivorgabs151201712

[30] S Song H Huang and T Ruan ldquoAbstractive text summa-rization using LSTM-CNN Based Deep Learningrdquo Multi-media Tools and Applications 2018

[31] C L Giles G M Kuhn and R J Williams ldquoDynamic re-current neural networks theory and applicationsrdquo IEEETransactions on Neural Networks vol 5 no 2 pp 153ndash1561994

[32] A J Robinson ldquoAn application of recurrent nets to phoneprobability estimationrdquo IEEE Transactions on Neural Net-works vol 5 no 2 pp 298ndash305 1994

[33] D Bahdanau K Cho and Y Bengio ldquoNeural machinetranslation by jointly learning to align and translaterdquo inProceedings of the International Conference on LearningRepresentations Canada 2014 httparxivorgabs14090473

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45no 11 pp 2673ndash2681 Nov 1997

[35] K Al-Sabahi Z Zuping and Y Kang Bidirectional Atten-tional Encoder-Decoder Model and Bidirectional Beam Searchfor Abstractive Summarization Cornell University IthacaNY USA 2018 httparxivorgabs180906662

[36] S Hochreiter and J Schmidhuber ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[37] K Cho ldquoLearning phrase representations using RNNencoderndashdecoder for statistical machine translationrdquo inProceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP) pp 1724ndash1734 DohaQatar 2014

[38] E Jobson and A Gutierrez Abstractive Text SummarizationUsing Attentive Sequence-To-Sequence RNNs p 8 2016


[39] S Chopra M Auli and A M Rush ldquoAbstractive sentencesummarization with attentive recurrent neural networksrdquo inProceedings of the NAACL-HLT16 pp 93ndash98 San Diego CAUSA 2016

[40] C Sun L Lv G Tian Q Wang X Zhang and L GuoldquoLeverage label and word embedding for semantic sparse webservice discoveryrdquo Mathematical Problems in Engineeringvol 2020 Article ID 5670215 8 pages 2020

[41] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httparxivorgabs13013781

[42] D Suleiman A Awajan and N Al-Madi ldquoDeep learningbased technique for Plagiarism detection in Arabic textsrdquo inProceedings of the 2017 International Conference on NewTrends in Computing Sciences (ICTCS) pp 216ndash222 AmmanJordan 2017

[43] D Suleiman and A Awajan ldquoComparative study of wordembeddings models and their usage in Arabic language ap-plicationsrdquo in Proceedings of the 2018 International ArabConference on Information Technology (ACIT) pp 1ndash7Werdanye Lebanon 2018

[44] J Pennington R Socher and C Manning ldquoGlove globalvectors for word representationrdquo in Proceedings of the 2014Conference on Empirical Methods in Natural Language Pro-cessing (EMNLP) pp 1532ndash1543 Doha Qatar 2014

[45] D Suleiman and A A Awajan ldquoUsing part of speech taggingfor improving Word2vec modelrdquo in Proceedings of the 20192nd International Conference on new Trends in ComputingSciences (ICTCS) pp 1ndash7 Amman Jordan 2019

[46] A Joulin E Grave P Bojanowski M Douze H Jegou andT Mikolov ldquoFastTextzip compressing text classificationmodelsrdquo 2016 httparxivorgabs161203651

[47] A Vaswani N Shazeer N Parmar et al ldquoAttention is all youneedrdquo Advances in Neural Information Processing Systemspp 5998ndash6008 2017

[48] J Devlin M-W Chang K Lee and K Toutanova ldquoPre-training of deep bidirectional transformers for languageunderstandingrdquo in Proceedings of the 2019 Conference of theNorth American Chapter of the Association for ComputationalLinguistics Human Language Technologies pp 4171ndash4186Minneapolis MN USA 2019

[49] Z Li Z Peng S Tang C Zhang and H Ma ldquoText sum-marization method based on double attention pointer net-workrdquo IEEE Access vol 8 pp 11279ndash11288 2020

[50] J Bradbury S Merity C Xiong and R Socher Quasi-re-current neural networks httpsarxivorgabs1611015762015

[51] U Khandelwal P Qi and D Jurafsky Neural Text Sum-marization Stanford University Stanford CA USA 2016

[52] Q Zhou N Yang F Wei and M Zhou ldquoSelective encodingfor abstractive sentence summarizationrdquo in Proceedings of the55th Annual Meeting of the Association for ComputationalLinguistics pp 1095ndash1104 Vancouver Canada July 2017

[53] Z Cao F Wei W Li and S Li ldquoFaithful to the original factaware neural abstractive summarizationrdquo in Proceedings of theAAAI Conference on Artificial Intelligence (AAAI) NewOrleans LA USA February 2018

[54] T Cai M Shen H Peng L Jiang and Q Dai ldquoImprovingtransformer with sequential context representations for ab-stractive text summarizationrdquo inNatural Language Processingand Chinese Computing J Tang M-Y Kan D Zhao S Liand H Zan Eds pp 512ndash524 Springer International Pub-lishing Cham Switzerland 2019

[55] R Nallapati B Zhou C N dos Santos C Gulcehre andB Xiang ldquoAbstractive text summarization using sequence-to-sequence RNNs and beyondrdquo in Proceedings of the CoNLL-16Berlin Germany August 2016

[56] A See P J Liu and C D Manning ldquoGet to the pointsummarization with pointer-generator networksrdquo in Pro-ceedings of the 55th ACL pp 1073ndash1083 Vancouver Canada2017

[57] R Paulus C Xiong and R Socher ldquoA deep reinforced modelfor abstractive summarizationrdquo 2017 httparxivorgabs170504304

[58] K S Bose R H Sarma M Yang Q Qu J Zhu and H LildquoDelineation of the intimate details of the backbone con-formation of pyridine nucleotide coenzymes in aqueous so-lutionrdquo Biochemical and Biophysical ResearchCommunications vol 66 no 4 1975

[59] C Li W Xu S Li and S Gao ldquoGuiding generation forabstractive text summarization based on key informationguide networkrdquo in Proceedings of the 2018 Conference of theNorth American Chapter of the Association for ComputationalLinguistics Human Language Technologies pp 55ndash60 NewOrleans LA USA 2018

[60] W Kryscinski R Paulus C Xiong and R Socher ldquoIm-proving abstraction in text summarizationrdquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) Brussels Belgium November 2018

[61] K Yao L Zhang D Du T Luo L Tao and Y Wu ldquoDualencoding for abstractive text summarizationrdquo IEEE Trans-actions on Cybernetics pp 1ndash12 2018

[62] X Wan C Li R Wang D Xiao and C Shi ldquoAbstractivedocument summarization via bidirectional decoderrdquo in Ad-vanced DataMining and Applications G Gan B Li X Li andS Wang Eds pp 364ndash377 Springer International Publish-ing Cham Switzerland 2018

[63] Q Wang P Liu Z Zhu H Yin Q Zhang and L Zhang ldquoAtext abstraction summary model based on BERT word em-bedding and reinforcement learningrdquo Applied Sciences vol 9no 21 p 4701 2019

[64] E Egonmwan and Y Chali ldquoTransformer-based model forsingle documents neural summarizationrdquo in Proceedings ofthe 3rd Workshop on Neural Generation and Translationpp 70ndash79 Hong Kong 2019

[65] Y Liu and M Lapata ldquoText summarization with pretrainedencodersrdquo 2019 httparxivorgabs190808345

[66] P Doetsch A Zeyer and H Ney ldquoBidirectional decodernetworks for attention-based end-to-end offline handwritingrecognitionrdquo in Proceedings of the 2016 15th InternationalConference on Frontiers in Handwriting Recognition (ICFHR)pp 361ndash366 Shenzhen China 2016

[67] I Sutskever O Vinyals and Q V Le ldquoSequence to SequenceLearning with Neural Networksrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems (NIPS)Montreal Quebec Canada December 2014

[68] D He H Lu Y Xia T Qin L Wang and T-Y LiuldquoDecoding with value networks for neural machine transla-tionrdquo in Proceedings of the Advances in Neural InformationProcessing Systems Long Beach CA USA December 2017

[69] D Harman and P Over ldquoe effects of human variation inDUC summarization evaluation text summarizationbranches outrdquo Proceedings of the ACL-04 Workshop vol 82004

[70] C Napoles M Gormley and B V Durme ldquoAnnotatedGigawordrdquo in Proceedings of the AKBC-WEKEX MontrealCanada 2012


[71] K M Hermann T Kocisky E Grefenstette et al ldquoMachinesto read and comprehendrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS) MontrealQuebec Canada December 2015

[72] M GruskyM Naaman and Y Artzi ldquoNewsroom a dataset of13 million summaries with diverse extractive strategiesrdquo inProceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational LinguisticsHuman Language Technologies Association for Computa-tional Linguistics New Orleans LA USA pp 708ndash719 June2018

[73] C-Y Lin ldquoROUGE a package for automatic evaluation ofsummariesrdquo in Proceedings of the 2004 ACL WorkshopBarcelona Spain July 2004

[74] A Venkatraman M Hebert and J A Bagnell ldquoImprovingmulti-step prediction of learned time series modelsrdquo inProceedings of the Twenty-Ninth AAAI Conference on Arti-ficial Intelligence pp 3024ndash3030 Austin TX USA 2015

[75] I Goodfellow A Courville and Y Bengio Deep LearningMIT Press Cambridge MA USA 2015

[76] S Bengio O Vinyals N Jaitly and N Shazeer ldquoScheduledsampling for sequence prediction with recurrent neuralnetworksrdquo in Proceedings of the Annual Conference on NeuralInformation Processing Systems pp 1171ndash1179 MontrealQuebec Canada December 2015

[77] A Lavie and M J Denkowski ldquoe Meteor metric for au-tomatic evaluation of machine translationrdquo Machine Trans-lation vol 23 no 2-3 pp 105ndash115 2009


considered a difficult task for a computer As abstractive textsummarisation requires an understanding of the documentto generate the summary advanced machine learningtechniques and extensive natural language processing (NLP)are required us abstractive summarisation is harder thanextractive summarisation since abstractive summarisationrequires real-word knowledge and semantic class analysis [7]However abstractive summarisation is also better than ex-tractive summarisation since the summary is an approximaterepresentation of a human-generated summary which makesit more meaningful [8] For both types acceptable summa-risation should have the following sentences that maintainthe order of the main ideas and concepts presented in theoriginal text minimal to no repetition sentences that areconsistent and coherent and the ability to remember themeaning of the text even for long sentences [7] In additionthe generated summary must be compact while conveyingimportant information about the original text [2 9]

Abstractive text summarisation approaches includestructured and semantic-based approaches Structured ap-proaches encode the crucial features of documents usingseveral types of schemas including tree ontology lead andbody phrases and template and rule-based schemas whilesemantic-based approaches are more concerned with thesemantics of the text and thus rely on the informationrepresentation of the document to summarise the textSemantic-based approaches include the multimodal se-mantic method information item method and semanticgraph-based method [10ndash17]

Deep learning techniques were employed in abstractivetext summarisation for the first time in 2015 [18] and theproposed model was based on the encoder-decoder archi-tecture For these applications deep learning techniqueshave provided excellent results and have been extensivelyemployed in recent years

Raphal et al surveyed several abstractive text summa-risation processes in general [19] eir study differentiatedbetween different model architectures such as reinforce-ment learning (RL) supervised learning and attentionmechanism In addition comparisons in terms of wordembedding data processing training and validation hadbeen performed However there are no comparisons of thequality of several models that generated summaries

Furthermore both extractive and abstractive summa-risation models were summarised in [20 21] In [20] theclassification of summarisation tasks was based on three factorsinput factors purpose factors and output factors Dong andMahajani et al surveyed only five abstractive summarisationmodels each On the other hand Mahajani et al focused on thedatasets and training techniques in addition to the architectureof several abstractive summarisationmodels [21] However thequality of the generated summary of the different techniquesand the evaluation measures were not discussed

Shi et al presented a comprehensive survey of severalabstractive text summarisation models which are based onsequence-to-sequence encoder-decoder architecture forconvolutional and RNN seq2seq models e focus was thestructure of the network training strategy and the algorithmsemployed to generate the summary [22] Although several

papers have analysed abstractive summarisation models fewpapers have performed a comprehensive study [23] More-over most of the previous surveys covered the techniquesuntil 2018 even though surveys were published in 2019 and2020 such as [20 21] In this review we addressed most of therecent deep learning-based RNN abstractive text summa-risation models Furthermore this survey is the first to ad-dress recent techniques applied in abstractive summarisationsuch as Transformer

is paper provides an overview of the approachesdatasets evaluationmeasures and challenges of deep learning-based abstractive text summarisation and each topic isdiscussed and analysed We classified the approaches based onthe output type into single-sentence summary and multi-sentence summary approaches Also within each classificationwe compared between the approaches in terms of architecturedataset dataset preprocessing evaluation and results eremainder of this paper is organised as follows Section 2introduces a background of several deep learning models andtechniques such as the recurrent neural network (RNN)bidirectional RNN attention mechanisms long short-termmemory (LSTM) gated recurrent unit (GRU) and sequence-to-sequencemodels Section 3 describes themost recent single-sentence summarisation approaches while the multisentencesummarisation approaches are covered in Section 4 Section 5and Section 6 investigate datasets and evaluation measuresrespectively Section 7 discusses the challenges of the sum-marisation process and solutions to these challenges Con-clusions and discussion are provided in Section 8

2 Background

Deep learning analyses complex problems to facilitate thedecision-making process Deep learning attempts to imitatewhat the human brain can achieve by extracting features atdifferent levels of abstraction Typically higher-level layershave fewer details than lower-level layers [24] e outputlayer will produce an output by nonlinearly transforming theinput from the input layere hierarchical structure of deeplearning can support learning e level of abstraction of acertain layer will determine the level of abstraction of thenext layer since the output of one layer will be the input ofthe next layer In addition the number of layers determinesthe deepness which affects the level of learning [25]

Deep learning is applied in several NLP tasks since itfacilitates the learning of multilevel hierarchal representa-tions of data using several data processing layers of nonlinearunits [24 26ndash28] Various deep learning models have beenemployed for abstractive summarisation including RNNsconvolutional neural networks (CNNs) and sequence-to-sequence models We will cover deep learning models inmore detail in this section

21 RNN Encoder-Decoder Summarization RNN encoder-decoder architecture is based on the sequence-to-sequencemodel e sequence-to-sequence model maps the inputsequence in the neural network to a similar sequence thatconsists of characters words or phrases is model is















++++

+X CtCtndash1

Htndash1

Xt

σ σ σ

X

Tanh

Tanh

X

Ht

Ht

Inputvector

Memory from

previous block

Output of

previous block

Output of

current block



concatenation

Memory from

current block

Sigmoid

Hyperbolictangent

Bias

0 1 2 3


he1

W1 W2 W3 W4

he2 he3 he4 hd1

ltSOSgt W5

W5 W6 W7 ltEOSgt

W6 W7

hd2 hd3 hd4

Context vector

Encoder Decoder


++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(a)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(b)

Figure 3 Continued










++++

X

σ σ

Ht

+ CtCt-1

Ht-1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(c)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(d)


























summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip






























































































++++

+X CtCtndash1

Htndash1

Xt

σ σ σ

X

Tanh

Tanh

X

Ht

Ht

Inputvector

Memory from

previous block

Output of

previous block

Output of

current block



concatenation

Memory from

current block

Sigmoid

Hyperbolictangent

Bias

0 1 2 3


he1

W1 W2 W3 W4

he2 he3 he4 hd1

ltSOSgt W5

W5 W6 W7 ltEOSgt

W6 W7

hd2 hd3 hd4

Context vector

Encoder Decoder


++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(a)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(b)

Figure 3 Continued










++++

X

σ σ

Ht

+ CtCt-1

Ht-1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(c)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(d)


























summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

















































































++++

+X CtCtndash1

Htndash1

Xt

σ σ σ

X

Tanh

Tanh

X

Ht

Ht

Inputvector

Memory from

previous block

Output of

previous block

Output of

current block



concatenation

Memory from

current block

Sigmoid

Hyperbolictangent

Bias

0 1 2 3


he1

W1 W2 W3 W4

he2 he3 he4 hd1

ltSOSgt W5

W5 W6 W7 ltEOSgt

W6 W7

hd2 hd3 hd4

Context vector

Encoder Decoder


++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(a)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(b)

Figure 3 Continued










++++

X

σ σ

Ht

+ CtCt-1

Ht-1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(c)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(d)


























summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

























































































++++

X

σ σ

Ht

+ CtCt-1

Ht-1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(c)

++++

X

σ σ

Ht

+ CtCtndash1

Htndash1

Xt

0 1 2 3

σ Tanh

Tanh

X

Ht

X

(d)


























summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip








































































































summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip





























































































summarization (RCT)













learning










Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip






















































































Context

Attention weight


Headline

(a)

Context


Headline

Attention weight

(b)



---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

















































































---- ---- ---- ---- ---- ---- rarr

---- ---- ---- ---- ---- ---- rarr

Convolutional

Max-pool

Convolutional

Max-pool

Convolutional

fo-pool

Convolutional

fo-pool

LSTM CNN QRNNLinear



Enco

der


MLP

hi S


Attention Ct

Somax

Yt

Maxout

GRU

St

Ctndash1 Stndash1

Ytndash1

Decoder


Sent

ence

enco

der


Attention Somax

Yt

GRU

St

Ctndash1 Stndash1

Ctx Ct

r

Context selection

MLP


Attention

Rela

tion

enco

der

Ctx

Ctr Ct


Ytndash1



























Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip









































































































Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip






























































































Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip





















































































Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

stat

es

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

stat

esVo

cabu

lary

di

strib

utio

n



Phra

se en

code

r

Phra

ses

Hid

den

stat

es

Encoder Decoder

Phrase vector

Words in a phrase

Convolution

Max

-poo

ling
















helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip






























































































helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip





















































































helliphellip

Enco

der h

idde

nsta

tes


Sum2

helliphellip

+


Sum2

+

helliphellip

Decoder

Lang

uage

mod

el

Con

text

ual m

odel




Attention

helliphellip

Enco

der h

idde

nsta

tes

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

s

Key Word1

K

Key Word2

Key Word3

Key Word4

helliphellipKe

y in

form

atio

n gu

ide n

etw

ork

Pointer Somax

St

Ct







Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip





















































































Word_Sum1

ltStartgt Word_Sum2


ary

enco

der

hidd

en st

ates


ndar

y en

code

rhi

dden

stat

es

Dec

oder

hidd

en st

ates

CdCp

α1 α2 α3 α4 α5 α6 α7h1s h2

s h3s h4

s h5s h6

s h7s

h1p h2

p h3p h4

p h5p h6

p h7p



Encoder


Sum2

Forw

ard

deco

der

helliphellip


Word_Sumn

Back

war

d de

code

r

helliphellip

Context vector

Context vector






















W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip




































































































W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip



























































































W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

















































































W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

W1POSNERTFIDF

Word_Sum1

ltStartgt Word_Sum2

Enco

der h

idde

n st

ates

Dec

oder

hidd

en st

ates

W1POSNERTFIDF


Word_Sum1

ltStartgt Word_Sum2


Wor

d-le

vel

hidd

en st

ates



Sent

ence

-leve

lhi

dden

stat

es















[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

























































































[55] Nallapatiet al


















[38] Jobson et al


third models







[35] Al-Sabahiet al






[64] Egonmwanet al











gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

























































































gradient descent


[55] Nallapatiet al














Beam search (5)







Beam search



























Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip


























































































Datasets

Num

ber o

fre

sear

ches

02468

10121416

Datasets

























5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip





































































































5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip

























































































5

10

15

20

25

30

35

40











0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip























































































0

5

10

15

20

25

30

35

40

45

50











G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip























































































G P P G G



Context vector

helliphellip

Atte

ntio

ndi

strib

utio

nEn

code

r hid

den

state

s

Word_Sum1ltStartgt

helliphellip

Dec

oder

hid

den

state

sVo

cabu

lary

distr

ibut

ion


Pgen








Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip






















































































Data Availability




References





Encoder

+


Sum2

+

HC C

Decoder

helliphellip
















































































































































































































Date post:	30-Dec-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Deep Learning Based Abstractive Text Summarization...

Documents