+ All Categories
Home > Documents > Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics...

Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics...

Date post: 28-May-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
43
Abstractive Text Summarization Using Seq2Seq Attention Models Soumye Singhal Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur 22 nd November, 2017
Transcript
Page 1: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Abstractive Text SummarizationUsing Seq2Seq Attention Models

Soumye Singhal Prof. Arnab Bhattacharya

Department of Computer Science and EngineeringIndian Institute of Technology, Kanpur

22nd November, 2017

Page 2: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 3: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 4: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Why Text Summarization?

I In the modern Internet age, textual data is ever increasing

I Need some way to condense this data while preserving theinformation and meaning.

I Text summarization is a fundamental problem that we need tosolve.

I Would help in easy and fast retrieval of information.

Page 5: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 6: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Extractive vs Abstractive

I Extractive summarizationI Copying parts/sentences of the source text and then combine

those part/sentences together to render a summary.I Importance of sentence is based on linguistic and statistical

features

I Abstractive summarizationI These methods try to first understand the text and then

rephrase it in a shorter manner, using possibly different wordsI For perfect abstractive summary, the model has to first truly

understand the document and then try to express thatunderstanding in short possibly using new words and phrases.

I Much harder than extractive.I Has complex capabilities like generalization, paraphrasing and

incorporating real-world knowledge.

Page 7: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Deep Learning

I Majority of the work has traditionally focused on Extractiveapproaches due to the easy of defining hard-coded rules toselect important sentences than generate new ones.

I But they often don’t summarize long and complex texts wellas they are very restrictive.

I The traditional rule-based AI does poorly on Abstractive TextSummarization.

I Inspired by the performance of Neural Attention Model in theclosely related task of Machine Translation Rush et al. 2015and Chopra et al. 2016 applied this Neural Attention Model toAbstractive Text Summarization and found that it alreadyperformed very well and beat the previous non-DeepLearning-based approaches.

Page 8: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 9: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Recurrent Neural Network

Figure: An unrolled RNN

I wi - input tokens of source article

I hi - Encoder hidden states

I Pvocab = softmax(Vhi + b) is the distribution over vocabularyfrom which we sample outi

Page 10: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Long-Short Term Memory

1

I If the context of the word is far away, RNN’s struggle to learn.I Vanishing Gradient ProblemI LSTMs selectively pass and forget information.1Image taken from colah.github.io

Page 11: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Long-Short Term Memory

Forget Gate Layer

I ft = σ(Wf [ht−1, xt ] + bf )

I Ct = Ct ⊗ ft

Input Gate Layer

I it = σ(Wt [ht−1, xt ] + bi )

I Ct = tanh(Wi [ht−1, xt ] + bc)

I Ct = Ct ⊗ tt + Ct

Output Gate Layer

I ot = σ(Wo [ht−1, xt ] + bo)

I ht = ot ∗ tanh(Ct)

Page 12: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Bi-Directional RNN

Backward RNN

Foreword RNN

out out out out0 1 2 3

in0 in1 in2in3

Predict Predict Predict Predict

Embeddings Embeddings Embeddings Embeddings

I Two passes on source computing hidden states←−ht and

−→ht

I ht = [←−ht ,−→ht ] now encodes past and future information.

Page 13: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Vanilla Encoder-Decoder

2

I It consists of an Encoder(Bidirectional LSTM) and a DecoderLSTM network.

I The final hidden state from the Encoder(thought vector) ispassed into the Decoder.

2Image taken from colah.github.io

Page 14: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 15: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Why do we need Attention?

I The basic encoder-decoder model fails to scale up.

I The main bottleneck is the fixed sized thought vector

I Not able to capture all the relevant information of the inputsequence as the model sizes up.

I At each generation step, only a part of the input is relevant.

I This is where attention comes it.

I It helps the model decide which part of the input encoding tofocus on at each generation step to generate novel words.

I At each step, the decoder outputs hidden state hi , from whichwe generate the output.

Page 16: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Attention is all you need!I importanceit = V ∗ tanh(eiW1 + htW2 + battn).I Attention Distribution at = softmax(importanceit)I ContextVector h∗t =

∑i ei ∗ ati

3

3Image stylized from https://talbaumel.github.io/attention/

Page 17: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Training

I Context Vector is then fed into two layers to generatedistribution over the vocabulary from which we sample.

I Pvocab(w) = softmax(V ′(V [ht , h∗t ] + b) + b′)

I For the loss at time step t, losst = − logP(wt∗), where w∗t isthe target summary word.

I LOSS =∑T

t=0 losstT

I We then use the Backpropagation Algorithm to get thegradient and learn the parameters

Page 18: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Generating the Summaries

At each step, the decoder outputs a probability distribution overthe target vocabulary. To get the output word at this step we cando the following

I Greedy Sampling, ie choose the mode of the Distribution

I Sample from the distribution.

I Beam Search - Choosing the top k most likely target wordsand then feeding them all into the next decoder input. So ateach time-step t the decoder gets k different possible inputs.It then computes the top k most likely target words for eachof these different inputs. Among these, it keeps only the top-kout of k2 and rejects the rest. This process continues. Thisensures that each target word gets a fair shot at generatingthe summary.

Page 19: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Metrics

I If target summary is not givenI Need a similarity measure between summary and source

document.I In a good summary, the topics covered would be similarI Use topic models like Latent Semantic Analysis(LSA) and

Latent Dirichlet Allocation(LDA)

I If the target summary is givenI Use metrics like ROUGE(Lin 2004) and METEORI They are essentially string matching metricsI ROUGE-N measures the overlap of N-grams between the system

and reference summaryI ROUGE-L is based on longest common subsequences. Takes

into account sentence level similarity.I ROUGE-S is the skip-gram variant

Page 20: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Dataset

Sentence level Datasets

I DUC-2004

I Gigaword

Large-Scale Dataset by Nallapati et al. 2016

I CNN/Daily Mail Dataset adapted for summarization.

Page 21: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Problems with Baseline

Though the baseline gives decent results, they are clearly plaguedby many problems

I They sometimes tend to reproduce factually incorrect details.

I Struggles with Out of Vocabulary (OOV) words.

I They are also a bit repetitive and focus on a word/phrasemultiple times.

I Focus mainly on single sentence summary tasks like headlinegeneration.

Page 22: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 23: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Feature-rich Encoder

I Introduced by Nallapati et al. 2016

I Aim is to input more more information about the source textinto encoder

I Apart from word-embeddings like word2vec, GloVe alsoincorporate more linguistic features like

I POS(parts of speech) tagsI named-entity tagsI TF-IDF statistics

I Though it speeds up training, it hurts the abstractivecapabilities of the model.

Page 24: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Hierarchical Attention

I Introduced by Nallapati et al. 2016.

I For bigger source document, they try to also identify keysentences for the summary.

I Two Bi-Direction RNN at source textI One at word levelI Another at sentence levelI Word level attention is then weighted by corresponding

sentence level attention.

Pa(j) =Paw (j)Pa

s (s(j))∑Ndk=1 P

aw (k)Pa

s (s(k))

Page 25: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 26: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Pointer-Generator Network

Introduced by See et al. 2017.

I Helps to solve the challenge of OOV words and factual errors.

I Works better for multi-sentence summaries.

I Ides is to choose between generating a word from the fixedvocabulary or copying one from the source document at eachstep of the generation.

I It brings in the power of extractive methods by pointing(Vinyals et al. 2015)

I So for OOV words, simple generation would result in UNK,but here the network will copy the OOV from the source text.

Page 27: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Pointer-Generator Network

4

4Image taken from blog, www.abigailsee.com

Page 28: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Pointer-Generator Network

I At each step we calculate generation probability pgenI pgen = σ(wT

h∗h∗t + wT

s ht + wTx xt + bptr )

I xt is the decoder input.

I Parameter wh∗ ,ws ,wx , bptr are learnable.

I Now this pgen is used as a switch.

I P(w) = pgenPvocab(w) + (1− pgen)∑

i :wi=w atiI Note that for OOV word Pvocab(w) = 0, so we end up

pointing.

Page 29: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 30: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Coverage Mechanism

I The cause of repetitiveness of the model can be accounted forby increased and continuous attention to a particular word.

I So we can use Coverage Model by Tu et al. 2016.

I Coverage Vector ct =∑t−1

t′=0 at′

I Intuitively, by summing the attention at all steps we arekeeping track of how much coverage each encoding, e i hasreceived.

I Now, give this as input to attention mechanism.

I importanceit = V ∗ tanh(eiW1 + htW2 + Wccti + battn)

I Penalize attending to things that have already been covered.

I covlosst =∑

i min(ati , cti ) penalizes overlap between attention

at this step and coverage till now.

I losst = − logP(wt∗) + λcovlosst

Page 31: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 32: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Intra-Attention

I Traditional approaches attend on the encoder states.

I But the current word being generated also depends upon whatprevious words were generated.

I So Paulus et al. 2017 used Intra-Attention on Decoderoutputs.

I This approach also avoids repeating things.

I Decoder context vector c∗t is generated in a similar way toencoder attention.

I c∗t passed on to generate Pvocab(w)

Page 33: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Page 34: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

How to correct my mistakes?

I During training, we always feed in the correct inputs to thedecoder, no matter what the output was at the previous step.

I Model doesn’t learn to recover from its mistakes.

I It assumes that it will be given the golden token at each stepin the decoding.

I During testing if the model produces even one wrong wordthen the recovery is hard.

A naive way to do rectify this problem is that during training, tossa coin with P[heads] = p to decide between choosing generatedoutput from the previous step or taking the golden token.

Page 35: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Training using Reinforcement Learning

I There are various ways in which the document can beeffectively summarized. The reference summary is just one ofthose possible ways.

I There should be some scope for variations in the summary

This is the idea behind Reinforcement based learning introduced byPaulus et al. 2017 which gave significant improvement over thebaseline. This is the current state of the art.

I During training, we first let the model generate a summaryusing its own decoder outputs as inputs.

I After the model produces its own summary, we evaluate thesummary in comparison to the reference summary using theROUGE metric.

I We then define a loss based on this score. If the score is highthat means the summary is good and hence the loss should beless and vice-versa.

Page 36: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Training using Reinforcement Learning

Summary

Reward

Golden summaryModel Scorer

genera

tescompares

updates returns

Page 37: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Policy Learning

I We use self-critical Policy gradient training.

I We generate two strings y s and y

I y s ∼ P(y st |y s1 , · · · , y st−1, x) ie sampling and y by greedysearch.

I y∗ is the ground truth.

I r(y) is the reward for sequence y compared with y∗.I Lrl = (r(y)− r(y s))

∑t logP(y st |y s1 , · · · , y st−1, x).

Page 38: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Problems in Training using Reinforcement Learning

I It’s possible to achieve a very high ROUGE score, without thesummary being human readable.

I Reflects that ROUGE doesn’t exactly capture the way wehumans evaluate summary.

I Now, since the above method optimizes for the ROUGE scores,it may produce summaries with very high ROGUE scores, butwhich are barely human-readable.

I So to curb this problem, we train our model in a mixed fashionusing both Reinforcement learning and Supervised Training.

I We can interpret it as, RL training giving the summary aglobal sentence/summary level supervision and Supervisedtraining giving a local word level supervision.

I Lmixed = γLrl + (1− γ)Lml

Page 39: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Challenges

I As pointed out by Paulus et al. 2017, ROUGE as a metric isdeficient.

I Dataset issuesI A majority of the dataset that is available is news dataset.I Can come up with a good summary only by looking at the top

few sentences.I All the above-discussed models discussed above assume this

and look at only the top 5-6 sentences of the source article.I Need a richer dataset for multi-sentence Text Summarization.

I Scalability Issues - the Multi-sentence problem largelyunsolved.

I Need a lot of data and computation power.

Page 40: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Future Work

I To solve the problem of ROUGUE metric in the ReinforcementLearning based training method, we can instead learn aDiscriminator separately first, which given a document andcorresponding summary tells how good the summary it.

I The problem of long document summarization has two mainproblems

I Vanishing Gradient ProblemI LSTM’s help information pass along furtherI But, the errors don’t propagate further back in time well.I Maximum 20-25 steps only.I Logarithmic Residual LSTM’s

Page 41: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

Logarithmic Residual LSTMs

x1 x2

st1

st2

ste

xt-1 xt

he

t - 2t

t - 2 t - 1 t

Page 42: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

References I

Chopra, Sumit et al. (2016). “Abstractive sentence summarizationwith attentive recurrent neural networks”. In: Proceedings ofthe 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pp. 93–98.

Lin, Chin-Yew (2004). “Rouge: A package for automatic evaluationof summaries”. In: Text summarization branches out:Proceedings of the ACL-04 workshop. Vol. 8. Barcelona, Spain.

Nallapati, Ramesh et al. (2016). “Abstractive text summarizationusing sequence-to-sequence rnns and beyond”. In: arXivpreprint arXiv:1602.06023.

Paulus, Romain et al. (2017). “A Deep Reinforced Model forAbstractive Summarization”. In: arXiv preprintarXiv:1705.04304.

Page 43: Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics I Though it speeds up training, it hurts the abstractive capabilities of the model.

References II

Rush, Alexander M et al. (2015). “A neural attention model forabstractive sentence summarization”. In: arXiv preprintarXiv:1509.00685.

See, Abigail et al. (2017). “Get To The Point: Summarization withPointer-Generator Networks”. In: arXiv preprintarXiv:1704.04368.

Tu, Zhaopeng et al. (2016). “Modeling coverage for neuralmachine translation”. In: arXiv preprint arXiv:1601.04811.

Vinyals, Oriol et al. (2015). “Pointer networks”. In: Advances inNeural Information Processing Systems, pp. 2692–2700.


Recommended