Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics...

Abstractive Text SummarizationUsing Seq2Seq Attention Models

Soumye Singhal Prof. Arnab Bhattacharya

Department of Computer Science and EngineeringIndian Institute of Technology, Kanpur

22nd November, 2017

Outline

The ProblemWhy Text SummarizationExtractive vs Abstractive

Baseline ModelVanilla Encoder-DecoderAttention is all you need!

Metrics and Datasets

ImprovementsHierarchical AttentionPointer-Generator NetworkCoverage MechanismIntra-AttentionReinforcement Based Training

Challenges and Way Forward

Outline






Why Text Summarization?

I In the modern Internet age, textual data is ever increasing

I Need some way to condense this data while preserving theinformation and meaning.

I Text summarization is a fundamental problem that we need tosolve.

I Would help in easy and fast retrieval of information.

Outline






Extractive vs Abstractive

I Extractive summarizationI Copying parts/sentences of the source text and then combine

those part/sentences together to render a summary.I Importance of sentence is based on linguistic and statistical

features

I Abstractive summarizationI These methods try to first understand the text and then

rephrase it in a shorter manner, using possibly different wordsI For perfect abstractive summary, the model has to first truly

understand the document and then try to express thatunderstanding in short possibly using new words and phrases.

I Much harder than extractive.I Has complex capabilities like generalization, paraphrasing and

incorporating real-world knowledge.

Deep Learning

I Majority of the work has traditionally focused on Extractiveapproaches due to the easy of defining hard-coded rules toselect important sentences than generate new ones.

I But they often don’t summarize long and complex texts wellas they are very restrictive.

I The traditional rule-based AI does poorly on Abstractive TextSummarization.

I Inspired by the performance of Neural Attention Model in theclosely related task of Machine Translation Rush et al. 2015and Chopra et al. 2016 applied this Neural Attention Model toAbstractive Text Summarization and found that it alreadyperformed very well and beat the previous non-DeepLearning-based approaches.

Outline






Recurrent Neural Network

Figure: An unrolled RNN

I wi - input tokens of source article

I hi - Encoder hidden states

I Pvocab = softmax(Vhi + b) is the distribution over vocabularyfrom which we sample outi

Long-Short Term Memory

1

I If the context of the word is far away, RNN’s struggle to learn.I Vanishing Gradient ProblemI LSTMs selectively pass and forget information.1Image taken from colah.github.io

Long-Short Term Memory

Forget Gate Layer

I ft = σ(Wf [ht−1, xt ] + bf )

I Ct = Ct ⊗ ft

Input Gate Layer

I it = σ(Wt [ht−1, xt ] + bi )

I Ct = tanh(Wi [ht−1, xt ] + bc)

I Ct = Ct ⊗ tt + Ct

Output Gate Layer

I ot = σ(Wo [ht−1, xt ] + bo)

I ht = ot ∗ tanh(Ct)

Bi-Directional RNN

Backward RNN

Foreword RNN

out out out out0 1 2 3

in0 in1 in2in3

Predict Predict Predict Predict

Embeddings Embeddings Embeddings Embeddings

I Two passes on source computing hidden states←−ht and

−→ht

I ht = [←−ht ,−→ht ] now encodes past and future information.

Vanilla Encoder-Decoder

2

I It consists of an Encoder(Bidirectional LSTM) and a DecoderLSTM network.

I The final hidden state from the Encoder(thought vector) ispassed into the Decoder.

2Image taken from colah.github.io

Outline






Why do we need Attention?

I The basic encoder-decoder model fails to scale up.

I The main bottleneck is the fixed sized thought vector

I Not able to capture all the relevant information of the inputsequence as the model sizes up.

I At each generation step, only a part of the input is relevant.

I This is where attention comes it.

I It helps the model decide which part of the input encoding tofocus on at each generation step to generate novel words.

I At each step, the decoder outputs hidden state hi , from whichwe generate the output.

Attention is all you need!I importanceit = V ∗ tanh(eiW1 + htW2 + battn).I Attention Distribution at = softmax(importanceit)I ContextVector h∗t =

∑i ei ∗ ati

3

3Image stylized from https://talbaumel.github.io/attention/

Training

I Context Vector is then fed into two layers to generatedistribution over the vocabulary from which we sample.

I Pvocab(w) = softmax(V ′(V [ht , h∗t ] + b) + b′)

I For the loss at time step t, losst = − logP(wt∗), where w∗t isthe target summary word.

I LOSS =∑T

t=0 losstT

I We then use the Backpropagation Algorithm to get thegradient and learn the parameters

Generating the Summaries

At each step, the decoder outputs a probability distribution overthe target vocabulary. To get the output word at this step we cando the following

I Greedy Sampling, ie choose the mode of the Distribution

I Sample from the distribution.

I Beam Search - Choosing the top k most likely target wordsand then feeding them all into the next decoder input. So ateach time-step t the decoder gets k different possible inputs.It then computes the top k most likely target words for eachof these different inputs. Among these, it keeps only the top-kout of k2 and rejects the rest. This process continues. Thisensures that each target word gets a fair shot at generatingthe summary.

Metrics

I If target summary is not givenI Need a similarity measure between summary and source

document.I In a good summary, the topics covered would be similarI Use topic models like Latent Semantic Analysis(LSA) and

Latent Dirichlet Allocation(LDA)

I If the target summary is givenI Use metrics like ROUGE(Lin 2004) and METEORI They are essentially string matching metricsI ROUGE-N measures the overlap of N-grams between the system

and reference summaryI ROUGE-L is based on longest common subsequences. Takes

into account sentence level similarity.I ROUGE-S is the skip-gram variant

Dataset

Sentence level Datasets

I DUC-2004

I Gigaword

Large-Scale Dataset by Nallapati et al. 2016

I CNN/Daily Mail Dataset adapted for summarization.

Problems with Baseline

Though the baseline gives decent results, they are clearly plaguedby many problems

I They sometimes tend to reproduce factually incorrect details.

I Struggles with Out of Vocabulary (OOV) words.

I They are also a bit repetitive and focus on a word/phrasemultiple times.

I Focus mainly on single sentence summary tasks like headlinegeneration.

Outline






Feature-rich Encoder

I Introduced by Nallapati et al. 2016

I Aim is to input more more information about the source textinto encoder

I Apart from word-embeddings like word2vec, GloVe alsoincorporate more linguistic features like

I POS(parts of speech) tagsI named-entity tagsI TF-IDF statistics

I Though it speeds up training, it hurts the abstractivecapabilities of the model.

Hierarchical Attention

I Introduced by Nallapati et al. 2016.

I For bigger source document, they try to also identify keysentences for the summary.

I Two Bi-Direction RNN at source textI One at word levelI Another at sentence levelI Word level attention is then weighted by corresponding

sentence level attention.

Pa(j) =Paw (j)Pa

s (s(j))∑Ndk=1 P

aw (k)Pa

s (s(k))

Outline






Pointer-Generator Network

Introduced by See et al. 2017.

I Helps to solve the challenge of OOV words and factual errors.

I Works better for multi-sentence summaries.

I Ides is to choose between generating a word from the fixedvocabulary or copying one from the source document at eachstep of the generation.

I It brings in the power of extractive methods by pointing(Vinyals et al. 2015)

I So for OOV words, simple generation would result in UNK,but here the network will copy the OOV from the source text.


4

4Image taken from blog, www.abigailsee.com


I At each step we calculate generation probability pgenI pgen = σ(wT

h∗h∗t + wT

s ht + wTx xt + bptr )

I xt is the decoder input.

I Parameter wh∗ ,ws ,wx , bptr are learnable.

I Now this pgen is used as a switch.

I P(w) = pgenPvocab(w) + (1− pgen)∑

i :wi=w atiI Note that for OOV word Pvocab(w) = 0, so we end up

pointing.

Outline






Coverage Mechanism

I The cause of repetitiveness of the model can be accounted forby increased and continuous attention to a particular word.

I So we can use Coverage Model by Tu et al. 2016.

I Coverage Vector ct =∑t−1

t′=0 at′

I Intuitively, by summing the attention at all steps we arekeeping track of how much coverage each encoding, e i hasreceived.

I Now, give this as input to attention mechanism.

I importanceit = V ∗ tanh(eiW1 + htW2 + Wccti + battn)

I Penalize attending to things that have already been covered.

I covlosst =∑

i min(ati , cti ) penalizes overlap between attention

at this step and coverage till now.

I losst = − logP(wt∗) + λcovlosst

Outline






Intra-Attention

I Traditional approaches attend on the encoder states.

I But the current word being generated also depends upon whatprevious words were generated.

I So Paulus et al. 2017 used Intra-Attention on Decoderoutputs.

I This approach also avoids repeating things.

I Decoder context vector c∗t is generated in a similar way toencoder attention.

I c∗t passed on to generate Pvocab(w)

Outline






How to correct my mistakes?

I During training, we always feed in the correct inputs to thedecoder, no matter what the output was at the previous step.

I Model doesn’t learn to recover from its mistakes.

I It assumes that it will be given the golden token at each stepin the decoding.

I During testing if the model produces even one wrong wordthen the recovery is hard.

A naive way to do rectify this problem is that during training, tossa coin with P[heads] = p to decide between choosing generatedoutput from the previous step or taking the golden token.

Training using Reinforcement Learning

I There are various ways in which the document can beeffectively summarized. The reference summary is just one ofthose possible ways.

I There should be some scope for variations in the summary

This is the idea behind Reinforcement based learning introduced byPaulus et al. 2017 which gave significant improvement over thebaseline. This is the current state of the art.

I During training, we first let the model generate a summaryusing its own decoder outputs as inputs.

I After the model produces its own summary, we evaluate thesummary in comparison to the reference summary using theROUGE metric.

I We then define a loss based on this score. If the score is highthat means the summary is good and hence the loss should beless and vice-versa.

Training using Reinforcement Learning

Summary

Reward

Golden summaryModel Scorer

genera

tescompares

updates returns

Policy Learning

I We use self-critical Policy gradient training.

I We generate two strings y s and y

I y s ∼ P(y st |y s1 , · · · , y st−1, x) ie sampling and y by greedysearch.

I y∗ is the ground truth.

I r(y) is the reward for sequence y compared with y∗.I Lrl = (r(y)− r(y s))

∑t logP(y st |y s1 , · · · , y st−1, x).

Problems in Training using Reinforcement Learning

I It’s possible to achieve a very high ROUGE score, without thesummary being human readable.

I Reflects that ROUGE doesn’t exactly capture the way wehumans evaluate summary.

I Now, since the above method optimizes for the ROUGE scores,it may produce summaries with very high ROGUE scores, butwhich are barely human-readable.

I So to curb this problem, we train our model in a mixed fashionusing both Reinforcement learning and Supervised Training.

I We can interpret it as, RL training giving the summary aglobal sentence/summary level supervision and Supervisedtraining giving a local word level supervision.

I Lmixed = γLrl + (1− γ)Lml

Challenges

I As pointed out by Paulus et al. 2017, ROUGE as a metric isdeficient.

I Dataset issuesI A majority of the dataset that is available is news dataset.I Can come up with a good summary only by looking at the top

few sentences.I All the above-discussed models discussed above assume this

and look at only the top 5-6 sentences of the source article.I Need a richer dataset for multi-sentence Text Summarization.

I Scalability Issues - the Multi-sentence problem largelyunsolved.

I Need a lot of data and computation power.

Future Work

I To solve the problem of ROUGUE metric in the ReinforcementLearning based training method, we can instead learn aDiscriminator separately first, which given a document andcorresponding summary tells how good the summary it.

I The problem of long document summarization has two mainproblems

I Vanishing Gradient ProblemI LSTM’s help information pass along furtherI But, the errors don’t propagate further back in time well.I Maximum 20-25 steps only.I Logarithmic Residual LSTM’s

Logarithmic Residual LSTMs

x1 x2

st1

st2

ste

xt-1 xt

he

t - 2t

t - 2 t - 1 t

References I

Chopra, Sumit et al. (2016). “Abstractive sentence summarizationwith attentive recurrent neural networks”. In: Proceedings ofthe 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pp. 93–98.

Lin, Chin-Yew (2004). “Rouge: A package for automatic evaluationof summaries”. In: Text summarization branches out:Proceedings of the ACL-04 workshop. Vol. 8. Barcelona, Spain.

Nallapati, Ramesh et al. (2016). “Abstractive text summarizationusing sequence-to-sequence rnns and beyond”. In: arXivpreprint arXiv:1602.06023.

Paulus, Romain et al. (2017). “A Deep Reinforced Model forAbstractive Summarization”. In: arXiv preprintarXiv:1705.04304.

References II

Rush, Alexander M et al. (2015). “A neural attention model forabstractive sentence summarization”. In: arXiv preprintarXiv:1509.00685.

See, Abigail et al. (2017). “Get To The Point: Summarization withPointer-Generator Networks”. In: arXiv preprintarXiv:1704.04368.

Tu, Zhaopeng et al. (2016). “Modeling coverage for neuralmachine translation”. In: arXiv preprint arXiv:1601.04811.

Vinyals, Oriol et al. (2015). “Pointer networks”. In: Advances inNeural Information Processing Systems, pp. 2692–2700.

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

Abstractive Text Summarization - IITKhome.iitk.ac.in/~soumye/cs498a/pres.pdf · I TF-IDF statistics...

Documents