Selective encoding for abstractive sentence summarization

Selective Encoding for Abstractive Sentence

Summarization Oingyu Zhou, Nan Yang, Furu Wei and Ming Zhou

ACL 2017

Presentator: Kodaira Tomonori

TaskTask: Abstractive sentence summarization

Input: sentence Output: sentence

Figure 1.

Introduction

• This task different from MT:1. there is no explicit alignment relationship.2. this task needs to keep the highlights and remove the unnecessary information.

Improvement

• Problem: Previous framework is no explicit alignment relationship between the input sentence and the summary except for extracted common words.

• Solution:There method is not to infer the alignment, but to select the highlights while filtering out secondary information in the input.

Problem Formulation

• Input sentence x = (x1, x2, …, xn) { xi ∈ Vs (=source vocab)}

• Output y = (y1, y2, … , yl) { l <= n}

Model

Sentence Encoder

• bidirectional GRU

• The initial states are set to zero vectors.

• After reading the sentence, hidden states are concatenated.

Selective Mechanism

• sentence representation: s = [hbackward,1, hforward, n]

• sGatei = σ(Wshi + Uss + b)

• h’i = hi ○ sGatei

Summary Decoder• GRU:

st = GRU(wt-1, ct-1, st-1)s0 = tanh(Wdhbackward,1 + b)

• Atention: et,i = vaTtanh(Wast-1 + Uah’i)at,i = exp(et,i) / ∑ni=1exp(et,i)ct = ∑ni=1at,ih’i

• Predict:rt = Wrwt-1 + Urct + Vrstmt =[max{rt,2j-1, rt,2j}]Tj=1,…,dp(yt|y1, …, yt-1) = softmax(Womt)

• w: word embeddingc: context vectors: hidden state

• h’i: encoder state

• rt: readout state

Objective Function

• J(θ) = - (1 / |D|) ∑(x,y) ∈D log p (y|x)(D: a set of parallel sentence-summary pair)

• optimizer: Stochastic Gradient Descent

Dataset

• Training set:English Gigaword dataset (Napoles et al., 2012)Training: 3.8M sentence-summary pairsDevelop: 189K

• Test set:1. English Gigaword 2. DUC 20043. MSR Abstractive Text Compression test sets

Data statics

Table 2

Evaluation Metric

• ROUGE (Lin, 2004)ROUGE-1, ROUGE-2, ROUGE-L

Implementation Details• Parameters:

Embedding size: 300GRU hidden state sizes to 512dropout(Srivastava et al., 2014) [p = 0.5]

• Training:Adam: (α = .001, β1 = .9, β2 = .999)gradient clipping [-5, 5]

• BeamSearchbeamsize 12

Baselines• ABS(Rush et al., 2015)

• ABS+ (Rush et al., 2015)

• CAs2s (Chopra et al., 2016)

• Feats2s (Nallapati et al., 2016)

• Luong-NMT (Luong et al., 2015)

• s2s+attthey also implement a s2s model with attention

English Gigaword

Table 3

DUC 2004

Table 4

MSR-ATC

Figure 5

Saliency Heat Map of Selective Gate

• they use the method in Li et al., 2016 to visualize the ocntribution of the selective gate to the final output.

• They approximate the Sy(g) by computing the first order Taylor expansion.

• THey draw the Euclidean norm of the first derivative of the output y with respect to the selective gate g associated with each input words.

Figure 3

Conclusion

• propose a selective encoding model.

• greatly improves

Date post:	21-Jan-2018
Category:	Science
Upload:	kodaira-tomonori
View:	54 times
Download:	0 times

Selective encoding for abstractive sentence summarization

Science