Date post: | 21-Jan-2018 |
Category: |
Science |
Upload: | kodaira-tomonori |
View: | 54 times |
Download: | 0 times |
Selective Encoding for Abstractive Sentence
Summarization Oingyu Zhou, Nan Yang, Furu Wei and Ming Zhou
ACL 2017
Presentator: Kodaira Tomonori
TaskTask: Abstractive sentence summarization
Input: sentence Output: sentence
Figure 1.
Introduction
• This task different from MT:1. there is no explicit alignment relationship.2. this task needs to keep the highlights and remove the unnecessary information.
Improvement
• Problem: Previous framework is no explicit alignment relationship between the input sentence and the summary except for extracted common words.
• Solution:There method is not to infer the alignment, but to select the highlights while filtering out secondary information in the input.
Problem Formulation
• Input sentence x = (x1, x2, …, xn) { xi ∈ Vs (=source vocab)}
• Output y = (y1, y2, … , yl) { l <= n}
Model
Sentence Encoder
• bidirectional GRU
• The initial states are set to zero vectors.
• After reading the sentence, hidden states are concatenated.
Selective Mechanism
• sentence representation: s = [hbackward,1, hforward, n]
• sGatei = σ(Wshi + Uss + b)
• h’i = hi ○ sGatei
Summary Decoder• GRU:
st = GRU(wt-1, ct-1, st-1)s0 = tanh(Wdhbackward,1 + b)
• Atention: et,i = vaTtanh(Wast-1 + Uah’i)at,i = exp(et,i) / ∑ni=1exp(et,i)ct = ∑ni=1at,ih’i
• Predict:rt = Wrwt-1 + Urct + Vrstmt =[max{rt,2j-1, rt,2j}]Tj=1,…,dp(yt|y1, …, yt-1) = softmax(Womt)
• w: word embeddingc: context vectors: hidden state
• h’i: encoder state
• rt: readout state
Objective Function
• J(θ) = - (1 / |D|) ∑(x,y) ∈D log p (y|x)(D: a set of parallel sentence-summary pair)
• optimizer: Stochastic Gradient Descent
Dataset
• Training set:English Gigaword dataset (Napoles et al., 2012)Training: 3.8M sentence-summary pairsDevelop: 189K
• Test set:1. English Gigaword 2. DUC 20043. MSR Abstractive Text Compression test sets
Data statics
Table 2
Evaluation Metric
• ROUGE (Lin, 2004)ROUGE-1, ROUGE-2, ROUGE-L
Implementation Details• Parameters:
Embedding size: 300GRU hidden state sizes to 512dropout(Srivastava et al., 2014) [p = 0.5]
• Training:Adam: (α = .001, β1 = .9, β2 = .999)gradient clipping [-5, 5]
• BeamSearchbeamsize 12
Baselines• ABS(Rush et al., 2015)
• ABS+ (Rush et al., 2015)
• CAs2s (Chopra et al., 2016)
• Feats2s (Nallapati et al., 2016)
• Luong-NMT (Luong et al., 2015)
• s2s+attthey also implement a s2s model with attention
English Gigaword
Table 3
DUC 2004
Table 4
MSR-ATC
Figure 5
Saliency Heat Map of Selective Gate
• they use the method in Li et al., 2016 to visualize the ocntribution of the selective gate to the final output.
• They approximate the Sy(g) by computing the first order Taylor expansion.
• THey draw the Euclidean norm of the first derivative of the output y with respect to the selective gate g associated with each input words.
Figure 3
Conclusion
• propose a selective encoding model.
• greatly improves