Hooks in the Headline: Learning to Generate Headlines with ...click-baity headline generation. Both...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5082–5093July 5 - 10, 2020. c©2020 Association for Computational Linguistics

5082

Hooks in the Headline: Learning to Generate Headlineswith Controlled Styles

Di Jin,1 Zhijing Jin,2 Joey Tianyi Zhou,3∗ Lisa Orii,4 Peter Szolovits1

1CSAIL, MIT, 2Amazon Web Services, 3A*STAR, Singapore, 4Wellesley College{jindi15,psz}@mit.edu, [email protected]@ihpc.a-star.edu.sg, [email protected]

Abstract

Current summarization systems only produceplain, factual headlines, but do not meet thepractical needs of creating memorable titles toincrease exposure. We propose a new task,Stylistic Headline Generation (SHG), to enrichthe headlines with three style options (humor,romance and clickbait), in order to attract morereaders. With no style-specific article-headlinepair (only a standard headline summarizationdataset and mono-style corpora), our methodTitleStylist generates style-specific headlinesby combining the summarization and recon-struction tasks into a multitasking framework.We also introduced a novel parameter sharingscheme to further disentangle the style fromthe text. Through both automatic and humanevaluation, we demonstrate that TitleStylistcan generate relevant, fluent headlines withthree target styles: humor, romance, and click-bait. The attraction score of our model gen-erated headlines surpasses that of the state-of-the-art summarization model by 9.68%, andeven outperforms human-written references.1

1 Introduction

Every good article needs a good title, which shouldnot only be able to condense the core meaningof the text, but also sound appealing to the read-ers for more exposure and memorableness. How-ever, currently even the best Headline Generation(HG) system can only fulfill the above requirementyet performs poorly on the latter. For example,in Figure 1, the plain headline by an HG model“Summ: Leopard Frog Found in New York City” isless eye-catching than the style-carrying ones suchas “What’s That Chuckle You Hear? It May Be theNew Frog From NYC.”

∗Corresponding author.1Our code is available at https://github.com/

jind11/TitleStylist.

New frog species discovered in New York City area. It has adistinctive croak, scientists find. Leopard frog speciesdoesn't yet have a name.

Ribbit! Frog Species Found in New York City Has a Croakof Its Own

Original Headline:

Article

Summ: Leopard Frog Found inNew York City

HG Model Output:

What's that Chuckle You Hear? It May be theNew Frog from NYCHumorous:

A New Frog with a Croak of Its Own Awaitsits Name in the Roads of NYCRomantic:

3 Facts about the New Frog with a Croak ofIts OwnClick-Baity:

Figure 1: Given a news article, current HG models canonly generate plain, factual headlines, failing to learnfrom the original human reference. It is also much lessattractive than the headlines with humorous, romanticand click-baity styles.

To bridge the gap between the practical needs forattractive headlines and the plain HG by the currentsummarization systems, we propose a new taskof Stylistic Headline Generation (SHG). Given anarticle, it aims to generate a headline with a targetstyle such as humorous, romantic, and click-baity.It has broad applications in reader-adapted titlegeneration, slogan suggestion, auto-fill for onlinepost headlines, and many others.

SHG is a highly skilled creative process, and usu-ally only possessed by expert writers. One of themost famous headlines in American publications,“Sticks Nix Hick Pix,” could be such an example. Incontrast, the current best summarization systemsare at most comparable to novice writers who pro-vide a plain descriptive representation of the textbody as the title (Cao et al., 2018b,a; Lin et al.,2018; Song et al., 2019; Dong et al., 2019). Thesesystems usually use a language generation modelthat mixes styles with other linguistic patterns andinherently lacks a mechanism to control the style

https://github.com/jind11/TitleStylist

https://github.com/jind11/TitleStylist

5083

explicitly. More fundamentally, the training datacomprise of a mixture of styles (e.g., the Gigaworddataset (Rush et al., 2017)), obstructing the modelsfrom learning a distinct style.

In this paper, we propose the new task SHG, toemphasize the explicit control of style in headlinegeneration. We present a novel headline generationmodel, TitleStylist, to produce enticing titles withtarget styles including humorous, romantic, andclick-baity. Our model leverages a multitaskingframework to train both a summarization modelon headline-article pairs, and a Denoising Autoen-coder (DAE) on a style corpus. In particular, basedon the transformer architecture (Vaswani et al.,2017), we use the style-dependent layer normal-ization and the style-guided encoder-attention todisentangle the language style factors from the text.This design enables us to use the shared contentto generate headlines that are more relevant to thearticles, as well as to control the style by pluggingin a set of style-specific parameters. We validatethe model on three tasks: humorous, romantic, andclick-baity headline generation. Both automaticand human evaluations show that TitleStylist cangenerate headlines with the desired styles that ap-peal more to human readers, as in Figure 1.

The main contributions of our paper are listedbelow:

• To the best of our knowledge, it is the firstresearch on the generation of attractive newsheadlines with styles without any supervisedstyle-specific article-headline paired data.

• Through both automatic and human evalua-tion, we demonstrated that our proposed Ti-tleStylist can generate relevant, fluent head-lines with three styles (humor, romance, andclickbait), and they are even more attractivethan human-written ones.

• Our model can flexibly incorporate multiplestyles, thus efficiently and automatically pro-viding humans with various creative headlineoptions for references and inspiring them tothink out of the box.

2 Related Work

Our work is related to summarization and text styletransfer.

Headline Generation as Summarization

Headline generation is a very popular area of re-search. Traditional headline generation methodsmostly focus on the extractive strategies using lin-guistic features and handcrafted rules (Luhn, 1958;Edmundson, 1964; Mathis et al., 1973; Salton et al.,1997; Jing and McKeown, 1999; Radev and McK-eown, 1998; Dorr et al., 2003). To enrich the di-versity of the extractive summarization, abstractivemodels were then proposed. With the help of neu-ral networks, Rush et al. (2015) proposed attention-based summarization (ABS) to make Banko et al.(2000)’s framework of summarization more pow-erful. Many recent works extended ABS by utiliz-ing additional features (Chopra et al., 2016; Takaseet al., 2016; Nallapati et al., 2016; Shen et al., 2016,2017a; Tan et al., 2017; Guo et al., 2017). Othervariants of the standard headline generation set-ting include headlines for community question an-swering (Higurashi et al., 2018), multiple headlinegeneration (Iwama and Kano, 2019), user-specificgeneration using user embeddings in recommenda-tion systems (Liu et al., 2018), bilingual headlinegeneration (Shen et al., 2018) and question-styleheadline generation (Zhang et al., 2018a).

Only a few works have recently started to fo-cus on increasing the attractiveness of generatedheadlines (Fan et al., 2018; Xu et al., 2019). Fanet al. (2018) focuses on controlling several featuresof the summary text such as text length, and thestyle of two different news outlets, CNN and Dai-lyMail. These controls serve as a way to boost themodel performance, and the CNN- and DailyMail-style control shows a negligible improvement. Xuet al. (2019) utilized reinforcement learning to en-courage the headline generation system to generatemore sensational headlines via using the readers’comment rate as the reward, which however cannotexplicitly control or manipulate the styles of head-lines. Shu et al. (2018) proposed a style transferapproach to transfer a non-clickbait headline intoa clickbait one. This method requires paired newsarticles-headlines data for the target style; however,for many styles such as humor and romance, thereare no available headlines. Our model does nothave this limitation, thus enabling transferring tomany more styles.

Text Style Transfer

Our work is also related to text style transfer, whichaims to change the style attribute of the text while

5084

preserving its content. First proposed by Shen et al.(2017b), it has achieved great progress in recentyears (Xu et al., 2018; Lample et al., 2019; Zhanget al., 2018b; Fu et al., 2018; Jin et al., 2019; Yanget al., 2018; Jin et al., 2020). However, all thesemethods demand a text corpus for the target style;however, in our case, it is expensive and technicallychallenging to collect news headlines with humorand romance styles, which makes this category ofmethods not applicable to our problem.

3 Methods

3.1 Problem Formulation

The model is trained on a source dataset Sand target dataset T . The source dataset S ={(a(i),h(i))}Ni=1 consists of pairs of a news articlea and its plain headline h. We assume that thesource corpus has a distribution P (A,H), whereA = {a(i)}Ni=1, and H = {h(i)}Ni=1. The targetcorpus T = {t(i)}Mi=1 comprises of sentences twritten in a specific style (e.g., humor). We assumethat it conforms to the distribution P (T ).

Note that the target corpus T only contains style-carrying sentences, not necessarily headlines — itcan be just book text. Also no sentence t is pairedwith a news article. Overall, our task is to learn theconditional distribution P (T |A) using only S andT . This task is fully unsupervised because there isno sample from the joint distribution P (A, T ).

3.2 Seq2Seq Model Architecture

For summarization, we adopt a sequence-to-sequence (Seq2Seq) model based on the Trans-former architecture (Vaswani et al., 2017). As inFigure 2, it consists of a 6-layer encoder E(·;θE)and a 6-layer decoder G(·;θG) with a hidden sizeof 1024 and a feed-forward filter size of 4096.For better generation quality, we initialize withthe MASS model (Song et al., 2019). MASS ispretrained by masking a sentence fragment in theencoder, and then predicting it in the decoder onlarge-scale English monolingual data. This pre-training is adopted in the current state-of-the-artsystems across various summarization benchmarktasks including HG.

3.3 Multitask Training Scheme

To disentangle the latent style from the text, weadopt a multitask learning framework (Luong et al.,2015), training on summarization and DAE simul-taneously (as shown in Figure 3).

Multi-Head Self-Attention

Layer Norm

MLP

Layer Norm

Emb EmbEmb

Encoder

Decoder

Multi-Head Encoder-Attention

MLP

Multi-Head Self-Attention

Style-Dependent Layer Norm

Style-Dependent QueryTransformation

Style-Dependent Layer Norm

Emb EmbEmb

Figure 2: The Transformer-based architecture of ourmodel.

Figure 3: Training scheme. Multitask training isadopted to combine the summarization and DAE tasks.

Supervised Seq2Seq Training for ES and GS

With the source domain dataset S, based on theencoder-decoder architecture, we can learn the con-ditional distribution P (H|A) by training zS =ES(A) and HS = GS(zS) to solve the supervisedSeq2Seq learning task, where zS is the learned la-tent representation in the source domain. The lossfunction of this task is

LS(θES ,θGS ) = E(a,h)∼S [− log p(h|a;θES ,θGS )],(1)

where θES and θGS are the set of model parame-ters of the encoder and decoder in the source do-main and p(h|a) denotes the overall probability ofgenerating an output sequence h given the inputarticle a, which can be further expanded as follows:

p(h|a;θES ,θGS ) =

L∏t=1

p(ht|{h1, ..., ht−1},zS ;θGS ),

(2)

where L is the sequence length.

DAE Training for θET and θGT For the targetstyle corpus T , since we only have the sentence twithout paired news articles, we train zT = ET (t)and t = GT (zT ) by solving an unsupervised re-

5085

construction learning task, where zT is the learnedlatent representation in the target domain, and t isthe corrupted version of t by randomly deleting orblanking some words and shuffling the word orders.To train the model, we minimize the reconstructionerror LT :

LT (θET ,θGT ) = Et∼T [− log p(t|t)], (3)

where θET and θGT are the set of model param-eters for the encoder and generator in the targetdomain. We train the whole model by jointly min-imizing the supervised Seq2Seq training loss LSand the unsupervised denoised auto-encoding lossLT via multitask learning, so the total loss becomes

L(θES ,θGS ,θET ,θGT ) = λLS(θES ,θGS )

+ (1− λ)LT (θET ,θGT ),(4)

where λ is a hyper-parameter.

3.4 Parameter-Sharing SchemeMore constraints are necessary in the multitasktraining process. We aim to infer the conditionaldistribution as P (T |A) = GT (ES(A)). However,without samples from P (A, T ), this is a challeng-ing or even impossible task if ES and ET , or GS

and GT are completely independent of each other.Hence, we need to add some constraints to thenetwork by relating ES and ET , and GS and GT .The simplest design is to share all parameters be-tween ES and ET , and apply the same strategyto GS and GT . The intuition behind this designis that by exposing the model to both summariza-tion task and style-carrying text reconstruction task,the model would acquire some sense of the targetstyle while summarizing the article. However, toencourage the model to better disentangle the con-tent and style of text and more explicitly learn thestyle contained in the target corpus T , we share allparameters of the encoder between two domains,i.e., between ES and ET , whereas we divide theparameters of the decoder into two types: style-independent parameters θind and style-dependentparameters θdep. This means that only the style-independent parameters are shared between GS

and GT while the style-dependent parameters arenot. More specifically, the parameters of the layernormalization and encoder attention modules aremade style-dependent as detailed below.

Type 1. Style Layer Normalization Inspired byprevious work on image style transfer (Dumoulin

et al., 2016), we make the scaling and shifting pa-rameters for layer normalization in the transformerarchitecture un-shared for each style. This stylelayer normalization approach aims to transform alayer’s activation x into a normalized activation zspecific to the style s:

z = γs(x− µσ

)− βs, (5)

where µ and σ are the mean and standard deviationof the batch of x, and γs and βs are style-specificparameters learned from data.

Specifically, for the transformer decoder archi-tecture, we use a style-specific self-attention layernormalization and final layer normalization for thesource and target domains on all six decoder layers.

Type 2. Style-Guided Encoder Attention Ourmodel architecture contains the attention mecha-nism, where the decoder infers the probability ofthe next word not only conditioned on the previ-ous words but also on the encoded input hiddenstates. The attention patterns should be differentfor the summarization and the reconstruction tasksdue to their different inherent nature. We insertthis thinking into the model by introducing thestyle-guided encoder attention into the multi-headattention module, which is defined as follows:

Q = query ·W sq (6)

K = key ·Wk (7)

V = value ·Wv (8)

Att(Q,K,V ) = Softmax

(QKtr

√dmodel

)V , (9)

where query, key, and value denote the tripleof inputs into the multi-head attention module;W s

q ,Wk, and Wv denote the scaled dot-product matrixfor affine transformation; dmodel is the dimensionof the hidden states. We specialize the dot-productmatrix W s

q of the query for different styles, sothatQ can be different to induce diverse attentionpatterns.

4 Experiments

4.1 Datasets

We compile a rich source dataset by combining theNew York Times (NYT) and CNN, as well as threetarget style corpora on humorous, romantic, andclick-baity text. The average sentence length in

5086

the NYT, CNN, Humor, Romance, and Clickbaitdatasets are 8.8, 9.2, 12.6, 11.6 and 8.7 words,respectively.

4.1.1 Source DatasetThe source dataset contains news articles pairedwith corresponding headlines. To enrich the train-ing corpus, we combine two datasets: the NewYork Times (56K) and CNN (90K). After combin-ing these two datasets, we randomly selected 3,000pairs as the validation set and another 3,000 pairsas the test set.

We first extracted the archival abstracts andheadlines from the New York Times (NYT) cor-pus (Sandhaus, 2008) and treat the abstracts asthe news articles. Following the standard pre-processing procedures (Kedzie et al., 2018),2 wefiltered out advertisement-related articles (as theyare very different from news reports), resulting in56,899 news abstracts-headlines pairs.

We then add into our source set the CNN sum-marization dataset, which is widely used for train-ing abstractive summarization models (Hermannet al., 2015).3 We use the short summaries in theoriginal dataset as the news abstracts and automati-cally parsed the headlines for each news from thedumped news web pages,4 and in total collected90,236 news abstract-headline pairs.

4.1.2 Three Target Style CorporaHumor and Romance For the target styledatasets, we follow (Chen et al., 2019) to use hu-mor and romance novel collections in BookCor-pus (Zhu et al., 2015) as the Humor and Romancedatasets.5 We split the documents into sentences,tokenized the text, and collected 500K sentencesas our datasets.

Clickbait We also tried to learn the writing stylefrom the click-baity headlines since they haveshown superior attraction to readers. Thus we usedThe Examiner - SpamClickBait News dataset, de-noted as the Clickbait dataset.6 We collected 500Kheadlines for our use.

Some examples from each style corpus are listedin Table 1.

2https://github.com/kedz/summarization-datasets

3We use CNN instead of the DailyMail dataset since Dai-lyMail headlines are very long and more like short summaries.

4https://cs.nyu.edu/˜kcho/DMQA/5https://www.smashwords.com/6https://www.kaggle.com/therohk/

examine-the-examiner

Style Examples

Humor

- The crowded beach like houses in the burbsand the line ups at Walmart.- Berthold stormed out of the brewing argu-ment with his violin and bow and went fora walk with it to practice for the much morereceptive polluted air.

Romance

- “I can face it joyously and with all my heart,and soul!” she said.- With bright blue and green buttercreamscales, sparkling eyes, and purple candy meltwings, it sat majestically on a rocky ledgemade from chocolate.

Clickbait

- 11-Year-Old Girl and 15-Year-Old Boy Ac-cused of Attempting to Kill Mother: Who Isthe Adult?- Chilly, Dry Weather Welcomes 2010 toSouth Florida- End Segregation in Alabama-Bryce Hospi-tal Sale Offers a Golden Opportunity

Table 1: Examples of three target style corpora: humor,romance, and clickbait.

4.2 Baselines

We compared the proposed TitleStylist against thefollowing five strong baseline approaches.

Neural Headline Generation (NHG) Wetrain the state-of-the-art summarization model,MASS (Song et al., 2019), on our collected newsabstracts-headlines paired data.

Gigaword-MASS We test an off-the-shelf head-line generation model, MASS from (Song et al.,2019), which is already trained on Gigaword, alarge-scale headline generation dataset with around4 million articles.7

Neural Story Teller (NST) It breaks down thetask into two steps, which first generates headlinesfrom the aforementioned NHG model, then appliesstyle shift techniques to generate style-specificheadlines (Kiros et al., 2015). In brief, this methoduses the Skip-Thought model to encode a sentenceinto a representation vector and then manipulatesits style by a linear transformation. Afterward, thistransformed representation vector is used to initial-ize a language model pretrained on a style-specificcorpus so that a stylistic headline can be generated.More details of this method can refer to the officialwebsite.8

7https://github.com/harvardnlp/sent-summary

8https://github.com/ryankiros/neural-storyteller

https://github.com/kedz/summarization-datasets

https://github.com/kedz/summarization-datasets

https://cs.nyu.edu/~kcho/DMQA/

https://www.smashwords.com/

https://www.kaggle.com/therohk/examine-the-examiner

https://www.kaggle.com/therohk/examine-the-examiner

https://github.com/harvardnlp/sent-summary

https://github.com/harvardnlp/sent-summary

https://github.com/ryankiros/neural-storyteller

https://github.com/ryankiros/neural-storyteller

5087

Fine-Tuned We first train the NHG model asmentioned above, then further fine-tuned it on thetarget style corpus via DAE training.

Multitask We share all parameters between ES

and ET , and between GS and GT , and trained themodel on both the summarization and DAE tasks.The model architecture is the same as NHG.

4.3 Evaluation MetricsTo evaluate the performance of the proposed Ti-tleStylist in generating attractive headlines withstyles, we propose a comprehensive twofold strat-egy of both automatic evaluation and human evalu-ation.

4.3.1 Setup of Human EvaluationWe randomly sampled 50 news abstracts from thetest set and asked three native-speaker annotatorsfor evaluation to score the generated headlines.Specifically, we conduct two tasks to evaluate onfour criteria: (1) relevance, (2) attractiveness, (3)language fluency, and (4) style strength. For thefirst task, the human raters are asked to evaluatethese outputs on the first three aspects, relevance,attractiveness, and language fluency on a Likertscale from 1 to 10 (integer values). For relevance,human annotators are asked to evaluate how seman-tically relevant the headline is to the news body.For attractiveness, annotators are asked how at-tractive the headlines are. For fluency, we ask theannotators to evaluate how fluent and readable thetext is. After the collection of human evaluationresults, we averaged the scores as the final score. Inaddition, we have another independent human eval-uation task about the style strength – we presentthe generated headlines from TitleStylist and base-lines to the human judges and let them choose theone that most conforms to the target style such ashumor. Then we define the style strength score asthe proportion of choices.

4.3.2 Setup of Automatic EvaluationApart from the comprehensive human evaluation,we use automatic evaluation to measure the gen-eration quality through two conventional aspects:summarization quality and language fluency. Notethat the purpose of this two-way automatic eval-uation is to confirm that the performance of ourmodel is in an acceptable range. Good automaticevaluation performances are necessary proofs tocompliment human evaluations on the model effec-tiveness.

Summarization Quality We use the standard au-tomatic evaluation metrics for summarization withthe original headlines as the reference: BLEU (Pa-pineni et al., 2002), METEOR (Denkowskiand Lavie, 2014), ROUGE (Lin, 2004) andCIDEr (Vedantam et al., 2015). For ROUGE, weused the Files2ROUGE9 toolkit, and for other met-rics, we used the pycocoeval toolkit.10

Language Fluency We fine-tuned the GPT-2medium model (Radford et al., 2019) on our col-lected headlines and then used it to measure theperplexity (PPL) on the generated outputs.11

4.4 Experimental DetailsWe used the fairseq code base (Ott et al., 2019).During training, we use Adam optimizer with aninitial learning rate of 5 × 10−4, and the batchsize is set as 3072 tokens for each GPU with theparameters update frequency set as 4. For the ran-dom corruption for DAE training, we follow thestandard practice to randomly delete or blank theword with a uniform probability of 0.2, and ran-domly shuffled the word order within 5 tokens. Alldatasets are lower-cased. λ is set as 0.5 in experi-ments. For each iteration of training, we randomlydraw a batch of data either from the source datasetor from the target style corpus, and the samplingstrategy follows the uniform distribution with theprobability being equal to λ.

5 Results and Discussion

5.1 Human Evaluation ResultsThe human evaluation is to have a comprehensivemeasurement of the performances. We conductexperiments on four criteria, relevance, attraction,fluency, and style strength. We summarize the hu-man evaluation results on the first three criteria inTable 2, and the last criteria in Table 4. Note thatthrough automatic evaluation, the baselines NST,Fine-tuned, and Gigaword-MASS perform poorerthan other methods (in Section 5.2), thereby weremoved them in human evaluation to save unnec-essary work for human raters.

Relevance We first look at the relevance scoresin Table 2. It is interesting but not surprising thatthe pure summarization model NHG achieves thehighest relevance score. The outputs from NHG

9https://github.com/pltrdy/files2rouge10https://github.com/Maluuba/nlg-eval11PPL on the development set is 42.5

https://github.com/pltrdy/files2rouge

https://github.com/Maluuba/nlg-eval

5088

Style Settings Relevance Attraction Fluency

None NHG 6.21 8.47 9.31Human 5.89 8.93 9.33

Humor Multitask 5.51 8.61 9.11TitleStylist 5.87 8.93 9.29

Romance Multitask 5.67 8.54 8.91TitleStylist 5.86 8.87 9.14

Clickbait Multitask 5.67 8.71 9.21TitleStylist 5.83 9.29 9.44

Table 2: Human evaluation on three aspects: relevance,attraction, and fluency. “None” represents the originalheadlines in the dataset.

are usually like an organic reorganization of severalkeywords in the source context (as shown in Ta-ble 3), thus appearing most relevant. It is notewor-thy that the generated headlines of our TitleStylistfor all three styles are close to the original human-written headlines in terms of relevance, validatingthat our generation results are qualified in this as-pect. Another finding is that more attractive ormore stylistic headlines would lose some relevancesince they need to use more words outside the newsbody for improved creativity.

Attraction In terms of attraction scores in Ta-ble 2, we have three findings: (1) The human-written headlines are more attractive than thosefrom NHG, which agrees with our observation inSection 1. (2) Our TitleStylist can generate moreattractive headlines over the NHG and Multitaskbaselines for all three styles, demonstrating thatadapting the model to these styles could improvethe attraction and specialization of some parame-ters in the model for different styles can further en-hance the attraction. (3) Adapting the model to the“Clickbait” style could create the most attractiveheadlines, even out-weighting the original ones,which agrees with the fact that click-baity head-lines are better at drawing readers’ attention. Tobe noted, although we learned the “Clickbait” styleinto our summarization system, we still made surethat we are generating relevant headlines instead oftoo exaggerated ones, which can be verified by ourrelevance scores.

Fluency The human-annotated fluency scores inTable 2 verified that our TitleStylist generated head-lines are comparable or superior to the human-written headlines in terms of readability.

Style Strength We also validated that our Ti-tleStylist can carry more styles compared with the

Multitask and NHG baselines by summarizing thepercentage of choices by humans for the most hu-morous or romantic headlines in Table 4.

5.2 Automatic Evaluation Results

Apart from the human evaluation of the overall gen-eration quality on four criteria, we also conducteda conventional automatic assessment to gauge onlythe summarization quality. This evaluation doesnot take other measures such as the style strengthinto consideration, but it serves as important com-plimentary proof to ensure that the model has anacceptable level of summarization ability.

Table 5 summarizes the automatic evaluationresults of our proposed TitleStylist model and allbaselines. We use the summarization-related eval-uation metrics, i.e., BLEU, ROUGE, CIDEr, andMETEOR, to measure how relevant the generatedheadlines are to the news articles, to some extent,by comparing them to the original human-writtenheadlines. In Table 5, the first row “NHG” showsthe performance of the current state-of-the-art sum-marization model on our data, and Table 3 providestwo examples of its generation output. Our ulti-mate goal is to generate more attractive headlinesthan these while maintaining relevance to the newsbody.

From Table 5, the baseline Gigaword-MASSscored worse than NHG, revealing that directly ap-plying an off-the-shelf headline generation modelto new in-domain data is not feasible, althoughthis model has been trained on more than 20 timeslarger dataset. Both NST and Fine-tuned baselinespresent very poor summarization performance, andthe reason could be that both of them cast the prob-lem into two steps: summarization and style trans-fer, and the latter step is absent of the summariza-tion task, which prevents the model from maintain-ing its summarization capability.

In contrast, the Multitask baseline involves thesummarization and style transfer (via reconstruc-tion training) processes at the same time and showssuperior summarization performance even com-pared with NHG. This reveals that the unsuper-vised reconstruction task can indeed help improvethe supervised summarization task. More impor-tantly, we use two different types of corpora for thereconstruction task: one consists of headlines thatare similar to the news data for the summarizationtask, and the other consists of text from novels thatare entirely different from the news data. However,

5089

NewsAbstract

Turkey’s bitter history with Kurds is figuring promi-nently in its calculations over how to deal with Bushadministration’s request to use Turkey as the base forthousands of combat troops if there is a war with Iraq;Recep Tayyip Erdogan, leader of Turkey’s govern-ing party, says publicly for the first time that futureof Iraq’s Kurdish area, which abuts border region ofTurkey also heavily populated by Kurds, is weighingheavily on negotiations; Hints at what Turkish officialshave been saying privately for weeks: if war comesto Iraq, overriding Turkish objective would be lesshelping Americans topple Saddam Hussein, but ratherpreventing Kurds in Iraq from forming their own state.

Reunified Berlin is commemorating 40th anniversaryof the start of construction of Berlin wall, almost 12years since Germans jubilantly celebrated reopeningbetween east and west and attacked hated structurewith sledgehammers; Some Germans are championingthe preservation of wall at the time when little remainsbeyond few crumbling remnants to remind Berlinersof unhappy division that many have since worked hardto heal and put behind them; What little remains ofphysical wall embodies era that Germans have yet toresolve for themselves; They routinely talk of ’wall inthe mind’ to describe social and cultural differencesthat continue to divide easterners and westerners.

Human Turkey assesses question of Kurds The wall Berlin can’t quite demolishNHG Turkey’s bitter history with Kurds Construction of Berlin wall is commemorated

Humor What if there is a war with Kurds? The Berlin wall, 12 years later, is still there?Romance What if the Kurds say “No” to Iraq? The Berlin wall: from the past to the presentClickbait For Turkey, a long, hard road East vs West, Berlin wall lives on

Table 3: Examples of style-carrying headlines generated by TitleStylist.

Style NHG Multitask TitleStylist

Humor 18.7 35.3 46.0Romance 24.7 34.7 40.6Clickbait 13.8 35.8 50.4

Table 4: Percentage of choices (%) for the most humor-ous or romantic headlines among TitleStylist and twobaselines NHG and Multitask.

unsupervised reconstruction training on both typesof data can contribute to the summarization task,which throws light on the potential future workin summarization by incorporating unsupervisedlearning as augmentation.

We find that in Table 5 TitleStylist-F achieves thebest summarization performance. This implicatesthat, compared with the Multitask baseline wherethe two tasks share all parameters, specialization oflayer normalization and encoder-attention parame-ters can make GS focus more on summarization.

It is noteworthy that the summarization scoresfor TitleStylist are lower than TitleStylist-F but stillcomparable to NHG. This agrees with the fact thatthe GT branch more focuses on bringing in stylis-tic linguistic patterns into the generated summaries,thus the outputs would deviate from the pure sum-marization to some degree. However, the relevancedegree of them remains close to the baseline NHG,which is the starting point we want to improve on.Later in the next section, we will further validatethat these headlines are faithful to the new articlethrough human evaluation.

We also reported the perplexity (PPL) of the gen-erated headlines to evaluate the language fluency,as shown in Table 5. All outputs from baselinesNHG and Multitask and our proposed TitleStylistshow similar PPL compared with the test set (usedin the fine-tuning stage) PPL 42.5, indicating thatthey are all fluent expressions for news headlines.

5.3 Extension to Multi-Style

We progressively expand TitleStylist to include allthree target styles (humor, romance, and clickbait)to demonstrate the flexibility of our model. Thatis, we simultaneously trained the summarizationtask on the headlines data and the DAE task onthe three target style corpora. And we made thelayer normalization and encoder-attention parame-ters specialized for these four styles (fact, humor,romance, and clickbait) and shared the other pa-rameters. We compared this multi-style version,TitleStylist-Versatile, with the previously presentedsingle-style counterpart, as shown in Table 6. Fromthis table, we see that the BLEU and ROUGE-Lscores of TitleStylist-Versatile are comparable toTitleStylist for all three styles. Besides, we con-ducted another human study to determine the betterheadline between the two models in terms of attrac-tion, and we allow human annotators to choose bothoptions if they deem them as equivalent. The resultis presented in the last column of Table 6, whichshows that the attraction of TitleStylist-Versatileoutputs is competitive to TitleStylist. TitleStylist-Versatile thus generates multiple headlines in differ-ent styles altogether, which is a novel and efficient

5090

Style Corpus Model BLEU ROUGE-1 ROUGE-2 ROUGE-L CIDEr METEOR PPL (↓) Len. Ratio (%)

None NHG 12.9 27.7 9.7 24.8 0.821 0.123 40.4 8.9Gigaword-MASS 9.2 22.6 6.4 20.1 0.576 0.102 65.0 9.7

Humor

NST 5.8 17.8 4.3 16.1 0.412 0.078 361.3 9.2Fine-tuned 4.3 15.7 3.4 13.2 0.140 0.093 398.8 3.9Multitask 14.7 28.9 11.6 26.1 0.995 0.134 40.0 9.5TitleStylist 13.3 28.1 10.3 25.4 0.918 0.127 46.2 10.6TitleStylist-F 15.2 29.2 11.6 26.3 1.022 0.135 39.3 9.7

Romance


Clickbait


Table 5: Automatic evaluation results of our TitleStylist and baselines. The test set of each style is the same, butthe training set is different depending on the target style as shown in the “Style Corpus” column. “None” meansno style-specific dataset, and “Humor”, “Romance” and “Clickbait” corresponds to the datasets we introduced inSection 4.1.2. During the inference phase, our TitleStylist can generate two outputs: one from GT and the otherfrom GS . Outputs from GT are style-carrying, so we denote it as “TitleStylist”; outputs from GS are plain andfactual, thus denoted as “TitleStylist-F.” The last column “Len. Ratio” denotes the average ratio of abstract lengthto the generated headline length by the number of words.

Style Model BLEU RG-L Pref. (%)

None TitleStylist-Versatile 14.5 25.8 —

Humor TitleStylist-Versatile 12.3 24.5 42.6TitleStylist 13.3 25.4 57.4

Romance TitleStylist-Versatile 12.0 24.2 46.3TitleStylist 12.0 24.4 53.7

Clickbait TitleStylist-Versatile 13.1 24.9 52.9TitleStylist 11.5 23.7 47.1

Table 6: Comparison between TitleStylist-Versatile andTitleStylist. “RG-L” denotes ROUGE-L, and “Pref.”denotes preference.

feature.

6 Conclusion

We have proposed a new task of Stylistic HeadlineGeneration (SHG) to emphasize explicit controlof styles in headline generation for improved at-traction. To this end, we presented a multitaskframework to induce styles into summarization,and proposed the parameters sharing scheme toenhance both summarization and stylization capa-bilities. Through experiments, we validated ourproposed TitleStylist can generate more attractiveheadlines than state-of-the-art HG models.

Acknowledgement

We appreciate all the volunteer native speakers(Shreya Karpoor, Lisa Orii, Abhishek Mohan,Paloma Quiroga, etc.) for the human evaluation of

our study, and thank the reviewers for their inspir-ing comments. Joey Tianyi Zhou is partially sup-ported by the Agency for Science, Technology andResearch (A*STAR) under its AME ProgrammaticFunding Scheme (Project No. A18A1b0045).

References

Michele Banko, Vibhu O Mittal, and Michael J Wit-brock. 2000. Headline generation based on statisti-cal translation. In Proceedings of the 38th AnnualMeeting on Association for Computational Linguis-tics, pages 318–325. Association for ComputationalLinguistics.

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei.2018a. Retrieve, rerank and rewrite: Soft templatebased neural summarization. In ACL.

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li.2018b. Faithful to the original: Fact aware neuralabstractive summarization. In Thirty-Second AAAIConference on Artificial Intelligence.

Cheng-Kuan Chen, Zhu Feng Pan, Ming-Yu Liu, andMin Sun. 2019. Unsupervised stylish image de-scription generation via domain layer norm. In TheThirty-Third AAAI Conference on Artificial Intelli-gence, AAAI 2019, The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference, IAAI2019, The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence, EAAI 2019, Hon-olulu, Hawaii, USA, January 27 - February 1, 2019,pages 8151–8158. AAAI Press.

https://doi.org/10.1609/aaai.v33i01.33018151

https://doi.org/10.1609/aaai.v33i01.33018151

5091

Sumit Chopra, Michael Auli, and Alexander M Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 93–98.

Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In Proceedings of the ninthworkshop on statistical machine translation, pages376–380.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,Xiaodong Liu, Yu Wang, Jianfeng Gao, MingZhou, and Hsiao-Wuen Hon. 2019. Unifiedlanguage model pre-training for natural languageunderstanding and generation. arXiv preprintarXiv:1905.03197.

Bonnie Dorr, David Zajic, and Richard Schwartz.2003. Hedge trimmer: A parse-and-trim approachto headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume5, pages 1–8. Association for Computational Lin-guistics.

Vincent Dumoulin, Jonathon Shlens, and ManjunathKudlur. 2016. A learned representation for artisticstyle. arXiv preprint arXiv:1610.07629.

HP Edmundson. 1964. Problems in automatic abstract-ing. Communications of the ACM, 7(4):259–263.

Angela Fan, David Grangier, and Michael Auli. 2018.Controllable abstractive summarization. In Pro-ceedings of the 2nd Workshop on Neural MachineTranslation and Generation, NMT@ACL 2018, Mel-bourne, Australia, July 20, 2018, pages 45–54. As-sociation for Computational Linguistics.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

Yidi Guo, Heyan Huang, Yang Gao, and Chi Lu. 2017.Conceptual multi-layer neural network model forheadline generation. In Chinese Computational Lin-guistics and Natural Language Processing Basedon Naturally Annotated Big Data, pages 355–367.Springer.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Advances in neural informationprocessing systems, pages 1693–1701.

Tatsuru Higurashi, Hayato Kobayashi, Takeshi Ma-suyama, and Kazuma Murao. 2018. Extractive head-line generation based on learning to rank for commu-nity question answering. In Proceedings of the 27thInternational Conference on Computational Linguis-tics, pages 1742–1753.

Kango Iwama and Yoshinobu Kano. 2019. Multiplenews headlines generation using page metadata. InProceedings of the 12th International Conferenceon Natural Language Generation, 2019. Associationfor Computational Linguistics.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2020. Unsupervised domain adaptationfor neural machine translation with iterative backtranslation. arXiv preprint arXiv:2001.08140.

Zhijing Jin, Di Jin, Jonas Mueller, Nicholas Matthews,and Enrico Santus. 2019. Unsupervised text at-tribute transfer via iterative matching and translation.In IJCNLP 2019.

Hongyan Jing and Kathleen McKeown. 1999. The de-composition of human-written summary sentences.

Chris Kedzie, Kathleen McKeown, and Hal Daume III.2018. Content selection in deep learning models ofsummarization. arXiv preprint arXiv:1810.12343.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in Neural Information Processing Systems,pages 3294–3302.

Guillaume Lample, Sandeep Subramanian,Eric Michael Smith, Ludovic Denoyer,Marc’Aurelio Ranzato, and Y-Lan Boureau.2019. Multiple-attribute text rewriting. In ICLR.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text summarizationbranches out, pages 74–81.

Junyang Lin, Xu Sun, Shuming Ma, and Qi Su. 2018.Global encoding for abstractive summarization. InACL.

Tianshang Liu, Haoran Li, Junnan Zhu, Jiajun Zhang,and Chengqing Zong. 2018. Review headline gen-eration with user embedding. In Chinese Computa-tional Linguistics and Natural Language ProcessingBased on Naturally Annotated Big Data - 17th ChinaNational Conference, CCL 2018, and 6th Inter-national Symposium, NLP-NABD 2018, Changsha,China, October 19-21, 2018, Proceedings, pages324–334.

Hans Peter Luhn. 1958. The automatic creation of lit-erature abstracts. IBM Journal of research and de-velopment, 2(2):159–165.

Minh-Thang Luong, Quoc V. Le, Ilya Sutskever,Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. CoRR,abs/1511.06114.

Betty A Mathis, James E Rush, and Carol E Young.1973. Improvement of automatic abstracts by theuse of structural analysis. Journal of the AmericanSociety for Information Science, 24(2):101–109.

https://doi.org/10.18653/v1/w18-2706

https://www.inlg2019.com/assets/papers/35_Paper.pdf

https://www.inlg2019.com/assets/papers/35_Paper.pdf

https://doi.org/10.1007/978-3-030-01716-3_27

https://doi.org/10.1007/978-3-030-01716-3_27

5092

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summariza-tion using sequence-to-sequence rnns and beyond.arXiv preprint arXiv:1602.06023.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensi-ble toolkit for sequence modeling. arXiv preprintarXiv:1904.01038.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics, pages 311–318.

Dragomir R Radev and Kathleen R McKeown. 1998.Generating natural language summaries from mul-tiple on-line sources. Computational Linguistics,24(3):470–500.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Alexander M Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. arXiv preprintarXiv:1509.00685.

Alexander M Rush, SEAS Harvard, Sumit Chopra, andJason Weston. 2017. A neural attention model forsentence summarization. In ACLWeb. Proceedingsof the 2015 Conference on Empirical Methods inNatural Language Processing.

Gerard Salton, Amit Singhal, Mandar Mitra, and ChrisBuckley. 1997. Automatic text structuring and sum-marization. Information processing & management,33(2):193–207.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12):e26752.

Shi-Qi Shen, Yan-Kai Lin, Cun-Chao Tu, Yu Zhao, Zhi-Yuan Liu, Mao-Song Sun, et al. 2017a. Recent ad-vances on neural headline generation. Journal ofcomputer science and technology, 32(4):768–784.

Shiqi Shen, Yun Chen, Cheng Yang, Zhiyuan Liu, andMaosong Sun. 2018. Zero-shot cross-lingual neu-ral headline generation. IEEE/ACM Trans. Audio,Speech & Language Processing, 26(12):2319–2327.

Shiqi Shen, Yu Zhao, Zhiyuan Liu, MaosongSun, et al. 2016. Neural headline generationwith sentence-wise optimization. arXiv preprintarXiv:1604.01904.

Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S.Jaakkola. 2017b. Style transfer from non-paralleltext by cross-alignment. In Advances in NeuralInformation Processing Systems 30: Annual Con-ference on Neural Information Processing Systems

2017, 4-9 December 2017, Long Beach, CA, USA,pages 6830–6841.

Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, andHuan Liu. 2018. Deep headline generation for click-bait detection. 2018 IEEE International Conferenceon Data Mining (ICDM), pages 467–476.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequencepre-training for language generation. arXiv preprintarXiv:1905.02450.

Sho Takase, Jun Suzuki, Naoaki Okazaki, TsutomuHirao, and Masaaki Nagata. 2016. Neural head-line generation on abstract meaning representation.In Proceedings of the 2016 conference on empiri-cal methods in natural language processing, pages1054–1059.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.From neural sentence summarization to headlinegeneration: A coarse-to-fine approach. In IJCAI,pages 4109–4115.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 4566–4575.

Jingjing Xu, Xu Sun, Qi Zeng, Xiaodong Zhang, Xu-ancheng Ren, Houfeng Wang, and Wenjie Li. 2018.Unpaired sentiment-to-sentiment translation: A cy-cled reinforcement learning approach. In ACL.

Peng Xu, Chien-Sheng Wu, Andrea Madotto, and Pas-cale Fung. 2019. Clickbait? sensational headlinegeneration with auto-tuned reinforcement learning.ArXiv, abs/1909.03582.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, andTaylor Berg-Kirkpatrick. 2018. Unsupervised textstyle transfer using language models as discrimina-tors. In Advances in Neural Information Process-ing Systems 31: Annual Conference on Neural Infor-mation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montreal, Canada, pages 7298–7309.

Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan,Jun Xu, Huanhuan Cao, and Xueqi Cheng. 2018a.Question headline generation for news articles. InProceedings of the 27th ACM International Confer-ence on Information and Knowledge Management,CIKM 2018, Torino, Italy, October 22-26, 2018,pages 617–626.

https://doi.org/10.1109/TASLP.2018.2842432

https://doi.org/10.1109/TASLP.2018.2842432

http://papers.nips.cc/paper/7259-style-transfer-from-non-parallel-text-by-cross-alignment

http://papers.nips.cc/paper/7259-style-transfer-from-non-parallel-text-by-cross-alignment

http://papers.nips.cc/paper/7959-unsupervised-text-style-transfer-using-language-models-as-discriminators



https://doi.org/10.1145/3269206.3271711

5093

Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang,Peng Chen, Mu Li, Ming Zhou, and Enhong Chen.2018b. Style transfer as unsupervised machine trans-lation. ArXiv, abs/1808.07894.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEE inter-national conference on computer vision, pages 19–27.

Date post:	09-Mar-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Hooks in the Headline: Learning to Generate Headlines with ...click-baity headline generation. Both...

Documents