CommonGen: A Constrained Text Generation Challenge for ... · sourcing and existing caption...

CommonGen A Constrained Text Generation Challengefor Generative Commonsense Reasoning

Bill Yuchen Linhearts Ming Shenhearts Wangchunshu Zhouhearts Pei Zhouhearts

Chandra Bhagavatulaspades Yejin Choispadesdiams Xiang RenheartsheartsUniversity of Southern California spadesAllen Institute for Artificial Intelligence

diamsPaul G Allen School of Computer Science amp Engineering University of Washington

yuchenlinshemmingpeizxiangrenusceduchandrabyejincallenaiorg

Abstract

Recently large-scale pretrained language mod-els have demonstrated impressive performanceon several commonsense benchmark datasetsHowever building machines with common-sense to compose realistically plausible sen-tences remains challenging In this paperwe present a constrained text generation taskCOMMONGEN

1 associated with a benchmarkdataset to explicitly test machines for theability of generative commonsense reasoningGiven a set of common concepts (eg dogfrisbee catch throw) the task is to generate acoherent sentence describing an everyday sce-nario using these concepts (eg ldquoa man throwsa frisbee and his dog catches itrdquo)

COMMONGEN is challenging because it in-herently requires 1) relational reasoning usingbackground commonsense knowledge and 2)compositional generalization ability to workon unseen concept combinations Our datasetconstructed through a combination of crowd-sourcing and existing caption corpora consistsof 30k concept-sets and 50k sentences Ex-periments show that there is a large gap be-tween state-of-the-art text generation models(eg T5) and human performance (306 vs635 in SPICE metric) The models strug-gle at the task often generating grammaticallysound yet realistically implausible sentences ndashpointing to interesting future research

1 Introduction

Commonsense reasoning the ability to make ac-ceptable and logical assumptions about ordinaryscenes in our daily life has long been acknowl-edged as a critical bottleneck of artificial intelli-gence and natural language processing (Davis andMarcus 2015) Most recent commonsense rea-soning challenges such as CommonsenseQA (Tal-

1Our code and data can be found at httpinklabusceduCommonGen Work in progress

dog | frisbee | catch | throw

- A dog leaps to catch a thrown frisbee- The dog catches the frisbee when the boy throws it- A man throws away his dog s favorite frisbee expecting him to catch it in the air

Expected Output everyday scenarios covering all given concepts

[Humans]

GPT2 A dog throws a frisbee at a football player

UniLM Two dogs are throwing frisbees at each other

BART A dog throws a frisbee and a dog catches it

T5 dog catches a frisbee and throws it to a dog

[Machines]

exercise | rope | wall | tie | wave

- A man in a gym exercises by waving ropes tied to a wall- The gym owner decided to tie a rope to the wall so people could make a wave in it for exercise

Concept-Set

[Humans]

GPT2 A woman is tied up in a rope and swinging a wave at a wall

UniLM A man with a rope and tie is doing some exercise on a wall

BART A man is tied to a rope and is waving his arms and doing

exercises on the wall [Machines]

Concept-Set a collection of objectsactions

Generative Commonsense Reasoning

Figure 1 An example from our COMMONGEN datasetGPT-2 (Radford et al 2019) UniLM (Dong et al2019) BART (Lewis et al 2019) and T5 (Raffel et al2019) are large pre-trained text generation models fine-tuned on the proposed task

mor et al 2019) SocialIQA (Sap et al 2019b)WinoGrande (Sakaguchi et al 2019) and Hel-laSwag (Zellers et al 2019b) have been framedas discriminative tasks ndash ie AI systems are re-quired to choose the correct option from a set ofchoices based on a given context While signifi-cant progress has been made on these discrimina-tive tasks we argue that commonsense reasoningin text generation poses a distinct complementarychallenge In this paper we advance machine com-monsense towards generative reasoning ability

Humans acquire the ability to compose sentencesby learning to understand and use common con-cepts that they recognize in their surrounding envi-ronment (Tincoff and Jusczyk 1999) The acquisi-tion of such an ability is regarded as a significantmilestone of human development (Moore 2013)Can machines acquire such generative common-sense reasoning ability To initiate the invesitag-tion we present COMMONGEN ndash a novel con-strained generation task that requires machines togenerate a sentence describing a day-to-day scene

using concepts from a given concept-set For ex-ample given a set of concepts exercise ropewall tie wave machines are required to generatea sentence such as ldquoa man in a gym exercises bywaving ropes tied to a wallrdquo

To successfully solve the task models need toincorporate two key capabilities a) relational rea-soning and b) compositional generalization Gram-matically sound sentences may not always be real-istic as they might violate our commonsense (eg

ldquoa dog throws a frisbee rdquo in Fig 1) In order tocompose a plausible sentence that describes an ev-eryday scenario models need to construct a gram-matical sentence while adhering to and reasoningover the commonsense relations between the givenconcepts Models additionally need compositionalgeneralization ability to infer about unseen conceptcompounds This encourages models to reasonabout a potentially infinite number of novel combi-nations of familiar concepts ndash an ability believedto be a limitation of current AI systems (Lake andBaroni 2017 Keysers et al 2020)

Therefore in support of the COMMONGEN taskwe present a dataset consisting of 29599 concept-sets associated with 49129 sentences We explic-itly design our dataset collection process to cap-ture the key challenges of relational reasoning andcompositional generalization described above Weestablish comprehensive baseline performance forstate-of-the-art language generation models Thebest model based on T5 (Raffel et al 2019)achieves 3660 with significant gap compared tohuman performance of 6350 in the SPICE met-ric ndash demonstrating the difficulty of the task Ouranalysis shows that state-of-the-art models strug-gle at the task generating implausible sentences ndasheg ldquodog throws a frisbee rdquo ldquogive massage to atablerdquo etc ndash pointing to interesting future researchdirections for the community

2 Task Formulation and Challenges

We formulate the proposed COMMONGEN taskwith mathematical notations and discuss its inher-ent challenges with concrete examples

The input is an unordered set of k concepts x =c1 c2 ck isin X (ie a concept-set) whereeach concept ci isin C is a common object (noun)or action (verb) We use X to denote the spaceof all possible concept-sets and use C to denotethe concept vocabulary (a subset of ConceptNetrsquossingle-word concepts) The expected output is a

simple grammatical sentence y isin Y that describesa common scenario in our daily life using2 allgiven concepts in x A scenario can depict either astatic situation or a short series of actions

The task is to learn a structured predictive func-tion f ∶ X rarr Y which maps a concept-set x toa sentence y The unique challenges of this taskcome from two major aspects as followsRelational Reasoning with Commonsense Ex-pected generative reasoners should prioritize themost plausible scenes over an infinite number ofless plausible scenes Recall the first illustrativeexamples in Figure 1 the underlying knowledgeare implicit and compositional (a) dogs love toperform tricks with humans (b) catching a frisbeeis a trick and (c) humans love to play this gamewith dogs As for the other example in Section 1about exercise rope wall tie wave we alsoneed to compose the following commonsense facts(i) doing exercises is to cost energy (ii) wavinga rope can cost energy and (iii) it is more usefulwhen the rope is tied to a wall

In order to complete a scenario a generativecommonsense reasoner also needs to reasonably as-sociate additional concepts (eg lsquogymrsquo and lsquomanrsquo)as agents or environments for completing a naturaland coherent scenario in our daily life

This not only requires understanding underlyingcommonsense relations between concepts but alsoincrementally composing them towards a globallyoptimal scenario The underlying reasoning chainsare inherently based on a variety of backgroundknowledge such as spatial relations object prop-erties physical rules temporal event knowledgesocial conventions etc However they may not berecorded in any existing knowledge basesCompositional Generalization Humans cancompose a sentence to describe a scenario aboutthe concepts they may never seen them co-occurring For example there is a testing concept-set x =pear basket pick put tree The conceptlsquopearrsquo never appear in the training data and lsquopickrsquonever co-occurs with lsquobasketrsquo Meanwhile thereare some relevant training examples

bull x1 =apple bag putrarr y1 =ldquoa boy puts anapple in a bagrdquo

bull x2 =apple tree pick rarr y2 =ldquoa girl picksan apple from the treerdquo

bull x3 =apple basket wash rarr y3 =ldquoa girltakes an apple from the basket and washes itrdquo

2Note that morphological inflections are allowed

We humans can generalize from these seen sce-narios and infer that a plausible output y =ldquoa girlpicks some pears from a tree and put them into herbasketrdquo This compositionally generalization abil-ity via analogy ie to make ldquoinfinite use of finitemeansrdquo (Chomsky 1965) is challenging for ma-chines This analogical challenge not only requiresinference about similar concepts (eg lsquoapplersquo rarrlsquopearrsquo) but also their latent associations

3 The COMMONGEN Dataset

We now introduce the construction and analysisof the proposed COMMONGEN dataset in this sec-tion To ensure that the concepts in each inputconcept-set are likely to be present together in aeveryday scene we utilize a wide range of existingcaption corpora for sampling frequent concept-sets(Section 31) We also carefully control the over-lap between the training set and developmenttestset such that the task is more challenging in termsof compositional generalization Afterwards weemploy workers on the crowd-sourcing platformAMT for collecting more human-written sentences(Section 32) and thus enrich the diversity of de-velopment and test set Finally we present thestatistics of the COMMONGEN dataset and utilizeConceptNet as an intermediate tool to investigatethe concept connectivity and the distribution ofvarious knowledge types (Section 33)

31 Collecting Concept-Sets from Captions

It is obviously nonsense if we ask a reasoner togenerate a scenario about an arbitrarily concept-setwhich is impossible even for humans The expectedconcept-sets of our task are supposed to be verylikely to co-occur in common daily-life scenesSuch everyday scenarios are ubiquitous in imagesand video clips and this leads us to think aboutusing imagevideo captioning datasets as a naturalresource for collecting concept-sets and sentences

We therefore collect a large amount of captionsentences from all publicly available visual captioncorpora including image captioning datasets suchas Flickr30k (Young et al 2014) MSCOCO (Linet al 2014) Conceptual Captions (Sharma et al2018) and also video captioning datasets such asLSMDC (Rohrbach et al 2017) ActivityNet (Kr-ishna et al 2017) and VATEX (Wang et al 2019)

We first conduct part-of-speech tagging over allsentences in the corpora such that words in sen-tences can be matched to the concept vocabulary of

ConceptNet Then we compute the sentence fre-quency of concept-sets that consist of 3sim5 conceptsThat is for each combination of threefourfive con-cepts in the vocabulary we know how many sen-tences are in the corpora covering all concepts

Towards building a more representative datasetwe expect our selected subset of concept-sets canreflect the distribution in the real world A straight-forward intuition is to directly treat the frequencyas the measure of likelihood of concept-sets andthen conduct probabilistic sampling based on thisdistribution However this method tends to sampleconcept-sets that contain one or two single highlyfrequent concept thus leading to corpus-dependentbias Also merely using the sentence number canbe imprecise to measure the scenario diversity sincemany images and videos were sampled interdepen-dently We therefore design a scoring function toweight a concept-set x to incorporate diversity andpenalty of inverse set frequency

score(x) = ∣S(x)∣∣⋃siisinS(x)w∣w isin si∣sumsiisinS(x) Length(si)

ρ(x)

We denote S(x) as the set of different sentencesthat contain all its concepts c1 c2 ck = xsi as one of the sentences and ∣S(x)∣ to be thenumber of sentences The second term is to di-vide the number of unique words in these sen-tences by the sum of the lengths of all the sen-tences which can roughly represent the diversityof the scenes described in these sentences Thenwe times the result with the last term ρ(x) =

∣X ∣(maxciisinx ∣xprime ∣ ci isin xprime and x

primeisin X ∣)

The idea is to find the concept in x that has themaximum set frequency (ie the number of differ-ent concept-sets (with non-zero weight) containsit) and then take the inverse with normalizationof the number of all concept-sets This penalty ef-fectively controls the bias towards highly frequentconcepts With the distribution of such scores wesample 100k concept-sets as candidate inputs

32 Crowd-Sourcing References via AMTAlthough the human-written sentences in the cap-tion corpora can be seen as quality annotations forthe COMMONGEN task as well they were writtenwith specific visual context (ie an image or a videoclip) Toward better diversity of the scenes aboutsampled concept-sets and more rigorous evalua-tion for systems crowd-sourcing additional humanreferences is necessary that are written with only

Statistics Train Dev Test

Concept-Sets 27069 993 1497-Size = 3 20580 493 --Size = 4 4207 250 747-Size = 5 2282 250 750

Sentences 39069 4018 6042Average Length 1085 1315 1380

Unique Concepts 6643 813 1351 Unique Concept-Pairs 47574 3982 8930 Unique Concept-Triples 38110 3786 9976

Novel Concepts - 250 601 Novel Concept-Pairs - 6488 7545 Novel Concept-Triples - 9553 9849

Table 1 The basic statistics of the COMMONGEN dataWe highlight the ratios of concept compositions that areunseen in training data which assures the challenge incompositional generalization ability

concept-sets as the context We decide to use theAMT platform for collecting such sentences forcovered the top-ranked 2500 concept-sets in thesampled results due to the expensive cost of humanefforts in writing sentences and the difficulty in ver-ifying the quality of collected sentences Each ofthem is assigned to at least three different work-ers To encourage workers to write about everydayscenarios about given concept-sets we ask them towrite rationale sentences as well to explain whatcommonsense facts they have used Examples ofrationales are shown in Figure 4

We use these 2500 concept-sets as the dev andtest set examples for their higher weights and betterdiversity of human-written sentences Furthermorewe use the remaining concept-sets as the trainingexamples for which we use the associated captionsas the target outputs Note that we explicitly controlthe overlap between the training and devtest ex-amples by filtering training concept-sets that havemore than two overlapping concepts with any ex-ample in the devtest set

The basic statistics of the final dataset is shownin Table 1 There are on average four sentences foreach example in dev and test sets which providea richer and more diverse test-bed for further au-tomatic and manual evaluation We highlight theratio of novel concept compositions (ie conceptconcept-pair and concept-triple) in devtest whichnever (co-)occur in training examples This makesCOMMONGEN challenging in terms of composi-tional generalization ability

1XPEHURIampRQFHSW3DLUVZLWKKRSampRQQHFWLRQV

KRS KRS

Figure 2 Connectivity analysis in 5-size concept-sets inthe test set each of which consists of 10 concept pairs Forexample 120 in blue means there are 12 concept-sets thathave 3 concept pairs with one-hop connections on ConceptNet

33 Analysis about Commonsense Knowledge

We here introduce deeper analysis of the datasetby utilizing the largest commonsense knowledgegraph (KG) ConceptNet (Speer et al 2017) as antool to study connectivity and relation types

Connectivity Distribution Obviously if the con-cepts inside a given concept-set is more denselyconnected with each other on the KG then it iseasier to write a scenario about them In each 5-size concept-set (ie a concept-set consists of fiveconcepts) there are 10 unique pairs of conceptsthe connections of which we are interested in Asshown in Figure 2 if we look at the one-hop linkson the KG about 60 of the 5-size concept-sethave less than one link among all concept-pairsOn the other hand if we consider two-hop linksthen nearly 50 of them are almost fully connected(ie each pair of concepts has connections)

These two observations together suggest that theCOMMONGEN has a reasonable difficulty the con-cepts are not too distant or too close and reasoningabout the associated scenes is thus neither too diffi-cult nor too trivial

Relation Distribution Furthermore the relationtypes of such connections can also tell us whatkinds of commonsense knowledge are potentiallyuseful for relational reasoning towards generationWe report the frequency of different relation types3

of the onetwo-hop connections among concept-pairs in the dev and test examples in Fig 3 To bet-ter summarize the distributions we categorize theserelations into five major types and present their dis-tribution in Table 2 respectively for onetwo-hopconnections between concept pairs

3Relation definitions are at httpsgithubcomcommonsenseconceptnet5wikiRelations

(1) One-hop Relation Distribution

(2) Two-hop Relation Distribution

Figure 3 Onetwo-hop relation frequency in the COMMONGEN devamptest sets on ConceptNet

Category Relations 1-hop 2-hop

Spatial knowledge AtLocation LocatedNear 940 3931

Object properties

UsedForCapableOfPartOf ReceivesActionMadeOf

FormOf HasPropertyHasA960 4404

Human behaviors

CausesDesireMotivatedByDesiresNotDesiresManner 460 1959

Temporal knowledge

Subevent PrerequisiteFirstLast-Subevent 150 2403

GeneralRelatedTo SynonymDistinctFrom IsA

HasContextSimilarTo7489 6965

Table 2 The distributions of the relation categorieson onetwo-hop connections

4 Methods

In this section we briefly introduce the adoptedbaseline methods that are tested on the proposedCOMMONGEN task As there is no principledapproach for the proposed setting to the best ofour knowledge we mainly consider it as a condi-tional sentence generation task that can be solvedby many sequence-to-sequence frameworks

Encoder-Decoder Models Bidirectional RNNsand Transformers (Vaswani et al 2017) are twomost popular architectures for seq2seq learningWe use them with the addition of attention mecha-nism (Luong et al 2015) with copying ability (Guet al 2016) which are based on an open-sourceframework OpenNMT-py (Klein et al 2017) Weuse bRNN-CopyNet and Trans-CopyNet de-note them respectively To alleviate the influence

from the concept ordering in such sequential learn-ing methods we randomly permute them multi-ple times for training and decoding and then gettheir average performance To explicitly eliminatethe order-sensitivity of inputs we replace the en-coder with a mean pooling-based MLP network(MeanPooling-CopyNet)

Non-autoregressive generation Recent ad-vances (Lee et al 2018 Stern et al 2019) inconditional sentence generation have an embedinginterest on (edit-based) non-autoregressive genera-tion models which iteratively refine generated se-quences We assume that these models potentiallywould have better performance because of explicitmodeling on iterative refinements and thus studythe most recent such model Levenshtein Trans-former (LevenTrans) by Gu et al (2019)

Pre-trained Language Generation Models Wealso employ various pre-trained language gen-eration models including GPT-2 (Radfordet al 2019) UniLM (Dong et al 2019)UniLM-v2 (Bao et al 2020) BERT-Gen (Baoet al 2020) BART (Lewis et al 2019) andT5 (Raffel et al 2019) to tackle this task andtest their generative commonsense reasoning abil-ity We fine-tuned all the above models on ourtraining data with a seq2seq format

Specifically to use GPT-2 for this sequence-to-sequence task we condition the language modelon the format ldquoc1 c2 ck = yrdquo during fine-tuning where ci is a concept in the given concept-set and connects with other concepts with a blank

Model Metrics ROUGE-2L BLEU-34 METEOR CIDEr SPICE Coverage

bRNN-CopyNet (Gu et al 2016) 290 1925 550 200 1270 399 1060 4225Trans-CopyNet 228 1404 430 200 910 231 750 2419

MeanPooling-CopyNet 330 1935 660 240 1350 434 1300 4405LevenTrans (Gu et al 2019) 574 2124 880 400 1330 372 1400 3680

GPT-2 (Radford et al 2019) 1647 3801 2870 1940 2440 1106 2450 7509BERT-Gen (Bao et al 2020) 1978 4093 3320 2310 2850 1331 2830 8319UniLM (Dong et al 2019) 2157 4196 3830 2750 2940 1492 2990 9013

UniLM-v2 (Bao et al 2020) 2102 4241 3480 2430 2980 1461 3000 9220BART (Lewis et al 2019) 2238 4144 3510 2490 3050 1332 3010 9632

T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502

Human Performance 4888 6379 4820 4490 3620 4353 6350 9931

Table 3 Experimental results of different baseline methods on the COMMONGEN test set The first group ofmodels are non-pretrained models while the second group is large pretrained models that we have fine-tuned Thebest models are bold and second best ones are underlined within each metric

y is a target sentence For inference we samplefrom the fine-tuned GPT-2 model after a prompt ofldquoc1 c2 ck =rdquo with beam search and use the firstgenerated sentence as the output sentence

For BERT-Gen we use the s2s-ft package4

to fine-tune them in a sequence-to-sequence fash-ion similar to the sequence-to-sequence LM objec-tive employed by UniLM

As for T5 the state-of-the-art text-to-text pre-trained model which is pre-trained with a multi-task objective by prepending a task descriptionbefore the input text we prepend the input con-cept set with a simple prompt ldquogenerate asentence withrdquo and fine-tune the modelwith the source sentence on the format ldquogenerate asentence with c1 c2 ckrdquo

5 Evaluation

In this section we first introduce our metrics forautomatic evaluation then analyze the performanceof tested systems and finally provide qualitativeanalysis with case studies

51 Metrics

Following other conventional generation taskswe use several widely-used automatic metricsto automatically assess the performance suchas BLEU (Papineni et al 2002) ROUGE (Lin2004) METEOR (Banerjee and Lavie 2005) whichmainly focus on measuring surface similarities Wereport the concept Coverage which is the aver-age percentage of input concepts that are present inlemmatizatized outputs

4httpsgithubcommicrosoftunilm

In addition we argue that it is more suitable touse evaluation metrics specially design for caption-ing task such as CIDEr (Vedantam et al 2015)and SPICE (Anderson et al 2016) They usuallyassume system generations and human referencesuse similar concepts and thus focus on evaluate theassociations between mentioned concepts insteadof n-gram overlap For example the SPICE met-ric use dependency parse trees as proxy of scenegraphs to measure the similarity of scenarios

To estimate human performance within eachmetric we treat each reference sentence in devtestdata as a ldquosystem predictionrdquo to be compared withall other references which is equivalent to com-pute inter-annotator agreement within each metricThus systems that have better generative abilitythan average crowd-workers should exceed this

52 Experimental ResultsTable 3 presents the experimental results of all com-pared methods in different metrics We can seethat all fine-tuned pre-trained models (the lowergroup) outperform non-pretrained models (the up-per group) with a significant margin This is notsurprising because the their pretraining objectivesincluding masked language modeling word order-ing and text infilling which predicts missing wordsor text spans are relevant to our task On the otherhand we find that the key disadvantage of non-pretrained models with CopyNet still falls in thefailure of using all given concepts (ie low cover-age) which results in worse results

Among them UniLM BART and T5 performsthe best which may be due to its inherent sequence-to-sequence pre-training framework We found that

[bRNN-CpNet] Lays massage someone table vertical gives on and the water

[Trans-CpNet] Massage lays on the kitchen

[MP-CpNet] A massage table being calling with an improvisation lay free speaker

[LevenTrans] A man chatting at the table

[GPT-2] A man gives a massage to a table

[BERT-Gen] A woman lays down on a table and gives a massage to a man

[UniLM] A woman lays down a massage on a table and gives a massage

[UniLM-v2] A woman is laying down and giving a massage on a table

[BART] A man lays on a table and gives a massage to a woman laying on the table

[T5] Woman lay on a table and gives a massage

[Machine generations] 1 The man lays down on the massage table and the therapist

gives him a massage[Rationale] The man must lay down to receive a massage

The therapist is the giver of massages The table is a

massage table

2 Lay down on the table and the masseuse will give you a

neck massage[Rationale] A masseuse is a woman who gives massages

professionally Massages are usually done on tables

3 The woman gives the man who lays on the table a massage[Rationale] Some massages are done laying down people

like to get massages tables are used for people to get

massages people lay on tables to get massages

[Human references from AMT][Input concept-set] give lay massage table

Figure 4 A case study with a concept-set give lay massage table for qualitative analysis of machine genera-tions Human references are collected from AMT and the crowd-workers are required to provide rationales Morecase studies are shown in Figure 5 in Appendix

BART has the best concept coverage which is prob-ably due to its comprehensive pretraining tasks thataim to recover text with noise The results suggestthat further modifying over pre-trained models is apromising direction for generative commonsensereasoning This also shows that our dataset wouldbe a good test-bed for comparing the commonsensereasoning ability of different pre-trained languagemodels

Recent work (Lv et al 2020) finds that theOMCS corpus (Singh et al 2002) which has de-rived the ConceptNet is a valuable resource forretrieving relevant commonsense facts for discrim-inative reasoning about questions We follow thesame steps to retrieve related facts by querying in-put concepts Then we concatenate them with theoriginal concept-sets as the final input sequenceto the above-mentioned methods mimicking ab-stractive summarization tasks However we onlyobserve very marginal improvement when usingretrieved OMCS sentences as additional inputs Weargue that imposing commonsense knowledge withadditional graph structures (Lin et al 2019) be-tween input concepts is a more promising futuredirection for the COMMONGEN task as graphs arenaturally order-insensitive

53 Qualitative Analysis with A Case study

Figure 4 shows the top generations of differ-ent models and human references about an inputconcept-set give lay massage table We findthat non-pretrained seq2seq models can success-fully use part of given concepts while the gener-ated sentences are neither grammatical nor coher-ent The vanilla LevenTrans model only uses

one of the given concepts although it aims to mod-eling the edits explicitly and generates syntacticallysound sentences bRNN-CopyNet uses all fourconcepts with the powerful copy mechanism butgenerates nonsensical sentences

The outputs of fine-tuned pre-trained models aresignificantly more grammatical and commonsen-sical Although they are not equipped with an ex-plicit module for enforcing the use of given con-cepts most of them can cover all concepts in theiroutputs We can see that the scenarios in the outputsof GPT-2 UniLM-v12 and T5 only involve asingle person and the other two models associatetheir scenarios with two persons This makes theperson doing two contradictory actions in their out-put scenarios (eg lsquolaying on a tablersquo and lsquogivinga massagersquo) GPT-2 creates an even funny non-sensical composition (lsquogives a massage to a tablersquo)due to this issue Although BERT-Gen indeedincorporates a second person in its output it stillhas the contradiction The model closet to humanreferences is BART within this case study if it didnot generate the lsquolays on a table andrsquo to describethe man This suggests that a second pass to re-move some local optimal generations is necessaryfor assuring plausibility of the scenario

6 Related Work

Commonsense benchmark datasets There aremany emerging datasets for testing machine com-monsense from different angles such as com-monsense extraction (Xu et al 2018 Li et al2016) next situation prediction (SWAG (Zellerset al 2018) CODAH (Chen et al 2019) Hel-

laSWAG (Zellers et al 2019b)) cultural and socialunderstanding (Lin et al 2018 Sap et al 2019ab)visual scene comprehension (Zellers et al 2019a)and general commonsense question answering (Tal-mor et al 2019 Huang et al 2019)

Recent studies have shown that simply fine-tuning large pre-trained language models egRoBERTa (Liu et al 2019) can yield near-humanor even exceeding-human performance in thesediscriminative reasoning scenarios such as theSWAG dataset We argure that the underlyingreasons are two-fold 1) The creation of distrac-tor choices has annotator bias (Geva et al 2019)which can be easily detected by NLU models 2)Self-supervised training objectives in BERT-likemodels (Devlin et al 2019) align well with themulti-choice QA setting the SWAG task sharesalmost the same scenario with the Next SentencePrediction (NSP) task and because the CSQA taskcan be viewed as learning to recover missing wordsthat are masked by ldquowh-wordsrdquo it can be distantlylearned using Masked Language Modeling (MLM)Therefore these success does not necessarily meanmachine reasoners can produce novel assumptionsin an open realistic generative setting

Constrained Text Generation Constrained textgeneration aims to decode sentences with expectedattributes such as sentiment (Luo et al 2019aHu et al 2017) tense (Hu et al 2017) tem-plate (Zhu et al 2019) style (Fu et al 2018Luo et al 2019b Li et al 2018) topics (Fenget al 2018) etc A similar scenario with our taskis lexically constrained encoding which has beenmainly studied in the machine translation commu-nity (Hasler et al 2018 Dinu et al 2019 Hokampand Liu 2017) One recent work in this line is theCGMH (Miao et al 2019) method which aimsto sample sentences with an ordered sequence ofkeywords from language models but cannot befine-tuned and adopted in our case Topical storygeneration (Fan et al 2018 Yao et al 2019) isalso a related direction while it targets generat-ing longer creative stories around the given topicsmaking it hard to directly adopt them to our taskAdditionally the COMMONGEN task brings somemore challenges mentioned in Section 2 Prior con-strained generation methods cannot address theseissues together in a unified model and thus we ex-pect COMMONGEN to be also a benchmark datasetfor future works in this direction

Injecting Commonsense for NLG There are

also a few works that incorporate commonsenseknowledge in language generation tasks such asessay generation (Guan et al 2019 Yang et al2019a) video storytelling (Yang et al 2019b) andconversational systems (Zhang et al 2019) Theseworks suggest that generative commonsense rea-soning has a great potential to benefit downstreamapplications Our proposed COMMONGEN to thebest of our knowledge is the very first constrainedsentence generation dataset for assessing and con-ferring generative machine commonsense and wehope it can benefit such applications

7 Conclusion

Our major contribution in this paper are as follows

1 we present COMMONGEN a novel con-strained generation task for generative com-monsense reasoning and a large-scale dataset

2 we carefully analyze the inherent challengesof the proposed task ie a) relational reason-ing with latent commonsense knowledge andb) compositional generalization

3 our extensive experiments systematically ex-amine recent pre-trained language generationmodels (eg UniLM BART T5) on the task and find that their performance is still far fromhumans generating grammatically sound yetrealistically implausible sentences

Our study points to interesting future research di-rections on modeling commonsense knowledge inlanguage generation process towards conferringmachines with generative commonsense reasoningability We hope COMMONGEN would also benefitdownstream NLG applications such as conversa-tional systems and storytelling models

ReferencesPeter Anderson Basura Fernando Mark Johnson and

Stephen Gould 2016 Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Satanjeev Banerjee and Alon Lavie 2005 METEORAn automatic metric for MT evaluation with im-proved correlation with human judgments In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion andor Summarization pages 65ndash72 Ann Ar-bor Michigan Association for Computational Lin-guistics

Hangbo Bao Li Dong Furu Wei Wenhui Wang NanYang Xiulei Liu Yu Wang Songhao Piao Jian-feng Gao Ming Zhou and Hsiao-Wuen Hon 2020Unilmv2 Pseudo-masked language models for uni-fied language model pre-training arXiv Computa-tion and Language

Michael Chen Mike DrsquoArcy Alisa Liu Jared Fernan-dez and Doug Downey 2019 Codah An adversar-ially authored question-answer dataset for commonsense ArXiv abs190404365

Noam Chomsky 1965 Aspects of the theory of syntax

Ernest Davis and Gary Marcus 2015 Commonsensereasoning and commonsense knowledge in artificialintelligence Commun ACM 5892ndash103

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Georgiana Dinu Prashant Mathur Marcello Federicoand Yaser Al-Onaizan 2019 Training neural ma-chine translation to apply terminology constraintsIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages3063ndash3068 Florence Italy Association for Compu-tational Linguistics

Li Dong Nan Yang Wenhui Wang Furu Wei Xi-aodong Liu Yu Wang Jianfeng Gao Ming Zhouand Hsiao-Wuen Hon 2019 Unified languagemodel pre-training for natural language understand-ing and generation In Advances in Neural Informa-tion Processing Systems pages 13042ndash13054

Angela Fan Mike Lewis and Yann Dauphin 2018 Hi-erarchical neural story generation In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 889ndash898 Melbourne Australia Associationfor Computational Linguistics

Xiaocheng Feng Ming Liu Jiahao Liu Bing Qin YiboSun and Ting Liu 2018 Topic-to-essay generationwith neural networks In IJCAI pages 4078ndash4084

Zhenxin Fu Xiaoye Tan Nanyun Peng Dongyan Zhaoand Rui Yan 2018 Style transfer in text Explo-ration and evaluation In Thirty-Second AAAI Con-ference on Artificial Intelligence

Mor Geva Yoav Goldberg and Jonathan Berant 2019Are we modeling the task or the annotator an inves-tigation of annotator bias in natural language under-standing datasets In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1161ndash1166 Hong Kong China As-sociation for Computational Linguistics

Jiatao Gu Zhengdong Lu Hang Li and Victor OKLi 2016 Incorporating copying mechanism insequence-to-sequence learning In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1 Long Papers)pages 1631ndash1640 Berlin Germany Association forComputational Linguistics

Jiatao Gu Changhan Wang and Junbo Zhao 2019Levenshtein transformer In Advances in Neural In-formation Processing Systems pages 11179ndash11189

Jian Guan Yansen Wang and Minlie Huang 2019Story ending generation with incremental encodingand commonsense knowledge In Proceedings ofthe AAAI Conference on Artificial Intelligence vol-ume 33 pages 6473ndash6480

Eva Hasler Adria de Gispert Gonzalo Iglesias andBill Byrne 2018 Neural machine translation decod-ing with terminology constraints In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 2 (Short Pa-pers) pages 506ndash512 New Orleans Louisiana As-sociation for Computational Linguistics

Chris Hokamp and Qun Liu 2017 Lexically con-strained decoding for sequence generation using gridbeam search In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 1535ndash1546Vancouver Canada Association for ComputationalLinguistics

Zhiting Hu Zichao Yang Xiaodan Liang RuslanSalakhutdinov and Eric P Xing 2017 Towardcontrolled generation of text In Proceedingsof the 34th International Conference on MachineLearning-Volume 70 pages 1587ndash1596 JMLR org

Lifu Huang Ronan Le Bras Chandra Bhagavatula andYejin Choi 2019 Cosmos QA Machine readingcomprehension with contextual commonsense rea-soning In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages2391ndash2401 Hong Kong China Association forComputational Linguistics

Daniel Keysers Nathanael Scharli Nathan ScalesHylke Buisman Daniel Furrer Sergii KashubinNikola Momchev Danila Sinopalnikov LukaszStafiniak Tibor Tihon Dmitry Tsarkov Xiao WangMarc van Zee and Olivier Bousquet 2020 Measur-ing compositional generalization A comprehensivemethod on realistic data In International Confer-ence on Learning Representations

Guillaume Klein Yoon Kim Yuntian Deng Jean Senel-lart and Alexander Rush 2017 OpenNMT Open-source toolkit for neural machine translation InProceedings of ACL 2017 System Demonstrationspages 67ndash72 Vancouver Canada Association forComputational Linguistics

Ranjay Krishna Kenji Hata Frederic Ren Li Fei-Feiand Juan Carlos Niebles 2017 Dense-captioningevents in videos In Proceedings of the IEEE inter-national conference on computer vision pages 706ndash715

Brenden M Lake and Marco Baroni 2017 General-ization without systematicity On the compositionalskills of sequence-to-sequence recurrent networksIn

Jason Lee Elman Mansimov and Kyunghyun Cho2018 Deterministic non-autoregressive neural se-quence modeling by iterative refinement In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing pages 1173ndash1182 Brussels Belgium Association for Computa-tional Linguistics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension ArXiv abs191013461

Juncen Li Robin Jia He He and Percy Liang 2018Delete retrieve generate a simple approach to sen-timent and style transfer In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies Volume 1 (Long Papers)pages 1865ndash1874 New Orleans Louisiana Associ-ation for Computational Linguistics

Xiang Li Aynaz Taheri Lifu Tu and Kevin Gimpel2016 Commonsense knowledge base completionIn Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 1445ndash1455 Berlin GermanyAssociation for Computational Linguistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Bill Yuchen Lin Frank F Xu Kenny Zhu and Seung-won Hwang 2018 Mining cross-cultural differ-ences and similarities in social media In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Pa-pers) pages 709ndash719 Melbourne Australia Asso-ciation for Computational Linguistics

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollarand C Lawrence Zitnick 2014 Microsoft cocoCommon objects in context In European confer-ence on computer vision pages 740ndash755 Springer

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach ArXiv abs190711692

Fuli Luo Peng Li Pengcheng Yang Jie Zhou Yu-tong Tan Baobao Chang Zhifang Sui and Xu Sun2019a Towards fine-grained text sentiment trans-fer In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2013ndash2022 Florence Italy Association forComputational Linguistics

Fuli Luo Peng Li Jie Zhou Pengcheng YangBaobao Chang Zhifang Sui and Xu Sun 2019bA dual reinforcement learning framework for un-supervised text style transfer arXiv preprintarXiv190510060

Thang Luong Hieu Pham and Christopher D Man-ning 2015 Effective approaches to attention-basedneural machine translation In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing pages 1412ndash1421 Lis-bon Portugal Association for Computational Lin-guistics

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2020 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering ArXivabs190905311

Ning Miao Hao Zhou Lili Mou Rui Yan and LeiLi 2019 Cgmh Constrained sentence generationby metropolis-hastings sampling In Proceedings ofthe AAAI Conference on Artificial Intelligence vol-ume 33 pages 6834ndash6842

Chris Moore 2013 The development of commonsensepsychology Psychology Press

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former arXiv preprint arXiv191010683

Anna Rohrbach Atousa Torabi Marcus RohrbachNiket Tandon Christopher Pal Hugo LarochelleAaron Courville and Bernt Schiele 2017 Moviedescription International Journal of Computer Vi-sion 123(1)94ndash120

Keisuke Sakaguchi Ronan Le Bras Chandra Bhagavat-ula and Yejin Choi 2019 Winogrande An adver-sarial winograd schema challenge at scale ArXivabs190710641

Maarten Sap Ronan Le Bras Emily Allaway Chan-dra Bhagavatula Nicholas Lourie Hannah RashkinBrendan Roof Noah A Smith and Yejin Choi2019a Atomic An atlas of machine commonsensefor if-then reasoning In Proceedings of the AAAIConference on Artificial Intelligence volume 33pages 3027ndash3035

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019b Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Push Singh Thomas Lin Erik T Mueller Grace LimTravell Perkins and Wan Li Zhu 2002 Open mindcommon sense Knowledge acquisition from thegeneral public In OTM Confederated InternationalConferencesrdquo On the Move to Meaningful InternetSystemsrdquo pages 1223ndash1237 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Thirty-First AAAI Conference onArtificial Intelligence

Mitchell Stern William Chan Jamie Kiros and JakobUszkoreit 2019 Insertion transformer Flexible se-quence generation via insertion operations arXivpreprint arXiv190203249

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conference

of the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ruth Tincoff and Peter W Jusczyk 1999 Some begin-nings of word comprehension in 6-month-olds Psy-chological science 10(2)172ndash175

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 Cider Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Xin Wang Jiawei Wu Junkun Chen Lei Li Yuan-Fang Wang and William Yang Wang 2019 VatexA large-scale high-quality multilingual dataset forvideo-and-language research In Proceedings of theIEEE International Conference on Computer Visionpages 4581ndash4591

Frank F Xu Bill Yuchen Lin and Kenny Zhu 2018Automatic extraction of commonsense LocatedNearknowledge In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguis-tics (Volume 2 Short Papers) pages 96ndash101 Mel-bourne Australia Association for ComputationalLinguistics

Pengcheng Yang Lei Li Fuli Luo Tianyu Liu andXu Sun 2019a Enhancing topic-to-essay gener-ation with external commonsense knowledge InProceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2002ndash2012 Florence Italy Association for Compu-tational Linguistics

Pengcheng Yang Fuli Luo Peng Chen Lei Li ZhiyiYin Xiaodong He and Xu Sun 2019b Knowledge-able storyteller a commonsense-driven generativemodel for visual storytelling In Proceedings of theTwenty-Eighth International Joint Conference on Ar-tificial Intelligence IJCAI pages 5356ndash5362

Lili Yao Nanyun Peng Ralph Weischedel KevinKnight Dongyan Zhao and Rui Yan 2019 Plan-and-write Towards better automatic storytelling InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 7378ndash7385

Peter Young Alice Lai Micah Hodosh and Julia Hock-enmaier 2014 From image descriptions to visualdenotations New similarity metrics for semantic in-ference over event descriptions Transactions of theAssociation for Computational Linguistics 267ndash78

Rowan Zellers Yonatan Bisk Ali Farhadi and YejinChoi 2019a From recognition to cognition Vi-sual commonsense reasoning In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition pages 6720ndash6731

Rowan Zellers Yonatan Bisk Roy Schwartz andYejin Choi 2018 SWAG A large-scale adversar-ial dataset for grounded commonsense inference InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing pages 93ndash104 Brussels Belgium Association for Computa-tional Linguistics

Rowan Zellers Ari Holtzman Yonatan Bisk AliFarhadi and Yejin Choi 2019b HellaSwag Cana machine really finish your sentence In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4791ndash4800 Florence Italy Association for ComputationalLinguistics

Houyu Zhang Zhenghao Liu Chenyan Xiong andZhiyuan Liu 2019 Grounded conversation genera-tion as guided traverses in commonsense knowledgegraphs ArXiv abs191102707

Wanrong Zhu Zhiting Hu and Eric P Xing 2019 Textinfilling ArXiv abs190100158

[bRNN-CpNet] Someone lowers his horse from the wall and lasso glass by cows

[Trans-CpNet] A horse having lasso in the bridal cows

[MP-CpNet] Cow in a lasso getting the ride

[LevenTrans] A cow rides through a horse

[GPT-2] A horse rides on a lasso

[BERT-Gen] A cow rides a lasso on a horse

[UniLM] A man rides a horse with a lasso at cows

[UniLM-v2] A horse rides a cow with a lasso on it

[BART] A man rides a horse and a cow on a bridle with a lasso

[T5] Lasso to ride a cow on a horse

[Machine generations]

1 When those men ride a horse for the first time and lasso those cows

[Rationale] cowboys ride horses and lasso cows for a living

2 A cowboy can use a lasso to control a horse or cow in order to ride them

[Rationale] I understand the words and I can read and write

English

3 The cowboy will lasso the cow while riding on the horse

[Rationale] Have seen it

[Human references from AMT]

1) [Input concept-set] cow horse lasso ride

[bRNN-CpNet] Process of holds at hands under walk on hours

[Trans-CpNet] Hands with a walk in the water

[MP-CpNet] Walk across the hold to water

[LevenTrans] Hand moored at the water

[GPT-2] A woman holds a water walker and holds a hand

[BERT-Gen] A man walking and holding a hand in water while walking

[UniLM] A man holds hands to walk across the water

[UniLM-v2] A man is walking and holding a hand in the water

[BART] A man walks with a woman holding her hand as they walk through water

[T5] Man holds a bottle of water in his hand as he walks along a river


1 The couple holds hands as they walk by the water

[Rationale]

Couples hold hands when taking walk even by a body of water

2 The girl is walking holding in her hand a bottle of water

[Rationale] I see this reading the words

3 The couple hold hands while they walk by the water

[Rationale] People sometimes hold hands People Like to walk

near water


2) [Input concept-set] hand hold walk water

[bRNN-CpNet] The window stands out a ladder but clean the sun to being squeegee

[Trans-CpNet] A brown leather ladder with green eyes

[MP-CpNet] Window of the zebra are on a tablecloth

[LevenTrans] A man on a a on on the kitchen

[GPT-2] Someone grabs a ladder from a window and squeezes it open

[BERT-Gen] A woman is cleaning a window with a ladder and a squeegee

[UniLM] Someone stands next to a window and stands on a ladder to clean the squeegee

[UniLM-v2] A man is standing on a ladder and using a ladder to clean the window

[BART] A man with a squeegee and a ladder standing on the ledge of a window is cleaning the window

[T5] Squeegee and ladder on a wooden stand to clean windows and windows

[Machine generations]1 The window cleaner stands on the ladder to clean the

window with a squeegee[Rationale] A squeegee is a tool to clean windows A

ladder is something that people use to reach high places

2 The man clean the window on the ladder stand by using

squeegee[Rationale] man need to clean the window by using

squeegee on the ladder stand

3 The man stood beside the ladder and cleaned the window

with a squeegee[Rationale] people can stand next to ladders People

clean windows Squeegees are used to clean windows

[Human references from AMT]3) [Input concept-set] clean ladder squeegee stand window

Figure 5 Three cases for qualitative analysis of machine generations References are collected from AMT crowd-workers and they are required to provide rationales Note that the second one is a positive case showing that somemodels can successfully generate reasonable scenarios However most models perform poorly on the other cases

using concepts from a given concept-set For ex-ample given a set of concepts exercise ropewall tie wave machines are required to generatea sentence such as ldquoa man in a gym exercises bywaving ropes tied to a wallrdquo

To successfully solve the task models need toincorporate two key capabilities a) relational rea-soning and b) compositional generalization Gram-matically sound sentences may not always be real-istic as they might violate our commonsense (eg

ldquoa dog throws a frisbee rdquo in Fig 1) In order tocompose a plausible sentence that describes an ev-eryday scenario models need to construct a gram-matical sentence while adhering to and reasoningover the commonsense relations between the givenconcepts Models additionally need compositionalgeneralization ability to infer about unseen conceptcompounds This encourages models to reasonabout a potentially infinite number of novel combi-nations of familiar concepts ndash an ability believedto be a limitation of current AI systems (Lake andBaroni 2017 Keysers et al 2020)

Therefore in support of the COMMONGEN taskwe present a dataset consisting of 29599 concept-sets associated with 49129 sentences We explic-itly design our dataset collection process to cap-ture the key challenges of relational reasoning andcompositional generalization described above Weestablish comprehensive baseline performance forstate-of-the-art language generation models Thebest model based on T5 (Raffel et al 2019)achieves 3660 with significant gap compared tohuman performance of 6350 in the SPICE met-ric ndash demonstrating the difficulty of the task Ouranalysis shows that state-of-the-art models strug-gle at the task generating implausible sentences ndasheg ldquodog throws a frisbee rdquo ldquogive massage to atablerdquo etc ndash pointing to interesting future researchdirections for the community

2 Task Formulation and Challenges

We formulate the proposed COMMONGEN taskwith mathematical notations and discuss its inher-ent challenges with concrete examples

The input is an unordered set of k concepts x =c1 c2 ck isin X (ie a concept-set) whereeach concept ci isin C is a common object (noun)or action (verb) We use X to denote the spaceof all possible concept-sets and use C to denotethe concept vocabulary (a subset of ConceptNetrsquossingle-word concepts) The expected output is a

simple grammatical sentence y isin Y that describesa common scenario in our daily life using2 allgiven concepts in x A scenario can depict either astatic situation or a short series of actions

The task is to learn a structured predictive func-tion f ∶ X rarr Y which maps a concept-set x toa sentence y The unique challenges of this taskcome from two major aspects as followsRelational Reasoning with Commonsense Ex-pected generative reasoners should prioritize themost plausible scenes over an infinite number ofless plausible scenes Recall the first illustrativeexamples in Figure 1 the underlying knowledgeare implicit and compositional (a) dogs love toperform tricks with humans (b) catching a frisbeeis a trick and (c) humans love to play this gamewith dogs As for the other example in Section 1about exercise rope wall tie wave we alsoneed to compose the following commonsense facts(i) doing exercises is to cost energy (ii) wavinga rope can cost energy and (iii) it is more usefulwhen the rope is tied to a wall

In order to complete a scenario a generativecommonsense reasoner also needs to reasonably as-sociate additional concepts (eg lsquogymrsquo and lsquomanrsquo)as agents or environments for completing a naturaland coherent scenario in our daily life

This not only requires understanding underlyingcommonsense relations between concepts but alsoincrementally composing them towards a globallyoptimal scenario The underlying reasoning chainsare inherently based on a variety of backgroundknowledge such as spatial relations object prop-erties physical rules temporal event knowledgesocial conventions etc However they may not berecorded in any existing knowledge basesCompositional Generalization Humans cancompose a sentence to describe a scenario aboutthe concepts they may never seen them co-occurring For example there is a testing concept-set x =pear basket pick put tree The conceptlsquopearrsquo never appear in the training data and lsquopickrsquonever co-occurs with lsquobasketrsquo Meanwhile thereare some relevant training examples

bull x1 =apple bag putrarr y1 =ldquoa boy puts anapple in a bagrdquo

bull x2 =apple tree pick rarr y2 =ldquoa girl picksan apple from the treerdquo

bull x3 =apple basket wash rarr y3 =ldquoa girltakes an apple from the basket and washes itrdquo

2Note that morphological inflections are allowed











ρ(x)



primeisin X ∣)













KRS KRS














Object properties



Human behaviors


Temporal knowledge





4 Methods












T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502







5 Evaluation


51 Metrics




















massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water


































ρ(x)



primeisin X ∣)













KRS KRS














Object properties



Human behaviors


Temporal knowledge





4 Methods












T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502







5 Evaluation


51 Metrics




















massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water


































KRS KRS














Object properties



Human behaviors


Temporal knowledge





4 Methods












T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502







5 Evaluation


51 Metrics




















massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water





























Object properties



Human behaviors


Temporal knowledge





4 Methods












T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502







5 Evaluation


51 Metrics




















massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water





























T5 (Raffel et al 2019) 2171 4179 3810 2720 3000 1458 3060 9502







5 Evaluation


51 Metrics




















massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water





































massage table















6 Related Work







7 Conclusion

























































































English

















[Rationale]






near water





























7 Conclusion

























































































English

















[Rationale]






near water








































































































English

















[Rationale]






near water




















































































English

















[Rationale]






near water
































































English

















[Rationale]






near water












































English

















[Rationale]






near water







































English

















[Rationale]






near water
























Date post:	15-Jul-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

CommonGen: A Constrained Text Generation Challenge for ... · sourcing and existing caption...

Documents