Arindam Mitra Pratyay Banerjee Kuntal Pal Swaroop Mishra Chitta … · 2020. 4. 20. · Arindam...

How Additional Knowledge can Improve Natural LanguageCommonsense Question Answering?

Arindam Mitra ∗ Pratyay Banerjee∗Kuntal Pal∗Swaroop Mishra∗Chitta BaralDepartment of Computer Science, Arizona State University

amitra7,pbanerj6,kkpal,[email protected],[email protected]

Abstract

Recently several datasets have been proposedto encourage research in Question Answer-ing domains where commonsense knowledgeis expected to play an important role. Re-cent language models such as ROBERTA,BERT and GPT that have been pre-trainedon Wikipedia articles and books have shownreasonable performance with little fine-tuningon several such Multiple Choice Question-Answering (MCQ) datasets. Our goal inthis work is to develop methods to incor-porate additional (commonsense) knowledgeinto language model-based approaches for bet-ter question-answering in such domains. Inthis work, we first categorize external knowl-edge sources, and show performance does im-prove on using such sources. We then ex-plore three different strategies for knowledgeincorporation and four different models forquestion-answering using external common-sense knowledge. We analyze our predictionsto explore the scope of further improvements.

Introduction

In recent months language models such as GPT(Radford et al., 2018), BERT (Devlin et al., 2019)and their variants (such as RoBERTa (Liu et al.,2019a)) that have been pre-trained on Wikipediaarticles and books are able to perform very well onmany of the natural language question-answeringtasks. These days they form the defacto baselinefor most new datasets. They even perform at nearhuman level on many recently proposed natural lan-guage QA datasets (Rajpurkar et al., 2016; Zellerset al., 2018). These models do well even on someof the question-answering tasks where question-answering seemingly requires knowledge beyondwhat is given in the QA items. Perhaps it is becausesome of the needed knowledge that may be present

∗ These authors contributed equally to this work.

Figure 1: Above we have two strategies of knowledgeincorporation. Below are sample questions from So-cialIQA and PIQA, with corresponding top retrievedknowledge from our generated knowledge sources.

in textual form is “encapsulated” by the languagemodel-based systems as they are trained on a hugetext corpus. But one may wonder whether morecan be done; i.e., can the performance be improvedby further infusion of the needed knowledge (or aknowledge base containing the needed knowledge),and what are ways of doing such knowledge in-fusion. DARPA and Allen AI upped the ante bydeveloping several question-answering challengeswhere commonsense knowledge and reasoning areexpected to play an important rule. The expectedadditional challenge in these domains is that oftencommonsense knowledge is not readily availablein textual form (Gordon and Van Durme, 2013).To answer the above-mentioned questions we con-sider three of those QA challenges: Abductive NLI(aNLI)(Bhagavatula et al., 2019), PIQA (Bisk et al.,2019), and Social Interaction QA (Social IQA)(Sap et al., 2019b).

In this paper, we explore ways to infuse knowl-edge into any language model to reason and solvemultiple-choice question-answering tasks. Consid-ering a baseline performance of the BERT whole-

arX

iv:1

909.

0885

5v3

[cs

.CL

] 1

7 A

pr 2

020

word-masked model, we improve the performanceon each of the datasets with three strategies. First,in the revision strategy, we fine-tune the BERTmodel on a knowledge-base (KB) which has knowl-edge statements relevant to that of each of thedatasets and then use the model to answer questions.In the second, Open-Book Strategy, we choose acertain number of knowledge statements from theKB that are textually similar to each of the samplesof the datasets. Then we fine-tune the pre-trainedBERT model for the question-answering task tochoose the answer. In the final strategy, we take ad-vantage of both the above-mentioned strategies. Wefirst fine-tune the pre-trained BERT model on theKB and then use additional knowledge extractedfor each sample for the question-answering.

To use the extracted knowledge from the KB, wepropose four models, concat, max, simple sum, andweighted sum. Each of the models uses knowledgein a different way to choose the correct answeramong the options.

Apart from these we created a dataset, Parent andFamily QA (PFQA) to analyze BERT’s memoriz-ing ability and to test BERT’s ability for answeringMCQ questions with necessary information scat-tered over multiple knowledge sentences.

Our contribution in this paper is as follows:

• We develop a framework for solving multiple-choice questions with external knowledge thatcan work on varied datasets.• We propose four novel models representing

four ways knowledge can be used with thelanguage models.• We study the three datasets, aNLI, PIQA, and

Social IQA under three scenarios of externalknowledge source respectively, and show thatexternal knowledge helps. In particular, weidentify the models that takes us to be amongthe top in the leaderboard in the three datasets.• We synthetically create a dataset, PFQA, and

will make it publicly available.

Related Work

Question-Answering Datasets: In datasets suchas SQuAD (Rajpurkar et al., 2016), TriviaQA(Joshi et al., 2017), WikiQA (Yang et al., 2015),CoQA (Reddy et al., 2019) the answers are presentin either the passage or the context. Systems areable to achieve near-human performance on someof them. HotpotQA (Yang et al., 2018) is a chal-lenging dataset where the questions explicitly re-

quire multi-hop reasoning, and supporting knowl-edge passages derived from Wikipedia are provided.Another challenging QA task is when the multiple-choice questions do not have sufficient knowledgeto answer correctly given a passage, context or op-tions, as in ARC (Clark et al., 2018), RACE (Laiet al., 2017), and OpenBook QA (Mihaylov et al.,2018). Recently language models trained on a hugecorpus have been able to perform quite well (De-vlin et al., 2019; Liu et al., 2019b) on them. Ourfocus in this paper is on datasets which not onlyrequire external facts but also need commonsenseknowledge to predict the correct answer, as in aNLI,PIQA and Social IQA.

External Knowledge: Models that integrate ex-ternal knowledge have been introduced in (Mi-haylov and Frank, 2018; Chen et al., 2018; Yanget al., 2019; Wang and Jiang, 2019). These areaiming to utilize relational knowledge of the form(a,R, b), where a and b are words, to modify pre-trained word vectors in both passages and questionsto obtain better inter-word alignments. In our case,knowledge is more complex with a and b beingevent descriptions containing variables and thuscomputing alignment between knowledge passageand question-answer pair is more challenging.

Knowledge Retrieval: Systems for InformationRetrieval, such as Elasticsearch (Gormley andTong, 2015), has been used in prior work of (Khotet al., 2019; Pirtoaca et al., 2019; Yadav et al., 2019;Banerjee et al., 2019; Banerjee, 2019). In our work,we use elasticsearch for retrieval and we have are-ranking algorithm using Spacy (Honnibal andMontani, 2017).

MCQ Datasets

In order to study how to incorporate knowledge, weneed datasets which are designed to need externalknowledge by question-answering systems. Wechoose four datasets to evaluate our models, eachwith a different kind of commonsense knowledge.Three of those are created by Allen AI researchersand one is generated synthetically by us.

Abductive NLI (aNLI): This dataset (Bhagavat-ula et al., 2019) is intended to judge the potential ofan AI system to do abductive reasoning in order toform possible explanations for a given set of obser-vations. Given a pair of observations O1 and O2,the task is to find which of the hypothesis optionsH1 or H2 better explains the observations. There

Figure 2: Example of all four datasets along with retrieved knowledge.

are 169,654 training and 1,532 validation samples.It also has a generation task, but we restrict our-selves to the multiple-choice task.

PIQA (Physical Interaction QA): This datasetis created to evaluate the physics reasoning capabil-ity of an AI system. The dataset requires reasoningabout the use of physical objects and how we usethem in our daily life. Given a goal G and a pair ofchoices C1 and C2, the task is to predict the choicewhich is most relevant to the goal G. There are16,113 training and 1,838 validation samples.

SocialIQA: This dataset is a collection of in-stances about reasoning on social interaction andthe social implications of their statements. Givena context C of a social situation and a question Qabout the situation, the task is to choose the correctanswer option AOi out of three choices. There areseveral question types in this dataset, which arederived from ATOMIC inference dimensions (Sapet al., 2019b,a). In total, there are 33,410 trainingand 1,954 validation samples.

Parent and Family QA: We synthetically createthis dataset to test both the memorizing capabilityof neural language models and the ability to com-bine knowledge spread over multiple sentences.The knowledge retrieved for the three earlier men-tioned datasets may be error-prone and in somecases, absent. This is due to the errors from theInformation Retrieval step. We create this syntheticdataset to have better control over the knowledgeand to ensure that we have the appropriate knowl-edge to answer the questions.

The source of this dataset is DBPedia (Auer et al.,2007) from which we query for people and extracttheir parent information. Using this information,

we generate 3 kinds of questions, which are, Whois the parent of X?, Who is the grandparent of X?and Who is the sibling of X?. The dataset has aquestion Q and 4 answer options AOi. The namesof a parent and their family members have manythings in common, which can be used to answersuch a question. To make the task harder, we re-move the middle and last names from the answeroptions. To select wrong answer options, we selectthose names which are at an edit distance of one ortwo. We also evaluate the alternate strategy of us-ing word-vector’s cosine similarity to find the mostsimilar names, but on evaluation we observe thatthe edit distance strategy creates confusing options.For example, the most similar names to “John” are“Robert”, and “Williams” using word-vectors. Us-ing edit distance, we get “Jon”, and “Johan”. Sinceour models use word-piece tokenization, and cor-responding embeddings, using edit distance makesthe task harder.

We also ensure that all three kinds of questionsfor a particular person are present in the specifictraining and validation set. In total, there are 74,035training, 9,256 validation and 9,254 test questions.

External Knowledge

Knowledge Categorization for Evaluation

Reasoning with data from each of the above-mentioned datasets needs commonsense knowl-edge. We categorize external knowledge sourcesinto three categories.

Directly Derived: Here the commonsense QAtask is directly derived from the knowledge source,and hence using the same knowledge may makethe task trivial. We evaluate this on aNLI

and the knowledge sources, ROCStories Corpus(Mostafazadeh et al., 2016) and Story Cloze Testthat were used in creating aNLI. Our motivation isto see how well the model is able to answer ques-tions when given the “same” knowledge.

Partially Derived: Here the commonsense QAtask is not directly derived from an external knowl-edge source, and considerable human knowledgewas used to generate the question-answers. In thiscase, we use SocialIQA, which uses the ATOMIC(Sap et al., 2019a) knowledge base as the source forsocial events, but has undergone sufficient humanintervention to make the task non-trivial. Duringdataset creation, the human turkers were asked toturn ATOMIC events into sentences and were askedto create question-answers.

Relevant: Here the commonsense task is entirelycreated with the help of human turkers without useof a specific knowledge source. But through ouranalysis of the question-answers, we guess knowl-edge sources that seem relevant. We evaluate thisusing PIQA as the commonsense task and Wik-iHow dataset (Koupaee and Wang, 2018) as the“relevant” external knowledge source.

Knowledge Source PreparationaNLI: For aNLI, we prepare multiple sets ofknowledge sources. To test our first category ofexternal knowledge, we use the entire Story ClozeTest and ROCStories Corpus. We also prepare an-other knowledge base that contains knowledge sen-tences retrieved for the train set of aNLI from thefirst knowledge base. This is done to not trivializethe task with knowledge leakage. We also create aknowledge source from multiple datasets such asMCTest (Richardson et al., 2013), COPA (Roem-mele et al., 2011) and ATOMIC, but not Story ClozeTest and ROCStories Corpus. These sources con-tain commonsense knowledge which might be use-ful for the aNLI task. This Combined Common-sense Corpus belongs to the relevant knowledgecategory as described in section External Knowl-edge.

SocialIQA: We synthetically generate a knowl-edge base from the events and inference dimen-sions provided by the ATOMIC dataset (Sap et al.,2019a). The ATOMIC dataset contains events andeight types of if-then inferences 1. The total num-ber of events is 732,723. Some events are masked,

1More details in Supplemental Materials.

which we fill by using a BERT Large model and theMasked Language Modelling task (Devlin et al.,2019). We extend the knowledge source, and re-place PersonX and PersonY, as present in the origi-nal ATOMIC dataset, using gender-neutral names.These steps may approximate the steps taken byhumans to generate the question-answers.

PIQA: We use the Wikihow dataset for PIQA. Itcontains a large collection of paragraphs (214,544),each having detailed steps or actions to completea task. We extract the title of each paragraph andsplit the paragraphs into sentences. The title isconcatenated to each of the sentences. This is doneto ensure the goal of the task should be present ineach of the sentences.

Parent and Family QA: We already possess thegold knowledge sentences. The knowledge forthese questions is represented with simple sen-tences of the form “The parent of X is Y”. Wedo not provide knowledge sentences for questionsabout grand parents and siblings. To answer suchquestions, the systems need to combine informa-tion spread over multiple sentences. Nearly alllanguage models are trained over Wikipedia, so alllanguage models would have seen this knowledge.

Knowledge RetrievalQuery Generation: For query generation, weconcatenate the question, answer option and thecontext if present, and remove standard Englishstopwords. We use verbs, adjectives, and adverbsfrom the question-answer pairs.

Information Retrieval System: We use Elastic-search to index all knowledge base sentences. Weretrieve the top 50 sentences for each question-answer pair. The retrieved sentences may containthe key search words in any order.

Re-Ranking: We perform Information Gainbased Re-ranking using Spacy as described in(Banerjee et al., 2019). We use sentence similarityand knowledge redundancy to perform the iterativere-ranking. For similarity we use Spacy sentencesimilarity, for knowledge redundancy we find sim-ilarity with the already selected sentences. Afterre-ranking, we select the top ten sentences.

We keep our Information Retrieval systemgeneric as the tasks require varying kinds of com-monsense knowledge; for example, If-then rules inSocialIQA, Scripts or Stories in aNLI, and under-standing of Processes and Tools in PIQA.

Figure 3: An end-to-end view of our approach. From query generation, knowledge retrieval, the different types ofknowledge retrieved along with keywords highlighted in blue, the corresponding learned weights in the Weighted-Sum model and finally to predicted logits.

Standard BERT MCQ Model

After extracting relevant knowledge from therespective KBs, we move onto the task ofQuestion-Answering. In all our experimentswe use BERT’s uncased whole-word-maskedmodel (BERTUWWM ) (Devlin et al., 2019) andRoBERTa (Liu et al., 2019b).

Question-Answering Model: As a baselinemodel, we used pre-trained BERTUWWM for thequestion-answering task with an extra feed-forwardlayer for classification as a fine-tuning step.

Modes of Knowledge Infusion

We experiment with four different models of usingknowledge with the standard BERT architecture forthe open-book strategy. Each of these modules takeas input a problem instance which contains a ques-tion Q, n answer choices a1, ..., an and a list calledpremises of length n. Each element in premisescontains m number of knowledge passages whichmight be useful while answering the question Q.Let Kij denotes the j-th knowledge passage forthe i-th answer option. Each model computes ascore score(i) for each of the n answer choices.The final answer is the answer choice that receivesthe maximum score. We now describe how thedifferent models compute the scores differently.

Concat: In this model, all the m knowledge pas-sages for the i-th choice are joined together to makea single knowledge passage Ki. The sequence oftokens {[CLS] Ki [SEP] Qai [SEP]} is then passedto BERT to pool the [CLS] embedding from thelast layer. This way we get n [CLS] embeddingsfor n answer choices, each of which is projected toa real number (score(i)) using a linear layer.

Parallel-Max: For each answer choice ai,Parallel-Max uses each of the knowledge passageKij to create the sequence {[CLS] Kij [SEP] Qai[SEP]} which is then passed to the BERT modelto obtain the [CLS] embedding from the last layerwhich is then projected to a real number using alinear layer. score(i) is then taken as the maxi-mum of the m scores obtained using each of the mknowledge passage.

Simple Sum: Unlike the previous model, simplesum and the next two models assume that the infor-mation is scattered over multiple knowledge pas-sages and try to aggregate that scattered informa-tion. To do this, the simple sum model, for each an-swer choice ai and each of the knowledge passageKij creates the sequence {[CLS] Kij [SEP] Qai[SEP]} which it then passes to the BERT modelto obtain the [CLS] embedding from the last layer.All of these m vectors are then summed to find thesummary vector, which is then projected to a scalarusing a linear layer to obtain the score(i).

Weighted Sum: The weighted sum model com-putes a weighted sum of the [CLS] embeddings assome of the knowledge passage might be more use-ful than others. It computes the [CLS] embeddingsin a similar way to that of the simple sum model.It computes a scalar weight wij for each of the m[CLS] embedding using a linear projection layerwhich we will call as the weight layer. The weightsare then normalized through a softmax layer andused to compute the weighted sum of the [CLS]embeddings. It then uses (1) a new linear layer or(2) reuses the weight layer (tied version) to com-pute the final score score(i) for the option ai. Weexperiment with both of these options.

Dataset Strategy Concat ↑ Max ↑ Sim-Sum ↑ Wtd-Sum ↑

aNLIONLY OPENBOOK 73.89 73.69 73.50 73.26ONLY REVISION 72.65 NA NA NA

REVISION & OPENBOOK 74.35 74.28 74.02 75.13

PIQAONLY OPENBOOK 67.84 72.41 72.58 72.52ONLY REVISION 74.53 NA NA NA


SocialIQAONLY OPENBOOK 70.12 67.75 70.21 70.22ONLY REVISION 69.45 NA NA NA


Parent & Family QAONLY OPENBOOK 91.21 89.8 92.66 92.86ONLY REVISION 78.30 NA NA NA


Table 1: Validation set accuracy (%) of each of the four models (Concat, Max, Simple sum, Weighted sum) acrossfour datasets for each of the three strategies. The base model is BERT Large whole-word-masked.

Dataset RoBERTA+Knowledge Source Dev

aNLIDIRECTLY DERIVED 86.68

TRAINONLY DIRECTLY DERIVED 85.84RELATED KNOWLEDGE 84.97

SocialIQAPARTIALLY DERIVED 79.53

TRAINONLY PARTIALLY DERIVED 78.85

Table 2: Comparison of Performance of aNLI and So-cialIQA datasets across multiple knowledge sourcesfor Weighted-Sum model. For ANLI, Directly De-rived refers to ROC-Stories and Story Cloze datasets.Related knowledge is Combined Commonsense Cor-pus. For SocialIQA, Partially Derived is the syntheticATOMIC. For both, TrainOnly is the knowledge ex-tracted only from the train set.

Dataset Model Test

aNLIBaseline : BERT 66.75

Baseline : ROBERTA 83.91Our : BERT 74.96

Our : ROBERTA 84.18

PIQABaseline : BERT 69.23


Our : ROBERTA 78.24

SocialIQABaseline : BERT 64.50


Our : ROBERTA 78.00

Parent & Family QABaseline : BERT 76.96


Our : ROBERTA 93.40

Table 3: Performance of the best knowledge infusedmodel on the Test set. Best model scores are in bold.

Experiments

Let D be an MCQ dataset and T be a pre-trainedlanguage model, KD be a knowledge base (a set ofparagraphs or sentences) which is useful for D andlet K be a general knowledge base where T waspre-trained and K might or might not contain KD.We consider three approaches to infuse knowledge.

Revision Strategy: In this strategy, T is fine-tuned on KD with respect to Masked LM andthe next sentence prediction task and then fine-tuned on the dataset D with respect to the Question-Answering task.

Figure 4: Different categories of errors.

Open Book strategy: Here a subset of KD isassigned to each of the training samples on thedataset D and the model T is fine-tuned on themodified dataset D.

Revision along with an Open Book Strategy:In this strategy, T is fine-tuned on KD with respectto Masked LM and the next sentence predictiontask and also a subset of KD is assigned to eachof the training samples on D. The model is thenfine-tuned with respect to the modified dataset as aQuestion-Answering task.

Results

Which Strategy Works? Table 1 and Table 2 sum-marize our experiments on four datasets. We ob-serve that knowledge helps in improving the per-formance. Both the Open Book and the Revisionstrategy perform well. Together the performanceimproves even further. The performance of theRevision strategy is poor for the SocialIQA andPFQA datasets. The reason behind this drop in per-formance may be due to the synthetic nature of thesentences and the unavailability of next sentenceprediction task data. Note that the knowledge in theKB for SocialIQA and PFQA are single sentencesand not paragraphs. The results for PIQA and aNLIdatasets are better due to the presence of naturaland contiguous knowledge sentences.

Figure 5: Categories of knowledge relevance for cor-rect predictions.

Figure 6: Weights learned by the RoBERTa Weighted-Sum Model vs the Normalized Overlap between knowl-edge and concatenated question-answer for all samplesof PIQA validation set.

For PIQA, the BERT model improves withknowledge, whereas RoBERTa model underper-forms, indicating RoBERTa gets distracted by theretrieved knowledge, and the knowledge it possessfrom pre-training is more useful.How did different External Knowledge Cate-gories Perform ? In Directly derived knowledgecategory, the model accuracy with knowledge issignificantly more than the baseline accuracy. How-ever, model is still not able to answer all questionsbecause of two reasons. First, the model fails toreason well. Second, many hypotheses are possi-ble between two observations, and turkers seemto have created hypotheses which are very differ-ent from the source data2. In partially derived andrelevant knowledge category, the model accuracyincreases with addition of knowledge.Which is the best model for Knowledge Infu-sion? Weighted Sum model seems to be the bestmodel for Knowledge Infusion. It is also partiallyexplainable by looking at the weights associated

2More details in Supplemental Materials.

Figure 7: Validation accuracy versus number of re-trieved knowledge sentences, for all three datasets.

with the knowledge sentences.How many Knowledge Sentences do we need?We experiment by varying number of KnowledgeSentences, and Figure 7 illustrates that accuracyfor SocialIQA and PIQA increases with more num-ber of knowledge sentences, whereas accuracy de-creases for aNLI. This is because, we use directlyderived knowledge source for aNLI, so increasingthe number of knowledge sentences acts as noisefor aNLI. For SocialIQA and PIQA more knowl-edge helps.

Is there a correlation among overlap betweenquestion-answer & knowledge and learnedweights? Figure 6 shows the weight versus over-lap between knowledge and question-answer dis-tribution for PIQA. It shows there is a low overlap,but the model learns to give high weights in somecases regardless of the overlap.

Discussion and Error Analysis

We have analyzed 200 samples from each ofour best models, and the results are presentedin Figures 5 and 4. Figure 5 shows that aroundtwo third of correct predictions are because of therelevant knowledge provided in open book format.The figure also shows that many a times knowledgeis acompanied with noise, and models are doinga good job in ignoring noise and attending torelevant knowledge. The Noisy IR category saysmodel is also doing a good job in ignoring thecomplete open book knowledge in case its noisy.In those cases, either knowledge acquired duringrevision phase or the orignial language modeltraining phase helps in answering correctly.Why did the weighted-sum model work best?Weighted sum model provides the flexibility toattend in varying amount to multiple knowledgesentences, which is not true for rest of the models.

How to choose external knowledge? Externalknowledge source needs to be chosen based on thedomain to which the dataset belongs; e.g. there arelots of questions in PIQA which can be benefittedby answering how of various physical processes.Use of Wikihow dataset improves model accuracy.How to choose a strategy? Open book strategyshould always be chosen. We saw in Figure 5 thatmodels are good at ignoring noise. This helps inhandling those cases where knowledge retrivedfrom IR is not relevant. Revision strategy is alsorecommended unless sufficient data is not availablefor next sentence prediction task.How do models perform in PFQA? We observeneural language models are able to memorizeand combine knowledge spread over multiplesentences. Most of the errors are observed in ques-tions regarding grandparents and siblings, whichindicate that there is still scope for improvement inmulti-hop reasoning.Use of RocStories for aNLI and ATOMICfor Social? As discussed in section ExternalKnowledge , in case of aNLI we wanted to checkhow well the model performs with the sameknowledge that is used during the creation of thedataset. Similarily in case of SocialIQA we wantedto see how much is the accuracy boost if weprovide partially derived knowledge. As explainedearlier, for both cases, although the models do well,they do not reach near human accuracy.What are the category of errors models made?We divide the errors into three categories. CategoryI is annotation issue, which is the case when morethan one answer option is correct or incorrectanswer option is labelled correct. Also, questionsfor which information is insufficient to select aspecific answer option falls in this category. Forexample:

Obs1: Alan got a gun for his 18th birthday.Obs2: He now loves to go hunting.(Hyp1) His dad took him hunting.(Hyp2) Alan decided to go hunting

Category II is where the IR output is noisy,and does not have relevant knowledge. Forexample:

Question: Blankets(a) can cover lights (b) can cover candlesKnowledge: How to Pack for Self Storage . Stand sofas onend to save space and cover your sofas with plastic coversand blankets.

Category III is Question-Answering modelissue where relevant knowledge is present, though

the knowledge is not completely exact3. However,some reasoning with the help of this relevantknowledge could have helped the model inpredicting the correct answer. For example:

Obs1: Tim needed a fruit to eat. Obs2: Finally, he foundsome fresh grapes to eat.(Hyp1) He went to the near by super market. (Hyp2) Timlooked for a long time in the messy fridge.Knowledge:Tim needed a fruit to eat. He wanted a fruitthat tasted good. He looked in the kitchen for the fruit. Healmost gave up. Finally he found some fresh grapes to eat.

Future work In future we will work on better rea-soning models and apply semantic IR to reduce QAModel issue and Noisy IR issue, which might leadto better performance. Some of the errors are be-cause of annotation issue. Those might be excludedfor calculating the exact accuracy.

Conclusion

Although this paper is about analyzing differentways to incorporate knowledge into language mod-els in commonsense QA, we note that we areamong the top of the leaderboard in the three tasks,SocialIQA, aNLI, and PIQA. We have providedfour new models for multiple-choice natural lan-guage QA using the knowledge and analyzed theirperformance on these commonsense datasets. Wealso make a synthetic dataset available which mea-sures the memorizing and reasoning ability of lan-guage models. We observe that existing knowledgebases even though do not contain all the knowl-edge that is needed to answer the questions, doprovide a significant amount of knowledge. Lan-guage models utilize some of the knowledge; still,there are areas where the models can be furtherimproved, particularly the ones where the knowl-edge is present but the model could not answer, andwhere it predicted wrong answers with irrelevantknowledge.

3More details with BERT prediction analysis in Supple-mental Materials.

ReferencesSoren Auer, Christian Bizer, Georgi Kobilarov, Jens

Lehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In Proceedings of the 6th International The Seman-tic Web and 2Nd Asian Conference on Asian Seman-tic Web Conference, ISWC’07/ASWC’07, pages722–735, Berlin, Heidelberg. Springer-Verlag.

Pratyay Banerjee. 2019. Asu at textgraphs 2019 sharedtask: Explanation regeneration using language mod-els and iterative re-ranking. EMNLP-IJCNLP 2019,page 78.

Pratyay Banerjee, Kuntal Kumar Pal, Arindam Mi-tra, and Chitta Baral. 2019. Careful selection ofknowledge to solve open book question answering.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages6120–6129, Florence, Italy. Association for Compu-tational Linguistics.

Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke Sakaguchi, Ari Holtzman, Han-nah Rashkin, Doug Downey, Scott Wen-tau Yih, andYejin Choi. 2019. Abductive commonsense reason-ing. arXiv preprint arXiv:1908.05739.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, JianfengGao, and Yejin Choi. 2019. Piqa: Reasoning aboutphysical commonsense in natural language. arXivpreprint arXiv:1911.11641.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowledge.In Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2406–2417.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Jonathan Gordon and Benjamin Van Durme. 2013. Re-porting bias and knowledge acquisition. In Proceed-ings of the 2013 Workshop on Automated KnowledgeBase Construction, AKBC ’13, pages 25–30, NewYork, NY, USA. ACM.

Clinton Gormley and Zachary Tong. 2015. Elastic-search: the definitive guide: a distributed real-timesearch and analytics engine. ” O’Reilly Media,Inc.”.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. arXiv preprint arXiv:1705.03551.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019.Whats missing: A knowledge gap guided approachfor multi-hop question answering. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 2807–2821.

Mahnaz Koupaee and William Yang Wang. 2018. Wik-ihow: A large scale text summarization dataset.arXiv preprint arXiv:1810.09305.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. Race: Large-scale readingcomprehension dataset from examinations. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 785–794.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019a.Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question answer-ing. In EMNLP.

Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 821–832.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and evaluation framework for deeper under-standing of commonsense stories. arXiv preprintarXiv:1604.01696.

George Sebastian Pirtoaca, Traian Rebedea, and Ste-fan Ruseti. 2019. Answering questions by learningto rank-learning to rank by answering questions. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

http://dl.acm.org/citation.cfm?id=1785162.1785216

https://doi.org/10.18653/v1/P19-1615

https://doi.org/10.18653/v1/P19-1615

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.1145/2509558.2509563

https://doi.org/10.1145/2509558.2509563

http://arxiv.org/abs/1907.11692

http://arxiv.org/abs/1907.11692

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 2531–2540.

Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training. URLhttps://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/languageunderstanding paper. pdf.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Matthew Richardson, Christopher J.C. Burges, andErin Renshaw. 2013. MCTest: A challenge datasetfor the open-domain machine comprehension of text.In Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing, pages193–203, Seattle, Washington, USA. Association forComputational Linguistics.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In 2011 AAAI Spring Symposium Series.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi.2019a. Atomic: an atlas of machine commonsensefor if-then reasoning. In Proceedings of the AAAIConference on Artificial Intelligence, volume 33,pages 3027–3035.

Maarten Sap, Hannah Rashkin, Derek Chen, RonanLeBras, and Yejin Choi. 2019b. Socialiqa: Com-monsense reasoning about social interactions. arXivpreprint arXiv:1904.09728.

Chao Wang and Hui Jiang. 2019. Explicit utilizationof general knowledge in machine reading compre-hension. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 2263–2272.

Ved Prakash Yadav, Steven Bethard, and Mihai Sur-deanu. 2019. Quick and (not so) dirty: Unsuper-vised selection of justification sentences for multi-hop question answering. In IJCNLP 2019.

An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu,Hua Wu, Qiaoqiao She, and Sujian Li. 2019. En-hancing pre-trained language representations withrich knowledge for machine reading comprehension.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages2346–2357.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.Wikiqa: A challenge dataset for open-domain ques-tion answering. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2013–2018.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D Manning. 2018. Hotpotqa: A dataset fordiverse, explainable multi-hop question answering.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages2369–2380.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. Swag: A large-scale adversarialdataset for grounded commonsense inference. arXivpreprint arXiv:1808.05326.

https://www.aclweb.org/anthology/D13-1020

https://www.aclweb.org/anthology/D13-1020

Supplemental Material

We have four sections here. In the first section, wehave analyzed BERT prediction for all 3 datasets.The second section is regarding analysis of ourdataset aNLI, where we have tried to find out whythe model is not able to answer all questions evenwith the help of directly derived knowledge source.In the third section, we have compared ROBERTAand BERT by analyzing their weighted-sum model.Fourth section illustrates some examples in thesedatasets which need knowledge beyond commonsense knowledge.

BERT prediction analysis

To understand how knowledge is used in BERTand whether the knowledge is useful or not, wedo the following analysis: For each of the datasetswe have randomly selected 100 samples where ourbest performing model predicts correctly and 100samples where it has failed. We identified the fol-lowing broad categories of analysis.

For the correct predictions, we check, (1) Exactappropriate knowledge is present, (2) A relatedbut relevant knowledge is present, (3) Knowledgeis present only in the correct option, and (4) Noknowledge is present. Figure 8 shows the countsfor the above categories. All the cases do not occurin all the datasets.

Figure 8: Measure of performance across differentknowledge presence in correct predictions

For the errors (Figure 9), we analyze, (1) Isthe knowledge insufficient, (2) Is the knowledgepresent in the wrong answer, (3) Knowledge isappropriate but model fails, and (4) Gold label isquestionable.

We also analyze given appropriate knowledge,how the model performs. From Figure 8, it canbe seen that BERT can answer quite a number ofquestion without knowledge. Also from Figure 9,

Figure 9: Measure of performance across differentknowledge presence in incorrect predictions.

it is clear that despite of having good knowledge,BERT fails to answer correctly.

In the following subsections, we analyze the dif-ferent dataset specific errors.

SocialIQA

We measure the performance across the eight dif-ferent ATOMIC inference dimensions for the bestknowledge infused model. The six of the inferen-tial dimensions are Needs, Attributes, Reactions,Wants, Motivations, Effects. These are for Per-sonX. There are two more for Others, Reaction andWants.

In figure 10 we can see both with and with-out knowledge the model performs nearly equallyacross all dimensions. There is no considerableimprovement across any particular dimension.

Figure 10: Performance of the model with (MACmodel) and without knowledge (Baseline) across dif-ferent types of ATOMIC inference dimensions.

In some cases the model fails to predict the cor-rect answer despite of the appropriate knowledgebeing present.

Question: Kendall took their dog to the new dogpark in the neighborhood. . What will Kendallwant to do next?(A) walk the dog (B) meet other dog ownersKnowledge: Jody takes Jody’s dog to the dogpark, as a result Jody wants to socialize withother dog owners.

In the above example, the above knowledge wasretrieved but still the model predicted the wrongoption. 341 questions were predicted wrongly afteraddition of knowledge. We also identified out ofthe set of 100 analyzed correct predictions, 29% ofthe questions had partial information relevant tothe question.

Figure 11: Performance of the model across the threedifferent type of questions.

Parent and Family QAIn Figure 11, we see with addition of knowledge,there is a considerable improvement in perfor-mance. Other than questions asking about parents,which just need a look up to answer, the siblingand grandparent questions need models to combineinformation present across multiple sentences. Wecan see the model improves even in this questions,showing knowledge infusion helps. Out of the threetypes of the questions, the performance is loweston the sibling questions, indicating that it is harderfor the models to perform this task. The modelaccuracy is reasonably good on this dataset, whichshows BERT has a strong capability to memorizefactual knowledge. Its performance improves withinfusion of knowledge,

Here also, 1,790 questions which were previ-ously predicted correctly, are predicted wrong withaddition of knowledge.

PIQAOut of the 100 failures that we have analysed, wefound that for 8 samples the goal matches the

knowledge statements but the answers present inthe knowledge is different. For example,

Goal: How can I soothe my tongue if I burn it?(A) Put some salt on it. (B) Put some sugar onit.

Knowledge: How to Soothe a BurntTongue.Chew a menthol chewing gum.

Also, there are 33 samples in the whole trainand dev dataset for which the words in one optionsare a subset of second option. In those cases, theknowledge retrieved is same for both the optionsand this confuses the BERT model.

Goal: What can I drink wine out of if I don’thave a wine glass?(A) Just pour the wine into a regular mug orglass and drink. (B) Just pour the wine into aregular mug or wine glass and drink.Knowledge: How to Serve Foie Gras. Pour aglass of wine.

On addition of knowledge, 359 samples havebecome correctly predicted with our best model forPIQA dataset which were initially incorrect. Butin the process, 166 samples which were correctin our baseline model have now been incorrectlypredicted.

aNLI

In this dataset, we also have some examples wherenegative knowledge is being fed to the model, andit still produces the correct output. There are 8 suchexamples among the 100 samples we analyzed. Forexample:

Obs1: Pablo likes to eat worms.Obs2: Pablo does not enjoy eating worms.(Hyp1) Pablo thought that worms were adelicious source of protein. (Hyp2) Pablo thenlearned what worms really are.

Knowledge: Pablo likes to eat worms. He reada book in school on how to do this. He friesthem in olive oil. He likes to do this at least oncea month. Pablo enjoys worms and views themas a delicacy.

Similarily, we have examples where knowledgefavors incorrect hypothesis, however our systemstill produces correct output. We found 12 suchexamples among the 100 samples we analyzed. Forexample:

Obs1: Dotty was being very grumpy.Obs2: She felt much better afterwards.(Hyp1) Dotty ate something bad. (Hyp2) Dottycall some close friends to chat.Knowledge: Allie felt not so good last night.She ate too much. So she had to sleep it off.Then she woke up. She felt so much better

We have 12 cases among 100 analyzed samples,where both hypothesis are very similar. So,oursystem is unable to produce correct output. Forexample:

Obs1: Bob’s parents grounded him.Obs2: He came back home but his parentsdidn’t even know he left.(Hyp1) Bob got caught sneaking out. (Hyp2)Bob got away with sneaking out.

We also have 34 examples where incorrect hy-pothesis has more word similarity with the obser-vation and knowledge, whereas correct hypothesishas been paraphrased or has less word similarity.The system predicts the wrong answer in such asituation. One such example is:

Obs1: Mary’s mom came home with more ba-nanas than they could possibly eat.Obs2: That was the best way ever to eat a ba-nana!(Hyp1) Mary and her mom decided to makechocolate covered frozen bananas to avoidwaste. (Hyp2) Mary made pineapple splits foreveryone.Knowledge: Mary s mom came home withmore bananas than they could possibly eat. Shewondered why she had bought them all. Thenafter dinner that night she got a surprise. Mommade banana splits for the whole family. Thatwas the best way ever to eat a banana

Another area where the system fails, is wherethe problem seems to be open-ended, and manyhypotheses can explain the pair of observations. Itis tough to find exact knowledge in such a scenario.For example,

Figure 12: Overlap percentage between IRed knowl-edge and aNLI

Obs1: Lisa went for her routine bike ride.Obs2: Some days turn out to be great adven-tures.(Hyp1) Lisa spotted a cat and followed it offtrail (Hyp2) Lisa saw a lot of great food.Knowledge: Lisa went for her routine bikeride.Only this time she noticed an abandonedhouse.She stopped to look in the house.It wasfull of amazing old antiques.Some days turn outto be great adventures.

Why Language model is not able to answer allquestions even with the help of directly derivedknowledge source?We experiment by providing directly derivedknowledge source in aNLI, and find that the lan-guage model is still not able to answer all questions.We analyze and find following insights regardingaNLI and direcly derived knowledge source.

1. 41.90 % of the data has either Obs1 or Obs2or both common with the knowledge.

2. 23.12 % of the data has both Obs1 and Obs2common with the knowledge.

Among those, figure 12 illustrates percentageoverlap of hypothesis with knowledge. Overlap hasbeen calculated by taking set of words in hypothsis,and finding if the same word is there in knowledgeor not. You can see that very few examples havehigh overlap with knowledge. Actual number willbe even lesser, as we have calculated word overlap,not phrase or sentence overlap.

For each of the bins, 13 illustrates how much ofthose are correctly classified. It is interesting to seethat, even though knowledge is very much relevant

Figure 13: Percentage of samples correctly classifiedby best model in aNLI

here (obs1,obs2 same and high hyp overlap), BERThas not been able to classify correctly. You canignore 0-10, 10-20 bins here, since percentage ofdata in those bins are very less as we saw in figure12.

ROBERTA vs BERTHere, both the learnt weights and the percentageoverlap between the question and option versusknowledge is binned with an interval of 0.1. Thefigure 14 shows the difference between the countsof samples for each weight and overlap bin be-tween BERT-weighted-sum model and RoBERTa-weighted-sum model. From the figure, it can beseen that, the samples with percentage overlap be-tween 0.2 to 0.3 which have been assigned lowerweight in the region 0.0 to 0.1 are more in BERTweighted sum model than in RoBERTa weightedsum model.On the other hand, the samples with percentageoverlap between 0.2 to 0.3 which have been as-signed weight in the region 0.1 to 0.2 and in region0.2-0.3 are more in RoBERTA weighted sum modelthan BERT weighted sum model.

This shows the RoBERTa model is able to assignweights to proper knowledge sentences leading toimproved question answering performance.

Examples which need knowledge beyondcommonsense knowledgeThere are some questions which needknowledge beyond common sense knowl-edge to answer. Following are 2 examples.Goal: how do you call the fuzz?(A) dial 911.(B) dial fuzz under contacts.

Figure 14: Difference between weights learned by theRoBERTa Weighted-Sum Model vs BERT-Weighted-Sum Model for the Normalized Overlap betweenknowledge and concatenated question-answer for allsamples of PIQA validation set

Goal: To fight Ivan Drago in Rocky for segamaster system.?(A) Drago isn’t in this game because it was re-leased before Rocky IV.(B) You have to defeat Apollo Creed andClubber Lang first.

Date post:	04-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Arindam Mitra Pratyay Banerjee Kuntal Pal Swaroop Mishra Chitta … · 2020. 4. 20. · Arindam...

Documents