+ All Categories
Home > Documents > Transformer-Based Open Domain Biomedical Question ...ceur-ws.org/Vol-2696/paper_75.pdf ·...

Transformer-Based Open Domain Biomedical Question ...ceur-ws.org/Vol-2696/paper_75.pdf ·...

Date post: 10-Feb-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
15
Transformer-Based Open Domain Biomedical Question Answering at BioASQ8 Challenge Ashot Kazaryan 1,2 , Uladzislau Sazanovich 1,2 , and Vladislav Belyaev 1,3 1 JetBrains Research, Russia {ashot.kazaryan,uladzislau.sazanovich,vladislav.belyaev}@jetbrains.com 2 ITMO University, Russia 3 National Research University Higher School of Economics, Russia {287371,191872}@niuitmo.ru Abstract. BioASQ task B focuses on biomedical information retrieval and question answering. This paper describes the participation and pro- posed solutions of our team. We build a system based on recent advances in the general domain as well as the approaches from previous years of the competition. We adapt a system based on a pretrained BERT for document and snippet retrieval, question answering and summarization. We describe all approaches we experimented with and show that while neural approaches do well, sometimes baseline approaches have high au- tomatic metrics. The proposed system achieves competitive performance while being general so that it can be applied to other domains as well. Keywords: BioASQ Challenge · Biomedical Question Answering · Open Domain Question Answering · Information Retrieval · Deep Learning 1 Introduction BioASQ [27] is a large scale competition for biomedical research. It provides eval- uation measures for various setups like semantic indexing, information retrieval and question answering, all regarding the biomedical domain. The competition takes place annually online, and each year gains more attention from research groups all around the world. The BioASQ provides necessary datasets, evalua- tion metrics and leaderboards for each of its sub-challenges. More specifically, the BioASQ challenge consists of two major objectives which are called “tasks”. The first is semantic indexing, which goal is to con- struct a search index given a set of documents, such that certain semantic re- lationships are held between the index terms. The second objective is passage ranking and question answering in various forms, which is given a question to return a piece of text. The returned text must either answer the question directly or contain enough information to derive the answer. In terms of the BioASQ, those objectives are called Task A and Task B, respectively. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece.
Transcript
  • Transformer-Based Open Domain BiomedicalQuestion Answering at BioASQ8 Challenge

    Ashot Kazaryan1,2, Uladzislau Sazanovich1,2, and Vladislav Belyaev1,3

    1 JetBrains Research, Russia{ashot.kazaryan,uladzislau.sazanovich,vladislav.belyaev}@jetbrains.com

    2 ITMO University, Russia3 National Research University Higher School of Economics, Russia

    {287371,191872}@niuitmo.ru

    Abstract. BioASQ task B focuses on biomedical information retrievaland question answering. This paper describes the participation and pro-posed solutions of our team. We build a system based on recent advancesin the general domain as well as the approaches from previous years ofthe competition. We adapt a system based on a pretrained BERT fordocument and snippet retrieval, question answering and summarization.We describe all approaches we experimented with and show that whileneural approaches do well, sometimes baseline approaches have high au-tomatic metrics. The proposed system achieves competitive performancewhile being general so that it can be applied to other domains as well.

    Keywords: BioASQ Challenge · Biomedical Question Answering · OpenDomain Question Answering · Information Retrieval · Deep Learning

    1 Introduction

    BioASQ [27] is a large scale competition for biomedical research. It provides eval-uation measures for various setups like semantic indexing, information retrievaland question answering, all regarding the biomedical domain. The competitiontakes place annually online, and each year gains more attention from researchgroups all around the world. The BioASQ provides necessary datasets, evalua-tion metrics and leaderboards for each of its sub-challenges.

    More specifically, the BioASQ challenge consists of two major objectiveswhich are called “tasks”. The first is semantic indexing, which goal is to con-struct a search index given a set of documents, such that certain semantic re-lationships are held between the index terms. The second objective is passageranking and question answering in various forms, which is given a question toreturn a piece of text. The returned text must either answer the question directlyor contain enough information to derive the answer. In terms of the BioASQ,those objectives are called Task A and Task B, respectively.

    Copyright c© 2020 for this paper by its authors. Use permitted under Creative Com-mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-ber 2020, Thessaloniki, Greece.

  • In this work, we explore applications of the state-of-the-art model in naturallanguage processing and deep learning in biomedical question answering. As aresult, we develop a system, that is capable of providing answers in the form ofdocuments, snippets, exact answers or abstractive text, given biomedical ques-tions from various domains. We evaluate our system on the recent BioASQ 2020challenge, where it achieves competitive performance.

    1.1 BioASQ Tasks

    Our team participated in the Task B, which involves information retrieval, ques-tion answering, summarization and more. This task uses benchmark datasetscontaining development and test questions, in English, along with gold standard(reference) answers constructed by a team of biomedical experts. The task isseparated into two phases.

    Phase A The first phase measures the ability of systems to answer biomedicalquestions with a list of relevant documents and snippets of text from retrieveddocuments. The main metric for documents and snippets is the mean averageprecision (MAP). The average precision is defined as follows:

    AP =

    ∑|L|r=1 P (r) · rel(r)|LR|

    where |L| is the number of items in a list predicted by the system, |LR| is thenumber of relevant items. P (r) is a precision when only first r returned items areconsidered, and rel(r) is equal to 1 if the r-th returned item is relevant. MAPand GMAP are arithmetic and geometric means of all questions in the evaluationset. For the snippets retrieval, precision is measured in terms of characters, andrel(r) is equal to 1 if the returned item has non-zero overlap with at least onerelevant snippet. Additional metrics are precision, recall and F1 score. A moredetailed description is present in the original paper [13].

    Phase B The second phase evaluates the performance of question answering,given a list of relevant documents and snippets from the previous phase. Thequestions are of several types: questions where the answer is either yes or no(“yes/no”), questions where the answer is a single term (“factoid”), and ques-tions where the answer is a list of terms (“list”). Additionally, each question hasan “ideal” answer, where the aim is to measure the systems’ ability to generateparagraph sized passage, that answers the question.

    The metrics of phase B are F1-macro for yes/no questions, mean reciprocalrank (MRR) [29] for factoid questions and F1 score for list questions. To eval-uate answers in natural language the ROUGE [16] scores are used. We shouldnote that human experts will additionally evaluate all systems after the contest.However, the results are not available at the time of writing this paper, thus weuse only the automatic measurements to draw our conclusions.

  • 1.2 Related Work

    Most of the contemporary large scale QA systems attempt to fill the gap betweenmassive source of knowledge and a complex neural reasoning model. Many of thepopular knowledge sources are sets of unstructured or semi-structured texts, likeWikipedia [5], [33]. It is still the case for biomedical domain, where the PubMed[4] is amongst the largest sources of biomedical scientific knowledge.

    During document retrieval a question answering system can benefit fromstructured knowledge as well. There is a rich set of different biomedical ontologieslike UMLS [2] or GO [1], successful use of which is shown in different QA systems,including ones that were submitted by previous years BioASQ participants [12].However, in our work we do not leverage such information and instead explorea more general approach, applicable to any other domain.

    Many systems perform re-ranking after initial document retrieval. Specializedneural models like DRMM [8] have been successfully used in previous BioASQchallenges [3]. More recent approaches utilize transformer-based language models[28] like BERT [6] for wide variety of tasks. Applications of transformers indocument re-ranking had set a new state of the art [21], including the last years’BioASQ challenges [22]. There are also systems that do document re-rankingbased on snippet extraction [22], but they did not achieve the highest positions.

    Some systems solve snippet extraction by utilizing methods that were origi-nally developed for document re-ranking. [22] uses the earlier mentioned DRMMin the Task 7B and achieves top results. [23] comes up with another neural ap-proach, employing both textual and conceptual information from question andcandidate text. In our work, we experiment with different methods and show howstrong baselines consistently demonstrate high metrics, given a proper documentretriever.

    Deep learning has shown its superiority in question answering. In the Task5b, [30] achieve top scores by training an RNN based neural network. However,most of the modern advancements in question answering can be attributed totransformer-based models. Last years’ challenges were dominated by systemsthat used BERT or its task-specific adapations, like BioBERT [34], [10]. In thiswork we experiment with a similar approach.

    Deep neural and transformer based models in particular have shown theirability to tackle summarization in different setups [17], including QA summa-rization [15]. However, as [19] noticed, BioASQ summaries tend to look very sim-ilar to the input examples. They exploit this observation and introduce severalsolutions, based on sentence re-ranking, and achieve top automatic and humanscores in several batches. There is also an attempt to utilize pointer-generatornetworks [26] for BioASQ ideal questions [7]. During the competition we ex-tend the snippet re-ranking approach by using transformer models. Moreover,we introduce a fully generational approach, based on transformers as well.

  • 2 Methods

    In this section, we describe the system we implemented for document and snippetretrieval, as well as our question answering system. We provide the results ofdifferent approaches which we experimented with during the competition. Toassess the performance of different methods more accurately, we merge all testbatches of 8B task into one and use resulting 500 questions as an evaluationset. Here and during the competition, we created a development set using 100questions from 6B and 200 questions from 7B task. Our system evolved frombatch to batch, achieving its final shape in batch 5. All the ablation experimentsand retrospective evaluations are performed on the system, that was used forthe 5-th batch submission.

    2.1 Document retrieval

    For document retrieval, we implement a system conceptually similar to [21]. Atfirst we extract a list of N document candidates using Anserini implementationof bm25 algorithm [32]. Then we use the BERT model to re-rank candidatedocuments and output at most ten top-scored documents.

    BM25 For initial document retrieval, we used Anserini [32]. We created anindex using the PubMed Baseline Repository of the 2019 year [4]. For eachpaper in the PubMed Baseline, we extracted the PubMed identifier, the title,and the abstract. We stored title and abstract as separate fields in the index.We applied default stopwords filtering and Porter stemming to the title andabstract provided with Anserini [31]. Overall the searcher index contains 19million documents.

    BERT Re-ranking The initial set of documents obtained with BM25 is passedto the BERT re-ranker, which assigns relevance scores to documents based ona question. We consider all documents with a score higher than a thresholdto be relevant and output at most ten papers with the highest scores. To trainBERT re-ranker, we created a binary classification dataset. We obtained positiveexamples from the gold documents of BioASQ dataset. We collected negativeexamples using BM25 by extracting 200 documents using a question as a queryand consider all documents starting from position 100 to be non-relevant if theyare not in the gold documents set. As BioASQ dataset for question answeringcontains questions collected from the past year contests, the relevant documentsinclude only the papers published before the year of the competition. Usually,there are several relevant documents for the question that were published afterthe year of the contest. To exclude such papers from the negative examples, wecalculated the maximum publication year for all relevant documents and filteredall documents published after this year from the negative examples.

  • Experiments We evaluated several approaches to document retrieval. First,we evaluated the performance of the BM25 algorithm, and then we applieddifferent modifications of BERT-based re-ranker. We examined the effects ofrelevance score threshold as well as the number of documents obtained from theBM25 stage. The results are presented in table 1. We can see how BERT-basedre-ranker consistently improves base BM25 performance.

    Since our re-ranker is trained to perform logistic regression, we can vary thedecision boundary to achieve an appropriate trade-off between precision andrecall. However, the MAP metric, which is used as a final ranking measure, doesnot penalize the system for additional non-relevant documents, which meansthe system should always output as much documents as possible to achieve thehighest score, while reducing its practical usefulness. We decided to orient oursystem towards both precision and recall and as a result we achieve the highestF1 scores across all batches, while maintaining competitive MAP scores.

    Table 1. Results of different approaches to document retrieval on the combined testset of 500 questions from 8B. N is the number of documents returned from the BM25stage. T is the score threshold for relevant documents.

    Method Precision Recall F-Measure MAP GMAP

    BM25(N = 10) 0.1190 0.5022 0.1730 0.3579 0.0128BM25+BERT(N = 50, T = 0.5) 0.2892 0.5158 0.3334 0.3979 0.0155BM25+BERT(N = 50, T = 0.) 0.1358 0.5481 0.1954 0.4114 0.0221BM25+BERT(N = 500, T = 0.5) 0.2734 0.5387 0.3249 0.4046 0.0191

    2.2 Snippet Retrieval

    Snippet extraction systems extract a continuous span of text from one of therelevant documents for the given question. We observe that snippets from theBioASQ training set are usually one sentence long, thus our system is designedas a sentence retriever and snippet extraction is formulated as a sentence rankingproblem. We experiment with both neural and statistical approaches to tacklethis challenge.

    Baseline We use a simple statistical baseline for sentence ranking, which isbased on measuring entity cooccurrence in question and candidate sentence. Foreach question and sentence we extract sets of entities Q and S respectively andcompute relevance score:

    relevance(q, s) =|Q ∩ S||Q|

    We use ScispaCy [20] en core web sm model for extracting entitities.

  • Word2Vec Similarity One approach of determining sentence similarity is tomap both query and candidate into the same vector space and measure the dis-tance between them. For embedding word sequences, we use Word2Vec model,pretrained on PubMed texts [18], and compute the mean of individual wordembeddings. Suppose the Eq and Es are the embeddings of question and snip-pet correspondingly. The relevance of snippet for a given question is a cosinesimilarity between embeddings:

    relevance(q, s) =Eq · Es||Eq||||Es||

    BERT Similarity As the transformer pretrained on the biomedical domainshould contain a lot of transferable knowledge, we check the zero-shot perfor-mance of the pretrained model. Similar to embeddings similarity, we use cosinedistance between the embeddings of question and snippet. The embedding of atext span is the contextualized embedding corresponding to the special [CLS]token which is inserted before the tokenized text. The relevance is a cosine sim-ilarity between embeddings of question and snippet.

    BERT Relevance As the task of snippet retrieval is very similar to docu-ment retrieval, we test a similar approach. We use BERTrel model, trained fordocument ranking to assign a relevance score to the pair of question and snippet:

    relevance(q, s) = BERTrel(q, s)

    Document Scores Finally, after assigning each question-sentence pair a rel-evance score, we scale the latter by additional score, based on the position ofthe document, which the candidate sentences are extracted from, in the listof relevant documents. Despite the simplicity of this trick, experiments showconsiderable improvements of evaluation metrics, which points out a strong cor-relation between the rank of the abstracts and the rank of the snippets fromthose abstracts. For each document di from the list of ranked relevant docu-ments D = d1, d2, . . . , dn there is a list of sentences Si = si,1, si,2, . . . , si,m andthe similarity score between query q and sentence si,j is:

    score(q, si,j) =relevance(q, si,j)

    i

    Experiments We evaluated all described approaches to snippet retrieval. Theresults are presented in table 2. We can see that heuristic of adding docu-ment score into the score of a snippet allows to improve MAP scores for allapproaches significantly. In line with the document retrieval, BERT relevancemodel has higher precision and recall with lower MAP scores. Surprisingly, re-trieval based on BioBERT cosine similarly performed well even without trainingon any BioASQ data. We can consider the approach to be a zero-shot perfor-mance of BioBERT on the task of snippet retrieval.

  • Table 2. Results of different approaches to snippet retrieval on the combined test setof 500 questions from 8B. “Docs” means scaling the snippet score by the position ofthe source document.

    Method Precision Recall F-Measure MAP GMAP

    Baseline 0.1631 0.2871 0.1841 0.6521 0.0036Baseline + Docs 0.1733 0.2876 0.1934 0.8902 0.0020

    Word2Vec Similarity 0.1702 0.2941 0.1904 0.6408 0.0054Word2Vec Similarity + Docs 0.1727 0.2850 0.1928 0.9350 0.0019

    BERT Similarity 0.1607 0.2621 0.1763 0.6338 0.0031BERT Similarity + Docs 0.1733 0.2847 0.1927 0.9374 0.0019

    BERT Relevance 0.1931 0.3383 0.2174 0.6926 0.0102BERT Relevance + Docs 0.1921 0.3344 0.2161 0.8098 0.0071

    2.3 Exact answers

    Factoid and List questions. For factoid and list questions we generate an-swers with a single extractive question-answering system. Its design follows theclassical transformer-based approach, described in [6]. As an underlying neuralmodel, we use ALBERT [14] finetuned on SQuAD 2.0 [24] and BioASQ trainingset. SQuAD is an extractive question answering dataset, so it is well suited forBioASQ tasks. In essence, list and factoid questions can be handled by the samespan extraction technique. Thus we can use the same model for both questionstypes, differing only at the postprocessing stage.

    Throughout all the 5 batches we experiment mainly at pre- and post- pro-cessing stages, without substantial changes in the architecture of the systemitself. During preprocessing, we convert input questions to the SQuAD format[25], where contexts are built from the relevant snippets, that come with eachinput question. The postprocessing stage is implemented in the same manner as[34]. However, for list question we additionally split the resulting extracted spansby “and/or” and “or” conjunctions, which we observed to be frequently used inchemical/gene enumerations in various biomedical abstracts. Table 3 shows theimportance of this step.

    Table 3. The performance of the QA model for list questions with and without splittingof answers by conjunctions as a postprocessing step. The evaluation is performed onthe first batch of 8B.

    Method Mean Precision Mean Recall F-Measure

    BioBERT 0.2750 0.2250 0.2305BioBERT + conj split 0.3884 0.5629 0.4315

    Yes/No questions. For yes/no questions we formulate the task as a logisticregression over question-snippet pairs and implement a transformer-based ap-

  • proach, similar to [34]. We use the ALBERT model and fine-tune it using SQuADand BioASQ datasets. In the fifth batch, we additionaly use PubMedQA dataset[11] and replace the model with BioBERT. Despite that PubMedQA containsmore than 200 thousand labelled examples, the average question length is twiceas large as BioASQ questions’ length is. We sampled 2 thousand questions withsimilar to BioASQ questions distribution and incorporated them into the finaltrain set.

    2.4 Summarization

    Phase B also includes summarization objective, where a participating systemhas to generate a paragraph sized text, answering the question. We come upwith different approaches for tackling this challenge.

    Weak baseline BioASQ does not impose any limitations on the source ofthe summary. We observed that summaries tend to be one or two sentenceslong, reminding how snippets are composed. Straightforward approach is to usesnippets, provided with the question for computing the summary. Our weakbaseline selects the first snippet from the question for this purpose.

    Snippet Reranking Naturally, the first snippet may not answer the questiondirectly and clearly, despite being marked as the most relevant. A logical im-provement to the baseline is to select the appropriate snippet, potentially in aquestion-aware manner. To make answers more granular, we split snippets bysentences and the resulting candidate pool contains snippets and snippet sen-tences. Sometimes, however, snippets are absent for a given question. In that casewe extract the candidate sentences from the relevant abstracts. For re-ranking,we use BERTrel trained for document re-ranking, as described in 2.2. Overall,we can describe this system as sentence-level extractive summarization.

    Abstractive Summarization Our final system performs abstractive summa-rization over provided snippets. We use traditional encoder-decoder transformerarchitecture [28], where the encoder is based on BioMed-RoBERTa [9], whilethe decoder is trained from scratch, following BertSUM [17]. First, we pretrainthe model on a summarization dataset based on PubMed, where the target isan arbitrary span from the abstract and the source is a piece of text, fromwhich the target can be derived. After that, we fine-tune the model to producesummaries given the question and concatenation of relevant snippets from theBioASQ training dataset, separated with a special token.

    3 Results

    In this section, we present an official automatic evaluation of our system, com-paring to the top competitor system. We denote our system as “PA” which

  • stands for the Paper Analyzer team. We additionally perform a retrospectiveevaluation of phase A, where the gold answers are available.

    3.1 Documents Retrieval

    In table 4, we present the results of our document retrieval system on all batchescompared to the top competitor. The final design of our system was implementedonly in the fifth batch. So, to evaluate our proposed system against our ownand other participants’ systems from previous batches, we computed evaluationmetrics over golden answers, provided by BioASQ for the Phase B. We were ableto fully reproduce official leaderboard scores for the fifth batch and show, thatour final system outperforms all our previous submissions. The retrospectiveevaluation shows that we significantly improved our system during the contestand achieved better results with the final system.

    Table 4. The performance of the document and snippet retrieval system on all batchesof task 8B. “final” represents the retrospective evaluation of a system for batch 5 onprevious batches. “Top Competitor” is a top-scoring submission from other teams.

    Documents SnippetsBatch System F-Measure MAP GMAP F-Measure MAP GMAP

    1 PAfinal 0.3389 0.3718 0.0156 0.1951 0.8935 0.0019PAbatch-1 0.2680 0.3346 0.0078 0.1678 0.5449 0.0028Top Competitor 0.1748 0.3398 0.0120 0.1752 0.8575 0.0017

    2 PAfinal 0.2689 0.3315 0.0141 0.1487 0.7383 0.0008PAbatch-2 0.2300 0.3304 0.0185 0.1627 0.3374 0.0047Top Competitor 0.2205 0.3181 0.0165 0.1773 0.6821 0.0015

    3 PAfinal 0.3381 0.4303 0.0189 0.1958 0.9422 0.0028PAbatch-3 0.2978 0.4351 0.0143 0.1967 0.6558 0.0062Top Competitor 0.1932 0.4510 0.0187 0.2140 1.0039 0.0056

    4 PAfinal 0.3239 0.4049 0.0189 0.1753 0.9743 0.0015PAbatch-4 0.3177 0.3600 0.0163 0.1810 0.7163 0.0056Top Competitor 0.1967 0.4163 0.0204 0.2151 1.0244 0.0055

    5 PAfinal 0.3963 0.4825 0.0254 0.2491 1.1267 0.0038PAbatch-5 (final) 0.3963 0.4825 0.0254 0.2491 1.1267 0.0038Top Competitor 0.1978 0.4842 0.0330 0.2652 1.0831 0.0086

    3.2 Snippet Retrieval

    In table 4, we present the results of our snippet retrieval system on all batchescompared to the top competitor. Similar to the document retrieval, we performeda retrospective evaluation on all batches for the final implemented system. Theevaluation shows that we significantly improved our system during the contest.

  • 3.3 Question Answering

    We submitted only baselines for batches 1 and 2, so we present results only forbatches starting with 3. Overall, we achieved moderate results on the questionanswering task, as we mainly focused on Phase A. We believe this was causedby poor selection of the training dataset. We will analyze errors and performadditional experiments in the future. The performance of our system is presentedin tables 5 and 6.

    Table 5. The performance of the proposed system on the yes/no questions. “TopCompetitor” is a top-scoring submission from other teams.

    Batch System Accuracy F1 yes F1 no F1 macro

    3 ALBERT(SQuAD, BioASQ) 0.9032 0.9189 0.8800 0.8995Top competitor 0.9032 0.9091 0.8966 0.9028

    4 ALBERT(SQuAD, BioASQ) 0.7308 0.7879 0.6316 0.7097Top competitor 0.8462 0.8571 0.8333 0.8452

    5 BioBERT(SQuAD, BioASQ, PMQ) 0.8235 0.8333 0.8125 0.8229Top competitor 0.8529 0.8571 0.8485 0.8528

    Table 6. The performance of the proposed system on the list and factoid questions.“Top Competitor” is a top-scoring submission from other teams.

    Batch System SAcc LAcc MRR Mean Prec. Rec F-Measure

    3 PA 0.2500 0.4643 0.3137 0.5278 0.4778 0.4585Top Competitor 0.3214 0.5357 0.3970 0.7361 0.4833 0.5229

    4 PA 0.4706 0.5588 0.5098 0.3571 0.3661 0.3030Top Competitor 0.5588 0.7353 0.6284 0.5375 0.5089 0.4571

    5 PA 0.4375 0.6250 0.5260 0.3075 0.3214 0.3131Top Competitor 0.5625 0.7188 0.6354 0.5516 0.5972 0.5618

    3.4 Summarization

    We evaluated our systems in all the five batches. However, we were able toexperiment with only one system per batch. The results are presented in the table7. We show how simple snippet re-ranker can achieve top scores in automaticevaluation. Meanwhile the abstractive summarizer, while providing readable andcoherent responses, achieves lower scores, however still very competitive ones.We hope that human evaluation will show the opposite results. We includedside-by-side comparison of answers provided by both systems in the appendix(table 8).

  • Table 7. The performance of the proposed system on the ideal answers. “Top Com-petitor” is a top-scoring submission from other teams chosen by R-SU4 (F1).

    Batch System R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1)

    1 Baseline 0.1118 0.1118 0.1116 0.1117Top competitor 0.6004 0.3660 0.6035 0.3556

    2 Baseline 0.0600 0.0655 0.0615 0.0650Top competitor 0.5651 0.3451 0.5725 0.3376

    3 Snippet Reranking 0.5235 0.3297 0.5303 0.3256Top competitor 0.4980 0.3833 0.5045 0.3811

    4 Snippet Reranking 0.5470 0.3087 0.5471 0.3001Top competitor 0.5281 0.3069 0.5329 0.2987

    5 Abstractive Generation 0.3696 0.3006 0.3688 0.2895Top competitor 0.3867 0.3668 0.3805 0.3548

    4 Conclusion and Future Work

    In this work, we demonstrate several approaches for our first participation inBioASQ competition. We propose a working system for each of the evaluationobjectives in Task 8B, achieving top positions in several batches.

    We designate a few areas for improvements. First, a more robust experi-mentation process is required. After the challenge, we found several errors indocument and snippeet retrieval pipelines that could affect the resulting per-formance. Second, our system does not utilize semantic indexing, which can beimplemented using different biomedical ontologies.

  • Appendix: Side-by-Side Comparison of Snippet Rankingand Generative Models.

    Table 8. Examples of answers generated by the snippet ranking and generative ques-tion answering systems.

    Question Sclerostin regulates what process?Gold Sclerostin plays a critical role in bone homeostasis and its deficiency

    or pharmacological neutralization increases bone formation

    Snippet Ranking Sclerostin regulates release of bone mineral by osteocytes by induc-tion of carbonic anhydrase 2.

    Generative Sclerostin is a locally acting regulator of late-osteoblast/preosteocytedifferentiation and regulates mineralization through a lo-cal/paracrine regulator of bone metabolism rather than as anendocrine hormone. The osteocyte product sclostin, a secreted in-hibitor of Wnt signaling and plays an essential role in the regulationof bone mass.

    Question Can CD55 deficiency cause thrombosis?Gold Yes, loss of CD55 is associated with thrombosis in patients with

    Paroxysmal nocturnal hemoglobinuria. CD55 deficiency with hy-peractivation of complement, angiopathic thrombosis, and protein-losing enteropathy (the CHAPLE syndrome) is caused by abnormalcomplement activation due to biallelic loss-of-function mutations inCD55

    Snippet Ranking CD55 Deficiency, Early-Onset Protein-Losing Enteropathy, andThrombosis

    Generative Yes. CD55 deficiency with hyperactivation of complement, angio-pathic thrombosis, and protein-losing enteropathy (the CHAPLEsyndrome) is caused by abnormal complement activation due to bial-lelic loss-of-function mutations in CD55

    Question Can prevnar 13 be used in children?Gold Yes, PCV13 is approved for routine vaccination of all infants as a

    4-dose series at age 2, 4, 6, and 12-15 months for children who pre-viously received 1 or more doses of the 7-valent pneumococcal con-jugate vaccine (PCV7), and for children with underlying medicalconditions that increase their risk for pneumococcal disease or itscomplications.

    Snippet Ranking PCV13 is approved for routine vaccination of all infants as a 4-doseseries at age 2, 4, 6, and 12-15 months for children who previouslyreceived 1 or more doses of the 7-valent pneumococcal conjugatevaccine (PCV7), and for children with underlying medical conditionsthat increase their risk for pneumococcal disease or its complications.

    Generative PCV13 is approved for routine vaccination of all infants. PCV 13 is arevision of pneumococcal conjugate vaccine that should be includedon pharmacy formularies.

  • References

    [1] Michael Ashburner et al. “Gene Ontology: tool for the unification of biol-ogy”. In: Nature Genetics 25 (2000), pp. 25–29.

    [2] Olivier Bodenreider. “The Unified Medical Language System (UMLS): in-tegrating biomedical terminology”. In: Nucleic acids research 32 Databaseissue (2004), pp. D267–70.

    [3] George Brokos et al. “AUEB at BioASQ 6: Document and Snippet Re-trieval”. In: Proceedings of the 6th BioASQ Workshop A challenge onlarge-scale biomedical semantic indexing and question answering. Brussels,Belgium: Association for Computational Linguistics, Nov. 2018, pp. 30–39. doi: 10.18653/v1/W18- 5304. url: https://www.aclweb.org/anthology/W18-5304.

    [4] Kathi Canese and Sarah Weis. “PubMed: the bibliographic database”. In:The NCBI Handbook [Internet]. 2nd edition. National Center for Biotech-nology Information (US), 2013.

    [5] Danqi Chen et al. “Reading Wikipedia to Answer Open-Domain Ques-tions”. In: Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) (2017). doi: 10.18653/v1/p17-1171. url: http://dx.doi.org/10.18653/v1/P17-1171.

    [6] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding”. In: (Oct. 2018). arXiv: 1810.04805.url: http://arxiv.org/abs/1810.04805.

    [7] Alexios Gidiotis and Grigorios Tsoumakas. “Structured Summarization ofAcademic Publications”. In: PKDD/ECML Workshops. 2019.

    [8] Jiafeng Guo et al. “A Deep Relevance Matching Model for Ad-hoc Re-trieval”. In: Proceedings of the 25th ACM International on Conference onInformation and Knowledge Management (2016).

    [9] Suchin Gururangan et al. “Don’t Stop Pretraining: Adapt Language Mod-els to Domains and Tasks”. In: ACL. 2020.

    [10] Stefan Hosein, Daniel Andor, and Ryan T. McDonald. “Measuring DomainPortability and ErrorPropagation in Biomedical QA”. In: PKDD/ECMLWorkshops. 2019.

    [11] Qiao Jin et al. “PubMedQA: A Dataset for Biomedical Research QuestionAnswering”. In: Proceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th International Joint Con-ference on Natural Language Processing (EMNLP-IJCNLP) (2019). doi:10.18653/v1/d19-1259. url: http://dx.doi.org/10.18653/v1/D19-1259.

    [12] Zan-Xia Jin et al. “A Multi-strategy Query Processing Approach for Biomed-ical Question Answering: USTB PRIR at BioASQ 2017 Task 5B”. In:BioNLP. 2017.

    [13] Martin Krallinger et al. “BioASQ at CLEF2020: Large-Scale BiomedicalSemantic Indexing and Question Answering”. In: European Conference onInformation Retrieval. Springer. 2020, pp. 550–556.

  • [14] Zhenzhong Lan et al. “ALBERT: A Lite BERT for Self-supervised Learn-ing of Language Representations”. In: (Sept. 2019). arXiv: 1909.11942.url: http://arxiv.org/abs/1909.11942.

    [15] Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-trainingfor Natural Language Generation, Translation, and Comprehension”. In:ArXiv abs/1910.13461 (2020).

    [16] Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”.In: Text summarization branches out. 2004, pp. 74–81.

    [17] Yang Liu and Mirella Lapata. “Text Summarization with Pretrained En-coders”. In: EMNLP/IJCNLP. 2019.

    [18] Ryan McDonald, George Brokos, and Ion Androutsopoulos. “Deep Rele-vance Ranking Using Enhanced Document-Query Interactions”. In: Pro-ceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing (2018). doi: 10.18653/v1/d18-1211. url: http://dx.doi.org/10.18653/v1/D18-1211.

    [19] Diego Mollá and Christopher Jones. “Classification Betters Regressionin Query-Based Multi-document Summarisation Techniques for QuestionAnswering”. In: Communications in Computer and Information Science(2020), pp. 624–635. issn: 1865-0937. doi: 10.1007/978-3-030-43887-6_56. url: http://dx.doi.org/10.1007/978-3-030-43887-6_56.

    [20] Mark Neumann et al. “ScispaCy: Fast and Robust Models for BiomedicalNatural Language Processing”. In: Proceedings of the 18th BioNLP Work-shop and Shared Task. Florence, Italy: Association for Computational Lin-guistics, Aug. 2019, pp. 319–327. doi: 10.18653/v1/W19-5034. eprint:arXiv:1902.07669. url: https://www.aclweb.org/anthology/W19-5034.

    [21] Rodrigo Nogueira and Kyunghyun Cho. “Passage Re-ranking with BERT”.In: arXiv e-prints, arXiv:1901.04085 (Jan. 2019), arXiv:1901.04085. arXiv:1901.04085 [cs.IR].

    [22] Dimitris Pappas et al. “AUEB at BioASQ 7: Document and Snippet Re-trieval”. In: Machine Learning and Knowledge Discovery in Databases.Ed. by Peggy Cellier and Kurt Driessens. Cham: Springer InternationalPublishing, 2020, pp. 607–623. isbn: 978-3-030-43887-6.

    [23] Mónica Pineda-Vargas et al. “A Mixed Information Source Approach forBiomedical Question Answering: MindLab at BioASQ 7B”. In: MachineLearning and Knowledge Discovery in Databases. Ed. by Peggy Cellier andKurt Driessens. Cham: Springer International Publishing, 2020, pp. 595–606. isbn: 978-3-030-43887-6.

    [24] Pranav Rajpurkar, Robin Jia, and Percy Liang. “Know What You Don’tKnow: Unanswerable Questions for SQuAD”. In: arXiv e-prints, arXiv:1806.03822(June 2018), arXiv:1806.03822. arXiv: 1806.03822 [cs.CL].

    [25] Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Com-prehension of Text”. In: arXiv:1606.05250 (June 2016). arXiv: 1606.05250[cs.CL]. url: http://arxiv.org/abs/1606.05250.

  • [26] Abigail See, Peter J. Liu, and Christopher D. Manning. “Get To ThePoint: Summarization with Pointer-Generator Networks”. In: Proceedingsof the 55th Annual Meeting of the Association for Computational Linguis-tics (Volume 1: Long Papers) (2017). doi: 10.18653/v1/p17-1099. url:http://dx.doi.org/10.18653/v1/P17-1099.

    [27] George Tsatsaronis et al. “An overview of the BIOASQ large-scale biomed-ical semantic indexing and question answering competition”. In: BMCBioinformatics 16 (Apr. 2015), p. 138. doi: 10.1186/s12859-015-0564-6.

    [28] Ashish Vaswani et al. “Attention is All you Need”. In: ArXiv abs/1706.03762(2017).

    [29] Ellen M Voorhees. “The TREC question answering track”. In: NaturalLanguage Engineering 7.4 (2001), p. 361.

    [30] Georg Wiese, Dirk Weissenborn, and Mariana Neves. “Neural QuestionAnswering at BioASQ 5B”. In: BioNLP 2017 (2017). doi: 10.18653/v1/w17-2309. url: http://dx.doi.org/10.18653/v1/W17-2309.

    [31] Peter Willett. “The Porter stemming algorithm: then and now”. In: Pro-gram (2006).

    [32] Peilin Yang, Hui Fang, and Jimmy Lin. “Anserini: Reproducible RankingBaselines Using Lucene”. In: J. Data and Information Quality 10.4 (Oct.2018). issn: 1936-1955. doi: 10.1145/3239571. url: https://doi.org/10.1145/3239571.

    [33] Wei Yang et al. “End-to-end open-domain question answering with bert-serini”. In: arXiv preprint arXiv:1902.01718 (2019).

    [34] Wonjin Yoon et al. “Pre-trained Language Model for Biomedical QuestionAnswering”. In: arXiv:1909.08229 (Sept. 2019), arXiv:1909.08229. arXiv:1909.08229 [cs.CL]. url: http://arxiv.org/abs/1909.08229.


Recommended