Incorporating Domain Knowledge into Natural Language ...

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Incorporating Domain Knowledge intoNatural Language Inference on ClinicalTextsMINGMING LU1, YU FANG1, FENGQI YAN1, AND MAOZHEN LI21Department of Computer Science and Technology, Tongji University, Shanghai 201804, China2Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, UB8 3PH, UK

Corresponding author: Yu Fang (e-mail: [email protected]).

This work was supported by the Fundamental Research Funds for the Central Universities under Grant 22120180117.

ABSTRACT Making inference on clinical texts is a task which has not been fully studied. With thenewly released, expert annotated MedNLI dataset, this task is being boosted. Compared to open domaindata, clinical texts present unique linguistic phenomena, e.g., a large number of medical terms andabbreviations, different written forms for the same medical concept, which make inference much harder.Incorporating domain-specific knowledge is a way to eliminate this problem, in this paper we assemblea new Incorporating Medical Concept Definitions module on the classic enhanced sequential inferencemodel (ESIM) model, which first extracts the most relevant medical concept for each word, if it exists, thenencodes the definition of this medical concept with a bidirectional long short-term network (BiLSTM) toobtain domain-specific definition representations, and attends these definition representations over vanillaword embeddings. Empirical evaluations are conducted to demonstrate that our model improves theprediction performance and achieves a high level of accuracy on the MedNLI dataset. Specifically, theknowledge enhanced word representations contribute significantly to entailment class.

INDEX TERMS Attention mechanism, clinical text, medical domain knowledge, natural languageinference, word representation.

I. INTRODUCTION

NATURAL Language Inference (NLI), also known asRecognizing Textual Entailment (RTE), is a task con-

cerning semantic relationship (entailment, contradiction, orneutral) between a premise and a hypothesis [1]. In re-cent years, represented by the Stanford Natural LanguageInference (SNLI) [2] corpus and the Multi-Genre NaturalLanguage Inference (MultiNLI) [3] corpus, large-scale an-notated datasets are made publicly available, which havepushed the development of this task. In addition, many deepneural network models are proposed to achieve the state-of-the-art performance [4]–[6].

In the clinical domain, newly released MedNLI [7] datasetfocuses on NLI task on clinical texts. Owing to the spe-cialty and particularity of this domain, clinical texts presentunique linguistic phenomena different from open domaindata: (1) the existence of a large number of medical termsand abbreviations leads to the out-of-vocabulary (OOV)issue; (2) a medical concept has different written forms in

different vocabularies, though they have the same meaning.Table 1 are some examples from the MedNLI dataset forillustration. The key words in Example #1 are diaphoresisand sweats, which express the same medical concept, butthey are written in different forms. Example #2 and #3 havemedical terms (lumbar puncture and coronary artery bypassgrafting), as well as standard medical abbreviations (LPand STEMI) and not standard logogram words (pt, meaningpatient). If a system cannot understand these medical termsand abbreviations correctly, it will misclassify the classes. Ingeneral, these unique linguistic phenomena make inferenceon MedNLI much harder.

Since processing of clinical texts requires domain-specificknowledge, in this paper, we incorporate such knowledgeinto the classic open domain model (ESIM) by encodingthe definitions of medical concepts with a bidirectionalLSTM [8] (BiLSTM) and attending the vanilla word em-beddings to these domain-specific representations. Throughthis way, computers are taught to, on one hand, learn the

VOLUME 4, 2016 1

M. Lu et al.: Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts

TABLE 1. Examples from the MedNLI dataset. P, H andL stand for premise, hypothesis and label, respectively.Domain-specific words for inference are in italics. LP isthe abbreviation for lumbar puncture, and STEMI stands forST segment elevation myocardial infarction.

Example #1P: It was also associated with diaphoresis.H: The patient has sweats.L: entailment

Example #2P: The pt was transfered to have an LP with neurosurg backup.H: The patient has no neurological symptoms, or indication forlumbar punctureL: contradiction

Example #3P: He presented preoperatively for coronary artery bypassgrafting.H: Patient has had a STEMIL: neutral

meanings of medical terms and abbreviations, on the otherhand, identify similarities and differences between medicalconcepts. We conduct experiments on the MedNLI dataset,and the results showing that our model outperforms allbaselines done by Romanov and Shivade [7], achieving thestate-of-the-art performance. In addition, we present ablationstudy and case study to learn how domain knowledgecontribute to our model.

Our work has three main contributions:• We propose a knowledge enhanced model for natural

language inference on clinical texts, which combinesBiLSTM and attention to enhance vanilla word em-beddings with definitions of medical concepts.

• We study of the effectiveness of our model on theMedNLI dataset, and achieve a higher level of accuracythan those models without knowledge enhanced.

• Our ablation study and case study reveal some usefulinsights for the contributions of knowledge enhancedword representations.

The rest of this paper is organized as follows. Section IIreviews the related work for natural language inference.Section III details the design of the proposed model. Sec-tion IV and V present and discuss the experimental settingsand results, respectively. Finally, we draw conclusion inSection VI.

II. RELATED WORKThere are two types of approaches for natural languageinference task: encoding-based models and interaction-based models [9]. Encoding-based models [2], [4], [10],[11] use Siamese architecture [12] to learn vector represen-tations of the premise and hypothesis, and then calculatethe semantic relationship between two sentences based ona neural network classifier. One representative model isInferSent [4], which is one baseline model of the MedNLIdataset.

Interaction-based models [5], [13], [14] utilize somesorts of word alignment mechanisms, e.g., attention [15],then aggregate inter-sentence interactions. As shown in theSemEval-2016 task of interpretable semantic textual similar-ity [16], the semantic relations of aligned chunks contributea lot to sentence pair modeling, interaction-based modelshave better performance than encoding-based models. Chenet al. [5] proposed an enhanced sequential inference model(ESIM), which contains three main components, i.e., inputencoding, co-attention matching, and inference composition.ESIM is another baseline model of the MedNLI dataset.

Unlike previous work [6] that enriches NLI modelswith lexical-level semantic knowledge about synonymy,antonymy, hypernymy, hyponymy and co-hyponymy be-tween words, we focus on medical domain and explore theincorporation of extra knowledge on clinical texts for naturallanguage inference. Romanov and Shivade [7] also studiedtwo ways of incorporating domain-specific knowledge intotheir baseline models. In one way, they modified pre-trainedword embeddings by retrofitting [17], so the input to modelscould carry clinical information. However, this way onlydegrades the performance. Because retrofitting works onlyon directly related concepts, while medical concepts aremore complex, and medical inferences require more stepsof reasoning. Another way is knowledge-directed attention,which is beneficial to the InferSent and ESIM models. Ourmodel is similar to the first way, modifying model’s inputs,but we utilize definition representations to enhance the wordembeddings of medical terms and abbreviations, alleviatingthe OOV issue and bridging the semantic gap betweendifferent written forms of a medical concept.

III. MODEL DESIGNIn this section, we will explain the NLI task and describeour domain knowledge, i.e., definitions of medical concepts.Then, we study how to incorporate these definitions into theESIM model for natural language inference on clinical texts.

A. PROBLEM DEFINITIONGiven the MedNLI dataset D, an example of the dataset canbe represented as a (p, h, y) triplet consisting of premisep, hypothesis h, and ground truth label y. Specially, thepremise is represented as p = {ai}Mi=1 and the hypothesisis h = {bj}Nj=1, where M and N are the lengths ofthe sentences. y ∈ {0, 1, 2} is the corresponding label ofthe given triple which takes a value of 0 if the premiseentails the hypothesis (entailment), 1 if they contradict eachother (contradiction), and 2 if they are unrelated (neutral).Our goal is to learn a predictive distribution p(y|p, h;θ)]parameterized by θ from D. That is, given a premise p andhypothesis h, we would like to infer the probability that theywill be classified as entailment, contradiction, or neutral.

B. DOMAIN KNOWLEDGEFirst, we collect the definitions of medical concepts fromUnified Medical Language System (UMLS) [18]. In UMLS,

2 VOLUME 4, 2016


FIGURE 1. An overview of our model. Similar to the ESIM model, our model consists of three layers, i.e., input encodinglayer, co-attention matching layer, and inference composition layer. The difference is that we incorporate medical conceptdefinitions in the first layer. {ai}Mi=1, {bj}Nj=1, and {c·,t}Tt=1 are the inputs to the model, representing the premise sentence,hypothesis sentence, and the definitions of extracted medical concepts from two sentence, respectively. y is the output. ⊕means concatenation of vectors.

TABLE 2. Some examples of medical concepts and their definitions from UMLS.

Word Medical Concept Definition

diaphoresis Increased sweating Profuse sweating.LP Spinal Puncture Tapping fluid from the subarachnoid space in the lumbar region, usually

between the third and fourth lumbar vertebrae.STEMI ST segment elevation myocardial infarction A clinical syndrome defined by MYOCARDIAL ISCHEMIA symp-

toms; persistent elevation in the ST segments of the ELECTROCAR-DIOGRAM; and release of BIOMARKERS of myocardia NECROSIS(e.g., elevated TROPONIN levels).

for a medical concept, there would be multiple definitionscoming from different source vocabularies. To simplify themodel, we choose the shortest one as the only definitionof this medical concept. In the end, we collect a total of198,042 definitions that make up our domain knowledgebase, denoted as K.

Second, following the previous work [7], we useMetamap [19] to extract medical concepts from premiseand hypothesis sentences, and map them to standard ter-minologies in the UMLS. For each extracted phrase, theremay be more than one related concepts, which are sortedby MetaMap Indexing (MMI) score. The higher the score,the greater the relevance of the medical concept to itsextracted phrase. In this paper, we only consider the conceptwith the highest score for each word, and discard thosewith the lower scores. As a result, every word has zero orone corresponding medical concept. Through this way, weknow exactly what concept the medical term or abbreviationstands for, and different written forms could be mapped tothe same concept. Finally, we associate words with conceptdefinitions. For example, if one word ai in the premisesentence extracts a medical concept, then we search ourdomain knowledge base K for its definition. We denoteword ai associated definition as {ci,t}Tt=1. Table 2 shows theextracted medical concepts of some domain-specific wordsof Example #1 to #3, and their definitions from UMLS.

C. MODEL OVERVIEW

We present here our model for natural language inferenceon clinical texts. It consists of three layers: input encodinglayer, co-attention layer, and inference composition layer.Fig. 1 shows an overview of our model.

The model takes the premise sentence, the hypothesissentence, and the definitions of extracted medical conceptsfrom two sentences as inputs, and then first constructsrespective word representations with pre-trained word em-beddings. These pre-trained word embeddings can be eitherpublicly available open domain word embeddings, or trainedon a domain-specific corpora. Then, each word in twosentences are attended over their corresponding definitionif it exists, which is done by the Incorporating MedicalConcept Definitions module. Furthermore, the enhancedword embeddings are fed into a parameters shared BiLSTMto obtain a set of contextualized representations of premiseand hypothesis sentences.

In the co-attention matching layer, we use soft-alignmentof contextualized word representations between the premiseand hypothesis to obtain aligned representation, followed bya heuristic matching approach [20] to collect local inferencevectors for each word. Finally, to determine the overallinference relationship between the premise and hypothesis,another BiLSTM is utilized to compose the collected localinference vectors, which is part of the inference compositionlayer. The output hidden vectors of the second BiLSTM areconverted to fixed-length vectors with max and mean pool-

VOLUME 4, 2016 3


ing operations and put into the final multi-layer perceptron(MLP) classifier to determine the inference class.

Details about each layer and the Incorporating MedicalConcept Definitions module are provided in the followingsections.

D. INPUT ENCODING LAYER

Input encoding layer takes as inputs the premise {ai}Mi=1,the hypothesis {bj}Nj=1, and associated medical conceptdefinitions {c·,t}Tt=1, where · can be replaced with i or j.Pre-trained word embeddings E ∈ Rde×|V | are first usedto converted word inputs to vector sequences ae

1, . . . ,aeM ],

[be1, . . . , beM ], and [ce·,1, . . . , c

e·,T ], where |V | is the vocabu-

lary size and de is the dimension of the word embedding. Inthe experiments, we explore six different word embeddings,one publicly available open domain word embedding, twotrained on domain-specific corpus, and three initialized withopen domain word embeddings and further fine-tuned onone or two domain-specific corpus:

• GloVe[CC]: GloVe embeddings [21], trained on Com-mon Crawl.

• fastText[BioASQ]: fastText embeddings [22], trained onPubMed abstracts from the BioASQ challenge [23].

• fastText[MIMIC-III]: fastText embeddings, trained on pa-tient clinical notes from the MIMIC-III database [24].

• GloVe[CC]→ fastText[BioASQ]: GloVe embeddings forinitialization and further fine-tuned on the BioASQdata.

• GloVe[CC]→ fastText[BioASQ]→ fastText[MIMIC-III]: GloVeembeddings for initialization and further fine-tuned onthe BioASQ and MIMIC-III data in succession.

• fastText[Wiki]→ fastText[MIMIC-III]: fastText Wikipediaembeddings for initialization and further fine-tuned onthe MIMIC-III data.

All of the domain-specific word embeddings are down-loaded from the MedNLI dataset1.

1) Incorporating Medical Concept Definitions

Inspired by the work of [25] and [26], we incorporatemedical concept definitions into word embeddings, as shownin Fig. 2.

The bidirectional long short-term memory (BiLSTM) net-work has been proven to be good at modeling dependenciescoming from both the past and the future in sequences. Sowe employ it to encode definition embeddings in forwardand backward directions. Take Fig. 2 for example, cet is theinput to the BiLSTM at time step t. To simplify notation,we omit the subscript i in this section. The hidden states in

1https://jgc128.github.io/mednli/

FIGURE 2. An illustration of incorporating medical conceptdefinition embeddings {cei,t}Tt=1 into the vanilla word em-bedding ae

i of one medical term or abbreviation in thepremise. From the output, we will get the enhanced wordrepresentation ae

i .

the forward direction are updated as follows:

it = σ(W icet +Ui−→h t−1 + b

i) (1)

ft = σ(W fcet +Uf−→h t−1 + b

f ) (2)

ot = σ(W ocet +Uo−→h t−1 + b

o) (3)

qt = tanh(W qcet +Uq−→h t−1 + b

q) (4)pt = fi,t ◦ pi,t−1 + it ◦ qt (5)−→h t = ot ◦ tanh(pt) (6)

where it, ft, ot are the input gate, forget gate and outputgate of LSTM, respectively. σ is the sigmoid function, andpt is the cell state. Accordingly, in the forward direction, thehidden state

−→h t at time step t depends on input word and

the preceding hidden state−→h t−1. Similarly, in the backward

direction, the hidden state←−h t is updated based on current

input and the hidden state from the next time step. At thet-th time step, the output of BiLSTM is usually obtainedby concatenation of the hidden states from both directions,formally, ht = [

−→h t;←−h t]. Especially, the above process can

be simplified as a BiLSTM function:

h1, . . . ,hT = BiLSTM(ce1, . . . , ceT ) (7)

To obtain definition enhanced word embeddings, weutilize a multi-layer perceptron attention [15] mechanismto aggregate the outputs of BiLSTM and then add them tothe vanilla word embeddings. In particular, attention first

4 VOLUME 4, 2016


computes the alignment score between ht and ae by afunction f(ht,a

e):

f(ht,ae) = vTσ(W hht +W

eae) (8)

where W h, W e are weight matrices and v is a weightvector. This alignment score measures the attention of ae toht. Subsequently, a softmax function normalizes alignmentscores to form a vector z ∈ RT :

zt =exp(f(ht,a

e))∑Tt′=1 exp(f(ht′ ,ae))

(9)

Here, zt is an indicator of the importance of ht to ae. So,the output of attention is a weighted sum of {ht}Tt=1, wherethe weights are given by z.

By adding the output of attention and the vanilla wordembedding, we obtain definition enhanced word embeddingin the premise:

ae =

T∑t=1

ztht + ae (10)

The above approach of incorporating medical concept defi-nitions also applies to the hypothesis.

2) Sentence EncodingTo represent words in their context, the enhanced word em-beddings of premise and hypothesis are fed into a parame-ters shared BiLSTM to obtain contextualized representationsas and bs:

as1, . . . ,a

sM = BiLSTM1(a

e1, . . . , a

eM ) (11)

bs1, . . . , bsN = BiLSTM1(b

e1, . . . , b

eN ) (12)

E. CO-ATTENTION MATCHING LAYERModeling the interactions is the critical component fordeciding the inference relationship between the premise andhypothesis. In this layer, a co-attention matrix is computedusing dot-product to produce aligned word representations,and then by comparing with contextualized representations,we collect matching information at the word level.

First, the co-attention score between each representationtuple (as

i , bsj) is calculated as follows:

eij = (asi )

Tbsj (13)

Then for the i-th word in the premise, its relevant represen-tation carried by the hypothesis is identified and composedusing eij as

αij =exp(eij)∑N

j′=1 exp(eij′)(14)

aci =

N∑j=1

αijbsj (15)

where α ∈ RM×N is the normalized co-attention matrixw.r.t. the column-axis, and ac

i is a weighted sum of {bsj}Nj=1,meaning the contents related to as

i are selected to form aci .

The same calculation is performed for each word in thehypothesis as

βij =exp(eij)∑M

i′=1 exp(ei′j)(16)

bcj =

M∑i=1

βijasi (17)

where β ∈ RM×N is the normalized co-attention matrixw.r.t. the row-axis. We denote ac

i and bcj as aligned wordrepresentations.

To further enhance inference information, followed theheuristic matching approach proposed by Mou et al. [20],we concatenate contextualized and aligned word represen-tations with the differences and element-wise products be-tween each other, resulting local inference vectors. Formally,local inference vectors am

i and bmj are calculated as follows:

ami = G([as

i ;aci ;a

si − ac

i ;asi ◦ ac

i ]) (18)bmj = G([bsj ; b

cj ; b

sj − bcj ; bsj ◦ bcj ]) (19)

where G is one-layer feed-forward neural network with theReLU [27] activation function to reduce dimensionality.

F. INFERENCE COMPOSITION LAYERIn this layer, a parameters shared BiLSTM followed by maxand mean pooling operations is typically employed as theaggregation method to compose the local inference vectorscollected above:

av1, . . . ,a

vM = BiLSTM2(a

m1 , . . . ,a

mM ) (20)

bv1, . . . , bvN = BiLSTM2(a

m1 , . . . , b

mN ) (21)

avmax = max

16i6Mavi (22)

avmean = mean

16i6Mavi (23)

bvmax = max16j6N

bvj (24)

bvmean = mean16j6N

bvj (25)

Again we use BiLSTM here, but the role is completelydifferent from that presented in Section III-D2. The BiL-STM here learns to discriminate critical local inferencevectors for obtaining the overall sentence-level inferencerelationship between the premise and hypothesis. The pool-ing vectors are concatenated together and fed into thefinal multi-layer perceptron (MLP) classifier which has onehidden layer with tanh activation and softmax output layer:

y = MLP([avmax;a

vmean; b

vmax; b

vmean]) (26)

The entire model is trained via minimizing the cross-entropyloss in an end-to-end manner.

IV. EXPERIMENTSIn this section, we first briefly introduce the MedNLIdataset, a newly released dataset for natural language infer-ence on clinical texts, followed by detailed training settings.

VOLUME 4, 2016 5


TABLE 3. Accuracies of our model (ESIM w/ Knowledge) compared to baselines using different word embeddings onMedNLI. Baseline results are directly copied from Romanov and Shivade [7].

Word Embeddings InferSent Baselines ESIM Baselines ESIM w/ Knowledge

GloVe[CC] 0.735 0.731 0.742fastText[BioASQ] 0.741 0.733 0.753fastText[MIMIC-III] 0.758 0.743 0.778GloVe[CC] → fastText[BioASQ] 0.742 0.745 0.765GloVe[CC] → fastText[BioASQ] → fastText[MIMIC-III] 0.762 0.749 0.776fastText[Wiki] → fastText[MIMIC-III] 0.766 0.748 0.771

A. MEDNLI DATASETWe evaluated our model on the MedNLI dataset [7], whichcontains 13k expert annotated sentence pairs. The premisesentences were drawn from clinical notes contained in theMIMIC-III v1.3 database [24], and the hypothesis sentenceswere generated by four clinicians. We use the same data splitas provided in Romanov and Shivade [7] and classificationaccuracy as an evaluation metric.

B. TRAINING DETAILSFollowing all baselines’ settings on the MedNLI dataset, wechose the dimension of word embeddings and hidden statesof BiLSTMs of 300, except for the BiLSTM in the Incor-porating Medical Concept Definitions module, which was150. We restricted the lengths of the premise and hypothesissentences by a maximum of 50 words, and that of medicalconcept definitions by 200. All word embeddings were fixedduring training. Adam [28] was used for optimization withan initial learning rate of 0.001. The mini-batch size wasset to 64. We set a dropout rate of 0.5 for input and outputof hidden layer of the final MLP classifier. We also usedvariational dropout [29] for input of BiLSTMs, which wasalso set to 0.5. We trained our model for a maximum of20 epochs. The training was stopped when the developmentloss did not decrease after 5 subsequent epochs.

All hyper-parameters were strictly selected on the devel-opment set, and then tested on the corresponding test set.We used PyTorch2 and AllenNLP3 to implement our model.

V. RESULTSIn this section, we will analyze the performance of ourmodel from three aspects. First, we will compare our modelwith baseline models for different word embeddings. Then,ablation study and case study are conducted to inspect howdomain knowledge contributes to the model.

A. COMPARISON AGAINST BASELINESWe compare our model, referred to as ESIM w/ Knowl-edge, against InferSent and ESIM baseline models testedby Romanov and Shivade [7] for six different wordembeddings stated in Section III-D. The results are re-ported in Table 3. Our model outperforms all baseline

2https://pytorch.org/3https://allennlp.org/

FIGURE 3. Confusion matrix without normalization: (a)InferSent baseline using fastText[Wiki]→ fastText[MIMIC-III]embedding4; (b) ESIM w/ Knowledge usingfastText[MIMIC-III] embedding.

models and achieves the state-of-the-art performance, in-dicating that incorporating medical concept definitionscan significantly improve the performance. Compared tothe best baseline (i.e., InferSent using fastText[Wiki]→fastText[MIMIC-III] embedding), we observed an absolute gainof 0.012 corresponding to 1.6% relative gain in the modelusing fastText[MIMIC-III] embedding. Actually, a total ofthree results for different word embeddings (others areGloVe[CC]→ fastText[BioASQ]→ fastText[MIMIC-III] embeddingand fastText[Wiki]→ fastText[MIMIC-III] embedding) exceedthe best baseline.

In baseline models, all results except one of InferSentare better than those of ESIM. However, for each wordembedding, our result goes beyond all two baselines,proving the effectiveness of ESIM integrated with do-main knowledge. The greatest gain of our model is forGloVe[CC]→ fastText[BioASQ] embedding (0.765 compared to0.745), where we obtain an absolute gain of 0.02 and arelative gain of 2.7%.

Besides comparing the overall performance, we also drawthe confusion matrix to visualize the classification resultsof three classes (entailment, contradiction and neutral). Asshown in Fig. 3, there are two confusion matrices withoutnormalization, the left belongs to best baseline4 and the right

4Results were predicted by model parameters released by Romanov andShivade [7], which only obtained an accuracy of 0.759, different from theaccuracy of 0.766 stated in the paper.

6 VOLUME 4, 2016


TABLE 4. Ablation study using fastText[MIMIC-III] embedding.For each entry in the table, accuracies of the developmentand test set are divided by a slash, and the number inparentheses is the best training epoch.

LSTM BiLSTM

w/o Attention 0.778 / 0.763 (10) 0.767 / 0.770 (9)a

w/ Attention 0.776 / 0.768 (8) 0.779 / 0.778 (12)aThis is a group of amended values, and the original valueswere 0.759 / 0.759 (6).

FIGURE 4. Loss and accuracy curve of the development andtest set using fastText[MIMIC-III] embedding.

belongs to the best result of our model. By comparing thesetwo confusion matrices, the following conclusions can bedrawn:

(1) Our model improves the performance in entailmentand neutral classes, of which it contributes a lot to entail-ment class, and the misclassifications to contradiction andneural classes are reduced by 12 and 14 respectively. Wethink this is because the incorporated domain knowledgeenhances the word representations of medical terms and ab-breviations and bridges the semantic gap between differentwritten forms of the same medical concept. The incorporatedknowledge also reduces the possibility of neural class beingmistakenly classified as contradiction class.

(2) Our model beats the performance in contradictionclass. After reviewing the misclassified examples, we foundthat the errors mainly occurred in those requiring numericalreasoning, e.g., a premise as In the ED, initial VS revealedT 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA. Ourmodel tends to mistake such numerical reasoning examplesfor entailment class. This is also true in neural class. Wethink ensemble methods using InferSent and ESIM w/Knowledge will take advantages of each model and obtainbetter predictive performance.

B. ABLATION STUDYThe main difference between our model and the vanillaESIM is the newly added Incorporating Medical ConceptDefinitions module: it uses a bidirectional LSTM to encodethe definitions of medical concepts, and another attention

(a)

(b)

(c)

FIGURE 5. Normalized Co-attention matrix of Example #1 to#3. (a) Example #1 with normalization over row-axis. (b)Example #2 with normalization over row-axis. (c) Example#3 with normalization over column-axis.

mechanism to enhance vanilla word embeddings. To an-alyze the contributions of these two components to theoverall performance, we conducted an ablation study usingfastText[MIMIC-III] embedding. Three model variants werestudied: one that removed only the attention mechanism,another that changed the bidirectional LSTM to unidirec-tional, and the last that did both. The results of the studyare presented in Table 4. The values of model variant w/oattention are amended, because this variant stopped so earlycompared to others. It was only iterated for 6 epochs, hasn’t

VOLUME 4, 2016 7


been fully trained, and did not have good generalizationperformance in both development set and test set, as shownin Fig. 4. Based on the loss and accuracy curve, we foundthe 9th epoch was the optimal iteration stop, whose loss wassecond minimum and best generalization performance.

From Table 4, we can conclude that models w/ attentionare better than those w/o attention and bidirectional LSTM isbetter than unidirectional LSTM. All of these findings reflectthe importance of the Incorporating Medical Concept Defi-nitions module, and domain-specific knowledge contributesto natural language inference on clinical texts.

C. CASE STUDYFinally, we qualitatively inspect examples listed in Table 1and visualize their normalized co-attention matrix, as in(13). For more examples with attention visualizations, seeAppendix A.

The key words of Example #1 for inference are di-aphoresis and sweats. By enhancing word embeddings withknowledge, our model learns to focus on these two medicalterms and knows that they have the same meaning. Asshown in Fig. 5 (a), in premise, diaphoresis has the highestweight to sweats. In Fig. 5 (b) (corresponding to Example#2), for the abbreviation LP, our model pays attention to it’sfull name of lumbar puncture. In Fig. 5 (c) (corresponding toExample #3), our model learns to make inference based onthe relationship between STEMI and coronary artery bypassgrafting. Because the definition of STEMI (i.e., ST segmentelevation myocardial infarction) is incorporated, our modellearns they are unrelated, and the prediction is neutral class.

VI. CONCLUSIONWe have present a novel model for natural language in-ference on clinical texts by incorporating medical conceptdefinitions into vanilla word embeddings. Our experimentresults demonstrated that the model outperforms all base-lines, achieving the state-of-the-art performance in accuracy,due to the contributions of domain knowledge.

Further improvement might be made by expanding med-ical concept definitions dictionary, to cover more medicalterms and abbreviations. For simplicity, we only employedthe shortness definition for each concept. However, a con-cept might have a number of definitions. Therefore, we willstudy how to encode multiple definitions in the future.

.

APPENDIX A SUPPLEMENTAL MATERIALIn this supplemental material, we show more examples(Table 5) with their normalized co-attention matrix (Fig. 6–14). Our model classifies all these examples correctly.

REFERENCES[1] I. Dagan, O. Glickman, and B. Magnini, “The pascal recognising tex-

tual entailment challenge,” in Machine Learning Challenges Workshop.Springer, 2005, pp. 177–190.

TABLE 5. More Examples from the MedNLI dataset. P, Hand L stand for premise, hypothesis and label, respectively.Key words for inference are in italics.

Example #4Paroxysmal atrial fibrillation.Patient has abnormal EKGL: entailment

Example #5P: The patient reports having reactive airway disease.H: The patient has COPD and/or atshma.L: entailment

Example #6P: Of note, patient does c/o progressive shortness of breathover the past year.H: The patient experiences dyspneaL: entailment

Example #7P: She does admit to some diarrhea at home.H: The patient is constipatedL: contradiction

Example #8P: Moving all four extremities, responding to sternal rub.H: Patient is unresponsiveL: contradiction

Example #9P: In the ambulance, per report from the cardiologist, initialrhythm strip showed normal sinus rhythm.H: The patient had ventricular fibrillation.L: contradiction

Example #10P: Liver failure- hx of encephalopathy, no bx seen in recordsDM type 2- non insulin dependent CHF Elevated PSA Pancre-atitis Postive PPD Alcoholic cardiomyopathyH: The patient is an alcoholic.L: neutral

Example #11P: The PDA and posterolateral vessels had 90% ostial lesions.H: Patient may required CABGL: neutral

Example #11P: A recent TEE showed severe aortic stenosos with an aorticvalve area of 0.7cm2.H: Patient has CHFL: neutral

[2] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotatedcorpus for learning natural language inference,” in Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing, 2015,pp. 632–642.

[3] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challengecorpus for sentence understanding through inference,” in Proceedings ofthe 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1(Long Papers), vol. 1, 2018, pp. 1112–1122.

[4] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Super-vised learning of universal sentence representations from natural languageinference data,” in Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, 2017, pp. 670–680.

[5] Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhancedlstm for natural language inference,” in Proceedings of the 55th AnnualMeeting of the Association for Computational Linguistics (Volume 1:Long Papers), vol. 1, 2017, pp. 1657–1668.

8 VOLUME 4, 2016


Paroxy

smal

atrial

fibrill

ation .

Premise

Patient

has

abnormal

EKG

Hyp

othe

sis

0.2

0.4

0.6

0.8

FIGURE 6. Co-attention matrix of Example #4 with normal-ization over column-axis.

Thepa

tient

report

sha

ving

reacti

veair

way

diseas

e .

Premise

Thepatient

hasCOPDand/or

atshma.

Hyp

othe

sis

0.2

0.4

0.6

0.8


[6] Q. Chen, X. Zhu, Z.-H. Ling, D. Inkpen, and S. Wei, “Neural naturallanguage inference models enhanced with external knowledge,” in Pro-ceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), vol. 1, 2018, pp. 2406–2417.

[7] A. Romanov and C. Shivade, “Lessons from natural language inference inthe clinical domain,” in Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, 2018, pp. 1586–1596.

[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[9] W. Lan and W. Xu, “Neural network models for paraphrase identification,semantic textual similarity, natural language inference, and question an-swering,” in Proceedings of the 27th International Conference on Compu-tational Linguistics, 2018, pp. 3890–3902.

[10] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang, “Disan: Di-rectional self-attention network for rnn/cnn-free language understanding,”in Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp.5446–5455.

[11] T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang, “Reinforcedself-attention network: a hybrid of hard and soft attention for sequencemodeling,” in Proceedings of the 27th International Joint Conference onArtificial Intelligence. AAAI Press, 2018, pp. 4345–4352.

[12] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signatureverification using a" siamese" time delay neural network,” in Advances inneural information processing systems, 1994, pp. 737–744.

[13] S. Wang and J. Jiang, “Learning natural language inference with lstm,” inProceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technolo-gies, 2016, pp. 1442–1451.

[14] R. Ghaeini, S. A. Hasan, V. Datla, J. Liu, K. Lee, A. Qadir, Y. Ling,A. Prakash, X. Fern, and O. Farri, “Dr-bilstm: Dependent reading bidi-rectional lstm for natural language inference,” in Proceedings of the

Ofno

te ,

patie

ntdo

es c/o

progre

ssive

shortn

ess ofbre

athover thepa

stye

ar .

Premise

The

patient

experiences

dyspnea

Hyp

othe

sis

0.2

0.4

0.6

0.8


She does

admit to

some

diarrh

ea atho

me .

Premise

The

patient

is

constipated

Hyp

othe

sis

0.1

0.2

0.3

0.4

0.5

0.6


2018 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1(Long Papers), vol. 1, 2018, pp. 1460–1469.

[15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473,2014.

[16] E. Agirre, A. Gonzalez-Agirre, I. Lopez-Gazpio, M. Maritxalar, G. Rigau,and L. Uria, “Semeval-2016 task 2: Interpretable semantic textual sim-ilarity,” Proceedings of the 10th International Workshop on SemanticEvaluation (SemEval 2016), pp. 512–524, 2016.

[17] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith,“Retrofitting word vectors to semantic lexicons,” in Proceedings of the2015 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, 2015, pp.1606–1615.

[18] O. Bodenreider, “The unified medical language system (umls): integratingbiomedical terminology,” Nucleic acids research, vol. 32, no. suppl_1, pp.D267–D270, 2004.

[19] A. R. Aronson and F.-M. Lang, “An overview of metamap: historicalperspective and recent advances,” Journal of the American Medical In-formatics Association, vol. 17, no. 3, pp. 229–236, 2010.

[20] L. Mou, R. Men, G. Li, Y. Xu, L. Zhang, R. Yan, and Z. Jin, “Naturallanguage inference by tree-based convolution and heuristic matching,” inProceedings of the 54th Annual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), vol. 2, 2016, pp. 130–136.

[21] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors forword representation,” in Proceedings of the 2014 conference on empiricalmethods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[22] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching wordvectors with subword information,” Transactions of the Association forComputational Linguistics, vol. 5, pp. 135–146, 2017.

[23] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R.Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Polychronopoulos

VOLUME 4, 2016 9


Moving all fou

r

extre

mities ,

respo

nding to

sterna

lrub

.

Premise

Patient

is

unresponsive

Hyp

othe

sis

0.10.20.30.40.50.60.70.8

FIGURE 10. Co-attention matrix of Example #8 with normal-ization over row-axis.

In the

ambu

lance ,

per

reportfro

m the

cardio

logist ,

initia

l

rhythmstr

ip

showedno

rmalsin

us

rhythm

.

Premise

The

patient

had

ventricular

fibrillation

.

Hyp

othe

sis

0.2

0.4

0.6

0.8

FIGURE 11. Co-attention matrix of Example #9 with normal-ization over row-axis.

et al., “An overview of the bioasq large-scale biomedical semantic indexingand question answering competition,” BMC bioinformatics, vol. 16, no. 1,p. 138, 2015.

[24] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi,B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a freelyaccessible critical care database,” Scientific data, vol. 3, p. 160035, 2016.

[25] D. Bahdanau, T. Bosc, S. Jastrzebski, E. Grefenstette, P. Vincent, andY. Bengio, “Learning to compute word embeddings on the fly,” arXivpreprint arXiv:1706.00286, 2017.

[26] D. Chaudhuri, A. Kristiadi, J. Lehmann, and A. Fischer, “Improvingresponse selection in multi-turn dialogue systems by incorporating domainknowledge,” in Proceedings of the 22nd Conference on ComputationalNatural Language Learning, 2018, pp. 497–507.

[27] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neuralnetworks,” in Proceedings of the fourteenth international conference onartificial intelligence and statistics, 2011, pp. 315–323.

[28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[29] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout andthe local reparameterization trick,” in Advances in Neural InformationProcessing Systems, 2015, pp. 2575–2583.

MINGMING LU was born in 1991 and receivedthe B.S. degree in computer science from ChinaUniversity of Mining and Technology, China,in 2013. Now he is a Ph.D. candidate in theDepartment of Computer Science and Technol-ogy, Tongji University, China. His main researchinterests include machine learning, natural lan-guage processing, and intelligent systems withapplications to medicine.

YU FANG is a Professor in the Department ofComputer Science and Technology, Tongji Uni-versity, China. She received the Ph.D. degreefrom Tongji University in 2006. Her main re-search interests include big data analytics andintelligent systems with applications to medicine.

FENGQI YAN was born in 1978 and receivedthe M.S. degree from Shandong University ofScience and Technology, China, in 2007. Nowhe is a D.Eng candidate in Electronics and In-formation, Tongji University, China. His researchinterests are focused on medical big data andmedical information services.

MAOZHEN LI is a Professor in the Departmentof Electronic and Computer Engineering, BrunelUniversity London, UK. He received the Ph.D.degree from the Institute of Software, ChineseAcademy of Sciences in 1997. His main researchinterests include high performance computing,big data analytics and intelligent systems withapplications to smart grid, smart manufacturingand smart cities. He has over 160 research publi-cations in these areas including 4 books. He has

served over 30 IEEE conferences and is on the editorial board of a numberof journals. He is a Fellow of the British Computer Society and the IET.

10 VOLUME 4, 2016


FIGURE 12. Co-attention matrix of Example #10 with normalization over column-axis.

ThePDA an

d

poste

rolate

ral

vesse

ls had 90 %

ostial

lesion

s .

Premise

Patient

may

required

CABG

Hyp

othe

sis

0.2

0.4

0.6

0.8


Arec

entTEE

showedsev

ereao

rtic

steno

soswith anao

rticvalveare

a of 0.7cm2 .

Premise

Patient

has

CHF

Hyp

othe

sis

0.10.20.30.40.50.60.70.8


VOLUME 4, 2016 11

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Incorporating Domain Knowledge into Natural Language ...

Documents