Towards Interpretable Clinical Diagnosis with …Bayesian networks on top of the entity-aware...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3143–3153July 5 - 10, 2020. c©2020 Association for Computational Linguistics

3143

Towards Interpretable Clinical Diagnosis with Bayesian NetworkEnsembles Stacked on Entity-Aware CNNs

Jun Chen, Xiaoya Dai, Quan Yuan, Chao Lu and Haifeng HuangBaidu Inc, Beijing, China

{chenjun22,daixiaoya,yuanquan02,luchao,huanghaifeng}@baidu.com

AbstractThe automatic text-based diagnosis remainsa challenging task for clinical use becauseit requires appropriate balance between accu-racy and interpretability. In this paper, weattempt to propose a solution by introducinga novel framework that stacks Bayesian Net-work Ensembles on top of Entity-Aware Con-volutional Neural Networks (CNN) towardsbuilding an accurate yet interpretable diagno-sis system. The proposed framework takesadvantage of the high accuracy and general-ity of deep neural networks as well as theinterpretability of Bayesian Networks, whichis critical for AI-empowered healthcare. Theevaluation conducted on the real ElectronicMedical Record (EMR) documents from hos-pitals and annotated by professional doctorsproves that, the proposed framework outper-forms the previous automatic diagnosis meth-ods in accuracy performance and the diagnosisexplanation of the framework is reasonable.

1 Introduction

The automatic diagnosis of diseases has drawnthe increasing attention from both research com-munities and industrial companies in the recentyears due to the advancement of artificial intelli-gence (AI) (Liang et al., 2019; Esteva et al., 2019;Liu et al., 2018). As reported in (Anandan et al.,2019), “AI-enabled analysis software is helpingto guide doctors and other health-care workersthrough diagnostic processes and questioning toarrive at treatment decisions with greater speedand accuracy.” Although the image-based diag-nosis has been well studied using PACS (PictureArchiving and Communication Systems) data (Lit-jens et al., 2017), the text-based diagnosis for Clin-ical Decision Support (CDS) (Berner, 2007) re-mains difficult due to the rare access to reliableclinical corpus and the difficulty in balancing be-tween accuracy and interpretability.

Table 1: A real outpatient EMR from hospital.Section Content

Basic 男, 30岁 (Male, 30 years old)

CC 咽部不适3天 (Pharyngeal discomfort for 3 days)

HPI

患者于3日前起咽痛伴发热,无呼吸困难、咳嗽、咳痰、嗳气或反酸 (The patient developed pharyngalgiaand fever 3 days ago, without dyspnea, cough, sputum,belching or acid reflux)

PE

咽峡稍充血,双侧扁桃体Ⅰ度肿大,无栓塞物及瘢痕(The hypopharyngeal isthmus is slightly congested.The bilateral tonsils are first-degree enlarged. There is noembolism or scar in the pharynx.)

TR

血常规示白细胞计数升高, WBC12.5 ∗ 109/L. C反应蛋白正常. ( The blood test showed elevated white bloodcell count, WBC12.5 ∗ 109/L. The C-reactive proteinis normal.)

Diagnosis 急性扁桃体炎 (Acute tonsillitis)

There have been attempts to study automatictext-based diagnosis with Electronic MedicalRecord (EMR) documents integrated in the Hospi-tal Information System (Mullenbach et al., 2018;Yang et al., 2018; Girardi et al., 2018). Basically,an EMR document is written by a doctor and con-sists of several sections that describe the illness ofthe patient. Besides the patient’s basic informa-tion like name, age and gender, an EMR documentcontains Chief Complaint (CC), History of PresentIllness (HPI), Physical Examination (PE), Test Re-ports (TR, e.g. lab test reports and PACS reports),Diagnosis, etc. Table 1 shows a real outpatientEMR document from a hospital. These sectionsdescribe the patient’s medical situation from dif-ferent aspects: CC summarizes the patient’s maindiscomforts of this visit. HPI extends CC by addingmore details and findings from the conversation be-tween doctor and patient. PE shows the findings byphysically examining the patient’s body, e.g. by pal-pation or inspection. TR are the objective findingsfrom the lab test reports or the PACS reports. In thehospitals, the doctors will make a comprehensiveanalysis mainly based on CC, HPI, PE, TR and thebasic information, and make a diagnosis. However,it is very hard for computers to automatically un-derstand all the diverse sections and capture the key

3144

information before making an appropriate diagno-sis. Besides, an inpatient EMR document is similarto that in Table 1 except that HPI, PE and TR areusually more lengthy and detailed. The frameworkproposed in this work can be applied on both theoutpatient and the inpatient EMR documents andwe will not distinguish them later.

In this study, we bring forward a novel frame-work of automatic diagnosis with EMR documentsfor CDS.1 Specifically, we propose to predict themain diagnosis based on the patient’s current ill-ness. Different from the previous works (Yanget al., 2018; Sha and Wang, 2017; Li et al., 2017;Girardi et al., 2018; Mullenbach et al., 2018) thatsolely rely on the end-to-end neural models, we pro-pose to stack the Bayesian Network (BN) ensem-bles on top of Entity-aware Convolutional NeuralNetworks (ECNN) in automatic diagnosis, whereECNN improves the accuracy of the predictionand BN ensembles explain the prediction. Theproposed framework attempts to bring some inter-pretability of the predictions by incorporating theknowledge encoded in the BN ensembles. Themain contributions of this work are as follows:• We propose a novel framework that stacks the

Bayesian network ensembles on top of theentity-aware convolutional neural networks tobring interpretability into automatic diagnosiswithout compromising the accuracy of deeplearning. Interpretability is very important inthe AI-empowered healthcare studies.• We bring forward three variants of Bayesian

Networks for disease inference that providesinterpretability. Moreover, we ensemble theseBNs towards more robust diagnosis results.• The evaluation conducted on real EMR doc-

uments from hospitals proves that the pro-posed framework outperforms the previousautomatic diagnosis methods with EMRs. Theproposed framework has been used as a crit-ical component in the clinical decision sup-port system developed by Baidu, which assistsphysicians in diagnosis in over hundreds ofprimary healthcare facilities in China.• We publish the Chinese medical knowledge

graph of Gynaecology and Respiration usedin our Bayesian Network for disease inferencewith this paper for reproducibility. The data

1Different from Electronic Health Record (EHR) wherethe illness of a patient’s multiple visits are combined together,EMR only contains the patient’s illness of this particular visit.EMRs are more generally used in the hospitals in China.

set can be downloaded from Github.2

2 Related Work

Due to the rapid advancement of machine intel-ligence, the text-based automatic diagnosis is be-coming one of the most important applications ofmachine learning and natural language processingin the recent years (Anandan et al., 2019; Kolecket al., 2019). Different from diagnosis or questionanswering on the Web (Chen et al., 2019), diag-nosis for the CDS takes place in the hospitals andclinics, and the predictive algorithm is integratedinto the Hospital Information System to assist doc-tors and physicians in the diagnosis.

Liang et al. (2019) proposes a top-down hier-archical classification method towards diagnosingpediatric diseases. From the root to the leaf, eachlevel on the diagnostic hierarchy is a logistic regres-sion model that performs classification on labelsfrom coarse granularity to fine-grained granular-ity, e.g. from organ systems down to respiratorysystems and to upper respiratory systems. Thismethod requires heavy manual annotation of train-ing samples at different levels of hierarchy.

Zhang et al. (2017) combines the variationalauto-encoder and the variational recurrent neuralnetwork together to make diagnosis based on labo-ratory test data. However, laboratory test data arenot the only resources considered in this paper.

Prakash et al. (2017) introduces the memory net-works into diagnostic inference based on free textclinical records with external knowledge sourcefrom Wikipedia.

Sha and Wang (2017) proposes a hierarchicalGRU-based neural network to predict the clinicaloutcomes based on the medical code sequencesof the patient’s previous visits. It deals with thesequential disease forecasting problem with EHRdata rather than the diagnosis problem for the cur-rent visit with EMR document. Similarly, Choiet al. (2016a) studies the RNN-based model forclinical event prediction. Baumel et al. (2017) in-vestigates the multi-label classification problemfor discharge summaries of EHR with hierarchicalattention-bidirectional GRU.

The most similar works to ours are in (Yanget al., 2018; Li et al., 2017) which trains an end-to-end convolutional network model to predict di-

2https://github.com/PaddlePaddle/Research/tree/master/KG/ACL2020_SignOrSymptom_Relationship

https://github.com/PaddlePaddle/Research/tree/master/KG/ACL2020_SignOrSymptom_Relationship



3145

agnosis based on EMRs. Besides, Girardi et al.(2018) improves the CNN model with the attentionmechanism in automatic diagnosis. Moreover, Mul-lenbach et al. (2018) studies a label-wise attentionmodel to further improve the accuracy of diagnosisat the cost of more computation time. Choi et al.(2016b) proposes a reverse time attention mecha-nism for interpretable healthcare studies.

Different from the previous studies, the noveltyof this paper is to bring interpretability into au-tomatic diagnosis by stacking the ensembles ofBayesian networks on top of the entity-aware con-volutional neural networks.

3 The Proposed Framework

Automatic diagnosis can be formally consideredas a classification problem where the proposedmethod outputs a probability distribution Pr(d|S)over all diseases d ∈ D based on the illness de-scription S. In this study, S corresponds to thepatient’s EMR document, i.e. S consists of severalsections of texts and some structured data like age,gender and medical department.

We bring forward a new framework that com-bines the black-box deep learning and the white-box knowledge inference to diagnose disease withEMR documents. Figure 1 shows the architectureof the proposed framework. Firstly, the medical en-tities are extracted from the EMR contents. Then,the EMR document is fed into the entity-awareconvolutional networks to generate disease priorprobability. Next, the Bayesian network ensem-bles perform disease inference based on the priorprobability and the probabilistic graphical mod-els (PGMs) before ensembling the final predictions.

3.1 Named Entity Recognition

Before introducing the convolutional and theBayesian networks, we first discuss a basic compo-nent of this framework – the named entity recog-nition (NER). NER extracts the entities as well astheir types from text sentences, which is very im-portant to capture the key information of the texts.In our experiments, we used Baidu’s enterpriseChinese medical NER system that integrates theadvanced NER models (Dai et al., 2019; Jia et al.,2019) and extracts entities of symptoms, vital signs,diseases and test report findings.

The F1 score of the NER system we use is 91%in a separate evaluation conducted on 1000 dedupli-cated sentences from real EMR documents by 10

Table 2: The NER results of the EMR document shownin Table 1. TR Finding: test result finding. (+) forpositive, (-) for negative and (?) for unknown.

Word Section Type Polarity

咽部不适(pharyngeal discomfort)

CC Symptom (+)

咽痛 (pharyngalgia) HPI Symptom (+)

发热 (fever) HPI Symptom (+)

呼吸困难 (dyspnea) HPI Symptom (-)

咳嗽 (cough) HPI Symptom (-)

咳痰 (sputum) HPI Symptom (-)

嗳气 (belching) HPI Symptom (-)

反酸 (acid reflux) HPI Symptom (-)

咽峡充血 (congestedhypopharyngeal isthmus)

PE Vital Sign (+)

双侧扁桃体肿大(enlarged bilateral tonsils)

PE Vital Sign (+)

咽部栓塞物(pharyngeal embolism)

PE Vital Sign (-)

咽部瘢痕(pharyngeal scar)

PE Vital Sign (-)

白细胞计数升高(elevated WBC)

TR TR Finding (+)

C反应蛋白异常(abnormalC-reactive protein)

TR TR Finding (-)

急性扁桃体炎(acute tonsillitis)

Diagnosis Diesease (+)

certificated physicians in China. 3 Meanwhile, thepolarity (positive (+), negative (-) or unknown (?))of entities is also recognized. The polarity in thiswork objectively means the presence or absenceof a finding in a given EMR. It is recognized inconjunction with the rule-based method with a vo-cabulary of negative Chinese words as well as thepolarity detection model. Table 2 shows the NERresults of the EMR in Table 1. Please note thatthe disease (acute tonsillitis) from the diagnosissection is the ground-truth label to predict and itwill not be included in the input to the predictivemodel in the evaluation.

In the offline processing of the EMR corpus, wepreserved the Top-K most frequent entities of alltypes as the entity vocabulary. In later experiments,we empirically set K = 10, 000. The entity vocab-ulary will be used to construct the one-hot featurefor each EMR document, which will be introducedlater. Since NER is not the focus of this study, thereaders can choose the public Chinese NER API4

from Baidu for fast experiments. We will focus onthe major contributions of the proposed frameworkin the next sections.

3There are two senior physicians beyond the attendingdoctor level and eight junior physicians contributed in theannotation tasks here and later.

4http://ai.baidu.com/tech/cognitive/entity_annotation

http://ai.baidu.com/tech/cognitive/entity_annotation

http://ai.baidu.com/tech/cognitive/entity_annotation

3146

…...

…...

embedding

convolution

max pooling

& flatten

CCother features

MLP

dropout

softmax

parallelPGMuniversalPGM

cascade PGM

finalpredictions

HPIPE

diseasepriors

CNNtow

erCNN

CNN

Figure 1: The architecture of the proposed framework.

3.2 ECNN for Prior GenerationThe convolutional networks take as input the listof texts w.r.t. the sections of an EMR document aswell as the medical entities extracted from them,and output the probability distribution of the dis-eases. To distinguish from the previous CNN mod-els without medical entities (Yang et al., 2018; Liet al., 2017), we use ECNN to denote the entity-aware CNN model proposed in this paper whereanother branch of fully connected layers processesthe medical entities and outputs the correspondingfeature representation. Let N denote the numberof sections (CC, HPI, PE, TR, etc) selected fromthe EMR document to construct ECNN. ECNNconsist of two parts: (1) N convolutional towers,each of which reads a unique section, and (2) onemulti-layer perceptron (MLP) branch that reads ahigh-dimensional hand-crafted feature.

Similar to the previous CNN method for text clas-sification (Kim, 2014), each convolutional towerprocesses the input sequence with three kernels ofvarious length resulting in multi-channel featureoutput. The three kernels process the input with3-grams, 4-grams and 5-grams, respectively, andtheir outputs are concatenated as the output of aconvolutional tower. Each kernel in the convolu-tional networks has 100 filters with strides as 1.The input is padded with valid method and theoutput is activated by ReLU.

For the input of MLP, we create the entity vocab-ulary that consists of the top-K frequent entities.Then, each EMR document is transformed to a K-dimensional one-hot feature f . That is, if the i-thentity in the entity vocabulary appears as a positivefinding in the input EMR, then the i-th dimension

of f is set to 1, and otherwise, it is set to 0. More-over, the patient’s age and gender are appendedto f to get the hand-crafted feature for MLP. TheMLP contains one dense layer activated by sigmoidfunction with 128 hidden units.

ECNN is trained with Adam optimizer (learningrate 0.001), 20 epochs and batch size of 32. Theoutput of each convolutional tower and the outputof the MLP are further concatenated before passingthrough the dropout and the softmax layer. Similarto Kim (2014), the dropout rate is empirically set to0.5. A |D|-dimensional feature is output by ECNNas the disease priors for the inference in the nextwhere D is the disease set.

In ECNN, the CNNs are supposed to capture thesequential signals in the section texts and the MLPis supposed to encode the feature of the criticalentities. By jointly modeling with CNNs and MLP,the proposed ECNN is expected to have superiorperformance than either of them alone.

3.3 Bayesian Network Ensembles

Although ECNN also outputs a probability distribu-tion over all diseases, the result is not interpretabledue to its end-to-end nature. However, the inter-pretability is very important in the CDS to explainhow the diagnosis is generated by machines. Thus,we propose the Bayesian network ensembles on topof the output of ECNN to explicitly infer diseasewith PGMs. There are three steps:

3.3.1 Relation ExtractionWe extract the relations between disease and othertypes of entities (disease, finding) where findingcan be symptom, vital sign, test report finding, etc.

3147

The rest of this paper will use finding to denote anytype of entities other than disease. Relation extrac-tion is performed in conjunction with the (disease,finding) co-occurrence mining and the deep extrac-tion model (Shi et al., 2019) from the EMR doc-uments and the textbooks 5. Then, the pairs withhigh co-occurrences larger than a support (e.g. 5)are preserved. The extracted relations are reviewedby 10 certificated physicians. The invalid extractedrelations which result from issues like incorrectrecognition of entities or polarities by NER, thesymptom caused by the secondary diagnosis butincorrectly paired with the first diagnosis, are re-moved before adding to the medical knowledgegraph. Therefore, the relation (disease, finding) inthe medical knowledge graph can, to some extent,be interpreted as: disease causes finding.

In our study, the pairs are mined from 275,797EMR documents of two medical departments (Gy-naecology and Respiration). On average, each dis-ease of Gynaecology in our experiments is associ-ated with 24 findings and that of Respiration is 42.For Gynaecology, there are 33 diseases, 305 symp-toms, 143 vital signs and 25 test report findings inthe PGMs. For Respiration, there are 21 diseases,263 symptoms, 187 vital signs and 31 test reportfindings in the PGMs.

3.3.2 Relation Weights EstimationWe experiment with six classical text features asthe relation weights in this study.

(1) Occurrence. The weight of finding i givendisease j is:

w(i; j) =n(i, j)∑k n(k, j)

, (1)

where n(i, j) is the number of co-occurrences offinding i and disease j. w(i; j) is computed by thetype of findings.

(2) TF-IDF Feature. Similar to TF-IDF featurein information retrieval, the weight of finding igiven disease j is:

w(i; j) = n(i, j) ∗ (log |D|+ 1

ni + 1+ 1), (2)

where ni is the number of diseases whose EMRdocuments contain finding i.

(3) TFC Feature. TFC feature (Salton andBuckley, 1988) is a variant of TF-IDF and it es-timates the weight of finding i given disease j as:

5The undergraduate teaching materials in most of the med-ical schools in China, authorized by the publisher.

w(i; j) =n(i, j) ∗ log |D|ni√∑k(n(k, j) ∗ log

|D|nk

)2. (3)

(4) TF-IWF Feature. The Term-FrequencyInverse-Word-Frequency (TF-IWF) feature (Basiliet al., 1999) estimates the weight of finding i givendisease j as:

w(i; j) = n(i, j) ∗ (log∑

k tkti

)2, (4)

where ti represents the number of occurrences ofword i in the whole training corpus.

(5) CHI Feature. CHI feature (χ2 Test) mea-sures how much a term is associated with a classfrom a statistical view. The CHI feature of findingi given disease j is (Yang and Pedersen, 1997):

w(i; j) =N ∗ (A ∗ D − C ∗ B)2

(A + C) ∗ (B + D) ∗ (A + B) ∗ (C + D), (5)

where N , A, B, C and D are the number of alldocuments, the number of documents containingfinding i and belonging to disease j, the numberof documents containing i but not belonging to j,the number of documents belonging to j but notcontaining i, and the number of documents notcontaining i and not belonging to j.

(6) Mutual Information. This feature assumesthat the higher the strength between a finding anda disease, the higher their mutual information willbe. Similar to the definition in CHI feature, thisfeature is defined as:

w(i; j) ≈ logA ∗N

(A+ C) ∗ (A+B). (6)

The above features are normalized by diseasebefore applying to the diagnosis inference. Bydefault, the average of the six features is used asthe connection weight.

3.3.3 Diagnosis InferenceWe propose the Bayesian network ensembles forthe diagnosis inference. Specifically, a group ofPGMs with the extracted relations and weights areensembled towards the final predictions.

Firstly, multiple bipartite graphs between dis-ease nodes and each type of finding nodes are de-rived from the medical knowledge graph. For Mtypes of findings, there will be M bipartite graphs.In later experiments, M = 3, i.e. (disease, symp-tom), (disease, vital sign) and (disease, test resultfinding). Based on the findings extracted fromEMR document, each bipartite graph can be in-dependently used to infer the disease distribution.

3148

For Bayesian inference, we compute the posteriorprobability of diseases given the findings in theEMR document extracted by NER:

Pr(d|F+, F−) =Pr(d, F+, F−)

Pr(F+, F−), d ∈ D, (7)

where F+ and F− are the sets of the positive andthe negative findings in the given EMR document,respectively. Following Eq. (7), it is straightfor-ward to get Pr(d|F+

sym, F−sym), Pr(d|F+

sign, F−sign)

and Pr(d|F+test, F

−test) w.r.t. the predictions based

on symptom alone, vital sign alone and test re-port finding alone. To compute the joint proba-bility Pr(d, F+, F−) and Pr(F+, F−), we referthe readers to the QuickScore method (Heckerman,1990) and the deduction therein. To speed up com-putation when a disease is associated with too manypositive findings, the variational method on thePGMs is applied (Jordan et al., 1999).

Next, we assemble these bipartite graphs in dif-ferent ways to get three variants of PGMs (Fig. 1).

(1) Parallel. This method independently per-forms inference with each type of finding and aver-age their results:

Pr(d|F+, F−) = avg(Pr(d|F+sym, F

−sym),

Pr(d|F+sign, F

−sign),Pr(d|F

+test, F

−test)). (8)

Parallel assumes that the ways to diagnose diseaseare different using different types of entities, andtheir predictions can complement each other. Anextension of Parallel is to perform a weighted sumof the three predictions. For simplicity concerns,we experiment with equal weights in this paper.

(2) Universal. This method mixes all types offindings together into a single network:

Pr(d|F+, F−) = (9)

Pr(d|F+sym, F

−sym, F

+sign, F

−sign, F

+test, F

−test).

It means that Universal does not distinguishthe types of entities and performs the type-freeBayesian inference. Compared with the other twoPGM variants, the connections between diseasesand findings in Universal are much denser. It as-sumes that the prediction benefits from the jointinference by seeing more findings of multiple typesat the same time.

(3) Cascade. This method constructs the multi-layer Bayesian networks with finding types as lay-ers and use the output of the previous layer as the

prior probability for the current layer.

Pr(dsym) = Pr(d|F+sym, F

−sym)

s.t., d ∼ Pr(dCNN ),

Pr(dsign) = Pr(d|F+sign, F

−sign)

s.t., d ∼ Pr(dsym),

Pr(dBN ) = Pr(dtest) = Pr(d|F+test, F

−test)

s.t., d ∼ Pr(dsign), (10)

where Pr(dCNN ) is the disease probability distri-bution computed by the convolutional networksin Sec. 3.2 and d ∼ Pr(dx) means that variable dsatisfies prior probability distribution Pr(dx). Cas-cade first infers disease with symptoms alone anduses the disease probability from ECNN as pri-ors. Then, it infers disease with vital signs aloneand uses the disease probability from symptom-based inference as priors. Finally, it infers diseasewith test report findings alone and uses the dis-ease probability from the previous output as priors.We present the cascade appraoch in such order be-cause it shows the best results compared to those inother orders in our experiments. Cascade assumesthat each type of entities can be used to refine theprevious predictions by incorporating additionalinformation.

The output of the above three PGMs are ensem-bled, e.g. weighted sum, as the final predictions. Inall, the proposed framework takes the raw EMRdocument and the NER results as input, and outputsthe diagnosis predictions.

Although we experiment with three types of en-tities in this paper, the proposed Bayesian networkensemble method is not limited to these types ofentities. It is easy to add more entity types in theproposed method when applicable.

3.4 The Interpretability of BN EnsemblesOne of the major contributions of this work is tobring interpretability into automatic diagnosis bystacking the Bayesian network ensembles on topof the convolutional networks. We illustrate howthe predictions are explained, i.e. interpretability,by BN with Fig. 2. We use the symptom-basedbipartite graph to illustrate for the simplicity con-cern, and the other types of entities explain thepredictions in the same way.

In Fig. 2, if only pharyngalgia is extracted froma patient’s EMR, then upper respiratory infec-tion (URI) will be predicted with high probabilitybut the probability of pneumonia and phthisis will

3149

Figure 2: The example of the interpretability ofBayesian network. The connection from disease d tosymptom s represents that d has some probability tocause s to be present. If d is diagnosed, the detectedsymptoms from EMR that are connected with d can beused to explain the diagnosis.

be set to the minimum because both of them arenot likely to cause pharyngalgia based on their co-occurrences in the corpus. The proposed methodcan explain the prediction of URI with symptompharyngalgia and their co-occurrence times besidesthe prediction probability.

If pharyngalgia and hemoptysis are both ex-tracted from a patient’s EMR, then URI as well asphthisis will be predicted with some positive prob-ability (their rankings depend on both their priorprobability and their connection weights to pharyn-galgia and hemoptysis), but pneumonia will be pre-dicted with the minimum probability. This is be-cause the noisy-OR gate is used in the Bayesian in-ference (Heckerman, 1990). The proposed methodexplains the prediction of URI with the positivefinding of symptom pharyngalgia and explainsthe prediction of phthisis with the positive find-ing of symptom hemoptysis as well as their co-occurrences.

4 Experiments and Results

In this section, we will introduce the data sets weexperiment with and the evaluation results.

4.1 Data SetsThe proposed framework is evaluated on the realEMR documents (mostly admission records). Wehave collaborated with several top hospitals inChina and we are authorized to conduct experi-ments with 275,797 EMR documents of two medi-cal departments for the evaluation (see Table 3).6

6Unfortunately, we have not yet obtained the permissionfrom the hospitals to make the evaluation data sets public atthis moment because EMR documents are legally protected bythe Chinese laws and there is too much sensitive informationabout the patients and the doctors in them. We are currentlyworking with the hospitals in contributing the benchmarkEMR data sets for automatic diagnosis, but it takes time dueto the legal issues. We suggest the readers to focus theirattention on the contribution of the novel automatic diagnosisframework in this paper.

1

10

100

1000

10000

100000

1 11 21 31 41 51 61 71 81 91 101 111 121

Gynaecology Respiration

Figure 3: The long-tail distribution of diagnosis. The x-axis indexes the names of diagnosis. The y-axis countsthe occurrences of diagnosis in the log scale.

Table 3: The statistics of the data sets. The table rep-resents the document counts by source. # means thenumber of. “# collected” is the number of the collectedEMR documents in the our experiments.

Departments # collected # test # disease

Gynaecology 191,645 606 33Respiration 84,152 214 21

The collected EMR documents are processed asfollows: The main diagnosis in each EMR docu-ment is extracted as its disease label. Then, weselect the top diseases from the collected EMR doc-uments, which results in 33 diseases from Gynae-cology (including Salpingitis, Cervical Carcinoma,Endometritis, Fibroid, etc) and 21 diseases fromRespiration (including Upper Respiratory Infection,Chronic Bronchitis, Pneumonia, Asthma, LungCancer, etc) that cover over 90% of all EMR doc-uments. There is a long-tail distribution of EMRdocuments by diseases as shown in Fig. 3, and eachof the selected diseases has over 100 EMR docu-ments for training. The other diseases are discardedin the experiments due to the lack of enough EMRdocuments to train a trustworthy model. Next, inorder to ensure the validity of the disease labels inthe test set, we recruit 10 professional physiciansto review the labels by evenly sampling EMR docu-ments under each disease. In this way, we collected606 reviewed EMR documents for Gynaecologyand 214 for Respiration as the test set (See diseasedistribution in supplemental files). The rest EMRdocuments are used for training. Since we are notgiven the identity of patient w.r.t. each EMR, thetraining and the testing sets are considered disjoint.In later experiments, we separately report the per-formance under both departments. It is more im-portant and difficult to distinguish diseases withinthe same department than that across departmentsdue to the overlapping symptoms, signs and testreport findings among the similar diseases.

3150

Table 4: The accuracy of the different diagnosis meth-ods on two medical departments. Top-k sensitivity isused as the accuracy measurement.

Methods Gynaecology Respiration

Top-1 Top-3 Top-1 Top-3

CAML (2018) 58.6% 76.3% 60.7% 82.7%CNN (2018) 61.0% 82.8% 61.7% 80.8%ACNN (2018) 62.1% 83.3% 60.7% 84.6%

PGM-C 50.8% 64.6% 26.6% 47.6%PGM-P 56.1% 69.3% 31.3% 45.3%PGM-U 56.2% 69.6% 33.6% 57.9%PGM-E 53.9% 70.2% 28.0% 48.1%ECNN 68.9% 86.7% 65.8% 81.7%ECNN-PGM-C 71.4% 88.6% 52.8% 82.7%ECNN-PGM-U 72.9% 88.6% 59.3% 87.8%ECNN-PGM-P 73.2% 88.4% 68.2% 87.3%ECNN-PGM-E 73.4% 88.8% 64.0% 88.3%

4.2 Experimental Results

We conduct experiments on the collected data setsto evaluate the performance of the framework.

4.2.1 Experimental SettingsIn the experiments, we used four CNN towers(N = 4) w.r.t. CC, HPI, PE and TR, and eachtower has three channels with kernel length 3, 4 and5 (representing 3-grams, 4-grams and 5-grams).

We use Jieba package7 to perform Chineseword segmentation on the training set and re-move the punctuation from the segmentation re-sults. The segmented word corpus is used to trainthe 100-dimensional word embeddings using theWord2Vec (Mikolov et al., 2013) method (windowas 5, min support as 5) implemented in the gensimpackage8. The top 100,000 frequent segmentedwords consist of the word vocabulary in the embed-ding layer of ECNN. Thus, the size of the embed-ding layer is (100000, 100).

Besides, the top 10,000 frequent entities (not seg-mented words) as well as age and gender are usedto construct the one-hot feature into MLP whichconsists of one hidden dense layer (128 Sigmoidunits) due to the efficiency consideration. Similarto Kim (2014), the dropout rate is empirically setto 0.5. By default, we use the average of all sixrelation weights in the experiments. The final pre-dictions are the average of the three PGM variants.ECNN and PGMs are trained separately offline.

4.2.2 Performance AccuracyTable 4 shows the Top-k sensitivity (The microaverage of the per-disease Top-k sensitivity, com-

7https://github.com/fxsjy/jieba8https://radimrehurek.com/gensim/

monly used as the accuracy measurement in health-care studies (Liang et al., 2019).) under two de-partments. Generally, sensitivity is ususally usedin binary classification (mostly output yes or no).Similarly, when we are dealing with classificationof multi-class rather than binary classification, theproposed automatic diagnosis model outputs theprobability distribution over K diseases (classes)for a given EMR. Suppose there are li out of nicases, where di is included in the Top-k predic-tions (ranked by probability) for the ni EMRs ofdisease di. The Top-k sensitivity of the proposedmodel on disease di is: li

ni. Furthermore, in the

overall evaluation of the proposed model on all dis-eases, we use the micro average of all classes asthe overall Top-k sensitivity:

sensitivity =

∑i li∑i ni

. (11)

CAML (Mullenbach et al., 2018) performs thelabel-wise attention on top of a CNN model.CNN (Yang et al., 2018) concatenates CC, HPI andTR together before sending to the multi-channelCNN model. ACNN (Girardi et al., 2018) incorpo-rates the gram-level attention with a CNN model.The empirical settings of hyper parameters are se-lected from the original papers. Besides, they sharethe same training set, training epochs, learning rateand batch size with the proposed methods.

Among the proposed methods, PGM-* (-C, -P,-U and -E represent Cascade, Parallel, Universaland Ensemble, respectively) are the methods thatsolely relies on the Bayesian networks which usethe disease distribution in the training set as theprior probability. ECNN is the proposed methodwithout the BN ensembles. ECNN-PGM-* are thecombined methods while ECNN-PGM-E is the pro-posed method with ECNN and Bayesian networkensembles in Figure 1. According to the results:(1) Most of the proposed methods ECNN-PGM-*outperform the previous automatic diagnosis meth-ods, which shows the effectiveness of the proposedmethods. (2) ECNN outperforms CNN due to theincorporation of medical entities. Jointly modelingwith free texts and medical entities brings extra ac-curacy performance compared with modeling withonly either one. (3) Stacking Bayesian Networkson top of the neural networks is very likely to fur-ther improve the performance, especially with theensemble of the predictions from multiple PGMs.

https://github.com/fxsjy/jieba

https://radimrehurek.com/gensim/

3151

00.20.40.60.81

Salpingitis

Fibro

id

Pelvic Infecti

on

VulvitisMole

Cervical P

olyp

Cervical C

arcinoma

Ovarian Tumor

Female In

fertility

Endometriosis

(a) Gynaecology

00.20.40.60.81

URI

Chronic Bro

nchitis

Pneumonia

Asthma

Lung Cance

rCOPD

Pulmonary

Abscess

Bronchiecta

sia

Pulmonary

Embolism

Respira

tory

Failu

re

(b) RespirationFigure 4: Top-1 sensitivity by diseases.

4.2.3 Error AnalysisFig. 4 shows the Top-1 sensitivity on some diseases.The performances across diseases are quite differ-ent. For example, the Top-1 sensitivity of Salp-ingitis is 100% but that of Endometriosis is 29%in the evaluation. Salpingitis can be identified bycombining general symptoms and ultrasonic examresults. However, from the perspective of physi-cians, Endometriosis is difficult to diagnose by na-ture because it shares common symptoms like dys-menorrhea and irregular menstruation with otherGynecologic diseases. These shared findings mis-guide the classifier towards other similar diseases.Similarly, among the respiratory diseases, patientswith Pulmonary Embolism, Respiratory Failureand Bronchiectasia share symptom dyspnea whichmakes it difficult to distinguish between them. Incontrast, Upper Respiratory Infection (URI) is easyto diagnose because it causes throat pain and rhin-orrhea unlike the other respiratory diseases.

Based on the analysis, the diagnosis performanceof a disease is higher if it shares less findings withother diseases or it has more specific findings.

4.2.4 InterpretabilityThe interpretability is reflected on the observedfindings in the EMR that connect to the predicteddisease in the medical knowledge graph as wellas their co-occurrences. We generate the predic-tion explanation with the following template: Thepatient is diagnosed as disease d because (s)he issuffering from symptom si, and (s)he has the vitalsign of vj , and the lab test (or PACS report) shows(s)he has tk. Besides, si, vj and tk have been foundon the patients of d for ni, nj , nk times, respec-tively, in the previous EMR documents that supportthis diagnosis.

Since the extracted relations in the medicalknowledge graph are reviewed by the certificatedphysicians, the validity of explanation is guaran-teed from the clinical perspective. We randomlyselect 50 testing samples per department whoseTop-1 diagnosis prediction is correct and generatethe explanation for the diagnosis prediction with

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

MI Occ TFC TF-IDF TF-IWF CHI All

Accuracy

FeatureTypeGyn-Top1 Res-Top1 Gyn-Top3 Res-Top3

Figure 5: The accuracy of ECNN-PGM-E using dif-ferent types of features. Gyn and Res represent gynae-cology and respiration, respectively. MI and Occ aremutual information and occurrence, respectively.

the above template. The explanation is evaluatedby three certificated physicians. The evaluation issubjective, but all of them agree that the predictionis well-supported by the generated explanation.

4.2.5 Feature ImportanceFigure 5 shows the accuracy performance usingdifferent types of features. We can see that in thisevaluation, TFC, TF-IDF and the average of all fea-tures are likely to lead to higher accuracy comparedto the other features where the accuracy of Top-3prediction is over 88%.

In all, the above experiments prove that the pro-posed framework can improve the accuracy of auto-matic diagnosis and bring reasonable interpretabil-ity into the predictions in the same time.

5 Conclusion

In this paper, we investigate the problem of auto-matic diagnosis with EMR documents for clinicaldecision support. We propose a novel frameworkthat stacks the Bayesian Network ensembles ontop of the Entity-aware Convolutional Neural Net-works. The proposed design brings interpretabilityinto the predictions, which is very important forthe AI-empowered healthcare, without compromis-ing the accuracy of convolutional networks. Theevaluation conducted on the real EMR documentsfrom hospitals validates the effectiveness of theproposed framework compared to the baselines inautomatic diagnosis with EMR.

Acknowledgement

We thank all the professional physicians led byDr. Shi and Dr. Hu who have contributed in theannotation tasks in our experiments.

3152

References

Padmanabhan Anandan, Yan Huang, KazumiNishikawa, BBorie Park, Eric S. Sullivan, JingyuWang, and Xu Shan. 2019. AI in health care:Capacity, capability, and a future of active health inAsia. MIT Technology Review Insights, pages 1–25.

Roberto Basili, Alessandro Moschitti, andMaria Teresa Pazienza. 1999. A text classifierbased on linguistic processing. In IJCAI Workshopon Machine Learning and Information Filtering,Stockholm, Sweden.

Tal Baumel, Jumana Nassour-Kassis, Raphael Co-hen, Michael Elhadad, and Noemie Elhadad. 2017.Multi-label classification of patient notes a casestudy on icd code assignment. In AAAI Workshops,pages 409–416.

Eta S. Berner. 2007. Clinical Decision Support Sys-tems. Springer.

Jun Chen, Jingbo Zhou, Zhenhui Shi, Bin Fan, andChengliang Luo. 2019. Knowledge abstractionmatching for medical question answering. In IEEEInternational Conference on Bioinformatics andBiomedicine (BIBM), pages 342–347.

Edward Choi, Mohammad Taha Bahadori, AndySchuetz, Walter F Stewart, and Jimeng Sun. 2016a.Doctor AI: Predicting clinical events via recurrentneural networks. In Machine Learning for Health-care Conference, pages 301–318.

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun,Joshua Kulas, Andy Schuetz, and Walter Stewart.2016b. RETAIN: An interpretable predictive modelfor healthcare using reverse time attention mecha-nism. In NeurIPS, pages 3504–3512.

Dai Dai, Xinyan Xiao, Yajuan Lyu, Shan Dou, Qiao-qiao She, and Haifeng Wang. 2019. Joint extractionof entities and overlapping relations using position-attentive sequence labeling. In AAAI, Honolulu,Hawaii, USA.

Andre Esteva, Alexandre Robicquet, Bharath Ramsun-dar, Volodymyr Kuleshov, Mark DePristo, KatherineChou, Claire Cui, Greg Corrado, Sebastian Thrun,and Jeff Dean. 2019. A guide to deep learning inhealthcare. Nature Medicine, 25:24–29.

Ivan Girardi, Pengfei Ji, An phi Nguyen, Nora Hol-lenstein, Adam Ivankay, Lorenz Kuhn, Chiara Mar-chiori, and Ce Zhang. 2018. Patient risk assess-ment and warning symptom detection using deepattention-based neural networks. In EMNLP Work-shop, pages 139–148, Brussels, Belgium.

David Heckerman. 1990. A tractable inference algo-rithm for diagnosing multiple diseases. Machine In-telligence and Pattern Recognition, 10:163–171.

Wei Jia, Dai Dai, Xinyan Xiao, and Hua Wu. 2019.ARNOR: Attention regularization based noise reduc-tion for distant supervision relation classification. InACL, pages 1399–1408, Florence, Italy.

Michael I. Jordan, Zoubin Ghahramani, Tommi S.Jaakkola, and Lawrence K. Saul. 1999. An intro-duction to variational methods for graphical models.Machine Learning, 37:183–233.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP, pages 1746—-1751, Doha, Qatar.

Theresa A Koleck, Caitlin Dreisbach, Philip E Bourne,and Suzanne Bakken. 2019. Natural language pro-cessing of symptoms documented in free-text narra-tives of electronic health records: a systematic re-view. Journal of the American Medical InformaticsAssociation, pages 364–379.

Christy Li, Dimitris Konomis, Graham Neubig, Peng-tao Xie, Carol Cheng, and Eric Xing. 2017. Convo-lutional neural networks for medical diagnosis fromadmission notes. In arXiv.

Huiying Liang, Brian Y. Tsui, Hao Ni, Carolina C. S.Valentim, Sally L. Baxter, and et al. 2019. Evalua-tion and accurate diagnoses of pediatric diseases us-ing artificial intelligence. Nature Medicine, 25:433–438.

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi,Arnaud Arindra Adiyoso Setio, Francesco Ciompi,Mohsen Ghafoorian, Jeroen A.W.M. van der Laak,Bram van Ginneken, and Clara I. Sanchez. 2017. Asurvey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88.

Qianlong Liu, Zhongyu Wei, Baolin Peng, XiangyingDai, Huaixiao Tou, Ting Chen, Xuanjing Huang,and Kam fai Wong. 2018. Task-oriented dialoguesystem for automatic diagnosis. In ACL, pages201—-207, Melbourne, Australia.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representa-tion of words and phrases and their compositionality.In NeurIPS, pages 3111—-3119.

James Mullenbach, Sarah Wiegreffe, Jon Duke, Ji-meng Sun, and Jacob Eisenstein. 2018. Explain-able prediction of medical codes from clinicaltext. In NAACL, pages 1101––1111, New Orleans,Louisiana, USA.

Aaditya Prakash, Siyuan Zhao, Sadid A. Hasan, VivekDatla, Kathy Lee, Ashequl Qadir, Joey Liu, andOladimeji Farri. 2017. Condensed memory net-works for clinical diagnostic inferencing. In AAAI,pages 3274–3280.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation Processing & Management, 24(5):513–523.

https://mittrinsights.s3.amazonaws.com/ai-healthcare-asia.pdf



https://doi.org/https://doi.org/10.1007/978-0-387-38319-4

https://doi.org/https://doi.org/10.1007/978-0-387-38319-4

https://doi.org/https://doi.org/10.1038/s41591-018-0316-z

https://doi.org/https://doi.org/10.1038/s41591-018-0316-z

https://doi.org/https://doi.org/10.1016/B978-0-444-88738-2.50020-8

https://doi.org/https://doi.org/10.1016/B978-0-444-88738-2.50020-8

https://doi.org/https://doi.org/10.1023/A:1007665907178

https://doi.org/https://doi.org/10.1023/A:1007665907178

https://doi.org/https://doi.org/10.1093/jamia/ocy173




https://doi.org/https://doi.org/10.1038/s41591-018-0335-9



https://doi.org/https://doi.org/10.1016/j.media.2017.07.005

https://doi.org/https://doi.org/10.1016/j.media.2017.07.005

https://doi.org/https://doi.org/10.1016/0306-4573(88)90021-0

https://doi.org/https://doi.org/10.1016/0306-4573(88)90021-0

3153

Ying Sha and May D. Wang. 2017. Interpretable pre-dictions of clinical outcomes with an attention-basedrecurrent neural network. In ACM InternationalConference on Bioinformatics, Computational Biol-ogy,and Health Informatics, pages 233–240, Boston,MA, USA.

Xue Shi, Yingping Yi, Ying Xiong, Buzhou Tang,Qingcai Chen, Xiaolong Wang, Zongcheng Ji,Yaoyun Zhang, and Hua Xu. 2019. Extracting en-tities with attributes in clinical text via joint deeplearning. Journal of the American Medical Informat-ics Association, pages 1584–1591.

Yiming Yang and Jan O. Pedersen. 1997. A compara-tive study on feature selection in text categorization.In ICML, pages 412—-420, Nashville, TN, USA.

Zhongliang Yang, Yongfeng Huang, Yiran Jiang, YuxiSun, Yu-Jin Zhang, and Pengcheng Luo. 2018. Clin-ical assistant diagnosis for electronic medical recordbased on convolutional neural network. ScientificReports, 8(6329).

Shiyue Zhang, Pengtao Xie, Dong Wang, and Eric P.Xing. 2017. Medical diagnosis from laboratory testsby combining generative and discriminative learn-ing. In arxiv.

https://doi.org/https://doi.org/10.1093/jamia/ocz158



https://doi.org/https://doi.org/10.1038/s41598-018-24389-w



Date post:	25-Jul-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Towards Interpretable Clinical Diagnosis with …Bayesian networks on top of the entity-aware...

Documents