1
Deepr: A Convolutional Net for Medical RecordsPhuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh
Abstract—Feature engineering remains a major bottleneckwhen creating predictive systems from electronic medical records.At present, an important missing element is detecting predictiveregular clinical motifs from irregular episodic records. We presentDeepr (short for Deep record), a new end-to-end deep learningsystem that learns to extract features from medical records andpredicts future risk automatically. Deepr transforms a record intoa sequence of discrete elements separated by coded time gaps andhospital transfers. On top of the sequence is a convolutional neu-ral net that detects and combines predictive local clinical motifsto stratify the risk. Deepr permits transparent inspection andvisualization of its inner working. We validate Deepr on hospitaldata to predict unplanned readmission after discharge. Deeprachieves superior accuracy compared to traditional techniques,detects meaningful clinical motifs, and uncovers the underlyingstructure of the disease and intervention space.
I. INTRODUCTION
A major theme in modern medicine is prospective healthcare,which refers to the capability to estimate the future medicalrisks for individuals. These risks can include readmissionafter discharge, the onset of specific diseases, and worseningfrom a condition [42]. Such capability would facilitate timelyprevention or intervention for maximum health impact, andprovide a major step toward personalized medicine. Animportant data resource in aiding this process are electronicmedical records [20]. Electronic medical records (EMRs)contain a wealth of patient information over time. Centralto EMR-driven risk prediction is patient representation, alsoknown as feature engineering. Representing an EMR amountsto extracting relevant historical signals to form a feature vector.
However, feature extraction in EMR is challenging [44]. AnEMR typically consists of a sequence of time-stamped visitepisodes, each of which has a subset of coded diagnoses, asubset of procedures, lab tests and textual narratives. The datais irregular at patient level. EMR is episodic – events are onlyrecorded when patients visit clinics, and the time gap betweentwo visits is largely random. Representing irregular timingposes a major challenge. EMR varies greatly in length – youngpatients usually have just one visit for an acute condition, butold patients with chronic conditions may have hundreds ofvisits. At the same time, the data is regular at local episodelevel. Diseases tend to form clusters (comorbidity) [41] andthe disease progression may be dictated by the underlyingbiological processes [49]. Likewise treatments may followa certain protocol or best practice guideline [17], and thereare well-defined disease-treatment interactions [39]. Theseregularities can be thought as clinical motifs. Thus an effectiveEMR representation should be able to identify regular clinicalmotifs out of irregular data.
Existing EMR-driven predictive work often relies on high-dimensional sparse feature representation, where features areengineered to capture certain regularities of the data [11], [20]
This feature engineering practice is effort intensive and non-adaptive to varying medical records systems. Automated featurerepresentation based on bag-of-words (BoW) is scalable, butit breaks collocation relations between words and ignores thetemporal nature of the EMR, thus it fails to properly addressthe aforementioned challenges.
In this work we present a new prediction framework calledDeepr that does not require manual feature engineering. Thetechnology is based on deep learning, a new revolutionaryapproach that aims to build a multilayered neural learningsystem like a brain [25]. When fed with a large amount ofraw data, the system learns to recognize patterns with littlehelp from domain experts. Deep learning now powers speechrecognition in Google Voice, self-driving cars at Google andBaidu, question answering system at IBM (Watson), and smartassistants at Facebook. It already has a great impact on hundredsof millions (if not billions) people. But healthcare has largelybeen ignored. We hypothesize that a key to apply deep learningfor healthcare patient representation which requires a properhandling of the irregular nature of episodes mentioned above[37]. Deepr fills the gap by offering an end-to-end technologythat learns to represent patients from scratch. It reads medicalrecords, learns the local patterns, adapts to irregular timing,and predicts personalized risk.
The architecture of Deepr is multilayered and is inspired byrecent convolutional neural nets (CNNs) in natural languages[9], [21], [25], [30], [51]. The most crucial operation occursat the bottom level where Deepr transforms an EMR into a“sentence” of multiple phrases separated by special “words”that represent time gap. Each phrase is an visit episode. Aswith syntactical grammars and collocation patterns in NLP,there might exist “health grammars” and “clinical patterns”in healthcare. Health grammars refer to latent biological andenvironmental laws that dictate the global evolution of one’shealth over time, e.g., probable progression from “diabetes typeII” to “renal failure”. To handle irregular timing, time gaps andtransfers are treated as special words. With this representation,an EMR is transformed into a sentence of variable lengththat retains all important events. The other layers of Deepr
constitute a CNN, which is similar to those in [9], [21], [51].First, words are embedded into a continuous vector space. Next,words in sentence are passed through a convolution operationwhich detects local motifs. Local motifs are then pooled toform a global feature vector, which is passed into a classifier,which predicts the future risk. All components are learned atthe same time from data: the data signals are passed fromthe data to the output, and the training signals are propagatedback from the labels to the motif detectors. Hence Deepr isend-to-end.
We validate Deepr on a large database of 300K patientscollected from a hospital chain in Australia. We focus on
arX
iv:1
607.
0751
9v1
[st
at.M
L]
26
Jul 2
016
2
predicting unplanned readmission within 6 months afterdischarge. Compared to existing bag-of-words representation,Deepr demonstrates a superior accuracy as well as the capacityto learn predictive clinical motifs, and to uncover the underlyingstructure of the space of diseases and interventions.
To summarize, we claim the following contributions:• A novel representation of irregular-time EMR as a
sentence with time gaps and transfers as special words.• A novel deep learning architecture called Deepr that (i)
uncovers the structure of the disease/treatment space, (ii)discovers clinical motifs, (iii) predicts future risk and (iv)explains the prediction by identifying motifs with strongresponses in each record. The system is end-to-end, and itsinner working can be inspected and visualized, allowinginterpretability and transparency.
• An evaluation of these claimed capabilities on a large-scaledataset of 300K patients.
II. BACKGROUND
a) Medical records: An electronic medical record (EMR)contains information about patient demographics and a se-quence of hospital visits for a patient. Admission informa-tion may include admission time, discharge time, lab tests,diagnoses, procedures, medications and clinical narratives.Diagnoses, procedures and medications are discrete entities.For example, diagnoses may be represented using ICD-10coding schemes1. For example, in ICD-10, E10 refers to Type1 diabetes mellitus, E11 to Type 2 diabetes mellitus. Theprocedures are typically coded in CPT (Current ProceduralTerminology) or ICHI (International Classification of HealthInterventions) schemes 2. One of the most important secondaryuses of EMR is building predictive models [20], [31], [44],[46].
Most existing prediction methods on EMRs either rely onmanual feature engineering [31] or simplistic extraction [44].They either ignore long-term dependencies or do not adequatelycapture variable length [2], [31], [44]. Neither are they able tomodel temporal irregularity [18], [29], [44], [49]. Capturingdisease progression has been of great interest [19], [29], andmuch effort has been spent on Markov models [14], [18],[49]. As Markov processes are memoryless, Markov modelsforget severe conditions of the past when it sees an admissiondue to common cold. This is undesirable. A proper modeling,therefore, must be non-Markovian and able to capture long-termdependencies.
b) Deep learning: Deep learning is an approach inmachine learning, aiming at producing end-to-end systemsthat learn from raw data and perform desired tasks withoutmanual feature engineering. The current wave of deep learningwas initiated by the seminal work of [15] in 2006, but deeplearning has been developed for decades [40]. Over the past fewyears, deep learning has broken records in cognitive domainssuch as vision, speech and natural languages [25]. Current deeplearning is mostly based on multilayered neural networks [40].All the networks share a common unit – the neuron – which is a
1http://apps.who.int/classifications/icd10/browse/2016/en2http://www.who.int/classifications/ichi/en/
simple computational device that applies a nonlinear transformto a linear function of inputs: i.e., f(x) = σ (b+
∑i wixi).
Almost all networks thus far are trained using back-propagation[50], thus enable end-to-end learning.
There are three main deep neural architectures in practice:feedforward, recurrent and convolutional. Feedforward nets(FFN) pass unstructured information from one end to theother, usually from an input to an output, hence they actas a universal function approximator [16]. Recurrent nets(RNN) model dynamics over time (and space) using self-replicated units. They maintain some degree of memory, andthus have potential to capture long-term dependencies. RNNsare powerful computational machines – they can approximateany program [27]. Convolutional nets (CNN) exploit therepeated local motifs across time and space, and thus aretranslation-invariant – the capacity often seen in human visualcortex [24]. Local motifs are small piece of data, usually ofpre-defined sizes, e.g., a batch of pixels, or a n-gram of words.CNN is often equipped with pooling operations to reduce theresolution and enlarge the motifs.
III. Deepr: A DEEP NET FOR MEDICAL RECORDS
In this section, we describe our deep neural net namedDeepr (short for Deep net for medical Record) for representingElectronic Medical Records (EMR) and predicting the futurerisk.
A. Deepr Overview
Deepr is a multilayered architecture based on convolutionalneural nets (CNNs). The information flow is summarizedin Fig. 1. At the bottom level, Deepr sequences the EMRinto a “sentence”, or equivalently, a sequence of “words”.Each word represents a discrete object or event such asdiagnosis, procedure, or any derived object such as time-intervalor hospital transfer. The next layer embeds words into anEuclidean space. On top of the embedding layer is a CNN thatreads a small chunk of words in a sliding window to identifylocal motifs. The local motifs are transformed by RectifiedLinear Unit (ReLU), which is a nonlinear function. All thetransformed motifs are then max-pooled across the sentence toderive an EMR-level feature vector. Finally, a linear classifieris placed at the top layer for prediction. The entire architectureof Deepr can be summarized as a function f(r) for record r:
f(r)← Class (Pool {ReLU (Conv [Embed {Seq (r)}])}) (1)
The CNN plays a crucial role as it detects clinical motifs thatare predictive. Clinical motifs are co-occurrences of diseases(also known as comorbidity), disease progression, patterns ofdisease/treatment, and patterns of collocating treatments [21].However, as CNN is supervised it requires labels, which maynot always be available (e.g., new patients with short history).A possible enhancement is through pretraining the embeddinglayer through a powerful tool known as word2vec [34]. Asword2vec is unsupervised and relies on local collocationpatterns, clinical motifs can be pre-detected, and then furtherrefined through CNN with supervising signals.
3
output
max-pooling
convolution --motif detection
embedding
sequencing
medical record
visits/admissions
time gaps/transferphrase/admission
prediction
1
2
3
4
5
time gaprecord vector
word vector
?
prediction point
Figure 1. Overview of Deepr for predicting future risk from medical record. Top-left box depicts an example of medical record with multiple visits, each ofwhich has multiple coded objects (diagnosis & procedure). The future risk is unknown (question mark (?)). Steps from-left-to-right: (1) Medical record issequenced into phrases separated by coded time-gaps/transfers; then from-bottom-to-top: (2) Words are embedded into continuous vectors, (3) local wordvectors are convoluted to detect local motifs, (4) max-pooling to derive record-level vector, (5) classifier is applied to predict an output, which is a future event.Best viewed in color.
B. Sequencing EMR
This task refers to transforming an EMR into a sentence,which is essentially a sequence of words. We present here howthe words are defined and arranged in the sentence.
Recall that an EMR is a sequence of time-stamped visitepisodes. Each episode may contain many pieces of information,but for the purpose of this work, we focus mainly ondiagnoses and treatments (which involve clinical proceduresand medications). For simplicity, we do not assume perfecttiming of each piece, and thus an episode is a finite set ofdiscrete words (diagnoses and treatments). The episode isthen sequenced into a phrase. The order of the element inthe phrase may follow the pre-defined ordering by the EMRsystem, for example, primary diagnosis is placed first, followedby secondary diagnoses, followed by procedures. In absenceof this information, we may randomize the elements.
Within an episode, occasionally, there are one or moretransfers between care providers, for example, separate de-partments from the same hospital, or between hospitals. Inthese cases, an admission is a phrase, and an episode is asubset of phrases separated by a transfer event. We create aspecial word TRANSFER for this event. Between two consecutiveepisodes, there is a time gap, whose duration is generallyrandomly distributed. We discretize the time gap into fiveintervals, measured in months: (0-1], (1-3], (3-6], (6-12], 12+.Each interval is assigned a unique identifier, which is treatedas a word. For example, 0-1m is a word for the (0-1] intervalgap. With these treatments, an EMR is a sentence of phrasesseparated by words for transfers or time gaps. The phrases areordered by their natural time-stamps. For robustness, infrequentwords are coded as RAREWORD.
The following is an example of a sentence, where diagnosesare in ICD-10 format (a character followed by digits), and
procedures are in digits:
1910 Z83 911 1008 D12 K31 1-3m R94 RAREWORD H53
Y83 M62 Y92 E87 T81 RAREWORD RAREWORD 1893 D12
S14 738 1910 1916 Z83 0-1m T91 RAREWORD Y83 Y92
K91 M10 E86 6-12m K31 1008 1910 Z13 Z83.
Here the phrases are: [1910 Z83 911 1008 D12], [R94RAREWORD H53 Y83 M62 Y92 E87 T81 RAREWORD RAREWORD
1893 D12 S14 738 1910 1916 Z83], [RAREWORD Y83 Y92 K91
M10 E86], and [K31 1008 1910 Z13 Z83]. The time separatorsare: [1-3m], [0-1m], and [6-12m]. Note that within each phrase,the ordering of words has been randomized.
C. Convolutional Net
c) Embedding: The first step when applying convolutionalnets on a sentence is to represent discrete words as continuousvectors. One way is to use the so-called one-hot coding, thatis, each word is a binary vector of all zeros, except for justone position indexed by the word. However, this representationcreates a high-dimensional vector, which may lead to overfittingand expensive computation. Alternatively, we can use wordembedding, which refers to assigning a dense continuous vectorto a discrete word. For example, the second word [Z83] in theexample above may be assigned to 3D vector as (0.1 -2.3 0.5).In practice, we maintain a look-up table indexed by words,i.e., E(w) ∈ Rm is the vector for word w. The embeddingtable E is learnable. Applying word embedding to the sentenceyields a sequence of vectors, where the vector at position t isxt = E(wt).
d) Convolution: On top of the word embedding layersis a convolutional layer. Each convolution operation reads asliding window of size 2d+ 1 and produces p filter responses
4
as follows:
zt = ReLU
b +
d∑j=−d
Wjxt+j
(2)
where zt ∈ Rp is filter response vector at position t, Wj ∈Rp×m is the convolution kernel at relative position j (hence,W ∈ Rp×m×(2d+1)), b is bias, and ReLU(x) = max {0,x}(element-wise). When it is clear from the context, we use “filter”to refer to the learnable device that detects motifs, which aremanifestation of filters in real data. The rectified linear functionenhances strong signals and eliminates weak ones. The bias band the kernel tensor W are learnable.
e) Pooling: Once the local filter responses are computedby the convolutional layer, we need to pool all the responsesto derive a global sentence-level vector. We apply here themax-pooling operator:
z = maxt{zt} (3)
where the max is element-wise. Thus the pooled vector z livesin the same space of Rp as filters responses {zt}. Like therectifier used in Eq. (2), this max-pooling further enhancesstrong signals across the words in the sentence.
f) Classifier: The final layer of Deepr is a classifierthat takes the pooled information and predicts the outcome:f(r) = classifier(z(r)) for record r. The main requirement isthat the classifiers must allow gradient to propagate down tolower layers. Examples include a linear classifier (e.g., logisticregression) or a non-linear parametric classifier (e.g., neuralnetwork).
D. Training
Deepr has multiple trainable parameters: embedding matrix,biases, convolution kernels, and classifier-specific parameters.As the number of trainable parameters is often large, itnecessitates regularizers such as weight shrinkage (e.g., via`2 norm) or dropouts [43] . For training we also need tospecify a loss function, which depends on the nature ofclassifiers. For example, for binary outcome (e.g., readmission),logistic classifier is usually trained on cross-entropy loss.Training starts with (random) initialization of parameterswhich are then refined through back-propagation and stochasticgradient descent (SGD). This requires gradients with respectto trainable parameters. Gradient computation is often tediousand erroneous, but it is now fully automated in modern deeplearning frameworks such as Theano [3] and Tensorflow [1].For SGD, parameters are updated after every mini-batch ofrecords (or sentences). Training is stopped after a pre-definednumber of epochs (iterations), or on convergence.
g) Pretraining with word2vec: As mentioned inSec. III-A, the embedding matrix can be pretrained usingword2vec. Here we do not need labels, and thus we can exploita large set of unlabeled data.
E. Model Inspection and Visualization
Deepr facilitates intuitive model inspection and visualizationfor better understanding:
h) Identifying motif responses in a sequence: For eachmotif detector, the motifs response at position t (e.g., zt ∈Rp) can be used to identify and visualize strong motifs. Forsize-3 motifs, the response weight to a size-3 sub-sequence(xt−1,xt,xt+1) of a sequence x is the term
∑dj=−dWjxt+j
in Eq. (2), which is the dot product of the sub-sequence andthe kernel W .
i) Identifying frequent and strong motifs: Motifs withlarge responses in sequences are collected. From this collection,we keep frequent motifs representative for each outcome class.
j) Computing word similarity: Through embedding xw =E(w), word similarity can be computed easily, e.g., throughcosine S(w, v) = x>wxv (‖xw‖ ‖xv‖)−1.
k) Visualization of similar patients: Patient vectors fromEq. (3) can be used to compute patient similarity. This enablesretrieving patients who have similar history and similar futurerisk likelihood. This is unlike existing methods that computeonly similar history, which does not necessarily guaranteesimilar future. Further, the similarity is not heuristic, and itdoes not require a heuristic combination of multiple data types(such as diseases and interventions). Fig. 2, for example, showsthe distribution of positive and negative classes, in which patientvectors are projected onto 2D using t-SNE [47]. Patients whohave similar history and future will stay close together.
l) Visualization in disease/intervention space: Sincewords are embedded into vectors, visualization in 2D is throughdimensionality reduction tools such as PCA or t-SNE [47].
IV. IMPLEMENTATION
In this section, we document implementation details ofDeepr on a typical EMR system. For ease of exposition, weassume that diseases are coded in ICD-10 format, but otherversions are also applicable with minimal changes.
A. Data and Evaluation
Data was collected from a large private hospital chain inAustralia in the period of July 2011 – December 2015. The datais coded according to Australian Coding Standard (ACS). TheACS dictates that diagnosis coding is based on ICD-10-AM3,an Australian adaptation to WHO’s ICD-10 system. Likewise,procedure coding follows ACHI (Australian Classification ofHealth Interventions). The data consists of 590,546 records(300K unique patients), each corresponds to an admission(defined by an admission time and a discharge time).
The data subset for testing Deepr was selected as follows.First we identified 4,993 patients who had at least an unplannedreadmission within 6 months from a discharge, regardless ofthe admitting diagnosis. This constituted the risk group. Foreach risk case, we then randomly picked a control case fromthe remaining patients. For each risk/control group, we used830 patients for model tuning, 830 for testing and the rest fortraining. A discharge (except for the last one in risk group) israndomly selected as prediction point, from which the futurerisk will be predicted. See also Fig. 1 for a graphical illustration.
3https://www.accd.net.au/Icd10.aspx
5
B. Implementation Details of Deepr
m) Episode definition: Deepr assumes that episodesare well-defined with an admission time and discharge time.However, it is not always the case due to intra-hospital or inter-hospital transfers. Our implementation links two admissionsinto an episode if they are separated by less than 12 hours, orby 12-24 hours but with documented transfer.
n) Words: For robustness, only level 3 ICD-10-AM codesare used. For example, F20.0 (paranoid schizophrenia) would beconverted into F20 (schizophrenia). Similarly, the proceduresare converted into procedure blocks. Rare words are thoseoccurring less than 100 times in the database.
o) Word order randomization: For motifs detection,randomization is necessary to generate many potential motifs.We also test a special case where words in a phrase are orderedstarting with the primary diagnosis followed by other secondarydiagnoses, then by procedures in their natural ordering asdefined by the EMR system.
p) Sentence length: For CNN, the sentences are trimmedto keep the last min(100, len(sentence)) words. This is to avoidthe effects of some patients who have very long sentenceswhich severely skew the data distribution. In a typical EMR,this is equivalent to accounting for up-to 10 visits per patient,which cover more than 95% of patients.
q) Hyper-parameter tuning: Deepr has a number ofhyper-parameters pre-specified by model users: embeddingdimension m, kernel window size 2d+ 1, motif size, numberof motifs n per size, number of epochs, mini-batch size, andother classifier-specific settings. Some hyper-parameters can befound through grid search, which finds the best configurationwith respect to the accuracy on the development set.
We searched for the best parameters using the training anddevelopment data. Then we used the model with the bestparameter to predict the unseen test data. The best parameterssettings were m = 100, d = 1, motif size = 3, 4 and 5, n = 100number of epochs = 10, mini-batch size = 64, `2 regularizationλ = 1.0.
C. Baselines
We implemented the bag-of-words representation and regular-ized logistic regression (BoW+LR). LR has a parameter C thathelps control overfitting. We searched for the best parameter Cusing the development data. We used the model with the bestparameter to predict the unseen test data. We found the bestparameter C = 0.1, which is equivalent to a prior Gaussian ofmean 0 and standard deviation of 0.333.
V. RESULTS
A. Risk Prediction
We predict unplanned readmission within 6 months aftera random index discharge. Table I reports the predictionaccuracy for all methods, when trained on data with andwithout coded time-gaps. Time-gaps coding improves theBoW-based prediction, suggesting the importance of propersequential handling. However, time-gaps do not affect theaccuracy of Deepr. This might be due to the convolution,
Method W/o time With timeBoW + LR 0.727 0.741Deepr (rand init) 0.754 0.753Deepr (word2vec init) 0.750 0.756
Table IACCURACY ON 6-MONTH UNPLANNED READMISSION PREDICTIONFOLLOWING A RANDOM INDEX DISCHARGE WITH AND WITHOUT
TIME-GAPS. RAND INIT REFERS TO RANDOM INITIALIZATION OF THEEMBEDDING MATRIX. Word2vec INIT REFERS TO PRETRAINING THE
EMBEDDING MATRIX USING THE word2vec ALGORITHM [34].
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
V29
R09
D35O30
N86
I49
N77
R87
H52
R39
I82
W17
M32
I80
D04
S67
N83
O34
S86
L50
M71
I83
D06
Z93G51
I95
L73
R27
W84
L02
M62
K29K90 Z30
R30
Y45D69
R26
W11
N32
D64
Z49
G56
K82
J10
O46
V19
S13
L72
D24
O33
Z80
G44
D48
N28
W23
W44F02
B37
O09L05
H72
Z46
E85
O14
M12
S06
R02
M80
S52
M67
R11
D75
K81
V28
K63
G20
R34
B02
S14
K83
P22
E21
M43
Z37
V47
B18
H11
P92
N31K91
O32
Z06
O35
J34
Z87
R42
J39
S31
T81
U72
R05
S51
R58
F00
R21
R18
S65
E44
Z13
V43
D27
M54
E66
M86
A41
N18
M10
N60
O41
K08
Z22
J94
K14
E43
S12
T91
Y46
N90
S42
Y42
G95
M96
W07
O21
S05
N12
R44
G83
I71
O26
K64
P39TRANSFER
S60
O80
R63
K61
S39
E88
I34
Q43
O75
T88
K52
N81
G12
U78
W19
T92
Y60
E55Z85
K85
H65
I10
M06
J35
Z33
D46
R91
S70
J69
M65I74
B34
A40
O70
J30
I21
D09
W10
D21
O24
Z12
M22
I61
Z38
N94
Z48M76
Y92
K20
Y86
R74
G70
S61
I26
S82
O98
H57
K59
D22
W55
K72
D05
R00
S50
V23S22
R94
N89
Z08
J44
Z94
F31
N42
I64
N92
D28
S63
R50
J18
T18
I25
O69
R49
X50
W25
F04
K62
D32
Z74
L53
E22
Z88L27
R57
N43
A04
M48
G47
Y65
N85
N21
R40
I97
O63
P28
Y51
K35
L81
W50
E04
H81
P59
J96
G31
I77
S68
P05
Z83
K86
N50
Z40
F03
S69
K01
R68
N72
F41
R52
K43
I86
D39
S91
M40
E06
N70
N63
K10
J45
J95
O00
M53
N36
O13
R51
E10
K26
M85
D13
F11
I72
S41
Z39
R10K30
U86
Q51
U50
O36
K75
Z09
K41
P36
Z45
G96
K92
J06
I89
R56G50
O64
I46
T82
S32
Z03
L60
R79
W45
O62O03
O90O44
K05
Z89
Y52
S89
Z34
Y85
L84
K12
G58
N40
L30
J20
R45
H61
G91
S40Z47
L82
T89
R53
N41
R15
M95
I08
W31
S72
O71
N62
I69
D23
Y89
Q38
S09
R12
E27
I47
H40
K65
Q50
M41
N19
S35
K11
G40
Q21
G54
K31 O81
J22
I84
G45
N20
E86
R47
S93
K76E03
Y84
X58
O68
G57
M84
W18
L90
I65
L89
I67
N73
H66
M17
N64
N88
S30
M75
N80
W02
Y83
R22
M13
N99
S83
K70
S76
Z75
M24
Z50
R31
F05
L29
T79
N17
T93
L97
S79S29
M11
J32
D47
D41
K25
Z91
T85
U83
M89
L92
L08
G93
K50
R29
Z31
S49
Q66
K58
U55
Z96
M50
O84
K22
E83
D36
L74J38
M19
F43H35
E53N75
O06
V18
P70
S00
D16
U82
J84
Z53
M51M46
N23
F10
K37
U66
N76
Z72
F60
K02
D17
R25
J98
Y43
E28
E46
S53
R23I50
R07M31
O92
Z95
S01
I35
O02
H91
P07
R73
M77
I31
N71
Y44
D68
S56
M94
B00
M79
S81
U80
M81
H04
G25
K00
M47D86
H10
K40
S71
E14
W01
Q27
Z41
R14
V03
Y48
D34
S66
I63
G61
G97
S20
J02
E65
K44O72
K07
A63
T80
S62
O47
J33
I87
H53
A09
E78
D07
G72
D50
F17
R41
Z98
B96
J90
N87
N48
L03
S64
W06
J47
W29
I05
M66
R19
Z29
Z60
K04
D72
N30
O04
R06
Y57
T83
D25
S37
Z35
I42
S99
K09
G81
Q74
R04
I70
T78
N35
W26
Z51RAREWORD
K60
O48
B35
M35
N84
D59
J40
R33
G43K74
K80
M20
Z92
N97
G62 U79
F01
M25
R20
O61
I20N46
O42
R93
O87
D80
O82
M21
Z86
M72
M23
Y49
S02
E87
K55
R13
Q23
S73
I44 H26
D03
I62
I51
I45
K51
D70
I07
D45
B95
F33
N13
D62
M87
T17
R60
U73
S92
F32
M88
S96
T84
R59
A08
M93
K13
O99
H71
L91
N95
S46
G30
B97
E11
W54
Z43K56
K57
K42
L98
J93
I27
K21
I85
D18
K66
E09
E05
E13
U87
H02
O66
I48
R55
V48
A49S36 M70
R54
O43
N93
O60
G55
S27
D37
S43
S90
Y40
E61
Z21
D30
M16
W22
M00
J15
L57S80
R35
X59
N47
R32
Y54 G35
D12
N39
Z42
pregnancyrelated
birthrelated
injuries
injuries
injuries
musculoskeletal system
sport related
heart related
heart related
heart related
respiratory system
respiratory system
respiratory system
blood related nervous
system
genitourinary system
genitourinary system
musculoskeletal system
digestive system
heart related
digestive system
mental health
digestive system
Figure 3. Distribution in the disease space, projected into 2D using t-SNE.Distribution of interventions is omitted for clarity. Best viewed in color.
rectification and max-pooling operations (see Sec. III-C), whichpick the most powerful convoluted signals in the sequence. Theuse of word2vec to initialize the embedding matrix also haslittle contribution toward the accuracy. This could be becauseword2vec looks only for local collocations in both directions(past and future), whereas the prediction in Deepr is moreglobal and of longer time horizon only in the future direction.In either cases with and without word2vec, Deepr is superiorthan the baseline BoW+LR.
Fig. (2) shows how Deepr groups similar patients and createsa more linear decision boundary while BoW+LR scattersthe patient distribution and has a more complicated decisionboundary. Recall that Deepr creates the feature vectors usingelement-wise max-pooling over all the motifs responses, as inEq. (3). This demonstrates that the motifs, not just individualwords, are important to computing similarity between patients.This also suggests that given a new patient Deepr is better atquerying similar patients in the database when future risk isneeded.
6
30 20 10 0 10
20
10
0
10
20
20 10 0 10 20
15
10
5
0
5
10
15
20
BoW+LR Deepr
Figure 2. 2D projections of classification on the unseen test set of two methods BoW+LR and Deepr. White points and blue background are negative class,black point and yellow region are positive class. The figure shows Deepr groups similar patients and creates a more linear decision boundary while BoW+LRscatters the patient distribution and has a more complicated decision boundary. The decision boundary is approximated by an exhaustive contouring method,where fine lattice points of the background grid are labeled to the predicted label of their nearest data point, and then the boundary is computed by thecontouring algorithm. Best viewed in color.
B. Disease/Procedure Semantics
Recall that Deepr first embeds words into a vector space.This offers a simple but powerful way to uncover and visualizethe underlying structure of the word space (see Sec. III-E).Fig. 3 plots the distribution of diseases on 2D. Deepr discoversdisease clusters which partly correspond to nodes in the ICD-10 hierarchy. Apart from pregnancy, child birth issues andinjuries, the conditions are not totally separately suggesting acomplex dependencies in the disease space. The main bockof the disease space has conditions related to heart, blood,metabolic system, respiratory system, nervous system andmental health. A more close examination of most similarconditions to a disease is given in Table II. For example,similar to cesarean section delivery of baby are those relatedto pregnancy complications (disproportion, failed inductionof labor, or diabetes) and corresponding delivery procedures(cesarean section, manipulating fetal presentation, forceps).
We note in passing that we also obtained a similar visual-ization using only word2vec as in [34], which is known todetect hidden semantic relationships between words. Deeprtrained on the embedding matrix initialized by word2vec didnot significantly change the relative positions of words. Thissuggests that Deepr also captures the semantic relationshipbetween words.
C. Filter Responses and Motifs
While the semantics in the previous sub-section reveal theglobal relative relation between diseases and procedures, theydo not explain local interactions (e.g., motifs). Here we computethe local filter responses per sentence, and from there, acollection of strong and frequent motifs is derived.
Table III shows some sentences with strong responses forFilter 1 and 4 for both risk and no-risk class. It can be seen thatthe sub-sequences Z85.1163.1910 and 1066.1067.I21 respondstrongly for the positive class and contribute to the classificationresult. The first sub-sequence is about cancer history (Z85),
Filter ID Response within a (sub) sentence
1 (readmit)
filter_0, filter_size_3, weight_0.954588, positive_618, 1620 . 1649 . 1910 . 1645 . D03 . 1299m . 1744 . N62 .1910 . D24
filter_0, filter_size_3, weight_0.906711, positive_520, Z08 . Z85 . 1163 . 1910 . 1089
filter_0, filter_size_3, weight_0.902545, positive_816, 1910 . Z08 . 1089 . Z85 . 1299m . 1910 . 1089 .Z08 . Z85 . 1299m . Z08 . 1089 . Z85 . 1910
filter_0, filter_size_3, weight_0.892010, positive_1446, Z86 . Z85 . 1089 . 1910 . Z08 . 1299m . 1089 .Z86 . 1910 . Z08 . Z85 . 1299m . 1089 . 1910 . Z86 . Z85 . Z08
filter_0, filter_size_3, weight_0.874444, positive_26, 1089 . Z86 . Z08 . Z85 . 612m . Z85 . 1089 . Z08
filter_1, filter_size_3, weight_2.002083, positive_4396, 1098 . RAREWORD . Z85 . Z08 . 13m . G47 .K59 . R31 . 01m . R33 . E11 . Z86 . Z92 . 612m . 1108 . 1916 . 1092 . 1910 . E11 . N30 . 1566 . Z86 . Y84 . Y92 . N32
filter_1, filter_size_3, weight_1.936797, positive_4160, 1089 . Z85 . Z08 . 1299m . Z08 . 1089 .Z85 . 612m . Z85 . Z08 . 1089 . 1299m . 1089 . Z08 . Z85
filter_1, filter_size_3, weight_1.894471, positive_448, Z85 . 1089 . 1910 . Z08 . 1299m . Z08 . Z85 .1910 . 1089 . 1299m . 1910 . Z85 . Z08 . 1089 . 01m . M54 . J45 . K21 . 612m . 1089 . Z85 . 1910 . Z08
filter_1, filter_size_3, weight_1.888126, positive_1490, 1910 . Z85 . 911 . Z08 . 612m . Z08. 911 . Z85 . 1910 . 36m . E11 . K62 . 908 . 1910
filter_1, filter_size_3, weight_1.883164, positive_4304, 1089 . 1910 . Z85 . Z08 . 1299m . Z08 .1910 . Z72 . 1089 . Z85 . 1299m . Z85 . Z08 . 1910 . 1089 . Z72
filter_2, filter_size_3, weight_1.894100, positive_1314, 1008 . K22 . 1910 . Z09 . 905 . Z87 . 1299m . 958 . 984 .Z03 . 1910 . 963 . 01m . F41 . 1916
filter_2, filter_size_3, weight_1.698204, positive_3834, 1569 . 1910 . K07 . 1706 . K07 . 1702filter_2, filter_size_3, weight_1.631166, positive_1164, Z86 . 1534 . M20 . 1916 . 1910 . 1528 . 1909 . 1547 . 1299m . 727 . I83 . 1910 . 727 . Z86 . E11
filter_2, filter_size_3, weight_1.593942, positive_2409, 1258 . 1259 . 984 . 1265 . N80 . 1910
filter_2, filter_size_3, weight_1.508270, positive_3755, 905 . 1910 . Z86 . Z03 . 1008 . R19 . K64
filter_3, filter_size_3, weight_1.908521, positive_1392, 1910 . Z86 . N97 . 1259 . 612m .
D05 . 1744 . 1910 . Z86 . 13m . RAREWORD . 1747 . 1916 . Z86 . 1893 . 1910 . D05 . 1756 . D64 . 1754 . 1299m .1910 . Z42 . Z86 . 1660
filter_3, filter_size_3, weight_1.797218, positive_2780, Z53 . Z08 . Z85 . Z86 . I20 . 01m . 1910 . Z85 .1089 . Z08 . Z86 . 1299m . 1098 . Z86 . 1910 . 1165 . D09 . 01m . 1067 . 1066 . RAREWORD . 1910 . Z86 .D09
1 (no-risk)
filter_98, filter_size_3, weight_2.951619, positive_1649, 1474 . 1561 . M67 . 1910 .1651 . 13m . 1620 . 1910 . 1651 . D04
filter_98, filter_size_3, weight_2.737271, positive_3896, H65 . 309 . 01m . 1668 . 1756 .
1668 . Z86 . Z42 . 1754 . Z85 . 1910 . 1757
filter_98, filter_size_3, weight_2.545380, positive_3089, 1910 . N84 . 1276 . N84 .1266
filter_98, filter_size_3, weight_2.377592, positive_1558, S01 . 1910 . 406 . X59 . U73 . Y92 . 1299m .
Z30 . 1910 . 1183filter_98, filter_size_3, weight_2.363580, positive_3368, 1916 . N17 . D64 . 1165 . 1916 . E83 . N13 . E83 . Z72 . 1910 . I10 .
R07 . 1093 . R33 . E87 . 1893 . 01m . Z72 . K59 . 01m . Z46 . 1092 . 612m . RAREWORD . Y65 . 1910 . Z03 . I97 .Y92 . T88 . I95
filter_99, filter_size_3, weight_2.046401, positive_688, 1183 . 1910 . Z30 . 612m . I48 . K92 . 01m . 911 .K64 . K92 . 1910
filter_99, filter_size_3, weight_2.015343, positive_1273, 1098 . Y84 . N30 . Y92 . 1096 . 1910 . 01m . Y84 . N30 . Y92 .1096 . 1092 . E11 . R31 . 1910 . 13m . Y84 . K66 . 1916 . 1910 . RAREWORD . T81 . R32 . RAREWORD . N30 . 986 . E10 . 1916 . 1893 . Y92
. 1909 . Y92 . 1183 . Y83 . 1299m . 1910 . 1067 . E10 . N20
filter_99, filter_size_3, weight_1.981209, positive_4017, 458 . K08 . 400 . 1910 . 400filter_99, filter_size_3, weight_1.908346, positive_2368, 1910 . Z30 . 1183filter_99, filter_size_3, weight_1.908346, positive_1558, S01 . 1910 . 406 . X59 . U73 . Y92 . 1299m . Z30 .1910 . 1183filter_0, filter_size_3, weight_0.954588, negative_2158, Z86 . 1744 . 1910 . D24
filter_0, filter_size_3, weight_0.954588, negative_3125, 1744 . 1910 . D24
filter_0, filter_size_3, weight_0.954588, negative_3071, 1744 . 1910 . D24
filter_0, filter_size_3, weight_0.954588, negative_2582, 1744 . 1910 . D24
filter_0, filter_size_3, weight_0.954588, negative_2245, 1910 . 1744 . D24
filter_1, filter_size_3, weight_2.170072, negative_4199, 1435 . Y92 . T81 . R11 . U73 . S61 . 1427 . Y60 . S51 . 1916 . S52
. 1910 . 1429 . Y92 . W19 . 1910 . 1557 . 13m . K57 . M13 . 01m . 1008 . R10 . R14 . M43 . 1910 . R11 . 1916 . 1299m . 1910 .309 . U83 . H65 . U86
4 (readmit)
filter_3, filter_size_3, weight_1.734465, positive_1490, 1910 . Z85 . 911 . Z08 . 612m . Z08 . 911 .Z85 . 1910 . 36m . E11 . K62 . 908 . 1910
filter_3, filter_size_3, weight_1.716212, positive_269, R19 . 911 . Z87 . 1008 . Z80 . 1910 . 01m .
D37 . 911 . 1910
filter_3, filter_size_3, weight_1.712104, positive_4349, Z08 . Z85 . 1089 . 1299m . Z08 .
Z85 . 1089 N18 . 668 . I25 . N13 . 1910 . 905 . 1066 . 1067 . I21 . N17 . 607 . 1910 . 1910 . 1067 . 671 . R19
filter_4, filter_size_3, weight_2.025432, positive_2169, K55 . K22 . R07 . 1008 . 1916 . 1910 . 905 . 1299m . 1910 .1008 . K31 . K22 . K44 . Z86 . 13m . K56 . Z86 . 01m . 607 . I10 . 671 . I25 . 668 . Z86 . I20 . 01m . R10 . 01m . K57 . 01m . 1910. K57 . 905 . 36m . 1089 . 1916 . Z86 . 1910 . R31 . D09 . 1916 . J44filter_4, filter_size_3, weight_1.942211, positive_1166, K63 . Z86 . I84 . 905 . 1910 . 13m . 668 . I21 . 1910 .
Z95 . N18 . N17 . J44 . I48 . I50 . 668 . E11 . E11 . 607 . 1910 . I25 . 1916filter_4, filter_size_3, weight_1.925787, positive_1056, E11 . 1916 . 1916 . 1916 . E11 . I50 . G47 . I10 . J96 . E66 .570 . E11 . N17 . E11 . 01m . I10 . E66 . E11 . N17 . N18 . Z92 . 1893 . 1916 . E11 . E11 . Y52 . Y92 . 612m .
E11 . E11 . Z92 . R15 . D68 . Y92 . I50 . N18 . I48 . Y44
filter_4, filter_size_3, weight_1.925539, positive_391, 1910 . N47 . 1196 . Z86 . 13m . 1916 . Z86 . M47
. 1299m . J18 . 1916 . R33 . Z86 . 1916 . R32 . E87 . I50filter_5, filter_size_3, weight_1.428182, positive_2661, Z86 . 1910 . N20 . 1126 . 36m . 1910 . N40 . 1165. 1909 . Z86 . N13
filter_5, filter_size_3, weight_1.387222, positive_2211, 1910 . N20 . 1126 . 01m . 1126 . 1910 . 1067 . N20
filter_5, filter_size_3, weight_1.168228, positive_369, 1910 . 1909 . H26 . 197 . 173 . Z86 . H52 . 01m . 197 .Z86 . H26 . 1910 . 1909
filter_5, filter_size_3, weight_1.111798, positive_3563, R33 . Z46 . N39 . N41 . 1902
filter_5, filter_size_3, weight_1.091392, positive_2924, 1046 . Z86 . 1067 . 1910 . 1074 . N20 . 1066 . 01m . N20 .1910 . 1067 . Z86 . 1041 . 36m . 1089 . Z08 . Z85 . 1910 . 13m . 1066 . 1910 . Z86 . 1067 . N20 . 1046 . 01m . 1089 . 1910 . Z86 . 1067 . Z46 . 612m . Z85 . Z87 . 1910 . 1066 . Z86 . Z08 . 612m . 1066 .
Z08 . 1910 . Z85 . Z86filter_6, filter_size_3, weight_1.514249, positive_2392, Z72 . 990 . 1910 . K40 . 1299m . 1910 . I97 . Z72 . I72. 700 . Y92 . I70 . I95 . 715 . Y83 . 36m . Z53
4 (no-risk)
filter_1, filter_size_3, weight_2.011279, negative_959, I84 . Z87 . 905 . K92 . E87 . 1910 . Z09 . E87 . 905 . K92 . K57. Z09 . 1910 . Z87 . 1299m . Z45 . Z53 . 13m . Z45 . 655
filter_1, filter_size_3, weight_1.895455, negative_2085, Y92 . 1758 . 1758 . 1910 . 1294 . Y83 . T85 . Z41 . 36m .1657 . 1910 . Z42 . 612m . Z41 . 1910 . RAREWORD . 1294 . Z42 . 36m . H26 . 1910 . 197 . 13m . 1909 . H26 .197 . 1910 . Z86 . 13m . H26 . 197 . 1910 . 1909 . 01m . H53 . 1909 . 193 . 1910 . 13m . H53 . RAREWORD . 1909 . 1910 . 01m .H53 . 1909 . 193 . 1910 . 01m . 1758 . Y92 . 1758 . Y83 . T85 . 1910
filter_1, filter_size_3, weight_1.848119, negative_1649, 309 . 1910 . H65
filter_1, filter_size_3, weight_1.818123, negative_2285, O68 . 1333 . 1338 . 1343 . O81 . Z37 . 1299m .O02 . 1265 . 1910 . O09 . 612m . A09 . O98 . 01m . Z29 . O92 . 1333 . 1338 . 1343 . O81 . O36 . 1334 . Z22 . Z37
filter_2, filter_size_3, weight_2.600813, negative_2359, L50 . 1299m . Z88 . Z03 . 1864
filter_2, filter_size_3, weight_2.600812, negative_4317, Z88 . Z03 . 1864
filter_2, filter_size_3, weight_2.372393, negative_2355, 1864 . Z03 . Z88
filter_2, filter_size_3, weight_2.372393, negative_2931, 1864 . Z03 . Z88 . 36m . Z03 . 1864. Z88
filter_2, filter_size_3, weight_2.195103, negative_3789, Z88 . Z41 . 13m . Z88 . Z41 .1864
filter_3, filter_size_3, weight_1.948234, negative_1422, 895 . K56 . 986 . K66 . RAREWORD . 899 . Y92 . Y83 . K91 .
Z43 . 1910 . 36m . 911 . Z08 . 1910 . Z85 . D12 . 1299m . Z08 . K57 . Z85 . 1910 . 911 . K63
filter_3, filter_size_3, weight_1.571307, negative_2948, Z86 . D05 . 1744 . 1910 . N62
filter_3, filter_size_3, weight_1.497793, negative_878, Z85 . 1756 . 808 . 1910 . 1747 . Z40 . 1756 . D05 . 1916 .36m . N80 . 1758 . 1252 . Z40 . 1910 . Z80 . 1299m . 1910 . N60 . 1744
filter_3, filter_size_3, weight_1.465393, negative_2181, R20 . E04 . 114 . 1910 . R20
1089 . Z86 . Z85 . 1910 . Z08 . 36m . 1910 .1089 . Z08 . Z85 . Z86 . 612m . Z08 . 1910 . Z85 . Z86 . E11 . 1089 . 1299m . 1910 . Z08 .Z86 . 1089 . Z85 . E11 . 1299m . Z85 . 1089 . E11 . Z86 . 1910 . Z08filter_4, filter_size_3, weight_1.986171, negative_517, N39 . R41 . Y92 . RAREWORD . F05 . M25 . E87 . F41 .TRANSFER . Y92 . I95 . Y42 . M31 . R45 . Y44
filter_4, filter_size_3, weight_1.981100, negative_1221, I95 . J96 . 1916 . J84 . J96 . 1916filter_4, filter_size_3, weight_1.935187, negative_1051, 1828 . G47 . 01m . 1828 . G47 . 570 . 13m . T81 . Y92 . J47 . Y84
Table IIISOME SENTENCES WITH STRONG RESPONSES FOR FILTERS 1 AND 4. CODE
WITH FIRST LETTER IS DIAGNOSIS, CODE WITH ALL NUMBERS ISPROCEDURE, CODE ENDS WITH “M” IS TIME-GAP. THE HEIGHTS OF THE
CODES ARE PROPORTIONAL TO THEIR RESPONSE WEIGHTS. THESUB-SEQUENCE Z85.1163.1910 AND 1066.1067.I21 RESPONSE STRONGLY
TO THE POSITIVE CLASS.
biopsy procedure (1163) and cerebral anesthesia (1910). Theother sub-sequence is about heart attack (I21) and kidney-related procedures (1066 and 1067).
From strong and frequent filter responses in all sentences,we derive the list of motifs. Table IV lists the motifs withlargest weights and highest frequency of occurrence for codechapter E, I and O. The first motif of Filter 45 shows thepattern that treatment removing toxic substances from theblood co-occurred with care involving dialysis and readmissionwithin 1 month. The second motif in the same row discoversthe pattern that type-I diabetes patients involve in educationabout information and management of diabetes. The thirdmotif in the same row shows type-II diabetes patients readmitwithin 1-3 months. Filter 26 demonstrates the co-occurrence ofdiseases related to diabetes. The three motifs show that type-IIdiabetes patients can have complications such as heart failure,vitamin D deficiency and kidney failure. Filters 10 and 35show diseases and treatments related to the circulatory system,whereas pregnancy and birth related motifs are shown in Filters2 and 33 in the last two rows.
7
Single delivery by cesarean section Type 2 diabetes mellitus Atrial fibrillation and flutterDiagnoses: Diagnoses: Diagnoses:
Maternal care for disproportionPlacenta praeviaComplications of puerperiumFailed induction of laborDiabetes mellitus in pregnancy
Personal history of medical treatmentPresence of cardiac/vascular implantsPersonal history of certain other diseasesUnspecified diabetes mellitusProblems related to lifestyle
Paroxysmal tachycardiaUnspecified kidney failureCardiomyopathyShock, not elsewhere classifiedOther conduction disorders
Procedures: Procedures: Procedures:Cesarean sectionMedical or surgical induction of labourManipulation of fetal presentationOther procedures associated with deliveryForceps delivery
Cerebral anesthesiaOther digital subtraction angiographyExamination procedures on uterusMedical or surgical induction of labourCoronary angiography
Insertion or removal procedures on aortaElectrophysiological studies [EPS]Other procedures on atriumCoronary artery bypass - other graftCoronary artery bypass - saphenous vein
Table IIRETRIEVING TOP 5 SIMILAR DIAGNOSES AND PROCEDURES.
FilterID Motifs
45 0-1m 1060 Z49Time-gapHaemoperfusionCare involving dialysis
1916 E10 Z45Allied health intervention, diabetes educationType 1 diabetes mellitusAdjustment and management of drugdelivery or implanted device
1-3m E11 Z45Time-gapType 2 diabetes mellitusAdjustment and management of drugdelivery or implanted device
26 E11 I48 I50Type 2 diabetes mellitusAtrial fibrillation and flutterHeart failure
E11 E55 I48Type 2 diabetes mellitusVitamin D deficiencyAtrial fibrillation and flutter
E11 I50 N17Type 2 diabetes mellitusHeart failureAcute kidney failure
10 1893 I48 K35Exchange transfusionAtrial fibrillation and flutterAcute appendicitis
1005 A41 I48Panendoscopy to ileum with administrationof tattooing agentOther sepsisAtrial fibrillation and flutter
1-3m I48 Z45Time-gapAtrial fibrillation and flutterAdjustment and management of drugdelivery or implanted device
35 1909 727 I83Intravenous regional anesthesiaInterruption of sapheno-femoral andsapheno-popliteal junction varicose veinsVaricose veins of lower extremities
1620 I83 L57Excision of lesion(s) of skin andsubcutaneous tissue of footVaricose veins of lower extremitiesSkin changes due to chronic exposure tononionising radiation
1910 768 I83SedationTranscatheter embolisation of other bloodvesselsVaricose veins of lower extremities
2 D68 O80 Z37Other coagulation defectsSingle spontaneous deliveryOutcome of delivery
1344 O75 O80Other suture of current obstetric lacerationor rupture without perineal involvementOther complications of labor and deliverySingle spontaneous delivery
1344 O75 O82Other suture of current obstetric lacerationor rupture without perineal involvementOther complications of labour and deliverySingle delivery by caesarean section
33 1333 1340 O09Neuraxial block during labour and deliveryprocedureEmergency lower segment caesarean sectionDuration of pregnancy
1340 O14 Z37Emergency lower segment caesarean sectionGestational [pregnancy-induced]hypertension with significant proteinuriaOutcome of delivery
1340 3-6m O34Emergency lower segment caesarean sectionTime-gapMaternal care for known or suspectedabnormality of pelvic organs
Table IVRETRIEVING 3 MOTIFS FOR EACH OF THE 6 FILTERS WHICH HAVE LARGEST WEIGHTS AND MOST FREQUENT WITH CODE CHAPTER O, I AND E.
VI. DISCUSSION
We have presented Deepr, a new deep learning architecturethat provides an end-to-end predictive analytics in healthcareservices. Deepr reads directly from raw medical records andpredicts future outcomes. This departs from the traditionalmachine learning that relies on expensive manual featureextraction. Deepr learns to extract meaningful features by itselfwithout expert supervision. This translates to uncovering thepredictive local motifs in the space of diseases and interventions.These capacities are not seen in existing methods.
r) Significance: Deepr contributes to the growing litera-ture of predictive medicine in multiple ways. First, it is ableto uncover the underlying space of diseases and interventions,showing the relationships between them. The largest diseasecluster in Fig. 3 suggests that diseases may interact in a
complex way, and current representation of disease hierarchiessuch as those in ICD-10 may not reflect the true nature ofmedical disorders. Second, Deepr detects predictive motifs ofcomorbidity, care patterns and disease progression. The motifssuggest a new look into the complex interactions betweendiseases and between the diseases and cares. Third, similarpatients can be retrieved not just using past history, but fromlikelihood of future risks as well. This would, for example,help to quickly identify an effective treatment regime based onsimilar patients who responded well to the treatment, or to alertthe care team of a potential risk based on similar patients whohad these before. Finally, Deepr predicts the future risk for apatient and explains why (through means of motifs responses),which is the core of modern prospective healthcare.
With these capabilities, Deepr can enable targeted monitor-
8
ing, treatments and care packaging. This is highly importantfor chronic disease management that requires an on-going careand evaluation. For health services, a high predictive accuracyof risk will lead to better resources prioritizing and allocation.For patients, accurate risk estimation is an important steptoward personalized care. Patients and family will be promotedto become more aware of the conditions and risk, leadingto proactive health management and help seeking. Deepr isgeneric and it can be implemented on existing EMR systems.This will enable innovative healthcare practices for betterefficiency and outcomes to occur. For example, doctors, whenseeing a patient, may consult the machine for a second opinion,with a transparent, evidence-based reasoning. Because theydo not miss any piece of information in the database, they areless likely to overlook important signals.
s) Comparison to recent work on medical records: Deeplearning in healthcare has recently attracted great interest. Themost popular application is medical imaging using CNNs [8],motivated by the recent successes in cognitive vision [12], [22],[25]. However, there has been limited work on non-cognitivemodalities. On time-series data (e.g., ICU measurements), themain difficulty is the handling missing data with recent workof [4], [23], [28], [38]. In [23], time-series are modeled usingautoencoders (an unsupervised feedforward net) to discovermeaningful phenotypes. In [4], [28], recurrent nets are used,and in [38], a convolutional net is employed. Deepr can beapplied on these data, following a discretization of continuoussignals into discrete words (e.g., through cut-points).
On routine medical records, Deepr is the only methodthat employs convolutional nets but there exist alternativearchitectures. Feedforward nets have been used [26], [10], [35].Recurrent neural networks (RNN) on medical records includeDoctor AI [6] and DeepCare [37]. Doctor AI is a RNN adaptedfor medical events, where both next events and time-gaps arepredicted. DeepCare is a sophisticated model that representstime-gaps using a parametric model. Similar to our observation,the authors of DeepCare also noticed an interesting analogybetween natural languages and EMR, where EMR is similarto a sentence, and diagnoses and interventions play the roleof nouns and modifiers. While DeepCare is powerful on longrecords, it is less effective in short records, e.g., those withonly one or two admissions. Deepr, on the other hand, doesnot suffer from this limitation. Stochastic deep neural nets suchas deep Boltzmann machines are used in [32]. Deep non-neuralnets have also been suggested in [13]. These methods are likelyto be expensive to train and produce prediction.
Embedding of medical concepts has been proposed incontemporary work [5], [7], [37], [45]. In [7], medical conceptsare embedded using word2vec [33], ignoring time gaps. TheMed2Vec in [5] extends word2vec to embed visits. Bothword2vec and Med2Vec model local collocations, but do notexplicitly model motifs (with precise relative positions). In [45],a global model known as eNRBM embeds patients into vectorsvia regularized nonnegative restricted Boltzmann machines[36]. Local motifs are not modeled and and variable recordlength and time gaps are not properly handled. Discoveringlocal motifs by means of convolutions has been suggested in[48] through matrix factorization. However, the work does not
do prediction.t) Limitations and future work: There are rooms for
future work. First, long-term dependencies are simply capturedthrough a max-pooling operation. This is rather simplistic dueto a complex dynamic between care processes and diseaseprocesses [37]. A better model should pool information that istime-sensitive (e.g., recent events are more important to distantones). At present, Deepr works exclusively on recorded eventssuch as diagnoses and interventions. Integration with clinicalnarrative would be highly useful because rich information isburied in unstructured text. This can be done in the sameframework of Deepr because of the sequential nature of text.Our evaluation has been limited to a common risk known asunplanned readmission. However, Deepr is not limited to anyspecific type of future risk. It can be well applied to predictingthe onset or progression of a disease.
REFERENCES
[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.
[2] Ognjen Arandjelovic. Discovering hospital admission patterns usingmodels learnt from electronic hospital records. Bioinformatics, pagebtv508, 2015.
[3] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin,Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler inPython. In Proc. 9th Python in Science Conf, pages 1–7, 2010.
[4] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag,and Yan Liu. Recurrent neural networks for multivariate time series withmissing values. arXiv preprint arXiv:1606.01865, 2016.
[5] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, CatherineCoffey, and Jimeng Sun. Multi-layer representation learning for medicalconcepts. KDD, 2016.
[6] Edward Choi, Mohammad Taha Bahadori, and Jimeng Sun. Doctor AI:Predicting Clinical Events via Recurrent Neural Networks. arXiv preprintarXiv:1511.05942, 2015.
[7] Youngduck Choi. Learning low-dimensional representations of medicalconcepts. Proceedings of the AMIA Summit on Clinical ResearchInformatics (CRI), 2016.
[8] Dan C Ciresan, Alessandro Giusti, Luca M Gambardella, and JürgenSchmidhuber. Mitosis detection in breast cancer histology images withdeep neural networks. In International Conference on Medical ImageComputing and Computer-assisted Intervention, pages 411–418. Springer,2013.
[9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, KorayKavukcuoglu, and Pavel Kuksa. Natural language processing (almost)from scratch. The Journal of Machine Learning Research, 12:2493–2537,2011.
[10] Joseph Futoma, Jonathan Morris, and Joseph Lucas. A comparison ofmodels for predicting early hospital readmissions. Journal of biomedicalinformatics, 56:229–238, 2015.
[11] Danning He, Simon C Mathews, Anthony N Kalloo, and Susan Hutfless.Mining high-dimensional administrative claims data to predict earlyhospital readmissions. Journal of the American Medical InformaticsAssociation, 21(2):272–279, 2014.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delvingdeep into rectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE International Conference onComputer Vision, pages 1026–1034, 2015.
[13] Ricardo Henao, James T Lu, Joseph E Lucas, Jeffrey Ferranti, andLawrence Carin. Electronic Health Record Analysis via Deep PoissonFactor Models. JMLR, 2016.
[14] Rui Henriques, Cláudia Antunes, and Sara C Madeira. Generativemodeling of repositories of health records for predictive tasks. DataMining and Knowledge Discovery, pages 1–34, 2014.
[15] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality ofdata with neural networks. Science, 313(5786):504–507, 2006.
9
[16] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforwardnetworks are universal approximators. Neural networks, 2(5):359–366,1989.
[17] Zhengxing Huang, Xudong Lu, and Huilong Duan. Latent treatmentpattern discovery for clinical processes. Journal of medical systems,37(2):1–10, 2013.
[18] Christopher H Jackson, Linda D Sharples, Simon G Thompson,Stephen W Duffy, and Elisabeth Couto. Multistate Markov modelsfor disease progression with classification error. Journal of the RoyalStatistical Society: Series D (The Statistician), 52(2):193–209, 2003.
[19] Anders Boeck Jensen, Pope L Moseley, Tudor I Oprea, Sabrina GadeEllesøe, Robert Eriksson, Henriette Schmock, Peter Bjødstrup Jensen,Lars Juhl Jensen, and Søren Brunak. Temporal disease trajectoriescondensed from population-wide registry data covering 6.2 millionpatients. Nature communications, 5, 2014.
[20] Peter B Jensen, Lars J Jensen, and Søren Brunak. Mining electronichealth records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012.
[21] Yoon Kim. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882, 2014.
[22] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classifica-tion with deep convolutional neural networks. In Advances in NeuralInformation Processing Systems 25, pages 1106–1114, 2012.
[23] Thomas A Lasko, Joshua C Denny, and Mia A Levy. Computationalphenotype discovery using unsupervised feature learning over noisy,sparse, and irregular clinical data. PloS one, 8(6):e66341, 2013.
[24] Yann LeCun and Yoshua Bengio. Convolutional networks for images,speech, and time series. The handbook of brain theory and neuralnetworks, 3361(10):1995, 1995.
[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436–444, 2015.
[26] Zhaohui Liang, Gang Zhang, Jimmy Xiangji Huang, and Qmming VivianHu. Deep learning for healthcare decision making with EMRs. InBioinformatics and Biomedicine (BIBM), 2014 IEEE InternationalConference on, pages 556–559. IEEE, 2014.
[27] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. Learninglong-term dependencies in NARX recurrent neural networks. IEEETransactions on Neural Networks, 7(6):1329–1338, 1996.
[28] Zachary C Lipton, David C Kale, and Randall Wetzel. Directly ModelingMissing Data in Sequences with RNNs: Improved Classification ofClinical Time Series. arXiv preprint arXiv:1606.04130, 2016.
[29] Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. Temporalphenotyping from longitudinal electronic health records: A graph basedframework. In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 705–714.ACM, 2015.
[30] Christopher D Manning. Computational linguistics and deep learning.Computational Linguistics, 2015.
[31] Jason Scott Mathias, Ankit Agrawal, Joe Feinglass, Andrew J Cooper,David William Baker, and Alok Choudhary. Development of a 5 year lifeexpectancy index in older adults using predictive mining of electronichealth record data. Journal of the American Medical InformaticsAssociation, 20(e1):e118–e124, 2013.
[32] Saaed Mehrabi, Sunghwan Sohn, Dingheng Li, Joshua J Pankratz, TerryTherneau, Jennifer L St Sauver, Hongfang Liu, and Mathew Palakal.Temporal pattern and association discovery of diagnosis codes usingdeep learning. In Healthcare Informatics (ICHI), 2015 InternationalConference on, pages 408–416. IEEE, 2015.
[33] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec,2014.
[34] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and theircompositionality. In Advances in Neural Information Processing Systems,pages 3111–3119, 2013.
[35] Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep patient:An unsupervised representation to predict the future of patients from theelectronic health records. Scientific reports, 6, 2016.
[36] T.D. Nguyen, T. Tran, D. Phung, and S. Venkatesh. Learning Parts-basedRepresentations with Nonnegative Restricted Boltzmann Machine . InProc. of 5th Asian Conference on Machine Learning (ACML), Canberra,Australia, Nov 2013.
[37] Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare:A Deep Dynamic Memory Model for Predictive Medicine. arXiv preprintarXiv:1602.00357, 2016.
[38] Narges Razavian and David Sontag. Temporal convolutional neuralnetworks for diagnosis from lab tests. arXiv preprint arXiv:1511.07938,2015.
[39] Patrick Royston and Willi Sauerbrei. Interactions between treatment andcontinuous covariates: a step toward individualizing therapy. Journal ofClinical Oncology, 26(9):1397–1399, 2008.
[40] Jürgen Schmidhuber. Deep learning in neural networks: An overview.Neural Networks, 61:85–117, 2015.
[41] Mansour TA Sharabiani, Paul Aylin, and Alex Bottle. Systematic reviewof comorbidity indices for administrative data. Medical care, 50(12):1109–1118, 2012.
[42] Ralph Snyderman and R Sanders Williams. Prospective medicine: thenext health care transformation. Academic Medicine, 78(11):1079–1084,2003.
[43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: A simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Research, 15:1929–1958,2014.
[44] Truyen Tran, Wei Luo, Dinh Phung, Sunil Gupta, Santu Rana, Richard LKennedy, Ann Larkins, and Svetha Venkatesh. A framework for featureextraction from hospital medical data with applications in risk prediction.BMC bioinformatics, 15(1):6596, 2014.
[45] Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh.Learning vector representation of medical objects via EMR-drivennonnegative restricted Boltzmann machines (eNRBM). Journal ofbiomedical informatics, 54:96–105, 2015.
[46] Truyen Tran, Dinh Phung, Wei Luo, and Svetha Venkatesh. Stabilizedsparse ordinal regression for medical risk stratification. Knowledge andInformation Systems, 2014. DOI: 10.1007/s10115-014-0740-4.
[47] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journalof Machine Learning Research, 9(2579-2605):85, 2008.
[48] F. Wang, N. Lee, J. Hu, J. Sun, and S. Ebadollahi. Towards heterogeneoustemporal clinical event pattern discovery: a convolutional approach. InProceedings of the 18th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 453–461. ACM, 2012.
[49] Xiang Wang, David Sontag, and Fei Wang. Unsupervised learning ofdisease progression models. In Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages85–94. ACM, 2014.
[50] DRGHR Williams and GE Hinton. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
[51] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXivpreprint arXiv:1502.01710, 2015.