Title
A systematic review of Natural Language Processing for
classification tasks in the field of incident reporting and adverse event
analysis.
AuthorsIan James Bruce Younga, Saturnino Luzb, Nazir Lonec
a Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected].
b Usher Institute of Population Health Sciences & Informatics, The University of Edinburgh, 9 Little France Rd, Edinburgh, Scotland EH16 4UX. [email protected]
c Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG. [email protected]
Corresponding Author
Ian James Bruce Young
ABSTRACT
Context: Adverse events in healthcare are often collated in incident reports which contain
unstructured free text. Learning from these events may improve patient safety. Natural
language processing (NLP) uses computational techniques to interrogate free text, reducing
the human workload associated with its analysis. There is growing interest in applying NLP
to patient safety, but the evidence in the field has not been summarised and evaluated to date.
Objective: To perform a systematic literature review and narrative synthesis to describe and
evaluate NLP methods for classification of incident reports and adverse events in healthcare.
Methods: Data sources included Medline, Embase, The Cochrane Library, CINAHL,
MIDIRS, ISI Web of Science, SciELO, Google Scholar, PROSPERO, hand searching of key
articles, and OpenGrey. Data items were manually abstracted to a standardised extraction
form.
Results: From 428 articles screened for eligibility, 35 met the inclusion criteria of using NLP
to perform a classification task on incident reports, or with the aim of detecting adverse
events. The majority of studies used free text from incident reporting systems or electronic
health records. Models were typically designed to classify by type of incident, type of
medication error, or harm severity. A broad range of NLP techniques are demonstrated to
perform these classification tasks with favourable performance outcomes. There are
methodological challenges in how these results can be interpreted in a broader context.
Conclusion: NLP can generate meaningful information from unstructured data in the specific
domain of the classification of incident reports and adverse events. Understanding what or
why incidents are occurring is important in adverse event analysis. If NLP enables these
insights to be drawn from larger datasets it may improve the learning from adverse events in
healthcare.
Keywords
Natural language processing Machine learningText classification Incident reportingAdverse event analysis Patient Safety
Abbreviations
Adverse Drug Event ADEElectronic Health Record EHRArea under receiver operating characteristic curves AUCSupport Vector Machines SVM
1.0 INTRODUCTION
1.1 RATIONALE
Incident reports are tools to collect data about adverse events and errors in healthcare[1].
Their use in healthcare has been brought over from other high reliability industries which
have recognised the importance of reporting potential and actual harm for improving
safety[2]. A culture which promotes the reporting and analysis of incidents, errors, and
adverse events is now thought a central tenet of patient safety[3,4].
Ultimately the utility of this system is predicated on the reporting of incidents reducing the
risk to future patients. This could be either by understanding what incidents are occurring, or
why incidents are occurring, and then taking actions based on this understanding. In the
investigation and analysis of incident reports, one component is classification[2][5]. This may
be classification of incident type, of the type and severity of harm that resulted, or of the
factors that contributed to the incident occurring.
There are problems with the current work flow for processing incident reports that make it
difficult to translate reports into better outcomes[4]. Firstly, the system is neither reliable nor
robust. It does not give the same consideration to all reports, and it is often unclear what
factors determine the review course of a particular report.
Secondly, issues of data validity within incident reports make analysis harder. Proprietary
incident reporting systems typically record a combination of structured data entry fields and
free text responses[6]. Free text responses provide the initiator of the report with freedom to
describe the incident as they saw it, but the completeness and accuracy of reporting can limit
data validity. Structured responses do not necessarily increase data validity. Classification of
incident type is often entered by the initiator of the report and chosen from a structured list of
options. This may worsen data validity if the structured choices do not allow the initiator to
adequately summarise the incident, or if they are missing important contextual information to
inform the classification[7].
Lastly, incident reports are produced with a volume and velocity that makes thorough and
timely human review impossible[8][9][10]. Through the efficiency of automation, data
science solutions may be beneficial where data volume and velocity are problems to
overcome.
Within data science, natural language processing (NLP) is a field which tries to understand,
process, and interpret human language[11]. It would be a monumental task to stay on top of
all the clinical free text loosed on the world every day. NLP aims to create structure from this
unstructured data. These structured data then provide a substrate to train Machine Learning
Models to analyse the text As such, set in the context of other methods for analysing incident
reports, NLP may confer benefit through allowing all incident reports to be processed in a
reliable way, and in dramatically reducing the time associated with analysis.
In the last decade, NLP has demonstrated potential utility in healthcare data beyond the field
of clinical incident reporting and adverse event analysis. Using free text radiology reports,
NLP has been used to automate the detection of Venous Thromboembolism diagnoses[12]
[13][14], malignancy diagnoses[15][16] and detect critical follow-up recommendations[17].
In the domain of preventing adverse drug events (ADEs), NLP has been used to provide
patient individualised ADE information[18], detect and provide real-time clinician feedback
of drug errors in a neonatal intensive care unit [19][20], and look for incidences of known
drug side effects within EHR data[21].
NLP may have utility as a method for detecting clinically important outcomes, in contrast to
traditional methods such as manual chart review or diagnostic coding. NLP has been used on
EHR data to identify hypoglycaemic episodes[22], inpatient falls[23], healthcare associated
urinary tract infections[24], and cancer recurrence[25].
The potential of NLP as a clinical predictive tool has also been explored. NLP has been used
to predict clinical complications for cancer patients using EHR data[26].
1.2 OBJECTIVES
There is growing interest in applying NLP to patient safety but the evidence in the field has
not been summarised and evaluated to date. For this reason, we conducted a systematic
review and narrative synthesis to understand the published work on using natural language
processing for classification tasks in the field of incident reports and adverse event analysis.
Our specific objectives were to understand what NLP has been shown to achieve, the
techniques employed, and to highlight areas of future research in this field. PRISMA
guidelines were followed in creating this review.
2.0 METHODS
2.1 ELIGIBILITY CRITERIA
We followed international Preferred Reporting Items for Systematic Reviews and Meta-
Analyses (PRISMA) guidelines in conducting this review. Study characteristics for eligibility
were: published research, of experimental or methodological studies, reviews and conference
abstracts, published after 2004, written in English language, using NLP techniques on free
text for the purposes of classification. Studies which trained machine learning models on
structured text fields were excluded. The application of NLP techniques should be in the field
of human healthcare. The source of free text should be incident reporting systems or if not,
the classification should relate to detection of adverse events or medical error.
2.2 INFORMATION SOURCES
Articles were identified for this review through a search of the online databases; Medline via
Ovid, Embase via Ovid, The Cochrane Library, CINAHL, MIDIRS, ISI Web of Science,
SciELO, Google Scholar, and PROSPERO. A pearling strategy was applied to bibliographies
of key articles. A grey literature search was conducted via OpenGrey. The search was last
conducted on May 8th, 2018.
2.3 SEARCH
The full electronic search strategy, including limits used, is presented in appendix A.
2.4 STUDY SELECTION
Studies identified through electronic search were initially de-duplicated in their online
databases before being stored in Mendeley reference manager (Mendeley Ltd, version 1.19.2,
2018)[27]. Mendeley’s desktop application was used to de-duplicate the combined search
output. Initial screening was at the level of title and abstract. Full text review was then
conducted for remaining articles. University of Edinburgh library requests were submitted for
those studies whose full text could not be accessed initially. Included studies were those that:
(a) used NLP (b) to perform a classification task (c) either of incident reports, or other source
of clinical free text where the aim of classification was to detect adverse events or medical
errors.
2.5 DATA COLLECTION PROCESS
Data extraction from included studies was performed within Mendeley’s desktop application
and stored in an Excel spreadsheet (Microsoft, version 16.21.1, 2018). Data extraction was
performed independently, using presented data. Rejected studies were classified under a
standard set of explanations and are fully detailed in the PRISMA flow diagram in figure 1.
2.6 DATA ITEMS
The variables for which data was sought were mapped to the PICOS statement, stored in a
data extraction form, and are described below:
Participants – Study title as study ID. From what dataset was the free text extracted?
Interventions – What classification task was being performed? What type or types of
NLP were being used?
Comparisons – What alternative non-automated classification technique was used as a
comparator to the classifier?
Outcomes – What statistical analysis of classifier performance was performed?
Study design – What was the type of study?
2.7 SUMMARY MEASURES
The summary is a narrative synthesis of the eligible studies, based principally on the
extracted data items.
3.0 RESULTS
3.1 STUDY SELECTION
The number of studies screened, assessed for eligibility, and included in the review are
presented in a PRISMA flow diagram in figure 1. This also details the number of studies
rejected at each stage with accompanying reasons for rejection.
Figure 1: PRISMA 2009 Flow Diagram
3.2 STUDY CHARACTERISTICS
Data were sought for the variables mapped to the PICOS statement as described above. A
summary of the data extraction form is presented in figure 2. The full citation list for studies
included in the review is presented in the bibliography under references[7,11,36–45,28,46–
55,29,56–60,30–35].
3.3 NARRATIVE SYNTHESIS OF RESULTS
Thirty-five studies were included in this review. Twenty-one studies used free text extracted
from incident reporting systems[7,11,46,47,49–55,58,28,59,29,30,33,36,37,40,42], nine from
EHRs[34,35,38,43–45,48,56,61], three from ADE reporting systems[39,57,60], one from
morbidity and mortality records[41], and one from discharge summaries[31].
3.3.1 Type of classification
A range of classification tasks were performed. The total number described exceeded 35 as
some studies performed more than one type of classification task. Twenty studies developed
NLP classifiers for “type of incident”[28,29,49,50,52,53,55,56,58,59,61,30,31,36,38,40–
42,44], seven for “type of medication error”[35,39,45,47,54,57,60], six for “severity or
presence of harm”[7,11,33,46,51,59], three for “type of postoperative
complication”[34,43,48], and one for “type of contributory factor”[7].
Studies using NLP for the classification of incident types have taken various approaches.
Some have defined a single incident type and modelled various NLP techniques to optimise
this binary classification[38,42,52,61]. Others have imposed a predefined ontology of
incident types and developed either multi-label text classification[7] or multiple binary
classifiers[30,31]. Network analysis has also been used to identify incident types from free
text rather than imposing a known ontology[49].
Of the studies which focused on ADEs, four developed classifiers to identify the presence of
generalised medication events[39][47][60][45], while three looked specifically at identifying
bleeding events[35], anaphylaxis [57], and “look-alike sound-alike” errors[54].
3.3.2 Classification performance
With respect to reporting performance outcomes for NLP models, studies consistently utilised
measures that can be calculated from an error matrix[62]. The most commonly reported
performance metrics were Sensitivity (Recall), Positive Predictive Value (Precision),
Accuracy, and F Score (Harmonic mean of Precision and Recall). Multiple studies also
presented error matrix outcomes as area under receiver operating characteristic curves
(AUC). There was however no overall consistency as to which specific measures were
reported.
3.3.3 Machine Learning Models
Most studies reported outcomes for more than one NLP technique. The most frequent models
developed used variants of Machine Learning Models, including Support Vector Machines
(SVM) (20 studies), Naïve Bayes (11 studies), Logistic Regression (7 studies), and K-Nearest
Neighbours (5 studies). Decision Trees and Random Forests were both used in 4 studies.
There were then a number of Machine Learning techniques that appear in 2 or fewer studies;
Topic Modelling, Decision Rules, Neural Networks, Boosted Trees, and Active Learning.
While many studies developed their own NLP models, several used proprietary NLP software
such as MedLee, or SAS text Miner[63][64]. Five or ten fold cross-validation was used either
to split data into training, validation and testing sets, or to optimise model parameters in 10 of
the 35 studies[7,11,33,39,40,44,47,48,54,59].
The vast majority of studies used a manually annotated corpus of free text documents as the
comparator to their NLP model. Broadly, studies managed to develop an NLP classifier
whose performance approached that of the comparator. Fong et al demonstrated AUC
performance of 0.96 using a SVM classifier to identify ADE in incident reports[47]. Ong et al
demonstrated AUC performance of 0.97 using a Naïve Bayes classifier to identify patient
identification and handover events in incident reports[29].
Figure 2: Summary of included studies
3.3.4 Quality Assessment
A critical evaluation of study quality was conducted using the TRIPOD reporting guidelines
as a framework[65]. Broadly, studies clearly identified the data source and the nature of the
classification task. Datasets consisted of a mixture of free text and structured data entry
fields. These were extracted, in all cases, from internal databases affiliated either to a hospital
or public institution. The number of records extracted for use ranged from five to over 20
million, with a median of 2974[44,45]. Studies were clear that NLP was being used to
Figure 2 Legend:
Incident Reporting System (IRS), Electronic Health Record (EHR), Morbidity & Mortality Record (M&M), Adverse Drug Event Reporting System (ADE), Discharge Summary (DIS)
Incident Type (IT), Medication Error Type (MET), Harm Severity (HS), Contributory Factors (CF), Postoperative Complication (POC)
Topic Modelling (TM), Proprietary Software (PS), Neural Networks (NN), Decision Trees (DT), Logistic Regression (LR), K-Nearest Neighbours (K-NN), Naïve Bayes (NB), Support Vector Machines (SVM), Random Forests (RF)
Manually Annotated Corpus (MAC), Initial Reporter Classification (IRC)
Area Under Receiver Operating Characteristic Curves (AUROC), Confusion Matrix (e.g. Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, Accuracy, Precision, Recall, F-Score) (Confusion Matrix)
develop rather than validate predictive models. Studies were clear about the method for
internal validation, which was typically a manually annotated corpus. Studies often described
multiple NLP models and performance metrics. It was typically not clear which of these had
been decided a priori, and whether actions were taken to blind the assessment of individual
models. The specifics of NLP model development were not made clear in all cases.
Classification performance was reported heterogeneously amongst the included studies.
Studies discussed limitations and provided an overall interpretation of their results and
potential clinical applications of their models. Studies provided conflict of interest and
funding statements.
4.0 DISCUSSION
4.1 SUMMARY OF EVIDENCE
There are now a number of studies demonstrating that NLP models can be developed to
classify the unstructured free text contained in incident reports and EHRs according to
incident type and the severity of harm associated. Published work has explored binary
classification techniques more widely than multi-labelled classification problems. The type of
NLP that has been found to perform best has varied between datasets and classification tasks.
A wide variety of model performance metrics are reported, reflecting different priorities in
the use of the model. In general, studies have developed NLP models which can perform
classification tasks in this domain with performance outcomes which approach manual
human classifiers.
4.2 LIMITATIONS
In conducting this systematic review, resource limitations did not allow for the search to be
performed in duplicate with two independent reviewers. We limited the search to English
language articles due to lack of funding for translation facilities. We also limited the search to
articles published since 2005 to ensure relevance to current practice as the fields of adverse
event analysis and NLP have evolved significantly over the past decade. As figure 3 shows,
frequency of publications in this field appears to be increasing, with the majority of studies
published in the last decade. As such this review will have captured most relevant
publications. Syntactic and ontological differences between languages may limit the
applicability of the NLP models used in this review to other languages, particularly in the text
pre-processing techniques described[37][49].
Figure 3.
Some aspects of the internal validity of these studies have been explored, such as the
difficulty in assessing the quality of comparative classification technique, and the effects of
multiple testing due to the publication of outcomes for multiple NLP models and
classification tasks. Both of these factors could bias in favour of the NLP model performance.
Across the range of studies, 16 of the 35 studies were published in two journals; Studies in
Health Technology and Informatics, and Journal of the American Medical Informatics
Association. Our search strategy included a grey literature search to minimise the possibility
of publication bias.
At outcome level, a challenge in this review was how best to summarise the performance of
NLP classifiers. In this review a narrative synthesis rather than a quantitative approach was
chosen due to studies reporting outcomes for multiple binary classification and multiple NLP
models, a lack of assurance of data homogeneity, and a lack of a uniform outcome
performance metric.
4.3 THE LACK OF DATA HOMOGENEITY
Although the majority of studies used free text from proprietary incident reporting systems,
this does not mean we can assume data homogeneity between these studies. It is recognised
that the performance of NLP classifiers is very data dependent[61]. This is one explanation
for why a range of NLP models were found to perform best across studies in this review.
Further it makes it difficult to infer which NLP model would demonstrate best classifier
performance on a future data set.
4.4 THE USE OF MULTIPLE PERFORMANCE METRICS
When reporting model performance, a metric should be chosen that best represents the
association between model classification and "true" classification[66]. There is however
acceptance that this relationship is complex and multifactorial. As such, there is an argument
for reporting all performance metrics such that the most information possible is available for
those who might wish to develop the model further or for a different use case. Fong et al. and
Ong et al. are good examples, presenting 6 and 5 performance metrics respectively[66][29].
It is known that efforts to improve one NLP model performance measure can detrimentally
affect another[66]. For example, increasing model sensitivity can decrease model precision. If
the model is adjusted to make it more likely to predict a positive occurrence, there will be an
increase in the number of positive occurrences that are recorded as positive, but also an
increase the number of negative occurrences incorrectly recorded as positive.
Because of this, in model development one performance metric may have to be prioritised
above another. Appropriate prioritisation depends on the intended use of the model[66].
In the domain of using NLP for classification of incidents and adverse events, it is important
the model does not miss an important event, thus high sensitivity is important[66]. This could
result in a decreased specificity due to an increased false positive rate. In this case, it is likely
an important event would require some supplemental human confirmation, which could
manage the additional false positives.
4.4.1 The impact of multiple testing
As most studies report performance outcomes for multiple NLP models, they are framed as
methodological exploratory studies as much as experimental studies. The problem with
interpreting the use of multiple models might be considered similarly to interpreting results
from multiple testing[67].
4.5 COMPARING PERFORMANCE AGAINST MANUAL ANNOTATION
Model performance is typically reported as compared to a manually annotated corpus. This
presumes manual annotation to be a gold standard. Thus, the validity of the accuracy
measurements depends on the validity of the manual annotation. The use of multiple
annotators and calculation of inter-rater agreement can improve the validity of manual
annotation, but it has limitations. For example, inter-rater agreement would be unaffected if
both raters missed a classification. Similarly, in most cases there is no way to be certain what
proportion of outcomes are unclassified by both automated and manual systems[66].
5.0 FUTURE RESEARCH
The majority of studies focus on binary classification, e.g. “drug error or no drug error”, “fall
or no fall”[54] [38]. It is recognised that incidents which lead to harm are often the result of
multiple interacting factors[2]. Moving forwards, NLP interrogation of incident reports
should look to achieve high performing multi-class models[7].
Understanding why incidents occur may be more important for effecting change than
understanding what incidents have occurred. Further studies exploring the ability of NLP to
classify incident reports by contributory factors could offer more learning opportunities from
adverse events.
Clinical free text represents a massive data set which has been largely underutilised because
of its size, unstructured nature, and until recently, inability to be electronically
searched[68,69]. A wealth of new knowledge may be generated if computational techniques
such as NLP can make these data suitable for analysis.
6.0 CONCLUSIONS
This systematic review presents evidence that NLP can generate meaningful information
from unstructured data in the specific domain of the classification of incident reports and
adverse events. Understanding what incidents are occurring or why they are occurring is
important in adverse event analysis. NLP has the potential to allow such classification tasks
to be performed at scale, for example between hospitals within a geographic region, between
regions, or across an entire healthcare system. This has the potential to improve learning from
adverse events in healthcare, which may ultimately reduce the risk to future patients.
One of the roles of data science in healthcare is to reduce the human burden of data
acquisition and analysis. The hope, in doing so, is to give healthcare professionals the time to
think creatively and effect change[70]. In a broader context, understanding how to interrogate
this unstructured data offers opportunities in a range of healthcare settings.
CONTRIBUTORSHIP STATEMENT
Ian Young, Saturnino Luz, and Nazir Lone all qualify for authorship according to the
International Committee of Medical Journal Editors (ICMJE). Each shares responsibility for
the conception and design of this study, interpretation of this review, and drafting and critical
revision of the manuscript. Ian Young is the corresponding author and is further responsible
for the acquisition of the data used in this review.
STATEMENT ON CONFLICT OF INTEREST
The authors have no conflicts of interest to declare.
FUNDING
This work was supported by the department of Anaesthesia, Critical Care, and Pain Medicine
at the Royal Infirmary of Edinburgh, via monies from the Trustees of the Edinburgh
Anaesthesia Festival.
SUMMARY TABLE
What was already known
Analysis of incident reports and adverse events in healthcare is considered an important part of quality improvement and patient safety.
NLP can provide structured information from unstructured free text, including performing classification tasks.
What this study adds
Within the domain of incident reporting and adverse event analysis, NLP has been shown to perform favourably compared to manual annotation in a wide range of classification tasks.
Studies in this domain have focussed on binary classification of incident types. Exploring multi-class problems and contributory factor analysis of incident reports could have clinical utility.
No single NLP technique shows superiority in this domain and training multiple models may be required to optimise classifier performance.
Word Count: 3492
REFERENCES
[1] R. Lawton, R.R.C. McEachan, S.J. Giles et al. Development of an evidence-based
framework of factors contributing to patient safety incidents in hospital settings: a
systematic review, BMJ Qual. Saf. 21 (2012) 369–380.
[2] S. Taylor-Adams, C. Vincent, D. Hewett et al. Systems Analysis of Clinical Incidents
the London Protocol, (n.d.).
[3] To Err Is Human, National Academies Press, Washington, D.C., 2000.
[4] K.E. Wood, D.B. Nash, Mandatory State-Based Error-Reporting Systems: Current and
Future Prospects, Am. J. Med. Qual. 20 (2005) 297–303.
[5] B. Peter, J. Pronovost, D.A. Thompson et al. Toward learning from patient safety
reporting systems, (n.d.).
[6] DATIX, No Title, (2018). https://www.datix.co.uk/en/.
[7] C. Liang, Y. Gong, Automated Classification of Multi-Labeled Patient Safety Reports:
A Shift from Quantity to Quality Measure., Stud. Health Technol. Inform. 245 (2017)
1070–1074.
[8] M. Govindan, Automated detection of harm in healthcare with information technology:
a systematic review, Qual. Saf. Health Care. 19 (2010) e11.
[9] D.S. Carrell, R.E. Schoen, D.A. Leffler et al. Challenges in adapting existing clinical
natural language processing systems to multiple, diverse health care settings, J. Am.
Med. Informatics Assoc. 24 (2017) 986–991.
[10] R. Pivovarov, N. Mie Elhadad, Automated methods for the summarization of
electronic health records, (n.d.).
[11] R. Jacobsson, Extraction of adverse event severity information from clinical narratives
using natural language processing, Pharmacoepidemiol. Drug Saf. 26 (2017) 37.
[12] C.M. Rochefort, A.D. Verma, T. Eguale et al. A novel method of adverse event
detection can accurately identify venous thromboembolisms (VTEs) from narrative
electronic health record data, (2014).
[13] Z. Tian, S. Sun, T. Equale et al. Automated extraction of vte events from narrative
radiology reports in electronic health records: A validation study, Med. Care. 55
(2017) e73–e80.
[14] J.W. Galvez, J.M. Pappas, L. Ahumada et al. The use of natural language processing
on pediatric diagnostic radiology reports in the electronic health record to identify deep
venous thrombosis in children, J. Thromb. Thrombolysis. 44 (2017) 281–290.
[15] W. Yim, S.W. Kwan, M. Yetisgen, Classifying tumor event attributes in radiology
reports, J. Assoc. Inf. Sci. Technol. 68 (2017) 2662–2674.
[16] C.R. Moore, A. Farrag, E. Ashkin, Using Natural Language Processing to Extract
Abnormal Results From Cancer Screening Reports., J. Patient Saf. 13 (2017) 138–143.
[17] M. Yetisgen-Yildiz, M.L. Gunn, F. Xia et al. Automatic identification of critical
follow-up recommendation sentences in radiology reports, AMIA Annu. Symp. Proc.
2011 (2011) 1593–1602.
[18] J.D. Duke, ADESSA: A Real-Time Decision Support Service for Delivery of
Semantically Coded Adverse Drug Event Data, AMIA Annu. Symp. Proc. 2010 (2010)
177–181.
[19] Q. Li, E.S. Kirkendall, E.S. Hall et al. Automated detection of medication
administration errors in neonatal intensive care, Journal of Biomedical Informatics. 57
(2015) 124-133.
[20] Y. Ni, T. Lingren, E.S. Hall et al. Designing and evaluating an automated system for
real-time medication administration error detection in a neonatal intensive care unit., J.
Am. Med. Inform. Assoc. 25 (2018) 555–563.
[21] T. Cai, Natural language processing to rapidly identify potential signals for adverse
events using electronic medical record data: Example of arthralgias and vedolizumab,
Arthritis Rheumatol. 68 (2016) 2802–2804.
[22] A.P. Nunes, J. Yang, L. Radican et al. Assessing occurrence of hypoglycemia and its
severity from electronic health records of patients with type 2 diabetes mellitus,
Diabetes Res. Clin. Pract. 121 (2016) 192–203.
[23] S. Toyabe, Characteristics of Inpatient Falls not Reported in an Incident Reporting
System, Glob. J. Health Sci. 8 (2015) 17–25.
[24] H. Tanushi, Detection of healthcare-associated urinary tract infection in Swedish
electronic health records, Stud. Health Technol. Inform. 207 (2014) 330–339.
[25] D.S. Carrell, S. Halgrim, D.-T. Tran et al. Practice of Epidemiology Using Natural
Language Processing to Improve Efficiency of Manual Chart Abstraction in Research:
The Case of Breast Cancer Recurrence, (n.d.).
[26] K. Jensen, C. Soguero-Ruiz, K. Oyvind Mikalsen et al. Analysis of free text in
electronic health records for identification of cancer patient trajectories, Sci. Rep. 7
(2017) 46226.
[27] MENDELEY, No Title, (n.d.). https://www.mendeley.com (accessed June 15, 2018).
[28] F. A, An Evaluation of Patient Safety Event Report Categories Using Unsupervised
Topic Modeling, Methods Inf. Med. 54 (2015) 338–345.
[29] M.-S. Ong, F. Magrabi, E. Coiera, Automated categorisation of clinical incident
reports using statistical text classification, Qual. Saf. Health Care. 19 (2010) e55.
[30] J. Gupta, I. Koprinska, J. Patrick, Automated Classification of Clinical Incident Types,
Stud. Health Technol. Inform. 214 (2015) 87–93.
[31] G.B. Melton, G. Hripcsak, Automated detection of adverse events using natural
language processing of discharge summaries, J. Am. Med. Informatics Assoc. 12
(2005) 448–457.
[32] J.F.E. Penz, A.B. Wilcox, J.F. Hurdle, Automated identification of adverse events
related to central venous catheters, J. Biomed. Inform. 40 (2007) 174–182.
[33] M.-S. Ong, F. Magrabi, E. Coiera, Automated identification of extreme-risk events in
clinical incident reports, J. Am. Med. Informatics Assoc. 19 (2012) e110–e118.
[34] H.J. Murff, F. FitzHenry, M.E. Matheny et al. Automated Identification of
Postoperative Complications Within an Electronic Medical Record Using Natural
Language Processing, Jama. 306 (2011) 848–855.
[35] R.D. Boyce, J. Jao, T. Miller et al. Automated Screening of Emergency Department
Notes for Drug-Associated Bleeding Adverse Events Occurring in Older Adults.,
Appl. Clin. Inform. 8 (2017) 1022–1030.
[36] G. J, Automated validation of patient safety clinical incident classification: Macro
analysis, Stud. Health Technol. Inform. 188 (2013) 52–57.
[37] K. Fujita, M. Akiyama, N. Toyama et al. Detecting effective classes of medical
incident reports based on linguistic analysis for common reporting system in Japan,
Stud. Health Technol. Inform. 192 (2013) 137–141.
[38] S. Toyabe, Detecting inpatient falls by using natural language processing of electronic
medical records, BMC Health Serv. Res. 12 (2012) 448.
[39] L. Han, R. Ball, C.A. Pamer et al. Development of an automated assessment tool for
MedWatch reports in the FDA adverse event reporting system, J. Am. Med.
Informatics Assoc. 24 (2017) 913–920.
[40] A.L. Benin, S.J. Fodeh, K. Lee et al. Electronic approaches to making sense of the text
in the adverse event reporting system, J. Healthc. Risk Manag. 36 (2016) 10–20.
[41] C. Liang, Y Gong, Enhancing Patient Safety Event Reporting by K-nearest Neighbor
Classifier, Stud. Health Technol. Inform. 218 (2015) 40603.
[42] A. Fong, A.Z. Hettinger, R.M. Ratwani, Exploring methods for identifying related
patient safety events using structured and unstructured data, J. Biomed. Inform. 58
(2015) 89–95.
[43] T. Speroff, Exploring the frontier of electronic health record surveillance the case of
postoperative complications, Med. Care. 51 (2013) 509–516.
[44] J. Gaebel, T. Kolter, F. Arlt et al. Extraction Of Adverse Events From Clinical
Documents To Support Decision Making Using Semantic Preprocessing, Stud. Health
Technol. Inform. 216 (2015) 1030.
[45] E. Iqbal, R. Mallah, R.G. Jackson et al. Identification of Adverse Drug Events from
Free Text Electronic Patient Records and Information in a Large Mental Health Case
Register, PLoS One. 10 (2015) e0134208.
[46] A. Cohan, A. Fong, R.M. Ratwani et al. Identifying Harm Events in Clinical Care
through Medical Narratives, in: Proc. 8th ACM Int. Conf. Bioinformatics, Comput.
Biol. Heal. Informatics - ACM-BCB ’17, ACM Press, New York, 2017: pp. 52–59.
[47] A. Fong, N. Harriott, D.M. Walters et al. Integrating natural language processing
expertise with patient safety event review committees to improve the analysis of
medication events, Int. J. Med. Inform. 104 (2017) 120–125.
[48] G.B. Weller, J. Lovely, D.W. Larson et al. Leveraging electronic health records for
predictive modeling of post-surgical complications, Stat. Methods Med. Res. 0(0)
(2017) 1-15.
[49] K. Fujita, M. Akiyama, K. Park et al. Linguistic analysis of large-scale medical
incident reports for patient safety, Stud. Health Technol. Inform. 180 (2012) 250–254.
[50] P.A. Ravindranath, S. Bruschi, K. Ernstrom et al. Machine learning in automated
classification of adverse events in clinical studies of Alzheimer’s disease, Alzheimer’s
Dement. 13 (2017) P1256.
[51] C. Liang, Y. Gong, Predicting Harm Scores from Patient Safety Event Reports., Stud.
Health Technol. Inform. 245 (2017) 1075–1079.
[52] W.M M, Screening Electronic Health Record-Related Patient Safety Reports Using
Machine Learning, J. Patient Saf. 13 (2017) 31–36.
[53] S.D. McKnight, Semi-Supervised Classification of Patient Safety Event Reports, J.
Patient Saf. 8 (2012) 60–64.
[54] Z.S.Y. Wong, Statistical classification of drug incidents due to look-alike sound-alike
mix-ups, Health Informatics J. 22 (2016) 276–292.
[55] Z.S.Y. Wong, M. Akiyama, Statistical text classifier to detect specific type of medical
incidents, Stud. Health Technol. Inform. 192 (2013) 1053.
[56] L.U. Gerdes, C. Hardahl, Text mining electronic health records to identify hospital
adverse events, Stud. Health Technol. Inform. 192 (2013) 1145.
[57] T. Botsis, M.D. Nguyen, E.J. Woo et al. Text mining for the vaccine adverse event
reporting system: Medical text classification using informative feature selection, J.
Am. Med. Informatics Assoc. 18 (2011) 631–638.
[58] A. Fong, J. Howe, K.T. Adams et al. Using Active Learning to Identify Health
Information Technology Related Patient Safety Events, Appl. Clin. Inform. 8 (2017)
35–46.
[59] Y. Wang, E. Coiera, W. Runciman et al. Using multiclass classification to automate
the identification of patient safety incident reports by type and severity, BMC Med.
Inform. Decis. Mak. 17 (2017) 84.
[60] T. Botsis, T. Buttolph, M. Nguyen et al. Vaccine adverse event text mining system for
extracting features from vaccine safety reports, J. Am. Med. Informatics Assoc. 19
(2012) 1011–1018.
[61] J.F.E. Penz, A.B. Wilcox, J.F. Hurdle, Automated identification of adverse events
related to central venous catheters, J. Biomed. Inform. 40 (2007) 174–182.
[62] S. V. Stehman, Selecting and interpreting measures of thematic classification accuracy,
Remote Sens. Environ. 62 (1997) 77–89.
[63] MedLee, MedLee, (n.d.).
[64] SAS, SAS, (n.d.). https://www.sas.com/en_gb/software/text-miner.html.
[65] E. Network, TRIPOD Checklist : Prediction Model Development and Validation,
Checklist. (2016).
[66] J. Chubak, G. Pocobelli, N.S. Weiss, Tradeoffs between accuracy measures for
electronic health care data algorithms, J. Clin. Epidemiol. 65 (2012) 343–349.
[67] Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Controlling the False Discovery Rate: a Practical and
Powerful Approach to Multiple Testing, Source J. R. Stat. Soc. Ser. B. 57 (1995) 289–
300.
[68] T. Murdoch, A. Detsky, The Inevitable Application of Big Data to Health Care, JAMA
Evid. 309 (2013) 1351–1352.
[69] N.R. Adam, R. Wieder, D. Ghosh, Data science, learning, and applications to
biomedical and health sciences, Ann. N. Y. Acad. Sci. 1387 (2017) 5–11.
[70] B. Young, Getting the measure of diabetes : the evolution of the National Diabetes
Audit, Practical Diabetes 35 (2018) 1–7.
APPENDIX 1
Full electronic search strategy
Limits placed on all searches were English language articles, in humans only, published between 2005 and 2018. When this review was revised, any article with a publication date after May 8 th, 2018 was excluded, as this was the last date on which the search was run for the original review.
Ovid MEDLINE® ALL 1(natural language processing or NLP or text mining or machine learning or artificial intelligence or information technology or classifier or document classification or semantic similarity or ontology).mp⇒ 57960 2(Natural Language Processing/ or Data Mining/ or Artificial intelligence/ or Machine Learning/)⇒ 36778 3(event report or adverse event or medication event or incident report or medication error or medical error or error report or patient safety event).mp⇒ 23367 4(Adverse Drug Reaction Reporting System/ or Medical Errors/ or Patient Safety/)⇒ 38438
5(*Algorithms/ and 4) ⇒ 144
61 or 2⇒ 63120 73 or 4 ⇒ 58720 8(6 and 7) or 5 ⇒ 924
MIDIRS: Maternity and Infant CareEmbase
1(natural language processing or NLP or text mining or text classification or machine learning).af. 29540
MIDIRS: Maternity and Infant Care 29 Embase 29511
2(event reports or adverse events or medication events or incident report* or patient safety).af. 152722
MIDIRS: Maternity and Infant Care 150897 Embase 1825
1 and 2 301
MIDIRS: Maternity and Infant Care 0 Embase 301
1 and 2 Deduplicated 290
MIDIRS: Maternity and Infant Care 0 Embase 290
Title and Abstract screen: 277 removed due to "wrong topic" or "irrelevant".
CINAHL
Initial search identical to Medline - 0 resultsThen using CINAHL suggested search terms:
1 Natural Language Processing and Adverse Health Care Event - 1
Title and Abstract screen: 1 removed for “wrong topic”
Cochrane Library
MeSH "Natural Language Processing" Subheading "Classification" - 11
Title and Abstract screen: 11 excluded for "wrong topic" or "irrelevant".
SciELO
1 Event report* or adverse events or incident reporting AND Natural language processing or NLP - 3
Title and Abstract screen: 3 excluded for “not English language” or “irrelevant”.
ISI Web of Science
1 Event report* or adverse events or incident reporting AND Natural language processing or NLP
Filtered by Research Domain of "Science Technology" – 209
Title and Abstract screen: 198 excluded for “duplicates”, "wrong topic" or "irrelevant". 11
Google ScholarScholar.google.com1 "natural language processing" and "medical" and "incident reports" and "classification" - 228
Title and Abstract screen: 226 excluded for “duplicates”, "wrong topic" or "irrelevant”.
Prospero Systematic Reviews
Crd.york.ac.ukSeparate searches for:"natural language processing" - 2 "NLP" - 4"Machine Learning" – 36
Title and Abstract screen: All excluded for "wrong topic" or "irrelevant".
Hand Search / Reference List / PearlingArticles identified - 13