MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette
Hirschman Chief Scientist Information Technology Center MITRE Text
Mining for Surveillance II: Extracting Epidemiological Information
from Free Text
Slide 2
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
*Why text mining for surveillance? 0 What kinds of text to mine? 0
What is text mining? 0 Some examples -Prodromic binning in RODS
-Processing patient records: MedLEE -Tracking outbreaks in the
news: MiTAP 0 Open research issues and conclusions
Slide 3
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining for Surveillance Information Extraction: documents to
entities, relations Data Streams Document Classes Extracted
Information, Summary Views Text Classification: key words to
document classes ICD9: 465.9 upper respiratory infection
respiratory diarrheal Documents contain useful information for
tracking outbreaks if free text can be converted into structured
data
Slide 4
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
0 Why text mining for surveillance? *What kinds of text to mine? 0
What is text mining? 0 Some examples -Prodromic binning in RODS
-Processing patient records: MedLEE -Tracking outbreaks in the
news: MiTAP 0 Open research issues
Slide 5
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Patient
Encounter Data 0 Useful information is contained in patient records
-Clinic visits, emergency room visits, hot lines -Data usually
occurs as stylized free text 0 What to extract? -Information useful
for prodromic or syndromic surveillance =Without text mining,
systems often just track fluctuation in number of admissions =New
systems (e.g., RODS) can bin text data into prodromes or syndromes
0 Time is critical in detecting an outbreak -Delays in collecting,
processing and aggregating information lead to delays in response
-Moral: grab what you can (Chief Complaint)
Slide 6
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example
of Clinical Data: Triage Chief Complaint (TCC)* NVD COUGH SOB DIZZY
NAUSEA VOMITITING TCC is short (20-50 characters, 1-10 words)
timely (available upon patient admission) errorful (typos and
abbreviations) *From R Olszewski, Bayesian Classification of Triage
Diagnoses for the Early Detection of Epidemics, Recent Advances in
Artificial Intelligence: Proc of the 16 th Internl FLAIRS Conf. Pp
412-416, AAAI Press, 2003.
Slide 7
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Clinical
Data: Radiology Report* Telegraphic style Extensive jargon
(sublanguage) Fields vary depending on report type
*http://cat.cpmc.columbia.edu/MedLEExml/demo/ Mention of a
condition (hilar adenopathy) is not equivalent to assertion:
Patient does not have hilar adenopathy
Slide 8
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Global
Disease Tracking from News 0 Capture of global outbreak information
-The recent SARS outbreak underscores the importance of global
monitoring for outbreaks -These are often first reported in (local)
news media or by informal communication (web chat rooms) 0 Global
outreach to capture local news is critical -Local news sources tend
to be in local languages (requiring translation) -Local news may be
by radio, requiring capture from broadcast news sources (radio, TV)
-This requires more advanced text processing technology (speech
transcription, translation)
Slide 9
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example:
Global Disease Tracking * ProMED: Program for Monitoring Infectious
Diseases: http://www.promedmail.org; Displayed in MiTAP Message
from Feb. 10 in ProMED* on SARS
Slide 10
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
0 Why text mining for surveillance? 0 What kinds of text to mine?
*What is text mining? 0 Some examples -Prodromic binning in RODS
-Processing patient records: MedLEE -Tracking outbreaks in the
news: MiTAP 0 Open research issues
Slide 11
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. The
Components of Text Mining Information Extraction: Documents to
entities, relations Question Answering: Question to answer
Information Retrieval, Text Classification: Key words to document
classes Collections News Reports Patient Records MEDLINE Document
Classes Summaries, Tables Facts SARS traced to civet cat
Slide 12
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining Modules 0 Binning documents into coherent sets (e.g.,
prodromes) 0 Extracting key entities (symptoms, diseases,
locations) and relations (time, severity, frequency) from narrative
text 0 Summarizing the findings (in a single record or across
multiple records) 0 Visualizing the data: create tables from
textual data for display (e.g., charts, maps) 0 Finding answers to
natural language questions (the nuggets in a collection of
documents) Text mining takes free text as input and distills some
value added information
Slide 13
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining (1) 0 Binning technology (classification) -Shallow and fast;
-Usually uses bag of words approach; -Must be trained to classify
into free text into desired set of bins 0 Extraction -Relies on
words in context (statistical or linguistic); -Is designed to get
at content/details, including negated or qualified conditions
(deeper, slower) -Requires either hand-tailored rules or
application of machine learning algorithms based on extensive
annotated training data
Slide 14
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining (2) 0 Summarization -Provides distillation of information
across multiple records; -Relies on a mix of pattern recognition
and semantic analysis 0 Visualization -Takes as input values
extracted from free text -Useful for interpreting complex
spatio-temporal data (graphs, maps) 0 Question answering -Can
return nuggets of information -Works by analyzing the question,
locating specific document and extracting the right type of
fact
Slide 15
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
0 Why text mining for surveillance? 0 What kinds of text to mine? 0
What is text mining? 0 Some examples *Prodromic binning in RODS
-Processing patient records: MedLEE -Tracking outbreaks in the
news: MiTAP 0 Open research issues
Slide 16
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example
#1: RODS (Real-Time Outbreak and Disease Surveillance)* Emergency
Department Graphs and Maps RODS System Database Admission Records
from Emergency Departments Emergency Department Detection
Algorithms CoCo Web Server Geographic Information System
Preprocessor *Slide courtesy of Wendy Chapman, U. Pittsburgh
Slide 17
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example
System #1: RODS* 0 RODS created at the University of Pittsburgh
-Used in multiple deployments, including for health monitoring
during the 2001 Olympics, Western Pennsylvania health surveillance
0 RODS captures electronic medical records -Applies natural
language processing to bin chief complaint into a set of syndromes,
e.g., respiratory, diarrheal, rash, -Detects out of ordinary
occurrences -Provides temporal and geospatial visualization
*http://www.health.pitt.edu/rods/
Slide 18
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining: Binning Into Syndromes 0 Nave Bayesian classifier used to
bin triage diagnoses into 8 syndromes -Unigram model gave good
results:.80 to.97 area under ROC curve, depending on syndrome,
compared to human experts -Bigram and mixture models did worse than
unigram model (due to sparse data) 0 Correcting spelling mistakes
and expanding abbreviations improved results by about a percentage
point 0 Some syndromes harder than others (botulinic syndrome was
only around 78% AUC) *R Olszewski, Bayesian Classification of
Triage Diagnoses for the Early Detection of Epidemics, Recent
Advances in Artificial Intelligence: Proc of the 16 th Internl
FLAIRS Conf. Pp 412-416, AAAI Press, 2003.
Slide 19
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text
Mining for Chief Complaint: Conclusions* 0 Nave Bayes works well
enough for Chief Complaint 0 Triage complaints are well-suited to
bag of words -They are short with little syntax, few modifiers
-They present positive complaints (no negation) 0 Some issues
-Entries may be too brief to provide adequate data to separate
certain syndromes -There may be insufficient training data for some
bins (botulinic bin had the fewest cases) -Triage complaints may
lack sufficient detail for certain applications More advanced
linguistic techniques may be overkill for Chief Complaint reports
*Ivanov, O. et al Accuracy of Three Classifiers of Acute
Gastrointestinal Syndrome for Syndromic Surveillance. AMIA 2002
Ann. Symp. Proc. 345-349
Slide 20
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. RODS
Mining syndromic information with time and geospatial coordinates
allows effective monitoring for anomalous events
Slide 21
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
0 Why text mining for surveillance? 0 What kinds of text to mine? 0
What is text mining? 0 Some examples -Prodromic binning in RODS
*Processing patient records: MedLEE -Tracking outbreaks in the
news: MiTAP 0 Open research issues
Slide 22
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. DOMAINS
DOMAINS INSTITUTIONS APPLICATIONS Medical Language Processor lungs
are clear; no gallops or rubs this echo shows some thickening of
mitral valve no mass or calcification noted on this xray Example
System #2: MedLEE* (Carol Friedman, Columbia) Radiology Discharge
Summaries Pathology Surveillance Error Tracking Patient Record
Access *Friedman, C., et al. A General Natural-Language Text
Processor for Clinical Radiology. JAMIA 1(2) 161-174. 1994
Slide 23
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Processes Complex Narrative* HL7 format, with codes
*http://cat.cpmc.columbia.edu/MedLEExml/demo/
Slide 24
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Provides Multiple Output Formats Mark-up version,showing only
positive findings: conditions are in RED, procedures in GREEN
Indented version, with findings and modifiers
*http://cat.cpmc.columbia.edu/MedLEExml/demo/
Slide 25
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Architecture 1 Text Structured form Lexicon GrammarMappings Coding
Table Parser Phrase Regular. Encoder Error Recovery Knowledge
Components Pre - Processor WSD Rules* Abbrevs *Word sense
disambiguation 1 Slide courtesy of Carol Friedman
Slide 26
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Processing Pipeline 0 Pre-processor does lexical look-up to assign
semantic classes, abbreviation expansion and other disambiguation:
-HR = heart rate or hour? -Discharge = patient status or
sign/symptom? 0 Parsing done with a semantic grammar, e.g., -DEGREE
+ CHANGE + FINDING handles: mild increase in congestion, mildly
increased congestion, 0 Phrase regularization maps semantically
equivalent phrases into structured controlled vocabulary: -Heart
appears to be slightly enlarged => enlarged heart 0 Encoding
maps phrases into appropriate format
Slide 27
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Captures Language Complexity 1 0 Synonyms and ambiguities 0
Modification and predication relations -enlarged heart = cardiac
enlargement = heart appears to be enlarged 0 Mapping into (several)
standard forms via coding tables: abdominal pain with no modifiers:
C0000737 (UMLS code) abdominal pain modified by no: C0423651 (no
abdominal pain) C0518732 (abdominal pain not present) 1 Slide
courtesy of Carol Friedman
Slide 28
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Applications 0 Framework is general enough so that it has been
applied to different medical domains -Discharge summaries,
radiology, mammography, pathology, -And to the biological domain as
well (GENIES) 0 It has been evaluated in applications to: -Generate
alerts to isolate patients suspicious for tuberculosis -Detect
patients who have positive mammograms -Compare comorbidities for
community-acquired pneumonia using MedLEE encoding vs
administrative data (ICD-9 codes)
Slide 29
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Extending MedLEE 1 0 Type of expertise required Grammar requires
NLP expertise 450 rules (CXR) 730 rules (DSUM) Little change for
remaining domains 0 Lexicon, abbreviations, WSD, coding table
domain expertise 0 Compositional mappings - automated 1 Slide
courtesy of Carol Friedman Domain Specific Lexical Entries
Slide 30
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE
Evaluations 0 Chest radiographic reports assessed for 24 conditions
1 -System processed ~900,000 reports on 250,000 patients -150
reports compared to manual coding, with sensitivity of 0.88,
specificity of 0.99 0 Data extracted by MedLEE used to calculate
severity scores for community acquired pneumonia 2 -Discharge
summaries: sensitivity 92%; specificity 93% compared to human
coders -Chest x-rays: sensitivity 87%; specificity 96% 1 Friedman
et al., Automating a Severity Score Guideline for Community-
Acquired Pneumonia Employing Medical Language Processing of
Discharge Summaries. AMIA 256260. 1999. 2 Hripcsak et al., Use of
Natural Language Processing to Translate Clinical Information from
a Database of 889,921 hest Radiographic Reports. Radiology,
157-163, July 2002.
Slide 31
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Information Extraction: Conclusions 0 MedLEE has been demonstrated
to provide automated extraction of detailed information, with high
correlation to human coding 0 NLP can extract more detail and
finer-grained information than e.g., ICD-9 codings 0 System must be
manually tailored to new reports -E.g., radiology has a different
vocabulary and style, compared to pathology or chief complaint or
discharge summary -One hospital may differ from another in its
report format and even vocabulary With tailoring, Information
Extraction can be used to extract data for retrospective studies
and detailed tracking of course of disease
Slide 32
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline
0 Why text mining for surveillance? 0 What kinds of text to mine? 0
What is text mining? 0 Some examples -Prodromic binning in RODS
-Processing patient records: MedLEE *Tracking outbreaks in the
news: MiTAP 0 Open research issues
Slide 33
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE
Text & Audio Processing 1 0 Prototype for monitoring infectious
disease outbreaks & other global threats 0 Delivers information
on demand -In real time, 24x7 -From live, on-line sources -Global
news, at local level -In multiple languages 0 Part of DARPA * TIDES
program 0 Available to qualified users via registration at the
MiTAP web site: http://mitap.sdsu.edu/p/p/ * Defense Advanced
Research Projects Agency Translingual Information Detection,
Extraction & Summarization 1 Damianos et al., "MiTAP, Text and
Audio Processing for Bio- Security: A Case Study." In Proc of
IAAI-2002: The 14 th Innovative Applications of Artificial
Intelligence Conf., 2002.
Slide 34
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Grouped
into news groups by source, disease, person, region, organization
System Overview 90+ sources, ~4K msgs/day capture browsing
searching 8 languages, with MT analysis
Slide 35
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. SARS:
Severe Acute Respiratory Syndrome 0 First record in MiTAP -ProMED
Feb 10 9PM 0 MiTAP finds in US press -Miami Herald, Feb 11 -Other
countries: Jakarta Post Feb 12
Slide 36
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Tracing
the Record for SARS 0 Searching the MiTAP archives for -SARS OR
pneumonia or acute respiratory infection 0 655 hits from Feb 1 to
March 22 0 Sample from search page from
http://mitap.sdsu.edu/search
Slide 37
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Messages
are cross-posted to relevant newsgroups Stories can be sorted by
subject, source, date News Reader Interface System is accessible
via standard news reader or web-based search engine News is
categorized by disease, source, region, and custom categories
Slide 38
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MiTAP
Interface 0 Each article is indexed for search 0 Routed to one or
more newsgroups 0 Tagged (via color code) for relevant entities,
e.g., -Disease -Location -Time 0 Translated if appropriate
(currently disabled) Top locations pop up
Slide 39
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Summarization: Daily Top 10 Diseases Diseases in todays news,
ranked by # articles Click to view extracts Compare to yesterdays
news # MiTAP articles today
Slide 40
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Thumbnail of March 20 Top Stories on SARS
Slide 41
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Multi-Document Summarization Summary of clustered documents Links
to MiTAP docs
Slide 42
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. TIDES
World Press Update (WPU): 0 Daily newsletter prepared by consultant
0 Collated from ~50 mostly foreign news sources in MiTAP 0 Review
of 800-1000 articles in
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Extracting Meaning from Language is Hard 0 Meaning may depend on
context, e.g., =Discharge of patient vs. bloody discharge =Chest
negative => chest [x-ray] negative 0 One meaning can be
expressed in many ways: -Enlarged heart = cardiomegaly 0 Complex
syntactic relations -Enlarged heart = heart is enlarged -Severe
pain and fever = (severe pain) + fever -Pain in left arm and wrist
= left (arm and wrist) 0 Language varies from domain to domain -New
vocabulary and phrases required for every new specialty -If a
disease isnt named, it is hard to find it!
Slide 52
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Language
is Hard: Negation 0 Chapman is developing a negative detection
tool, NegEx* to detect negation and scope 0 Distinguish: -Patient
denies pain and shortness of breath => pain (negated); sob
(negated) -Patient denies pain but has shortness of breath =>
pain (negated); sob (positive) 0 Negative expressions include: -No,
not, deny, ruled out, no complaint of, absence of, free of,
without, fails to reveal,
*http://omega.cbmi.upmc.ed/~chapman/NegEx.html
Slide 53
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Key
Research Issues: Some Problems 0 Input format for medical records
is irregular -No uniform medical record format -Records may be
truncated, with idiosyncratic abbreviations and typos 0 Desired
output format is also non-standard: -Multiple nomenclatures and
encodings, e.g., =ICD9 =SNOMED =UMLS 0 There are many subdomains:
-Tools needed to help automatic tailoring to new domains -Training
data and resources (lexicons with synonym lists) are key
Slide 54
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Conclusions 0 Text mining and information extraction have been
successfully applied in several relevant domains -Binning into
syndromes -Extraction of complex information from patient records
-Capture, binning and mark-up of global news on infectious disease
0 There are still major challenges: -There is no standardized input
or output -Systems are cumbersome to port to new subdomains and
tasks -Evaluation is difficult: hard to evaluate quality of
extraction given noisy data, complex tasks =Standard benchmark test
sets would help
Slide 55
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Acknowledgements 0 I would like to thank -Wendy Chapman for
materials on RODS -Carol Friedman for materials on MedLEE -Cmd Eric
Rasmussen, US Navy, whose vision has guided MiTAP -Mark Prutsalis,
MiTAPs most productive user -Bob Younger and Sue Ellen Moore for
MiTAP technology transfer to SPAWAR Systems Center -And my MITRE
colleagues for the work on the MiTAP system: =Laurie Damianos (PI),
Steve Wohlever, George Wilson, Marc Ubaldino, Andy Chisholm, Janet
Hitzeman, Conrad Chang, Andy Shen -DARPA for its funding of the
MiTAP work -NSF for its funding of our recent work in disease
modeling and prediction
Slide 56
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Backup
Slide 57
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Information Extraction Evaluations For Newswire Relation extraction
now at over 80% Event extraction less than 60%, improving slowly
Name extraction > 90% in English, Japanese; improving in Chinese
Commercial name taggers exist for news reports in multiple
languages Results show best of show each year
Slide 58
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Question
Answering (MITREs QANDA System) Where did Dylan Thomas die? 61.
Swansea: In Dylan: the Nine Lives of Dylan Thomas, Fryer makes a
virtue of not coming from Swansea 6 2. Italy: Dylan Thomass widow
Caitlin, who died last week in Italy aged 81, 33. New York:Dylan
Thomas died in New York 40 years ago next Tuesday What diseases are
caused by prions? 31. Both CJD and BSE are caused by mysterious
particles of infectious protein called prions 3 2. Scientists
trying to understand the epidemic face an unusual problem: BSE,
scrapie, and CJD are caused by a bizarre infectious agent, the
prion which does not follow the normal rules of microbiology. 6 3.
These diseases are caused by a prion, an abnormal version of a
naturally-occurring protein, but researchers have recognized
different strains of prions that differ in incubation times,
symptoms, and severity of illness....
Slide 59
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Question
Answering 0 Stage 1: Question analysis -Find type of object that
answers the question: when needs time, which proteins need protein
0 Stage 2: Document retrieval -Using (augmented) question, retrieve
set of possibly relevant documents via information retrieval 0
Stage 3: Document processing -Search documents for entities of the
desired type using information extraction -Search for entities in
appropriate relations 0 Stage 4: Rank answer candidates 0 Stage 5:
Present the answer (N bytes, or a phrase or a sentence or a
summary)
Slide 60
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. TREC
Q&A 2000 Results (250-byte) Harabagiu and Moldovan, Southern
Methodist University Mean Reciprocal Rank:76% First Answer Correct:
69% Correct Answer in Top 5:86% Lessons: question answering works
-- at least for simple factual questions
Slide 61
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. 0
Automated systems exist now that can: -Return classes of documents
relevant to a subject (information retrieval: IR) -Identify
entities (90-95% accuracy) or relations among entities (70-80%
accuracy) in text (information extraction: IE) -Answer factual
questions using large document collections at 75-85% accuracy
(question answering: QA) -Provide translations good enough for
skimming 0 But... these results are for news stories 0 How do these
results translate to medical data? State of NLP: Metrics
Slide 62
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Negatives from Chapmans NegEx site
Slide 63
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example
of Clinical Data (1) Discharge Summary Medical records have
internal structure rely heavily on specialized terminology and
abbreviations
Slide 64
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE:
Discharge Summary
Slide 65
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE:
Discharge Summary w Mark-Up
Slide 66
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE:
Discharge Summary in HL7
Slide 67
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Slide 68
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Slide 69
MITRE 2002 The MITRE Corporation. ALL RIGHTS RESERVED. First
SARS Message in MiTAP: Feb 10, 2003