2010 HDWA Annual Conference
Data Warehousing – Adding Value to Healthcare
Pathology Reports Information Extraction: An
OHNLP and UMLS Powered Approach
Naveen AshishResearch Associate Professor
September 14th 2010
HDWA 2010Durham, NC
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Agenda
• Introduce institution, research group, and project
• Outline automated information extraction problem
• Solution
– Using open frameworks
– Open ontology resources
• Current Status
• Domain experts engagement
• Conclusions
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
University of California, Irvine
Medical Center
University of California, Irvine Medical Center is a 422-
bed tertiary teaching hospital with a commitment to
education, research and quality patient care. UCI
Medical Center is a Magnet Designated facility with a
Level 1 Trauma Center, Burn Center and Level II
Neonatal Care Center.
• Not-for-Profit
• # Employees
• # ER Visits
• # Admissions
•
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Data Warehouse Profile
• UCI Clinical Informatics Team
– Director
– Informatics Solutions Architect
– Principal Statistician/Advisor
– Informatics Outreach Architect (future)
– Clinical Practice Engineer
– Clinical Research Informatics Lead
– Business Intelligence Developer (2)
– Clinical Informatics Specialist
– NLP Specialist
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Project Team
• Supported by UCI Medical Center Clinical Informatics
Department
• Collaboration between UCI Medical Center (Clinical
Informatics) and Calit2/Computer Science
• Members
– Naveen Ashish (NLP and CS Researcher)
– Lisa Dahm (Director, Biomedical Informatics)
– Charles Boicey (Informatics Architect)
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Vision
UCI QUP
Quest
Text Reports
Analysis
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
(UCI) Pathology Report
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Reports
• Pathology reports
– Free text but “semi-structured” as well
– Nuggets of information in the text
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
What do we want to ask ?
Sample (retrieval) “questions” Surgical Pathology
Patients with a surgical pathology report containing undifferentiated
lymphoepithelioma-like gastric carcinoma.
Patients with a surgical pathology report containing spindle cell carcinoma of
the breast, grade 3, margin(s) positive, node(s) positive.
Discharge Note
Patients with a discharge note containing a diagnosis of cerebrovascular
accident and diabetes mellitus type II discharged in stable condition to home.
Female patients with a discharge diagnosis of Ewing sarcoma, hypertension
and obesity discharged in stable condition to home.
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
In Text Thus we need
Sections and
sub-sections
Associations
Terms
Dimensions
…
FINAL DIAGNOSIS AFTER MICROSCOPY:
LUNG, LEFT LOWER LOBE, WEDGE RESECTION:
POORLY DIFFERENTIATED ADENOCARCINOMA OF PULMONARY ORIGIN
SIZE: 1.5 CM
STAPLED RESECTION MARGIN: NEGATIVE
5 NECROSIS
EXTENSIVE FIBROSIS IS NOT PRESENT
PLEASE SEE COMMENT
FINAL DIAGNOSIS AFTER MICROSCOPY:
A. DEEP TRICEPS MARGIN, EXCISION:
POSITIVE FOR SARCOMA
B. LATERAL SUPERIOR MARGIN, EXCISION:
POSITIVE FOR SARCOMA
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
System
OHNLP
UCI-QUP
Application
(Rules,
Code)
Database
(warehouse)
GUI, Tableau, i2b2
Unstructured Structured Analysis
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Related WorkComputerized Extraction of Information on the Quality of Diabetes Care from Free Text in Electronic Patient Records of General Practitioners. Jaco Voorham,Petra
Denig. JAMIA 2007;14:349-354 doi:10.1197/jamia.M2128
Application of information technology: MedEx: a medication information extraction system for clinical narratives. Hua Xu, Shane P Stenner,Son Doan,Kevin B
Johnson,Lemuel R Waitman,Joshua C Denny. JAMIA 2010;17:19-24 doi:10.1197/jamia.M3378
Identifying Smokers with a Medical Extraction System. Cheryl Clark,Kathleen Good,Lesley Jezierny,Melissa Macpherson,Brian Wilson,Urszula Chajewska.
JAMIA 2008;15:36-39 doi:10.1197/jamia.M2442
Automated evaluation of electronic discharge notes to assess quality of care for cardiovascular diseases using Medical Language Extraction and Encoding System
(MedLEE). Jung-Hsien Chiang, Jou-Wei Lin, Chen-Wei Yang. JAMIA 2010;17:245-252 doi:10.1136/jamia.2009.000182
Using Regular Expressions to Abstract Blood Pressure and Treatment Intensification Information from the Text of Physician Notes. Alexander Turchin, Nikheel S
Kolatkar,Richard W Grant,Eric C Makhni,Merri L Pendergrass, Jonathan S Einbinder. JAMIA 2006;13:691-695 doi:10.1197/jamia.M2078
Natural Language Processing Framework to Assess Clinical Conditions. Henry Ware, Charles J Mullett,V Jagannathan
JAMIA 2009;16:585-589 doi:10.1197/jamia.M3091
A General Natural-language Text Processor for Clinical Radiology. Carol Friedman,Philip O Alderson, John H M Austin, James J Cimino,Stephen B
Johnson.JAMIA 1994;1:161-174 doi:10.1136/jamia.1994.95236146
Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS
Specialist Lexicon. Yang Huang,Henry J Lowe, Dan Klein,Russell J Cucina
.JAMIA 2005;12:275-285 doi:10.1197/jamia.M1695
Automated Encoding of Clinical Documents Based on Natural Language Processing. Carol Friedman, Lyudmila Shagina,Yves Lussier, George Hripcsak. JAMIA
2004;11:392-402 doi:10.1197/jamia.M1552
Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data. Lois C Childs, Robert Enelow,Lone Simonsen, Norris
H Heintzelman,Kimberly M Kowalski,Robert J Taylor. JAMIA 2009;16:571-575 doi:10.1197/jamia.M3083
Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. Genevieve B Melton,George Hripcsak. JAMIA 2005;12:448-
457 doi:10.1197/jamia.M1794
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Related Work
Documents
Discharge summaries, Patient notes,
EMR sections, Path or Radiology reports …
Identify noun phrases.
Section (headings)
Numerical values,
Negations,
…
Extract blood pressure,
Medications,
…
Quality of care,
Smoker status,
Adverse events
Other diagnoses
…
Processing
Analysis
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Systems
Columbia
Carol Friedman (et al.,)
MedLee
“Black art”
Systems from Defense, Intelligence etc., companies
Open Software and Tools
Medical Informatics
OHNLP
Open Health Natural Language Processing
IBM, MayoClinic, (NCI)
General
UIMA, GATE
Variety of lexical tools, named-entity recognizers, parsers etc.,
XAR
http://zellig.cpmc.columbia.edu/medlee/
http://incubator.apache.org/uima/
http://gate.ac.uk/
http://nlp.stanford.edu/software/lex-parser.shtml
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Extraction Techniques What do we employ to achieve automated extraction ?
Broad paradigms
Rule driven (expert)
Machine-learning based (trained)
Combined (most recent systems)
Multiple levels
Semi-structured data extraction
Named entity extraction
POS tagging, NE identification
Ontology driven (domain terms)
“Deep” relation level extraction
Associations
Natural Language Parsing
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
NL Parse Illustration
(ROOT
(S
(NP
(NP (NNP Tissue))
(PP (IN between)
(NP (DT the) (CD two) (JJ surgical) (NNS
clips))))
(VP (VBZ contains)
(NP
(NP (NNS foci))
(PP (IN of)
(NP
(NP (JJ ductal) (NN carcinoma))
(ADJP (FW in) (FW situ)))))
(PP
(PP (IN within)
(NP (DT a) (NN papilloma)))
(, ,)
(CONJP (RB as) (RB well) (IN as))
(PP (IN within)
(NP (NNS ducts)))))
(. .)))
nsubj(contains-7, Tissue-1)
det(clips-6, the-3)
num(clips-6, two-4)
amod(clips-6, surgical-5)
prep_between(Tissue-1, clips-6)
dobj(contains-7, foci-8)
amod(carcinoma-11, ductal-10)
prep_of(foci-8, carcinoma-11)
amod(carcinoma-11, in-12)
dep(in-12, situ-13)
det(papilloma-16, a-15)
prep_within(contains-7, papilloma-16)
prep_within(contains-7, ducts-22)
conj_and(papilloma-16, ducts-22)
“Tissue between the two surgical clips contains foci of
ductal carcinoma in situ within a papilloma, as well as
within ducts.”
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
MedLee Illustration
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
OHNLP• OHNLP
– Open Health Natural
Language Processing
Consortium
• IBM and MayoClinic are
founding partners
• caBIG/NCI supported
– Open-source consortium
promoting the use of UIMA
• Features
– Built upon Apache UIMA
• Annotators, Pipelines
– Medical domain
• MedKAT/P (IBM)
– Pathology reports
extraction
• cTAKES (Mayo Clinic)
– Clinical data
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Rationale for OHNLP
• Based on UIMA
– Open source
– Community of developers
• OHNLP itself
– NCI
• IBM, Mayo
– MedKAT/P and cTakes
– Two way benefits
• Adopt
• Contribute back
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
MedKAT Annotations
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
“Programming” UIMA
• OHNLP based on UIMA
• UIMA composed of “Analysis Engines”
– Primitive
– Aggregate
Primitive Engine
(section headings)
Primitive Engine
(numerical)
Primitive Engine
(dict terms)
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Descriptors, Resources
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Analysis Engines
• Developing “UCI-QUP”
– UCI Quest Uima Pipeline
• Analysis Engines
– Recognize sections and sub-sections
• Regular expressions
– Significant terms
• Medical terms
– Existing dictionary in MedKAT/P
• Useful, not complete
– Integrate additional terminology
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
AE Terms
• Good resources
– NCI Thesaurus
• Cancer related
• > 500,000 terms/concepts
– NCI Metathesaurus
• Several million concepts
• Developed
– Converter
• NCI Thesaurus UIMA Dictionary Resource
– Application
– Database
• MySQL
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Architecture
Pathology Reports(Unstructured)
OHNLP
Extracted Data(Structured)
UCI Quest Uima Pipeline
Knowledge Sources
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
UMLS and Metathesaurus
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
UMLS
• UMLS
– Obtained system from NLM
– Installed successfully on informatics-nlp
– Features
• Browse concepts and relationships
• Flat files
• DB import
– Being integrated into UCI-QUP
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Our Contribution to OHNLP
• We indeed adopted
– Framework
– Relevant “resources”
• Contribute to overall OHNLP effort
– Specific analysis engines
• Sections and sub-sections in pathology reports
• Significant items
• Dictionary terms (UMLS integration)
• …
• Contribute as a project back to OHNLP
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Database Schema
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Demo
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
SQL Queries
• Example (possible) queries
SELECT reportid
FROM collection
WHERE
(sectioncontent like ‘%carcinomia%’) AND (heading
like ‘%tumor%)
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Interfaces
• i2b2
• Tableau
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Guide For Fields
• College of
American
Pathologists
(CAP)
– Detailed
protocols
Specimen (Note A)
___ Partial breast
___ Total breast (including nipple and skin)
___ Other (specify): ____________________________
___ Not specified
Procedure (Note A)
___ Excision without wire-guided localization
___ Excision with wire-guided localization
___ Total mastectomy (including nipple and skin)
___ Other (specify): ____________________________
___ Not specified
Lymph Node Sampling (select all that apply) (Note B)
___ No lymph nodes present
___ Sentinel lymph node(s)
___ Axillary dissection (partial or complete dissection)
___ Lymph nodes present within the breast specimen (ie, intramammary lymph nodes)
___ Other lymph nodes (eg, supraclavicular or location not identified)
Specify location, if provided: _________________________
Specimen Integrity
___ Single intact specimen (margins can be evaluated)
___ Multiple designated specimens (eg, main excisions and identified margins)
___ Fragmented (margins cannot be evaluated with certainty)
___ Other (specify): __________________________________
Specimen Size (for excisions less than total mastectomy)
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Implications
• Multiple specific extraction and distillation techniques
• Section, sub-section segmentation
• Term spotting
• Associations
• Negation (Absence) and Assertion (Presence)
• Dimensions
• Expressions
• Full NL Parse where required
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Current Status
• System first version
– Creation of database for data warehouse
• QUEST “compliant”
– Meta-thesaurus integration
– Retrieval
• SQL and UI
• Tableau
• i2b2
– Star schema
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Presentation Content Continued
• Direction
– Demonstrate value to researchers
– CTSA Investigators
• Lessons learned
– Open source frameworks very useful !
• Reuse external solutions, resources
• Our solutions can be adopted
– Approach appears scalable
– Domain expert engagement essential
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Content
• What went well
– UIMA and MedKAT choice
– UMLS integration
• What would you would do differently
– Project is in early stage
– Technical and framework choices seem right
– Will learn more as we engage domain experts
• What will provide value to investigators ?
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Summary
• Comprehensive approach to detailed information
extraction from Pathology reports
• Exploiting open source and programmable
frameworks (UIMA)
• Integration of UMLS
• Contribution of pipeline
• Engagement of domain experts
All Rights Reserved, Duke Medicine 2007
HDWA 2010Durham, NC
Presenter(s) Contact Information
• Contact information
– Naveen Ashish
– http://www.ics.uci.edu/~ashish