+ All Categories
Home > Documents > TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA...

TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA...

Date post: 20-Jan-2016
Category:
Upload: fay-robertson
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
38
TREC 2011 Medical Track Rich Medlin LAIR
Transcript
Page 1: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC 2011 Medical Track

Rich MedlinLAIR

Page 2: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC = Text Retrieval Conference

• Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes

• Now NIST – several tracks, including micro-blog (aka Twitter), Session – (www.trec.nist.gov)

• These are REAL WORLD TASKS

Page 3: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC Format

• Batch Information Retrieval• “A competition”• Given a fixed corpus and a fixed set of topics,

find all of the documents that are relevant to the topic– Compare outcomes – Precision, Recall, MAP, bpref

Page 4: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC Medical Track - Motivation

• Comparative Effectiveness Research:– “In a patient with a high grade obstruction of the

carotid artery, is carotid enarterectomy effective?– Design case-control study.– 1) Find all the patients with CEA– 2) Randomly select a group of patients EXACTLY

the same in all respects EXCEPT they didn’t get a CEA.

– 3) Compare outcomes

Page 5: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC Medical Task

• Given a corpus, find a patient (retrieve the RECORD, not the document)– patients with atrial fibrillation treated with ablation– patients admitted for a reason directly related to

medication non-compliance– patients younger than 50 with hearing loss– patients admitted for injuries s/p fall– patients with GERD who had esophageal

adenocarcinoma diagnosed by endoscopy

Page 6: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TREC Medical Task

• Completely Automatic• “Manual” – Do anything you want with your

system to find the records e.g. you can repeatedly query your system and refine the queries

Page 7: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Strategies

“patients with CEA”

Manual

Clinicians

NLM – Essie**

OHSU - Lucene

Non-Clinicians

Automatic

Clinicians Non-Clinicians

IR (Glasgow, Umass) NLP (Dallas)** Cengage

(Data Mining)

Page 8: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Overview

• The Corpus: Clinical Records from multiple hospitals in Pittsburgh (but within the same medical system, so pretty homogeneous language - same residents, single radiology and path department, but some clinical doc notes from Shadyside)

• Not in corpus– Structured lab values, vitals, I’s and O’s– Demographics– Nursing, Case Management, Social Work, etc Notes

Page 9: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Corpus Stats

100,172 reportAbout 17,000 visitsMost visits with less than 5 reports23 visits with more than 100 reportsMax docs for a single visit is 415

Page 10: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.
Page 11: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Report Types

• Radiology Reports (Strongly templated)• History and Physicals (Surgery vs. Medical)• Consultation Reports (Not very templated)• Emergency Department Reports (Resident vs. Attending)• Progress Notes (ICU vs. floor)• Discharge Summaries (Not too copy and pasted)• Operative Reports (May include interventional radiology)• Surgical Pathology Reports (useful for cancer diagnosis)• Cardiology Reports (very computer generated – echo,

cath, ?doppler)

Page 12: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

TeamsTeam Clinicians? Background Outcomes

Terrier – U.Glasgow No IR Better

External Collections - UDel

No IR Worse, but interesting

Dutch Hat Trick - Erasmus

Yes NLP Better

NLM - well, NLM Yes NLP/Practical Best Manual

Cohort Shephard – UT Dallas

No NLP 2nd best Automatic

OHSU Yes Practical/IR (lucene baseline)

Horrible

Merck Yes DataMining/Practical

Meh

Mayo Yes Practical/IR Better than average

Cengage No Data Mining Best automatic

Page 13: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

bpref results

Page 14: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

External Query Expansion

• NLM Products (Obvious)– Metamap, SnomedCT,ICD-9CM, MESH, Specialist

lexicon, SemRep– (CPT)

• Wikipedia, DBPedia (UpToDate, MDConsult)• RXNorm, Drugbank, Google (FirstDataBank)• Trec 2007 Genomics, ImageClef 2009, Bioscope• Other: MedDRA (medical terms for ADR’s)

Page 15: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

The Corpus

• Hospital Records (vs. ambulatory)– “visit” = “Episode of Care” = “Account Number” =

“Bill”– Teaching Hospital => Two or more of each type of

physician (resident, fellow, attending) notes with different, possibly conflicting content

– Tertiary Care => Very complicated, multiple consults, lots of comorbidities

– Multiple departments => Radiology, Path, Vascular Lab

Page 16: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

The Corpus

• Mostly free text generated, but some computer generated text for cath lab, vascular US.

• Duplicate Note Content – carotid stent gets an Operative Note (“op note”) + Radiology Report

Page 17: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Some sample records

• http://skynet.ohsu.edu

Page 18: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Spurious Questions Generated

• Who is Richard Tong and why is he chairing this track?• Why is there no article about judging in medical

track?• Are they really going to get the data out this year?• What is up with OHSU’s submission?• Why is HongFen dissing the major mayo generated

medical IR/NLP product (cTakes)• Why didn’t Pitt have an entry (they wrote caTIES,

which is a major, production system for cohort finding AND were the source of the data for this track.)?

Page 19: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Problems

• Corpus was available, then it wasn’t then it was.

• No judgments for many of the best runs.• No QA regarding judgments

Page 20: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Key Questions

• Does domain knowledge help (NLM) or hurt IR tasks (OHSU) when used for searching medical records?

• Linguistics (NLM) vs. Statistics (UTD) vs. IR (udel) vs. Hybrid (UG)

• Recall vs. Precision for the given task of assembling a clinical cohort for research?– Do current IR strategies introduce bias that would

effect the analysis of outcomes?

Page 21: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Key Questions (cont’d)

• How did the commercial submissions fair?– Cengage (really well) – Secret Sauce – network of

terms from UMLS and expand based on connectedness and distance

– Merck/LUXID (slightly better than average)– These are dataminers

Page 22: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Historical Performance of published systems

• LSP (1980’s-1990’s) – Sagar (NYU)– R>.9, P>.9– One sentence took 15 seconds to process…

• MedLee (1990’s – 2000’s)–Friedman (Columbia)

Now a private company• caTIES (current) ?• cTakes (current) ?• NLM (current) ?• ARC (current) ?• Really all about index creation

Page 23: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

cTakes and caTies

Page 24: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Typical Clinical Text Mining Workflow

Page 25: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Medical Text Peculiarities

• Not curated knowledge sources• Rarely proofread• Contradictory information within a single visit• Negation (30 – 35% of terms)• Local abbreviations and jargon• Written to satisfy billing, compliance and

patient care (in that order)• May be computer generated from templates

Page 26: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Medical Corpus Peculiarities

• Privacy, obviously (unknown territory, legally)• Delayed malpractice liability? (e.g.

documentation of sub-optimal care or errors)• Delayed Insurance Fraud liability (ICD-9’s

assigned, not supported by documentation, omit CPT’s)

• Proprietary Information?• What else?

Page 27: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

More about the Corpus

• Clinical Records that are “physician-o-centric”– No nursing notes, allied health notes (dietary,

social work, case management)• No structured data (e.g. lab values,vital signs,

I&O’s BUT labs are likely included in the reports as free text or copy and paste)

Page 28: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Overview (cont’d)

• Completely automated runs• Manual runs (anything you want to do to find

matching records) – How things are done in the real world

Page 29: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

ICD-9 vs. ICD-10

• The rest of the world has moved on to ICD-10 a while ago. ICD-10 codes are wildly (some would say overly) more specific.

• ICD-9 codes are usually not assigned by physicians. They are assigned by billers. So, it is unsurprising that they are not accurate

• They didn’t include CPT (current procedural technology) codes, which are quite a bit more accurate. I can imagine this is a liability issue for medicare fraud and for proprietary reasons.

Page 30: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

ICD-9 Code Example Document

• Malignant neoplasm of breast (female), unspecified• A primary or metastatic malignant neoplasm involving

the breast. The vast majority of cases are carcinomas arising from the breast parenchyma or the nipple. Malignant breast neoplasms occur more frequently in females than in males. -- 2003

• Short description: Malign neopl breast NOS.• ICD-9-CM 174.9 is a billable medical code that can be

used to specify a diagnosis on a reimbursement claim.ICD-9-CM Volume 2 Index entries containing back-references to 174.9:Adenocarcinoma (M8140/3) - see also Neoplasm, by site, malignant

duct (infiltrating) (M8500/3)with Paget's disease (M8541/3) - see Neoplasm, breast, malignantspecified site - see Neoplasm, by site, malignantunspecified site 174.9

infiltrating duct (M8500/3)with Paget's disease (M8541/3) - see Neoplasm, breast, malignantspecified site - see Neoplasm, by site, malignantunspecified site 174.9

inflammatory (M8530/3)specified site - see Neoplasm, by site, malignantunspecified site 174.9

lobular (M8520/3)specified site - see Neoplasm, by site, malignantunspecified site 174.9

Carcinoma (M8010/3) - see also Neoplasm, by site, malignantduct (cell) (M8500/3)

with Paget's disease (M8541/3) - see Neoplasm, breast, malignantinfiltrating (M8500/3)

specified site - see Neoplasm, by site, malignantunspecified site 174.9

infiltrating duct (M8500/3)with Paget's disease (M8541/3) - see Neoplasm, breast, malignantspecified site - see Neoplasm, by site, malignantunspecified site 174.9

inflammatory (M8530/3)specified site - see Neoplasm, by site, malignantunspecified site 174.9

lobular (infiltrating) (M8520/3)noninfiltrating (M8520/3)

specified site - see Neoplasm, by site, in situunspecified site 233.0

specified site - see Neoplasm, by site, malignantunspecified site 174.9

medullary (M8510/3)with

amyloid stroma (M8511/3)specified site - see Neoplasm, by site, malignantunspecified site 193

lymphoid stroma (M8512/3)specified site - see Neoplasm, by site, malignantunspecified site 174.9

174.8ICD9Data.com175

Page 31: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Query Workflow

• Query Normalization (cTakes, GATE)• Query Expansion (knowledge based)

Page 32: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Retrieval Workflow – Search Engine Choices

• Lucene on XML, SQL, PostGres,mySQL (Standard) (Cosign Sim + “more like these”)

• Indri (LM,QL)• Terrier (LM,QL)• Data Mining tools: KNIME, ProMiner, Weka,

Rapidminer (tf-idf, probably) – These guys love semantics and subdocument retrieval

Page 33: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Document Indexing

• Lumpers – All visits merged into a single document (preferred by IR people) and index

• Splitters – Parse out all of the document subsections (preferred by clinicians, NLP), index each separately

Page 34: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Document Indexing II:Negation, hedging and scoping

• NegEx (the original) – rules based• ConText (the succesor, negation +

FH,PMH,hedges)• LingScope – CRF + NegEx + hedges• ScopeFinder ? – can’t find anything• Custom Negation – KNIME, SAS Textminer,etc

Page 35: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

NegEx type Rules – 97% accurate

Page 36: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Document Vocabulary Normalization

• cTakes (bombed on this corpus, is mayo specific)

• UMLS and related services:– MetaMap, SnomedCT,ICD-9CM, MESH, Specialist

lexicon, Semantic Medline– (CPT)

• Wikipedia, Google, Healthline.com (UpToDate, MDConsult)

• RXNorm, Drugbank, Google (FirstDataBank)

Page 37: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

udel

• Indri + pseudorelevance feedback using UMLS

Page 38: TREC 2011 Medical Track Rich Medlin LAIR. TREC = Text Retrieval Conference Originally DARPA sponsored– Scan newspapers for intelligence gathering purposes.

Glasgow

• Interesting thing – only used the admitting diagnosis ICD-9 codes for expansion (good idea clinically)


Recommended