AMIA CRI Summit 2011
CRI-09: Cross-Institutional Systems to Support Phenotyping in Biomedical Research:
Experiences from the eMERGE Network
Luke RasmussenMarshfield Clinic
David Carrell, PhDGroup Health Research Institute
William Thompson, PhDNorthwestern University
Hua Xu, PhDVanderbilt University
Jyoti Pathak, PhDMayo Clinic
eMERGE Consortium
• Principal sponsor: NHGRI with additional funding from NIGMS
• NIH-funded consortium (CTSA awardee institutions)
• DNA Biobanks linked to EHR data• Consortium members
– Group Health of Puget Sound– Marshfield Clinic– Mayo Clinic– Northwestern University– Vanderbilt University
Marshfield ClinicBiobank Population
Geographically defined cohortStable population
Minimal selection biasOver 95% of medical events captured in EMR
DataAll levels of inpatient and outpatient care5 decades of retrospective clinical data
Prospective & continuous data collection via EHREvent, testing, treatment and outcomes represented
High utilization of primary care to classify controlsClinical, financial and environment data
Health Events
eMERGE Contributors• NHGRI
– Rongling Li– Heather Junkins– Teri Manolio– Jim Ostell
• Group Health– Eric Larson– Gail Jarvik– Chris Carlson– Wylie Burke– Gene Jart– David Carrell– Malia Fullerton– Walter Kukull– Paul Crane– Noah Weston
• Northwestern– Rex Chisholm– Bill Lowe– Phil Greenland– Wendy Wolf– Maureen Smith– Geoff Hayes– Pedro Avila– Joel Humowiecki– Jen Allen-Pacheco– Amy Lemke– Will Thompson
• Marshfield– Cathy McCarty– Peggy Peissig– Luke Rasmussen– Marilyn Ritchie– Justin Starren– Russ Wilke– Dick Berg– Jim Linneman
• Mayo Clinic– Christopher G. Chute– Iftikhar J. Kullo– Barbara Koenig– Suzette Bielinski– Mariza de Andrade
• Vanderbilt– Dan Roden– Dan Masys– Josh Denny– Brad Malin– Ellen Wright Clayton– Dana Crawford– Jonathan Haines– Jonathan Schildcrout– Jill Pulley– Melissa Basford– Marilyn Ritchie
RFA HG-07-005:Genome-Wide Studies in Biorepositories with
Electronic Medical Record Data
• 2007 NIH Request for Applications from the National Human Genome Research Institute
“The purpose of this funding opportunity is to provide support for investigative groups affiliated with existing biorepositories to develop necessary methods and procedures for, and then to perform, if feasible, genome-wide studies in participants with phenotypes and environmental exposures derived from electronic medical records, with the aim of widespread sharing of the resulting individual genotype-phenotype data to accelerate the discovery of genes related to complex diseases.” (Emphasis added)
Development and Growth
IdeaDevelop
Disseminate
More Ideas
Issues
• Pre-existing and new systems/methods
• Applied to common (yet different) tasks
• Different locations/environments
Tools and Methods
Presenter Topic
Luke RasmussenMarshfield Clinic
Reusable phenotype algorithmsTechniques to facilitate future reuse of phenotype algorithms.
David CarrellGroup Health
Clinical Text Explorer Search InterfaceFacilitates exploration of EHR for rapid phenotyping and algorithm refinement.
William ThompsonNorthwestern University
clinical Text Analysis and Knowledge Extraction System (cTAKES)Natural language processing (NLP) system utilized for multiple phenotypes, including PAD.
Hua XuVanderbilt University
MedExNLP system utilized within eMERGE with additional applications to pharmacogenomic research.
Jyoti PathakMayo Clinic
eleMAPFacilitates harmonization and standardization of phenotype variables across sites.
AMIA CRI Summit 2011
Reusable Phenotype Algorithms
Luke RasmussenSenior Programmer/Analyst
Marshfield Clinic Research FoundationBiomedical Informatics Research Center
Phenotype Development
• Multi-disciplinary teams
• Multiple sites
• Iterative
• Intangible →Tangible
EMR-based Phenotype Algorithms
• Typical components– Billing and diagnoses codes– Procedure codes– Labs– Medications– Phenotype-specific co-variates (e.g., Demographics,
Vitals, Smoking Status, CASI scores)– Pathology– Imaging?
• Organized into inclusion and exclusion criteria
EMR-based Phenotype Algorithms
• Iteratively refine case definitions through partial manual review to achieve ~PPV ≥ 95%
• For controls, exclude all potentially overlapping syndromes and possible matches; iteratively refine such that ~NPV ≥ 98%
Primary Phenotypes
Site Phenotype Validation (PPV/NPV)
Group Health Dementia 73% / 92%
Marshfield Clinic
Cataracts / Low HDL 98% / 98%
82% / 96%
Mayo Clinic PAD 94% / 99%
Northwestern University
Type 2 DM 98% / 100%
Vanderbilty University
QRS Duration 97% / 100%
Supplemental Phenotypes
Site Phenotype Validation (PPV/NPV)
Group Health WBC *
Marshfield Clinic
Diabetic Retinopathy
80% / 98%
Mayo Clinic RBC 98% / 94%
Northwestern University
Lipids / Height 92% / 100%
95% / 100%
Vanderbilty University
PheWAS *
* - Not available at this time
Phenotype Reuse
• T2DM Diabetic Retinopathy– Identification of DM– T2DM included T1DM for exclusion
• Low HDL Lipids
Iterative Refinement for Reuse
Condition - Subtype A Condition - Subtype B
Condition
Subtype A
Subtype B
Formalizing Reuse
• Identified potential for reuse
• Leverage significant work
• Phenotypes available: www.gwas.org
• Limitations– Site-specific implementations