Next-generation PhenotypingUsing Interoperable Big Data
George Hripcsak, Chunhua Weng
Columbia University Medical Center
Collab with Mount Sinai Medical Center
Biomedical Informaticsdiscovery and impact
Introducing OHDSI
Observational Health Data Sciences and InformaticsInternational network of researchers and observational health databases with a central coordinating center housed at Columbia UniversityMission: Large-scale analysis of observational health databases for population-level estimation and patient-level predictionsVision: Patients and clinicians use OHDSI tools every day to access evidence based on 1 billion patients
http://ohdsi.org
Clinical researcher, provider, patient
Tools and algorithms
Data nodes
Infrastructure, models, ontologies
OHDSI’s global research community
• >120 collaborators from 11 different countries• Experts in informatics, statistics, epidemiology, clinical sciences• Active participation from academia, government, industry, providers
http://ohdsi.org/who-we-are/collaborators/
Global reach of ohdsi.org
• >4600 distinct users from 96 countries in 2015
Why large-scale analysis is needed in healthcare
All
dru
gs
All health outcomes of interest
What is large-scale?
• Millions of observations
• Millions of covariates
• Millions of questions
No analytics software in the world can fit a regression with >1m observations and >1m covariates on typical hardware… but CYCLOPS can!
Need for performance in handling relational structure with millions of patients and billions of clinical observations, focus on optimization to analytical use cases.
Systematic solutions with massive parallelization should be designed to run efficiently for one-at-a-time AND all-by-all
Concept
Concept_relationship
Concept_ancestor
Vocabulary
Source_to_concept_map
Relationship
Concept_synonym
Drug_strength
Cohort_definition
Stand
ardize
d vo
cabu
laries
Attribute_definition
Domain
Concept_class
Cohort
Dose_era
Condition_era
Drug_era
Cohort_attribute
Stand
ardize
d
de
rived
ele
me
nts
Stan
dar
diz
ed
clin
ical
dat
a
Drug_exposure
Condition_occurrence
Procedure_occurrence
Visit_occurrence
Measurement
Procedure_cost
Drug_cost
Observation_period
Payer_plan_period
Provider
Care_siteLocation
Death
Visit_cost
Device_exposure
Device_cost
Observation
Note
Standardized health system data
Fact_relationship
SpecimenCDM_source
Standardized meta-data
Stand
ardize
d h
ealth
e
con
om
ics
Drug safety surveillance
Device safety surveillance
Vaccine safety surveillance
Comparative effectiveness
Health economics
Quality of care
Person
Preparing your data for analysis
Patient-level data in source
system/ schema
Patient-level data in
OMOP CDM
ETL design
ETL implement
ETL test
WhiteRabbit: profile your source data
RabbitInAHat: map your source
structure to CDM tables and
fields
ATHENA: standardized vocabularies for all CDM
domains
ACHILLES: profile your CDM data;
review data quality
assessment; explore
population-level summaries
OH
DSI
to
ols
bu
ilt t
o h
elp
CDM: DDL, index,
constraints for Oracle, SQL
Server, PostgresQL;
Vocabulary tables with loading
scripts
http://github.com/OHDSI
OHDSI Forums:Public discussions for OMOP CDM Implementers/developers
Usagi: map your
source codes to CDM
vocabulary
Single study
Real-time query
Large-scale analytics
Data Evidence sharing paradigms
Patient-level data in
OMOP CDM
evidence
Write Protocol
Developcode
Executeanalysis
Compile result
Develop app
Design query
Submit job
Review result
Develop app
Execute script
Explore results
One-time Repeated
Standardized large-scale analytics tools under development within OHDSI
Patient-level data in
OMOP CDM
http://github.com/OHDSI
ACHILLES:Database profiling
CIRCE:Cohort
definition
HERACLES:Cohort
characterization
OHDSI Methods Library:CYCLOPS
CohortMethodSelfControlledCaseSeries
SelfControlledCohortTemporalPatternDiscovery
Empirical CalibrationHERMES:
Vocabulary exploration
LAERTES: Drug-AE
evidence base
HOMER:Population-level
causality assessment
PLATO:Patient-level
predictive modeling
CALYPSO:Feasibility
assessment
CIRCE for cohort definition
• CIRCE (Cohort Inclusion and Restriction Criteria Expression)• User interface to define and review cohort definitions:
– COHORT is a set of persons satisfying one or more criteria for a duration of time
– Disease phenotype is a typical use case for cohort definition
• Interface translates a human-readable form into a standardized JSON representation for network-based analysis interoperabilities, and compiles the JSON into platform-specific SQL dialect for direct execution against any OMOP CDM-compliant dataset
• Open-source, freely available source code: https://github.com/OHDSI/Circe
One interface allows definition of criteria across all tables and all fields of the OMOP Common Data Model. The user interface translates this human-readable form into JSON, which is compiled into SQL dialects for 5 platforms.
Each expression can be defined by one or more standard concept sets, using OHDSI’s standardized vocabularies
OHDSI standardized vocabularies allows consistent definitions to be applied across disparate source vocabularies:
Select descendents for SNOMED concept of ‘Attention deficit hyperactivity disorder’ maps all ICD9, ICD10, READ codes to execute analysis across OHDSI’s international data network
HERMES for vocabulary exploration
Concept sets can define one or more entitities. Here, the PheKB list of ‘ADHD inclusionary medications’ has been represented by 21 RxNorm ingredient concepts, all brands/dose/form are subsumed
The human-readable Expression form is translated into JSON in real-time. This JSON object can be shared across partners to materialize the definition consistently and reproducibly without any programming required
Each expression is compiled into SQL. OHDSI supports rendering SQL into platform-specific dialects for SQL Server, Oracle, Postgres, RedShift, MS APS.
This code can be copied and executed in your favorite SQL UI tool, or….
Patient-level observational databases that are converted to the OMOP Common Data Model and exposed to the OHDSI webAPI (either local install or any public network version) can have the cohort definition directly executed within the database to produce a COHORT . The COHORT is then available for all subsequent research within the OHDSI environment…
Proof of concept
• Treatment pathways around the world
• Diabetes, hypertension, depression
• (Submitted to PNAS)
Cohort
Databases (255M) and definitions
Diabetes
Opportunities for collaboration
• Implement the PheKB library in CIRCE, so that all organizations with patient-level data (translated to OMOP common data model) can take the work from eMERGE and directly apply the logic to their own data and participate in eMERGE’s research
Phenotyping hard challenges
• Quality of the data– Ambiguous or unknown meaning
– Accuracy• 50-100% accuracy [Hogan JAMIA 1997]
– Completeness• mostly missing
– Complexity• disease ontologies
• Bias
observe &
interpretTruth
Health status of the patient
ConceptClinician or
patient’s conception
RecordEHR/PHR
Concept2nd clinician’s conception of the patient (or
self, lawyer, compliance, ...)
ModelComputable
representation
author read
process
Error Error
Error
Implicit
Biased
Patient state
Electronic health record
Care team
Therapy
Objective tests
Environment
Inpatient mortality for community acquired pneumonia
0
5
10
15
20
25
30
35
1 2 3 4 5
Fine class
Mo
rtality
(%
)
18715 cohort1935 cohortFine
18715 cohort+CXR+fdg-recent pneu-recent visit
1935 cohortabove plus+DSUM exist+ICD9 (pneu
not sepsis)
Hripcsak ... Comput Biol Med 2007;37:296-304
EHR-derived phenotype
• Clinically relevant feature derived from EHR
– Patient has (a diagnosis of) type II diabetes
– Recent rash and fever
– Drug-induced liver injury
• Then use the phenotype in correlation studies, etc.
Raw data Phenotype ExperimentQuery
“Physics” of the medical record
1. Study EHR as if it were a natural object
– Use EHR to learn about EHR
– Not studying patient, but recording of patient
2. Aggregate across units and model
3. Borrow methods from non-linear time series
Glucose by Δt and tau
1 2 3 4 5 6 7 8 910 20 30 40 50 60 70 80 90
100
0.17
0.83
2
750
450
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
MI
tau
delta-t (days)
Glucose
0.4-0.45
0.35-0.4
0.3-0.350.25-0.3
0.2-0.25
0.15-0.2
0.1-0.15
0.05-0.10-0.05
-0.1-0
Albers ... Translational Bioinformatics 2009
Correlate lab tests and concepts
• 22 years of data on 3 million patients
• 21 laboratory tests
– sodium, potassium, bicarbonate, creatinine, urea nitrogen, glucose, and hemoglobin
• 60 concepts derived from signout notes
– residents caring for inpatients to facilitate the transfer of care for overnight coverage
– concepts likely to have an association + controls
Intentional and physiologic associations
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
-60 -40 -20 0 20 40 60
potassium
aldactone
dialysis
hyperkalemia
hypokalemia
hypomagnesemia
Timing of cause in disease vs. treatment
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
-60 -10 40
glucose
hyperglycemia
hypernatremia
hypoglycemia
insulin
metformin
pancreatitis
Specificity of the concept
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
-60 -40 -20 0 20 40 60
creatinine
aldactone
dialysis
diarrhea
diuretic
hctz
hyperglycemia
hypernatremia
vomiting
Health care process model
Hripcsak ... JAMIA 2013
Hripcsak ... JAMIA 2013
inpatient admit ambulatory surgery
Interpreting time
Hripcsak JAMIA 2009
Deviation by stated unit
0
5
10
15
20
25
30
35
40
45
50-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Proportional deviation
Nu
mb
er
of
occu
rren
ces
dayweekmonthyear
No
w
Stat
ed
ti
me
Interpreting time
Variable Definition Coefficient Significance
value stated numeric value in the temporal assertion (1 to 30 in this sample)
0.0414 <0.001
round number true if value is a multiple of 5 (any unit) or 6 (with months)
–0.0218 0.002
ln(duration) logarithm of stated duration in days, which equals the product of unit and value
0.150 0.023
gt 18 years true if duration ≥ 18 years, so the event should not be in the database
0.816 <0.001
intercept 0.406 0.416
Patient variability and sampling
Parameterizing Time
Parameterizing Time(Non-stationarity)
0
0.5
1
1.5
2
2.5
creatinine glucose sodium potassium
coe
ffic
ien
t o
f va
riat
ion
rate of change
clock
warped
sequence
Hripcsak JAMIA 2015
Parameterizing Time
Vector autoregression to decipher associations
Noisy training setswith Nigam Shah; David Sontag
Summary
• OHDSI international collaboration could dovetail with eMERGE
• Next-generation phenotyping requires understanding the EHR