Next-generation Phenotyping Using Interoperable Big · PDF fileNext-generation Phenotyping...

Post on 11-Mar-2018

229 views 2 download

transcript

Next-generation PhenotypingUsing Interoperable Big Data

George Hripcsak, Chunhua Weng

Columbia University Medical Center

Collab with Mount Sinai Medical Center

Biomedical Informaticsdiscovery and impact

Introducing OHDSI

Observational Health Data Sciences and InformaticsInternational network of researchers and observational health databases with a central coordinating center housed at Columbia UniversityMission: Large-scale analysis of observational health databases for population-level estimation and patient-level predictionsVision: Patients and clinicians use OHDSI tools every day to access evidence based on 1 billion patients

http://ohdsi.org

Clinical researcher, provider, patient

Tools and algorithms

Data nodes

Infrastructure, models, ontologies

OHDSI’s global research community

• >120 collaborators from 11 different countries• Experts in informatics, statistics, epidemiology, clinical sciences• Active participation from academia, government, industry, providers

http://ohdsi.org/who-we-are/collaborators/

Global reach of ohdsi.org

• >4600 distinct users from 96 countries in 2015

Why large-scale analysis is needed in healthcare

All

dru

gs

All health outcomes of interest

What is large-scale?

• Millions of observations

• Millions of covariates

• Millions of questions

No analytics software in the world can fit a regression with >1m observations and >1m covariates on typical hardware… but CYCLOPS can!

Need for performance in handling relational structure with millions of patients and billions of clinical observations, focus on optimization to analytical use cases.

Systematic solutions with massive parallelization should be designed to run efficiently for one-at-a-time AND all-by-all

Concept

Concept_relationship

Concept_ancestor

Vocabulary

Source_to_concept_map

Relationship

Concept_synonym

Drug_strength

Cohort_definition

Stand

ardize

d vo

cabu

laries

Attribute_definition

Domain

Concept_class

Cohort

Dose_era

Condition_era

Drug_era

Cohort_attribute

Stand

ardize

d

de

rived

ele

me

nts

Stan

dar

diz

ed

clin

ical

dat

a

Drug_exposure

Condition_occurrence

Procedure_occurrence

Visit_occurrence

Measurement

Procedure_cost

Drug_cost

Observation_period

Payer_plan_period

Provider

Care_siteLocation

Death

Visit_cost

Device_exposure

Device_cost

Observation

Note

Standardized health system data

Fact_relationship

SpecimenCDM_source

Standardized meta-data

Stand

ardize

d h

ealth

e

con

om

ics

Drug safety surveillance

Device safety surveillance

Vaccine safety surveillance

Comparative effectiveness

Health economics

Quality of care

Person

Preparing your data for analysis

Patient-level data in source

system/ schema

Patient-level data in

OMOP CDM

ETL design

ETL implement

ETL test

WhiteRabbit: profile your source data

RabbitInAHat: map your source

structure to CDM tables and

fields

ATHENA: standardized vocabularies for all CDM

domains

ACHILLES: profile your CDM data;

review data quality

assessment; explore

population-level summaries

OH

DSI

to

ols

bu

ilt t

o h

elp

CDM: DDL, index,

constraints for Oracle, SQL

Server, PostgresQL;

Vocabulary tables with loading

scripts

http://github.com/OHDSI

OHDSI Forums:Public discussions for OMOP CDM Implementers/developers

Usagi: map your

source codes to CDM

vocabulary

Single study

Real-time query

Large-scale analytics

Data Evidence sharing paradigms

Patient-level data in

OMOP CDM

evidence

Write Protocol

Developcode

Executeanalysis

Compile result

Develop app

Design query

Submit job

Review result

Develop app

Execute script

Explore results

One-time Repeated

Standardized large-scale analytics tools under development within OHDSI

Patient-level data in

OMOP CDM

http://github.com/OHDSI

ACHILLES:Database profiling

CIRCE:Cohort

definition

HERACLES:Cohort

characterization

OHDSI Methods Library:CYCLOPS

CohortMethodSelfControlledCaseSeries

SelfControlledCohortTemporalPatternDiscovery

Empirical CalibrationHERMES:

Vocabulary exploration

LAERTES: Drug-AE

evidence base

HOMER:Population-level

causality assessment

PLATO:Patient-level

predictive modeling

CALYPSO:Feasibility

assessment

CIRCE for cohort definition

• CIRCE (Cohort Inclusion and Restriction Criteria Expression)• User interface to define and review cohort definitions:

– COHORT is a set of persons satisfying one or more criteria for a duration of time

– Disease phenotype is a typical use case for cohort definition

• Interface translates a human-readable form into a standardized JSON representation for network-based analysis interoperabilities, and compiles the JSON into platform-specific SQL dialect for direct execution against any OMOP CDM-compliant dataset

• Open-source, freely available source code: https://github.com/OHDSI/Circe

One interface allows definition of criteria across all tables and all fields of the OMOP Common Data Model. The user interface translates this human-readable form into JSON, which is compiled into SQL dialects for 5 platforms.

Each expression can be defined by one or more standard concept sets, using OHDSI’s standardized vocabularies

OHDSI standardized vocabularies allows consistent definitions to be applied across disparate source vocabularies:

Select descendents for SNOMED concept of ‘Attention deficit hyperactivity disorder’ maps all ICD9, ICD10, READ codes to execute analysis across OHDSI’s international data network

HERMES for vocabulary exploration

Concept sets can define one or more entitities. Here, the PheKB list of ‘ADHD inclusionary medications’ has been represented by 21 RxNorm ingredient concepts, all brands/dose/form are subsumed

The human-readable Expression form is translated into JSON in real-time. This JSON object can be shared across partners to materialize the definition consistently and reproducibly without any programming required

Each expression is compiled into SQL. OHDSI supports rendering SQL into platform-specific dialects for SQL Server, Oracle, Postgres, RedShift, MS APS.

This code can be copied and executed in your favorite SQL UI tool, or….

Patient-level observational databases that are converted to the OMOP Common Data Model and exposed to the OHDSI webAPI (either local install or any public network version) can have the cohort definition directly executed within the database to produce a COHORT . The COHORT is then available for all subsequent research within the OHDSI environment…

Try it yourself

http://www.ohdsi.org/web/circe/#/146

Proof of concept

• Treatment pathways around the world

• Diabetes, hypertension, depression

• (Submitted to PNAS)

Cohort

Databases (255M) and definitions

Diabetes

Opportunities for collaboration

• Implement the PheKB library in CIRCE, so that all organizations with patient-level data (translated to OMOP common data model) can take the work from eMERGE and directly apply the logic to their own data and participate in eMERGE’s research

Phenotyping hard challenges

• Quality of the data– Ambiguous or unknown meaning

– Accuracy• 50-100% accuracy [Hogan JAMIA 1997]

– Completeness• mostly missing

– Complexity• disease ontologies

• Bias

observe &

interpretTruth

Health status of the patient

ConceptClinician or

patient’s conception

RecordEHR/PHR

Concept2nd clinician’s conception of the patient (or

self, lawyer, compliance, ...)

ModelComputable

representation

author read

process

Error Error

Error

Implicit

Biased

Patient state

Electronic health record

Care team

Therapy

Objective tests

Environment

Inpatient mortality for community acquired pneumonia

0

5

10

15

20

25

30

35

1 2 3 4 5

Fine class

Mo

rtality

(%

)

18715 cohort1935 cohortFine

18715 cohort+CXR+fdg-recent pneu-recent visit

1935 cohortabove plus+DSUM exist+ICD9 (pneu

not sepsis)

Hripcsak ... Comput Biol Med 2007;37:296-304

EHR-derived phenotype

• Clinically relevant feature derived from EHR

– Patient has (a diagnosis of) type II diabetes

– Recent rash and fever

– Drug-induced liver injury

• Then use the phenotype in correlation studies, etc.

Raw data Phenotype ExperimentQuery

“Physics” of the medical record

1. Study EHR as if it were a natural object

– Use EHR to learn about EHR

– Not studying patient, but recording of patient

2. Aggregate across units and model

3. Borrow methods from non-linear time series

Glucose by Δt and tau

1 2 3 4 5 6 7 8 910 20 30 40 50 60 70 80 90

100

0.17

0.83

2

750

450

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

MI

tau

delta-t (days)

Glucose

0.4-0.45

0.35-0.4

0.3-0.350.25-0.3

0.2-0.25

0.15-0.2

0.1-0.15

0.05-0.10-0.05

-0.1-0

Albers ... Translational Bioinformatics 2009

Correlate lab tests and concepts

• 22 years of data on 3 million patients

• 21 laboratory tests

– sodium, potassium, bicarbonate, creatinine, urea nitrogen, glucose, and hemoglobin

• 60 concepts derived from signout notes

– residents caring for inpatients to facilitate the transfer of care for overnight coverage

– concepts likely to have an association + controls

Intentional and physiologic associations

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

-60 -40 -20 0 20 40 60

potassium

aldactone

dialysis

hyperkalemia

hypokalemia

hypomagnesemia

Timing of cause in disease vs. treatment

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

-60 -10 40

glucose

hyperglycemia

hypernatremia

hypoglycemia

insulin

metformin

pancreatitis

Specificity of the concept

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-60 -40 -20 0 20 40 60

creatinine

aldactone

dialysis

diarrhea

diuretic

hctz

hyperglycemia

hypernatremia

vomiting

Health care process model

Hripcsak ... JAMIA 2013

Hripcsak ... JAMIA 2013

inpatient admit ambulatory surgery

Interpreting time

Hripcsak JAMIA 2009

Deviation by stated unit

0

5

10

15

20

25

30

35

40

45

50-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Proportional deviation

Nu

mb

er

of

occu

rren

ces

dayweekmonthyear

No

w

Stat

ed

ti

me

Interpreting time

Variable Definition Coefficient Significance

value stated numeric value in the temporal assertion (1 to 30 in this sample)

0.0414 <0.001

round number true if value is a multiple of 5 (any unit) or 6 (with months)

–0.0218 0.002

ln(duration) logarithm of stated duration in days, which equals the product of unit and value

0.150 0.023

gt 18 years true if duration ≥ 18 years, so the event should not be in the database

0.816 <0.001

intercept 0.406 0.416

Patient variability and sampling

Parameterizing Time

Parameterizing Time(Non-stationarity)

0

0.5

1

1.5

2

2.5

creatinine glucose sodium potassium

coe

ffic

ien

t o

f va

riat

ion

rate of change

clock

warped

sequence

Hripcsak JAMIA 2015

Parameterizing Time

Vector autoregression to decipher associations

Noisy training setswith Nigam Shah; David Sontag

Summary

• OHDSI international collaboration could dovetail with eMERGE

• Next-generation phenotyping requires understanding the EHR