Post on 05-Dec-2014
description
transcript
Joel Saltz MD, PhDChair Department of Biomedical Informatics, Director Center for Comprehensive InformaticsEmory UniversityAdjunct Professor CSE, CSCollege of Computing, Georgia Tech
Data and Computational Challenges in Integrative Biomedical Informatics
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
INTEGRATIVE DATA ANALYTICS
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Integrative Biomedical Informatics Analytics• Anatomic/functional
characterization at fine level (Pathology) and gross level (Radiology)
• High throughput multi-scale image segmentation, feature extraction, analysis of features
• Integration of anatomic/functional characterization with multiple types of “omic” information
Radiology
Imaging
Patient Outcome
Pathologic Features
“Omic”
Data
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Integrative Spatio-Temporal Molecular Analytics
• Aka Big Data
Quantitative Feature Analysis in Pathology: Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
Using TCGA Data to Study
Glioblastoma
Diagnostic Improvement
Molecular Classification
Predictors of Progression
Millions of Nuclei Defined by n Features
• Top-down analysis: use the features with existing diagnostic constructs
• Bottom-up analysis: let features define and drive the analysis
TCGA Whole Slide Images
Jun Kong
Step 1:Nuclei
Segmentation
• Identify individual nuclei and their boundaries
Nuclear Analysis Workflow
• Describe individual nuclei in terms of size, shape, and texture
Step 2:Feature
Extraction
Step 1:Nuclei
Segmentation
Oligodendroglioma Astrocytoma
Nuclear Qualities
1 10
Step 3:Nuclei
Classification
Comparison of Machine-based Classification to Human Based Classification
Separation of GBM, Oligo1, Oligo2 as Designated by Neuropathologists
Separation of GBM, Oligo1 and Oligo2 as Designated by Machine
Survival Analysis
Human Machine
Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic ProteinProteolipoproteinHoxD1
Nuclear features mostAssociated with Oligo Signature Genes:
Circularity (high)Eccentricity (low)
Millions of Nuclei Defined by n Features
• Top-down analysis: analyze features in context of existing diagnostic constructs
• Bottom-up analysis: let nuclear features define and drive the analysis
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information
Lee Cooper,Carlos Moreno
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)
Featu
re I
ndic
es
CC CM PB
10
20
30
40
500 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Sur
viva
l
CC
CM
PB
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Associations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
HEALTHCARE DATA ANALYTICS
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be
readmitted within 30 days?– What fraction of patients with a given set of diseases will be readmitted
within 30 days?– How does severity and time course of co-morbidities affect readmissions?– Geographic analyses
• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines) useful?
• Need a repeatable process that we can apply identically to both local and UHC data
Clinical Phenotype Characterization and the Emory Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Overall System
I2b2 Web Server
I2b2 Database
Source data
Database Mapper
Source data
Source data
Data Processing
Metadata Manager
Metadata Repository
Query Specification
Investigator
Data Analyst
Data Analyst
Data Modeler
Investigator
Query toolsStudy-
specific Database
Investigator
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
5-year Datasets from Emory and University Healthcare Consortium
• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American
Community Survey)Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Using Emory & UHC Data to Find Associations With 30-day Readmits
• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code
hierarchies– Classify continuous variables using standard interpretations
(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Derived Variables
• 30-day readmit• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories• UHC product lines• Variables derived from a combination of codes and/or laboratory test results
– Obesity– Diabetes/uncontrolled diabetes– End-stage renal disease (ESRD)– Pressure ulcer– Sickle cell disease/sickle cell crisis
• Temporal variables derived over multiple encounters– Multiple MI– Multiple 30-day readmissions– Chemotherapy within 180 (or 365) days before surgery– Previous encounter within the last 90 (or 180) days
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
30-Day Readmission Rates for Derived VariablesEmory Health Care
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Geographic AnalysesUHC Medicine General Product Line (#15)
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Predictive Modeling for Readmission
• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the
variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a
training dataset– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-day readmission
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Predictive Modeling Applied to 180 UHC HospitalsReadmission fraction of top 10% high risk patients
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 1770
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
All Hospital Model
Individual Hospital Model
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Status of Healthcare Data Analytics
• Integrative dataset analysis can leverage patient information gathered over many encounters
• Temporal analyses can generate derived variables that appear to correlate with readmissions
• Predictive modeling has promise of providing decision support
• Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper
• Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts
• Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
HIGH END AND LARGE DATA COMPUTING
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Supercomputing – Collaboration with ORNL: Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Core Transformations for multi-scale pipelines
• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Extreme DataCutter – Two Level Model
Coarse Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Node Level Work Scheduling
Fine Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
VLDB 2012
Change Detection, Comparison, and Quantification
Thanks!