Data Engineering and Imaging Informatics for …...#CMIMI18#CMIMI18 Data Engineering and Imaging...

Post on 21-Feb-2020

3 views 0 download

transcript

#CMIMI18#CMIMI18

Data Engineering and Imaging Informatics for Precision OncologyAshish Sharma PhDAssistant Professor, Biomedical InformaticsEmory University School of Medicine

@_AshishSharma

#CMIMI18

Disclosures

None

#CMIMI18

Cancer has been progressively redefined over the past 20 years

Global Oncology Trends 2017. Report by the

QuintilesIMS Institute

#CMIMI18

Increase In Number And Complexity Of Treatment

Global Oncology Trends 2017. Report by the QuintilesIMS Institute

#CMIMI18

How do Data Sci. & Engg. enable Precision Oncology ?

Data Science

AlgorithmsData Engineering

#CMIMI18#CMIMI18

- Data for AI Development

- Processing Pipelines

- Scale (Cloudy Medicine)

Data Engineering

Going from Bench to Bedside*

Good Algorithms

Outline

#CMIMI18

#CMIMI18#CMIMI18

Big Data is not helpful for developing algorithms if data is not FAIR

FindableAccessibleInteroperableReusable"The FAIR Guiding Principles For Scientific Data

Management And Stewardship." Scientific Data 3 (2016)

#CMIMI18

FAIR Data The Cancer Imaging ArchiveTCIA encourages and supports cancer imaging open science community by hosting and managing Findable, Accessible, Interoperable, and Reusable (FAIR) images and associated/derived dataClark et al. J Digital Imaging 26.6 (2013_: 1045-1057

~.75PB downloaded over a rolling 12 month window

#CMIMI18#CMIMI18 #CMIMI18

TCIA is Not Just an Image Repository • Radiology

• Digital Pathology

• Radiotherapy data

• Imaging features• Labels, Segmentations,

Features….

• Clinical data

• Links to genomic data

#CMIMI18

Hard to be FAIR

TCIAThis is where electronic medical record gets a little complicated

Sadly TCIA has multiple ways to store non-image data

• Often non-image data is difficult to reuse

• In some cases (e.g., NLST) it is used to create data cohorts

• Often it is difficult conduct studies that make use of non-image data in an integrative manner.

#CMIMI18

How do you build a FAIR repo —Requirements and ChallengesClinical DataOne uniform management strategy for all non-image data (clinical) Enhance data exploration, cohort identification, visual analytics

Imaging Features Featurebase for Radiomics and Pathomics featuresOne data representation

Enhanced and automated data curationNon-image data, pathology data, feature sets

Enable efficient deployment and support cloud deployments

#CMIMI18

Platform for Imaging in Precision Medicine (PRISM)

• PRISM will evolve and containerize the TCIA technology stack to streamline its deployment and incorporate new tools for analysis and management of images and imaging features with clinical context to enrich TCIA’s datasets.

• Semantic integration of TCIA non-image data • Tools for Pathology image data analysis and management • Some new functionality will go into both TCIA and PRISM• Freely available as containerized microservices and OSS

#CMIMI18

PRISM Architecture

#CMIMI18

Building upon PRISM at Emory

GOAL: Streamline access to imaging for research and quality studiesJoint between DBMI and Radiology

Near-real time replication of the PACS (ongoing) Extract metadata for research and quality studies (ongoing) Integrate with orders and reports Simplify access to images for research studies Secure storage, processing and de-Identification (when reqd.) Link w/ Data Warehouse; EMR… Co-located computing and storage

#CMIMI18

Imaging != Rad + RT

Hello Digital Pathology

#CMIMI18

Digital Pathology for Precision Oncology

Image analysis and DL methods to extract features from images Link Rad/Path features to “omics”, outcome biological phenomena Identify trillions of objects – nuclei, glands, ducts, tumor niches… Support queries against ensembles of features

(multiple algorithms/datasets) Analysis of integrated spatially mapped structural/”omic” information

to gain insight into cancer mechanism and to choose best intervention

18

● Deep learning based computational stain for staining tumor infiltrating lymphocytes (TILs)

● Computationally stained TILs correlate with pathologist eye and molecular estimates

● TIL patterns linked to tumor and immune molecular features, cancer type, and favorable outcomes● Potentially guide treatment

selection

● 4,759 subjects (TCGA) == 5,202 H&E slides; 13 cancer types

Saltz et al. Cell Reports 2018 doi.org/10.1016/j.celrep.2018.03.086

#CMIMI18

Quantitative Imaging Pathology - QuIP

#CMIMI18#CMIMI18

Data Processing Pipelines

#CMIMI18

Challenges

Model Development, Training TensorFlow, Keras, pyTorch,

MATLAB....Notebooks, IDEs…

Deployment (going to bedside) Data Wrangling (w/o Human in the

loop) on-demand deployment of

Algorithms Scalability Performance and LatencyMonitoring, Testing and ReliabilityUser Interfaces

#CMIMI18

Containers, Microservices and APIs

• No monoliths — think stages (preprocessing, segmentation, feature selection, classification, CNNs…)• Stage Independent and if possible stateless• Helps in scaling, deployment and redundancy

Containers (an easy way to do it)+Encapsulate the code and immediate dependencies+Easy to share, adopt, deploy- Security implications (Docker vs. Singularity)

• Situation gets better if using K8s Check out Grunt from Panos, Brad Erickson and the Mayo team

#CMIMI18

Simple, Effective Data Processing Design

Patient Data

PROs• Real-time Data Streams• Easy to test and maintain• Easier to upgrade algorithm• Easy to build dashboards and

visual analytic tools• Secure

CONs• Data and Processing are tightly

coupled• Hard to deploy multiple algorithms• Reengineer similar systems for

each new algorithm

• Deployment are not elastic• No-automatic failovers

#CMIMI18

Streaming Architectures

Modular design achieved by decoupling data and processing

Data is streamed into Kafka Cluster

Algorithmic pipelines subscribe to topics and process data

Enables rapid prototyping and deployment of algorithms

Preserves the scalability and reliability gains

#CMIMI18#CMIMI18

Scale and Integrate via Cloudy Pipelines

#CMIMI18#CMIMI18

Why Cloud First

What about Local infrastructure?

Hybrid Infrastructure?

Scalable and Affordable Computing- On Demand Computing

(lower capital expenditures)

Managed services that enable new design patterns for computing- Big Query/RedShift- Serverless Computing- data wrangling tools, e.g. DataFlow

Lower/Different barriers to adoption- Work with APIs, not Servers- Local IT has to become cloud aware

#CMIMI18

Cloudy for Scalability & Redundancy Patient Data

○ Leverage Vendor Services for Scalability and Redundancy○ Deployed AISE on AWS Lambda

and ML Engine

○ Deployment Time < 1day○ Improves model development

by allowing one to test, during development, with real-world scale and constraints

Ver 1.0

Ver 2.0

#CMIMI18

Processing at Scale

Hint: Docker is not the silver bullet

Cloudy Pipelines can Work (e.g. Google Genomics, DNANexus, NCI Cloud Resources, Globus Genomics…)

1. Think multi-stage pipelines not standalone executables/apps

2. Stages containerized or API endpoints

3. Workflow languages to author pipelines (CWL, WDL……)

4. Rely on orchestrators capable of running pipelines on local/cloud/hybrid

Lessons from Genomics

#CMIMI18

Processing at Scale

Reproducibility Share tools (via Code or Containers) Github + Docker Hub +

DockStore … Share and Publish Models TensorFlow Hub; ModelHub.AI; …

Scale Early stages Docker Swarm; Kubernetes etc. Technically getting there but hard to adopt Serverless AWS Lambda; Google Cloud Functions pywren

Integration w/ EMR FHIR, DICOMWeb

Examples are for illustrative purposes and not endorsements

#CMIMI18

National Cancer Data Ecosystem Recommendations

Warren Kibbe “Data: Where Precision Oncology and Learning Health Meet”. SAMSI Workshop on Precision Medicine, August 16, 2018

#CMIMI18

https://gdc.cancer.gov

NCI Cancer Research Data Commons

https://cbiit.cancer.gov/ncip/cancer-data-commons#CMIMI18

#CMIMI18#CMIMI18

Final Words

• Need Big, FAIR Data that is representative of the population

• Develop algorithms but think about deployment and scale

• Partnerships and Teams of Techies and MDs; Academic and IndustryTEAM SCIENCE AT ITS BEST

Cloud computing, HPC and AI can have a transformative effect on medicine

#CMIMI18#CMIMI18

Acknowledgements● Emory DBMI Engineers, PostDocs and

Students

● Fred Prior PhDTCIA Team, Dept. of Biomedical InformaticsUniv. of Arkansas for Medical Sciences

● Joel Saltz, MD PhDDept. of Biomedical InformaticsStony Brook University

U01CA187013-06

UG3CA225021-01

14X138

U24CA215109-02U24CA180924-05