Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Dr. Matthieu-P. Schapranow HPI Future SOC Lab Day, Potsdam, Germany
Nov 4, 2015
Generously supported by
■ Online: Visit we.analyzegenomes.com for latest research results, tools, and news
■ Offline: Read more about it, e.g. High-Performance In-Memory Genome Data Analysis: How In-Memory Database Technology Accelerates Personalized Medicine, In-Memory Data Management Research, Springer, ISBN: 978-3-319-03034-0, 2014
■ In Person: Join us for “Festival of Genomics” Jan 19-21, 2016 in London, UK
Important things first: Where do you find additional information?
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
2
■ Patients
□ Individual anamnesis, family history, and background
□ Require fast access to individualized therapy
■ Clinicians
□ Identify root and extent of disease using laboratory tests
□ Evaluate therapy alternatives, adapt existing therapy
■ Researchers
□ Conduct laboratory work, e.g. analyze patient samples
□ Create new research findings and come-up with treatment alternatives
The Setting Actors in Oncology
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 3
A Federated In-Memory Database System For Life Sciences
IT Challenges Distributed Heterogeneous Data Sources
Human genome/biological data 600GB per full genome 15PB+ in databases of leading institutes
Prescription data 1.5B records from 10,000 doctors and 10M Patients (100 GB)
Clinical trials Currently more than 30k recruiting on ClinicalTrials.gov
Human proteome 160M data points (2.4GB) per sample >3TB raw proteome data in ProteomicsDB
PubMed database >24M articles Hospital information systems
Often more than 50GB
Medical sensor data Scan of a single organ in 1s creates 10GB of raw data Cancer patient records
>160k records at NCT A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 Chart 4
■ Requirements
□ Real-time data analysis
□ Maintained software
■ Restrictions
□ Data privacy
□ Data locality
□ Volume of “big medical data”
■ Solution?
□ Federated In-Memory Database System vs. Cloud Computing
Software Requirements in Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
5
Where are all those Clouds go to?
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
6
Gartner's 2014 Hype Cycle for Emerging Technologies
Multiple Cloud Service Providers
Schapranow, BIRTE/VLDB 2015, Aug 31, 2015
A Federated In-Memory Database System For Life Sciences
7
Local S ystem
C loudSynchron ization
S erv ice
R
Loca l S to rage
LocalSynchron iza tion
S erv ice
R
SharedC loud
S torage
S ite A
Local S ystem
R
Loca l S to rage
LocalSynchron iza tion
Serv ice
S ite B
C loudSynchron iza tion
S erv ice
SharedC loud
S torage
R
C loud P rovider S ite A
C loud Provider S ite B
Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
8 S ite B
Federated In -M em oryD atabase Instance ,
A lgorithm s, andApp lications M anaged
by Service P rovider
Clou
d Ser
vice
Prov
ider
S ite A
FIMD
BA.
1
FIMD
BA.
2
FIMD
BA.
3
FIMD
BA.
4
FIMD
BA.
5
FIMD
BB.
1
FIMD
BB.
2
FIMD
BB.
3
FIMD
BC.
1
Federated In -M em oryD atabase Instances
M aster D ataM anaged by
Service P rovider
Sensitive D atareside a t S ite
■ Aim: Provision of managed Analyze Genomes services while sensitive data
remains locally
■ Process steps
□ Connect existing resources to join federated database landscape
□ Install Workers on local nodes to process sensitive data and store results in local DB instances
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
Analyze Genomes: Real-time Analysis of Big Medical Data
9
In-Memory Database
Extensions for Life Sciences
Data Exchange, App Store
Access Control, Data Protection
Fair Use
Statistical Tools
Real-time Analysis
App-spanning User Profiles
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
A Federated In-Memory Database System For Life Sciences
Drug Response Analysis
Pathway Topology Analysis
Medical Knowledge Cockpit Oncolyzer
Clinical Trial Recruitment
Cohort Analysis
...
Indexed Sources
Use Case: Identification of Best Treatment Option for Cancer Patient
■ Patient: 48 years, female, non-smoker, smoke-free environment
■ Diagnosis: Non-Small Cell Lung Cancer (NSCLC), stage IV
1. Surgery to remove tumor
2. Tumor sample is sent to laboratory to extract DNA
3. DNA is sequenced resulting in up to 750 GB of raw data per sample
4. Processing of raw data to perform analysis
5. Identification of relevant driver mutations using international medical knowledge
6. Informed decision making Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
10
From Raw Genome Data to Analysis
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
■ Sequencing: Acquire digital DNA data
■ Alignment: Reconstruction of complete genome with snippets
■ Variant Calling: Identification of genetic variants
■ Data Annotation: Linking genetic variants with research findings
Chart 11
Standardized Modeling of Genome Data Analysis Pipelines
■ Graphical modeling of analysis pipelines
□ Supports reproducible research
□ BPMN-2.0-compliant
■ Extension of modeling notation by
□ Modular structure
□ Degree of parallelization
□ Parameters/variables
■ Pipelines stored in IMDB and executed through our worker framework
A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 Chart 12
Execution of Genome Data Analysis Pipelines
■ Dedicated scheduler for optimized pipeline execution
□ Assigns tasks to workers
□ Recovery of pipeline status
■ Scheduler uses IMDB logs for workload estimation
■ Different scheduling algorithms available, e.g.
□ High Throughput
□ Priority First
□ User-/Group-based
A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
IMDB
Pipeline Tasks Scheduler
Worker Worker
Worker Worker
Pipeline Subtasks
Events Data
Chart 13
Real-time Analysis of Genetic Variants
■ Genome Browser enables detailed exploration of genome loci and associated associations
■ Ranks variants accordingly to known diseases
■ Integrates latest international medical knowledge, annotations, and literature
■ Provides links back to primary data sources, e.g. EBI, NCBI, dbSNP, and UCSC
A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 Chart 14
Medical Knowledge Cockpit
■ Uses patient specifics to provide more adequate results
■ Immediate exploration of relevant information, e.g.
□ Gene descriptions
□ Molecular impact and related pathways
□ Scientific publications
□ Suitable clinical trials
■ Translates manual searching for hours or days into finding
A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 Chart 15
Drug Response Analysis
■ Incorporate knowledge about historic cases to optimize treatment of current cases
■ Enables real-time exploration of Xenograft experiments
■ Configurable medical model to predict drug response
A Federated In-Memory Database System For Life Sciences
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 Chart 16
■ Global Medical Knowledge (Master’s project)
■ Detect cardiovascular diseases and evaluate
treatment options (DHZB)
■ Use health insurance data to improve health care research (AOK)
■ Pharmacogenetics (Bayer)
■ Generously supported by
Join us for upcoming projects!
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
17
Interdisciplinary Design Thinking
Teams
You?
■ For patients
□ Identify relevant clinical trials and medical experts
□ Become an informed patient
■ For clinicians
□ Identify pharmacokinetic correlations
□ Scan for similar patient cases, e.g. to evaluate therapy efficiency
■ For researchers
□ Enable real-time analysis of medical data, e.g. assess pathways to identify impact of detected variants
□ Combined mining in structured and unstructured data, e.g. publications,
diagnosis, and EMR data
What to Take Home? Test it Yourself: AnalyzeGenomes.com
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015 18
A Federated In-Memory Database System For Life Sciences
Keep in contact with us!
Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC)
August-Bebel-Str. 88 14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow Program Manager E-Health
Schapranow/Perscheid, FSOC Lab Day, Nov 4, 2015
A Federated In-Memory Database System For Life Sciences
19
Cindy Perscheid Research Assistant [email protected]