Post on 16-Apr-2017
transcript
Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines
Cindy Perscheid
Festival of Genomics
London, Jan 19, 2016
Perscheid, Schapranow
Recap Analyze Genomes Real-time Analysis of Big Medical Data
In-Memory Database
Extensions for Life Sciences
Data Exchange, App Store
Access Control, Data Protection
Fair Use
Statistical Tools
Real-time Analysis
App-spanning User Profiles
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
Modeling and Executing Genome Data Processing Pipelines
Drug Response Analysis
Pathway Topology Analysis
Medical Knowledge Cockpit Oncolyzer
Clinical Trial Recruitment
Cohort Analysis
...
Indexed Sources
From Raw Genome Data to Analysis Results
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
■ Sequencing: Acquire digital DNA data
■ Alignment: Reconstruction of complete genome with snippets
■ Variant Calling: Identification of genetic variants
■ Data Annotation: Linking genetic variants with research findings
■ Not standardized
■ Not exchangeable
■ Concatenation of bash scripts reading from and writing to files
■ Requires IT expertise for
□ Setup
□ Error handling, and
□ Efficient processing and parallelization
■ Objective: Model, configure, and execute pipelines without involving IT experts
Genome Data Processing Pipelines State of the Art
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
■ Modeling Genome Data Processing Pipelines
■ Pipeline Execution in the Worker Framework
Agenda
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
■ Modeling Genome Data Processing Pipelines
■ Pipeline Execution in the Worker Framework
Agenda
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Modeling Genome Data Processing Pipelines BPMN 2.0 Example
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Start Event
End Event
Annotated Data Object
Parallel Gateway
Collapsed Subtask
Task
Multiple Task Instances Executed in Parallel
■ Graphical modeling notation
■ Compliant with BPMN 2.0 extended by
□ Modular structure
□ Degree of parallelization
□ Parameters and variables
■ Model descriptions (XPDL) are stored in IMDB
■ Model instances are transformed into graph structure executed by our worker framework
Modeling Genome Data Processing Pipelines BPMN 2.0 Extensions
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Modeling Genome Data Processing Pipelines From Model to Execution 1. Design time (researcher, process expert)
□ Definition of parameterized process model
□ Uses graphical editor and available jobs
2. Configuration time (researcher, lab assistant)
□ Select model and specify parameters
□ Results in model instance stored in database
3. Execution time (researcher)
□ Select model instance
□ Specify execution parameters, e.g. input files
Modeling and Executing Genome Data Processing Pipelines
Perscheid, Schapranow
■ Results are imported into IMDB
■ Optimization reduced execution time by >50%
Modeling Genome Data Processing Pipelines Traditional vs. Optimized Approach
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
■ Modeling of Genome Data Processing Pipelines
■ Worker Framework
Agenda
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Pipeline Execution Overview
Modeling and Executing Genome Data Processing Pipelines
In-Memory Database
Tasks
Scheduler
ID Pipeline Params 12 BWA xyz.fastq 13 BWAmem abc.fastq 14 Bowtie2 xyz.fastq
Worker
Worker
Subtasks Task ID Job Status Params
12 97 Split done xyz.fastq
12 98 Import todo abc.vcf
12 98 Import done abc.vcf
Webservice
. . .
1. Trigger task execution
2. Schedule subtasks
3. Execute subtasks
Pipeline Execution Software Components and Communication
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines Node
Worker Worker Worker
IMDB
Node
Worker Worker Worker
IMDB
Node
Worker Worker Worker
IMDB
Scheduler
Node
Worker Worker Worker
IMDB
Transmitter
Node
Worker Worker Worker
IMDB ...
Site I Site II VPN
UDP TCP
Shared File System Shared File System
Node
Worker Worker Worker
IMDB
■ Workers execute jobs one by one
■ Subtask execution status in IMDB:
□ Ready (0),
□ In Progress (1),
□ Done (2), or
□ Erroneous (3).
■ Jobs implemented as Python modules/classes
□ Can contain arbitrary code
□ Have access to IMDB
□ Can read/write to shared working directory
Pipeline Execution Runtime Layer - Worker
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Pipeline Execution Coordination Layer - Scheduler
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Scheduling Algorithm
Pipeline Executor
Ressource Allocator
Subtask Subtask Subtask i
Subtask k
Subtask m
m ..
■ Scheduling algorithms are plug-in software modules
□ “User-/Group-based” to let users execute their tasks on their local site only
□ “Priority First” to prefer important users
□ “High Throughput”, i.e. “shortest task first” to deal with high load
■ Scheduling algorithms can also be composed hierarchically
Pipeline Execution Scheduling Algorithms
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines Priority-based
High-throughput High-throughput
High-throughput
Site I Site II
Prio A Prio *
Group-based
■ Maintains lists of running and idle nodes
■ Idle worker requests new sub task for its assigned groups
■ If there is no matching sub task, it sleeps until a new sub task gets ready
Pipeline Execution Resource Allocator
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
Node
Worker Worker Worker
IMDB
Site 1 Site 2
Node
Worker Worker Worker
IMDB
Node
Worker Worker Worker
IMDB
Working Working Waiting
Pipeline Execution Flexibility
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines Node
IMDB
Node
Worker Worker Worker
Node
Worker Worker Worker
IMDB
Scheduler
UDP TCP
File System Share
■ All execution data is stored in IMDB
■ Temporary files on a shared file system
■ In case of any failure, the system-wide state can be restored
Pipeline Execution Recoverability
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines
IMDB
Pipeline Tasks Scheduler
Worker Worker
Worker Worker
Pipeline Subtasks
Events
Data
Thanks!
Hasso Plattner Institute Enterprise Platform & Integration Concepts
August-Bebel-Str. 88 14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow schapranow@hpi.de
http://we.analyzegenomes.com/
Cindy Perscheid, M. Sc. cindy.perscheid@hpi.de
Perscheid, Schapranow
Modeling and Executing Genome Data Processing Pipelines