+ All Categories
Home > Documents > bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological...

bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological...

Date post: 02-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
35
bioKepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of LargeScale Biological Data WorDS.sdsc.edu Ilkay Al/ntas 1 , Jianwu Wang 2 , Daniel Crawl 1 , Shweta Purawat 1 1 San Diego Supercomputer Center, UC San Diego 2 UMBC
Transcript
Page 1: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

 bioKepler:  A  Comprehensive  Bioinforma2cs  Scien2fic  Workflow  Module  for  Distributed  Analysis  of  Large-­‐Scale  Biological  Data  

WorDS.sdsc.edu          

Ilkay  Al/ntas1,  Jianwu  Wang2,  Daniel  Crawl1,  Shweta  Purawat1    

1  San  Diego  Supercomputer  Center,  UC  San  Diego  2  UMBC  

Page 2: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

A Toolbox with Many Tools

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

•   Data  •   Search,  database  access,  IO  opera2ons,  streaming  data  in  real-­‐2me…  

•   Compute  •   Data-­‐parallel  paOerns,  external  execu2on,  …  

•   Network  opera2ons  •   Provenance  and  fault  tolerance  

Page 3: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

•  From analysis to searchable results •  Standardization •  Auto generation of methods and materials

•  Sequencers •  Sensor networks •  Medical imaging

Workflows are Used in These Diverse Scenarios in Biological Sciences

Acquisi2on  Genera2on  

Data  Analysis  

Data  

Data   Publica2on  Archival  

Many forms •  Data-intensive •  HPC •  Local Exploratory

Workflows foster collaborations!

•  Flexibility and synergy •  Optimization of resources •  Increasing reuse •  Standards compliance

•  Often for data reduction •  In real-time or offline

Page 4: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA Example:

Using Scientific Workflows and Related Provenance for Collaborative Metagenomics

ResearchCommunity Cyberinfrastructure for Advanced

Microbial Ecology Research and Analysis(CAMERA)

http://camera.calit2.net

Page 5: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA is a Collaborative Environment

Data Cart Multiple Available Mixed collections of CAMERA Data (e.g. projects, samples)

User Workspace Single workspace with access to all data and results (private and shared)

Group Workspace Share specified User Workspace data with collaborators

Data Discovery GIS and Advanced query options

Data Analysis Workflow based analysis

Page 6: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Workflows are a Central Part of CAMERA •  CAMERA-supported

–  28 existing workflows•  Workflows under

development–  Fragment Recruitment

Viewer –  Next Generation Sequencing–  VIROME Pipeline–  Standalone bioinformatics

tools –  National Center for Genome

Research–  Joint Genome Institute

•  User built–  Currently running in a

sandbox–  Will be ported to a virtual

cloud environment

All  can  be  reached  through  the  CAMERA  portal  at:hOp://portal.camera.calit2.net  

•  Inputs: from local or CAMERA file systems; user-supplied parameters

•  Outputs: sharable with a group of users and links to the semantic database

QC

filter

Taxonomy Binning

BLAST

Assembly

Comparison, Statistical analysis, and more

workflows

Metagenomic

Annotation

and

Clustering

Duplicate filtering

More than 1500 workflow submissions monthly!

Page 7: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA Portal - Workflows

Page 8: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA Workflows

RAMMCAP

Page 9: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA  W

orkflows  

Page 10: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA  W

orkflows  

Page 11: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA Job Status

Page 12: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

CAMERA Workflow Results

Page 13: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Pushing the boundaries of existing infrastructure and workflow system

capabilities

Page 14: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

New Requirements from the User Community•  Increase reuse

–  best development practices by the scientific community–  other bio packages

•  Increase programmability by end users–  users with various skill levels –  to formulate actual domain specific workflows

•  Increase resource utilization–  optimize execution across available computing resources –  in an efficient, transparent and intuitive manner

•  Make analysis a part of the end-to-end scientific model from data generation to publication

Page 15: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences

Annota2on  features:    • tRNA  predic2on  (tRNAscan)  • rRNA  predic2on  (meta_RNA,  BLAST)  • ORF  call  (ORF_finder,  Metagene)  • RPS-­‐BLAST  against  COG  etc  • HMMER  against  Pfam  /  Tigrfam  

} Clustering  of  reads  } Mul2-­‐step  clustering  of  ORFs  } GO  assignment  } EC  number  assignment      

Page 16: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Tool Descrip/on BLAST Scalable  parallel  database  search  with  blastn,  blastp,  tblastn,  blastx,  tblastx MegaBLAST Fast  database  search  with  MegaBLAST Diversity   Diversity  analysis  for  viral  metagenome QC Quality  control  for  454  raw  reads CD-­‐HIT-­‐454 Iden2fy  ar2ficial  duplicates  from  454  reads RAMMCAP Metagenome  annota2on    

-­‐  rRNA,  tRNA,  ORF  predic2on  -­‐  reads  and  ORF  clustering  -­‐  reads  and  ORF  informa2on  -­‐  family  and  func2on  annota2on  (Pfam,  TIGRfam,  COG)  -­‐  Gene  Ontology  and  Enzyme  Classifica2on  annota2on  -­‐  Combined  annota2on  summary  

FRV Fragment  Recruitment  Viewer Assembly Consensus-­‐based  meta-­‐assembler  for  454  reads KEGG Pathway  annota2on  by  search  KEGG  database  with  blastp RDP  binning Taxonomy  binning  of  rRNA  sequences  using  RDP  classifier BLAST  binning Taxonomy  binning  by  querying  ref.  rRNA  DB  using  blastn tRNA Iden2fica2on  of  tRNAs  from  fragments  using  tRNA-­‐scan Meta-­‐RNA   Iden2fica2on  of  rRNAs  from  fragments  using  HMM BLAST-­‐RNA Iden2fica2on  of  rRNAs  by  querying  ref.  rRNA  DB  using  blastn ORF_finder ORF  call  by  six  reading  frame  transla2on Metagene ORF  call  by  Metagene FragGeneScan ORF  call  with  FragGeneScan  from  454  reads Pfam Protein  family  annota2on  against  Pfam  using  HMMER TIGRfam Protein  family  annota2on  against  TIGRfam  using  HMMER COG Protein  family  annota2on  against  NCBI  COG  using  rps-­‐blast KOG Protein  family  annota2on  against  NCBI  KOG  using  rps-­‐blast PRK Protein  family  annota2on  against  NCBI  PRK  using  rps-­‐blast CD-­‐HIT-­‐EST Clustering  of  reads   CD-­‐HIT Clustering  of  ORFs H-­‐CD-­‐HIT Mul2ple  level  clustering  of  ORFs  into  ORF  family

A  number  of  bioinforma2cs  tools  are  used  in  RAMMCAP  

Page 17: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Original  implementa2on  of  the  annota2on  workflow  in  Kepler  

A  green  box  is  called  a  ‘actor’  ,  which  performs  a  task.  

This  special  actor  represents  an  annota2on  component,  such  as  BLAST  search.  

Workflow  parameters,  which  can  be  specified  by  users  in  portal,  are  passed  to  workflow  components.  

Data  flow  is  divided.  

Page 18: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Each  actor  was  a  wrapper  to  a  web  service!    

Customized  web  services!  

Page 19: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

RAMMCAP

Data  size    

CPU  2me  

Memory  

Parallel    

KB                                                            MB                                                              GB                                                            TB  

Second                              Hour                                  Day                                          Month                                      Year  

GB                                                                                            10GB                                                                                          100GB  

No  need   No   Mul2  threading   MPI   Map  Reduce    

QC  

tRNA  

cd-­‐hit  

hmmer  

metagene  

blast  

QC   tRNA   cd-­‐hit  hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast  

Page 20: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences

Data  size    

CPU  2me  

Memory  

Parallel    

KB                                                            MB                                                              GB                                                            TB  

Minute                              Hour                                  Day                                          Month                                      Year  

GB                                                                                            10GB                                                                                          100GB  

No  need   No   Mul2  threading   MPI   Map  Reduce    

QC   tRNA   cd-­‐hit  hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast  

Data  size    

CPU  2me  

Memory  

Parallel    

KB                                                            MB                                                              GB                                                            TB  

Minute                              Hour                                  Day                                          Month                                      Year  

GB                                                                                            10GB                                                                                          100GB  

No  need   No   Mul2  threading   MPI   Map  Reduce    

NGS  

QC   tRNA   cd-­‐hit  hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer  metagene   blast  

QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast  

Page 21: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Another  cases  –    RNA-­‐seq  /  genomic  /  metagenomic  

Raw  reads  

HQ  reads  

Assemble  

Velvet,  SOAPdenovo,  

Abyss  Oases  Trinity  

Alignments  

Reads QC  

Con2gs  

mapping  BWA  Bow9e  BLAST  

Further  analysis  

Data  size    

CPU  2me  

Memory  

Parallel    

KB                                                            MB                                                              GB                                                            TB  

Minute                                                    Hour                                                      Day                                                            Month                                        

GB                                                                                            10GB                                                                                          100GB  

No  need   No   Mul2  threading   MPI   Map  Reduce    

NGS  assembly  

QC  mapping  

QC  mapping   mapping  assembly  

QC  

assembly  

mapping  

assembly  mapping  QC  

Page 22: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

bioKepler  implementa2on:  Using  bioActors  instead  of  wrapper  actors    

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

bio  

Wrapper  Actors  •  Need  implementa2on  of  underlying  

computa2onal  tools  

bioActors  •  Reusable  •  Mul2ple  execu2on  modes  •  Build-­‐in  parallel  execu2on  

capabili2es  

Page 23: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Gateways  and  other  user  environments  

bioKepler  Kepler  and  Provenance  Framework  

BioLinux     Galaxy   Clovr     Hadoop  

CLOUD  and  OTHER  COMPUTING  RESOURCES  e.g.,  SGE,  Amazon,  FutureGrid,  XSEDE  

www.bioKepler.org

May  22nd,  2014   Scalable  Bioinforma2cs  Boot  Camp  

A coordinated ecosystem of biological and technological packages for bioinformatics!

Page 24: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

The bioKepler Approach

•  Parallel Computation Framework–  Use Distributed Data-Parallel (DDP) frameworks, e.g.,

MapReduce, and other parallelization methods to execute subworkflows

•  bioActors–  Configurable and reusable higher-order components

for bioinformatics and computational biology•  Transparent support for different execution

engines and computational environments•  Deployment on diverse environments

Page 25: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Reuse, Programmability, Execution

•  Funded by NSF ABI & CI Reuse programs - Altintas (PI) and Li (Co-PI)•  Development of a comprehensive bioinformatics scientific workflow

module for distributed analysis of large-scale biological data

Big improvement on usability and programmability by end users!

www.bioKepler.org

Galaxy

bioKepler

Kepler  •  CORE  •  Distributed  Data  

Parallel  •  Provenance  •  Repor2ng  •  Run  Manager  •  …  Bio-Linux

CloudBioLinux

Kepler  supports  •  Workflows  •  Other  third  party  

programming  tools,  e.g.,  R,  Matlab,  KNIME  

•  Extensible  task  and  data  paralleliza2on  

•  Service  orienta2on  •  Execu2on  on  mul2ple  

engines,  e.g.,  SDF,  SGE,  Hadoop  

Page 26: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

bioKepler’s Conceptual FrameworkKepler

bioKepler

ComputeAmazon

EC2

FutureGridSun Grid Engine

Adhoc Network

Data

CAMERA

Ensembl

Genbank

Deploy & Execute

Bioinformatics Tools

Clustering

MappingAssembly

Transfer

Customize & Integrate

Data-Parallel Execution PatternsMap-Reduce Master-Slave All-Pairs

Triton Resource

Provenance

Execution HistoryData Lineage

Reporting

PDF GenerationReport Designer

Fault-Tolerance

Error HandlingAlternatives

Run Manager

TagSearch

Director

Executable Workflow Plan

Scheduler

Execution EngineBioinformatician

Workflow

bioActorsBLASTHMMERCD-HIT

Page 27: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

bioActors

•  Set of steps to execute a bioinformatics tool locally or in an external environment– Locally executable– Parallelized external execution

•  Customizable by the user based on external packages– Tools imported from CloudBioLinux

•  Tools are evaluated on their computational requirements

Page 28: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Transparent Execution includes Parallelization Solutions in Distributed Environments

•  Tradi9onal  parallel  programming  interfaces  –  Examples:  MPI  and  OpenMP  –  Hard  to  implement  –  Original  sequen2al  tools  cannot  be  reused  

•  Parallel  job  execu9on  –  Examples:  SGE  and  Condor  –  Original  sequen2al  tools  can  be  reused  –  Create  small  jobs  by  splikng  data  or  tasks  –  Hard  to  achieve  data  locality  for  each  job  

•  Data  parallel  job  execu9on  –  Examples:  Hadoop  and  Stratosphere  –  Original  sequen2al  tools  can  be  reused  –  Support  customized  and  automa2c  data  par22on  and  distribu2on  –  Support  data  locality  for  each  job  through  special  distributed  file  system,  HDFS  

Page 29: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Distributed Data-Parallel bioActors

•  Set of steps to execute a bioinformatics tool in DDP environment

•  Customized from the ExecutionChoice actor

•  Includes:– Data-parallel patterns, e.g., Map, Reduce,

Cross, All-Pairs, etc., to specify data grouping–  I/O to interface with storage– Data format specifying how to split and join

Page 30: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

DDP bioActor Usage Model

A1 A2 An

DDP BlastDDP Generic

1. Search

2a. ChooseSpecific

2b. Choose Generic

2b. Create Sub-Workflow

3. Add to Workflow

Results

4a. Execute

4b. Add to Larger

Workflow

4c. Save in Library

WorkflowDDP Director

User: Workflow Developer

bioActor Library

Page 31: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Status  of  bioActors  500+  bioActors  are  listed  under  current  bioKepler  release,  ~40  of  them  are  

parallelized.  

Page 32: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Example bioActors•  Alignment: BLAST, BLAT•  Profile-Sequence Alignment: PSI-BLAST•  Hidden Markov Model: HMMER•  Mapping: Bowtie, BWA, Samtools•  Multiple Alignment: ClustalW, Muscle•  Clustering: CD-HIT, Blastclust•  Gene Prediction: Glimmer, Genescan,

Fraggenescan•  tRNA prediction: tRNA-scan, Meta-RNA•  Phylogeny: FastTree, RAxML

Page 33: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Example Workflows

Page 34: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Current Release

•  A bioKepler VM executable on Amazon EC2, FutureGrid and SDSC Cloud– Builds upon CloudBioLinux including Bio-

Linux and Galaxy•  A bioActor template that can be

customized for different execution choices– e.g., local vs. Map/Reduce on a specific

environment•  Example usecases

Downloadable as a package at: http://www.biokepler.org/releases

Page 35: bioKepler:!A!Comprehensive!Bioinformacs! Scien2fic!Workflow ... · Scenarios in Biological Sciences Acquisi2on! Generaon! Data Analysis! Data! Data! Publicaon! Archival! Many forms

Demo  and  Que

s2on

s  

WorDS

 Dire

ctor:    Ilkay  Al2ntas,  Ph.D.  

Email:  al2n

tas@

sdsc.edu

   


Recommended