Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | chelsea-stokely |
View: | 215 times |
Download: | 1 times |
Bioinformatics workflow managementThoughts and case studies from industry.
Mark Schreiber, Bioinformatics Research Investigator
WWWFG, 5-7 June 2007
2 | Bioinformatics workflow management | Mark Schreiber
Outline
Integration and workflows
Early attempts
Case studies and examples
What does the future hold?
Conclusions
3 | Bioinformatics workflow management | Mark Schreiber
Bioinformatics at NITD Data Integration
• Ontologies, Standards, DBs
Knowledge Discovery
• Algorithms, Informatics, Machine Learning
Modelling
• Pathways, Circuits, Abstraction
Infrastructure
SupportResearch
4 | Bioinformatics workflow management | Mark Schreiber
Bioinformatics at NITD
BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers.
Hypothesis Generation and Validation. Providing the right information at the right time.
Decision Support.
5 | Bioinformatics workflow management | Mark Schreiber
Data SourcesHeterogeneity
The most significant research is done when heterogeneous data sources can be combined in one analysis.
Data
Scapers, CGI-Bin,
WS-Clients
Parsers (one per format)
BioJava/ BioPerl
Parser FrameworksImage analysis
SQL, JDBC/ODBC,
J2EE, .NETAPI
Webpages / Services Flatfiles XML Images / Video Relational DB Instrument
6 | Bioinformatics workflow management | Mark Schreiber
Applications (Services)Yet more heterogeneity
RDBMS• Oracle, MySQL, PostGres etc
Open Source• Usually just a command line interface
Commercial software• API, scripting engine, webservice
Web services and Web resources
Integration is rarely seamless
7 | Bioinformatics workflow management | Mark Schreiber
Productivity vs. InnovationFinding a balance
Development and manufacturing prioritize productivity
Research requires more innovation
Standardization increases productivity
Standardization limits innovation• At the level it is applied
Standardization promotes innovation• At higher levels
Workflows give a nice balance
8 | Bioinformatics workflow management | Mark Schreiber
What is a workflow?In Bioinformatics
A data-driven procedure consisting of one or more transformation processes (nodes).
Can be represented as a directed graph.• Direction is time – The order of transformations.
• A set of transformation rules.
A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services).
A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component).
9 | Bioinformatics workflow management | Mark Schreiber
The UNIX PhilosophyAnalogy to workflows
Write programs that do one thing and do it well
Write programs that work together
Write programs to handle text streams, because that is the universal interface• Text formatted as XML
Do one thing and do it well
A workflow is made up of nodes that do one thing and do it well• So is a Service Oriented Architecture (SOA)
10 | Bioinformatics workflow management | Mark Schreiber
An early attempt: PolymerUnix shell scripts + Biojava objects
Biojava is a large API of Java objects that are useful for bioinformatics.
Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’.
Polymer combines these mini-programs into a very simple workflow using Unix shell scripts.• Much like Unix piping.
Unfortunately it instantiates multiple JVMs
Lacks management and logging systems
11 | Bioinformatics workflow management | Mark Schreiber
How could Polymer have been better?
Provide an execution class and allow it to execute a script.• This would mean only one JVM is launched and could allow for
threading of branches in the script.
Use Groovy script instead of Unix shell script.• But Groovy hadn’t been invented at the time.
At the same time workflow management systems were emerging which made Polymer redundant.
12 | Bioinformatics workflow management | Mark Schreiber
A production example: Drug Target Identification Rational bioinformatics prioritization
In collaboration with biologists identify desirable characteristics of a drug target
Integrate relevant data from large datasets
Combine data and score each target based on the presence or absence of desirable characteristics
Prioritize targets based on their overall score
13 | Bioinformatics workflow management | Mark Schreiber
HomologyEssentiality
Expression Druggable domains
StructurePathways
AssessDrugTarget
Scientist defines desirable criteria Assign
weights Produce a
score for each gene
Select targets for promotion to D1 Competitive
advantage
Legal position
Literature
Biological feasibility
DB
Epidemiology
Assayability
A production example: Drug Target Identification Rational bioinformatics prioritization
Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61
14 | Bioinformatics workflow management | Mark Schreiber
Workflow Management SystemControlling the workflow
A WMS should provide a means to execute a workflow in a controlled way.
Ideally it will also provide:• Logging
• Messaging
• Security and provenance management
• Scheduling and load balancing
• Exception handling
• Resource pooling (eg DB connections)
Much of the above is easily accessible from a JEE/ .NET application server• JBoss, Glassfish
15 | Bioinformatics workflow management | Mark Schreiber
Workflow Design SystemBuilding the workflow
Many WMS systems are also a WDS• Eg Taverna, Pipeline Pilot, Inforsense
A GUI that allows rapid workflow development• Increases productivity and encourages experimentation
• Drag and drop assembly of a workflow
Provides an API or scripting interface to allow the design of new nodes
A simple scripting interface would also be an alternative to using a GUI for design
16 | Bioinformatics workflow management | Mark Schreiber
Simple Data Mining Workflow
Each node has a discrete function.
Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic.
Self documenting.
Can be run by other users.
17 | Bioinformatics workflow management | Mark Schreiber
AnnotationFinding malaria kinases
Semi-automated annotation
18 | Bioinformatics workflow management | Mark Schreiber
Advanced annotationCombining multiple services
19 | Bioinformatics workflow management | Mark Schreiber
Workflows become nodesStanding on the shoulders of giants
Elements of workflows that are frequently re-used should become nodes.
Workflow re-use, Object oriented workflows
20 | Bioinformatics workflow management | Mark Schreiber
Example: From Arrays to PathwaysUsing whole workflows as nodes
Process and array and find the over represented KEGG pathways and NCBI processes.
21 | Bioinformatics workflow management | Mark Schreiber
Workflow design systems promote rapid development
Finding orthologues and paralogues using whole genome pairwise blast.
Development of the workflow took about 5mins.
22 | Bioinformatics workflow management | Mark Schreiber
Workflow design systems promote experimentationMind map data analysis
23 | Bioinformatics workflow management | Mark Schreiber
Integration Via Ontology
Workflows in bioinformatics typically do a lot of integration before and/ or after analysis.
Integration is normally done using joins and filters.• Using equality and Boolean operations.
- Eg type = protease OR type = serine protease …
Joins and filters should be able to be evaluated using ontology.• Eg. Filtering for proteases would include all subconcepts
automatically.
Data sets could be quickly mapped using custom ontologies.
24 | Bioinformatics workflow management | Mark Schreiber
Simplifying Service IntegrationExpose an API
All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API.
Easier to learn than a full Java or C API.
Should be based on an existing scripting language not a new one.• Python, Groovy, Ruby or Perl
While you are at it expose your stack via the scripting language.• Imagine what could be done with BLAST if the stack could be
manipulated via scripting.
25 | Bioinformatics workflow management | Mark Schreiber
Web Services and Service Oriented Architecture‘Outsourcing your processing’
Webservices• Services can reside on different servers
• Platform independent HTTP protocol
• CGI, REST, XML-RPC, SOAP
• SOAP is the easiest to generically connect to and parse
• Results are available as XML
Service Oriented Architecture• Usually implies web services
• SOA promotes re-use and simplifies maintenance
• Bottleneck shifts from CPU time to network availability
26 | Bioinformatics workflow management | Mark Schreiber
Resource Oriented ArchitectureOutsourcing your data warehouse
Bioinformatics is very resource intensive
ROA simplifies maintenance and removes the need for synchronization.
Many resources are now accessible by webservices in XML format
27 | Bioinformatics workflow management | Mark Schreiber
Resource Oriented ArchitectureThe challenges
Network latency can become a major problem• Intelligent caching and increased network speed are a must
Requires resource discovery and cross referencing• RDF and Ontology will play an increasingly important role
• Workflow management systems will need to understand these
Increasingly workflows will make use of loosely-coupled interoperable resources and services.
28 | Bioinformatics workflow management | Mark Schreiber
Business ProcessesFrom proactive to reactive
Business processes are long running, asynchronous processes• Typically they react to events, e.g. a change in a stock price.
- ‘Push’ vs ‘Pull’ model of data access.
• Known as ‘programming in the large’
• Defined using BPEL with very heavy use of SOA and ROA
Currently, most workflows are explicitly executed, ‘short running’, synchronous processes
Bioinformatics will increasingly use business processes• React to streaming machine data
• Continuously process literature or database updates
29 | Bioinformatics workflow management | Mark Schreiber
Web Service ChoreographyWill it be relevant to bioinformatics?
Business processes and workflows are ‘orchestrations’• Scope is limited to one participant
• The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed.
Choreography involves the management of several loosely coupled BP’s• A network of long running asynchronous BP’s that react to the behavior of
their peers.
• Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process
Web Service BP Choreography
Node Workflow ???
One to Many
One to ManyOne to Many
One to Many
30 | Bioinformatics workflow management | Mark Schreiber
ConclusionsDesign and management
Workflows are created using a workflow design system and executed on a workflow management system
A well designed workflow management can considerably increase productivity
Promotes workflow re-use and helps organize a multi-user environment
A good design system allows rapid development of a workflow
A good design system promotes experimentation and data exploration
31 | Bioinformatics workflow management | Mark Schreiber
ConclusionsThe future
Ontology will play an increasing role in data integration• Join and Filter operations that can reason over an ontology model
Business processes and web choreography will become more relevant to bioinformatics• ‘Live’ data favors programming ‘in the large’
• Workflows exposed as business processes
• Network speed and optimal caching are key
All of these approaches have been used before• Used and proven in business intelligence
• Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology