Workflows, Semantics & future eScience
Integrative Bioinformatics Workshop,Tom Oinn – [email protected],
6th September 2006
WorkflowsWorkflows
Data driven workflow system– Graph of operations (nodes) and data transfer
(edges)– Operations are services, databases, command
line tools, scripts…– Workflow engine software (enactor) responsible
for coordination of operations– Enactor is data agnostic, apart from collection
structures (for Taverna)
Taverna 1.4Taverna 1.4
Rewritten generic web service client Rewritten BioMart client Provenance capture system Performance and usability enhancements Groundwork for new architecture and build system
Web Service SupportWeb Service Support
Enhanced support for document / literal style services– i.e. NCBI eUtils services
More robust invocation– Copes better with various broken service types
Support for wsdl:documentation tags– Now shows free text service docs– Not ideal but it’s all web services give us
BioMart SupportBioMart Support
UI changed to reflect current website– Ease ‘techno-shock’ to users
Supports all Mart features– Data set linking and federation
Uses Mart service– Connects over HTTP reducing firewall issues– Mart providers no longer need to open JDBC
access ports, fewer ports open so better security for service providers.
Provenance Capture & BrowseProvenance Capture & Browse
Observes events from the workflow engine Populates a triple store with information
from these events Presents a simple browse interface over this
metadata– Replicates Taverna’s existing result and status
browser Allows for more complex query interfaces in
the future
Taverna is now part of OMII-UKTaverna is now part of OMII-UK Taverna 1.4 production target : Sept 2006
– Packaging, Installation, Deployment, Maintenance, Testing– GridSAM, GRIMOIRES, BioMOBY registry integration– Semantic content for registry– Integration of discovery and metadata management– Security AA for KAVE data and metadata management
Taverna 2.0 : Spring 2007– Redevelopment of the plug in and enactor framework,
improved iteration events, data management Close collaboration with pioneers Incremental rollouts to early adopters
Ingest Ingest
Early adoptersPioneers
Pioneers ConservativesEarly adoptersPioneers
myGridPre-release
myGrid Release
OMII-UKRelease
Software Engineering
XP
Software Engineering
Quality & Test
Evaluation Evaluation OMII Software Engineering
Quality & TestPrioritise & Plan
Prioritise & Plan
Production Applications & Professional ServicesApplications & Professional Services
myGridAlliance
myGridAlliance
Source-forgecommunity
Source-forgecommunity
Evolving challengesEvolving challenges
Long running data intensive workflows Manipulation of confidential or otherwise protected
information Use with classical grid systems Interaction with users during workflows Workflow authoring, service discovery and
composition Fine grained runtime updates Data comprehension, provenance and
visualization – the rest of this talk!
Increasing Automation
Be
tter
Se
man
tics
an
d U
nd
erst
and
ing
Manual use of tools, web pages
Scripted toolinvocation
Naïve workflowsystems
Basic ‘discovery’ styleservice annotations
Guided workflowconstruction
Workflow design withannotation overlays
Automated hypothesisgeneration (really!)
Knowledge drivenvisualization
Hypothesis validation
‘Data playground’exploratory tool
And now, the future…And now, the future…
Service AnnotationsService Annotations
Immediate problem – too many services!– At workflow construction time users cannot
isolate the services they need
Multiple levels of annotation– Interface and syntactic definitions i.e. WSDL– Free text descriptions– Semantic annotation of operations
Increasing Automation
Be
tter
Se
man
tics
an
d U
nd
erst
and
ing
Service annotationsService annotations
Manual use of tools, web pages
Scripted toolinvocation
Naïve workflowsystems
Basic ‘discovery’ styleservice annotations
Guided workflowconstruction
Automated hypothesisgeneration (really!)
Knowledge drivenvisualization
Hypothesis validation
‘myalignscript.pl’
‘A tool to comparemultiple protein structures’
performs_task : alignment
input_type{seq_a} : sequence…output_type{score} : d_value
output{score} is_distance_between pair {input{sequence a}, input{sequence b}}Also needs workflow level annotation!
Re
qui
res
typ
e o
nto
logy
or
ont
olo
gie
s!N
atu
ral
lan
gua
ge
Building the semantic networkBuilding the semantic network
Workflow engine uses service annotations to annotate the results of invocations of those services.
For example :
Fetch Structure Fetch Sequence
ID
InterproScan
GetGO(cellular location)
ExtractMotifRanges
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
No service annotationsNo service annotations
Fetch Structure
Fetch Sequence
InterproScanInterproScan
GetMotifRanges
GetMotifRanges GetMotifRanges
GetMotifRangesGetGO
GetGO
GetGO
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
Input / Output type annotationInput / Output type annotation
Fetch Structure
Fetch Sequence
InterproScanInterproScan
GetMotifRanges
GetMotifRanges GetMotifRanges
GetMotifRangesGetGO
GetGO
GetGO
protein_identifier
3d_structure
protein_sequence
ipro_identifier
ipro_identifier
go_term
go_term
go_term
range_setrange_set
range_set
range_set
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
Full static semanticsFull static semantics
has_structure
has_sequence
has_ipro_hithas_ipro_hit
contains_domain
contains_domain contains_domain
contains_domainhas_go
has_go
has_go
protein_identifier
3d_structure
protein_sequence
ipro_identifier
ipro_identifier
go_term
go_term
go_term
range_setrange_set
range_set
range_set
ID
ID
ID
ID
ID
ID
ID
Dynamic semanticsDynamic semantics
has_structure
has_sequence
has_ipro_hit
contains_domain
contains_domain
has_go
protein_identifier
3d_structure
protein_sequence
ipro_identifier
go_term
range_set
range_set
has_evidence
has_evidence
predicts_location
location_prediction
(nodes omitted to prevent further insanity)
Driven by workflow level annotation
VisualizationVisualization
Naïve rendering of the graph isn’t good enough
Any scientific domain already has vizualization mechanisms
Create an ecosystem of visualization agents– Iteratively consume the semantic network– Replace node(s) with markers into the
visualizer’s space– Render any remaining edges using graph layout
ID
ID
ID
ID
ID
ID
ID
Rendering AgentsRendering Agents
has_structure
has_sequence
has_ipro_hit
contains_domain
contains_domain
has_go
protein_identifier
3d_structure
protein_sequence
ipro_identifier
go_term
range_set
range_set
has_evidence
has_evidence
predicts_location
location_prediction
3D Structure Renderer
Sequence + Feature Renderer
Gene Ontology Subgraph Renderer
Hypothesis ValidationHypothesis Validation
Express hypothesis as a pattern that can match the semantic network topology
Combination of structure and node values– Need to use a rich graph aware query
language, various options For each object of a certain class test
whether the structure around it matches Link back to the visualization to show
exceptions
Hypothesis Generation (!)Hypothesis Generation (!)
Use genetic algorithms to ‘evolve’ a suitable match for the previous stage
Relatively easy to create a fitness function (precision, specificity, match percentage)
Easy to ‘mutate’ patterns ‘Tell me anything interesting you’ve noticed
about protein structures in this workflow’ capability
Obtaining TavernaObtaining Taverna
Taverna is available under the LGPL from our project site on Sourceforge.net– http://taverna.sourceforge.net
Release 1.4 as of May 2006 Win32, Solaris / Linux & OS-X Includes online and downloadable user manual,
examples etc. Support via project mailing lists
mymyGrid acknowledgementsGrid acknowledgementsCarole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble.
Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.
User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell.
Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.
Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust.