Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop,...

Post on 15-Jan-2016

216 views 0 download

transcript

Tavernathe story from up-aboveAntoon Goderis

The University of Manchester, UKhttp://www.mygrid.org.uk/tavernahttp://www.omii.ac.uk

DART workshop, Brisbane, Australia, 14 December 2006

2

Overview The situation in –omics Creating new biology using Taverna Taverna

Key traits Features on the OMII roadmap

Including today’s release

3

Bioinformaticians & co.

4

Open environmentData, Data, Data

EBI

SeqHoundSRS

National Center for Biotechnology Information (USA)

Cambridge, UKTokyo, Japan

5

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

6

The situation in {genomics, transcriptomics, proteomics,

metabolomics ..} Lots of data Lots of parameters to choose An analysis takes a long time The analysis services are unreliable Lots of analysis steps Need to record and explain your steps

7

Enter workflows Lots of data

[high throughput] Lots of parameters to choose

[best practice] An analysis takes a long time

[long running] The analysis services are unreliable

[fault tolerance] Lots of analysis steps

[data and control flow] Need to record and explain your steps

[provenance]

8

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

Workflow-based middleware

9

myGrid myGrid http://www.mygrid.org.uk UK e-Science pilot project since 2001 Part of the Open Middleware Infrastructure Institute UK Build middleware for Life Scientists that enables them

to undertake in silico experiments and share those experiments and their results.

Individual scientists, in under-resourced labs, who use other people’s applications.

Open source. Workflows & Semantic Techologies for metadata

management. Data flows. Ad hoc & exploratory

10

Overview The situation in -omics Creating new biology using Taverna Taverna

Key traits Features on the OMII roadmap

Including today’s release

11

?200

Microarray + QTL

Genes captured in microarray experiment and present in QTL region

Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping

Genotype Phenotype

[Andy Brass, Steve Kemp, Paul Fisher, 2006]

12

Key:

A – Retrieve genes in QTL region

B – Annotate genes with external database Ids

C – Cross-reference Ids with KEGG gene ids

D – Retrieve microarray data from MaxD database

E – For each KEGG gene get the pathways it’s involved in

F – For each pathway get a description of what it does

G – For each KEGG gene get a description of what it does

[Andy Brass, Steve Kemp, Paul Fisher, 2006]

13

Result Captured the pathways returned by QTL and

Microarray workflows over the MaxD microarray database

Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.

Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate.

[Andy Brass, Steve Kemp, Paul Fisher, 2006]

14

Trichuris muris (mouse whipworm) infection

Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

Manual experimentation: Two year study of candidate genes, processes unidentified

Workflows: trypanosomiasis cattle experiment, was reused without change.

Analysis of the data by a biologist found the processes in a couple of days.

[Joanne Pennock, Paul Fisher, 2006]

15

Changing scientific practice Systematic and comprehensive automation.

Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses

Dry people hypothesise, wet people validate. “make sense of this data” -> “does this make sense?”

Workflow factories. Different dataset, different result

Accurate provenance.

16

Overview The situation in -omics Creating new biology using Taverna Taverna

Key traits Features on the OMII roadmap

Including today’s release

17

User Uptake ~25000 downloads Systems biology Proteomics Gene/protein annotation Microarray data analysis Medical image analysis Heart simulations High throughput

screening Phenotypical studies Plants, Mouse, Human Astronomy Dilbert Cartoons

18

Finding and Sharing Tools

Taverna Workbench 3rd Party Applications and

Portals

WorkflowEnactor

Service Management

Results Management

ProvenancelogMetadata

DefaultDataStore

CustomStore

DAS

KAVE BAKLAVA

Feta

myExperiment

Utopia

ClientsClients

LSIDs

Workflow enactor

19

Taverna workbench

20

3000+ services Open domain services and

resources, Third party. Enforce NO common data model. No common typing, Missing

metadata.

Soaplab InstantSoap

21

Services Landscape

22

User Interaction Allows a workflow to call

out to an expert human user

E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline

[University of Bergen]

23

Tools, Tools, Tools

Feta Search tool

Pedro Annotation tool

24

Capture and Curation Effort

Ontology and Annotation Curation Team

Franck Tanoh and Katy Wolstencroft

Community Service Providers

Community Scientists

25

Scufl Model

TavernaWorkbench

Shielding & Extensible

plug-ins

Workflow Execution

Application

Workflow enactor

Processor Processor

PlainWeb

Service

Soaplab

Processor

LocalJava App

Processor

WFEnactor

Processor

BioMOBY

Processor

SeqHound

Processor

BioMART

Processor

WSRF

Processor

Beanshell

Simple Conceptual Unified Flow Language

Nested workflows, Automatic iterations,Best guess data type handling

26

Service incompatibility Fix up the services to be compatible or…. Shims – libraries of adapters. Automated data type matching using reasoning over

a mismatch and service ontology

Duncan Hull, myGridKhalid Belhajjame, ISPIDER

27

Shimidentification

Mismatchdetection

28

Service failure? Most services are owned by other people No control over service failure Some are research level

Workflows only as good as the services they connect. Notify failures Instigate retries Set criticality Substitute services

29

Provenance Collection Observes events from

the workflow engine Populates an RDF triple

store with information from these events

Browse interface Simple browser replicates

Taverna’s existing result and status browser

Graphical browser ProQA Query API

urn:data:f2

urn:data:f2

urn:data1urn:data1

urn:data2urn:data2

urn:compareinvocation3urn:compareinvocation3

urn:data12

urn:data12

Blast_report

[input]

[output]

[input]

[distantlyDerivedFrom]

SwissProt_seq

[instanceOf]

Sequence_hit

[hasHits]

urn:hit2….

urn:hit2….

urn:hit1…urn:hit1…

urn:hit50…..

urn:hit50…..

[instanceOf]

[similar_sequence_to]

Data generated by services/workflows

Concepts

[ ]

[performsTask]

Find similar sequence

[contains]

Services

urn:data:3urn:data:3

urn:hit8….

urn:hit8….

urn:hit5…urn:hit5…

urn:hit10…..

urn:hit10…..

[contains]

[instanceOf]

urn:BlastNInvocation3urn:BlastNInvocation3

urn:invocation5urn:invocation5urn:data:f1

urn:data:f1

[output]

New sequence

Missed sequence

[hasName] [hasName

]

literalsDatumCollection

[type]

LSDatum

[type]Properties

[instanceOf]

[output]

[output]

[directlyDerivedFrom]

[Zhao et al 07 provenance challenge paper]

30

31

Provenance Tracking

From which Ensembl gene does pathway mmu004620 come from?

32

Pathway_id KEGG_id Uniprot Ensembl_gene_id

Entrez

dF

dF

dF dF

Workflows over Results

Automatically backtrack through the data provenance graph

33

A workflow marketplace

34

webTaverna GUI - main

35

Overview The situation in -omics Creating new biology using Taverna Taverna

Key traits Features on the OMII roadmap

Including today’s release

36

Ingest Ingest

Early adoptersPioneers

Pioneers ConservativesEarly adoptersPioneers

myGridPre-release

myGrid Release

OMII-UKRelease

Software Engineering

XP

Software Engineering

Quality & Test

Evaluation Evaluation OMII Software Engineering

Quality & TestPrioritise & Plan

Prioritise & Plan

Production Applications & Professional ServicesApplications & Professional Services

myGridAlliance

myGridAlliance

Source-forgecommunity

Source-forgecommunity

37

Who are the OMII Users?

Increasing variation in requirements with the scientific domain.

Different scientific/research domains

End Users

Application Developers

Service and Middleware Developers

Middleware Deployers

Diff

ere

nt a

ctivitie

s

Systems Administrators

38

Taverna is now part of OMII-UK Taverna 1.5 – Today! Taverna 1.6 myExperiment

39

Integrated provenance Raven release mechanism to simplify updates

for the user +/- 300 semantic annotations for core services Patterns for using proxies for bulk data

transactions Redeveloped plug in and enactor framework,

improved iteration events, data management

Taverna 1.5

40

Integrated provenance

Taverna 1.5

41

Integrated provenance Raven release mechanism to simplify updates for the

user

Taverna 1.5

42

Integrated provenance Raven release mechanism to simplify updates for the

user +/- 300 semantic annotations for core services

Add_ncbi_to_string : beanshell script, need to ask Paul for more detailsInput:Output:

Kegg_gene_ids_all_species (bconv): converts external IDs to KEGG IDs [mapping]string: External ID . e.g. NCBI ID [Genebank_GI] return: KEGG gene ID [KEGG_record_id]

Get_pathways_by_genes: Search all pathways which include all the given genes [Searching]Input: List of KEGG genes id [KEGG_gene_id]Output: Return a list of pathway_id of specified KEGG genes_id

Merge_pathwaysStringlistConcatenated

This workflow takes in Entrez gene ids then adds the string "ncbi-geneid:" to the start of each gene id. These gene ids are then cross-referenced to KEGG gene ids. Each KEGG gene id is then sent to the KEGG pathway database and its relevant pathways returned.

Taverna 1.5

43

Integrated provenance Raven release mechanism to simplify updates for the

user +/- 300 semantic annotations for core services Patterns for using proxies for bulk data transactions Redeveloped plug in and enactor framework, improved

iteration events, data management

Taverna 1.5

44

Taverna 1.6 Due out Summer 2007

Revised enactment core Native support for long running workflows Data proxy to deal with bulk data transactions Improved service discovery and provenance

management

46

Obtaining Taverna Taverna is available under the LGPL from our

project site on Sourceforge.net http://taverna.sourceforge.net

Win32, Solaris / Linux & OS-X Includes online and downloadable user

manual, examples etc. Support via project mailing lists

47

Conclusions See plans for Taverna 2.0 on myGrid wiki Taverna development is user-driven

Please keep in touch and tell us what you would like to see by the myGrid mailing lists: Taverna Users, Taverna Hackers

Taverna http://taverna.sourceforge.netmyGrid http://www.mygrid.org.ukOMII-UK http://www.omii.ac.uk

48

Phase1 myGrid researchers, Phase2 OMII-UK, myGrid Research Team

Peter Li, Paul Fisher, Andy Brass, Robert Stevens, Mark Wilkinson

EPSRC, Wellcome Foundation, EU

Acknowledgements