1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from...

Post on 20-Jan-2016

217 views 0 download

Tags:

transcript

1

A myGrid Project Tutorial

Dr Mark Greenwood

University of Manchester

With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole Goble and the rest of the myGrid team.

2

• Open Source Upper Middleware for Bioinformatics

• (Web) Service-based architecture• Targeted at Tool Developers,

Bioinformaticians and Service Providers

Newcastle

NottinghamManchester

Southampton

Hinxton

Sheffield

3

myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,

Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of

Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,

Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,

Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker

4

Roadmap - start

services

data

6

Tenet I• High level Middleware

services for data intensive resource interoperation for Bioinformatics– Information Grid not

computational Grid• Exploratory, ad hoc • For individuals• In silico experiment as

workflow• Distributed query processing• Information Management

7

Tenet II• High level services for e-Science

experimental management;– Provenance– Event notification– Personalisation

• Sharing knowledge and sharing components– Scientific discovery is personal &

global.– Federated third party registries for

workflows and services– Workflow and service discovery for

reuse and repurposing

Registry

Re

giste

rF

ind

Annotate

8

Tenet III

• Open Source and Open Services– No control or influence over

service providers

• Open to third party metadata and services

• Open extensible architecture– Assemble your own

components– Designed to work together– Toolkit

Freefluo

WfEE

TavernaViewUDDIregistry

EventNotification

mIR

PedroSemanticDiscovery

Info.Model

Soaplab

Gateway & Portal

LSID

HaystackProvenanceBrowser

9

Tenet IV• (Web) Service architecture

– Publication, discovery, interoperation, composition, decommissioning of myGrid services

– WS-I -> OGSA / WSRF

• Metadata driven– Ontologies– Common information model– Semantic Web technologies

• RDF, OWL

10

Tenet V

Middleware for

• Tool Developers • Bioinformaticians • Service Providers• Biologists are indirectly

supported by the portals and apps these develop.

11

Roadmap

run workflows

services

workflows

data

discover services

data management

workflows

12

Data-intensive bioinformatics

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

13

Use ScenariosGraves’ Disease• Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of

Human Genetics School of Clinical Medical Sciences, University of Newcastle

• Discover all you can about a gene• Annotation pipelines and Gene expression analysis• Services from Japan, Hong Kong, various sites in UK

Williams-Beuren Syndrome• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St

Mary’s Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis

Services from USA, Japan, various sites in UK

14

Manually filling a genomic gap

Two major steps:• Extend into the gap: Similarity searches; RepeatMasker, BLAST• Characterise the new sequence: NIX, Interpro, etc…

• Numerous web-based services (i.e. BLAST, RepeatMasker)• Cutting and pasting between screens• Large number of steps• Frequently repeated – info now rapidly added to public databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled in lab book and

files saved to local hard drive• Mundane• Much knowledge remains undocumented• Bioinformatician does the analysis

15

WBS Workflows:GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetative elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterProPFAMPrositeSmart

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

ncbiBlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence ncbiBlastWrapper

Sort for appropriate Sequences only

Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns

RepeatMasker

16

Graves’ Disease Bioinformatics

Annotation PipelineWhat is known about my

candidate gene?

Medline

OMIM

GO

BLAST

EMBL

DQP

Query

Genotype Assay Design System 3D Protein StructureIs this SNP present

in my samples?What is the structure of the protein

product encoded by my candidate gene?

Primer Design

Gene ID

Restriction FragmentLength Polymorphism experiment

SNP SNPSNP

Use primers designed by myGrid to amplify region flanking SNP on the gene

PDB

Query PDB & display proteinstructure

Obtain information about protein& extract information about active site

Swiss-ProtAMBITInterpro

Emboss Eprimer applicationin SoapLab

Selection of restriction enzyme

Talisman

SNP

Emboss Restrictin SoapLab

AMBIT

Determine whether coding SNPaffects the active site of the protein

Peter Li1, Claire Jennings2, Simon Pearce2 and Anil Wipat1, (2003)1School of Computing Science and 2Institute of Human Genetics, University of Newcastle-upon-Tyne.Candidate gene

pool

17

Experiment life cycle

Discovering and reusing

experiments and resources

Managing lifecycle, provenance and

results of experiments

Sharingservices &

experiments

Personalisation

Forming experiments

Executing and monitoring

experiments

18

(e-)Scientists…• …Experiment

• Can workflow be used as an experimental method?• How many times has this experiment been run?

• …Analyze• How do we manage the results to draw conclusions from

them?• How reliable are these results?

• …Collaborate• Can we share workflows, results, metadata etc?

• …Publish• Can we link to these workflows and results from our papers?

• …Review• Can I find, comprehend and review your work?• How was that result derived?

19

Collections of Tasks

Finding

Description ServiceDiscovery

Enactment

BuildingWorkflow

Provenance

StorageData

ManagementQuerying

DomainTasks Service

Providers

Bioinformaticians

Scientists

Annotation providers

20

Registry

mIR

Discovery View

HaystackProvenance

Browser

FreeFluoEnactor

TavernaWF Builder

PedroAnnotation tool

Ontology Store

Others

WSDLSoap-lab

Interface Description

Annotation/description

Annotation providers

Query &Retrieve Workflow

Execution

Store data/knowledge

Scientists

Bioinformaticians

invoking

Querying/sharing/federating/registering

ServiceProviders

Data descriptions

Vocabulary

21

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rsA

pplicationsC

ore servicesE

xternal servicesmyGrid Service Stack

Views

Legacy apps

GowLab

22

Two+ Paths

Core functionality• Services – Soaplab

and Gowlab• Workflow enactment

engine – Freefluo• Workflow workbench

– Taverna• Data integration –

OGSADQP• Information model &

management

Innovative work• Service and workflow

registration• Semantic discovery• Provenance

management• Text mining

In between• Event notification• Gateway

23

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rsA

pplicationsC

ore servicesE

xternal servicesmyGrid Service Stack

Views

Legacy apps

GowLab

24

25

Run the Workflow

Viewing intermediate results

26

Run the Workflow

27

Drilling Down: myGrid and Semantics

• Workflow and service discovery – Prior to and during enactment– Semantic registration

• Workflow assembly– Semantic service typing of inputs and outputs

• Provenance of workflows and other entities• Experimental metadata glue• Use of RDF, RDFS, DAML+OIL/OWL

– Instance store, ontology server, reasoner– Materialised vs at point of delivery reasoning.

• myGrid Information Model

28

Semantic Discovery

View annotations on workflow

Pedro data capture tool

Drag a workflow entry into the explorer pane and the workflow loads.Drag a service/ workflow to the scavenger window for inclusion into the workflow

29

Tutorial focus

Core functionality• Services – Soaplab

and Gowlab• Workflow enactment

engine – Freefluo• Workflow workbench

– Taverna• Data integration –

OGSADQP• Information model &

management

Innovative work• Service and workflow

registration• Semantic discovery• Provenance

management• Text mining

In between• Event notification• Gateway

30

Roadmap

LSID authorities

Taverna workbench

Registry1. Describe services

3. Write & run workflows

services

workflows

data

2. Discover services

4. Provenance & datamanagement

workflows

31

Sessions on Details• Workflows - hands on with Taverna• Semantics• Timetable – split sessions

– Session 1• Group 1 – hands on (Swanson)• Group 2 – semantics (Newhaven)

– Teabreak (short)– Session 2

• Group 1 – semantics (Newhaven)• Group 2 –hands on (Swanson)

– Discussions and Conclusions