transcript
- Slide 1
- Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau,
University of Southampton, UK
- Slide 2
- or the Provenance of my interest for Provenance Luc Moreau,
University of Southampton, UK
- Slide 3
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 4
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 5
- Large amounts of data EMBL July 2001 150 Gbytes Microarray 1
Petabyte per annum Sanger Centre 20 terabytes of data Genome
sequences increase 4x per annum
http://www3.ebi.ac.uk/Services/DBStats/
- Slide 6
- Complexity Diversity Heterogeneity Disease Drug Disease
Clinical trial Phenotyp e Protein Protein Structur e Protein Sequen
ce P-P interaction s Proteo me Gene sequen ce Genom e sequen ce
Gene express ion homology Genomic, proteomic, transcriptomic,
metabalomic, protein- protein interactions, regulatory
bio-networks, alignments, disease, patterns & motifs, protein
structure, protein classifications, specialist proteins (enzymes,
receptors),
- Slide 7
- Heterogeneity Data types & forms Community Autonomy Over
500 different databases Different formats, structure, schemas,
coverage Web interfaces, flat file distribution,
- Slide 8
- Heterogeneous Data Multimedia Images & Video Text
annotations & literature Descriptive as well as numeric
Knowledge-based Text Extraction
- Slide 9
- Bioinformatics Analysis Different algorithms BLAST, FASTA, pSW
Different implementations WU-BLAST, NCBI-BLAST Different service
providers NCBI, EBI, DDBJ
- Slide 10
- Drug Discovery
- Slide 11
- In silico experimentation Discovery of resources and tools,
staging of operations, sharing of results Process is as important
as outcome Science is dynamic change happens Scientific discovery
is personal & global Provenance and history
- Slide 12
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 13
- myGrid EPSRC funded pilot project Generic middleware within
application setting 36 month in 42 month performance period Start 1
st October 2001 16 full-time post docs altogether 6 DTA
studentships 1 technical project manager
- Slide 14
- myGrid consortium Scientific Team Biologists and
Bioinformaticians GSK, AZ, Merck KGaA, Manchester, EBI Technical
Team Manchester, Southampton, Newcastle, Sheffield, EBI, Nottingham
IBM, SUN GeneticXchange Network Inference, Epistemics Ltd
- Slide 15
- myGrid outcomes e-Scientists Bioinformatics demonstrator
(Graves disease and Williams syndrome) Developers myGrid-in-a-Box
developers kit (currently myGrid 0.4) Integrating some existing
bioinformatics tools with myGrid (EBI services)
- Slide 16
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 17
- Graves disease Autoimmune disease of the thyroid in which the
immune system of an individual attacks cells in the thyroid gland
resulting in hyperthyroidism Weight loss, trembling, muscle
weakness, increased pulse rate, increased sweating and heat
intolerance, goitre, exophthalmos
- Slide 18
- The Biology GD caused by the stimulation of the thyrotrophin
receptor by thyroid-stimulating autoantibodies secreted by
lymphocytes of the immune system. Why is the lymphocyte causing the
antibodies that attack the thyroid cell?
- Slide 19
- Graves Disease Experimental Process What genes are associated
with Graves Disease? Candidate Gene What is known about my
candidate gene? Literature Previous Research Databases Experimental
Annotation Pipeline What SNPs (single nucleotide polymorphisms) in
my candidate gene might be relevant? Verify relevance of SNPs. How
can I visualise annotations to my candidate gene? Genotype Assay
Design System 3D Protein Structure & SNP Visualisation
- Slide 20
- Experiment life cycle Executing experiments Workflow enactment
Distributed Query processing Job execution Provenance generation
Single sign-on authorisation Event notification Resource &
service discovery Repository creation Workflow creation Database
query formation Discovering and reusing experiments and resources
Workflow discovery & refinement Resource & service
discovery Repository creation Provenance Managing experiments
Information repository Metadata management Provenance management
Workflow evolution Event notification Providing services &
experiments Service registration Workflow deposition Metadata
Annotation Third party registration Personalisation Personalised
registries Personalised workflows Info repository views
Personalised annotations Personalised metadata Security Forming
experiments
- Slide 21
- A work bench for demonstrating services myView on the mIR
Workflow Metadata about workflow note about workflow
- Slide 22
- A workflow represents an experiment that can be run on the
Grid. A workflow takes data as input. It performs activities, which
are steps involved in analysing the data, including using tools and
services, querying databases and running other workflows. A
workflow can be run on the users local machine, or remotely, taking
advantage of resources that are distributed. Data intensive grid
having to deal with heterogeneity of the data and processes.
Worflows
- Slide 23
- myGrid schematic Graves disease scenario Workbench Workflow
editor Event Notification Workflow Enactment Information repository
Service Registry Knowledge management Text services Bio services
Distributed query processing Services Core components Generic
Applications Exemplars Talisman SoapLab Gateway
- Slide 24
- Notification Service Knowledge Services DB2 Registry Service
Oriented Architecture Semantic registration Service Structural
registration Knowledge Service Ontology Server Reasoner Matcher
Registry DB2 Workflow templates DataProvenance mInfo Repository
Workflow enactment engine Workflow instances Build/Edit Workflow
Service Discovery Test Data Notification Service WSFL JMS
Distributed Query Processor Information Extraction PASTA Job
Execution SoapLab mIR Provenance service Component Discovery
MetadataConcepts Registry View UDDI UDDI-M
- Slide 25
- myGrid Deployment
- Slide 26
- myGrid 0.4 (Nov 2003) Describer (MAN): A tool for attaching
semantic descriptions to WS and workflows Find Service (MAN): A
component for classifying and discovering services and workflows
via their semantic descriptions Ontology Server (MAN): The DAML+OIL
reasoner Workbench (NOT): a NetBeans module for examining and
updating the MIR and submitting workflows for enactment e-Science
Gateway (NOT): An API giving access to myGrid core services MIR
(myGrid information repository) (MAN/NEW): A Web Service accessing
a repository that can hold data for an individual scientist or a
team of scientists. Notification Service (IAM): A general-purpose
Web Service that supports a publish/subscribe model of event
notification, based on JMS Registry View service (IAM): A Web
Service supporting a registry of published Web Services and
workflows annotated with metadata, including semantic descriptions
Freefluo (ITI): workflow enactment engine Taverna (EBI): workflow
editing environment
- Slide 27
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 28
- Provenance: definition Main Entry: provenance Pronunciation:
'prv-n&n(t)s, 'pr-v&-"nn(t)s Function: noun Etymology:
French, from provenir to come forth, originate, from Latin
provenire, from pro- forth + venire to come -- more at PRO-, COME
Date: 1785 1 : ORIGIN, SOURCE 2 : the history of ownership of a
valued object or work of art or literaturePRO-COMEORIGINSOURCE
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Provenance Experiment is repeatable, if not reproducible, and
explained by provenance records Who, what, where, why, when,
(w)how? The traceability of knowledge as it is evolves and as it is
derived. Immutable metadata Migration travels with its data but may
not be stored with it. Private vs Shared provenance records.
Credit. Provenance is related to:
- Slide 34
- A full provenance record is linked with the results. Its a log
of execution. Early Provenance Capture
- Slide 35
- Kinds of Provenance Backward Derivation An explanation of when,
by who, how something was produced. Linking items, usually in a
directed graph. Execution Process- centric To be contrasted with
forward derivation, which is a path like a workflow, script or
query. mass = 200 decay = WW stability = 1 event = 8 mass = 200
decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1
mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1
mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass =
200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200
event = 8 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt =
10000
- Slide 36
- Kinds of Provenance Annotations Attached to items or
collections of items, in a structured, semi- structured or free
text form. Annotations on one item or linking items. An explanation
of why, when, where, who, what, how. Data-centric
- Slide 37
- Kinds of Provenance in myGrid Derivations Workflow Enactment
Engine provides a detailed provenance record stored in the myGrid
Information Repository (mIR) describing what was done, with what
services and when XML document, soon to be an RDF model Annotations
Every mIR object has Dublin Core provenance properties described in
an attribute value model
- Slide 38
- Provenance of data Operational execution trail process start
time end time lsid:HGVBase_retrieve input by_service urn: Claire
Jennings run_for output Gene:AC005412.6SNP:000010197
- Slide 39
- From Provenance to Knowledge Gene:AC005412.6SNP:000010197
process start time end time lsid:HGVBase_retrieve input by_service
urn: Claire Jennings run_for output
contains_single_nucleotide_polymorphism as stated by Declarative
semantic execution trail
- Slide 40
- From Provenance to Knowledge Gene:AC005412.6SNP:000010197
process start time end time lsid:HGVBase_retrieve input by_service
urn: Claire Jennings run_for output
contains_single_nucleotide_polymorphism as stated by Trust and
attribution urn: Carole Goble disputed by
- Slide 41
- Provenance vs Provenance vs Annotation Provenance of an
annotation Annotation of Provenance Provenance vs Workflow
Provenance describes past execution A workflow is a script for
future execution
- Slide 42
- What is Provenance? Annotations may be subject of
interpretation (e.g. Alice believes annotation X, whereas Bob does
not). Provenance should aim at recording an undisputed view of an
execution.
- Slide 43
- What is Provenance? Provenance traces execution Provenance must
be generated automatically Annotations can be either generated
automatically or created by the user Annotations can contain
semantic augmentation, which can be derived automatically or
supplied manually.
- Slide 44
- Generating provenance RDF registry mIR FreeFluo WFEE Workflow
execution Template OWL descriptions Identify workflow Execution
Provenance log Data and metadata from the run Input data &
parameters startTime, endTime, service instances invoked Bind
services Knowledge Provenance log Workflow knowledge template
RDF+OWL Knowledge arising from workflow Scufl
- Slide 45
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 46
- Provenance in a Bioinformatics Grid myGrid builds a
personalised problem-solving environment that helps
bioinformaticians find, adapt, construct and execute in silico
experiments Provenance in Drugs Discovery process: FDA requirement
on drug companies to keep a record of provenance of drug discovery
as long as the drug is in use (up to 50 years sometimes).
- Slide 47
- Provenance in Aerospace Engineering Provenance requirement: to
maintain a historical record of outputs from each sub-system
involved in simulations. Aircrafts provenance data need to be kept
for up to 99 years when sold to some countries. Currently, little
direct support is available for this.
- Slide 48
- Provenance in Organ Transplant Management Decision support
systems for organ and tissue transplant, rely on a wide range of
data sources, patient data, and doctors and surgeons knowledge
Heavily regulated domain: European, national, regional and site
specific rules govern how decisions are made. Application of these
rules must be ensured, be auditable and may change over time
Provenance allows tracking previous decisions: crucial to maximise
the efficiency in matching and recovery rate of patients
- Slide 49
- The Grid and Virtual Organisations The Grid problem is defined
as coordinated resource sharing and problem solving in dynamic,
multi-institutional virtual organisations [FKT01]. Effort is
required to allow users to place their trust in the data produced
by such virtual organisations Understanding how a given service is
likely to modify data flowing into it, and how this data has been
generated is crucial.
- Slide 50
- Provenance and Virtual Organisations Given a set of services in
an open grid environment that decide to form a virtual organisation
with the aim to produce a given result; How can we determine the
process that generated the result, especially after the virtual
organisation has been disbanded? The lack of information about the
origin of results does not help users to trust such open
environments.
- Slide 51
- Provenance and Workflows Workflow enactment has become popular
in the Grid and Web Services communities Workflow enactment can be
seen as a scripted form of virtual organisation. The problem is
similar: how can we determine the origin of enactment results.
- Slide 52
- Provenance: Definition Provenance is some data able to explain
how a particular result has been derived. In a service-oriented
architecture, provenance identifies what data is passed between
services, what services are available, and what results are
generated for particular sets of input values, etc. Using
provenance, a user can trace the process that led to the
aggregation of services producing a particular output.
- Slide 53
- Overview Bioinformatics background myGrid facts Services and
Workflows Provenance in myGrid Beyond myGrid Provenance
Architectural vision Conclusions
- Slide 54
- What is the problem? Provenance recording should be part of the
infrastructure, so that users can elect to enable it when they
execute their complex tasks over the Grid or in Web Services
environments. Currently, the Web Services protocol stack and the
Open Grid Services Architecture do not provide any support for
recording provenance.
- Slide 55
- Architectural Vision
- Slide 56
- Provenance gathering is a collaborative process that involves
multiple entities, including the workflow enactment engine, the
enactment engine's client, the service directory, and the invoked
services. Provenance data will be submitted to one or more
provenance repositories acting as storage for provenance data. Upon
user's requests, some analysis, navigation and reasoning over
provenance data can be undertaken.
- Slide 57
- Architectural Vision Storage could be achieved by a provenance
service. Provenance service would provide support for analysis,
navigation or reasoning over provenance Client side support for
submitting provenance data to the provenance service.
- Slide 58
- A First Prototype (Szomszor,Moreau 03) A service-oriented
architecture for provenance support in Grid and Web Services
environments, based on the idea of a provenance service; A
client-side API for recording provenance data for Web Service
invocation; A data model for storing provenance data; A server-side
interface for querying provenance data; Two components making use
of provenance: provenance browsing and provenance validation.
- Slide 59
- Prototype Overview
- Slide 60
- Prototype Sequence Diagram
- Slide 61
- To identify the interactions between provenance service, client
side library and enactment engine Creation of a session Need to be
able to support the most complex workflows including conditional
branching, iteration, recursion and parallel execution. Support
asynchronous submission of provenance data so that provenance
submission does not delay workflow execution.
- Slide 62
- Prototype Provenance Data Model
- Slide 63
- Must support recording of all information necessary to replay
execution Must support all complex forms of workflows (recursion,
iterations, parallel execution).
- Slide 64
- Prototype Provenance Browser
- Slide 65
- Discussion In order for provenance data to be useful, we expect
such a protocol to support some classical properties of distributed
algorithms. Using mutual authentication, an invoked service can
ensure that it submits data to a specific provenance server, and
vice-versa, a provenance server can ensure that it receives data
from a given service. With non-repudiation, we can retain evidence
of the fact that a service has committed to executing a particular
invocation and has produced a given result. We anticipate that
cryptographic techniques will be useful to ensure such
properties
- Slide 66
- Towards Trust
- Slide 67
- Using the provenance of data, trust metrics of the data can be
derived from: Trust the user places in invoked services Trust the
user places in the input data Trust the user places in the enacted
workflow Trust the user places in the enactor Trust the user places
in the provenance service.
- Slide 68
- The purpose of project PASOA to investigate provenance in Grid
architectures Funded by EPSRC under the fundamental computer
science for e-Science call In collaboration with Cardiff
www.pasoa.org
- Slide 69
- Conclusion Provenance is a rather unexplored domain Strategic
to bring trust in open environment Necessity to design a
configurable architecture capable of support multiple requirements
from very different application domains. Need to further
investigate the algorithmic foundations of provenance, which will
lead to scalable and secure industrial solutions.
- Slide 70
- Publications [SM03] Martin Szomszor and Luc Moreau. Recording
and reasoning over data provenance in web and grid services. In
International Conference on Ontologies, Databases and Applications
of SEmantics (ODBASE'03), volume 2888 of Lecture Notes in Computer
Science, pages 603-620, Catania, Sicily, Italy, November 2003. [MCS
+ 03] Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer
Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott.
Provenance-based trust for grid computing - position paper. 2003.
[GGS+03] Mark Greenwood, Carole Goble, Robert Stevens, Jun Zhao,
Matthew Addis, Darren Marvin, Luc Moreau, and Tom Oinn. Provenance
of e-science experiments - experience from bioinformatics. In
Proceedings of the UK OST e-Science second All Hands Meeting 2003
(AHM'03), pages 223-226, Nottingham, UK, September 2003.
- Slide 71
- Acknowledgements The myGrid Southampton Team: Simon Miles, Juri
Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne
Mark Greenwood, Carole Goble, Manchester Martin Szomszor,
Southampton Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber
and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven
Willmott, UPC
- Slide 72
- www.mygrid.org.uk m