Provenance in the Dynamic, Collaborative New Science
Dr Jun ZhaoDepartment of Zoology
University of [email protected]
Technological infrastructure for the preservation and efficient retrieval and reuse of scientific workflows in a range of disciplines
Packaging, preserving and publishing
● Dealing with big amounts of tabular data
● A lot of small scripts to avoid creating blackbox process
● Local resource sharing, public access only after publication
● Data must be frequently updated from external data repositories
● Data updates must be tested before being executed
● Data must be locally stored with versioning
● “... we don't like to spread [the tasks] and lose controls who is doing what ...”
Astronomy Use Case: A Repeater's Story
Research Objects● Aggregation – Pointers or literals of
internal and external content;● Identity –Equivalence, equality;● Metadata – A reusable object;● Lifecycle – Stages of development.
Impacts on available functionality;● Versioning – Recording changes;● Security – Access, authentication,
ownership, trust;● Graceful Degradation of
Understanding – Opaque RO domain content.
● Mixed stewardship● Provenance
● Of compound objects● Of evolutions● Of dynamic objects and static
objects
ROs are Content Aware Objects that bundle things together
http:/www.wf4ever-project.org
Biology Use Case: A Reuser's Story
● Takes a set of genes from gene experiment results performed by others, as read in a scientific paper
● Perform 'dry' analysis to understand which genes and which biological processes were disturbed by which chemical compounds● basic affymetrix data processing
● statistical analysis to identify genes that are significantly differentially expressed under different conditions (with/without the compounds)
● find those pathways that are most prominent among the filtered genes
Biology Use Case: A Reuser's Story
● Search for existing experiments from myExperiment (http://myexperiment.org)
● Challenge: Understand the workflow● Perform test runs with test data and his own data● Read others' logs● Read annotations to workflows
● Reuse scripts from colleagues and perform tests that his colleagues are familiar with
How Can It be Supported?● A reference to the source of the data and the people to acknowledge for it.
● The initial hypothesis
● The conceptual workflow or a summary of the experiment plan
● References to workflows that were tested, with comments on their application for the user's use case
● The workflow of the user's, possibly with a backlog of previous versions that the user wishes to keep for reference (with notes and comments)
● The runs of the user's own workflow, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')
● The final hypothesis, with comments.
● A reference to the results of the workflow
● Design logs that record the user's considerations while making the workflow
● Run logs that record the user's considerations while running and interpreting the workflow
Where is Linked Data?
The Role of Linked Data in Wf4Ever
● Collaborative science● Dynamic science● Open science
Provenance Challenge
● Identity● Context● Storage● Retrieval
Take home
● Provenance should be user-driven● Linked Data should be a means to an end● http://www.wf4ever-project.org
Acknowledgement
● Marco Roos of Leiden Unveristy (NL) and Jose Enrique Ruiz of Instituto de Astrofísica de Andalucía (Spain)
● Carole Goble of University of Manchester (UK) and Jose Manuel Gomez of iSOCO (Spain)
● Hui Hua and Jenny Molly of University of Oxford (UK)