+ All Categories
Home > Education > Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Date post: 10-May-2015
Category:
Upload: carole-goble
View: 1,015 times
Download: 0 times
Share this document with a friend
Description:
Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?
Popular Tags:
30
Workflows, Provenance & Reporting A Lifecycle Perspective Professor Carole Goble FREng FBCS The University of Manchester, UK [email protected] 3 rd – 6 th September 2013, Rome, Italy
Transcript
Page 1: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflows, Provenance & Reporting A Lifecycle Perspective

Professor Carole Goble FREng FBCS

The University of Manchester, UK

[email protected]

3rd – 6th September 2013, Rome, Italy

Page 2: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

The Scientific and Technical Ecosystem

Mobilising Big and Broad Data• Streaming• Sweeps through models• Integrative analysis• Results synthesis• Heavy compute

Interoperability, plugging together• Multi step chains, Multi software / data• Mixed resources / platforms• Incompatibility smoothing• Trans-disciplinary, Alien processes

[DataONE]

Page 3: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

BioSTIF

inputs: data, parameters, configurations

outputs

Workflow nutshell• A series of automated /

interactive data analysis steps

• Process data at scale• Import data / codes

from one’s own research and/or from existing libraries

• Pipelines & analytic and synthesis procedures

• Chains of components• Bridges between

resources• Shield from change

and operational complexity

• Releasing capacity

Services

Resources

Page 4: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

ProvisioningWorkflows

ApplnService

ApplnService

Users

Workflows

CompositionIncorporation

Invocation

Applications• Applications components

of workflows• Compose applications

into workflows• Incorporate workflows

into applications

Infrastructure• Provision physical

resources to support application workflows

• Coordinate resources through workflows

• Optimise and adapt to change

[Foster 2005]

Workflows

Wfms

Page 5: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Assembly of Components

InteroperabilityCovering up incompatibility

Page 6: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Flexible variation

StabilisingOptimising

Page 7: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflows: maturing approachUnderpin integrative platforms.

Established in many disciplines, notably chemistry and biology, esp. ‘omics: assembly, synthesis, annotation, analytics.

Overlaps with metagenomics, phylogenetics and genetic ecology

Powering service based science and science as a service http://www.globus.org/genomics/solution

Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted

Page 8: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Ecological Niche modelling, population modelling, Metagenomics and Phylogenetics ‘omics pipelines and analytic workflows http://www.biovel.eu

Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis http://camera.calit2.net/index.shtm

Combine species occurrence data with global climate, terrain and land cover information, to identify environmental correlates of species ranges. http://www.lifemapper.org/species

BioDiversity

Page 9: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Taxonomic Data Refinement

www.biovel.eu

• Synonym expansion • Taxonomic name resolution• Occurrence retrieval• Spell checking• Geographic and taxonomic cleaning• Temporal refinement• Data processing log

[Matthias Obst, INTECOL 2013]

Page 10: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Data Operations in Workflows in the Wild

Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails

Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, in press, FGCS

Page 11: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Large ScaleEcological Niche Modeling Workflow

.

Step 1: Explorative modeling-Use unfiltered data -Use fixed parameters: Mahalonobis distance (Farber and Kadmon 2003)-Native projections-Test the model, distribution of points, number of points

Step 2: Deep modeling-Filtering environmentally unique points with BioClim algorithm (Nix 1986)-ENM with Support Vector Machine (Cristianini & Shawe-Taylor 2000) and Maximum Entropy (Phillips 2004)-Parameter optimization (if necessary) on the model test results-2 masks (model generate, model project)

Data discoveryData discovery

Data assembly, cleaning, and refinement

Data assembly, cleaning, and refinement

Ecological Niche Modeling

Ecological Niche Modeling

Statistical analysisStatistical analysis

Analytical cycle

Pilumnus hirtellus

Enclosed sea problem (Ready et al., 2010)

[Matthias Obst, INTECOL 2013]

Page 12: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflow-enabled science• Common Templates• Prepared components• Systematic assembly• (Steered) automation

• Hybrid combinations• Variations• Extensibility• Customisation• Parameterisation

• Repeats• Cross-run synthesis• Routine, pooled methods• Tracking

Page 13: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Repeated model sweepsTen insect species were modelled: European spruce bark beetle – Ips typographus L. Bordered white moth (syn. pine looper) - Bupalus piniarius L., (syn. B. piniaria L.) Pine-tree lappet - Dendrolimus pini L. Mottled umber - Erannis defoliaria Clerck Nun moth - Lymantria monacha L. Winter moth - Operopthera brumata L. Pine beauty moth - Panolis flammea Den. & Schiff Green oak tortrix - Tortrix viridana L. European pine sawfly – Neodiprion sertifer Geoffr. Common pine sawfly – Diprion pini L. Tortrix viridana Image by Kimmo & Seppo Silvonen Lymantria monacha

data

configurationparameters

steps Päivi Lyytikäinen-Saarenmaa presentation, INTECOL 2013

Page 14: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

http://www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx

Workflows

workflowsresults

provenanceprocess (log)results (origin)

ReportingRecord of scienceReproducibility Transparent process

Integrate with reporting systems

Know howTraining

See Penevpresentation

Page 15: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Provenance the link between computation and results

W3C PROV model standardrecord for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt partial repeat/reproducecarry attributionscompute creditscompute data quality/trustselect data to keep/releaseoptimisation and debugging

PDIFF: comparing provenance traces to diagnose divergence across experimental results [Woodman et al, 2011]

Page 16: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

[Freire]http://www.aosabook.org/en/vistrails.html

Collecting -> Using ProvenanceInstrumenting, cross-tool interoperabilityReporting at different scales

Page 17: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

b

Publishing with Provenance

Page 18: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Summary: Infrastructure Productivity

CustomiseCustomise

ProcessProcess

CustomiseCustomise

ProcessProcess

CustomiseCustomise

EnvironmentEnvironmentLegacy, others and your own software, datasets, services, codes, and platforms. optimise and manage use of computing infrastructure, HPC, clouds and platforms

WFMSmiddleware

WFMSmiddleware

Support the design, config. and execution of workflows. manage utility actions for data, logging, security, compute, errors…shield incompatibilities / complexity / change

Parameterised, integrative, multi-step (data) pipelines, analytics, computational protocols. That can be repetitively reused. dependency-rich interoperability.

WorkflowWorkflow

AppsAppsDomain/task specific apps that incorporate (an ecosystem of) workflowsIntegrate

Page 19: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Summary: User Productivity: Capability Raising

AccessAccessFramework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity.

CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components

ProcessProcessAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis and processing and error handlingEnsembles, comparisons, “what ifs”

CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components

ProcessProcessAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis and processing and error handlingEnsembles, comparisons, “what ifs”

CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components

AccessAccessFramework to access and leverage heterogeneous legacy applications, services, datasets and codes and combine with yours.Shielding from complexity.

ProcessProcessIntegration, Reusable workflows/componentsAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis Ensembles, comparisons, “what ifs”

Process reporting. Citation tracking. Reproducibility, Provenance, Audit. Quality Control. Standard Operating Procedures.RecordRecord

CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components

Page 20: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflow Commoditiesbuilding cohorts, capturing traits,

explicit reporting, clear instructions

• Workflow templates• Workflow sets• Libraries of sub workflow parts• Design practices for mix, match

and reuse • Future proofed design predicting

need to adapt• Discovery and exchange• Workflow engineers• Workflow custodians

Page 21: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Seeding a workflow library

Page 22: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflow Commodities exchanging, curating, preserving, packaging, life cycle management

http://www.researchobject.orghttp://www.dcc.ac.uk

Page 23: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Katy’s student’s 200 hoursTracking where data went

Workflow Commoditiesgetting credit, capability, engineers and custodians

Page 24: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Application Buildinguser variety, outcome focused

• Right apps, right users.• Commodity apps:

– Web. Spreadsheets. R.

• Customisation• Mixed workflow / scripting• Deployment / Portability

– Web based / desktop– Virtualised deployments– Cloud hosted service– A cloud-enabled local host

• Local ownership• Capability building

Workflow

Visibility

BioDiversity

Low

Concept K

nowledge

High

Technology/Infrastructure

Dom

ain Scientist

Technical specialists

Com

putational Scientist

CustomSpecificApps

GeneralToolkits

Policy

makers

Low

High

Versatility

Page 25: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Who are the users?• Policy makers?• Biodiversity researcher?• Computational scientist?• Tool developer?• Service provider?• Infrastructure provider?• Digital custodian?

Page 26: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Workflow management systems• Integrated into community frameworks,

coupled into tools• Virtualised (Web) Services

• Scaling, Optimisation• Interoperability, Using provenance• No one workflow language/system

• Specialisation & its cost• Plug-ins for common community

platforms and resources• Mitigating and adapting to changes in

infrastructures and resources.• Sustainability and engineering

Generic

Specific

http://www.erflow.eu/

Page 27: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Population dynamicsThe life cycle of infrastructures

• Dynamics: Mitigate, Adapt, Disperse, Die

• Standard and maintained prog. interfaces (APIs)

• Standard formats and ids• Stability, reliability, repair• Interoperability• Semantic descriptions• Sustainability of services

and infrastructure• Instrument resources for

citation & microattribution• Coupled services and

infrastructure.

Page 28: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Impact of dependencies

[Zhao et al. Why workflows break e-Science 2012]

Page 29: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

Summary

Scale.Standards data formats, programmatic interfaces. Governance.

Workflow commoditiesDesign practicesCredit

A seamless, pluggable service. Scale. Adaptability. Specific-Generic tension. Putting provenance to use for data credit.

Embedding workflows in common applications Integration into reporting and publishing lifecycles

Page 30: Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

BioDiversity Virtual e-Laboratorywww.biovel.eu

Wf4Everwww.wf4ever-project.org

SysMOwww.sysmo-db.org

SCaleable Preservation Environmentshttp://www.scape-project.eu


Recommended