Post on 10-May-2015
description
transcript
Workflows, Provenance & Reporting A Lifecycle Perspective
Professor Carole Goble FREng FBCS
The University of Manchester, UK
carole.goble@manchester.ac.uk
3rd – 6th September 2013, Rome, Italy
The Scientific and Technical Ecosystem
Mobilising Big and Broad Data• Streaming• Sweeps through models• Integrative analysis• Results synthesis• Heavy compute
Interoperability, plugging together• Multi step chains, Multi software / data• Mixed resources / platforms• Incompatibility smoothing• Trans-disciplinary, Alien processes
[DataONE]
BioSTIF
inputs: data, parameters, configurations
outputs
Workflow nutshell• A series of automated /
interactive data analysis steps
• Process data at scale• Import data / codes
from one’s own research and/or from existing libraries
• Pipelines & analytic and synthesis procedures
• Chains of components• Bridges between
resources• Shield from change
and operational complexity
• Releasing capacity
Services
Resources
ProvisioningWorkflows
ApplnService
ApplnService
Users
Workflows
CompositionIncorporation
Invocation
Applications• Applications components
of workflows• Compose applications
into workflows• Incorporate workflows
into applications
Infrastructure• Provision physical
resources to support application workflows
• Coordinate resources through workflows
• Optimise and adapt to change
[Foster 2005]
Workflows
Wfms
Assembly of Components
InteroperabilityCovering up incompatibility
Flexible variation
StabilisingOptimising
Workflows: maturing approachUnderpin integrative platforms.
Established in many disciplines, notably chemistry and biology, esp. ‘omics: assembly, synthesis, annotation, analytics.
Overlaps with metagenomics, phylogenetics and genetic ecology
Powering service based science and science as a service http://www.globus.org/genomics/solution
Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted
Ecological Niche modelling, population modelling, Metagenomics and Phylogenetics ‘omics pipelines and analytic workflows http://www.biovel.eu
Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis http://camera.calit2.net/index.shtm
Combine species occurrence data with global climate, terrain and land cover information, to identify environmental correlates of species ranges. http://www.lifemapper.org/species
BioDiversity
Taxonomic Data Refinement
www.biovel.eu
• Synonym expansion • Taxonomic name resolution• Occurrence retrieval• Spell checking• Geographic and taxonomic cleaning• Temporal refinement• Data processing log
[Matthias Obst, INTECOL 2013]
Data Operations in Workflows in the Wild
Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails
Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, in press, FGCS
Large ScaleEcological Niche Modeling Workflow
.
Step 1: Explorative modeling-Use unfiltered data -Use fixed parameters: Mahalonobis distance (Farber and Kadmon 2003)-Native projections-Test the model, distribution of points, number of points
Step 2: Deep modeling-Filtering environmentally unique points with BioClim algorithm (Nix 1986)-ENM with Support Vector Machine (Cristianini & Shawe-Taylor 2000) and Maximum Entropy (Phillips 2004)-Parameter optimization (if necessary) on the model test results-2 masks (model generate, model project)
Data discoveryData discovery
Data assembly, cleaning, and refinement
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Ecological Niche Modeling
Statistical analysisStatistical analysis
Analytical cycle
Pilumnus hirtellus
Enclosed sea problem (Ready et al., 2010)
[Matthias Obst, INTECOL 2013]
Workflow-enabled science• Common Templates• Prepared components• Systematic assembly• (Steered) automation
• Hybrid combinations• Variations• Extensibility• Customisation• Parameterisation
• Repeats• Cross-run synthesis• Routine, pooled methods• Tracking
Repeated model sweepsTen insect species were modelled: European spruce bark beetle – Ips typographus L. Bordered white moth (syn. pine looper) - Bupalus piniarius L., (syn. B. piniaria L.) Pine-tree lappet - Dendrolimus pini L. Mottled umber - Erannis defoliaria Clerck Nun moth - Lymantria monacha L. Winter moth - Operopthera brumata L. Pine beauty moth - Panolis flammea Den. & Schiff Green oak tortrix - Tortrix viridana L. European pine sawfly – Neodiprion sertifer Geoffr. Common pine sawfly – Diprion pini L. Tortrix viridana Image by Kimmo & Seppo Silvonen Lymantria monacha
data
configurationparameters
steps Päivi Lyytikäinen-Saarenmaa presentation, INTECOL 2013
http://www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx
Workflows
workflowsresults
provenanceprocess (log)results (origin)
ReportingRecord of scienceReproducibility Transparent process
Integrate with reporting systems
Know howTraining
See Penevpresentation
Provenance the link between computation and results
W3C PROV model standardrecord for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt partial repeat/reproducecarry attributionscompute creditscompute data quality/trustselect data to keep/releaseoptimisation and debugging
PDIFF: comparing provenance traces to diagnose divergence across experimental results [Woodman et al, 2011]
[Freire]http://www.aosabook.org/en/vistrails.html
Collecting -> Using ProvenanceInstrumenting, cross-tool interoperabilityReporting at different scales
b
Publishing with Provenance
Summary: Infrastructure Productivity
CustomiseCustomise
ProcessProcess
CustomiseCustomise
ProcessProcess
CustomiseCustomise
EnvironmentEnvironmentLegacy, others and your own software, datasets, services, codes, and platforms. optimise and manage use of computing infrastructure, HPC, clouds and platforms
WFMSmiddleware
WFMSmiddleware
Support the design, config. and execution of workflows. manage utility actions for data, logging, security, compute, errors…shield incompatibilities / complexity / change
Parameterised, integrative, multi-step (data) pipelines, analytics, computational protocols. That can be repetitively reused. dependency-rich interoperability.
WorkflowWorkflow
AppsAppsDomain/task specific apps that incorporate (an ecosystem of) workflowsIntegrate
Summary: User Productivity: Capability Raising
AccessAccessFramework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity.
CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
ProcessProcessAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis and processing and error handlingEnsembles, comparisons, “what ifs”
CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
ProcessProcessAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis and processing and error handlingEnsembles, comparisons, “what ifs”
CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
AccessAccessFramework to access and leverage heterogeneous legacy applications, services, datasets and codes and combine with yours.Shielding from complexity.
ProcessProcessIntegration, Reusable workflows/componentsAutomated plumbing + InteractionSystematic, repetitive and unbiased analysis Ensembles, comparisons, “what ifs”
Process reporting. Citation tracking. Reproducibility, Provenance, Audit. Quality Control. Standard Operating Procedures.RecordRecord
CustomiseCustomiseRapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
Workflow Commoditiesbuilding cohorts, capturing traits,
explicit reporting, clear instructions
• Workflow templates• Workflow sets• Libraries of sub workflow parts• Design practices for mix, match
and reuse • Future proofed design predicting
need to adapt• Discovery and exchange• Workflow engineers• Workflow custodians
Seeding a workflow library
Workflow Commodities exchanging, curating, preserving, packaging, life cycle management
http://www.researchobject.orghttp://www.dcc.ac.uk
Katy’s student’s 200 hoursTracking where data went
Workflow Commoditiesgetting credit, capability, engineers and custodians
Application Buildinguser variety, outcome focused
• Right apps, right users.• Commodity apps:
– Web. Spreadsheets. R.
• Customisation• Mixed workflow / scripting• Deployment / Portability
– Web based / desktop– Virtualised deployments– Cloud hosted service– A cloud-enabled local host
• Local ownership• Capability building
Workflow
Visibility
BioDiversity
Low
Concept K
nowledge
High
Technology/Infrastructure
Dom
ain Scientist
Technical specialists
Com
putational Scientist
CustomSpecificApps
GeneralToolkits
Policy
makers
Low
High
Versatility
Who are the users?• Policy makers?• Biodiversity researcher?• Computational scientist?• Tool developer?• Service provider?• Infrastructure provider?• Digital custodian?
Workflow management systems• Integrated into community frameworks,
coupled into tools• Virtualised (Web) Services
• Scaling, Optimisation• Interoperability, Using provenance• No one workflow language/system
• Specialisation & its cost• Plug-ins for common community
platforms and resources• Mitigating and adapting to changes in
infrastructures and resources.• Sustainability and engineering
Generic
Specific
http://www.erflow.eu/
Population dynamicsThe life cycle of infrastructures
• Dynamics: Mitigate, Adapt, Disperse, Die
• Standard and maintained prog. interfaces (APIs)
• Standard formats and ids• Stability, reliability, repair• Interoperability• Semantic descriptions• Sustainability of services
and infrastructure• Instrument resources for
citation & microattribution• Coupled services and
infrastructure.
Impact of dependencies
[Zhao et al. Why workflows break e-Science 2012]
Summary
Scale.Standards data formats, programmatic interfaces. Governance.
Workflow commoditiesDesign practicesCredit
A seamless, pluggable service. Scale. Adaptability. Specific-Generic tension. Putting provenance to use for data credit.
Embedding workflows in common applications Integration into reporting and publishing lifecycles
BioDiversity Virtual e-Laboratorywww.biovel.eu
Wf4Everwww.wf4ever-project.org
SysMOwww.sysmo-db.org
SCaleable Preservation Environmentshttp://www.scape-project.eu