Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | curtis-bond |
View: | 221 times |
Download: | 0 times |
Using explicit control processes in distributed workflows to gather provenance
Sergio M. S. CruzFernando Seabra ChirigatiRafael DahisMaria Luiza M. CamposMarta Mattoso
Federal University of Rio de Janeiro, Brazil
UFRJ
Agenda
• Introduction
• Motivation Control flow in data centric workflows
• Objective Provenance Gathering in Distributed Workflows with Explicit
Control Flows
• Case of Use Control Flow on VisTrails
• Conclusion
Distribution & Heterogeneity in Workflows
• Scientific Wf enables data intensive analyses Use of grid x remote parallel machinesUse of different WfMS
- Different provenance capture mechanismsUse Centralized x Distributed WfMS
- often offer disjoint set of capabilities
How to obtain a homogeneous provenance representation and capture mechanism?
Control flow matters in data centric workflows• Scientific workflows also need control structures to
specify how the data flow should be directed
• Goderis et al. [6] stress the importance of combining different models of computation in one scientific workflow
• Bowers et al. [5] say that: “modeling control-flow using only dataflow constructs can
quickly lead to overly complex workflows that are hard to understand, reuse, reconfigure, maintain, and schedule”
• Tudruj et al. [7] state the importance of general dynamic control flow, but focus on synchronization of parallel execution
Presented a set of generic control structures and proposed the use of a monitoring middleware
A real example: OrthoSearch workflow
Detect distant homologies
on five parasites associated with
tropical neglected diseases
BLAST
MAFFT/HMMERpackages
Best Hits Finder
FormatDB
InterPRO
OrthoSearch specification in Kepler
• Some lighweight tasks can run locally
• Suppose we need to execute MAFFT/HMMER in a High Performance Environment
• Just send it to a grid !
Time consuming tasks
BLAST
MAFFT/HMMERpackages
Best Hits Finder
FormatDB
InterPRO
OrthoSearch - loops, choice, …
How to map this to the grid language ?
LOCAL BLAST
MAFFT/HMMERpackages
Best Hits Finder
FormatDB
InterPRO
OrthoSearch - loops, choice, …
Alternatively, send one job at a time to execute remotely
Can be very inefficient !
OrthoSearch - loops, choice, …
Rewrite this to the grid language.e.g. Triana, supports loops !
But, how to bring provenance data back to Kepler ?
How to register loop iterations ?
OrthoSearch - loops, choice: other issues
What if my available grid does not have a WfMS ?
What if my available grid supports another WfMS ?
What if the grid WfMS does not support loops ?
Generic control flow modules with remote provenance gathering!
Motivation
• Workflow design Different WfMS present their own control structures, parallel
execution models, etc.- Expose different modeling semantics to the users!
• Provenance gathering WfMS register provenance in their own schema Often encompassing specific grid features Based on application domain attributes
A lot of mappings and conversions!
Many challenges in changing WfMS for the same workflow
Objective
• Diminish the dependence of the workflow definition on the WfMS
uncoupling the provenance gathering system from the WfMS
having some control flow of execution independent of the WfMS workflow specification language
• Plugging control flow and provenance gathering modules along the workflow original tasks
the workflow specification can be executed almost independently of the current WfMS
provenance can be gathered uniformly
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
Scientific Workflow Control Flows
COGsDB
MAFFT hmmbuild
fastacmdformatdb
hmmsearch
hmmcalibrate
PtnDB
ReciprocalsBest Hits Finder
InterPROReannotated genes
hmmpfam
HMMER
BLAST
Implicit DECISION
Implicit LOOP
Scientific Workflow with Explicit Control Flows
Explicit LOOP
MAFFT hmmbuild
hmmsearch
hmmcalibrate
hmmpfam
HMMER
Initial condition
MUX
IFT F
• All these modules can be sent to execute in any HPC environment
• Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules
Explicit DECISION
Meta-Workflow eases migration of a Wf from WfMS to another!
Control flow modules on VisTrails
• All these control flow modules were made available on Vistrails
• More explicit control is now available
• Remote execution can keep specified control
• Remote execution can bring provenance data back to Vistrails with compatible structure
Advantages
Orthosearch on VisTrails
• All these inner modules (sub-workflow) can be sent to execute in a grid or HPC environment
• Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules
• In Vistrails the loop could not be implemented because it is a DAG based WfMS
External LOOP(parameter
exploration)
Explicit DECISION
Scientific Workflow - Heterogeinity
COGsDB
MAFFT hmmbuild
fastacmdformatdb
hmmsearch
hmmcalibrate
PtnDB
ReciprocalsBest Hits Finder
InterPROReannotated genes
hmmpfam
HMMER
BLAST
Tim
e
con
su
min
g
Orthosearch on VisTrails
• BLAST modules should be sent to execute in PC cluster
• Provenance gathering mechanisms can be inserted in the control flow modules to be sent to the parallel environement
• In Vistrails this can be achieved using the MidMon modules
REMOTE PARALLEL EXECUTION
BLAST
MidMon on VisTrails
Monitoring tool that checks scientific processes running on distributed environments• Message exchange-based tool• Decoupled and present modular infrastructure• Support to legacy applications on distributed resources
Implementation
Data Modules
Control Modules
BLAST
Concluding
• We share the same motivation of Bowers et al., Goderis et al. and Tudruj et al.
• And the same as Groth et al.
• We propose: A set of generic control-flow structures independent of WfMS
• Our implementation has shown that: Control-flow structures can allow generic sub-workflow remote
execution Remote process provenance can be captured in the same
representation of the wf Workflow refactoring is facilitated Control-flow structures can be coupled to monitoring middleware
Using explicit control flow
Provenance independent of a WfMS
Conclusion
• Distribution & Heterogeneity are inevitable in scientific workflows
• Adding control-flow modules to the scientific workflow specification can help the execution by heterogeneous WfMS running on distributed environments
Acts as documentation of the execution control workflow Allows to evaluate and monitor the activities of the workflow Helps to gather provenance from heterogeneous and
independent environments with low programming efforts
• MidMon on top of VisTrails Enable scientists to monitor the submitted jobs status on
their desktops Preserves workflows’ original features
Future work
• Use workflow views, e.g. ZOOM* Our solution makes the workflow very verbose
• Use software component reuse and refactoring techniques to help the automatic incorporation of these modules
“Using Provenance to Improve Workflow Design” Tosta et al.
• Work with other workflows from bioinformatics and oil industry
Using explicit control processes in distributed workflows to gather provenance
Sergio M. S. da CruzFernando Seabra ChirigatiRafael DahisMaria Luiza M. CamposMarta Mattoso
Federal University of Rio de Janeiro, Brazil
Thanks !
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
MUXDescribes a convergence between two or more input ports, resulting in just one branch
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
DEMUXRepresents an incoming branch that diverges into two or more parts. Just one of the outgoing branches is enabled depending on a condition associated
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
STRING CONTROLThe workflow is divided in two or more branches, and just one of them can be enabled; the other outgoing branches are withdrawn
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
NUMBER CONTROLAll output data are originatedsimultaneously
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
NUMBER COMPARETwo or more incoming branchesbecome one outgoing branch, which will be only enabled after the complete activation of all the input data.
Scientific Workflow Control Flows
• A small set of generic workflow-level control modules
• Based on workflow patterns by Van der Aalst et al.
Workflow Pattern Module
Structured Discriminator Mux
Exclusive Choice Demux
Deferred Choice String Control
Multiple Instances without synchronization
Number Control
Synchronization Number Compare
Exclusive Choice If
IFSame pattern of the DemuxBut present two differences : If has only two input ports and has a logical expression, where the scientists can create any condition they need.
MidMon
• Offer a generic and lightweight monitoring tool that checks scientific processes running on distributed environments
Message exchange-based, 2 layered modular infrastructure
Decoupled and lightweight, crossing different network boundaries
Easy to deploy and manage Support to legacy applications on distributed resources
Midmon Monitoring Data
• state data may be possible to be monitored
• it may be possible to monitor about the state of the environment
• it may be possible to monitor about service availability
Midmon – State Data
• List of task state data that it may be possible to monitor:
Progress of a service - Rely on check points within the service, or a service may be able to provide an estimate of its progress
Completion of a service - This could be a simple event that indicates that a service has produced all of its output file
Data consumption rate of a service - This is a measure of the rate at which service is consuming data from input file
Data production rate of a service - This is a measure of the rate at which service is generating data for output file
Midmon – State of the environment
• A list of the useful data that it may be possible to monitor about the state of the environment is:
Available execution nodes - This could be a list of changes in the available execution nodes in the environment
Load on an execution node - This is a measure of the load in a execution node. It could be one, or a tuple, or a composite of services, e.g., the CPU load, the number of processes, and the free resources of the execution node
Load on a network link - This is a measure of the usage of a network link, in terms of the available bandwidth
Memory usage on an execution node - This is a measure of the usage of memory in a execution node
Midmon – Service availability
• The following is a list of useful data that it may be possible to monitor about service availability
Available services - This could be a list of the services available as mapping targets for tasks in a workflow. The data could also include, e.g., the status of services currently deployed
Available data resources. This could be a list of the data resources available as mapping targets for inputs and outputs in a workflow