Automated Provenance Capture Architecture and
ImplementationKaren Schuchardt, Eric Stephan,
Tara Gibson, George Chin
PNNL
Methodology• Simulated using Kepler workflow system. We did not attempt to
leverage looping.• Programmed stub actors for each step with proper inputs,
outputs, and user controlled parameters.• Implemented an execution event listener that optionally records
workflow. No changes to core Kepler were made. • Applied/extended our existing content management/provenance
system to see how far we could go with it.• Implemented actors /
workflows for queries/visual analysis using xslt/graphviz
Provenance Capture Architecture
NamingService
ContentStore
WorkflowEngine
Prov CaptureService
Metadataextraction
QueryService
TripleStore
Provenance/Content System
ProvService
Translation
AnalysisTools
QueryTools
BrowsingTools
Client ToolseventsWorkflow
Tools
WorkflowUI
Indexing
AnnotationTools
HarvestingTools
SDG Provenance Capture Implementation
URL,LSID
ContentStore
KeplerEngine
Prov CaptureService
Metadataextraction Triple
Store
SAMTranslation
Node x NodeComparison
Bategalj/Mrvar
Algorithm
DavExplorer/Ecce
Client Tools
eventsWorkflow Tools
KeplerUI
All within Kepler
Nettool
Keplerworkflows
SEDASL
Lucene
Defuddle
GML,webDav
URIQA (rdf),webDAV
Queryprocessor
Provprocessor
URIQA(rdf),/webDAV
TripleHarvesters
Physical Model
Named Thing
Property
Content
0..n
1
Property can be alink or a value
Any “thing”, for which we want to capture some information, is given a unique id with which properties and relationships can be associated. Additionally, content can be associated with these “things”.
Logical Overlay
Workflow InstancestartedExecutionfinishedExecution Actor Instance
startedExecutionfinishedExecution
creatorwasRunByowningInstitutioncreatedtitle
Port Valueformatcreated
hasSource
hasOutput
[arbitrary triple]*
isPartOf
isPartOf
title
isInput
[arbitrary triple]*
[arbitrary triple]*
titleformat
Parameter
hasParameter
formattitle
hasValue ORhasHashOf Value
[arbitrary triple]*
hasValue ORhasHashOf Value
format
11
1
0..n1
0..n
0..n
hasStatus
createdWithuid
uid
uid
uid
Semantically Extended DASL Queries
Select- all properties or a specific list- format (gxl, rdf, webDAV)
Scope- a url or query (i.e. 2 phase) - names of properties to follow (and direction)- stop conditions (property/values comparisons, depth)
Where- property name/value comparisons, content search
Workflow Comparisons• Node-by-node comparisons
– Nodes match if all node attributes and incoming and outgoing edges match
– Nodes are similar if attributes and edges match to some specified XX%
• After node comparisons, edges are compared– Edges match if connecting nodes were
found to be exactly matching or similar and edge attributes match
– Edges are similar if attributes match to some specified XX%
• Outputs include: – Matching or similar nodes, – Matching or similar edges, – Nodes only in first or second graph, – Edges only in first or second graph
•title•instantiationOf•source•value•format
isPartOf-reverseisInput-reverse
hasOutput
isInput hasOutput-reverseisPartOf
Nodes only in First Graph:node52 (atlas-z.gif, )node14 (imageformat, )node53 (convertyimage,)node57 (atlas-y.gif, )node36 (convertzimage)node34 (imageformat, )node15 (imageformat, )node78 (atlas-x.gif, )node26 (convertximage)Count: 9
Workflow Graph Distances
Implements social network algorithm based on triad census (Batagelj and Mrvar, 2001; Chin, Whitney, Powers, and Johnson, 2004)
Examines every three possible nodes of a workflow graphEvery three possible combination fall into 1 of 64 possible triad statesCensus is counts of triads that exist in the graph, which may be used to summarize or profile overall graph structureDistance is computed by taking Euclidean distance of two triad censuses which is normalized to a 0.0..1.0 value
Most useful for assessing similarity across large, complex workflowsDistance computed for two workflow graphs: 0.095888
(0, 4, 6, 1,…)
(0, 4, 6, 1,…)
What’s Cool– Combined rdf assertions with
scientific content management• flexible capabilities for metadata
extraction (e.g. Defuddle to extract data from warp file).
• Existing rdf harvestors could be plugged in through same mechanism
• Extensible translation mechanism (browse tools can provide views of raw data such as a table of warp parameters)
– Conceptually simple model that can apply to much more than workflow execution.
– Readily adaptable to alternative models, constructs, relationships.
– Indexing and Query of content or metadata
– All relationships are reverse indexed automatically. You can search up or down and event mix directions on specific properties.
– Flexible event based model so as to minimize connections into workflow engine
– Actors can contribute their own metadata easily through events
– User control over which actors to capture provenance on
– Automatic content type determination
– Multiple output formats– Capability to capture hashes
instead of values– Leveraged DASL extension
mechanisms– Based on existing standards
(http) – existing tools can be leveraged
– Pluggable authentication model base on JAAS
– Everything is open source
Limitations
– Prov capture is slow. We do one assertion at a time currently but they could all be packaged up into one request
– RDF predicates can’t contain special characters but things like parameters often have these characters.
– SAM can be made to work but current implementation based on WebDAV ties resources to metadata. We had to create dummy resources.
– SAM not rdf based.– Big files (reference images) are duplicated as part of
provenance tracking because they are data inputs to multiple actors.
– Did not get to LSID service but it would be nice if this wasn’t a separate protocol to deal with.
Kepler/Workflow Comments
– Decided to stay with brute force model instead of loop based model. Loop based model would probably introduce controller actors that would obscure the provenance capture.
– Issue of what to capture provenance on for more general workflows– Coding actors for each thing you want to do doesn’t scale and is a
barrier to adoption by scientists.– Can’t control actor firing order which resulted in things like
AlignWarp4 producing warp1.warp– We used string constant actors to supply input files but it makes
more sense for Kepler to support the concept of a data source.– We could not tell if a port value was a file except by using
File.exists()– Would like to see events be external for complete separation from
workflow engine.
Out of (Current) Scope
– Dynamically changing and continuing workflows (ie evolving workflows)
– Pointing back to provenance on Actors. A real system would do this and the Actors themselves would have global ideas that could be referenced.
– Capturing provenance on workflow descriptors and pointing back to them (same as above for actors)
– Use of lsids – we have service running but never got to the point of inserting it. Instead we used a url name generation service.
– Signing results
Brainstorming Categorization
• How data was generated– User set parameter values– Workflow
structure/execution capture– Outside tools– Auto-generated
metadata/content
• The structure of the query– 2 phase query (or recursive)– Specifying wht to include– Specifying what to exclude
• What it will be used for– Exploratory analysis– Directed query to answer
a specific question– Debugging– Verification– Comparison