7th Biennial Ptolemy Miniconference
Berkeley, CAFebruary 13, 2007
Provenance Framework in KeplerProvenance Framework in Kepler
Ilkay AltintasNorbert Podhorszki
Contributors:
S. Bowers, B. Ludäscher, T. McPhillips (UC Davis)
O. Barney (U Utah), E. Jäger-Frank (SDSC)
Provenance 2Ptolemy Miniconference, February 13, 2007
Outline
Provenance? What is it?
Framework in Kepler to record provenance data
RWS: A provenance model suitable for Kepler's different
computational models.
Possible Applications of Provenance
Provenance 3Ptolemy Miniconference, February 13, 2007
What to track and why
Do we need some tracking of what is happening?
Recreate results and rebuild workflows using the evolution
information (see repeatable experiments)
Associate the workflow with the results it produced
Create links between generated data in different runs, and compare
different runs
Recover from a system failure
• Checkpoint a workflow Debug and explain results (via lineage tracing, …)
Smart Reruns
• Avoid re-generating the same data all the time
Provenance 4Ptolemy Miniconference, February 13, 2007
Model of Provenance
Core feature
capture the processing history (trace) leading to a data product
Model of Computation (MoC)
Well-defined in terms of input/output relations and the (partial) order of actions
• MMoC ( PProgram, IInput ) OOutput
• DAG, SDF, DDF, PN, etc
Different ways of specification
• see Ptolemy-related papers, Kahn-McQueen paper, etc.
• give abstract/high-level pseudo code
• Practically it is defined through the implementation of the execution system
(including the scheduling). In Kepler/Ptolemy it is the Director.
There are legal (possible) runs under a given MoC
Provenance 5Ptolemy Miniconference, February 13, 2007
Model of Provenance
Model of Provenance (MoP)
The starting point is a MoC and its particular implementation
• Observables e.g. a single fired(x, A, y) or reads, writes and actions separately
• Trace: recorded assertions (about observable events) during a legal run
MoP is a MoC, except the “legal run” replaced with “legal trace”
There is a default MoP for a MoC: the total trace of each observable events
• Turing machine: moves of the head, data read and written
A MoP may add another information or omit some (“T=R-I+M”)
• Trace = Run – Ignored things + Modelled additional things
• M: Add real timestamps of actions, execution host information
• I: Omit the input for each action if this can be inferred unambiguously later (DAG)
• Depends on the application of the trace
T
Provenance 6Ptolemy Miniconference, February 13, 2007
MoP Examples
DAG workflow
Record: Output data generated by the actions
Inference: Execution of actions and inputs to them can be inferred from the
DAG itself
Smart-rerun
Record: Output of an action and the parameters for that action should be
recorded
Inference: If an action’s parameter is not changed and actions on which this
action depends (inferred from the workflow graph) are also unchanged, the
action’s output will be the same in a future run.
Kitchen definition
A MoP is “good” if it can handle the intended questions & use cases.
Provenance 7Ptolemy Miniconference, February 13, 2007
Kepler: Streaming actors
Stateful actors
• An output depends on all inputs in the past. e.g. AddSubstract
Stateless actors
• An output depends only on inputs read in the current firing. E.g.
Expression, RecordAssembler
Non-conformist actors
• Filter, Running average, Daily average (some of the past inputs)
• How do you determine correctly which inputs a given output depends on?
MoP Examples
A
Provenance 8Ptolemy Miniconference, February 13, 2007
Kepler: Data dependent routing (branches and loops)
The firing history of the actors cannot be inferred from the static
workflow graph
• Something should be recorded (e.g. firings)
MoP Examples
Provenance 9Ptolemy Miniconference, February 13, 2007
RWS
A Model of Provenance for
Kepler Directors
Provenance 10Ptolemy Miniconference, February 13, 2007
what about actor state? what about “real” dependencies?
State-reset event s defines when actor “cuts off” dependencies
a semantic notion, known to the actor [developer] (or part of a higher-order scheme)
r, r … r, w, w, … w, s!, r, … r, w, ... w, …
reference: IPAW’06, Bowers et al
RWS: Read − Write − State-reset
s!
A
r … r w … w
PS
???
r, r … r, w, w, … w, r, … r, w, … w …
time
firing
Provenance 11Ptolemy Miniconference, February 13, 2007
RWS trace of some actors
Stateless actor (r+ w+ s)* : r … r w… w s r … r w… w s …
Stateful actor (r+ w+)*
Simple filter actor (conditional depends only on current token)
(r w? s)* : either it emits a token or not
Daily average of hourly measurement ((r w)24 s)*
Generally: RWS firing is defined in terms of r and w events
r+ w+ defines one RWS firing (most Kepler actors behave similarly)
More general: definition of the RWS firing round
(r+ w+)* s : dependencies among several firings
…
Provenance 12Ptolemy Miniconference, February 13, 2007
Kepler Provenance Framework
Provenance 13Ptolemy Miniconference, February 13, 2007
Provenance Framework in Kepler
Modeled as a separate concern in the system
Optional drag and drop feature
Listen to execution and save information (customizable):
Context: who, what, where, and when that is associated with the run
Input data and its associated metadata
Workflow outputs and intermediate data products
Workflow definition (entities, parameters, connections): a specification of
what exists in the workflow and can have a context of its own
Information about the workflow evolution -- workflow trail
Provenance 14Ptolemy Miniconference, February 13, 2007
Kepler System Architecture
Authentication
GUI
Vergil
SMS
KeplerCore
ExtensionsPtolemy
…Kepler GUI Extensions…
Actor&DataSEARCH
TypeSystem
Ext
ProvenanceRecorder
KeplerObject
Manager
Documentation
Smart Re-run /Failure
Recovery
IPAW’06-Altintas et al.
Provenance 15Ptolemy Miniconference, February 13, 2007
Kepler Provenance Recorder (IPAW’06, Altintas et al)
• Parametric and customizable
– Different report formats– Variable levels of
verbosity• all, some, medium, on
error
– Multiple cache destinations
• Saves information on– User name, Date, Run,
etc…
Provenance 16Ptolemy Miniconference, February 13, 2007
Implementation details
The Provenance Recorder
Extends the Ptolemy AbstractSettableAttribute
Listens to the Director for
• Changes in the workflow graph
• Initialization, workflow execution and stop
• Actor firing
Listens to all IOPorts for
• Token emissions on output ports to record output data
That is, we could say it is a
Ptolemy Provenance Framework
Provenance 17Ptolemy Miniconference, February 13, 2007
Implementation details
Builds an internal representation of the workflow graph
Ptolemy’s DirectedGraph
Nodes: IOPorts, Edges: port connections
Used for
• Recording workflow structure (dependencies among ports)
• Subscribing at all ports (listening for input/output)
Provenance 18Ptolemy Miniconference, February 13, 2007
Application: smart-rerun
Provenance 19Ptolemy Miniconference, February 13, 2007
Implementation of RWS in Kepler
Data model
i.e. observables in all MoC implementations in Kepler Port-actor relationship
• portTable(Port, Actor, type)• type is a for atomic and c for composite actors (transparent)
Token-object relationship• tokenTable(Token, Object)
Object-value relationship• objectTable(Object, Value, Type)
• type is currently not recorded RWS trace
• traceTable(Port, Event, Token, FiringCounter)• event: r as read, w as write or s as state-reset
Provenance 20Ptolemy Miniconference, February 13, 2007
Extending the framework
1. Initialization (initialize())
Framework traverses the workflow graph (ports and
connections)
RWS: generate specific data structures (port, actor and
connection details)
2. Just before start (validate())
Framework subscribes for event listeners
RWS: subscribe additional listener TokenGetEvent
Provenance 21Ptolemy Miniconference, February 13, 2007
Extending the framework
3. When workflow is modified (changeExecuted())
Framework traverses the workflow graph (ports and
connections)
RWS: re-generate data structures
4. During execution when an event occurs
TokenSendEvent() and TokenGetEvent() listeners are
extended to generate RWS trace events
Provenance 22Ptolemy Miniconference, February 13, 2007
Possible applications of Provenance
Smart-rerun
Monitoring/debugging of a workflow
see LiDAR poster today by Efrat Jäger-Frank
Answering processing history, data related question
Participated at the First Provenance Challenge with
Kepler-RWS http://twiki.ipaw.info/bin/view/Challenge/RWS
Reporting/documentation of
workflows and data productsGenerate my
publication
Provenance 23Ptolemy Miniconference, February 13, 2007
Acknowledgement
RWS model
Shawn Bowers and Timothy McPhillips, UC Davis
Formalization of the MoPs
Bertram Ludäscher, UC Davis
Kepler Provenance Framework implementation
Oscar Barney, Univ. of Utah, Salt Lake City
Efrat Jäger-Frank, SDSC, San Diego
Provenance 24Ptolemy Miniconference, February 13, 2007
References
RWS model
S.Bowers, T.McPhillips, B.Ludäscher, S.Cohen and S.B.Davidson
A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows
Intl. Provenance and Annotation Workshop (IPAW), Chicago, 2006
B.Ludäscher, N.Podhorszki, I.Altintas, S.Bowers, T.McPhillips
From Computation Models to Models of Provenance and the RWS Model
to appear in 2007 in Journal of Concurrency and Computation: Practice and Experience
Provenance framework
I.Altintas, O.Barney, E.Jäger-Frank
Provenance Collection Support in the Kepler Scientific Workflow System
Intl. Provenance and Annotation Workshop (IPAW), Chicago, 2006