1
Instant KarmaCollecting Provenance for AMSR-E
Beth PlaleDirector, Data to Insight Center
Indiana UniversityHelen Conover
Information Technology and Systems Center, University of Alabama in Huntsville
Joint AMSR-E Science Team MeetingJune 2-3, 2010Huntsville, AL
Objective
Approach
Co-Is/Partners
Key Milestones
Instant Karma: Applying a Proven Provenance Tool to NASA’s AMSR-E Data Production Stream
PI: Michael Goodman, NASA MSFC
• Improve the collection, preservation, utility and dissemination of provenance information within the NASA Earth Science community
• Customize and integrate Karma, a proven provenance tool into NASA data production
• Collect and disseminate provenance of AMSR-E (Advanced Microwave Scanning Radiometer – Earth Observing System) standard data products, initially focusing on Sea Ice
• Engage the Sea Ice science team and user community• Adhere to the Open Provenance Model (OPM)
• Apply Karma to Sea Ice data production workflows• Customize Karma’s provenance dissemination user
interface• Evaluate usefulness of provenance collected
- Measure traffic to Karma Provenance Browser- Collect user feedback
- Expand use of Karma to other AMSR-E data production streams
Thorsten Markus, NASA GSFC; Beth Plale, Indiana University; Rahul Ramachandran, Helen Conover, UAHuntsville TRLin= 7 TRLcurrent= 7
11/09
• Evaluate current AMSR-E SIPS product generation
06/10• Extend Karma provenance collection tools for SIPS 09/10• Enhance Karma Provenance Browser interface
10/10• Instrument AMSR-E Sea Ice production in Testbed
12/10• Evaluate with Sea Ice science team
03/11• Introduce Provenance Browser to NSIDC DAAC
06/11• Instrument AMSR-E Sea Ice production in Ops
09/11• Evaluate with AMSR-E Sea Ice user community
02/12• Instrument other AMSR-E data streams
02/12
3
Types of Provenance InformationLots of information already available, but scattered
across multiple locations– Processing system configuration– Dataset and file level metadata– Processing history information – Quality assurance information – Software documentation (e.g., algorithm theoretical basis
documents, release notes)– Data documentation (e.g., guide documents, README
files) Instant Karma project aims to collate and organize
information from multiple sources
Sea Ice Processing Flow and Dependencies
4
One day’s worth of Level-2A Tbs
Delivered Algorithm Package (Sea Ice)
Daily Processing Script
Sea Ice Products 6.25 km
Sea Ice Products 25 km
Snow Melt Mask
is a 5 day running averages that is updated and replaced daily. Masks generated yesterday are used for today’s products.
Snow Depth on Sea Ice
Product
Multi-year Ice Mask
Sea Ice Products 12.5 km Multi-year Ice Mask
Snow Melt Mask
Sea Ice Concentration
Snow depth over Sea IceDefault
Multi-year Ice Mask
Karma provenance collection and representation
Karma analysis tool suite and portal
= Optionally installed in future
5
6
AMSR-E daily processing workflow
• Workflow executes once per day of input files received
• Uses configuration files, data files, mask files
• Invokes processes, programs, algorithms
• Generates data files, images
7
Graph Viz client
Subscriber Interface (provenance
listener)
Notification Ingester Interface
Relational store
Synchronous ingest Web service
Query Service
WS messenger Bus (future)
WSM
OPM 1.0 XML events OPM 1.0
RDF XML
Axis 2 other
Instrumented appsQuery client
Karma 3.0 architecture
Preserv client
Preservation object
Prov Track lib Prov Track lib
Client Toolkit Client Toolkit Client Toolkit
Ingester Implementer Interface
Knowledge discovery:Inferencing, quality,
completenessDatabaseSetup script
RESTfulService
Axis 2
Axis 2 Prov Track lib
Xregistry(Optional)
XMC Cat metadata
catalog (optional)
8
Karma Architecture• Service Core
– Bridge pattern for independent Ingester and IngesterImplementer implementation
– Core components for ingesting notifications– Asynchronously shredding raw notifications to populate tables
• Axis2 Web Service Layer– API layer to ingest notifications from clients’ push– Also allows another layer to ingest notifications by pulling from
message bus• Axis2 Handlers
– Gather information by intercepting SOAP message from host services– Minimal intrusiveness and lightweight instrumentation
9
Scavenging: for Stand-alone Provenance Collection
• Collects provenance using scavenging– Use existing collection mechanisms
• e.g., logging tool, auditing tool– Low burden on both users and programmers
User Annotation Scavenging Full Instrumentation
Application Burden Low Low High
Human Burden High Low Low
Information Quality Error rates and omissions lead to
incomplete information
Could have incompleteness
Complete
10
Open Provenance Model (OPM)• Karma is generic and stand-alone
– Not coupled to any particular system• Karma 3.0 Utilizes OPM v1.01 to represent
provenance graph– OPM is a standard
http://eprints.ecs.soton.ac.uk/16148/1/opm-v1.01.pdf
– Enables provenance information exchange with other OPM-compliant tools
11
Types of Provenance Information
12
Types of Provenance Info (2)• [1] launches
– Whom: user ID or name– What: service e.g., service URI– When: launch context, time
• [2] consumes and [3] produces– File (e.g., file URL, owner)– Service: program, algorithm
• version
• [4] invokes– Invoking service– Invoked service– Parameters– Results/faults
13
Additional types of provenance Information Captured by Karma
• Execution Status– Terminated or Failed
• Transfer of Data– Sending of results– Receiving of results
• Workflow and Program Lifecycles• Unknown Notifications
– Stored as raw notifications• Forthcoming: Spatial and temporal information, simple
and complex data values, quality information
14
Partial provenance graph for sea ice product run of 14 July 2010 – attribute data is incomplete
Execut-ion end
date
Santa
Daily
Execut-ion start
date
Product name
Execut-ion date
Sea ice 12
mask
Bright-ness file
L3
25km sea ice product
Processing_type = sea ice; … Bright-
ness fileBright-ness file
12km sea ice product
6km sea ice
product
used
WasControlledBy
WasGeneratedBy
WasGeneratedBy
used
URI = qqqq;Generation time
= xxxx; file name = yyyy;
Service URI = qqqq;Execution time =
xxxx; version no. = yyyy;
Value = 14 July 2010
URI = gggg;Filename = yyyy;
Mask file
Sea ice file
Input files
Provenance graph for sea ice product
16
Provenance used to explain difference in images 1/28/2010 and 2/09/2010 as change in sea mask due to missing data (underlined in blue in lower graph)
17
Example• The provenance visualization is obtained using a simulated Karma
provenance database and in this use case its aim is to help scientist identify the mask file being used and provenance information about mask file.
• The provenance graph gives the user annotated lineage about a sea ice data product: inputs required for its creation, the files created as a result of processing of the file.
• Provenance visualization in this form allows for deeper examination.– e. g. : for a recurring error, the scientist can view all related provenance
information to get to source of error.
18
Ongoing work• Better graph layout with detail for each data product and
process used generating a sea ice product. • Give nodes different shape and color depending on whether
input node or generated output node etc.• The user will be able to add annotations to edges by simply
right clicking on them. Thus capturing semantic annotations to the existing causal dependencies.
• Forthcoming: Spatial and temporal information, simple and complex data values, quality information
• Provenance bundle archived with data or embedded in HDF file, in addition to Karma database
19
AMSR-E Provenance Use Cases• Browse provenance graphs : convey rich information about
final data product details– Spatial location, time of observation, algorithms employed, quality
propagation• Answer “Something isn’t right” question
• Example illustrated earlier: did not receive data for several days so mask can be inaccurate.
• Provenance “bundle” includes relevant science papers• New communication satellites interfere with NASA satellites
for certain channels• Identify channels affected by RFI and channels used to generate each
product