Date post: | 18-Feb-2017 |
Category: |
Technology |
Upload: | tanu-malik |
View: | 319 times |
Download: | 0 times |
A Reproducible Framework
Powered By GlobusTanu Malik, Kyle Chard, Ian Foster
Computation InstituteUniversity of Chicago and Argonne National Laboratory
GeoDataspace
GeoDataspace
GeoDataspace
Share and Reproduce
Alice wants to share her models and simulation output with Bob, and Bob wants to re-execute Alice’s application to validate her inputs and outputs.
GeoDataspace
GeoDataspace
GeoDataspace
Alice’s Options
1. A tar and gzip
2. Build a website with model code, parameters, and data
3. Submit to a repository
4. Create a virtual machine
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
Bob’s Frustration1. I do not find the lib.so required for building
the model.
2. How do I?
Lack of easy and efficient methods for sharing and reproducibility
Amount of pain Bob suffers
Amount of pain Alice suffers
Some Reproducibility Requirements
• Automatically solve the “dependency hell” problem
• “I have an incompatible version of the library”
• Connect programs with data and capture dataflows
• Which version of my program produced this data?
• Allows easy annotation of human knowledge
• “Insufficient documentation to install or run the program”
• Enables reproducibility efficiently and with minimal intervention
• “No change of programming or authoring environments”
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
Reproducible Framework
Machine A
Application
Machine Bdata
system files(/bin, /lib, ...)
source code
parametersconfiguration
network connections
SciUnits
(Docker Hub, GitHub, DataHub)
Globus Catalog Globus Publish
SciUnits
Execution Platform(Docker, PTU, chroot)
Share/Transfer
12
3
3
4
1. Capture the scientific activity
• Capture the source code, the data, the environment, including the flows of data from process to process (local or distributed)
2. Preserve as SciUnits
• Preserve the captured information as physical files or as detailed metadata (annotations and provenance)
3. Share & Distribute• Share the sciunits with others including detailed metadata
4. Re-execute and Re-analyze• Users can run the complete package without installation or configuration.
Queries for detailed provenance of data
and versions.
CI Components• SciUnits
• Units of scientific activity/research output
• Metadata Catalog
• A scalable, flexible catalog for annotations conforming to open-world assumption
• Globus services for sharing, transfering and publishing sciunits
• Share/Publish sciunits for others to use
• Replay capability through native re-execution, Docker or Vagrant
• Run sciunits without installation or configuration and metadata information
GeoDataspace
GeoDataspace
GeoDataspace
Simplifying Data Management for
Geoscience ModelsTanu Malik, Ian Foster, Kyle Chard,
Joseph Baker, Mike Gurnis, Jonathan Goodall, Sco= Peckham
GeoDataspace
GeoDataspace
GeoDataspace
Science DriversSolid Earth
Space Science
Hydrology
CSDMS
GeoDataspace
GeoDataspace
GeoDataspace
• http://workspace.earthcube.org/geodataspace
• Software, Source code, Science Usecases, Reports, Presentations, News
Project Goals
• GeoDataspace Project Goals:
• Establish the reproducible framework with Globus
• Enable three use cases for establishing geounits in Space Science, Hydrology, and Seismology
• Making the geounit Client widely accessible to the EarthCube community
• connect with a model and data repository (CSDMS)
GeoDataspace
GeoDataspace
GeoDataspace
Science Usecases• Seismology: geounits of 2D and 3D kinematic
geoscience models, visualized through GPlates and modifying GPML data files
• End Goal: Sharing, preserving, and publishing visualization sessions with data
• Space Science geounits on SuperDARN data with analysis tools as available from the Baker Laboratory at Virginia Tech
• End Goal: Sharing and publishing geounits
• Hydrology geounits of IRODs workflows on hydrology VIC models
• End Goal: Demonstrating end-to-end reproducibility with iRODS that does not support data provenance or data publishing
GeoDataspace
GeoDataspace
GeoDataspace
AcknowledgementsFunders:
Community:
GeoDataspace
GeoDataspace
GeoDataspace
Reproducible Framework Client
Provenance Data
Annotations
Application Virtualization(Source Code, Data, Environment, Library
Dependencies)C
ore
Plu
gins
Clipboard Events
Commands
Web Browsing History
Metadata from Files
Ontologies, Dictionaries, Vocabularies
Globus Services(Catalog,Transfer, Share, Publish)
g gg g
Provenance Services(PROV Data Management)
geounit Client
GeoDataspace
GeoDataspace
GeoDataspace
1. Support for application virtualization.
2. Provenance collection:
audit <program name>, exec <program name> [activity]
specific version information collected if part of a VMS
3. Annotation: addannotation <file|dir|g> <key:value>
4. Create packages (Docker or vagrant compliant)
5. Queries: why, what, where
6. Visualizers
PROVaaS