Coding Provenance in Softwareand Matching Tools to Data
OPeNDAP Provenance Project
And
ESIP ToolMatch Project
Patrick West, Tetherless World Constellation Rensselaer Polytechnic Institute
What is Provenance
• Provenance is information about entities, activities, and people involved in producing a piece of data or thing.
• In Data Science we’re interested in keeping track of, or being able to trace back, how a data product was generated and from what.
• E.G. As part of the Ecosystem Status Report there’s an interesting plot in one of the chapters which I’m interested in learning more about.
2
Generating a Plot
3
How did I get there?
4
I know how it was generated
• Because I’m the one who added the plot to the document
• I know how the plot was generated
• I wrote parts of the software in OPeNDAP Hyrax that’s doing the data access, manipulation, and transformation
• So I know: . A plot is generated by accessing a set of data using OPeNDAP Hyrax; which generates a DAP DataDDS object by reading in a set of NetCDF files, constraining and projecting the data, running a server side function or two, doing an aggregation; and then using that data product to generate the plot.
5
IPythonNotebook
cell
cell
cell
cell
Generating a Plot
6
OPeNDAP Hyrax
Reads in Data
Spits outdataBadda Bing Badda Boom
Uses dataGenerates plot
OPeNDAPRequest URL
BUT I WANT TO KNOW MORE
Some informationI WANT to know
• How was that plot generated?
• What software was used to generate the plot and any intermediary data?
• What data files were read in to generate the plot, what was done to the data, and by what?
• Where did those data files come from? What parameters are in there? What sensors measured those parameters? Tell me information about the measuring of the data.
7
Generating a Plot
8
OPeNDAP Hyrax
Reads in Data
Spits outdata
IPythonNotebook
cell
cell
cell
cellUses dataGenerates plot
OPeNDAPRequest URL
Where did the datafiles come from?
Linked Data
• I also am interested in the developers of the software and who publishes the software, the licensing of the software, and how I could use it.
• I’m interested in what IPython Notebooks are, what they can do, and whether I could use them for other projects.
• And I want to be able to let the “owner” of the data files know that I’ve used the results of an access in a publication, presentation, article, or whatever.
9
What the project focuses on
10
OPeNDAP HyraxOPeNDAP Hyrax
OLFS BES
NetCDF dap ServerSideFunctions
aggregate
transformRequest URL
W3C Prov
11
Prov-O
12
:dds_of_reading a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used <http://test.opendap.org/dap/data/h5/monday.h5> [ a vsto:Dataset, prov:Entity, toolmatch:DataCollection; toolmatch:hasAccessURL <http://test.opendap.org/dap/data/h5/monday.h5>; ]; prov:used <http://test.opendap.org/dap/data/h5/tuesday.h5> [ a vsto:Dataset, prov:Entity, toolmatch:DataCollection; toolmatch:hasAccessURL <http://test.opendap.org/dap/data/h5/monday.h5>; ]; prov:wasAssociatedWith <opendapi:software/hdf5_handler/2.1.1>; ];.
Prov-O
13
:aggregated_dds a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used :constrained_dds; prov:wasAssociatedWith <opendapi:software/ncml_module/1.2.2>; ];.
:result a foaf:Document; nfo:fileName "thursday.h5"; dcterms:format netcdf; prov:wasGeneratedBy [ a prov:Activity; prov:used :aggregated_dds; prov:wasAssociatedWith <opendapi:software/fileout_netcdf/1.2.1>; ];.
:constrained_dds a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used :dds_of_reading; prov:wasAssociatedWith <opendapi:software/BES/3.12.0>; ];.
DOAP – Description of a Project
14
DOAP – Description of a Project
15
<http://opendap.tw.rpi.edu/instances/software/BES> a doap:Project, prov:Entity; doap:name "OPeNDAP Back-End Server (BES)"; doap:developer <http://tw.rpi.edu/instances/PatrickWest>; doap:developer <http://tw.rpi.edu/instances/DanHalloway>; doap:developer <http://tw.rpi.edu/instances/James_Gallagher>; doap:developer <http://tw.rpi.edu/instances/NathanPotter>; doap:homepage <http://opendap.org/download/hyrax?q=BES_software>; doap:vendor <http://tw.rpi.edu/instances/OPeNDAP>; doap:repository <http://opendap.tw.rpi.edu/instances/Repository>; doap:bug-database <http://scm.opendap.org/trac/>; doap:release <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0>; doap:description "BES is a high-performance back-end server software framework that allows data providers more flexibility in providing end users views of their data."; doap:license <http://opendap.tw.rpi.edu/instances/License>;. <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0> a doap:Version, prov:Entity; prov:specializationOf <http://opendap.tw.rpi.edu/instances/software/BES>; doap:name "BES-3.12.0"; doap:revision "3.12.0"; doap:download-page <http://opendap.org/download/hyrax/1.9>; doap:repository <http://scm.opendap.org/svn/tags/bes/3.12.0>; doap:license <http://opendap.tw.rpi.edu/instances/License>; doap:created 2013-08-27;
.
DOAP – Description of a Project
16
<http://opendap.tw.rpi.edu/instances/Repository> a doap:SVNRepository; doap:location <http://scm.opendap.org/svn/> doap:browse <http://scm.opendap.org/svn/>.
<http://opendap.tw.rpi.edu/instances/License> dc:description "This software is distributed under the GNU Lesser General Public License <http://www.gnu.org/licenses/gpl.html>"; doap:name "GNU LESSER GENERAL PUBLIC LICENSE"; rdfs:seeAlso <http://www.gnu.org/licenses/gpl.html>;.
<http://opendap.tw.rpi.edu/id/opendap/D9IH6677D3I6HDIHD36IHDI7DH> # The hash above is: HASH(config file, BES version that read it) a prov:Agent; prov:wasDerivedFrom <http://opendap.tw.rpi.edu/instances/software/hdf5_handler/2.1.1>, <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0>, <http://opendap.tw.rpi.edu/instances/software/ncml_module/1.2.2/>, <http://opendap.tw.rpi.edu/instances/software/fileout_netcdf/1.2.1>; . prov:wasDerivedFrom :config_file_hash; # b/c BES set it up: prov:wasAttributedTo <http://scm.opendap.org/svn/tags/bes/3.9.2>;.
What We’re Trying
• The BES loads shared modules at startup that handle specific tasks
• Our first attempt was to use something called a Reporter that reports on the completion of a request, but it’s too after the fact.
• Second thought is that the modules themselves add provenance information on the fly, which to me is ideal, but is unrealistic.
• The probably implementation is that the BES, the software framework that communicates with the modules, is where the provenance is tracked.
17
What’s next
• Get more use cases about what types of information we want to collect
• Write the story about what we’re trying to do
• Come up with software use cases for the implementation
• Continue discussing provenance with the core OPeNDAP group
• Continue to work with the original Prov group (Tim, Jim, and Stephan) in discussions
18
Questions
19
ToolMatch Usecase
• "I need data for Carbon dioxide (CO2) concentrations, a climate change indicator, for the summer of 2012, that can be accessed via OPeNDAP Hyrax and plotted as a timeseries.”
• "I need data with measurements of atmospheric aerosol optical depth sliced along latitude and longitude, returned as netcdf data, and accessible in MatLab."
20
Using SADL
21
Inferencing
22
* Equivalent ClassDataCollection <Aqua_AIRS_Level2_Plus_AMSU>and (isAccessedBy value OPeNDAP) or (hasDataStorageFormat value NetCDF)and (usesGridType value AuxiliaryLatLonGrid) or (usesGridType value RegularLatLonGrid)and usesConvention value ClimateForecast_CF* Subclass OfmappedBy value IDVand mappedBy value McIDAS-Vand mappedBy value Panoply Inferred
Inferencing
23
* Equivalent ClassDataCollectionand (isAccessedBy value OPeNDAP) or (hasDataFormat value NetCDF)and usesConvention value CF1Conventionand usesConvention value RegularLatLonGrid* Subclass OfmappedBy value Ferretand mappedBy value GrADS
Inferred
Inferencing
24
* Equivalent ClassDataCollectionand (isAccessedBy value GrADSDataServer) or (isAccessedBy value Hyrax) or (isAccessedBy value ThreddsDataServer) or (isAccessedBy value erddap)* Subclass OfisAccessedBy value OPeNDAP
Inferred
Resulting Query
25
The resulting query to find the set of tools available to visualize a data collection becomes very simple
DESCRIBE ?toolWHERE { <data_collection> toolmatch:visualizedBy ?tool . ?tool rdf:type toolmatch:Tool .}
The Result
26
Description
Tools
Where are weand what’s next
• We’ve got part of the ontology done
• We’ve got stuff in the triple store
• We need to complete the dataset ontology piece
• We need to verify the ontology and rules
• We need crowd sourcing for more tools and information about tools
• Patrick needs to understand rules better
27
Questions
28
References
OPeNDAP Provenance Project•Prov Overview - http://www.w3.org/TR/prov-overview/•OPeNDAP Prov - https://github.com/tetherless-world/opendap/•OPeNDAP LODSPeaKr - http://opendap.tw.rpi.edu/index.html•OPeNDAP Endpoint - http://opendap.tw.rpi.edu/virtuoso/sparql •OPeNDAP – http://opendap.org
ToolMatch Project•ToolMatch - http://wiki.esipfed.org/index.php/ToolMatch•ToolMatch Virtual Server - http://toolmatch.tw.rpi.edu/•ToolMatch Schema - http://toolmatch.tw.rpi.edu/docs/index •ToolMatch Endpoint - http://toolmatch.tw.rpi.edu/sparql
29