NASA Distributed Active Archive Center for Biogeochemical Dynamics
Geoscience Data Repository in Digital Object Model and Open-Source Frameworks:Provenance Applications (ESDORA Project)
Jerry Pan, Christopher Lenhardt, Biva Shrestha, Yaxing Wei, Giri Palanisamy, Robert Cook
NASA ORNL DAACEnvironmental Science DivisionOak Ridge National Laboratory
1
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Agenda
1.Geoscience Data Curation2.System Components & Digital Object Model3.Capabilities & OAIS Mapping4.Provenance Applications5.Conclusion Remarks
2
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Digital Data Curation Maintaining and adding value to a trusted body of
digital information for current and future use throughout its lifecycle
3
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Important Aspects of Data CurationAuditing:
What changed, when (contextual environment, status)Lineage and provenance:
The derivation history of data formally recorded, and is both machine and human understandable now and in the future.
Versioning:Keep earlier versions a data stream in a data system, such that we can revert to an earlier version if needed.
Identifier:Data is identifiable and citable, using a standardized scheme, e.g., the Digital Object Identifier (DOI) system.
Integrity:The integrity of data files at any time of its lifecycle is verifiable.
Interoperable/accessible for long term:Accessible with ease by users and software.
4
NASA Distributed Active Archive Center for Biogeochemical Dynamics
The Challenge
• Tremendous amount of data in Geosciences is being generated, digital curation needs to be in place for preservation and reuse.
Yet, there is not a generic, interoperable system to manage, preserve, and deliver relevant metadata and data processing lineage information along with the actual content.
5
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA: A Complete Data System Built on Fedora Digital Object Model
Archive Management:
Fedora Repository
User Interface:
Drupal & Islandora
Search & Discovery:
Apache Solr & Fedora Semantic Store
http://esdora2.ornl.gov/ 6
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ContentDigital Object
XML Encoding
Fedora Digital Object Model
Object InfoObject Info
ID
SemanticsSemantics
AuditAudit
Metadata 2 Metadata 2
Content 1Content 1
… …
Metadata 1Metadata 1
Content 2Content 2
( Payette, S. and C. Lagoze, 1998 ) 7
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA Capabilities:• Metadata and data managed together in one logic unit
• Integrity checks, versions, and auditing trails
• Machine-readable semantics for provenance knowledge
• XML-encoding for long-term storage, access, and recovery
• Search, discovery, metadata publishing• Multiple standards (FGDC, ISO, EML, etc…) accommodated (we use FGDC) 8
NASA Distributed Active Archive Center for Biogeochemical Dynamics
OAIS Reference Architecture
9
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Information Unit
Logical information units (packages) for ingestion, management, dissemination
OAIS – SIP: Submission Information PackageOAIS – AIP: Archival Information PackageOAIS – DIP: Dissemination Information Package
10
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA SIP
Data set (folder)
-- Metadata (folder with structured and non-structured metadata files)
-- Data (folder with actual data files)
Data set (folder)
-- Metadata (folder with structured and non-structured metadata files)
-- Data (folder with actual data files)
Object InfoObject Info
ID
SemanticsSemantics
AuditAudit
FGDC Metadata FGDC Metadata
Free Text MetadataFree Text Metadata
… …
PolicyPolicy
Data Content Data Content
11
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA AIP (data/metadata coexist)
Object InfoObject Info
ID
SemanticsSemantics
AuditAudit
FGDC Metadata FGDC Metadata
Free Text MetadataFree Text Metadata
… …
PolicyPolicy
Data Content Data Content
12
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Inline Metadata Editor
13
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA DIP• REST Web Services
• Data Objects• Collection Objects• Datastreams• Metadata in OAI-PMH• Indexing & Search
http://esdora2.ornl.gov/oaiprovider/?verb=ListRecords&metadataPrefix=fgdc
14
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Solr-Enabled Indexing & Search• Simple Keyword Search• Faceted Search• Spatial/Temporal Search• Result linked to data objects
20
11
NA
SA
ES
DS
WG
Me
etin
g,
Ne
wp
ort
Ne
ws,
VA
15
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Provenance in ESDORA
16
20
11
NA
SA
ES
DS
WG
Me
etin
g,
Ne
wp
ort
Ne
ws,
VA
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Where should provenance be stored?
users
Internal metadata sources (often file system)
Structured metadata stores (database or indexing engine)
External metadata sources on the Web
Application
20
11
NA
SA
ES
DS
WG
Me
etin
g,
Ne
wp
ort
Ne
ws,
VA
17
In software applications: BAD
In accompanying files: BAD
In structured metadata records: BAD if not linked to data
Semantically a part of the content system: GOOD
NASA Distributed Active Archive Center for Biogeochemical Dynamics
ESDORA: Metadata & semantic relations are stored in the same digital object as the data content
DOIDOI
FGDCFGDC
Read meRead me
Guide docsGuide docsDatastream
1Datastream
1
ISOISO
Application uses semantic queries for knowledge stored in objects
Application
18
SemanticsSemantics
Datastream xDatastream x
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Synthetic Land Cover Data Chain (SYNMAP)
(Modeling and Synthesis Thematic Data Center, MAST-DC)
Analyzed_SYNMAP Analyzed Potential_SYNMAP
Original_SYNMAP
AVHRR_CFTC MODIS_GLC GLCC GLC2000
To provide the standardized land cover map for Multi-scale Synthesis and Terrestrial Model Intercomparison Project, the Original SYNMAP is assembled from four independent products, which is in-turn reprocessed (common resolution, extent, CF-Compliant NetCDF) to produce the Analyzed SYNMAP and Potential SYNMAP at global and North American scales. 19
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Provenance: Data derivation history
Data derivation history information are recorded and stored in Fedora RDF semantic store. The semantic store are indexed, and can be queried using SPARQL and iTQL
Data derivation history information are recorded and stored in Fedora RDF semantic store. The semantic store are indexed, and can be queried using SPARQL and iTQL
20
Object: Analyzed_SYNMAP
Object: Analyzed_SYNMAP
Processing info…
Processing info…
Semantics (RDF):This object
is “DerivedFrom”Original_SYNMAP
Semantics (RDF):This object
is “DerivedFrom”Original_SYNMAP
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Provenance: Granule checksums
21
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Provenance: Auditing trail and versioning history
22
NASA Distributed Active Archive Center for Biogeochemical Dynamics
Conclusion Remarks
The digital object model abstraction reduces the complexity of data curation.
Object semantics and XML encoding can be used to preserve provenance knowledge as well descriptive metadata.
The integrated system addresses many metadata and provenance issues and can be used as an archive system for Geoscience data content.
23
NASA Distributed Active Archive Center for Biogeochemical Dynamics
• http://esdora2.ornl.gov/
• Acknowledgement: This work is funded by NASA ACCESS Grant # 09-ACCESS09-8
• The team would like to thank Stephen Berrick for progress reviews and guidance
• Contact: Jerry Pan, [email protected]
24