Emerging Provenance/Context Content Standard
Discussion at Data Preservation and Stewardship Cluster Session at ESIP Federation Meeting
July 14, 2011
H. K. “Rama” Ramapriyan and John MosesESDIS Project
NASA Goddard Space Flight [email protected]; [email protected]
Outline• Why do we need a standard? (reaffirm)• Background• Status• Content Matrix• Next steps
– Draft introductory text material– Send matrix for broader review – NASA missions teams –
satellite and aircraft investigations, MEaSUREs teams, DAAC UWGs, USGS, NOAA, EPA(?), ESA(?)
– Decide on IEEE or ISO route to proceed further
Why Standard?• Need to understand and document what content is essential for
long-term preservation of information from Earth observations• Common approach and consistency across organizations (national
and international) will be beneficial to ensure future long-term archives preserve necessary content– Data needed for long-term science studies come from multiple organizations
• Standard will enable new missions to plan for preservation of required content – Less expensive to plan ahead than retrofit (retrofitting may be impossible)
• Assessing compliance to standard helps determine utility of datasets for long-term science studies– Standard is worth having even if compliance is not 100% (depends on
capabilities, requirements, budgets of organizations)• Standard should provide priorities for content – e.g., critical, essential , desirable
• Other Reasons??
Background• Proposed initiating standard activity at January 2011 ESIP
Meeting – Attendees were very supportive of the idea– Agreed we should develop a matrix – candidate columns
defined– Don’t know of an existing standard that covers enumeration of
all content that must be preserved• May be some overlaps with ISO 19115, but need to identify them item by item
• We are focused on “what” not “how”• Current title of standard: “Provenance and Context
Content Standard for Data Supporting Global Change Research”– We propose simplifying to: “Preservation Content Standard for
Data Supporting Global Change Research”
Status (1 of 2)• Received inputs from Ted Habermann et al (NOAA)• Merged with NASA’s inputs based on USGCRP
Workshop (1998) Report, and discussions (2010-2011) with EOS instrument teams (GLAS, HIRDLS) and TOMS instrument PI– Note – USGCRP workshop, jointly sponsored by NASA and NOAA,
identified a number of scenarios from which content recommendations were derived
• Content matrix was developed and posted on ESIP Data Stewardship and Preservation Cluster wiki– Initial version – March 1, 2011– Latest version (incorporates comments from cluster members) – June
8, 2011– Focused mostly on satellite remote sensing data; need to ensure we
cover other types of data (aircraft, in situ)
Status (2 of 2)
• ESA/NASA Earth Science Framework for Cooperation– NASA and ESA are discussing collaboration and/or coordination in
various areas, including data systems– Subgroup 3 - Ground Segments & Data
• Goals: Collaborate between ESA and NASA ground segments and data systems to enhance mission return and to enable efficient development and generation of multi-mission/multi-agency data products.
– ESA is very interested in coordinating the Content Standard with us – ESA has provided us with their documents related to Long-Term Data Preservation (LTDP)
– Provides an opportunity to compare notes, be more comprehensive and avoid duplication – could ease the way to international standard
– We have compared our matrix with ESA documents and identified only a few differences in content
– We will share the matrix with ESA after this meeting
Summary Earth Science Content Information
• OAIS Ref Model terms: Content Information Object includes Content Data Object with Representation Information for the Designated Community – Earth Science researchers.– Preservation Description Information Types: Provenance (source of info),
Context (relation to other info), Reference (identifiers), Fixity (e.g., checksum)
– PCCS is not covering Reference and Fixity explicitly – should it?
• PCCS categories are defined to highlight the important information needed for Earth Science research.– To identify the flow of information through an instrument’s life cycle.– To develop a user’s sense of what should be included as provenance and
context information about a particular instrument’s dataset.– Extendable to classes of instruments and measurement systems
Content Matrix – Introduction (1 of 3)• Using column headings discussed at January 2011 ESIP
meeting (mostly)• Each row corresponds to a content item and provides details• Content items are mapped into 8 categories (see later chart) • One or more content items are defined in each of the
categories • Column headings
– Item Number (C.N – category and number within category)– Category – Content Item Name– Definition / Description– Rationale (why content is needed)– Criteria (how good content should be)– Priority (H, M, L or critical, essential , desirable)
Content Matrix – Introduction (2 of 3)• Column headings (cont.)
– Source (who should provide content item)– Project phase for capture– User community (who would be most likely to need the
content item – this column is mostly blank in this version; needs group inputs)
– Representation (while focus in on “what”, brief comments are included here on whether items are word files, numeric files, pointers, etc.)
– Distribution restrictions (potential proprietary or ITAR concerns associated with content item)
– Source identifying item (where content item came from – NASA, NOAA or both)
Content Matrix – Introduction (3 of 3)• Further level of detail is possible - components
within content items– Different subsets of components are needed
under different circumstances– Discriminators that determine such subsets:
• Platform type (satellite, aircraft, flux tower, buoy, etc.)• Instrument type• Measurement type • Product level
Categories
1. Preflight/Pre-Operations: Instrument/Sensor characteristics including pre-flight/pre-operations performance measurements; calibration method; radiometric and spectral response; noise characteristics; detector offsets
2. Products (Data): Raw instrument data, Level 0 through Level 4 data products and associated metadata
3. Product Documentation: Structure and format with definitions of all parameters and metadata fields; algorithm theoretical basis; processing history and product version history; quality assessment information
4. Mission Calibration: Instrument/sensor calibration method (in operation) and data; calibration software used to generate lookup tables; instrument and platform events and maneuvers
5. Product Software: Product generation software and software documentation6. Algorithm Input: Any ancillary data or other data sets used in generation or
calibration of the data or derived product; ancillary data description and documentation
7. Validation: Record and data sets8. Software Tools: product access (reader) tools.
1. Pre-OperationalContent Name
Description Discriminator/Components
Priority
Instrument Description
Documentation of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements.
Instrument Type• Remote sensors
• Specifications• Platform geometry• Spectral response• Radiometric response• Noise characteristics
• In-Situ
H
Calibration Data
Numeric (digital) files of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements.
Instrument Type• Measured spectral response
data• Measured radiometric data• Measured ambient noise
L-M
2. Earth Science Data Products (1)Content Name
Description Discriminator/Components
Raw data or Level 0 data products
Raw data as measured by a spaceborne, airborne or in situ instrument; Level 0 data is the reconstructed, unprocessed instrument and payload data at full resolution, with any and all communications artifacts removed.
Instrument Type• Observations
platform ephemeris and metadata in multiple files
L-M*
Level 1A data products
Reconstructed, unprocessed instrument data at full resolution, time-referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters computed and appended but not applied to Level 0 data.
Instrument Type L-H*
Level 1B data Level 1A data that have been processed to sensor units.
Instrument Type H
* Priorities depend on choice to keep L0 or L1A
2. Earth Science Data Products (2)Content Name
Description Discriminator/Components
Level 2 data Derived geophysical variables at the same resolution and location as Level 1 source data.
Instrument Type• Remote Sensor• In-Situ
H
Level 3 data Variables mapped on uniform space-time grid scales, usually with some completeness and consistency.
Assimilation Type• Single Level 2 Input • Multi Level 2 Inputs
H
Level 4 data Model output or results from analyses of lower-level data.
Model Input Type• Single Instrument• Multiple Instrument• Remote Sensor + In-
Situ
H
Metadata Information about data to facilitate discovery, search, access, understanding and usage associated with each of the data products. Links product to algorithm version.
Product Data LevelInstrument TypeAssimilation TypeModel Input Type
H
3. Product Documentation (1)Content Name Description Discriminator/
ComponentsProduct Team State the product team members
roles, contact information and period of responsibility.
Team Type• Investigator• Development• Help Desk• Operations
H
Product Requirements
Project's requirements for each product, either explicitly or by reference to the project's requirements document, if available.
Requirements Type• Content• Format• Latency• Accuracy• Quality
M
Product Dev. History
Major product development steps and milestones, with links to other relevant items that are part of the preserved provenance and context contents.
Instrument TypeAssimilation TypeModel Input Type
M
3. Product Documentation (2)Content Name Description Discriminator/
ComponentsProcessing History
Documentation of Processing history and production version history, indicating which versions were used when, why different versions came about, and what the improvements were from version to version.
Instrument TypeProduct Level
H
Algorithm Version History
Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive. Granule level metadata should indicate which version of software was used for producing a given granule. Links Product to Algorithm Version.
Instrument TypeProduct Level
H
3. Product Documentation (3)Content Name
Description Discriminator/Components
Maintenance History
Excerpts and/or references to maintenance documentation deemed of value to product users. Maintenance reports.
M
Operations History
Excerpts and/or references to operations documentation deemed of value to product users. Operations event logs.
H
Product Generation Algorithms
Processing algorithms and their scientific and mathematical basis, including complete description of any sampling or mapping algorithm used in creation of the product - geo-location, radiometric calibration, geophysical parameters, sampling or mapping algorithms used in creation of the product, algorithm software documentation, ATBD & high-level data flow diagrams
Product Level• Algorithm
Output• Algorithm
Performance Assumption
• Error Budget• Numerical
Computation Considerations
H
3. Product Documentation (4)Content Name Description Discriminator/
ComponentsProduct Quality Documentation of product quality assessment
(methods used, assessment summaries for each version of the datasets) Description of embedded data at the granule level including quality flags, product data uncertainty fields, data issues logs, etc.
Instrument TypeProduct Level• Product
Accuracy• Sensor
Effects
H
Quality Assessment and Potential Algorithm improvements
Describe potential future enhancements to the algorithm, the limitations they will mitigate, and provide all possible and useful related information and links.
Instrument Type H
References A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set.
H
User feedback Information received back from users of the data set or product
L
Product Generation AlgorithmsComponent Description
Algorithm Output Describe the output data products - not format - at a level of detail to determine if the product meets user requirements.
H
Algorithm Performance Assumptions
Describe all assumptions that have been made concerning the algorithm performance estimates. Note any limitations that apply to the algorithms (e.g., conditions where retrievals cannot be made or where performance may be significantly degraded. To the extent possible, the potential for degraded performance should be explored, along with mitigating strategies.
H
Error Budget Organize the various error estimates into an error budget. Error budget limitations should be explained. Describe prospects for overcoming error budget limitations with future maturation of the algorithm, test data, and error analysis methodology.
H
Numerical Computation Considerations
Describe how the algorithm is numerically implemented, including possible issues with computationally intensive operations (e.g., large matrix inversions, truncation and rounding).
M
Product QualityComponent Description
Product Accuracy Accuracy of products, as measured by validation testing, and compared to accuracy requirements. References to relevant test reports.
H
Sensor Effects Flowed-through effects of sensor noise, calibration errors, spatial and spectral errors, and/or un-modeled or neglected geophysical phenomena on the quality of products.
H
4. Mission Calibration (1)Content Name Description Discriminator/
ComponentsInstrument/Sensor Calibration during mission
Instrument/sensor calibration method - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location
Instrument Type H
In-situ measurement environment
In the case of Earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record
Platform Type• Aircraft• Balloon• Station/
Tower/Buoy
H
Mission Platform History
Instrument events and maneuvers; attitude and ephemeris; aircraft position; event logs
Platform Type• Satellite• Aircraft
H
4. Mission Calibration (2)Content Name Discriminator/
ComponentsDiscriminator/Components
Mission Calibration data
Instrument/sensor calibration data - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location
Instrument Type M
Calibration software
Source code used in applying calibration to generate look-up tables and/or parameters needed for producing calibrated products
Instrument Type M
5. Product Software (1)Content Name Description Discriminator/
ComponentsProduct generation algorithms
Source code used to generate products at all levels.
Instrument Type
H
Output dataset description
For each output data file, details on data product's structure, format/type, range of values and special error values. Include data volume and file size. All information needed to verify that the required output data is created by a run. Verify that all expected datasets are product in the expected format.
H
Programming & Procedural
Describe any important programming and procedural aspects related to implementing the algorithm into operating code.
M-H
Exception Handling
List the complete set of expected exceptions, and describes how they are identified, trapped, and handled.
H
5. Product Software (2)Content Name Description
Test Data Description
Description of data sets used for software verification and validation, including unit tests and system test, either explicitly or by reference to the developer's test plans, if available. This will be updated during operations to describe test data for maintenance.
L-M
Unit Test Plans Description of all test plans that were produced during development, including links or references to the artifacts.
L-M
Test Results Description of testing and test results performed during development, either explicitly or by references to test reports. If test reports are not available to external users, provide a summary of the test results in sufficient detail to give external users a good sense of how the test results indicate that the products meet requirements.
L-M
6. Algorithm InputContent Name Description
Algorithm input documentation
Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product, either explicitly or by reference to appropriate documents. Information should include full description of the input data and their attributes covering all input data used by the algorithm, including primary sensor data, ancillary data, forward models (e.g. radiative transfer models, optical models, or other model that relates sensor observables to geophysical phenomena) and look-up tables.
H
Algorithm Input data
At granule level, include information on all inputs (including ancillary or other data granules, calibration files, look-up tables etc.) that were used to generate the product.At the appropriate level (granule or dataset) include calibration parameters, precision orbit & attitude data; Climatological norms, geophysical masks or first guess fields, spectrum and transmittance informationNumerical weather or climate model inputs
H
7. ValidationContent Name Description Discriminator/
ComponentsValidation datasets Description of validation process,
including identification of validation data sets; Cal/Val plans & status; detailed history of validation activities
Product Level L-H
Validation record Validation data sets along with metadata
Product Level H
8. Software ToolsContent Name Description Discriminator/
ComponentsTools for Users Readers and data analysis tools Product Level H
Next Steps (1 of 2)• Review content matrix in this meeting – consider
appropriateness of level of detail• Form small team (4 or 5 volunteers willing to review
drafts in detail) to carry forward; incorporate results from this meeting on use cases
• Look at ISO19115, SensorML, Policy changes needed(?),• Draft introductory text material to include
– Need for content standard– ESIP Data Stewardship and Preservation Cluster – definition/mission/charter– Cluster’s motivation to promote standard– Scope of standard– Description of matrix– Request for review
• include how review inputs should be provided• Reviewers should look at the list from the points of view providing data as well as using
someone else’s data
Next Steps (2 of 2)• “user test” on a small group of data providers• Send matrix with introductory text for broader
review – NASA missions teams – satellite and aircraft investigations, MEaSUREs teams, DAAC UWGs, USGS, NOAA Data Centers, NOAA CDR Program PI’s, EPA(?), ESA(?)
• Provide same material to NASA ESDSWG for consideration with its “best practices” submission
• Decide on IEEE or ISO route to proceed further – may need to augment small team at this point
Proposal from NASA ESDSWG Technology Infusion Working Group (TIWG)
• Concept of a provenance and context standard was discussed at the ESDSWG meeting in October
• Tentatively agreed to develop a “Best Practice” Technical Note for NASA Standards Process Group consideration– Contents to include
• Items from the USGCRP 1998 Workshop Report• 1-2 paragraph justification and best practice for each item
– Process• Develop Technical Note • Submit RFC to SPG• SPG initial screening• Public evaluation
– Process expected to complete in about 1 year
IEEE Standard Development Process (~2 years)*• Form nucleus of a working group (core group of people interested in developing the standard) • Submit Project Authorization Request (PAR) to New Standards Committee (NesCom).
– Relatively short form specifying the scope, purpose, and contact points for the new project.
– NesCom meets 4 times a year, but can start the review of the par shortly after it is submitted • Mobilize Working Group
– Individual-based IEEE working groups (one individual-one vote) are open to anyone to participate - participants don't have to be IEEE or IEEE-SA members.
– Entity-based IEEE working groups (one entity-one vote) have specific membership requirements for an entity to observe or attain membership and voting rights. There are legal issues related to patents and such that must be dealt with, but the IEEE Standards Association (SA) provides staff support to help with this.
• Draft the Standard– IEEE-SA staff provides many tools and direct support in drafting the standards– Involvement of the Quantities, Units and Letter Symbols Committee, and the IEEE-SA Staff Editor are mandatory.
• Ballot the Standard– When working group has determined that a draft is mature enough, they submit it to the IEEE sponsor (SCC40, chaired by SJS Khalsa).
– 75% affirmative ballots are needed to proceed to the approval stage • Approve the Standard
– Final approval of an IEEE standard is achieved by submitting document and supporting material to IEEE-SA Standards Board Standards Review Committee (RevCom), which issues a recommendation to the IEEE-SA Standards Board.
• Maintain the Standard– A standard has a validity period of five years from the date of approval by the IEEE-SA Standards Board. It is important to note that at the end of
the five year validity period, one of three things has to happen: – revision, reaffirmation, or withdrawal.– During this five year validity period, amendments and corrigenda may need to be developed that offer minor additions/revision to the standard.
*Thanks to Siri Jodha Singh Khalsa
ISO Standard Development Process (~4 years)*• An INCITS/L1* member or members prepares a new work item proposal (NWIP)• NWIP will be voted by INCITS/L1 members for a one month voting period during which clarification questions need
to be answered• If NWIP is approved by INCITS/L1, it will be voted by INCITS Executive board as U.S. contribution will be sent to ISO
TC211 secretariat by INCITS/L1. TC 211 will send the proposal to member countries for one month informal review• If more than half of voting members countries support the project and at least 5 members agree to participate in
the project by nominating experts to the project team, the project is approved• Project chair, editor, and the nominated experts will form the project team to draft the working draft (WD) of the
standard . There are normally multiple versions of the WD. • When the project team is satisfied with the WD, the final version of WD will be sent out by TC211 to countries’
members for Committee Draft (CD) voting.• If CD voting approved, the project team is dissolved and an editing committee (EC) will be formed. The project chair
will be the chair of the EC, and the ISO TC 211 working group 6 convener will become the EC chair.• Edited CD will be put in for vote again to advance to Draft International Standard (DIS) stage. If approved, the EC
will have a final EC meeting to edit the document to DIS based on voting comments. If not approved, a new version of CD needs to be produced.
• After it becomes DIS, the next step will be Formal Draft International Standard (FDIS). • FDIS to International standard (IS): It will automatically advance from FDIS to IS after 6 months without major
objections. Only minor editorial change will be allowed from FDIS to IS.
**International Committee for Information Technology Standards – Geographic Information Systems
*Thanks to Liping Di