Post on 01-Jan-2016
transcript
Darwin Core Archive (DwC-A) validation: A New Collaborative
EffortChristian Gendreau, Université de Montréal / CanadensysDavid P. Shorthouse, Université de Montréal / Canadensys
Marie-Élise Lecoq, GBIF FranceTim Robertson, GBIF
Darwin Core Archive (DwC-A)
DarwinCore standard does not impose strong rules on the content associated with any DarwinCore terms.
Current GBIF DwC-A Validator
Original goal“… test Darwin Core Archives as specified in the Darwin Core Text Guide.”
http://tools.gbif.org/dwca-validator/
Current GBIF DwC-A Validator
Original targetDwC-A are simple and can be created using simple custom scripts.
“… make sure GBIF and others can read the information as expected.”
Current GBIF DwC-A Validator
• Validates archive structure• Offer web presence– Report viewer– API
Next GBIF DwC-A Validator?
New goalExtends validation to the content of the archive
https://github.com/gbif/dwca-validator
Current content validators
• Atlas of Living Australia sandbox• VertNet – Spatial quality• GBIF Spain – Darwin Test• Encyclopedia of Life – dwc-validator• Scratchpads – dwca-validator• GlobalNames – dwc-archive ruby gem• … much more
See Appendix 1 for links
What we need?
• Accommodate different scopes• Configuration/customizations– Use more knowledge when available
• Web access (page and API)
Scopes
• Data entry• Desktop software– Scientific Work Flow – Statistical software
• Integrated Publishing Toolkit (IPT)• National nodes• Aggregators
Configuration/Customization
• Where the validator will be used?• Can we provide more information?– e.g. I know all the dates in my file should be ISO
Components
• Library• Web• Extension Support
Library
• Define structure for validation process• Provide a validation framework enabling
sharing• Close to DarwinCore specification
Web
• Web page to submit archive or URL• Report viewer• API
Extension Support
• Include domain knowledge• Propose interpreted data
Internals
• Validation types– Structure• Metadata
– Records : Rows• Fields data (e.g. date, coordinates)
– Records : Columns• ID uniqueness
Internals – Record level
• Validation chain– Composed by chain elements– Possible parallelism
Internals – Record level
• Immutable Chain element– Self contained• Never relies on another chain element
– Ordering independent• Same behaviour wherever the element is used in the
chain
But what if I need really ordering?
Internals - Composition
• Composed chain element• Exposed as one chain element
Composition example
• Mandatory Latitude/Longitude– Check record completion on lat/long– Check decimal lat/long value
Configuration example
• Select mandatory DarwinCore terms– scientificName must be provided
• Restrict bounding box– decimalLatitude and decimalLongitude must be
between
Customization example
• Apply your own controlled vocabulary– Use your own dictionary for a term– ControlledVocabularyEvaluationRule
Extension Example
• Suggester, link to narhwal-processor– Suède –> ISO 3166-2:SE – URI –> http://sws.geonames.org/2661886
Collaborative
• Share configuration• Share customization (dictionary)• Implement new reusable component– e.g. validation on specific Dwc-A extension
Collaboration
• Where to go?– https://github.com/gbif/dwca-validator
• Who can contribute?– Everyone
• What is needed?– Ideas, constructive comments– Code review, feedback
Project status
• Not yet released• Command line interface available
Follow the project on GitHub
Acknowledgments
Special thanks
• SiB Colombia• SiB Brazil• Peter Desmet• John Wieczorek• Dag Endresen• …
Appendix 1DwC Content validators
Atlas of Living Australia sandboxhttp://sandbox.ala.org.au/datacheck/
VertNet – Spatial qualityDisplayed on occurrence pages athttp://portal.vertnet.org/search
GBIF Spain – Darwin Testhttp://www.gbif.es/darwin_test/Darwin_Test_in.php
Encyclopedia of Life – dwc-validatorhttp://services.eol.org/dwc_validator/
Appendix 1 - continue
Scratchpads – dwca-validatorhttps://github.com/edwbaker/dwca_validator/
GlobalNames – dwc-archive ruby gemhttps://github.com/GlobalNamesArchitecture/dwc-archive