+ All Categories
Home > Software > Data Quality: Towards a Common Validator

Data Quality: Towards a Common Validator

Date post: 15-Jun-2015
Category:
Upload: kristgen
View: 144 times
Download: 0 times
Share this document with a friend
Description:
Data Quality: Towards a Common Validator SiBBr 4th Workshop, Petrópolis, Rio de Janeiro, Brazil Authors: Christian Gendreau, Anne Bruneau, David Shorthouse
Popular Tags:
22
Data quality Towards a common validator Christian Gendreau, Anne Bruneau, David Shorthouse Université de Montréal, Biodiversity Centre
Transcript
Page 1: Data Quality: Towards a Common Validator

Data qualityTowards a common validator

Christian Gendreau, Anne Bruneau, David ShorthouseUniversité de Montréal, Biodiversity Centre

Page 2: Data Quality: Towards a Common Validator

What is data quality?● Relative● Fitness for use

o Coordinate precision for distributiono Hierarchy not provided

● What –When –Where

Page 3: Data Quality: Towards a Common Validator

Examples● 2008 VI 13, 2008-06-13, 13-06-2008, June 13 2008, 13 junho

2008● Canada, Québec, Montréal, -73.55399 45.508669● Narwalus microcephalus => Monodon monoceros

Public Domain: Freshwater and Marine Image Bank, University of WashingtonLibraries Digital Collections

Page 4: Data Quality: Towards a Common Validator

Data Quality Information Chain

Courtesy of Arthur D. Chapman

Page 5: Data Quality: Towards a Common Validator

Brief History● First Canadensys Explorer/Harvester (2012)● narwhal-processor (2013)● TDWG 2013

o DQ Interest Groupo Presentation about our plano Discussion with GBIF

Page 6: Data Quality: Towards a Common Validator

Why do we need a validator?● Identify and quantify potential issues● DarwinCore is permissive

o DarwinCore itself can change● Records and technologies will always evolve

Page 7: Data Quality: Towards a Common Validator

What should a validator allow?● Define a validation scope

o at the source (e.g. collection)o national nodeo aggregatoro GBIF

Page 8: Data Quality: Towards a Common Validator

Validator - Expected design● Modular, scalable, reusable● Customizable

o per configuration/extensiono use user defined dictionary

● Validation Chain

Page 9: Data Quality: Towards a Common Validator

Current Options● GBIF validator● CRIA tools● ALA tools

Probably all organisations have their own tools.

Page 10: Data Quality: Towards a Common Validator

dwca-validator● Starting from previous GBIF validator● Building a community project● Provide framework for Biodiversity Data Quality Interest

Group (TDWG)

https://github.com/gbif/dwca-validator

Page 11: Data Quality: Towards a Common Validator

Vision• Library

o Core module, reusable (e.g. IPT)

• Webo Send archive, view report

• narwhal-processoro Suggest interpreted value

• Extensionso Domain knowledge / Quality index

Page 12: Data Quality: Towards a Common Validator

Validation chain● Chain element

o Self contained (never relies on another chain element)o Ordering independent

● Composed chain element (narwhal and extensions)o Wrap chain elements under a new elemento Ordering possible between wrapped element

Page 13: Data Quality: Towards a Common Validator

Chain element example

Page 14: Data Quality: Towards a Common Validator

Validation types● Structure

o metadatao organization of data

● Rowso dates, coordinates, ...

● Columnso ID uniqueness

Page 15: Data Quality: Towards a Common Validator

Result Accumulator● Records validation result as they occur

o ID/Validator/Context/ValidationType/Result/Message

● Allows different views of resulto Web viewo Feed another application

Page 16: Data Quality: Towards a Common Validator

Current Status● Library with CLI (command line interface)● Basic evaluators and rules● Ready for contributions

Page 17: Data Quality: Towards a Common Validator

Demo• Darwin Core Archive, Taxon Checklisto Invalid characterso Broken link synonym accepted taxon

Lynx Canadensis, http://www.animalgalleries.org/

Page 18: Data Quality: Towards a Common Validator

Future validations● Use semantic web (e.g. GeoNames)● Use external resolver (e.g. CoL)● Use more complex validation (e.g. climate layer)

Page 19: Data Quality: Towards a Common Validator

Future validationsAccomodate localisation vs misspellings● Brésil (fr)● Brazil (en)● Brasil (pt)● Brasilien (se)

● Brézil (??)

Page 20: Data Quality: Towards a Common Validator

Questions?

Public Domain: robynm

Page 21: Data Quality: Towards a Common Validator

Acknowledgements

Page 22: Data Quality: Towards a Common Validator

Contacthttp://www.canadensys.net

http://github.com/Canadensys

@Canadensys


Recommended