Data Quality Resources in Species Occurrence Digitization Allan Koch Veiga Etienne Americo Cartolano...

Post on 21-Jan-2016

214 views 0 download

Tags:

transcript

S

Data Quality Resources in Species Occurrence

Digitization

Allan Koch VeigaEtienne Americo Cartolano Jr

Antonio Mauro Saraiva

Agricultural Automation Laboratory – LAAComputing Engineering Dept., Engineering School

Universidade de São Paulo, Brazil

Outline

Background

Biodiversity Data Digitizer (BDD) & IABIN

Data Quality Methodology

Data Quality Tools BDD Geo Tool BDD Taxon Tool

Conclusion

Background

Importance of Species Occurrence Data GBIF Portal IABIN Portal

Data quality impacts the uses of data

Location | Taxonomic data domain Georeferencing | Identification are two major

causes of error in species occurrence data

Need to improve Data Quality (DQ)

Data quality & IABIN-PTN

Inter-American Biodiversity Information Network (IABIN) Pollinators Thematic Network (PTN) GEF-funded project (2006-2011) (~$180k)

11 countries in Latin America ~400,000 records

Responsibilities Development of tools for data digitization and

integration Data Digitization Training and support Reviewing proposals, reports, data

Close contact with data owners / providers

Data Quality & IABIN-PTN

Opportunities & needs Discuss digitization issues with the

grantees Standards: importance and role (TDWG) Data quality: concepts

Improve data quality Provide mechanisms integrated to

digitization tools versus isolated tools

Biodiversity Data Digitizer (BDD)

Designed for easy: Digitization Manipulation Publication

Rich data content

FAO-GEF pollinator project

Darwin Core

EOL/Plinian Core

Interaction Extension

FAO Deficit Protocol

FAO Monitoring

Protocol

MRTG Schema

Dublin Core

Demo: Thu

Location Data Domain

DQ Assessment MethodologyWhat is Data Quality?

Completeness Consistency Credibility Accuracy Precision

Data Domain (context)

Dimension (aspect) Problem (error patterns)

Missing value

Incorrect value

Nonatomic value

Inconsistent value

Incorrect value

Missing value

Incorrect value

Nonatomic value

Missing value

Incorrect value

Nonatomic valueInformation

contamination

Nonatomic value

Information contaminati

on

Information contaminati

onInformation contaminati

on

DQ Management Methodology

How to improve the DQ?

Reducing Errors

Detection and CorrectionPrevention

Error prevention is considered superior to error detection

Resources to Improve DQ on BDD

Tools to prevent errors on occurrence data digitization

Integrated to BDD species occurrence data-entry interface

BDD Geo Tool prevent location data digitization errors

BDD Taxon Tool prevent taxonomic data digitization errors

BDD Geo ToolStep 1 of 3 – Primary Data

BDD Geo ToolStep 2 of 3 – Data Source

BDD Geo ToolStep 3 of 3 – Uncertainty

BDD Geo ToolLocation data form is filled

BDD Geo Tool

Improved

Completeness: adds data not available before (ex. lat/long, municipality)

Consistency: consistent data obtained from a consistent source (avoiding errors like lat:0, long:0, municipality: New Orleans )

Credibility: associate data to a credible source (BioGeomancer, Google, GeoNames)

Accuracy: better than center of mass of a region

Precision: uncertainty indicator increases data fitness for use

BDD Taxon ToolStep 1 of 2 – Taxonomic Name Selection

BDD Taxon ToolStep 2 of 2 – Taxonomic Hierarchy Selection

BDD Taxon ToolTaxonomic data form filled

BDD Taxon Tool

Improved

Completeness: taxonomic hierarchy is filled from a taxon name

Consistency: consistent data are obtained from a consistent source (Catalog of Life)

Credibility: data associate to a credible source (Catalog of Life)

Accuracy: avoid spelling mistakes / entering an incorrect taxonomic hierarchy

Precision: complete scientific names suggestions

Conclusion

Integrated existing techniques, tools, and credible data sources to a species occurrence data-entry tool

Improved completeness, consistency, accuracy and precision of species occurrence data

Error prevention in taxonomic and location data

Tools available for an audience with little literacy on data digitization and DQ

Conclusion

Next steps

Other tools, techniques, dimensions and error patterns and domains of data quality in biodiversity are yet to be explored and added

Work on error correction on existing data

Spreadsheet based data correction

Suggestions and collaboration are welcome!

Acknowledgements

IABIN – PTN Laurie Adams (P2), Mike Ruggiero (ITIS), Mike Frame, Liz

Sellers and Ben Wheeler (USGS) Pedro Correa (University of São Paulo) All data grantees

FAO-UNEP-GEF Pollinator project in Brazil Barbara Gemmil-Herren (FAO) Ministry of the Environment - Brazil All data grantees

Thank you

Allan Koch Veiga allan.kv@gmail.com

Etienne Americo Cartolano Jretienne.cartolano@gmail.com

Antonio Mauro Saraiva saraiva@usp.br

Agricultural Automation Laboratory – LAAComputing Engineering Dept., Engineering School

Universidade de São Paulo, Brazil