GBIF Checklist BankIndexing & Backbone
Checklist Scope1.846 datasets registered 18 million name records
Plazi (1.131), Pensoft (178), CoL GSDs (156)
Denormalized Checklist
Normalized Checklist
Checklist Challenges• Highly relational taxonomic data, almost all records linked in tree & basionym
• Wrong or missing records destroy dataset integrity, not just a single record! • Different to flat, unrelated occurrence records
• Data Quality • broken referential integrity • bad names or placeholders (e.g. «Unallocated Family») • missing or unused controlled vcabularies, e.g. «art» for rank species
• Name strings can be published in several ways • ScientificName • ScientificName + Authorship • Genus + SpeciesEpitheton + Rank + InfraspecificEpitheton + Authorship
• Classifications can be published in several ways • Normalised via parentNameUsageID • Normalised via parentNameUsage • Denormalised via Kingdom,Phylum,Class,Order,Family,Genus
Checklist Indexing• Basic archive validation
• unique ids
• Checklist Normalizer • resolve relations • create implicit taxa from denormalised classification • interpret controlled vocabularies, e.g. rank • match to backbone • match to previous version to keep GBIF ids stable
• Checklist Importer • Inserts data to PostgresDB and solr index for searches
• Checklist Analyser • generate dataset metrics
Organizing Occurrences
• GBIF needs a single, consistent taxonomy • for metrics, search, maps • considerable variation in higher taxa • synonymies can be very large
• Catalog of Life is largest single source • ~90% of GBIF occurrence records (thanks to birds) • ~50% of GBIF occurrence names (35% in 2010)
• GBIF needs to assemble a taxonomy • originally merged (noisy) names found
in occurrences. Resulted in lots of duplicates • improved by stitching together checklist datasets
Cronquist classification Mimosaceae: 3,200 species Caesalpiniaceae: 2,000 species Fabaceae: 14,000 species
“Modern” classification Fabaceae: 19,200 species
Mimosoideae: 3,200 species Cæsalpinioideae: 2,000 species Faboideae: 14,000 species
Current Backbone Issues• Far too many accepted species (acc/syn)
• Cactaceae: GBIF 12.062 (342 syn), TPL 2.233 (5.422 syn) + 5.500 unknown • Genus Weingartia: GBIF 129 (0 syn), TPL 8 (26 syn) + 68 unknown
• Many accepted names based on the same basionym • Sulcorebutia breviflora Backeb. • Weingartia breviflora (Backeb.) Hentzschel & K.Augustin
• No synonyms with different authors possible • Poa pubescens R.Br. synonym of Eragrostis pubescens (R.Br.) Steud. • Poa pubescens Lej. synonym of Poa pratensis L. • merged all names with exact same canonical name
• list of known homonym genera (IRMNG) used to disambiguate between larger groups
Backbone Building
• Overlay ordered sources • Start with Catalog of Life • Primary source defines status • Create new name if kingdom, canonical name & authorship do not exist in
current nub
• Ignore source name if … • not a major Linnean rank (infraspecifc ranks are included) • higher ranks above family (configurable per source) • status conflicts with already existing status • hybrid formula, cultivar, candidatus or placeholder names !!!
Catalogue of Life
Fauna Europaea
GRIN
MammalSpeciesWorld
Observations
Specimens 8000 Species Lists
10s of taxonomic resources
5M+ namesin Primary Data Index
NUBMerged
Match
Backbone AssemblingAnimalia Archaea Bacteria Chromista Fungi Plantae Protozoa Viruses incertae sedis
• Nub build starts with 8 kingdoms
Backbone AssemblingPlantae
Magnoliophyta Magnoliopsida
Asterales Asteraceae
Helianthus L. Helianthus anuus L.
• Catalog of Life is added • Defines higher classification
Plantae Magnoliophyta
Magnoliopsida Asterales
Asteraceae Helianthus L.
Helianthus anuus L.
Backbone AssemblingPlantae
Magnoliophyta Magnoliopsida
Asterales Asteraceae
Helianthus L. Helianthus anuus L.
Cichorium Cichorium intybus L.
• Missing genera are created • Tribe is ignored
Asteraceae Cichorieae Lam & DC. [tribe]
Cichorium intybus L.
Backbone AssemblingPlantae
Magnoliophyta Magnoliopsida
Asterales Asteraceae
Helianthus L. Helianthus anuus L.
Cichorium Linneaus Cichorium intybus L.
= C. balearicum Porta = C. byzantinum Clementi
• Synonyms respect authors • Author match very loose • Existing genus author updated
Plantae Asteraceae
Cichorium Linneaus Cichorium intybus Linneaus
= Cichorium balearicum Porta = Cichorium byzantinum Clem. = Cichorium byzantinum Clementi
Backbone AssemblingPlantae
Magnoliophyta Magnoliopsida
Asterales Asteraceae
Helianthus L. Helianthus anuus L.
Cichorium L. Cichorium intybus L.
= C. balearicum Porta = C. byzantinum Clem.
• Prefer authors from nomenclators
Asteraceae Cichorium L. Cichorium byzantinum Clem.
Backbone AssemblingAsteraceae
Helianthus L. Helianthus anuus L.
Agoseris Agoseris apargioides (Less.) Greene
= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird
Cichorium L. Cichorium intybus L.
= C. balearicum Porta = C. byzantinum Clem.
• Infraspecifics are included
Asteraceae Agoseris apargioides (Less.) Greene
= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird
Backbone AssemblingAsteraceae
Helianthus L. Helianthus anuus L.
Agoseris Agoseris apargioides (Less.) Greene
= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird
Agoseris eastwoodiae Fedde Agoseris maritima E. Sheld.
Cichorium L. Cichorium intybus L.
= C. balearicum Porta = C. byzantinum Clem.
• Other source treats them as species
• Same canonical maritima allowed twice - author different
Asteraceae Agoseris eastwoodiae Fedde Agoseris maritima E. Sheld.
Final Cleanup - BasionymsAsteraceae
Helianthus L. Helianthus anuus L.
Agoseris Agoseris apargioides (Less.) Greene
= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz
= Agoseris eastwoodiae Fedde A. a. var. maritima (E. Sheld.) Baird
= Agoseris maritima E. Sheld. Cichorium L.
Cichorium intybus L. = C. balearicum Porta = C. byzantinum Clem.
• Finally basionyms are detected • by terminal epithet & author
within a family • Only 1 accepted per group
• the most trusted first stays
Final Cleanup - AutonymsAsteraceae
Helianthus L. Helianthus anuus L.
Agoseris Agoseris apargioides (Less.) Greene
= A. maritima Eastw. A. a. var. apargioides A. a. var. eastwoodiae (Fedde) Munz
= Agoseris eastwoodiae Fedde A. a. var. maritima (E. Sheld.) Baird
= Agoseris maritima E. Sheld. Cichorium L.
Cichorium intybus L. = C. balearicum Porta = C. byzantinum Clem.
• Create missing autonyms
Backbone Building Rules• Create missing genus or species in classification
• only for accepted taxa
• Create missing autonyms for infraspecific
• Detect basionyms based on terminal epithet & authorship • Assumes epithet & authorship in family is unique • Converts all but one accepted to synonyms
• Flag taxa as doubtful • genus or higher taxon without any species (IRMNG) • species (or infrasp.) with a parent genus (or species) considered to be a synonym
• moved to newly accepted genus (or species) • the case for potential children of synonymised basionym combination
Backbone Sources• GBIF Backbone Patch
• Catalogue of Life
• World Register of Marine Species
• Dyntaxa - Svensk taxonomisk databas
• GRIN Taxonomy
• Fauna Europaea
• Integrated Taxonomic Information System
• Euro+Med Plantbase
• Interim Register of Marine and Nonmarine Genera
• The Clements Checklist
• IOC World Bird Names
• Mammal Species of the World
• Paleobiology Database
• Nomenclators
• International Plant Names Index
• Index Fungorum
• ZooBank
• Prokaryotic Nomenclature Up-to-date
• ICTV Master Species List
• Organisations
• Species Files
• Biodiversity Data Journal (Pensoft)
• ZooKeys (Pensoft)
• PhytoKeys (Pensoft)
• Plazi ???
Backbone Matching
• Occurrence • fuzzy name match • classification match • allow higher rank matches
• Checklist • match kingdom • require straight canonical match • incl authorship comparison • no webservice yet, only embedded
NameUsageParsed Name
Backbone Match
Citation
Dataset Metrics
Verbatim Record
Metrics
Extensions
• Checklists & Nubsame structure
• Parent-child hierarchy • normalized classification
• flexible ranks
• synonyms accepted rel.
• Dataset metrics as timeseries
• Basionym relation
Schema
CLB Supported Extensions• Description: human paragraphs about some topic • Distribution: area ranges with statuses • Identifier: additional identifier for the record • Multimedia: image, video, sound • Literature references: bibliography • Occurrence (indexed via occurrence workflows) • Species Profile: extinct, marine, freshwater, terrestrial flags • Types and specimens: (overlaps with Occurrence) • Vernacular names: name with language & region
http://rs.gbif.org/extension/gbif/1.0/
Normalizing Classifications