10 years of global biodiversity databases: are we there yet?

Post on 14-Apr-2017

6,034 views 3 download

transcript

10 years of global biodiversity databases:

are we there yet?

Tony ReesIndependent data consultant,

Northern Rivers region, New South Wales, Australia

previously: CSIRO Marine &Atmospheric Research, Hobart,

Tasmania

Global ocean bio-records in OBIS, 2015

The vision: “Biodiversity information on every desktop” [ / device]…

A global taxon inventory up-to-date species lists,

synonymies, etc. (for all groups)

Citations, links to primary literature direct access to the primary

taxonomic literature (for all described taxa), including full text (preferably…)

“All” georeferenced records accessible, for all species no need for individuals

to do the data aggregation

map local / regional / global records, show details for any data item

Indexes of taxon traits e.g. to support

sort / filter / group by…

Predictive mapping / computed range maps for all taxa fill sampling gaps via niche

modelling, to produce comprehensive global species maps

Plus more (phylogenies, illustrations, genetics, descriptions, keys…)

A standardised approach for this talk

Rationale for each component/activity (why do we care?)

Size of the problem (or sub-problem)

Who is addressing it (and what they are doing)

How far have they (“we”) got, and how much is still to be done…

Some other points to consider:• open vs. closed access to relevant content (who can access?)• machine vs. human retrievability ( -> services, not just pages to view)• degree of consolidation available (saves querying multiple resources)• web query only, or are the base data available for export/user upload

A global taxon inventoryFrom presentation by Quentin Wheeler, International Institute for Species Exploration (IISE):

A global taxon inventory – why do we care?

Useful to know with what organisms we share the planet

History of life as its own study area, also key to understanding present life forms and their relationships

Ensure taxonomic names do not get accidentally re-used

Construct list once, use many times (no need to re-create from scratch)

Reconcile old names / synonyms to current taxon concepts (important for data integration)

Provide “taxonomic backbone” underpinning other biodiversity activities / projects.

How many species? Arthur Chapman asked the experts:

From Chapman’s summary document (2009 edition):

NB “Others” includes non-green algae, Protista, prokaryotes and viruses (refer document for details)“Invertebrates” includes 1m insects, 360k others (incl. 102k arachnids, 85k molluscs, 47k Crustacea)“Estimated” total spp. for world is 11.3m i.e. only 17% of estimated global biodiversity yet named (!)

From Chapman’s summary document (2009 edition):

From Chapman’s summary document (2009 edition):

…add another 200k-300k(?) for known fossil species, maybe multiply 2x-3x to include synonyms……gives upwards of 5m species names to catalogue/organise (+0.5m genera…)+ new species descriptions (~20k/year) and higher taxa, also new combinations (??/year)

5m+ names

!

From group estimates to global lists

Likes:•Comprehensive coverage (“most” zoological names held)•Includes fossils as well as extant taxa•Cites original publications for most post-1860 names•Low latency (c. 6 months to name appearance in index)•Some tax. hierarchy for all names•ION ID minted for every name (usefulness varies)Dislikes:•Many more names than taxa (spelling + authority variations, synonyms, bad data) – needs deduplication before use•Hard to work out which is “correct” name or what names are synonyms, etc.•Some quirks in citations as given (including author spellings)•Detailed publication and taxon info is behind paywall

ION – Index to Organism Nameswww.organismnames.com

2005: 1.8m names, all ranks (incl. synonyms), from 2.8m documents2015: 5.2m names, all ranks (incl. synonyms), from 3.5m documents- Animal names only (+ protists), cf. other resources for plants etc.

Newly published names (all ranks) in ION, Nov 2015 (total 2.01m)

names from Index Animalium (1758-1850)

names from Zoological Record (1864-current)

Catalogue of Lifewww.catalogueoflife.org

2005: 530k valid species names + ?? synonyms), from 23 databases2015: 1.6m valid species names + 1.3m synonyms), from 151 databases- All taxonomic groups, extant only (a few fossils starting 2015)Likes:•Name quality high (all expert-supplied), synonomies explicit, no (few) duplicates•Internally consistent taxonomic hierarchy, kingdom -> family•Coverage increasing over time (claims currently 84% of all extant species)Dislikes:•Some groups not yet covered (also no fossils)•Synonymies not always complete (some old names not listed)•No author, synonym information at ranks above species•No links to original literature (although these may be traceable via source databases)•More latency than ION (takes a while for new names to appear)•No stable IDs for names (cannot use for linking to current edition)

Partial ION listing – search for “Physeter”

Partial ION listing – search for “Physeter”

Note: 1. ION often includes the same name in multiple variants (mix of “good” and “bad” content), giving over-representation of number of “real” names 2. This is a list of names, not taxa (single taxon can have multiple names, e.g. valid name plus synonyms – not distinguished in ION).

Equivalent Cat. of Life search result for “Physeter”:

Selected other names/taxon databases of note

PaleoBioDB (fossils) – formerly PaleoDB•2005: 60k names, all ranks•2015: 320k names, all ranks (incl. synonyms)– Good coverage of many fossil taxa (most groups)

World Register of Marine Species (WoRMS)•2007: first release, 75k valid species + ?? syns•2015: 230k valid species + 96k synonyms– Excellent coverage of marine taxa (almost all groups), incl. some fossilsInterim Register of Marine and Nonmarine

Genera (IRMNG) – Tony Rees / OBIS project•2006: first release, 159k genus names incl. synonyms (the latter partly known, part not)•2015: 488k genus names incl. synonyms (also 1.9m species names incl. synonyms)– Comprehensive genus level coverage of all groups, extant + fossil, not all assigned to family as yet

For other groups (examples, NB completeness varies)

(etc., etc.)

Linking names to the literature

Linking names to the literature – why do we care?

Initial publication / description / designated type is “anchor” for every taxonomic name and concept

Use to verify “indexing” details (taxon name + author, year) are correctly represented

Included text details (title < abstract < full text) can be “mined” to extract information useful for indexing (or just reading)

Entry point to wider literature via refs. list, subsequent citations, etc.

E.g. from this…

to this

(taxon name + author, year)

(article full text)

Genus #1 in IRMNG: example “minimal” citation styles

Genus Authority

Microcitation

AaAa Baker,

1940Aa Baker,

1940Bull. Bishop Mus., 165, 107

Genus #1 in IRMNG: example “better” (=standard) citation style

Genus Authority

Microcitation

Full citation

AaAa Baker,

1940Aa Baker,

1940Bull. Bishop Mus., 165, 107

Aa Baker, 1940

Baker, H.B., 1940. Zonitid snails from Pacific Islands. Part 2.-Hawaiian genera of Microcystinae. Bulletin Bishop Museum Honolulu, 165: 105-201.

ION has a subset of these(article title, citation only)

Genus #1 in IRMNG: example “best” citation style with online links

Genus Authority

Microcitation

Full citation Online link (abstract)

Online link(full text)

AaAa Baker,

1940Aa Baker,

1940Bull. Bishop Mus., 165, 107

Aa Baker, 1940

Baker, H.B., 1940. Zonitid snails from Pacific Islands. Part 2.-Hawaiian genera of Microcystinae. Bulletin Bishop Museum Honolulu, 165: 105-201.

Aa Baker, 1940

Baker, H.B., 1940. (etc.) http://...(or DOI)

Aa Baker, 1940

Baker, H.B., 1940. (etc.) http://...(or DOI)

ION has a subset of these(article title, citation only)

BioNames(R. Page project)

has some of these

Online access to scientific literature – 1 Q.: How many articles in the “scientific literature”?

A.: Guesstimate might be 180m total “scholarly articles”, 120m in all sciences, 20m in biology over past 250 years

Google Scholar: ~160m citations (all disciplines) Web of Science: 90m items indexed (1900 onwards) PubMed: 24m records (mostly 1966 onwards) Biological Abstracts: 12m records, 1926 onwards

(includes some non-journal material) Ideally would like single master list, unique ID/hyperlink for each

work (article/chapter/book etc.) DOI (Digital Object Identifier) system / CrossRef introduced in

2000, good for newly published work currently used for 114m “objects” (incl. some retrospective

allocation; NB not all are scientific literature) “Publishers use CrossRef's tools to convert citations from

dumb strings to useful links” (quote from R. Page discussion post)

Online access to scientific literature – 2

Zoological Record has indexed 3.5m works in zoology 1864-current (increasing at 70k/year, 1.5k/week), but individual records are behind paywall

Online access to scientific literature – 3

Biodiversity Heritage Library (BHL) is scanning older literature (esp. pre-1923) and placing online limited subset indexed by article title, otherwise (all) indexed

by journal and page no. (then has BHL page ID – can link to that)

search can be initiated by journal title, volume + page (if already known)

can also search by taxon scientific name – but some instances will be missed (BHL OCR [optical character recognition] is less than 100% reliable)

this author’s experience looking for initial publication instances of older names – success in around 1/3 of cases (not too bad), however requires manual search (time consuming)

ideally, original description page links should be compiled somewhere for others to re-use (not currently done on any scale)

BHL sample page: American Journal of Science s4 v15 (1903) p. 312(original description of Megablattina Sellards, 1903, a cockroach)

BHL sample page: American Journal of Science s4 v15 (1903) p. 312(original description of Megablattina Sellards, 1903, a cockroach)

Online access to scientific literature – 4 More recent literature – mix of publisher websites and

operations like JSTOR, often behind paywalls (though abstracts typically not so) – but not all yet available digitally (BHL also has some post-1922 content)

Subscription/abstracting services (Zoological Record, Web of Science, etc.) have better coverage, but are often not open access for viewing or external linking purposes (although PubMed is)

Some tools constructed around planned all-encompassing “Bibliography of Life” project (from Europe, http://biblife.org/), but progress difficult to gauge as yet (claims 215k references held); another European project: GRIB (Global References Index to Biodiversity), however development appears to have stopped…

…claims 215k references held

In summary: online [open] access available to subsets of article titles > abstracts > full text in decreasing proportions

No single comprehensive source of online refs. available at this time, users must “mix and match” sources as available

Few direct links in current tax. databases to literature that is online (some noteworthy exceptions)

Over 95% of taxonomic literature pre-dates year 2000 starting point for DOIs

Most comprehensive indexes are currently commercial products (behind paywalls), not much traction in “community / open access” equivalents as yet.

Taxon traits

Machine-readable sets of taxon traits – why do we care?

Powerful tools for automated subsetting / filtering out sets of interest

Useful for data quality assurance (e.g. flag suspect data, fix logical inconsistencies)

Can form the basis of auto-response “expert systems” / keys e.g. as already available for specialised groups

Need for standardised vocabularies/ semantics for indexing terms, units used, etc.

Operations like OBIS (Ocean Biogeographic Information System) want to display only (e.g.) marine + extant taxa, suppress others

No “trait bank” systems existed at that time, IRMNG was created to fill this need: flag taxa as extant/fossil, marine/nonmarine IRMNG data & flags subsequently incorporated into other

systems e.g. WoRMS, ALA, OTOL, EOL, more… – IRMNG flags are ~70% complete at genus level, 95%+ for species

EOL (Encyclopedia of Life) is establishing “TraitBank” (2014 on) to capture similar traits + more

EOL TraitBank most populous content (Oct 2015)

Note, EOL is an aggregator, not an original content generator (relies on content supplied by third parties)

EOL TraitBank most populous content (Oct 2015)

EOL traits recently added to Google search, Nov 2015

Room for further development in this area…

e.g. TDWG (Taxonomic Databases Working Group) had active interest in development of “SPM” (Species Profile Model) around 2007-8, seems a bit quiet since

character matrices stored in computer-based keys e.g. Lucid, DELTA, etc. could presumably be leveraged in some cases

some domains already well covered in standard manner (e.g. FishBase for 33k fishes, SeaLifeBase for 71k non-fish marine taxa)

SeaLifeBase example shown in next slide…

Displayed range data is from AquaMaps holdings, see later in this presentation

Assembling georeferenced species data

Assembling georeferenced species data – why do we care?

“Where” is as important as “what” in biodiversity studies Central repository much easier point of access than

thousands/millions of distributed sources See gaps in existing data holdings / state of current data

sampling, digitisation and mobilisation Overlay spatial distributions with other layers e.g. country

boundaries, habitats, environmental variables – generate regional lists, understand controlling factors

Spot bad data (appearing in unlikely places on the map) Use for spatial analysis (geography as computable data).

Distributed data networks First data networks in USA, late 1990s – VertNET, HerpNET,

ORNIS – connecting museum data (vertebrate specimen records) in participating agencies (also in Australia: Australian Virtual Herbarium)

OBIS (2002 on) and GBIF (2004 on) provide gateways to both specimen and observation data from multiple agencies wordwide OBIS (marine species records only):

2005: 5.6m records from 38 data sources (40,700 species)

2015: 44.9m records from 1,916 data sources (147,000 species)

GBIF (all habitats): 2005: 45m records from 334 data sources (?? species)2015: 577m records from 15,196 data sources (??

species) OBIS data flows into GBIF (though with some issues), also into

local networks e.g. ALA (Atlas of Living Australia)

Building OBIS – 2002-5(trying to make a working system, and provide a good user

experience)

“OBIS v2” front page /spatial search interface,

2005

Current OBIS sample map & data

OBIS records for Physeter macrocephalus (sperm whale) in Australian region, Oct 2015 (51,756 global records)

GBIF sample map & data

GBIF records for Physeter macrocephalus in Australian region, Oct 2015 (34,436 global records)

ALA (Atlas of Living Australia) presentation of records for Physeter macrocephalus in Australian region, Oct 2015

ALA (Atlas of Living Australia) presentation of records for Physeter macrocephalus in Australian region, Oct 2015

How complete are holdings of GBIF, OBIS, etc.?

From Hill et al., 2012 paper: at least 1 bn – 2 bn specimens in biological collections worldwide (not all currently digitised)

Observations probably outnumber specimens by 100x - 1000x

Gives maybe 500 bn potential records +/- ; GBIF has 0.5 bn to date (0.1%)…

Not all records are of equal importance for initial studies of distributions (much redundancy), maybe OBIS/GBIF have <5% of most useful records at this time…

Existing holdings presently heavily skewed towards better sampled/accessible areas, also regions where digitisation is more advanced

True “target numbers” difficult to assess (every individual of every species, or what?)

Predicted distributions (environmental niche modelling)

From AquaMaps poster for e-biosphere conference, London, 2009

Predicted distributions (environmental niche modelling) – why do we care?

Available georeferenced data are always incomplete, need a mechanism to intelligently fill in data gaps, produce more complete biodiversity maps & atlases

Move from hand drawn maps / non-digital “expert knowledge” to computable data

Model potential spread of invasives into new areas (show suitable habitat)

Model potential changes in species range in response to changing climate or other factors

Facilitate better understanding of broad- (and fine-) scale factors controlling species distributions.

Niche modelling concept

Range of methodologies available including MAXENT, GARP, simple niche models e.g. Relative Environmental Suitability (RES)

Ready et al., 2010 (incl. Tony Rees) contend that simple methods work as well as more complex ones:

Source: A. Guisan group web page, Université de Lausanne, Switzerlandhttp://www.unil.ch/idyst/en/home/menuinst/research-poles/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html

J. Ready et al., Ecological Modelling, 2010

Global niche modelling/mapping projects

Lifemapper: Kansas University, c. 2003 onwards•Models terrestrial niches (?only)•No. of maps unclear (claims >100,000 species with data, perhaps only a subset with maps)•Uses GARP modelling (computationally intensive, several hours per species map?), no expert review

** FW variables: elevation, surface temperature, net primary productivity, soil pH, soil moisture, soil organic carbon, precipitation, compound topographic index

* Marine variables: bottom depth, water temperature (SST/bottom), salinity, primary production, sea ice concentration, distance to land

AquaMaps: Kiel Marine Lab (+ co-developers), 2006 onwards•Models marine niches only (plus some freshwater)•22,000 species mapped by Nov 2015 (incl. ~600 FW), mainly fishes•Uses RES modelling (6*/8** environmental variables, <2 mins per species map) plus geographic partitioning and expert review

Lifemapper team at KU (Kansas University, USA)

Lifemapper example map

Lifemapper example map for milk or Spanish snail (Helix lactea, now = Otala lactea)(yellow dots are data points, red is potential habitat)

Building AquaMaps – 2005(trying to make a working system, and the models fit the data…)

Building AquaMaps – 2005(trying to make a working system, and the models fit the data…)

AquaMaps example map for New Zealand sea lion (Phocarctos hookeri)(without expert review): Data points used

AquaMaps example map for New Zealand sea lion (Phocarctos hookeri)(without expert review): Computed AquaMap

AquaMaps example map for New Zealand sea lion (Phocarctos hookeri)(without expert review): All suitable habitat

AquaMaps example map for New Zealand sea lion (Phocarctos hookeri)(without expert review) : All suitable habitat (detail)

Detail (square size = 50 km nominal for

global coverage)

AquaMaps example maps for New Zealand sea lion (Phocarctos hookeri)(without expert review): Current vs. computed year 2100 range

2010 2100

from http://us.whales.org/species-guide/sperm-whale

From hand drawn map to computable data…(this example: Physeter macrocephalus)

from http://us.whales.org/species-guide/sperm-whale

Aquamaps map (including probability of occurrence by 0.5 degree square)

From hand drawn map to computable data…(this example: Physeter macrocephalus)

Pooling all the range data

AquaMaps summary plot – all marine species overlaid (22,800 as at Nov 2015)

AquaMaps summary plot – all marine species overlaid (22,800 as at Nov 2015)

Click on the map to generate “what lives here” queries…

AquaMaps spatial query result, November 2015:

AquaMaps spatial query result, November 2015:

Any primary data to back this up? E.g. search in Atlas of Living Australia…

Yes, there ARE whale sharks on NSW central coast…

Niche modelling is a very current discussion topic: e.g. GBIF/AMNH symposium 4 November 2015 (no proceedings as yet)

So – “are we there yet?”

Component 2005 2015 Status (/5)

Global taxon inventory – all species names (with synonyms)

25%? 60%+ nnn(n)All names linked to the literature (original descriptions), at least minimally

5%? 10-20%? n(n)

Taxon traits databased, in machine-addressable form 0 10%+? n(n)Distribution data (specimens, observations) in online systems

<1% 5%+? n

Predicted distributions/global range maps for all taxa 0?

5%+?(fishes 60%+)

n(n)

A report card to date…

Take home message: progress is definitely being made, however plenty still to do: Complete master names lists, release as open data

(also deal with inflow of new names and taxonomic dynamism)

Improve online access to tax. literature (plus embedded links from relevant databases)

More data into OBIS & GBIF (including datasets not yet digitised)

More progress on predictive mapping (algorithms, base data, habitat factors, species covered).

Take home message: progress is definitely being made, however plenty still to do: Complete master names lists, release as open data

(also deal with inflow of new names and taxonomic dynamism)

Improve online access to tax. literature (plus embedded links from relevant databases)

More data into OBIS & GBIF (including datasets not yet digitised)

More progress on predictive mapping (algorithms, base data, habitat factors, species covered).

Thank you! Tony Rees Tony.Rees@marinespecies.org

◦ CSIRO Marine Research applications developer 1998-2014 including CAAB (Codes for Australian Aquatic Biota), c-squares and Taxamatch

◦ OBIS steering committees (various) / system developer 2002-2005◦ AquaMaps project co-developer 2004-current◦ IRMNG developer 2006-current◦ OBIS Australia Node manager 2006-2014◦ Global Names Project collaborator 2006-current◦ WoRMS contributor 2007-current◦ GBIF & Open Tree of Life collaborator 2010-current◦ iPlant collaborator 2010-2013◦ Atlas of Living Australia consultant 2010-2012◦ Catalogue of Life global team member 2010-2012◦ GBIF Ebbe Nielsen Prize (for excellence in Biodiversity

Informatics) winner 2014.This talk available at: www.slideshare.net/tony1212/presentations