Date post: | 10-May-2015 |
Category: |
Education |
Upload: | roderic-page |
View: | 811 times |
Download: | 3 times |
Biodiversity informatics: why aren’t we there yet?
@rdmpage
http://iphylo.blogspot.com
I’ve often said I want a Google for biodiversity data…
…turns out what I should have asked for was a NSA for biodiversity
• There are known knowns, things we know that we know
• There are known unknowns, things we now know we don’t know
• But there are also unknown unknowns, things we do not know we don't know
known
unknown
knowns
unknowns
What do these diagrams tell us?
Implications
• Sequencing is cheap
• The flood of sequences is only going to increase
• How much of this is relevant to biodiversity?
• --
Numbers of new animal names
1923
WWI WWII
Implications
• Rate of new taxa being described is relatively constant
• Suggests taxonomists are working at capacity
• Most taxonomic work is in the past
• Compare this to exponential growth of sequencing• --
Mammals in GenBank
Proper Linnaean names
Aus sp.
Mammals
Proper Linnaean names
Aus sp.
“Invertebrates”
BOLD
Dark taxa
• Disconnect between taxonomy and genomics
• How much of this comprises taxa we already know about versus new diversity?
• Do we need taxonomic names?• --
100,000 articles from http://biostor.org (BHL)
1923 today
Scanned legacy
• BHL is more than pre-1923 literature
• The real gap is post-1923 to pre-open access (2003)
• Most of the 20th century taxonomic literature is “dark”
• --
Size of Wikipedia articles on mammals
Few, large articles
Many, small articles “long tail”
Power law
• We know a lot about a few species
• For most species we know very little (even in well-known groups)
• For poorly known species need to go to legacy literature
• --
PanTHERIA (2009)1923 2003
Legacy literature
• Legacy literature matters (even for well-studied taxa)
• Much of this will be in digitally “dark” period
• --
Publishers of taxonomy(# articles)
http://bionames.org
Publishers
• BioStor (BHL) is the single largest source of taxonomic literature
• Lots of tiny publishers (long tail)
• Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne)
• Who do we talk to about data mining?• --
Taxonomic journals (articles/decade)
Implications
• Zootaxa is indeed a “mega journal”
• If we had to pick one journal to data mine it is Zootaxa
• --
GBIF
• The Global Biodiversity Information Facility is not evenly “global”
• Tells us as much about sampling as distribution of diversity
Flickr EOL group
Crowd sourcing
• Where is the “crowd”?
• It’s where the iPhones are…
GenBank animal sequences
GenBank host records
Implications
• GenBank is about more than genes
• GenBank has a wealth of information on location, and ecological interactions
Implications
• Phylogenetic data is not being archived (why not?)
• Makes it hard to reproduce studies
• Does data matter?
• What level of granularity should be citable?
What do these diagrams tell us?