Donat Agosti Plazihttp://plazi.org
Systematics AssociationOxford, 28. August 2015
Nothing in taxonomy makes sense except in the light of Open Access
I want to be able at anytime, anywhere to access, mine and analyse a
significant body of published and digitized taxonomic knowledge.
I want to build by machine the catalogue of life.
I hope taxonomiy communications arrives in the 21st century
Vision and hope
1. The demand
Before antbase.org, Harvard‘s Museum of Comparative Zoology could claim to be the
only location with a complete set of ant systematics publications from 1758 - present.
Through antbase.org‘s
digital library, access
to this body of
literature is worldwide,
and it is actively used
(>10,000 visits in one
month only).2004
2. The corpus of taxonomic literature
Build and establish a TreatmentBank, such as Plazi, as basis forcontent mining of and linking to the taxonomic literature
3. The core corpus of taxonomic knowledge: Treatments
4. Make use of the semantic linked WWW
Avoid all the waistful actual publishing!
• Publish structured data• Publish open access• Make taxonomic literature first class literature by minting
DOIs and making digital copies accessible• Add links to names, treatments, articles, DNA sequences,
digital objects• Help by building your own public corpus of citable data
Pensoft journals (e.g. Biodiversity Data Journal, Zookeys, Phytokeys) are the gold standard.
Surfing or the seduction of science (for a young kid)
Surfing or the seduction of science (for a young kid)
Surfing or the seduction of science (for a young kid)
Surfing or the seduction of science (for an adult)
Get a copy of the Cyclothone paper
Surfing or the seduction of science (for an adult)
Surfing or the imperative for science
Surfing or the imperative for science
Linking treatments and data with external resources
NCBI
Surfing or the imperative for science
Establish Plazi as, or use Plazi to build TreatmentBank as source for content mining of thetaxonomic literature
TreatmentBank
What are the species in Amazonia?
TreatmentBank
Countries (Region)Australia (Queensland)
Export species materials citations (DwC)
Text mining tools: Visualization of treatment content
Summary of content of 37 Zootaxa spider publications and 8 Biodiversity Data Journal. (Miller et al., 2015)
Pseudomyrmex ants and Vachellia ant-acaciasare a classic example of mutualism in biology.
allenii
melanoceras
ruddiae
chiapensis
collinsii
cookii
cornigera
globulifera
hindsii
janzenii
mayana
sphaerocephala
boopis
flavicornis
hesperius
ita
janzenikuenckeli
mixtecus
nigrocinctus
nigropilosus
opaciceps
particeps
peperi
reconditus
satanicus
simulansspinicola
subtilissimus
veneficus
ferrugineus
gentlei
gracilis
Transbiotic link networkAssociated species linked throughreferences in taxonomic treatments
Acacia-ant species: Pseudomyrmex gracili
Treatment: redescription
Associated ant-acacia: Acacia gentlei
Ants Plants
Photocredits: Alex Wild
Treatment
Treatments linked through citations
Text mining tools: Visualization of treatment content
What does this mean?
The Linking Open Data cloud diagram
Linked Open Data Cloud
The demand: scientists and citizen scientists
Before antbase.org, Harvard‘s Museum of Comparative Zoology could claim to be the
only location with a complete set of ant systematics publications from 1758 - present.
Through antbase.org‘s
digital library, access
to this body of
literature is worldwide,
and it is actively used
(>10,000 visits in one
month only).
Online catalogueOpen accessOnline library2004
Online catalogue
The interest of big science
2004
2005
The demand: scientists and citizen scientists
The scientific challenge: Bridging the gap
1 tnntttccca cgaataaata atataagatt ttgattatta cctccttctt taattttatt61 attatcaaga agattagttt ataaaggagt aggaacagga tgaactgttt atcctccttt121 atctaataat ttatatcata atggattttc aactgattta gcaatttttt ctttacatat181 tgcaggaata tcatcaatta taggagcaat taattttatt tcaacaattt taaatataca241 tcataaaaat ttatcattag ataaaattcc attgttagtt tgatcaattt taattacagc301 tattttatta ttattatctt tacctgtatt agcaggtgca attactatat tattaactga361 tcgaaatcta aatacaactt tttttgatcc ttcgggtgga ggagatccaa ttttatatca421 acatttattt
Where do we stand?
The bristlemouths are a rapacious family of deep-sea fishes that include the wildly successful genus Cyclothone
In contrast, ichthyologists put the likely figure for bristlemouths at hundreds of trillions — and perhaps quadrillions, or thousands of trillions.
The bristlemouths are a rapacious family of deep-sea fishes that include the wildly successful genus Cyclothone
Taxonomy?Source?
Issue USD 266.00Article USD 48.00
Get a copy of the Cyclothone paper
Our contribution for a better understanding of biodiversity
Access to ant taxonomic publications through antbase.org /Smithsonian Institution, including currently the entire body of non-copyrighted publications since 1758 (>4,000 publications or 85,000 pages. Source: (Agosti 2005)
Access
• Limited access (copyright)
• Limited discoverability of content
• Research results cannot be cited
• Data mining does not work
Issues of access
Provide an open access, linked corpus of taxonomic literature
A solution
Surfing at breakfast table
article
treatment
CiteshttpURI
cites (DOI)
Scientific name
https://www.wikidata.org/wiki/Property:P1992
Feed Wikipedia with taxonomic data
Surfing or the imperative for science
Surfing or the imperative for science
Surfing or the imperative for science
LODPDF
HNS
HNS
Surfing or the imperative for science: Use of name services
The goal
Create a citable open corpus of taxonomic publications
Biodiversity Literature Repository: Record
Biodiversity Literature Repository: RecordTreatment
Illustration
http://plazi.org/wiki/Blue_ListPatterson et al., 2014: http://dx.doi.org/10.1186/1756-0500-7-79
Legal issues
Workflow
Plazi SRS
find scan «OCR» markup store +access
Text
<tax:treatment>
<tax:nomenclature>
<tax:name>
<tax:xid source="HNS" identifier="193329"/>
<tax:xmldata>
<dc:Genus>Mystrium</dc:Genus>
<dc:Species>leonie</dc:Species>
</tax:xmldata>
Mystrium leonie
</tax:name>
<tax:status>n. sp.</tax:status>
Fig 1 D - F
</tax:nomenclature>
<tax:div type="description">
<tax:p>HOLOTYPE WORKER: TL 3.95, HL 1.02, HW 0.95, CI 93, SL
1.30, SI 137, PW 0.73, ML 0.38. Mandible outer margin
to a sharp apical tooth, the apex parallel to the anterior
(Holotype with material in mandibles, so mandibles and
$ described below from paratypes.) Median clypeus
....
</treatment>
Semantisch erweiterter Text(TaxonX)
… alternatives: From human to machine readable text
RDF
Plazi tools: table extraction
«Treatment»Wissenschaftliche ArtnameVerbreitungsnachweisBibliographische Records
Cataglyphis tartessica workersVariable mean ± SDHead length 11.23 ± 0.12Head width 11.15 ± 0.12Scape length 11.47 ± 0.12Mesosoma length 11.94 ± 0.16Femur length 12.03 ± 0.14Cephalic index 0 93.60 ± 3.940Scape index 128.10 ± 7.660
Plazi tools: discovering of scientific names
Plazi tools: discovering and parsing of bibliographic references
Plazi tools: discovering and parsing of observation data
Plazi tools: discovering of treatments
Treatment: a well defined part of an article that defines the particular usage of a scientific name by an authority at a given time (a page(s) in a publication).
Treatment
The special case taxonomic literature: The citated elements aretreatments, not article
Formica obsoleta Linnaeus, 1758: 580
Treatment
Original combinations
Reference to an orginal combination
Subsequent useages of names cite the referenced treatment
What is a treatment?
Treatment and treatment reference and citation
Trea
tmen
t ci
tati
on
Treatment references
Treatment
Citing of treatments or linking of treatments to treatments
By minting persistent httpURIs for treatments, treatmentscan be cited like a bibliographic reference
http://treatment.plazi.org/id/A9FFD1FC-4629-FFB4-968F-AD38386521BA
Status quo
• 50,000+ treatments life, daily growth
• RDF in Betaversion
• GoldenGate Imagine (PDF and text mining tool) in betaversion
• Provider for data for NCBI, Wikidata, GBIF, EOL, antweb
• Biodiversity Literature Repository functional
Next steps
• Collaborate with ContentMine to extract >50
treatments/day
Next steps
Planned collaboration with ContentMine to extract treatments on a daly bases
http://www.slideshare.net/petermurrayrust/?
BioDiv
Next steps
• Collaborate with ContentMine to extract 50 treatments/day
• 1 Million treatments life
• RDF Version accessibl
• GoldenGate Imagine (Text mining tool)
• Provider für Daten für NCBI, GBIF, EOL, antweb
• Biodiversity Literature Repository mit 100,000 bibliographic
references and digital copies (PDF, images, etc.)
Next steps
BUT
Next steps
Avoid all this waste (our next generation will have to clean up)!
Publish structured dataPublish open accessPublish in journals with DOIAdd links to names, treatments, articles, DNA sequences, digital objectsHelp build your own corpus of citable data
Pensoft journals (e.g. Biodiversity Data Journal, Zookeys, Phytokeys) are the gold standard.
Thanks!
Donat Agosti
Acknowledgment: Pensoft, Zenodo/CERN, NCBI, Wikidata, ContentMine