Virtual BiodiversityViBRANT
SEVENTH FRAMEWORK PROGRAMME -infrastructure
Community web sites: small pieces loosely joined
Dave Roberts, David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Small pieces loosely joined
Has many potential meanings:
Joining contributors together to form communities
Joining the data together that go towards forming a Scratchpad
Joining Scratchpad content with the landscape of biodiversity informatics data on the web
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Goal ...
Data set ...
People ...
Addressing the challenges of taxonomy
Inventory the Earth’s speciesDocument their relationships“Publish” & apply these data
1.8 M described spp. (10M names)300M pages (over last 250 years)1.5-3B specimens
4-6,000 taxonomists30-40,000 “pro-amateurs”Many more citizen scientists?
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
The technology must largely embody the cause–effect relationship connecting problem to solution.
The effects of the technological fix must be assessable using relatively unambiguous or uncontroversial criteria.
Research and development is most likely to contribute decisively to solving a social problem when it focuses on improving a standardized technical core that already exists.
Sarewitz and Nelson (2008) Three rules for technological fixes. Nature, 456: 871-872
I
II
III
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
15 October 2010
Biodiversity - a kind of washing powder?
When 2010 was named as the "year of biodiversity" by the UN, it began with a plea to save the world's ecosystems.
UN Secretary-General Ban Ki-moon said: "Biological diversity underpins ecosystem functioning... its continued loss, therefore, has major implications for current and future human well-being."
Recently, members of the public were asked what biodiversity is. The most common answer was "some kind of washing powder".
http://www.bbc.co.uk/news/science-environment-11546289
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Addressing the challenges of biodiversity informatics
“…the field [of biodiversity informatics] appears to be growing in a void of overarching, motivating questions, effectively making it a set of technologies in search of questions to address.”
Peterson et al, Syst. & Biodiv. 2010
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Scratchpadshttp://scratchpads.eu
Hosted websites for taxonomistsResearch & publication platform Modular (Drupal) & flexible Supports the taxonomic workflowBottom-up design, agile dev.Ecosystem of communities (185)2,350+ users (unpaid) from 2007ViBRANT follow on, €4.75M
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Taxonomy & Literature
eBooks
Image Galleries Societies & Organizations
eJournals
DNA, Phylogeny & Specimens2.3k users, 58 countries, 268k pages
185 "Virtual Research Communities"
EDIT, GBIF, NHM, & EOL
Platform for biodiversity research & data publication
Changing the nature of collaboration
Expanding opportunities to participate in science
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Magic
Your data Your web site
A website for you & your community
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Taxonomy import,management andnavigation
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Reference manager /Endnote support forbibliographies
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Image galleries,image upload &annotation
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Nexus / Newick import forvisualizing phylogenies
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Molecular & morphological character matricies(discrete, morphometric and text characters)
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Presence / absence country maps
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Specimen & locationrecords (DwC)
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Web fora with e-mail integration
User blogs
Static web pages
Newsletters with e-mail integration
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Import from CSV text file to any content type
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
A Virtual Research Environment (Scratchpads) where users can safely store, share and manage their research information.
Analytical services for users to build identification keys and phylogenetic trees.
A publication platform for users to automatically compile taxonomic manuscripts from their research database.
A portal for users to centrally access publicly accessible biodiversity research information and literature.
Training, support & sociological study, helping research communities to use these tools and services.
A standards compliant technical architecture that can be sustained by biodiversity research community.
ViBRANT Products
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
ScratchpadsVirtual Research
EnvironmentPhylogenetic
analysisBioclimaticmodelling& metrics
Identificationtools
Matrix dataeditor
Biodiversity data
publishingScholarlymanuscriptpublishing
DistributedScratchpad
hostingSoftwaremodule
integration
Sustainabilityplan
Communalbiodiversity
literature
Biodiversityliteraturemarkup
Biodiversitydatamining
Citizenscience
programme
Fieldrecordingsupport
Usersociology
study
Userfeedbacksystems
Training& outreachprogramme
Biodiversitydata
standards
Dataaggregation
portal
GBIFintegrationactivities
Biodiversityvisualisation
layers
Controlledvocabulary
platformNetworking
WP3. TrainingWP4. Standards
WP8. Mobilisation
ResearchWP2. ArchitectureWP7. Literature
ServiceWP5. Data
WP6. Publishing
The “chromosome”
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Cues Indented textUPPER CASE TEXTBold textItalic textLatinKeywordsSymbols
Biodiversity literature looks like this
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
M BRITISH MUSEUM (NATURAL HiSi 26JU PRESENTED GENERAL UC.-lARYBulletin ofthe BritishMuseum (Natural History) The ichneumon-fly genus Banchus in the OldWorld(Hymenoptera) M. G. Fitton seriesEntomology Vol51 Nol 25 July 1985
Adobe Reader has this
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
MBRITISH MUSEUM(NATURAL HiSi26 JUPRESENTEDGENERAL UC.-lARYBulletin of theBritish Museum (Natural History)The ichneumon-fly genus Banchus(Hymenoptera) in the Old WorldM. G. FittonEntomology seriesVol51 Nol 25 July 1985
Lura (BHL) has this
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
But choice of XML schema is importantABBYY XML is very detailed
This line of text has 202 bytes:
The Bulletin of the British Museum (Natural History), instituted in 1949, is issued in fourscientific series, Botany, Entomology, Geology (incorporating Mineralogy) and Zoology,and an Historical series.
To encode in ABBYY XML format this line requires 45,533 bytes.
There are 84,263 lines in the document from which this example was taken.
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Look for taxon namesUsed uBio FindIT web service
Overall excellent
Especially as add Namebank ID
But still some oddities
Genus = ‘The’
The scutellum
The primitive
Species or Author = ‘and’
Exetastes and
B[anchus] falcatorius and
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Look for paragraph types
Simple keyword matching
Surprisingly effective!
Issue – can identify start, but not end…
Follow up work
Punctuation
Concepts
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Look for other proper names
Biologia Centrali-Americana has a gazetteer
Most journals do not
Generic solution = OpenCalais
Good accuracy
Old countries
D.D.R.
West Germany
Continents
America
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Ambiguities and Mis-identificationsNew York
City
State
Washington
City
State
Lake George
City
Lake Victoria
City
Other Oddities
Persons
Surname only
Two part names
Van Veen
van Veen
Regions and Continents
East Africa
Africa
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Negative spell checking
Go beyond stop words
Remove everything not in a spell dictionary
Check:
Minor
Vulgar
Bulletin 27 from the Zoology Series reduced
From 139,034
to 5,219 words
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
LigaturesINTRODUCTION.
Volume, one of five required for the enumeration of the Rhynchophora, was
THIS
commenced by Dr. Sharp in 1889 and is now concluded by myself. The study of the " Otiorhynchinœ Alatse " has unfortunately been delayed for many years, during the publication of Vol. IV. parts 4, 5, and 7, all of which are devoted to the Family Curculionidœ. The present Volume, IV. part 3, includes the Subfamilies Attelabinae, Pterocolinœ, Allocoryninee, Apioninœ, Thecesterninae, and Otiorhynchinre. The Attelabinae are represented by 104 (88 new), the Pterocolinse by three (all new), the Allocoryninse (a new subfamily) and Thecesterninse each by one, the Apioninae by 88 (84 new), and the Otiorhynchinae by 419 (340 new) species respectively; the total number for the six subfamilies being 616 species, with 516 new, and forty new genera. Amongst the 419 Otiorhynchinae, the apterous and winged forms are almost equal in number, there being a preponderance of apterous terrestrial species (Eupagoderes, Epicœrus, Epayriopsis, &c.) in the arid portions of Mexico and the winged forms ÇExophthalmuS) &c.) becoming relatively more numerous in the forest regions southward. Taking the Curculionidœ as a whole—the subfamilies Curculioninae and Calandrinse, in addition to those worked out in the present Volume,—the number of species enumerated altogether from Central America is as follows :— Vol. IV. part 3, 616; IV. part 4, 1365; IV. part 5, 908; IV. part 7, 344 : total 3233. The three other families of Rhynchophora—the Brenthidae, Scolytidae, and
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Ligatures
For the 24 æ there are: 11 ae; 5 œ; 5 se; 1 ee; 1 re; 1 a?;
So not a single correct rendering of the ligature, æ.
By contrast, the only example of œ in the page, Epicœrus, was correctly rendered.
OtiorhynchinæAlatæ
CurculionidæAttelabinæ
PterocolinæAllocoryninæ
ApioninæThecesterninæOtiorhynchinæ
AttelabinæPterocolinæ
Allocoryninæ
Otiorhynchinœ Alatse Curculionidœ Attelabinae Pterocolinœ Allocoryninee Apioninœ Thecesterninae Otiorhynchinre Attelabinae Pterocolinse Allocoryninse
=>=>=>=>=>=>=>=>=>=>=>=>
ThecesterninæApioninæ
OtiorhynchinæOtiorhynchinæ
CurculionidæCurculioninæ
CalandrinæBrenthidæScolytidæ
AnthribidæHispidæ
Cassididæ
ThecesterninseApioninaeOtiorhynchinaeOtiorhynchinaeCurculionidœCurculioninaeCalandrinseBrenthidaeScolytidaeAnthribidaeHispidaCassididae
=>=>=>=>=>=>=>=>=>=>=>=>
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Soundex
831639637616578
elytraprothorax
Habpunctate
millim
E436P636H100P523M450
8315092941253612987211
elytraElytraelytriselytralelytron
elytrisqueelytrorumque
Elytralelytrorum
elytro Elytrorum
Elytris
E436E436E436E436E436E436E436E436E436E436E436E436
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Similar words?
denticulate => denticulataLevenshtein distances of 1: 0,0,1
denticulate => reticulateLevenshtein distances of 2: 3,2,0
denticulate => geniculateLevenshtein distances of 2: 2,2,0
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
What did we achieve?
Marked up 11 volumes, i.e. 4,504 pages
Have robust workflow, can mark up a Bulletin in about 10-15 minutes. Choke point is call to OpenCalais web service
No manual intervention or review required: workflow is scalable
Recognising taxon names:
Well uBio gives us a goods start, and we have techniques to cluster ALL mis-spellings and variants with a valid taxon; but not perfect, eg BanchusFabricius ends up in more than one cluster
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
More reliable (e.g., distribute the servers)
More functional (e.g., phylogenetic & publication services)
Easier to use (better workflows)
Prettier (better graphical design - more intuitive)
More integrated (for data stored inside & outside the Scratchpad framework)
More sustainable (simple administration, distribute developers, development sandbox)
“making the Scratchpads better”
“making natural history better”Easier to compile, manage and reuse your data
Easier to find and reuse other peoples data
Promoting your data inside & outside the taxonomic community
Getting people to work for you (crowdsourcing)
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Manuscript preparation on a Scratchpad
Submit as XML
Produce PDF
Enhanced XML
Register with ZooBank,
GBIF, EoL etc.
Printed paper
Enhanced HTML
Send to reviewers
AuthorAuthorAuthor
Publisher
Public
Virtual BiodiversityViBRANT
-infrastructureSEVENTH FRAMEWORK PROGRAMME
Thank you for your attention.
Any questions