Date post: | 26-May-2015 |
Category: |
Technology |
Upload: | chris-freeland |
View: | 929 times |
Download: | 0 times |
Botanicus.org: Applying emerging technology to
historic scientific literature
Chris Freeland
Doug Holland
Missouri Botanical Garden
Published literature is the foundation on which biological science is based
Botany & systematics are sciences built on accumulated knowledge
Taxonomic Literature
• Over 250 years of systematic description of life
• Systema naturae (10th ed. 1758) by Carl von Linné
The cited half-life of publications in taxonomy is longer than in any other scientific discipline
* * * The decay rate is longer than in any scientific discipline
- Macro-economic case for open access, Tom Moritz
Taxonomic Literature
How historic literature is used
Taxonomic Impediment
• Specimen collections• Databases• Publications• Observations• ‘Gray’ literature• Index cards• Field notebooks
www.botanicus.org
A freely accessible, Web- based encyclopedia of digitized botanical
literature, sponsored by the Missouri Botanical Garden Library
• 650,000+ pages of text
• 1,300 volumes, 200 titles
• 145,000 linked protologues
• ~10TB of data
Workflow
Selection Preparation
Post ProductionPublicationMetadata
Enhancement
Digitization
Conservation
Selection
Preparation
• Review bibliographic metadata in MOBOT library catalog– Clean up, if needed
• Extract MARC – Transform to MARCXML– Parse into Botanicus DB
• Review title & determine which scanning device to use– Possible trip through Conservation
Digitization
5 Full time scanners
3 Indus 5002 book scanners
1 Kodak i280 Sheet feed scanner
Post Production – Custom Apps
• PageConvert– JPEG2000 (*.jp2) creation– Thumbnail creation– Moves derivative images to server– Updates item records to prepare for publishing– Runs on each scanning workstation
• PagePublish– Looks for items ready to publish– Creates or updates page records– Guesses page “types” text or illustration – Triggers OCR generation and PDF creation– Updates titles and item records to “publish ready”– Runs centrally
Post Production – Packaged Apps
• PrimeOCR– 6 voting engines– Multi-language support– Character coordinates– Outputs ASCII text, other formats
• LuraTech PDF Compressor– 2GB of TIF page images -> 30MB PDF– PDF/A– OCR (ABBY FineReader)
Enhancement - Paginator
View
Web 2.0 Features
• AJAX interface – JPEG2000 (Image compression with zoom)
• Web Services – uBio TaxonFinder and NameBank Taxonomic
Intelligence
• RSS feeds– Volumes added and news
• Mash Ups– Geocoded Subject headings plotted on Google Maps
• Tag Clouds
9. Page View
• Distributed taxonomic indexing– Public-resource computing application that
identifies name-like strings in OCR text– Bundles of text pages sent to volunteer
computers for indexing & results reporting
• Runs as a screensaver
• Open source framework behind SETI@Home
TIF Image from ScannerConverted to text via PrimeOCRName finding via bTaxonGrab Extract namesSubmit to TaxonFinderSOAP response
SciLINC in action…
Prof. Newton wrote me that he is extremely excited about your digitization project. At the moment he and his graduate botany students in Kenya have access to very few resources. He spends his summer terms at Kew doing his research for the next year's teaching and writing, but he tells me that now, because of what is already on your site, he will not have to carry so much back to Kenya for his research and his students but can download and work with your resources right there.
-- excerpt re: Botanicus from email August 2006
Taxonomic Impedectomy
The Future
• User Accounts– User defined views– MyBookshelf – favoriting & sharing
• Wiki-type editing & tagging– Metadata enrichment– OCR correction by users
• Bibliographic Intelligence– Improved “click through” citations– Citation finding & linking
• Increased geospatial extraction and visualization
Biodiversity Heritage Library• American Museum of Natural History
(New York)• Field Museum (Chicago)• Natural History Museum (London)• Smithsonian Institution (Washington) • Missouri Botanical Garden• New York Botanical Garden• Royal Botanic Garden, Kew• Botany Libraries, Harvard University• Ernst Meyer Library of the Museum of
Comparative Zoology, Harvard University
• Marine Biological Laboratory / Woods Hole Oceanographic Institution
• Core literature pre-1923: 400,000 (80 million pages)
• All pre-1923: 600-750,000 (120-150 million pages)
• All literature: 1.4-1.6 million (280-320 million pages)
Biodiversity Heritage Library
www.biodiversitylibrary.org
Botanicus.org brought to you by:
Andrew W. Mellon Foundation2000-2004
Wm. Keck Foundation2005-2007
Institute of Museum and Library Services (IMLS)
2006-2008
Botanicus.org
Please comment and send questions and suggestions to: