Rap
id d
igit
izati
on
of
P H
erb
ari
um
Switching to the fast track: Rapid digitization of the
world's largest herbarium
TDWG 2011- New OrleansSimon Chagnoux, Henri Michiels
Rap
id d
igit
izati
on
of
P H
erb
ari
um
The French Museum
Rap
id d
igit
izati
on
of
P H
erb
ari
um An old institution
• Founded in 1635 (at that time the Royal garden of medicinal plants)
• In 1793, the French revolution turns the garden into the national Museum
• Now: 15 locations in France, 2000 people
20 Oct. 2011 TDWG - Orleans 3
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Renovating the Herbarium
An opportunity to digitize the entire collection
Rap
id d
igit
izati
on
of
P H
erb
ari
um
The Paris Herbarium
20 Oct. 2011 TDWG - Orleans 6
Rap
id d
igit
izati
on
of
P H
erb
ari
um
The Renovation Project (1)
• Two main drivers to this project :– the herbarium, designed for 6 million
specimens, was packed with 10 million sheets and fitted with old storage
– raising the storage density required to reinforce the floors
20 Oct. 2011 TDWG - Orleans 7
Rap
id d
igit
izati
on
of
P H
erb
ari
um
The Renovation Project (2)
• The only way of doing this was to move away the entire collection and to put it back in the renovated place after works
• An opportunity for– New sorting, from geographic to phylogenetic
(APG3)– Reconditioning– Digitizing
20 Oct. 2011 TDWG - Orleans 8
Rap
id d
igit
izati
on
of
P H
erb
ari
um
2006 – Start of the project
2009 – Start of the works
2010 (June) – Start of digitization
2011 (Nov) – Opening of the first rearranged spaces to researchers
2012 – End of the project
Renovation Calendar
20 Oct. 2011 TDWG - Orleans 9
Rap
id d
igit
izati
on
of
P H
erb
ari
um Budget
• Overall project cost: 24,5 Million €– Building renovation 12 000 000– Movers 900 000– Attaching specimens 3 200 000– Reconditioning, digitization
and sorting 6 700 000– Supplies 1 600 000– Storage 100 000
20 Oct. 2011 TDWG - Orleans 10
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Herbarium
WarehouseIndustrial Partner
The renovation cycle
DigitizationReconditioningSorting
Floor by floor renovation
20 Oct. 2011 TDWG - Orleans 11
Before ....
20 Oct. 2011 TDWG - Orleans 12
... And after
20 Oct. 2011 TDWG - Orleans 13
Rap
id d
igit
izati
on
of
P H
erb
ari
um Why digitize ?
• Because all the parts have to be manipulated in the course of the project
• Digitization gives us:– a virtual copy of specimens– the possibility to share and study specimens
without touching them
• More than an electronic copy of the collection catalog, we’ll have a collaborative tool for managing scientific knowledge inside, as well as outside the institution
20 Oct. 2011 TDWG - Orleans 14
Rap
id d
igit
izati
on
of
P H
erb
ari
um 2D Digitization is cheap
• the cost of digitization is marginal compared to the full project
• full specimen processing (moving, sorting, reconditionning, new furniture)
• digitization and name processing
• digitization is appealing to funding
$1,5
$0,1
20 Oct. 2011 TDWG - Orleans 15
Rap
id d
igit
izati
on
of
P H
erb
ari
um A new paradigm
• For 15 years we have been entering all information of some specimens, – 1 million entries in the database (rich information)– One fifth (200 000 images) was photographed
• Since summer 2010, we use a massive approach where digitization precedes data entry – 2 million records digitized in one year– limited information in the database (name and
geographic area)– The scientific information can be added without
manipulating the specimens themselves
20 Oct. 2011 TDWG - Orleans 16
Rap
id d
igit
izati
on
of
P H
erb
ari
um
The workflow
Digitizing, reconditionning and sorting
Rap
id d
igit
izati
on
of
P H
erb
ari
um An industrial process (1)
• We chose a contractor with an industrial know-how
• A dedicated place had to be set-up and equipped by the contractor
• Two teams of 20 workers in two shifts working from 6am to 9pm
• The process had to align on the schedule of the renovation works, floor by floor
20 Oct. 2011 TDWG - Orleans 18
Rap
id d
igit
izati
on
of
P H
erb
ari
um An industrial process (2)
• Planned production rate: 17 000 sheets
per day over 24 months
ca. 15 seconds / sheet
• At this rate, a variation of ± 1 second per
specimen has an impact of ± 300 k€ over
the project cost
20 Oct. 2011 TDWG - Orleans 19
The Bussy-St-Georges site
20 Oct. 2011 TDWG - Orleans 20
Rap
id d
igit
izati
on
of
P H
erb
ari
um Workflow overview
U P
D O W N
U P
D O W N
U P
D O W N
U P
D O W NU n p a c k in g a n d a d d in g b a r c o d e
D a ta e n tr y Im a g e c a p tu r e 3 0 0 D P I
R e c o n d i tio n n in g a n d S o r tin g
f rom Herbar ium T o Herbar ium
20 Oct. 2011 TDWG - Orleans 21
Rap
id d
igit
izati
on
of
P H
erb
ari
um How to alleviate data entry
• We take advantage of the physical ordering of specimens
• We provide a name list to the contractor (APG 3 classification)
• The contractor enriches the list with the information generated during the process and provides us with a table containing consolidated information (image number, barcode numbers, classification,…)
20 Oct. 2011 TDWG - Orleans 22
Rap
id d
igit
izati
on
of
P H
erb
ari
um 1 – Delivery (1)
A carting company transports the specimens to the facility where they arrive in clearly labeled boxes. Boxes receive a tracking barcode
20 Oct. 2011 TDWG - Orleans 23
Rap
id d
igit
izati
on
of
P H
erb
ari
um 1 – Delivery (2)
• The Museum provides two files:1. a “logistics” file
– number of boxes– family name and number– genus name and number– geographic area
2. a “taxonomy” file– List of available taxon names with family,
genus, species, authors, ID (taxon number)
20 Oct. 2011 TDWG - Orleans 24
Rap
id d
igit
izati
on
of
P H
erb
ari
um 1 – Delivery (3)
• This information is digested by the contractor’s Information System and used along the industrial process (labeling, sorting, quality assurance)
20 Oct. 2011 TDWG - Orleans 25
Rap
id d
igit
izati
on
of
P H
erb
ari
um 2 – Folder processing
For each folder, the operator :1.replaces the jacket (color according to region)2.reads the species name and types the first letters on its computer3.selects the name in a list4.prints a label with barcode and identification information, and sticks it on the folder
20 Oct. 2011 TDWG - Orleans 26
Rap
id d
igit
izati
on
of
P H
erb
ari
um 3 – Specimen Digitization (1)
• Datamatrix and barcode are stuck on each sheet– Datamatrix: for tracking purposes– Barcode: specific to Muséum and to int’l
herbarium standard
• The specimens are placed three by three on a tray
20 Oct. 2011 TDWG - Orleans 27
Rap
id d
igit
izati
on
of
P H
erb
ari
um 3 - Specimen Digitization (2)
• The tray is placed on a conveyor belt
• The sheet is scanned
• The scan is checked (framing and focus)
• At the end of the chain, the barcode is read to check if all specimens are back in the folder
20 Oct. 2011 TDWG - Orleans 28
The Digitization Bench
20 Oct. 2011 TDWG - Orleans 29
Rap
id d
igit
izati
on
of
P H
erb
ari
um 4 - Reconditioning
• After scanning, each sheet is inserted in a sulfurized paper liner
• The barcode of each specimen is read, allowing the system to check if all specimens are back in the right folder
• The folders are stored in a “cut box” before sorting
20 Oct. 2011 TDWG - Orleans 30
Rap
id d
igit
izati
on
of
P H
erb
ari
um 5 - Sorting 1 (by genus)
• This sorting consists in storing specimens by family and genus names
• The operator puts the jackets in boxes and places them on shelves according to the family and genus numbers (the shelves are labelled in advance by the contractor)
20 Oct. 2011 TDWG - Orleans 31
Rap
id d
igit
izati
on
of
P H
erb
ari
um 6 - Sorting 2 (by species)
• The operator takes a box, reads the barcode on each jacket
• The system displays the species name and assigns a number which is printed on a label
• The label is sticked on the folder, which is then stored on the shelf with the same number
20 Oct. 2011 TDWG - Orleans 32
Rap
id d
igit
izati
on
of
P H
erb
ari
um
7 – Packing, transport and final storage
20 Oct. 2011 TDWG - Orleans 33
• The folders are put in boxes and sent to the Museum
• The contractor stores the folders in the Museum’s herbarium
Rap
id d
igit
izati
on
of
P H
erb
ari
um
How to ensure quality in mass digitization?
60 000 images
produced each week
1% of the production
checked (ca. 600 images) Samples are
distributed among botanical staff
Checking:
•Focus
•Data quality
•Barcode number
•Barcode location
1
2
3
4
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Scanning Resolution and Image Format
Rap
id d
igit
izati
on
of
P H
erb
ari
um Production of images
• The conveyor belt passes the specimens under a bidirectional scanner which produces 11x17” (A3), 300 dpi, 5000 x 3300 pixel images
• TIFF files are saved offline (one production day per disk of 1 TB)
• JPEG’s are made for online use
20 Oct. 2011 TDWG - Orleans 36
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Scanning resolution and image size
• One TIFF image is 50 MB
• One JPEG is 5 MB. This compression rate was chosen to have the same level of details as with TIFF (only colour is slightly changed)
• This choice is a technico-economic trade-off
• For 10 million images:– TIFF represents 500 TB
– JPEG represents 50 TB
– Data represents <100 GB
20 Oct. 2011 TDWG - Orleans 37
Rap
id d
igit
izati
on
of
P H
erb
ari
um Why do we keep TIFF ?
• Partners seek lossless data (Reflora, Mellon)
• Standard for physical publishing
• Native scan output, which can be used for any future use or transformation
20 Oct. 2011 TDWG - Orleans 38
Rap
id d
igit
izati
on
of
P H
erb
ari
um Handling TIFF data
• We cannot afford « live » storage of 500 TB
• … and even 1 Po with redundancy ! $$$• With a lot of energy consumption and heat
dissipation for rarely accessed images
• We are planning to start using tape storage next year, with HSM software
• For the time being, USB disks are stored in the collection warehouse
20 Oct. 2011 TDWG - Orleans 39
Rap
id d
igit
izati
on
of
P H
erb
ari
um Exception for the types
• The types are not part of this industrial process
• They are manually digitized on-premises at 600 dpi (200 MB in compressed TIFF)
• This process was initiated by the Mellon foundation in 2004
• We now have about 100 000 type images
20 Oct. 2011 TDWG - Orleans 40
Rap
id d
igit
izati
on
of
P H
erb
ari
um
What we’ve achievedand learned …
… after 12 months of collaboration between scientists and industrials (over an anticipated duration of 24 months)
Rap
id d
igit
izati
on
of
P H
erb
ari
um Achievements
• 2,1 million specimens processed between June 2010 and August 2011
• Images and data are of good quality
• The new premises comply with today’s standards (space, safety, light, air-conditioning, …)
20 Oct. 2011 TDWG - Orleans 42
Rap
id d
igit
izati
on
of
P H
erb
ari
um Fast but ... not fast enough
20 Oct. 2011 TDWG - Orleans 43
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Reasons for being behind schedule
• Logisticians have under-estimated the sorting work
• Only two digitization chains are operational, instead of three (due to lack of staff)
20 Oct. 2011 TDWG - Orleans 44
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Software and quality assurance
• There is more software needed for ensuring tracability and detecting failures than for data acquisition.
• Fast web publication of images allows a broader audience to perform quality control.
• Continuous control is mandatory
20 Oct. 2011 TDWG - Orleans 45
Rap
id d
igit
izati
on
of
P H
erb
ari
um People
• Working under constant time pressure during two years is really difficult in an academic context
• The contractor must be considered as a service provider and not just the team next-door (not obvious in an academic context)
20 Oct. 2011 TDWG - Orleans 46
Rap
id d
igit
izati
on
of
P H
erb
ari
um Working with a contractor
ROI
speed
robustness
quality
exhaustivity
specifity
• Culture clash
• Many parameters were not known at the beginning of the project (processes, numbers, ...)
• Quality control is a key point to make sure that scientific excellence governs the industrial throughput (to be defined upfront)
• Write everything and always refer to the contract20 Oct. 2011 TDWG - Orleans 47
Rap
id d
igit
izati
on
of
P H
erb
ari
um Digitizing other objects
• Digitizing herbarium is « easy »: – same dimensions for all objects– Easy manipulation and scanning– The plant itself is not touched – only the
paper
• Digitizing 3D objects is a lot more complex and generally requires to manipulate the specimen itself
20 Oct. 2011 TDWG - Orleans 48
Rap
id d
igit
izati
on
of
P H
erb
ari
um
Is it over ?
Digitization is just a very first step…
Rap
id d
igit
izati
on
of
P H
erb
ari
um Virtual herbarium
• The amount of information available on-line will lower the number of physical visits to the Herbarium
• … but visitors leave post-it note on the sheets How to replace this ?– Annotation systems– « virtual visit » website
20 Oct. 2011 TDWG - Orleans 50
Rap
id d
igit
izati
on
of
P H
erb
ari
um Spot the differences …
AFM
FABACEAE
Abrus aureus R. Vig.
?
20 Oct. 2011 TDWG - Orleans 51
Rap
id d
igit
izati
on
of
P H
erb
ari
um Differences are
• Occurrence– occurrenceID | catalogNumber | occurrenceDetails | occurrenceRemarks | recordNumber | recordedBy | individualID |
individualCount | sex | lifeStage | reproductiveCondition | behavior | establishmentMeans | occurrenceStatus | preparations | disposition | otherCatalogNumbers | previousIdentifications | associatedMedia | associatedReferences | associatedOccurrences | associatedSequences | associatedTaxa
• Event– eventID | samplingProtocol | samplingEffort | eventDate | eventTime | startDayOfYear | endDayOfYear | year | month | day |
verbatimEventDate | habitat | fieldNumber | fieldNotes | eventRemarks • Location
– locationID | higherGeographyID | higherGeography | continent | waterBody | islandGroup | island | country | countryCode | stateProvince | county | municipality | locality | verbatimLocality | verbatimElevation | minimumElevationInMeters | maximumElevationInMeters | verbatimDepth | minimumDepthInMeters | maximumDepthInMeters | minimumDistanceAboveSurfaceInMeters | maximumDistanceAboveSurfaceInMeters | locationAccordingTo | locationRemarks | verbatimCoordinates | verbatimLatitude | verbatimLongitude | verbatimCoordinateSystem | verbatimSRS | decimalLatitude | decimalLongitude | geodeticDatum | coordinateUncertaintyInMeters | coordinatePrecision | pointRadiusSpatialFit | footprintWKT | footprintSRS | footprintSpatialFit | georeferencedBy | georeferenceProtocol | georeferenceSources | georeferenceVerificationStatus | georeferenceRemarks
• Identification– identificationID | identifiedBy | dateIdentified | identificationReferences | identificationRemarks | identificationQualifier | typeStatus
• Taxon– taxonID | scientificNameID | acceptedNameUsageID | parentNameUsageID | originalNameUsageID | nameAccordingToID |
namePublishedInID | taxonConceptID | scientificName | acceptedNameUsage | parentNameUsage | originalNameUsage | nameAccordingTo | namePublishedIn | higherClassification | kingdom | phylum | class | order | family | genus | subgenus | specificEpithet | infraspecificEpithet | taxonRank | verbatimTaxonRank | scientificNameAuthorship | vernacularName | nomenclaturalCode | taxonomicStatus | nomenclaturalStatus | taxonRemark
20 Oct. 2011 TDWG - Orleans 52
Rap
id d
igit
izati
on
of
P H
erb
ari
um OCR / NLP ?
20 Oct. 2011 TDWG - Orleans 53
Rap
id d
igit
izati
on
of
P H
erb
ari
um Projects to fill the gap
• Remote Taxonomists– Yack web tool
• Citizen Science / CrowdSourcing– « les collecteurs » project
• Repatriation project– Reflora (Brasil)
20 Oct. 2011 TDWG - Orleans 54
Rap
id d
igit
izati
on
of
P H
erb
ari
um Thank you !
A project managed by:
•Direction of Collections– Michel Guiraud mguiraud (at) mnhn (.) fr– Pascale Joannot joannot (at) mnhn (.) fr
•DSI (Information Systems)– Henri Michiels michiels (at) mnhn (.) fr– Simon Chagnoux chagnoux (at) mnhn (.) fr
20 Oct. 2011 TDWG - Orleans 55