Post on 06-Jan-2018
description
transcript
The way from pdf-documents to xml-files
A brief overview through the OCR-process and the XML mark up
Christiana Klingenberg & Donat Agosti
workflow
document processing
1) OCR (ABBYY FineReader)
• reading the pdf document, dividing the text in blocks
• building training files• orthography check
2) XML markup (GoldenGATE)
• workflow (level 1)• FAT / LSID• treatments
OCR – ABBYY FineReader
Considerations- building training files for each type face pattern
(eg. for each journal)- marking the blocks in logical reading order- recognizing special caracters [[worker]],
[[queen]], [[male]], [[soldier]]- orthography check- saving options- problems
type face pattern
1804. Carolum Reichard, Brunsviga. 1861. Journal of the Proceedings of the Linnean Society of London, Zoology
1921. Annales de la Societe Entomologique de Belgique 2005. Proceedings of the California Academy of Sciences
marking the blocks1
2 3
4
1234567
marking the blocks in a logical order to get a readable xml document
Vespa. 263emargina-ta.50. V. nigra thorace maculata, abdomine fasciis quinque prima antice emarginata, Vespa emarginata. Ent.
Syst. 2. 267. 51. * Habitat in Germania Dom Smidt.simplex51. V. nigra clypeo thoracis margine antico ab-dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2,
267. 52. * Habitat Kiliae.parietina.52. V. nigra clypeo thoraceque maculatis, abdomi-ne fasciis supra quinque, subtus duabus flavis. Ent. Syst,
2. 268. 53. *Panz. Fn. Germ. 49. tab. 24.Habitat Kiliae.
Vespa. 263
50. V. nigra thorace maculata, abdomine fasciis emargina-quinque prima antice emarginata, ta.
Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt.
51. V. nigra clypeo thoracis margine antico ab- simplex. dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2. 267. 52. * Habitat Kiliae. 52. V. nigra clypeo thoraceque maculatis, abdomi- parietina. ne fasciis supra quinque, fubtus duabus flavis. Ent. Syst, 2. 268. 53. Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae.
blocks marked in a logical sequence, „clean“ html
whole text marked in one block, „dirty“ html
special characters
it is not possible to enforce the Abbyy pattern editor to re-read certain characters!
[[worker]][[soldier]][[queen]][[male]][[…]] = not recognizable
orthography check / problems• additional dictionaries: “anty_species”, “anty_glossary”,
(“anty_Chris”)• latin dictionary?• geographic names dictionary?• misspelled taxa
(incl. species names beginning with CAPITALS)
• available training files for different type patterns for ABBYY (community)
• species dictionaries for different groups (eg. plants, beetles, birds, etc.) (community) (could be used as lexicon in GoldenGATE)
saving options
(T) australis Forel = parallela(T) bequaerti Forel = schultzei(T) bicolor (Clark) * = turneri(T) bidentata Brown n. sp. [[worker]] Philippines [13](T) bicuspis Emery 1900:268 [[worker]] [[male]]
Madagascar [15]boliviana Santschi = sinuata(P) brevidentata Wheeler — cribrinodis(T) brevinodis Santschi = cribrinodis(?) brunnipes (Clark) * 1938:361 [[worker]] S Australia:
Reevesby I. [16](T) cephalotes Viehmeyer = parallela(T) ceylonensis Donisthorpe = parallelacineracea Forel = punctata
(T) australis Forel = parallela(T) bequaerti Forel = schultzei(T) bicolor (Clark) * = turneri(T) bidentata Brown n. sp. [[worker]] Philippines [13](T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15]boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis(?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallelacineracea Forel = punctata
workflow
GoldenGATE: xml mark up
• FAT / attribute taxon names– editing species names (beginning with lower
case letters, if not recognized as a genus)– marking of additional, not recognized taxa
(without the author, the author will be given during LSID referencing)
– edit annotations (improving the tool)
GoldenGATE: xml mark up
• LSID referencing– upload of new taxonomic names (quality
control?)– same taxon described by two authors? In
case of doubt, which one?
Establishing “taxon format” rules according with the ICZN for taxon upload:“Genus (SubGenus) species subspecies variety”(requires in most cases a previous editing of the taxa, during the OCR process or in GoldenGATE)
GoldenGATE: treatment mark up
• definitions of treatment options, especially: catalogue entry, synopsis, citation, reference group
• suggestions for simplifying the treatment mark up: journal-specific analyzers?
• treatment mark-up during “paginator” step and subSubSection mark up posteriorly?
GoldenGATE: TaxonX
• TaxonX validation: in GoldenGATE (no necessity of Oxygen or XMLSpy)
• TaxonX – MODS: what about books?
GoldenGATE: considerations• new definitions of mark up levels• LSIDs, citations (DOIs)• community: “mark up server”, integrating
specialists for special groups or mark up levels
Error prevention:• in case of doubt consult the original pdf (taxa),
especially when working with “dirty” html
expenditure of time• OCR: average of x 5,63 min / page
depends on type face pattern and availability of trainig file for type face pattern
• GoldenGATE: average of x 8,18 min / page (tx1)
– average time represents also time of debugging and error search– depends on number of taxa and treatments– time will reduce due to constant improving of GoldenGATE and developing
helpful tools
Time development GoldenGATE