Evaluation and refinement of an enhancedOCR process for mass digitisation
An infrastructure project in collaboration with Sprakbanken(SB, the Swedish Language Bank) and Kungliga biblioteket
(KB, the National Library of Sweden)
Dana DannellsSB Kickoff
6 September 2018
Newspaper digitisation at KB
KB’s collection holds more than 19 million pages of historicalnewspapers accessible in digital format.
Unfortunately, a large portion of the texts contains errors,resulting from the Optical Character Recognition (OCR)process. There is a need for an infrastructure for the productionand dissemination of reliable text.
The OCR-module
I A platform was developed by KB in cooperation with theNorwegian software company Zissor in 2017.
I The underlying principle is to process an image with twoOCR-systems, compare the results (on word level) andchoose the output that has the highest validity.
I A scoring model was implemented, based on thedictionaries of the two OCR-systems. Each word is eitherverified (if confirmed by both dictionaries) or falsified (ifrejected by one or both dictionaries).
Comparison between two OCR-systems
Aftonbladet 1831
Aftonbladet 1945
1 2 3 40.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
ABBYY
Tesseract
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
ABBYY
Tesseract
The purpose of the project
To evaluate and improve the OCR-module through systematictext analyses, dictionaries and word lists, and implement the
module in KB’s mass digitisation process.
The specific aims of the project
I evaluate and refine the OCR-module based on typicalfeatures of the source document (image);
I develop a methodological approach to relate features of thesource documents and specific types of dictionaries andlinguistic rule sets;
I produce reference material and make it freely available;
I construct a framework for defining quality levels andmethods for quality declaration of the processed texts;
I formalise the cooperation between KB and other researchinstitutions;
I enhance the quality of the OCRed text in KB’s massdigitisation of newspaper, and consequently increase theusability of these resources in research applications.
Reference material (ground truth)
I 400 pages covering the years 1831 – 2017 will be manuallytranscribed (double keyed).
I The selected pages will be carefully chosen to reflect typicalvariations in layout, typography, and language over time.
I We will use both statistical and manual selectionprocedures to ensure that the material is broad andrepresentative.
Analysis and characterisation of the material
I Quantitative analysis, performed automatically bymachines.
+ accurate and quickly produces generalisation andsummarization of the input data.
- basically only provides statistical information about thedata.
I Qualitative analysis, performed manually by humanannotators.
+ provides in-depth explanatory data.- equires human expertise and is time-consuming.
Analysis method
Levenshtein distance operations
Dictionaries and wordlists
SALDO and Dalin covering late-modern Swedish(1831–1905), and modern Swedish (1906–2017)
Some word lists that were produced in the project: En frimolntjanst for OCR ‘A free cloud service for OCR’
38 historical and modern corpora of 980 million annotatedtokens from the period of 1800 until today
Evaluation of the OCR-module
Case 1 based on comparison with a ground truth
→ identify the most frequent edit operations→ improve the dictionaries and word lists for different timeperiods
Case 2 based on comparison with previously processedmaterial
→ evaluate the usefulness of the final module
Quality assessment of results will be carried outthroughout the project.
Project plan and timeline
January 2019 – December 2020
2019
I Selection and transcription of the reference material
I OCR processing of the transcribed material
I Evaluation of the functionality of the OCR-module and ofthe processed text
I Analysis of the OCR processing results
2020
I Refinement and further development of the OCR-module
I Re-processing of the test material and analysis
I Final evaluation and documentation