Download - Evaluation and re nement of an enhanced OCR …Evaluation and re nement of an enhanced OCR process for mass digitisation An infrastructure project in collaboration with Spr akbanken

Evaluation and refinement of an enhancedOCR process for mass digitisation

An infrastructure project in collaboration with Sprakbanken(SB, the Swedish Language Bank) and Kungliga biblioteket

(KB, the National Library of Sweden)

Dana DannellsSB Kickoff

6 September 2018

Project members

Lars Bjork (KB), Project Coordinator

Torsten Johansson (KB)

Dana Dannells (SB)

Newspaper digitisation at KB

KB’s collection holds more than 19 million pages of historicalnewspapers accessible in digital format.

Unfortunately, a large portion of the texts contains errors,resulting from the Optical Character Recognition (OCR)process. There is a need for an infrastructure for the productionand dissemination of reliable text.

The OCR-module

I A platform was developed by KB in cooperation with theNorwegian software company Zissor in 2017.

I The underlying principle is to process an image with twoOCR-systems, compare the results (on word level) andchoose the output that has the highest validity.

I A scoring model was implemented, based on thedictionaries of the two OCR-systems. Each word is eitherverified (if confirmed by both dictionaries) or falsified (ifrejected by one or both dictionaries).

Comparison between two OCR-systems

Aftonbladet 1831

Aftonbladet 1945

1 2 3 40.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

ABBYY

Tesseract

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

ABBYY

Tesseract

The purpose of the project

To evaluate and improve the OCR-module through systematictext analyses, dictionaries and word lists, and implement the

module in KB’s mass digitisation process.

The specific aims of the project

I evaluate and refine the OCR-module based on typicalfeatures of the source document (image);

I develop a methodological approach to relate features of thesource documents and specific types of dictionaries andlinguistic rule sets;

I produce reference material and make it freely available;

I construct a framework for defining quality levels andmethods for quality declaration of the processed texts;

I formalise the cooperation between KB and other researchinstitutions;

I enhance the quality of the OCRed text in KB’s massdigitisation of newspaper, and consequently increase theusability of these resources in research applications.

Reference material (ground truth)

I 400 pages covering the years 1831 – 2017 will be manuallytranscribed (double keyed).

I The selected pages will be carefully chosen to reflect typicalvariations in layout, typography, and language over time.

I We will use both statistical and manual selectionprocedures to ensure that the material is broad andrepresentative.

Analysis and characterisation of the material

I Quantitative analysis, performed automatically bymachines.

+ accurate and quickly produces generalisation andsummarization of the input data.

- basically only provides statistical information about thedata.

I Qualitative analysis, performed manually by humanannotators.

+ provides in-depth explanatory data.- equires human expertise and is time-consuming.

Analysis method

Levenshtein distance operations

Dictionaries and wordlists

SALDO and Dalin covering late-modern Swedish(1831–1905), and modern Swedish (1906–2017)

Some word lists that were produced in the project: En frimolntjanst for OCR ‘A free cloud service for OCR’

38 historical and modern corpora of 980 million annotatedtokens from the period of 1800 until today

Evaluation of the OCR-module

Case 1 based on comparison with a ground truth

→ identify the most frequent edit operations→ improve the dictionaries and word lists for different timeperiods

Case 2 based on comparison with previously processedmaterial

→ evaluate the usefulness of the final module

Quality assessment of results will be carried outthroughout the project.

Project plan and timeline

January 2019 – December 2020

2019

I Selection and transcription of the reference material

I OCR processing of the transcribed material

I Evaluation of the functionality of the OCR-module and ofthe processed text

I Analysis of the OCR processing results

2020

I Refinement and further development of the OCR-module

I Re-processing of the test material and analysis

I Final evaluation and documentation

Thank you!