06 traub

1

Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesMyriam C. Traub, Jacco van Ossenbruggen!Centrum Wiskunde & Informatica

24/03/2015

01

Use of digital archives for…

✤ selection for close reading!

✤ first occurrence of specific words!

✤ word frequency patterns over !✤ time !✤ different sources

2

01

Use of digital archives for…

✤ selection for close reading!

✤ first occurrence of specific words!

✤ word frequency patterns over !✤ time !✤ different sources

It all depends on OCR

quality!

2

We care about average performance

on representative subsets for generic

cases.

3

Two different perspectives of OCR quality

We care about average performance

on representative subsets for generic

cases.

I care about actual performance

on my non-representative subset

for my specific query.

3

Two different perspectives of OCR quality

Specific case: “amsterdam”

✤ American English 2009, 1700 - 18504


✤ American English 2009, 1700 - 1850How

reliable are these numbers?

4





Tool maker:!recall (average) = 90%



Tool maker:!recall (average) = 90%

Is “amsterdam” in the 10% or in the

90%?


“amsterdam” and “amfterdam”

6



6



How do I know?!

What else?6

✤ Current American English, 1700 - 1850

Different versions / pipelines

7

✤ Current American English, 1700 - 1850

Different versions / pipelinesHow do I know

that they fixed it?!What else?

7

Research questions

✤ How can we give humanities researchers the information they need to understand how the OCR limitations influence their research tasks?!

✤ What is a good way of estimating uncertainty for a specific case?!

✤ How to get (the data for) better estimates?

8

OCR

pre-processing

post-!

processing

ingestion

scanning

9

Understanding potential sources of bias

✤ some details difficult to reconstruct !

✤ essential to understand overall impact

OCR

pre-processing

post-!

processing

ingestion

scanning

10

Potential starting point:

✤ Confidence in the pipeline?

Example for OCR confidence values

11


11


11


11

What does 0.732

mean?

Example for post-processing

12

Example for post-processing

12

01

OCR confidence values useful?

✤ available for all items in the collection: page, word, character!

✤ calibration based on limited ground truth?!

✤ only for highest ranked words / characters, other candidates missing

13

01

Future work

✤ How to estimate impact of OCR errors on use cases outside the ground truth?!

✤ Can we crowdsource part of this estimation problem?!

✤ How to convey the estimated impact to researchers using the corpus?

14

01

How does OCR quality affect your [email protected]!

[email protected]

15

Date post:	15-Jul-2015
Category:	Government & Nonprofit
Upload:	ingeangevaare
View:	77 times
Download:	0 times

06 traub

Government & Nonprofit