+ All Categories
Home > Documents > A Voting System for Automatic Correction of OCR...

A Voting System for Automatic Correction of OCR...

Date post: 08-Aug-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
26
A Voting System for A Voting System for Automatic Correction of OCR Automatic Correction of OCR Output Output
Transcript
Page 1: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

A Voting System for A Voting System for Automatic Correction of OCRAutomatic Correction of OCR

OutputOutput

���� ����� ��������

����������������

Page 2: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

IntroductionIntroductionOCR = Optical Character Recognition

Page 3: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

����������������

���� ��� ���� �������� ��� ������� ����� ���� ���� ���������

������� ���������� �� ��� � !������

Page 4: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Known Techniques for Known Techniques for Spelling CorrectionSpelling Correction

Edit Distance:The minimum number of editing operations (i.e., insertion, deletion and substitution of letters) required to transform one string to another.

������������

����������������

�������� ��

Page 5: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Known Techniques for Spelling Correction Known Techniques for Spelling Correction (cont...)(cont...)

� Hashing:� Skeleton Key

�������� �� ���� ���

.

.cmnct oui a communi cat i on..

� "���� #������

�����

0 01 02 13 0

4 05 06 17 08 1

��� ������

��� ������

��� ������

������� ��

Page 6: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

So, what is the problem ?So, what is the problem ?��

• Most of the techniques are relevant only for typing errors.

• Designed for isolatedwords.

• We are interested in a fullyautomatic system.

Page 7: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algor ithmThe Algor ithm

�������������� ����� �������������������� �!!���!�������� ���"#����������� �$����� %��� ������$����� ����� "&����������� �����������"

Page 8: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

%�������&����� �����

� "���� ��� �����������

� ������ ��� ���$�����'���������

� ��������� ��� �'���� ��� (���'���� �������

� "���� ��� ��������� �����

)�������� �$ �� �����

*��������� ����������

+����� ��� ����

���������

Page 9: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

)�������� ,�� ���������� �����)�������� ,�� ���������� �����

�������� ������1jw 2

jw

*�����)���������

�"'����� ������� �� ����"�"'����� ������� ����%� ��������!� �������$��� "�"'����� ������� ������ �����!� �������$��� "("'����� ������� ������ �% ������$��� "

���-)���������

����)���������

Page 10: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

• If words are identical:Accept the word.

� )!� ����� ������� ��*

� +�� �������"

� ���,�������� ���-�������"

� ��!�����.� ��/"

� �� �/ "

1jw 2

jw0

1jw 2

jw0

Page 11: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

If the words mismatch:

• If only one of them is valid�Then:Accept the valid one.

1jw 2

jw0

)_,(4.0

)_,(6.0)(

dictionaryglobalwfreq

dictionarylocalwfreqwdictionaryij

ij

ij

+⋅=

Page 12: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

� )!% ��������$��� �'���*

1����� ����� � � ���!��% ���� ����� ������ � ���%�� ���"

1jw 2

jw0

)(

)(2

1

j

j

w

wCandidates

�����������������

����������������� �=

Page 13: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

��

��� =

=otherwise. 0

.1),( if 1_

ij

ij k

wwnceedit_distacloseis

)),(_

)(

),,(_(0.4

)),,(_(6.0)(

11

ij

ij

ij

ij

ij

ij

iij

ij

ij

k

k

k

kk

wwcloseis

wdictionary

wwwgramword

OCRwwfreqerrorwmark

+

+⋅

+⋅=

+−

2����� ����!�����3��� ������ � � ��*

�"�+�� �������" �"�� � �+�� ����"

�"�4� 51���" ("���!�����.� ��/"

Page 14: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

� I f bot h wor ds ar e val i d:

The Algorithm (cont...)

).,,(_

)_,()(

11ij

ij

ij

ij

ij

wwwgramword

dictionarylocalwfreqwcontext

+−

+=

)]_,(0.4

)(6.0[)()(

dictionaryglobalwfreq

wcontextOCRaccuracywmarkij

ij

iij

+⋅⋅=

Page 15: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The Algorithm (cont...)

����������������� ������������������ �

�0����6��

#���������0��6��73�����+*

� ����3���!����8�� ��/9�:"

� )# ���������������� ������ ���"

� )# �������� � ��� ����$��� ��� �� � � ������ � � ����� "

Page 16: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

The The ExperimentsExperiments

� �������� ��������������������� ����

Page 17: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

0.10%1.18%ge

0.48%1.58%1i

0.72%0.15%lf

0.72%0.30%z2

0.10%0.75%dda

0.77%3.54%o0

0.48%1.58%1i

OCR 2OCR 1Error StringSource String

The Environment of the The Environment of the ExperimentsExperiments

.�� ���$����� ����� $�� ������ �� �

Page 18: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Examples of successful Examples of successful correctionscorrections

det ai l sdet aHsd, t ai l sdet ai l s

pr eci sentpr eci ~entpr , c i sentpr eci sent

nei ghbor hoodnei &i bor ho~ne~hbor hoodnei ghbor hood

schoolsci i oolschooischool

t hankf ul l yt hankf i i l yt hankf v l l yt hankf ul l y

we’ r ewe’ ewe’ t ewe’ r e

sur vi vor ssur vi vor ssur veyor ssur vi vor s

goi nggor i nggoi nggoi ng

Accepted WordOCR 2OCR 1Original Word

������ ��� #����� �� ��

Page 19: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Examples of erroneous decisionsExamples of erroneous decisions

����� ��� �� ��������� $�� ������� �������

��/� ����!� �� ��������� (��� �������� $��� ��� ��� 0$�� ����1 (��� ���� ���� �������� ��� ���� �����2�3�

-�/� ������ ��� (���� (��� ���(��� �(� ����� �������������

4����� ��� �$ ��� (���� �� �� ��� ����������1 ��� �� �� ��� ��� ������� ����

5�/� 6���������7 ��� ��!�� (��� 0(���� (�� �������3 �� �� ��������� ����

8�/� �������� $��� � !��!�� ��������� $�� ��� ����!����� ��!�� (����

Page 20: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

4zoozooi2ooi2001

4sansassavssays

ci r cus

t he

l i t he

t oe

t oo

dass

l an

5, 6oo

Accepted Word

3

2

2

2

2

1

1

1

Error type

ci r cusci r c. l esci r c l es

Wet heWe

l i s l el i t hel i t t l e

t oet r uet r ue

t oo100100

dassdasscl ass

l anl anI an

5, 6oo5, 6oo5, 600

OCR 2OCR 1Original Word

Page 21: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

of the ResultsAnalysis�������� ����� �� ������� ������������������ ���

�������� ����� �� ������� ������������������ ���

�������� ����������������� ��

�������� ���������������� ��

Page 22: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

�������� � ����� � ���� � ����� �� ��������� !��

������"� � ����� � ���#!��������������� ���

Analysis of the Results (cont…)

������$� � ����� � ���� �� ��������� !��� ������%�% � � ���

Page 23: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Analysis of the Results (cont…)

3.6%Both – full correction system7

5.1%Both – compar ison + dictionary’s frequencies6

7.2%Both – compar ison + simple dictionary lookup5

8.0%OCR 2 – dictionary + candidates generation4

8.5%OCR 1 – dictionary + candidates generation3

12.1%OCR 2 – no post-processing2

14.0%OCR 1 – no post-processing1

Error Rate

Page 24: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Analysis of the Results (cont…)

(D) false negative(C) false positiveIncorrect word

written to output

(B) true negative(A) true positiveCorrect word

written to output

Do not accept the OCR word, and try to suggest

candidates

Accept words from OCR as is

accepted rdscorrect wo #

OCR fromdirectly accepted rdscorrect wo # Recall =

+=

BA

A

OCR fromdirectly accepted words total#

OCR fromdirectly accepted rdscorrect wo #Precision =

+=

CA

A

Page 25: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Analysis of the Results (cont…)

Recall vs. Precision

84

86

88

90

92

94

96

98

100

90 92 94 96 98 100

Recall

Pre

cisi

on

model 1model 2model 3model 4model 5model 6model 7

Page 26: A Voting System for Automatic Correction of OCR Outputboston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein-slides.pdf · A Voting System for Automatic Correction of OCR Output

Fur ther Work

• More OCR devices.• Context: NLP techniques, word classes.• Specifications for certain language.


Recommended