Date post: | 14-Jun-2015 |
Category: |
Education |
Upload: | impact-centre-of-competence |
View: | 177 times |
Download: | 1 times |
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computer Lexica in OCR and Retrieval
Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 2
Overview
What is a computer lexicon
Lexica in IMPACT
Tools for lexicon building and applying lexica
Some results
Searching Demonstration
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 3
What is a computer lexicon?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 4
Computer lexicon vs electronic dictionary (1)
An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 5
Dictionary XML (example)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 7
Computer Lexicon vs Electronic Dictionary (2)
A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma, part of speech, morphology, syntax)
Examples of use:
Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 9
Lexica in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 10
The OCR lexiconAn OCR lexicon is
A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 11
OCR lexicon: example1550-1750 > 1900
song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798
television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 12
The IR lexicon IR lexicon: most
important information categoriesword forms (lists of words) +
- frequency information
- quotes (dated sources) from corpora or electronic dictionaries
- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
The modern lemma is used for searching in texts
Standard use in corpus linguistics and modern historical lexicography
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 13
<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>
<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 14
Tools for lexicon building and application of lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 15
Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk
I
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
II
(patterns to predict variation)
(a number are predictable with patterns, others need to be taken from a lexicon )
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Neil Fitzgerald, 7th July 2011 16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 17
Computer lexica
For OCR and OCR post correction Improving searchability of historic text material by building
a lexicon with variants by using a modern lemma as a search entry
Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook Guidelines and tools to use the lexica in OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 18
Tools (more specific)- Lexicon building from corpus material and
dictionaries - Use of lexica in search engines
- Tool to extract spelling variation patterns from historical material
- Tool to relate previously unrecognised spelling variations to their standard form
- Tool to deduct previously unrecognised inflected forms to their basic form
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 19
Ordinary words vs Names (NEs)
Tools for the automatic recognition, classification and finding of variant names Wish of the libraries Separate regular vocabulary from names Reduce unpleasant results:
Abimelech apemelk! (b/p; i/e; e/0; k/ch) (apemelk means monkeymilk..)
NE lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 20
A number of results for Dutch and German
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 21
Ground truth data: DutchType and genre # words
Gold Standard Book 300k
Random Set Books 340k
Random Set Staten Generaal (Legal Papers)
2.5M
Gold Standard Staten Generaal 500k
Gold Standard Newspapers 1 3.4M
Gold Standard Newspapers 2 170k
Random Set Newspapers 3.2M
total 13.1M
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 22
Lexicon coverage (1: ground truth books)
Type coverage Token coverage
Modern lexicon (e-Lex) 46% 76%
Core general lexicon 56% 84%
1 + 2 63% 89%
Expansion with corpus material
78% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 23
Lexicon coverage (2: GT newspapers 18th-19th C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 40% 83%
Core general lexicon 41% 84%
1 + 2 51% 89%
Expansion with corpus material
62% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 24
Lexicon coverage (3: GT Staten Generaal 19e C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 51% 89%
Core general lexicon 47% 88%
1 + 2 58% 93%
Expansion with corpus material
68% 97%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 25
Lexicon coverage (4: GT Staten Generaal 20e C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 70% 93%
Core general lexicon 66% 93%
1 + 2 76% 96%
Expansion with corpus material
81% 98%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 26
Lexicon coverage (5: Genesis, 1637 bible)
Type coverage Token coverage
Modern lexicon (e-Lex) 31% 61%
Core lexicon 62% 83%
1 + 2 65% 89%
Expansion with corpus material
87% 98.6%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 27
Lexicon coverage (6: P.C. Hooft, histories)
Type coverage Token coverage
Modern lexicon (e-Lex) 26% 67%
Core lexicon 47% 88%
1 + 2 50% 90%
Expansion with corpus material
58% 96%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 28
Evaluation of OCRFinereader SDK (version 9, 10) External dictionary interface (implementation module) Challenge
Translation of corpus frequencies to weights 0-100 Broken words, case-sensitivity, …Problem with long ‘s’ (work around)
Lexicon DataIMPACT OCR-lexicon for DutchFinereader internal lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 29
OCR results: word recognition rateDataset With ABBYY internal Dutch
lexiconWith IMPACT lexicon for Dutch (case hyphenation)
With IMPACT lexicon for Dutch (case hyphenation) + long S problem)
DPO35 88.8% 90.9% 93,5 %
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 30
An example:
OCR at the beginning of the project: Results:
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 31
Dictionary16th century
No. of word errors
Reduction of error rate
18th century
No. of word errors
Reduction of error rate
19th century
No. of word errors
Reduction of error rate
No Lexicon 1306 - 827 - 2074 -
Optimal Lexicon 756 42% 395 52% 612 70%
Modern Lexicon 1096 16% 501 39% 888 57%
W.Historical Lexicon 938 28% 481 42% 856 59%
Modern + Virtual H.L. 1011 25% 480 42% 849 59%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Languages in IMPACTD
utch, German, English, Spanish, FrenchP
olish, Czech, Slovene and Bulgarian
-Cross language perspective paper
-Parallel OCR and IR experiments
-GT datasets
-Language tools: language independent
-Except from 3 core languages: proof of concept lexica
IMPACT <Demo Day BL, 12 July 2011> 32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
English in IMPACTL
exicon building using OED– OCR lexicon from quotations full text, possibly supplemented with corpus material– IR lexicon from headword variants in quotations (small demo)
Named Entity Recognition on newspaper material
– NE lexicon– Gold standard corpus NE recognition (CONLL)
(Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) )
PER, LOC, ORGR
esearch into the possible benefits from exclusion of modern words from the OCR lexicon
IMPACT <Demo Day BL, 12 July 2011> 33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 34
An indemnity shall be granted to the surfer….
… bikini …
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 35
Retrieval demonstrator
Indexing and retrieval library (java) implemented on the lucene search engine
Lexicon in MySQL database
OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection
Page XML output [in framework]
NE tagging
Indexing and retrieval while using lexicon and NE tagging
35