+ All Categories
Home > Education > Computer Lexica in OCR and Retrieval

Computer Lexica in OCR and Retrieval

Date post: 15-Jun-2015
Category:
Upload: biblioteca-nacional-de-espana
View: 516 times
Download: 0 times
Share this document with a friend
Description:
Presentada en "Sesión de demostración de IMPACT en la BNE". Octubre. Biblioteca Nacional de España
Popular Tags:
65
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
Transcript
Page 1: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computer Lexica in OCR and Retrieval

Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)

Page 2: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4 March 2009 presentation The Hague 2

Can we handle ‘de wereld’ (‘the world’)’?

werreid

Page 3: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 3

OCR:Abbyy Finereader SDK with built in standard Dutch dictionary

OCR:Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch:

werreld

Page 4: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 4

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarelsswerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareldweirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

RETRIEVAL: key in modern WERELD and find all

Page 5: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 5

The long s problem: An example ….

OCR at start of project

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

.

Page 6: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 6

The long s problem: An example ….

OCR at start of project Results April 2010

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Page 7: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 7

The long s problem: An example ….

OCR at start of project Results April 2010

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon.

In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first)

Page 8: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 8

Overview

What is a computer lexiconLexica in IMPACTTools for lexicon building and applying lexica Some resultsSearching Demonstration

Page 9: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 9

What is a computer lexicon?

Page 10: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 10

Computer lexicon vs electronic dictionary (1)

An electronic dictionary is: Digitised full text (no pictures)For human useIdeally: searchable with explicitely coded material (XML), such as a

lemma, part of speech (PoS), meaning, quotes etc.Examples: OED online, WNT online

Page 11: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 11

Dictionary XML (example)

Page 12: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 12

Page 13: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 13

Computer Lexicon vs Electronic Dictionary (2)A computer lexicon is:

Always in a structured digital format (XML, relational database) Main purpose: computer applicationExplicitely coded information (e.g. lemma wereld, part of speech

noun, morphology werelden, werelds … , syntax)

Examples of use:

Linguistic enrichment of text material‘Advanced’ searching (words with all spelling variant and inflections)Automatic summarization, keyword extraction…

Page 14: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 14

Page 15: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 15

Lexica in IMPACT

Page 16: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 16

The OCR lexiconAn OCR lexicon is

A checked list of words in a languageBased on a corpus (collection) of dated texts (selection!)Preferably with frequency informationPreferably from the same time period or of the same text type as

the texts you wish to digitize

Page 17: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 17

OCR lexicon: example1550-1750 > 1900

song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798

television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61

Page 18: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 18

The IR lexicon IR lexicon: most important information categories

word forms (lists of words) + - frequency information- quotes (dated sources) from corpora or electronic dictionaries- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word

The modern lemma is used for searching in textsStandard use in corpus linguistics and modern historical lexicography

Page 19: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 19

<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote>

<derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>

Page 20: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 20

Tools for lexicon building and application of lexica

Page 21: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 21

Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

I

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlytwereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlysswarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

II

(patterns to predict variation)

(a number are predictable with patterns, others need to be taken from a lexicon )

Page 22: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Neil Fitzgerald, 7th July 2011 22

Page 23: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 23

Computer lexica

For OCR and OCR post correctionImproving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry

Tools for lexicon buildingTools for application of lexicon in search engines Lexicon cookbook

Page 24: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 24

Tools (more specific)- Lexicon building from corpus material and dictionaries - Use of lexica in search engines

- Tool to extract spelling variation patterns from historical material

- Tool to relate previously unrecognised spelling variations to their standard form

- Tool to deduct previously unrecognised inflected forms to their basic form

Page 25: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 25

Spelling variation tools (pattern-based)Language-independent approach:

Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, …. Pattern weights are computed from example material

Additional approaches possible, eg. :Use of aligned data (parallel historical text and modern version)

Page 26: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 26

LemmatizationReduction of historical word forms to modern lemmaHistorical word standard (“modern”) spelling lemma form

(pattern matching) (lemmatizer)

Dystels (1) distels (2) distel

When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup.

But: 1) We will not have full form information for many lemmata

(especially the historical ones)2) Even lemmata present in modern language may have historical

inflected forms different from the present-day paradigm

Page 27: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 27

Lemmatization and reverse lemmatizationWe also need a lemmatization process for these situations

A typical lemmatizer assigns some standard form (infinitive, nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.

But:Matching these patterns can be hard to combine with matching both spelling variation patterns and OCR errors (bok/bokken/bokkeu)We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmataThis construction is carried out by means of a statistical reverse lemmatizer

Page 28: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 28

AttestationFrom hypothetical (non-witnessed) lexicon content to attested word forms in “real” textAutomatic selection of candidate attestationsManual work: verification and correction

Two approachesDictionary based (INL): Woordenboek der Nederlandsche TaalCorpus based (LMU, INL): Dutch DBNL corpus

Page 29: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 29

IMPACT Dictionary Attestation Tool

work• We are working on what works.

• Depart from me, ye that worke iniquity.

• She worcketh knittinge of stockings.

headword

Quotations

variants

TaskFind the variants of a headword as they occur in the quotations

Lexicon building at work: Verifying attestations in historical dictionaries

Page 30: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 30

IMPACT Dictionary Attestation Tool

Automatically (preprocessing)

• match literallye.g: work work, Work

• match using existing lexica and listse.g: work works, worked, wrought

• approximate matchinge.g: work worke

By hand (using the tool)

• correct automatic mismatchese.g: works words, worms

• find missed matchese.g: work worketh, wrowght

TaskFind the variants of a headword as they occur in the quotations

Electronic

historical

dictionaryDatabase

with lemmata

and quotatioms

Page 31: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 31

IMPACT Attestation ToolTool

Lemma headword

QuotationsSorted by uncertainty

Up-to-date overview of what is done and needs to be don

Done by this user so far

Page 32: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 32

IMPACT Lexicon Tool

Automatically (preprocessing = apply lemmatizer)

• match literallye.g: work work, Work

• match using existing lexica and listse.g: work works, worked, wrought

• matching using spelling variation modulee.g: uiterlijk uyterlick

By hand (using the tool)

• assign correct lemma e.g: was (N) zijn (V)

• group tokens belonging togethere.g: konings zoon koningszoon

• select attestations

TaskFind and verify attestations in a historical corpus

Page 33: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 33

Corpus-based lexicon building: Impact Lexicon Tool

Page 34: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 34

General vocabulary vs. Named entitiesTools for lexicon building described so far: applicable to general lexiconTools for NE recognition, classification and variant matching- library requirement- distinguish general vocabulary from NE’s- avoid unpleasant mixups like Abimelech apemelk!

(b/p; i/e; e/0; k/ch)

Page 35: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 35

Improvement of state of the art / innovation

We use existing computational linguistic approaches, but figure out how to apply them to historical languageWe develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together

Data selection and acquisitionManual workComputational linguistics tools

Page 36: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

languages in IMPACTDutch, German, English, Spanish, FrenchPolish, Czech, Slovene and Bulgarian

- Cross language perspective paper- Parallel OCR and IR experiments- GT datasets

- Language tools: language independent- Except from 3 core languages: proof of concept lexica

IMPACT <Demo Day BL, 12 July 2011> 36

Page 37: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR evaluation results(preliminary!)

Page 38: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

1. CzechCo jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských, 1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke, 1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805 Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho částek, 1840

Page 39: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 40: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2.Dutch18th and 19th century books, newspapers, parliamentary papers

Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852 Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796 Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784 Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795

Page 41: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Precision: 0.8432889410216431 , Recall: 0.843331934927516

Page 42: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 43: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

English16th-19th century materialSources for lexicon building: OED, ECCO

Page 44: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 45: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

French17th century books

Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653 Dissertation de la philosophie en général, 1668 La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673 Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677 Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693

Page 46: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 47: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

GermanDas Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887 Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609

Page 48: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 49: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

PolishAdwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621, 1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746 Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589

Page 50: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 51: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

SloveneGenovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855

Page 52: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 53: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 53

Retrieval demonstrator

Indexing and retrieval library (java) implemented on the lucene search engineLexicon in MySQL database

OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selectionPage XML output [in framework]NE tagging Indexing and retrieval while using lexicon and NE tagging

53

Page 54: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 55: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 56: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 57: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 58: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 59: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 60: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 61: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 62: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 63: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 64: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 65: Computer Lexica in OCR and Retrieval

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Recommended