+ All Categories
Home > Presentations & Public Speaking > Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

Date post: 08-Jul-2015
Category:
Upload: celi
View: 174 times
Download: 2 times
Share this document with a friend
Description:
Ecco la prima delle nostre presentazioni al CLIC 2014: Errori di OCR e riconoscimento di Named Entities nell'Archivio Storico de La Stampa
Popular Tags:
4
OCR ERRORS & NAMED ENTITY RECOGNITION IN LA STAMPA’S HISTORICAL ARCHIVE Andrea Bolioli, Eleonora Marchioni, Raffaella Ventaglio [email protected], [email protected], [email protected] 1
Transcript
Page 1: Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

OCR ERRORS & NAMED ENTITY

RECOGNITION IN LA STAMPA’S HISTORICAL ARCHIVE

Andrea Bolioli, Eleonora Marchioni, Raffaella Ventaglio

[email protected], [email protected], [email protected]

1

Page 2: Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

Microfilm Scan and OCR Full text indexing NER and infographics

1 2 3 0

The project was realized in 2010-2011. Public web site: www.archiviolastampa.it

5 million newspaper articles, 1910-2005

Page 3: Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

We annotated about 16,000 errors and corrections in a sample of 894 front page articles. A few examples: dustin hoflman, hoftman, holfman, hollman, hotfman, hotlman (dustin hoffmann) , pohtica (politica), de (dc), pei (pci), doc um e nto (documento) , re- latore (relatore), ima (una), gh (gli). OCR errors classified in types: Segmentation, Hyphenation, Character misrecognition, Punctuation, Graphics, etc.

Type and Percentage of errors ( errors / num of tokens )

OCR Errors in the Historical Archive

Page 4: Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

People recognized in the front pages of the newspaper (number of articles).

Measures over a test corpus of 500 documents.

NEs that occur more than 10 times, extracted from 4,8M documents.

Named Entity Recognition


Recommended