Post on 15-Jun-2015
description
transcript
Europeana Newspapers:
The Gateway to European Newspapers Online
IFLA 2013 SATELLITE MEETING ON NEWSPAPER & GENLOC SECTIONS
Singapore, 14 August 2013
Clemens Neudecker
@cneudecker
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Overview
• Objectives
• Overview of Dataset
• Workflows & Technologies
• Questions & Answers
2
Image: Nationaal Archief The Netherlands
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Objectives
• Refinement of 10 mill. pages with OCR, OLR, NER
• Ingestion of metadata for 18 mill. pages in Europeana
• Create a full text content browser for newspapers
• Create a unified METS/ALTO profile (ENMAP)
• Produce tools in order to ease creation of ENMAP objects
• Share best practices and provide recommendations
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Who
12 content providers
2 networking partners
4 technology providers
1 aggregator
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Recently associated
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
The data
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
The workflow
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• METS/ALTO package containing images, metadata & full text
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• METS/ALTO package containing images, metadata & full text
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologies: Stanford CRF-NER
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
QA @ PRImA
• Layout and OCR evaluation
• Technologies: Ground truth + Evaluation Tools (IMPACT)
• In-depth scenario driven evaluation using profiles with more than 600 metrics
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Full-text search @ TEL
16
Blogwww.europeana-newspapers.eu
Workshop 16 Sept. 2013 (Amsterdam)
Thank you for your attention!clemens.neudecker@kb.nl