Europeana Newspapers - the Gateway to European Newspapers Online

Post on 15-Jun-2015

89 views 4 download

Tags:

description

Europeana Newspapers - the Gateway to European Newspapers Online IFLA 2013 Satellite Meeting on Newspaper & Genloc Sections, Science Centre Singapore, 14-15 August 2013, Singapore.

transcript

Europeana Newspapers:

The Gateway to European Newspapers Online

IFLA 2013 SATELLITE MEETING ON NEWSPAPER & GENLOC SECTIONS

Singapore, 14 August 2013

Clemens Neudecker

@cneudecker

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Overview

• Objectives

• Overview of Dataset

• Workflows & Technologies

• Questions & Answers

2

Image: Nationaal Archief The Netherlands

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Objectives

• Refinement of 10 mill. pages with OCR, OLR, NER

• Ingestion of metadata for 18 mill. pages in Europeana

• Create a full text content browser for newspapers

• Create a unified METS/ALTO profile (ENMAP)

• Produce tools in order to ease creation of ENMAP objects

• Share best practices and provide recommendations

3

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Who

12 content providers

2 networking partners

4 technology providers

1 aggregator

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Recently associated

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

The data

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (1)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (2)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (3)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (4)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

The workflow

11

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OCR @ UIBK

• OCR = Optical Character Recognition

• Technologies: ABBYY FineReader SDK

• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts

• METS/ALTO package containing images, metadata & full text

12

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR @ CCS

• OLR = Optical Layout Recognition

• Technologies: docWorks

• Separation of columns, articles, headlines, page classes

• METS/ALTO package containing images, metadata & full text

13

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER @ KB

• NER = Named Entities Recognition

• Technologies: Stanford CRF-NER

• Open source: https://github.com/KBNLresearch/europeananp-ner

• Detection of Named entities: Person, Location, Organization

14

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

QA @ PRImA

• Layout and OCR evaluation

• Technologies: Ground truth + Evaluation Tools (IMPACT)

• In-depth scenario driven evaluation using profiles with more than 600 metrics

15

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Full-text search @ TEL

16

Blogwww.europeana-newspapers.eu

Workshop 16 Sept. 2013 (Amsterdam)

Thank you for your attention!clemens.neudecker@kb.nl