Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

Post on 11-May-2015

129 views 1 download

Tags:

transcript

Digitale Zeitungen –Verarbeitung in Europeana Newspapers

Information Day SBB

Berlin, 27 Februar 2014

Clemens Neudecker, KB, Twitter: @cneudecker

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Übersicht

• Ziele & Herausforderungen

• Zeitungen im Projekt

• Workflow & Technologien

• Fragen & Antworten

2

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Ziele

• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)

• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)

• Erstellen von Software für NER in 3 Sprachen (KB)

• Entwicklung von Tools die den Workflow automatisieren

• Erstellen von Richtlinien und Empfehlungen (“best practices”)

3

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Herausforderungen

• Qualität vs. Durchsatz

• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)

• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)

• Unterschiedliche Dateiformate, Sprachen, Alphabete

• Historische Schreibvarianten

• Klar strukturierter und weitgehend automatisierter Workflow

4

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Die Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (1)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (2)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (3)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (4)

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Workflow

10

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OCR @ UIBK

• OCR = Optical Character Recognition (Optische Zeichenerkennung)

• Technologien: ABBYY FineReader SDK• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box

• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext

11

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (BCT)

• BCT = Binarisation and Colour Reduction Tool

• Ziel: Konvertierung von Farb-/Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k

• Hintergrund: Dateigrösseder Images reduzieren umDatenmenge handhabbarzu machen (hunderte TBs)

12

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FRT)

• FRT = File Rename Tool

• Ziel: Unterstützung der Bibliotheken bei der Daten-anlieferung – Umbenennungvon Dateien und Ordnern

• Hintergrund: Daten in der fürautomatisierte Verarbeitungnotwendigen Struktur aufbereiten

13

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FAT)

• FAT = File Analyzer Tool

• Ziel: Check und Validierungder Datenstruktur vorAnlieferung zur Verarbeitung

• Hintergrund: Garantie füralle Beteiligten dass die Datenfür die weitere Verarbeitungin geeigneter Form vorliegen

14

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR @ CCS

• OLR = Optical Layout Recognition (Optische Layouterkennung)

• Technologien: docWorks• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)

• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext

15

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR ���� Artikelerkennung

16

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER @ KB

• NER = Named Entities Recognition

• Technologien: Stanford CRF-NER• 3 Sprachen: Deutsch, Niederländisch, Französisch

• Open source: https://github.com/KBNLresearch/europeananp-ner

• Erkennung von 3 Klassen: Person, Ort, Organisation

17

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18

Ergebnisse für NL

Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.

100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)

*

* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung

Personen Orte Organisationen

Precision 0.940 0.950 0.942

Recall 0.588 0.760 0.559

F-measure 0.689 0.838 0.671

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER vs. OCR

19

0,25

0,35

0,45

0,55

0,65

0,75

0,85

0,95

NER

OCR

Danke für die Aufmerksamkeit!

Noch Fragen?

clemens.neudecker@kb.nl