THE DIATHESIS NEWSPAPER DIGITIZATION SUITE Foundation of Research and Technology Institute of...

THE DIATHESIS NEWSPAPER THE DIATHESIS NEWSPAPER DIGITIZATION SUITEDIGITIZATION SUITE

Foundation of Research and TechnologyFoundation of Research and Technology Institute of Computer ScienceInstitute of Computer Science

Centre for Cultural InformaticsCentre for Cultural Informatics

Martin Doerr, Georgios Markakis, Maria Theodoridou

Heraklion, Crete, Greece

About DIATHESISAbout DIATHESIS

Diathesis is a newspaper digitization suite whose primary purpose is Diathesis is a newspaper digitization suite whose primary purpose is the digitization, classification and dissemination of archival the digitization, classification and dissemination of archival newspaper material. newspaper material.

It was originally used for the digitization of the Vikelaia Municipal It was originally used for the digitization of the Vikelaia Municipal Library’s newspaper collection (1890-1960) at Heraclion, Crete. It Library’s newspaper collection (1890-1960) at Heraclion, Crete. It has evolved as an independent digitization suite since.has evolved as an independent digitization suite since.

Used in other projects as well (Filekpedeytiki Etairia Athens, Greece – Used in other projects as well (Filekpedeytiki Etairia Athens, Greece – The “AYGHI” newspaper)The “AYGHI” newspaper)

The ProblemThe Problem

Historical newspapers are one of the most signicant source of information for researchers due to the wealth of information they provide regarding every aspect of everyday political, social and intellectual life.

Access to this type of archival material is usually obstructed by the following factors:

In order to protect the archival material from potential damage some archives prohibit the access to the largest part of their collection.

Direct contact with the original archival material constitutes a potential health hazard (due to dust and fungi).

The lack of indexes to newspapers combined with the vastness of information contained in them makes research a very time consuming task.

Many archives adopted digitization of newspapers as a straightforward method to deal with the above problems. Digitized material is easier to preserve and much easier to distribute via the Web.

However, conversion of archival material into a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solve the problem of rapid access to this material.

Digitization itself is inadequate if it does not provide the means of rapidly accessing the digitized material in a timely and accurate manner (also known as the searchability issue).

Current State of the Art newspaper Digitization Current State of the Art newspaper Digitization PracticesPractices

Currently there are three main approaches for rendering Currently there are three main approaches for rendering newspaper archival material searchable:newspaper archival material searchable:

1.1. The The Physical FeaturesPhysical Features Based Approach Based Approach

2.2. The The OCROCR Based Full Text Indexing Approach. Based Full Text Indexing Approach.

3.3. The The Conceptual ClassificationConceptual Classification (Ontology Based) Approach. (Ontology Based) Approach.

The physical features based classification approach.The physical features based classification approach.

Newspapers are classified using Newspapers are classified using a basic set of metadata regarding physical a basic set of metadata regarding physical features of the original material (features of the original material (number of issuenumber of issue, , date of publicationdate of publication, , newspaper namenewspaper name, , number of pagesnumber of pages etc). etc).

Advantages:Advantages: Simple to implement.Simple to implement.

Disadvantages:Disadvantages: The final user is unable to conduct full-text searches on an article or issue The final user is unable to conduct full-text searches on an article or issue

level basis.level basis. The final outcome of the digitization effort resembles more a browsing The final outcome of the digitization effort resembles more a browsing

mechanism.mechanism. There is no explicitly defined conceptual structure of the archive.There is no explicitly defined conceptual structure of the archive.

Institutions:Institutions: Anno: Austrian newspapers online project (http://deposit.ddb.de/online/exil/exil.htm). “Exilpresse digital. deutsche exilzeitschriften 1933-1945" project

(http://deposit.ddb.de/online/exil/exil.htm). Denmark: Digitaliserede danske aviser 1759-1865 (http://www.statsbiblioteket.dk).

The OCR based Full Text Indexing Approach.

Automatic digitization approaches that make use of OCR analysis of digitized newspapers. Full Text Indexing techniques are currently considered to be the state of the art in the area of newspaper digitization and this is mainly for the following reasons:

Creation of searchable full - text index via OCR is a much faster process compared to the manual creation of metadata.

Separation of searchability and readability. It is possible to conduct searches at a page/issue/article level basis. The search is conducted via keywords in a manner that is familiar to the average

user of contemporary Web Search engines. Efficient content dissemination over the Web.

Disadvantages: Well known precision/recall issues. Newspaper archives are not as chaotic as the Web. The search of information in OCR based information retrieval systems is conceptually

blind. The import process a computationally expensive procedure.

The OCR based Full Text Indexing Approach.

Institutions adopting this approach:Institutions adopting this approach:

British library online newspaper archive (http://www.uk.olivesoftware.com/). The Brooklyn Daily Eagle online (http://www.brooklynpubliclibrary.org/eagle/). Northern New Nork historical newspapers (http://news.nnyln.net/). Utah Digital Newspapers (http://www.lib.utah.edu/digital/unews/). Historical newspapers in Washington (http://www.secstate.wa.gov/history/newspapersname.aspx). To mention just a few…

The conceptual classification approach.The conceptual classification approach overcomes many of the above weaknesses by

enabling the user to perform a knowledge engineering task upon the already digitized material via the use of ontologies.

An Ontology: "the specifcation of ones conceptualization of a knowledge domain".

Advantages: Ontologies are used to express a specific conceptual view over the digitized material. The use of top level ontologies guarantees to a certain extent the semantic

interoperability among different archives. The user may use concepts that classify the document that are not initially contained

within the document itself.

Disadvantages: Given the density of information in a newspaper, production of metadata is

a notoriously time consuming task (knowledge engineering bottleneck). It is almost impossible to manually define all the semantic relations or

entities contained even in a single article in a timely manner.

The DIATHESIS Approach: a hybrid approachThe DIATHESIS Approach: a hybrid approach

This system attempts to implement a realistic conceptual classification approach by combining the best elements from the three approaches mentioned above:

1. It permits searches on a newspaper issue basis (newspaper issue name, number, publication date) in a similar manner to the physical features based approach.

2. It permits searches on an article level basis via the use of full text queries in a similar manner to the OCR based Full Text Indexing Approach.

3. It permits searches on an article level basis via the semantic relationships assigned to each segment.

4. It permits searches that combine all of the above elements.

The system DOES not attempt to create a complete semantic structure that includes all the semantic relationships and entities (Actors, Places) described in the text. Instead it focuses to the creation a coherent semantic backbone that can be easily enriched with semantic relations.

DIATHESIS is using CIDOC – CRM as an underlying ontology.

Aims of DIATHESISAims of DIATHESIS

To render the digitized newspapers searchable on a document/article level To render the digitized newspapers searchable on a document/article level basis.basis.

To exploit the use of OCR technology in order to enable full text search in a To exploit the use of OCR technology in order to enable full text search in a newspaper collection.newspaper collection.

To combine full text search with user-defined metadata based search on a To combine full text search with user-defined metadata based search on a document and article level basis in order to enhance the overall precision document and article level basis in order to enhance the overall precision factor of the system. factor of the system.

To provide visualization facilities and an ergonomic interface for:To provide visualization facilities and an ergonomic interface for: The timely completion of metadata according to a set of predefined thesauri The timely completion of metadata according to a set of predefined thesauri

hierarchies.hierarchies. The browsing of the digitized newspaper collection given a set of predefined thesauri The browsing of the digitized newspaper collection given a set of predefined thesauri

hierarchies. hierarchies.

To deal with issues of semantic interoperability of digitized material To deal with issues of semantic interoperability of digitized material (conformance to international standards).(conformance to international standards).

To create a robust semantic backbone that will allow the full implementation of To create a robust semantic backbone that will allow the full implementation of the CIDOC CRM Model.the CIDOC CRM Model.

About CIDOCAbout CIDOC

What is the CIDOC Conceptual Reference Model?What is the CIDOC Conceptual Reference Model?

An An Object Oriented OntologyObject Oriented Ontology of about 80 classes and 130 properties for cultural of about 80 classes and 130 properties for cultural and natural historyand natural history

CRM instances can be encoded in many forms: RDBMS, ooDBMS, XML, RDF(S), CRM instances can be encoded in many forms: RDBMS, ooDBMS, XML, RDF(S), OWL.OWL.

Accepted as ISO-21127 in June 2005Accepted as ISO-21127 in June 2005

The CRM The CRM Is Is notnot a metadata standard a metadata standard

It is meant to become It is meant to become our language for semantic interoperability,our language for semantic interoperability, It is aIt is a Conceptual Reference ModelConceptual Reference Model for analyzing and designing cultural for analyzing and designing cultural

information systemsinformation systems Is limited to the underlying semantics of database schemata and document Is limited to the underlying semantics of database schemata and document

structures used in cultural heritage and museum documentationstructures used in cultural heritage and museum documentation Does Does notnot define the terminology used to document these data structures define the terminology used to document these data structures Does Does notnot say what cultural institutions say what cultural institutions shouldshould document document Aims to explain the logic of what they Aims to explain the logic of what they actually doactually do document document

An Example Hierarchy: E70 Stuff (Thing)An Example Hierarchy: E70 Stuff (Thing)

CIDOC Example (1): Modeling an ActivityCIDOC Example (1): Modeling an Activity

P14 performed

P11 participated in

P94 has created

E31 Document“Yalta Agreement”

E7 Activity

“Crimea Conference”

E65 Creation Event

*

E38 Image

P86 falls within

P7 took place at

P67 is referred to by

E52 Time-Span

February 1945

P81 ongoing throughout

P82 at some time within

E39 Actor

E39 Actor

E53 Place7012124

E39 Actor

CIDOC Example (2): Describing a composite CIDOC Example (2): Describing a composite artifactartifact

CIDOC-CRM DIATHESIS implementation: CIDOC-CRM DIATHESIS implementation: Issue/Segments RelationshipsIssue/Segments Relationships

E31.Document(Newspaper Issue)

E73.Information_Object(Newspaper Page)

E7.Activity

(Newspaper Segment)

P106F.is_composed_of

P67F.refers_to



E7.Activity

(Newspaper Segment)

P67F.refers_to

E7.Activity

(Newspaper Segment)

P67F.refers_to



E7.Activity

(Newspaper Segment)


P67F.refers_to



E7.Activity

(Newspaper Segment)

P67F.refers_to

E7.Activity

(Newspaper Segment)

P67F.refers_to



E7.Activity

(Newspaper Segment)


P67F.refers_to



E7.Activity

(Newspaper Segment)

P67F.refers_to

E7.Activity

(Newspaper Segment)

P67F.refers_to



E7.Activity

(News)


P67F.refers_to



E7.Activity

(News)

P67F.refers_to

E7.Activity

(News)

P67F.refers_to

CIDOC-CRM DIATHESIS implementation: Issue CIDOC-CRM DIATHESIS implementation: Issue Physical FeaturesPhysical Features


E7.Activity

(News)

P67F.refers_to

E35.Title(Newspaper Title)

P92B.was_brought_into_existence_by

E63.Beginning_of_ Existence

(Newspaper Publication Date)

P43F.has_dimensionE54.Dimension

(Number of pages)

P102F.has_title

CIDOC-CRM DIATHESIS implementation: Activity CIDOC-CRM DIATHESIS implementation: Activity ReferencesReferences

E31.DocumentE31.Document(Newspaper Issue)(Newspaper Issue)

E7.ActivityE7.Activity

(News)(News)

P67F.refers_toP67F.refers_to

P14F.carried_out_byP14F.carried_out_by

P16F.used_specific_objectP16F.used_specific_object

P7F.took_place_atP7F.took_place_at

P4F.has_time-spanP4F.has_time-span

P2F.has_typeP2F.has_type

P3F.has_noteP3F.has_note

SIS-TMS SIS-TMS Controlled Controlled VocabularyVocabulary

E2.Temporal_EntE2.Temporal_Entityity

E39.Actor E39.Actor (literal)(literal)

E70.Stuff (literal)E70.Stuff (literal)

E53.Place E53.Place (literal)(literal)

E55.Type E55.Type (literal)(literal)

(Article Full text)(Article Full text)

Thesauri HierarchiesThesauri Hierarchies

CIDOC based newspaper annotationCIDOC based newspaper annotation

Integration by Factual Relations

Ethiopia

Johanson's Expedition

CIDOC CRMCore Ontology

Documents in Digital Libraries

Hadar

Discovery of Lucy

Lucy

Donald Johanson

Benaki Museum

real world nodes (KOS)

The System Architecture: Software ComponentsThe System Architecture: Software Components

Apache Tomcat Application Server

Newspaper Digitization Suite

Diathesis

Administrator Diathesis

Administrator Diathesis

AnnotationMechanism

DiathesisAnnotationMechanism

DIATHESISWeb Search

DIATHESISWeb Search

Database

SIS-TMSThesaurus

ManagementSystem

Server Side Client Side

http://tomcat.apache.org/

http://images.google.gr/imgres?imgurl=http://ethostoolkit.rgu.ac.uk/wp-content/ethos-content/Fedora.jpg&imgrefurl=http://ethostoolkit.rgu.ac.uk/%3Fpage_id%3D76&h=193&w=204&sz=16&hl=el&start=54&um=1&tbnid=2BZr2f4BbO0iIM:&tbnh=99&tbnw=105&prev=/images%3Fq%3DFedora%2B%2Bdigital%2Blibrary%26start%3D36%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Del%26sa%3DN

http://images.google.co.uk/imgres?imgurl=http://www.gfx-trading.com/eng/images/MySQL.gif&imgrefurl=http://www.gfx-trading.com/&h=298&w=400&sz=6&hl=en&start=1&um=1&tbnid=M75TFpQSYCKXmM:&tbnh=92&tbnw=124&prev=/images%3Fq%3DMYSQL%2BLogo%26svnum%3D10%26um%3D1%26hl%3Den

The System Architecture: Workflow ViewThe System Architecture: Workflow View

The user interfaceThe user interface

FEATURES:FEATURES:

Fully Web Based.Fully Web Based. Simple to use / Easy to learn.Simple to use / Easy to learn. IntelligentIntelligent Upload / Download Mechanism.Upload / Download Mechanism. Workflow Control .Workflow Control . Data Loss Prevention Mechanism (Temporary Local Storage Data Loss Prevention Mechanism (Temporary Local Storage

and Data Recovery)and Data Recovery).. Flexible and Ergonomic Completion of Metadata FieldsFlexible and Ergonomic Completion of Metadata Fields.. Automatic Highlighting of keywords in OCR Text (Actors, Automatic Highlighting of keywords in OCR Text (Actors,

Places).Places). Use of SVG thesauri hierarchies for the timely completion of Use of SVG thesauri hierarchies for the timely completion of

Vocabulary Reserved Metadata fieldsVocabulary Reserved Metadata fields..

The user interfaceThe user interface

DIATHESISDIATHESIS

Annotation MechanismEnd User Search

MechanismAdministrator

Usage Stats

Mass Import

System Configuration

Search for Subjects

Search for Issues

Demonstration: Annotation InterfaceDemonstration: Annotation Interface

Demonstration: End User Search MechanismDemonstration: End User Search Mechanism

Future DirectionsFuture Directions Enrich the metadata creation process with Information Extraction Techniques.Enrich the metadata creation process with Information Extraction Techniques. Expand the suite with complementary Deep Semantic Annotation Capabilities (Semantic Expand the suite with complementary Deep Semantic Annotation Capabilities (Semantic

Wiki)Wiki)

Material Preprocessing

Phase

Shallow Semantic Annotation – metadata production phase.

Deep Semantic Annotation – full CIDOC implementation phase

DIATHESISSemantic

Wiki

PHASE 1 PHASE 2 PHASE 3

InformationExtraction

Techniques

InformationExtraction

Techniques

ConclusionsConclusions

The use of OCR technology in newspaper digitization practices is a hot new The use of OCR technology in newspaper digitization practices is a hot new technology. However it is not capable to deal with a plethora of issues.technology. However it is not capable to deal with a plethora of issues.

Deep Semantic annotation via Semantic Web technologies is a promising future Deep Semantic annotation via Semantic Web technologies is a promising future trend. CIDOC CRM provides the theoretical means to achieve this. The problem is trend. CIDOC CRM provides the theoretical means to achieve this. The problem is how to implement it. Creation of deep semantic relationships that exist within the how to implement it. Creation of deep semantic relationships that exist within the boundaries of a single newspaper issue is a time – consuming , and therefore boundaries of a single newspaper issue is a time – consuming , and therefore expensive task.expensive task.

The DIATHESIS digitization suite encapsulates a digitization strategy towards the The DIATHESIS digitization suite encapsulates a digitization strategy towards the creation of a vast semantic network of factual relationships between CIDOC entities creation of a vast semantic network of factual relationships between CIDOC entities while effectively dealing with the following issues:while effectively dealing with the following issues:

Digitization and Storage of Newspaper MaterialDigitization and Storage of Newspaper Material Rendering digitized material searchable on an issue/article level basis via the use of metadata, Rendering digitized material searchable on an issue/article level basis via the use of metadata,

thesauri hierarchies and full text queries.thesauri hierarchies and full text queries. Create a semantic backbone that can be used by future implementations.Create a semantic backbone that can be used by future implementations.

The next step: Link the DIATHESIS semantic backbone with a Semantic Wiki.The next step: Link the DIATHESIS semantic backbone with a Semantic Wiki.

Thank You!Thank You!

[email protected]@ics.forth.gr

Date post:	18-Dec-2015
Category:	Documents
Upload:	alban-hawkins
View:	214 times
Download:	0 times