Metadata and identifiers for e- journals Copenhagen 13.-14.3.2000 Juha Hakala Helsinki University...

Metadata and identifiers for e-journals

Copenhagen 13.-14.3.2000

Juha Hakala

Helsinki University Library

[email protected]

Contents

• Introduction

• Traditional cataloguing

• Full-text indexing

• Embedded metadata + Dublin Core

• DIEPER choices

• Identification of e-journals

Introduction

• Metadata = structured description of resource• Structure of metadata is defined in a format

– simple formats (AltaVista)– complex formats (MARC)– structured formats (Dublin Core)

• Choices have important cost and quality implications (good is not free)

Traditional cataloguing

• Routinely done for journals (ISSN DB)

• Articles indexed only selectively– Finnish article index Arto: 1100 journals;

65000 articles + 10 man years annually, 40 libraries co-operate in production

• Extending MARC cataloguing to all digitised articles is too expensive

• Any selection criteria for “good material”?

Full-text indexing

• Will not replace cataloguing...– In large databases precision still bad

• ...but we should follow what is happening– RDBMS become document-literate (Oracle

Intermedia)– new search techniques (e.g. fuzzy searching)– efficient use of language technologies– knowledge management

Embedded metadata (1)

• Three issues to solve: – semantics: in which metadata format should my

metadata be?– syntax: is it possible / feasible to embed

metadata into this document (does the document format allow inclusion of metadata)

– once topics 1 & 2 have been solved: are there tools for creating / harvesting / indexing my metadata?

Embedded metadata - syntax

• It must be possible to include metadata in non-compromised form & specify each data element separately

• Most document formats do not allow efficient metadata usage– “flat files”, image formats, Word97

• “This is Dublin Core identifier element, and there is an ISBN in it”

Embedded metadata - syntax (2)

• HTML 4.0– META tag enables sophisticated metadata – Explicit specification for how to embed Dublin

Core -based metadata (RFC 2731)

• XML/RDF– “Resource Description Framework makes data

machine understandable”– very versatile, but may be tough to implement

Embedded metadata - semantics

• Metadata formats tend to be domain specific, complex and hard to learn

• Dublin Core as an alternative:– simple (in its basic form)– generic (no domain dependency)– extensible (local elements possible)

• Is there any competition left?

Status of Dublin Core Initiative

• maintenance in reliable hands

• 15 elements stable (DC 1.1)

• syntax for HTML 4.0 stable

• core qualifiers under development– proposals published in December -99– agreement in DC-AC in March 2000– will result to 50-60 qualifiers

Tools for Dublin Core

• Metadata support in Web indexes becoming more popular

• Metadata creation emerging in document management systems

• Text editors: XML support in place, RDF yet to come

DIEPER choices

• Document format will be XML/RDF– extensible and open document format that will

become very popular in the future

• Metadata format will be based on DC– DC tags: Identifier, Title, Creator, Contributor,

Publisher, Language, Subject– Local tags: e.g. SerialsNumbering,

PlaceOfPublication, SizeSourcePrint

Identifiers for e-journals

• Two different issues:– how to identify journals themselves– how to identify articles and possibly sections of

articles (table of contents etc.)

• Do we need resolution mechanism (based on DOI or URN)

E-journals

• ISSN must be used, also for digitised journals– digitised version may have the same ISSN than

the original paper version

• ISSN should not be embedded on issues / articles, since this enhances recall too much

• Broadened scope: serials + integrating resources

Issues & articles

• SICI (Serial Item and Contribution Identifier) should be used

• ANSI/NISO standard (1996)– http://sunsite.berkeley.edu/SICI/

• Not widely supported yet; e-commerce is likely to change this– need to identify whatever that can be sold

• SICI generator available

Properties of SICI

• Extensible: can identify issue/article/section within article

• Can be created automatically (from structured source document)

• Complex– 0002-8231(1929)30:1<ZBDMSU>2.0.CO;2-Z

• Can be used as URN or DOI

URN & DOI

• Umbrella systems that provide e.g. persistent linkage between a reference and the resource via a resolution service

• DOI is a publisher-driven initiative, URN comes from the Internet community

• DOIs can be used as URNs, not vice versa

Digital object identifier

• Consist of prefix and suffix, separated by a slash– 10.1045/february2000-risher

• Suffix may be anything, there is no hint on its content

• Prefix identifies the publisher + indicates where to find a resolution service

Uniform resource name

• Consists of three parts:– string urn:– Namespace identifier (NID)– Namespace specific string (NSS)

• When NID is known, creating URNs from existing identifiers is trivially easy

• No hint on where to find resolution service

Business models

• DOI: annual payment for each DOI assigned– no decision yet on the size of the payment– flat fee for publisher ID

• URN: no price at all– but someone has to pay for the resolution

services

DIEPER policy

• URNs will be used, in order to enable URN-based resolution services

• ISSN/SICI will be used

• ISSN International Centre will assist in creation of URN resolution services– ISSN database will be contacted first, in order

to get the address of the resolution service

Date post:	25-Dec-2015
Category:	Documents
Upload:	cordelia-houston
View:	214 times
Download:	0 times

Metadata and identifiers for e- journals Copenhagen 13.-14.3.2000 Juha Hakala Helsinki University...

Documents