Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | emery-burt |
View: | 32 times |
Download: | 0 times |
New Digital Library Possibilities Using the Open Archives InitiativeProtocol for
Metadata Harvesting (OAI-PMH)
Michael L. NelsonOld Dominion University
Norfolk Virginia, [email protected]
http://www.cs.odu.edu/~mln/icsep/
International Conference on Scientific Electronic Publishingin Developing Countries
Valparaiso, ChileOctober 2, 2002
Several Slides Also from Van de Sompel & Warner
Random Thoughts
1. Thanks to the Organizing Committee for inviting me
2. Me deseo habla prestado la atencion a mis clases del Espanol de la escuela secundaria…
3. Publishers & Editors: if you want increased coverage, exposure and readership, you must “do” OAI…
Outline
• OAI-PMH history and technical highlights– a full technical review is out of the scope of
this presentation
• Example data provider user• Example service provider uses• Implicatations for authors and editors• Looking to the future
Open Archives Initiative
The protocol is openlydocumented, and metadatais “exposed” to at least somepeer group (note: rights management can still apply!)
Archive defined as a“collection of stuff” --not the archivist’s definition of “archive”. “Repository” used in most OAI documents.
OAI is happeningat break-neck speed...
The Rise and Fall of Distributed Searching
• wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice– Davis & Lagoze, JASIS 51(3), pp. 273-80– Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still viable, but only for small values of N
• NCSTRL: N > 100; bad• NTRS/NIX: N<=20; ok (but could be better)
The Rise and Fall of Distributed Searching
• Other problems of distributed searching (from STARTS)
– source-metadata problem• how do you know which nodes to search?
– query-language problem• syntax varies and drifts over time between the various nodes
– rank-merging problem• how do you meaningfully merge multiple result sets?
• Temptations:– centralize all functions
• “everything will be done at X”– standardize on a single product
• “everyone will use system Y”
Santa Fe Convention [02/2000]
• goal: optimize discovery of e-printshttp://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html
• input:
• the UPS prototypehttp://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html
• RePEc /SODA “data provider / service provider model”
• Dienst protocol
• deliberations at Santa Fe meeting [10/99]
• Data Providers– publishing into an archive– providing methods for metadata “harvesting”
• provide non-technical context for sharing information also
• Service Providers– harvest metadata from providers– implement user interface to data
• Self-describing archives– Much of the learning about the constituent UPS
archives occurred out of band…
Data and Service Providers
Even if theseare done bythe same DL,these are distinct roles
Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata
– data remains at remote repositories
user
. . .
search for “cfd applications”
local copy ofmetadata
metadataharvested offline
metadataharvested offline
metadataharvested offline
metadataharvested offline
each node independently maintained
all searching, browsing, etc. performed on the metadata hereindividual nodes can
still support direct userinteraction
• low-barrier interoperability specification
• metadata harvesting model: data provider / service provider
• focus on document-like objects
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• experimental: 12-18 months
OAI-PMH v.1.0 [01/2001]
about eprintsdocument
like objectsresources
metadata OAMSunqualifiedDublin Core
unqualifiedDublin Core
transport HTTP HTTP HTTP
responses XML XML XML
requests HTTP GET/POST HTTP GET/POST HTTP GET/POST
verbs Dienst OAI-PMH OAI-PMH
nature experimental experimental stable
modelmetadataharvesting
metadataharvesting
metadataharvesting
Santa Feconvention
OAI-PMHv.1.0/1.1
OAI-PMHv.2.0
OAI-PMH 2.0• Good news: OAI-PMH is still
Six Verbs + Dublin Core
• Incremental improvements– single XML schema– ambiguities removed– more expressive options– cleaner separation of roles & responsibilities
• Bad news: not backwards compatible with 1.1
Dublin Core• Dublin Core Metadata Initiative
– http://www.dublincore.org/– from 1994-1995, recognizing the need for simple,
interoperable metadata for resource discovery– good overview of metadata & DC:
http://www.dlib.org/dlib/january01/lagoze/01lagoze.html
– 15 elements (qualifiers possible)
Title Creator Subject Description Publisher
Contributor Date Type Format Identifier
Source Language Relation Coverage Rights
OAI MechanicsRequest is encoded in http
Response is encoded in XML
XML Schemas for theresponses are defined in the OAI-PMH document
Overview of OAI-PMH Verbs
Verb Function
Identify description of archive
ListMetadataFormats metadata formats supported by archive
ListSets sets defined by archive
ListIdentifiers OAI unique ids contained in archive
ListRecords listing of N records
GetRecord listing of a single record
metadataabout therepository
harvestingverbs
most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)
protocol vs periphery
• clear distinction between protocol and
periphery
• fixed protocol document
• extensible implementation guidelines:
• e.g. sample metadata formats, description containers, about containers
• allows for OAI guidelines and community guidelines
OAI-PMH vs HTTP
• clear separation of OAI-PMH and HTTP
• OAI-PMH error handling
• all OK at HTTP level? => 200 OK
• something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb)
• http codes 302, 503, etc. still available to implementers, but no longer represent OAI-PMH events
resource
all available metadata about David
item
Dublin Coremetadata
MARCmetadata
SPECTRUMmetadata records
item = identifier
record = identifier + metadata format + datestamp
set-membership is item-level property
resource – item - record
other general changes
• better definitions of harvester,
repository, item, unique identifier, record,
set, selective harvesting
• oai_dc schema builds on DCMI XML
Schema for unqualified Dublin Core
• usage of must, must not etc. as in
RFC2119
• wording on response compression
other general changes
• all protocol responses can be validated
with a single XML Schema
• easier for data providers
• no redundancy in type definitions
• SOAP-ready
• clean for error handling
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord></OAI-PMH>
response no errors
note no http encodingof the OAI-PMH request
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request><error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error></OAI-PMH>
response with error
with errors, only the correctattributes are echoed in <request>
resumptionToken
harvester RDBMS
ListRecords
Records 1-1000, resumptionToken=AXad31
ListRecords, resumptionToken=AXad31
Records 1001-2000, resumptionToken=pQ22-x
ListRecords, resumptionToken=pQ22-x
Records 2001-2770
scenario: harvesting2770 records in 3 separate1000 record “chunks”
• idempotency of resumptionToken: return same
incomplete list when rT is reissued
• while no changes occur in the repo: strict
• while changes occur in the repo: all items with
unchanged datestamp
•new, optional attributes for the resumptionToken:
•expirationDate
•completeListSize
•cursor
resumptionToken
• harvesting granularity
• mandatory support of YYYY-MM-DD
• optional support of YYYY-MM-DDThh:mm:ssZ
• other granularities considered, but ultimately rejected
• granularity of from and until must be the
same
harvesting granularity
• Identify more expressive
Identify
<Identify>
<repositoryName>Library of Congress 1</repositoryName>
<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>[email protected]</adminEmail>
<adminEmail>[email protected]</adminEmail>
<deletedRecord>transient</deletedRecord>
<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>deflate</compression>
• header contains set membership of item
header
<record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double
harvest” 1.x required to get all records and all set information
• ListIdentifiers returns headers
ListIdentifiers
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“…” …>http://arXiv.org/oai2</request><ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> ……
• introduction of provenance container to
facilitate tracing of harvesting history
provenance
<about> <provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate>
<originDescription> … … … </originDescription> </originDescription> </provenance></about>
• introduction of friends container to
facilitate discovery of repositories
friends
<description>
<friends>
<baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL>
<baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL>
<baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL>
<baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL>
</friends>
</description>
NASA <friends> example (1)• A light weight, DP-centric method
to communicate the existence of “others”
http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify
..<description> <friends ..namespace stuff..> <baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL> <baseURL>http://ntrs.nasa.gov/oai2.0</baseURL> <baseURL>http://horus.riacs.edu/perl/oai/</baseURL> <baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL> </friends> </description>..
<friends>…</friends/
http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/
http://ntrs.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://horus.riacs.edu/perl/oai/
harvester
Identify
NASA <friends> example (2)
• introduction of branding container for
DPs to suggest rendering & association
hints<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/ http://www.openarchives.org/OAI/2.0/branding.xsd"> <collectionIcon> <url>http://my.site/icon.png</url> <link>http://my.site/homepage.html</link> <title>MySite(tm)</title> <width>88</width> <height>31</height> </collectionIcon> <metadataRendering metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/" mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering> <metadataRendering metadataNamespace="http://another.place/MARC" mimeType="text/css">http://another.place/MARCrender.css</metadataRendering></branding>
branding
• revision of oai-identifier<description> <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier> </oai-identifier></description>
oai-identifier
domain based repository names
• SOAP implementation• Result set filtering• Multiple / “best” metadata• GetRecord -> GetRecords• Machine readable rights
management• XML format for “mini-archives”
did not make it into OAI-PMH v.2.0
• Resources on DL projects are typically spent in 2 areas:– creating & maintaining the collection
• data provider
– developing access services for the collection (searching, browsing, etc.)
• service provider
• OAI-PMH allows for specialization based on resources / interest
So What Does OAI-PMH Mean for Your Digital
Library?
Scientific Communication
• With only some exceptions, which interface is used for discovery is not as important as the fact that discovery occurred in the first place…– “control” of the discovered objects is not “lost” by data
providers• however, higher level mirroring services can be built on
top of OAI (cf. NACA & ARC mirroring between NASA LaRC and MAGiC)
• The real power of OAI-PMH derives as much from what it does not do as what it actually does
What Does OAI-PMH Mean for Authors?
• On the surface, absolutely nothing!– the ideal OAI deployment should be absolutely invisible
to normal DL operations– uninterested users should not even notice or care
• Indirectly, they should enjoy the benefits of the critical mass of current and developing DL tools & systems – personal, institutional data providers– proliferation of targetted, value-added service
providers
What Does OAI-PMH Mean For Editors?
• Absolutely everything…• The decoupling of SPs and DPs will have significant and
profound implications on scientific and technical information exchange– OAI-PMH is actually just one component in a larger
engineering effort for scholarly communication (e.g. OpenURL)
• Service and resource integration will be the focus of journals, professional societies, universities, etc.– OAI-PMH will be a basic, core technology for scientific
publishing as http & XML
Field of Dreams• It should be easy to be a data provider, even if it
makes more work for the service provider.– if enough data providers exist, the service providers
will come (DPs >> SPs)
• Open-source / freely available tools– “drop-in” data providers:
• industrial strength: http://www.eprints.org/• personal size: http://kepler.cs.odu.edu/
– tools to make your existing DL a data provider:• http://www.openarchives.org/tools/tools.htm• also: OAI-implementers mailing list / mail archive!
– service providers:• Arc: http://sourceforge.net/projects/oaiarc/
OAI Observation: Front-End Only
• No input/registry mechanism– OAI harvesting protocol is always a front-end for
something else• filesystem, Dienst, RDBMS, LDAP, etc.
– convenient for pre-existing DLs, but does not address “new” DLs
• e.g., “we want to do OAI”
• Bounds the scope of OAI– responsibilities and domain of OAI are still be
discussed– tension between functionality and simplicity
OAI Observation: No T&C
• Possible to use multiple OAI servers in a DMZ-like configuration…
Public OAI Server
Private OAI Server
Source database
OAI requestsfrom trusted hosts
OAI requestsfrom arbitrary hosts
could even use a separate copy of the database…
OAI Observation: No T&C
• Possible to use OAI harvesting protocol in closed, restricted systems
OAI 1 OAI 2
OAI 3OAI 4
all OAI requests originate from these 4 DLs
Metadata
– Q: “Which format should I use?”• A: any/all of them…
– lowest common denominator: unqualified Dublin Core
– Again, little known about actual behavior• will DC be actually be useful? or too lossy?• will communities create/adopt specific formats?• will native (presumably richer) formats be
harvested?
we very much want this to happen...
“The Return of MARC” ?!
The Future: Community Building
• Ultimately, protocols and metadata formats are not what makes a difference
• Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language Archives Community – http://www.language-archives.org
• OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends
ListMetadataFormats
• Arguments– identifier
(OPTIONAL)
• Errors– id does not exist
• Arguments– identifier
(OPTIONAL)
• Errors– badArgument– noMetadataForma
ts– idDoesNotExist
1.1 2.0
ListSets
• Arguments– resumptionToken
(EXCLUSIVE)
• Errors– no set hierarchy
• Arguments– resumptionToken
(EXCLUSIVE)
• Errors– badArgument– badResumptionTok
en– noSetHierarchy
1.1 2.0
ListIdentifiers
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)
• Errors– no records match
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix (REQUIRED)
• Errors– badArgument– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– noRecordsMatch
1.1 2.0
ListRecords
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix (REQUIRED)
• Errors– no records match– metadata format cannot be
disseminated
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix (REQUIRED)
• Errors– noRecordsMatch– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– badArgument
1.1 2.0
GetRecord
• Arguments– identifier (REQUIRED)– metadataPrefix
(REQUIRED)
• Errors– id does not exist– metadata format cannot
be disseminated
• Arguments– identifier (REQUIRED)– metadataPrefix
(REQUIRED)
• Errors– badArgument– cannotDisseminateFormat– idDoesNotExist
1.1 2.0
Argument SummarymetadataPrefix from until set resumptionToke
nidentifier
Identify
ListMetadataFormats
optional
ListSets exclusive
ListIdentifiers optional optional optional exclusive
ListRecords optional optional optional exclusive
GetRecord
Error SummaryIdentify BA
ListMetadataFormats
BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification