Excellent XML – systems interoperability at the Wellcome
Library
EIUG 11th Conference, Stirling University
1 & 2 September 2005
Margaret Savage-Jones
Wellcome Library Systems
Millennium - Innovative Interfaces Inc.
http://catalogue.wellcome.ac.uk Includes online requesting
from closed stack since mid 2003
Calm - Archive system – DS Ltd http://archives.wellcome.ac.uk
Online access to archive & mss holdings
Miro/MedPhoto image system – System Simulation Ltd
http://medphoto.wellcome.ac.uk
Online access to over 100,000 images, image retrieval & delivery
Underlying protocol: OAI-PMH
Open Archives Initiative Protocol for Metadata Harvesting - protocol for sharing and harvesting metadata between different OAI-compliant systems
Based on XML and HTTP
One system (CALM or MedPhoto) exposes metadata via an OAI repository. This metadata is harvested by the other system (Millennium) and then loaded
Motivation With a MARC21, ISAD(G) & a bespoke image repository it was a strategic objective to make these systems interoperate
Phase II of the Closed Stack project - Western Manuscripts and Archives had to be requestable online by summer 2004
XML Harvester development by Innovative with Michigan State University 2001-02. Wellcome placed an order for XML Harvester in January 2003
With CALM ver 4 it was possible to export EAD XML
Benefits
Online requesting - Western MSS & Archives collections
One circulation system to manage and one set of circ stats
Same interface for all online requests from stack
Archives & manuscripts like other collections
Image sets for library objects displayed in Web OPAC
User can jump from one system to another
No need to rekey user search in other system
Selective harvesting for onward record updating
Example: archive record (from Crick Coll.)
Harvested archive record in Web OPAC
Image of the archive item
Encoded Archival Description (EAD)
Initially XML Harvester dealt only with EAD and needed
encodinganalogs for parsing. Developed with Michigan
State University (MSU) whose EAD finding aids had
MARC encodinganalogs. Harvester parser read these tags.
Encodinganalogs are attributes in XML records indicateing
field, subfield, indicators etc. in another descriptive encoding
system e.g. MARC21 equivalent to EAD tagged element
Archive system metadata
Hierarchical, tree structure with collection and component item
level records catalogued in General International Standard Archival
Description, ISAD(G)
Field export from CALM as default subset EAD DTD had some
empty fields – had to export as “DServe Natural” XML which
includes field tags. Catalog.xml output with catalog.DTD
Pilot – used “Haddad” catalogue XML
Used small set of 87 XML Arabic records – a local variant
of `MASTER’ XML DTD as a pilot to tes XML Harvester
Used stylesheets to filter unwanted fields, add encodinganalogs
and put 87 .xml files in a web server directory ready to be
harvested
Web crawler
Harvester reaches the XML files through port 80.
We added a page to the Millennium screens directory
listing files with redirections to the web server folder.
Harvester opened the page, scanned for `HREF’ strings
which directed it to the XML records (file.xml)
The XML Harvester parser read tags from encodinganalogs
to create MARC21 records, writing to a file for loading
Redirection screen<html>
<head>
<title> Harvester Test</title>
</head>
<body>
<em>Mss Files</em><br>
<strong> Sample Screen # 2</strong>
<PRE>
Test to confirm if harvester can crawl files deposited on wtcalm01
</pre>
<A HREF=http://wtcalm01.wellcome.ac.uk/xml/002.xml>002</A>
<A HREF=http://wtcalm01.wellcome.ac.uk/xml/83.xml>83</A>
<A HREF=http://wtcalm01.wellcome.ac.uk/xml/82.xml>82</A>
</body>
</html>
Example – encodinganalogs for 856
- <hyperlink>
-<url ENCODINGANALOG=”85607$u”>
<xsl:text>http://http://wisdom.welcome.ac.uk/xml/</xsl:text>
<xsl:value-of select+”substring-after(/?idno,`WMS Arabic`)”/>
<xsl:text>.html</xsl:text>
</url>
<text ENCODINGANALOG=”85607$z”>View full manuscript record</text>
</hyperlink>
Harvested MARC21 “Haddad” record
Links: to PDF and Request button
Lessons
Arabic records would be loaded only once but records from
CALM would need regular reharvesting/overlay
Need a more sophisticated approach than crawling a web
directory – XML Harvester can harvest from OAI Repository and
use datestamps in OAI to harvest records created, or modified
in specified date range
XSLT could be used to transform records to MARC21 OAI
without using encodinganalogs.
Archives OAI repository
Built on CALM server using freeware University of Illinois
Provider service tool (Runs under Windows IIS)
Other Requirements:
Microsoft 2000 serverMicrosoft IIS ver 4 or higherMicrosoft ASPMicrosoft XML Parser (MSXML) 4.0Microsoft ActiveX Data objects and ODBC compliant datasource i.e. MS Acces97+ databaseFirewall access on port 80
Key decisions
Metadata export – chose full CALM record XML DTD (not EAD)
Matchpoint – decided to load contents of Calm RefNo field to Millennium 001 indexed in `o’
Also had to consider:
Hierarchical record level to harvest
Navigation between the two systems
Millennium parameters
Decision: Record level to harvest
A “Collection” could consist of more than 40 boxes. Must have
1:1 record relationship to make requesting and retrieval work
Decision to exclude archives Collection records & use Component
level records. Each of these represent 1 item (box, folder, piece)
and links to a single bib records with attached item for circulation
in Millennium
Decision: NavigationArchivists wanted the archives (CALM) interface to offer
the main search route for Western Archives & MSS
User is taken from CALM record into Millennium to place
their request then back to their CALM record to continue
browsing their hit list - – two links were needed
Forward: runs cgi script to search Millennium for
corresponding bib record
Back: 856 with URL link (can be inserted by Harvester)
Example: Links
Forward: cgi script runs search of Millennium `o’ index for
match on CALM RefNo value
http://catalogue.wellcome.ac.uk/search/o?SEARCH=PPCRI%2FA%2F1%2F2%2F8
Back: RefNo PP/CRI/A/1/2/8 built into OAI record URL linking
to CALM web front end - RefNo value built into search string
http://archives.wellcome.ac.uk/DServe/dserve.exe?& dsqIni=
Dserve.ini&dsqApp=Archive&dsqCmd=show.tcl& dsqDb=
Catalog&dsqPos=0&dsqSearch=((text)='PP/CRI/A/1/2/8')
Calm XML export file<?xml version="1.0" encoding="utf-8" ?>
- <record>
- <DScribeRecord> <RecordType>Component</RecordType> <IDENTITY /> <RefNo>MS4385/4404</RefNo> <AltRefNo>MS.4404</AltRefNo> <PreviousNumbers /> <Title>Notes and extracts on Chemistry, Volumetric Analysis, (etc.)</Title> <Date>c. 1865</Date> <Level>Item</Level> <Extent>1 volume</Extent> <UserText5>Bentley House</UserText5> <Location /> <UserText3>Western MSS series 3 - Requestable</UserText3> <UserWrapped9 /> <UserText6 /> <UserText7 />
Mapping Calm XML to Marc21
Fields tags used: 001, 008, 245, 260, 500, 506, 655, 856
And 949 to make the item. Harvester inserts a 99x tag with load
identification code e.g. CALM20040820225128
Found that Component records do not have `author’ which is
only held at Collection level – but not a problem
Mock’ bib and item records keyed to Millennium to:
- demonstrate navigation & agree content with team
- act as a benchmark when harvested records loaded
XSLT – eXtensible Style Language Transformation
Used XSLT to split the XML single output file into 48,000 component
.xml records using the <DescribeRecord> as record delimiter
and then transform them to MARC21 OAI records listed to
XML Harvester by our OAI repository
The OAI repository installed on the CALM staging server
uses the University of Illinois Provider service tool - freeware
Millennium parameters
To cope with `open’ v `closed’ archive collections
– new codes were added to archives records and mapped to
new Millennium branch codes which would trigger Millcirc rules
New branch codes added to Request Rules, Determiner Table,
WWWOPTIONS, Locations served
New MATTYPE to exclude Western Mss and archives from the
Asian Mss scope
Config file for archives record harvest
@LOGLEVEL=CONFIG
@DBNAME=CALM
@URL=http://wtcalm02/oai/oai.asp
@CREATEOVERLAYFROMURI=true
@9XXMARCTAG=991
@USEOAI=true
@DATE=20000606000000
@SHOWMETADATA=true
Management interface for XML Harvester
Archive record: Request link to Web OPAC
Harvested archive record in Millennium
Patron login screen to place request
Confirmation of request
Interoperation sought with image system
To integrate MedPhoto, a bespoke photo library system,
and Millennium for seamless display and ordering of images
MedPhoto holds images and records for more than 60,000 items
catalogued in Millennium – Iconographic collection, archives &
manuscripts, rare books etc.
Specific need for Millennium User to see images associated with
library objects
Media management interface
Config file for image URL harvest@LOGLEVEL=CONFIG
@DBNAME=MEDPHOTO
@URL=http://aquarius.wellcome.ac.uk:6969/ixbin/hixserv
@RECID_MARCTAG=001
@CREATEOVERLAYFROMURI=true
@9XXMARCTAG=991
@USEOAI=true
@REQUIRE_EADID=false
@DATE=20000606000000
@OAIFROMDATE=20050701000000
@OAIUNTILDATE=20050731000000
@OAISET=bib
Selective Harvesting – images
Harvest full “bib” set and load to Millennium populating 962s
then each month request list of all new image URLs created since
the last harvest with a Millennium .b number in their record.
<http://medphoto.wellcome.ac.uk:6969/ixbin/hixserv?verb=ListRecords&meta
dataPrefix=marc21&set=bib&from=2005-05-01&until=2005-05-31>
(for records in May)
<http://medphoto.wellcome.ac.uk:6969/ixbin/hixserv?verb=ListRecords&meta
dataPrefix=marc21&set=bib&from=2005-06-01&until=2005-06-30>
(for records in June and so on)
Harvesting: Image OAI repository
OAI repository built by SSL on MedPhoto server
Metadata matchpoint .b bib record no. is common element
Between Millennium and MedPhoto
XML Harvester selectively requests record set “bib” which all
Have .b nos, parses the returned list of MARC21 OAI records
and creates a file of MARC records for loading
Matches on .b and overlays inserting 962 for each image
962|u holds URL for thumbnail and |e holds `launchpad`URL
MARC21 record ready to load File Name: DONE-MEDPHOTO_20050601192747.marc (411,392 bytes) Offset:
256 Blocks: 1 - 2
LEADER 00403nam a2200085uu 4500
DIRECTORY
001000900000 035001500009 856008000024 962018500104 991002800289
TAGS
1 000 00403nam a2200085uu 4500@
2 001 L0027751@
3 035 |a.b12857890@
4 856 4 1
|uhttp://medphoto.wellcome.ac.uk/ixbin/imageserv?MIDMIRO=L0027751|zView image@
5 962
|a000:000:URL:b0000000:000000:0:0:0:0:0:0|tImage|vn|uhttp://medphoto.wellcome.ac
.uk/ixbin/hixclient.exe?MIROPAC=L0027751|ehttp://medphoto.wellcome.ac.uk/ixbin/i
mageserv?MIRO=L0027751@
6 991 |aMEDPHOTO{228}20050601192747@
Example: with |t default
“Launch pad”
We saw an opportunity for further integration – used
Intermediate screen – URL delivered by MedPhoto repository and
loaded to 962 |e
User can hotlink from this “launch pad” into image system
to register, use a light box, email, download or order the
image online from the image system before returning to
Web OPAC
What we usedXML Harvester product (III)
OAI repository software
VBScript – for file splitting operation
Instant Saxon (command line XSLT processor)
Microsoft MSXML core services (e.g. ver 5)
Media Management for 962 (or load URLs to 856)
Three OAI-PMH compliant library systems
Shared Record IDs as matchpoints
Some experience of working with stylesheets
Some experience of load tables and record loading
Work in progress
Harvesting legacy catalogues/XML for other Asian MSS
e.g.Iskander and Jain project (with Oxford University)
Complete testing and batch loading of 60,000 thumbnail and
“launchpad” URLs to 962’s
Establish routines to manage updates for new, deleted
or amended records – utilise OAI-PMH selective harvesting
Further automation of routines where practicable
Wish List/Enhancements
Global edit for 962 tag
More documentation for XML Harvester
Access to underlying harvester parameters e.g. for XSLT
processor and XML parser
Automation of selective harvesting for maintenance
Useful linksXML http://www.w3.org/XML
EAD http://www.loc.gov/ead/
OAI software http://oai.grainger.uiuc.edu/projectinfo.htm
XSLT http://saxon.sourceforge.net/saxon6.4.3/instant.html
http://www.openarchives.org/OAI/openarchivesprotocol.html
http://www.openarchives.org/OAI/2.0/guidelines-marcxml.htm
OAI tutorial http://www.oaiforum.org/tutorial
OAI repository testing http://re.cs.uct.ac.za/
Some example records
http://catalogue.wellcome.ac.uk/record=b1465521
http://catalogue.wellcome.ac.uk/record=b1580232
http://catalogue.wellcome.ac.uk/record=b1313568
http://catalogue.wellcome.ac.uk/record=b1613633
http://catalogue.wellcome.ac.uk/search/o?SEARCH=PPCRI%2FA%2F1%2F2%2F8
Excellent XML: systems interoperability at the Wellcome Library
Thanks for your attention
Margaret Savage-Jones
Library Systems Administrator