+ All Categories
Home > Documents > PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin...

PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin...

Date post: 18-Dec-2015
Category:
View: 215 times
Download: 2 times
Share this document with a friend
Popular Tags:
55
PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia [email protected]
Transcript
Page 1: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Library Science TalksSNL/CERN, September 2004

Paul KoerbinDigital Archiving Branch

National Library of [email protected]

Page 2: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

1. Background and approach to web archiving2. The management system (PANDAS)3. Workflows and procedures4. Issues and future directions

Page 3: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

1. Background and approach to web archiving in Australia - PANDORA

Page 4: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Beginnings

• Name originally an acronym for: ‘Preserving and Accessing Networked Documentary Resources of Australia’

• Now: ‘Australia’s Web Archive’• Began in mid-1996 (selecting)• Began archiving in late 1996-early 1997

Page 5: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Approach

• Practical and pragmatic • Began as: Proof-of-concept project• Now: Routine National Library activity• Achieving outcomes while continuing to develop and extend

processes and systems• Best use of available resources and infrastructure

Page 6: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Resources

• Existing technical services staff - librarians• Digital Archiving Branch has the business responsibility • Information technology staff from within the Library for

development and support• PANDORA partner institutions (10 including the NLA)

Page 7: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Mandate and responsibilities

• National Library of Australia’s statutory responsibilities• National Library Act, 1960• Maintain and develop a national collection of ‘library material’• Comprehensive collection relating to Australia and the

Australian people

Page 8: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Mandate and responsibilities

• National Library has a leadership role for the Australian library community

Legal deposit

• Legal deposit in the federal jurisdiction in Australia does not cover electronic resources

Page 9: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Some key characteristics

• Selective approach to archiving online resources• Scalable to available resources and do-able• Negotiate permission to archive• Apply manual quality assurance processes to harvested resources• Provide access to the archived resources

Page 10: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Shortcomings of selective approach

• Can’t collect everything that future researchers may want

• Labour intensive tasks• Does not retain the full complexity of the linking

structure of the Internet

Page 11: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Indicative statistics as at August 2004

• 6,500+ titles• 13,000+ archived instances• 21 million files*• 680 gigabytes*

*These figures are for the display copy only. Two more preservation copies plus preservation metadata are maintained.

Page 12: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

2. The management system: PANDAS

Page 13: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

• PANDAS – PANDORA Digital Archiving System• Integrated web based system • Workflow management system• Developed specifically to manage the web archiving

processes at the National Library of Australia• Used by PANDORA’s partners located throughout

Australia

Page 14: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

• Developed in-house at the NLA• Replaced multiple non-integrated systems used between 1996 and

2001 • Written in Java on Apple WebObjects application development

platform• First version released in June 2001• Second version released August 2002• Ongoing enhancement and development program

Page 15: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Page 16: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS system architecture consists of 4 layers

• 1) Presentation layer – client applications for visual presentation to the end user

• 2) Application layer – the core application functionality such as PANDAS and PANDORA

Page 17: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS system architecture consists of 4 layers

• 3) Business layer – application access to the data storage and communication infrastructure

• 4) Data layer – third party infrastructure products, e.g. Oracle database and WebDAV accessible files servers

Page 18: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Nomenclature

• PANDORA – the whole enterprise • PANDAS – the whole management system• PANDAS – the system component providing a web-based user

application to manage workflows• PANDORA – the system component that creates the public

interface

Page 19: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Page 20: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS is used to:

• Record administrative metadata about titles selected (or rejected or monitored) for archiving

• Schedule and initiate harvesting• Manage quality assurance checking and problem

fixing

Page 21: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS is used to:

• Prepare items for public display through the PANDORA home page

• Manage access restrictions• Generate management reports

Page 22: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS is a workflow system that:

• Connects with and utilises other software and protocols for specific functions

• Provides an interface to the harvesting software – currently this is HTTrack (http://www.httrack.com)

Page 23: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDAS is a workflow system that:

• Uses WebDAV protocol to provide content managers with remote access to the harvested files

• Uses Z39.50 protocol to access the National Bibliographic Database to extract metadata from the MARC record

Page 24: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

PANDORA public interface component

• Title and subject listings and title entry pages are generated ‘on-the-fly’ from PANDAS metadata

• Some static web pages (documents, information)• Search engine

Page 25: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Persistent identifiers and URLs

• Running number generated by PANDAS• Persistent URL applied to title entry page

http://nla.gov.au/nla.arc-21220

• Logically extended to any resource in the Archivehttp://nla.gov.au/nla.arc-21220-20030822-www.ipjp.org/september2002/schweitzer-ed.html

• Citation generator on public interface

Page 26: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

3. Workflows and procedures

Page 27: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

• Identifying and selecting• Recording administrative metadata• Harvesting• Quality assurance processing• Archiving• Preparing for public display• Creating resource discovery metadata• Reporting

Page 28: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Identifying and selecting

• Selection guidelines – each partner has their own guidelines• Just guidelines … not rules nor ideology• Selection priorities in guidelines (NLA)• Notification networks – indexing agencies, staff, publishers,

public• NLA selection guidelines available at:

http://pandora.nla.gov.au/selectionguidelines.html

Page 29: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Selection – what sort of publications?

• Titles – the entities to be archived• Defined during the selection process• Document-like publications, e.g. PDF• Whole web sites• Parts of web sites

Page 30: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Selection – what sort of publications?

• Focus on content – substantial, unique• Special events or issues• Format or potential technical problems are not, in

principle, a selection consideration• One-off archiving• Scheduled archiving – whole entity, not an update

Page 31: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Recording administrative metadata

• Four types of records– Title– Publisher– Indexer– Collection

• Selection status• Additional details associated with status (standing)

Page 32: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Administrative metadata

• Publisher details• Archiving permission status• Access restrictions • Notes• Assigning ownership of titles• Transfer titles between agencies

Page 33: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Harvesting

• Mostly harvesting from the Web• Also able to upload from local drives (WebDAV

protocol)• Third party software – HTTrack• PANDAS interface to set up harvesting rules

Page 34: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Harvesting

• Define extent of selected resource to be archived

• Set gather filters and gather settings• Set gather schedule• Initiate harvesting

Page 35: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Scheduling harvesting

• Significant function of PANDAS• Regular schedules, e.g. weekly, monthly, annual• Specific dates• Harvest now• Combination of scheduling options

Page 36: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Harvesting - filters and settings

• Default settings• Ignore robot.txt rules because permission to archive has been

obtained from publisher• Gather sub-directories• Gather ‘near files’, e.g. linked images• Limit on depth – sufficient for any web site but to prevent

abuse of host server

Page 37: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Harvesting - filters and settings

• Gather filters are critical• Selection based on specific content• Archiving permission for specific content• Efficient use of resources (bandwidth,

storage)

Page 38: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Quality assurance

• Important process for PANDORA• Owner of title notified when harvest is complete• Visual, manual checking process • Check for completeness and functionality• Check that content is new (if previously archived)• Check that there is no extraneous material

Page 39: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Quality assurance

• Harvested files in a working area – not ‘archived’ at this stage• WebDAV (protocol) access to the working area• Problem analysis and fixing• Missing files, broken links• Complex problems referred to IT support through PANDAS

error reporting module

Page 40: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Quality assurance

• Problems due to limitations of harvesting software• Excessive use of JavaScript• Deep web resources• Traps such as metafiles, absolute links• Other methods of acquisition (CD, FTP)• Business decision whether or not to accept the harvested instance

Page 41: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Archiving

• Harvested instance is accepted• One-click process for PANDAS user• Transfers instance from working area to Digital Object

Storage System• Creates preservation and display copies• Perl scripts – e.g. re-write external links

Page 42: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Archiving – preservation master copies

• Preservation master – incl. harvest log files• Display master – includes changes made to the harvested instance

(manual and scripts)• Metadata master – http header responses• Gzip compressed TARball (Tape Archive format) on Digital Object

Storage System (DOSS)• Access (display) copy on web server

Page 43: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Preparing for public access – title entry pages

• Generated ‘on-the-fly’ from content of PANDAS database• Partner branding• Link to publisher’s site• Links to dated archived instances• Manual additions – notes, links to serial issues, copyright statement

Page 44: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Preparing for public access – listings and collections

• Subject listings• Title listings• Partner views• Collections – events, sampling over specific time period

Page 45: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Public access – restrictions

• Period• Date• Authentication• IP addresses/subnet mask (i.e. physical locations such as a single PC

in the NLA main reading room)• PANDAS manages automatically – can be manually enabled/disabled

Page 46: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Creating resource discovery metadata

• MARC record for each title• National Library of Australia OPAC• National Bibliographic Database• Metadata derived from the catalogue record is embedded in

the title entry pages• Indexing/abstracting services’ citations

Page 47: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Reporting

• Pre-defined reports from PANDAS UI• Statistical and data reports• SQL query on Oracle database (not through PANDAS interface) • ProClarity for user defined data cube reporting and analysis• LinkScan for broken publisher URL links

Page 48: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

4. Issues and future directions

Page 49: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Current issues

• Commitment to selective, quality assessed, accessible web archiving

• Efficient identification – automated selection• Legal deposit (when?)• Blanket permission – government agencies

Page 50: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Current issues

• Ongoing development and enhancement of PANDAS• Improve robustness of system• Re-engineer PANDAS software• Need to achieve greater efficiencies and increase scale

of web archiving activity

Page 51: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Future directions

• Automatically ingest and process larger volume of online publications and associated metadata – batches

• Comply with international standards and adopt standard tools – IIPC

• Incorporate other collection methods – domain harvesting, deep web, deposit

Page 52: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Future directions

• Automate collection of more preservation metadata and develop metadata management interface

• Improve access and discovery paths to the Archive’s resources as it continues to grow

Page 53: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

More information

• PANDORA home pagehttp://pandora.nla.gov.au/

• Key documents (background, technical, PIs)http://pandora.nla.gov.au/documents.html

• PANDAS manual http://pandora.nla.gov.au/manual/pandas

• Papers and presentationshttp://pandora.nla.gov.au/papers.html

Page 54: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

Questions?

Page 55: PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia pkoerbin@nla.gov.au.

PANDORAAustralia’s Web Archive

http://pandora.nla.gov.au


Recommended