Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | miguel-nichols |
View: | 213 times |
Download: | 0 times |
A survey of Web preservation initiatives
Michael DayUKOLN, University of Bath
7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003),
Trondheim, Norway, 17-22 August 2003
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Presentation overview• The importance of the Web
• Challenges:
– Technical, legal, and organisational challenges
• Approaches to collection:
– Harvesting based, selective, and deposit; combined approaches
• Discussion:
– Collection and access policies, software, costs, long-term preservation
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Importance of the Web
An all pervasive communication medium:• In research:
– Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003)
• Wider societal role:– personal communication, e-commerce,
etc.– "… the information source of first resort for
millions of readers" (Lyman, 2002)
ECDL 2003, Trondheim, Norway, 17-22 August
2003
The UKOLN study
Feasibility study produced for:– Joint Information Systems Committee (JISC)– Wellcome Library
– A survey of initiatives– Recommendations for the JISC and
Wellcome Library– Supplementary legal study (Charlesworth)– Published February 2003
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Technical challenges (1)
Size of Web:– Surface web > 50 Tb (2000) … and still
growing– The 'deep Web'– Scale of task means that Web-archiving
needs to be a collaborative activity
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Technical challenges (2)
Dynamic nature of Web:– Web pages disappear on average after 75
days– Many leave no trace
Evolution of Web-based technologies:– Increasing reliance on databases, scripts,
plug-ins, etc.– A 'moving target'
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Legal challenges
Copyright
Content liability, e.g.:– Defamation– Data protection
In the UK:– Selective approach would be the safest
solution (unless law changes)
See: Charlesworth (2003)http://library.wellcome.ac.uk/projects/archiving_reports.shtml
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Organisational challenges
Decentralised organisation:– Web-archiving initiatives focus on defined
sub-sets of the Web, e.g.:– National domain, subject, organisation type
– Need for co-operation between initiatives
Quality:– Much on Web is low-quality (or worse)– Is there a need to preserve all of this?
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Initiatives (1)
The Internet Archive– Largest initiative, running since 1996– Co-operates on special collections and
with other repositories
National Libraries:– Pioneer archives in Sweden (Kulturarw3)
and Australia (PANDORA)– Now many, many more– Changes to legal deposit legislation in
some countries
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Initiatives (2)
National archives:– Focus on government Web-sites (however
defined)– Guidance for Web-site managers:
– e.g., UK and Australia
– Snapshots:– e.g., USA and UK
Other:– Universities and scholarly societies:
– e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell)
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Approaches (1)
Automatic harvesting:– Use of Web crawler technologies– Crawler follows links and downloads
content– Pioneered by Internet Archive and
Kulturarw3 project– Also used for the gathering of the Finnish
and Austrian Web
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Approaches (2)
Selective approaches:– Selection of individual Web sites– Negotiate rights with site owners– Collection using gathering or mirroring
software, ftp, or e-mail– Pioneered in PANDORA project– Experimented with by Library of Congress
and British Library
Deposit approaches:– Site owners/administrators deposit site in
repositories
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Approaches (3)
Combined approaches:– Combines the advantages of the
harvesting and selective approaches– Pioneered by the Bibliothèque nationale
de France– Experimented with enhancements to the
harvesting approach• e.g., noting the change frequency of sites,
and their 'importance')• Uses the selective approach for the 'deep
Web'
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Collection policies
Dependent on technical approach chosen– National domain ++ (for harvesting-based
approaches)– Collection guidelines (for selective
approaches)– Based on relevance, provenance, quality, etc.– Frequency of capture– Possible overlap with subject gateway
initiatives - e.g. the Resource Discovery Network (RDN) in the UK
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Approximate size (2002)Country Initiative Type Size (Gb.) No. Sites
USA Internet Archive H >150,000.00
Sweden Kulturarw3 H 4,500.00
France BnF C <1,000.00
Austria AOLA H 448.00
Australia PANDORA S 405.00 3,300
Finland HUL H 401.00
UK Britain on the Web S 0.03 100
USA MINERVA S * 35
Source: Day (2003)
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Access policies
Access policies differ:– Internet Archive and the PANDORA
archive make data available– e.g., the Wayback Machine
– Other collections effectively closed (for legal reasons or because experimental)
– Need for specialised Web indexes that can search and navigate large collections of Web material
– e.g., Nordic Web Archive (NWA) Toolset
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Software
Various software in use:– Harvesting:
– Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa
– Selective:– HTTrack (popular), etc.– PANDAS (PANDORA Digital Archiving
System) - helps with managing the process, adding metadata, etc.
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Costs
Costs vary widely:– Selective approach much more expensive
(per Tb.) than bulk harvesting– But resulting archives are more widely
accessible– Significant costs in undertaking rights
clearance
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Long-term preservation
Many initiatives until now mainly focused on the collection of resources:
– Need to consider the longer-term– Descriptive and technical metadata– Migration needs (e.g. for complex sites)– Need for Web archiving initiatives to
become trusted repositories– Need to be embedded into the 'core
activities' of their host organisation
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Summing up
• Much experimentation to date, but now moving into implementation phase
• Co-operation and collaboration is important
• Combined technical approaches offer best way forward
• Legal challenges still problematic• Long-term preservation issues still to
be explored in detail
ECDL 2003, Trondheim, Norway, 17-22 August
2003
Acknowledgements
UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.
http://www.ukoln.ac.uk/