Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
1
Hiberlink – Towards Time Travelfor the Scholarly Web
Martin [email protected]@mart1nkle1n
Robert [email protected]@azaroth42
Herbert Van de [email protected]@hvdsomp
http://www.hiberlink.org/ http://www.mementoweb.org/The Hiberlink Project is supported by the
Andrew W. Mellon Foundation
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
2
LANL
• Herbert Van de Sompel
• Rob Sanderson• Martin Klein
U. Edinburgh
• Claire Grover• Beatrix Alex• Richard Tobin• Adam Zhou
Hiberlink Project and Partners
EDINA
• Peter Burnhill• Christine Rees• Muriel Mewissen• Tim Strickland• Neil Mayo
Two year project funded by Andrew W. Mellon Foundation
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
3
Problem Statement
Preservation of formal scholarly output is (relatively) well understood.
Preservation of the resources that make up the context for that research is not:
• Datasets• Software• Workflows• Videos, Slides• Project and Demonstration web sites• AJAX• …
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
4
To what extent are web resources that are referenced from works in repositories still available at their original URL …
or from archives of web resources?
Participants: LANL, UNT, arXiv
Paper: http://arxiv.org/abs/1105.3459
Contributions: • Much larger scale than any previous study, 162,052
unique URLs• Automatically searched multiple archives for all URLs,
rather than manually for a small subset
Pilot Study
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
7
Pilot Study: Results
• 72% in archives and/or still exist
• High proportion of archived URLs, possibly due to academic level and general disciplines
• 78% in archives and/or still exist
• 45% still exist, but not archived!Possibly due to high value, but very discipline specific references
UNT
arXiv
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
8
To what extent are web resources that are referenced from works in repositories still available at their original URL … or from archives of web resources?
Redo the same experiment with…• Even larger dataset with millions of papers and URLs• Text mining processes for URL extraction • Track location of URL (citations, footnote, text, etc)• Evaluation of extraction via gold standard dataset• Determine type of resource referenced• Track type of publication (journal, thesis, report, etc)
Hiberlink: Quantify Full Extent of the Problem
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
9
We propose two active archiving solutions of resources referenced from scholarly papers to ensure that the scholarly record remains unbroken
1. Active Crawling:• Run extraction routines at repositories, publishers, or
third parties via text mining agreements or open access publications
• Feed the URL seed list to existing web crawlers, such as the Internet Archive
• IA (and others) already Memento compliant
Hiberlink: Propose Solutions (1)
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
10
2. Transactional Archiving:• Willing server forks responses for resources and
sends to both browser and to archive for preservation
Hiberlink: Propose Solutions (2)
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
11
2011 pilot study showed:• Significant problem!• Random archiving by web crawlers is not enough
Hiberlink project will:• Fully quantify the extent to which web resources that
form the context of scholarly output are available and archived
• Propose active solutions to prevent the loss of further resources
• Use Memento for both research and access
Summary
Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA
12
Hiberlink – Towards Time Travelfor the Scholarly Web
Martin [email protected]@mart1nkle1n
Robert [email protected]@azaroth42
Herbert Van de [email protected]@hvdsomp
http://www.hiberlink.org/ http://www.mementoweb.org/The Hiberlink Project is supported by the
Andrew W. Mellon Foundation