Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | valerie-nicholson |
View: | 213 times |
Download: | 0 times |
Web Capture teamOffice of strategic initiatives
February 27, 2006
Selecting Content from the Web: Challenges and Experiences
of the Library of Congress
Abbie Grotke
Web Capture Team
Office of Strategic Initiatives
CRL Workshop, February 27, 2006
Web Capture teamOffice of strategic initiatives
February 27, 2006
Agenda
• Why Collect the Web?• Web Collections at the Library of Congress• Policy Issues and Technical Activities• Project: Selecting and Managing Content
Capture from the Web• Our Partnerships and International
Collaborations
Web Capture teamOffice of strategic initiatives
February 27, 2006
Why Collect the Web? Digital Preservation Goals of the
Library of Congress
• Preserve our nation’s history and culture• Identify and preserve at-risk digital content• Support development of tools, models, and
methods for digital preservation
Web Capture teamOffice of strategic initiatives
February 27, 2006
The Early Days
• Feb 2000: “how do we collect the Web?” led to MINERVA prototype (www.loc.gov/minerva)
• Special project team initially formed: cataloging, legal, public services, technology services staff
• Early partnerships:– Internet Archive (www.archive.org)
– WebArchivist.org
• From project to program…– 2003: Web Capture team formed
– 2004: Some of MINERVA team joined Web Capture
Web Capture teamOffice of strategic initiatives
February 27, 2006
Web Collections 2000-2006*Election 2000: 767 seed urls
*September 11th: 30,000+
2002 Winter Olympics: 70
Sept 11 Remembrance: 1,800
*Election 2002: 3,000
107th-109th Congress: 588
Iraq War: 300
Election 2004: 2000
Papal Transition: 200
Katrina: 818
Supreme Court: 285*public access available through www.loc.gov/minerva
Web Capture teamOffice of strategic initiatives
February 27, 2006
Current Library Collecting Efforts
• Iraq War (ongoing)• 109th Congress (ongoing)• Darfur
Over 40 TB of data collected to date!
Web Capture teamOffice of strategic initiatives
February 27, 2006
The Web CaptureProcess at LC
Collection Planning
Selection
Notification/Permissions
TechnicalReview
Crawl & QA
Cataloging
Interface Development
Legal Review
Access
Store & Manage
Web Capture teamOffice of strategic initiatives
February 27, 2006
Technical Activities
• Current activity in areas of:– Selection and permission gathering
• Web Collection Management System
– Acquisition: crawling and collection• Heritrix
– Access and display• Full text searching, Wayback replacement
– Collection analysis and preservation
Web Capture teamOffice of strategic initiatives
February 27, 2006
Policy Issues
• Need to seek clear and consistent intellectual property protocols for crawling– Section 108 Study Group may provide hope
http://www.loc.gov/section108/
• What content should we now be collecting? How long should we collect it?
• Once we collect it, how do we make it available to our staff and public users?
• Do we share collecting efforts (costs, time) with partners? If so, how?
Web Capture teamOffice of strategic initiatives
February 27, 2006
Various Web Collection Strategies
• Entire Web domain -- Internet Archive• National domain (.se) –- Sweden, France, others• Selective (individual URLs) and thematic –
Australia• Thematic or event based -- Library of Congress
Other strategies LC is exploring• Acquire collections gathered by others• Establish relationships with producers to acquire
their content
Web Capture teamOffice of strategic initiatives
February 27, 2006
Selection
• LC’s Collection Policy Statements• Collection planning defines:
– Collection scope • Description• Types of sites• Frequency
– Categories of sites• X category of site gets Y type permission• Reporting• Possible other uses – cataloging, access points
Web Capture teamOffice of strategic initiatives
February 27, 2006
Other considerations
• What does the recommender want?– complete site– single document, page, or section
• Can we get it and provide access to it?– crawler and access tool limitations– deep web– scoping– permission
Web Capture teamOffice of strategic initiatives
February 27, 2006
Selecting and Managing Content Captured from the Web
• One-year project to address:– Roles and responsibilities for lifecycle
management of archived Web content– Single-site collecting vs. thematic collecting– Copyright permissions and notifications– Exploring how technical aspects of Web sites
affect selection criteria– Expanding staff participation
Web Capture teamOffice of strategic initiatives
February 27, 2006
Additional Objectives
• Learn by doing– Practical experience is key– Collection planning– Permissions planning– Content collection– Quality review: did we get what was wanted?
• Further document resource requirements and workflow (staff/time)
• Inform and educate other Library staff
Web Capture teamOffice of strategic initiatives
February 27, 2006
Project Participants
• Four Content Groups– Darfur– Visual Image– Manuscript Organizations– Single Site
• Bibliographic and Lifecycle Subgroups• Management Oversight Committee
Web Capture teamOffice of strategic initiatives
February 27, 2006
Training
• Workshops– Selection– Technology of Web Capture– Copyright and Permissions– Access tools overview
• Tools training– For Recommenders: How to nominate a URL for
archiving– For Selection Coordinators: How to use the tool to
move through selection and permissions process
• Ongoing support, refreshers as needed
Web Capture teamOffice of strategic initiatives
February 27, 2006
Some Big Challenges
• Defining new roles and responsibilities (and actually doing them)
• Resource limitations: everyone is busy and selection and permissions take a lot of time
• Finding the geek balance: too much vs. too little technical information
• Do LC’s traditional selection policies fit Web content?
Crisis in Darfur, Sudan
Web Capture teamOffice of strategic initiatives
February 27, 2006
Crisis in Darfur, Sudan
• Approximately 200 seed URLs selected– Sampling of news reports– Scholarly reports and studies– Responses of
• Government• Public (Web logs, etc.)• Key organizations and their Web sites, some formed in
response to crisis
– About 25 sites in other languages, mostly Arabic
• Started crawling February 20, 2006– Weekly, Monthly, One time
Web Capture teamOffice of strategic initiatives
February 27, 2006
Upcoming tasks
• Review results of crawl– Technical Team Quality Review– Curator QA Quality Review
• Initiate permissions and collecting of Manuscript, Visual Image, and Single Site collections
• Full-text indexing search testing• Further explore lifecycle management
issues
CDL
UNT
LC
IAUIUC
IIPC
BL
OCLC
RLGArchive-itPartners
CollectingPartnersNLA
BNF
UKWAC
NYU
Collecting Partners
Collecting Partners
Norway
Finland
Denmark
Sweden
A Web ofA Web ofArchiving Archiving InitiativesInitiatives
NARA
Web Capture teamOffice of strategic initiatives
February 27, 2006
National Partnerships and Collaborations
• University of California Digital Library– The Web at Risk: A Distributed Approach to
Preserving our Nation’s Political Cultural Heritage
• Internet Archive– Testing the storage, data maintenance and access
of collected Web content
• Information sharing with other US government agencies– Government Printing Office– National Archives and Records Administration
Web Capture teamOffice of strategic initiatives
February 27, 2006
International Collaborations
• International Internet Preservation Consortium (IIPC)– Collect and preserve a rich body of Internet
content from around the world– To foster the development and use of common
tools, techniques and standards– To encourage and support national libraries
everywhere to address Internet collecting and preservation
– Share experience and best practices
Web Capture teamOffice of strategic initiatives
February 27, 2006
IIPC Members
• France (lead)
• Italy
• Denmark
• Finland
• Iceland
• Canada
http://netpreserve.org/
• Norway
• Australia
• Sweden
• United Kingdom
• Internet Archive, USA
• Library of Congress, USA
Web Capture teamOffice of strategic initiatives
February 27, 2006
Upcoming Directions
• Better tools for supporting selection• Improving access tools• Better crawl management• Large-scale collection storage approach:
Repository
Web Capture teamOffice of strategic initiatives
February 27, 2006
Questions?
Abbie Grotke