Progress Made and Lessons Learned
through Collaborative Web Archiving Projects
Anna Perricci
Columbia University Libraries
Archive-It Partner Meeting 2014
November 18, 2014
Web Resources Archiving Collaboration
• Many thanks to the Mellon Foundation
• Building collaborations among
– The web archiving community
– Other research libraries
– Users and potential users of web archives
– Site creators
Incentive awards projects to advance web archiving tools
Warcbase: Building a Scalable Web Archiving Platform on HBase and Hadoop. (Jimmy Lin, University of Maryland)
Archiving Transactions Towards Uninterruptible Web Service (Zhiwu Xie and Edward A. Fox, Virginia Tech University)
Incentive awards projects to advance web archiving tools
Visualizing Digital Collections of Web Archives (Michele
Weigle, Old Dominion University)
Tools for Managing Seed URLs (Michael Nelson, Old
Dominion University)
Incentive awards projects to advance web archiving tools
Perma.cc: Mitigating the Pervasive Problem of Link Rot in Scholarly Works and Preserving Online Content (Kim Dulin, The Harvard Library Innovation Lab)
Free Law Project
Providing free access to primary legal materials, developing legal research tools, and supporting academic research on legal corpora)
Building an efficient, coherent, and scalable national framework for collecting web content
https://archive-it.org/home/borrowdirect
Program Components
• Communication and coordination
• Seed management and harvest
• Supplemental quality review (QA testing)
• MARC Metadata
• Local preservation storage (seeking solutions)
The first 18 months of collaborative collecting
• Planning, needs assessment (interviews with stakeholders including Associate University Librarians for collection development at each Borrow Direct institution in 2013), timelines created
• Group communication (spreadsheets, Basecamp), cultivating dialogs
• Coordinate seed URLs nomination for pilots collections (CCWA, CAUSEWAY), QA testing and creation of MARC records
• Trying out workflows for optimal balance of involvement and efficient forward motion on projects
• In planning stages for sharing costs & 5 year plan for Borrow Direct/Ivy Plus collaborations
Collaboration with music librarians
Contemporary Composers Web Archive
Selectors
• Borrow Direct Music Librarians Group: music librarians at Brown, Columbia, Cornell, Dartmouth, Harvard, Johns Hopkins, Princeton, and Yale universities, MIT, and the universities of Chicago and Pennsylvania
Cataloging expertise
• Russell Merritt (cataloger specializing in music resources)
• Kate Harcourt (Director of Original and Special Materials Cataloging)
• Alex Thurman (Web Resources Collection Coordinator)
CCWA
CCWA
Progress on CCWA & lessons learned so far
By the numbers:
• 11 curators participating
• 56 sites currently available in Archive-It – 23 additional sites for follow up
• 27 GB of content archived (268,519 URLs)
• 50 MARC records in WorldCat as of 11/18/14 – Russell Merritt (music cataloger) collaboratively developed MARC records
for composers websites; further cataloging of available sites through 2CUL
Outreach
• SAA presentation on MARC records for CCWA http://www.slideshare.net/annaperricci/lightning-talk-for-session-703-of-society-of-american-archivists
• Over 30 sites tested for quality by five music librarians; bibliographic assistant on the grant tested all sites in collection
CCWA Permissions
77 Composers
Yes (37)
No (0)
Did not respond (35)
No contact info (2)
Recently died/did notcontact (3)
Quality Assurance with music librarians
Creating MARC records for web archives
• Creating MARC records for archived websites is standard practice at CUL
– MARC records make web archives discoverable in CLIO (Columbia Libraries Information Online)
• Collection level and seed level records
• Will use Archive-It interface to add Dublin Core metadata
Anticipating wider use of MARC records
• Records have been regularly released to WorldCat
• Collaborators on cataloging were attentive to which fields will ordinarily be stripped out when a MARC record is imported to another institution’s OPAC
MARC records
Patron view of record in CLIO
Cataloger’s view of record in CLIO
Progress on CAUSEWAY & lessons learned
• Curators from 9 Borrow Direct institutions (Ivies Plus Art & Architecture Group) – Lead advisors: Carole Ann Fabian and Chris Sala
• 137 seed URLs (over 100 harvested and being released as sites are tested, cataloged and assigned metadata in Archive-It)
• 51 GB of content archived (1,006,114 URLs )
• Over 60 sites available in Archive-It with DC metadata (also all 60+ have MARC records in CLIO)
Outreach
• Update sent to IVAAG soliciting feedback
• Gave update and got feedback at semi annual IVAAG meeting
• Presentation scheduled for ARLIS/NA 2015
CAUSEWAY Permissions
137 Site owners
Yes (74)
No (3)
Later (2)
No contact info (2)
Did not respond (56)
CAUSEWAY
CAUSEWAY
CAUSEWAY
CAUSEWAY
Cataloging expertise brought to CAUSEWAY
• Alex’s expertise in cataloging architecture and urban planning sites (built through collaboration with Chris Sala on the Avery collecting of web archives) equips him to make more specific MARC records for sites in CAUSEWAY
• Columbia University art and architecture librarians encourage users to find resources via records in the OPAC so access to CAUSEWAY sites will likely be via the MARC records which point to the calendar page for archived sites
• Alex is working with our Bibliographic Assistant, Naeema Akter (position funded by the grant as well) to add appropriate metadata for better browsing in the Archive-It interface
Early start on facets in Archive-It
CAUSEWAY goals for duration of remainder of grant
• Collect all nominated sites in scope, test for quality, create a MARC record for each archived website (by early 2015)
• Evaluate quality and solicit feedback (ongoing)
• Meet at ARLIS/NA and discuss progress (March 2015)
– Anna will also give a presentation on collaborative web archiving projects at ARLIS/NA
• Establish ongoing workflows and goals (2015 and onward)
• End of pilot phase: December 2015
Project tracking: Basecamp & many, many spreadsheets
Pilot climate change collecting & lessons learned so far
• 25 selectors from 5 institutions
Great range of fields:
-Wide variety of area studies (9)
-Social science (5)
-Science and environmental science (4)
-Medical (1), Law (1), Special Collections (1)
-Collection Development AUL (3), Preservation (1)
• 127 seeds websites nominated (some duplication)
• A lot of enthusiasm for topic
What we’ve learned about workflows and scale
• Distributing work does not reduce costs
• Collaborative effort builds the project and new tasks promote professional growth
• Quality Assurance and cataloging integral to process of creating high quality collections of web archives
#webarchivinghappenshere
Use cases
Image credit: Flickr user: Nicky Jurd (CC BY 2.0)
Using the Human Rights Web Archive & learning from human rights scholars’ work
(publications, citations)
Citations scraped from articles published in 2010 in select scholarly journals
Isolating URLs from list of citations using Open Refine
(approximately 10% of citations scraped have URLs in them)
Querying Internet Archive collection (via API)
Columbia University web resources: creating best practices for site creators
Wider reach with guidelines rather than suggesting changes on case by case basis
Web archiving initiatives focusing on art resources
An initiative designed to address the “urgent need to document the
dynamic web-based versions of auction catalogues, catalogues raisonnés, and scholarly research projects, as well as artist, gallery, and museum websites” (http://www.nyarc.org/content/web-archiving)
Artist files Special Interest Group
What do you want to learn about web archiving?
Do you have any suggestions on how the SAA Web
Archiving Roundtable can help you develop your knowledge of web archiving?
Categories we identified based on the 33 responses:
– Description
– Preservation
– Access/ Use
– Project Management/ Collaboration
– Appraisal/ Collection Dev/ Policy
– Technology/ Capture/ Tools
– Business Case/ Costs/ Best Practices
Some presentations, papers, panels & posters during grant
• Moderated: “Web Archiving: Experiences, Perspectives and Possibilities” held at METRO on 10/20/14
• Presentation (lightning talk): “MARC Records for the Contemporary Composers Web Archive” for the Society of American Archivists annual conference on 8/16/14 URL (via Academic Commons): http://dx.doi.org/10.7916/D8028Q3S
• Presentation: “SAA Web Archiving Roundtable Education Needs Assessment Survey Results” for the SAA Web Archiving Roundtable meeting at Society of American Archivists annual conference (co-presented with John Bence) on 8/14/14
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving Initiatives” for the METRO Conference 2014 on 1/15/14
• Poster session: “Assessment of the Effectiveness of the Human Rights Web Archive @Columbia University” (co-presented with Pamela Graham) at the ACRL/NY Symposium on 12/6/13
URL (via Academic Commons): http://dx.doi.org/10.7916/D8BG2KZ9
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving Initiatives” for the Best Practices Exchange on 11/14/13 (with Scott Reed)
URL (via Academic Commons): http://dx.doi.org/10.7916/D8G73BNK
• Presentation: “Web Archiving Resource Collaboration” at CrawlCamp held at METRO on 7/17/13
Are project elements on schedule & within budget?
• So far yes though we have plenty of challenges and work ahead of us
• Steady progress on citation analysis but it’s been much harder than we thought it’d be
• Lots of room for engagement and team work including maintenance and coordination of cooperative efforts
Refining building materials
Modest gains
The next 12.5 months
• Complete remainder of work called for in grant
• Establish shared cost model for collaborative collection building (e.g. CCWA and CAUSEWAY)
• Plan for scaling (maintenance and growth)
• Codify roles for meaningful involvement in web archiving efforts
• Contribute to professional organizations to strengthen web archiving efforts nationally and internationally
Credits to some of many collaborators
• Bob Wolven, Alex Thurman, Naeema Akter
• Pamela Graham, Kate Harcourt, Christina Harlow
• Talia Jimenez, Stephen Davis, incentives awards oversight panel: Kris Carpenter, Mark Phillips, Rob Sanderson & Perry Willett
• Elizabeth Davis, Russell Merritt & Borrow Direct music librarians
• Carole Ann Fabian, Chris Sala, Ivies Plus Art & Architecture Group
• Borrow Direct Associate University Librarians for Collection Development group
• Climate change selectors at Borrow Direct institutions
• Archive-It staff
• Community for discussion and participation Including: NYARC, METRO, International Internet Preservation Consortium (IIPC), SAA Web Archiving Roundtable, ARLIS/NA Artist Files SIG
Growing web archives
Thanks!
Anna Perricci
@AnnaPerricci
Columbia University Libraries