SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science ...

Jefferson Bailey, Director, Web Archiving, Internet ArchiveMIT Program on IS, Brownbag Series | Cambridge MA 2017 | @jefferson_bail | [email protected]

SAFETY NETS: RESCUE & REVIVAL FOR ENDANGERED WEB RECORDS

mailto:[email protected]

Outline

● About Internet Archive

● Timeline of Web Archiving at IA

● Web as Historical & Archival Record

● Collecting & Collections

● Technologies & Challenges

● Research & Services

● Conclusion

The Internet ArchiveNon-Profit Library

Founded in 1996 by Brewster Kahle

Universal Access to All Knowledge

35,000,000,000,000,000 Bytes Archived(35 Petabytes)

Books Digitization

Music Digitization

TV Collection

Software Preservation and Emulation

25,0002,000,0002,300,0002,400,0003,000,0004,000,000

570,000,000,000+

Software TitlesMoving ImagesBook ArchiveAudio RecordingsHours of TelevisioneBooksURLs

20 Years of Archiving the Web

1996 US Presidential Campaigns with Smithsonian

218,342,520Web Captures

1997 First Full Crawl

525,362,846Web Captures

1998 Donation of Crawl to the Library Of Congress

1,166,891,826Web Captures

2000US Presidential Campaigns with the Library of CongressStarted Collecting TelevisionStarted Digitizing Movies


2001Launch of the WayBack Machine


2002Book Digitization Begins


2003International Internet Preservation Consortium Founded


2006Archive-It Started


2007Ireland


2008National Archive & Records Administration (NARA)Congressional Harvests (https://webharvest.gov)


https://webharvest.gov

2009Archive-It Adds its 100th Partner7 National Library Partners


2010Broad and Survey Web-Scale Crawls


2014Emulation of Software Archive in the Browser


2016Archive-It Adds its 500th Partner

467,195,419,069

Web Captures

The Web as Historical Resource



https://web.archive.org/details/http://web.mit.edu/


● The web is the primary publishing platform of our generation

● The web has consumed all media

● The web is distributed in access, but centralized in publication

● One cannot study contemporary society (or even recent history) without studying the web

The Web as Archival Resource● WARC format

○ Obtuse, technical

● Packaging

○ URL-centric clicking &

browsing through Wayback

○ Query-based retrieval and

search functionality

○ Silos

● Born-Digital is same

The Web as Archival Resource

● The web as “lived experience”

● Plurality of representation

● Diversity of media

● Unrivaled scope, scale, extent

● Access can be native (in the browser, full-text search, etc)

Collecting & Collections

Web ScaleCurated

Collaborative

Hundreds of crawls per day | 1 Billion web documents per week | 15 PB total (3 PB / year)

Global Scale Harvesting

Archive-ItCurated, Selective Web Archiving

Topical or Thematic Collections

Community Webs Program

• LB21 grant, National Digital Platform

• Continuing Education, Curating Collections

• 2-year project (Jun 2017 - May 2019)

• Additional funding from Kahle Austin Foundation

Education & Training● Establish a cohort network of public librarians doing

web archiving to preserve local history● Support further cohort building, professional

development activities, and outreach

Collection Development● Create open, dedicated training and OER materials

on community memory web archiving● Seed innovative local programing and partnerships

Expanding National Capacity● Provide web archiving services and infrastructure

and ongoing storage and access● Build an extensible program model that can be

scaled and applied to other domains

Community Webs Goals

Community Webs Applications

110 applicationsfrom public libraries across the country

A cohort of 28small, medium, and large

public libraries

15 IMLS Participants

13 Kahle Austin Participants

★ Athens Regional Library System★ Birmingham Public Library★ Brooklyn Collection, Brooklyn Public Library★ Buffalo & Erie County Public Library★ Cleveland Public Library★ Columbus Metropolitan Library★ DC Public Library★ East Baton Rouge Parish Library★ Forbes Library (MA)★ Grand Rapids Public Library★ Henderson District Public Libraries (HDPL)★ Kansas City Public Library★ County of Los Angeles Public Library, ★ Marshall Lyon County Library (MN)

★ Metropolitan Library System (OK)★ New Brunswick Free Public Library★ Patagonia Library (AZ)★ Pollard Memorial Library (Lowell, MA)★ Queens Public Library★ Lawrence Public Library★ San Diego Public Library★ San Francisco Public Library★ NYPL - Schomburg Center for

Research in Black Culture★ The Urbana Free Library★ West Hartford Public Library★ Westborough Public Library★ Denver Public Library, Western History

and African American Research Library

Participants

Collecting: With Researchers

News Measures Research Project● 663 local news sites representing 100 communities ● 7 crawls for a composite Week (July - September 2016)● 2.3TB & 17 million URLs captured● Post-project ongoing monthly crawls● Access to the collection, https://archive-it.org/collections/7520● Research datasets publicly available: soon! (watch IA blog)● Work with us to save news for research & posterity!

https://archive-it.org/collections/7520

Collecting: Kids, Scholars, Ourselves● K-12 Web Archiving Program

● Katrina Blogs○ WBM as research tool○ AIT as archival citation○ Retroactive special collections○ http://bit.ly/katrina-blogs

● Open-access Scholarship○ PFDs in WBM (1.6 billion)○ Cross-referencing against OA

registries, repos, ISSNs, DOIs, lists, etc

http://bit.ly/katrina-blogs

Archiving .govThe End of Term Web Archive

defining the “government web presence”

Stanford WebBase Project

2004 crawl list of URLs

eot 2016: more partners!

Federal Government Web Archiving Working Group

End of Term Web Archive 2016

2008: 457 from 26 nominators

2012: 1476 from 31 nominators

2016: 15,000+ from 400+ nominators (via UNT form)

Plus!: Over 150,000 from DataRescue/EDGI events/tools


End of Term Web Archive http://eotarchive.cdlib.org/

http://eotarchive.cdlib.org/

• Started with:• 9,000+ social media

accounts (scrape of gov SM registry API) 44% FB, 37% TW, 10% YT

• ~190K total domain, subdomains, gov/non-gov sites

• more crowdsourced, curatorial nominations


• Ended with:• 200+ terabytes of data• 350,000,000+ docs/files• 70,000,000+ html files• 40,000,000+ PDFs• 100,000 public nominations• LOC, GPO, NARA, GSA,

NASA, EPA, others

https://archive.org/details/MilitaryIndustrialPowerpointComplexEvery Powerpoint from the .mil web domain (~50K) converted to PDF and with FTS

Special Web Sub Collections

https://archive.org/details/MilitaryIndustrialPowerpointComplex?sort=-downloads


http://archive.org/~vinay/20th-century-gov-groupshots.html


GifCities! https://gifcities.org | Project done for Internet Archive’s 20th Anniversary, October 2016Project Team: Vinay Goel, Richard Caceres, Jefferson Bailey + IA archivists!

https://gifcities.org

https://drive.google.com/file/d/0B6C41kTNBjwqSWZRVFNwVUZEZVk/view

https://drive.google.com/file/d/0B6C41kTNBjwqSWZRVFNwVUZEZVk/view

http://www.youtube.com/watch?v=Vy3ws3TQtwk

http://www.youtube.com/watch?v=Vy3ws3TQtwk


(coming soon!)

Technologies & Challenges

● Variations in acquisition

● Complex format

● Tons o’ data +++++++

● Storage infrastructure

● Computational infrastructure

● Diversity of contained data

Technical Challenges

● Acquisition & provenance opacity

● The Never-ending Web

● Crawl configurations

● Moving target; volatility

● Traditional finding aids inadequate

● Access is technologically dependent

● Lure of evermore data; more data not “better” data

● Attestation issues and a higher sensitivity to elision

Conceptual Challenges

● Web archiving + born-digital is still a somewhat niche collecting activity

● Lack of coordinated efforts on shared tooling

● Little familiarity with formats, software, or processes

● Few on-ramps for non-developer and developer participation

● Web archives can answer any question

Community Challenges

• “Systems Interoperability and Collaborative

Development for Web Archives”

•National Leadership Grant, National Digital

Platform, R&D

•IA/AIT (PI), Stanford, UNT, Rutgers

•2-year project started January 2016

•National Symposium Feb 2017

WASAPI: Web Archiving Systems APIs

WASAPI: Web Archiving Systems APIsThree Key Areas of R&D:

1) What are the attributes of a community model that can

support sustainable and broad-based collaborative

web archiving technology development?

2) What are the community needs and downstream uses

for the planned Export APIs (by AIT & LOCKSS) to

facilitate transfer of web archive data between

distributed systems and what other prospective APIs

does it point to?

3) How can better interoperability of web archiving

systems support new forms of access and research

use?

You can now search the

shall we...https://web-beta.archive.org/

Searching: WBM (keywords)

https://web-beta.archive.org/

Searching: WBM (keywords)● How it works:○ Index is built on anchor text of all in-

bound links to a homepage○ Index text covers 443 million

homepages and is drawn from 900B in-bound links from other cross-domain websites

You can now search the details

shall we...

https://web-beta.archive.org/__wb/search/metadata?q=host:mit.edu

Searching: WBM (profile)

BROZZLER!

“browser” | “crawler” = BROZZLER

Research & Services

Advancing New Uses● Web Archives

○ rich in content

○ rich in longitudinal value

○ rich sources for data mining

● Current Access Models

○ URL-centric clicking &

browsing through Wayback

○ Query-based retrieval and

search / discovery

Researchers

Want data

Interested in change over time

Study semantics, entities, locations

Multidisciplinary

Value info about collection origins

Web Archives

Have a lot of data

Have data segmented over time

Have a rich diversity of content

Multidisciplinary

Chock full of provenance information

● Expand access models for web archives + born-digital

● Enable new insights into collections

● Leverage IA (or other) infrastructure for large-scale

processing to produce datasets for research

● Facilitate computational analysis and new use cases

● Increase use, visibility, and value of Archive-It partner

collections

Goal of Research Services

Flexible Research Services

Researchers do not necessarily need huge sets of data to do interesting work… they do need flexible data delivery services…. Different formats based on different searches for different kinds of research at different times.

V.E. Varvel Jr. & A. Thomer, Google Digital Humanities Awards recipient interviews report

Archive-It Research Serviceshttp://bit.ly/ait_ars

Web Archive Derivative Datasets

http://bit.ly/ait_ars

APIs, Notebooks, Interfaces

APIs, Notebooks, Interfaces

Historical ccTLD Wayback Machines! Built on IA global crawls + added historic web data

With keyword and mime/format search, embed linkback, domain stats, and special features

Accessing: Data/Datathons● White House Datathon

○ Worked with White House & hosted event● Archives Unleashed - http://archivesunleashed.com/

○ AU 3.0 -- At IA as part of WASAPI symposium● WARCshop

○ PSU workshop for archivists to support research● Webinar on using APIs (for SAA, videos soon)● Online workshop + notebooks

○ https://github.com/vinaygoel/ars-workshop

http://archivesunleashed.com/

https://github.com/vinaygoel/ars-workshop

“The .GOV Internet Archive: A Big Data Resource for Political Science”Gade, et al., The Political Methodologist

Colors of a (disappeared) National WebAnat Ben-David (Digital Soci/PoliSci)

Exploring the Canadian Political Parties + Geocities(Ian Milligan, Digital Historian)

web collection

web collection

web collection

web collection

Custodian hardware/cloud

Comm CloudAWS, Goog,

Azure, Wolfram, etc

Academic HPC

APIs

disks

Local + tooling/analytics

derivs

tools

platforms

Seeds, WARCs, Derivative Datasets, Publications, Research Data

APIs

Research Services Approaches

● Datasets to researchers, patrons, developers

● Minimize need for dedicated infrastructure

○ leverage custodian computing power and archival

expertise

● Hide complexities and volume of datasets and of born-

digital collections through derivative formats

● Ongoing development of platforms and APIs for research

data analysis and manipulation

Conclusion I

● Web archives are the present and future of historical

records and include all media types

● No future scholarship can study our era without

considering materials published (only) on the web

● Web archives will unsettle prior methodological

approaches

● But web archives will offer new potential in research,

both scholarly inquiry and public recreation

Conclusion II

To advance research use of web archives and

born-digital historical records (i.e the archives

of now) will require greater comfort, by

archivists and by historians, with technical

mediation at multiple levels and the increasing

distance between the granularity and totality

of the objects and subjects of study

Conclusion Last

● Building ‘safety nets’ for born-digital resources

will depend on the adoption of new

technologies, new practices, new collections,

and new research services/methods.

● The results will be the ongoing primacy and

utility of the archive record and the continued

vitality and resiliency of historical scholarship.

THANKS!

Jefferson Bailey, Director of Web Archiving

[email protected] | @jefferson_bail

Internet Archive, https://archive.org

Archive-It, https://archive-it.org

mailto:[email protected]

http://archive.org

http://archive.org

Date post:	29-Jan-2018
Category:	Technology
Upload:	micah-altman
View:	153 times
Download:	0 times

SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program on Information Science ...

Technology