Date post: | 29-Jan-2018 |
Category: |
Technology |
Upload: | micah-altman |
View: | 153 times |
Download: | 0 times |
Jefferson Bailey, Director, Web Archiving, Internet ArchiveMIT Program on IS, Brownbag Series | Cambridge MA 2017 | @jefferson_bail | [email protected]
SAFETY NETS: RESCUE & REVIVAL FOR ENDANGERED WEB RECORDS
Outline
● About Internet Archive
● Timeline of Web Archiving at IA
● Web as Historical & Archival Record
● Collecting & Collections
● Technologies & Challenges
● Research & Services
● Conclusion
The Internet ArchiveNon-Profit Library
Founded in 1996 by Brewster Kahle
Universal Access to All Knowledge
35,000,000,000,000,000 Bytes Archived(35 Petabytes)
Books Digitization
Music Digitization
TV Collection
Software Preservation and Emulation
25,0002,000,0002,300,0002,400,0003,000,0004,000,000
570,000,000,000+
Software TitlesMoving ImagesBook ArchiveAudio RecordingsHours of TelevisioneBooksURLs
20 Years of Archiving the Web
1996 US Presidential Campaigns with Smithsonian
218,342,520Web Captures
1997 First Full Crawl
525,362,846Web Captures
1998 Donation of Crawl to the Library Of Congress
1,166,891,826Web Captures
2000US Presidential Campaigns with the Library of CongressStarted Collecting TelevisionStarted Digitizing Movies
6,153,042,235Web Captures
2001Launch of the WayBack Machine
12,082,859,018Web Captures
2002Book Digitization Begins
22,277,788,816Web Captures
2003International Internet Preservation Consortium Founded
38,868,116,181Web Captures
2006Archive-It Started
103,943,903,726Web Captures
2007Ireland
184,277,909,308Web Captures
2008National Archive & Records Administration (NARA)Congressional Harvests (https://webharvest.gov)
209,160,715,829Web Captures
2009Archive-It Adds its 100th Partner7 National Library Partners
225,658,093,516Web Captures
2010Broad and Survey Web-Scale Crawls
246,744,306,660Web Captures
2014Emulation of Software Archive in the Browser
452,769,236,649Web Captures
2016Archive-It Adds its 500th Partner
467,195,419,069
Web Captures
The Web as Historical Resource
The Web as Historical Resource
The Web as Historical Resource
https://web.archive.org/details/http://web.mit.edu/
The Web as Historical Resource
● The web is the primary publishing platform of our generation
● The web has consumed all media
● The web is distributed in access, but centralized in publication
● One cannot study contemporary society (or even recent history) without studying the web
The Web as Archival Resource● WARC format
○ Obtuse, technical
● Packaging
○ URL-centric clicking &
browsing through Wayback
○ Query-based retrieval and
search functionality
○ Silos
● Born-Digital is same
The Web as Archival Resource
● The web as “lived experience”
● Plurality of representation
● Diversity of media
● Unrivaled scope, scale, extent
● Access can be native (in the browser, full-text search, etc)
Collecting & Collections
Web ScaleCurated
Collaborative
Hundreds of crawls per day | 1 Billion web documents per week | 15 PB total (3 PB / year)
Global Scale Harvesting
Archive-ItCurated, Selective Web Archiving
Topical or Thematic Collections
Community Webs Program
• LB21 grant, National Digital Platform
• Continuing Education, Curating Collections
• 2-year project (Jun 2017 - May 2019)
• Additional funding from Kahle Austin Foundation
Education & Training● Establish a cohort network of public librarians doing
web archiving to preserve local history● Support further cohort building, professional
development activities, and outreach
Collection Development● Create open, dedicated training and OER materials
on community memory web archiving● Seed innovative local programing and partnerships
Expanding National Capacity● Provide web archiving services and infrastructure
and ongoing storage and access● Build an extensible program model that can be
scaled and applied to other domains
Community Webs Goals
Community Webs Applications
110 applicationsfrom public libraries across the country
A cohort of 28small, medium, and large
public libraries
15 IMLS Participants
13 Kahle Austin Participants
★ Athens Regional Library System★ Birmingham Public Library★ Brooklyn Collection, Brooklyn Public Library★ Buffalo & Erie County Public Library★ Cleveland Public Library★ Columbus Metropolitan Library★ DC Public Library★ East Baton Rouge Parish Library★ Forbes Library (MA)★ Grand Rapids Public Library★ Henderson District Public Libraries (HDPL)★ Kansas City Public Library★ County of Los Angeles Public Library, ★ Marshall Lyon County Library (MN)
★ Metropolitan Library System (OK)★ New Brunswick Free Public Library★ Patagonia Library (AZ)★ Pollard Memorial Library (Lowell, MA)★ Queens Public Library★ Lawrence Public Library★ San Diego Public Library★ San Francisco Public Library★ NYPL - Schomburg Center for
Research in Black Culture★ The Urbana Free Library★ West Hartford Public Library★ Westborough Public Library★ Denver Public Library, Western History
and African American Research Library
Participants
Collecting: With Researchers
News Measures Research Project● 663 local news sites representing 100 communities ● 7 crawls for a composite Week (July - September 2016)● 2.3TB & 17 million URLs captured● Post-project ongoing monthly crawls● Access to the collection, https://archive-it.org/collections/7520● Research datasets publicly available: soon! (watch IA blog)● Work with us to save news for research & posterity!
Collecting: Kids, Scholars, Ourselves● K-12 Web Archiving Program
● Katrina Blogs○ WBM as research tool○ AIT as archival citation○ Retroactive special collections○ http://bit.ly/katrina-blogs
● Open-access Scholarship○ PFDs in WBM (1.6 billion)○ Cross-referencing against OA
registries, repos, ISSNs, DOIs, lists, etc
Archiving .govThe End of Term Web Archive
defining the “government web presence”
Stanford WebBase Project
2004 crawl list of URLs
eot 2016: more partners!
Federal Government Web Archiving Working Group
End of Term Web Archive 2016
2008: 457 from 26 nominators
2012: 1476 from 31 nominators
2016: 15,000+ from 400+ nominators (via UNT form)
Plus!: Over 150,000 from DataRescue/EDGI events/tools
End of Term Web Archive 2016
• Started with:• 9,000+ social media
accounts (scrape of gov SM registry API) 44% FB, 37% TW, 10% YT
• ~190K total domain, subdomains, gov/non-gov sites
• more crowdsourced, curatorial nominations
End of Term Web Archive 2016
• Ended with:• 200+ terabytes of data• 350,000,000+ docs/files• 70,000,000+ html files• 40,000,000+ PDFs• 100,000 public nominations• LOC, GPO, NARA, GSA,
NASA, EPA, others
https://archive.org/details/MilitaryIndustrialPowerpointComplexEvery Powerpoint from the .mil web domain (~50K) converted to PDF and with FTS
Special Web Sub Collections
Special Web Sub Collections
http://archive.org/~vinay/20th-century-gov-groupshots.html
Special Web Sub Collections
GifCities! https://gifcities.org | Project done for Internet Archive’s 20th Anniversary, October 2016Project Team: Vinay Goel, Richard Caceres, Jefferson Bailey + IA archivists!
Special Web Sub Collections
(coming soon!)
Technologies & Challenges
● Variations in acquisition
● Complex format
● Tons o’ data +++++++
● Storage infrastructure
● Computational infrastructure
● Diversity of contained data
Technical Challenges
● Acquisition & provenance opacity
● The Never-ending Web
● Crawl configurations
● Moving target; volatility
● Traditional finding aids inadequate
● Access is technologically dependent
● Lure of evermore data; more data not “better” data
● Attestation issues and a higher sensitivity to elision
Conceptual Challenges
● Web archiving + born-digital is still a somewhat niche collecting activity
● Lack of coordinated efforts on shared tooling
● Little familiarity with formats, software, or processes
● Few on-ramps for non-developer and developer participation
● Web archives can answer any question
Community Challenges
• “Systems Interoperability and Collaborative
Development for Web Archives”
•National Leadership Grant, National Digital
Platform, R&D
•IA/AIT (PI), Stanford, UNT, Rutgers
•2-year project started January 2016
•National Symposium Feb 2017
WASAPI: Web Archiving Systems APIs
WASAPI: Web Archiving Systems APIsThree Key Areas of R&D:
1) What are the attributes of a community model that can
support sustainable and broad-based collaborative
web archiving technology development?
2) What are the community needs and downstream uses
for the planned Export APIs (by AIT & LOCKSS) to
facilitate transfer of web archive data between
distributed systems and what other prospective APIs
does it point to?
3) How can better interoperability of web archiving
systems support new forms of access and research
use?
You can now search the
shall we...https://web-beta.archive.org/
Searching: WBM (keywords)
Searching: WBM (keywords)● How it works:○ Index is built on anchor text of all in-
bound links to a homepage○ Index text covers 443 million
homepages and is drawn from 900B in-bound links from other cross-domain websites
You can now search the details
shall we...
https://web-beta.archive.org/__wb/search/metadata?q=host:mit.edu
Searching: WBM (profile)
BROZZLER!
“browser” | “crawler” = BROZZLER
Research & Services
Advancing New Uses● Web Archives
○ rich in content
○ rich in longitudinal value
○ rich sources for data mining
● Current Access Models
○ URL-centric clicking &
browsing through Wayback
○ Query-based retrieval and
search / discovery
Researchers
Want data
Interested in change over time
Study semantics, entities, locations
Multidisciplinary
Value info about collection origins
Web Archives
Have a lot of data
Have data segmented over time
Have a rich diversity of content
Multidisciplinary
Chock full of provenance information
● Expand access models for web archives + born-digital
● Enable new insights into collections
● Leverage IA (or other) infrastructure for large-scale
processing to produce datasets for research
● Facilitate computational analysis and new use cases
● Increase use, visibility, and value of Archive-It partner
collections
Goal of Research Services
Flexible Research Services
Researchers do not necessarily need huge sets of data to do interesting work… they do need flexible data delivery services…. Different formats based on different searches for different kinds of research at different times.
V.E. Varvel Jr. & A. Thomer, Google Digital Humanities Awards recipient interviews report
Archive-It Research Serviceshttp://bit.ly/ait_ars
Web Archive Derivative Datasets
APIs, Notebooks, Interfaces
APIs, Notebooks, Interfaces
Historical ccTLD Wayback Machines! Built on IA global crawls + added historic web data
With keyword and mime/format search, embed linkback, domain stats, and special features
Accessing: Data/Datathons● White House Datathon
○ Worked with White House & hosted event● Archives Unleashed - http://archivesunleashed.com/
○ AU 3.0 -- At IA as part of WASAPI symposium● WARCshop
○ PSU workshop for archivists to support research● Webinar on using APIs (for SAA, videos soon)● Online workshop + notebooks
○ https://github.com/vinaygoel/ars-workshop
“The .GOV Internet Archive: A Big Data Resource for Political Science”Gade, et al., The Political Methodologist
Colors of a (disappeared) National WebAnat Ben-David (Digital Soci/PoliSci)
Exploring the Canadian Political Parties + Geocities(Ian Milligan, Digital Historian)
web collection
web collection
web collection
web collection
Custodian hardware/cloud
Comm CloudAWS, Goog,
Azure, Wolfram, etc
Academic HPC
APIs
disks
Local + tooling/analytics
derivs
tools
platforms
Seeds, WARCs, Derivative Datasets, Publications, Research Data
APIs
Research Services Approaches
● Datasets to researchers, patrons, developers
● Minimize need for dedicated infrastructure
○ leverage custodian computing power and archival
expertise
● Hide complexities and volume of datasets and of born-
digital collections through derivative formats
● Ongoing development of platforms and APIs for research
data analysis and manipulation
Conclusion I
● Web archives are the present and future of historical
records and include all media types
● No future scholarship can study our era without
considering materials published (only) on the web
● Web archives will unsettle prior methodological
approaches
● But web archives will offer new potential in research,
both scholarly inquiry and public recreation
Conclusion II
To advance research use of web archives and
born-digital historical records (i.e the archives
of now) will require greater comfort, by
archivists and by historians, with technical
mediation at multiple levels and the increasing
distance between the granularity and totality
of the objects and subjects of study
Conclusion Last
● Building ‘safety nets’ for born-digital resources
will depend on the adoption of new
technologies, new practices, new collections,
and new research services/methods.
● The results will be the ongoing primacy and
utility of the archive record and the continued
vitality and resiliency of historical scholarship.
THANKS!
Jefferson Bailey, Director of Web Archiving
[email protected] | @jefferson_bail
Internet Archive, https://archive.org
Archive-It, https://archive-it.org