Collecting and exploring the “now” and the “flow” A Case Study on Paris Attacks Archives
Marie Chouleur (BnF), Zeynep Pehlivan (Ina) and Valérie Schafer (Iscc, Cnrs)
IIPC, London Web Archiving Week, 16 June 2017
London 7/7 Archives
A special call for application from the French CNRS and a one-year project dedicated to Born-Digital Heritage of Paris Attacks Challenges and opportunities related to the collaboration between Web archivists and social science researchers within the project ASAP (From #jesuischarlie to #offenturen: the born digital heritage and its archiving during the events).
https://www.paris.fr/actualites/grandformat/13-novembre
WHO, WHY, WHEN, WHAT FOR ?
https://archive-it.org/collections/5190
Legal deposit extended to the internet in 2006
“Signs, signals, writings, images, sounds or messages of any kind communicated to the public by electronic means”
Limited to “French” internet
Shared by for websites related to audiovisual communications (radio, tv channels, blogs, etc.) and for all other websites
On-site access only, in order to preserve copyright and privacy (BnF, Ina and regional libraries in charge of legal deposit)
Web Archiving in France
20 Years of Collections @laBnF
Encyclopedic collections, aiming at representativeness : websites, blogs and social media (Twitter, Facebook, Instagram, etc.), online videos platforms...
December 31, 2016 : 793 Tb (after deduplication and compression)
Annual broad crawls : - 4.5 millions seed domains, limited budget - Seed lists provided by registrars Selective crawls : - Thematic Collections, Special Events, Emergency - About 20,000 seed domains, more frequent (daily, weekly, monthly, twice a year, annual) and in-depth crawls - More than 100 people involved (curators, researchers)
Emergency Crawls @laBnF
On demand, for punctual operations (e.g. websites to be closed) : seeds added to weekly/monthly crawls by curators
For unforeseen events or medium-scale operations that do not fit with agenda (e.g. social movement Nuit Debout in 2016): extra crawls may be planned
Complementary to daily News crawls (around 200 free news websites - national, regional, online - and 25 subscription news sites - 260 local editions) and other crawls (especially “Official Websites” and “Social Movement”)
Archiving Web @Ina Search by Url Search by catalog Full-text search
Search videos Search Twitter
Archiving video providers (Youtube, Dailymotion etc.)
Archiving web radios
Archiving Twitter
Archiving web sites
• Sources
Web sites
Social media accounts
Video providers accounts
Twitter and Radio flux
Iterative crawling of the archive
Archiving Web Sites @Ina
• 14389 web sites archived daily
• Number of versions URL : 58.1 billion
• Archive size : 4.46 PB, 750 TB (compressed and deduplicated )
Archiving Video Providers @Ina
Archiving Twitter @ina
• Since February 2014
• 13 000 users (timelines)
• 400 hashtags
• +500 millions of tweets
• Paris attacks
HOW AND HOW MANY ?
2015 Terrorist Attacks @laBnF
Charlie Hebdo Attacks (January 2015)
First crawl the day after the attacks Appeal to curators and IIPC partners for seed URLs documenting reactions on the web and Twitter Crawls maintained for a month
Seeds also sent for crawl by Archive-It
Paris and St-Denis Attacks (November 2015) First crawl the Monday following the attacks Smaller scale, partly due to resources allocated to other activities (broad crawl underway at same time) Crawls maintained for one week
Nomination tool : BnF Collecte du web (BCWeb)
Scheduling tool : NetarchiveSuite (NAS)
Crawler : version 1 / version 3 since March 2017
No remote access to NAS and Heritrix
Archiving Twitter with Heritrix : max. 20 tweets twice or 4 x a day per account, 2 000 accounts in total, form of publication and context preserved (timeline, pictures, links), access from the same application
Archiving the web @laBnF
2015 Terrorist Attacks @laBnF
Charlie Hebdo Attacks (January 2015)
Around 1.500 domains
13.5 million URLs, 0.5 Tb
Various contents, most of them in homage to the victims, notably websites or blogs with drawings, some other critical or hostile
Paris and St-Denis Attacks (November 2015) 61 seed URLs
1.2 million URLs, 65 Gb
Mainly social media (popular hashtags on Twitter)
Archiving Twitter @ina • Archiving with Twitter APIs (data oriented)
140 characters and +30 metadata Between 3000 and 4000 characters The text of a tweet represents about 5% of the information Geolocation, user details etc.
Restrictions
• Streaming API : 400 hashtags, 5000 users
• 1% of tweets published at time t
• REST API : 3200 old tweets per user
• Search API : 15 minute window of 180 for user and 450 for app
Charlie Hebdo
• + 12 millions tweets
• The next day, 8 January
#JeSuisCharlie, #CharlieHebdo, #Charlie, #IamCharlie,#jenesuispasCharlie
Charlie Hebdo
07-01-2015 14-01-2015
07 January 2015 480K tweets at 21:00
13 November
• + 20 millions tweets
• From 23:00
#Bataclan, #JeSuisParis, #PrayForParis, #ParisAttacks, #Fusillade, #bombes,
#PorteOuverte, #rechercheparis, #bombesparis, #solidarité …
13 November
13-11-2015 20-11-2015
14 November 2015 984K tweets at 00:00
Nice
• + 8 millions tweets
• The next day, also with Search API
#Nice06, #RechercheNice, #Nice, #prayfornice,
#NiceAttentat, #NousSommesUnis, #NiceAttack
Nice
14-07-2016 21-07-2016
15 July 2016 615K tweets at 01:00
Emergencycollectsasdisrup3veprocesses?Issuesandlimits?Toolsandmethods?Opennessandcloseness?Governanceandmaterial,technicalandhumanissues?
June 2016, ASAP meeting at ISCC, brainstorming on Emergency Crawls
Emergencycollectsasdisrup3veprocesses?Issuesandlimits?Toolsandmethods?Opennessandcloseness?Governanceandmaterial,technicalandhumanissues?
3 interviews with - Jefferson Bailey & Sylvie Rollason-Cass (Archive-It team) in march 2016 https://asap.hypotheses.org/125
- Annick Le Follic (BnF) 21 March 2016 https://asap.hypotheses.org/168
- Thomas Drugeon (Ina) 21 March 2016 https://asap.hypotheses.org/173
WHERE ?
Access to collections @laBnF
“Archives de l’internet”
Main application, open to all researchers : URL Search, News, Guided Tours, Full-Text Search (prototype)
“Archives de l’internet Labs”
Experimental platform for full-text searching and data mining, open to partner researchers
A pilot project, part of four-year BnF program exploring services to offer digital corpora to researchers (“CORPUS”)
Full-text search - Advanced search
Full-text search - Expert search
Full-text search - List of results
Saved searches
Saved corpora
Download of the results (metadata)
Statistics
Metadata - CDX files, WAT files
Completeness of our archives • Archiving embedded sources in tweets like images, vidéos
etc. • External sources : N.Ruest, M.Klein, Linkfluence etc.
• How to estimate? • For hashtags that we track • Using rate limit information (moving average, savgol filter etc.)
{"follow":[],"track":["#PrayForParis","#fusillade", …
{"limit":{"track":2058,"timestamp_ms":"1447456084764"}} …
{"limit":{"track":2159,"timestamp_ms":"1447456084783"}}
DEMO INA (vidéo)
FROM SEARCH TO RESEARCH
From hands-on to finger on … Data Deluge
False Moves
Raw Data is an oxymoron
Digital Literacy
A “third generation” phenomena
Tweet, tweet, retweet …
Close and distant reading
Trans-archives and trans-media
https://www.franceculture.fr/emissions/le-numerique-et-nous/archives-comment-est-conservee-la-memoire-du-web
A trading zone under construction
« The metaphor of a trading zone is being applied to collaborations in S & T. The basis of the metaphor is anthropological studies of how different cultures are able to exchange goods, despite differences in language and culture ». https://en.wikipedia.org/wiki/Trading_zones
Thank you for your attention.
BnF
Website : www.bnf.fr http://www.bnf.fr/fr/professionnels/innov_num_dl_internet.html http://www.bnf.fr/fr/collections_et_services/livre_presse_medias/a.archives_internet.html Blog : http://webcorpora.hypotheses.org/ Email : [email protected] Twitter @DLWebBnF
KEEP IN TOUCH