Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
2
Inspiration from Previous Work
https://doi.org/10.1007/978-3-319-67008-9_10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
3
Published at WebSci 2018
https://doi.org/10.1145/3201064.3201085
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
4
1. Can we create event collections by focused crawling online-
available web archives?
2. How do event collections created from the archived web
compare to those created from the live web?
3. How does the amount of time passed since the event affect
the collections built from the live and the archived web?
4. How do event collections built from the archived web
compare to manually curated collections?
Questions
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
5
• Often orchestrated by subject matter experts, archivists,
special collection librarians, technicians
• Potentially with guidance from institutional collection policy
• Results in a list of seeds (URIs, social media accounts, etc)
• Utilization of crawling services such as Archive-It, Social Feed
Manager
Background – Event Collection Building
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
6
Temporal: time passed since event is of concern•
Use of web archives via Memento infrastructure
Selection: seeds often picked manually•
Use of references from Wikipedia pages
Relevance: seed assessment often done by humans •
Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
7
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
8
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
9
• Temporal: time passed since event is of concern
Use of web archives
• Selection: seeds often picked manually
Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
11
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
12
• Temporal: time passed since event is of concern
Use of web archives
• Selection: seeds often picked manually
Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
13
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawledCrawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
14
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawledCrawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
15
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawledCrawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
16
1. Content of Wikipedia page + random 60% of page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
2. Content of remaining 40% of Wikipedia page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
• Compute cosine similarity value between vectors 1 and 2
• Run 10 times
• Take average cosine similarity value as content threshold
Content Relevance
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
17
• Define temporal interval for which crawled pages are
considered relevant
• Event date extracted from Wikipedia event page
Temporal Relevance
1
Event Date Change Point Today
0 0
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
18
Change Point Detection
2016−06−12 2016−11−05 2017−03−31 2017−08−24
020
40
60
80
10
0
Edit Dates
Pe
rce
nta
ge
46
• Plot number of Wikipedia page
edits per day
• Run R’s changepoint algorithm
• Detect significant change in curve
https://cran.r-project.org/web/packages/changepoint/index.html
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
19
• Extract datetime from pages via:
• URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/
• Meta tags<meta property="article:published" itemprop="datePublished"
content="2017-12-09T10:14:50-05:00" />
• ODU’s Carbondate toolhttp://carbondate.cs.odu.edu/
• Memento datetime
• X-Header
Datetime Extraction
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
20
• Topics limited to terror attacks and mass shootings in the U.S.
• From different times in the past
• Take content and temporal relevance into account
• Equally weighted
• Use events’ Wikipedia page as input for focused crawler
• Version that was live at change point
Experiment Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
21
Focused crawl of: •
22 • archives, simultaneously, via Memento infrastructure
The live web•
Seeds•
Memento of Wikipedia page references closest to and •
after event time
Subject to temporal and contextual relevance assessment•
Crawled • outlinks
Memento of • outlinks closest to and after event time
Subject to temporal and contextual relevance assessment•
Crawl Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
22
• Crawl stop conditions:
• No more relevant documents left
• 5 levels deep
• Utilized crawl priority queue
Crawl Details
Level 2
Level 1
Level 0
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
23
New York City, October • 31st 2017
Las Vegas, October • 1st 2017
Orlando, June • 12th 2016
San • Bernadino, December 2nd 2015
Tucson, January • 8th 2011
Binghampton• , April 3rd 2009
Collections Crawled (in November 2017)
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
24
NYC, 10/31/2017 – URIs per Level
0 1 2 3 4 5
Crawl depth
Num
ber
of U
RIs
050
0100
0150
0200
0
Web Archive Crawl
01
02
030
40
50
60
70
80
90
100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
050
0100
0150
0200
0
Live Web Crawl
01
02
030
40
50
60
70
80
90
100
Perc
ent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
25
TUC, 01/08/2011 – URIs per Level
0 1 2 3 4 5
Crawl depth
Num
ber
of U
RIs
020
000
4000
060000
80
000
Web Archive Crawl
01
020
30
40
50
60
70
80
90
100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
020
000
4000
060000
80
000
Live Web Crawl
01
020
30
40
50
60
70
80
90
100
Perc
ent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
26
NYC, 10/31/2017 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
27
TUC, 01/08/2011 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
28
TUC, 01/08/2011 – Comparison to Archive-IT
0 5000 10000 15000
05
000
10
00
015
000
Documents
Accu
mu
late
d R
ele
va
nce
Web Archive Crawl
Archive−It Crawl
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
29
TUC, 01/08/2011 – Web Archive Contributions
web.archive.org 75%
wayback.archive−it.org
14%webarchive.loc.gov 7%
web.archive.bibalex.org 2%archive.is 2%
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
30
• Web archives are great resources to build event collections of
web resources
• Crawling web archives is much slower than the live web
• Collections about very recent events benefit more from the
live web than the archived web
but
• Collections about events from the distant past benefit more
from the archived web than the live web
• Utilizing multiple web archives is beneficial for the collection
• Focused crawls have the potential to outperform manual
collection building
Takeaways
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
31
https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZL
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands