Building Event Collections from Crawling Web...

Building Event Collections from Crawling Web Archives

@mart1nkle1n

IIPC WAC 2018, 11/13/2018, Wellington, NZL

Building Event Collections

from

Crawling Web Archives

Martin Klein1

Lyudmila Balakireva1

Herbert Van de Sompel2

1Research Library

Los Alamos National Laboratory

2Data Archiving and Networked Services

The Netherlands


@mart1nkle1n


2

Inspiration from Previous Work

https://doi.org/10.1007/978-3-319-67008-9_10


@mart1nkle1n


3

Published at WebSci 2018

https://doi.org/10.1145/3201064.3201085


@mart1nkle1n


4

1. Can we create event collections by focused crawling online-

available web archives?

2. How do event collections created from the archived web

compare to those created from the live web?

3. How does the amount of time passed since the event affect

the collections built from the live and the archived web?

4. How do event collections built from the archived web

compare to manually curated collections?

Questions


@mart1nkle1n


5

• Often orchestrated by subject matter experts, archivists,

special collection librarians, technicians

• Potentially with guidance from institutional collection policy

• Results in a list of seeds (URIs, social media accounts, etc)

• Utilization of crawling services such as Archive-It, Social Feed

Manager

Background – Event Collection Building


@mart1nkle1n


6

Temporal: time passed since event is of concern•

Use of web archives via Memento infrastructure

Selection: seeds often picked manually•

Use of references from Wikipedia pages

Relevance: seed assessment often done by humans •

Use of focused crawling with content and temporal

relevance assessment

Problems and our Approach


@mart1nkle1n


7


@mart1nkle1n


8


@mart1nkle1n


9

• Temporal: time passed since event is of concern

Use of web archives

• Selection: seeds often picked manually


• Relevance: seed assessment often done by humans





@mart1nkle1n


10


@mart1nkle1n


11


@mart1nkle1n


12

• Temporal: time passed since event is of concern

Use of web archives

• Selection: seeds often picked manually


• Relevance: seed assessment often done by humans





@mart1nkle1n


13

Focused Crawling

Child 1

Seed

Child 2 Child 3

Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

Not crawledCrawled and

not relevant

Crawled and

relevant


@mart1nkle1n


14

Focused Crawling

Child 1

Seed

Child 2 Child 3



not relevant

Crawled and

relevant


@mart1nkle1n


15

Focused Crawling

Child 1

Seed

Child 2 Child 3



not relevant

Crawled and

relevant


@mart1nkle1n


16

1. Content of Wikipedia page + random 60% of page’s references

• Generate topic vector (TF-IDF of 1grams + 2grams)

2. Content of remaining 40% of Wikipedia page’s references

• Generate topic vector (TF-IDF of 1grams + 2grams)

• Compute cosine similarity value between vectors 1 and 2

• Run 10 times

• Take average cosine similarity value as content threshold

Content Relevance


@mart1nkle1n


17

• Define temporal interval for which crawled pages are

considered relevant

• Event date extracted from Wikipedia event page

Temporal Relevance

1

Event Date Change Point Today

0 0


@mart1nkle1n


18

Change Point Detection

2016−06−12 2016−11−05 2017−03−31 2017−08−24

020

40

60

80

10

0

Edit Dates

Pe

rce

nta

ge

46

• Plot number of Wikipedia page

edits per day

• Run R’s changepoint algorithm

• Detect significant change in curve

https://cran.r-project.org/web/packages/changepoint/index.html


@mart1nkle1n


19

• Extract datetime from pages via:

• URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/

• Meta tags<meta property="article:published" itemprop="datePublished"

content="2017-12-09T10:14:50-05:00" />

• ODU’s Carbondate toolhttp://carbondate.cs.odu.edu/

• Memento datetime

• X-Header

Datetime Extraction


@mart1nkle1n


20

• Topics limited to terror attacks and mass shootings in the U.S.

• From different times in the past

• Take content and temporal relevance into account

• Equally weighted

• Use events’ Wikipedia page as input for focused crawler

• Version that was live at change point

Experiment Details


@mart1nkle1n


21

Focused crawl of: •

22 • archives, simultaneously, via Memento infrastructure

The live web•

Seeds•

Memento of Wikipedia page references closest to and •

after event time

Subject to temporal and contextual relevance assessment•

Crawled • outlinks

Memento of • outlinks closest to and after event time

Subject to temporal and contextual relevance assessment•

Crawl Details


@mart1nkle1n


22

• Crawl stop conditions:

• No more relevant documents left

• 5 levels deep

• Utilized crawl priority queue

Crawl Details

Level 2

Level 1

Level 0

Child 1

Seed

Child 2 Child 3



@mart1nkle1n


23

New York City, October • 31st 2017

Las Vegas, October • 1st 2017

Orlando, June • 12th 2016

San • Bernadino, December 2nd 2015

Tucson, January • 8th 2011

Binghampton• , April 3rd 2009

Collections Crawled (in November 2017)


@mart1nkle1n


24

NYC, 10/31/2017 – URIs per Level

0 1 2 3 4 5

Crawl depth

Num

ber

of U

RIs

050

0100

0150

0200

0

Web Archive Crawl

01

02

030

40

50

60

70

80

90

100

All URIs

Relevant URIs

0 1 2 3 4 5

Crawl depth

050

0100

0150

0200

0

Live Web Crawl

01

02

030

40

50

60

70

80

90

100

Perc

ent

All URIs

Relevant URIs


@mart1nkle1n


25

TUC, 01/08/2011 – URIs per Level

0 1 2 3 4 5

Crawl depth

Num

ber

of U

RIs

020

000

4000

060000

80

000

Web Archive Crawl

01

020

30

40

50

60

70

80

90

100

All URIs

Relevant URIs

0 1 2 3 4 5

Crawl depth

020

000

4000

060000

80

000

Live Web Crawl

01

020

30

40

50

60

70

80

90

100

Perc

ent

All URIs

Relevant URIs


@mart1nkle1n


26

NYC, 10/31/2017 – Relevance over…

Crawled Documents Crawl Time


@mart1nkle1n


27

TUC, 01/08/2011 – Relevance over…

Crawled Documents Crawl Time


@mart1nkle1n


28

TUC, 01/08/2011 – Comparison to Archive-IT

0 5000 10000 15000

05

000

10

00

015

000

Documents

Accu

mu

late

d R

ele

va

nce

Web Archive Crawl

Archive−It Crawl


@mart1nkle1n


29

TUC, 01/08/2011 – Web Archive Contributions

web.archive.org 75%

wayback.archive−it.org

14%webarchive.loc.gov 7%

web.archive.bibalex.org 2%archive.is 2%


@mart1nkle1n


30

• Web archives are great resources to build event collections of

web resources

• Crawling web archives is much slower than the live web

• Collections about very recent events benefit more from the

live web than the archived web

but

• Collections about events from the distant past benefit more

from the archived web than the live web

• Utilizing multiple web archives is beneficial for the collection

• Focused crawls have the potential to outperform manual

collection building

Takeaways


@mart1nkle1n


31

https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384


@mart1nkle1n


Building Event Collections

from

Crawling Web Archives

Martin Klein1

Lyudmila Balakireva1

Herbert Van de Sompel2

1Research Library

Los Alamos National Laboratory

2Data Archiving and Networked Services

The Netherlands

Date post:	15-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times