+ All Categories
Home > Technology > "Web Archive services framework for tighter integration between the past and the present web", Phd...

"Web Archive services framework for tighter integration between the past and the present web", Phd...

Date post: 10-May-2015
Category:
Upload: ahmed-alsum
View: 1,135 times
Download: 2 times
Share this document with a friend
Popular Tags:
149
WEB ARCHIVE SERVICES FRAMEWORK FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB Ahmed AlSum PhD Defense February 2014 Committee Members: Michael L. Nelson Michele C. Weigle Hussein M. Abdel-Wahab M’Hammad Abdous Herbert Van de Sompel Old Dominion University Computer Science Department 1
Transcript
Page 1: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

1

WEB ARCHIVE SERVICES FRAMEWORK

FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB

Ahmed AlSumPhD Defense

February 2014

Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel

Old Dominion University Computer Science Department

Page 2: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

2

Domain

Contribution

Goal

WEB ARCHIVE SERVICES FRAMEWORK

FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB

Ahmed AlSumPhD Defense

February 2014

Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel

Old Dominion University Computer Science Department

Page 3: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

3

Outline• Introduction• Web Archiving Services Framework

• Content Service• Metadata Service• URI Service• Archive Service

• Conclusions

Page 4: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

4

INTRODUCTION

Motivation and Research Questions

Page 5: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

5

What is a Web Archive?

Introduction Motivation

http://www.cs.odu.edu

Page 6: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

6

Who are using Web Archives? & How?• Politicians• Journalists• Web designers• Historians• Researchers• Social scientists• Curious users

Introduction Motivation

*IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009

Page 7: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

7

Web Archives interfaces are limited

Introduction Motivation

Page 8: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

8

Web Archiving Use Cases• Ponguru asked on Internet Archive forum on May 17,

2010*:• Hi All - I am new to Archive.org. A few quick questions

(1) Is there any API or tools available to access the Archive.org contents programmatically?

(2) Are there any research papers where Archive.org was used for data collection / analysis (e.g. studying a particular topic over time, etc.)? I digged a little bit, could not find much, so checking with the group. "

• Introduction Motivation

*http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg

Page 9: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

9

Lack of APIs• Famous websites provide APIs to the third-party

developer.• Introduction Motivation

Page 10: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

10

Limited and non-standards APIs• Current Web Archives have a limited set of APIs that don’t

cover the user’s needs.• Introduction Motivation

Page 11: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

11

Wayback Machine API• Introduction Motivation

• It returns JSON interface for the list of available Mementos.

Page 12: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

12

Croatian Web Archive

Introduction Motivation

Full-text search web interface Full-text search APIs in JSON

Page 13: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

13

Memento• Introduction Motivation

• Memento provides TimeMap in the application CoRE format.

Page 14: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

14

Memento Terminology

Introduction Motivation

URI-R, R

URI-M, M

URI-T, TM

http://www.amazon.com

http://web.archive.org/web/20110411070244/http://amazon.com

Original Resource

Memento

TimeMap

Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force (IETF). Retrieved from

http://tools.ietf.org/html/rfc7089

Page 15: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

15

Memento Aggregator• Merges TimeMaps from various archives.

Introduction Motivation

Page 16: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

16

Web Archiving as Big Data• Internet Archive corpus reached 5 PetaBytes. • Alexandria Bibliotheca needs one year to recompute

checksum for its corpus.

• Tools

Introduction Motivation

Apache Pig

Page 17: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

17

Research Question

How Can We Enrich The Web Archive Access Interface With The Conjunction Of The Live Web?

Introduction Research Questions

Page 18: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

18

Research Questions• What are the required services for the web archiving user

community? • Shall we work on the web archive collection as one entity or on

different levels? • How can we use the web archive content beyond full-text search? • What are the metadata fields that could enhance user browsing? • How can we develop access interface to the temporal web graph? • How can we optimize creation of thumbnails?• How can we use the HTTP redirection to enhance the URI-lookup

query? • How can we optimize the query routing mechanism across the

web archives?

Introduction Research Questions

Page 19: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

19

WEB ARCHIVE SERVICE FRAMEWORKLevels and Datasets

Page 20: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

20

Web Archive Service FrameworkWeb Archive Service Framework

Page 21: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

21

• Archive level• Web Archive profiling to

optimize the query routing.

• URI level• URI HTTP redirection in the

web archive URI-lookup.

• Metadata level• ArcLink• ArcThumb

• Content level• ArcContent

Web Archive Service Framework

ArcSys

Page 22: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

22

IIPC 2010 Winter Olympics

Web Archive Service Framework Datasets

* http://olympics.us.archive.org/olympics2010/

Size 700+GB

From Nov 2009

To Mar 2010

#URI-R 6.4M

#URI-M 23.7M

Page 23: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

23

Fortune 500• 499,540 mementos from 488

TimeMaps.• For each Memento, we download the

HTML and capture the thumbnail using PhantomJS.

Web Archive Service Framework Datasets

Page 24: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

24

DMOZ

Web Archive Service Framework Datasets

• URI Open Directory based on user submissions.

Page 25: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

25

CONTENT SERVICEArcContent

Archive

URI

Metadata

Content

Page 26: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

26

Wayback Machine URI Rewriting

Original Rewritten

Content Service

Page 27: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

27

Response Types

Raw Response

Modified Response

Extracted Response

Content Service

Page 28: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

28

ArcContent Architecture Diagram

Content Service

Page 29: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

29

Extracted Response Filters

Content Service

TextContent

TFContent

Page 30: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

30

Extracted Response Formats

Content Service

XML

JSON

Page 31: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

31

ArcContent Applications

Content Service

TFContent

TagClouds

Page 32: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

32

METADATA SERVICEArcLink & ArcThumb

Archive

URI

Metadata

Content

Page 33: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

33

Metadata Access Service• Metadata Service• Metadata is data about data.• Metadata layer is data about mementos.

Type Field Description ExampleTechnical

Content-type Entity mimetype. text/html

Content-length Size of the entity-body. 90883

Extracted

Title Title of the page. Egypt rejoices at Mubarak departure

Description Description about the content of the entity-body.

The BBC World Affairs Editor John Simpson reflects on how Egypt brought about the overthrow of President Hosni Mubarak.

Outgoing Links A list of all the outlinks that the page pointed to.

Derived

Thumbnail Thumbnail of the representation of the web page.

Incoming Links A list of all the inlinks that to pointed to the page

Page 34: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

34

ArcLink

Motivation, Stages, Cost Model, Applications

Page 35: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

35

ArcLink: optimization techniques to build and retrieve the temporal web graph

A. AlSum and M. L. Nelson,.

In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries

JCDL ‘13, Indianapolis, Indiana, 2013

See also: http://arxiv.org/abs/1305.5959

Page 36: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

36

Easily Solved Questions

Q: What are the available mementos for www.vancouver2010.com?

Metadata Service ArcLink Motivation

Page 37: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

37

Solved Questions, but hard

Q. What are the HTML titles for www.vancouver2010.com through time?

A. Page scraping for all mementos

Metadata Service ArcLink Motivation

Page 38: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

38

Impossible Questions

Q What are the anchor-text that pointed to www.vancouver2010.com through time?

Metadata Service ArcLink Motivation

…<a href=www.vancouver2010.com >Vancouver Olympics</a>….

…<a href=www.vancouver2010.com >Winter Olympics</a>…

…<a href=www.vancouver2010.com >Vancouver 2010</a>…

Page 39: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

39

Outlinks

Metadata Service ArcLink Motivation

Page 40: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

40

ArcLink and Temporal Web GraphWhat is ArcLink?• ArcLink is a complete system to Extract, Preserve, and

Access to Temporal Web Graph.

What is the Temporal Web Graph?• Link structure through the time, including inlinks and

outlinks.

Metadata Service ArcLink Motivation

WG @t2WG @t1 TWG

Page 41: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

41

System Stages

Metadata Service ArcLink Stages

Page 42: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

42

Filtering• Using CDX files to filter the URI to select the mementos

that will contribute to the Web Graph.• For example,

• Exclude non-200 HTTP status code• Exclude Images, style-sheets, videos, etc• Exclude duplicate mementos

• Technique: Using Pig Latin script on CDX files• Results: CDX was reduced to 25% of the original size,

from 23.8M mementos to 6.7M mementos.

Metadata Service ArcLink Stages

Page 43: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

43

Extraction• Technique: Hadoop• Step 1: URI-ID generation

• Canonicalized the URI into SURT format • Hash the canonicalized format using SimHash• Completely distributed

• Step 2: Define data sources

Metadata Service ArcLink Stages

𝑤𝑤𝑤 .𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜

𝑤𝑤𝑤1.𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜}𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜

𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜→𝐴𝐵𝐶𝐷 11

Input Source Map (sec) Reduce (sec) Total (sec)

2 TasksWayback 21,422 4,194 25,616

WARC 13,327 2,770 16,098 (62%)

5 TasksWayback 13,721 2,257 15,978

WARC 8,304 1,746 10,051 (62%)

• WARC • Web archive UI

Page 44: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

44

Storage• ArcLink used database to save the web

graph

Metadata Service ArcLink Stages

Insertion Performance Update Performance

Page 45: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

45

ArcLink Response

Metadata Service ArcLink Stages

Page 46: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

46

ArcLink Response

Metadata Service ArcLink Stages

Page 47: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

47

ArcLink Response

Metadata Service ArcLink Stages

Page 48: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

48

Impossible Questions

Q. What are the anchor-text that pointed to www.vancouver2010.com through time?

Metadata Service ArcLink Applications

Page 49: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

49

Temporal Page Rank Nov-2009 Dec-2009 Jan-2010

1 vancouver2010.com/code - topsport.com/sportch/liveticker/ 2 vancouver2010.com/en/langpolicy - vancouver2010.com/code

3 vancouver2010.com/forgotpassword - canadacode.vancouver2010.com/ user/register

4 vancouver2010.com/store - canadacode.vancouver2010.com

5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore

6 vancouver2010.com/ - canadacode.vancouver2010.com/ user/login?destination=node/add/image

7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse 8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge 9 canadacode.vancouver2010.com/contact - i-credible.nl

10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl

Metadata Service ArcLink Applications

Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 ) 1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr

2 topsport.com/sportch/liveticker/ laprovence.com/la-provence-le-faq-de-la-moderation vancouver2010.com/code

3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr

4 laprovence.com/la-provence-le-faq-de-la-moderation

vancouver2010.teamgb.com /teamgb/team-behind-team-gb/filenotfound.aspx

laprovence.com/la-provence-le-faq-de-la-moderation

5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport 6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer 7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo 8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk

9 dosb.de/de/vancouver-2010/vancouver-ticker/detail/printer.html lemonde.fr/cgv topsport.com/sportch/liveticker/

10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy

Page 50: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

50

ArcThumb

Motivation, Feature Exploration, Selection Algorithm

Page 51: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

51

Thumbnail Summarization Techniques For Web Archives

AlSum and M. L. Nelson,.

In Proceedings of the 36th European Conference on Information Retrieval.

ECIR 2014, Amsterdam, Netherlands, 2014

Page 52: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

52

Thumbnails in Web Archive

Metadata Service ArcThumb Motivation

Internet Archive UK Web Archive

Page 53: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

53

Thumbnails Creation Challenges• Scalability in Time

• IA may need 361 years to create thumbnail for each memento using one hundred machines.

• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.

• Page quality

Metadata Service ArcThumb Motivation

Page 54: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

54

Thumbnails Usage Challenges

Metadata Service ArcThumb Motivation

• This is partial view of 700 thumbnails out of 10,500 available mementos for www.apple.com

Page 55: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

55

From 10,500 Mementos to 69 Thumbnails.

Metadata Service ArcThumb Motivation

Page 56: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

56

How many thumbnails do we need?

Metadata Service ArcThumb Methodology

www.unfi.com on the live Web

Page 57: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

57

How many thumbnails do we need?

Metadata Service ArcThumb Methodology

www.unfi.com on the live Web

Page 58: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

58

40 Thumbnails are good.

Metadata Service ArcThumb Methodology

Page 59: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

59

Visual Similarity and Text Similarity

Metadata Service ArcThumb MethodologyS

imila

rD

iffe

ren

t

HTML Text

Page 60: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

60

Correlation between Visual Similarity and Text Similarity

Metadata Service ArcThumb Feature Exploration

SimHash DOM tree

Embedded resources Memento Datetime

SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]

Page 61: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

61

Threshold Grouping

Metadata Service ArcThumb Selection Algorithms

Page 62: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

62

Threshold Grouping

Metadata Service ArcThumb Selection Algorithms

Page 63: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

63

Clustering technique

Metadata Service ArcThumb Selection Algorithms

SimHash Feature SimHash and Datetime Features

Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.

Page 64: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

64

Time Normalization

Metadata Service ArcThumb Selection Algorithms

Page 65: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

65

Selection Algorithms Comparison

  Threshold Grouping K clustering Time Normalization

TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109

# Features 1 feature 1 or more 1 feature

Preprocessing required Yes Yes No

Efficient processing Medium Extensive Light

Incremental Yes No Yes

Online/offline Both Both Both

Metadata Service ArcThumb Selection Algorithms

Page 66: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

66

URI SERVICE

Archive

URI

Metadata

Content

Page 67: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

67

ARCHIVAL HTTP REDIRECTION RETRIEVAL POLICIES

A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel

In Proceedings of 3rd Temporal Web Analytics Workshop.

TempWeb 2013, Rio de Janeiro, Brazil

Page 68: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

68

Live Web Redirect

http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu

URI Service

% curl -I http://bit.ly/r9kIfC HTTP/1.1 301 Moved….Location: http://www.cs.odu.edu/…

Page 69: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

69

Live Web Redirect

URI Service

R http://bit.ly/r9kIfC R http://www.cs.odu.eduredirects to

Page 70: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

70

R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk

R1

http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/

R3

http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical

Web

Arc

hive

Li

ve w

eb

redirects to

redirects to

has Memento

Archived Web Redirect

URI Service

Page 71: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

71

Experiment• Dataset: 10,000 sample URIs from • Dataset does not include bit.ly nor doi.• Experiment focused on the root page (no embedded resources)

URI Service Experiment and Results

HTTP Status/Code (10,000 URI-R)

OK (200) 82.83%

Redirection (3xx) 14.71%

Redirection (301) 8.4%

Redirection (302) 6.1%

Redirection (others) 0.2%

Not-Found (4xx) 1.18%

Others 1.28%

HTTP Status/Code (894,717 URI-M)

OK (200) 93.46%

Redirection (3xx) 5.69%

Not-Found (4xx) 0.26%

Others 0.59%

URIs Live HTTP status code Memento HTTP status code

Page 72: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

72

URI Stability• URI’s stability is a count of the change in HTTP responses

across time (200, 3xx, or 4xx) and the number of different URIs in the “Location” for 3xx status code.

High Stability = 1 No Stability = 0

URI Service

Page 73: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

73

Abstract Model• TimeMap for R

URI Service

M1 M2 M3TimeMapR

Page 74: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

74

Timemap Redirection Categories

URI Service

All Mementos have 200 HTTP status code All Mementos have redirection to the same URI.

All Mementos have redirection to different URIs. Mementos have different HTTP status code.

Stability =1 Stability =1

Stability ≈ 0

Stability

Page 75: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

75

URI Stability

URI Service Experiment and Results

TimeMap Category Percentage Stability

All Mementos have OK 52% 1

Mementos have mixed status codes 36% 0.91

All Mementos have Redirection 0.92% 0.85

Redirection to the same URI 0.62%

Redirection to different URIs 0.30%

URI has no Mementos at all 10.97% 0

Stability in semi-log scale Stability for |TM(R)| < 300

Page 76: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

76

Current Wayback Machine Policy• Live Redirect: Wayback Machine ignores the live

redirects. Use instead of • Archived Redirect: Wayback Machine follows the

redirection.

URI Service Retrieval Policies

Page 77: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

77

Policy one:

URI-R with HTTP redirection• Scope: Selection between on the live web.• Example: http://bit.ly/r9kIfC http://www.cs.odu.edu

• Algorithm:

URI Service Retrieval Policies

Retrieve the memento M for R.

Status(M) =200

Status(M) =3xx

Status(M) =4xx&& R has

Stop

Go to Policy 2

Stop

Yes

Yes

Yes No

No

No

Use instead of R

Page 78: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

78

Policy one: URI-R with HTTP redirection• Evaluation:

• Policy scope has: 1471 URIs (that have live redirection)

• 77 out of 1471 have no mementos at all• 17 out of 77 have been retrieved mementos based on live

redirection

URI Service Retrieval Policies

Page 79: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

79

Policy two: URI-M with HTTP redirection• Scope: Selection between in web archive.• Example: http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl

http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com/

• Algorithm:

URI Service Retrieval Policies

𝑀→𝑀

Extract original from

Repeat content-netgotiation in datetime for original()

http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com

/http://www.cnn.com/

Accept-Datetime: Sun, 13 May 2006 http://www.cnn.com/

Page 80: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

80

Policy two: URI-M with HTTP redirection• Evaluation:

• Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)

• Success criteria: Using policy two contributed to the original TimeMap

• Success percentage: 58% of the cases

URI Service Retrieval Policies

Page 81: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

81

ARCHIVE SERVICE

Percentage and Distribution

Archive

URI

Metadata

Content

Page 82: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

82

How Much Of The Web Is Archived?

S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson

In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

JCDL '11, Ottawa, Canada 2011

See also: http://arxiv.org/abs/1212.6177

Page 83: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

83

Experiment• 4 Sample sets – 1000 URIs each

• For each URI, we used Memento Aggregator to record the TimeMap for this URI.

Archive Service Percentage Experiment

Page 84: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

84

Archives Under Experiment2010 2010 and 2013 2013

Archive Service Percentage Experiment

UK

Page 85: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

85

How Much of the Web is Archived?• It Depends on Which Web…

Archive Service Percentage Results

2010 2013Including SE cache

Excluding SE Cache General

90% 79% 90%

97% 68% 95%

88% 19% 52%

35% 16% 33%

Page 86: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

86

Profiling Web Archive Coverage For Top-level Domain And Content Language

A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel

In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries

TPDL 2013, Valletta, Malta, 2013

Extended version is invited to special edition in IJDL.

See also: http://arxiv.org/abs/1309.4008

Page 87: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

87

Memento Aggregator

Archive Service Distribution

Page 88: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

88

Where can you find?

Archive Service Distribution

http://www.google.com/

Page 89: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

89

Where can you find?

Archive Service Distribution

http://www.google.com/

Page 90: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

90

Where can you find?

Archive Service Distribution

http://www.japantimes.co.jp/

Page 91: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

91

Where can you find?

Archive Service Distribution

http://www.japantimes.co.jp/

Page 92: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

92

Research Question

Problem• We need to profile the web archives around the world with

these characteristics:• Age• Top-level domains• Languages• Growth rate

Goal• To optimize the query routing for Memento Aggregator.• To determine the missing parts of the web.

Archive Service Distribution

Page 93: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

93

URIs Samples Sources

Archive Service Distribution

Web1. DMOZ – Random sample2. DMOZ – TLD 200 URIs for

each TLD from DMOZ (80 tlds)

3. DMOZ – Languages 100 URIs for each Languages (40 lang.)

Web Archives4. Top 1-Gram from Bing5. Top 1000 queries term

by Yahoo in 9 languages

User requests6. IA Wayback Machine log files7. Memento aggregator log files

* We used hostnames only

Page 94: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

94

TLD Coverage

Archive Service Distribution

IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library

SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University

IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is

Page 95: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

95

Language Coverage

Archive Service Distribution

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library

SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University

IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is

Page 96: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

96

Growth Rate

Archive Service Distribution

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Stopped archiving in 2008

Steady growth

Stopped getting new URIs, but still crawling

IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library

SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University

IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is

Page 97: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

97

Building Web Archive Profile

Archive Service Distribution

{"Profile":{

"Name“ : "Taiwan Web Archive",

"URI“ : "http://webarchive.lib.ntu.edu.tw",

"TimeGate“ : "http://mementoproxy.cs.odu.edu/tw/timegate/",

"Code“ : "TW",

"Age“ : "Tue, 15 Jul 1997 00:00:00 GMT",

"TLD“ : [ {"tw":0.6},{"cn":0.08},{"hk:0.04}, {"eg":0.04},{"gov":0.04}, {"my":0.04},{"jp":0.04},{"kr":0.02}],

"Language“ : [{"zh-TW":0.5},{"zh-CN":0.25},{"id":0.08},{"ar":0.08}],

"GrowthRate“ : [

{"199707":[4,4]},{"200202":[1,1]},

{"200607":[30,62]},{"200608":[20,80]},

{"200609":[5,9]},{"200612":[77,129]},

... // other values truncated

{"201308":[7,94]},{"201309":[2,94]}]

}

}

Page 98: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

98

• RecallTM@1 = 3/8 = 0.375

• RecallTM@2 = 5/8 = 0.625

Web Archive Selection Evaluation

Archive Service Distribution

𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠

TM(R)

A1 M1

M2

M3

A2 M4

M5

A3 M6

A4 M7

A5 M8

Page 99: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

99

Web Archive Selection Evaluation

Archive Service Distribution

Number of Archive Including IA Excluding IA

RecallTM@3 0.96 0.647

RecallTM@6 0.98 0.83

RecallTM@9 0.998 0.983

RecallTM@12 0.999 0.987

• Total number of archives N = 15

𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠

Page 100: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

100

CONCLUSIONS

Page 101: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

101

Conclusions• We proposed a new service framework that divides the web archive

corpus into four levels: Content, Metadata, URI, and Archive.• The development of ArcContent that supports the web archive interface

with extracted version of the mementos based on a set of predefined filters.

• We studied the challenges of building the temporal web graph and developed ArcLink, a distributed system to extract, preserve, and expose the temporal web graph.

• We studied the optimization and summarization techniques to create the thumbnails for the web graph collections based on SimHash fingerprints.

• We extended the concept of URI-lookup in the web archive to include the HTTP redirection status code.

• The concept of “Web Archive Profile” to characterize the web archive corpus was defined with an application on the distributed search in the Memento Aggregator.

Page 102: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

102

Publications• S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How

much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011.

• A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, 2013.

• A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013.

• A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web Archive Coverage for Top-Level Domain and Content Language.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.

• A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36th European Conference on Information Retrieval. ECIR ‘14, 2014.

Page 103: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

103

What’s next?• Web Archiving Engineer at Stanford University.

Page 104: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

104

WEB ARCHIVE SERVICES FRAMEWORK

FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB

Ahmed AlSumPhD Defense

February 2014

Old Dominion University Computer Science Department

@aalsum

Page 105: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

105

BACKUP

Page 106: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

106

Memento• Memento is an HTTP

extension to integrate the Past and the Current Web

I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/

Now

T1

T2

T3

Page 107: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

107

Memento

• Developer and administrator for Memento aggregator and proxies

Page 108: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

108

Memento Clients

• Memento currently is RFC.

Page 109: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

109

Lack of APIs• Famous websites provide APIs to the third-party

developer.• Introduction Motivation

Page 110: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

110

Lack of APIs• US Agencies started to support APIs to data access.• Introduction Motivation

Page 111: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

111

Web Archiving Use Cases• Temporal navigation.• Full text search.• Use language filters.• Provide raw WARC.• Import of metadata records

into other repositories.

• Introduction Motivation

*IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006.

Page 112: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

112

Related Projects

Data analysis for the web data

Tools and Methods to access the web archive

Enable the user to do experiments on the raw crawled data on Amazon S3

Enable the user to browse the present and the past web

• Introduction

Page 113: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

113

Selection• Decide what to capture

Everything, any domain

National domains

Delegate selection to partners

Users’ favorites

• We studied what is already captured

Page 114: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

114

URI-Based

WayBack Machine• Web Archiving Trends Accessing Web Archive

• Textbox to enter the requested URI.

• BubbleMap to show you the available mementos.

Page 115: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

115

Collection-Based• Web Archiving Trends Accessing Web Archive

• In addition to browsing the collection, you can browse the URIs in this collection.

Page 116: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

116

Full-text search• Web Archiving Trends Accessing Web Archive

• BL interface provides different filtering techniques for the results.

Page 117: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

117

Past Web Browser• Web Archiving Trends Accessing Web Archive

• You can replay the pages with different controls to forward, backward, pause and stop.

Page 118: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

118

Zoetrope• Web Archiving Trends Accessing Web Archive

• Different Views• Comparison between

different Mementos• Not feasible on the

current web archiving infrastructure

Page 119: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

119

DiffIE• Web Archiving Trends Accessing Web Archive

• A browser plug-in that caches the pages a person visits and highlights how those pages have changed when the person returns to them

• It is possible on the personal archiving.

Page 120: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

120

Synchronicity• Web Archiving Trends Accessing Web Archive

• Mozilla Firefox add-on supports internet user in (re-)discovering missing web pages in real time

Page 121: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

121

Warrick• Web Archiving Trends Accessing Web Archive

• It’s a utility for reconstructing or recovering a website when a back-up is not available

Page 122: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

122

ArcSys Architecture Diagram

Web Archive Service Framework

Page 123: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

123

WAT files• WAT files are metadata files for WARC files• WAT files are used to create data analysis reports based

on large datasets.

Metadata Service

Page 124: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

124

It’s More than WAT filesWAT ArcLink

Batch Process on a set of WARCs Batch process on a set of URIs

For internal use For public use

No-way to integerate with others WAT files in others locations

It could be aggregated with other graphs

No incremental update Support incremental update

Access on WAT file level using Pig Access on URI level using Web service

Metadata Service ArcLink Motivation

Page 125: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

125

Cost of Scaling Up• Filtering

• Extraction

• Storage

Metadata Service ArcLink Cost model

Internet Archive

88 hrs108 * 109 mementos

247 days

500 TB

Filtering 𝑇𝑖𝑚𝑒=

𝑛106 ∗

5.5𝑚(h𝑟𝑠

)Extraction 𝑆𝑖𝑧𝑒

=𝑛

∗10

% Storage

*Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB

Page 126: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

126

Time-Indexed Inlinks Information

Metadata Service ArcLink Applications

Date Anchor Text

04-Nov-09 vancouver2010.com

11-Nov-09 vancouver2010.com

18-Nov-09 vancouver2010.com

16-Jan-10 Vancouver 2010 Olympic Games

16-Jan-10 Vancouver 2010 Olympic Games

23-Jan-10 vancouver2010.com

23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports

30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports

30-Jan-10 vancouver2010.com

30-Jan-10 Vancouver 2010 Olympic Games

13-Feb-10 Vancouver 2010 Olympic Winter Games

15-Feb-10 Vancouver 2010 Olympic Games

18-Feb-10 Official Vancouver Games site

19-Feb-10 vancouver2010.com

20-Feb-10 Official Vancouver Games site

21-Feb-10 VANOC 2010

Page 127: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

127

HTTP Redirection Relationship between URI-R & URI-M

URI Service Experiment and Results

Live Web URI − R

OK Redirection

Web ArchiveURI-M

OK Case 1 5

Redirection 2 3,4Case 1

Case 2 Case 3 Case 4 Case 5

80.8%

2.74% 1.34%1.33%

13.7%

Page 128: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

128

Timemap Redirection Categories• Category 1

URI Service

All Mementos have 200 HTTP status code

Stability =1

Page 129: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

129

Timemap Redirection Categories• Category 2

URI Service

All Mementos have redirection to the same URI.

Stability =1

Page 130: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

130

Timemap Redirection Categories• Category 3

URI Service

All Mementos have redirection to different URIs.

Stability ≈ 0

Page 131: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

131

Timemap Redirection Categories• Category 4

URI Service

Mementos have different HTTP status code.

Stability

Page 132: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

132

HTTP Redirection Relationship between URI-R & URI-M

URI Service

Live Web URI − R

OK Redirection

Web ArchiveURI-M

OK Case 1 5

Redirection 2 3,4Case 1

Case 2 Case 3 Case 4 Case 5

Page 133: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

133

URI Reliability

URI Service

M1

3xx

M2

3xx

M3

3xx

TimeMap

rel=original

R`Mrel=original

R`Mrel=original

R`M

RStability =1

? ? ?200 404 3xx

Page 134: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

134

Summary• Quantitative study with 10,000 URIs.• 48% were not fully stable through time.• 27% were not perfectly reliable through time.• New archival retrieval policy:

• Policy one: successfully retrieved mementos for 17 out of 77.• Policy two: Expanded the TimeMap for 58% of cases.

URI Service Retrieval Policies

Page 135: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

135

URI Reliability• 23% of the mementos did not lead to a successful

memento at the end.

URI Service Experiment and Results

Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300

Page 136: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

136

Experiment

Archive Service Percentage Experiment

• For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos).

• For each URI, Memento Aggregator responded with TimeMap for this URI.

Example <http://memento.waybackmachine.org/memento/20010819194233/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/20011216220248/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT",

Page 137: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

137

1000 URIs Ordered by First Observation Date

Archive Service Percentage Results

See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

Page 138: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

138

2010

Archive Service Percentage Results

2013

Page 139: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

139

Archive Service Percentage Results

2010 2013

Page 140: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

140

Archive Service Percentage Results

2010 2013

Page 141: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

141

Archive Service Percentage Results

2010 2013

Page 142: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

142

URIs Samples Sources – Live Web1. DMOZ – Random sample

• 10,000 URIs randomly sample from DMOZ directory (~5M URIs).

2. DMOZ – TLD: 200 URIs for each TLD• 80 tlds.

3. DMOZ – Languages 100 URIs for each Languages• 40 languages.

Archive Service Distribution

Page 143: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

143

URIs Samples Sources – Web Archive• Query the fulltext search interface for the web archives

with two set of query terms.

4. Top 1-Gram from Bing• Most of them is English

5. Top 1000 queries term by Yahoo in 9 languages• We excluded the general keywords such as: Obama,

Facebook.

Archive Service Distribution

Page 144: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

144

URIs Samples Sources – User requests• Sampling from the users requests to the web archived

materials

6. Sample from IA Wayback Machine Log files• 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,

2012.

7. Sample from Memento aggregator log files• 1,000 URIs randomly sampled from LANL Memento Aggregator

between 2011 to 2013.

Archive Service Distribution

Page 145: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

145

General Coverage

Archive Service Distribution

IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library

SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University

IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is

Page 146: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

146

Web Archive Selection Evaluation

Archive Service Distribution

Page 147: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

147

Web Archive Selection Evaluation

Archive Service Distribution

Page 148: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

148

Future Works

Page 149: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

149

iTunes cover application

Metadata Service ArcThumb Motivation


Recommended