Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | ahmed-alsum |
View: | 1,135 times |
Download: | 2 times |
1
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel
Old Dominion University Computer Science Department
2
Domain
Contribution
Goal
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel
Old Dominion University Computer Science Department
3
Outline• Introduction• Web Archiving Services Framework
• Content Service• Metadata Service• URI Service• Archive Service
• Conclusions
4
INTRODUCTION
Motivation and Research Questions
5
What is a Web Archive?
Introduction Motivation
http://www.cs.odu.edu
6
Who are using Web Archives? & How?• Politicians• Journalists• Web designers• Historians• Researchers• Social scientists• Curious users
Introduction Motivation
*IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009
7
Web Archives interfaces are limited
Introduction Motivation
8
Web Archiving Use Cases• Ponguru asked on Internet Archive forum on May 17,
2010*:• Hi All - I am new to Archive.org. A few quick questions
(1) Is there any API or tools available to access the Archive.org contents programmatically?
(2) Are there any research papers where Archive.org was used for data collection / analysis (e.g. studying a particular topic over time, etc.)? I digged a little bit, could not find much, so checking with the group. "
• Introduction Motivation
*http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg
9
Lack of APIs• Famous websites provide APIs to the third-party
developer.• Introduction Motivation
10
Limited and non-standards APIs• Current Web Archives have a limited set of APIs that don’t
cover the user’s needs.• Introduction Motivation
11
Wayback Machine API• Introduction Motivation
• It returns JSON interface for the list of available Mementos.
12
Croatian Web Archive
Introduction Motivation
Full-text search web interface Full-text search APIs in JSON
13
Memento• Introduction Motivation
• Memento provides TimeMap in the application CoRE format.
14
Memento Terminology
Introduction Motivation
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force (IETF). Retrieved from
http://tools.ietf.org/html/rfc7089
15
Memento Aggregator• Merges TimeMaps from various archives.
Introduction Motivation
16
Web Archiving as Big Data• Internet Archive corpus reached 5 PetaBytes. • Alexandria Bibliotheca needs one year to recompute
checksum for its corpus.
• Tools
Introduction Motivation
Apache Pig
17
Research Question
How Can We Enrich The Web Archive Access Interface With The Conjunction Of The Live Web?
Introduction Research Questions
18
Research Questions• What are the required services for the web archiving user
community? • Shall we work on the web archive collection as one entity or on
different levels? • How can we use the web archive content beyond full-text search? • What are the metadata fields that could enhance user browsing? • How can we develop access interface to the temporal web graph? • How can we optimize creation of thumbnails?• How can we use the HTTP redirection to enhance the URI-lookup
query? • How can we optimize the query routing mechanism across the
web archives?
Introduction Research Questions
19
WEB ARCHIVE SERVICE FRAMEWORKLevels and Datasets
20
Web Archive Service FrameworkWeb Archive Service Framework
21
• Archive level• Web Archive profiling to
optimize the query routing.
• URI level• URI HTTP redirection in the
web archive URI-lookup.
• Metadata level• ArcLink• ArcThumb
• Content level• ArcContent
Web Archive Service Framework
ArcSys
22
IIPC 2010 Winter Olympics
Web Archive Service Framework Datasets
* http://olympics.us.archive.org/olympics2010/
Size 700+GB
From Nov 2009
To Mar 2010
#URI-R 6.4M
#URI-M 23.7M
23
Fortune 500• 499,540 mementos from 488
TimeMaps.• For each Memento, we download the
HTML and capture the thumbnail using PhantomJS.
Web Archive Service Framework Datasets
24
DMOZ
Web Archive Service Framework Datasets
• URI Open Directory based on user submissions.
25
CONTENT SERVICEArcContent
Archive
URI
Metadata
Content
26
Wayback Machine URI Rewriting
Original Rewritten
Content Service
27
Response Types
Raw Response
Modified Response
Extracted Response
Content Service
28
ArcContent Architecture Diagram
Content Service
29
Extracted Response Filters
Content Service
TextContent
TFContent
30
Extracted Response Formats
Content Service
XML
JSON
31
ArcContent Applications
Content Service
TFContent
TagClouds
32
METADATA SERVICEArcLink & ArcThumb
Archive
URI
Metadata
Content
33
Metadata Access Service• Metadata Service• Metadata is data about data.• Metadata layer is data about mementos.
Type Field Description ExampleTechnical
Content-type Entity mimetype. text/html
Content-length Size of the entity-body. 90883
Extracted
Title Title of the page. Egypt rejoices at Mubarak departure
Description Description about the content of the entity-body.
The BBC World Affairs Editor John Simpson reflects on how Egypt brought about the overthrow of President Hosni Mubarak.
Outgoing Links A list of all the outlinks that the page pointed to.
Derived
Thumbnail Thumbnail of the representation of the web page.
Incoming Links A list of all the inlinks that to pointed to the page
34
ArcLink
Motivation, Stages, Cost Model, Applications
35
ArcLink: optimization techniques to build and retrieve the temporal web graph
A. AlSum and M. L. Nelson,.
In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries
JCDL ‘13, Indianapolis, Indiana, 2013
See also: http://arxiv.org/abs/1305.5959
36
Easily Solved Questions
Q: What are the available mementos for www.vancouver2010.com?
Metadata Service ArcLink Motivation
37
Solved Questions, but hard
Q. What are the HTML titles for www.vancouver2010.com through time?
A. Page scraping for all mementos
Metadata Service ArcLink Motivation
38
Impossible Questions
Q What are the anchor-text that pointed to www.vancouver2010.com through time?
Metadata Service ArcLink Motivation
…<a href=www.vancouver2010.com >Vancouver Olympics</a>….
…<a href=www.vancouver2010.com >Winter Olympics</a>…
…<a href=www.vancouver2010.com >Vancouver 2010</a>…
39
Outlinks
Metadata Service ArcLink Motivation
40
ArcLink and Temporal Web GraphWhat is ArcLink?• ArcLink is a complete system to Extract, Preserve, and
Access to Temporal Web Graph.
What is the Temporal Web Graph?• Link structure through the time, including inlinks and
outlinks.
Metadata Service ArcLink Motivation
WG @t2WG @t1 TWG
41
System Stages
Metadata Service ArcLink Stages
42
Filtering• Using CDX files to filter the URI to select the mementos
that will contribute to the Web Graph.• For example,
• Exclude non-200 HTTP status code• Exclude Images, style-sheets, videos, etc• Exclude duplicate mementos
• Technique: Using Pig Latin script on CDX files• Results: CDX was reduced to 25% of the original size,
from 23.8M mementos to 6.7M mementos.
Metadata Service ArcLink Stages
43
Extraction• Technique: Hadoop• Step 1: URI-ID generation
• Canonicalized the URI into SURT format • Hash the canonicalized format using SimHash• Completely distributed
• Step 2: Define data sources
Metadata Service ArcLink Stages
𝑤𝑤𝑤 .𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜
𝑤𝑤𝑤1.𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜}𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜
𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜→𝐴𝐵𝐶𝐷 11
Input Source Map (sec) Reduce (sec) Total (sec)
2 TasksWayback 21,422 4,194 25,616
WARC 13,327 2,770 16,098 (62%)
5 TasksWayback 13,721 2,257 15,978
WARC 8,304 1,746 10,051 (62%)
• WARC • Web archive UI
44
Storage• ArcLink used database to save the web
graph
Metadata Service ArcLink Stages
Insertion Performance Update Performance
45
ArcLink Response
Metadata Service ArcLink Stages
46
ArcLink Response
Metadata Service ArcLink Stages
47
ArcLink Response
Metadata Service ArcLink Stages
48
Impossible Questions
Q. What are the anchor-text that pointed to www.vancouver2010.com through time?
Metadata Service ArcLink Applications
49
Temporal Page Rank Nov-2009 Dec-2009 Jan-2010
1 vancouver2010.com/code - topsport.com/sportch/liveticker/ 2 vancouver2010.com/en/langpolicy - vancouver2010.com/code
3 vancouver2010.com/forgotpassword - canadacode.vancouver2010.com/ user/register
4 vancouver2010.com/store - canadacode.vancouver2010.com
5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore
6 vancouver2010.com/ - canadacode.vancouver2010.com/ user/login?destination=node/add/image
7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse 8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge 9 canadacode.vancouver2010.com/contact - i-credible.nl
10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl
Metadata Service ArcLink Applications
Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 ) 1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr
2 topsport.com/sportch/liveticker/ laprovence.com/la-provence-le-faq-de-la-moderation vancouver2010.com/code
3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr
4 laprovence.com/la-provence-le-faq-de-la-moderation
vancouver2010.teamgb.com /teamgb/team-behind-team-gb/filenotfound.aspx
laprovence.com/la-provence-le-faq-de-la-moderation
5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport 6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer 7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo 8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk
9 dosb.de/de/vancouver-2010/vancouver-ticker/detail/printer.html lemonde.fr/cgv topsport.com/sportch/liveticker/
10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy
50
ArcThumb
Motivation, Feature Exploration, Selection Algorithm
51
Thumbnail Summarization Techniques For Web Archives
AlSum and M. L. Nelson,.
In Proceedings of the 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
52
Thumbnails in Web Archive
Metadata Service ArcThumb Motivation
Internet Archive UK Web Archive
53
Thumbnails Creation Challenges• Scalability in Time
• IA may need 361 years to create thumbnail for each memento using one hundred machines.
• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
Metadata Service ArcThumb Motivation
54
Thumbnails Usage Challenges
Metadata Service ArcThumb Motivation
• This is partial view of 700 thumbnails out of 10,500 available mementos for www.apple.com
55
From 10,500 Mementos to 69 Thumbnails.
Metadata Service ArcThumb Motivation
56
How many thumbnails do we need?
Metadata Service ArcThumb Methodology
www.unfi.com on the live Web
57
How many thumbnails do we need?
Metadata Service ArcThumb Methodology
www.unfi.com on the live Web
58
40 Thumbnails are good.
Metadata Service ArcThumb Methodology
59
Visual Similarity and Text Similarity
Metadata Service ArcThumb MethodologyS
imila
rD
iffe
ren
t
HTML Text
60
Correlation between Visual Similarity and Text Similarity
Metadata Service ArcThumb Feature Exploration
SimHash DOM tree
Embedded resources Memento Datetime
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
61
Threshold Grouping
Metadata Service ArcThumb Selection Algorithms
62
Threshold Grouping
Metadata Service ArcThumb Selection Algorithms
63
Clustering technique
Metadata Service ArcThumb Selection Algorithms
SimHash Feature SimHash and Datetime Features
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
64
Time Normalization
Metadata Service ArcThumb Selection Algorithms
65
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
Metadata Service ArcThumb Selection Algorithms
66
URI SERVICE
Archive
URI
Metadata
Content
67
ARCHIVAL HTTP REDIRECTION RETRIEVAL POLICIES
A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel
In Proceedings of 3rd Temporal Web Analytics Workshop.
TempWeb 2013, Rio de Janeiro, Brazil
68
Live Web Redirect
http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu
URI Service
% curl -I http://bit.ly/r9kIfC HTTP/1.1 301 Moved….Location: http://www.cs.odu.edu/…
69
Live Web Redirect
URI Service
R http://bit.ly/r9kIfC R http://www.cs.odu.eduredirects to
70
R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk
R1
http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/
R3
http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical
Web
Arc
hive
Li
ve w
eb
redirects to
redirects to
has Memento
Archived Web Redirect
URI Service
71
Experiment• Dataset: 10,000 sample URIs from • Dataset does not include bit.ly nor doi.• Experiment focused on the root page (no embedded resources)
URI Service Experiment and Results
HTTP Status/Code (10,000 URI-R)
OK (200) 82.83%
Redirection (3xx) 14.71%
Redirection (301) 8.4%
Redirection (302) 6.1%
Redirection (others) 0.2%
Not-Found (4xx) 1.18%
Others 1.28%
HTTP Status/Code (894,717 URI-M)
OK (200) 93.46%
Redirection (3xx) 5.69%
Not-Found (4xx) 0.26%
Others 0.59%
URIs Live HTTP status code Memento HTTP status code
72
URI Stability• URI’s stability is a count of the change in HTTP responses
across time (200, 3xx, or 4xx) and the number of different URIs in the “Location” for 3xx status code.
High Stability = 1 No Stability = 0
URI Service
73
Abstract Model• TimeMap for R
URI Service
M1 M2 M3TimeMapR
74
Timemap Redirection Categories
URI Service
All Mementos have 200 HTTP status code All Mementos have redirection to the same URI.
All Mementos have redirection to different URIs. Mementos have different HTTP status code.
Stability =1 Stability =1
Stability ≈ 0
Stability
75
URI Stability
URI Service Experiment and Results
TimeMap Category Percentage Stability
All Mementos have OK 52% 1
Mementos have mixed status codes 36% 0.91
All Mementos have Redirection 0.92% 0.85
Redirection to the same URI 0.62%
Redirection to different URIs 0.30%
URI has no Mementos at all 10.97% 0
Stability in semi-log scale Stability for |TM(R)| < 300
76
Current Wayback Machine Policy• Live Redirect: Wayback Machine ignores the live
redirects. Use instead of • Archived Redirect: Wayback Machine follows the
redirection.
URI Service Retrieval Policies
77
Policy one:
URI-R with HTTP redirection• Scope: Selection between on the live web.• Example: http://bit.ly/r9kIfC http://www.cs.odu.edu
• Algorithm:
URI Service Retrieval Policies
Retrieve the memento M for R.
Status(M) =200
Status(M) =3xx
Status(M) =4xx&& R has
Stop
Go to Policy 2
Stop
Yes
Yes
Yes No
No
No
Use instead of R
78
Policy one: URI-R with HTTP redirection• Evaluation:
• Policy scope has: 1471 URIs (that have live redirection)
• 77 out of 1471 have no mementos at all• 17 out of 77 have been retrieved mementos based on live
redirection
URI Service Retrieval Policies
79
Policy two: URI-M with HTTP redirection• Scope: Selection between in web archive.• Example: http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl
http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com/
• Algorithm:
URI Service Retrieval Policies
𝑀→𝑀
Extract original from
Repeat content-netgotiation in datetime for original()
http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com
/http://www.cnn.com/
Accept-Datetime: Sun, 13 May 2006 http://www.cnn.com/
80
Policy two: URI-M with HTTP redirection• Evaluation:
• Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)
• Success criteria: Using policy two contributed to the original TimeMap
• Success percentage: 58% of the cases
URI Service Retrieval Policies
81
ARCHIVE SERVICE
Percentage and Distribution
Archive
URI
Metadata
Content
82
How Much Of The Web Is Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
JCDL '11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177
83
Experiment• 4 Sample sets – 1000 URIs each
• For each URI, we used Memento Aggregator to record the TimeMap for this URI.
Archive Service Percentage Experiment
84
Archives Under Experiment2010 2010 and 2013 2013
Archive Service Percentage Experiment
UK
85
How Much of the Web is Archived?• It Depends on Which Web…
Archive Service Percentage Results
2010 2013Including SE cache
Excluding SE Cache General
90% 79% 90%
97% 68% 95%
88% 19% 52%
35% 16% 33%
86
Profiling Web Archive Coverage For Top-level Domain And Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries
TPDL 2013, Valletta, Malta, 2013
Extended version is invited to special edition in IJDL.
See also: http://arxiv.org/abs/1309.4008
87
Memento Aggregator
Archive Service Distribution
88
Where can you find?
Archive Service Distribution
http://www.google.com/
89
Where can you find?
Archive Service Distribution
http://www.google.com/
90
Where can you find?
Archive Service Distribution
http://www.japantimes.co.jp/
91
Where can you find?
Archive Service Distribution
http://www.japantimes.co.jp/
92
Research Question
Problem• We need to profile the web archives around the world with
these characteristics:• Age• Top-level domains• Languages• Growth rate
Goal• To optimize the query routing for Memento Aggregator.• To determine the missing parts of the web.
Archive Service Distribution
93
URIs Samples Sources
Archive Service Distribution
Web1. DMOZ – Random sample2. DMOZ – TLD 200 URIs for
each TLD from DMOZ (80 tlds)
3. DMOZ – Languages 100 URIs for each Languages (40 lang.)
Web Archives4. Top 1-Gram from Bing5. Top 1000 queries term
by Yahoo in 9 languages
User requests6. IA Wayback Machine log files7. Memento aggregator log files
* We used hostnames only
94
TLD Coverage
Archive Service Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
95
Language Coverage
Archive Service Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
96
Growth Rate
Archive Service Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Stopped archiving in 2008
Steady growth
Stopped getting new URIs, but still crawling
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
97
Building Web Archive Profile
Archive Service Distribution
{"Profile":{
"Name“ : "Taiwan Web Archive",
"URI“ : "http://webarchive.lib.ntu.edu.tw",
"TimeGate“ : "http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code“ : "TW",
"Age“ : "Tue, 15 Jul 1997 00:00:00 GMT",
"TLD“ : [ {"tw":0.6},{"cn":0.08},{"hk:0.04}, {"eg":0.04},{"gov":0.04}, {"my":0.04},{"jp":0.04},{"kr":0.02}],
"Language“ : [{"zh-TW":0.5},{"zh-CN":0.25},{"id":0.08},{"ar":0.08}],
"GrowthRate“ : [
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
98
• RecallTM@1 = 3/8 = 0.375
• RecallTM@2 = 5/8 = 0.625
Web Archive Selection Evaluation
Archive Service Distribution
𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠
TM(R)
A1 M1
M2
M3
A2 M4
M5
A3 M6
A4 M7
A5 M8
99
Web Archive Selection Evaluation
Archive Service Distribution
Number of Archive Including IA Excluding IA
RecallTM@3 0.96 0.647
RecallTM@6 0.98 0.83
RecallTM@9 0.998 0.983
RecallTM@12 0.999 0.987
• Total number of archives N = 15
𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠
100
CONCLUSIONS
101
Conclusions• We proposed a new service framework that divides the web archive
corpus into four levels: Content, Metadata, URI, and Archive.• The development of ArcContent that supports the web archive interface
with extracted version of the mementos based on a set of predefined filters.
• We studied the challenges of building the temporal web graph and developed ArcLink, a distributed system to extract, preserve, and expose the temporal web graph.
• We studied the optimization and summarization techniques to create the thumbnails for the web graph collections based on SimHash fingerprints.
• We extended the concept of URI-lookup in the web archive to include the HTTP redirection status code.
• The concept of “Web Archive Profile” to characterize the web archive corpus was defined with an application on the distributed search in the Memento Aggregator.
102
Publications• S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How
much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011.
• A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, 2013.
• A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013.
• A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web Archive Coverage for Top-Level Domain and Content Language.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.
• A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36th European Conference on Information Retrieval. ECIR ‘14, 2014.
103
What’s next?• Web Archiving Engineer at Stanford University.
104
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Old Dominion University Computer Science Department
@aalsum
105
BACKUP
106
Memento• Memento is an HTTP
extension to integrate the Past and the Current Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
107
Memento
• Developer and administrator for Memento aggregator and proxies
108
Memento Clients
• Memento currently is RFC.
109
Lack of APIs• Famous websites provide APIs to the third-party
developer.• Introduction Motivation
110
Lack of APIs• US Agencies started to support APIs to data access.• Introduction Motivation
111
Web Archiving Use Cases• Temporal navigation.• Full text search.• Use language filters.• Provide raw WARC.• Import of metadata records
into other repositories.
• Introduction Motivation
*IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006.
112
Related Projects
Data analysis for the web data
Tools and Methods to access the web archive
Enable the user to do experiments on the raw crawled data on Amazon S3
Enable the user to browse the present and the past web
• Introduction
113
Selection• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users’ favorites
• We studied what is already captured
114
URI-Based
WayBack Machine• Web Archiving Trends Accessing Web Archive
• Textbox to enter the requested URI.
• BubbleMap to show you the available mementos.
115
Collection-Based• Web Archiving Trends Accessing Web Archive
• In addition to browsing the collection, you can browse the URIs in this collection.
116
Full-text search• Web Archiving Trends Accessing Web Archive
• BL interface provides different filtering techniques for the results.
117
Past Web Browser• Web Archiving Trends Accessing Web Archive
• You can replay the pages with different controls to forward, backward, pause and stop.
118
Zoetrope• Web Archiving Trends Accessing Web Archive
• Different Views• Comparison between
different Mementos• Not feasible on the
current web archiving infrastructure
119
DiffIE• Web Archiving Trends Accessing Web Archive
• A browser plug-in that caches the pages a person visits and highlights how those pages have changed when the person returns to them
• It is possible on the personal archiving.
120
Synchronicity• Web Archiving Trends Accessing Web Archive
• Mozilla Firefox add-on supports internet user in (re-)discovering missing web pages in real time
121
Warrick• Web Archiving Trends Accessing Web Archive
• It’s a utility for reconstructing or recovering a website when a back-up is not available
122
ArcSys Architecture Diagram
Web Archive Service Framework
123
WAT files• WAT files are metadata files for WARC files• WAT files are used to create data analysis reports based
on large datasets.
Metadata Service
124
It’s More than WAT filesWAT ArcLink
Batch Process on a set of WARCs Batch process on a set of URIs
For internal use For public use
No-way to integerate with others WAT files in others locations
It could be aggregated with other graphs
No incremental update Support incremental update
Access on WAT file level using Pig Access on URI level using Web service
Metadata Service ArcLink Motivation
125
Cost of Scaling Up• Filtering
• Extraction
• Storage
Metadata Service ArcLink Cost model
Internet Archive
88 hrs108 * 109 mementos
247 days
500 TB
Filtering 𝑇𝑖𝑚𝑒=
𝑛106 ∗
5.5𝑚(h𝑟𝑠
)Extraction 𝑆𝑖𝑧𝑒
=𝑛
∗10
% Storage
*Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB
126
Time-Indexed Inlinks Information
Metadata Service ArcLink Applications
Date Anchor Text
04-Nov-09 vancouver2010.com
11-Nov-09 vancouver2010.com
18-Nov-09 vancouver2010.com
16-Jan-10 Vancouver 2010 Olympic Games
16-Jan-10 Vancouver 2010 Olympic Games
23-Jan-10 vancouver2010.com
23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 vancouver2010.com
30-Jan-10 Vancouver 2010 Olympic Games
13-Feb-10 Vancouver 2010 Olympic Winter Games
15-Feb-10 Vancouver 2010 Olympic Games
18-Feb-10 Official Vancouver Games site
19-Feb-10 vancouver2010.com
20-Feb-10 Official Vancouver Games site
21-Feb-10 VANOC 2010
127
HTTP Redirection Relationship between URI-R & URI-M
URI Service Experiment and Results
Live Web URI − R
OK Redirection
Web ArchiveURI-M
OK Case 1 5
Redirection 2 3,4Case 1
Case 2 Case 3 Case 4 Case 5
80.8%
2.74% 1.34%1.33%
13.7%
128
Timemap Redirection Categories• Category 1
URI Service
All Mementos have 200 HTTP status code
Stability =1
129
Timemap Redirection Categories• Category 2
URI Service
All Mementos have redirection to the same URI.
Stability =1
130
Timemap Redirection Categories• Category 3
URI Service
All Mementos have redirection to different URIs.
Stability ≈ 0
131
Timemap Redirection Categories• Category 4
URI Service
Mementos have different HTTP status code.
Stability
132
HTTP Redirection Relationship between URI-R & URI-M
URI Service
Live Web URI − R
OK Redirection
Web ArchiveURI-M
OK Case 1 5
Redirection 2 3,4Case 1
Case 2 Case 3 Case 4 Case 5
133
URI Reliability
URI Service
M1
3xx
M2
3xx
M3
3xx
TimeMap
rel=original
R`Mrel=original
R`Mrel=original
R`M
RStability =1
? ? ?200 404 3xx
134
Summary• Quantitative study with 10,000 URIs.• 48% were not fully stable through time.• 27% were not perfectly reliable through time.• New archival retrieval policy:
• Policy one: successfully retrieved mementos for 17 out of 77.• Policy two: Expanded the TimeMap for 58% of cases.
URI Service Retrieval Policies
135
URI Reliability• 23% of the mementos did not lead to a successful
memento at the end.
URI Service Experiment and Results
Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300
136
Experiment
Archive Service Percentage Experiment
• For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos).
• For each URI, Memento Aggregator responded with TimeMap for this URI.
Example <http://memento.waybackmachine.org/memento/20010819194233/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/20011216220248/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT",
137
1000 URIs Ordered by First Observation Date
Archive Service Percentage Results
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
138
2010
Archive Service Percentage Results
2013
139
Archive Service Percentage Results
2010 2013
140
Archive Service Percentage Results
2010 2013
141
Archive Service Percentage Results
2010 2013
142
URIs Samples Sources – Live Web1. DMOZ – Random sample
• 10,000 URIs randomly sample from DMOZ directory (~5M URIs).
2. DMOZ – TLD: 200 URIs for each TLD• 80 tlds.
3. DMOZ – Languages 100 URIs for each Languages• 40 languages.
Archive Service Distribution
143
URIs Samples Sources – Web Archive• Query the fulltext search interface for the web archives
with two set of query terms.
4. Top 1-Gram from Bing• Most of them is English
5. Top 1000 queries term by Yahoo in 9 languages• We excluded the general keywords such as: Obama,
Facebook.
Archive Service Distribution
144
URIs Samples Sources – User requests• Sampling from the users requests to the web archived
materials
6. Sample from IA Wayback Machine Log files• 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento aggregator log files• 1,000 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
Archive Service Distribution
145
General Coverage
Archive Service Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
146
Web Archive Selection Evaluation
Archive Service Distribution
147
Web Archive Selection Evaluation
Archive Service Distribution
148
Future Works
149
iTunes cover application
Metadata Service ArcThumb Motivation