+ All Categories
Home > Documents > Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral...

Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral...

Date post: 26-Dec-2015
Category:
Upload: cathleen-rose
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
33
Website Reconstruction using the Web Infrastructure Frank McCown http:// www.cs.odu.edu/~fmccown / Doctoral Consortium June 11, 2006
Transcript
Page 1: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Website Reconstruction using the Web Infrastructure

Frank McCownhttp://www.cs.odu.edu/~fmccown/

Doctoral Consortium

June 11, 2006

Page 2: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.
Page 3: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Web Infrastructure

Page 4: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

4

HTTP 404

Page 5: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.
Page 6: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

6

Cost of Preservation

H L H

Publisher’s cost (time, equipment, knowledge)

LOCKSS

Browser cache

TTApacheiPROXY

Furl/Spurl

InfoMonitor

Filesystem backups

Coverage of the Web

H

Client-view Server-view

Web archivesSE caches

Hanzo:web

Page 7: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

7

Research Questions

How much digital preservation of websites is afforded by lazy preservation? Can we reconstruct entire websites from the WI? What factors contribute to the success of website

reconstruction? Can we predict how much of a lost website can

be recovered? How can the WI be utilized to provide

preservation of server-side components?

Page 8: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

8

Prior Work

Is website reconstruction from WI feasible? Web repository: G,M,Y,IA Web-repository crawler: Warrick Reconstructed 24 websites

How long do search engines keep cached content after it is removed?

Page 9: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

9

Timeline of SE Resource Acquisition and Release

Vulnerable resource – not yet cached (tca is not defined)

Replicated resource – available on web server and SE cache (tca < current time < tr)

Endangered resource – removed from web server but still cached (tca < current time < tcr)

Unrecoverable resource – missing from web server and cache (tca < tcr < current time)

Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

Page 10: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.
Page 11: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.
Page 12: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

12

How Much Did We Reconstruct?

A

“Lost” web site Reconstructed web site

B C

D E F

A

B’ C’

G E

F

Missing link to D; points to old resource G

F can’t be found

Page 13: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

13

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Page 14: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Results

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

Page 15: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

15

Warrick Milestones

www2006.org – first lost website reconstructed (Nov 2005)

DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)

www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)

Internet Archive officially “blesses” Warrick (mid Mar 2006)1

1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

Page 16: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

16

Proposed Work

How lazy can we afford to be? Find factors influencing success of website

reconstruction from the WI Perform search engine cache characterization

Inject server-side components into WI for complete website reconstruction

Improving the Warrick crawler Evaluate different crawling policies Development of web-repository API for inclusion

in Warrick

Page 17: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

17

Factors Influencing Website Recoverability from the WI

Previous study did not find statistically significant relationship between recoverability and website size or PageRank

Methodology Sample large number of websites - dmoz.org Perform several reconstructions over time using

same policy Download sites several times over time to capture

change rates

Page 18: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

18

Evaluation

Use statistical analysis to test for the following factors: Size Makeup Path depth PageRank Change rate

Create a predictive model – how much of my lost website do I expect to get back?

Page 19: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

19

SE Cache Characterization

Web characterization is an active field Search engine caches have never been characterized Methodology

Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask

Access cached version if present Download live version from the Web Examine HTTP headers and page content Attempt to access various resource types (PDF,

Word, PS, etc.) in each SE cache

Page 20: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

20

Evaluation

Compute the ratio of indexed to cached Find types, size, age of resources Do http Cache-control directives ‘no-cache’ and

‘no-store’ stop resources from being cached? Compare different SE caches compare How prevalent is the use of NOARCHIVE meta

tags to keep HTML pages from being cached? How much of the Web is cached by SEs? What is the overlap with the Internet Archive?

Page 21: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Marshall TR Server – running EPrints

Page 22: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

We can recover the missing page and PDF, but what about the services?

Page 23: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

23

Recovery of Web Server Components

Recovering the client-side representation is not enough to reconstruct a dynamically-produced website

How can we inject the server-side functionality into the WI?

Web repositories like HTML Canonical versions stored by all web repos Text-based Comments can be inserted without changing

appearance of page

Page 24: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

24

Injection Techniques

Inject entire server file into HTML comments Divide server file into parts and insert parts

into HTML comments Use erasure codes to break a server file into

chunks and insert the chunks into HTML comments of different pages

Page 25: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

25

Recover Server File from WI

Page 26: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

26

Evaluation

Find the most efficient values for n and r (chunks created/recovered)

Security Develop simple mechanism for selecting files that

can be injected into the WI Address encryption issues

Reconstruct an EPrints website with a few hundred resources

Page 27: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Recent Work

URL canonicalization Crawling policies

Naïve policy Knowledgeable policy Exhaustive policy

Reconstruct 24 websites with each policy Found that exhaustive and knowledgeable

are significantly more efficient at recovering websites

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Efficiency ratio bins

Fre

qu

ency

Naive

Knowledgeable

Exhaus tive

Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006, To appear.

Page 28: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

28

Warrick API

API should provide a clear and flexible interface for web repositories

Goals: Shield Warrick from changes to WI Facilitate inclusion of new web repositories Minimize implementation and maintenance costs

Page 29: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

29

Evaluation

Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption &

modification

Page 30: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

30

Risks and Threats

Time for enough resources to be cached Search engine caching behavior may change

at any time Repository antagonism

Spam Cloaking

Page 31: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

TimetableTimeline

Page 32: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

32

Summary

When this work is completed, I will have… demonstrated and evaluated the lazy

preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE

behavior (API) explored how much we store in the WI

(server-side vs. client-side representations)

Page 33: Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

33

Thank You

Questions?


Recommended