+ All Categories
Home > Documents > The Theory and Practice of Website Archivability Theory and Practice of Website Archivability...

The Theory and Practice of Website Archivability Theory and Practice of Website Archivability...

Date post: 26-Apr-2018
Category:
Upload: trinhlien
View: 219 times
Download: 1 times
Share this document with a friend
22
The Theory and Practice of Website Archivability Vangelis Banos 1 , Yunhyong Kim 2 , Seamus Ross 2 , Yannis Manolopoulos 1 1 Department of Informatics, Aristotle University, Thessaloniki , Greece 2 University of Glasgow, United Kingdom FROM CLEAR TO ARCHIVEREADY.COM
Transcript
Page 1: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

The Theory and Practice of Website Archivability

Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1

1Department of Informatics, Aristotle University, Thessaloniki , Greece 2University of Glasgow, United Kingdom

FROM CLEAR TO ARCHIVEREADY.COM

Page 2: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

2

Table of Contents 1. Problem definition,

2. CLEAR: A Credible Live Method to

Evaluate Website Archivability,

3. Demo: http://archiveready.com/,

4. Future Work.

Page 3: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Problem definition • Web content acquisition is a critical step in

the process of web archiving,

• Web bots face increasing difficulties in

harvesting websites,

• After web harvesting, archive administrators

review manually the content and endorse or

reject the harvested material,

• Key Problem: Web harvesting is automated

while Quality Assurance (QA) is manual.

3

Page 4: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Website

Archivability ?

What is

Website Archivability captures the core aspects

of a website crucial in diagnosing whether it has

the potentiality to be archived with

completeness and accuracy.

Attention! it must not be confused with website dependability,

reliability, availability, safety, security, survivability, maintainability.

Page 5: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

CLEAR: A Credible Live Method to Evaluate Website Archivability • An approach to producing a credible on-the-fly

measurement of Website Archivability, by:

• Using standard HTTP to get website elements,

• Evaluating information such as file types, content

encoding and transfer errors,

• Combining this information with an evaluation of the

website's compliance with recognised practices in

digital curation,

• Using adopted standards, validating formats,

assigning metadata

• Calculating Website Archivability Score (0 – 100%)

5

Page 6: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

6

Accessibility Cohesion

Standards

Compliance Performance

Metadata

CLEAR: A Credible Live Method to Evaluate Website Archivability

Page 7: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

7

Website attributes evaluated using CLEAR

Page 8: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

8

C L E A R • The method can be summarised as follows:

1. Perform specific Evaluations on Website

Attributes,

2. In order to calculate each Archivability Facet’s

score,

• Scores range from (0 – 100%),

• Not all evaluations are equal, if an important

evaluation fails, score = 0, if a minor

evaluation fails, score = 50%

3. Producing the final Website Archivability as the

sum all Facets’ scores.

Page 9: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Accessibility

9

Are web archiving crawlers able to

discover all content using standard

protocols and best practices?

Page 10: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Accessibility evaluation

10

Facet Evaluation Rating Total

Accessibility

No RSS feed 50% 50%

No robots.txt 50%

No sitemap.xml 0%

6 links, all valid 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 11: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Cohesion

11

• Dependencies are a great issue in digital curation.

• If a website is dispersed across different web

locations (images, javascripts, CSS, CDNs, etc),

the acquisition and ingest is likely to risk suffering if

one or more web locations fail on change.

• Web bots may have issues accessing a lot of

different web locations due to configuration issues.

Page 12: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Cohesion evaluation

12

Facet Evaluation Rating Total

Cohesion

1 external and no internal scripts 0% 70%

4 local and 1 external images 80%

No proprietary (Quicktime & Flash) files

100%

1 local CSS file 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 13: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Metadata

13

• Metadata are necessary for digital curation and

archiving.

• Lack of metadata impairs the ability to manage,

organise, retrieve and interact with content.

• Web content metadata may be:

• Syntactic: (e.g. content encoding, character set)

• Semantic: (e.g. description, keywords, dates)

• Pragmatic: (e.g. FOAF, RDF, Dublin Core)

Page 14: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Metadata evaluation

14

Facet Evaluation Rating Total

Metadata

Meta description found 100% 87%

HTTP Content type 100%

HTTP Page expiration not found 50%

HTTP Last-modified found 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 15: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Performance

15

• Calculate the average network response time for all

website content.

• The throughput of web spider data acquisition

affects the number and complexity of the web

sources it can process.

• Performance evaluation: Facet Evaluation Rating Total

Performance Average network response time is 0.546ms

100% 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 16: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Standards Compliance

16

• Digital curation best practices recommend that web

resources must be represented in known and

transparent standards, in order to be preserved.

Page 17: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Standards Compliance evaluation

17

Facet Evaluation Rating Total

Standards Compliance

1 Invalid CSS file 0% 87%

Invalid HTML file 0%

Meta description found 100%

No HTTP Content encoding 50%

HTTP Content Type found 100%

HTTP Page expiration found 100%

HTTP Last-modified found 100%

No Quicktime or Flash objects 100%

5 images found and validated with JHOVE 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 18: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

iPRES 2013 Website Archivability Evaluation

18

Facet Rating Website Archivability

Accessibility 50%

77%

Cohesion 70%

Standards Compliance 77%

Metadata 87%

Performance 100%

Page 19: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

ArchiveReady.com Demonstration

- Web application implementing CLEAR,

- Web interface & also Web API in JSON,

- Running on Linux, Python, Nginx, Redis, Mysql.

19

Page 20: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

Impact

20

1. Web professionals

- evaluate the archivability of their websites

in an easy but thorough way,

- become aware of web preservation concepts,

- embrace preservation-friendly practices.

2. Web archive operators

- make informed decisions on archiving websites,

- perform large scale website evaluations with ease,

- automate web archiving Quality Assurance,

- minimise wasted resources on problematic websites.

Page 21: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

21

Future Work 1. Not optimal to treat all Archivability Facets as equal.

2. Evaluating a single website page, based on the

assumption that web pages from the same website

share the same components and standards.

Sampling would be necessary.

3. Certain classes and specific types of errors create

lesser or greater obstacles to website acquisition

and ingest than others. Differential valuing of error

classes and types is necessary.

4. Cross validation with web archive data is under way

Page 22: The Theory and Practice of Website Archivability Theory and Practice of Website Archivability Vangelis Banos1, ... • Digital curation best practices recommend that web ... Vangelis

THANK YOU Vangelis Banos

Web: http://vbanos.gr/

Email: [email protected]

ANY QUESTIONS?

22

The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.


Recommended