+ All Categories
Home > Documents > Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1.

Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1.

Date post: 24-Dec-2015
Category:
Upload: nigel-carroll
View: 214 times
Download: 0 times
Share this document with a friend
14
Statistics on web archives using ISO metrics Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1
Transcript

1

Statistics on web archives using ISO metrics

Annick Le FollicBibliothèque nationale de France

Tallinn, 2015-01-29

2

BnF needsObjectives

Characterize BnF web collectionsManage the activity of the digital legal deposit

teamDescribe BnF web data to be preserved

Two main kinds of metrics: harvesting and preservation

From experimentation…Scripts and Heritrix reports from Internet

Archive engineersA dedicated application developed by BnF

engineers

3

International environment… to the standardization of the metrics

Definition of concepts and standards with an ISO working group

Dedicated statistics have to be included in the Library general performance statistics

Experience sharing within the IIPCMany libraries have changed from ARC to

WARC

BnF has developed a specific tool (NAS_qual) for its first internal broad crawl in 2010

4

Benefits of ISO report for BnFAdoption of strict definitions of termsMain metrics chosen for collection

development

At a more refined level, collection characterisation

Statistic Purpose Example

Number of targets Objectives of the collection 8,000 targets

Total number of URLQuantity of information in web archive

14 billion URL

Total compressed size stored

Overall size of web archive 200 terabytes

Number of container filesNumber of conservation units in archive

18,000 WARC files

Statistic Purpose Example

Distribution by top level domain Geographic distribution 70 % of collection in .fr TLD

Distribution by format types Document type characterisation

60 % of collection in text/html

5

BnF methodA table lists and characterizes all possible

metricsA code and a name for each oneThe source reportThe calculation methodThe needed tool (scripts, NAS_qual…)

Main difficultiesDifference of scale between broad and

selective crawlCompressed or uncompressed sizeCollected or processed URLs

6

Production Preservation

Sources NetarchiveSuite (and Heritrix) SPAR

Statistics tools

NAS_qual SPAR indicators

Exploitation Excel files Excel files

7

Description of top level domain

Metrics description Metrics elaboration

Name Description Data source

Calculation

Number of TLD

Number of unique first top level domain harvested

Hosts-report.txt – N files

Extract the TLD from the name of the hosts. Add the different TLD with at least one URL harvested. Be careful: a TLD can have several occurrences in several host-reports.

To characterize a collection in terms of geographic distribution (e.g. France)

8

Statistics on top level domainsStarting with a seed list of .fr domains, we

can see that French scope also includes a large part of .com domains, and also European domains

TLD Number of URL %

fr 1,050,488,163 43.3 %

com 952,199,484 39.3 %

net 105,871,664 4.4 %

org 104,451,350 4.3 %

eu 29,396,613 1.2 %

9

Description of MIME types

Metrics description Metrics elaboration

Name Description Data source

Calculation

Number of MIME types

Number of unique MIME types harvested

mimetype-report.txt - N files

Add the unique MIME types. Be careful : a MIME type can have several occurrences in several MIME type reports.

Get a distribution by content types comparable to other documents in a library

Help preservation tasks

10

Statistics on MIME typesWe can note that around 1 million audio

and video files will need special attention to be preserved

MIME type (by categories) Number of URL %

text 1,34, 647,190 55.7 %

image 947,101,138 39.3 %

application 114,668,770 4.8 %

video 2,120,719 0.1 %

audio 1,837,207 0.1 %

11

Description of WARC files volume

Metrics description Metrics elaboration

Name Description Data source

Calculation

WARC files volume (compressed with metadata WARC)

Weight in bytes (and in Go) of all the conservation units produced / data harvested

Manifest of storage servers

Add the weight of all the WARC files of one or several harvests (configurable).

Manage the storage spaceHelp preservation tasks

12

Statistics on WARC files volumeBnF uses a similar way to count ARC and

WARC files99.35 Tio in 2014567.38 Tio for the entire BnF web collections

Question: BnF still hesitates to convert bytes to Go or Gio, To or Tio?

13

Communication to usersComments by the digital legal deposit team

Describe the web archive

Discussion with the IT team Define annual storage volumeDefine number of crawlers

Content librarians networkCooperate on selective crawls

BnF managers and readers Disseminate figures on the annual report,

the BnF website, the legal deposit observatory

15 metrics

5 metrics

4 metrics

2 metrics or more

14

ConclusionSome usage limitations

Unused metricsBugs and errorsLack of analysis

Some changes in the environmentAdaptation to Heritrix 3Options for new toolsEvolution of standards

Even though, we are able to compare our production and our collections with other institutions


Recommended