Pandora

Post on 20-Jun-2015

546 views 4 download

Tags:

transcript

Trends in Use of Pandora Archive

Presentation at IIPC Open Day “The Broad Value of Web Archives”

30th April, 2012, Library of Congress

Monica Omodei Director, Web Archiving and Digital Preservation

National Library of Australia momodei @ nla.gov.au

About the Pandora Archive •  Selective, Collaborative Approach "

–  high value, discrete, timely collecting"– A number of partners contribute to Pandora"

•  Targeted Australian content "–  selection policy, nominations are reviewed"

•  Historical – started 1996"•  Bibliocentric approach "

– archived sites/publications are fully catalogued"•  Publicly accessible"

–  full content keyword search through national resource discovery service trove.nla.gov.au

– Browse is of reconstituted version of original site – Metadata indexed in google"

Pandora Archive Stats

•  Size – 6.32 TB"•  Number of Files > 140 million"•  Number of ‘titles’ > 30.5K"•  Number of title instances > 73.5K"

Whole domain archive • We have also commissioned the IA to crawl

the .au domain for us annually since 2005

• Legislation prevents us from making this accessible yet

• Hopefully soon we will be able to allow access to researchers

Australian web domain crawls

Year! 2005! 2006! 2007! 2008! 2009! 2011!

Files! 185 million!

596 million!

516 million!

1 billion! 765 million!

660 million!

Hosts crawled!

811,523! 1,046,038! 1,247,614! 3,038,658! 1,074,645! 1,346,549!

Size (TBs)! 6.69! 19.04! 18.47! 34.55! 24.29! 30.71!

The Bad News •  we have no legal deposit legislation for electronic

publications so permission to archive must be obtained"– significant content missed because permission to

copy refused"•  QA and fixing process can be labour intensive"

– Technical infrastructure ten years old"•  Selection guidelines outdated and dont align"•  Significant content missed because of resourcing

constraints and high labour cost"•  Search and browse functionality very limited"

– no URL search, no time-based searching"•  Current infrastructure doesnʼt scale for broader

themed collections with multiple sites or for domain-scale archiving

Glass half full •  Situation will improve markedly if Legal Deposit

provisions extended to digital publications"– The Australian Attorney-General has released a

consultation paper with a model for this extension"•  Broader coverage will be achieved when

infrastructure is upgraded, improving scalability and reducing labour costs for QA/fixing – We have commenced a multi-year Digital Library

Infrastructure Replacement Project which includes upgrading our web archiving tools"

– We are currently trialling Heritrix for collaborative thematic collecting, and wayback for access to our commissioned .gov.au sub-domain archive"

DLIR Project • Digital Library Infrastructure Replacement"• RFP was followed by RFT for components

where reasonable solutions had been proposed (including core repository)"

• The RFT evaluation recommended proceeding to contract negotiations with the selected tenderer for each component"

• Currently preparing a submission for ministerial approval prior to contract negotiations with vendors.

Patterns of Use

•  Which archived sites are popular and why ?"

•  Is use of our archive growing ?"•  What is the relative interest in

older vs more recent captures ?"•  Who is using our archives ?"•  And what for ?

Which archived sites are popular ? •  Data source – filtered, aggregated web

access log data which counts access to “titles”"

•  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012"

•  Selected some to examine and speculate as to why they might be popular"

•  Included consistently high ranking, and ones that were very variable between years

Reasons for popularity of archived version •  Were once popular and are now

decommissioned, particularly if domain name continues to exist and redirects to the archive"

•  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content"

•  Popular referencing sources cite the archive as well as the live site (if it still exists)

Conclusions •  Be more proactive in identifying

unresponsive domains "•  Market automatic redirect

services to web site owners/managers"

•  Allow Google to index archive content for sites which are no longer ‘live’"

Is use of Pandora growing ? Annual access figures for Pandora Web Site and Archive

NB robots.txt was not introduced on the site until 2005 Web site design change in 2008 affected measure downward

Interest in older vs recent content • Filtered access logs by reference

from the entry page to the archived instance

• aggregated accesses by age(year) of archived instance

• Added number of instances of that age in the archive as a reference

Age of instances accessed

Who is using archive ."

• Online survey linked to from search service - approx 450 respondents

• Age, gender, location, education

• How did they arrive

• What type of information and for what purpose

•  Is it still available on the live web ?

But first an anecdote Article in major newspaper – quote

WE at Spring Loaded are no conspiracy theorists, but the disappearance of Liberal Party policies is curious. First went the policy documents. A recent revamp of the website saw the pre-election press releases go. But thanks to the National Library of Australia’s Internet archive, many of the policies can be seen at http://pandora.nla.gov.au When Spring Loaded asked about the missing policies, the Liberal Party said there was “nothing untoward”.

Examples of lost web sites

• Qantas own special web site presenting their case during the major dispute with pilots, engineers and cabin crew unions that grounded the airline in 2011

• Jeff Kennett's campaign web site in the 1999 Victorian State election - the first use of the web by a politician during a campaign in Australia

About the respondents

How did they arrive ?

What information was sought ?

What for ?

Other questions

• Did you realise that you were going to enter an archived version of a web site, not the live one (60% yes to 40% no)

• Was the resource you were looking for no longer available on the live web ? (50-50)

• Have you visited other web archives ? (60% yes, 40% no)

Conclusions • We need to market our archive better

• Promote redirects for closing, unsupported web sites

• Convert archives to arc/warc so memento API will find content

• allow google indexing of content for archived web sites where live version is extinct or substantially altered