+ All Categories
Home > Documents > A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Date post: 02-Feb-2016
Category:
Upload: melba
View: 37 times
Download: 0 times
Share this document with a friend
Description:
A New Model for Web Resource Harvesting. Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory. Her. - PowerPoint PPT Presentation
22
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported in part by the Andrew Mellon Foundation & Library of Congress Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory
Transcript
Page 1: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

A New Model for Web Resource Harvesting

Her

This work supported in part by the Andrew Mellon Foundation & Library of Congress

Michael Nelson

Computer Science Department

Old Dominion University

Herbert Van de Sompel

Digital Library Research & Prototyping Team

Research Library, Los Alamos National Laboratory

Page 2: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Outline

(0) The Problem

(1) mod_oai

(2) Future Research

Page 3: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

WWW and DL: Separated at Birth

1994

DL

WWW

Today

The Good: XML, BitTorrent, Web ServicesThe Bad: RSSThe Ugly: Semantic Web

The Good: OAIS, DOI, OAI-PMHThe Bad: Dublin CoreThe Ugly: SRU/W

The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered.

WWW

DL

Page 4: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

www.getty.edu

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc100; last mod2003-09-11

what documents have beenmodified since 2003-11-15 ?

robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

Web Robots

what is this file?what are its relationships to other files?how often does it change?

Page 5: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

A More Efficient Way

what documents have beenmodified since 2003-11-15 ?

www.getty.eduwith mod_oai

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc100; last mod2003-09-11

<co> <metadata/> <link/> <link/> <change/> …</co>

Page 6: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Outline

(0) The Problem

(1) mod_oai

(2) Future Research

Page 7: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

• Goal: integrate OAI-PMH functionality into the web server itself…

• mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server

o written in Co respects values in .htaccess, httpd.conf

• compile mod_oai on http://www.foo.edu/

• baseURL is now http://www.foo.edu/modoaio Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

- http://www.foo.edu/modoai?

verb=ListIdentifiers &

metdataPrefix=oai_dc &

from=2004-09-15 &

set=mime:video:mpeg

mod_oai approach

Page 8: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

OAI-PMH data model in mod_oai

resource

item

Dublin Coremetadata records

OAI-PMH identifier = entry point to all records pertaining to the resource

MPEG-21DIDL

metadata pertainingto the resource

HTTP headermetadata

http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf

OAI-PMH setsMIME type

Page 9: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

OAI-PMH Entity value description

Resource URL PDF, PS, XML, HTML or other file

Item

identifier OAI Identifier DNS-based name of metadata about resource

set membership LCSH Library of Congress Subject Heading

Record

metadataPrefix oai_dc bibliographic metadata in Dublin Core

datestamp 2004-10-18 modification date of DC record

Record

metadataPrefix oai_marc bibliographic metadata in MARC

datestamp 2004-07-31 modification date of MARC record

OAI-PMH concepts : typical repository

Page 10: A New Model for Web Resource Harvesting

OAI-PMH Entity value description

Resource URL HTML, GIF, PDF or other web file

Item

identifier URL same URL as the resource

set membership MIME type MIME type of the resource

Record

metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_dc a subset of http_header in DC

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata

datestamp 2004-07-31 modification date of resource

OAI-PMH concepts : mod_oai

Page 11: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

harvester • issues a ListIdentifiers, • finds URLs of updated

resources• does HTTP GETs

updates only

• can get URLs of resources with specified MIME types

Resource Discovery: ListIdentifiers

Page 12: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Preservation: ListRecords

harvester • issues a ListRecords, • Gets updates as MPEG-

21 DIDL documents (HTTP headers, resource By Value or By Reference)

• can get resources with specified MIME types

Page 13: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

wget mod_oaiindex.htmlas seed

"find . -type f"as seed

files

# of files in baseline 709 5739 5268# of files in update(25%)

114 1318 1335

performance of mod_oai and wgeton www.cs.odu.edu

Page 14: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Readings

• Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069

Page 15: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Outline

(0) The Problem

(1) mod_oai

(2) Future Research

Page 16: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Issues and Future Work

• For a given server, there are a set of URLs, U, and a set of files Fo Apache maps U Fo mod_oai maps F U

• Neither function is 1-1 nor ontoo We can easily check if a single u maps to F, but given F we cannot (easily)

generate U• Short-term issues:

o dynamic files- exporting unprocessed server-side files would be a security hole

o IndexIgnore- httpd will “hide” valid URLs

o File permissions- httpd will advertise files it cannot read

• Long-term issueso Alias, Location

- files can be covered up by the httpdo UserDir

- interactions between the httpd and the filesystem

Page 17: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

IndexIgnore & File Permissions

Page 18: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Alias: Covering Up Files

httpd.conf:Alias /A /usr/local/web/htdocs/BAlias /B /usr/local/web/htdocs/A

the files “A” and “B” will be different from the URLshttp://server/Ahttp://server/B

Page 19: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

UserDir: “Just in Time” mounting of directories

whiskey.cs.odu.edu:/ftp/WWW/conf% ls /homeliu_x/ mln/whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso/home/tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf % ls /homeliu_x/ mln/ tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf %

Page 20: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Looking Further Down the Road for mod_oai

• “Reverse” the method of URL discoveryo cannot look to the files;o listen to incoming requests and build a list of valid URLs

- could be seeded with files at start

- also the method for handling server processed files / URLs

• Plug-ins for descriptive metadata o DC tags in HTMLo MS Office formats, PDFo Tags from JPEG, TIFF, MP3, etc.

• Additional metadata in the DIDLo technical metadata from JHOVEo estimated change rate

- cf. Cho & Garcia-Molina, ACM TOIT 28(4)

• http log access as separate metadata formats- cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

Page 21: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

Expanding OAI-PMH / Complex Object Access

• OAI-PMH / CO access for:o blogso message boardso native file systems

- e.g. Mac OS X “Spotlight”

• More aggressive use of OAI-PMH / CO for preservationo recently funded NSF DIGARCH programo use for preservation:

- Usenet - Email- Multicasting

Page 22: A New Model for Web Resource Harvesting

OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland

OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting

• Better web harvesting can be achieved through:o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects

• Use cases:o Preservation (ListRecords)o Web crawling (ListIdentifiers)

• mod_oai: reference implementationo Better performance than wgeto static files only; dynamic files in the futureo not a replacement for DSpace, Fedora, eprints.org, etc.

• More info:o http://www.modoai.org/o http://whiskey.cs.odu.edu/


Recommended