+ All Categories
Home > Documents > A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Date post: 05-Jan-2016
Category:
Upload: kenyon
View: 33 times
Download: 3 times
Share this document with a friend
Description:
A New Model for Web Resource Harvesting. Michael L. Nelson Old Dominion University joint work with:. Her. Herbert Van de Sompel Xiaoming Liu. Carl Lagoze Simeon Warner. Terry Harrison. This work supported in part by the Andrew Mellon Foundation & Library of Congress. - PowerPoint PPT Presentation
45
A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon Warner Terry Harrison This work supported in part by the Andrew Mellon Foundation & Library of Congress
Transcript
Page 1: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Michael L. Nelson

Old Dominion University

joint work with:

HerHerbert Van de Sompel

Xiaoming Liu

Carl Lagoze

Simeon Warner

Terry Harrison

This work supported in part by the Andrew Mellon Foundation & Library of Congress

Page 2: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

My Research Interests

• Digital Objects and Repositorieso Interaction between them: roles, responsibilities, architecture, scalabilityo Bumper sticker: Free the Object from the Tyranny of the Repository

• Digital Preservationo Shared infrastructure models, automation, large-scale best effort

strategieso Bumper sticker: We Need Fewer Heroes

• User / System Co-Evolutiono Discerning intent from access large-scale patternso Bumper sticker: Freedom From Choice

Page 3: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research

Page 4: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

WWW and DL: Separated at Birth

1994

DL

WWW

Today

more on DL/WWW, from the NSF Post-DL Workshop:http://www.sis.pitt.edu/~dlwkshop/paper_sompel.htmlhttp://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html

The Good: XML, BitTorrent, Web ServicesThe Bad: RSSThe Ugly: Semantic Web

The Good: OAIS, DOI, OAI-PMHThe Bad: Dublin CoreThe Ugly: SRU/W

The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered.

WWW

DL

Page 5: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

www.getty.edu

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc100; last mod2003-09-11

what documents have beenmodified since 2003-11-15 ?

robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

Web Robots

what is this file?what are its relationships to other files?how often does it change?

Page 6: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

A More Efficient Way

what documents have beenmodified since 2003-11-15 ?

www.getty.eduwith mod_oai

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc100; last mod2003-09-11

<co> <metadata/> <link/> <link/> <change/> …</co>

Page 7: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research

Page 8: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

A Very Brief OAI-PMH History

• Universal Preprint Serviceo A cross-archive DL that that provides services on a collection of metadata

harvested from multiple archives- not distributed searching

o Demonstrated at Santa Fe NM, October 21-22, 1999- D-Lib Magazine, 6(2) 2000 (2 articles)

– http://www.dlib.org/dlib/february00/02contents.htmlo UPS was soon renamed the Open Archives Initiative (OAI)

http://www.openarchives.org/• The OAI has authored, among other things, the Open Archives Initiative

Protocol for Metadata Harvesting (OAI-PMH) o in use around the world, 600+ known instances

- registration not required; many unknown instances– http://gita.grainger.uiuc.edu/registry/– http://celestial.eprints.org/cgi-bin/status

o used by Google ca. late 2004- http://www.nla.gov.au/digicoll/oai/

Page 9: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

“A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” 

“A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”

Data Providers /Repositories

Service Providers /Harvesters

Page 10: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Aggregators

data providers(repositories)

service providers(harvesters)aggregator

aggregators allow for:• scalability for OAI-PMH• load balancing • community building• discovery

Page 11: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH data model

resource

item

Dublin Coremetadata

MARCXMLmetadata records

entry point to all records pertaining to the resource

metadata pertainingto the resource

OAI-PMH identifiermetadataPrefixdatestamp

OAI-PMH identifier

OAI-PMH sets

Page 12: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Overview of OAI-PMH Verbs

Verb Function

Identify description of repository

ListMetadataFormats metadata formats supported by repository

ListSets sets defined by repository

ListIdentifiers OAI unique ids contained in repository

ListRecords listing of N records

GetRecord listing of a single record

metadataabout therepository

harvestingverbs

most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)

Page 13: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research

Page 14: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Resource Harvesting: Use cases

• Discovery: use content itself in the creation of services o search engines that make full-text searchableo citation indexing systems that extract references from the full-text contento browsing interfaces that include thumbnail versions of high-quality images

from cultural heritage collections

• Preservation:o periodically transfer digital content from a data repository to one or more

trusted digital repositorieso trusted digital repositories need a mechanism to automatically synchronize

with the originating data repository

Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html

Page 15: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches

Typical scenario:

1. An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository.

2. The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource.

3. A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.

Page 16: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Issue 1

Locating the resource based on information provided in dc.identifier dc.identifier used to convey a variety of identifier: (simultaneously) URL

DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator. Several dereferencing attempts required

URI provided in dc.identifier is commonly that of a bibliographic “splash page” How to know it is a bibliographic “splash page”, not the resource? If it is a bibliographic “splash page”, where is the resource?

Page 17: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Issue 2

Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting:

Datestamp of DC record does not necessarily change when resource changes

no DC datestamp change DC datestamp change

no resource update OK unnecessary resource download

resource update missed resource update

OK

Page 18: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Conventions

Cannot really address issue 2 (datestamps) with metadata conventions

Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions

First dc.identifier is locator of the resource what if the resource is not digital?

Use of dc.format and/or dc.relation to convey locator

Page 19: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Conventions

<oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films</dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:subject>ING-INF/01 Elettronica</dc:subject> <dc:description>A parallel-plate resonator method is proposed for non-destructive characterisation of resistive films used in microwave integrated circuits. A slot made in one ... </dc:description> <dc:publisher>Microwave engineering Europe</dc:publisher> <dc:date>2002</dc:date> <dc:type>Documento relativo ad una Conferenza o altro Evento</dc:type> <dc:type>PeerReviewed</dc:type> <dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:format>pdf http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf </dc:format></oai_dc:dc>

locator of resourcesplash page

Page 20: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Conventions

<dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>

<dc:relation>

http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf

</dc:relation>

locator of resourcesplash page

Page 21: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Conventions

<dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>

<dc:relation>

http://resolver.unibo.it/00000014/

</dc:relation>

<dc:relation>

http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf

</dc:relation>

locator of resourcesplash page

splash page

Page 22: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Existing OAI-PMH based approaches : Other attempts

dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s)

What if there is no splash page? How does a harvester recognize this situation?

OA-X: protocol extension OK in local context Strategic problem to generalize How to consolidate with OAI-PMH data model

Qualified Dublin Core Could bring expressiveness to distinguish between locator & identifier But what about the datestamp issue?

Page 23: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Proposed OAI-PMH based approach

Use metadata formats that were specifically created for representation of digital objects:

Complex Object Formats as OAI-PMH metadata formats MPEG-21 DIDL, METS, ..

Page 24: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH data model

resource

item

Dublin Coremetadata METS records

OAI-PMH identifier = entry point to all records pertaining to the resource

MPEG-21DIDL

metadata pertainingto the resource

simple highlyexpressive

more expressive

highlyexpressive

MARCXMLmetadata

Page 25: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Complex Object Formats : characteristics

• Representation of a digital object by means of a wrapper XML document.

• Represented resource can be:o simple digital object (consisting of a single datastream)o compound digital object (consisting of multiple datastreams)

• Unambiguous approach to convey identifiers of the digital object and its constituent datastreams.

• Include datastream:o By-Value: embedding of base64-encoded datastreamo By-Reference: embedding network location of the datastream o not mutually exclusive; equivalent

• Include a variety of secondary information o By-Valueo By-Referenceo Descriptive metadata, rights information, technical metadata, …

Page 26: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

<didl:DIDL><didl:Item> <didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8"> <dii:Identifier> http://amsacta.cib.unibo.it/archive/00000014/ </dii:Identifier> </didl:Statement></didl:Descriptor> <didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8"> <oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films </dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:format>application/pdf</dc:format> … </oai_dc:dc> </didl:Statement></didl:Descriptor> <didl:Component> <didl:Resource mimeType="application/pdf"

ref="http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf"/> </didl:Component></didl:Item></didl:DIDL>

Page 27: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Complex Object Formats & OAI-PMH

• Resource represented via XML wrapper => OAI-PMH <metadata>

• Uniform solution for simple & compound objects• Unambiguous expression of locator of datastream• Disambiguation between locators & identifiers• OAI-PMH datestamp changes whenever the resource

(datastreams & secondary information) changes• OAI-PMH semantics apply: “about” containers, set membership

Page 28: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH based approach using Complex Object Format

Typical scenario:

1. An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb

2. The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected.

3. A parser at the end of the harvesting application analyzes each harvested complex object record:

- The parser extracts the bitstreams that were delivered By-Value.

- The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference.

4. A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.

Page 29: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Complex Object Formats & OAI-PMH : issues

• Which Complex Object Format(s)• How to Profile Complex Object Format(s) for OAI-PMH Harvesting• Large records• Making resources re-harvestable• Because the resource is represented as <metadata>, can rights

pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline?

• Tools:o Software library to write compliant complex objectso Integration of this library with repository systems (Fedora, DSpace,

eprints.org, ….)

Page 30: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research

Page 31: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

• Goal: integrate OAI-PMH functionality into the web server itself…

• mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server

o written in Co respects values in .htaccess, httpd.conf

• compile mod_oai on http://www.foo.edu/

• baseURL is now http://www.foo.edu/modoaio Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

- http://www.foo.edu/modoai?

verb=ListIdentifiers &

metdataPrefix=oai_dc &

from=2004-09-15 &

set=mime:video:mpeg

mod_oai approach

Page 32: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH data model in mod_oai

resource

item

Dublin Coremetadata records

OAI-PMH identifier = entry point to all records pertaining to the resource

MPEG-21DIDL

metadata pertainingto the resource

HTTP headermetadata

http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf

OAI-PMH setsMIME type

Page 33: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH Entity value description

Resource URL PDF, PS, XML, HTML or other file

Item

identifier OAI Identifier DNS-based name of metadata about resource

set membership LCSH Library of Congress Subject Heading

Record

metadataPrefix oai_dc bibliographic metadata in Dublin Core

datestamp 2004-10-18 modification date of DC record

Record

metadataPrefix oai_marc bibliographic metadata in MARC

datestamp 2004-07-31 modification date of MARC record

OAI-PMH concepts : typical repository

Page 34: A New Model for Web Resource Harvesting

OAI-PMH Entity value description

Resource URL HTML, GIF, PDF or other web file

Item

identifier URL same URL as the resource

set membership MIME type MIME type of the resource

Record

metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_dc a subset of http_header in DC

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata

datestamp 2004-07-31 modification date of resource

OAI-PMH concepts : mod_oai

Page 35: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

harvester • issues a ListIdentifiers, • finds URLs of updated

resources• does HTTP GETs

updates only

• can get URLs of resources with specified MIME types

Resource Discovery: ListIdentifiers

Page 36: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Preservation: ListRecords

harvester • issues a ListRecords, • Gets updates as MPEG-

21 DIDL documents (HTTP headers, resource By Value or By Reference)

• can get resources with specified MIME types

Page 37: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

wget mod_oaiindex.htmlas seed

"find . -type f"as seed

files

# of files in baseline 709 5739 5268# of files in update(25%)

114 1318 1335

performance of mod_oai and wgeton www.cs.odu.edu

for more detail: “mod_oai: An Apache Module for Metadata Harvesting “http://arxiv.org/abs/cs.DL/0503069

Page 38: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research

Page 39: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Issues and Future Work

• For a given server, there are a set of URLs, U, and a set of files Fo Apache maps U Fo mod_oai maps F U

• Neither function is 1-1 nor ontoo We can easily check if a single u maps to F, but given F we cannot (easily)

generate U• Short-term issues:

o dynamic files- exporting unprocessed server-side files would be a security hole

o IndexIgnore- httpd will “hide” valid URLs

o File permissions- httpd will advertise files it cannot read

• Long-term issueso Alias, Location

- files can be covered up by the httpdo UserDir

- interactions between the httpd and the filesystem

Page 40: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

IndexIgnore & File Permissions

Page 41: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Alias: Covering Up Files

httpd.conf:Alias /A /usr/local/web/htdocs/BAlias /B /usr/local/web/htdocs/A

the files “A” and “B” will be different from the URLshttp://server/Ahttp://server/B

Page 42: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

UserDir: “Just in Time” mounting of directories

whiskey.cs.odu.edu:/ftp/WWW/conf% ls /homeliu_x/ mln/whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso/home/tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf % ls /homeliu_x/ mln/ tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf %

Page 43: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Looking Further Down the Road for mod_oai

• “Reverse” the method of URL discoveryo cannot look to the files;o listen to incoming requests and build a list of valid URLs

- could be seeded with files at start

- also the method for handling server processed files / URLs

• Plug-ins for descriptive metadata o DC tags in HTMLo MS Office formats, PDFo Tags from JPEG, TIFF, MP3, etc.

• Additional metadata in the DIDLo technical metadata from JHOVEo estimated change rate

- cf. Cho & Garcia-Molina, ACM TOIT 28(4)

• http log access as separate metadata formats- cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

Page 44: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

Expanding OAI-PMH / Complex Object Access

• OAI-PMH / CO access for:o blogso message boardso native file systems

- e.g. Mac OS X “Spotlight”

• More aggressive use of OAI-PMH / CO for preservationo recently funded NSF DIGARCH programo use for preservation:

- Usenet - Email- Multicasting

Page 45: A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting

Texas A&M University, April 25, 2005

OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting

• Better web harvesting can be achieved through:o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects

• Use cases:o Preservation (ListRecords)o Web crawling (ListIdentifiers)

• mod_oai: reference implementationo Better performance than wgeto static files only; dynamic files in the futureo not a replacement for DSpace, Fedora, eprints.org, etc.

• More info:o http://www.modoai.org/o http://whiskey.cs.odu.edu/


Recommended