Date post: | 23-Jun-2015 |
Category: |
Documents |
Upload: | jarcherumd |
View: | 239 times |
Download: | 0 times |
Problems and Issues in Selecting, Harvesting, and Cataloging Web
Resources
Joanne Archer and John SchalowUniversity of Maryland Libraries
Jargon
CrawlerWeb Harvesting
Seed
Harvest
Crawl
Wayback Machine
Options for Web Harvesting
In House Program
i.e. Pandora, Web Curator Tool
Pro: flexibility
Con: $$$
i.e. HTTrack, Adobe Web Capture
Pro: inexpensive
Con: not-scalable
Off the Shelf
Software
Third Party
Subscription
i.e. Web Archiving Service
Archive-It
Pro: Ease-of-use
Con: $
Key Questions for Harvesting Projects
unique
ness
ephemerality
research valueharvest frequency
scope
Maryland’s Pilot Harvests(2008-2010)
Historic Preservation Maryland State Documents
Why harvest these areas?
• Collections are unique
• Builds on existing strengths in print collections
• Large amount of material migrating to the web
Key Questions for Harvesting Projects
unique
ness
ephemerality
research valueharvest frequency
scope
Harvesting
Harvesting Challenges:• Javascript• Streaming media• Form and database driven content• Password protected sites• Robot.txt files• Multiple hosts/subdomains
Single host = www.preservemd.org
Multiple hosts = www.umd.edu
www.lib.umd.edu
End-User Access
End-User Access
collection note
subjectheading
general material designation
URLs
uniform title
Conclusions
Challenges• Start up costs• What to collect• Metadata creation
BUT We are well prepared to meet the challenges