Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | alvin-cannon |
View: | 217 times |
Download: | 1 times |
Greg Janée • DCC seminar • 2005-09-27 2
A misadventure in preservation
• 1976– Viking probes go to Mars– soil data is analyzed for evidence of life
• 1999– USC neurobiologist Joseph Miller asks for data– NASA has data on tape!
• But...– tapes coded “in a format so old that the
programmers who knew it had died”
Greg Janée • DCC seminar • 2005-09-27 3
Paradox of preservation
• Is the data valuable?– yes: had to travel to another planet to get it
• Is the data being used?– no– perhaps never again
• How much am I willing to pay for its preservation?– as close to zero as possible
Greg Janée • DCC seminar • 2005-09-27 4
Is it worth preserving?
• Keith’s equation*:– (current value) = (intrinsic value) - (cost to use)
• Greg’s equation:– item is worth preserving for time duration T if:
• (intrinsic value) * ProbT(usage) > T(preservation costs) + (cost to use)
*apologies to Keith Johnson, Stanford libraries
Greg Janée • DCC seminar • 2005-09-27 5
Project genesis
• NDIIPP– Library of Congress, 2000– $100M– http://www.digitalpreservation.gov/
• NGDA– UCSB (MIL) & Stanford (Branner Library)– $2.6M, 3 years– geospatial data– http://www.ngda.org/
Greg Janée • DCC seminar • 2005-09-27 7
Project goal
• “How can we preserve geospatial data on a national scale and make it available to future generations?”
• No focus on a particular collection
• Geospatial data– discrete chunks– relatively highly-structured, well-defined– but 90% of our work is generic
Greg Janée • DCC seminar • 2005-09-27 8
Idea #1
• Archival has to be cheap & easy– must be distributed– little incentive, no funding– not sexy
Greg Janée • DCC seminar • 2005-09-27 9
NGDA approach
• Compromise: define cheap archive– fundamental approach: preservation by co-archival
of object semantics– ingest: one step up from crawling– web access– notable for what’s missing: discovery, usability
• Foundation for additional functionality– e.g., migration– prototype archives will offer ADL, OAI access
Greg Janée • DCC seminar • 2005-09-27 10
Idea #2
• Archival systems must be designed with their own demise in mind– archival objects will long outlive any system that
manages them– system-level migrations will occur– at inopportune times
Greg Janée • DCC seminar • 2005-09-27 11
system
databasestorage
handleresolver
database
Typical repository architecture
database
handleresolver
database
fragile
Greg Janée • DCC seminar • 2005-09-27 12
NGDA architecture
storage subsystem
standard, public data model
archival system
databases,caches,
etc.
bulkloader
ingest
ADL OAIWeb
access
Greg Janée • DCC seminar • 2005-09-27 13
Post-NGDA architecture
storage subsystem
standard, public data model
Web
Greg Janée • DCC seminar • 2005-09-27 14
Storage system requirements
• Req’s:– associate UUIDs/RIDs with bitstreams– retrieve global/local bitstream by UUID/RID– determine (parent) UUID of any bitstream– list all UUIDs
• Satisfied by:– any filesystem– any kind of UUIDs
• tag:library.ucsb.edu,2005:identifier
Greg Janée • DCC seminar • 2005-09-27 16
Archival object representation
• Components are files• Manifest is an XML document
• Other approaches– OAIS: archival information packages (AIPs)– XMLtape
Greg Janée • DCC seminar • 2005-09-27 17
Ingest
• Ingest template defines– common structure of objects to be ingested– necessary validations– associations to other objects
• assumes pre-loading of semantic definitions
– policies, rights, etc.
• Represents choke point– requires human evaluation
Greg Janée • DCC seminar • 2005-09-27 18
Format registry
• We’re developing one– who isn’t?
• Serves as archive of format specifications
• How broadly to interpret “format”?– traditional file format– product– series, collection, arbitrary set
Greg Janée • DCC seminar • 2005-09-27 19
Format dependencies
• Consider dependency graph induced by format specifications
• Def: a format is recoverable if the format of its specification is recoverable
• Axioms: plain text, HTML are recoverable
HTML
GIFGeoTIFF
CSSplaintext
TIFF“dessicated”
version
Greg Janée • DCC seminar • 2005-09-27 20
Challenges
• Making ingest easy, easier, easier-er, ...
• GIS formats– very complex: topology, layer, coverage, project– proprietary
• MODIS– multiple petabytes– format (HDF) is not well-defined– moving to on-demand computation of products– lineage important– copious additional semantics
Greg Janée • DCC seminar • 2005-09-27 21
Misadventure, redux
• What if there had been an NGDA-like solution?– format specification would have been archived
• Limitations– data not necessarily immediately usable– format specification itself not necessarily viewable
• But limitations can be addressed according to usage, available resources