Digital Preservation
UN FAO
May 23-24, 2011
Tom Cramer
Perry Willett
Stephen Abrams
Sheila Morrissey
Agenda, Day 110:30 – 10:45 Welcome, introductions, and review of objectives and agenda
10:45 – 12:30 Preservation goals and concepts: long-term usability, risk, trust
12:30 – 13:30 Lunch
13:30 – 14:15 Policy frameworks and business/technical sustainability
14:15 – 15:00 Strategies: redundancy/replication, migration, emulation
15:00 – 15:30 Afternoon break
15:30 – 16:30 Standards and best practices: reformatting, OAIS, PREMIS, Drambora, TRAC
16:30 – 17:00 Questions and discussion
Agenda, Day 2
08:30 – 08.35 Review of objectives and agenda
08:35 – 09:30 Infrastructure and tools
09:30 – 10:30 Case study: preservation activities at CDL
10:30 – 11:00 Morning break
11:00 – 12:00 Case study: preservation activities at Portico
12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF
12:30 – 13:30 Lunch
13:30 – 14:00 Case study: preservation activities at Stanford
14:00 – 15:00 Other preservation resources
15:00 – 15:30 Afternoon break
15:30 – 16:00 Format characterization
16:00 – 16:30 Characterization in preservation workflows
16:30 – 17:00 Questions and discussion
Agenda, Day 110:30 – 10:45 Welcome, introductions, agenda
10:45 – 12:30 Preservation goals and concepts: long-term usability, risk, trust
12:30 – 13:30 Lunch
13:30 – 14:15 Policy frameworks and business/technical sustainability
14:15 – 15:00 Strategies: redundancy/replication, migration, emulation
15:00 – 15:30 Afternoon break
15:30 – 16:30 Standards & best practices: reformatting, OAIS, PREMIS, Drambora, TRAC
16:30 – 17:00 Questions and discussion
Introduction
• Libraries, archives and digital preservation
Policies for:
• Designated community• Content acquisition• Ownership• Access• Security• Privacy• Takedown challenges• Legal agreements• Copyright, intellectual property
Planning, I
• Perfection is impossible.
• Organizations need to constantly improve their preservation policies and activities.
• The goal of digital preservation is to assure accessibility to important content
Planning II
• Designated community: who is the audience/clientele?
• Expectations, requirements
• Roles
• What content will you preserve? Legacy data? Projected growth? Metadata? Access? Archival retention period?
Planning III
• Do you have adequate resources to provide service?
• Staff, administration, equipment, training
http://www.flickr.com/photos/nationallibrarynz_commons/3326203787/
Planning IV
• Roles/staff: system administrators, developers, archivists, business analysts, metadata specialists, product managers, administrators
http://digital.nls.uk/74548646
Planning V
• Do you have authority/permission to archive content? Do the organizations have the rights to submit it?
• Legal environment: legislative mandates, copyright restrictions, agreements with rightsholders
http://arcweb.archives.gov/arc/action/ExternalIdSearch?id=1696015
Planning VI
• Disaster planning
• Business planning
• Risk mitigation
http://commons.wikimedia.org/wiki/File:I40_Bridge_disaster.jpg
Plato
• The Preservation Planning Tool:
http://www.ifs.tuwien.ac.at/dp/plato
• Planets Project of the Digital Preservation Lab at the Vienna University of Technology– Define requirements
– Evaluate alternatives
– Analyze results
– Build preservation plan
SLAs
• Service level agreements make explicit the terms of the service– Define roles, terms, rights, permitted uses
– Define service period
– When is the service available?
– When are maintenance outages scheduled?
– When do you respond to support requests?
– Who is notified for unscheduled outages?
– Process to end agreement
Sustainability
• Funding for the long term– What’s the total cost of preservation?
– What resources are available?
– What’s possible given resource constraints?
BRTF
• Blue Ribbon Task Force on Sustainable Digital Preservation. Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. February 2010.
http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf
BRTF
• Barriers to sustainable digital access and preservation include:– Inadequacy of funding models
– Confusion and/or lack of alignment between stakeholders, roles, and responsibilities
– Inadequate incentives to support collaboration
– Complacency that current practices are good enough
– Fear that digital access and preservation is too big.
BRTF
• Conditions for sustainable digital preservation:– Recognition of the benefits by decision makers
– Process for selecting materials with long-term value
– Incentives for decision makers to preserve in the public interest
– Appropriate organization and governance
– Mechanisms to secure adequate resources
LIFE
• Life Cycle Information for E-Literature
http://www.life.ac.uk/
• University College, London and the British Library, funded by JISC
• Developed a model and a tool for predicting the costs of preserving digital content
LIFE
Agenda, Day 110:30 – 10:45 Welcome, introductions, agenda
10:45 – 12:30 Preservation goals and concepts: long-term usability, risk, trust
12:30 – 13:30 Lunch
13:30 – 14:15 Policy frameworks and business/technical sustainability
14:15 – 15:00 Strategies: redundancy/replication, migration, emulation
15:00 – 15:30 Afternoon break
15:30 – 16:30 Standards & best practices: reformatting, OAIS, PREMIS, Drambora, TRAC
16:30 – 17:00 Questions and discussion
Strategies: Replication
• Data redundancy/data replication: geographical (Japan) (New Zealand) (California)
• System manufacturer/OS heterogeneity: not reliant on single manufacturer (Sun bought by Oracle, Isilon bought by EMC)
• Issues with replication: how much data? How quickly can it be moved? How often does it change? How to monitor data on multiple systems?
Strategies: Identifiers
• Uniquely identify objects within repository
• Bind identifiers to target object
Strategies: Format Migration
Why:• To manage a limited number of formats in
the repository• To create an access copy • To a preservation-ready format • To take advantage of additional functionality• In response to identified risk of format
failure
Strategies: Format Migration
When:
• At time of deposit, as part of ingest
(TIFF->JPEG2000)
• As a batch migration (SGML->XML)
• Lossless/lossy
Strategies: Emulation
• Increasingly robust option, particularly for operating systems and firmware.
• Intellectual property and copyright complexities
• Continued challenges for maintenance and reliance.
Strategies: Metadata
• Sufficient metadata to understand object
• Descriptive, technical, administrative, preservation
• How much is enough?
Strategies: Fixity I
Fixity: test for corruption– Corruption could
happen during submission or storage
Process: • Calculate digest at ingest:• Recalculate later• Compare
Mike Johnson: TheBusyBrain.comhttp://www.flickr.com/photos/thebusybrain/2492945625/
Strategies: Fixity II
• Need to have a clear policy on how to respond in cases of corruption
• In conjunction with replication provides robust infrastructure
http://www.flickr.com/photos/library_of_congress/2179131683/
Strategies: Backup
• Digital backups: – tape, disc, cloud
– Online, nearline, offline
– Frequency, retention period
• Physical backups: print, microfilm
Wikimedia Commonshttp://commons.wikimedia.org/wiki/File:Backup_Backup_Backup_-_And_Test_Restores.jpg
Strategies: Access
• Access is the goal of preservation
• Users who are actively using digital content will be the first to spot problems
Cornell Universityhttp://hdl.handle.net/1813.001/5skm
Agenda, Day 110:30 – 10:45 Welcome, introductions, agenda
10:45 – 12:30 Preservation goals and concepts: long-term usability, risk, trust
12:30 – 13:30 Lunch
13:30 – 14:15 Policy frameworks and business/technical sustainability
14:15 – 15:00 Strategies: redundancy/replication, migration, emulation
15:00 – 15:30 Afternoon break
15:30 – 16:30 Standards & best practices: reformatting, OAIS, PREMIS, Drambora, TRAC
16:30 – 17:00 Questions and discussion
Best practices: Reformatting
Reformatting: best practices for digitization
• NISO: A Framework of Guidance for Building Good Digital Collectionshttp://www.niso.org/publications/rp/framework3.pdf
• JISC: Advicehttp://www.jiscdigitalmedia.ac.uk/advice/
• NARA: Technical Informationhttp://www.archives.gov/preservation/technical/
Best practices: Reformatting
• Goal is to use documented, open standards.
• Many image, audio and video formats may have patented software for creation (encoding) and reading (decoding) but use an open metadata standard (eg MPEG-4, JPEG2000)
Standards: PREMIS
• PREMIS: Preservation Metadata: Implementation Strategies
• http://www.loc.gov/standards/premis
• A core set of preservation metadata: “the information a repository uses to support the digital preservation process.”
Standards: PREMIS
Basic data model includes:
• Intellectual entity
• Digital Object
• Event
• Agent
• Rights
Standards: OAIS
• OAIS: Open Archival Information System
• Model for digital preservation– ISO 14721:2003
– Defines key terms: archival storage, content object, designated community, representation information, information package (content information and
– Defines process: Submission, Archive, Dissemination
OAIS Functional Entities
OAIS: Functions of Ingest
OAIS concepts
• SIP: Submission Information Package
• AIP: Archival Information Package
• DIP: Dissemination Information Package
• Information Object=Data object + Representation Information (Rep Info)
Uses of OAIS
• It’s a reference model, not a blueprint• Used to perform gap analysis• Provides a way to measure business and administrative
practices, succession plans, budgets, staffing• A framework for audits• TRAC, Drambora, Digital Asset Framework, ISO (which
one?) • Self audit vs external audit• CRL acting as external auditor; has completed audits
for Portico, HathiTrust
TRAC
• Trustworthy Repository Audit and Certification Checklist
• Developed by OCLC, RLG and NARA, now managed by CRL
• Under consideration as an ISO standard
• http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying-0
TRAC
• A tool to guide auditors to discover gaps
• Three main categories:– Organization infrastructure
– Digital Object management
– Technologies, Technical Infrastructure and Security
TRAC
• Can be used to conduct a self-audit, or used by external auditors
• Center for Research Libraries serves as an external auditor in the US– Completed audits on Portico and HathiTrust
– Working on others
• Requires extensive documentation of policies, procedures, staffing, budgets, systems, and technology
Audits
• Quis custodiet ipsos custodes? Can audits be trusted? – Not such a great track record in financial industry
– What are the benchmarks?
• Proposed ISO standard http://wiki.digitalrepositoryauditandcertification.org/bin/view
– Metrics for Audits
– Requirements for Auditing Organizations
DRAMBORA
• Digital Repository Audit Method Based on Risk Assessment (DCC and DPE)
• http://www.repositoryaudit.eu/
• A toolkit for administrators to conduct a self-assessment of their digital repositories
• Identify gaps in current policies, systems, staffing
• Measure likelihood and severity of the risk
DAF
• Data Asset Framework (formerly Data Audit Framework) from the DCC
• http://www.data-audit.eu/
• An online tool to help administrators “identify, locate, describe and assess” research data within their organizations
• Once they’ve acquired a better understanding, the tool will help assess the policies, practices and systems to manage research data
Agenda, Day 110:30 – 10:45 Welcome, introductions, agenda
10:45 – 12:30 Preservation goals and concepts: long-term usability, risk, trust
12:30 – 13:30 Lunch
13:30 – 14:15 Policy frameworks and business/technical sustainability
14:15 – 15:00 Strategies: redundancy/replication, migration, emulation
15:00 – 15:30 Afternoon break
15:30 – 16:30 Standards & best practices: reformatting, OAIS, PREMIS, Drambora, TRAC
16:30 – 17:00 Questions and discussion
Agenda, Day 2
08:30 – 08.35 Review of objectives and agenda
08:35 – 09:30 Infrastructure and tools
09:30 – 10:30 Case study: preservation activities at CDL
10:30 – 11:00 Morning break
11:00 – 12:00 Case study: preservation activities at Portico
12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF
12:30 – 13:30 Lunch
13:30 – 14:00 Case study: preservation activities at Stanford
14:00 – 15:00 Other preservation resources
15:00 – 15:30 Afternoon break
15:30 – 16:00 Format characterization
16:00 – 16:30 Characterization in preservation workflows
16:30 – 17:00 Questions and discussion
Technical Infrastructure
• Technical infrastructure: is it adequate to respond to user expectations? High-availability? Back-up copy to online versions? Can it scale up to meet future demand? Seemingly simple question: Can you put data in, and get it back out? The same as it went in? Bigger picture: does the infrastructure adhere to the OAIS framework?
Support infrastructure
• Do you have adequate staffing to provide the service described in the Service Level Agreement?
Systems and tools
• JHOVE2 characterization: http://jhove2.org
• PRONOM: http://www.nationalarchives.gov.uk/PRONOM/Default.aspx
• United Digital Format Registry: http://udfr.org
Systems and tools
• Local systems: Fedora, DSpace, ePrints, Ex Libris Rosetta
• Hosted systems: Merritt (CDL), Chronopolis (UCSD), Tessella Safety Deposit Box, DuraSpace, MetaArchive
• Local/Hosted: LOCKSS
Agenda, Day 2
08:30 – 08.35 Review of objectives and agenda
08:35 – 09:30 Infrastructure and tools
09:30 – 10:30 Case study: preservation activities at CDL
10:30 – 11:00 Morning break
11:00 – 12:00 Case study: preservation activities at Portico
12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF
12:30 – 13:30 Lunch
13:30 – 14:00 Case study: preservation activities at Stanford
14:00 – 15:00 Other preservation resources
15:00 – 15:30 Afternoon break
15:30 – 16:00 Format characterization
16:00 – 16:30 Characterization in preservation workflows
16:30 – 17:00 Questions and discussion
CDL UC3: Who are we?
Case study: CDL
• Founded in 1997 by the University of California (UC)• Provide service to 10 UC campuses:
– 222,000 students– 121,000 faculty and staff members
• Work closely with libraries. • Services:
– joint journal licensing and purchasing; – union catalog (Melvyl)– digitization of special collections– scholarly communications and publishing – digital preservation
Case study: CDL UC3
• Digital Preservation Group (DPG)– History: DPR, METS metadata, largely library
clientele
• DPG University of California Curation Center (UC3)
• New clientele, new requirements, new mandates
http://www.flickr.com/photos/65328860@N00/14320717By Felix Burton (Flickr) [CC-BY-2.0 (www.creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
Case study: CDL UC3
• Changing funding: Contracting state funding
• Changing clientele: Expanding to include a wide range of groups, researchers, museums, as well as libraries.
• Changing requirements and needs: data management and sharing requirements
• Challenge: CDL must work more efficiently, collaboratively, and find new revenue sources
Case study: CDL UC3
• Main UC3 services:– EZID: http://n2t.net/ezid
– Web Archiving Service: http://was.cdlib.org
– Merritt Repository Service:http://merritt.cdlib.org
• Easy identifiers for the long-term• A service to make and manage actionable ids
–User interface–Programming interface for bulk ops (Dryad)
• Ids for anything: digital, physical, living, abstract• Can manage identifiers under different schemes:
–ARKs, DOIs, and more to come (LSIDs, ...)• Visit EZID at
http://n2t.net/ezid
Case Study: UC3 EZID
Case Study: UC3 Web Archiving Service
Build unique archives for local research communities
Geographically focused archives support local research• Los Angeles• Monterey Bay• Orange County• San Diego• Santa Barbara
Topical archives support special research collections• Guantanamo Bay / Tamiment Library
• California Water Districts / Water Resources Center Archives
Search Across all Sites in Archive
Easy to Use : Simple Workflow
Analyze Site Change
Are there new documents on this site?
Are there documents in my archive that have been removed from the live site?
WAS Snapshot: Spring 2011
Stats:22 organizations using serviceseveral more joining shortly
4725 sites captured26 terabytes of content stored100+ archives under construction35 archives published
Case study: Curation Micro-services
• Curation micro-services: work of Stephen Abrams, John Kunze and Patricia Cruse
• http://www.cdlib.org/uc3/curation/
• https://confluence.ucop.edu/display/Curation
Case Study: Micro-services
• Complex emergent behavior
• Low barrier, low maintenance, low commitment
• Policy neutral, protocol/platform independent
• The file system is the database
http://www.flickr.com/photos/oskay/265899811
The Unix philosophy
“Make each program do one thing well”
“To do a new job, build afresh rather than complicate old programs by adding new features”
“Expect the output of every program to become the input to another, as yet unknown, program”
“Design and build software … to be tried early”
“Don't hesitate to throw away the clumsy parts and rebuild them”
— D. L. McIlroy et al., “Unix time-sharing system forward,” Bell System Technical Journal57:6, part 2 (1978): 1902
Curation micro-servicesMode Focus Value Service Valence Visibility
Curation
ValueAccretion Annotation
UI / A
ccess control / Message queuing
Interoperation
User-facing
Visibility Notification
Utility
Accessibility Access
Application
Derivation Transformation
Selectivity Search
Actionability Index
Stewardship Ingest
Preservation
ContextEpistemology Characterization
Interpretation
Provider-facing
Ontology Inventory
State
Reliability Replication
ProtectionFixity Fixity
Stability Storage
Identity Identity
Merritt micro-services
• Merritt is built from a micro-services toolkit– IdM/Authn/Authz LDAP
– Persistent identifiers EZID
– Persistent storage CAN/Pairtree/Dflat/Checkm/ReDD
– Fixity Fixity
– Replication Replication
– Catalog Inventory/4store
– Ingest Ingest/Zookeeper
– Characterization JHOVE2
– Discovery XTF
– Transformation
– Notification– Annotation
Version 2
GhOST/Shibboleth
Design goals
Principle of least surprise
Multiple interface modalities
– RESTful HTTP
– Command line
– Procedural (Java, Perl, Ruby, …)
Linked data
Stable URL references
The file system is the database
http://example-store/
State or content
Storage node
ObjectVersionFile
default/1234/3/xyzstate/
Storage service
Merritt repository
“How can I meet the data management requirements of my grant?”
“I know my desktop content is at risk; what should I do?”
“What’s a good way to share the data underlying a recent publication?”
“How can I ensure persistent availability?”
Model free
application/msword 342.5 KB
Strongly versioned
Easy submission
More info on CDL
• CDL UC3 home pagehttp://www.cdlib.org/uc3
• Curation Micro Serviceshttps://confluence.ucop.edu/display/curation/
• Contact:
Other services/projects
• JHOVE2: http://jhove2.org
• United Digital Format Registry (UDFR)
https://bitbucket.org/udfr/main/
• DataONE
• DCXL (Digital Curation Excel)
• Data Management Plan Tool
https://bitbucket.org/dmptool/main/