Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | victoria-atkinson |
View: | 216 times |
Download: | 0 times |
OGF-22www.ogf.org PERG
Preservation Environments Research Group
• Organizers: Reagan Moore ([email protected])Richard Marciano ([email protected])
• Goals: Analyze capabilities required by a preservation environment
Define rule-based preservation environment - iRODS RLG/NARA assessment criteria for a Trusted Digital Repository
• CASPAR - representation information• SHAMAN - migration micro-services
Demonstrate creation of a preservation environment based on data grid technology Demonstrate creation of preservation rules controlling a preservation environment
• Participants: CASPAR - Cultural, Artistic and Scientific knowledge for Preservation, Access and
Retrieval SHAMAN - Sustaining Heritage Access through Multivalent ArchiviNgSustaining Heritage Access through Multivalent ArchiviNg NCRIS - National Collaborative Research Infrastructure Strategy PLANETS - Preservation and Long-term Access through Networked Services MIT - DSpace digital library NARA Transcontinental Persistent Archive Prototype U Md - Producer Archive Workflow Network UK Digital Curation Centre Taiwan National Archives
OGF-22www.ogf.org PERG
Intellectual Property Policy
• I acknowledge that participation in OGF22 is subject to the OGF Intellectual Property Policy.• Intellectual Property Notices Note Well: All statements related to the activities of the OGF and
addressed to the OGF are subject to all provisions of Section 17 of GFD-C.1 (.pdf), which grants to the OGF and its participants certain licenses and rights in such statements. Such statements include verbal statements in OGF meetings, as well as written and electronic communications made at any time or place, which are addressed to: the OGF plenary session,
• any OGF working group or portion thereof, • the GFSG, or any member thereof on behalf of the GFSG, • the GFAC, or any member thereof on behalf of the GFAC, • any OGF mailing list, including any working group or research group list, or any other list functioning
under OGF auspices, • the GFD Editor or the GWD process • Statements made outside of a OGF meeting, mailing list or other function, that are clearly not intended
to be input to an OGF activity, group or function, are not subject to these provisions.• Excerpt from Section 17 of GFD-C.1 Where the GFSG knows of rights, or claimed rights, the OGF
secretariat shall attempt to obtain from the claimant of such rights, a written assurance that upon approval by the GFSG of the relevant OGF document(s), any party will be able to obtain the right to implement, use and distribute the technology or works when implementing, using or distributing technology based upon the specific specification(s) under openly specified, reasonable, non-discriminatory terms. The working group or research group proposing the use of the technology with respect to which the proprietary rights are claimed may assist the OGF secretariat in this effort. The results of this procedure shall not affect advancement of document, except that the GFSG may defer approval where a delay may facilitate the obtaining of such assurances. The results will, however, be recorded by the OGF Secretariat, and made available. The GFSG may also direct that a summary of the results be included in any GFD published containing the specification. OGF Intellectual Property Policies are adapted from the IETF Intellectual Property Policies that support the Internet Standards Process.
OGF-22www.ogf.org PERG
Data Management Applications
• Data grids Share data - organize distributed data as a collection
• Digital libraries Publish data - support browsing and discovery
• Persistent archives Preserve data - manage technology evolution
• Real-time sensor systems Federate sensor data - integrate across sensor streams
• Workflow systems Analyze data - integrate client- & server-side workflows
• Coalescence of requirements into generic infrastructure
OGF-22www.ogf.org PERG
Generic Infrastructure
• Data grids organize distributed data into shared collections Persistent name spaces for files, users, storage Collection attributes
Provenance, descriptive, system metadata
• Data grids manage heterogeneous storage systems Standard operations across file systems, tape archives, object ring
buffers Enable management of technology evolution
At the point in time when new technology is available, both the old and new systems can be integrated
OGF-22www.ogf.org PERG
Preservation Requirements
• Authenticity Maintain information about provenance of data Assertions made about the file at the time of ingestion
• Integrity Maintain information about the management of the data Assertions made by the archivist
Access controls, audit trails, checksums, replication, synchronization, federation
• Infrastructure independence Management of properties of records independently of choice of
storage system
• Scalability Management of large collections (billions of records, petabytes of
data, thousands of attributes)
OGF-22www.ogf.org PERG
National Archives and Records Administration Transcontinental Persistent Archive Prototype
Federation of Seven Independent Data Grids
Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.
U Md SDSC
MCAT MCAT
Georgia Tech
MCAT
NARA II
MCAT
NARA I
MCAT
Rocket Center
MCAT
U NC
MCAT
OGF-22www.ogf.org PERG
Extremely Successful• Storage Resource Broker (SRB) manages 2 PBs of data in
internationally shared collections• Data collections for NSF, NARA, NASA, DOE, DOD, NIH, LC, NHPRC,
IMLS; APAC, UK e-Science, IN2P3, KEK, … Astronomy Data grid Bio-informatics Digital library Earth Sciences Data grid Ecology Collection Education Persistent archive Engineering Digital library Environmental science Data grid High energy physics Data grid Humanities Data Grid Medical community Digital library Oceanography Real time sensor data, persistent archive Seismology Digital library, real-time sensor data
• Goal has been generic infrastructure for distributed data
OGF-22www.ogf.org PERG
Date
ProjectGBs of
data stored1000’s of
filesGBs of
data stored1000’s of
files# Curators
GBs of data stored
1000’s of files
# Curators
Data Grid NSF / NVO 17,800 5,139 51,380 8,690 80 88,216 14,550 100 NSF / NPACI 1,972 1,083 17,578 4,694 380 39,697 7,590 380 Hayden 6,800 41 7,201 113 178 8,013 161 227 Pzone 438 31 812 47 49 28,799 17,640 68 NSF / LDAS-SALK 239 1 4,562 16 66 207,018 169 67 NSF / SLAC-JCSG 514 77 4,317 563 47 23,854 2,493 55 NSF / TeraGrid 80,354 685 2,962 282,536 7,257 3,267 NIH / BIRN 5,416 3,366 148 20,400 40,747 445 NCAR 70,334 325 2 LCA 3,787 77 2Digital Library NSF / LTER 158 3 233 6 35 260 42 36 NSF / Portal 33 5 1,745 48 384 2,620 53 460 NIH / AfCS 27 4 462 49 21 733 94 21 NSF / SIO Explorer 19 1 1,734 601 27 2,750 1,202 27 NSF / SCEC 15,246 1,737 52 168,931 3,545 73 LLNL 18,934 2,338 5 CHRON 12,863 6,443 5Persistent Archive NARA 7 2 63 81 58 5,023 6,430 58 NSF / NSDL 2,785 20,054 119 7,499 84,984 136 UCSD Libraries 127 202 29 5,205 1,328 29 NHPRC / PAT 2,576 966 28 RoadNet 3,557 1,569 30 UCTV 7,140 2 5 LOC 6,644 192 8 Earth Sci 6,136 652 5TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 1,023 TB 200 mil 5,539
5/17/02 6/30/04 11/29/07
OGF-22www.ogf.org PERG
Data Grid Evolution
• Data grids Management of preservation environment properties
Data and trust virtualization Infrastructure independence
SRB - Storage Resource Broker
• Rule-based data grids Automation of management policies
Management virtualization Open source software
iRODS - integrated Rule-Oriented Data System http://irods.sdsc.edu
OGF-22www.ogf.org PERG
Using a Data Grid - Details
iRODS ServerRule Engine
•Data request goes to iRODS Server
iRODS ServerRule Engine
Metadata CatalogRule Base
DB
•Server looks up information in catalog
•Catalog tells which iRODS server has data
•1st server asks 2nd for data
•The 2nd iRODS server applies rules
•User asks for data
OGF-22www.ogf.org PERG
Requirements Driving Evolution
• Observe that as the size of the shared collections grow, the administrative tasks can become onerous. Data grids provide mechanisms to manage recovery from all errors that
occur in the distributed environment
• Need to minimize labor support through automation of administrative functions File ingestion tasks Verification of desired collection properties Integrity checks and replica management
OGF-22www.ogf.org PERG
Requirements Driving Evolution
• Observe that each preservation environment has unique management policies User administration File retention & deletion Time-dependent access controls Data distribution and replication File update (versions, backups) Descriptive metadata
OGF-22www.ogf.org PERG
Requirements Driving Evolution
• Socialization of collections The archivists have specific properties that they assert the collection will
possess Completeness Authoritative sources Authenticity
The creators of the records have their own criteria for the properties they expect
• Socialization is the mapping from creator assertions to archivist expectations Extract records from the environment in which they were created and
migrate into the preservation environment Extract records from the preservation environment and deliver to users
of the archive Maintain assertions about the records during both extraction processes
OGF-22www.ogf.org PERG
Data Management
Data ManagementEnvironment
ConservedProperties
ControlMechanisms
RemoteOperations
ManagementFunctions
AssessmentCriteria
ManagementPolicies
Capabilities
Data grid – Management virtualizationData Management
InfrastructurePersistent
StateRules Micro-services
Data grid – Data and trust virtualizationPhysical
InfrastructureDatabase Rule Engine Storage
System
iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System
OGF-22www.ogf.org PERG
Rules
• Rule classes System enforced rules Administrator controlled rules User defined rules
• Rule execution Atomic rules - executed on each operation invoked by a client Deferred rules - executed at a future time Periodic rules - executed to validate assessment criteria and enforce
desired properties (integrity)
OGF-22www.ogf.org PERG
iRODS Rule Syntax
• Event | Condition | Action-set | Recovery-set Event - triggered by operation or queued rule Condition - composed of tests on any attributes in
the persistent state information Action-set - composed from both micro-services
and rules Recovery-set - used to ensure transaction semantics
and consistent state information
• Executed by a rule engine installed at each storage location - server side workflows
OGF-22www.ogf.org PERG
Micro-Services
• Challenge is that storage systems do not provide desired processes Have “minimal” set of standard operations that are performed
at the storage system Have actions required by clients such as replication,
metadata extraction, format migration Create standard micro-services that aggregate storage
operations into modules that can be used to implement desired processes.
OGF-22www.ogf.org PERG
Data Virtualization
Storage SystemStorage System
Storage ProtocolStorage Protocol
Access InterfaceAccess Interface
Standard Micro-servicesStandard Micro-services
Data GridData Grid
Map from the actions
requested by the access
method to a standard set of
micro-services. The
standard micro-services
are mapped to the
operations supported by the storage system
Standard OperationsStandard Operations
OGF-22www.ogf.org PERG
integrated Rule-Oriented Data System
Client Interface Admin Interface
Current State
Rule Invoker
MicroService
Modules
Metadata-based Services
Resources
MicroService
Modules
Resource-based Services
ServiceManager
ConsistencyCheck
Module
RuleModifierModule
ConsistencyCheck
Module
Engine
Rule
Confs
ConfigModifierModule
MetadataModifierModule
MetadataPersistent
Repository
ConsistencyCheck
Module
RuleBase
OGF-22www.ogf.org PERG
Distributed Management System
RuleRule
EngineEngine
DataData
TransportTransport
MetadataMetadata
CatalogCatalog
ExecutionExecution
ControlControl
MessagingMessaging
SystemSystem
ExecutionExecution
EngineEngine
VirtualizationVirtualization
ServerServer
SideSide
WorkflowWorkflow
PersistentPersistent
StateState
informationinformation
SchedulingScheduling
PolicyPolicy
ManagementManagement
OGF-22www.ogf.org PERG
Digital Preservation
• Preservation community is defining the rules need to assert trustworthiness of a digital repository RLG/NARA - Trustworthy Repositories Audit & Certification:
Criteria and Checklist.
http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf
• Defined 105 rules that are being implemented in iRODS
OGF-22www.ogf.org PERG
RLG/NARA Assessment
• Example TRAC assessment criteria
90 Verify descriptive metadata and source against SIP template and set SIP compliance flag
91 Verify descriptive metadata against semantic term list
92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)
93 Verify consistency of preservation metadata after hardware change or error
OGF-22www.ogf.org PERG
Classes of Assessment Criteria
• Collection properties List properties of associated name spaces Verify properties Compare properties with assertions
• Collection operations Transform file formats Migrate data Generate audit trails
• Structured information Parse audit trails to generate compliance reports Apply templates to extract information Apply templates to format state information
OGF-22www.ogf.org PERG
Which Comes First?
• Specification of required provenance metadata PREMIS - defines metadata that should be maintained about
events associated with record Definition of the procedures left to each preservation environment
• Specification of required management policies Define explicitly the management procedures Derive the required state information needed to track outcomes Implies provenance metadata is defined by management policies Observe this leads to multiple classes of preservation metadata
associated with each preserved name space
OGF-22www.ogf.org PERG
Persistent State Information
• User name space Identity of archivists Qualifications of archivists
• Record (file) name space Provenance metadata Transformative migrations Chain of custody (storage locations) Integrity Representation information (OAIS)
• Storage resource name space Archival properties Error rates
OGF-22www.ogf.org PERG
Persistent State Information
• Representation information for preservation environment
• Rule name space Management policies that control operations within preservation
environment Versions of rules Verification criteria
• Micro-service name space Management procedures that quantify operations on records Versions of micro-services Verification criteria
• Persistent State name space State information created by each version of a micro-service
OGF-22www.ogf.org PERG
Preservation Requirements
• What are your required preservation management policies?
• What are your required preservation processes?
• What are your required preservation assessment criteria?
• What preservation systems are you using, and how can the preservation systems interoperate?
• Can a set of records be migrated from your preservation environment into another system while maintaining authenticity, integrity, and chain of custody?
OGF-22www.ogf.org PERG
Theory of Digital Preservation
• Given the set of preservation policies• Given the set of preservation procedures• Given the set of persistent state information
• Does the system have demonstrable closure and consistency properties? Is the required persistent state information generated that is needed
to make assertions about trustworthiness, authenticity, integrity? Can assertions be made about the set of preservation procedures
that have been applied to the records (no missing steps)? Do the applied preservation procedures enforce all preservation
policies?
OGF-22www.ogf.org PERG
iRODS Application
• NSF - SDCI grant “Adaptive Middleware for Community Shared Collections” iRODS development, SRB maintenance
• NARA - Transcontinental Persistent Archive Prototype Trusted repository assessment criteria
• NSF - Ocean Research Interactive Observatory Network (ORION) Real-time sensor data stream management
• NSF - Temporal Dynamics of Learning Center data grid Management of Institution Research Board approval
OGF-22www.ogf.org PERG
iRODS Development Status
• Current release is version 1.0 January 23, 2008 http://irods.sdsc.edu
• International collaborations SHAMAN - University of Liverpool
Sustaining Heritage Access through Multivalent ArchiviNg CASPAR
Representation information, TRAC assessment criteria UK e-Science data grid IN2P3 (Lyon, France) data grid migration DSpace policy management integration Fedora user middleware integration LStore distributed metadata catalog integration
OGF-22www.ogf.org PERG
Planned Development• In progress:
GSI support Audit trails - mechanisms to record and track iRODS persistent state changes Structured information interface based on mounted collection driver (tar file) GUI Browser (AJAX) Driver for HPSS Porting to additional versions of Unix/Linux (Ubuntu completed)
• Planned Time-limited sessions via a one-way hash authentication Python Client library Driver for SAM-QFS Porting to Windows Support for MySQL as the metadata catalog MCAT to ICAT migration tools Extensible Metadata including Databases Access Interface Zones/Federation Cheshire / Multivalent Browser micro-service
OGF-22www.ogf.org PERG
For More Information
Reagan W. MooreSan Diego Supercomputer Center
http://www.sdsc.edu/srb/http://irods.sdsc.edu/