Date post: | 11-May-2015 |
Category: |
Documents |
Upload: | cameroon45 |
View: | 494 times |
Download: | 0 times |
Persistent Archives: Long-term sustainability of data based on
policy and data virtualization
Arcot (Raja) RajasekarUniversity of North Carolina at Chapel Hill
[email protected]://irods.diceresearch.org
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” (2008-2012) NSF SDCI 0721400 “Data Grids for Community Driven Applications” (2007-2010)
Topics• Data Grids for Preservation & Sharing
– Brief Intro– Why are they suitable for deploying scalable persistent archives?– iRODS as an exemplar Data Grid
• Two Examples:– DIGARCH: Preservation of Multi-media Collection– TPAP: NARA Testbed of Persistent Archives
Data Preservation Challenges• Data driven research generates massive data collections
– Data sources are remote and distributed– Collaborators are remote– Wide variety of data types: observational data, experimental data, simulation
data, real-time data, office products, web pages, multi-media• Collections contain millions of files
– Logical arrangement is needed for distributed data – Discovery requires the addition of descriptive metadata
• Long-term retention requires migration of output into a reference collection– Automation of administrative functions is essential to minimize long-term labor
support costs– Creation of representation information for describing file context– Validation of assessment criteria (authenticity, integrity)
What is a Data Grid?
• Geographically distributed heterogeneous resources that are managed autonomously
• Active with data resources being added and removed• Users like to share/discover data using contextual
information
4
What is a Data Grid?• Data Grid – a network of data resources that is presented as a
single, accessible collection of data.• Data Grid – provisions for associating metadata & annotations• Data Grid – enables discovery, access & server-side processing• Metadata-based data virtualization• Policy Virtualization
5
MetadataMetadata
Why Data Grids?• Data Virtualization: Shared Collections Concept
– Common Abstract Name Spaces: physical-independence• Data objects and collections : logical names• Users/collaborators : global user name space• Shared resources & uniform access : location & protocol transparency• Common typing conventions for objects & actions
– Provide technology independence• Platform & Vendor-ndependence• High scalability
– Need discovery metadata• Descriptive attributes for each name space• System & Domain-specific information
Why Data Grids?• Policy- Virtualization: Automate Operations
– System-centric Policies & Obligations: • Manage retention, disposition, distribution, replication, integrity,
authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication
– Domain-specific Policies:• Identification & Extraction of Metadata• Ingestion Control for Provenance Attribution• Processing of Data on Ingestion
– Creation of multi-resolution images, type-identification, anonymization,…
• Processing of Data on Access– IRB Approval for data access, Data sub-setting, Merging of multiple images,
conversion, redaction, …
Preservation is an Integral Part of the Data Life Cycle
• Organize project data into a shared collection• Publish data in a digital library for use by other
researchers• Enable data-discovery & data-driven analyses• Preserve reference collection for use by future
research initiatives• Associate new collection against prior state-of-the-art
data• Define & Enforce Policies for long-term management
and curation
Exemplar Data Grid: iRODS• Integrated Rule- Oriented Data System
• It is a data grid system – data virtualization
– A distributed file system, based on a client-server architecture.
– Allows users to access files seamlessly across a distributed environment, based upon their attributes & GUID rather than locations
– It replicates, syncs and archives data, connecting heterogeneous resources in a logical and abstracted manner.
• It is a server-side workflow system – policy virtualization
– Actions are coded as functions/scripts (micro-services)
– Micro-services can be chained into Policies (rules)
– Rules are interpreted by a distributed rule engine
– The chains can be triggered on an event and condition (rules)
– Micro-services communicate through parameters, shared contexts, and out-of-band message queues.
Open Policy and Uniform Access
Policy/Rule Examples• Automatically extract metadata for a file with certain types and store in
domain-centric metadata catalog
• Notify owner if a file metadata is missing N days after ingestion
• Automatically “audit” derived datasets – provenance gathering
• Periodically check for integrity of files in a collection and repair them if needed/possible
• Allow users only using certificate-based log in to access files from a collection – multi-lock control
• Automatically migrate a file to “slow” storage location after N days of non-use – storage management
• Automatically replicate a file that falls into a collection into 3 geo-distributed sites – replication strategies
• When too many users from site A are using a file from site B, keep a copy in site A – data placement strategies
• Send a notification when file with certain type of data is ingested.
Overview of iRODS Architecture Overview of iRODS Architecture
UserCan Search, Access, Add and
Manage Data& Metadata
*Access data with Web-based Browser or iRODS GUI or Command Line clients.
Overview of iRODS Data System
iRODS Data Server
Disk, Tape, etc.
iRODS Metadata
CatalogTrack data
iRODS Data System
iRODS Rule Engine
Track policies
integrated Rule-Oriented Data SystemClient Interface Admin Interface
Current State
Rule Invoker
MicroService
Modules
Metadata-based Services
Resources
MicroService
Modules
Resource-based Services
ServiceManager
ConsistencyCheck
Module
RuleModifierModule
ConsistencyCheck
Module
Engine
Rule
Confs
ConfigModifierModule
MetadataModifierModule
MetadataPersistent
Repository
ConsistencyCheck
Module
RuleBase
iRODS Components
RuleRuleEngineEngine
ExecutionExecutionControlControl
MessagingMessagingSystemSystem
ExecutionExecutionEngineEngine
VirtualizationVirtualization
ServerServerSideSideWorkflowWorkflow
PersistentPersistentStateStateinformationinformation
SchedulingScheduling
PolicyPolicyManagementManagement
DataDataTransportTransport
MetadataMetadataCatalogCatalog
iRODS Applications• Institutional repositories
– Carolina Digital Repository at University of North Carolina– Duke Medical Archive
• Regional data grids– RENCI data grid linking 7 engagement centers in North Carolina– HASTAC data grid linking humanities collections across 9 UC campuses
• National data grids– NARA Transcontinental Persistent Archive Prototype – NSF Temporal Dynamics of Learning Center data grid– NSF Ocean Observatories Initiative data grid– NASA Center for Computational Sciences archive– JPL Planetary Data System data grid
• International data grids– Australian Research Collaboration Service - ARCS– French National Library
User Interfaces• C library calls - Application level• Unix shell commands - Scripting languages• Java I/O class library (JARGON) - Web services• SAGA - Grid API• Web browser (Java-python) - Web interface• Windows browser - Windows interface• WebDAV - iPhone interface• Fedora digital library middleware - Digital library middleware• Dspace digital library - Digital library services• Parrot - Unification interface• Kepler workflow - Scientific workflow• Fuse user-level file system - Unix file system
Case 1: NARA TPAP• National Archives Electronic Records Administration
Research Program (funded thru NSF)• Transcontinental Persistent Archive Prototype
– Use federation of data grid technology to build a preservation environment
– Conduct research on preservation concepts• Infrastructure independence• Enforcement of preservation properties• Validation of assessment criteria• Automation of administrative processes• Show technology migration
– Demonstrate preservation on selected NARA digital holdings
National Archives and Records Administration National Archives and Records Administration Transcontinental Persistent Archive PrototypeTranscontinental Persistent Archive Prototype
U Md UCSD
MCAT MCAT
Georgia Tech
MCAT
Federation of Seven
Independent Data Grids
NARA II
MCAT
NARA I
MCAT
Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.
Rocket Center
MCAT
U NC
MCAT
ISO MOIMS-repository assessment criteria• We are developing 150 rules that implement the
assessment criteria• Examples:90 Verify descriptive metadata and source
against SIP template and set SIP compliance flag
91 Verify descriptive metadata against semantic term list
92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)
93 Verify consistency of preservation metadata after hardware change or error
• Case Study 2: DIGARCH
• Preservation of Video Files – By Integrating a Video Production Pipeline– With a Preservation Workflow
Digital Preservation Lifecycle ManagementBuilding a demonstration prototype for the preservation of large-scale multi-media collections
San Diego Supercomputer Center, Univ. of California,
San DiegoArcot Rajasekar
(PI)Richard MarcianoReagan MooreChien-Yi Hou
Francine Berman (co-PI)
UCSD-TV, Univ. of California, San DiegoLynn Burstan (co-
PI)Steve AndersonMellisa McEwenBee Bornheimer
UCTV-BerkeleyHarry Kreisler
UCSD Libraries, Univ. of California, San Diego
Brian Schottlaender (co-PI)
Luc DeClerckBrad WestbrookArwen Hutt
Ardys KozbialChris FrymannVivian Chu
Our Proposal• Design and Development of a Prototype for
Preserving Digital Video Collections– Management of Authenticity, Integrity,
and Infrastructure Independence– Preservation Life-cycle meshing
seamlessly with the content production• Minimal impact to production life-cycle
– Workflow system that automates accession, description, organization and preservation of video and associated contents
• Metadata definition, extraction and ingestion
• Long-term retention and technology migration
– At risk Collection: ‘Conversation with History’ video collection• Video, audio, text transcripts, web-based material
• Databases of administrative and descriptive metadata
• Derived products
Exemplar Collection• Conversation with History - UCTV - from 1982
– Hour-long interviews with internationally prominent individuals
– Institute of International Affairs, UC Berkeley– Available in 15 million homes nationwide via UCTV– 40 program segments annually– Web-site for downloading older segments– Among UCTVs most accessed on-line programs– Programs used in educational material
Pre-Interview Interview Transcription Post-Interview
Metadata Analysis
Schema Generation
SIP/AIP Definitions
Capture Scripts
Metadata DB Capture
Interview Metadata Capture
Make SIPS
Aggregate AIP/Verify
Store/Replicate/Preserve
TV production Lifecycle
Metadata Definition & Capture Workflow
Persistent Archival Workflow
Broadcast/Transfer
Metadata Validation
Preservation Processes• Generation of a Globally Unique Identifier (GUID) for each interview
session • Retrieval of the original video session• Retrieval of each of the segments associated with a video session• Retrieval of the transmission scripts for each video segment• Retrieval of the material published on the Web page for each segment• Processing of each Web page to redirect internal URLs into handles
within the preservation logical name space for digital entities• Retrieval of the rights statement for each session • Retrieval of the header associated with each video segment• Retrieval of the trailer associated with each video segment • Retrieval of the administrative, structural, and descriptive
metadata stored in the Filemaker Pro database • Retrieval of the annotations stored with the Web pages• Specification of Preservation Metadata for AIP • Creation of AIPs for the above material• Creation of containers for physically aggregating material for
storage in the preservation environment• Storage of containers within the preservation environment• Specification of preservation management metadata such as access
controls, storage location, and replication
Utility of Data Grids Utility of Data Grids • Logical Universal Identifier• Uniform Access to Distributed Data• Data Discovery through Metadata• Open Policy for Provenance & Data Management• Rule-based workflow for Metadata Extraction & Analysis• Audit trails to capture provenance• Provides a Platform for Data Publication • Provides a means to Uniquely identify datasets• Can be used to enforce metadata requirement - policy• Cross-referencing provenance through workflow-derived data• Provides a means to perform data attribution