+ All Categories
Home > Documents > Data Preservation

Data Preservation

Date post: 14-Dec-2014
Category:
Upload: nishantsri
View: 327 times
Download: 7 times
Share this document with a friend
Description:
 
20
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Preservation and Long Term Access to Data and Records in a Knowledge- based Society Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/
Transcript
Page 1: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Preservation and Long Term Access to Data and Records in a Knowledge-

based Society

Reagan W. MooreSan Diego Supercomputer Center

[email protected]://www.npaci.edu/DICE/

Page 2: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data and Knowledge Systems GroupStaff• Reagan Moore• Ilkai Altintas• Chaitan Baru• Sheau Yen Chen• Charles Cowart• Amarnath Gupta• George Kremenek• M. Kulrul• Bertram Ludäscher• Richard Marciano• A. Memon• XuFei Qian• Roman Olshanowsky• Arcot Rajasekar• Abe Singer• Michael Wan• Ilya Zaslavsky• Bing Zhu

Graduate Students • A. Bagchi• S. Bansal• A. Behere• R. Bharath• S. Bharath• L. Sui

Undergraduate Interns• N. Cotofana• D. Le• J. Trang• L. Yin• +/- NN

Page 3: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Topics

• Building persistent archives

• Data grids

• Authenticity mechanisms

• Managing technology evolution

• Knowledge-based access

Page 4: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Archival ProcessesAppraisal –determine the archivable contentAccession - determine the initial physical location for the data, and the

relationship of the new collection to existing collections Arrangement - add administration control, describe the information

content (provenance, authenticity, structure, administrative), and decompose digital objects into their components as needed.

Description - complete the definition of collection attributes by iterating between arrangement, reformatting, and representation.

Preservation – build an archivable form of the digital entities, characterize the collection context , and manage their storage

Access – provide query mechanisms for discovering, retrieving, and presenting the digital entities.

Page 5: Data Preservation

ERA Concept model

Page 6: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Common Approach (digital library, persistent archive, data grid)

• Logical name space used to organize digital entities, and associate attributes

• Separation of information management from data storage management

• Definition of abstraction mechanisms for dealing with repositories

• Emergence of need for knowledge management

Page 7: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

PrologPredicate

SDSC Storage Resource Broker & Meta-data CatalogLevels of Abstraction

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

Clients

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 8: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Authenticity

• Guarantee that the data has not been changed– Collection owned data, only accessible through the data

handling system

– Support roles defining access (curation, owner, annotation, read)

– Support access controls mapping users to roles

• Audit trails that record all operations on files• Digital signatures - cryptographic checksums

Page 9: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Managing Technology Evolution

• Data grids provide interoperability mechanisms to access data in multiple administration domains and multiple types of storage systems.

• Persistent archives migrate collections from old technology to new technology to support presentation on new systems

• Both require the ability to access heterogeneous systems

Page 10: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Presentation of Digital Objects

Storage System

Operating System

Application

Digital Object

Display System

Page 11: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management - Emulation

New Storage System

New Operating System

Old Application

Digital Object

New Display System

Wrap Application

Page 12: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management

New Storage System

New Operating System

Old Application

Digital Object

New Display System

Add Operating System Call

Page 13: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management

Old Storage System

New Operating System

Old Application

Digital Object

Old Display System

Add Operating System Call

Add Operating System Call

Page 14: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management Migration

New Storage System

New Operating System

New Application

Digital Object

New Display System

Migrate Encoding Format

Page 15: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management - SDSC

Old Storage System

New Operating System

New Application

Digital Object

Old Display System

Wrap Storage System Wrap Display System

Migrate Encoding Format

Page 16: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Accessing Archived Data

• Name transparency– Access data without knowing the file name– Map from attributes to a local file name

• Location transparency– Access data without knowing where it is stored– Map from global file name to local file name

• Collection transparency– Access data without knowing the collection attributes– Map from concept space to collection attributes

Page 17: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Information Management- Logical Name Space

• Set of attributes to describe digital entities that are registered into the logical name space

• SRB metadata - Unix file system semantics• Provenance metadata - Dublin Core• Resource metadata - User access control lists• Discipline metadata - User defined attributes

• Each digital entity may have unique attributes

Page 18: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Management - Discovery across Collections

• Mapping from collection attributes to discipline concepts – Make queries based on discipline concepts

• Characterization of relationships between attributes– Semantic / logical - cross-walks– Procedural / temporal - records management– Structural / spatial - GIS

Page 19: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Based Data Grids

AttributesSemantics

Knowledge

Information

Data

Ingest Services

Management AccessServices

(Model-based Access)

(Data Handling System - SRB)

MC

AT

/HD

F

Gri

ds

XM

L D

TD

SD

LIP

XT

M D

TD

Rul

es -

KQ

L

InformationRepository

Attribute- based Query

Feature-basedQuery

Knowledge orTopic-Based Query / Browse

KnowledgeRepository for Rules

RelationshipsBetweenConcepts

FieldsContainersFolders

Storage(Replicas,Persistent IDs)

Page 20: Data Preservation

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Further Information

http://www.npaci.edu/DICE


Recommended