+ All Categories
Home > Documents > Trusted Datagrids: Library of Congress Projects with UCSD

Trusted Datagrids: Library of Congress Projects with UCSD

Date post: 11-Jan-2016
Category:
Upload: cooper
View: 18 times
Download: 2 times
Share this document with a friend
Description:
Trusted Datagrids: Library of Congress Projects with UCSD. Ardys Kozbial – UCSD Libraries David Minor - SDSC. Building Trust in a 3 RD Party Repository: A Pilot Project. David Minor San Diego Supercomputer Center. someone they can’t control?. How can the LC trust. - PowerPoint PPT Presentation
Popular Tags:
67
Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC
Transcript
Page 1: Trusted Datagrids: Library of Congress Projects with UCSD

Trusted Datagrids:Library of Congress Projects with UCSD

Ardys Kozbial – UCSD Libraries

David Minor - SDSC

Page 2: Trusted Datagrids: Library of Congress Projects with UCSD

Building Trust in a 3RD Party Repository: A Pilot Project

David Minor San Diego Supercomputer Center

Page 3: Trusted Datagrids: Library of Congress Projects with UCSD
Page 4: Trusted Datagrids: Library of Congress Projects with UCSD
Page 5: Trusted Datagrids: Library of Congress Projects with UCSD
Page 6: Trusted Datagrids: Library of Congress Projects with UCSD

How can the LC trustsomeone they can’t control?

Page 7: Trusted Datagrids: Library of Congress Projects with UCSD
Page 8: Trusted Datagrids: Library of Congress Projects with UCSD

Moving forward in the right direction requires more than fuzzy promises

Page 9: Trusted Datagrids: Library of Congress Projects with UCSD

… it takes a combination of experts and tools.

Cyberinfrastructure

Page 10: Trusted Datagrids: Library of Congress Projects with UCSD

Cyberinfrastructure is the collection of ...

Resources

+ Glue

Computers, data storage, networks,scientific instruments, experts, etc.

Integrating software, systems, and organizations

Page 11: Trusted Datagrids: Library of Congress Projects with UCSD

“Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.”

- ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences

Page 12: Trusted Datagrids: Library of Congress Projects with UCSD

•“The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure”

Page 13: Trusted Datagrids: Library of Congress Projects with UCSD

SDSC ...

• Is one of the original NSF supercomputer centers • Supports high performance computing systems

• Supports data applications for science, engineering, social sciences, cultural heritage institutions

• Has LARGE data capabilities• 3+ PB Disk Storage• 25+ PB Tape Storage

Page 14: Trusted Datagrids: Library of Congress Projects with UCSD

UCSD Libraries

• 3.5+ million volumes

• Digital Access Management System (in development)

• 250,000+ objects• 15+ TB

• Shared collections with UC• California Digital Library

• Digital Preservation Repository• eScholarship repository

Page 15: Trusted Datagrids: Library of Congress Projects with UCSD

Partnerships and Collaborations

LC Pilot Project – Building Trust in a 3rd Party Repository

– Using test image collections/web crawls ingest content to SDSC repository

– Allow access for content audit– Track usage of content over time– Deliver content back to LC at end of project

Library of Congress NDIIPP Chronopolis Program– Build Production Capable Chronopolis Grid (50 TB x 3)– Further define transmission packaging for archival communities– Investigate best network transfer models for I2 and TeraGrid networks

California Digital Library (CDL) Mass Transit Program– Enable UC System Libraries to transfer high-speed mass digitization

collections across CENIC/I2– Develop transmission packaging for CDL content

UCSD Libraries’ Digital Asset Management System– RDF System with data managed in SRB at SDSC

Page 16: Trusted Datagrids: Library of Congress Projects with UCSD

SDSC DPI Group

Digital Preservation Initiatives Group– Charged with Developing and Supporting

Digital Preservation Services within the Production Systems Division of SDSC.

– http://dpi.sdsc.edu– Cross-Organizational Group

• SDSC Personnel/UCSD Libraries Personnel– Libraries– Archives– Technology– Information Science

Page 17: Trusted Datagrids: Library of Congress Projects with UCSD

CyberinfrastructureTrust

Page 18: Trusted Datagrids: Library of Congress Projects with UCSD

For Example:

Page 19: Trusted Datagrids: Library of Congress Projects with UCSD

We worked together to setup high speed data replication services

Checksums

Checksums

Achieved 200Mb/s

= 2 TB/day

Highly reliableInternet2

Page 20: Trusted Datagrids: Library of Congress Projects with UCSD

Network setup involved …

LC and SDSC staff working together

Configurations on networks and computers

Resolving different security environments

Network monitoring

Page 21: Trusted Datagrids: Library of Congress Projects with UCSD

Networking is hard!Networking is hard!

Can’t forget it once it’s setupCan’t forget it once it’s setup

It’s not magic - there’s always a reasonIt’s not magic - there’s always a reason

It highlights collaborative nature of workIt highlights collaborative nature of work

LessonsLearned

Page 22: Trusted Datagrids: Library of Congress Projects with UCSD

Has a long-term solution been found?Has a long-term solution been found?

Have multi-institutional issues been solved?Have multi-institutional issues been solved?

Does new infrastructure improve process?Does new infrastructure improve process?

TrustElements

Is solution useful for other organizations?Is solution useful for other organizations?

Page 23: Trusted Datagrids: Library of Congress Projects with UCSD
Page 24: Trusted Datagrids: Library of Congress Projects with UCSD

SDSC created a robust storage environment for this data

Multiple replications …

… at SDSC

… and geographically

diverse locations

Page 25: Trusted Datagrids: Library of Congress Projects with UCSD

(a process with several characteristics)

Needed to replicate structure exactly

This had to be done for 5+ replications

Complex environment had to be transparent

Data had to be available for manipulation

Page 26: Trusted Datagrids: Library of Congress Projects with UCSD

The Storage Resource Broker provided replication services ...

Page 27: Trusted Datagrids: Library of Congress Projects with UCSD

... and extensive monitoring, logging and reporting functions(which led to many conversations)

Page 28: Trusted Datagrids: Library of Congress Projects with UCSD

Logging and monitoring procedures

Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward

But …

What is the master list and who maintains it?

Who decides what is a legitimate change?

Do you want a dark archive or an active remote data center?

Page 29: Trusted Datagrids: Library of Congress Projects with UCSD

We tested a new Front-End

Page 30: Trusted Datagrids: Library of Congress Projects with UCSD

… and explored an important issue

“Reliability”

Versus

“Accessibility”

Page 31: Trusted Datagrids: Library of Congress Projects with UCSD

Always keep expectations alignedAlways keep expectations aligned

Don’t confuse accessibility and reliabilityDon’t confuse accessibility and reliability

Duplication of structure is complicatedDuplication of structure is complicated

Communication highlights communicationCommunication highlights communication

LessonsLearned

Page 32: Trusted Datagrids: Library of Congress Projects with UCSD

Can remote data be accessed?Can remote data be accessed?

Can remote data be retrieved and re-used?Can remote data be retrieved and re-used?

Can remote data be verified?Can remote data be verified?

Can ownership be clearly defined?Can ownership be clearly defined?

TrustElements

Page 33: Trusted Datagrids: Library of Congress Projects with UCSD

50,000 ARC files

6 Terabytes of data

Short processing time

Parallel indexing and display system

Looked “default” to the user

SDSC and LC explored a new approach to working with web archives

Page 34: Trusted Datagrids: Library of Congress Projects with UCSD

Using default tools, our initial indexing rate was 1000 files per day…

This was over our time budget.… more than 6 weeks of constant computing to index entire collection.

Page 35: Trusted Datagrids: Library of Congress Projects with UCSD

We ran 18 parallel indexing instances – reduced processing to a week

We modified the Wayback sourcecode to create a new

access infrastructure

Page 36: Trusted Datagrids: Library of Congress Projects with UCSD

Sometimes you need to start overSometimes you need to start over

Default setup isn’t always easiestDefault setup isn’t always easiest

Time is a wonderful motivatorTime is a wonderful motivator

Experts are often interested in your workExperts are often interested in your work

LessonsLearned

Page 37: Trusted Datagrids: Library of Congress Projects with UCSD

Can a new organization bring new expertise?Can a new organization bring new expertise?

Are the final results the same?Are the final results the same?

Can the results be reached in a better way?Can the results be reached in a better way?

Can a new organization work with your partners?Can a new organization work with your partners?

TrustElements

Page 38: Trusted Datagrids: Library of Congress Projects with UCSD

Next steps ….

Chronopolis!

Page 39: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis: A Partnership

Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries.

Initial Chronopolis provider sites include:

SDSC and UCSD Libraries at UC San Diego

University of Maryland

National Center for Atmospheric Research (NCAR) in Boulder, CO

UCSD Libraries

Page 40: Trusted Datagrids: Library of Congress Projects with UCSD

Institutions and Roles - UCSD

SDSC– Storage and networking services– SRB support– Transmission Packaging Modules

UCSD Libraries– Metadata services (PREMIS)– DIPs (Dissemination Information

Packages)– Other advanced data services as

needed

Page 41: Trusted Datagrids: Library of Congress Projects with UCSD

Institutions and Roles - NCAR

National Center for Atmospheric Research

–Archives: Complete copy of all data

–Storage and network support

–Network testing

Page 42: Trusted Datagrids: Library of Congress Projects with UCSD

Institutions and Roles - UMIACS

University of Maryland – Institute for Advanced Computer Studies

– Archives: Complete copy of all data

– Advanced data services • PAWN: Producer – Archive Workflow Network in Support of Digital Preservation

• ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives

– Other advanced data services as needed

Page 43: Trusted Datagrids: Library of Congress Projects with UCSD

SDSC Chronopolis Program

Page 44: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis VocabularyPartners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services.

Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network.

SRB – Storage Resource Broker – datagrid software.

iRODS – integrated Rule Oriented Data System – datagrid software.

ACE – Audit Control Cnvironment – part of the ADAPT project at UMD.

PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD.

INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware.

Bagit – Transfer specification developed by CDL and the Library of Congress.

GridFTP – parallel transfer technology - moves large collections within a grid wide-area network.

Page 45: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis: Inside

Linked by main staging grid where data is verified for integrity, and quarantined for security purposes.

Collections are independently pulled into each system.

Manifest layer provides added security for database management and data integrity validation.

Benefits– 3 independently

managed copies of the collection

– High availability– High reliability

NCAR

SDSCCore Center Archive

SDSCStagingGrid

PullPull

Chron Clients:CDLICPSR

Pull

Push

UMD

Copy 1

Copy 2Copy 3

ManifestManagementMCAT DBMultiple Hash Verifications

Grid Brick Disks

MCAT

MCAT

MCAT

HPSSTape

Grid Brick Disks

Page 46: Trusted Datagrids: Library of Congress Projects with UCSD

SDSC Leveraged Infrastructure Serves Both

HPC & Digital Preservation

Archive 25 PB

capacity Both HPSS &

SAM-QFS

Online disk ~3PB total HPC parallel

file systems Collections Databases

Access Tools

Adapted from Richard Moore (SDSC)

Page 47: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis Demonstration ProjectDemonstration Project 2006-2007

– Demonstration Collections Ingested within Chronopolis

• National Virtual Observatory (NVO)– 3 TB Hyperatlas Images (partial

collection)

• Library of Congress PG Image Collection

– 600 GB Prokudin-Gorskii Image Collection

• Interuniversity Consortium for Political and Social Research (ICPSR)

– 2TB Web Accessible Data

• NCAR Observational Data– 3TB Observational Re-Analysis Data

Page 48: Trusted Datagrids: Library of Congress Projects with UCSD

NDIIPP Chronopolis Project

• Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR

• Installing and testing a suite of monitoring tools using ACE, PAWN, INCA

• Creating Appropriate Transmission Information Packages

• Generating PREMIS definitions for data

• Writing Best Practices documents for clients and partners

Page 49: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis Grid FrameworkSun 614062TB

Sun 614062TB

SRB D-Broker

SRB D-Broker

SRB MCAT

Sun SAM-QFS

Sun SAM-QFS

SRB D-Broker

SRB D-Broker

SRB MCAT

Apple XsanApple Xsan

SRB D-Broker

SRB D-Broker

SRB MCAT

CDL Server

ICPSR Server

NCAR Network

MarylandNetwork

SDSC Network

ICPSR Network

UC BerkeleyNetwork

Chronopolis Data 12-25TB

Chronopolis Data 12-25TB

Chronopolis Data 12TB

Chronopolis Data 12TB

CDL Server

SDSC Network

NCAR Network

UMD Network

Tape SilosTape Silos

Adapted from Bryan Banister (SDSC)

Page 50: Trusted Datagrids: Library of Congress Projects with UCSD

NDIIPP Chronopolis Clients-CDLCalifornia Digital Library

–A part of UCOP, supports the University of California libraries

– Providing up to 25TB of data: Web-At-Risk project• Five years of political and

governmental websites• ARC files created from web crawls• Using Bagit Transfer Structure

Page 51: Trusted Datagrids: Library of Congress Projects with UCSD

Diagram of CDL Data TransferCDL Virtual Machine at UCB

SDSC Network

Wget Bagit

Wget files 1-10, 11-20

File n

BagitManifest

File 1

Possible SRB/BagitModule

UM

IACS

ChronStaging

ChronRepository

NCAR

Parallel Wget Xfer

UMIACS Network

NCARNetworkAdapted from Bryan Banister (SDSC)

Page 52: Trusted Datagrids: Library of Congress Projects with UCSD

NDIIPP Chronopolis Clients-ICPSR

Inter-University Consortium for Political and Social Research, University of Michigan

– Providing @12TB of data: Wide variety of types

– Already working with SDSC using SRB

Page 53: Trusted Datagrids: Library of Congress Projects with UCSD

Diagram of ICSPR Transfer

ICPSR SRB RepositoryUMich

SDSC Network

Sput/Srsync Files

Sput tar files

File n

EMCSAN

File 1

ChronSRBMCAT

UM

IACS

ChronStaging

ChronRepository

NCAR

Parallel Sput/Srsync Xfer

UMIACS Network

NCARNetworkAdapted from Bryan Banister (SDSC)

Page 54: Trusted Datagrids: Library of Congress Projects with UCSD

Ongoing and Future Initiatives

• Migration of Chronopolis from SRB to iRODS

• Develop Interoperability with Community Based Archival Systems/Standards

• TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium

Page 55: Trusted Datagrids: Library of Congress Projects with UCSD

Looking for Partnerships

• Repositories interested in moving large digital collections among heterogeneous repository systems.• Fedora, DSpace or E-Prints sites interested in managed datagrid storage.• Institutions interested in personnel swaps to conduct TRAC audit assessment compliance.• Community Needs for Mass-Scale Data Transmission and Storage.

Page 56: Trusted Datagrids: Library of Congress Projects with UCSD

Chronopolis Credits

SDSC– Fran Berman– Richard Moore– David Minor– Chris Jordan– Jim D’Aoust– Robert McDonald– Don Sutton– Brian Banister– Phong Dinh– Jay Dombrowski– Emilio Valente

UCSD Libraries– Brian Schottlaender– Luc Declerck– Ardys Kozbial– Brad Westbrook– Arwen Hutt

NCAR– Don Middleton– Michael Burek– Linda McGinley

UMIACS– Joseph JaJa– Mike Smorul– Mike McGann

Library of Congress– Martha Anderson– Lisa Hoppis

CACI– Mike Ivey

Page 58: Trusted Datagrids: Library of Congress Projects with UCSD
Page 59: Trusted Datagrids: Library of Congress Projects with UCSD
Page 60: Trusted Datagrids: Library of Congress Projects with UCSD
Page 61: Trusted Datagrids: Library of Congress Projects with UCSD

• a geographically distributed preservation environment that supports long-term management and stewardship of digital collections

• implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure.

• technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment.

Chronopolis is ...

Page 62: Trusted Datagrids: Library of Congress Projects with UCSD

• Assessment of the needs of potential user communities and development of appropriate service models

• Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations

• Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc.

• Development of cost and risk models for long-term preservation

• Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure

Chronopolis focuses on ...

Page 63: Trusted Datagrids: Library of Congress Projects with UCSD

UCSD Libraries

The people of Chronopolis are ...

Page 64: Trusted Datagrids: Library of Congress Projects with UCSD

Organizations need ways to validate trust in 3rd partiesIn conclusion …

Page 65: Trusted Datagrids: Library of Congress Projects with UCSD
Page 66: Trusted Datagrids: Library of Congress Projects with UCSD

… and demonstrating trust.

SDSC and the Library of Congress explored one way to do this …

by working with Cyberinfrastructure

Page 67: Trusted Datagrids: Library of Congress Projects with UCSD

With a trusted relationship, many journeys become possible


Recommended