1
Applied CyberInfrastructure ConceptsISTA 420/520 Fall 2014
1
Nirav Merchant ([email protected])Bio Computing & iPlant CollaborativeEric Lyons ([email protected])Plant Sciences & iPlant CollaborativeUniversity of Arizonahttp://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/
Will Computers Crash Genomics? Science Vol 331 Feb 2011
Topic CoverageLifecycle Issues (example from MIT)Why DM (Data Management) iRODS Introduction
Scaling the Infrastructure for Data Management(Chapter 3 from FiMDA) Group homework
Reality of data“We are drowning in data, but starving of information” - Attribution unknown
Data Life Cycle
http://www.data-archive.ac.uk/create-manage/life-cycle
5
iRODS Background and Evolution
• integrated Rule-Oriented Data System (iRODS) http://www.irods.org
• Originated at SDSC, developed by the DICE (Data Intensive Cyber Environments) group
• Based on decade-long SRB development experience for managing distributed data
• Community-driven
• Most of the group migrated to UNC Chapel Hill in 2008-2009– The group is bi-coastal: DICE-UNC, DICE-UCSD
• First release of iRODS in 2009
• iRODS picked up where SRB left off
6
iRODS Background and Evolution
• Modular, extensible, customizable
• Open source (BSD license)
• Supported at UNC with complementary activities by DICE and RENCI, a research unit of UNC Chapel Hill
• https://github.com/irods/irods
iRODS
I. Data grid middleware
II. Data management infrastructure
III. A framework for procedural implementation of data management policy (policy-driven data management)
iRODS is all these.
My Data:disk, filesystem,
site-specific storage, ...
My Data:tape, database, filesystem,
...
Partner’s Dataremote disk, tape,
filesystem, site-specific storage,…
User Client
• iRODS installs over heterogeneous data resources
• Users can share & manage distributed data as a single collection
User sees a single collection
iRODS View of Distributed Data
iRODS Unified Virtual Collection
iRODS as a Data Grid• Sharing data across:
– geographic and institutional boundaries– heterogeneous resources (hardware/software)
• Virtual (logical) collections of distributed data
• Global name spaces – data: files and collections– users: single sign on– storage: virtual resources
• Metadata catalogue (iCAT) manages mappings between logical and physical name spaces
A RENCI Data Grid
iRODS Server Metadata Catalog (iCAT)
iRODS Server
iRODS Server iRODS Server
• Client asks for data – request goes to an iRODS server
• Server contacts the iCAT-enabled server
• Information (location, access rights, etc) is retrieved from the iCAT
• Server containing data is signaled to send data to authorized client
• Client asks for data – request goes to an iRODS server
• Server contacts the iCAT-enabled server
• Information (location, access rights, etc) is retrieved from the iCAT
• Server containing data is signaled to send data to authorized client
iPlant
iRODS Server
NCSU
UNC-A
Duke
UNC-CH
iRODS Server
RENCI, Europa Center
A complete data grid (zone) hasone metadata catalogue (iCAT)
11
TUCASI Infrastructure Project (TIP) Federated Data Grids
Independent data grids (zones), each with its own iCAT,
can be federated18 September 2012
12
Federation of Data Grids• NASA
– Disparate data collections: Satellite data, model data, remote sensing data– Manage the collections separately (technically and administratively) with separate
data grids– Federate the data grids to give users an overall view onto NASA data
• Collaboration between consortia– DataNet Federation Consortium: 6 science domain partners, federating their data
grids to share data, users– Users authenticate to home data grid, access federated data grids
• For geographically distributed replication, evolution in data life cycle
18 September 2012
iPlant Data StoreFree Your Data
Different Users, Different Access Needs: One Data Store