Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | intelhealthcare |
View: | 671 times |
Download: | 2 times |
Policy-based Data Management
Integrated Rule Oriented Data Grid (iRODS)
Reagan W Moore (DICE-UNC)
Arcot Rajasekar (DICE-UNC)
httpirodsdiceresearchorg
What is the Opportunity Play for iRODS
At a high level hellip
The Management of Big Data is the 1 concern for IT
bull Life Cycle Management
bull Useful (actionable) and searchable metadata
bull Integrity
bull Collaboration (Federation of Immutable data)
iRODS Provides Policy-Based data management
bull Next Generation data management cyber-infrastructure
bull System that enables a flexible adaptive customizable
data management architecture
bull Tool for large collections (Petabytes hundreds of millions of files)
Properties of policy-based data management systems
Management of the data life cycle (project collection digital library persistent
archive processing pipeline)
Applications of iRODS
LifeTimetrade Library (digital library for students)
Genomics data grid
Carolina Digital Repository (institution repository)
French National Library (IT automation)
DataNet Federation Consortium (data and workflow sharing for collaborative
research)
1 What iRODS is and what problems it is solving today and tomorrow
2 Speak to different use cases (there will be many companies attending
representing many departments with different opportunitiesproblems)
a Digitization of University Assets- Library archive
b Genomic pipeline automation
c IT service automation
We will touch on hellip
Topics
bull Principles behind policy-based data management
ndash Enable collaborative research
ndash Enable reproducible science
ndash Enable creation of reference collections
bull Integrated Rule-Oriented Data System (iRODS)
ndash Enforce management policies
ndash Automate administrative functions
ndash Validate assessment criteria
Shared Collections ndash Data Grid
File
System
Client 50 clients web browser
unix shell command hellip
Data grid middleware
provides global name
single sign-on policy
enforcement metadata
replication Tape
Archive
Data Grid
Multiple types of systems
can be used to store data
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
What is the Opportunity Play for iRODS
At a high level hellip
The Management of Big Data is the 1 concern for IT
bull Life Cycle Management
bull Useful (actionable) and searchable metadata
bull Integrity
bull Collaboration (Federation of Immutable data)
iRODS Provides Policy-Based data management
bull Next Generation data management cyber-infrastructure
bull System that enables a flexible adaptive customizable
data management architecture
bull Tool for large collections (Petabytes hundreds of millions of files)
Properties of policy-based data management systems
Management of the data life cycle (project collection digital library persistent
archive processing pipeline)
Applications of iRODS
LifeTimetrade Library (digital library for students)
Genomics data grid
Carolina Digital Repository (institution repository)
French National Library (IT automation)
DataNet Federation Consortium (data and workflow sharing for collaborative
research)
1 What iRODS is and what problems it is solving today and tomorrow
2 Speak to different use cases (there will be many companies attending
representing many departments with different opportunitiesproblems)
a Digitization of University Assets- Library archive
b Genomic pipeline automation
c IT service automation
We will touch on hellip
Topics
bull Principles behind policy-based data management
ndash Enable collaborative research
ndash Enable reproducible science
ndash Enable creation of reference collections
bull Integrated Rule-Oriented Data System (iRODS)
ndash Enforce management policies
ndash Automate administrative functions
ndash Validate assessment criteria
Shared Collections ndash Data Grid
File
System
Client 50 clients web browser
unix shell command hellip
Data grid middleware
provides global name
single sign-on policy
enforcement metadata
replication Tape
Archive
Data Grid
Multiple types of systems
can be used to store data
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Properties of policy-based data management systems
Management of the data life cycle (project collection digital library persistent
archive processing pipeline)
Applications of iRODS
LifeTimetrade Library (digital library for students)
Genomics data grid
Carolina Digital Repository (institution repository)
French National Library (IT automation)
DataNet Federation Consortium (data and workflow sharing for collaborative
research)
1 What iRODS is and what problems it is solving today and tomorrow
2 Speak to different use cases (there will be many companies attending
representing many departments with different opportunitiesproblems)
a Digitization of University Assets- Library archive
b Genomic pipeline automation
c IT service automation
We will touch on hellip
Topics
bull Principles behind policy-based data management
ndash Enable collaborative research
ndash Enable reproducible science
ndash Enable creation of reference collections
bull Integrated Rule-Oriented Data System (iRODS)
ndash Enforce management policies
ndash Automate administrative functions
ndash Validate assessment criteria
Shared Collections ndash Data Grid
File
System
Client 50 clients web browser
unix shell command hellip
Data grid middleware
provides global name
single sign-on policy
enforcement metadata
replication Tape
Archive
Data Grid
Multiple types of systems
can be used to store data
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Topics
bull Principles behind policy-based data management
ndash Enable collaborative research
ndash Enable reproducible science
ndash Enable creation of reference collections
bull Integrated Rule-Oriented Data System (iRODS)
ndash Enforce management policies
ndash Automate administrative functions
ndash Validate assessment criteria
Shared Collections ndash Data Grid
File
System
Client 50 clients web browser
unix shell command hellip
Data grid middleware
provides global name
single sign-on policy
enforcement metadata
replication Tape
Archive
Data Grid
Multiple types of systems
can be used to store data
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Shared Collections ndash Data Grid
File
System
Client 50 clients web browser
unix shell command hellip
Data grid middleware
provides global name
single sign-on policy
enforcement metadata
replication Tape
Archive
Data Grid
Multiple types of systems
can be used to store data
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Policy-based Data Management
Client
iRODS-server
Rule-engine
Rule base
Workflows
iRODS-server
Rule Engine
Rule base
Workflows
Storage Storage
Logical
Collection
(data grid)
Consensus on Policies and Procedures
controls the Data Collection
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
7
Policy-Based Data Environments
Purpose Reason a collection is assembled
Properties Attributes needed to ensure the purpose
Policies Controls for enforcing desired properties
bull mapped to computer actionable rules
Procedures Functions that implement the policies
bull Mapped to computer actionable workflows
Persistent state information Results of applying the procedures
bull mapped to system metadata
Property verification Validation that state information conforms to the desired purpose
bull mapped to periodically executed policies
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Community-based Collection Life Cycle
Project
Collection
Private
Local
Policy
Data
Grid
Shared
Distribution
Policy
Digital
Library
Published
Description
Policy
Data
Processing
Pipeline
Analyzed
Service
Policy
Reference
Collection
Preserved
Representation
Policy
Federation
Sustained
Re-purposing
Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
The driving purpose changes at each stage of the data life cycle
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Sequencing Work ndash an Infrastructure View
RENCI
Science
Portal
Open
Science
Grid
TeraGrid
UNC BASS
hellip
National Resources
Pipelines Genome
Databases
RENCI Infrastructure TestDevelopment
Distributed ad-hoc
processing
iRODS data-grid managed
processing
Data Production
UNC HTFS
Third Party
Vendors
Clinical Data Systems
NCGenes
Secure Medical
Workspace
Production
Pipelines
Archive Genome
Databases
RC RENCI LCCC HTSF Infrastructure hardware ITS software LCCC UNC High
Throughput Sequencing facility
Local
(TUCASI)
Data Sharing
NIH
Other
Institutions
Ref
Se
q
Genome
Annotations
dbSN
P HGMD
1000
Genomes
Managing several hundred TBs of genomic data
VarDB Hadoop
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Managing Data on the Research Side
RENCI
STORAGE
(Tape Drives)
UNC
STORAGE
(Tape Drives)
UNC HPC RENCI HPC
External
Compute
Open Science
Grid
Clemson
Clouds
IT Machines
RENCI Hadoop
Genomics
Storage
Lab
Machines
NIH
External
Partners
Genomics HPC
Genomics
Hadoop
Data
Providers
Researchers
Students
External
Collaborators
IT Staff
iRODS gracefully allows for introducing control
bullData movement and replication
bullMetadata standards
bullArchival deletion and retention
bullIntegration with workflows hadoop databases
bullHiding complexities
bullAutomation
bullhellip all policy driven
bullhellip without breaking the in-place systems
Wild West Managed
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12000
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
SILS LifeTime Library Policies
Library management
Replication
Checksums
Versioning
Strict access controls
Quotas
Metadata catalog replication
Installation environment archiving
Ingestion
Automated synchronization of student directory
with LifeTime Library
Automated loading of MP3 metadata
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web Rich Web client)
Web services (VOSpace)
Load libraries (Python Java)
IO libraries (C C++ Fortran)
File systems (FUSE WebDav Parrot)
Synchronization interfaces (iDrop)
Unix tools Grid tools (icommands SAGA SRM Griphyn)
Workflows (Kepler Taverna)
Digital Libraries (Fedora DSpace)
Portals (EnginFrame)
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Managing Information amp Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies procedures Rule base Rule engine
Wisdom policy enforcement point Data Grid
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Data Virtualization
Storage System
Storage Protocol
Access Interface
Policy Enforcement Points
Standard Micro-services
Standard IO Operations
bull Map from the actions
requested by the client to
multiple policy
enforcement points
bull Map from policy to
standard micro-services
bull Map from micro-services
to standard Posix IO
operations
bull Map standard IO
operations to the
protocol supported by
the storage system
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
System and User-driven Rules
The data grid automatically applies rules
defined in the rule base corere
You can define rules that are applied
interactively or that are deferred for later
execution
irule ndashF ldquorule-filerrdquo
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Example Rule
Write ldquoHello Worldrdquo
Create rule file call ruleHellor
myTestRule
writeLine(stdoutrdquo Hello World)
INPUT null
OUTPUT ruleExecOut
irule ndashF ruleHellor
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Production Integrity Rule
Verify all input parameters for consistency
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations
Identify all missing replicas and document their lack
Create new replicas to replace missing replicas
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection
Track progress of the policy execution
Initialize the rule for the first execution
Enable restart of the process from the last set of checked files in case of a system halt
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections
Minimize the number of sleep periods used by the deadline scheduler
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate and the number of files checked
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Workflow Management amp Registration
Workflow file
Directory holding all input and output files
associated with workflow file (mounted
collection that is linked to the workflow file)
Input parameter file lists parameters
and input and output file names
Directory holding all output
files generated for invocation
of eCWkflowrun the version
number is incremented
Automatically generated run file for
Executing each input file
Output file created for
eCWKflowmpf
eCWkflowmss
earthCubeeCWkflow
eCWkflowmpf
earthCubeeCWkfloweCWkflowrunDir0
eCWkflowrun
Outfile
eCWkflow2run
eCWkflow2mpf
earthCubeeCWkfloweCWkflow2runDir
0
Newfile
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Publications
Rajasekar R M Wan R Moore W Schroeder S-Y Chen L
Gilbert C-Y Hou C Lee R Marciano P Tooby A de Torcy B
Zhu ldquoiRODS Primer Integrated Rule-Oriented Data Systemrdquo
Morgan amp Claypool 2010
Ward R M Wan W Schroeder A Rajasekar A de Torcy T
Russell H Xu R Moore ldquoThe integrated Rule-Oriented Data
System (iRODS 30) Micro-service Workbookrdquo DICE
Foundation November 2011 ISBN 9781466469129
Amazoncom
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Reagan W Moore
rwmoorerenciorg
httpirodsdiceresearchorg
NSF OCI-0940841 ldquoDataNet Federation Consortiumrdquo
NSF OCI-1032732 ldquoImprovement of iRODS for Multi-Disciplinary Applicationsrdquo
NSF OCI-0848296 ldquoNARA Transcontinental Persistent Archives Prototyperdquo
NSF SDCI-0721400 ldquoData Grids for Community Driven Applicationsrdquo
iRODS - Open Source Software
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
iRODS Distributed Data Management
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Val = 0rdquo
msiExecStrCondQuery(SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
Coll and META_COLL_ATTR_NAME = TEST_DATA_ID GenQOut2)
foreach (GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_NAME Val)
if(int(Val) == 0)
Str1 = TEST_DATA_ID=0rdquo
msiString2KeyValPair(Str1kvp)
msiAssociateKeyValuePairsToObj(kvpColl-C)
writeLine(Lfileadded TEST_DATA_ID attribute to collection Coll)
on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery(META_COLL_ATTR_VALUE COLL_NAME = Coll and
META_COLL_ATTR_NAME = TEST_DATA_ID GenQInp2)
msiExecGenQuery(GenQInp2GenQOut2)
foreach(GenQOut2)
msiGetValByKey(GenQOut2 META_COLL_ATTR_VALUE colldataID)
Initializing Workflow Parameters
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Workflow Operations Used
Arithmetic (+ - )
Boolean tests (== = ampamp || gt lt gt=)
Conditional statements
if then else
Control
break fail
Loops
for foreach while
List manipulation
initialization list addition (cons) extracting an element from a
list (elem) updating an element in a list (setelem)
Variable manipulation
initialization type conversion (int double str)
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Performance at rencirsquo
bull Execute call to rule engine 18 msecs
bull Execute metadata query 714 msecs
bull Disk seek latency 5 msecs
bull Disk rotational latency 11 msecs
bull Production loop logic 63 msecs
bull Checksum verification 21 msecs
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Data Analysis Use Cases
bull Demonstrate reproducible science A use case could include the
registration storage sharing and re-execution of a workflow The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example
bull Automate data retrieval A use case could demonstrate remote access to a
data collection retrieval of desired data sets transformation and use in
an analysis workflow An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built
bull Integrate community resources with collaboration environments An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis
bull Integrate multiple community resources A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
RHESSys workflow to develop a nested
watershed parameter file (worldfile)
containing a nested ecogeomorphic object
framework and full initial system state
Choose gauge
or outlet (HIS)
Extract
drainage area
(NHDPlus)
Digital
Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope
Aspect
Streams (NHD)
Roads (DOT) Strata
Hillslope
Patch
Basin
Stream network
Nested watershed
structure
Land Use
Leaf Area
Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation
parameter files
Eco-Hydrology
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
iRODS Rule for RHESSys
main
getExtentForGageReachcode(gageReachcode extentInNHD_Vect_Coords)
convertExtentToNHD_DEM(extentInNHD_Vect_Coords extentInNHD_DEM_Coords)
extractTileFromNHD_DEM(trimr(extentInNHD_DEM_Coords n))
importDEMTileIntoNewGRASSLocationAsUTM(extentInNHD_Vect_Coords newLocPhysPath
newLocObjPath)
delineateWatershedForNHDGage(nhdStreamGageID newLocPhysPath newLocObjPath)
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
extractTileFromNHD_DEM(extentCoords)
Split path to object into collection and name
msiSplitPath(nhdDEMObjPath nhdDEMObjColl nhdDEMObjName)
writeLine(serverLog nhdDEMObjColl)
writeLine(serverLog nhdDEMObjName)
Build query to discover physical path
msiAddSelectFieldToGenQuery(DATA_PATH null genQInp)
msiAddConditionToGenQuery(DATA_NAME = nhdDEMObjName genQInp)
msiAddConditionToGenQuery(COLL_NAME = nhdDEMObjColl genQInp)
msiAddConditionToGenQuery(DATA_RESC_NAME = rescName genQInp)
Run query
msiExecGenQuery(genQInp genQOut)
Extract path from query result
foreach (genQOut) msiGetValByKey(genQOut DATA_PATH filePath)
writeLine(serverLog filePath)
Determine physical path of input directory
msiSplitPath(filePath inFileDir headerFileIgnore)
Generate physical path of output file
msiSplitPath(inFileDir inFileParentDir rasterDatasetName)
tileFileName = SUBSET-++rasterDatasetName++img
tileFilePath = inFileParentDir++++tileFileName
Generate iRODS path of output
msiSplitPath(nhdDEMObjColl nhdDEMObjCollParent junk)
tileObjPath = nhdDEMObjCollParent++++tileFileName
args = -of HFA -projwin ++extentCoords++ ++inFileDir++ ++tileFilePath
writeLine(serverLog args)
msiExecCmd(gdal_translate args irenrenciorg null null cmd_out)
writeLine(serverLog cmd_out)
Register tile file with iRODS
msiPhyPathReg(tileObjPath rescName tileFilePath null status)
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible Adaptive and Customizable
Data Management Architecture
ldquoCannedrdquo scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management
Thank you
Questions
Thank you
Questions