Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | ava-morrison |
View: | 36 times |
Download: | 3 times |
File and Object Replication in Data Grids
Chin-Yi Tsai
2
Outline
Introduction
Background and Related Work
Globus Data Grid Tools
File Replication Tool : GDMP
Object Replication
Experimental Results with GridFTP
Conclusion
3
4
Introduction Data replicationData replication is a key-issue in Data Grid
File and Object
Distributed analysis of experimental data High Energy Physics Community (HEP) CERN ATLAS
CMS The CMS experiment is a high energy physics experiment located at CERN, that will
start data taking in the year 2006 Computing, storage, network There is a natural mapping to a Grid environment
GDMP architecture uses Globus Data Grid tools as middleware
File
objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject
5
File Replication and Object ReplicationGrid site 1 (source) Grid site 2 (destination)
Local storage
application
Grid site 1 (source) Grid site 2 (destination)
applicationobject copier tool
6
Major focus of European DataGrid Project on High Energy Physics
Object data stores used for next generation experiments objects are important for data handling Grid software mainly to deal with file replication issues
single file (about 1 or 2 GB in size; in total a few PB) contains many objects most objects are read-only
7
Related Data Grid Projects
Earth Science Grid (ESG) management of climate data
Particle Physics Data Grid (PPDG) HEP applications
Grid Physics Network (GriPhyN) Realizing the concept of Virtual Data
9
Globus Data Grid Tools
The Globus Toolkit is an open source software toolkit used for building grids middleware
Four main components of Globus The Grid Security Infrastructure (GSI) The Globus Resource Management The Globus Information Management architecture Data Management architecture, or Data Grid
GridFTP, Replica Management
10
GridFTP
SRBSRBSRBSRB
HPSSHPSS
DFSDFS
DPSSDPSS
uniform client interface
GridFTPGridFTP
11
Features of GridFTP GSI and Kerberos support
GSS API
Third-party control of data transfer add GSS API
Parallel data transfer Multiple TCP stream, single host
Striped data transfer Multiple TCP stream, multiple host/server
Partial file transfer
Automatic negotiation of TCP buffer/window sizes
Support the reliable and restartable data transfer
user
Local security infrastructure
user user….
GSI
GSI
Site A
Site B
user user
Local security infrastructure
user
12
The GridFTP Protocol Implementation
The two main libraries globus_ftp_control_library globus_ftp_client_library
13
Replica Catalog
Mapping between logical name for files or collections and one or more copies of the objects on physical storage systems
Three types of entries logical collections Location (physical) logical files
14
One Application ModelReplica Catalog
Logical CollectionWeather measurement 2003
Logical CollectionWeather measurement 2002
Locationcwb.gov.tw
Locationntu.edu.tw
Locationfcu.edu.tw
filename: Jan 2003filename: Feb 2003…filename: Dec 2003
filename: Jan 2003filename: Feb 2003Protocol: GridFTPHostname: cwb.gov.twPath: nfs/weather/
filename: Jan 2003filename: Feb 2003filename: Oct 2003
filename: Jan 2003filename: Sep 2003
Logical File Parent
Logical FileJan 2003
Logical FileJan 2003…
15
An Example Replication Scenario File1: 100MBFile2: 200MBFile3: 300MBFile4: 400MBFile5: 500MB
Site BSite B
File2File2File3File3File5File5
Site ASite A
File1File1File2File2File3File3File4File4
namesToSearchFile
filename:File4filename:File5
File1File2File3File4File5
listCollectionNamesFile
filename:File1filename:File2filename:File3filename:File4
listANamesFile
filename:File2filename:File3filename:File5
listBNamesFile
Location entry corresponding to site A uc : gridftp://Ahost.isi.edu:2222/nfs/path/on/A
Location entry corresponding to site B uc : gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B
Implementation This Scenario with the Command Line Tool
Registering the collectionRegistering the collection globus-replica-catalog –host <ldap url> -manager <ldap DN> -password <> -collection –create listCollectionNamesFile
Registering the location ARegistering the location A globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -location locationA? –create gridftp://Ahost.isi.edu:2222/nfs/path/on/A listANamesFile
Registering the location BRegistering the location B globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -location locationB? –create gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B listBNamesFile
Registering logical file File1, File2, File3, File4, File5Registering logical file File1, File2, File3, File4, File5 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -logicalfile File1 –create 104857600
Searching for the uc(URL constructor) attribute of all location that contain Searching for the uc(URL constructor) attribute of all location that contain File4 and File5File4 and File5 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -collection –find-locations NamesToSearchFile –attributes ucuc
List the value of the size attribute of the File2List the value of the size attribute of the File2 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -logicalfile File2? –list-attributes size
18
GDMP Architecture
The GDMP client-server software system is a generic file replication tool
Request Manager
Security Layer
ReplicaCatalogService
DataMover
Service
StorageManagerService
19
Replica Catalog Service
Maintain a globalglobal file name space of replicas
New file logical file name meta-information physical location
Client sites query the Replica Catalog Service
Implementation LDAP and Globus library (replica catalog) High-level API
Globus Replica Catalog
ReplicaCatalogService
Application
API
20
Data Mover Service
Layered design high-level API and low-level service
Data transfer security, performance, robustness
To use GridFTP as GDMP’s underlying file transfer mechanism
Handle network failures and perform additional check for corruption
21
Storage Management Service
Use external tools for staging (different for each MSS) Assume that each site has a local disk pool = data transfer cache
GDMP triggers file staging to the disk pool If a file is not located on the disk pool but requested by a remote site GDMP,
initiates a disk-to-disk file transfer
GDMP has a plug-in for Hierarchical Storage Manager (HRM) APIs, which provide a common interface to be used to access different Mass Storage Systems. The implementation is based on CORBA
Site B
Site B
GDMP
GDMP
disk pool
disk pool
22
Object Replication Motivation
File replication works well for many kinds of applications
however, too inefficient for physics analysis: only a few objects of a file are requested physicists want to have replicas on specific sites with sufficient CPU power
don’t want to have the entire file but only a few objects file replication: overhead in terms of data to be transferred
use object copier to copy objects to a file and then replicate the “new” file
one object per file is inefficient since object size is between a 100bytes and 1 MB - too many files
Grid site 1 (source) Grid site 2 (destination)
applicationobject copier tool
23
Object Replication Architecture Choices
large, world-wide distributed databases are not considered very attractive in HEP
significant parts of GDMP and Globus are used
Object replication cycle: objects are identified by application objects not present at the location are identified “missing” objects are copied into new files and then transferred to the
application
Copy and file transfer are pipelined to achieve a better response time
Index files used for locating objects
24
Object Replication Prototyping Experience
Most of current next-generation experiments do not do analysisanalysis yet: object replication is still a prototype file replication based on GDMP is in production use
machine where object copier is running needs to be powerful (CPU and IO)
25
Experimental Results with GridFTP
Main motivation study the impact of TCP socket buffer size tuning on parallel data transfers understand the throughput that can be achieved in realistic settings
Get maximal throughput it is critical to use optimal TCP send and receive socket buffer size (too
small or to large)
Test server WU-ftpd server 0.4b6
Test program extened_get extended_put
26
Experimental Results with GridFTP (cont’d)
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10
No. of streams
Tra
ns
fer
rate
Mb
ps
1 MB file 25 MB file 50 MB file 100 MB file
27
Experimental Results with GridFTP (cont’d)
Optimal TCP buffer size = RTT * (speed of bottleneck link) RTT measured with Unix ping tool bottleneck link speed: pipechar (new tool from LBNL)
Simple method to determine optimal number of parallel streams is not known yet too many streams may overload the receiving host usually, 4~8 parallel streams are optimal
28
Conclusion
GDMP replication service has been enhanced with more advanced data managementdata management features namespace file catalog management efficient file transfer (GridFTP)
Object-based replication experimental analysis