+ All Categories
Home > Documents > File and Object Replication in Data Grids

File and Object Replication in Data Grids

Date post: 31-Dec-2015
Category:
Upload: ava-morrison
View: 36 times
Download: 3 times
Share this document with a friend
Description:
File and Object Replication in Data Grids. Chin-Yi Tsai. Outline. Introduction Background and Related Work Globus Data Grid Tools File Replication Tool : GDMP Object Replication Experimental Results with GridFTP Conclusion. File. object. object. object. object. object. object. - PowerPoint PPT Presentation
27
File and Object Replication in Data Grids Chin-Yi Tsai
Transcript
Page 1: File and Object Replication in Data Grids

File and Object Replication in Data Grids

Chin-Yi Tsai

Page 2: File and Object Replication in Data Grids

2

Outline

Introduction

Background and Related Work

Globus Data Grid Tools

File Replication Tool : GDMP

Object Replication

Experimental Results with GridFTP

Conclusion

Page 3: File and Object Replication in Data Grids

3

Page 4: File and Object Replication in Data Grids

4

Introduction Data replicationData replication is a key-issue in Data Grid

File and Object

Distributed analysis of experimental data High Energy Physics Community (HEP) CERN ATLAS

CMS The CMS experiment is a high energy physics experiment located at CERN, that will

start data taking in the year 2006 Computing, storage, network There is a natural mapping to a Grid environment

GDMP architecture uses Globus Data Grid tools as middleware

File

objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject objectobjectobjectobject

Page 5: File and Object Replication in Data Grids

5

File Replication and Object ReplicationGrid site 1 (source) Grid site 2 (destination)

Local storage

application

Grid site 1 (source) Grid site 2 (destination)

applicationobject copier tool

Page 6: File and Object Replication in Data Grids

6

Major focus of European DataGrid Project on High Energy Physics

Object data stores used for next generation experiments objects are important for data handling Grid software mainly to deal with file replication issues

single file (about 1 or 2 GB in size; in total a few PB) contains many objects most objects are read-only

Page 7: File and Object Replication in Data Grids

7

Related Data Grid Projects

Earth Science Grid (ESG) management of climate data

Particle Physics Data Grid (PPDG) HEP applications

Grid Physics Network (GriPhyN) Realizing the concept of Virtual Data

Page 8: File and Object Replication in Data Grids

9

Globus Data Grid Tools

The Globus Toolkit is an open source software toolkit used for building grids middleware

Four main components of Globus The Grid Security Infrastructure (GSI) The Globus Resource Management The Globus Information Management architecture Data Management architecture, or Data Grid

GridFTP, Replica Management

Page 9: File and Object Replication in Data Grids

10

GridFTP

SRBSRBSRBSRB

HPSSHPSS

DFSDFS

DPSSDPSS

uniform client interface

GridFTPGridFTP

Page 10: File and Object Replication in Data Grids

11

Features of GridFTP GSI and Kerberos support

GSS API

Third-party control of data transfer add GSS API

Parallel data transfer Multiple TCP stream, single host

Striped data transfer Multiple TCP stream, multiple host/server

Partial file transfer

Automatic negotiation of TCP buffer/window sizes

Support the reliable and restartable data transfer

user

Local security infrastructure

user user….

GSI

GSI

Site A

Site B

user user

Local security infrastructure

user

Page 11: File and Object Replication in Data Grids

12

The GridFTP Protocol Implementation

The two main libraries globus_ftp_control_library globus_ftp_client_library

Page 12: File and Object Replication in Data Grids

13

Replica Catalog

Mapping between logical name for files or collections and one or more copies of the objects on physical storage systems

Three types of entries logical collections Location (physical) logical files

Page 13: File and Object Replication in Data Grids

14

One Application ModelReplica Catalog

Logical CollectionWeather measurement 2003

Logical CollectionWeather measurement 2002

Locationcwb.gov.tw

Locationntu.edu.tw

Locationfcu.edu.tw

filename: Jan 2003filename: Feb 2003…filename: Dec 2003

filename: Jan 2003filename: Feb 2003Protocol: GridFTPHostname: cwb.gov.twPath: nfs/weather/

filename: Jan 2003filename: Feb 2003filename: Oct 2003

filename: Jan 2003filename: Sep 2003

Logical File Parent

Logical FileJan 2003

Logical FileJan 2003…

Page 14: File and Object Replication in Data Grids

15

An Example Replication Scenario File1: 100MBFile2: 200MBFile3: 300MBFile4: 400MBFile5: 500MB

Site BSite B

File2File2File3File3File5File5

Site ASite A

File1File1File2File2File3File3File4File4

namesToSearchFile

filename:File4filename:File5

File1File2File3File4File5

listCollectionNamesFile

filename:File1filename:File2filename:File3filename:File4

listANamesFile

filename:File2filename:File3filename:File5

listBNamesFile

Location entry corresponding to site A uc : gridftp://Ahost.isi.edu:2222/nfs/path/on/A

Location entry corresponding to site B uc : gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B

Page 15: File and Object Replication in Data Grids

Implementation This Scenario with the Command Line Tool

Registering the collectionRegistering the collection globus-replica-catalog –host <ldap url> -manager <ldap DN> -password <> -collection –create listCollectionNamesFile

Registering the location ARegistering the location A globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -location locationA? –create gridftp://Ahost.isi.edu:2222/nfs/path/on/A listANamesFile

Registering the location BRegistering the location B globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -location locationB? –create gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B listBNamesFile

Page 16: File and Object Replication in Data Grids

Registering logical file File1, File2, File3, File4, File5Registering logical file File1, File2, File3, File4, File5 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -logicalfile File1 –create 104857600

Searching for the uc(URL constructor) attribute of all location that contain Searching for the uc(URL constructor) attribute of all location that contain File4 and File5File4 and File5 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -collection –find-locations NamesToSearchFile –attributes ucuc

List the value of the size attribute of the File2List the value of the size attribute of the File2 globus-replica-catalog –host < ldap url > -manager < ldap DN > -password <> -logicalfile File2? –list-attributes size

Page 17: File and Object Replication in Data Grids

18

GDMP Architecture

The GDMP client-server software system is a generic file replication tool

Request Manager

Security Layer

ReplicaCatalogService

DataMover

Service

StorageManagerService

Page 18: File and Object Replication in Data Grids

19

Replica Catalog Service

Maintain a globalglobal file name space of replicas

New file logical file name meta-information physical location

Client sites query the Replica Catalog Service

Implementation LDAP and Globus library (replica catalog) High-level API

Globus Replica Catalog

ReplicaCatalogService

Application

API

Page 19: File and Object Replication in Data Grids

20

Data Mover Service

Layered design high-level API and low-level service

Data transfer security, performance, robustness

To use GridFTP as GDMP’s underlying file transfer mechanism

Handle network failures and perform additional check for corruption

Page 20: File and Object Replication in Data Grids

21

Storage Management Service

Use external tools for staging (different for each MSS) Assume that each site has a local disk pool = data transfer cache

GDMP triggers file staging to the disk pool If a file is not located on the disk pool but requested by a remote site GDMP,

initiates a disk-to-disk file transfer

GDMP has a plug-in for Hierarchical Storage Manager (HRM) APIs, which provide a common interface to be used to access different Mass Storage Systems. The implementation is based on CORBA

Site B

Site B

GDMP

GDMP

disk pool

disk pool

Page 21: File and Object Replication in Data Grids

22

Object Replication Motivation

File replication works well for many kinds of applications

however, too inefficient for physics analysis: only a few objects of a file are requested physicists want to have replicas on specific sites with sufficient CPU power

don’t want to have the entire file but only a few objects file replication: overhead in terms of data to be transferred

use object copier to copy objects to a file and then replicate the “new” file

one object per file is inefficient since object size is between a 100bytes and 1 MB - too many files

Grid site 1 (source) Grid site 2 (destination)

applicationobject copier tool

Page 22: File and Object Replication in Data Grids

23

Object Replication Architecture Choices

large, world-wide distributed databases are not considered very attractive in HEP

significant parts of GDMP and Globus are used

Object replication cycle: objects are identified by application objects not present at the location are identified “missing” objects are copied into new files and then transferred to the

application

Copy and file transfer are pipelined to achieve a better response time

Index files used for locating objects

Page 23: File and Object Replication in Data Grids

24

Object Replication Prototyping Experience

Most of current next-generation experiments do not do analysisanalysis yet: object replication is still a prototype file replication based on GDMP is in production use

machine where object copier is running needs to be powerful (CPU and IO)

Page 24: File and Object Replication in Data Grids

25

Experimental Results with GridFTP

Main motivation study the impact of TCP socket buffer size tuning on parallel data transfers understand the throughput that can be achieved in realistic settings

Get maximal throughput it is critical to use optimal TCP send and receive socket buffer size (too

small or to large)

Test server WU-ftpd server 0.4b6

Test program extened_get extended_put

Page 25: File and Object Replication in Data Grids

26

Experimental Results with GridFTP (cont’d)

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10

No. of streams

Tra

ns

fer

rate

Mb

ps

1 MB file 25 MB file 50 MB file 100 MB file

Page 26: File and Object Replication in Data Grids

27

Experimental Results with GridFTP (cont’d)

Optimal TCP buffer size = RTT * (speed of bottleneck link) RTT measured with Unix ping tool bottleneck link speed: pipechar (new tool from LBNL)

Simple method to determine optimal number of parallel streams is not known yet too many streams may overload the receiving host usually, 4~8 parallel streams are optimal

Page 27: File and Object Replication in Data Grids

28

Conclusion

GDMP replication service has been enhanced with more advanced data managementdata management features namespace file catalog management efficient file transfer (GridFTP)

Object-based replication experimental analysis


Recommended