+ All Categories
Home > Documents > Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing...

Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing...

Date post: 13-Dec-2015
Category:
Upload: augusta-marshall
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
2
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to allow high bandwidth distribution of data across the grid in accordance with the computing model. To enable reliable bulk replication of data, LHCb's DIRAC system has been integrated with gLite's File Transfer Service middleware component to make use of dedicated network links between LHCb computing centres. DIRAC's Data Management tools previously allowed the replication, registration and deletion of files on the grid. For SC3 supplementary functionality has been added to allow bulk replication of data (using FTS) and efficient mass registration to the LFC replica catalog. Provisional performance results have shown that the system developed can meet the expected data replication rate required by the computing model in 2007. This paper details the experience and results of integration and utilisation of DIRAC with the SC3 transfer machinery. Introduction to DIRAC Data Management Architecture DIRAC architecture split into three main component types: Services - independent functionalities deployed and administered centrally on machines accessible by all other DIRAC components Resources - GRID compute and storage resources at remote sites Agents - lightweight software components that request jobs from the central Services for a specific purpose. The DIRAC Data Management System is made up an assortment of these components. FileCatalogC FileCatalogB SE Service SRMStorage GridFTPStorage HTTPStorage StorageElement ReplicaManager FileCatalogA UserInterface WMS TransferAgent Data Management Clients Physical storage DIRAC Data Management Components Main components of the DIRAC Data Management System: Storage Element abstraction of GRID storage resources actual access by specific plug-ins srm, gridftp, bbftp, sftp, http supported namespace management, file up/download, deletion etc. Replica Manager provides an API for the available data management operations point of contact for users of data management systems removes direct operation with Storage Element and File Catalogs uploading/downloading file to/from GRID SE, replication of files, file registration, file removal File Catalog standard API exposed for variety of available catalogs allows redundancy across several catalogs LHCb Transfer Aims During SC3 The extended Service Phase of SC3 was to allow the experiments to test their specific software and validate their computing models using the platform of machinery provided. LHCb’s Data Replication goals during SC3 can be summarised as: Replication ~1TB of stripped DST data from CERN to all Tier-1’s. Replication of 8 TB of digitised data from CERN/Tier- 0 to LHCb participating Tier1 centers in parallel. Removal of 50k replicas (via LFN) from all Tier-1 centres Moving 4TB of data from Tier1 centres to Tier0 and to other participating Tier1 centers. Integration of DIRAC with FTS SC3 replication machinery utilised gLite’s File Transfer Service (FTS) lowest-level data movement service defined in the gLite architecture offers reliable point-to-point bulk file transfers physical files (SURLs) between SRM managed SEs accepts source-destination SURL pairs assigns file transfers to dedicated transfer channel take advantage of networking between CERN and Tier1s routing of transfers is not provided Higher level service required to resolve SURLs and hence decide on routing. DIRAC Data Management System employed to do these tasks. Integration requirements: new methods developed in Replica Manager previous Data Management operations single file and blocking bulk operation functionality added to the Transfer Agent/Request monitoring of asynchronous FTS jobs required information for monitoring stored within Request DB entry LCG – SC3 Machiner y Transfer network LHCb - DIRAC DMS Request DB File Catalog Interface Transfer Manager Interface Replica Manager Transfer Agent LCG File Catalog File Transfer Service Tier0 SE Tier1 SE A Tier1 SE B Tier1 SE C
Transcript
Page 1: Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.

Author: Andrew C. SmithAbstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to allow high bandwidth distribution of data across the grid in accordance with the computing model. To enable reliable bulk replication of data, LHCb's DIRAC system has been integrated with gLite's File Transfer Service middleware component to make use of dedicated network links between LHCb computing centres. DIRAC's Data Management tools previously allowed the replication, registration and deletion of files on the grid. For SC3 supplementary functionality has been added to allow bulk replication of data (using FTS) and efficient mass registration to the LFC replica catalog.

Provisional performance results have shown that the system developed can meet the expected data replication rate required by the computing model in 2007. This paper details the experience and results of integration and utilisation of DIRAC with the SC3 transfer machinery.Introduction to DIRAC Data Management Architecture

DIRAC architecture split into three main component types:

•Services - independent functionalities deployed and administered centrally on machines accessible by all other DIRAC components

•Resources - GRID compute and storage resources at remote sites

•Agents - lightweight software components that request jobs from the central Services for a specific purpose.

The DIRAC Data Management System is made up an assortment of these components.

FileCatalogCFileCatalogCFileCatalogBFileCatalogB

SE ServiceSE Service

SRMStorageSRMStorage GridFTPStorageGridFTPStorage HTTPStorageHTTPStorage

StorageElementStorageElement

ReplicaManagerReplicaManager

FileCatalogAFileCatalogA

UserInterfaceUserInterface WMSWMS TransferAgentTransferAgent

Data Management Clients

Physical storage

DIRAC Data Management Components

Main components of the DIRAC Data Management System:

•Storage Element•abstraction of GRID storage resources •actual access by specific plug-ins•srm, gridftp, bbftp, sftp, http supported•namespace management, file up/download, deletion etc.

•Replica Manager•provides an API for the available data management operations•point of contact for users of data management systems•removes direct operation with Storage Element and File Catalogs•uploading/downloading file to/from GRID SE, replication of files, file registration, file removal

•File Catalog •standard API exposed for variety of available catalogs •allows redundancy across several catalogs

LHCb Transfer Aims During SC3

The extended Service Phase of SC3 was to allow the experiments to test their specific software and validate their computing models using the platform of machinery provided. LHCb’s Data Replication goals during SC3 can be summarised as:

• Replication ~1TB of stripped DST data from CERN to all Tier-1’s.

•Replication of 8 TB of digitised data from CERN/Tier-0 to LHCb participating Tier1 centers in parallel.

•Removal of 50k replicas (via LFN) from all Tier-1 centres

•Moving 4TB of data from Tier1 centres to Tier0 and to other participating Tier1 centers. Integration of DIRAC with FTS

SC3 replication machinery utilised gLite’s File Transfer Service (FTS)

•lowest-level data movement service defined in the gLite architecture

•offers reliable point-to-point bulk file transfers

•physical files (SURLs) between SRM managed SEs

•accepts source-destination SURL pairs

•assigns file transfers to dedicated transfer channel

•take advantage of networking between CERN and Tier1s

•routing of transfers is not provided

Higher level service required to resolve SURLs and hence decide on routing. DIRAC Data Management System employed to do these tasks.

Integration requirements:

•new methods developed in Replica Manager

•previous Data Management operations single file and blocking

•bulk operation functionality added to the Transfer Agent/Request

•monitoring of asynchronous FTS jobs required

•information for monitoring stored within Request DB entry

LCG – SC3

Machinery

Transfer network

LHCb -DIRACDMS

Request DB

File Catalog Interface

Transfer Manager Interface

ReplicaManager

TransferAgent

LCG File Catalog

File Transfer Service

Tie

r0 S

E

Tie

r1 S

E A

Tie

r1 S

E B

Tie

r1 S

E C

Page 2: Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.

Once Transfer Agent obtains Request XML file:•replica information for LFNs obtained •replicas matched against source SE and target SE •SURL pairs resolved using endpoint information •SURL pairs are then submitted via the FTS Client•FTS GUID and other information on job stored in XML file

Obtain Job Param

Resolve PFNs

Resolve SURL Pairs

Submit FTS Job

Update DB with Job Info

Update Monitoring with

Job Info

Request DB

Request DB

Transfer Agent

FTS Client

LCG File Catalog

DIRAC Config Svc

DIRAC Monitoring

Rep

lica

Man

ager

Obtain Job Info

Get Job Status

Resolve Failed or Succeeded

Update Request and Monitoring

If Job Terminal:Register

Completed Files

Send Accounting Data

Request DB

Request DB

Transfer Agent

FTS Client

LCG File Catalog

DIRAC Accounting

DIRAC Monitoring

Resubmit Failed Files to Request

Request DB

Rep

lica

Man

ager

Transfer Agent executed periodically using ‘runit’ daemon scripts

•replication request information retrieved from Request DB•status of the FTS job is obtained via the FTS Client•status (active, done, failed) of individual files obtained•Request XML file updated •monitoring information sent to allow web based tracking

If the FTS job has reached terminal state:•completed files are registered in the file catalogs•failed files constructed into new replication request•accounting information sent to allow bandwidth measurementsPerformance Obtained During T0-T1

Replication

0

10

20

30

40

50

9/10

/05

11/1

0/05

13/1

0/05

15/1

0/05

17/1

0/05

19/1

0/05

21/1

0/05

23/1

0/05

25/1

0/05

27/1

0/05

29/1

0/05

31/1

0/05

2/11

/05

4/11

/05

6/11

/05

Date

Ra

te (

MB

/s)

CERN_Castor -> RAL_dCache-SC3

CERN_Castor -> PIC_Castor-SC3

CERN_Castor ->SARA_dCache-SC3

CERN_Castor -> IN2P3_HPSS-SC3

CERN_Castor -> GRIDKA_dCache-SC3

CERN_Castor -> CNAF_Castor-SC3

60

Many Castor 2 Problems

Ser

vice

In

terv

enti

on

SARA ProblemsRequired Rate

Combined 40MB/s from CERN to 6 LHCb Tier1s to meet SC3 goals

•aggregated daily rate was obtained •overall SC3 machinery not completely stable•target rate not sustained over the required period•peak rates of 100MB/s were observed over several hours

Rerun of exercise planned to demonstrate the required rates.

Tier1–Tier1 Replication Activity On-Going

During T0-T1 replication FTS was found to be most efficient when replicating files pre-staged on disk.

•dedicated disk pools setup to T1 sites for seed files•1.5TB of seed files transferred to dedicated disk•FTS Servers were installed by T1 sites•channels setup directly between sites

Replication activity is on going with this exercise. The current status of this setup is shown below.

PIC

No FTS Server

Channels Managed by Source SE

IN2P3

FTS Server

Manage IncomingChannels

CNAF

FTS Server

Manage Incoming Channels

FZK

FTS Server

Manage Incoming Channels

RAL

FTS Server

Manage Incoming Channels

SARA

FTS Server

Manage Incoming Channels

Bulk File Removal Operations

Bulk removal of files performed on completion of T0-T1 replication.

•bulk operation of ‘srm-advisory-delete’ used•takes list of SURls and ‘removes’ physical file•functionality added to Replica Manager and Storage Element•additions required for SRM Storage Element plug-in •Replica Manager SURL resolution tools reused

Different interpretations of the SRM standard has lead to different underlying behavior between SRM solutions.

Initially bulk removal operations executed by a single central agent

•SC3 goal of 50K replicas in 24 hours shown to be unattainable

Several parallel agents instantiated •each performing physical and catalog removal for a specific SE •10K replicas were removed from 5 sites in 28 hours •performance loss observed in replica deletion on LCG FC (see below)•unnecessary SSL authentications CPU intensive•remedied by ‘sessions’ when performing multiple catalog operations

0

50

100

150

200

250

300

350

400

450

1 2 3 4

Removal Phase

Tim

e f

or

10

0 R

ep

lica

Rem

ova

ls (

s) Phase 1 RAL

Phase 2GRIDKA,IN2P3

Phase 3GRIDKA, IN2P3, CNAF, PIC

Phase 4GRIDKA, IN2P3, CNAF, PIC, RAL

Operation of DIRAC Bulk Transfer Mechanics

DIRAC integration with FTS deployed centrally•managed machine at CERN•service all data replication jobs for SC3

Lifetime of bulk replication job:•bulk replication requests submitted to the DIRAC WMS•JDL file with an input sandbox of an XML file•XML contains important parameters i.e. LFNs, source/target SE •DIRAC WMS populates the Request DB of central machine with XML•Transfer Agent polls Request DB periodically for ‘waiting’ requests


Recommended