+ All Categories
Home > Documents > LHCb: CCRC’08 Post Mortem

LHCb: CCRC’08 Post Mortem

Date post: 08-Nov-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
29
CCRC’08 post mortem - June’08 1 LHCb : CCRC 08 Post Mortem Nick Brook
Transcript
Page 1: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 1

LHCb: CCRC’08 Post MortemNick Brook

Page 2: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 2

Use of computing centres Main useranalysis

supported atCERN+6 “Tier-1”

centres

Tier-2 centresessentially

Monte Carloproductionfacilities

Plan to makeuse of LHCb

online farm forre-processing

during shutdown

Page 3: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 3

Planned tasks• Raw data distribution from pit → T0 centre

• Use of rfcp into CASTOR from pit - T1D0• Raw data distribution from T0 → T1 centres

• Use of FTS - T1D0• Recons of raw data at CERN & T1 centres

• Production of rDST data - T1D0• Use of SRM 2.2

• Stripping of data at CERN & T1 centres• Input data: RAW & rDST - T1D0• Output data: DST - T1D1• Use SRM 2.2

• Distribution of DST data to all other centres• Use of FTS - T0D1 (except CERN T1D1)

All tasks envisaged during data taking in 2008

Page 4: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 4

Planned tasks

• May activities• Maintain equivalent of 1 month data taking

• All production activitites• Assuming a 50% machine cycle efficiency• Run fake analysis activity in parallel to

production type activities• Analysis type jobs were used for debugging throughout the

period• GANGA testing ran for last weeks at low level

Page 5: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 5

SRM space tokens• Used SRM 2.2 SE in CCRC08

• LHCb space tokens are:• LHCb_RAW (T1D0)• LHCb_RDST (T1D0)• LHCb_M-DST (T1D1)• LHCb_DST (T0D1)• LHCb_MC_M-DST (T1D1)• LHCb_MC_DST (T0D1)• LHCb_FAILOVER (T0D1)• LHCb_USER (T0D1)

Page 6: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 6

Activities across the sites• Planned breakdown of processing activities (CPU

needs) based on resource pledges

9CNAF

11RAL4PIC

26NL-T125IN2P311GridKa

14CERNFraction (%)Site

Page 7: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 7

Pit -> Tier 0• Use of rfcp to copy data from pit to CASTOR

(LHCb_RAW - T1D0)• rfcp is the recommended approach from IT• A file sent every ~30 sec• Data remain on online disk until successful CASTOR migration• Rate to CASTOR - ~70MB/s

Problems with online storage area

In general ransmoothly:

- Stability problemswith online storagearea - solved withfirmware updateduring CCRC

- Internal issues withsending bk-keepinginfo

Page 8: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 8

Tier 0 -> Tier 1• FTS from CERN to Tier-1 centres• Destionation LHC_RAW (T1D0)• Transfer of RAW will only occur once data has

migrated to tape & checksum is verified• No occurrence of checksum problems…• … moved to transfer immediately then cleaning-up at

the remote sites if the integrity check failed• This proved problematic - stick to original approach

• Rate out of CERN - ~35MB/s averaged overthe period

• Peak rate far in excess of requirement• In smooth running sites matched LHCb requirements

Page 9: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 9

Page 10: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 10

Tier 0 -> Tier 1• To first order all transfers eventually succeeded

• Plot shows efficiency on 1st attempt…Issue withUKcertificates

CERNoutage

CERNFTS/SRMendpointproblems

RestartIN2P3 SRM

endpoint

Page 11: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 11

Reconstruction• Input 1 RAW file & output 1 rDST file• Used SRM 2.2 SE in CCRC08

• LHCb space tokens are:• LHCb_RAW (T1D0)• LHCb_RDST (T1D0)

• Data shares need to be preserved• Important for resource planning

• Reduced nos of events per recons job from 50kto 25k (job ~12 hour duration on 2.8 kSI2kmachine)• High lumi & low lumi files

• Caused stalled jobs in Feb phase - queue length issues

Page 12: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 12

Reconstruction Issues• Job created immediately after file transferred,

so file should be online but…• LHCb pre-stage (srm_bringonline) files & then

checks on the status of the file beforesubmitting pilot job (srm_ls) via GFAL– Pre-stage should ensure access availability from cache– Issues at NL-T1 with reporting of file status– Problem developed at IN2P3 right at end of CCRC08 -

31st May

Page 13: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 13

Reconstruction41.2k reconstruction

jobs created

27.6k jobs proceededto done state

Done/created ~67%

56%(70%)

76%

72%(80%)

86%

6.1k(14%)

3.1k(7%)

2.8k(7%)

5.3k(13%)

10.3k(25%)

4.1k(11%)

3.9k(9%)

6.1k(14%)

IN2P3

GridKa

CNAF

CERN RatioDonejobs

Createjobs

3.5k(8%)

1.6k(4%)

2.3k(6%)

4.7k(11%)

1.8k(4%)

10.3k(26%)

74%RAL

89%PIC

23%NIKHEF

Page 14: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 14

Reconstruction27.6k reconstruction jobs in done state

21.2k jobs processed 25k events

Done/25k events ~77%

3.0k jobs failed to upload rDST to local SE(only 1 attempt before trying Failover)

Failover/25k events ~13%

0.7k(14%)

0.7k(22%)

0.0k(1%)

0.7k(14%)

5.1k(90%)

3.0k(99%)

2.6k(95%)

5.2k(100%)

43%(54%)

58%

67%(75%)

76%

IN2P3

GridKa

CNAF

CERN Success/Created

Failupload

25kevents

0.0k(1%)

0.0k(0%)

0.9k(70%)

68%

89%

4%

3.1k(89%)

1.6k(99%)

1.2k(53%)

RAL

PIC

NIKHEF

Page 15: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 15

CERN :

Success

Created= 76%Reconstruction - CERN

Once RAW data available at site very efficient running

• 99% efficiency - limited to 300 jobs?

• Issues with uploading data to local SE (LHCb_rDST) -13% failure rate

• 73% associated with file registering procedure inLFC

• Only spotted in post CCRC08 analysis of log files -under further study

Page 16: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 16

Reconstruction - CNAFVery efficient at start of CCRC08

Some issue in accessing s/w area due to load issues

• Leads to jobs stalling/crashing

• Primarily due to faulty switch

• Switch not reporting problem …

• Latency still observed - moving to VO-independent GPFS file system

Some stability issues associated with SRM CASTORendpoint (1.2.21)

• Old version of SRM (1.3.21) planning to upgrade afterCCRC

Success

Created=

67%

(75%)

Page 17: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 17

Reconstruction - GridKaIssues with dCache servers at the beginning of CCRC

• Slow response from gfal_ls - reoptimised needed

Issue with “tmpwatch” cleaning running jobs working directory

Since then site very efficient running

• 99% efficiency

• GridKa had issues with uploading data to local SE (LHCb_rDST) -22% failure rate

• ~31% due to “no space left on device” - seemed cured when dcapdoors & file handles increased & dcap doors re-started on ~27thMay

• Also issues with checking metadata to ensure file integrity,gfal_ls

Backlog catchup!

Success

Created=58%

Page 18: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 18

Reconstruction - IN2P3IN2P3 originally stable - from 5-14th May

From 14th May many problems…

• LHCb stager agent began to fail, resolved by moving to newer versionof GFAL from AA

• Also “core dump” in file access on WN

• No longer able to access data directly from application from(gsi)dcap servers for rest of CCRC08

Fallback solution copy i/p data to WN (from IN2P3) via GridFTP

• Worked successfully … but not long term solution

Success

Created=

43%

(54%)

i/p dataset download start

Page 19: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 19

Reconstruction - NL-T1Issue with “tmpwatch” cleaning running jobs working directory

Unable to access data from NL-T1 for large periods

• Original problem unable to upload output data to SE - dCacheupgrade

• → Instability of gsidcap servers (12th May)

• LHCb stager agent failed to access status of file

• A week to convince not LHCb issue !

• dCache reporting incorrect status of file

• Bug introduced into latest patch of dCache

No fallback solution attempted

• Fundamental problem

• Abandon stager approach - leads to very inefficient jobs

Success

Created= 4%

Page 20: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 20

PIC :

Success

Created= 89%Reconstruction - PIC

Once RAW data available at site very efficient running

• 99% efficiency

• Small problem with misconfigured WN

• 1 WNs could not access data - solved within 24hours

Page 21: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 21

Reconstruction - RALRAL originally stable - from 5-20th May

From 20th May

• Application crashing with seg fault in RFIO

• Issue resolved May 30th - not understood???

Fallback solution copy i/p data to WN (from RAL) viaGridFTP

• Worked successfully - not a long term solution

Success

Created=68%

i/p dataset download start(+catch-up)

Page 22: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 22

Reconstruction

Low efficiency at CNAF due :• s/w area access

• more jobs than cores on a WN …

Low efficiency at RAL & IN2P3 due to data download• Resolved through tuning timeout

CPU efficiencybased onration of

wallclock toCPU time on“done” jobs

Page 23: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 23

dCache ObservationsOfficial LCG recommendation - 1.8.0-15p3

LHCb ran smoothly at half of T1 dCache sites• PIC OK - version 1.8.0-12p6 (unsecure)

• GridKa OK - version 1.8.0-15p2 (unsecure)

• IN2P3 - problematic - version 1.8.0-12p6 (secure)

• “core dumps” - needed to ship version of GFAL to run

• Could explain CGSI-gSOAP problem????

• NL-T1 - problematic (secure)

• Many versions during CCRC to solve number of issues

• 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4

• “Failure to put data - empty file”->”missing space token”problem -> incorrect metadata returned, NEARLINEissue

Page 24: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 24

Databases• Conditions DB used at CERN & Tier-1 centres

• No replication tests of conditions DB Pit ↔Tier-0 (andbeyond)

• Switched to using Conditions DB 15th May forreconstruction• Issues observed accessing LFC via COOL• Still under investigation with LFC experts

• LFC• Use “streaming” to populate the read-only instance at

T1 from CERN• Problem with CERN instance revealed local instances

not being used by LHCb!• Testing underway now

Page 25: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 25

Stripping

• Stripping on rDST files• Input - 1 rDST files & associated RAW file

• Space tokens: LHC_RAW(T0D1) & LHCb_rDST(T0D1)

• DST files & ETC produced during the processstored locally on T1D1 (add storage class)• Space tokens: LHCb_M-DST

• DST & ETC file then distributed to all othercomputing centres on T0D1 (except CERNT1D1)• Space tokens: LHCb_DST (LHCb_M-DST)

Page 26: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 26

Stripping

• 31.8k stripping jobswere submitted

• 9.3k jobs ran to “Done”• Major issues with LHCb

bk-keeping• Detailed analysis of logs

still ongoing

1.6k2.2kRAL-17.0kFailed to

resolvedatasets

1.1k0.3k4.5k2.0k2.3k2.4k

1.1kPIC<0.1kNIKHEF0.2kIN2P32.0kGridKa2.0kCNAF2.3kCERN

Page 27: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 27

Stripping: T1-T1 transfers

Stripping limited to 4 T1 centres

CNAF PIC

GridKaRAL

Strippingreductionfactortoo small

Issues with SRMtokens at PIC

Page 28: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 28

Lesson Learnt for DIRAC3• Improved error reporting in workflow & pilot logs

– Careful checking of log files was required for detailedanalysis

• Full failover mechanism is in place but not yetdeployed– only CERN was used for CCRC08

• Alternative forms of data access– Minor tuning of the timeout for downloading input data

was required• 2 timeouts needed: time of copy & activity timeout

Page 29: LHCb: CCRC’08 Post Mortem

CCRC’08 post mortem - June’08 29

Summary– Data transfer of CCRC08 using FTS was successful– Still plagued with many issues associated data access

• Issues improved since Feb CCRC08 but…• 2 sites problematic for large chunks of CCRC08 - 50% of LHCb

resources!!• Problems mainly associated with access with dCache• Commencing tests with xrootd

– DIRAC3 tools improved significantly from Feb• Still need improved reporting of problems

– LHCb bk-keeping remains a major concern• New version due prior to data taking

– LHCb need to implement a better interrogation of logfiles


Recommended