+ All Categories
Home > Documents > Review of 2012 Distributed Computing Operations

Review of 2012 Distributed Computing Operations

Date post: 23-Feb-2016
Category:
Upload: solana
View: 44 times
Download: 0 times
Share this document with a friend
Description:
Review of 2012 Distributed Computing Operations. Stefan Roiser 66 th LHCb Week – NCB Meeting 26 November, 2012. Content. 2012 data processing activities Current issues and work in progress Processing outlook for 2013 and beyond Conclusion. 2012 Activities. - PowerPoint PPT Presentation
Popular Tags:
22
Review of 2012 Distributed Computing Operations Stefan Roiser 66 th LHCb Week – NCB Meeting 26 November, 2012
Transcript
Page 1: Review of 2012 Distributed Computing Operations

Review of 2012 Distributed Computing Operations

Stefan Roiser66th LHCb Week – NCB Meeting

26 November, 2012

Page 2: Review of 2012 Distributed Computing Operations

2012 Operations 2

Content

• 2012 data processing activities• Current issues and work in progress• Processing outlook for 2013 and beyond• Conclusion

26 Nov '12

Page 3: Review of 2012 Distributed Computing Operations

2012 Operations 3

2012 ACTIVITIES

26 Nov '12

Page 4: Review of 2012 Distributed Computing Operations

2012 Operations 4

2012 Processing Overview(all plots of this talk since 1 Jan)

26 Nov '12

All successful Jobs of 2012

SimulationReconstruction

Reprocessing

Stripping

UserAll successful Workgrouped by Activity

CPU Efficiency ofSuccessful jobs

UserTotal

Simulation

StrippingReprocessing

Prompt Processing

LHCb was making good use of the provided resources

Page 5: Review of 2012 Distributed Computing Operations

2012 Operations 5

Successful Work in 2012 by Site

26 Nov '12

Total 3.5 M CPU days

Page 6: Review of 2012 Distributed Computing Operations

2012 Operations 6

MC Simulation

26 Nov '12

• Mostly running at Tier2 sites– But also at Tier0/1 when resources are available– Usually running

with lowestpriority

– Low error rate as also true for otherproduction activities

Job Characteristics• No input data• Uploading output to

“closest” storage element

Page 7: Review of 2012 Distributed Computing Operations

2012 Operations 7

Prompt Reconstruction

26 Nov '12

• First pass reconstruction of detector data– Usually 100 % of RAW files are processed. Since

reprocessing started, only partial (~30%) reconstruction at CERN + “attached T2s”

Job Characteristics• At Tier0/1 sites• 3GB downl input from tape• ~5 GB output to tape • Job duration ~ 36 hours

Page 8: Review of 2012 Distributed Computing Operations

2012 Operations 8

• Reprocessing of 2012 data started mid Sept.• Pushing system to its limits– Running up to 15k reconstruction jobs was a very

good stress test for post LS1 prompt processing– Data processing smooth– Hitting limits with data

movement• E.g. staging from Tape

(see Philippe’s talk)

Data Reprocessing

26 Nov '12

Job Characteristics• Same as prompt reco• + also running at Tier2 sites

Page 9: Review of 2012 Distributed Computing Operations

2012 Operations 9

2012 Data Reprocessing

26 Nov '12

41 Tier 2 sites involved in the activity, downloading RAW file from T1 storage and providing ~ 50 % of work for data reconstruction jobs

Page 10: Review of 2012 Distributed Computing Operations

2012 Operations 10

“Attached” Tier 2 Sites• Page providing updated status of “attached T2 sites” and their storage at

http://lhcbproject.web.cern.ch/lhcbproject/Reprocessing/sites.html– Useful for Tier2 sites to know from/to where they receive/provide the data for

reprocessing– Since this year we have the possibility (and used it) to re-attach sites to another

storage element when processing power was needed

26 Nov '12

Page 11: Review of 2012 Distributed Computing Operations

2012 Operations 12

User Activities

26 Nov '12

Constant “background” of failed jobs

Higher activities until summer, since then less amount of running jobs

Less submissions during week-end

Job Characteristics• T0/1 sites• Remote input data access• Duration ??

ICHEP

Page 12: Review of 2012 Distributed Computing Operations

2012 Operations 13

ISSUES AND WORK IN PROGRESS

26 Nov '12

Page 13: Review of 2012 Distributed Computing Operations

2012 Operations 14

Issues at Sites

26 Nov '12

7 Nov, power Cut at RAL. The site managed to recovered within 24 hours

Mid Oct, disk server failure at CNAF, storage out for several days. After recovery CNAF allotted the double amount of job slots in order to recover

• Mostly “business as usual”– Pilots aborted, memory

consumption by jobs, …

Jobs cannot find their input data, mostly at IN2P3 and GRIDKA, i.e. sites with several “attached” T2s, overload of SRM

Page 14: Review of 2012 Distributed Computing Operations

2012 Operations 15

Queue Info

• Page providing information on queues as seen from LHCb via BDII– Some sites seem to provide wrong information– As a consequence LHCb is submitting jobs to queues which it shouldn’t

and the local batch system will kill those jobs after max CPU time used• We had to remove temporarily some sites for reprocessing b/c of this

– Currently campaign to cleanup these wrong values ongoing26 Nov '12

http://lhcbproject.web.cern.ch/lhcbproject/Operations/queues.html

Page 15: Review of 2012 Distributed Computing Operations

2012 Operations 16

Interaction with Sites

• Several sites inform about major downtimes well in advance– Very welcome as it facilitates mid-term planning

• How to reach Tier2 Sites?– LHCb does not have infrastructure to interact

constantly with Tier2s– Can we involve some (wo)men in the middle for

this interaction?• Eg. provide info on processing plans, BDII issue, …

26 Nov '12

Page 16: Review of 2012 Distributed Computing Operations

2012 Operations 17

CVMFS Deployment• CVMFS deployment is high priority for LHCb (+WLCG)

– Once we reach 100 %will facilitate the sw distribution process

– All sites supporting LHCb are highly encouraged to install CVMFS

– Currently 45 out 96 siteshave CVMFS deployed

– Status info available athttps://maps.google.com/maps?q=http://cern.ch/lhcbproject/CVMFS-map/cvmfs-lhcb.kml

26 Nov '12

Page 17: Review of 2012 Distributed Computing Operations

2012 Operations 18

OUTLOOK

26 Nov '12

Page 18: Review of 2012 Distributed Computing Operations

2012 Operations 19

Next Processing Activities

26 Nov '12

Activity Approx Time + Duration

2012 data reprocessing Sep ‘12 – Jan ’13

2011 data reprocessing Beginning ’13 (~ 1 ½ months)

Incremental stripping ~ 2 x / year in 2013 (~ 2 months)

2011/12 data reprocessing During 2014 (~ 5 months)

Loads on sites storage systems• Reprocessing: Reconstruction + Stripping + Merging

– Reconstruction run on “attached T2 sites”– Staging all RAW data from tape – Reco output (FULL.DST) migrated to tape (via disk BUFFER)– Replication of Merging output (DST) on multiple sites

• Incremental Stripping: Stripping + Merging– Staging of all FULL.DST files– Producing up to ~ 20% additional DST files– Replication of DST on multiple sites

Page 19: Review of 2012 Distributed Computing Operations

2012 Operations 20

Conclusions

• Very good support by sites for LHCb operations– Very good interaction with Tier 1 sites– Improvements possible for Tier2s

• LHCb has made good use of the provided resources in 2012– Upcoming reviews of computing model and tools will have

impact on processes next year • 2012 Reprocessing was good stress test for future

operations– Changes in site infrastructures necessary for post LS1

26 Nov '12

Page 20: Review of 2012 Distributed Computing Operations

2012 Operations 21

BACKUP

26 Nov '12

Page 21: Review of 2012 Distributed Computing Operations

FULL.DST = reconstructed physics quantitiesUNM.DST = temporary output for physics stream, PHY.DST = file ready for physics user analysis

Data Processing Workflow

Reconstruction

Stripping

Merging

FULL.DSTUNM.DST

PHY.DST

RAWRAWFULL.DST FULL.DST

Replication

Destroy

TapeTapeBufferDisk Only

Buffer

D1T0

D0T1

Data File

Data Processing

Data Managemt

Disk OnlyStorage

1

2

3

4

5

4’4’

4’6

7

8

10

9

Page 22: Review of 2012 Distributed Computing Operations

2012 Operations 23

Additional Info• Dirac Job Status plots by final minor status don’t include the Statuses

“Input Data Resolution” and “Pending Requests” because these are not final statuses

• Yandex is not included in the CPU pie plots because it provides sometimes wrong (too high) info on running jobs and would dominate all plots

26 Nov '12


Recommended