+ All Categories
Home > Documents > Future of Batch Processing at CERN - HEPiX Fall...

Future of Batch Processing at CERN - HEPiX Fall...

Date post: 23-Oct-2019
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
22
PES CERN IT Department CH-1211 Gen` eve 23 Switzerland www.cern.ch/it CERN IT Department Future of Batch Processing at CERN HEPiX Fall 2013 erˆ ome Belleman Daniel Pek CERN IT October 2013
Transcript
Page 1: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

Future of Batch Processing at CERNHEPiX Fall 2013

Jerome Belleman Daniel PekCERN IT

October 2013

Page 2: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

2 – Future of BatchProcessing at CERN

Outline

1 The Future, LSF and its Shortcomings

2 Alternative Batch Systems

3 Initial Results

Page 3: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

3 – The Future, LSFand its Shortcomings

Section 1

The Future, LSF and its Shortcomings

Page 4: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

4 – The Future, LSFand its Shortcomings

Current Setup

IBM LSF 7.0.6

4 000 nodes

SLC5 → SLC6Physical → virtual (≈ 1000 virtual so far)

> 65 000 cores

400 000 jobs/day

±70 000 running jobs

Page 5: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

5 – The Future, LSFand its Shortcomings

A Large, Busy Cluster

#jo

bs

pending —running —

15Ju

ne

29Ju

ne

13Ju

ly

27Ju

ly

10Augu

st

24Augu

st

7Sep

tem

ber

100k

200k

300k

400k

Page 6: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

6 – The Future, LSFand its Shortcomings

Goals and Concerns

Goals Concerns with LSF

30 000 to 50 000 nodes 6 500 nodes max

Cluster dynamism Adding/Removing nodesrequires reconfiguration

10 to 100 Hz dispatch rate Transient dispatchproblems

100 Hz query scaling Slow query/submissionresponse times

Page 7: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

7 – The Future, LSFand its Shortcomings

Addressing Operational Issues

Hired an LSF consultant

Suggested a few minor enhancements:

Kernel parametersMemory limitsCPU bindingLogging and accounting file handling

No miraculous improvements

Taught us a lot

Page 8: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

8 – AlternativeBatch Systems

Section 2

Alternative Batch Systems

Page 9: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

9 – AlternativeBatch Systems

Current Candidates

SLURM 2.5.7

HTCondor 8.1.0

Son of Grid Engine 8.1.3

LSF 8/9

Page 10: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

10 – AlternativeBatch Systems

Batch Testing Framework

JobGenerator

JobSubmitter

Harvester

BatchSystem

Poller DB

[

{ "cmd": "tests.d/psleep.py 120",

"count": 1000,

"instance ": "batch",

"user": "atlas",

"queue": "debug" }

]

Page 11: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

10 – AlternativeBatch Systems

Batch Testing Framework

QueryGenerator

QuerySubmitter

Harvester

BatchSystem

Poller DB

[

{ "cmd": "tests.d/bjobs.py",

"count": 1,

"instance ": "batch" }

]

Page 12: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

11 – Initial Results

Section 3

Initial Results

Page 13: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – Initial Results

Submission Rates

500 1000 1500 2000time in secs

0

50

100

150

200

250

300

350

subm

issi

on r

ate

in H

z

job scaling testSLURM

HTCondor

Son of Grid Engine

Page 14: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – Initial Results

Submission Rates

500 1000 1500 2000time in secs

0

50

100

150

200

250

300

350

subm

issi

on r

ate

in H

z

job scaling testSLURM

HTCondor

Son of Grid Engine

Page 15: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – Initial Results

Submission Rates

500 1000 1500 2000time in secs

0

50

100

150

200

250

300

350

subm

issi

on r

ate

in H

z

job scaling testSLURM

HTCondor

Son of Grid Engine

Page 16: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – Initial Results

Submission Rates

500 1000 1500 2000time in secs

0

50

100

150

200

250

300

350

subm

issi

on r

ate

in H

z

job scaling testSLURM

HTCondor

Son of Grid Engine

Page 17: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – Initial Results

Submission Rates

500 1000 1500 2000time in secs

0

50

100

150

200

250

300

350

subm

issi

on r

ate

in H

z

job scaling testSLURM

HTCondor

Son of Grid Engine

Page 18: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

13 – Initial Results

Scalability

Requirements for measurement:

Piggyback on existing resources

Need to simulate many queries

Experience:

Slow SLURM startup with too many hosts?

HTCondor made to scale out

CPU load on the master(s)

Page 19: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

14 – Initial Results

General Impressions

SLURM HTCondor Son of GridEngine

CurrentLSF

Config(Puppet)

Struggle Comfortable1st time,updatesharder

Needssharedfilesystem

Complex

Maturity Easy tofreeze

Trustworthy Rougharound theedges

Trustworthy

Dynamic Notreally

Yes Yes Not really

Doc Poor Solid Solid Solid

Communitysupport

Ratheruninter-ested

Veryenthusiastic

Enthusiastic Commercial

Page 20: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

15 – Future of BatchProcessing at CERN

Required Features

Grid support

Kerberos/AFS

Accounting

Host normalisation

Fairshare scheduling

Support for commercial applications

IPv6?

Page 21: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

16 – Future of BatchProcessing at CERN

Conclusions

Replacement candidates:

SLURM feels too young

HTCondor mature and promising

Son of Grid Engine fast, a bit rough

What’s next:

Host scalability

Query load

Features

Page 22: Future of Batch Processing at CERN - HEPiX Fall 2013jeromebelleman.gitlab.io/talks/belleman-condor.pdfPES CERN IT Department CH-1211 Gen eve 23 Switzerland CERNIT Department 2 { Future

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – Future of BatchProcessing at CERN

Thanks!

Questions?


Recommended