+ All Categories
Home > Documents > Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2....

Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2....

Date post: 06-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
25
Transcript
Page 1: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of
Page 2: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Future of Batch Processing at CERNJerome Belleman, Ulrich Schwickerath, Iain Steers – IT-PES-PS

HEPiX Spring 2015 Future of Batch Processing at CERN 2

Page 3: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Outline

Context

For Now: Pilot Service

Next Up: Local Jobs

HEPiX Spring 2015 Future of Batch Processing at CERN 3

Page 4: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Context

HEPiX Spring 2015 Future of Batch Processing at CERN 4

Page 5: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Goals and Concerns

Goals Concerns with LSF

30 000 to 50 000 nodes 6 500 nodes max

Cluster dynamism Adding/Removingnodes requiresreconfiguration

10 to 100 Hz dispatchrate

Transient dispatchproblems

100 Hz query scaling Slow query/submissionresponse times

HEPiX Spring 2015 Future of Batch Processing at CERN 5

Page 6: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Evaluating Alternatives to LSF

After HEPiX Fall 2013 – Ann Arbor:

� LSF 8/9 claims to only marginally scale higher

� SLURM showed scalability problems too

� Son of Grid Engine only briefly reviewed, as. . .

� . . . HTCondor looked promising

HEPiX Spring 2015 Future of Batch Processing at CERN 6

Page 7: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Settling on Condor

After HEPiX Spring 2014 – Annecy:

� Condor scaled encouragingly

� Focus on functions (grid, fairshare, auth, AFS)

� Pleasant experience

HEPiX Spring 2015 Future of Batch Processing at CERN 7

Page 8: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Pilot Service

After HEPiX Fall 2014 – Lincoln:

� Grid submissions only

� Setting up a CREAM CE

� Reviewing security

→ Consolidating pilot service

HEPiX Spring 2015 Future of Batch Processing at CERN 8

Page 9: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

For Now: Pilot Service

HEPiX Spring 2015 Future of Batch Processing at CERN 9

Page 10: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Setting Up an ARC CE (I)

� CREAM heavy, opaque

� Heard good things about ARC

� Simple config, single file

HEPiX Spring 2015 Future of Batch Processing at CERN 10

Page 11: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Setting Up an ARC CE (II)

Now we have:

� Condor setup accepting and running jobs

� User-to-VO/role mapping

� Static/dynamic information published to BDII

� HEPSPEC06 normalisation

� Puppetised configuration

HEPiX Spring 2015 Future of Batch Processing at CERN 11

Page 12: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Setting Up an ARC CE (III)

TODO:

� GLUE validation fails with ARC, waiting for fix

� Scale job wall time by worker node attributes

� Single queue to accept jobs

� Security review

And then evaluate HTCondor-CE?

HEPiX Spring 2015 Future of Batch Processing at CERN 12

Page 13: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

On the Condor Front (I)

� Fairshare groups and quotas

� Accounting group injected into job submission

TODO:

� Accounting (How do you store it?)

HEPiX Spring 2015 Future of Batch Processing at CERN 13

Page 14: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Monitoring

� Still our Ganglia instance, but also. . .

� . . . central manager, schedds, workers in Kibana

HEPiX Spring 2015 Future of Batch Processing at CERN 14

Page 15: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Next Up: Local Jobs

HEPiX Spring 2015 Future of Batch Processing at CERN 15

Page 16: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

AFS Token Management

� There is Kerberos ticket passing

� Forging valid AFS tokens from expired ones

� Risk of credential theft

� Independence from AFS

HEPiX Spring 2015 Future of Batch Processing at CERN 16

Page 17: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Job Submissions and Queries

Query job no matter where it’s submitted from

� A schedd to answer all queries?

� Protection against heavy query loads

Or

� <username>.condor.cern.ch aliases

� Job IDs hashed to schedds

HEPiX Spring 2015 Future of Batch Processing at CERN 17

Page 18: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Group Membership Enforcement

� Submit on behalf of the group you belong to

� Post-submission checks?

� There might be plans upstream

HEPiX Spring 2015 Future of Batch Processing at CERN 18

Page 19: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Replacement for LSF Queues

� Interface between users and resources

� Opportunity to review what users should see

� ClassAds

HEPiX Spring 2015 Future of Batch Processing at CERN 19

Page 20: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Worker Node Admission

� Explicit machine list for now. . .

� . . . with the aim of becoming more dynamic

� Adding nodes easy, removing them not so much

HEPiX Spring 2015 Future of Batch Processing at CERN 20

Page 21: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

High Availability/Scalability

� How many schedulers?

� Multiple pools?

� Hierarchical collectors?

HEPiX Spring 2015 Future of Batch Processing at CERN 21

Page 22: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Conclusion

HEPiX Spring 2015 Future of Batch Processing at CERN 22

Page 23: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Collaboration

� European HTCondor Site Admins Meeting 2014

� Enthusiastic chats with lead developers

� HTCondor Week

� Help from RAL

� Sharing with PIC too

HEPiX Spring 2015 Future of Batch Processing at CERN 23

Page 24: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Outlook

� Became interested in other CE solutions

� So did experiments

� Some progress in implementing fairshare

� Grid submissions for early adopters

HEPiX Spring 2015 Future of Batch Processing at CERN 24

Page 25: Future of Batch Processing at CERN · HEPiX Spring 2015 Future of Batch Processing at CERN 2. Outline Context For Now: Pilot Service Next Up: Local Jobs HEPiX Spring 2015 Future of

Recommended