Download - Condor Usage at Brookhaven National Lab

Condor Usage at Brookhaven National Lab

Alexander Withers (talk given by Tony Chan)RHIC Computing Facility

Condor Week - March 15, 2005

About Brookhaven National Lab

● One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE.

● Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department.

● Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects.

● RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

Computing Facility Resources

● Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc.

● 6+ PB permanent tape storage capacity.● 500+ TB central/distributed disk storage capacity.● 1.4 million SpecInt2000 aggregrate computing

power in Linux Farm.

History of Condor at Brookhaven

● First looked at Condor in 2003 as a replacement for LSF and in-house batch software.

● Installed 6.4.7 in August 2003.● Upgraded to 6.6.0 in February 2004.● Upgraded to 6.6.6 (with 6.7.0 startd binary) in

August 2004.● User base grew from 12 (April 2004) to 50+

(March 2005).

The Rise in Condor Usage

0

200

400

600

800

1000

1200

1400

kC

PU

-ho

urs

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

The Rise in Condor Usage

0

200

400

600

800

1000

1200

1400

1600

1800

avg

. #

of

run

nin

g

job

s

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

Condor Cluster Usage

0

5

10

15

20

25

30

35

Av

g.

Clu

ste

r U

sa

ge

(%

)

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

BNL’s modified Condorview

Overview of Computing Resources● Total of 2750 CPUs (growing to 3400+ in 2005).● Two central managers with one acting as a

backup.● Three specialized submit machines which handle

~600 simultaneous jobs each on average.● 131 of the execute nodes can also act as

submission nodes.● One monitoring/Condorview server.

Overview of Computing Resources, cont.

● Six GLOBUS gateway machines for remote job submission.

● Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3.

● Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.

Overview of Configuration● Computing resources divided into 6 pools.● Two configuration models:

– Split pool resources into two parts and restrict which jobs can run in each part.

– More complex version of the Bologna Batch System.

– A pool uses one or both of these models.

● Some pools employ user priority preemption.● Use “drop queue” method to fill fast machines

first. ● Have tools to easily reconfigure nodes.● All jobs use vanilla universe (no checkpointing).

Two Part Model

● Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction.

● Within Condor, a node advertises itself as either an analysis node or a reconstruction node.

● A job must advertise itself in the same manner to match with an appropriate node.

● Only certain users may run reconstruction jobs but anyone can run an analysis job.

Analysis/Reconstruction

Group 3

Group 2

Group 1

Fast

Slow

vm1

vm2

● No suspension● No preemption● Will start a job if CPU is free

Group 1

Group 2

Group 3

Group 4

Group 5

Reconstruction Job: wants group <= 2

A More Complex Version of the Bologna Model

● Two CPU nodes each with 8 VMs.● 2 VMs per CPU.● Only two jobs running at a time.● Four job categories, each with its own priority.● A high priority VM will suspend a random VM

of lower priority.● The random aspect is to prevent the same VM

from always getting suspended.

Analysis/Reconstruction

Group 3

Group 2

Group 1

Fast

Slow

● Low priority VMs suspended● No preemption● Will start a job if CPU is free or is of higher priority

Group 1

Group 2

Group 3

Group 4

Group 5

Reconstruction Job: wants group == 3Med. Priority (vm5/vm6)

MC (vm1/vm2)

Low (vm3/vm4)

Med (vm5/vm6)

High (vm7/vm8) High Prio

Low Prio

Issues We've Had to Deal With

● Tune parameters to alleviate scalability problems.– MATCH_TIMEOUT

– MAX_CLAIM_ALIVES_MISSED

● Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C Panasas fixed bug.

● High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

Issues We’ve Had to Deal With, cont.

● Dagman problems (latency, termination) changed from dagman for plain Condor.

● Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off).

● Modified Condorview to meet our accounting & monitoring requirements.

Issues Not Yet Resolved

● Need job ClassAd which gives user's primary group --> better control over cluster usage.

● Transfer output files for debugging when job is evicted.

● Need option to force the schedd to release its claim after each job.

● Allow schedd to set mandatory periodic_remove policy avoid manual cleanup.

Issues Not Yet Resolved, cont.

● Shadow seems to make a large number of NIS calls. Possible problem with caching address shadows in vanilla universe?

● Need Kerberos support to comply with security mandates.

● Interested in Condor on Demand (COD), but lack of functionality prevents more usage.

● Need more (and effective) cluster management tools condor_off works?

Near-Term Plans & Summary

● Waiting for 6.8.x series (late 2005?) to upgrade.● Scalability concerns as usage rises.● High availability more critical as usage rises.● Integration of BNL Condor pools with external

pools, but concerned about security.● Need some functionalities listed above for a

meaningful upgrade and to improve cluster management capability.