Post on 13-Feb-2016
description
transcript
HTCondor at the RAL Tier-1
Andrew Lahiff
Overview• Current status at RAL• Multi-core jobs• Interesting features we’re using• Some interesting features & commands• Future plans
Current status at RAL
Background• Computing resources
– 784 worker nodes, over 14K cores– Generally have 40-60K jobs submitted per day
• Torque / Maui had been used for many years– Many issues– Severity & number of problems increased as size of farm
increased– Doesn’t like dynamic resources
• A problem if we want to extend batch system into the cloud– In 2012 decided it was time to start investigating moving to a
new batch system
Choosing a new batch system• Considered, tested & eventually rejected the following
– LSF, Univa Grid Engine*• Requirement: avoid commercial products unless absolutely necessary
– Open source Grid Engines• Competing products, not sure which has the next long term future• Communities appear less active than HTCondor & SLURM• Existing Tier-1s running Grid Engine using the commercial version
– Torque 4 / Maui• Maui problematic• Torque 4 seems less scalable than alternatives (but better than Torque 2)
– SLURM• Carried out extensive testing & comparison with HTCondor• Found that for our use case
Very fragile, easy to breakUnable to get reliably working above 6000 running jobs
* Only tested open source Grid Engine, not Univa Grid Engine
Choosing a new batch system• HTCondor chosen as replacement for Torque/Maui
– Has the features we require– Seems very stable– Easily able to run 16,000 simultaneous jobs
• Didn’t do any tuning – it “just worked”• Have since tested > 30,000 running jobs
– Is more customizable than all other batch systems
The story so far• History of HTCondor at RAL
– Jun 2013: started testing with real ATLAS & CMS jobs– Sep 2013: 50% pledged resources moved to HTCondor– Nov 2013: fully migrated to HTCondor
• Experience– Very stable operation– No changes needed as the HTCondor pool increased in size
from ~1000 to ~14000 cores– Job start rate much higher than Torque / Maui even when
throttled– Very good support
Architecture
Central manager
CEscondor_schedd
condor_negotiator
Worker nodescondor_startd
condor_collector
Our setup• 2 central managers
– condor_master– condor_collector– condor_HAD (responsible for high-availability)– condor_replication (responsible for high-availability)– condor_negotiator (only running on at most 1 machine at a
time)• 8 submission hosts (4 ARC CE, 2 CREAM CE, 2 UI)
– condor_master– condor_schedd
• Lots of worker nodes– condor_master– condor_startd
• Monitoring box (runs 8.1.x which contains ganglia integration)– condor_master– condor_gangliad
Computing elements• ARC experience so far
– Have run over 9.4 million jobs so far across our ARC CEs– Generally ignore them & they “just work”
• VO status– ALTAS & CMS
• Fine from the beginning– LHCb
• Added ability to DIRAC to submit to ARC– ALICE
• Not yet able to submit to ARC, have said they will work on this– Non-LHC VOs
• Some use DIRAC, which can now submit to ARC• Some use EMI WMS, which can submit to ARC
Computing elements• ARC 4.1.0 released recently
– Will be in UMD very soon• Has just passed through staged-rollout
– Contains all of our fixes to HTCondor backend scripts• Plans
– When VOs start using RFC proxies we could enable the web service interface
• Doesn’t affect ATLAS/CMS• VOs using NorduGrid client commands (e.g. LHCb) can get job status
information more quickly
Computing elements• Alternative: HTCondor-CE
– Special configuration of HTCondor, not a brand new service– Some sites starting to use this in the US– Note: contains no BDII (!)
HTCondor-G
HTCondor-CEschedd
schedd
job router
ATLAS & CMSpilot factories
site
collector(s)negotiator
startds
Central manager(s)
Worker nodes
Multi-core jobs
Getting multi-core jobs to work• Job submission
– Haven’t set up dedicated queues– VO has to request how many cores they want in their JDL
• Fine for ATLAS & CMS, not sure yet for LHCb/DIRAC…• Could set up additional queues if necessary
• Did 5 things on the HTCondor side to support multi-core jobs…
Getting multi-core jobs to work• Worker nodes configured to use partitionable slots
– WN resources divided up as necessary amongst jobs– We had this configuration anyway
• Setup multi-core accounting groups & associated quotas– Configured so that multi-core jobs automatically get assigned
to the appropriate groups– Specified group quotas (fairshares) for the multi-core groups
• Adjusted the order in which the negotiator considers groups– Consider multi-core groups before single core groups
• 8 free slots are “expensive” to obtain, so try not to lose them too quickly
Getting multi-core jobs to work• Setup condor_defrag daemon
– Finds WNs to drain, triggers draining & cancels draining as required
– Pick WNs to drain based on how many cores they have that can be freed up• E.g. getting 8 free cores by draining a full 32 core WN is
generally faster than draining a full 8 core WN
Getting multi-core jobs to work• Improvement to condor_defrag daemon
– Demand for multi-core jobs not known by condor_defrag– Setup simple cron which adjusts number of concurrent
draining WNs based on demand• If many idle multi-core jobs but few running, drain aggressively• Otherwise very little draining
ResultsRunning & idle multi-core jobs
Gaps in submission by ATLAS resultsin loss of multi-core slots.
Number of WNs running multi-core jobs & draining WNs
Significantly reduced CPU wastagedue to the cron• Aggressive draining: 3% waste• Less-aggressive draining: < 1% waste
Multi-core jobs• Current status
– Haven’t made any changes over the past few months– Now only CMS running multi-core jobs
• Waiting for ATLAS to start up again• Will be interesting to see what happens when multiple VOs
run multi-core jobs– May look at making improvements if necessary
• Details about our configuration here:https://www.gridpp.ac.uk/wiki/RAL_HTCondor_Multicore_Jobs_Configuration
Interesting features we’re using(or are about to use)
Startd cron• Worker node health check script
– Script run on WN at regular intervals by HTCondor– Can place custom information into WN ClassAds. In our
case:• NODE_IS_HEALTHY (should WN start jobs or not)• NODE_STATUS (list of any problems)
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) WN_HEALTHCHECKSTARTD_CRON_WN_HEALTHCHECK_EXECUTABLE = /usr/local/bin/healthcheck_wn_condorSTARTD_CRON_WN_HEALTHCHECK_PERIOD = 10mSTARTD_CRON_WN_HEALTHCHECK_MODE = periodicSTARTD_CRON_WN_HEALTHCHECK_RECONFIG = falseSTARTD_CRON_WN_HEALTHCHECK_KILL = true
## When is this node willing to run jobs?START = (NODE_IS_HEALTHY =?= True)
Startd cron• Current checks
– CVMFS– Filesystem problems (e.g. read-only)– Swap usage
• Plans– May add more checks, e.g. CPU load– More thorough tests only run when HTCondor first starts up?
• E.g. checks for grid middleware, checks for other essential software & configuration, …
• Want to be sure that jobs will never be started unless the WN is correctly set up
• Will be more important for dynamic virtual WNs
Startd cron• Easily list any WNs with problems, e.g.
# condor_status -constraint 'partitionableslot == True && NODE_STATUS != "All_OK”’-autoformat Machine NODE_STATUSlcg1209.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1248.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1309.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1340.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch
gmetric script using HTCondorPython API for making Gangliaplots
PID namespaces• We have USE_PID_NAMESPACES=True on WNs
– Jobs can’t see any system processes or processes associated with other jobs on the WN
– Example stdout of job running “ps –e”: PID TTY TIME CMD 1 ? 00:00:00 condor_exec.exe 3 ? 00:00:00 ps
MOUNT_UNDER_SCRATCH• Each job sees a different /tmp, /var/tmp
– Uses bind mounts to directories inside job scratch area• No more junk left behind in /tmp• Jobs can’t fill /tmp & cause problems• Jobs can’t see what other jobs have written into /tmp
• For glexec jobs to work, need a special lcmaps plugin enabled – lcmaps-plugins-mount-under-scratch– Minor tweak to lcmaps.db
• We have tested this but it’s not yet rolled out
CPU affinity• Can set on WN ASSIGN_CPU_AFFINITY=True• Jobs locked to specific cores• Problem:
– When PID namespaces also used, CPU affinity doesn’t work
Control groups• Cgroups: mechanism for managing a set of processes• We’re starting with the most basic option:
CGROUP_MEMORY_LIMIT_POLICY=none– No memory limits applied– Cgroups used for
• Process tracking• Memory accounting• CPU usage assigned proportionally to the number of CPUs in
the slot– Currently configured on 2 WNs for initial testing with
production jobs
Some interesting features & commands
Upgrades• Central managers, CEs:
– Update the RPMs– condor_master will notice the binaries have changed &
restart daemons as required• Worker nodes
– Make sure WN config containsMASTER_NEW_BINARY_RESTART=PEACEFUL
– Update the RPMs– condor_master will notice the binaries have changed,
drain running jobs, then restart daemons as required
Dealing with held jobs• Held jobs have failed in some way & remain in the queue
waiting for user intervention– E.g. input file(s) missing from CE
• Can configure HTCondor to deal with them automatically– Try re-running held jobs once only after waiting 30 minutes
SYSTEM_PERIODIC_RELEASE = ((CurrentTime - EnteredCurrentStatus > 30 * 60) && (JobRunCount < 2))
– Remove held jobs after 24 hoursSYSTEM_PERIODIC_REMOVE = ((CurrentTime - EnteredCurrentStatus > 24 * 60 * 60) && JobStatus == 5))
condor_who• See what jobs are running on a worker node[root@lcg0975 ~]# condor_who
OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM patls053@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_5 2730014.0 0+03:30:12 30184 /pool/condor/dir_30180/condor_exec.exepatls053@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_7 3189534.0 0+03:38:19 21266 /pool/condor/dir_21262/condor_exec.exetatls011@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_6 2729613.0 0+04:40:21 6942 /pool/condor/dir_6938/condor_exec.exe tlhcb005@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_4 2977866.0 0+08:42:13 26669 /pool/condor/dir_26665/condor_exec.exetatls001@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_1 3186150.0 0+10:57:05 12401 /pool/condor/dir_12342/condor_exec.exetlhcb005@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_3 3174829.0 0+16:03:57 26418 /pool/condor/dir_26331/condor_exec.exe pcms054@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_2 3149655.0 1+23:55:51 31281 /pool/condor/dir_31268/condor_exec.exe
condor_q –analyze 1
• Why isn’t a job running?
-bash-4.1$ condor_q -analyze 16244
-- Submitter: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:45033> : lcgui03.gridpp.rl.ac.ukUser priority for alahiff@gridpp.rl.ac.uk is not available, attempting to analyze without it.---16244.000: Run analysis summary. Of 13180 machines, 13180 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job
WARNING: Be advised: No resources matched request's constraints
condor_q –analyze 2
The Requirements expression for your job is:
( ( Ceph is true ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )
Suggestions:
Condition Machines Matched Suggestion --------- ---------------- ----------1 ( TARGET.Cpus >= 16 ) 0 MODIFY TO 82 ( ( target.Ceph is true ) ) 139 3 ( TARGET.Memory >= 1 ) 13150 4 ( TARGET.Arch == "X86_64" ) 13180 5 ( TARGET.OpSys == "LINUX" ) 13180 6 ( TARGET.Disk >= 1 ) 13180 7 ( TARGET.HasFileTransfer ) 13180
condor_ssh_to_job• ssh into a job from a CE[root@arc-ce01 ~]# condor_ssh_to_job 3147487.0Welcome to slot1@lcg1554.gridpp.rl.ac.uk!Your condor job is running with pid(s) 2402.[pcms054@lcg1554 dir_2393]$ hostnamelcg1554.gridpp.rl.ac.uk[pcms054@lcg1554 dir_2393]$ lscondor_exec.exe_condor_stderr_condor_stderr.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359_condor_stdout_condor_stdout.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359glide_adXvcQglidein_startup.shjob.MCJMDmk8W6jnCIXDjqiBL5XqABFKDmABFKDmb5JKDmEBFKDmQ8FM6m.proxy
condor_fetchlog• Retrieve log files from daemons on other machines[root@condor01 ~]# condor_fetchlog arc-ce04.gridpp.rl.ac.uk SCHEDD05/20/14 12:39:09 (pid:2388) ******************************************************05/20/14 12:39:09 (pid:2388) ** condor_schedd (CONDOR_SCHEDD) STARTING UP05/20/14 12:39:09 (pid:2388) ** /usr/sbin/condor_schedd05/20/14 12:39:09 (pid:2388) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)05/20/14 12:39:09 (pid:2388) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON05/20/14 12:39:09 (pid:2388) ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $05/20/14 12:39:09 (pid:2388) ** $CondorPlatform: x86_RedHat6 $05/20/14 12:39:09 (pid:2388) ** PID = 238805/20/14 12:39:09 (pid:2388) ** Log last touched time unavailable (No such file or directory)05/20/14 12:39:09 (pid:2388) ******************************************************...
condor_gather_info• Gathers information about a job
– Including log files from schedd, startd
[root@arc-ce01 ~]# condor_gather_info --jobid 3288142.0
cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLog.oldcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLogcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StartLogcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLogcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLog.lcg1043.gridpp.rl.ac.ukcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/condor-profile.txtcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_qcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_userlog_linescgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_ad_analysiscgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_adcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/SchedLog.oldcgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StarterLog.slot1
Job ClassAds• Lots of useful information in job ClassAds
– Including email address from proxy• Easy to contact users of problematic jobs
# condor_q 3279852.0 -autoformat x509UserProxyEmailandrew.lahiff@stfc.ac.uk
condor_chirp• Jobs can put custom information into job ClassAds• Example: lcmaps-plugin-condor-update
– Puts information into job ClassAd about glexec payload user & DN
– Can then use condor_q to see this information
Job router• Job router daemon transforms jobs from one type to
another according to configurable policies– E.g. submit jobs to a different batch system or a CE
• Example: sending excess jobs to the GridPP Cloud using glideinWMS
JOB_ROUTER_DEFAULTS = \ [ \ MaxIdleJobs = 10; \ MaxJobs = 50; \ ]JOB_ROUTER_ENTRIES = \ [ \ Requirements=true; \ GridResource = "condor lcggwms02.gridpp.rl.ac.uk lcggwms02.gridpp.rl.ac.uk:9618"; \ name = "GridPP_Cloud"; \ ]
Job router• Example: initially have 5 idle jobs-bash-4.1$ condor_q
-- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )
5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
Job router• Routed copies of the jobs soon appear
– Original job mirrors the status of the routed copy
-bash-4.1$ condor_q
-- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2250.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2251.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2252.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2253.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2254.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )
10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended
Job router• Can check that the new jobs have been sent to a remote
resource[root@lcgvm21 ~]# condor_q -grid
-- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID 2250.0 alahiff IDLE condor->lcggwms02.gridpp.rl 20.0 2251.0 alahiff IDLE condor->lcggwms02.gridpp.rl 21.0 2252.0 alahiff IDLE condor->lcggwms02.gridpp.rl 17.0 2253.0 alahiff IDLE condor->lcggwms02.gridpp.rl 18.0 2254.0 alahiff IDLE condor->lcggwms02.gridpp.rl 19.0
glideinWMS HTCondor pool
Job ids on remote resource
Future plans• Setting up ARC CE & some WNs using Ceph as a shared
storage system (CephFS)– ATLAS testing with arcControlTower
• Pulls jobs from PanDA, pushes jobs to ARC CEs– Input files pre-staged & cached on Ceph by ARC– Currently in progress…
Future plans• Test power management features
– HTCondor can power down idle WNs and wake them as required
• Batch system expanding into the cloud– Make use of idle private cloud resources– We have tested that condor_rooster can be used to
dynamically provision VMs as they are needed
Questions?
Backup slides
Monitoring
Overview• Batch system monitoring
– Mimic– Jobview– Ganglia– Elasticsearch
Mimic• Overview of state of worker nodes
Jobview
http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html
Ganglia• Custom gmetric scripts + condor_gangliad
Ganglia
Aim to move to usingmetrics only fromcondor_gangliad as muchas possible(easy to share with other sites)
Jobs monitoring• CASTOR team at RAL have been testing Elasticsearch
– Why not try using it with HTCondor?• Elasticsearch ELK stack
– Logstash: parses log files– Elasticsearch: search & analyze data in real-time– Kibana: data visualization
• Hardware setup– Test cluster of 13 servers (old diskservers & worker nodes)
• But 3 servers could handle 16 GB of CASTOR logs per day• Adding HTCondor
– Early testing phase only– Wrote config file for Logstash to enable history files to be parsed– Add Logstash to machines running schedds
Jobs monitoring• Full job ClassAds visible & can be queried
Jobs monitoring• Can make custom plots & dashboards