+ All Categories
Home > Documents > Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... •...

Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... •...

Date post: 08-May-2018
Category:
Upload: doanthuy
View: 219 times
Download: 5 times
Share this document with a friend
21
Daniel Udwary NERSC Data Science Engagement Group February 3, 2016 Running Jobs at Wang Hall
Transcript
Page 1: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Daniel Udwary "NERSC Data Science Engagement Group"February 3, 2016

Running Jobs at Wang Hall

Page 2: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Outline

•  Genepoolmovelogis-cs•  DifferencesbetweenCraysandGenepool•  CoriandEdisonarchitectureandconfigura-ons•  IntrotoSLURM

2

Page 3: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Why am I running this training session? •  DuringtheMendelmove(nextweek!),wewillhaveaperiodof

reducedGenepoolcomputeavailability

•  WewanttoencouragemoreJGIcomputeworkonNERSC’sflagshipsupercomputers,whenitmakessense–  Lastyear,usedlessthanhalfofCPU-houralloca6on

•  NERSCwantstoknowwhatitcandotobeOerenablebioinforma-csworkonthosemachines,andiden-fywherefutureproblemsmightlie

•  GenepoolmaymovetoSLURMinthefuture

3

Page 4: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

NERSC has moved to a new building

•  AllsystemsmustmovefromOaklandtoBerkeley

- 4 -

Page 5: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Resources at Wang Hall (aka CRT)

•  NewMendel+nodes•  Newloginnodes(genepool13andgenepool14)•  Allfilesystems(almost…)•  Cori•  Edison

S-llatOSF:–  OldMendelnodes–movingstar6ngFeb8–  LegacyGenepoolnodes–tobeshutdown~Feb22–  Tapearchive–Noplantomove(yet)

5

Page 6: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Move Schedule – Current Plan

-6-

Feb8

Mendel+

LegacyComputes

Feb22?

Mendel

Outage@CRT@OSF

Filesystems

Scheduler

Down6meforpowerworkand

networkingmaintenance

?

SeqFS

Page 7: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Key Differences Between Cori/Edison and Genepool

CoriandEdison•  Generallylarge,mul--node

jobs

•  Jobsarecharged

•  Wait-meun-ljobstartmeasuredindays

•  Usersgenerallycompileandinstalltheirownso\ware–fewmodules

•  SLURM

Genepool•  Manysmall,singlenode(or

evensingle-CPU)jobs

•  Nojobcharging•  Wait-memeasuredinhours,

ifnotminutes

•  AwesomeJGIconsultantsmanagebioinforma-csso\wareasmodules

•  UGE

7

Page 8: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Basics of NERSC Cray architecture

•  CoriPhaseI–  CrayXC–  1630nodes–  128GBmemorypernode–  32corespernode

•  (2x16core2.3GHzHaswell)

•  CoriPhaseII–  >9300nodes–  KnightsLandingCPUs

•  Edison–  CrayXC30–  5576nodes–  64GBmemorypernode–  24corespernode

•  (2x12core2.4GHzIvyBridge)

8

Page 9: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Edison Queue Structure

9

https://www.nersc.gov/users/computational-systems/edison/running-jobs/queues-and-policies/

So, use Edison for large parallel jobs using >682 nodes

Page 10: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Cori queue structure •  hOps://www.nersc.gov/users/computa-onal-systems/cori/running-jobs/queues-and-policies/

10

Page 11: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

What is SLURM?

•  Insimpleword,SLURMisaworkloadmanager,orabatchscheduler.

•  SLURMstandsforSimpleLinuxU-lityforResourceManagement.

•  SLURMunitestheclusterresourcemanagement(suchasTorque)andjobscheduling(suchasMoab)intoonesystem.Avoidsinter-toolcomplexity.

•  AsofJune2015,SLURMisusedin6ofthetop10computers,includingthe#1system,Tianhe-2,withover3Mcores.

•  CoriinstalledwithSLURM,andEdisonswitchedlastNov,a\erits’move

- 11 -

Page 12: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Advantages of Using SLURM

•  Fullyopensource.•  SLURMisextensible(pluginarchitecture)•  Lowlatencyscheduling.Highlyscalable.•  Integrated“serial”or“shared”queue•  IntegratedBurstBuffersupport•  Goodmemorymanagement•  Built-inaccoun-nganddatabasesupport•  “Na-ve”SLURMrunswithoutCrayALPS(Applica-onLevel

PlacementScheduler)–  Batchscriptrunsontheheadcomputenodedirectly–  Easiertouse.Lesschanceforconten6oncomparedtosharedMOM

node.

- 12 -

Page 13: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

SLURM User Commands •  sbatch qsub submitabatchscript•  salloc qlogin requestaninterac6vesession•  scancel qdel deleteabatchjob•  scontrolhold qhold holdajob•  scontrolrelease qrls releaseajob•  sacct qacct displayjobaccoun6ngdata•  sqs qs NERSCcustomqueuedisplay

- 13 -

Page 14: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Running with SLURM •  Use“sbatch”(as“qsub”inUGE)tosubmitbatchscript

or“salloc”(as“qlogin”inUGE)torequestinterac-vebatchsession.

•  Needtospecifywhichshelltouseforbatchscript.•  Environmentisautoma-callyimported(as“qsub-V”in

UGE)•  Landsonthesubmitdirectory•  Batchscriptrunsontheheadcomputenode•  Noneedtorepeatflagsinthesruncommandifalready

definedinSBATCHkeywords.•  Hyperthreadingisenabledbydefault.Jobsreques-ng

morethan32cores(MPItasks*OpenMPthreads)pernodewillusehyperthreadsautoma-cally.

- 14 -

Page 15: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Running with SLURM continued

•  Use“srun”tolaunchparalleljobs(aswith“aprun”withTorque/Moab)

•  srunflagsoverwriteSBATCHkeywords•  srundoesmostofop-malprocessandthreadbindingautoma-cally.Onlyflagssuchas“-n”“-c”,alongwithOMP_NUM_THREADSareneededformostapplica-ons.Advanceduserscanexperimentmoreop-onssuchas–num_tasks_per_socket,–cpu_bind,--mem,etc.

15

Page 16: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

16

http://slurm.schedmd.com/rosetta.pdf

Page 17: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

SLURM Task arrays

TaskarraysworksimilarlytoUGE•  sbatch--array=1-100

– Wouldstarta100taskjobarray

•  Jobarrayswillhavetwoaddi6onalenvironmentvariablesset:–  $SLURM_ARRAY_JOB_IDwillbesettothefirstjobIDofthearray.

–  $SLURM_ARRAY_TASK_IDwillbesettothejobarrayindexvalue.

17

Page 18: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Sample SLURM Batch Script

-18-

#!/bin/bash-l#SBATCH--par66on=regular#SBATCH--job-name=test#SBATCH--account=mpccc#SBATCH--nodes=2#SBATCH--6me=00:30:00srun-n16./mpi-helloexportOMP_NUM_THREADS=8srun-n8-c8./xthi

#!/bin/bash-l#SBATCH-pregular#SBATCH-Jtest#SBATCH-Ampccc#SBATCH-N2#SBATCH-t00:30:00srun-n16./mpi-helloexportOMP_NUM_THREADS=8srun-n8-c8./xthi

Longcommandop6ons Shortcommandop6ons

Tosubmitabatchjob:%sbatchmytest.slSubmiqedbatchjob15400

Page 19: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

SLURMmary

•  SLURMprovidesequivalentorsimilarfunc-onalitywithTorque/MoabandUGE.

•  srunprovidesequivalentorsimilarprocessandthreadaffinitywithaprun.

•  Pleaseletusknowifyouhaveanadvancedorcomplicatedworkflow,andan-cipatepoten-alpor-ngissues.Wecanworkwithyoutomigrateyourscripts.

•  Batchconfigura-onsares-llsubjecttotuningsandmodifica-onsbeforethesystemisinfullproduc-on.

- 19 -

Page 20: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Documentations •  SchedMDwebpage:

–  hqp://www.schedmd.com/•  RunningJobsonCori

–  hqps://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/•  Manpagesforslurm,sbatch,salloc,squeue,sinfo,sacct,scontrol,

scancel,etc.•  Torque/Moabvs.SLURMComparisons

–  hqps://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/for-edison-users/torque-moab-to-slurm-transi6on-guide/

•  RunningjobsonBabbageusingSLURM:–  hqps://www.nersc.gov/users/computa6onal-systems/testbeds/babbage/

running-jobs-under-slurm-on-babbage/•  RunningiobsonEdison’stestsystem(Alva)withna-veSLURM

–  hqps://www.nersc.gov/users/computa6onal-systems/edison/alva-test-and-development-system-for-edison/#toc-anchor-7

- 20 -

Page 21: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide

Recommended