What MARCC Does
1
•Maryland • Advanced • Research • Computing • Center
Jaime E. Combariza, PhD Director
Slidesavailableonline
2
•www.marcc.jhu.edu/training
•MARCCHELP•[email protected]•IncludeasmuchinformaFonaspossible,forexample:
•Thejobidofthejobwithproblems•Fullpathtothebatchsubmissionscript•Anyspecificerrormessages•Ifpossibleasnapshotwitherrors
•FrequentlyAskedQuesFons•hQps://www.marcc.jhu.edu/geRng-started/faqs/
HighPerformanceCompuFngAnalogy
3
Ingredients/Recipe Scientific Applications
Oven= MARCC
Users
Research Project
Model&Funding
5
•GrantfromtheStateofMarylandtoJohnsHopkinsUniversitytobuildanHPC/bigdatafacility.•Building,ITstackandNetworking•OperaFonalcostcoveredby5schools:•KriegerSchoolofArts&Sciences(JHU)•WhiFngSchoolsofEngineering(JHU)•SchoolofMedicine(JHU)•BloombergSchoolsofPublicHealth(JHU)•UniversityofMarylandatCollegePark(UMCP)
DefiniFons
7
Cluster(HighPerformanceComputeCluster) Aggrega8onofserverswithhighspeedconnec8vityandfilesystemsa>achedtotheservers.
CPU CentralProcessingUnitaka“Processor”
GPU GraphicalProcessingUnit
Node(login/compute/management/others) Aserverwithsomeamountofmemory,cores,CPUs/GPUs
Core a“processingunit”withinanode(24our28cores)
Memory/RAM Amountofmemorypercoreandnode(128/96)
Filesystem StorageaQachedtothecluster
Network ConnecFvitybetweennodesandfilesystems/communicaFon
Sohware ScienFficapplicaFons(Python,Matlab,Samtools,Gaussian)
8
Nodes Type Descrip8on TotalNocores TFLOPs
648 Regularcomputenodes Haswell24-core128GBRAM 15,552 622
50 LargeMemnodes IvyBridge48-core1024GBRAM 2,400 57.648 GPUnodes Haswell24-core,2NvidiaK80s 1,152 225.5
- Lustre 2PetaByteFilesystem
- ZFS 13TB(forma>ed)Originalsystem 19,104cores 905
48 Regularcomputenodes Broadwell,28-core128GBRAM 1,344 55.91
24 GPUnodes Broadwell,28-core2NvidiaK80s 672 117.75
4 GPUP100 Broadwell,28-coreplus2P100pernode 112 4.65+4/7/gpu2 GPUV100 64-coreand28-core
28 Condo Haswell24-core128GBRAM 672 26.88
8 Condo Broadwell,28-core128GBRAM 224 9.32
52 Condo SkylakeGold612624cores 1152TotalResourcesasof2/22/2019 +23,300cores 1.5+PFLOPs
Compute Nodes
HPCResources&Model
9
•Approx21,120coresand15PBstorageKSAS: 13.4 M Quarter
WSE: 13.4 M Quarter
SOM: 6.8 M Quarter
BSPH: 2.6 M Quarter
UMCP: 6.4 M Quarter
Reserve: 2.0 M/Q
AllocaFons
10
•DeansrequestedapplicaFonsfromallfacultymembers•AllocaFonsgrantedaccordingtoavailableresources
•hQp://marcc.jhu.edu/request-access/marcc-allocaFon-request/
Remarks
11
•MARCCisfreeofcosttoPIs.TheschoolspayfortheoperaFons•AuthenFcaFon:viaTwo-factorauthenFcaFon•Open-dataoranykindofconfidenFaldata.dbGaPisfine,inmostcases.•SecureResearchEnvironment(MSE)forHIPAAdata•IfaddiFonalresources(allocaFon)neededplantoaddacondo(computenodes)
Storage
13
Directory Quota Backup(cost) Addi8onalstorage
$HOME 20GBytesonZFS YES(nocost) NO
~/scratch 1TBpergrouponLustre,useraccess NO YES(>10TB,ViceDean)
~/work sharedquotawith~/scratch,groupaccess
~/data 1TBdefaultperPI,upto10TBpergrouponZFS,requestMARCC
YES(nocost) N/A
~/work-zfs 50TBpergrouponZFS,requestViceDean YES($40/TB/yr) $40/TB/yr+backupcost
~/project <6months,uponrequest(ViceDean) NO N/A
Temporaryfiles
14
•Temporaryfilesgoin~/scratch•Pleasedonotuse/tmpor/dev/shm•Ifneeded,pleasecleanfilesaherjobiscompleted•PleaseifatallpossibledonotdoheavyI/Oto“data”.Usescratch/work
ConnecFng
15
•Windows:PuQy,bash,XSHELL•Mac:terminal,XQuartz,•VNC(limited);OpenOnDemand(OOD)•ssh[-YX]gateway2.marcc.jhu.edu-luserid
•ssh-Ylogin.marcc.jhu.edu-luserid
Filetransfer
16
1.UseFilezillahQps://www.marcc.jhu.edu/geRng-started/faqs/2.sshdtn.marcc.jhu.edu-luserid3.Use“aspera”(FAQ)4.Useglobusconnect•Downloadclientorusewebsite•Createendpoint(ifneeded)•AuthenFcateusingJHUsinglesign-on
sshfsMounts(Basic)
18
•Fuse/sshfsisnowenabledonthecluster•ItfollowsthetwofactorauthenFcaFonprotocol•CheckwithyourlocalITpersontofindouthowtomountdifferentfilesystemsorMARCC’swebsiteforanexample:•hQps://www.marcc.jhu.edu/geRng-started/basic/
Sohware&Modules
19
•MARCCmanagessohwareavailabilityusingthe“environmentmodules”(LmodfromTACC)•module--help•moduleavail•modulespiderpython•modulelist(ml)•moduleloadgaussian
MoreonModules
20
•Module“load”changestheuser’spathtoprependthepackagebeingloadedtotheuser’senvironment.•Example:python•>whichpython•/usr/bin/python•pythonversionthatcomeswiththeOS
modulespider
21
[jcombar1@login-node03~]$mlspiderpython
python:------------------------------------------------------------------------------------------------------------------------------------------Versions:python/2.7-anacondapython/2.7-anaconda53python/2.7python/3.6-anacondapython/3.6python/3.7-anacondapython/3.7Otherpossiblemodulesmatches:biopython------------------------------------------------------------------------------------------------------------------------------------------Tofindotherpossiblemodulematchesexecute:
$module-rspider'.*python.*'
------------------------------------------------------------------------------------------------------------------------------------------FordetailedinformaFonaboutaspecific"python"module(includinghowtoloadthemodules)usethemodule'sfullname.Forexample:
$modulespiderpython/3.7
moduleshow
22
[jcombar1@login-node03~]$mlshowpython/2.7——————————————————————————————————————/sohware/lmod/modulefiles/apps/python/2.7.lua:-----------------------------------------------------------------------------------------------------------------help([[anaconda-loadstheanacondasohware&applicaFonenvironment
Thisadds/sohware/apps/python/2.7toseveraloftheenvironmentvariables.
Aherloading,toseesysteminstalledpackagesforthisPythoninstallaFonpleasetype:condalist
SeeMARCC'shelppageat:hQps://www.marcc.jhu.edu/geRng-started/local-python-packages/]])whaFs("loadsthePython2.7package")always_load("centos7")prepend_path("PATH","/sohware/apps/python/2.7/bin")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib/python2.7")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib/python2.7/site-packages")
ModulesExamples
23
• [jcombar1@login-node01~]$moduleloadpython• [jcombar1@login-node01~]$modulelist• CurrentlyLoadedModules:• 1)gcc/4.8.22)slurm/14.11.033)python/2.7.10[jcombar1@login-node01~]• $whichpython• /sohware/apps/python/2.7• $echo$PATH• /sohware/apps/python/2.7/bin:/sohware/apps/marcc/bin:/sohware/centos7/bin:/
sohware/apps/slurm/current/sbin:/sohware/apps/slurm/current/bin:.:/sohware/apps/mpi/openmpi/3.1.3/intel/18.0/bin:/sohware/apps/compilers/intel/itac/2018.3.022/intel64/bin:/sohware/apps/compilers/intel/clck/2018.3/bin/intel64:/sohware/apps/compilers/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64:/sohware/apps/compilers/intel/compilers_and_libraries_2018.3.222/linux/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sohware/apps/compilers/intel/parallel_studio_xe_2018.3.051/bin:/home-0/jcombar1/pdsh/bin:/home-0/jcombar1/bin
modulehelpapplicaFon
24
•modulehelpgaussian-----------------------------------------ModuleSpecificHelpfor"gaussian/g09"------------------------------------------ThisisacomputaFonalchemistryapplicaFon.************************************************[email protected]**Requestaccesstotheg09group************************************************website:hQp://www.gaussian.comManualonline:hQp://www.gaussian.com/g_tech/g_ur/g09_help.htm————————————————————————————————————————--OnMARCCGaussian09runsusingthreads.ItdoesnotuseLindalibraries--PleasedonotrunGaussianovermorethanoneNode.Followtheexamplebelow-————————————————————————————————————————-Torunitinbatchmodeuseascriptlikethisone:#!bin/bash-l#SBATCH--Fme=1:0:0#SBATCH--nodes=1#SBATCH--ntasks-per-node=8#SBATCH--parFFon=shared#SBATCH--mem=120000MB####THEABOVEREQUESTS120GBRAMmoduleloadgaussiangscratch=/scratch/users/$USERmkdir-p$gscratch/$SLURM_JOBIDexportGAUSS_SCRDIR=$gscratch/$SLURM_JOBIDdateFmeg09water(.com)
Queuingsystem
26
•MARCCallocatesresourcestousersusingatransparentandfairprocessbymeansofa“queueingsystem”.•SLURM(SimpleLinuxUniversalResourceManager)•Opensource,adoptedatmanyHPCcentersandHHPC(similarenvironments)
SLURMCommands
27
•manslurm•hQp://marcc.jhu.edu/geRng-started/running-jobs/
Queues/ParFFons
28
•LinktowebsitePartition Default/Max Time
Default/Max Cores
Default/MAx Mem Serial/Parallel Backfilling
SHARED 1 hr/ 72hr 1/24 5 GB / 120GB
Serial/ Parallel Shared
UNLIMITED Unlimited 1/24/48 5 Gb/120GB Serial/Parallel Shared
PARALLEL 1 hr/ 72hr 5 GB / 120GB Parallel Exclusive
GPUK80/gpup100/gpuv100 1 hr/ 72hr 6/24 5 GB /
120GBSerial/Parallel Shared
LRGMEM 1 hr/ 72hr 48 21GB/1024GB
Serial/Parallel Shared
Scavenger Max 6 hr 1/24 5Gb/120GB Serial/Parallel Shared
ParFFons/Shared
29
•Maysharecomputenodesforjobs•Serialorparalleljobs•Fmelimit1hrto100-hours•1-24cores•onenode
•#SBATCH-N1•#SBATCH--ntasks-per-node=12•#SBATCH--parFFon=shared
ParFFons/Unlimited
30
•UnlimitedFme!!!•Jobsthatneedtorunformorethan100-hours•Ifthesystem/nodecrashesthejobwillbekilled•oneto24cores,oneormulFnode•serialorparallel
•#SBATCH-Nn(n1ormore)•#SBATCH--parFFon=unlimited•#SBATCH--wallFme=15-00:00:00(fiheendays)•#SBATCH--ntasks-per-node=m•#SBATCH—mem=0!!!!!!!!!!!!!!!!!!!
ParFFons/Parallel
31
•Dedicatedqueue•exclusivenodes•singleandmulF-nodejobs•1hrto100hours•Paralleljobsonly
•#SBATCH-N4•#SBATCH--ntasks-per-node=24(96cores)•#SBATCH--parFFon=parallel•#SBATCH—mem=0
ParFFons/scavenger
32
•Mustusewithqos=scavenger•#SBATCH--qos=scavenger•#SBATCH--parFFon=scavenger
•Lowpriorityjobs•Timemaximum6hours•Useonlyifyouralloca8onranout
SLURMFlags
33
Description FLAGScript directive #SBATCH
Job Name #SBATCH --job-name=Any-nameRequested time #SBATCH -t minutes
#SBATCH -t [days-hrs:min:sec]Nodes requested #SBATCH -N min-Max
#SBATCH --nodes=NumberNumber of cores per node #SBATCH --ntasks-per-node=12Number of cores per task #SBATCH --cpus-per-task=2
Mail #SBATCH --mail-type=endUser’s email address #SBATCH [email protected]
Memory size #SBATCH --mem=[mem|M|G|T]Job Arrays #SBATCH --array=[array_spec]
Request specific resource #SBATCH --constraint=“XXX”
SLURMEnvvariables
34
Description Variable
JobID $SLURM_JOBID
Submit Directory $SLURM_SUBMIT_DIR
Submit Host $SLURM_SUBMIT_HOST
Node List $SLURM_JOB_NODELIST
Job Array Index $SLURM_ARRAY_TASK_ID
> printenv | grep SLURM
SLURMScripts
35
cp-r/scratch/public/scripts.(Copydirectory)
#!/bin/bash-l#SBATCH--job-name=MyJob#SBATCH--Fme=8:0:0#SBATCH--nodes=1#SBATCH--ntasks-per-node=12#SBATCH--mail-type=end#[email protected]#SBATCH--parFFon=sharedmoduleloadmvapich2/gcc/64/2.0b####loadmvapich2moduleFmempiexec./code-mvapich.x>OUT-24log
RunningJobs
36
• sbatch(qsub)script-name• squeue(qstat-a)-uuserid[[email protected]][email protected](sqme)
login-vmnode01.cm.cluster:Req'dReq'dElapJobidUsernameQueueNameSessIDNDSTSKMemoryTimeUseSTime-----------------------------------------------------------------------------------------300jcombar1sharedMyJob--112--08:00R00:00
InteracFvework
38
•interact-usage• usage:interact[-ncores][-twallFme][-mmemory][-pqueue]• [-oou�ile][-X][-ffeaturelist][-hhostname][-gngpus]• InteracFvesessiononacomputenode• opFons:• -ncores(default:1)• -twallFmeashh:mm:ss(default:30:00)• -mmemoryas#[k|m|g](default:5GB)• -pparFFon(default:'def')• -oou�ilesaveacopyofthesession'soutputtoou�ile(default:off)• -XenableXforwarding(default:no)• -ffeaturelistCCV-definednodefeatures(e.g.,'e5-2600'),• combinedwith'&'and'|'(default:none)• -hhostnameonlyrunonthespecificnode'hostname'• (default:none,useanyavailablenode)
Jobarrays
39
•#!/bin/bash-l•#SBATCH--job-name=job-array•#SBATCH--Fme=1:0:0•#SBATCH--array=1-240•#SBATCH--nodes=1•#SBATCH--ntasks-per-node=1•#SBATCH--parFFon=shared•#SBATCH—mem=4.9GB•#SBACTH--mail-type=end•#[email protected]
•#runyourjob
•echo"StartJob$SLURM_ARRAY_TASK_IDon$HOSTNAME"
•...
GPUsandInteracFveJobs
40
•#SBATCH-pgpu--gres=gpu:4•#SBATCH--ntasks-per-node=4•#SBATCH--cpus-per-task=6
•interact-pdebug-g1-n1-c6
Compilers/Compiling
41
•Intelcompilers•modulelist•ifort-O3-openmp-omy.exemy.f90•icc-g-omyc.xmyc.c•GNU•gfortran-O4-omyg.f90•gcc-O4-omyc.xmyc.c•PGICompilers(mlpgi)•pgcc-help
MPIjobs
42
•modulespidermpi•moduleloadmvapich2•mpif90ormpicccode(.f90or.c)•mpiexeccode.x(withincomputenode•USEmpiiccormpif90(Intel-mpi)•mpif90andmpicc(usegcc)
InformaFon
44
•Websitemarcc.jhu.edu