UIS: Research Computing Services · 2019-03-20 · 3.Logically bind the nodes I Clusters consist of...

An Introduction to High Performance Computing

Stuart [email protected]

Research Computing Services (http://www.hpc.cam.ac.uk/)University Information Services (http://www.uis.cam.ac.uk/)

21st March 2019 / UIS Training

UIS: Research Computing Services

Your trainers for today will be:

I Stuart RankinResearch Computing User Services

I Chris HadjigeorgiouResearch Software Engineering

2 of 86

Welcome

I Please sign in on the attendance sheet.

I Please give your online feedback at the end of the course:http://feedback.training.cam.ac.uk/ucs/form.php

I Keep your belongings with you.

I Please ask questions and let us know if you need assistance.

3 of 86

Plan of the Course

Part 1: Basics

Part 2: Research Computing Services HPC

Part 3: Using HPC

4 of 86

http://feedback.training.cam.ac.uk/ucs/form.php

5

Part I: Basics

Basics: Training accounts

I For our practical exercises we will use HPC training accounts.These are distinct from the MCS desktop training accounts.

I You will find HPC training account details on your desk.

I Your HPC training account is valid only for today.

I The name of the HPC account will be the same as your MCSdesktop account: z4XY (where XY is the station number).

I Please check your MCS workstation is booted into Ubuntu Linux,and logged in, ask if you need help with this.

I PDFs of the course notes and the exercises can be found in yourMCS filespace.

6 of 86

Basics: Login nodes

I For our practical exercises we will use the login nodeslogin.hpc.cam.ac.uk.

I We will be using the skylake nodes which are part of the CPUcluster.

I We also have knl (specialised CPU) and pascal (GPU) nodes.

7 of 86

Basics: About names

I Earlier versions of this course referred to the Darwin and Wilkesclusters, but these have retired. In 2017 they were superceded bynew clusters Peta4 and Wilkes2 (collectively part of CSD3).

I After the most recent hardware upgrade in October 2018, thefacility is being re-branded Cumulus. This is a cosmetic changeonly and affects none of the details of actually using the system.

I To submit jobs, the important choice is skylake, knl or pascal.

8 of 86

Basics: Security

I Boring but very, very important . . .

I Cambridge IT is under constant attack by would-be intruders.

I Your data and research career are potentially threatened byintruders.

I Cambridge systems are high profile and popular targets (beparanoid, because they are out to get you).

I Don’t be the weakest link.

9 of 86

Basics: Security

I Keep your password (or private key passphrase) safe.

I Always choose strong passwords.

I Your UIS password is used for multiple systems so keep it secure!

I Keep the software on your laptops/tablets/PCs up to date — thisincludes home computers.

I Check out and install free anti-malware software available for workand home:https://help.uis.cam.ac.uk/service/security/stay-safe-online/malware

I Don’t share accounts (this is against the rules anyway, and yourfriends can get their own).

10 of 86

Prerequisites

I Basic Unix/Linux command line experience:Unix: Introduction to the Command Line Interface (self-paced)https://www.training.cam.ac.uk/ucs/Course/ucs-unixintro1

I Shell scripting experience is desirable:Unix: Simple Shell Scripting for Scientistshttps://www.training.cam.ac.uk/ucs/Course/ucs-scriptsci

11 of 86

Basics: Why Buy a Big Computer?

What types of big problem might require a “Big Computer”?

Compute Intensive: A single problem requiring a large amount ofcomputation.

Data Intensive: A single problem operating on a large amount of data.

Memory Intensive: A single problem requiring a large amount ofmemory.

High Throughput: Many unrelated problems to be executed in bulk.

12 of 86

https://www.training.cam.ac.uk/ucs/Course/ucs-unixintro1

https://www.training.cam.ac.uk/ucs/Course/ucs-scriptsci

Basics: Compute Intensive Problems

I Distribute the work for a single problem across multiple CPUs toreduce the execution time as far as possible.

I Program workload must be parallelised:

Parallel programs split into copies (processes orthreads).Each process/thread performs a part of the work onits own CPU, concurrently with the others.A well-parallelised program will fully exercise asmany CPUs as there are processes/threads.

I The CPUs typically need to exchange information rapidly, requiringspecialized communication hardware.

I Many use cases from Physics, Chemistry, Engineering, Astronomy,Biology...

I The traditional domain of HPC and the Supercomputer.

13 of 86

Basics: Scaling & Amdahl’s Law

I Using more CPUs is not necessarily faster.I Typically parallel codes have a scaling limit.I Partly due to the system overhead of managing more copies, but

also to more basic constraints;I Amdahl’s Law (idealized):

S(N) =1(

1− p + pN

)where

S(N) is the fraction by which the program has sped up

relative to N = 1

p is the fraction of the program which can be parallelized

N is the number of CPUs.

14 of 86

Basics: Amdahl’s Law

http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg

15 of 86

The Bottom Line

I Parallelisation requires effort:I There are libraries to help (e.g. OpenMP, MPI).I Aim to make both p and performance per CPU as large as possible.

I The scaling limit: eventually using more CPUs becomesdetrimental instead of helpful.

16 of 86

Basics: Data Intensive Problems

I Distribute the data for a single problem across multiple CPUs toreduce the overall execution time.

I The same work may be done on each data segment.

I Rapid movement of data to and from disk is more important thaninter-CPU communication.

I Big Data problems of great current interest -

I Hadoop/MapReduce

I Life Sciences (genomics) and elsewhere.

17 of 86

Basics: High Throughput

I Distribute independent, multiple problems across multiple CPUs toreduce the overall execution time.

I Workload is trivially (or embarrassingly) parallel:

∗ Workload breaks up naturally into independent pieces.∗ Each piece is performed by a separate process/thread on a separate

CPU (concurrently).∗ Little or no inter-CPU communication.

I Emphasis is on throughput over a period, rather than onperformance on a single problem.

I Compute intensive capable ⇒ high throughput capable (notconversely).

I If you are using lots of R or python, you are probably highthroughput, and possibly data intensive or compute intensive.

18 of 86

Basics: Memory Intensive Problems

I Require aggregation of large memory, rather than many CPUs.

NB Memory (fast, volatile) not disk (slow,non-volatile).

I Performance optimisation is harder (memory layout tends to behighly nonuniform).

I More technically difficult and expensive to scale beyond a singlebox.

I If you think you have a memory intensive problem, are you sure itneeds to be?

19 of 86

Basics: Putting it All Together

I Each of these types of problem requires combining many CPUsand memory modules.

I Nowadays, there can be many CPUs and memory modules inside asingle commodity PC or server.

I HPC involves combining many times more than this.

20 of 86

Basics: Inside a Modern Computer

I Today’s commodity servers already aggregate both CPUs andmemory to make a single system image in a single box.

I Even small computers now have multiple cores (fully functionalCPUs) per socket.

I Larger computers have multiple sockets (each with their own localmemory):

all CPUs (unequally) share the node memory=⇒ the node is a shared memory multiprocessor

with Non-Uniform Memory Architecture (NUMA)but users still see a single computer (single system image).

21 of 86

Basics: Inside a Modern Computer

22 of 86

Basics: How to Build a Supercomputer

I A supercomputer aggregates contemporary CPUs and memory toobtain increased computing power.

I Usually today these are clusters.

1. Take some (multicore) processors plus some memory.I Could be an off-the-shelf server, or something more special.I A NUMA, shared memory, multiprocessor building block: a node.

23 of 86


2. Connect the nodes with one ormore networks. E.g.

Gbit Ethernet: 100 MB/secOmni-Path: 10 GB/sec

Faster network is for inter-CPUcommunication across nodes.Slower network is formanagement and provisioning.Storage may use either.

24 of 86


3. Logically bind the nodesI Clusters consist of distinct nodes (i.e. separate Linux computers)

on common private network(s) and controlled centrally.

∗ Private networks allow CPUs in different nodes to communicate.∗ Clusters are distributed memory machines:

Each process/thread sees only its local node’s CPUs and memory(without help).

∗ Each process/thread must fit within a single node’s memory.

I More expensive machines logically bind nodes into a single systemi.e. CPUs and memory.

∗ E.g. SGI UV.∗ Private networks allow CPUs to see CPUs and memory in other

nodes, transparently to the user.∗ These are shared memory machines, but very NUMA.∗ Logically a single system - 1 big node∗ A single process can span the entire system.

25 of 86

Basics: Programming a Multiprocessor Machine

I Non-parallel (serial) code

∗ For a single node as for a workstation.∗ Typically run as many copies per node as CPUs, assuming node

memory is sufficent.∗ Replicate across multiple nodes.

I Parallel code

∗ Shared memory methods within a node.E.g. pthreads, OpenMP. Intra-node only.

∗ Distributed memory methods spanning one or more nodes.Message Passing Interface (MPI). Both intra and inter-node.

∗ Some codes use both forms of parallel programming (hybrid).

26 of 86

Basics: Summary

I Why have a supercomputer?I Single problems requiring great time or big data; many problems.

I Most current supercomputers are clusters of separate nodes.

I Each node has multiple CPUs and non-uniform shared memory.

I Parallel code uses shared memory (pthreads/OpenMP/MPI)within a node, distributed memory (MPI) across multiple nodes.

I Non-parallel code uses the memory of one node, but may becopied across many.

27 of 86

28

Part II: Research Computing Services HPC

Early History: EDSAC (1949–1958)

29 of 86

Early History: EDSAC (1949–1958)

I Electronic Delay Storage Automatic Calculator

I The second general use, electronic digital (Turing complete) storedprogram computer

I 3,000 valves

I 650 instructions per second

I 2KB memory in mercury ultrasonic delay lines

I One program at a time!

I Used in meteorology, genetics, theoretical chemistry, numericalanalysis, radioastronomy.

I “On a few occasions it worked for more than 24 hours.”

30 of 86

Early History: Mainframes (1958–1995)

EDSAC 2 (1958–1965) Complete redesign in-house: 10x faster, 80KBmemory.

TITAN (1964–1973) Multiuser system, designed with Ferranti.

Phoenix (1971–1995) IBM mainframes, general purpose (includingemail).

Mainframe service morphs into distributed researchcomputing support with central services.

Specialised research computing needs remain!

31 of 86

Central HPC in Cambridge

Created: 1996 (as the HPCF).

Mission: Delivery and support of a large HPC resource for use bythe University of Cambridge research community.

Self-funding: Paying and non-paying service levels.

User base: Includes external STFC & EPSRC plus industrial users.

Plus: Dedicated group nodes and research projects.

2017 Research Computing Service (within the UIS).

32 of 86

History of Performance

http://www.top500.org

1997 76.8 Gflop/s

2002 1.4 Tflop/s

2006 18.27 Tflop/s

2010 30 Tflop/s

2012 183.38 Tflop/s

2013 183.38 CPU + 239.90 GPU Tflop/s

2017 1.697 CPU + 1.193 GPU Pflop/s

2018 2.271 CPU + 1.193 GPU Pflop/s

33 of 86

Darwin1 (2006–2012)

34 of 86

Darwin3 (2012–2018)(b) & Wilkes (2013–2018)(f)

35 of 86

Peta4 (2017) Cumulus (2018)

36 of 86

Skylake

I Each compute node:

∗ 2x16 cores, Intel Skylake 2.6 GHz32 CPUs∗ 192 GB or 384 GB RAM6 GB or 12 GB per CPU∗ 100 Gb/sec Omni-Path10 GB/sec (for MPI and storage)

I 1152 compute nodes.

I 8 login nodes (login-cpu.hpc.cam.ac.uk).

37 of 86

Coprocessors — GPUs etc

I CPUs are general purposeI Some types of parallel workload fit vector processing well:

I Single Instruction, Multiple Data (SIMD)I Think pixels on a screenI GPUs specialise in this type of workI Also competitor many-core architectures such as the Intel Phi

38 of 86

Pascal


∗ 4× NVIDIA P100 GPU4 GPUs∗ 1x12 cores, Intel Broadwell 2.2 GHz12 CPUs∗ 96 GB RAM96 GB RAM∗ 100 Gb/sec (4X EDR) Infiniband.10 GB/sec (for MPI and storage)

I 90 compute nodes.

I 8 login nodes (login-gpu.hpc.cam.ac.uk).

39 of 86

KNL (Intel Phi)


∗ 64 cores, Intel Phi 7210256 CPUs∗ 96 GB RAM96 GB RAM∗ 100 Gb/sec Omni-Path10 GB/sec (for MPI and storage)

I 342 compute nodes

I Shared login nodes with Skylake

40 of 86

Cluster Storage

I Lustre cluster filesystem:

∗ Very scalable, high bandwidth.∗ Multiple RAID6 back-end disk volumes.∗ Multiple object storage servers.∗ Single metadata server.∗ Tape-backed HSM on newest filesystems.∗ 12 GB/sec overall read or write.∗ Prefers big read/writes over small.

41 of 86

Obtaining an Account and Support

I https://www.hpc.cam.ac.uk/applications-access-research-computing-services

I Email [email protected]

42 of 86

43

Part III: Using HPC

Using HPC: Connecting to the RCS Clusters

I SSH secure protocol only.Supports login, file transfer, remote desktop. . .

I SSH access is allowed from anywhere.Fail2Ban will ban repeatedly failing clients for 20 minutes.

I Policies for other clusters may differ.

44 of 86

Connecting: Windows Clients

I putty, pscp, psftphttp://www.chiark.greenend.org.uk/ sgtatham/putty/download.html

I WinSCPhttp://winscp.net/eng/download.php

I TurboVNC (remote desktop, 3D optional)

http://sourceforge.net/projects/turbovnc/files/

I Cygwin (provides an application environment similar to Linux)

http://cygwin.com/install.html

Includes X server for displaying graphical applications running remotely.

I MobaXtermhttp://mobaxterm.mobatek.net/

45 of 86

Connecting: Linux/MacOSX/UNIX Clients

I ssh, scp, sftp, rsyncInstalled (or installable).

I TurboVNC (remote desktop, 3D optional)

http://sourceforge.net/projects/turbovnc/files/

I On MacOSX, install XQuartz to display remote graphicalapplications.http://xquartz.macosforge.org/landing/

46 of 86

Connecting: Login

I From Linux/MacOSX/UNIX (or Cygwin):ssh -Y [email protected]

I From graphical clients:Host: login-cpu.hpc.cam.ac.ukUsername: abc123 (your UCAM account name)

I login-cpu.hpc will map to a random login nodei.e. one of login-e-9, login-e-10, . . . , login-e-16

47 of 86

Connecting: First time login

I The first connection to a particular hostname produces thefollowing:The authenticity of host ’login-cpu (128.232.224.50)’ can’t be established.

ECDSA key fingerprint is SHA256:HsiY1Oe0M8tS6JwR76PeQQA/VB7r8675BzG5OYQ4h34.

ECDSA key fingerprint is MD5:34:9b:f2:d2:c6:b3:5c:63:99:b7:27:da:5b:c8:16:fe.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added ’login-cpu,128.232.224.50’ (ECDSA) to the list of known hosts.

I One should always check the fingerprint before typing “yes”.

I Graphical SSH clients should ask a similar question.

I Designed to detect fraudulent servers.

48 of 86

Connecting: First time login

I Exercise 1 - Log into your RCS training account.

I Exercise 2 - Simple command line operations.

49 of 86

Connecting: File Transfer

I With graphical clients, connect as before and drag and drop.

I From Linux/MacOSX/UNIX (or Cygwin):rsync -av old directory/

[email protected]:rds/hpc-work/new directory

copies contents of old directory to /̃rds/hpc-work/new directory.

rsync -av old directory

[email protected]:rds/hpc-work/new directory

copies old directory (and contents) to/̃rds/hpc-work/new directory/old directory.

∗ Rerun to update or resume after interruption.∗ All transfers are checksummed.∗ For transfers in the opposite direction, place the remote machine as

the first argument.

I Exercise 3 - File transfer.

50 of 86

Connecting: Remote Desktop

I First time starting a remote desktop:

[sjr20@login-e-1 ~]$ vncserver

You will require a password to access your desktops.

Password:

Verify:

Would you like to enter a view-only password (y/n)? n

New ’login-e-1:99 (sjr20)’ desktop is login-e-1:99

Starting applications specified in /home/sjr20/.vnc/xstartup

Log file is /home/sjr20/.vnc/login-e-1:99.log

I NB Choose a different password for VNC.I The VNC password protects your desktop from other users.I Remember the unique host and display number (login-e-1 and 99

here) of your desktop.

51 of 86


I Remote desktop already running:

[sjr20@login-e-1 ~]$ vncserver -list

TigerVNC server sessions:

X DISPLAY # PROCESS ID

:99 130655

I Kill it:

[sjr20@login-e-1 ~]$ vncserver -kill :99

Killing Xvnc process ID 130655

I Typically you only need one remote desktop.

I Keeps running until killed, or the node reboots.

52 of 86


I To connect to the desktop from Linux:

vncviewer -via [email protected] localhost:99

I The display number 99 will be different in general and unique toeach desktop.

I You will be asked firstly for your cluster login password, andsecondly for your VNC password.

I Press F8 to bring up the control panel.

53 of 86

Using HPC: User Environment

I Scientific Linux 7.x (Red Hat Enterprise Linux 7.4 rebuild)I bash shellI Gnome or XFCE4 desktop (if you want)I GCC, Intel, PGI compilers and other development software.

I But you don’t need to know that.

I NOT Ubuntu or Debian!

I CentOS 7 is OK.

54 of 86

User Environment: Filesystems

I /home/abc123I 40GB quota.I Visible equally from all nodes.I Single storage server.I Hourly, daily, weekly snapshots copied to tape.I Not intended for job outputs or large/many input files.

I /rds/user/abc123/hpc-work a.k.a. /home/abc123/rds/hpc-workI Visible equally from all nodes.I Larger and faster (1TB initial quota).I Intended for job inputs and outputs.I Not backed up.I Research Data StorageI https://www.hpc.cam.ac.uk/research-data-storage-services

55 of 86

Filesystems: Quotas

I quota

[abc123@login-e-1 ~]$ quota

Filesystem GiBytes quota limit grace files quota limit grace User/group

/home 10.6 40.0 40.0 0 ----- No ZFS File Quotas ----- U:abc123

/rds-d2 1.0 1024.0 1126.4 - 8 1048576 1048576 - G:abc123

I Aim to stay below the soft limit (quota).

I Once over the soft limit, you have 7 days grace to return below.

I When the grace period expires, or you reach the hard limit (limit),no more data can be written.

I It is important to rectify an out of quota condition ASAP.

56 of 86

Filesystems: Quotas

I quota

[abc123@login-e-1 ~]$ quota

Filesystem GiBytes quota limit grace files quota limit grace User/group

/home 10.6 40.0 40.0 0 ----- No ZFS File Quotas ----- U:abc123

/rds-d2 1.0 1024.0 1126.4 - 8 1048576 1048576 - G:abc123

I Aim to stay below the soft limit (quota).

I Once over the soft limit, you have 7 days grace to return below.

I When the grace period expires, or you reach the hard limit (limit),no more data can be written.

I It is important to rectify an out of quota condition ASAP.

56 of 86

Filesystems: Automounter

I Directories under /rds/user and /rds/project are automounted.

I They only appear when explicitly referenced.

I Thus when browsing these directories may appear too empty— use ls or cd to reference /rds/user/abc123 explicitly.

I We create convenience symlinks (shortcuts) under ˜/rds.

57 of 86

Filesystems: Permissions

I Be careful and if unsure, please ask support.I Can lead to accidental destruction of your data or account

compromise.

I Avoid changing the permissions on your home directory.I Files under /home are particularly security sensitive.I Easy to break passwordless communication between nodes.

58 of 86

User Environment: Software

I Free software accompanying Red Hat Enterprise Linux is (or canbe) provided.

I Other software (free and non-free) is available via modules.

I Some proprietary software may not be generally accessible.

I New software may be possible to provide on request.

I Self-installed software should be properly licensed.

I sudo will not work. (You should be worried if it did.)

I Docker-compatible containers can now be downloaded and usedvia singularity.

59 of 86

User Environment: Environment Modules

I Modules load or unload additional software packages.

I Some are required and automatically loaded on login.

I Others are optional extras, or possible replacements for othermodules.

I Beware unloading default modules in /̃.bashrc.

I Beware overwriting environment variables such as PATH andLD LIBRARY PATH in /̃.bashrc. If necessary append or prepend.

60 of 86


I Currently loaded:

module list

Currently Loaded Modulefiles:

1) dot 9) intel/impi/2017.4/intel

2) slurm 10) intel/libs/idb/2017.4

3) turbovnc/2.0.1 11) intel/libs/tbb/2017.4

4) vgl/2.5.1/64 12) intel/libs/ipp/2017.4

5) singularity/current 13) intel/libs/daal/2017.4

6) rhel7/global 14) intel/bundles/complib/2017.4

7) intel/compilers/2017.4 15) rhel7/default-peta4

8) intel/mkl/2017.4

I Available:

module av

61 of 86


I Whatis:

module whatis openmpi-1.10.7-gcc-5.4.0-jdc7f4f

openmpi-1.10.7-gcc-5.4.0-jdc7f4f: The Open MPI Project is an open source...

I Load:

module load openmpi-1.10.7-gcc-5.4.0-jdc7f4f

I Unload:

module unload openmpi-1.10.7-gcc-5.4.0-jdc7f4f

62 of 86


I Matlab

module load matlab/r2017b

I Invoking matlab in batch mode:matlab -nodisplay -nojvm -nosplash command

where the file command.m contains your matlab code.

I The University site license contains the Parallel ComputingToolbox.

I MATLAB Parallel Server coming soon!

63 of 86


I Purge:

module purge

I Defaults loaded on login (vary by cluster):

module show rhel7/default-peta4

-------------------------------------------------------------------

/usr/local/Cluster-Config/modulefiles/rhel7/default-peta4:

module-whatis default user environment for Peta4 nodes with Intel MPI

setenv OMP_NUM_THREADS 1

module add dot slurm turbovnc vgl singularity

module add rhel7/global

module add intel/bundles/complib/2017.4

-------------------------------------------------------------------

module load rhel7/default-peta4

I Run time environment must match compile time environment.

64 of 86

User Environment: Compilers

Intel: icc, icpc, ifort (recommended)

icc -O3 -xHOST -ip code.c -o prog

mpicc -O3 -xHOST -ip mpi_code.c -o mpi_prog

GCC: gcc, g++, gfortran

gcc -O3 -mtune=native code.c -o prog

mpicc -cc=gcc -O3 -mtune=native mpi_code.c -o mpi_prog

PGI: pgcc, pgCC, pgf90

pgcc -O3 -tp=skylake code.c -o prog

mpicc -cc=pgcc -O3 -tp=skylake mpi_code.c -o mpi_prog

Exercise 4: Modules and Compilers

65 of 86

Using HPC: Job Submission

66 of 86


I Compute resources are managed by a scheduler:SLURM/PBS/SGE/LSF/. . .

I Jobs are submitted to the scheduler— analogous to submitting jobs to a print queue— a file (submission script) is copied and queued

for processing.

67 of 86


I Jobs are submitted from the login node— not itself managed by the scheduler.

I Jobs may be either non-interactive (batch) or interactive.

I Batch jobs run a shell script on the first of a list of allocated nodes.

I Interactive jobs provide a command line on the first of a list ofallocated nodes.

68 of 86


I Jobs may use part or all of one or more nodes— the owner can specify --exclusive to force exclusive

node access (automatic on KNL).

I Template submission scripts are available under/usr/local/Cluster-Docs/SLURM.

69 of 86

Job Submission: Using SLURM

I Prepare a shell script and submit it to SLURM:

[abc123@login-e-1]$ sbatch slurm_submission_script

Submitted batch job 790299

70 of 86

Job Submission: Show Queue

I Submitted job scripts are copied and stored in a queue:

[abc123@login-e-1]$ squeue -u abc123

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

790299 skylake Test3 abc123 PD 0:00 2 (PriorityResourcesAssocGrpCPUMinsLimit)

790290 skylake Test2 abc123 R 27:56:10 2 cpu-e-[1,10]

71 of 86

Job Submission: Monitor Job

I Examine a particular job:

[abc123@login-e-1]$ scontrol show job=790290

72 of 86

Job Submission: Cancel Job

I Cancel a particular job:

[abc123@login-e-1]$ scancel 790290

73 of 86

Job Submission: Scripts

I SLURMIn /usr/local/Cluster-Docs/SLURM, see examples:slurm submit.peta4-skylake, slurm submit.wilkes2.

#!/bin/bash

#! Name of the job:

#SBATCH -J myjob

#! Which project should be charged:

#SBATCH -A CHANGEME

#! How many whole nodes should be allocated?

#SBATCH --nodes=1

#! How many tasks will there be in total? (<= nodes*32)

#SBATCH --ntasks=116

#! How much wallclock time will be required?

#SBATCH --time=02:00:00

#! Select partition:

#SBATCH -p skylake

...

I #SBATCH lines are structured comments— correspond to sbatch command line options.

I The above job will be given 1 cpu16 cpus on 1 node for 2 hours(by default there is 1 task per node, and 1 cpu per task).

74 of 86

Job Submission: Accounting Commands

I How many core hours available do I have?

mybalance

User Usage | Account Usage | Account Limit Available (hours)

---------- --------- + -------------- --------- + ------------- ---------

sjr20 3 | SUPPORT-CPU 2,929 | 22,425,600 22,422,671

sjr20 0 | SUPPORT-GPU 0 | 87,600 87,600

I How many core hours does some other project or user have?

gbalance -p SUPPORT-CPU

User Usage | Account Usage | Account Limit Available (hours)

---------- --------- + -------------- --------- + ------------- ---------

pfb29 2,925 | SUPPORT-CPU 2,929 | 22,425,600 22,422,671

sjr20 * 3 | SUPPORT-CPU 2,929 | 22,425,600 22,422,671

...

(Use -u for user.)

I List all jobs charged to a project/user between certain times:gstatement -p SUPPORT-CPU -u xyz10 -s "2018-04-01-00:00:00" -e "2018-04-30-00:00:00"

JobID User Account JobName Partition End ExitCode State CompHrs------------ --------- ---------- ---------- ---------- ------------------- -------- ---------- --------263 xyz10 support-c+ _interact+ skylake 2018-04-18T19:44:40 0:0 TIMEOUT 1.0264 xyz10 support-c+ _interact+ skylake 2018-04-18T19:48:07 0:0 CANCELLED+ 0.1275 xyz10 support-c+ _interact+ skylake Unknown 0:0 RUNNING 0.3...

75 of 86

Job Submission: Single Node Jobs

I Serial jobs requiring large memory, or OpenMP codes.

#!/bin/bash

...

#SBATCH --nodes=1

#SBATCH --ntasks=1

# Default is 1 task per node

#SBATCH --cpus-per-task=

#SBATCH --mem=5990

# Memory per node in MB - default is pro rata by cpu number

# Increasing --mem or --cpus-per-task implicitly increases the other

...

export OMP NUM THREADS= # For OpenMP across cores

$application $options

...

76 of 86



#!/bin/bash

...

#SBATCH --nodes=1

#SBATCH --ntasks=1


#SBATCH --cpus-per-task=1

# Default is 1 cpu (core) per task

#SBATCH --mem=5990



...

export OMP NUM THREADS= # For OpenMP across cores


...

76 of 86



#!/bin/bash

...

#SBATCH --nodes=1

#SBATCH --ntasks=1


#SBATCH --cpus-per-task=32 # Whole node

#SBATCH --mem=5990



...

export OMP NUM THREADS=32 # For OpenMP across 32 cores


...

76 of 86



#!/bin/bash

...

#SBATCH --nodes=1

#SBATCH --ntasks=1


#SBATCH --cpus-per-task=16 # Half node

#SBATCH --mem=5990



...

export OMP NUM THREADS=16 # For OpenMP across 16 cores


...

76 of 86



#!/bin/bash

...

#SBATCH --nodes=1

#SBATCH --ntasks=1


#SBATCH --cpus-per-task=32 # Whole node

#SBATCH --mem=5990



...

export OMP NUM THREADS=16 # For OpenMP across 16 cores (using all memory)


...

76 of 86

Job Submission: MPI Jobs

I Parallel job across multiple nodes.

#!/bin/bash

...

#SBATCH --nodes=4

#SBATCH --ntasks=128 # i.e. 32x4 MPI tasks in total.


...

mpirun -np 128 $application $options

...

I SLURM-aware MPI launches remote tasks via SLURM (doesn’t need a

list of nodes).

77 of 86

Job Submission: MPI Jobs

I Parallel job across multiple nodes.

#!/bin/bash

...

#SBATCH --nodes=4



...

mpirun -ppn 16 -np 64 $application $options

...

I SLURM-aware MPI launches remote tasks via SLURM (doesn’t need a

list of nodes).

77 of 86

Job Submission: Hybrid Jobs

I Parallel jobs using both MPI and OpenMP.

#!/bin/bash

...

#SBATCH --nodes=4



...

export OMP NUM THREADS=2 # i.e. 2 threads per MPI task.

mpirun -ppn 16 -np 64 $application $options

...

I This job uses 128 CPUs (each MPI task splits into 2 OpenMP threads).

78 of 86

Job Submission: High Throughput Jobs

I Multiple serial jobs across multiple nodes.

I Use srun to launch tasks (job steps) within a job.

#!/bin/bash

...

#SBATCH --nodes=2

...

cd directory for job1

srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> err &



...



wait

I Exercise 5 - Submitting Jobs.

79 of 86

Job Submission: Interactive

I Compute nodes are accessible via SSH while you have a jobrunning on them.

I Alternatively, submit an interactive job:

sintr -A TRAINING-CPU -N1 -n8 -t 2:0:0

I Within the window (screen session):

∗ Launches a shell on the first node (when the job starts).∗ Graphical applications should display correctly (if they did from the

login node).∗ Create new shells with ctrl-a c, navigate with ctrl-a n and ctrl-a p.∗ ssh or srun can be used to start processes on any nodes in the job.∗ SLURM-aware MPI will do this automatically.

80 of 86

Job Submission: Array Jobs

I http : //slurm.schedmd .com/job array .html

I Used for submitting and managing large sets of similar jobs.

I Each job in the array has the same initial options.

I SLURM

[abc123@login-e-1]$ sbatch --array=1-7:21,3,5,7 -A TRAINING-CPU submit script

Submitted batch job 791609

[abc123@login-e-1]$ squeue -u abc123

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

791609 1 skylake hpl abc123 R 0:06 1 cpu-a-6




791609 1, 791609 3, 791609 5, 791609 7

i.e. ${SLURM ARRAY JOB ID} ${SLURM ARRAY TASK ID}SLURM ARRAY JOB ID = SLURM JOBID for the first element.

81 of 86

Job Submission: Array Jobs (ctd)

I Updates can be applied to specific array elements using${SLURM ARRAY JOB ID} ${SLURM ARRAY TASK ID}

I Alternatively operate on the entire array via${SLURM ARRAY JOB ID}.

I Some commands still require the SLURM JOB ID (sacct, sreport,sshare, sstat and a few others).

I Exercise 7 - Array Jobs.

82 of 86

Scheduling

I SLURM scheduling is multifactor:I QoS — payer or non-payer?I Age — how long has the job waited?

Don’t cancel jobs that seem to wait too long.I Fair Share — how much recent usage?

Payers with little recent usage receive boost.I sprio -j jobid

I BackfillingI Promote lower priority jobs into gaps left by higher priority jobs.I Demands that the higher priority jobs not be delayed.I Relies on reasonably accurate wall time requests for this to work.I Jobs of default length will not backfill readily.

83 of 86

Wait Times

I 36 hour job walltimes are permitted.

I This sets the timescale at busy times (without backfilling).

I Use backfilling when possible.

I Short (1 hour or less) jobs have higher throughput.

84 of 86

Checkpointing

I Insurance against failures during long jobs.

I Restart from checkpoints to work around finite job length.

I Application native methods are best. Failing that, one can tryDMTCP:

http://dmtcp.sourceforge.net/index.html

85 of 86

Job Submission: Scheduling Top Dos & Don’ts

I Do . . .I Give reasonably accurate wall times (allows backfilling).I Check your balance occasionally (mybalance).I Test on a small scale first.I Implement checkpointing if possible (reduces resource wastage).

I Don’t . . .I Request more than you need

— you will wait longer and use more credits.I Cancel jobs unnecessarily

— priority increases over time.

86 of 86

Date post:	12-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

UIS: Research Computing Services · 2019-03-20 · 3.Logically bind the nodes I Clusters consist of...

Documents