+ All Categories
Home > Documents > Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the...

Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the...

Date post: 12-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
41
Compiling applications for the Cray XC
Transcript
Page 1: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiling applications for the Cray XC

Page 2: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler Driver Wrappers (1)

● All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers for each language are: ●  cc  –  wrapper  around  the  C  compiler    ●  CC  –  wrapper  around  the  C++  compiler  ●  ftn  –  wrapper  around  the  Fortran  compiler  

●  These scripts will choose the required compiler version, target architecture options, scientific libraries and their include files automatically from the current used module environment. Use the –craype-­‐verbose  flag to see the default options.

● Use them exactly like you would the original compiler, e.g. To compile prog1.f90:

> ftn  -­‐c  <any_other_flags>  prog1.f90  

2

Page 3: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler Driver Wrappers (2)

●  The scripts choose which compiler to use from the PrgEnv module loaded

● Use module swap to change PrgEnv, e.g.  >  module  swap  PrgEnv-­‐cray  PrgEnv-­‐intel  

●  PrgEnv-­‐cray  is loaded by default at login. This may differ on other Cray systems. ●  use module  list to check what is currently loaded

●  The Cray MPI module is loaded by default (cray-­‐mpich). ●  To support SHMEM load the cray-­‐shmem module.

PrgEnv   Description Real  Compilers  

PrgEnv-­‐cray   Cray Compilation Environment crayftn,  craycc,  crayCC  

PrgEnv-­‐intel   Intel Composer Suite ifort,  icc,  icpc  

PrgEnv-­‐gnu   GNU Compiler Collection gfortran,  gcc,  g++  

PrgEnv-­‐pgi   Portland Group Compilers pgf90,  pgcc,  pgCC  

3

Page 4: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler Versions

●  There are usually multiple versions of each compiler available to users. ●  The most recent version is usually the default and will be loaded when

swapping the PrgEnv. ●  To change the version of the compiler in use, swap the Compiler

Module. e.g. module  swap  cce  cce/8.3.10  

PrgEnv   Compiler Module PrgEnv-­‐cray   cce  PrgEnv-­‐intel   intel  PrgEnv-­‐gnu   gcc  PrgEnv-­‐pgi   pgi  

4

Page 5: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

EXCEPTION: Cross Compiling Environment

●  The wrapper scripts, ftn,  cc,  and CC, will create a highly optimized executable tuned for the Cray XC’s compute nodes (cross compilation).

●  This executable may not run on the login nodes ●  Login nodes do not support running distributed memory applications ●  Some Cray architectures may have different processors in the login

and compute nodes. Typical error is ‘… illegal Instruction …’

●  If you are compiling for the login nodes ●  You should use the original direct compiler commands, e.g. ifort,  

pgcc,  crayftn,  gcc,  …  PATH  will change with modules.  All libraries will have to be linked in manually.

●  Conversely, you can use the compiler wrappers {cc,CC,ftn}  and use the -­‐target-­‐cpu=  option among {abudhabi, haswell, interlagos, istanbul, ivybridge, mc12, mc8, sandybridge, shanghai, x86_64. The x86_64 is the most compatible but also less specific.

5

Page 6: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

About the –I, –L and –l flags

●  For libraries and include files being triggered by module files, you should NOT add anything to your Makefile ●  No additional MPI flags are needed (included by wrappers) ●  You do not need to add any -­‐I, -­‐l or –L flags for the Cray provided

libraries

●  If your Makefile needs an input for –L  to work correctly, try using ‘.’

●  If you really, really need a specific path, try checking ‘module  show  <X>’ for some environment variables

6

Page 7: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Dynamic vs Static linking ●  Currently static linking is default

●  May change in the future ●  Already changed when linking for GPUs (XK6/XK7 nodes)

●  To decide how to link, 1.  you can either set CRAYPE_LINK_TYPE to “static” or “dynamic” 2.  Or pass the ‘-­‐static’ or ‘-­‐dynamic’ option to the linking wrapper (cc, CC or ftn).

●  Features of dynamic linking : ●  smaller executable, automatic use of new libs ●  Might need longer startup time to load and find the libs ●  Environment (loaded modules) should be the same between your compiler setup and

your batch script (eg. when switching to PrgEnv-­‐intel) ●  Features of static linking :

●  Larger executable (usually not a problem) ●  Faster startup ●  Application will run the same code every time it runs (independent of environment)

●  If you want to hardcode the rpath into the executable use ●  Set CRAY_ADD_RPATH=yes during compilation ●  This will always load the same version of the lib when running, independent of the

version loaded by modules

7

Page 8: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

OpenMP

● OpenMP is support by all of the PrgEnvs. ●  CCE (PrgEnv-­‐cray) recognizes and interprets OpenMP directives by

default. If you have OpenMP directives in your application but do not wish to use them, disable OpenMP recognition with –hnoomp.

●  Intel OpenMP spawns an extra helper thread which may cause oversubscription. Hints on that will follow.

PrgEnv   Enable OpenMP Disable OpenMP PrgEnv-­‐cray   -­‐homp   -­‐hnoomp  PrgEnv-­‐intel   -­‐openmp  PrgEnv-­‐gnu   -­‐fopenmp  PrgEnv-­‐pgi   -­‐mp  

8

Page 9: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler man Pages

●  For more information on individual compilers

●  To verify that you are using the correct version of a compiler, use: ●  -V option on a cc, CC, or ftn command with PGI, Intel and Cray ●  --version option on a cc, CC, or ftn command with GNU

PrgEnv   C C++ Fortran

PrgEnv-­‐cray   man  craycc   man  crayCC   man  crayftn  PrgEnv-­‐intel   man  icc   man  icpc   man  ifort  PrgEnv-­‐gnu   man  gcc   man  g++   man  gfortran  PrgEnv-­‐pgi   man  pgcc   man  pgCC   man  pgf90  

Wrappers man  cc   man  CC   man  ftn  

9

Page 10: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Using Compilers

Quick Overview

Page 11: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Using Compiler Feedback

● Compilers can generate annotated listing of your source code indicating important optimizations. Useful for targeted use of compiler flags.

● CCE

●  ftn  -­‐rm  ●  {cc,CC}  -­‐hlist=a  

 ●  Intel

●  ftn/cc  -­‐opt-­‐report  3  -­‐vec-­‐report6  ●  If you want this into a file: add  -­‐opt-­‐report-­‐file=filename  ●  See ifort  -­‐-­‐help  reports  

● GNU ●  -­‐ftree-­‐vectorizer-­‐verbose=9  

● PGI ●  -­‐Minfo=<…>  

11

Page 12: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler feedback: Loopmark

●  For example, with the Cray compiler  %%%        L  o  o  p  m  a  r  k      L  e  g  e  n  d        %%%  Primary  Loop  Type            Modifiers  -­‐-­‐-­‐-­‐-­‐-­‐-­‐  -­‐-­‐-­‐-­‐  -­‐-­‐-­‐-­‐            -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  A    -­‐  Pattern  matched      a  -­‐  vector  atomic  memory  operation  

       b  –  blocked  C    -­‐  Collapsed                  f  –  fused  D    -­‐  Deleted                      i  –  interchanged  E    -­‐  Cloned                        m  -­‐  streamed  but  not  partitioned  I    -­‐  Inlined                      p  -­‐  conditional,  partial  and/or  computed  M    -­‐  Multithreaded          r  –  unrolled  P    -­‐  Parallel/Tasked      s  –  shortloop  V    -­‐  Vectorized                t  -­‐  array  syntax  temp  used  

       w  -­‐  unwound  

12

Page 13: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler feedback: Loopmark (cont.)  29.    b-­‐-­‐-­‐-­‐-­‐-­‐-­‐<    do  i3=2,n3-­‐1  30.    b  b-­‐-­‐-­‐-­‐-­‐<          do  i2=2,n2-­‐1  31.    b  b  Vr-­‐-­‐<                do  i1=1,n1  32.    b  b  Vr                        u1(i1)  =  u(i1,i2-­‐1,i3)  +  u(i1,i2+1,i3)  33.    b  b  Vr        *                  +  u(i1,i2,i3-­‐1)  +  u(i1,i2,i3+1)  34.    b  b  Vr                        u2(i1)  =  u(i1,i2-­‐1,i3-­‐1)  +  u(i1,i2+1,i3-­‐1)  35.    b  b  Vr        *                  +  u(i1,i2-­‐1,i3+1)  +  u(i1,i2+1,i3+1)  36.    b  b  Vr-­‐-­‐>                enddo  37.    b  b  Vr-­‐-­‐<                do  i1=2,n1-­‐1  38.    b  b  Vr                        r(i1,i2,i3)  =  v(i1,i2,i3)  39.    b  b  Vr        *                -­‐  a(0)  *  u(i1,i2,i3)  40.    b  b  Vr        *                -­‐  a(2)  *  (  u2(i1)  +  u1(i1-­‐1)  +  u1(i1+1)  )  41.    b  b  Vr        *                -­‐  a(3)  *  (  u2(i1-­‐1)  +  u2(i1+1)  )  42.    b  b  Vr-­‐-­‐>                enddo  43.    b  b-­‐-­‐-­‐-­‐-­‐>          enddo  44.    b-­‐-­‐-­‐-­‐-­‐-­‐-­‐>    enddo  

13

Page 14: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiler Feedback: Loopmark (cont.)  ftn-­‐6289  ftn:  VECTOR  File  =  resid.f,  Line  =  29        A  loop  starting  at  line  29  was  not  vectorized  because  a  recurrence  was  found  on  "U1"  between  lines  32  and  38.  ftn-­‐6049  ftn:  SCALAR  File  =  resid.f,  Line  =  29        A  loop  starting  at  line  29  was  blocked  with  block  size  4.  ftn-­‐6289  ftn:  VECTOR  File  =  resid.f,  Line  =  30        A  loop  starting  at  line  30  was  not  vectorized  because  a  recurrence  was  found  on  "U1"  between  lines  32  and  38.  ftn-­‐6049  ftn:  SCALAR  File  =  resid.f,  Line  =  30        A  loop  starting  at  line  30  was  blocked  with  block  size  4.  ftn-­‐6005  ftn:  SCALAR  File  =  resid.f,  Line  =  31        A  loop  starting  at  line  31  was  unrolled  4  times.  ftn-­‐6204  ftn:  VECTOR  File  =  resid.f,  Line  =  31        A  loop  starting  at  line  31  was  vectorized.  ftn-­‐6005  ftn:  SCALAR  File  =  resid.f,  Line  =  37        A  loop  starting  at  line  37  was  unrolled  4  times.  ftn-­‐6204  ftn:  VECTOR  File  =  resid.f,  Line  =  37        A  loop  starting  at  line  37  was  vectorized.  

14

Page 15: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Recommended compiler optimization levels

● Cray compiler ●  The default optimization level (i.e. no flags) is equivalent to –O3 of

most other compilers. CCE optimizes rather aggressively by default, but this is also most thoroughly tested configuration

●  Try with –O3  –hfp3  (also tested this thoroughly)  ●  -­‐hfp3  gives you a lot more floating point optimization, esp. 32-bit ●  In case of precision errors, try a lower –hfp<number> (-­‐hfp1  first, only -­‐hfp0  if absolutely necessary)

● GNU compiler ●  Almost all HPC applications compile correctly with using -­‐O3, so do

that instead of the cautious default. ●  -­‐ffast-­‐math  may give some extra performance

●  Intel compiler ●  The default optimization level (equal to -­‐O2) is safe. ●  Try with –O3.  If that works still, you may try with -­‐Ofast              

-­‐fp-­‐model  fast=2  ●  –craype-­‐verbose  flag to {cc,CC,ftn}  to show options.  

Page 16: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Inlining & inter-procedural optimization

● Cray compiler ●  Inlining within a file is enabled by default. ●  Command line options –OipaN  (ftn) and –hipaN  (cc/CC) where

N=0..4, provides a set of choices for inlining behavior ●  0 disables inlining, 3 is the default, 4 is even more elaborate

●  The –Oipafrom=  (ftn) or –hipafrom= (cc/CC) option instructs the compiler to look for inlining candidates from other source files, or a directory of source files.

●  The -­‐hwp combined with -­‐h  pl=… enables whole program automatic inlining.

● GNU compiler

●  Quite elaborate inlining enabled by –O3

●  Intel compiler ●  Inlining within a file is enabled by default ●  Multi-file inlining enabled by the flag -­‐ipo  

Page 17: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Loop transformations

● Cray compiler ●  Most useful techniques in their aggressive state already by default ●  One may try to improve loop restructuration for better vectorization

with –h  vector3  

● GNU compiler ●  Loop blocking (aka tiling) with-­‐floop-­‐block  ●  Loop unrolling -­‐funroll-­‐loops  or -­‐funroll-­‐all-­‐loops    

●  Intel compiler ●  Loop unrolling with -­‐funroll-­‐loops  or  -­‐unroll-­‐aggressive  

Page 18: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Directives for the Cray Compiler

●  I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are convinced that it should be, you can use compiler directives instead of rising the optimization level –O…

● Cray compiler supports a full and growing set of directives

and pragmas, e.g. ●  !dir$ concurrent ●  !dir$ ivdep ●  !dir$ interchange ●  !dir$ unroll ●  !dir$ loop_info

[max_trips] [cache_na] ●  !dir$ blockable

● More information given in

●  man directives ●  man loop_info

     

!dir$  blockable(j,k)  !dir$  blockingsize(16)          do  k  =  6,  nz-­‐5                do  j  =  6,  ny-­‐5                      do  i  =  6,  nx-­‐5                      !  stencil                      end  do                end  do          end  do    

18

Page 19: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Summary

●  Four compiler environments available on the XC40: ●  Cray (PrgEnv-cray is the default) ●  Intel (PrgEnv-intel) ●  GNU (PrgEnv-gnu) ●  PGI (PrgEnv-pgi)

●  All of them accessed through the wrappers ftn, cc and CC – just do module swap to change a compiler or a version.

●  There is no universally fastest compiler

●  Performance strongly depends on the application (even input) ●  We try however to excel with the Cray Compiler Environment ●  If you see a case where some other compiler yields better

performance, let us know!

● Compiler flags do matter ●  be ready to spend some effort for finding the best ones for your

application. ●  More information is given at the end of this presentation.

19

Page 20: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Cray Scientific Libraries

Overview

Page 21: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Cray Scientific Libraries

FFT

FFTW

Dense BLAS

LAPACK

ScaLAPACK

IRT

CASE

Sparse CASK

PETSc

Trilinos

IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CASE – Cray Adaptive Simplified Eigensolver

21

●  Large variety of standard libraries available via modules ●  Optimized for Cray Hardware and also for Haswell processor.

Page 22: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

What makes Cray libraries special

1.  Node performance ●  Highly tuned routines at the low-level (ex. BLAS)

2.  Network performance

●  Optimized for network performance ●  Overlap between communication and computation ●  Use the best available low-level mechanism ●  Use adaptive parallel algorithms

3.  Highly adaptive software ●  Use auto-tuning and adaptation to give the user the known best

(or very good) codes at runtime

4.  Productivity features ●  Simple interfaces into complex software

22

Page 23: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Library Usage Overview.

●  LibSci ●  Includes BLAS, CBLAS, BLACS, LAPACK, ScaLAPACK ●  Module is loaded by default (man  libsci) ●  Threading used within LibSci (OMP_NUM_THREADS). If you call within a parallel region, single thread used. More info later on.

●  FFTW ●  module  load  fftw  and  man  fftw

● PETSc ●  module  load  cray-­‐petsc{-­‐complex}  and man  intro_petsc  

●  Trilinos ●  module  load  cray-­‐trilinos  and  man  intro_trilinos

●  Third Party Scientific Libraries ●  module  load  cray-­‐tpsl  (use  online  documentation)

●  Iterative Refiniment Toolkit (IRT) through LibSci. ●  man intro_irt

● Cray Adaptive Sparse Kernels (CASK) are used in cray-petsc and cray-trilinos (transparent to the developer).

23

Page 24: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Third party Scientific Libraries (cray-tpsl)

●  TPSL (Third Party Scientific Libraries) contains a collection of outside mathematical libraries that can be used with PETSc and Trilinos.

●  This module will increase the flexibility of PETSc and Trilinos by providing users with multiple options for solving problems in dense and sparse linear algebra.

●  The cray-tpsl module is automatically loaded when PETSc or Trilinos is loaded. The libraries included are MUMPs, SuperLU, SuperLU_dist, ParMetis, Hypre, Sundials, and Scotch.

24

Page 25: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Check you got the right library!

● Add options to the linker to make sure you have the correct library loaded.

●  -­‐Wl adds a command to the linker from the driver ● You can ask for the linker to tell you where an object was

resolved from using the –y option. ●  E.g. –Wl,-­‐ydgemm_  (notice  the  ‘_’  at  the  end  of  the  name)    

Note: do not explicitly link “-lsci”. This will not be found from libsci 11+ and means a single core library for 10.x.

.//main.o:  reference  to  dgemm_  /opt/xt-­‐libsci/11.0.05.2/cray/73/mc12/lib/libsci_cray_mp.a(dgemm.o):  definition    of  dgemm_  

25

Page 26: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Threading for BLAS and LAPACK

●  LibSci is compatible with OpenMP ●  Control the number of threads to be used in your program using

OMP_NUM_THREADS ●  e.g., in job script export  OMP_NUM_THREADS=16  ●  Then run with srun  with –cpus-per-task=16    

● What behavior you get from the library depends on your code 1.  No threading in code

●  The BLAS call will use OMP_NUM_THREADS threads 2.  Threaded code, outside parallel regions

●  The BLAS call will use OMP_NUM_THREADS threads 3.  Threaded code, inside parallel regions

●  The BLAS call will use a single thread ●  Threaded LAPACK works exactly the same as threaded BLAS ●  Anywhere LAPACK uses BLAS, those BLAS can be threaded. ●  Some LAPACK routines are threaded at the higher level

26

Page 27: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Intel MKL

●  The Intel Math Kernel Libraries (MKL) is an alternative to LibSci ●  Features tuned performance for Intel CPUs as well

●  Linking quite complicated, but the Intel MKL Link Line Advisor can tell you what to add to your link line ●  http://software.intel.com/sites/products/mkl/

● Using MKL together with the Intel compilers (PrgEnv-intel) is usually straightforward. Simply add –mkl  to your compile and linker options

27

Page 28: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Running applications on the Cray XC

With Native SLURM

Page 29: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

How applications are generally run on a XC

● Most Cray XCs are batch systems. ●  Users submit batch job scripts to a scheduler from a login node (e.g.

PBS, MOAB, SLURM) for execution at some point in the future. Each job requires resources and a predicts how long it will run.

●  The scheduler (running on an external server) chooses which jobs to run and allocates appropriate resources

●  The batch system will then execute the user’s job script on an a different node as the login node.

●  The scheduler monitors the job and kills any that overrun their runtime prediction.

● User job scripts typically contain two types of statements. 1.  Serial commands that are executed by the MOM node, e.g.

●  quick setup and post processing commands ●  e.g. (rm, cd, mkdir etc)

2.  Parallel executables that run on compute nodes. 1.  Launched using the srun command.

32

Page 30: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

SLURM on the XC40 (Beginner Guide)

●  The main Cray system uses the Simple Linux Utility for Resource Management (SLURM) ●  Plenty of documentation can be found on

http://slurm.schedmd.com/documentation.html ●  In your daily work you will mainly encounter the following

commands: ●  sbatch  –  Submit  a  batch  script  to  SLURM.    ●  srun  –  Run  parallel  jobs.    ●  scancel–  Signal  jobs  under  the  control  of  SLURM  ●  squeue  –  information  about  running  jobs  

●  The entire information about your simulation execution is contained in a batch script which is submitted via sbatch.

●  The batch script contains one or more parallel job runs executed via srun (job step). Nodes are used exclusively.

●  The simulations have to be executed on /scratch/…  

34

Page 31: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Lifecycle of a batch script

CDL nodes

sbatch  job.sl  

SLURM gateway Node

Cray XC Compute Nodes

 #!/bin/bash   #SBATCH -p <your_workq> #SBATCH –A <your_account> #SBATCH -t 30 #SBATCH –N 100    cd  <some_working_directory>    srun  –n  640  ./simulation.exe    rm  –r  <my_work_dir>/<tmp_files>    

Example Batch Job Script – job.sl  

Parallel Serial

Scheduler Resources

35

The script will start by default in the directory where sbatch has been executed. This directory is available in the environment variable SLURM_SUBMIT_DIR

Page 32: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Useful SLURM options (Native)

●  srun is the application launcher ●  It must be used to run application on the XC compute nodes:

interactively or in a batch job. ●  If srun is not used, the application is launched on the gateway

node (and will most likely fail). ●  srun launches groups of Processing Elements (PEs) or tasks on

the compute nodes. (PE == (MPI RANK || Coarray Image || UPC Thread || ..) )

●  Some important parameters to set are:

●  No need for all –N, -c, –n, --ntasks-per-node but need consistency ●  Can also be specified via #SBATCH  in batch script.

Description Option Total Number of tasks -n,--ntasks

Number of tasks per compute node --ntasks-per-node Number of threads per task -c,--cpus-per-task

Number of nodes -N,--nodes Walltime -t,--time

36

Page 33: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

XC40 MPI-Job Examples

…  #SBATCH  -­‐N  1    srun  –n  1  ./<exe>  

…  #SBATCH  -­‐N  1    srun  –n  64  ./<exe>  #srun  –n  32  ./<exe>  #srun  –n  16  ./<exe>  

…  #SBATCH  –N  4    srun  –n  256  ./<exe>  

Single node, Single task Run a job on one task on one node with full memory.

Single node Run a pure MPI job with 64 Ranks on one node. The user can request a value for -­‐n smaller than 64 but not larger.

Multi node fully packed Run a pure MPI job on 4 nodes with 64 MPI ranks on each node. The nodes are fully packed.

Page 34: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

XC40 MPI-Job Examples

#!<your_shell>  …  #SBATCH  –N  4    srun  –tasks-­‐per-­‐node=32  ./<exe>  #srun  –n=128  ./<exe>  srun  –tasks-­‐per-­‐node=16  ./<exe>  #srun  –n=64  ./<exe>    

#!<your_shell>  …  #SBATCH  –N  4    export  OMP_NUM_THREADS=4  #srun  –n  64  –c  4  ./<exe>  srun  –tasks-­‐per-­‐node=16  –c  4  \                                                    ./<exe>  

Multi node paritally filled Run a pure MPI job on 4 nodes with less than 64 tasks per node. If you specify the number of nodes –N you can either specify the total number of tasks –n or the -–ntasks-­‐per-­‐node.

Hybrid MPI/OpenMP Run a hybrid applications on 4 nodes with 16 tasks per node and 4 OpenMP threads per task using the -­‐-­‐cpus-­‐per-­‐task  (-­‐c) parameter.

Page 35: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Hyperthreads on the XC40 with SLURM

●  Intel Hyper-Threading is a method of improving the throughput of a CPU by allowing two independent program threads to share the execution resources of one CPU ●  When one thread stalls the processor can execute read instructions

from a second thread instead of sitting idle ●  Because only the thread context state and a few other resources are

replicated (unlike replicating entire processor cores), the throughput improvement depends on whether the shared execution resources are a bottleneck

●  Typically much less than 2x with two hyperthreads. ●  With srun the hyper-threading is turned off with    -­‐-­‐hint=nomultithread  ●  Simply try it, if it does not help, switch back.

#SBATCH  –N  4    export  OMP_NUM_THREADS=4  srun  –tasks-­‐per-­‐node=8  –c  4  \    -­‐-­‐hint=nomultithread  ./<exe>                                                    

Page 36: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

SLURM Output and Error

40

•  Redirects stdout and stderr to two separate files specified by the user. •  By default the script output will be written to files of the form slurm-­‐<num>.out  in your submit directory, where num is your SLURM batch job number.

•  Output is written immediately to files so please do not move or delete them. •  To collect stderr and stdout to a single file, specify same -output and -error

#SBATCH  –output=<my_output_file_name>.out      #SBATCH  –error=<my_output_file_name>.err  

•  You can use %j to add the SLURM batch job number to your output files.

#SBATCH  –output=<my_output_file_name>-­‐%j.all.out      #SBATCH  –error=<my_output_file_name>-­‐%j.all.out  

•  Finally, you can specify a job  name  which will appear after squeue.

#SBATCH  –job-­‐name=<my_job_name>  

Page 37: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Monitoring your SLURM Job

41

•  Start your job with from the shell with sbatch. •  You will see the corresponding job id right away.

>  sbatch  <your_job>.slurm  Submitted  batch  job  <JOBID>    

•  While running you can inspect your job with squeue.  •  In order to inspect only your own jobs you can use the –u option to squeue. •  Always check that the reported resources are what you expect. •  For more information you can use > scontrol  show  job  <JOBID>  or >  sstat  <JOBID>  from an interactive session to get the job steps.  

>  squeue  -­‐u  <username>    JOBID          USER  ACCOUNT          NAME    ST  REASON        START_TIME                                TIME    TIME_LEFT  NODES  CPUS    74914      esposito        cray        job3      R  None            2015-­‐06-­‐02T13:12:37              0:08            29:52          2    128  

•  Only if you think that your job is not running properly after inspecting your output files, you can cancel it with scancel.

•  If your job exceeds the time limits specified with #SLURM  –t  your job will be automatically canceled by SLURM.

>  scancel  <JOBID>  

Page 38: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

>  ssh  gateway<num>  >  salloc  <your_slurm_parameters>        

More on SLURM

●  Behavior in specific cases: ●  If you do not specify anything you can run a single task on one node for one hour. ●  Specifying –n without -­‐-­‐ntasks-­‐per-­‐node  still spreads the task evenly among

nodes. ●  The node memory limit is currently set to 32GB. You can use -­‐-­‐mem=131072  to

access the full memory of the node ●  If –c is specified without –n, then enough nodes are allocated and filled to satisfy

–c and –n. ●  Be careful when you specify SLURM parameters both in the batch script via

#SBATCH and on the line of srun in the script. It is possible that you do not get an abort for conflicting parameters.  

●  More information on core binding and numa affinity is given later on. ●  User is responsible to get right partition and account !!! Use sinfo   ●  For debugging and other diagnostic you can request an interactive

session.

Page 39: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Summary of SLURM commands and variables

Slurm Workload Manager

Job Submissionsalloc - Obtain a job allocation.

sbatch - Submit a batch script for later execution.

srun - Obtain a job allocation (as needed) and execute an application.

--array=<indexes>

(e.g. “--array=1-10”)

Job array specification.

(sbatch command only)

--account=<name> Account to be charged for resources used.

--begin=<time>

(e.g. “--begin=18:00:00”)

Initiate job after specified

time.

--clusters=<name> Cluster(s) to run the job.

(sbatch command only)

--constraint=<features> Required node features.

--cpu_per_task=<count> Number of CPUs required

per task.

--dependency=<state:jobid> Defer job until specified jobs

reach specified state.

--error=<filename> File in which to store job

error messages.

--exclude=<names> Specific host names to

exclude from job allocation.

--exclusive[=user] Allocated nodes can not be

shared with other jobs/users.

--export=<name[=value]> Export identified

environment variables.

--gres=<name[:count]> Generic resources required

per node.

--input=<name> File from which to read job

input data.

--job-name=<name> Job name.

--label Prepend task ID to output.

(srun command only)

--licenses=<name[:count]> License resources required

for entire job.

--mem=<MB> Memory required per node.

--mem_per_cpu=<MB> Memory required per

allocated CPU.

-N<minnodes[-maxnodes]> Node count required for the

job.

-n<count> Number of tasks to be

launched.

--nodelist=<names> Specific host names to

include in job allocation.

--output=<name> File in which to store job

output.

--partition=<names> Partition/queue in which to

run the job.

--qos=<name> Quality Of Service.

--signal=[B:]<num>[@time] Signal job when approaching

time limit.

--time=<time> Wall clock time limit.

--wrap=<command_string> Wrap specified command in a

simple “sh” shell.

(sbatch command only)

Accountingsacct - Display accounting data.

--allusers Displays all users jobs.

--accounts=<name> Displays jobs with specified

accounts.

--endtime=<time> End of reporting period.

--format=<spec> Format output.

--name=<jobname> Display jobs that have any of these

name(s).

--partition=<names> Comma separated list of partitions

to select jobs and job steps from.

--state=<state_list> Display jobs with specified states.

--starttime=<time> Start of reporting period.

sacctmgr - View and modify account information.

Options:

--immediate Commit changes immediately.

--parseable Output delimited by '|'

Commands:

add <ENTITY> <SPECS>

create <ENTITY> <SPECS>

Add an entity. Identical to

the create command.

delete <ENTITY> where

<SPECS>

Delete the specified entities.

list <ENTITY> [<SPECS>] Display information about

the specific entity.

modify <ENTITY> where

<SPECS> set <SPECS>

Modify an entity.

Entities:

account Account associated with job.

association Group information for job.

cluster ClusterName parameter in the

slurm.conf.

qos Quality of Service.

Job Managementsbcast - Transfer file to a job's compute nodes.

sbcast [options] SOURCE DESTINATION

--force Replace previously existing file.

--preserve Preserve modification times, access times, and

access permissions.

scancel - Signal jobs, job arrays, and/or job steps.

--account=<name> Operate only on jobs charging the

specified account.

--name=<name> Operate only on jobs with specified

name.

--partition=<names> Operate only on jobs in the specified

partition/queue.

--qos=<name> Operate only on jobs using the

specified quality of service.

http://slurm.schedmd.com/documentation.html

Page 40: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Summary of SLURM commands and variables

--reservation=<name> Operate only on jobs using the

specified reservation.

--state=<names> Operate only on jobs in the specified

state.

--user=<name> Operate only on jobs from the

specified user.

--nodelist=<names> Operate only on jobs using the

specified compute nodes.

squeue - View information about jobs.

--account=<name> View only jobs with specified

accounts.

--clusters=<name> View jobs on specified clusters.

--format=<spec>

(e.g. “--format=%i %j”)

Output format to display.

Specify fields, size, order, etc.

--jobs<job_id_list> Comma separated list of job IDs

to display.

--name=<name> View only jobs with specified

names.

--partition=<names> View only jobs in specified

partitions.

--priority Sort jobs by priority.

--qos=<name> View only jobs with specified

Qualities Of Service.

--start Report the expected start time

and resources to be allocated for

pending jobs in order of

increasing start time.

--state=<names> View only jobs with specified

states.

--users=<names> View only jobs for specified

users.

sinfo - View information about nodes and partitions.

--all Display information about all

partitions.

--dead If set, only report state information

for non-responding (dead) nodes.

--format=<spec> Output format to display.

--iterate=<seconds> Print the state at specified interval.

--long Print more detailed information.

--Node Print information in a node-oriented

format.

--partition=<names> View only specified partitions.

--reservation Display information about advanced

reservations.

-R Display reasons nodes are in the

down, drained, fail or failing state.

--state=<names> View only nodes specified states.

scontrol - Used view and modify configuration and state.

Also see the sview graphical user interface version.

--details Make show command print more details.

--oneliner Print information on one line.

Commands:

create SPECIFICATION Create a new partition or

reservation.

delete SPECIFICATION Delete the entry with the

specified SPECIFICATION.

reconfigure All Slurm daemons will re-read

the configuration file.

requeue JOB_LIST Requeue a running, suspended or

completed batch job.

show ENTITY ID Display the state of the specified

entity with the specified

identification.

update SPECIFICATION Update job, step, node, partition,

or reservation configuration per

the supplied specification.

Environment Variables

SLURM_ARRAY_JOB_ID Set to the job ID if part of a

job array.

SLURM_ARRAY_TASK_ID Set to the task ID if part of

a job array.

SLURM_CLUSTER_NAME Name of the cluster

executing the job.

SLURM_CPUS_PER_TASK Number of CPUs requested

per task.

SLURM_JOB_ACCOUNT Account name.

SLURM_JOB_ID Job ID.

SLURM_JOB_NAME Job Name.

SLURM_JOB_NODELIST Names of nodes allocated

to job.

SLURM_JOB_NUM_NODES Number of nodes allocated

to job.

SLURM_JOB_PARTITION Partition/queue running the

job.

SLURM_JOB_UID User ID of the job's owner.

SLURM_JOB_USER User name of the job's

owner.

SLURM_RESTART_COUNT Number of times job has

restarted.

SLURM_PROCID Task ID (MPI rank).

SLURM_STEP_ID Job step ID.

SLURM_STEP_NUM_TASKS Task count (number of

MPI ranks).

Daemons

slurmctld Executes on cluster's “head” node to

manage workload.

slurmd Executes on each compute node to

locally manage resources.

slurmdbd Manages database of resources limits,

licenses, and archives accounting

records.

Copyright 2015 SchedMD LLC. All rights reserved.

http://www.schedmd.com

Last Update: 3 April 2015

http://slurm.schedmd.com/documentation.html

Page 41: Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

SLURM compared to others

28-Apr-2013User Commands PBS/Torque Slurm LSF SGE LoadLevelerJob submission qsub [script_file] sbatch [script_file] bsub [script_file] qsub [script_file] llsubmit [script_file]Job deletion qdel [job_id] scancel [job_id] bkill [job_id] qdel [job_id] llcancel [job_id]Job status (by job) qstat [job_id] squeue [job_id] bjobs [job_id] qstat -u \* [-j job_id] llq -u [username]Job status (by user) qstat -u [user_name] squeue -u [user_name] bjobs -u [user_name] qstat [-u user_name] llq -u [user_name]Job hold qhold [job_id] scontrol hold [job_id] bstop [job_id] qhold [job_id] llhold -r [job_id]Job release qrls [job_id] scontrol release [job_id] bresume [job_id] qrls [job_id] llhold -r [job_id]Queue list qstat -Q squeue bqueues qconf -sql llclassNode list pbsnodes -l sinfo -N OR scontrol show nodes bhosts qhost llstatus -L machineCluster status qstat -a sinfo bqueues qhost -q llstatus -L clusterGUI xpbsmon sview xlsf OR xlsbatch qmon xload

Environment PBS/Torque Slurm LSF SGE LoadLevelerJob ID $PBS_JOBID $SLURM_JOBID $LSB_JOBID $JOB_ID $LOAD_STEP_IDSubmit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD $SGE_O_WORKDIR $LOADL_STEP_INITDIRSubmit Host $PBS_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST $SGE_O_HOSTNode List $PBS_NODEFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST $PE_HOSTFILE $LOADL_PROCESSOR_LISTJob Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX $SGE_TASK_ID

Job Specification PBS/Torque Slurm LSF SGE LoadLevelerScript directive #PBS #SBATCH #BSUB #$ #@Queue -q [queue] -p [queue] -q [queue] -q [queue] class=[queue]Node Count -l nodes=[count] -N [min[-max]] -n [count] N/A node=[count]

CPU Count-l ppn=[count] OR -lmppwidth=[PE_count] -n [count] -n [count] -pe [PE] [count]

Wall Clock Limit -l walltime=[hh:mm:ss] -t [min] OR -t [days-hh:mm:ss] -W [hh:mm:ss] -l h_rt=[seconds] wall_clock_limit=[hh:mm:ss]Standard Output FIle -o [file_name] -o [file_name] -o [file_name] -o [file_name] output=[file_name]Standard Error File -e [file_name] e [file_name] -e [file_name] -e [file_name] error=[File_name]

Combine stdout/err-j oe (both to stdout) OR -j eo(both to stderr) (use -o without -e) (use -o without -e) -j yes

Copy Environment -V --export=[ALL | NONE | variables] -V environment=COPY_ALLEvent Notification -m abe --mail-type=[events] -B or -N -m abe notification=start|error|complete|never|alwaysEmail Address -M [address] --mail-user=[address] -u [address] -M [address] notify_user=[address]Job Name -N [name] --job-name=[name] -J [name] -N [name] job_name=[name]

Job Restart -r [y|n]--requeue OR --no-requeue (NOTE:configurable default) -r -r [yes|no] restart=[yes|no]

Working Directory N/A --workdir=[dir_name] (submission directory) -wd [directory] initialdir=[directory]Resource Sharing -l naccesspolicy=singlejob --exclusive OR--shared -x -l exclusive node_usage=not_shared

Memory Size -l mem=[MB]--mem=[mem][M|G|T] OR --mem-per-cpu=[mem][M|G|T] -M [MB] -l mem_free=[memory][K|M|G] requirements=(Memory >= [MB])

Account to charge -W group_list=[account] --account=[account] -P [account] -A [account]Tasks Per Node -l mppnppn [PEs_per_node] --tasks-per-node=[count] (Fixed allocation_rule in PE) tasks_per_node=[count]CPUs Per Task --cpus-per-task=[count] Job Dependency -d [job_id] --depend=[state:job_id] -w [done | exit | finish] -hold_jid [job_id | job_name]Job Project --wckey=[name] -P [name] -P [name]

Job host preference--nodelist=[nodes] AND/OR --exclude=[nodes] -m [nodes]

-q [queue]@[node] OR -q[queue]@@[hostgroup]

Quality Of Service -l qos=[name] --qos=[name]Job Arrays -t [array_spec] --array=[array_spec] (Slurm version 2.6+) J "name[array_spec]" -t [array_spec]Generic Resources -l other=[resource_spec] --gres=[resource_spec] -l [resource]=[value]Licenses --licenses=[license_spec] -R "rusage[license_spec]" -l [license]=[count]

Begin Time-A "YYYY-MM-DD HH:MM:SS" --begin=YYYY-MM-DD[THH:MM[:SS]] -b[[year:][month:]daty:]hour:minute -a [YYMMDDhhmm]

http://slurm.schedmd.com/documentation.html


Recommended