Managing and Deploying GPU AcceleratorsADAC17 - Resource ManagementStephane Thiell and Kilian CavalottiStanford Research Computing Center
GPU resources at the SRCC
Slurm and GPUs
Slurm and GPU P2P
Running Amber with GPU P2P (intranode)
Running TensorFlow with Singularity
OUTLINE
GPU RESOURCES AT THE SRCC
STANFORD SHERLOCK
Shared compute cluster
Open to the Stanford community as a resource to support sponsored research
Condo cluster, nodes are ordered quarterly (currently 1000+ nodes)
Heterogenous cluster with 73 GPU nodes / 500 GPU cards / Tesla and GeForce
Total of ~2500 users (~420 are faculty) and 64 owners
Supermicro GPU SuperServer 4028GR-TRT with4U with 8 x Nvidia GeForce consumer-grade cards
Dell C4130 1U with 4 x Nvidia Tesla cards
STANFORD XSTREAM
Multi-GPU HPC cluster
520 x Nvidia K80, 1040 GPUs, 16 GPUs / node
24 x Nvidia P100E, 8 GPUs / node
Energy efficient
1 PFlops (LINPACK peak) for only 190kW!
June 2015 (ISC15)
87th 6th
Nov 2015 (SC15)
102nd 5th
June 2016 (ISC16)
122nd 6th
Nov 2016 (SC16)
162nd 8th
June 2017 (ISC17)
214th 24th
XStream’s Cray CS-Storm 2626X8N node specs:
▸ 20 CPU cores (12 GB RAM/core)▸ 256 GB of DDR3 RAM▸ 16 Nvidia K80 GPUs, each with 12 GB of GDDR5 with ECC support▸ Balanced PCIe bandwidth across GPUs (dual Root Complex)▸ 1 Infiniband FDR
XSTREAM SYSTEM CHARACTERISTICS (NODE LAYOUT)
SLURM AND GPUs
SLURM WITH GENERIC RESOURCES (GRES)
SLURM is the resource manager on both Sherlock and XStream
▸ Open source and full-featured (like GPU support)
Generic Resource (GRES) scheduling
$ srun [...] --gres gpu:2 [...]or #SBATCH --gres gpu:2
This defines the number of GPUs per node.
GPU Compute Mode selection
#SBATCH -C gpu_shared
Custom option. Set the GPU Compute Mode to DEFAULT (shared) instead of EXCLUSIVE PROCESS in Slurm Prolog.
With -C gpu_shared, multiple processes are able to access a GPU.
In general NOT recommended but sometimes required for multi-GPU jobs, for instance when running Amber or LAMMPS.
SHERLOCK SLURM GPU QOS SETTINGS
Simple enforcement of GPU usage on Sherlock’s GPU QoS
▸ MinTRES set to cpu=1,gres/gpu=1
Example of error if the above rule is not respected
$ srun -p gpu --pty bash
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
$ srun -p gpu --gres gpu:1 --pty bash
srun: job 16281961 queued and waiting for resources
SHERLOCK SLURM GPU FEATURES
Sherlock has many different types of Nvidia GPUs
We use Slurm FEATURES (-C) for GPU type selection
▸ GRES “gpu:n” is used for GPU allocation
▸ We used to specify the GPU type in GRES as in “gpu:tesla:2” but
using features is more flexible!
▸
▸
▸
▸
▸
▸
▸ Example of GPU type constraint: -C GPU_SKU:TITAN_XP
# sinfo -o "%.10R %.8D %.10m %.5c %7z %8G %100f %N"
PARTITION NODES MEMORY CPUS S:C:T GRES AVAIL_FEATURES
test 1 257674 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-04
test 1 257674 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P40,GPU_MEM:24GB sh-112-05
ownerxx 2 128658 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-113-[06-07]
ownerxy 2 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07]
ownerxy 4 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-[08-11]
ownerxz 1 128658 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-12
owners 2 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07]
owners 5 128658+ 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-[08-12]
...
XSTREAM SLURM CPU/GPU RATIOS
Job submission rules are enforced to maximize GPU efficiency
(*) Unlike memory/CPU, the number of GPUs is NOT automatically updated when you request more memory.
CPU/GPU ratio enforcement implemented using a job_submit.lua plugin
Example of error if the above CPU/GPU ratio rule is not respected$ srun -c 5 --gres gpu:1 command
srun: error: CPUs requested per node (5) not allowed with only 1 GPU(s); increase the number of GPUs to 4 or reduce the number of CPUs
srun: error: Unable to allocate resources: More processors requested than permitted
Max CPU/GPU
ratio
Default memory per
CPU
Max memory per
CPU
Max (system) memory per
GPU
Min GPU count
5/4 (20/16) 12,000 MB 12,800 MB 16,000 MB(*) 1
XSTREAM MULTI-GPU RESOURCE ALLOCATION
GPU devices cgroups
Slurm on XStream uses the Linux cgroup devices subsystem so that a job is only allowed to access its allocated GPU devices.
$ srun --gres gpu:3 nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-fdae33a3-6287-68e8-1aef-4fa8a745ad07)
GPU 1: Tesla K80 (UUID: GPU-f4735a45-ea34-55f1-35ba-50a84c4b462c)
GPU 2: Tesla K80 (UUID: GPU-b1d8438e-7f05-9bc3-8d3e-67893931896f)
Consequence: the GPU IDs above and $CUDA_VISIBLE_DEVICES IDs within a job always start at 0.
Display full node GPU Direct communication matrix and CPU affinity
$ srun -c 20 --gres gpu:16 nvidia-smi topo -m
SLURM AND GPU PEER TO PEER COMMUNICATION
GPU PEER-TO-PEER WITH SLURM ON XSTREAM (1/5)
#SBATCH --gres-flags=enforce-binding
Standard Slurm option. Enforce GRES/CPU binding, ie. all CPUs and GRES (here, GPUs) will be allocated within the same CPU socket(s). Required for GPU P2P.
Sufficient when used with 1 CPU (-c1) but USELESS when CPUs are allocated across different CPU sockets. Need to bind tasks (and their cpus) to a specific CPU socket.
#SBATCH --cpus-per-task=1 to 10, for example: -c 8
#SBATCH --ntasks-per-socket=n, with n == ntasks
Standard Slurm option. Masks will automatically be generated to bind the tasks to specific sockets.
GPU PEER-TO-PEER WITH SLURM ON XSTREAM (2/5)
Example of bad resource allocation on XStream for GPU P2P
GPU PEER-TO-PEER WITH SLURM ON XSTREAM (3/5)
Bad case of GPU P2P allocation:
▸ 1 CPU (core 0) and 6 GPUs are already allocated on the first CPU socket▸ by requesting 8 CPUs and 8 GPUs, we may get:
▹ 8 CPUs not on the same CPU socket▹ 8 GPUs not on the same PCIe Root Complex (CPU socket): 2 + 6
$ srun -c8 --gres gpu:8 nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity
GPU0 X PIX SOC SOC SOC SOC SOC SOC PHB 1-8
GPU1 PIX X SOC SOC SOC SOC SOC SOC PHB 1-8
GPU2 SOC SOC X PIX PXB PXB PHB PHB SOC 1-8
GPU3 SOC SOC PIX X PXB PXB PHB PHB SOC 1-8
GPU4 SOC SOC PXB PXB X PIX PHB PHB SOC 1-8
GPU5 SOC SOC PXB PXB PIX X PHB PHB SOC 1-8
GPU6 SOC SOC PHB PHB PHB PHB X PIX SOC 1-8
GPU7 SOC SOC PHB PHB PHB PHB PIX X SOC 1-8
mlx4_0 PHB PHB SOC SOC SOC SOC SOC SOC X
GPU PEER-TO-PEER WITH SLURM ON XSTREAM (4/5)
Fix issue by using the correct Slurm options for GPU P2P
▸ Group CPUs on the same socket with --ntasks-per-socket=n▸ Group GPUs on the same socket with --gres-flags=enforce-binding
$ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity
GPU0 X PIX PXB PXB PHB PHB PHB PHB SOC 10-17
GPU1 PIX X PXB PXB PHB PHB PHB PHB SOC 10-17
GPU2 PXB PXB X PIX PHB PHB PHB PHB SOC 10-17
GPU3 PXB PXB PIX X PHB PHB PHB PHB SOC 10-17
GPU4 PHB PHB PHB PHB X PIX PXB PXB SOC 10-17
GPU5 PHB PHB PHB PHB PIX X PXB PXB SOC 10-17
GPU6 PHB PHB PHB PHB PXB PXB X PIX SOC 10-17
GPU7 PHB PHB PHB PHB PXB PXB PIX X SOC 10-17
mlx4_0 SOC SOC SOC SOC SOC SOC SOC SOC X
GPU PEER-TO-PEER WITH SLURM ON XSTREAM (5/5)
Example of proper resource allocation on XStream for GPU P2P
RUNNING AMBER WITH GPU P2P (INTRANODE)
AMBER (http://ambermd.org/) is a popular molecular dynamics simulation software used on both Sherlock and XStream.
AMBER is interesting to study as it has the ability to use GPUs to massively accelerate PMEMD for both explicit solvent PME (Particle Mesh Ewald) and implicit solvent GB (Generalized Born). Key new features include (...) peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling.
https://www.tensorflow.org/tutorials/deep_cnn
http://www.cs.toronto.edu/~kriz/cifar.html
XSTREAM RUNNING AMBER WITH GPU P2P (1/3)
AMBER Benchmark: DHFR NPT HMR 4fs = 23,558 atoms
XSTREAM RUNNING AMBER WITH GPU P2P (2/3)
#!/bin/bash#SBATCH -o slurm_amber_pme_p2p_4gpus.%j.out#SBATCH --ntasks=4#SBATCH --ntasks-per-socket=4#SBATCH --gres gpu:4#SBATCH --gres-flags=enforce-binding#SBATCH -C gpu_shared#SBATCH -t 1:00:00
echo ""echo "JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs"echo "-----------------------------------------"echo ""
module load intel/2015.5.223 Amber/14cd PME/JAC_production_NPT_4fs
srun $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.GPU -o mdout.4GPU -inf mdinfo.4GPU -x mdcrd.4GPU -r restrt.4GPU
amber_bench_pme_p2p_4gpus.sbatch
Overview of CPU / GPU / multiGPU / multiGPU P2P performance
Other results are available at http://ambermd.org/gpus14/benchmarks.htm#Benchmarks (without ECC)
XSTREAM RUNNING AMBER WITH GPU P2P (3/3)
GPU RESOURCE MANAGEMENT WITH CONTAINERS:RUNNING TENSORFLOW WITH SINGULARITY
Software like TensorFlow is evolving so quickly that it requires too much pain to install/upgrade on an HPC cluster on a regular basis, especially on an old OS (XStream is still running RHEL 6.9).
Singularity is a container technology developed for HPC. It is also installed on the major XSEDE computing systems.
Thanks to the new Nvidia GPU support in Singularity 2.3 (through the --nv option), we can now use Singularity with GPUs on both Sherlock and XStream!
TENSORFLOW MODEL TRAINING (1/4)
Step 1: get the latest Tensorflow image
Create a new Singularity image and import docker image.
$ module load singularity
$ singularity pull docker://tensorflow/tensorflow:latest-gpu
Step 2: TensorFlow container quick test
$ singularity shell --home $WORK:/home --nv tensorflow-latest-gpu.img
Singularity tensorflow.img:~> python
>>> import tensorflow as tf
TENSORFLOW MODEL TRAINING (2/4)
Step 3: run CIFAR10 training on single GPU
TENSORFLOW MODEL TRAINING (3/4)
#!/bin/bash#SBATCH --job-name=cifar10_1gpu#SBATCH --output=slurm_cifar10_1gpu_%j.out#SBATCH --cpus-per-task=1#SBATCH --gres gpu:1#SBATCH --time=1:00:00
TENSORFLOW_IMG=tensorflow-latest-gpu.imgCIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10
mkdir $LSTOR/cifar10_datacp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/
module load singularity
srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_train.py --batch_size=128 \ --log_device_placement=false \ --max_steps=100000
extract of cifar10_1gpu.sbatch for XStream
Step 4: run CIFAR10 training on multiple GPUs
TENSORFLOW MODEL TRAINING (4/4)
#!/bin/bash#SBATCH --job-name=cifar10_2gpu#SBATCH --output=slurm_cifar10_2gpu_%j.out#SBATCH --cpus-per-task=2#SBATCH --ntasks-per-socket=1#SBATCH --gres gpu:2#SBATCH --gres-flags=enforce-binding#SBATCH --time=1:00:00
TENSORFLOW_IMG=tensorflow-latest-gpu.imgCIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10
mkdir $LSTOR/cifar10_datacp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/
module load singularity
srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_multi_gpu_train.py --num_gpus=2 \ --batch_size=64 \ --log_device_placement=false \ --max_steps=100000
extract of cifar10_2gpu.sbatch for XStream