+ All Categories
Home > Documents > Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with...

Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with...

Date post: 21-May-2020
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
29
Managing and Deploying GPU Accelerators ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center
Transcript
Page 1: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Managing and Deploying GPU AcceleratorsADAC17 - Resource ManagementStephane Thiell and Kilian CavalottiStanford Research Computing Center

Page 2: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU resources at the SRCC

Slurm and GPUs

Slurm and GPU P2P

Running Amber with GPU P2P (intranode)

Running TensorFlow with Singularity

OUTLINE

Page 3: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU RESOURCES AT THE SRCC

Page 4: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

STANFORD SHERLOCK

Shared compute cluster

Open to the Stanford community as a resource to support sponsored research

Condo cluster, nodes are ordered quarterly (currently 1000+ nodes)

Heterogenous cluster with 73 GPU nodes / 500 GPU cards / Tesla and GeForce

Total of ~2500 users (~420 are faculty) and 64 owners

Supermicro GPU SuperServer 4028GR-TRT with4U with 8 x Nvidia GeForce consumer-grade cards

Dell C4130 1U with 4 x Nvidia Tesla cards

Page 5: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

STANFORD XSTREAM

Multi-GPU HPC cluster

520 x Nvidia K80, 1040 GPUs, 16 GPUs / node

24 x Nvidia P100E, 8 GPUs / node

Energy efficient

1 PFlops (LINPACK peak) for only 190kW!

June 2015 (ISC15)

87th 6th

Nov 2015 (SC15)

102nd 5th

June 2016 (ISC16)

122nd 6th

Nov 2016 (SC16)

162nd 8th

June 2017 (ISC17)

214th 24th

Page 6: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

XStream’s Cray CS-Storm 2626X8N node specs:

▸ 20 CPU cores (12 GB RAM/core)▸ 256 GB of DDR3 RAM▸ 16 Nvidia K80 GPUs, each with 12 GB of GDDR5 with ECC support▸ Balanced PCIe bandwidth across GPUs (dual Root Complex)▸ 1 Infiniband FDR

XSTREAM SYSTEM CHARACTERISTICS (NODE LAYOUT)

Page 7: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

SLURM AND GPUs

Page 8: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

SLURM WITH GENERIC RESOURCES (GRES)

SLURM is the resource manager on both Sherlock and XStream

▸ Open source and full-featured (like GPU support)

Generic Resource (GRES) scheduling

$ srun [...] --gres gpu:2 [...]or #SBATCH --gres gpu:2

This defines the number of GPUs per node.

GPU Compute Mode selection

#SBATCH -C gpu_shared

Custom option. Set the GPU Compute Mode to DEFAULT (shared) instead of EXCLUSIVE PROCESS in Slurm Prolog.

With -C gpu_shared, multiple processes are able to access a GPU.

In general NOT recommended but sometimes required for multi-GPU jobs, for instance when running Amber or LAMMPS.

Page 9: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

SHERLOCK SLURM GPU QOS SETTINGS

Simple enforcement of GPU usage on Sherlock’s GPU QoS

▸ MinTRES set to cpu=1,gres/gpu=1

Example of error if the above rule is not respected

$ srun -p gpu --pty bash

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

$ srun -p gpu --gres gpu:1 --pty bash

srun: job 16281961 queued and waiting for resources

Page 10: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

SHERLOCK SLURM GPU FEATURES

Sherlock has many different types of Nvidia GPUs

We use Slurm FEATURES (-C) for GPU type selection

▸ GRES “gpu:n” is used for GPU allocation

▸ We used to specify the GPU type in GRES as in “gpu:tesla:2” but

using features is more flexible!

▸ Example of GPU type constraint: -C GPU_SKU:TITAN_XP

# sinfo -o "%.10R %.8D %.10m %.5c %7z %8G %100f %N"

PARTITION NODES MEMORY CPUS S:C:T GRES AVAIL_FEATURES

test 1 257674 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-04

test 1 257674 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P40,GPU_MEM:24GB sh-112-05

ownerxx 2 128658 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-113-[06-07]

ownerxy 2 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07]

ownerxy 4 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-[08-11]

ownerxz 1 128658 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-12

owners 2 515578 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07]

owners 5 128658+ 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_XP,GPU_MEM:12GB sh-112-[08-12]

...

Page 11: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

XSTREAM SLURM CPU/GPU RATIOS

Job submission rules are enforced to maximize GPU efficiency

(*) Unlike memory/CPU, the number of GPUs is NOT automatically updated when you request more memory.

CPU/GPU ratio enforcement implemented using a job_submit.lua plugin

Example of error if the above CPU/GPU ratio rule is not respected$ srun -c 5 --gres gpu:1 command

srun: error: CPUs requested per node (5) not allowed with only 1 GPU(s); increase the number of GPUs to 4 or reduce the number of CPUs

srun: error: Unable to allocate resources: More processors requested than permitted

Max CPU/GPU

ratio

Default memory per

CPU

Max memory per

CPU

Max (system) memory per

GPU

Min GPU count

5/4 (20/16) 12,000 MB 12,800 MB 16,000 MB(*) 1

Page 12: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

XSTREAM MULTI-GPU RESOURCE ALLOCATION

GPU devices cgroups

Slurm on XStream uses the Linux cgroup devices subsystem so that a job is only allowed to access its allocated GPU devices.

$ srun --gres gpu:3 nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-fdae33a3-6287-68e8-1aef-4fa8a745ad07)

GPU 1: Tesla K80 (UUID: GPU-f4735a45-ea34-55f1-35ba-50a84c4b462c)

GPU 2: Tesla K80 (UUID: GPU-b1d8438e-7f05-9bc3-8d3e-67893931896f)

Consequence: the GPU IDs above and $CUDA_VISIBLE_DEVICES IDs within a job always start at 0.

Display full node GPU Direct communication matrix and CPU affinity

$ srun -c 20 --gres gpu:16 nvidia-smi topo -m

Page 13: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

SLURM AND GPU PEER TO PEER COMMUNICATION

Page 14: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU PEER-TO-PEER WITH SLURM ON XSTREAM (1/5)

#SBATCH --gres-flags=enforce-binding

Standard Slurm option. Enforce GRES/CPU binding, ie. all CPUs and GRES (here, GPUs) will be allocated within the same CPU socket(s). Required for GPU P2P.

Sufficient when used with 1 CPU (-c1) but USELESS when CPUs are allocated across different CPU sockets. Need to bind tasks (and their cpus) to a specific CPU socket.

#SBATCH --cpus-per-task=1 to 10, for example: -c 8

#SBATCH --ntasks-per-socket=n, with n == ntasks

Standard Slurm option. Masks will automatically be generated to bind the tasks to specific sockets.

Page 15: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU PEER-TO-PEER WITH SLURM ON XSTREAM (2/5)

Example of bad resource allocation on XStream for GPU P2P

Page 16: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU PEER-TO-PEER WITH SLURM ON XSTREAM (3/5)

Bad case of GPU P2P allocation:

▸ 1 CPU (core 0) and 6 GPUs are already allocated on the first CPU socket▸ by requesting 8 CPUs and 8 GPUs, we may get:

▹ 8 CPUs not on the same CPU socket▹ 8 GPUs not on the same PCIe Root Complex (CPU socket): 2 + 6

$ srun -c8 --gres gpu:8 nvidia-smi topo -m

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity

GPU0 X PIX SOC SOC SOC SOC SOC SOC PHB 1-8

GPU1 PIX X SOC SOC SOC SOC SOC SOC PHB 1-8

GPU2 SOC SOC X PIX PXB PXB PHB PHB SOC 1-8

GPU3 SOC SOC PIX X PXB PXB PHB PHB SOC 1-8

GPU4 SOC SOC PXB PXB X PIX PHB PHB SOC 1-8

GPU5 SOC SOC PXB PXB PIX X PHB PHB SOC 1-8

GPU6 SOC SOC PHB PHB PHB PHB X PIX SOC 1-8

GPU7 SOC SOC PHB PHB PHB PHB PIX X SOC 1-8

mlx4_0 PHB PHB SOC SOC SOC SOC SOC SOC X

Page 17: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU PEER-TO-PEER WITH SLURM ON XSTREAM (4/5)

Fix issue by using the correct Slurm options for GPU P2P

▸ Group CPUs on the same socket with --ntasks-per-socket=n▸ Group GPUs on the same socket with --gres-flags=enforce-binding

$ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \nvidia-smi topo -m

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity

GPU0 X PIX PXB PXB PHB PHB PHB PHB SOC 10-17

GPU1 PIX X PXB PXB PHB PHB PHB PHB SOC 10-17

GPU2 PXB PXB X PIX PHB PHB PHB PHB SOC 10-17

GPU3 PXB PXB PIX X PHB PHB PHB PHB SOC 10-17

GPU4 PHB PHB PHB PHB X PIX PXB PXB SOC 10-17

GPU5 PHB PHB PHB PHB PIX X PXB PXB SOC 10-17

GPU6 PHB PHB PHB PHB PXB PXB X PIX SOC 10-17

GPU7 PHB PHB PHB PHB PXB PXB PIX X SOC 10-17

mlx4_0 SOC SOC SOC SOC SOC SOC SOC SOC X

Page 18: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU PEER-TO-PEER WITH SLURM ON XSTREAM (5/5)

Example of proper resource allocation on XStream for GPU P2P

Page 19: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

RUNNING AMBER WITH GPU P2P (INTRANODE)

Page 20: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

AMBER (http://ambermd.org/) is a popular molecular dynamics simulation software used on both Sherlock and XStream.

AMBER is interesting to study as it has the ability to use GPUs to massively accelerate PMEMD for both explicit solvent PME (Particle Mesh Ewald) and implicit solvent GB (Generalized Born). Key new features include (...) peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling.

https://www.tensorflow.org/tutorials/deep_cnn

http://www.cs.toronto.edu/~kriz/cifar.html

XSTREAM RUNNING AMBER WITH GPU P2P (1/3)

Page 21: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

AMBER Benchmark: DHFR NPT HMR 4fs = 23,558 atoms

XSTREAM RUNNING AMBER WITH GPU P2P (2/3)

#!/bin/bash#SBATCH -o slurm_amber_pme_p2p_4gpus.%j.out#SBATCH --ntasks=4#SBATCH --ntasks-per-socket=4#SBATCH --gres gpu:4#SBATCH --gres-flags=enforce-binding#SBATCH -C gpu_shared#SBATCH -t 1:00:00

echo ""echo "JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs"echo "-----------------------------------------"echo ""

module load intel/2015.5.223 Amber/14cd PME/JAC_production_NPT_4fs

srun $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.GPU -o mdout.4GPU -inf mdinfo.4GPU -x mdcrd.4GPU -r restrt.4GPU

amber_bench_pme_p2p_4gpus.sbatch

Page 22: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Overview of CPU / GPU / multiGPU / multiGPU P2P performance

Other results are available at http://ambermd.org/gpus14/benchmarks.htm#Benchmarks (without ECC)

XSTREAM RUNNING AMBER WITH GPU P2P (3/3)

Page 23: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

GPU RESOURCE MANAGEMENT WITH CONTAINERS:RUNNING TENSORFLOW WITH SINGULARITY

Page 24: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Software like TensorFlow is evolving so quickly that it requires too much pain to install/upgrade on an HPC cluster on a regular basis, especially on an old OS (XStream is still running RHEL 6.9).

Singularity is a container technology developed for HPC. It is also installed on the major XSEDE computing systems.

Thanks to the new Nvidia GPU support in Singularity 2.3 (through the --nv option), we can now use Singularity with GPUs on both Sherlock and XStream!

TENSORFLOW MODEL TRAINING (1/4)

Page 25: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Step 1: get the latest Tensorflow image

Create a new Singularity image and import docker image.

$ module load singularity

$ singularity pull docker://tensorflow/tensorflow:latest-gpu

Step 2: TensorFlow container quick test

$ singularity shell --home $WORK:/home --nv tensorflow-latest-gpu.img

Singularity tensorflow.img:~> python

>>> import tensorflow as tf

TENSORFLOW MODEL TRAINING (2/4)

Page 26: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Step 3: run CIFAR10 training on single GPU

TENSORFLOW MODEL TRAINING (3/4)

#!/bin/bash#SBATCH --job-name=cifar10_1gpu#SBATCH --output=slurm_cifar10_1gpu_%j.out#SBATCH --cpus-per-task=1#SBATCH --gres gpu:1#SBATCH --time=1:00:00

TENSORFLOW_IMG=tensorflow-latest-gpu.imgCIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10

mkdir $LSTOR/cifar10_datacp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/

module load singularity

srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_train.py --batch_size=128 \ --log_device_placement=false \ --max_steps=100000

extract of cifar10_1gpu.sbatch for XStream

Page 27: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Step 4: run CIFAR10 training on multiple GPUs

TENSORFLOW MODEL TRAINING (4/4)

#!/bin/bash#SBATCH --job-name=cifar10_2gpu#SBATCH --output=slurm_cifar10_2gpu_%j.out#SBATCH --cpus-per-task=2#SBATCH --ntasks-per-socket=1#SBATCH --gres gpu:2#SBATCH --gres-flags=enforce-binding#SBATCH --time=1:00:00

TENSORFLOW_IMG=tensorflow-latest-gpu.imgCIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10

mkdir $LSTOR/cifar10_datacp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/

module load singularity

srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_multi_gpu_train.py --num_gpus=2 \ --batch_size=64 \ --log_device_placement=false \ --max_steps=100000

extract of cifar10_2gpu.sbatch for XStream

Page 28: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

CONTACT

[email protected]

Page 29: Managing and Deploying GPU AcceleratorsGroup GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \

Recommended