+ All Categories
Home > Documents > CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction...

CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction...

Date post: 24-Mar-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
29
CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University
Transcript
Page 1: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CUDAKernelbasedCollectiveReductionOperationsonLarge-scaleGPUClusters

Ching-HsiangChu,KhaledHamidouche,Akshay Venkatesh,AmmarAhmadAwanandDhabaleswar K.(DK)Panda

Speaker: Sourav Chakraborty

Network-based ComputingLaboratory

DepartmentofComputerScienceandEngineeringTheOhioStateUniversity

Page 2: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline

Page 3: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 3NetworkBasedComputingLaboratory

DriversofModernHPCClusterArchitectures

• Multi-coreprocessorsareubiquitous• InfiniBandverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing

Accelerators/Coprocessorshighcomputedensity,highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors

Page 4: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 4NetworkBasedComputingLaboratory

• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboosted byNVIDIAGPUsinNov’15

– FromTop500list(http://www.top500.org)

AcceleratorsinHPCSystems

8 15 23 28 335231 22 20 18 1514

11 1216 20 30

29

0

20

40

60

80

100

June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015

System

Count

NVIDIAKepler NVIDIAFermi IntelXeonPhi

Page 5: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 5NetworkBasedComputingLaboratory

• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– E.g. Deep learning frameworks such as Caffe

Motivation– CollectivesinApplications

GPUNode1 GPUNode2

GPUNodeN

MPI_Bcast/MPI_Scatter

MPI_Gather/MPI_Reduce

GPU computations

Page 6: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 6NetworkBasedComputingLaboratory

• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– Pure communication collectives: Broadcast, Gather, Scatter…

– Compute-oriented collectives: Reduce, Allreduce, Scan

– Communication part is highly optimized

• Compute-oriented collectives operations are not fullyoptimized for GPU clusters– CPU is doing all the works

– GPU resources are not fully utilized

Motivation- CollectiveReductionOperations

Page 7: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 7NetworkBasedComputingLaboratory

• Fastcomputation

– Massiveparallelism

• Efficientcommunication

– NVIDIAGPUDirect RDMA

Motivation– PowerfulGPUResources

• GPUfeaturesarenotbeingutilizedforallcollectives• Can we leverage these features to further optimize the

compute-oriented collectives on GPU clusters?

http://www.nvidia.com/object/gpu-accelerated-computing.html https://developer.nvidia.com/gpudirect

Page 8: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 8NetworkBasedComputingLaboratory

• How to eliminate explicit data movements between Hostand GPU memories?– cudaMemcpy call is expensive!

• How to handle the GPU-to-GPU communication after thecomputations finish?

• When to use GPU for compute-oriented collectiveoperations?

– Launching kernels bring overhead; How to minimize?

ProblemStatement

Page 9: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 9NetworkBasedComputingLaboratory

• Design a framework that exploits the CUDA kernels toefficiently handle compute-oriented collectives

• Propose extensions to the existing collective algorithms tobe GPU-Aware compute-orientedalgorithms– MPI_Reduce, MPI_Allreduce andMPI_Scan

• Detailed analysis and evaluation using different GPUsystems includinga Cray CS-Storm system.

Overview

Page 10: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 10NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline

Page 11: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 11NetworkBasedComputingLaboratory

• Existingdesigns1. ExplicitcopythedatafromGPUtohostmemory

2. Host-to-Host communicationtoremoteprocesses

3. Performcomputation onCPU

4. ExplicitcopythedatafromhosttoGPUmemory

• Proposeddesigns1. GPU-to-GPU communication

• NVIDIAGPUDirect RDMA(GDR)

• Pipeline throughhostforlargemsg

2. Performcomputation onGPU• EfficientCUDAkernels

DesignConsideration

CPU

Host Memory

GPU

PCIe IB Adapter

CPU

Host Memory

GPU

PCIeIB Adapter1

2

3

4

1

2

Node BNode A

Page 12: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 12NetworkBasedComputingLaboratory

• Tree-basedK-nomialalgorithm– Onlythenon-leaf nodesperformreductionoperation

• Pros&Cons– Loadbalance,Efficient/scalablecommunication

– Higheraveragelatency

K-nomial MPI_Reduce

0 1 2 3 4 5 6 7[1]

[2]

[3]

0

124

356

7

Page 13: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 13NetworkBasedComputingLaboratory

• Host-basedBinomial-Reduce(Default)

• GPU-basedBinomial-Reduce(BR-DD)

CostAnalysis

Expensive cudaMemcpy, before/after reduction op.

Relatively slow computation on CPUFast Host-Host Comm.

Fast, highly parallelized computation on GPU

Overhead of launching CUDA kernels (~10us)GDR-based GPU-GPU Comm.

Constant variant of tree initialization

log$𝑛 × 𝜖×𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)

log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

Message Size

Page 14: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 14NetworkBasedComputingLaboratory

• Gather-firstalgorithm– Rootgathersallthedataandperformthecomputation

• Sinceonlyrootneeds thefinalresult

• Pros&Cons– Lowcomputationoverhead

– Poorscalability

Gather-firstMPI_Reduce

0 1 2 3 4 5 6 7

Page 15: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 15NetworkBasedComputingLaboratory

• Host-basedGatherandReduce(GR-H-HH)

• Host-basedGather,GPU-basedReduce(GR-HH)

• GPU-basedGatherandReduce(GR-DD)

CostAnalysis

(𝑛 − 1)× 𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)

(𝑛 − 1)×𝐶𝑜𝑚𝑚678(𝑀) + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

(𝑛 − 1)×(𝐶𝑜𝑚𝑚+,-. 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A 𝑀 + 2×𝐶𝑜𝑝𝑦(𝑀))Could suffer scalable issue è Good for small messages or small scale

Less affect from CUDA kernel launching overhead è Good for small messages

Page 16: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 16NetworkBasedComputingLaboratory

• Recursivedoublingalgorithm– Everyprocessorneedstoperformcomputation

• Pros&Cons– Loadbalance,Efficient/scalablecommunication

– Higheraveragelatency

GPU-basedMPI_Allreduce andMPI_Scan

0 1 2 3 4 5 6 7[1]

[2]

[3]

Page 17: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 17NetworkBasedComputingLaboratory

• GPU-basedRecursiveDoubling(RD-DD)

• GPU-basedBinomial-Reduce-Broadcast(GBRB-DD)

CostAnalysis

Same as BD-DD MPI_Reduce

log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

log$𝑛 × 2×𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

Page 18: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 18NetworkBasedComputingLaboratory

Communication Computation Design Algorithm Benefit

Host<->HostCPU

BR-H-HH(Default) Binomial-Reduce Largescale,smallmessagesRD-H-HH(Default) Recursivedoubling

GR-H-HH

Gather-Reduce Smallscale,small messages

GPU

GR-HHHost<->Device (GDR) GR-HD/GR-DH

Device<->Device(GDR)

GR-DD

BR-DD Binomial-Reduce

Largemessagesforanyscale

BRB-DD Binomial-Reduce-Bcast

RD-DDRecursivedoubling

Host<->Device (GDR) RD-HD/RD-DH

AlternativeandExtendedDesigns

Page 19: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 19NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline

Page 20: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 20NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR) andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness (MVAPICH2-EA), Availablesince2015

– Usedbymorethan2,550organizations in79countries

– Morethan360,000(>0.36million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC

• 13th ranked185,344-corecluster(Pleiades)atNASA

• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systems foroveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)

Page 21: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 21NetworkBasedComputingLaboratory

1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode

– Usedforinter-nodeexperiments• Upto32GPUnodes

2. CSCScluster@SwissNationalSupercomputing Centre– CrayCS-Stormsystem

– 8NVIDIAK80GPUspernode(=16NVIDIAK40GPUspernode)

– Usedforintra-nodeexperiments• Upto96GPUsover11nodes

ExperimentalEnvironments

Page 22: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 22NetworkBasedComputingLaboratory

Latency(us)

MessageSize(Bytes)

Default BD-DDGR-DD GR-HDGR-HH GR-H-HH

0

20

40

60

80

100

4 8 16 32 64 128

256

512 1K 2K 4K 8K

Latency(us)

MessageSize(Bytes)

Default BD-DDGR-DD GR-HDGR-HH GR-H-HH

Evaluation- MPI_Reduce @Wilkes(32GPUs)

Gather-first approacheswinforsmallmessages

K-nomialGPU-basedapproachwinsforlargemessages

Page 23: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 23NetworkBasedComputingLaboratory

16K 64K 256K 1M 4M

Latency(us)

MessageSize(Bytes)

Default BD-DDGR-DD GR-HDGR-HH GR-H-HH

0

50

100

150

200

250

4 16 64 256 1K 4K

Latency(us)

MessageSize(Bytes)

Default BD-DDGR-DD GR-HDGR-HH GR-H-HH

Evaluation- MPI_Reduce @ CSCS(96GPUs)

Gather-first approacheswinforsmallmessages

K-nomialGPU-basedapproachwinforlargemessages

Page 24: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 24NetworkBasedComputingLaboratory

0

5

10

15

20

25

Latency(ms)

MessageSize(Bytes)

DefaultRD-DDBRB-DD

0

2

4

6

8

10

2 4 8 16 32Latency(ms)

SystemSize(NumberofNodes)

DefaultRD-DDBRB-DD

Evaluation- MPI_Allreduce

96GPUs@CSCS

GoodScalability32GPUs@Wilkes

Page 25: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 25NetworkBasedComputingLaboratory

0

5

10

15

20

2 4 8 16 32Latency(ms)

SystemSize(Numberofnodes)

DefaultRD-DDRD-HD

0102030405060

64K 128K 256K 512K 1M 2M 4M

Latency(ms)

MessageSize(Bytes)

DefaultRD-DDRD-HD

Evaluation- MPI_Scan

96GPUs@CSCSGoodScalability32GPUs@Wilkes

2MBMessage

Page 26: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 26NetworkBasedComputingLaboratory

Prediction• Use the proposed analytical models to predict the

performance for large scale GPU clusters

0

10000

20000

30000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Latecny(us)

MessageSize(Bytes)

Predictionfor1024GPUsDefaultRD-DD/BR-DD

0

2000

4000

6000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Latency(us)

MessageSize(Bytes)

32GPUsonWilkesclusterModel-basedEstimationExperimentresult

Page 27: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 27NetworkBasedComputingLaboratory

• CUDAkernelbaseddesignssignificantlyimprovetheperformanceofcompute-orientedcollectiveoperations– MPI_Reduce,MPI_Allreduce andMPI_Scan

– CUDAkernelsbasedreductionoperationsèFastcomputation

– GPUDirect featureèEfficientGPU-to-GPUcommunication

• Futurework– Performingapplication-levelevaluation

• Deeplearning frameworkssuchasCaffe

– WillbeincludedinthefuturereleaseofMVAPICH2-GDRlibrary

Conclusion

Page 28: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 28NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

Page 29: CUDA Kernel based Collective Reduction Operations …CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh,

CCGrid 2016 29NetworkBasedComputingLaboratory

• CUDAKernels– Example:VectoradditionforMPI_SUMoperation

ReductionOperationsonGPU

template<class T>__global__ void vector_addition(T *dst, T *src, size_t count){

int i = blockIdx.x * blockDim.x + threadIdx.x;for (; i < count; i += blockDim.x * gridDim.x)

dst[i] += src[i];}

MoreinformationaboutoptimizingyourCUDAkernels:http://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdfhttp://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html


Recommended