+ All Categories
Home > Documents > Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015...

Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015...

Date post: 08-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
28
Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Ching-Hsiang Chu 1 , Xiaoyi Lu 1 , Ammar A. Awan 1 , Hari Subramoni 1 , Jahanzeb Hashmi 1 , Bracy Elton 2 and Dhabaleswar K. (DK) Panda 1 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation
Transcript
Page 1: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

EfficientandScalableMulti-SourceStreamingBroadcastonGPUClusters

forDeepLearningChing-HsiangChu1,XiaoyiLu1,AmmarA.Awan1,HariSubramoni1,

JahanzebHashmi1,BracyElton2 andDhabaleswarK.(DK)Panda1

1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation

Page 2: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 2NetworkBasedComputingLaboratory

• Introduction– DeepLearningonGPUandInfiniBand(IB)Clusters

– Multi-sourceBroadcast-typeOperationforDeepLearning

• Analysis

• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIAGPUDirectfeatures

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 3: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 3NetworkBasedComputingLaboratory

TrendsinModernHPCArchitecture

• Multi-core/many-coretechnologies

• HighPerformanceInterconnects

• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems

• HighPerformanceStorageandComputedevices

Accelerators/Coprocessorshighcomputedensity,high

performance/watt>1Tflop/sDPonachip

HighPerformanceInterconnects–InfiniBand(IB),Omni-Path

<1μseclatency,100GbpsBandwidth>Multi-coreProcessors SSD,NVMe-SSD,NVRAM

Tianhe– 2 TitanK- ComputerSunwayTaihuLight

Page 4: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 4NetworkBasedComputingLaboratory

20 18 15 14 10 8 6

23 28 3352 53 50

43

2 22

01020304050607080

June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017

System

Cou

nt

NVIDIAFermi NVIDIAKepler NVIDIAPascal

• GrowthofGPUclustersinthelast3years– NVIDIAGPUsboostmanyTop500andGreen500systems

• “Top13systemsonthelatestGreen500 areallequippedwiththeP100hardware”*

GPUinHPCSystems

*Datacollectedfromhttp://top500.org

Page 5: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 5NetworkBasedComputingLaboratory

ArchitecturesforDeepLearning(DL)

Multi-coreCPUsacrossnodes Multi-coreCPUs+SingleGPUacrossnodes

Multi-coreCPUswithinanode Multi-coreCPUs+Multi-GPUwithinanode

Multi-coreCPUs+Multi-GPUacrossnodes

PastandCurrentTrend Near-future

E.g.,NVIDIADGX-1systems

IBNetworks

IBNetworks

IBNetworks

Page 6: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 6NetworkBasedComputingLaboratory

• ComputationusingGPU

• CommunicationusingMPI– Exchanging partial gradients aftereachminibatch

– All-to-all(Multi-Source)communications

Ø E.g.,MPI_Bcast

• Challenges– Highcomputation-communicationoverlap

– Goodscalability forupcominglarge-scaleGPUclusters

– Noapplication-levelmodification

High-performanceDeepLearning

GPUNode1

GPUNode2 GPUNode4

GPUNode3

Page 7: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 7NetworkBasedComputingLaboratory

• Introduction

• Analysis– ExistingDesigns

– ProblemStatement

• ProposedDesign

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 8: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 8NetworkBasedComputingLaboratory

EvaluationParameters

IBHCA

CPU

GPU

Bandwidth

𝑩𝑮

𝑩𝑯 ≫ 𝑩𝑮

𝑩𝑷𝑪𝑰𝒆

Notation Meaning Unit𝒏 Numberofprocesses N/A𝒎 Numberofbroadcastsources N/A𝒕𝒔 Setuptimeforsendingdata sec

𝒕𝒐(𝒏) OverheadforissuinganIB-MCASTpacket sec𝑴 Originalmessagesize bytes𝑪 Sizeofadatachunk bytes

𝑼 MaximumTransmissionUnitforIB-MCAST,providedbyhardwaremanufacturer bytes

𝑩𝑯 BandwidthofreadingHostmemory bytes/sec

𝑩𝑮BandwidthofreadingGPUmemory

(NVIDIAGPUDirect RDMA) bytes/sec

𝑩𝑷𝑪𝑰𝒆PCIeBandwidthbetweenHostandGPU

memory bytes/sec

Message

𝑴

𝑪

𝑼

Page 9: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 9NetworkBasedComputingLaboratory

• Direct • Pipeline • Staging

Ring-basedBroadcast

IBHCACPU

GPU

Source

DataIBHCA

CPU GPUDestination1

Data IBHCA

CPU GPU

Destination3

Data

GDRReadGDRWriteNetworkTransfer

IBHCA

CPU GPU

Destination2

Data

PoorScalability

(𝑛 − 1)× 𝑡7 +𝑀𝐵;

𝑀𝐶 + (𝑛 − 2) × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ (𝑛 − 1)× 𝑡7 +𝑀𝐵B

Page 10: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 10NetworkBasedComputingLaboratory

• Direct • Pipeline • Staging

K-nomial-basedBroadcast

Non-optimizedScalability

IBHCACPU

GPU

Source

Data

IBHCA

CPU GPUDestination1

Data

IBHCA

CPU GPU

Destination3

DataGDRReadGDRWriteNetworkTransfer

IBHCACPU

GPU

Destination2

Data

logF 𝑛 × 𝑡7 +𝑀𝐵;

𝑀𝐶 × logF 𝑛 × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ logF 𝑛 × 𝑡7 +𝑀𝐵B

Page 11: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 11NetworkBasedComputingLaboratory

IBHCA

CPU

GPU

Source

IBSwitch

Header

Data

IBHCA

CPU

GPU

Destination1Header

Data

IBHCA

CPU

GPU

DestinationNHeader

Data

1.IBGather+GDRRead2.IBHardwareMulticast3.IBScatter+GDRWrite

• ForGPU-residentdata,using– GPUDirectRDMA(GDR)

– InfiniBandHardwareMulticast(IB-MCAST)

• Overhead– IBUDlimit

– GDRlimit

HardwareMulticast-basedBroadcast*

*A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.

𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +

𝑈𝐵;

Page 12: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 12NetworkBasedComputingLaboratory

• HowtodeterminetechniquestoleverageIB-MCAST andotherGPUadvancedfeatures GDR todesignefficientandscalablebroadcast withlargemessagesonGPUclusters?

• Howtoachievehighoverlapandscalabilityformulti-sourcebroadcastoperations?

• Howtodetermineattainabletheoreticalandpracticalperformancebenefitsfordeeplearningapplications?

ProblemStatement

Page 13: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 13NetworkBasedComputingLaboratory

• Introduction

• Analysis

• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIA

GPUDirectfeatures

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 14: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 14NetworkBasedComputingLaboratory

• Optimizedbroadcastsendoperation

– Streaming theGPU-residentdatathroughhostmemory

– LeveragingInfiniBandhardwaremulticast

Ø Low-latency:avoidingGDRReadlimit

Ø Overlappingdatatransferswithinandacrossnodes

• Optimizedbroadcastreceiveoperation

– Zero-copyschemebyleveragingGDRfeature

Ø Low-latency:avoidingunnecessarydatatransfers

OverviewofProposedStreamingDesign

Page 15: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 15NetworkBasedComputingLaboratory

• PreparingIntermediatebuffer(im_buf)– Page-locked(pinned)hostbuffer

Ø FastDevice-Hostdatamovement

– AllocatedatinitializationphaseØ Lowoverhead

• Streaming datathroughhost– Fine-tunedchunkeddata

– Asynchronouscopyoperations

Ø Three-stagepipeline

IBHCA

CPU

GPU

Source

IBSwitch

Header

d_out

1.DataPreparation2.IBGather3.IBHardwareMulticast

im_buf

OptimizedBroadcastSend

MPI_Bcast(d_out,…)

Page 16: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 16NetworkBasedComputingLaboratory

• Zero-copybroadcastreceive– Pre-posteduserbuffer(d_in)

– Avoidsadditionaldatamovement

– LeveragesIBScatterandGDRfeatures

Ø Low-latency

Ø Free-upPCIeresourcesforapplications

IBSwitch

IBHCA

CPU

GPU

Destination1Header

d_in

IBHCA

CPU

GPU

DestinationNHeader

d_inIBHardwareMulticastIBScatter(GDRWrite)

OptimizedBroadcastReceiveMPI_Bcast(d_in,…)

Page 17: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 17NetworkBasedComputingLaboratory

OverlapOpportunities

BroadcastfromNodeC

BroadcastfromNodeA

BroadcastfromNodeBTimeline

HCA

CPU

GPU

GPU

CPU

HCANod

eB

Nod

eC

GPUCPUHCAN

odeA

:cudaMemcpyAsync:IBHardwareMulticast:cudaStreamSynchronize:GDRWrite

OverlapAcrossNodes

Overlapwithinanode𝐶𝐵>?@A

+𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +

𝑈𝐵B

Page 18: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 18NetworkBasedComputingLaboratory

• Introduction

• Analysis

• ProposedDesign

• PerformanceEvaluation– OSUMicro-Benchmark(OMB)

– DeepLearningFramework

• ConclusionandFutureWork

Outline

Page 19: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 19NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet

(RoCE)– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,775organizationsin85countries

– Morethan420,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)

• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’16,10Mcores,100PFlops)

Page 20: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 20NetworkBasedComputingLaboratory

• RI2cluster@TheOhioStateUniversity– Two14-coreIntel(Broadwell)XeonE5-2680V4processors

– 1NVIDIAK80GPUpernode;UsedUpto16GPUnodes

– OnesingleportInfiniBandEDRHCA

– MellanoxSB7790andSB7800InfiniBandswitches

• OhioStateUniversity(OSU)Micro-Benchmark(OMB)http://mvapich.cse.ohio-state.edu/benchmarks/

– osu_bcast- MPI_BcastLatencyTest

• Deeplearningframework:CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)*– AlexNetandVGGmodelswithImageNetdataset

ExperimentalEnvironments

*D.S.Banerjee,K.HamidoucheandD.K.Panda,"Re-DesigningCNTKDeepLearningFrameworkonModernGPUEnabledClusters," IEEECloudCom,LuxembourgCity,2016,pp.144-151.

Page 21: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 21NetworkBasedComputingLaboratory

• @RI2cluster,16GPUs,1GPU/node

1

10

100

1000

10000

4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

Latency(μs)

MessageSize(bytes)

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Evaluation:BenchmarkEvaluation

1

10

100

1000

10000

2 4 8 16

Latency(μs)

NumberofGPUnodes

2MBMessage

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Lowerisbetter

Near-Constant65%

• Providenear-constantlatencyoverthesystemsizes• Reducesupto65%oflatencyforlargemessages

HitGDRreadlimit

Page 22: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 22NetworkBasedComputingLaboratory

• @RI2cluster,16GPUs,1GPU/node:– CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)withoutmodification

Evaluation:DeepLearningFrameworks

0

100

200

300

8 16

TrainingTim

e(s)

NumberofGPUnodes

AlexNetmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

1000

2000

3000

8 16

TrainingTim

e(s)

NumberofGPUnodes

VGGmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lowerisbetter

15% 24% 6%15%

• Reducesupto24%and15%oflatencyforAlexNetandVGGmodels• Higherimprovementisexpectedforlargersystemsizes

Page 23: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 23NetworkBasedComputingLaboratory

• BasedonthearchitectureonRI2cluster

PerformancePrediction

0.001

0.1

10

1000

100000

Latency(s)

NumberofBroadcastSources

K-nomial-based:Model-basedEstimationRing-based:Model-basedEstimationMCAST-GDR-Opt:Model-basedEstimation

0.001

0.01

0.1

1

10

100

2 4 8 16

Latency(s)

NumberofBroadcastSources

K-nomial-based:Model-basedEstimationK-nomial-based:ExperimentRing-based:Model-basedEstimationRing-based:ExperimentMCAST-GDR-Opt:Model-basedEstimationMCAST-GDR-Opt:Experiment

Within10%oferror

𝑴 = 2𝑀𝐵; 𝑪 = 512𝐾𝐵; 𝑼 = 4𝐾𝐵;𝑩𝑯 ≈ 100𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼 × ln 𝑛 , 15 ≤ 𝛼 ≤ 20

Page 24: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 24NetworkBasedComputingLaboratory

• Introduction

• Analysis

• ProposedDesign

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 25: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 25NetworkBasedComputingLaboratory

• ProposedefficientbroadcastschemestoleverageGDRandMCAST

features fordeeplearningapplications

– Optimizedstreamingdesignforlargemessagestransfers

• Providedandevaluatedanalyticalmodels tocaptureessential

performancebehaviorofalternativebroadcastschemesonGPU

clusters

Ø ThesefeaturesareincludedinthelatestreleaseofMVAPICH2-GDR

library

Conclusion

Page 26: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 26NetworkBasedComputingLaboratory

• Extendthedesignforotherbroadcast-basedcollective

algorithmsaswellasnon-blockingoperations

– Allreduce,Allgather,…,andsoon

• Evaluatetheproposeddesigninupcominglarger-scale

GPUclusters

FutureWork

Page 27: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 27NetworkBasedComputingLaboratory

ThankYou!Ching-HsiangChu,XiaoyiLu,AmmarA.Awan,HariSubramoni,JahanzebHashmi,BracyEltonandDhabaleswarK.(DK)Panda

{chu.368,lu.932,awan.10,subramoni.1,hashmi.29}@[email protected],[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors anddo not necessarily reflect the views of the DOD or the employer of the author.

Page 28: Efficient and Scalable Multi-Source Streaming Broadcast on ... · June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 t NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal •Growth

ICPP2017 28NetworkBasedComputingLaboratory

• NVIDIAGPUDirect[1]

– Remotedirectmemoryaccess(RDMA)transfersbetweenGPUsandotherPCIedevices⇒ GDR

– andmore…

• InfiniBand(IB)hardwaremulticast(IBMCAST)[2]

– Enablesefficientdesignsofbroadcastoperations

• Host-based[3]

• GPU-based[4]

MCAST-basedBroadcast

[1]https://developer.nvidia.com/gpudirect[2]PfisterGF.,“AnIntroductiontotheInfiniBandArchitecture.”HighPerformanceMassStorageandParallelI/O,Chapter42,pp617-632,Jun2001.[3]J.Liu,A.R.Mamidala,andD.K.Panda,“FastandScalableMPI-levelBroadcastusingInfiniBand’sHardwareMulticastSupport,”inIPDPS2004,p.10,April2004.[4]A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.


Recommended