+ All Categories
Home > Documents > Accelerating Deep Learning with...

Accelerating Deep Learning with...

Date post: 22-May-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
27
Accelerating Deep Learning with MVAPICH Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University OSU Booth Talk (SC ’17)
Transcript
Page 1: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

AcceleratingDeepLearningwithMVAPICH

AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda

NetworkBasedComputingLaboratory

Dept.ofComputerScienceandEngineering

TheOhioStateUniversity

OSUBoothTalk(SC’17)

Page 2: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– DeepLearningTrends

– CPUsandGPUsforDeepLearning

– MessagePassingInterface(MPI)

• Co-designEfforts– OSU-Caffe

– NCCL-augmentedMPIBroadcast

– Large-messageCUDA-AwareMPICollectives

• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe

Agenda

Page 3: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 3NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Caffe,TensorFlow,CNTKandmanymore..

• MostframeworksareexploitingGPUstoacceleratetraining

• Diverseapplications– ImageRecognition,CancerDetection,Self-DrivingCars,SpeechProcessingetc.

DLFrameworksandTrends

https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/

Page 4: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 4NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)– TheImageNetChallenge- (ILSVRC)

– 90%oftheImageNetteamsusedGPUsin2014*

– DLmodelslikeAlexNet,GoogLeNet,andVGG

– AnaturalfitforDLduetothethroughput-orientednature

– GPUsarealsogrowingintheHPCarena!à

GPUsaregreatforDeepLearning

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/statistics/list/

Page 5: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 5NetworkBasedComputingLaboratory High-PerformanceDeepLearning

AndCPUsarecatchingupfast

1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org

• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing

• XeonPhi1st generationwasaco-processor

• Unlike XeonPhi2nd generation,whichisaself-hostedprocessor!

• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?

https://www.top500.org/statistics/list/

SystemCountforXeonPhi

Page 6: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 6NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• WhatisMessagePassingInterface(MPI)?– ade-factostandardforexpressingdistributed-memoryparallelprogramming

– usedforcommunicationbetweenprocessesinmulti-processapplications

• MVAPICH2isahighperformanceimplementationoftheMPIstandard

• WhatcanMPIdoforDeepLearning?– MPIhasbeenusedforlargescalescientificapplications

– DeepLearningcanalsoexploitMPItoperformhigh-performancecommunication

• WhydoIneedcommunicationinDeepLearning?– IfyouuseoneGPUoroneCPU,youdonotneedcommunication

– But,oneGPUorCPUisnotenough!

– DLwantsasmanycomputeelementsasitcanget!

– MPIisagreatfit– Broadcast,Reduce,andAllreduceiswhatmostDLworkloadsrequire

Whattouseforscale-out?(DistributedtrainingofNeuralNets.)

Page 7: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 7NetworkBasedComputingLaboratory High-PerformanceDeepLearning

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)

Page 8: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 8NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....

• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!

• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?

– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)

• Butthereishope,actuallyalotofgreatprogresshere!– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!

– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..

DeepLearningFrameworks – CPUsorGPUs?

Page 9: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 9NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Howtoefficientlyscale-outaDeepLearning(DL)frameworkandtakeadvantageofheterogeneousHigh

PerformanceComputing(HPC)resourceslikeGPUsandXeonPhi(s)?

TheKeyQuestion!

Page 10: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 10NetworkBasedComputingLaboratory High-PerformanceDeepLearning

ResearchChallenges

LetusbringHPCandDL“together”!

ComputationandcommunicationcharacteristicsofDLworkloads?

VariousdatasetsandnetworkshandleddifferentlyinDLframeworks

Possiblestrategiestoevaluatethe

performanceofDLframeworks

Performancetrendsthatcanbeobservedfora

singlenode

Scale-outofDNNtrainingforCPU-basedandGPU-

basedDNNtraining

Performancebehaviorfor

hardwarefeatures

Page 11: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 11NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– DeepLearningTrends

– CPUsandGPUsforDeepLearning

– MessagePassingInterface(MPI)

• Co-designEfforts– OSU-Caffe

– NCCL-augmentedMPIBroadcast

– Large-messageCUDA-AwareMPICollectives

• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe

Agenda

Page 12: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 12NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Bcast(GPU0)

packed_comm_buff

L1L2..

Ln

F

L1L2..

Ln

L1L2..

Ln

L1L2..

Ln

Params

GPU

0

Params

GPU

1 Params

GPU

2

Params

GPU

3

Gradients

1.DataPropagation

2.ForwardBackwardPass

3.GradientAggregation

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

Reduce(GPU0)

Loop{}

CaffeArchitecture

http://hidl.cse.ohio-state.edu

Page 13: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 13NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)

– MostcommunicationbasedonGPUbuffers

• ExistingState-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance

– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!

• Proposed:Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?

– EfficientOverlap ofComputationandCommunication

– EfficientLarge-Message Communication(Reductions)

– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?

OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes

Scale-up

Perform

ance

Scale-outPerformance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)

Page 14: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 14NetworkBasedComputingLaboratory High-PerformanceDeepLearning

0

2000

4000

6000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

0

10

20

30

0 2 8 32 128 512 2K 8K

Latency(us)

MessageSize(Bytes)

GPU-GPUInter-nodeLatency

MV2-(NO-GDR) MV2-GDR-2.3a

MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores

NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA

CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA

10x

9x

MVAPICH2-GDR:Scale-outforGPU-basedDistributedTraining

1.88us11X

MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!

Page 15: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 15NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Caffe:AflexibleandlayeredDeepLearningframework.

• BenefitsandWeaknesses– Multi-GPUTrainingwithinasinglenode

– PerformancedegradationforGPUsacrossdifferentsockets

– LimitedScale-out

• OSU-Caffe:MPI-basedParallelTraining– EnableScale-up(withinanode)andScale-out

(acrossmulti-GPUnodes)

– Scale-outon64GPUsfortrainingCIFAR-10networkonCIFAR-10dataset

– Scale-outon128GPUsfortrainingGoogLeNetnetworkonImageNetdataset

OSU-Caffe 0.9:ScalableDeepLearningonGPUClusters

0

50

100

150

200

250

8 16 32 64 128

TrainingTim

e(secon

ds)

No.ofGPUs

GoogLeNet(ImageNet)on128GPUs

Caffe OSU-Caffe(1024) OSU-Caffe(2048)Invalidusecase

OSU-Caffe0.9availablefromHiDL site

Page 16: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 16NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• NCCLhassomelimitations– Onlyworksforasinglenode,thus,noscale-outon

multiplenodes

– DegradationacrossIOH(socket)forscale-up(withinanode)

• WeproposeoptimizedMPI_Bcast– CommunicationofverylargeGPUbuffers(orderof

megabytes)

– Scale-outonlargenumberofdensemulti-GPUnodes

• HierarchicalCommunicationthatefficientlyexploits:– CUDA-AwareMPI_BcastinMV2-GDR

– NCCLBroadcastprimitive

EfficientBroadcastforMVAPICH2-GDRusingNVIDIANCCL

110100100010000100000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(us)

LogScale

MessageSize

MV2-GDR MV2-GDR-Opt

100x

0102030

2 4 8 16 32 64Time(secon

ds)

NumberofGPUs

MV2-GDR MV2-GDR-Opt

PerformanceBenefits:MicrosoftCNTKDLframework(25%avg.improvement)

PerformanceBenefits:OSUMicro-benchmarks

EfficientLargeMessageBroadcastusingNCCLandCUDA-AwareMPIforDeepLearning,A. Awan,K.Hamidouche,A.Venkatesh,andD.K.Panda,The23rdEuropeanMPIUsers'GroupMeeting(EuroMPI16),Sep2016 [BestPaperRunner-Up]

Page 17: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 17NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• MPI_Bcast:DesignandPerformanceTuningforDLWorkloads– Designring-basedalgorithmsforlargemessages

– Harnessamultitudeofalgorithmsandtechniquesforbestperformanceacrossthefullrangeofmessagesizeandprocess/GPUcount

• PerformanceBenefits– PerformancecomparableorbetterthanNCCL-

augmentedapproachesforlargemessages

– Upto10Ximprovementforsmall/mediummessagesizeswithmicro-benchmarks

– Upto7%improvementforVGGtraining

PureMPILargeMessageBroadcast

0.0010.010.11101001000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(m

s)-logscale

MessageSize(bytes)

MV2-GDR-NCCL MV2-GDR-Opt

051015202530

2 4 8 32 64 128

TrainingTime(secon

ds)

No.ofGPUs

MV2-GDR-NCCL MV2-GDR-Opt

VGGTrainingwithCNTK

A.A.Awan,C-H.Chu,H.Subramoni,andD.K.Panda.OptimizedBroadcastforDeepLearningWorkloadsonDense-GPUInfiniBandClusters:MPIorNCCL?,arXiv ’17(https://arxiv.org/abs/1707.09414)

MPIBcast Benchmark:128GPUs(8nodes)

Page 18: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 18NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• PerformancegainsforMVAPICH2-GDR2.3a*comparedtoBaidu-allreduceLargeMessageAllreduce:MVAPICH2-GDRvs.Baidu-allreduce

*AvailablewithMVAPICH2-GDR2.3a

110

1001000

10000100000

1000000

Latency(us)-logscale

MessageSize(bytes)

8GPUs(4nodeslogscale-allreducevsMVAPICH2-GDR

Baidu-allreduce MVAPICH2-GDR

~30Xbetter~11%improvement

Page 19: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 19NetworkBasedComputingLaboratory High-PerformanceDeepLearning

0100200300

2 4 8 16 32 64 128Latency(m

s)

MessageSize(MB)

Reduce– 192GPUs

LargeMessageOptimizedCollectivesforDeepLearning

0

100

200

128 160 192Latency(m

s)

No.ofGPUs

Reduce– 64MB

0100200300

16 32 64Latency(m

s)

No.ofGPUs

Allreduce- 128MB

0

50

100

2 4 8 16 32 64 128Latency(m

s)

MessageSize(MB)

Bcast– 64GPUs

0

50

100

16 32 64Latency(m

s)

No.ofGPUs

Bcast128MB

• MVAPICH2-GDRprovidesoptimizedcollectivesforlargemessagesizes

• OptimizedReduce,Allreduce,andBcast

• GoodscalingwithlargenumberofGPUs

• AvailableinMVAPICH2-GDR2.2andhigher

0100200300

2 4 8 16 32 64 128Latency(m

s)MessageSize(MB)

Allreduce– 64GPUs

Page 20: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 20NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– DeepLearningTrends

– CPUsandGPUsforDeepLearning

– MessagePassingInterface(MPI)

• Co-designEfforts– OSU-Caffe

– NCCL-augmentedMPIBroadcast

– Large-messageCUDA-AwareMPICollectives

• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe

Agenda

Page 21: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 21NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Performancedependsonmanyfactors

• HardwareArchitectures– GPUs

– Multi-/Many-coreCPUs

– SoftwareLibraries:cuDNN(forGPUs),MKL-DNN/MKL2017

(forCPUs)

• HardwareandSoftwareco-design– Softwarelibrariesoptimized

foroneplatformwillnothelptheother!

– cuDNNvs.MKL-DNN

UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors

A.A.Awan,H.Subramoni,D.Panda,“AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures”3rdWorkshoponMachineLearninginHighPerformanceComputingEnvironments,heldinconjunctionwithSC17,Nov2017.

Page 22: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 22NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• WeuseMCDRAMasCache forallthesubsequentresults

• Onaverage,DDR-Allisupto1.5XslowerthanMCDRAM

• MKLengineisupto3XbetterthandefaultCaffeengine

• Biggest gainsforIntel XeonPhi (many-core)architecture

• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)

ImpactofMKLengineandMC-DRAMforIntel-Caffe

020040060080010001200140016001800

Training

Time(m

s)

CPUArchitectures

0100200300400500600700

DDR-All MCDRAM-All MCDRAMasCache

Training

Time(m

s)

MemoryConfigurations

Forward Backward

Page 23: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 23NetworkBasedComputingLaboratory High-PerformanceDeepLearning

TheFullLandscapeforAlexNetTraining

050100150200250300350400450500

Time(m

s)

conv1 conv2 conv3 conv4 conv5

• ConvolutionsintheForwardandBackwardPass

• FasterConvolutionsà FasterTraining

• Mostperformancegainsarebasedonconv2 andconv3.

0100200300400500600700800

Time(m

s)

conv1 conv2 conv3 conv4 conv5

Page 24: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 24NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Multi-nodeResults:ResNet-50• Allresultsareweakscaling

– Thebatchsizeremainsconstantpersolverbutincreasesoverallby:

– Batch-size*#nodesor

– Batch-size*#gpus

• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability

• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Lessscalability

duetocommunicationoverhead

0

100

200

300

400

500

600

0

50

100

150

200

250

300

350

400

2 4 8 16 20 32

Images/secon

d

Training

Time(secon

ds)

No.ofNodes

Time(seconds) Images/second

ResNet-50Intel-Caffe

1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html

Page 25: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 25NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Summary

• DeepLearningisontherise– Rapidadvancesinsoftware,hardware,andavailabilityoflargedatasetsis

drivingit

• SinglenodeorsingleGPUisnotenoughforDeepLearningworkloads

• WeneedtofocusondistributedDeepLearningbuttherearemanychallenges

• MPIoffersagreatabstractionforcommunicationinDLTrainingtasks

• Aco-designofDeepLearningframeworksandcommunicationruntimeswillberequiredtomakeDNNTrainingscalable

Page 26: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 26NetworkBasedComputingLaboratory High-PerformanceDeepLearning

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~awan.10

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

Page 27: Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

MLHPC‘17 27NetworkBasedComputingLaboratory High-PerformanceDeepLearning

PleasejoinusforothereventsatSC’17• Workshops

– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware

• Tutorials– InfiniBand,Omni-Path,andHigh-Speed

EthernetforDummies

– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage

• BoFs– MPICHBoF:MVAPICH2Project:Latest

StatusandFuturePlans

Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails

• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning

FrameworksforScalableDistributedTrainingonGPUClusters

– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters

• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans

TowardsExascaleComputing

– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach

– AcceleratingDeepLearningwithMVAPICH

– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning


Recommended