CLIC GAN : S. Vallecorsa, G. Khattak, F. Carminati, M. Pierini SurfSara : V. Codreanu, D. Podareanu
Cray : D. MoiseIntel : H. Pabst, V. Saletore
mpi-opt : V. Loncar, F. Pantaleo, T. Nguyen, M. Pierini, J-R. Vlimant, A. Zlokapa
Large Scale Training andOptimization of Neural Networks and
Generative Adversarial Networksover Distributed Resource
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 2
Outline
Grounding : Deep Learning in HEP Generative adversarial networks Training workload parallelization Hyper-parameters optimization Summary and Future work
Based on software (mpi-opt) we are developing, with references to otherstudies performed by collaborators
Some performance plots and studies are stalled due to limited access toresource (allocation pending, job priority, …)
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 3
Deep Learning in High Energy Physics
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 4
Artificial Neural Network
See all developing applications in David's talk https://indico.cern.ch/event/587955/contributions/3012266/
http://www.asimovinstitute.org/neural-network-zoo
● Large number of parameters● Efficiently adjusted with stochastic gradient descent● The more parameters, the more data required● Training to convergence can take minutes to several days, ...
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 5
Generative Adversarial Network
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 6
A Forger's Game
See many other contributions this weekhttps://indico.cern.ch/event/587955/contributions/2937509/ Steve F.https://indico.cern.ch/event/587955/contributions/2937595/ Sofia V.https://indico.cern.ch/event/587955/contributions/2937612/ Viktoriaa C.https://indico.cern.ch/event/587955/contributions/2937515/ Tomasz P.
● Constructed from two artificial neural networks● Two concurrent gradient descent procedure● Training to convergence can take minutes to several days, ...
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 7
GAN for CLIC
Generative adversarial networkarchitecture for CLIC 3D dataset
“Images” are 3D : energydeposition in a highly granular
calorimeterhttp://cds.cern.ch/record/2254048
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 8
Training Artificial Neural Networks
● ANN and associated loss function have fully analyticalformulation and are differentiable with respect to modelparameters
● Gradient evaluated over batch of data➢ Too small : very noisy and scattering➢ Too large : information dilution and slow convergence
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 9
Distributed Training
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 10
Parallelism Overview➔Data distribution
Compute the gradients on several batchesindependently and update the model synchronously ornot. Applicable to large dataset
➔Gradient distributionCompute the gradient of one batch in parallel andupdate the model with the aggregated gradient.Applicable to large sample ≡ large event
➔Model distributionCompute the gradient and updates of part of themodel separately in chain. Applicable to large model
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 11
DataDistribution
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 12
Data Distribution
https://arxiv.org/abs/1712.05878
● Master node operates as parameter server● Work nodes compute gradients● Master handles gradients to update the central model
➔ downpour sgd https://tinyurl.com/ycfpwec5 ➔ Elastic averaging sgd https://arxiv.org/abs/1412.6651
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 13
Performance on ANN
● Speed up in training recurrent neural networks on PizDaint CSCS supercomputer
➔ Linear speed up with up to ~20 nodes. Bottlenecksto be identified
➔ Needs to compensate for staleness of gradients● Similar scaling on servers with 8 GPUs
➔ x7 speed up with students' work
https://github.com/duanders/mpi_learn
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 14
Performance on GAN
● Speed up in training generativeadversarial networks on Piz Daint CSCSand Titan ORNL supercomputers
➔ Using easgd algorithm with rmsprop➔ Speed up is not fully efficient.
Bottlenecks to be identified
NVIDA K20 at Titan, ORNL
NVIDA P100 on Piz Daint, CSCS
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 15
Par
amet
er-s
et g
roup
0 TM2
TW0
TWNW
Training mastergroup 0, subrank 0
TM1
TW0
TWNW
TMNM
TW0
TWNW
● Putting workers in several groups● Aim at spreading communication to the main master● Need to strike a balance between staleness and
update frequency
Sub-master Layout
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 16
Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc
Notmpi-opt
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 17
GradientDistribution
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 18
Par
amet
er-s
et g
roup
0 TW1GPU2
TW1GPU2
TWNW
GPU2
Training mastergroup 0, subrank 0
TW1GPU1
TW2GPU1
TWNW
GPU1
TW1GPUN
GPU
TW2GPUN
GPU
TWNW
GPUNGPU
● A logical worker is spawn over multiple mpi processes● Communicator passed to horovod https://github.com/uber/horovod ● Private horovod branch to allow for group initialization/reset● Nvidia NCCL enabled for fast GPU-GPU communication
“all-reduce” Layout
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 19
Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc
Notmpi-opt
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 20
ModelDistribution
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 21
Intra-Node Model Parallelism
See T. Kurth et al. @ https://pasc18.pasc-conference.org for nodeto node model parallelism considerations
GPU2GPU1
● Perform the forward and backward pass of sets of layers ondifferent devices
● Require good device to device communication● Utilize native tensorflow multi-device manager● Aiming for machines with multi-gpu per node topology (summit)
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 22
Hyper-Parameters Optimization
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 23
Hyper-Parameters● Various parameters of the model cannot be learned by
gradient descent➢ Learning rate, batch size, number of layers, size of
kernels, …
● Tuning to the right architecture is an “art”. Can easilyspend a lot of time scanning many directions
● Full parameter scan is resource/time consuming.
➔ Hence looking for a way to reach the optimum hyper-parameter set for a provided figure of merit (the loss bydefault, but any other fom can work)
➔ Too optimization engine integrated➢ Bayesian optimization with gaussian processes prior➢ Evolutionary algorithm
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 24
H-optmasterRank 0
Par
amet
er-s
et g
roup
0
Par
amet
er-s
et g
roup
1
Par
amet
er-s
et g
roup
NG
Training mastergroup 0, subrank 0
Training workergroup 0, subrank 1
Training workergroup 0, subrank2
Training workergroup 0, subrank N
W
● One master process drives the hyper-parameter optimization● N
G groups of nodes training on a parameter-set on simultaneously
● One training master● N
W training workers
Basic Layout
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 25
Bayesian Optimization
● Objective function isapproximated as a multivariategaussian
● Measurements provided one byone to improve knowledge of theobjective function
● Next best parameter to test isdetermined from the acquisitionfunction
● Using the python implementationfrom https://scikit-optimize.github.io
https://tinyurl.com/yc2phuaj
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 26
Evolutionary Algorithm● Chromosomes are represented by the hyper-parameters● Initial population taken at random in the parameter space● Population is stepped through generations
● Select the 20% fittest solutions● Parents of offspring selected by binary tournament based on
fitness function● Crossover and mutate to breed offspring
● Alternative to bayesian opt. Indications that it works better forlarge number of parameters and non-smooth objective function
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 27
K-Folding Cross Validation
● Estimate the performance of multiple model training overdifferent validation part of the training dataset
● Allows to take into account variance from multiple source(choice of validation set, choice of random initialization, ...)
● Crucial when comparing models performance● Training on folds can proceed in parallel
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 28
H-optmasterRank 0
Par
amet
er-s
et g
roup
0
Par
amet
er-s
et g
roup
1
Par
amet
er-s
et g
roup
NG
TM0G0F1
TW1G0F1
TW2G0F1
TWNW
G0F1
TM0G0F0
TW1G0F0
TW2G0F0
TWNW
G0F0
TM0G0FN
F
TW1G0FN
F
TW2G0FN
F
TWNW
G0FNF
● One master running the optimization. Receiving the average figure ofmerit over N
F folds of the data
➢ NG groups of nodes training on a parameter-set on simultaneously
➢ NF groups of nodes running one fold each
K-folding Layout
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 29
Summary and Outlook
Deep Learning
Training & Optimization, J-R Vlimant, CHEP18
Nnodes
= 1+ NG x N
F x (N
M x N
W x N
GPU)
NG : # of concurrent hyper-parameter set tested
NF : # of folds
NM : # of masters
NW : # of workers per master
NGPU
: # of nodes per worker (1node=1gpu)
Speed up and optimize models using thousand(s)of GPUs
Putting all Features Together
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 31
● Training large/complex models over large datasetcan take up to several days
● Several ways to distribute the training, with theirown advantage and drawbacks
● Scaling with promising efficiencies and modelperformance
● Several solutions proposed for training generativeadversarial networks, with mixed stabilities
● Turn-key dockerized mpi-opt solution forANN&GAN distributed training and optimizationwith keras and pytorch
● More development to come on mpi-opt➢ New gradients managing algorithms➢ Address bottlenecks
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 32
Acknowledgements
● Part of this work was conducted on TACC under anallocation thanks to the Intel IPCC program.
● Part of this work was conducted on Titan at OLCFunder the allocation csc291 (2018).
● Part of this work was conducted on Piz Daint at CSCSunder the allocations d59 (2016) and cn01 (2018).
● Part of this work was conducted at "iBanks", the AIGPU cluster at Caltech. We acknowledge NVIDIA,SuperMicro and the Kavli Foundation for their supportof "iBanks".
● Part of the team is funded by ERC H2020 grantnumber 772369
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 33
Extra Slides
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 34
H-optmasterRank 0
Par
amet
er-s
et g
roup
0
Par
amet
er-s
et g
roup
1
Par
amet
er-s
et g
roup
NG
TM2
TW0
TWNW
Training mastergroup 0, subrank 0
TM1
TW0
TWNW
TMNM
TW0
TWNW
● One master running the bayesian optimization● N
G groups of nodes training on a parameter-set on simultaneously
● One training master● N
M training sub-masters
● NW training workers
Sub-Master Layoutmpi-opt
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 35
H-optmasterRank 0
Par
amet
er-s
et g
roup
0
Par
amet
er-s
et g
roup
1
Par
amet
er-s
et g
roup
NG
TW0GPU2
TW1GPU2
TWNW
GPU2
Training mastergroup 0, subrank 0
TW1GPU1
TW2GPU1
TWNW
GPU1
TW1GPUN
GPU
TW2GPUN
GPU
TWNW
GPUNGPU
● One master running the bayesian optimization● N
G groups of nodes training on a parameter-set on simultaneously
● One training master● N
W training worker groups
● NGPU
used for each worker group (either nodes or gpu)
all-reduce Layoutmpi-opt
07/12/18Deep Learning
Training & Optimization, J-R Vlimant, CHEP18 36
skoptworker 2
commasterRank 0
Par
amet
er-s
et g
roup
0
Par
amet
er-s
et g
roup
1
Par
amet
er-s
et g
roup
NG
Training mastergroup 0, subrank 0
Training workergroup 0, subrank 1
Training mastergroup 0, subrank2
Training mastergroup 0, subrank N
W
● One master running communication of parameter set● N
SK workers running the bayesian optimization
● NG groups of nodes training on a parameter-set on simultaneously
● One training master● N
W training workers
mpi-skopt Setup
H-optworker1
H-optworker N
SK
mpi-opt