Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions...

CLIC GAN : S. Vallecorsa, G. Khattak, F. Carminati, M. Pierini SurfSara : V. Codreanu, D. Podareanu

Cray : D. MoiseIntel : H. Pabst, V. Saletore

mpi-opt : V. Loncar, F. Pantaleo, T. Nguyen, M. Pierini, J-R. Vlimant, A. Zlokapa

Large Scale Training andOptimization of Neural Networks and

Generative Adversarial Networksover Distributed Resource

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 2

Outline

Grounding : Deep Learning in HEP Generative adversarial networks Training workload parallelization Hyper-parameters optimization Summary and Future work

Based on software (mpi-opt) we are developing, with references to otherstudies performed by collaborators

Some performance plots and studies are stalled due to limited access toresource (allocation pending, job priority, …)



Deep Learning in High Energy Physics



Artificial Neural Network

See all developing applications in David's talk https://indico.cern.ch/event/587955/contributions/3012266/

http://www.asimovinstitute.org/neural-network-zoo

● Large number of parameters● Efficiently adjusted with stochastic gradient descent● The more parameters, the more data required● Training to convergence can take minutes to several days, ...

https://indico.cern.ch/event/587955/contributions/3012266/

http://www.asimovinstitute.org/neural-network-zoo



Generative Adversarial Network



A Forger's Game

See many other contributions this weekhttps://indico.cern.ch/event/587955/contributions/2937509/ Steve F.https://indico.cern.ch/event/587955/contributions/2937595/ Sofia V.https://indico.cern.ch/event/587955/contributions/2937612/ Viktoriaa C.https://indico.cern.ch/event/587955/contributions/2937515/ Tomasz P.

● Constructed from two artificial neural networks● Two concurrent gradient descent procedure● Training to convergence can take minutes to several days, ...







GAN for CLIC

Generative adversarial networkarchitecture for CLIC 3D dataset

“Images” are 3D : energydeposition in a highly granular

calorimeterhttp://cds.cern.ch/record/2254048

http://cds.cern.ch/record/2254048



Training Artificial Neural Networks

● ANN and associated loss function have fully analyticalformulation and are differentiable with respect to modelparameters

● Gradient evaluated over batch of data➢ Too small : very noisy and scattering➢ Too large : information dilution and slow convergence



Distributed Training



Parallelism Overview➔Data distribution

Compute the gradients on several batchesindependently and update the model synchronously ornot. Applicable to large dataset

➔Gradient distributionCompute the gradient of one batch in parallel andupdate the model with the aggregated gradient.Applicable to large sample ≡ large event

➔Model distributionCompute the gradient and updates of part of themodel separately in chain. Applicable to large model



DataDistribution



Data Distribution

https://arxiv.org/abs/1712.05878

● Master node operates as parameter server● Work nodes compute gradients● Master handles gradients to update the central model

➔ downpour sgd https://tinyurl.com/ycfpwec5 ➔ Elastic averaging sgd https://arxiv.org/abs/1412.6651


https://tinyurl.com/ycfpwec5




Performance on ANN

● Speed up in training recurrent neural networks on PizDaint CSCS supercomputer

➔ Linear speed up with up to ~20 nodes. Bottlenecksto be identified

➔ Needs to compensate for staleness of gradients● Similar scaling on servers with 8 GPUs

➔ x7 speed up with students' work

https://github.com/duanders/mpi_learn

https://github.com/duanders/mpi_learn



Performance on GAN

● Speed up in training generativeadversarial networks on Piz Daint CSCSand Titan ORNL supercomputers

➔ Using easgd algorithm with rmsprop➔ Speed up is not fully efficient.

Bottlenecks to be identified

NVIDA K20 at Titan, ORNL

NVIDA P100 on Piz Daint, CSCS



Par

amet

er-s

et g

roup

0 TM2

TW0

TWNW

Training mastergroup 0, subrank 0

TM1

TW0

TWNW

TMNM

TW0

TWNW

● Putting workers in several groups● Aim at spreading communication to the main master● Need to strike a balance between staleness and

update frequency

Sub-master Layout



Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc

Notmpi-opt

https://sites.google.com/nvidia.com/ai-hpc



GradientDistribution



Par

amet

er-s

et g

roup

0 TW1GPU2

TW1GPU2

TWNW

GPU2


TW1GPU1

TW2GPU1

TWNW

GPU1

TW1GPUN

GPU

TW2GPUN

GPU

TWNW

GPUNGPU

● A logical worker is spawn over multiple mpi processes● Communicator passed to horovod https://github.com/uber/horovod ● Private horovod branch to allow for group initialization/reset● Nvidia NCCL enabled for fast GPU-GPU communication

“all-reduce” Layout

https://github.com/uber/horovod



Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc

Notmpi-opt

https://sites.google.com/nvidia.com/ai-hpc



ModelDistribution



Intra-Node Model Parallelism

See T. Kurth et al. @ https://pasc18.pasc-conference.org for nodeto node model parallelism considerations

GPU2GPU1

● Perform the forward and backward pass of sets of layers ondifferent devices

● Require good device to device communication● Utilize native tensorflow multi-device manager● Aiming for machines with multi-gpu per node topology (summit)

https://pasc18.pasc-conference.org/



Hyper-Parameters Optimization



Hyper-Parameters● Various parameters of the model cannot be learned by

gradient descent➢ Learning rate, batch size, number of layers, size of

kernels, …

● Tuning to the right architecture is an “art”. Can easilyspend a lot of time scanning many directions

● Full parameter scan is resource/time consuming.

➔ Hence looking for a way to reach the optimum hyper-parameter set for a provided figure of merit (the loss bydefault, but any other fom can work)

➔ Too optimization engine integrated➢ Bayesian optimization with gaussian processes prior➢ Evolutionary algorithm



H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG


Training workergroup 0, subrank 1

Training workergroup 0, subrank2

Training workergroup 0, subrank N

W

● One master process drives the hyper-parameter optimization● N

G groups of nodes training on a parameter-set on simultaneously

● One training master● N

W training workers

Basic Layout



Bayesian Optimization

● Objective function isapproximated as a multivariategaussian

● Measurements provided one byone to improve knowledge of theobjective function

● Next best parameter to test isdetermined from the acquisitionfunction

● Using the python implementationfrom https://scikit-optimize.github.io

https://tinyurl.com/yc2phuaj

https://scikit-optimize.github.io/

https://tinyurl.com/yc2phuaj



Evolutionary Algorithm● Chromosomes are represented by the hyper-parameters● Initial population taken at random in the parameter space● Population is stepped through generations

● Select the 20% fittest solutions● Parents of offspring selected by binary tournament based on

fitness function● Crossover and mutate to breed offspring

● Alternative to bayesian opt. Indications that it works better forlarge number of parameters and non-smooth objective function



K-Folding Cross Validation

● Estimate the performance of multiple model training overdifferent validation part of the training dataset

● Allows to take into account variance from multiple source(choice of validation set, choice of random initialization, ...)

● Crucial when comparing models performance● Training on folds can proceed in parallel



H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TM0G0F1

TW1G0F1

TW2G0F1

TWNW

G0F1

TM0G0F0

TW1G0F0

TW2G0F0

TWNW

G0F0

TM0G0FN

F

TW1G0FN

F

TW2G0FN

F

TWNW

G0FNF

● One master running the optimization. Receiving the average figure ofmerit over N

F folds of the data

➢ NG groups of nodes training on a parameter-set on simultaneously

➢ NF groups of nodes running one fold each

K-folding Layout



Summary and Outlook

Deep Learning

Training & Optimization, J-R Vlimant, CHEP18

Nnodes

= 1+ NG x N

F x (N

M x N

W x N

GPU)

NG : # of concurrent hyper-parameter set tested

NF : # of folds

NM : # of masters

NW : # of workers per master

NGPU

: # of nodes per worker (1node=1gpu)

Speed up and optimize models using thousand(s)of GPUs

Putting all Features Together



● Training large/complex models over large datasetcan take up to several days

● Several ways to distribute the training, with theirown advantage and drawbacks

● Scaling with promising efficiencies and modelperformance

● Several solutions proposed for training generativeadversarial networks, with mixed stabilities

● Turn-key dockerized mpi-opt solution forANN&GAN distributed training and optimizationwith keras and pytorch

● More development to come on mpi-opt➢ New gradients managing algorithms➢ Address bottlenecks



Acknowledgements

● Part of this work was conducted on TACC under anallocation thanks to the Intel IPCC program.

● Part of this work was conducted on Titan at OLCFunder the allocation csc291 (2018).

● Part of this work was conducted on Piz Daint at CSCSunder the allocations d59 (2016) and cn01 (2018).

● Part of this work was conducted at "iBanks", the AIGPU cluster at Caltech. We acknowledge NVIDIA,SuperMicro and the Kavli Foundation for their supportof "iBanks".

● Part of the team is funded by ERC H2020 grantnumber 772369



Extra Slides



H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TM2

TW0

TWNW


TM1

TW0

TWNW

TMNM

TW0

TWNW

● One master running the bayesian optimization● N



M training sub-masters

● NW training workers

Sub-Master Layoutmpi-opt



H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TW0GPU2

TW1GPU2

TWNW

GPU2


TW1GPU1

TW2GPU1

TWNW

GPU1

TW1GPUN

GPU

TW2GPUN

GPU

TWNW

GPUNGPU

● One master running the bayesian optimization● N



W training worker groups

● NGPU

used for each worker group (either nodes or gpu)

all-reduce Layoutmpi-opt



skoptworker 2

commasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG


Training workergroup 0, subrank 1

Training mastergroup 0, subrank2

Training mastergroup 0, subrank N

W

● One master running communication of parameter set● N

SK workers running the bayesian optimization

● NG groups of nodes training on a parameter-set on simultaneously


W training workers

mpi-skopt Setup

H-optworker1

H-optworker N

SK

mpi-opt

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions...

Documents