+ All Categories
Home > Documents > Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions...

Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions...

Date post: 27-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
CLIC GAN : S. Vallecorsa, G. Khattak, F. Carminati, M. Pierini SurfSara : V. Codreanu, D. Podareanu Cray : D. Moise Intel : H. Pabst, V. Saletore mpi-opt : V. Loncar, F. Pantaleo, T. Nguyen, M. Pierini, J-R. Vlimant, A. Zlokapa Large Scale Training and Optimization of Neural Networks and Generative Adversarial Networks over Distributed Resource
Transcript
Page 1: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

CLIC GAN : S. Vallecorsa, G. Khattak, F. Carminati, M. Pierini SurfSara : V. Codreanu, D. Podareanu

Cray : D. MoiseIntel : H. Pabst, V. Saletore

mpi-opt : V. Loncar, F. Pantaleo, T. Nguyen, M. Pierini, J-R. Vlimant, A. Zlokapa

Large Scale Training andOptimization of Neural Networks and

Generative Adversarial Networksover Distributed Resource

Page 2: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 2

Outline

Grounding : Deep Learning in HEP Generative adversarial networks Training workload parallelization Hyper-parameters optimization Summary and Future work

Based on software (mpi-opt) we are developing, with references to otherstudies performed by collaborators

Some performance plots and studies are stalled due to limited access toresource (allocation pending, job priority, …)

Page 3: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 3

Deep Learning in High Energy Physics

Page 4: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 4

Artificial Neural Network

See all developing applications in David's talk https://indico.cern.ch/event/587955/contributions/3012266/

http://www.asimovinstitute.org/neural-network-zoo

● Large number of parameters● Efficiently adjusted with stochastic gradient descent● The more parameters, the more data required● Training to convergence can take minutes to several days, ...

Page 5: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 5

Generative Adversarial Network

Page 6: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 6

A Forger's Game

See many other contributions this weekhttps://indico.cern.ch/event/587955/contributions/2937509/ Steve F.https://indico.cern.ch/event/587955/contributions/2937595/ Sofia V.https://indico.cern.ch/event/587955/contributions/2937612/ Viktoriaa C.https://indico.cern.ch/event/587955/contributions/2937515/ Tomasz P.

● Constructed from two artificial neural networks● Two concurrent gradient descent procedure● Training to convergence can take minutes to several days, ...

Page 7: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 7

GAN for CLIC

Generative adversarial networkarchitecture for CLIC 3D dataset

“Images” are 3D : energydeposition in a highly granular

calorimeterhttp://cds.cern.ch/record/2254048

Page 8: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 8

Training Artificial Neural Networks

● ANN and associated loss function have fully analyticalformulation and are differentiable with respect to modelparameters

● Gradient evaluated over batch of data➢ Too small : very noisy and scattering➢ Too large : information dilution and slow convergence

Page 9: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 9

Distributed Training

Page 10: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 10

Parallelism Overview➔Data distribution

Compute the gradients on several batchesindependently and update the model synchronously ornot. Applicable to large dataset

➔Gradient distributionCompute the gradient of one batch in parallel andupdate the model with the aggregated gradient.Applicable to large sample ≡ large event

➔Model distributionCompute the gradient and updates of part of themodel separately in chain. Applicable to large model

Page 11: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 11

DataDistribution

Page 12: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 12

Data Distribution

https://arxiv.org/abs/1712.05878

● Master node operates as parameter server● Work nodes compute gradients● Master handles gradients to update the central model

➔ downpour sgd https://tinyurl.com/ycfpwec5 ➔ Elastic averaging sgd https://arxiv.org/abs/1412.6651

Page 13: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 13

Performance on ANN

● Speed up in training recurrent neural networks on PizDaint CSCS supercomputer

➔ Linear speed up with up to ~20 nodes. Bottlenecksto be identified

➔ Needs to compensate for staleness of gradients● Similar scaling on servers with 8 GPUs

➔ x7 speed up with students' work

https://github.com/duanders/mpi_learn

Page 14: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 14

Performance on GAN

● Speed up in training generativeadversarial networks on Piz Daint CSCSand Titan ORNL supercomputers

➔ Using easgd algorithm with rmsprop➔ Speed up is not fully efficient.

Bottlenecks to be identified

NVIDA K20 at Titan, ORNL

NVIDA P100 on Piz Daint, CSCS

Page 15: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 15

Par

amet

er-s

et g

roup

0 TM2

TW0

TWNW

Training mastergroup 0, subrank 0

TM1

TW0

TWNW

TMNM

TW0

TWNW

● Putting workers in several groups● Aim at spreading communication to the main master● Need to strike a balance between staleness and

update frequency

Sub-master Layout

Page 16: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 16

Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc

Notmpi-opt

Page 17: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 17

GradientDistribution

Page 18: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 18

Par

amet

er-s

et g

roup

0 TW1GPU2

TW1GPU2

TWNW

GPU2

Training mastergroup 0, subrank 0

TW1GPU1

TW2GPU1

TWNW

GPU1

TW1GPUN

GPU

TW2GPUN

GPU

TWNW

GPUNGPU

● A logical worker is spawn over multiple mpi processes● Communicator passed to horovod https://github.com/uber/horovod ● Private horovod branch to allow for group initialization/reset● Nvidia NCCL enabled for fast GPU-GPU communication

“all-reduce” Layout

Page 19: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 19

Sofia V. @ https://sites.google.com/nvidia.com/ai-hpc

Notmpi-opt

Page 20: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 20

ModelDistribution

Page 21: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 21

Intra-Node Model Parallelism

See T. Kurth et al. @ https://pasc18.pasc-conference.org for nodeto node model parallelism considerations

GPU2GPU1

● Perform the forward and backward pass of sets of layers ondifferent devices

● Require good device to device communication● Utilize native tensorflow multi-device manager● Aiming for machines with multi-gpu per node topology (summit)

Page 22: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 22

Hyper-Parameters Optimization

Page 23: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 23

Hyper-Parameters● Various parameters of the model cannot be learned by

gradient descent➢ Learning rate, batch size, number of layers, size of

kernels, …

● Tuning to the right architecture is an “art”. Can easilyspend a lot of time scanning many directions

● Full parameter scan is resource/time consuming.

➔ Hence looking for a way to reach the optimum hyper-parameter set for a provided figure of merit (the loss bydefault, but any other fom can work)

➔ Too optimization engine integrated➢ Bayesian optimization with gaussian processes prior➢ Evolutionary algorithm

Page 24: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 24

H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

Training mastergroup 0, subrank 0

Training workergroup 0, subrank 1

Training workergroup 0, subrank2

Training workergroup 0, subrank N

W

● One master process drives the hyper-parameter optimization● N

G groups of nodes training on a parameter-set on simultaneously

● One training master● N

W training workers

Basic Layout

Page 25: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 25

Bayesian Optimization

● Objective function isapproximated as a multivariategaussian

● Measurements provided one byone to improve knowledge of theobjective function

● Next best parameter to test isdetermined from the acquisitionfunction

● Using the python implementationfrom https://scikit-optimize.github.io

https://tinyurl.com/yc2phuaj

Page 26: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 26

Evolutionary Algorithm● Chromosomes are represented by the hyper-parameters● Initial population taken at random in the parameter space● Population is stepped through generations

● Select the 20% fittest solutions● Parents of offspring selected by binary tournament based on

fitness function● Crossover and mutate to breed offspring

● Alternative to bayesian opt. Indications that it works better forlarge number of parameters and non-smooth objective function

Page 27: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 27

K-Folding Cross Validation

● Estimate the performance of multiple model training overdifferent validation part of the training dataset

● Allows to take into account variance from multiple source(choice of validation set, choice of random initialization, ...)

● Crucial when comparing models performance● Training on folds can proceed in parallel

Page 28: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 28

H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TM0G0F1

TW1G0F1

TW2G0F1

TWNW

G0F1

TM0G0F0

TW1G0F0

TW2G0F0

TWNW

G0F0

TM0G0FN

F

TW1G0FN

F

TW2G0FN

F

TWNW

G0FNF

● One master running the optimization. Receiving the average figure ofmerit over N

F folds of the data

➢ NG groups of nodes training on a parameter-set on simultaneously

➢ NF groups of nodes running one fold each

K-folding Layout

Page 29: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 29

Summary and Outlook

Page 30: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

Deep Learning

Training & Optimization, J-R Vlimant, CHEP18

Nnodes

= 1+ NG x N

F x (N

M x N

W x N

GPU)

NG : # of concurrent hyper-parameter set tested

NF : # of folds

NM : # of masters

NW : # of workers per master

NGPU

: # of nodes per worker (1node=1gpu)

Speed up and optimize models using thousand(s)of GPUs

Putting all Features Together

Page 31: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 31

● Training large/complex models over large datasetcan take up to several days

● Several ways to distribute the training, with theirown advantage and drawbacks

● Scaling with promising efficiencies and modelperformance

● Several solutions proposed for training generativeadversarial networks, with mixed stabilities

● Turn-key dockerized mpi-opt solution forANN&GAN distributed training and optimizationwith keras and pytorch

● More development to come on mpi-opt➢ New gradients managing algorithms➢ Address bottlenecks

Page 32: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 32

Acknowledgements

● Part of this work was conducted on TACC under anallocation thanks to the Intel IPCC program.

● Part of this work was conducted on Titan at OLCFunder the allocation csc291 (2018).

● Part of this work was conducted on Piz Daint at CSCSunder the allocations d59 (2016) and cn01 (2018).

● Part of this work was conducted at "iBanks", the AIGPU cluster at Caltech. We acknowledge NVIDIA,SuperMicro and the Kavli Foundation for their supportof "iBanks".

● Part of the team is funded by ERC H2020 grantnumber 772369

Page 33: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 33

Extra Slides

Page 34: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 34

H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TM2

TW0

TWNW

Training mastergroup 0, subrank 0

TM1

TW0

TWNW

TMNM

TW0

TWNW

● One master running the bayesian optimization● N

G groups of nodes training on a parameter-set on simultaneously

● One training master● N

M training sub-masters

● NW training workers

Sub-Master Layoutmpi-opt

Page 35: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 35

H-optmasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

TW0GPU2

TW1GPU2

TWNW

GPU2

Training mastergroup 0, subrank 0

TW1GPU1

TW2GPU1

TWNW

GPU1

TW1GPUN

GPU

TW2GPUN

GPU

TWNW

GPUNGPU

● One master running the bayesian optimization● N

G groups of nodes training on a parameter-set on simultaneously

● One training master● N

W training worker groups

● NGPU

used for each worker group (either nodes or gpu)

all-reduce Layoutmpi-opt

Page 36: Large Scale Training and Optimization of Neural Networks ... › event › 587955 › contributions › ...07/12/18 Deep Learning Training & Optimization, J-R Vlimant, CHEP18 2 Outline

07/12/18Deep Learning

Training & Optimization, J-R Vlimant, CHEP18 36

skoptworker 2

commasterRank 0

Par

amet

er-s

et g

roup

0

Par

amet

er-s

et g

roup

1

Par

amet

er-s

et g

roup

NG

Training mastergroup 0, subrank 0

Training workergroup 0, subrank 1

Training mastergroup 0, subrank2

Training mastergroup 0, subrank N

W

● One master running communication of parameter set● N

SK workers running the bayesian optimization

● NG groups of nodes training on a parameter-set on simultaneously

● One training master● N

W training workers

mpi-skopt Setup

H-optworker1

H-optworker N

SK

mpi-opt


Recommended