+ All Categories
Home > Documents > Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... ·...

Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... ·...

Date post: 19-Mar-2018
Category:
Upload: vodiep
View: 214 times
Download: 1 times
Share this document with a friend
39
www.cineca.it Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA
Transcript
Page 1: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Large Scale Parallelism

Carlo Cavazzoni, HPC department, CINECA

Page 2: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Parallel Architectures

Two basic architectural scheme:

Distributed Memory

Shared Memory Now most computers have a mixed architecture

+ accelerators -> hybrid architectures

Page 3: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

CPU

memory

CPU

memory

CPU

node

node

node

node

node

node

Page 4: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Shared Memory

CPU

memory

CPU CPU CPU CPU

Page 5: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Mixed Architectures

CPU

memory

CPU

CPU

memory

CPU

CPU

memory

CPU

NETWORK

node

node

node

Page 6: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Most Common Networks Cube, hypercube, n-cube

Torus in 1,2,...,N Dim

switch

switched

Fat Tree

Page 7: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Roadmap to Exascale (architectural trends)

Page 8: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Exascale architecture

Main memory

node

core core

core core

CPU

core core

core core

CPU

GPU GPU

GPU memory GPU memory

0 29

03 1

12

07

28

02 13

06

31

14

05

30

15

04 25

08 24 09

27

10

11

26

21

23

22

17

16

19

18

CPU ~ 16 cores / 16 threads

Co-processor ~ 128 cores / 512 threads

GPU ~ 1024 cores / 10^4 threads

Page 9: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Dennard scaling law

L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L2 = 4D P’ = P

do not hold anymore!

The power crisis!

L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P

Increase the number of cores to maintain the architectures evolution on the Moore’s law

Programming crisis!

The core frequency and performance do not grow following the Moore’s law any longer

new VLSI gen.

old VLSI gen.

Page 10: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

MPI inter process communications

MPI on Multi core CPU

MPI_BCAST

network

node node

node node

1 MPI proces / core Stress network Stress OS

Many MPI codes (QE) based on ALLTOALL Messages = processes * processes

We need to exploit the hierarchy

Re-design applications

Mix message passing And multi-threading

Page 11: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

What about Applications?

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

P = 0.999999

serial fraction= 0.000001

Page 12: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

What about QE?

Page 13: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

CP Flow chart FORM NLRH

RHOOFR

VOFRHO PRESS

FORCE

ORTHO

ρ(r) = Σ |ψ(r)|2

V(R, ψ) = VrDFT(R, ρ(r)) + VG

DFT(R, ρ(G))

Orthogonalize wave functions: ψ

Pseudopotential Form factors

Forces on the electrons: Fψ

Page 14: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

PW flow chart

Page 15: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Main Algorithms in QE

§ 3D FFT § Linear Algebra

–  Matrix Matrix Multiplication –  less Matrix-Vector and Vector-Vector –  Eigenvalues and Eigenvectors computation

§ Space integrals § Point function evaluations

Page 16: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Programming Models in QE

Message Passing (MPI) Shared Memory (OpenMP) Languages and Paradigm for Hardware Accelerators (CUDA) Hybrid: MPI + OpenMP + CUDA

Cuda kernel

Cuda kernel

Cuda kernel

Page 17: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

OpenMP !$omp parallel do do i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do !$omp end parallel do call fw_scatter ( . . . ) !$omp parallel do i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do !$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do end do !$omp end parallel

Page 18: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Speed-­‐up,  relative  to  64  cores  Pure  MPI

0

1

2

3

4

5

128 256 512 1024

cores

Pure  MPIMPI+OpenMP,  4  threadsMPI+OpenMP,  2  threads

Improve scalability with OpenMP

We observe the same behaviour But at an higher number of cores

Pure MPI saturate at 128 cores

MPI+2OMP Threads saturate at 256 cores

MPI+4OMP Threads saturate at 512 cores

CP simulation of 256Water molecules

Page 19: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

when I should use OpenMP?

Speed-up

Number of core

MPI

MPI+ OpenMP

Page 20: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

when I should use TaskGroups?

Speed-up

Number of cores

-ntg 1

-ntg 4

-ntg 2

nproc = nr3 nproc = 2*nr3

Page 21: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Tasks Group

parallel 3D FFT do i = 1, n compute parallel 3D FFT( psi(i) ) end do

P0

P1

P2

P3

pis(i)

the parallelization is limited to the number of planes in the 3D FFT (NX x NY x NZ) there is little gain to use more than NZ proc

Page 22: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Tasks Group II

The goal is to use more processors than NZ. The solution is to perform FFT not one by one but in group of NG.

redistribute the n FFT do i = 1, nb, ng compute ng parallel 3D FFT (at the same time) end do

we can scaleup to NZ x NG processor. This cost an additional ALLTOALL and memory (NG times the size of the 3D vector). But we have half the number of Loop cycle!

P0

P1

P2

P3

P4

P5

P6

P7

2 - 3D FFT In one shot

Page 23: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Diagonalization: how to set -ndiag Daigonalization

time

-ndiag 1 4 9 Nopt …

Nopt: depend on the number of electrons and the communication performance

Page 24: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Diagonalization/Orthogonalization Group

when incrasing the number of cores, not all part of the code scale with the same efficiency.

Hermitian matrixes are square matrixes, and a square grid of processors can give to optimal performance (communication/computatio)

in a run with 10 processors, the diag. group use 4 procrs (2x2)

Matrixes are block distributed to the diag group. In this case is possible to use a mixed parallelization MPI+OpenMP using SMP library

Nb

Nb

Page 25: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

QE parallelization hierarchy

Page 26: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Parallelization Strategy

§  3D FFT

§  Linear Algebra

§  Space integrals

§  Point function evaluations

ad hoc MPI & OpenMP driver

ScalaPACK + blas multithread

MPI & OpenMP loops parallelization and reduction

MPI & OpenMP loops parallelization

Page 27: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

10240 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

1024 cores

-nimage 10 Each replica got 1024 cores, everything is replicated, but position are different

256 cores

256 cores

256 cores

256 cores

-nbgrp 4 Each band group got 256 cores, g-vectors are replicated, but bands are different

64 Tasks

OMP_NUM_THREADS=4, g-vectors and fft are distributed across 64 tasks Each task manage 4 cores using shared memory/OpenMP parallelism

64 Tasks

64 Tasks

64 Tasks

16 works

-ndiag 16

Only 16 tasks are involved in KS hamiltonian diagonalization, In most computation using all tasks is an overshooting

48 idle

-ntg 2

FFT computation is reorganized and distributed to two group of 32 tasks each. This help overcoming the limitation to the parallelization given by the number of grid points in the Z direction.

64 Tasks

32 Tasks

32 Tasks

How to deal with extreme parallelism

Page 28: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Parallelization over Images Energy barrier evaluation

Parallelization over Pools: K-points sempling of Brillouin Zone. Electronic band structure

Parallelization of the FFT Real and Reciprocal Space Decomposition. One FFT for each electron

Parallelization of Bands structure. Matrix Diagonalization At least one state for each electron

states

states

Charge density Hamiltonian

Loosely coupled Tightly coupled

1x1xN proc grid √N x √N proc grid

G-vectors are replicated (across groups)

G-vectors are distributed (within each group)

Page 29: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Bands parallelization scaling

0

50

100

150

200

250

300

350

400

4096 8192 16384 32768 65536

2048 4096 8192 16384 32768

1 2 4 8 16

second

s  /steps

CNT10POR8  -­‐ CP  on  BGQ

calphi

dforce

rhoofr

updatc

ortho

Virtual  cores

Real  cores

Band  groups

Page 30: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

0

50

100

150

200

250

300

350

400

450

500

8192 32768 65536

4096 16384 32768

4 4 16

second

s  /step

s

CdSe  1214  -­‐ CP  on  BGQ

prefor

nlfl

nlfq

vofrho

calphi

dforce

rhoofr

updatc

ortho

Virtual  cores

Real  cores

Band  groups

CdSe  1214  -­‐  FERMI

virtual  core real  core MPI  task OpenMP  ThreadsBand  GroupsTask  GroupsOrtho  procs Time/step8192 4096 1024 8 4 4 256 47232768 16384 4096 8 4 4 1024 24165536 32768 8192 8 16 4 512 148.835

Page 31: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

export WORKDIR=`pwd`/. export TASKS=16384 export PREFIX=cp export TASK_PER_NODE=8 export THREADS=4 export NTG=4 export NDIAG=512 export NBG=16 export RUNID=4 export INPUT_FILE=$WORKDIR/$PREFIX.in export OUTPUT_FILE=$WORKDIR/$PREFIX.${TASKS}t.${TASK_PER_NODE}tpn.${THREADS}omp.${NBG}bg.${NTG}tg.${NDIAG}d.${RUNID}.out runjob --np $TASKS --ranks-per-node $TASK_PER_NODE --envs OMP_NUM_THREADS=$THREADS : \

$WORKDIR/cp.x –nbgrp $NBG -ntask_groups $NTG -ndiag $NDIAG < $INPUT_FILE > $OUTPUT_FILE

Typical CP command line on massively parallel supercomputers

Page 32: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

&control title='Prace bench', calculation = 'cp', restart_mode='restart', ndr=53, ndw=52, prefix='nano', nstep=500, iprint=10, isave=200, dt=5.0d0, etot_conv_thr = 1.d-8, pseudo_dir = './' outdir = './' tstress = .false. tprnfor = .true. wf_collect=.false. saverho=.false. memory="small" /

Input parameters

can be critical for performance, use only when really needed

Page 33: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Program CP v.5.0.1 (svn rev. 9250M) starts on 7Aug2012 at 23: 8:40 This program is part of the open-source Quantum ESPRESSO suite for quantum simulation of materials; please cite "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); URL http://www.quantum-espresso.org", in publications or presentations arising from this work. More details at http://www.quantum-espresso.org/quote.php Parallel version (MPI & OpenMP), running on 131072 processor cores Number of MPI processes: 16384 Threads/MPI process: 8 band groups division: nbgrp = 16 R & G space division: proc/pool = 16384 wavefunctions fft division: fft/group = 4 … Matrix Multiplication Performances ortho mmul, time for parallel driver = 0.02369 with 1024 procs Constraints matrixes will be distributed block like on ortho sub-group = 32* 32 procs

Reading the output… CNT10POR8

Page 34: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Basic Data Type

Charge density

Wave functions 1D arrays

3D arrays

Reciprocal space

Real space

Page 35: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

( ) ( ) ( )∑Ω

=G

ii iGrGCr exp1ψ

Reciprocal Space Representation

cutEG ≤2/2

( ) ( )∑=i

ii rfr 2ψρ

( ) ( ) ( ) ( )( )∑∑ʹ′

ʹ′−ʹ′−ʹ′Ω

=G

iii

i rGGiGGCGCfG exp1ρ

To retain the same accurancy as the wave function

To truncate the infinite sum

cutEG 42/2 ≤

Wave Functions

Charge Density

Page 36: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

FFTs Ecut

|G|2/2< Ecut

( )Giψ

( ) ( )∑=i

ii rfr 2ψρ( )riψ

FFT

4Ecut

( )Gρ

|G|2/2< 4Ecut

Reciprocal Space

Real Space

Page 37: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Reciprocal Space distribution ( )Gρ

( )Giψ

P0

P2

P3

P4

Page 38: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Understanding QE 3DFFT, Parallelization of Plane Wave

0 1 2 3

PE 0

Column

PE 1 PE 2 PE 3

4 E c 0 1 2 3 0

0 0 0 1 1 1 1 2

2 2 2

3 3 3

3

x

y

z

~ Nx Ny / 5 FFT along z

Ecut

4Ecut ( )Gρ

( )Giψ

Charge density

Single state electronic wave function

Reciprocal Space G óPlane Wave vectors

Similar 3DFFT are present in most ab-initio codes like CPMD

x z

x

z

Page 39: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures

www.cineca.it

Conclusion

Number of cores double every two years -> parallel vs serial

Memory per core decreases -> parallelism at all level

Multi/many core nodes -> MPI and OpenMP

Communicator hierarchy -> tune command line parameters

I/O will be critical -> avoid it when not required

Power consumption will drive CPU/Computer design


Recommended