Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

«What I cannot compute, I do not understand.» (adapted from Richard P. Feynman)

Achievements and challenges running GPU-accelerated Quantum ESPRESSO

on heterogeneous clusters

Filippo Spiga1,2 <[email protected]>

1HPCS, University of Cambridge2Quantum ESPRESSO Foundation

What is Quantum ESPRESSO?

• Quantum ESPRESSO is an integrated suite of computer codes for atomistic simulations based on DFT, pseudo-potentials, and plane waves

• "ESPRESSO" stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization

• Quantum ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide

• Quantum ESPRESSO is free software that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development

2

What Quantum ESPRESSO can do?

3

• ground-state calculations– Kohn-Sham orbitals and energies, total energies

and atomic forces– finite as well as infinite system– any crystal structure or supercell– insulators and metals (different schemes of BZ

integration)– structural optimization (many minimization

schemes available)– transition states and minimum-energy paths (via

NEB or string dynamics) electronic polarization via Berry’s phase

– finite electric fields via saw-tooth potential or electric enthalpy

• norm-conserving as well as ultra-soft and PAW pseudo-potentials

• many different energy functionals, including meta-GGA, DFT+U, and hybrids (van der Waals soon to be available)

• scalar-relativistic as well as fully relativistic (spin-orbit) calculations

• magnetic systems, including non-collinear magnetism• Wannier intepolations

• ab-initio molecular dynamics– Car-Parrinello (many ensembles and flavors) – Born-Oppenheimer (many ensembles and flavors) – QM-MM (interface with LAMMPS)

• linear response and vibrational dynamics– phonon dispersions, real-space interatomic force

constants– electron-phonon interactions and

superconductivity effective charges and dielectric tensors

– third-order an-harmonicities and phonon lifetimes– infrared and (off-resonance) Raman cross

sections– thermal properties via the quasi-harmonic

approximation• electronic excited states

– TDDFT for very large systems (both real-time and “turbo-Lanczos”)

– MBPT for very large systems (GW, BSE)

.... plus several post processing tools!

Quantum ESPRESSO in numbers

• 350,000+ lines of FORTRAN/C code

• 46 registered developers

• 1600+ registered users

• 5700+ downloads of the latest 5.x.x version

• 2 web-sites (quantum-espresso.org & qe-forge.org)

• 1 official user mailing-list, 1 official developer mailing-list

• 24 international schools and training courses (1000+ participants)

4

PWscf in a nutshellprogram flow

5

3D-FFT + GEMM + LAPACK

3D-FFT + GEMM + LAPACK

3D-FFT3D-FFT 3D-FFT + GEMM

3D-FFT + GEMM

Spoiler!

• Only PWscf ported to GPU

• Performance serial (full socket vs full socket + GPU): 3x ~ 4x

• Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x

• Designed to run better at low number of nodes (efficiency not high)

• spin magnetization and noncolin not ported (working on it)

• I/O set low on purpose

• NVIDIA Kepler GPU not exploited at their best (working on it)

6

Achievement: smart and selective BLAS

7

ICPU

GPU

H2D

unbalance

D2H

A1

+

B C1

×

A2

+

B C2

×

phiGEMM: CPU+GPU GEMM operations

•Drop-in library wont work as expected, need control

•overcome limit of the GPU memory

•flexible interface (C on the HOST, C on the DEVICE)

•dynamic workload adjustment (SPLIT) -- heuristic

•call-by-call profiling capabilities

Challenge: rectangular GEMMbad shape, poor performance

8

Issues:•A and B can be larger than GPU memory•A and B matrices are "badly" rectangular (dominant dimension)

Solutions: ~ +15% performance•tiling approach

• not too big, not too small• GEMM computation must exceed copies (H-D, D-H), especially for small tiles

•handling the "SPECIAL-K" case• adding beta × C done once

• accumulating alpha × Ai × Bi timesOptimizations included in phiGEMM ( version >1.9)

m

n

k

k

m

n

Common case due to data distribution

Challenge: parallel 3D-FFT

• 3D-FFT burns up to 40%~45% of total SCF run-time

• 90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid)

• 3D-FFT is "small" <3003 COMPLEX DP

• 3D-FFT can be not a cube

• In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT

• In serial data layout is straightforward, in parallel not*

• MPI communication become big issue for many-node problem

• GPU FFT is mainly memory bounded grouping & batching 3D-FFT

9

Challenge: FFT data layoutit is all about sticks & planes

10

Ny Nz FFT along x

Transform along X

0

1

2

3

PE 0

PE 1PE 2

PE 3

0 1 2 3 00 0 0

11 1 1

22 2

2

33 3

3

Transform along Z

~ Nx Ny / 5 FFT along z Nx Nz / 2 FFT along y

z

0

1

2

3

Transform along Y

Pa

ralle

l Tra

nsp

ose

~ N

x N

y N

z / (

5 N

p) d

ata

exch

ange

d pe

r P

E

xyz x

y z

0

1

2

3

xy

Data are not contiguous and not “trivially” distributed across processors

A single 3D-FFT is divided in independent 1D-FFTsThere are two "FFT grid" representation in

Reciprocal Space: wave functions (Ecut) and charge density (4Ecut)

Zeros are not transformed. Different cut-offs preserve accuracy


11

Optimization #1

•CUDA-enabled MPI for P2P (within socket)

•Overlap FFT computation with MPI communication

•MPI communication >>> FFT computation for many nodes


Optimization #2

Observation: Limitation in overlapping D-H copy due to MPI communication

•pinned needed (!!!)

•Stream D-H copy to hide CPU copy and FFT computation

Optimization #3

Observation: MPI “packets” small for many nodes

•Re-order data before communication

•Batch MPI_Alltoallv communications

Optimization #4

Idea: reduce data transmitted (risky...)

•Perform FFTs and GEMM in DP, truncate data before communication to SP

12

Achievements: parallel 3D-FFTminiDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials)

13

Optimization #1: +37% improvement in communication

Optimization #2:

Optimization #3: +10% improvement in communication

Optimization #4: +52% (!!!) improvement in communication SP vs DP

Lower gain in PWscf !!!


14

All data of all FFT computed back to host mem

All data of all FFT computed back to host mem

Data reordering before GPU-GPU communication

Data reordering before GPU-GPU communication

1

2

Challenge: H*psi

15

compute/update H * psi:compute kinetic and non-local term (in G space)

complexity : Ni × (N × Ng+ Ng × N × Np)Loop over (not converged) bands:

FFT (psi) to R space complexity : Ni × Nb × FFT(Nr)

compute V * psi complexity : Ni × Nb × Nr

FFT (V * psi) back to G space complexity : Ni × Nb × FFT(Nr)

compute Vexx complexity : Ni × Nc × Nq × Nb × (5 × Nr +

2×FFT(Nr))N = 2×Nb (where Nb = number of valence bands)Ng = number of G vectorsNi = number of Davidson iteration

Np = number of PP projectorNr = size of the 3D FFT gridNq = number of q-point (may be different from Nk)

Challenge: H*psinon-converged electronic bands dilemma

16

Non-predictable number of FFT across all SCF iterations

Challenge: parallel 3D-FFTthe orthogonal approach

17

PSI PSIC

CUFFT GR

products

CUFFT RG

PSICHPSI

PSIC

PSIC

“MPI_Allgatherv”

“MPI_Allscatterv”

Multiple LOCAL grid to compute

Overlapping is possible!!

PSI PSIC

FFT GR

products

FFT RG

PSICHPSI

DISTRIBUTED

Considerations:

•memory on GPU ATLAS K40 (12 GByte)

•(still) too much communication GPU Direct capability needed

•enough 3D-FFT not predictable in advance

•benefit also for CPU-only! Not ready for production yet

Challenge: eigen-solverswhich library?

• LAPACK MAGMA (ICL, University of Tennessee)– hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK)– single and multi-GPU, no memory distributed (yet)– some (inevitable) numerical "discrepancies"

• ScaLAPACK ELPA ELPA + GPU (RZG + NVIDIA)

– ELPA (Eigenvalue SoLvers for Petaflop Applications) improves ScaLAPACK

– ELPA-GPU proof-of-concept based on CUDA FORTRAN

– effective results below expectation

• Lancronz diagonaliz w/ tridiagonal QR algorithm (Penn State)

– simple (too simple?) and designed to be GPU friendly

– take advantage of GPU Direct

– experimental, need testing and validation

18

HPC Machines

• 128 nodes dual-socket• dual 6-core Intel Ivy Bridge • dual NVIDIA K20c per node• dual Mellanox Connect-IB FDR

19

WILKES (HPCS) [DELL]

#2 Green500 Nov 2013

( ~3632 MFlops/W )

• 18688 nodes single-socket• single 16-core AMD Opteron• one NVIDIA K20x per node• Gemini interconnection

TITAN (ORNL) [CRAY]

#2 Top500 Jun 2013

( ~17.59 PFlops Rmax )

Achievement: Save Powerserial multi-threaded, single GPU, NVIDIA Fermi generation

20

3.67x 3.1x3.2x

-57% -58% -54%

Tests run early 2012 @ ICHEC

Achievement: improved time-to-solution

21

2.4x

~2.1x

2.5x

~3.4x

~2.9x

2.4x

~3.4x

~3.5x

Serial Parallel Parallel tests run on WilkesSerial tests run on SBN machine

Challenge: running on CRAY XK7

Key differences...

•AMD Bulldozer architecture, 2 cores shares same FPU pipeline aprun -j 1

•NUMA locality matters a lot , for both CPU-only and CPU+GPU aprun –cc numanode

•GPU Direct over RDMA is not supported (yet?) 3D-FFT not working

•Scheduling policy "unfriendly" input has to be really big

Performance below expectation (<2x)

Tricks: many-pw.x, __USE_3D_FFT

22

Challenge: educate users

• Performance portability myth

• "configure, compile, run" same as the CPU version

• All dependencies (MAGMA, phiGEMM) compiled by QE-GPU

• No more than 2 MPI process per GPU

– Hyper-Q does not work automatically, an additional running deamon is needed

• Forget about 1:1 output comparison

• QE-GPU can run on every GPU but some GPU are better than others...

23

Lessons learntbeing "heterogeneous" today and tomorrow

• GPU does not really improve code scalability, only time-to-solution

• Re-think about data distribution for massive parallel architectures

• Deal with un-controlled "numerical fluctuations" (GPU magnifies this)

• The "data movement" constrain will soon disappear new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015

• Looking for true alternatives, new algorithms– not easy, extensive validation _plus_ module dependencies

• Performance is a function of human effort

• Follow the mantra «Do what you are good at.»

24

Thank you for your attention!

Links:•http://hpc.cam.ac.uk•http://www.quantum-espresso.org/•http://foundation.quantum-espresso.org/•http://qe-forge.org/gf/project/q-e/•http://qe-forge.org/gf/project/q-e-gpu/

Date post:	08-Jan-2016
Category:	Documents
Upload:	toni
View:	35 times
Download:	3 times

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

Documents