(all on first page) · Updated results, presentation normalization ongoing. V0.4 More updates and...

D4.4 “Report on the profiling, the optimisation and the benchmarking of a subset of application suited for performance

and energy”

D4.5 “Report on the efficiency and performance evalua tion of the

application ported and best practices”

D4.6 “Final list of ported and optimized applications”

Version 1.0

Document Information Contract Number 288777

Project Website www.montblanc-project.eu

Contractual Deadline M 45

Dissemination Level PU

Nature Report

Author E. Boyer (GENCI)

Contributors

B. Videau (CNRS), D. Brayford (LRZ) , M . Allalen (LRZ), P. Lanucara

(CINECA), N. Sanna (CINECA), Filippo M antovani (BSC), R. Halver (JSC), D. Broemmel (JSC), JH. M einke (JSC), Kevin Pouget (CNRS), Jean-François M éhaut (CNRS), Luigi Genovese (CEA),

Constan Gomez (BSC), Alejandro Rico (BSC)

Reviewer Name: Alejandro Rico (BSC), Filippo M antovani (BSC), Jesús Labarta (BSC)

Keywords Exascale, scientific applications, por ting, profiling, optim isation , energy

Notices:

D4.4 “Report on the profiling, the optimization and the benchmarking of a subset of application ” D4.5 “ Report on the efficiency and performance evaluation of the application ported and best practices” D4.6 “ Final list of ported and optimized applications”

2

The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

2015 Mont-Blanc Consortium Partners. All rights reserved.

Change Log

Version Description of Change

V0.1 Initial draft released to the WP4 contributors

V0.2 M erge of D4.4 and D4.5 with agreement of project manager

V0.3 Updated results, presentation normalization ongoing.

V0.4

M ore updates and charts.

V0.5 Table of figures added + corrections

V0.6 Late contributions added

V0.7 Idem

V0.8 Idem

V1.0 Final version for the European Commission


3

Table of Contents Introduction .................................................................................................................. 6

1

1.1 Platform used by WP4 ................................................................................................ 7

Report on the WP4 applications ......................................................................................... 8

2

2.1 BigDFT .................................................................................................................. 8 2.1.1 Description of the code............................................................................. 8 2.1.2 Code version ........................................................................................... 8 2.1.3 Problem sizes.......................................................................................... 8 2.1.4 Weak Scaling .......................................................................................... 9 2.1.5 Strong Scaling ......................................................................................... 9 2.1.6 Energy profiling ..................................................................................... 11 2.1.7 Report on scaling and energy profiling .................................................... 14 2.1.8 Synthesis & Best practices ..................................................................... 14

2.2 BQCD ...................................................................................................................16 2.2.1 Description of the code........................................................................... 16 2.2.2 Code version ......................................................................................... 16 2.2.3 Problem sizes........................................................................................ 16 2.2.4 Weak Scaling ........................................................................................ 16 2.2.5 Strong Scaling ....................................................................................... 19 2.2.6 Energy profiling ..................................................................................... 20 2.2.7 Report on scaling and energy profiling .................................................... 21 2.2.8 Synthesis & Best practices ..................................................................... 21

2.3 COSMO ................................................................................................................22 2.3.1 Description of the code........................................................................... 22 2.3.2 Code versions ....................................................................................... 23 2.3.3 Problem sizes........................................................................................ 26 2.3.4 Weak Scaling ........................................................................................ 26 2.3.5 Strong Scaling ....................................................................................... 26 2.3.6 Energy profiling ..................................................................................... 29 2.3.7 Report on scaling and energy profiling .................................................... 30 2.3.8 Synthesis & Best practices ..................................................................... 30 References............................................................................................................. 30

2.4 MP2C .................................................................................................................31 2.4.1 Description of the code........................................................................... 31 2.4.2 Code versions ....................................................................................... 31 2.4.3 Problem sizes........................................................................................ 32 2.4.4 Weak Scaling ........................................................................................ 32 2.4.5 Strong Scaling ....................................................................................... 33 2.4.6 Energy profiling ..................................................................................... 35 2.4.7 Report on scaling and energy profiling .................................................... 37 2.4.8 Synthesis & Best practices ..................................................................... 38

2.5 PEPC ...................................................................................................................39 2.5.1 Description of the code........................................................................... 39 2.5.2 Code versions ....................................................................................... 40 2.5.3 Problem sizes........................................................................................ 41 2.5.4 Weak Scaling ........................................................................................ 42 2.5.5 Strong Scaling ....................................................................................... 43 2.5.6 Energy profiling ..................................................................................... 44 2.5.7 Report on scaling and energy profiling .................................................... 46 2.5.8 Synthesis & Best practices ..................................................................... 48 References ............................................................................................................ 48

2.6 Quantum Espresso ...................................................................................................49 2.6.1 Description of the code........................................................................... 49


4

2.6.2 Code version ......................................................................................... 50 2.6.3 Problem sizes........................................................................................ 52 2.6.4 Weak Scaling ........................................................................................ 52 2.6.5 Strong Scaling ....................................................................................... 54 2.6.6 Energy profiling ..................................................................................... 55 2.6.7 Report on scaling and energy profiling .................................................... 55 2.6.8 Synthesis and best practices .................................................................. 55 References............................................................................................................. 56

2.7 SMMP ..................................................................................................................57 2.7.1 Description of the code........................................................................... 57 2.7.2 Code versions ....................................................................................... 57 2.7.3 Problem sizes........................................................................................ 57 2.7.4 Weak Scaling ........................................................................................ 57 2.7.5 Strong Scaling ....................................................................................... 58 2.7.6 Energy profiling ..................................................................................... 59 2.7.7 Report on scaling and energy profiling .................................................... 60 2.7.8 Synthesis & Best practices ..................................................................... 60

2.8 SPECFEM3D ..........................................................................................................62 2.8.1 Description of the code........................................................................... 62 2.8.2 Code version ......................................................................................... 62 2.8.3 Problem sizes........................................................................................ 62 2.8.4 Weak Scaling ........................................................................................ 63 2.8.5 Strong Scaling ....................................................................................... 65 2.8.6 Energy profiling ..................................................................................... 67 2.8.7 Report on scaling and energy profiling .................................................... 73 2.8.8 Synthesis & Best practices ..................................................................... 77

2.9 Alya RED ...............................................................................................................78 2.9.1 Description of the code........................................................................... 78 2.9.2 Code version ......................................................................................... 78 2.9.3 Problem sizes........................................................................................ 78 2.9.4 Weak Scaling ........................................................................................ 79 2.9.5 Strong Scaling ....................................................................................... 79 2.9.6 Energy profiling ..................................................................................... 81 2.9.7 Report on scaling and energy profiling .................................................... 82 2.9.8 Synthesis & Best practices ..................................................................... 83

2.10 ExMatEx Proxy Applications ........................................................................................84 2.10.1 Description of the code........................................................................... 84 2.10.2 Code version ......................................................................................... 84 2.10.3 Problem sizes........................................................................................ 85 2.10.4 Weak Scaling ........................................................................................ 86 2.10.5 Strong Scaling ....................................................................................... 89 2.10.6 Energy profiling ..................................................................................... 91 2.10.7 Report on scaling and energy profiling .................................................... 94 2.10.8 Synthesis & Best practices ..................................................................... 96

List of figures.....................................................................................................................99

List of tables.................................................................................................................... 101

Acronyms and Abbreviations............................................................................................... 101


5

Executive Summary

The Mont-Blanc project aims to assess the potential of clusters based on low power embedded components to address future Exascale HPC needs.

The role of work package 4 (WP4, “Exascale applicat ions”) is to port, co-design and optimise up to 11 real exascale-class scient if ic applications to the different

generation of platforms available in order to assess the global programmability and the performance of such systems.

After a first phase where all of these 11 applications have been ported to various ARM based low power architectures provided by the project, the experts from WP4 selected a subset of 8 applications for being ported and optimised on the Mont

Blanc prototype using appropriate programming models. This report presents the result of the activity of the profiling, the optimisation and

the benchmarking on this subset of scientific applications. Due to the close relationship and rich cross references between the deliverables:

D4.4 “Report on the profiling, the optimisation and the benchmarking of a subset of application suited for performance and energy”;

D4.5 “Report on the efficiency and performance evaluation of the application ported and best practices”;

D4.6 “Final list list of ported and optimized applications”;

The decision has been taken to avoid redundancy and for a better reading and

logical sequence to merge D4.4, D4.5 and D4.6 in a single physical document. Nevertheless the sections remain clearly identified:

Section Relevance

Description of the code D4.6

Code versions D4.6

Problem sizes D4.4

Weak Scaling D4.4

Strong Scaling D4.4

Energy profiling D4.4

Report on scaling and energy profiling D4.5

Synthesis & Best practices D4.5


6

Introduction

1As complement of the activit ies of work package 3 (WP3, “Optimized applicat ion

kernels”), a part of the activit ies of Mont-Blanc will be to assess on the different generation of platforms made available by the project the behaviour of exacale-class

“real” scientif ic applicat ions. The objective of work package 4 (WP4, “Exascale applications”) is to evaluate the global programmability and the performance (in terms of time and energy to solut ion) of the architecture and to assess the efficiency

of hybrid OmpSs/MPI programming model. These real scient if ic applications, used by academia and industry, running daily in

production into existing European (PRACE Tier-0 systems) or national HPC facilit ies have been selected by the different partners in order to cover a wide range

of scientif ic domains. This deliverable focuses on a subset of the applications ported on Mont-Blanc platforms, according to their scaling ability. The list of the 8 applications is the following:

Code Scientific Domain Contact Institution

BigDFT Electronic Structure Brice Videau - Luigi Genovese CNRS/CEA

BQCD Quantum Chromodynamics Momme Allalen – David Brayford LRZ

COSMO Weather Forecast Piero Lanucara CINECA

Quantum Espresso

Electronic Structure Nico Sanna CINECA

MP2C M ulti-particle collision Rene Halver JSC

PEPC N-Body Coulomb & gravitational Dirk Broemmel JSC

SMMP M olecular Simulation Jan H. Meinke JSC

SPECFEM3D Geophysics Kevin Pouget – Brice Videau CNRS

Table 1 - Final list of the WP4 scientific applications

Apart from these applications, this deliverable also reports on Alya Red, and two of

the ExMatEx Proxy Applications, LULESH and CoMD.

Code Scientific Domain Contact Institution

Alya Red Computational Biomechanics Constan Gomez- Alejandro Rico BSC

ExMatEx Hydrodynamics, M olecular Dynamics Alejandro Rico BSC


7

This report refers to the activities planned in WP4 under Task 4.2 and Task 4.3

T4.2. Profiling, benchmarking and optimization (m18:m45)

Following the work per formed into task 4.1 , a subset of applications which offers the best potential for exploiting the hardware and software characteris tics of these prototypes will be

elaborated. This selection will include the results of WP3 and WP5 activities in term of kernel and software libraries availability /per formance as well as all the others components of the software s tack. On this subset of applications, dedicated optimisation effor ts will be focused

on effective usage of SIM D vector units or hybridisation with potential accelerators using por table programming models like OpenCL since some of the proposed codes have already some OpenCL versions.

T4.3. Profiling, benchmarking and optimization (m24:m45)

Finally on such optimized applications, additional benchmarks will be conducted in order to compare power consumption and computing per formance on the M ont-Blanc platform and

the Tier-0 platforms of the PRACE European infrastructure. A best practice document describing productiv ity evaluation in term of time to por t and optimize selected codes, impact of the programming models into the rewriting of source codes, por tability , assessment of

programming models easiness for “regular programmers”, choice of metrics to evaluate the efficiency of the system.

1.1 Platform used by WP4

Mont-Blanc Prototype (CHAPEL): Please refer to Deliverable 7.8 for a complete description of the platform.


8

Report on the WP4 applications

2 2.1 BigDFT

2.1.1 Description of the code

BigDFT 1 is an ab-init io simulat ion software based on the Daubechies wavelets

family. Among other quantities, the software computes the electronic orbital

occupations and energies. Several execution modes are available, depending on the

problem under investigat ion. Cases can be periodic in various dimensions, and use

K-points to increase the accuracy along periodic dimensions. Test cases can also be

isolated.

Four institutions were involved in the BigDFT project at that time:

1. Commissariat à l'Énergie Atomique (T. Deutsch),

2. University of Basel (S. Goedecker),

3. Université catholique de Louvain (X. Gonze),

4. Christian-Albrechts Universität zu Kiel (R. Schneider).

BigDFT is an Open Source project and the code is available at: www.bigdft.org

Since 2010, four laboratories have been contributing to the development of BigDFT:

L_Sim (CEA), UNIBAS, LIG and ESRF and BigDFT is mainly used by academics.

The code is written mainly in FORTRAN (121k lines) with part in C/C++ (20k lines)

and it is parallelized using MPI, OpenMP and has OpenCL support. It also uses the

BLAS and LAPACK libraries.

BigDFT scalability is good and multiple runs using more than 16384 IBM BG/Q

cores. Hybrid runs using up to 288 GPUs of CURIE have also been realized.

2.1.2 Code version

Present benchmarks have been done with the stable BigDFT version 1.7.6, released

in November, 2014.

2.1.3 Problem sizes

Benchmarks cover the range between one to 500 nodes of the Mont-Blanc prototype.

The input files have been chosen to be either in Free, Periodic or Surfaces boundary

conditions (BC). The Free BC system is a small run consisting of a single Carbon

atom, the Periodic BC is a 4-atom cell of Ag, and the Surfaces BC system is a 4

atom orthorhombic supercell of Graphene. Most of the big sized runs have been

performed with the latter system.

The memory footprints of the various systems are different:


9

The Free BC system is rather small (80 MB)

The Periodic BC system is a little bigger (200 MB)

The Surface BC calculation treats more than 16GB of data. It therefore

represents a workload big enough to be easily partitioned among the

different nodes (medium)

The Surface BC big calculation treats 10 times as much data as the medium

one with a memory footprint of over 160GB (big)

2.1.4 Weak Scaling

Weak scaling (figure 2.1.1) has been considered together with strong scaling by

increasing the workload between the medium sized system and the big system. In

both cases, optimal efficiencies are obtained, provided that the granular ity of the job

is not too fine. When the job size is big enough the entire prototype can be used

effectively.

Figure 2.1.1 BigDFT Weak Scaling.

The data size ratio between the big and medium system is 10.

2.1.5 Strong Scaling

The strong scaling has been performed for the Surface BC system using 2 MPI process per node of the prototype. For the medium sized system (see Figure 1.2) the

results are of excellent quality up to 164 MPI tasks. Starting from this size, the

1

10

100

10 100 1000

0.1

1

Spe

edup

wrt

16

core

s

Spe

edup

wrt

200

cor

es

Number of Cores

Scaling of BigDFT on MontBlanc Prototype

Medium system (left axis)Big system (right axis)


10

behaviour of the networking of the system changes and the MPI AllReduce collective communication stops scaling and even increases for 656 cores. This effect

seems to be due to the reduction in granular ity and the increase in message number. For the big system (see Figure 2.1.3) the communications do not scale so the communication cost per node stays constant. Thus the efficiency decreases

according to Amdahl's law.

Figure 2.1.2 BigDFT Strong scaling of the medium sized system

At 16 cores each node used 2GB of data whereas at 656 cores only uses 48 MB of data

Figure 2.1.3 BigDFT Strong scaling of the big sized system

0

5

10

15

20

25

30

35

40

45

0 100 200 300 400 500 600 700 50

55

60

65

70

75

80

85

90

95

100

Sp

ee

dup

wrt

16 c

ore

s

Pa

ralle

l e

ffic

ien

cy (

%)

Number of Cores

Strong Scaling of Medium system

speedup (left axis)Efficiency (right axis)

1

1.5

2

2.5

3

3.5

4

4.5

5

200 300 400 500 600 700 800 900 1000 75

80

85

90

95

100

105

Spe

edup

wrt

200

core

s

Par

alle

l effi

cien

cy (%

)

Number of Cores

Strong Scaling of Big system

speedup (left axis)Efficiency (right axis)


11

At 200 cores each node used 2GB of data whereas at 656 cores only used 1.6 G of

data and only 320MB at 1000 cores.

2.1.6 Energy profiling

For each run of the big system (the medium experiments were conducted before the

tool was deployed on the prototype), the energy consumption has been profiled by

the tool installed at BSC and developed at LRZ. The average consumption per node

has been found to be around 10W, (see figures 2.1.4 and 2.1.5). Care has been

taken in considering the actual dispersion of the power consumption of each node,

as shown by the histogram. When the communications take more percent of the

time, the average consumption lowers, and this fact limits the decrease of the energy

efficiency of the run, as it is shown by the strong scaling figures 1.6 and 1.7. Overall

behavior is very stable for all the runs of the big system series.

Figure 2.1.4 Energy consumption per node during a BigDFT run. Unit is mW.

The black curve shows a consequent (and confirmed) clock skew. Thus results have to be ridden of these nodes and their energy consumption extrapolated.


12

Figure 2.1.5BigDFT: Ratio of node per 50 mW intervals for two runs – Big system

As more processes are blocked into communications the energy consumption per node decreases. We can also see that the prototype nodes have different individua l

energy consumption, as BigDFT is perfectly balanced on those runs

Figure 2.1.6 BigDFT: Energy per iteration of the big system

The error bars represent the variability of the prototype node power consumption rather than error measurement. With a perfect scaling, the energetic cost would be constant. As it is not the case the energetic efficiency decreases as the node count

rises.


13

Figure 2.1.7 ( fig 2.1.3) with the energetic efficiency

We can see that the energetic efficiency decreases less than the parallel effic iency.

This is due to the fact that processes blocked into communication idle rather than poll.


14

2.1.7 Report on scaling and energy profiling

The main difficulty in using the Mont-Blanc prototype for the BigDFT code has

been given by instabilit ies on the machine. BigDFT code has its own profiling tool,

which is able to detect the load unbalanc ing among the different nodes for some of

the operations. Thanks to this profiling tool, the code revealed that some nodes were

busy with parasite jobs. In the enclosed figure we show the behaviour of the

different machines for a single BigDFT run. Although the workload is perfectly

balanced, some nodes exhibited instabilities.

Figure 2.1.8 Task distribution by core in a BigDFT run

The orange task is very short and is intently unbalanced. On the contrary, the blue task should be perfectly balanced and is not, so some cores experience perturbations. This problem was fixed by the system administrators (see D5.11 Section 10).

2.1.8 Synthesis & Best practices

BigDFT code shows better efficiency on Mont-Blanc prototype with the usage of

single- level MPI parallelizat ion, due to poor efficiency of shared memory

parallelisat ion and double-precision OpenCL runs, see table below. However, the

results are very promis ing as the parallel efficiency always stays above 85% when

the granular ity of the job is large enough. For the energy consumption, the situation

is even improved as the machine spends slightly less energy when communications

are more important.

Figure 2.1.9 BigDFT influence of the programming model


15

The table results are the execution time of the two small systems showing the inf luence of the programming model used. OpenCL is disappoint ing due to poor

double precision support of the Mali GPU. MPI shows better performance than hybrid parallelizat ion. This is certainly due to bad tuning of the OpenMP runtime for the architecture.

These conclusions might however be affected by the idle energy consumption of the

network, which has not been taken into account in these experiences. As another

practice that might be useful when communication breakdown is limit ing efficiency,

we might overlap some of the collective operations with computation, which is

possible thanks to the usage of Thread-based MPI operations within BigDFT. In this

case, the policy of thread overloading might also be considered to overlap with

thread creation latency.

Figure 2.1.10 Mean time distribution inside a BigDFT process running the medium

and big systems

Communications are dominated by AllReduce. Comparing between the medium system at 656 cores and the big system at 650 cores, a factor 10 to the size of the

data communicated implies twice the communication time. This seems to indicate that the medium system communication is limited by latency.


16

2.2 BQCD


BQCD is used in benchmarks for supercomputer procurement at LRZ, as well as in

the DEISA and PRACE projects, and it is a code basis in the QPACE project. The benchmark code is written in Fortran90. BQCD is a program that simulates QCD with the Hybrid Monte-Carlo algor ithm.

QCD is the theory of strongly interacting elementary particles. The theory describes particle properties like masses and decay constants from first principles. The

starting point of QCD is an infinite-dimensional integral. In order to study the theory on a computer space-time continuum is replaced by a four-dimensional regular finite latt ice with (anti-) periodic boundary conditions. After this

discretisation, the integral is finite-dimens ional but still rather high dimensional. The high-dimensional integral is solved by Monte-Carlo methods.

The kernel of the program is a standard conjugate gradient solver with even/odd pre-conditioning. In a typical Hybrid Monte Carlo run more than 80% of the total computer time is used for the multiplication of a very large sparse matrix (“hopping

matrix”) with a vector. At the single CPU level QCD programs benefit from the fact that the basic operation is the multiplication of small complex matrices. Setup:

We measure the performance of the standard conjugate gradient (CG) solver that is implemented in BQCD. The dominant operation in the solver is the matrix times

vector mult iplicat ion. This operation involves the communication of boundary elements from neighbor ing processes. We are going to a regime where the exchange of boundary elements takes 50 percent of the execution time or even more.

The second type of communication needed in the solver is global summation. For the hybrid parallelizat ion all loops were parallelized with the parallel do directive.

MPI is called in sequential regions, i.e. only the master threads are involved in the MPI communication. We have used one or two MPI processes per node and one OpenMP thread per core

and look at weak and strong scaling for a 12³ x 24; 24³ x 48 and 48³ × 96 lattices.

2.2.2 Code version

MPI, MPI + OpenMP (hybrid), OmpSs + MPI

2.2.3 Problem sizes

The input lattice sizes are:123 x 24, 24

3 x 48 and 48

3 x 96

2.2.4 Weak Scaling

Weak scaling of the OmpSs + MPI version of BQCD on Chapel (ARM) with lattice sizes of 12

3x24 (16 & 32), 24

3x48 (64 & 128), 48

3x96 (256 & 512).


17

Figure 2.2.1 Performance comparison for the MPI+OmpSs version of BQCD on

Mont-Blanc

Input lattice sizes of 12³ × 24; 24³ ×48 and 48³ × 96

processors region #calls time mean min max total

S MFLOP/S MFLOP/S MFLOP/S GFLOP/S

16 CG 215 767.98 150.02 150.02 150.03 2.4

32 CG 215 571.49 100.8 100.8 100.81 3.23

64 CG 215 3918.9 117.6 117.6 117.6 7.53

128 CG 215 3374.18 68.29 68.29 68.3 8.74

256 CG 215 13371.66 137.86 137.86 137.87 35.29

512 CG 215 10833.85 85.08 85.07 85.1 43.56

0

10

20

30

40

50

16 32 64 128 256 512

CG

GF

LO

PS

Cores

Weak Scaling


18

Figure 2.2.2 BQCD Performance comparison MPI/Hybrid/OmpSs

Lattices: 12³ × 24; 24³ ×48 and 48³ × 96 The strong scaling results show a drop in performance between 64 and 256 ranks.

First reason can be due to the placement of the jobs and the network interconnect and the second one is that some nodes are lowering their frequency due to

overheating. SuperMUC Phase 1. Various measurements were made on SuperMUC phase 1 at

LRZ and during the friendly user access period of Phase 2. The outcome is that hybrid version is a few percent faster than pure MPI version on SuperMUC. In

Figure 2.2, we have plotted only the hybrid strong and weak scaling results comparison. However, on Chapel ARM HPC prototype system the pure MPI version is a small percentage faster than the hybrid version of BQCD. This could be

a result of the overhead associated with OpenMP is greater on the ARM HPC prototype than on an x86 HPC system.

0

10

20

30

40

50

60

70

0 100 200 300 400 500 600

CG

Gfl

op

/s

Cores

Weak Scaling BQCD Lattice-sizes: 12^3x24; 24^3x48; 48^3x96

hyb mpi ompss


19

Figure 2.2.3 Performance comparison of the hybrid version of BQCD on

SuperMUC phase 1

Lattices: 48³ × 96; 64³ ×128 and 96³ × 192


Strong scaling of OmpSs + MPI version of BQCD with an input lattice size of 123 x

24 on Mont-Blanc prototype Chapel (ARM).

Figure 2.2.4 BQCD Strong scaling on Mont-Blanc

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35

CG

Tfl

op

/s

Cores /1024

Weak and strong scaling of BQCD on SuperMUC

48^3x96 64^3x128 96^2x192

0

0.5

1

1.5

2

2.5

3

3.5

8 16 32 64 128 256

CG

GF

LO

PS

Cores

Strong Scaling


20

processors region #calls time mean min max total

S MFLOP/S MFLOP/S MFLOP/S GFLOP/S

8 CG 215 921.55 250.05 250.05 250.05 2

16 CG 215 767.98 150.02 150.02 150.03 2.4

32 CG 215 571.49 100.8 100.8 100.81 3.23

64 CG 215 1137.21 25.33 25.33 25.33 1.62

128 CG 215 1150.28 12.52 12.52 12.52 1.6

256 CG 215 648.85 11.1 11.1 11.1 2.84


Processors Total

values Total nodes Total watts

Average watts

Total energy consumed J

8 5507 4 55722 10.12 63590

16 6943 8 69823 10.06 78120

32 12479 16 121277 9.72 135592

64 164818 32 1656655 10.05 1854930

128 250175 64 2510740 10.04 2792088


21

Figure 2.2.5 Total energy consumed for weak scaling of the OmpSs + MPI of BQCD

Input lattice sizes of 48³ × 96; 64³ ×128 and 96³ × 192 on Chapel.


The energy profiling results (Figure 2.2.5) show unusual behavior of the system for certain runs, which could be the result of processors running at lower clock

frequencies than the default. The reason for this could be due to the load on the system and overheating of the processors, resulting in CPU frequency scaling.


On Mont-Blanc Prototype cluster (Figure 2.1) we have plotted results for pure MPI and 2 threads per MPI process. For the three different lattice sizes up to 512 cores,

pure MPI is faster for smaller number of cores. We also plotted the results of the MPI+OmpSs version using the same lattice size as the pure MPI and hybrid (MPI+OpenMP) versions. The performance of the OmpSs version is less than both

the pure MPI and hybrid versions, but the scaling is good, especially as we know that the overhead of the tasks is significant due to fine grained parallelism in BQCD.

0

500000

1000000

1500000

2000000

2500000

3000000

8 16 32 64 128

En

ergy

J

Cores

Total Energy Consumed


22

2.3 COSMO


The COnsortium for Small-scale MOdeling (COSMO [1]) was formed in October

1998 with the objective to develop, improve and maintain a non-hydrostatic limited area atmospheric model, to be used both for operational and research applications by the members of the consortium. To meet the computational requirement of the

model, the program has been coded in Standard Fortran 90 and parallelized using the MPI library for message passing on distributed memory machines.

Several codes are part of the general model (COSMO-ART, COSMO-CCLM, etc) suitable for specific purposes. Among this, COSMO RAPS is a reduced version of COSMO code and is used mainly for benchmarking purposes from vendors,

research communities and consortium members. The release first used on this project was COSMO_RAPS 5.0.

Together with RAPS, in order to better address the “operational” environment of COSMO code and in the framework of the HP2C Swiss project, OPCODE testbed was established in 2011. The OPCODE[2] project (and testbed) is a sort of

“demonstrator” of the entire operational suite of MeteoSwiss, ranging from IFS (Integrated Forecast System) boundary conditions to post processing and presentation of results. The numerical core is based on COSMO operational release

4.19.2.2. Both RAPS and OPCODE have been selected for the init ial porting to ARM

architecture. The main reason for this choice is the evidence that both are of interest for COSMO community (RAPS for benchmarking purposes and OPCODE for “operational”) and are the versions best suited to be implemented on the Mont-

Blanc prototype architecture. RAPS code was the first ported to Tibidabo machine during T4.1. While the

complete porting on ARM prototype of RAPS has been completed, some problem occurred during the simulat ion step. Thus, in order to fix those problems and to finalize the activity within the WP3 and WP4, the OPCODE toolchain was

considered for the initial porting instead of RAPS. Among the advantages of OPCODE with respect to RAPS we would highlight the

following:

The chance to easily change the COSMO code structure, for example the

“dynamical core” or the communication library (“stencil”) with minimal effort.

The advanced status of implementation of COSMO model on GPU

architectures.

The GPU implementation of OPCODE is still at a prototype level. Nevertheless, the main computational parts (Dynamical core and Physics) have been already ported to

GPU. In particular, the porting of Physics to GPU was carried out using OpenACC tool and this aspect could represent a great advantage toward the porting over the OmpSs toolkit (e.g., see next section for details).


23

2.3.2 Code versions

The COSMO RAPS code has been first ported to Tibidabo using the GNU toolchain and no particular issues have been encountered in the process. However, as of M18 the reference version in WP4 has been switched to OPCODE and it was

successfully ported on Tibidabo cluster. The code was build using GNU 4.6.2 gfortran compiler, used with options:

-ftree-vectorize –mcpu=cortex-a9 –mtune=cortex-a9. During the preliminary activit ies in WP4 a comprehensive set of benchmarks has been carried out on the ARM version of the code using the Tibidabo prototype then

the tests have been extended to a cluster of Jetson TK1 and more important, on the final Mont-Blanc prototype. A bug encountered with asynchronous I/O (nprocio>0)

was solved on all the architectures tested and now OPCODE is able to use synchronous as well as asynchronous I/O capability. For what concerns the porting over the OmpSs toolchain and then, in order to

benefit the overlapping of the computation and communication phase of the run, we first proceeded with a comprehensive profiling of OPCODE by using the EXTRAE

tool either on the EURORA system at CINECA as well as the Mont-Blanc prototype. As a whole, both Dynamical core and Physics are responsible for the most computational demanding parts of OPCODE. Dynamical core porting on GPU

involved a complete rewritten in C++ made out by the COSMO consortium and the use of the so-called “stencil- library” in order to execute. This part seems at a more

prototypal stage (and “experimental”) with respect to the Physics part. On the other hand, Physics is approximately 20% of the complete execution time of an OPCODE simulation and is more “compute bound” with respect to the other code parts.

Dynamical core Porting to OmpSs

We recall that Dynamical core of OPCODE together with Physics is responsible for most of the computing time of a given simulation. YUTIMING analysis shows that within Dynamical Core most of the computing time is spent in fast_wave solver

which is used in OPCODE to compute fast wave terms related to the prognostic variables update in the Dynamical core. For the sake of simplic ity for benchmarking,

simulat ions were carried out using this routine in the simplest way, letting numerical discussions to further studies. Porting to OmpSs was done working on source code directly. For example a typical

loop in fast_wave : DO k = 2, ke

DO j = jstart , jend

DO i = istart , iend

zrofac = ( rho0(i,j,k)*dp0(i,j,k-1) + rho0(i,j,k-

1)*dp0(i,j,k) )&

/( rho (i,j,k)*dp0(i,j,k-1 )+ rho (i,j,k-

1)*dp0(i,j,k) )

zcw1(i,j,k) = 2.0_ireals*g*dts*zrofac/(dp0(i,j,k-1)+dp0(i,j,k))

ENDDO

ENDDO

ENDDO

has been analyzed and taskif ied, identifying the scope of each variable that is

responsible for this task and task dependencies:


24

!$OMP TARGET DEVICE(OPENCL) NDRANGE(3, MII,MJJ,MKK,32,8,4)

FILE(fast_waves_1.cl) COPY_DEPS

!$OMP TASK IN(T1,T2,T3) OUT(T4)

SUBROUTINE

FAST_WAVES_1(MII,MJJ,MKK,K1,K2,J1,J2,I1,I2,A,B,T1,T2,T3,T4)

INTEGER::MII,MJJ,MKK,K1,K2,J1,J2,I1,I2

REAL*8,DIMENSION(MII,MJJ,MKK)::T1,T2,T3,T4

REAL*8 A,B

END SUBROUTINE FAST_WAVES_1

After that, as OmpSs greatly simplif ies the memory allocation, data copies to/from device, etc., the work has been to create the kernels (CUDA and then OpenCL)

starting from the corresponding F90 function(s):

#ifdef cl_khr_fp64

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

#elif defined(cl_amd_fp64)

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

#else

#error "Double precision floating point not supported by OpenCL

implementation."

#endif

__kernel void fast_waves_1(int n, int p,int q,int lk,int uk,int

lj,int uj,int li,int ui,double g, double dts,__global double* a,

__global double* b, __global double* c, __global double* d)

{

#define idxyz(I,J,K) ((I)+n*((J)+p*(K)))

const int I = get_global_id(0);

const int J = get_global_id(1);

const int K = get_global_id(2);

if(I>=li-1 && I<=ui-1 && J>=lj-1 && J<=uj-1 && K>=lk-1 && K<=uk-1) {

double zrofac = ( a[idxyz(I,J,K)]*b[idxyz(I,J,K-1)] + a[idxyz(I,J,K-

1)]*b[idxyz(I,J,K)] )/(c[idxyz(I,J,K)]*b[idxyz(I,J,K-1)]+

c[idxyz(I,J,K-1)]*b[idxyz(I,J,K)]);

d[idxyz(I,J,K)] = 2.0*g*dts*zrofac/(b[idxyz(I,J,K-

1)]+b[idxyz(I,J,K)]);

}

}

At the final stage of development we finalized the inclus ion of the new OmpSs structure into the fast_wave solver routine. The porting of fast_wave has been

carried out by inserting into the DYN part of OPCODE the code structure developed for the Himeno porting to OmpSs:

Figure 2.3.1 COSMO OpCode Structure

COSMO OpCode

Himeno solver Dynamical

core

Physics


25

As at the end of T4.2 we implemented a preliminary version of the new DYN part

of OPCODE suitable enough to begin the numerical assessment of this part of the code and then it has been refined until the last version used on the Mont-Blanc prototype.

Physics Porting to OmpSs+OpenCL

Physics has already been ported to GPU using OpenAcc directives by the COSMO consortium and the use of directives let the code almost unchanged. In fact, we recall that a typical loop in PHY:

do j=1, niter

do i=1, nwork

c(i) = a(i) + b(i) * ( a(i+1) – 2.0d0*a(i) + a(i-1) )

end do

end do

can be almost straightforwardly ported to OpenACC as:

!$acc update device(a,b)

do j=1, niter

!$acc region do kernel

do i=1, nwork

c(i) = a(i) + b(i) * ( a(i+1) – 2.0d0*a(i) + a(i-1) )

end do

!acc update host(c)

end do

!acc update host(c)

Thus, as OmpSs greatly simplif ies the memory allocation, data copies to/from device, etc., the work has been to create the kernels (CUDA and then OpenCL)

starting from the corresponding F90 function(s) and to translate OpenACC directive to OmpSs. This activity has also been accomplished w ith the support of some tool able to translate OpenACC directives into CUDA and/or OpenCL kernels like the

CAPS [4] compiler we used in the first stage of porting of the PHYS part of OPCODE.

At present, we have the PHY part of the OPCODE version of COSMO code completely ported to GPU with OpenACC and then already taskif ied for the OmpSs version. Nonetheless, we have to underline that conversely to the DYN part of

OPCODE, the PHYS code is still at its preliminary stage of development thus the performance evaluation of this code section has been rather limited in the following. As a matter of fact, at the end of the porting activit ies in WP4 we have three code

versions for COSMO OPCODE (all using the NetCDF library for I/O) we are referring to into the following:

1. The original version of COMSO OPCODE ported on ARM using MPI as

available on the target system;

2. The MPI version plus OmpSs directives to run on ARM cores in SMP mode;

3. The MPI version plus OmpSs directives and OpenCL kernels (DYN part

only) to run on ARM cores and Mali GPU.


26

2.3.3 Problem sizes

As usual for this benchmark version of COSMO, the input dataset used is the “artific ial” configuration where the same atmospheric parameters are replicated along the x,y plane of the grid map by using a fixed number of 60 z points on the

third dimension. This typical setup of the input it is customary to be selected in order to guarantee an easy setup of the simulat ion with minimal data movement

during the benchmark and to facilitate the generation of a set of inputs able to fit in memory during weak and strong scalability performance measurements.

2.3.4 Weak Scaling

In the Figure below we report the weak scalability of COSMO OPCODE runs on

256 nodes (Samsung Development Boards or SDBs) of the Mont-Blanc prototype. The benchmarks were carried out by varying the x,y data size (at fixed numbers of

60 z points) and we report the elapsed time (without diagnostic and I/O timing) normalized per unit of 64 x,y data set.

Figure 2.3.2 COSMO OpCode weak scalability on 256 SDBs


We would begin the report of the benchmarking of our version of COSMO

OPCODE by showing in the figure below the strong scalability behaviour on the Tibidabo cluster by using an artificial (x,y,z) grid of 256x256x60 points:

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

64 128 256 512No

rmal

ized

Ele

pse

d T

ime

(s)

XY size @ 60 Z points

COSMO OPCODE - E.T. (s) vs. XY size Weak Scalability @ 256 SDBs

Total Linear DYN Linear PHYS Linear


27

Figure 2.3.3 COSMO OpCode run on Tibidabo

where the CPU bound part of the code (DYN:Dynamical + PHY:Phys ics) is

reported in blue while the communication and I/O timing is given in light green. It is clear ly evident from the figure that an almost linear scalability of the CPU bound code sections has a steady counterpart that remain constant as the number of

processors increase. This is quite common pattern for this heavily input dependent code, as remarked by most of the end-users of the COSMO consortium. In fact, it is

customary for COSMO benchmarking to often switch off the I/O during the tests in order to focus the measurement on the most CPU bound of the code, being the constant I/O part becoming less important, as the number of MPI processes

increases, or limited by properly managing the nprocio parameter to distribute the I/O on all the nodes available at runtime. Just to give an example of this technique

for benchmarking COSMO OPCODE, we report in the figure below the strong scaling performance of its parent COSMO RAPS V5.1 with I/O OFF:

Figure 2.3.4 COSMO RAPS Benchmark Strong Scaling on Cray XC40


28

With linear scaling in red, here in blue is shown the trend of the (super-linear) speedup of RAPS on the Cray XC40 at CSCS carried out by the MeteoSwiss

team[5]. Thus in the following whenever applicable, we will set ON/OFF the I/O part as well as the PHYS (experimental) part of COSMO OPCODE.

The new COSMO OPCODE version artific ial benchmarks were carried out on Mont-Blanc prototype up to 512 SDBs using the reference version (MPI). Strong

scaling results are presented in Figure 2.3.5 (timing) and 2.3.6 (speedup).

Figure 2.3.5 COSMO OpCode Benchmark elapsed time

1.0

10.0

100.0

1000.0

4 8 16 32 64 128 256 512

Ela

pse

d-t

ime

(sec

.)

Number of SDBs

COSMO OPCODE - E.T. (s) vs. #SDBs Strong Scalability

TOTAL

DYN

PHYS


29

Figure 2.3.6 COSMO OpCode benchmark strong scaling

The scalability is really satisfactory (in particular the Physics part scaling is really good) up to 512 SDBs. Diagnostic part (which is mainly related to I/O and numerical error checking) were removed from timings.


The Power-To-Solution benchmark of COSMO OPCODE, running on 256 SDBs without dedicated node for I/O (nprocio=0) and 257 SDBs with dedicated node for I/O (nprocio=1) is reported for just one SDB in Figure 2.3.7:

Figure 2.3.7 COSMO OpCode power consumption profile

1.0

10.0

100.0

1000.0

4 8 16 32 64 128 256 512

Spee

d-u

p

Number of SDBs

COSMO OPCODE - Speedup Strong Scalability

TOTAL

DYN

PHYS

LINEAR

5000

5500

6000

6500

7000

7500

8000

8500

0 11 22 34 45 56 67 78 90

Po

wer

Co

nsu

mp

tio

n (

mW

)

Exec Time (s)

COSMO OPCODE - One SDB Power consumption in mW vs Exec Time (s)

NPROCIO=0

NPROCIO=1


30

From the reported data it is evident the most balanced profiling of power

consumption when COSMO OPCODE has a dedicated node for I/O whereas the run with nprocio=0 has several drops in performance due to CPU inactivity while the code tries to dump out results at runtime and output files. In fact, if one removes the

part of blue line where the process is inactive, the two curves behave almost the same way, thus confirming the power consumption per SDB of COSMO OPCODE

to be about 8.0 W per SDB per exec time.


On the Mont-Blanc Prototype cluster, weak scalability (Figure 3.2) and strong

scalability (Figure 3.5 and 3.6) of the COSMO OPCODE MPI reference version are quite good. Nonetheless, it is not easy to explain the different behavior of the PHYS and DYN part in weak and strong scaling benchmark (for example PHYS part

speed-up is superlinear in strong scaling while is sublinear in the weak scaling). This could be explained by the fact that the weak scaling input dataset parameters

are not the same of the strong scaling. Anyway, different qualitative performance behavior could be obtained by varying the number of SDBs, the size on the artificia l domain and different codepaths activat ion/deactivat ion and will be exploited in the

future.


Despite the effort spent in the preliminary porting to OmpSs (and OpenCL) no performance gains were obtained for OPCODE. Even applying the Himeno scheme for OmpSs coding of the dynamic part of OPCODE, the resulted performance of the

new MPI+OmpSs+OpenCL version were really unsatisfactory because of the poor efficiency of Mali GPU kernels and OmpSs overhead for unnecessary data transfer

between host and GPU in the case of OpenCL kernels. Nonetheless, the MPI reference version of OPCODE has shown an extremely good scaling on the Mont-Blanc prototype, at least for the artificial dataset used for benchmark.

References

[1] A. Montani, D. Cesari, C. Marsigli, and T . Paccagnella. Seven years of activity in the field of mesoscale ensemble forecasting by the COSMO-LEPS system: main achievements and open challenges. Technical report, Deutscher Wetterdienst, 2010. [2] http://www.hp2c.ch/projects/opcode/ [3] Joint CRESTA/DEEP/Mont-Blanc Workshop,10th – 11th June 2013 (BSC, Barcelona, Spain)

http://www.hp2c.ch/projects/opcode/ [4] E.g., the CAPS compiler: http://www.caps-entreprise.com/products/caps-compilers/ [5] Private communication from Mario Mattia and Ben Cumming of Cray Research Inc.

http://www.hp2c.ch/projects/opcode/

http://www.hp2c.ch/projects/opcode/

http://www.caps-entreprise.com/products/caps-compilers/


31

2.4 MP2C


MP2C is a multi- level simulat ion program coupling mult i-particle dynamics (MPC) to molecular dynamics (MD), which enables it to include hydrodynamic interactions

into fluid and solvent simulat ions. For the project the MPC part was considered for analysis and optimizat ion and ported to OmpSs (see Code versions). The basic MPC algorithm relies on sorting the particles describing the fluid into a shifted grid of

cells, then calculat ing the centre-of-mass (com) velocit ies within these cells and calculat ing the relat ive veloc it ies of the particles with respect to the com-velocity.

Afterwards for each cell a random rotation vector is determined uniquely for each cell. The new particle veloc it ies are then assigned to the sum of the com-velocity and the rotated relat ive velocity of the particle. In order to utilize more than one

compute node in a parallel environment, domain decomposition is employed as a basic parallelizat ion strategy. Since particles are free to move across domain

boundaries, an exchange of information is required across the edges of each domain in order to update particle information between time steps and to keep the calculated com-velocity and rotation vector unique for each cell that overlap from one domain

to the next.

2.4.2 Code versions

The MPC part which is the computationally most intense part of MP2C, was ported to MPI, OmpSs and OpenCL. The original version of the code used a non-hybrid MPI implementation as parallelizat ion environment. Both of these versions were benchmarked and tested on the Mont-Blanc prototype.

In the MPI, OmpSs and OpenCL version the sorting of the particles, the calculat ion of the com, the calculat ion of the rotation vectors and the rotation of the particle velocit ies were implemented as OpenCL kernels. The corresponding parts in the

thermostat routine, the boundary condit ions and the update of particle velocit ies and posit ions were also ported to OpenCL. That means, that the construction of the

linked cell lists needed for the collis ion cells and the cell structure remains on the CPU, as well as the inter-domain communication based on MPI.

Figure 2.4.1 Trace of MP2C running on one node MPI+OmpSs+OpenCL


32

In the trace in Figure above it can be seen, that the computation on the CPU in general takes longer than the corresponding parts on the GPU. The noticeable

reddish parts on the first thread (upper-most line) represent the cell construction within the thermostat routine, while the olive-greyish parts represent the cell construction of the collis ion cell algor ithm, these parts consume more time than the

complete time spent on the GPU. Also it can be noticed, that the first time the thermostat is invoked on the GPU it needs noticeably longer than in subsequent

steps. This seems to be due to compilat ion of the OpenCL kernel, as for increasing numbers of MPI ranks the time spent stays constant.

2.4.3 Problem sizes

For the benchmark two different test cases were considered. The first case corresponds to a weak-scaling run with 5,120,000 MPC particles on each MPI rank.

The other smaller case investigates the strong scaling behaviour of 5,120,000 MPC particles. For the energy consumption test on a single node a slightly larger system

of 7,290,000 MPC particles was used as a stress test.

2.4.4 Weak Scaling

Full Application:

For the weak scaling benchmarks the MPI only and the OmpSs+OpenCL versions were compared for trajectory calculat ions of 500 time steps including application of the thermostat at every tenth time step. The system size was fixed to 5,120,000

particles per MPI rank. Due to technical reasons (explained in the problems section), the OpenCL version was tested up to 128 ranks, while the MPI version was tested up to 512 ranks. In order to achieve comparable results the MPI version was

run with one MPI rank per node, like the OpenCL version.


33

Figure 2.4.2 Weak scaling efficiency benchmark of the full application

The benchmark revealed that the OpenCL version shows an overall less good performance than the MPI version. Also the scaling behaviour of the OpenCL version at this point is found to be not as good as the behaviour shown by the MPI

version.

This seems to be connected with the behaviour of the first thermostat step (see Fig. 2.4.1 and Fig. 2.4.4), where the first step does not seem to scale at all for increasing

number of MPI ranks.


Full Application:

In order to benchmark the strong scaling behaviour of the two different implementations (MPI vs OmpSs+OpenCL), a system consisting of 5,120,000 particles was set up and run on an increasing number of MPI ranks. For both

versions, tests were performed for up to 16 nodes. As is apparent from Figure 2.4.3 the single rank performance already differs between the versions, which reflects the overhead due to CPU-GPU data transfer times. Overall, the OmpSs+OpenCL

version showed a visible decrease in parallel efficiency for larger number of ranks.


34

Figure 2.4.3 Strong scaling benchmark for MP2C full application

One reason for the different scaling behaviour of the full application is also due to the parallel exchange of data between domains. Data have to be (i) transferred from

GPU to CPU; (ii) sent via MPI to neighbouring domains; and (iii) transferred from CPU to GPU in order to continue the calculat ions. For strong scaling problems the number of actual calculat ions performed on the GPU decreases due to decreasing

domain sizes, while the size of data which are communicated relative to the size of data which are stored on the domain, increases. With decreasing number of particles

on each process the performance advantage of the GPU gets less important due to a relat ively high transfer time from CPU to GPU and vice versa. This results in a strong degradation of parallel efficiency of the OmpSs+OpenCL code. To a certain

extent this behaviour can be seen in Fig. 2.4.4, where the greyish parts indicate the communication, while the olive-greyish and the reddish-brownish parts again

represent the CPU parts of the cell algorithms. In contrast to Fig. 24.1 the communication parts are more visible.

Another factor that decreases the scalability massively is the behaviour of the thermostat during the first step when the OpenCL kernel for the calculat ion of the

cell properties is called the first time. It seems that the compilat ion of the kernel takes a relat ively long time in comparison to the following calculat ions. Due to the

comparatively short total run-time in the trace in Fig. 6, the problem should become less important for long runs, but since the compilat ion time seems to be constant, it will limit the achievable scaling.


35

Figure 2.4.4 Trace of the MPI+OmpSs+OpenCL version

8 MPI ranks, i.e. 8 nodes for 20 time steps.


Fig. 2.4.5 shows the total energy consumption for each of the benchmarks during the strong scaling tests. It can be seen that the OpenCL version consumes more

energy for each number of cores than the MPI version. This has two reasons, first the OpenCL version has a longer runtime than the MPI version and secondly for

higher numbers of MPI ranks the OpenCL version shows worse scaling behavior than the MPI version.


36

Figure 2.4.5 Power consumption of MP2C for the strong scaling benchmark

(using one rank per core)

In small tests it was seen that for weak scaling problems the energy consumption

correlates to the number of ranks or nodes used. Therefore it was investigated, how the energy consumption would change if the pure MPI version runs either with one or two MPI ranks on a single node compared to a run of the pure MPI version with

two ranks on two nodes and runs of the MPI+OmpSs+OpenCL version with one and two ranks. The results can be seen in Fig. 2.4.6 and in Figure below the total

energy needed to complete the calculations are listed.

version energy needed [J] runtime [s]

MPI, 1 rank / node, 1 rank 1632.4 206.11

MPI, 1 rank / node, 2 ranks 1736.1 106.72

MPI, 2 ranks / node, 2 ranks 2213.7 113.42

MPI+OmpSs+OpenCL, 1 rank 2055.5 243.85

MPI+OmpSs+OpenCL, 2 ranks 2496.4 142.49

Figure 2.4.6 energy needed by MP2C to compute 20 time steps with 7290000

particles


37

The table shows that the energy required for the calculat ion is lowest for the pure MPI version, when using only one rank on the node, while only using one MPI

rank. In comparison the MPI versions, the OpenCL versions are not very energy-efficient at the moment. Fig. 2.4.7 shows periodic power peaks on the node power profile over time. For the OpenCL version running with one rank, these are easily

explained, when one has a look at the traces in Fig. 2.4.1 and Fig 2.4.4. Whenever the code is run thread-parallel the power consumption equals the consumption of the

MPI version running with two ranks on one node. For the other peaks they are explained with differences in the computational intensity of the code. Some parts of the code are not as demanding in computational powers as others. And when these

parts are reached, more power is consumed, as can be seen by the peaks.

Figure 2.4.7 Comparison of MP2C energy consumption using one and two MPI ranks on a single node


The benchmarks showed that MP2C is able to scale on the Mont-Blanc prototype. The original version shows better results at this, than the OpenCL ported version. This is due to the fact that the sorting of the particles into the collis ion cells cannot

be fully implemented onto the GPU and the restricted scalability of the thermostat due to the first time step.


38

For the cell list creation, in order to sort the particles into cells, one can either distribute the particles between threads, resulting in possible collis ions of threads

whenever two or more threads sort their particles into the same cell, or distribute the cells between the work items, resulting in the need for every thread to check every particle, leading to a serialization of work. Therefore in the current implementation

the particles are distributed between threads, then for each particle the correct cell is calculated, and the result is transferred back to the host. There, the particles are

collected into a single list for each cell and afterwards the lists of particles are distributed onto the threads in order to calculate the com-velocity and rotation vector for each cell and velocity rotation for each particle.

Addit ionally the scaling behaviour is impacted by the fact, that in order to be able to exchange particle information at the edges of the domains, the particle information has to be retrieved from the GPU, then sent to the corresponding process, and re-

uploaded to the GPU again. This procedure needs to be applied three times for each time step: first to exchange particles, that moved from one domain to the next, second to exchange particle information at the border of the domains to fill the

collis ion cells consistently, and last to send the correct rotated velocit ies back to the origin domain. For each of these exchanges the particle data needs to be on the

CPU, requiring the aforementioned procedure to communicate the data.

And as mentioned before, the seemingly sequential inf luence of the first thermostat step prevents scaling to a high number of ranks, as in the case of strong scaling. There, the calculat ions become faster on a high number of ranks, but the first step

always takes as long as on lower number of ranks.


During the porting effort from the MPI to the OpenCL version it was seen that it is crucial to avoid unnecessary data transfer between host and GPU. Also it was seen that in order to achieve the best possible energy efficiency it would be well advised

to rely on hybrid programming, e.g. MPI+OmpSs or MPI+OpenMP, in order to utilize both core of each node while reducing the run-time of the code sufficient ly to counter the increasing energy demand.


39

2.5 PEPC


PEPC is an N-body solver for Coulomb or gravitational systems. Diverse user

communities in areas such as warm dense matter, magnetic fusion, astrophysics, complex atomistic systems and vortex fluid dynamics use it . Current projects use PEPC for laser- or particle beam-plasma interactions, plasma-vacuum and -wall

interactions in Tokamaks, simulat ing fluid turbulence using the vortex particle method, and for investigating planet formation in Circumstellar discs consisting of

gas and dust. PEPC is based on the generic Barnes-Hut tree algor ithm and exploits multi-pole

expansions to create hierarchical groupings of more distant particles. This reduces the computational effort in the force calculat ion from the generally unaffordable

O(n2) operations needed for brute-force summation, to a more amenable O(n log(n)) complexity.

The code is open source and developed at Juelich [1] within the Simulat ion Laboratory Plasma Physics [2] under GPL license. PEPC is written in Fortran2003 with C wrappers to enable pthreads support, thus making use of a hybrid

MPI/pthreads programming model [3]. A version supporting OmpSs and OpenCL kernels has also been created for the Mont-Blanc Project. The only external

dependency is a library for parallel sorting written in C that is included in the source tree. The different applications of PEPC are split into different front-ends and can be combined with different interaction specific modules.

Figure 2.5.1 PEPC Structure of the Treecode Framework


40

Currently, PEPC supports three interaction-specific modules and several respective front-ends.

The vanilla MPI/pthreads implementation of PEPC allows for an excellent scaling of the code on different architectures. The prime example being an IBM Blue

Gene/Q (JUQUEEN) with more than 64 billion particles across up to 458,752 cores with 1,668,196 parallel threads, which proved its High-Q Club status [4].

Figure 2.5.2 Weak scaling of PEPC on JUQUEEN.

2.5.2 Code versions

As mentioned above, PEPC comes in two flavours for the Mont-Blanc project:

'Vanilla' PEPC: this is the main version of the code that is very well portable to any system and well performing. It makes use of a hybrid programming

model via MPI and pthreads and is tuned towards many threads. Each time step, the main thread will spawn two threads for handling communication and a user given number of worker threads performing the tree-walk and

computing individual interactions on-the-fly. On Blue Gene/Q, the optimum is to use 60 worker threads, so filling all available hardware threads (when

inc luding the threads for communication and MPI). The Mont-Blanc prototype with only two cores per node poses some limitation but we nevertheless found four worker threads to perform best for high particle

loads per node and thread placement by the OS.

'Mont-Blanc' PEPC: the init ial porting of PEPC for Mont-Blanc started with

an OpenACC addon to explore GPU capabilit ies of PEPC (see previous deliverables). This has now been fully converted to an MPI/OmpSs version

where the functions running within pthreads are used as tasks for OmpSs and the OpenACC GPU kernel is reflected in an OpenCL kernel. The former accelerator thread controlling the OpenACC kernels is a continuously

running task that is active during the execution of PEPC. Its role is to submit OpenCL kernels as tasks to the GPU and init iate transfers of the interaction


41

lists now collected by the worker threads during their tree-walks. Those lists are then computed by the GPU kernel, also performing most of the

reductions that occur from the previously individual interactions (now as work items on the GPU). This version has been used to experiment with shared memory via OmpSs API calls to avoid data copies. Making use of

shared memory did not provide an immediate posit ive effect on the execution time but resulted in wrong computations instead, the reasons of

which remain unclear for now. Since the number of threads/tasks is increased by one compared to Vanilla PEPC and the OmpSs runtime requires an addit ional thread to manage GPU kernel executions the two

cores of the Mont-Blanc prototype are heavily oversubscribed from this version. This is believed to result in increased runtimes and was used to

experiment with different runtime librar ies provided by BSC. Starting point was a runtime switch to disable the CPU binding of tasks to allow OmpSs tasks to also use the core otherwise exclusively claimed by the OmpSs GPU

runtime thread. Although this runtime switch proved to be very beneficia l for the execution times performance was still behind the ones of the vanilla

version. We also followed BSC's advice and reduced the number of concurrent tasks by abandoning the continuously active task controlling PEPC's GPU data and kernel executions. Instead, the worker tasks would

submit GPU tasks straight from the tree-walk and get blocked if no buffers on the GPU were available. Unfortunately this version also exhibits wrong

computations and is still under investigation by BSC. We will use the better performing Vanilla PEPC for the scaling and energy profiling

reports below.

2.5.3 Problem sizes

PEPC has a modular structure for different physics applications and several front-ends to suit specific problems. We chose the Coulomb interaction and a

benchmarking front-end called 'pepc-essential' for our tests. This setup initializes PEPC with a random qubic particle cloud filling the simulat ion box. The number of particles as well as the number of time steps and worker threads is user specified.

We picked 10 time steps to keep executions short and disabled I/O to focus on the computational kernels. The number of worker threads was varied between 1 and 5

to test the most efficient and best scaling combination. The experience with JUQUEEN shows that several thousand particles per worker

thread are necessary to achieve good scalability. Therefore, we performed tests with 20000, 40000, 80000, and 160000 particles per rank and scaled the total number of

particles for the weak scaling tests while keeping it fixed at 2560000 particles for the strong scaling test.


42

2.5.4 Weak Scaling

The weak scaling results are shown in figures 5.3 and 5.4. As expected from the JUQUEEN results, the scalability increases with the number of particles per node.

The Mont-Blanc prototype allowed for tests up to 640 nodes, while most of the time the stability of the cluster did not allow larger tests than with 512 nodes.

Figure 2.5.3 Weak scalability of PEPC on the Mont-Blanc prototype

With 4 worker threads (plus communication and main threads) per node with the

given number of particles per MPI rank. The solid line shows the O(N log N) behaviour that is the theoretical ideal case.

Figure 2.5.4 Efficiency of PEPC from previous figure


43


The strong scaling example we show below connects to some of the measurements of the weak scaling from the previous section. We use a fixed number of 2.56

million particles which coincides with runs of 20, 40, 80, and 160 thousand particles using 128, 64, 32, and 16 nodes respectively. Since PEPC will only run efficient ly

with a certain minimum number of particles per thread it is clear that strong scaling will eventually break down; this is clearly visible in Figure 2.5.5 after about 100 nodes.

Figure 2.5.5 Strong scaling of PEPC on the Mont-Blanc prototype

The dotted line represents ideal scaling. The solid green line on the other hand shows the consumed energy of the runs. These runs were obtained with 2 worker threads only with thread pinning enabled (see below).

When evaluat ing the best number of threads to use we kept the number of particles

and nodes fixed, so report it here with the other strong scaling results. Figure 2.5.6 shows the runtimes of PEPC for two particle numbers on 32 and 128 nodes with a varying number of threads. The runtime clearly varies with the number of threads

and suggests oversubscribing the CPU cores is beneficial for this MPI/pthread application.


44

Figure 2.5.6 PEPC with a varying number of threads for a fixed number of particles

and nodes.

We also tested if thread pinning to specific CPU cores provides even better results. The rationale is that communication threads should run uninterrupted to immediately handle any communication. We tested a varying number of worker

threads on one CPU core while any thread performing communication bound to the other core. The runtime dependence on the number of used threads then disappears

with no clear improvement in runtime compared to the unpinned version of PEPC. The only cases where pinning helped seemed to be for really low particle numbers, where PEPC's communication scheme is no longer able to hide latencies.


The energy profiling was performed for all of the scaling tests. We start by looking

at individual power figures resolved by time. Figure 2.5.7 shows the reported Watts for each node (in this case 16 for brevity) where it is possible to identify the 10 time steps of a representative run.


45

Figure 2.5.7 Watts per node and time for one PEPC

Run with 2 worker threads on 16 nodes. The vertical lines indicate the actual PEPC

execution, times before and after provide an idle measurement of the nodes.

Next figure shows the estimated energy consumed by one of the scaling tests, averaged for the nodes that reported their power consumption, as well as an envelope of the minimum and maximum energy used by the nodes

(omitting zero results). Since our power measurements always included an idle time before and after the PEPC run (see figure 2.5.7) we estimated this

'idle energy' consumed in those regions and subtracted this estimate from the total energy, providing a number for the energy for PEPC alone.

Figure 2.5.8 PEPC Runtime and estimated energy of one of the weak scaling tests


46

Shown are the average, minimum and maximum energy consumed per node.

We also estimate the total energy consumed for strong scaling runs. It is an estimate as in some cases power figures could not be retrieved from all nodes. So we took the average and mult iplied this with the number of nodes used. The resulting total

energy is shown as solid line in Figure 2.5.5. This figure can be used to identify the most power efficient number of nodes. The power figures obtained by the weak

scaling tests showed an expected proportional increase in used energy with the increase in time. Since the workload per node is kept fixed, no variat ion of power usage per node is expected for a different number of nodes.


PEPC's scaling is known to be very good once a certain minimum number of particles per thread or MPI ranks are used. Since PEPC has separate

threads for communication, keeping the worker threads busy with enough particles nicely overlaps computation and communication and hides latencies. Experiments

where a forced pinning of threads improves this situation further showed effects only for a very low particle number. In general, 4 worker threads plus 2 for communication with an OS controlled thread distribution worked best for a

sufficient particle number per rank. Figure 2.5.9 shows the efficiency of PEPC (relative to the theoretical ideal) for weak scaling and 4 different particle numbers

with good results. Figure 2.5.10 on the other hand shows a strong scaling test with the best case of 4 worker threads where the particle number is kept fixed. After approximately 100 nodes, scaling breaks down. The figure also includes one

particular example with fewer threads pinned to separate compute and communication threads, which shows better performance for low particle numbers .

Figure 2.5.9 Weak scaling efficiency of PEPC


47

Larger particle numbers per MPI rank/node show better scaling properties.

Figure 2.5.10 PEPC Strong scaling experiments with 2.56 million particles

With 4 worker threads (OS thread placement) and 2 worker threads (explicit thread

placement). The solid lines w/o symbols also show an estimate of energy used by those runs.

The energy profiling suffered from irregularities on the Mont-Blanc prototype with nodes running at substantially lower power and hence performance. Those cases

where easily revealed by looking at histograms of the power usage, plotting power vs. runtime, or looking at the performance of PEPC which was sensitive against the

individual nodes' speeds. PEPC has built-in load balancing that currently takes into account only particle numbers and their number of interactions computed but ignores runtime differences between nodes, hence a few nodes with lower speed

affect the overall performance. Those problems were significantly improved over time and eventually power measurements were successful as for example shown in

figure 2.5.10. No influence of particle number or number of worker threads was observed; hence the energy used by a run should only depend on the runtime and number of nodes used.


48


Over the course of the project, the porting of 'Mont-Blanc' PEPC suffered from several delays due to shortcomings of the init ial OmpSs runtime or necessary additions to the OmpSs compiler. Working together with BSC solved most issues or

resulted in work-arounds. It also caused several stages of a ported code with varying number of helper threads or degree of OmpSs taskif ication. All this hindered

specific tuning approaches that would especially benefit the ARM architecture.

The porting of PEPC to OmpSs was helped by the modular structure of PEPC and its hybrid programming model that was available from the start. The detour using

OpenACC before investing into OmpSs with OpenCL can be part of a best practice advice: OpenACC is often easier and quicker implemented and helps assessing GPU

capabilit ies of a code. The positive outcome here suggested PEPC would profit from GPU kernels. Another best practice is to investigate thread pinning for hybrid applications. This appears especially important for OmpSs applications demanding

more concurrent tasks than CPU cores available. The GPU version of PEPC is still under development and has not been used for performance tests since the latest

version with the most adaption to the Mont-Blanc prototype did not produce correct simulation results.

The lack of detailed traces (partly due to stability problems of the prototype, partly

because the main development happened on JUDGE with limited applicability to the prototype) let us only guess that the performance of 'Mont-Blanc' PEPC on the

Mont-Blanc prototype is not as good as 'Vanilla' PEPC because of the granularity and number of the tasks. Traces on JUDGE seemed to indicate that GPU performance suffers from long memory transfers (in relation to the GPU kernel

time) between CPU and GPU. However, the fact that the prototype has shared memory between GPU and CPU did not yet show the expected effect. Best practice

should be to monitor the states of the OmpSs runtime and the times spent therein and aim for a large granularity of tasks.

Obtaining scaling results and power figures was more complex as expected. While

the stability of the cluster improved over time, it remained an issue when submitting several tens of jobs. Observed effects ranged from terminating jobs due to

filesystem or network/MPI problems to largely varying runtimes and power figures. Since the prototype is under constant improvement it is also difficult to obtain reproducible results that allow for clear indications on which code changes are

advisable.

References

[1] http://www.fz-juelich.de/ias/jsc/pepc

[2] http://www.fz-juelich.de/ias/jsc/slpp [3] M. Winkel, R. Speck, H. Hübner, L. Arnold, R. Krause, P. Gibbon, Amassively parallel, multi-disciplinary Barnes-Hut tree code for extreme-scale N-body

simulations, Comp. Phys. Comm., 183 (4) 880 [4] PEPC, High-Q Club, http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-

Club/PEPC/_node.html

http://www.fz-juelich.de/ias/jsc/pepc

http://www.fz-juelich.de/ias/jsc/slpp


49

2.6 Quantum Espresso


QUANTUM ESPRESSO[1] is an integrated suite of computer codes for electronic-

structure calculat ions and materials modelling at the nanoscale, based on density-functional theory, plane waves, and pseudo-potentials (norm conserving, ultrasoft, and PAW). QUANTUM ESPRESSO stands for opEn Source Package for Research

in Electronic Structure, Simulat ion, and Optimizat ion. Quantum ESPRESSO is an init iat ive of the DEMOCRITOS National Simulat ion Center (Trieste) and SISSA

(Trieste), in collaborat ion with CINECA National Supercomputing Center, the Ecole Polytechnique Fédérale de Lausanne, Université Pierre et Marie Curie, Princeton University, and Oxford University. Courses on modern electronic -

structure theory with hands-on tutorials on the Quantum ESPRESSO codes are offered on a regular basis in collaboration with the Abdus Salam International

Centre for Theoretical Physics in Trieste. Further details are available at the code websites http://www.quantum-espresso.org and http://www.qe-forge.org.

Both users from academia and industry use Quantum ESPRESSO (QE). QE is mainly written in Fortran90, but it contains some auxiliary libraries written in C and Fortran77. The whole distribution is approximately 500K lines of code, even if the

core computational kernels (CP and PW) are roughly 50K lines each. Both data and computations are distributed in a hierarchical way across available processors,

ending up with multiple parallelizat ion levels that can be tuned to the specific application and to the specific architecture. More in detail, the various parallelizat ion levels are geared into a hierarchy of processor groups, identified by

different MPI communicators. The single task can take advantage both of shared memory nodes using OpenMP parallelizat ion and NVIDIA accelerating devices

thanks to the CUDA drivers implemented for the most time consuming subroutines. The QE distribution is by default self-contained. All you need are a working Fortran

and C compiler. Nevertheless it can be linked with most common external librar ies, such as FFTW, MKL, ACML, ESSL, ScalaPACK and many others. External

libraries for FFT and Linear Algebra kernels are necessary to obtain optimal performance. QE contain dedicated drivers for FFTW, ACML, MKL, ESSL, SCSL and SUNPERF FFT specific subroutines.

Quantum ESPRESSO is not an I/O intensive application, and it does significant I/O

activities only at the end of the simulat ion to save electronic wave functions, used both for post-processing and as a checkpoint restart. As a consequence, I/O activities are expected also at the beginning of the simulat ion in a restart run. Each

task saves its own bunch of data using Fortran direct I/O primitives.

The code has been ported to almost all platforms. Its scalability depends very much on the simulated system. Usually, on architectures with high performance interconnect, the code shows a strong scalability over two orders of magnitude of

processors (e.g. between 1 and 100), considering a dataset size that saturates the memory of the nodes used as basis for the computation of the relative speed-up. On

http://www.quantum-espresso.org/

http://www.qe-forge.org/


50

the other hand, the code shows good weak scalability. Recently, on a large simulat ion, a good scalability up to 65K cores has been obtained. The figure below

shows the performance of QE while running a signif icant dataset on a BG/Q system (FERMI at CINECA) from 2048 cores (with 4096 threads) to 32768 cores (with 65536 virtual cores). Each colour bar is relat ive to the different major subroutines of

the code (see figure below).

Figure 2.6.1 Scalability of the CP kernel of QE on BG/Q system using the

CNT10POR8 benchmark

2.6.2 Code version

To reduce the possible porting problems we have configured QuantumESPRESSO as to use all internal librar ies rather than external ones. In fact, QuantumESPRESSO is self-contained and external libraries can be used as on optimizat ion step. Then we

do not link the code with external BLAS, FFT, LAPACK, SCALAPACK.

As described so far, the source code of QuantumESPRESSO is mainly Fortran90 with a little subset of C source code. Moreover, the compilation of the whole package using gfortran and gcc is routinely checked. We did not find any problem

related to the compilat ion of the code under ARM architecture and the original code has been compiled using MPI and OpenMP. To validate the porting we have

selected a well-known test case (water molecule), already used on many other systems and with different codes.

To profile the code, we have used the internal profiling feature of QuantumESPRESSO as well as the Extrae tool, thus allowing us to monitor the

performance of the most time-consuming subroutines and to compare them with the behaviour on other machine architectures. Among these, the 3D-FFT is computed in parallel distributing the z-axis, i.e. each processor takes a subset of the total number

of planes. These well-known distribution schemes imply the presence of a global ALLTOALL operation thus making the quality of the communication subsystem a

key factor for the performance. To this end, we report in Figure 6.2, the wall-clock

0

50

100

150

200

250

300

350

400

4096 8192 16384 32768 65536

2048 4096 8192 16384 32768

1 2 4 8 16

seco

nds /

step

s

CNT10POR8 - CP on BGQ

calphi

dforce

rhoofr

updatc

ortho

Virtual cores

Real cores

Band groups


51

time of the communication and computation timing vs. the number of tasks of cp.x running on Tibidabo.

Figure 2.6.2 QE Time spent in 3D-FFTs varying the number of tasks

The trend of the timing shows as the on-core computations of the 3DFFTs scale almost linear ly, (with small deviat ions from linearity being possibly due to the limited number of planes per processor) while the communication timing raise

constantly as the number of nodes (messages) increase. Nonetheless, this behaviour common to many similar applications, it paved the route for the most suitable

application of the OmpSs compiler which, relying on the Nanox runtime, could optimize the superposition of the communication and computation phases of cp.x.

This particular behaviour of cp.x as well as of pw.x has been also revealed by the activities carried out in WP3 where it has been identif ied the most used low-level

routines as to be related to GEMM and 3D-FFT operations. Thus, we began the porting over OmpSs of the ZGEMM() function in the phiGEMM library as well as of the main FFT driver in CP/PW. Following the best practices for the porting, we

have gained after two years of experience on the Mont-Blanc development environment; we had first identif ied the code of the two drivers (GEMM and FFT)

as the zone where to taskify the call. Then, after the call structure has been included into an F90 interface with the definit ion of the CPU and GPU subroutines, we have finalized the porting by translat ing the .cu (CUDA) code to OpenCL. Then, keeping

in mind the final integrations of these kernels into the corresponding full applicat ion, we referred to the literature on QuantumESPRESSO by identifying a paper where a

porting and comprehensive performance analys is of GEMM and FFT functions have been carried out on hybrid HPC architectures based on GPU[2]. According to this work, when using PWscf of QuantumESPRESSO, the GEMM and FFT (3D)

operations account for more than 70% of the total computing time (CPU) when running the DEISA AUSURF112 test-case. This study, among others, is a further

confirmation that we correctly identif ied in D3.1 the small-size kernel eligible to be


52

ported to OmpSs but it also introduced a pathway for the migration of the kernels into the full applicat ion. In fact, in the paper the authors embedded GEMM

operations into the phiGEMM library and ported the 3D FFT calls in the vloc_psi() code section to the NVIDIA CUFFT library. This has been accomplished by creating a proper set of stub functions in order to integrate FORTRAN calls with the

C calling interface of the CUBLAS and CUFFT librar ies. This methodology has permitted us to clear ly pinpoint in the sources the reference to the CUDA kernels

and to immediately identify the code sections eligible to be taskified with OmpSs. As a matter of fact, at the end of the porting activit ies in WP4, we have three

versions of codes for QuantumESPRESSO PWSCF on the Mont-Blanc prototype: 1. The original version of PWSCF ported on ARM using standard MPI

(ATLAS) libraries;

2. The original version of PWSCF ported on ARM using optimized MPI

(SCALAPACK) libraries;

3. The MPI version plus OmpSs directives to run on ARM cores in SMP mode

and MPI (SCALAPACK);

4. The MPI version plus OmpSs directives (FFT) and OpenCL kernels (new

ZGEMM) to run on ARM cores and Mali GPU.

2.6.3 Problem sizes

To better shape the performance of pw.x, we setup several input datasets aimed at revealing the runtime behaviour of the code on Mont-Blanc prototype when dealing

with weak and strong scalability benchmarks. Among these, we selected: (i) a crystal system of 18 atoms of Si for weak scaling benchmarking of the code by

setting the number of SCF iteration to 32 and increasing number of K-points from 2 to 10 at 32-160 SDBs; (ii) the AUSURF112 PRACE benchmark[3] made out of 112 Au atoms with fixed number of SCF iterations (1 or 21) and K-points (2).

2.6.4 Weak Scaling

QuantumESPRESSO cp.x and pw.x are well known to have an almost linear weak scalability. In the figure below, we report the weak scalability of PWSCF when running the Si18 input dataset on a reference architecture based on Intel processors

and K20 GPUs by varying in input the number of SCF iterations (the electron_maxstep input parameter) and the number of K-point at fixed number of

computing elements.


53

By keeping constant the number of SCF iteration to 2, we carried out a weak scaling benchmarking of pw.x with the Si18 input by using 32-160 SDBs (2 MPI processes

per SDB) of the Mont-Blanc prototype. The code version used was the original code linked with the ATLAS libraries as made available on the Mont-Blanc prototype at various values of the K-points parameters.


54


To measure pw.x strong scalability, the AUSURF112 benchmark was used. In the figures below, we report the strong scalability of Electrons part within PWSCF when running the AUSURF112 input dataset on the Mont-Blanc prototype. Strong

scaling measurements were carried out using the newly developed version of PWSCF containing the PP-ZGEMM and FFT codes, compared with the reference

version (using SCALAPACK and task groups) with the larger input AUSURF112.

We measured a strong scaling up to 4 with the new developed version pf PWSCF

with respect to 2 of the reference. The speed-up is dominated by the FFT part, while GEMM part is performing very well for the newly developed OmpSs version.

0

2

4

6

8

10

16 32 48 64 80 96 112 128

Spee

du

p

#SDBs/MPI-Procs

QuantumESPRESSO - AUSURF112 Speedup - 2 Threads/MPI-Procs

Original code (MPI) + SCALAPACK Strong Scalability

ELECTRONS

SUM_BAND

Linear

0

2

4

6

8

10

16 32 48 64 80 96 112 128

Spee

du

p

#SDBs/MPI-Procs

QuantumESPRESSO - AUSURF112 Speedup - 2 Threads/MPI-Procs

MPI + OmpSs (FFT/GEMM) + SCALAPACK Strong Scalability

ELECTRONS

C_BANDS

SUM_BAND

Linear


55


Energy profiling was performed on the original and new versions of the PWSCF code by using AUSURF112 as input and testing on 128 SDBs (1 process per SDB) and the results are reported in the next figure.

The results reported clearly show that despite a 10-15% of higher power

consumption, the new MPI+OmpSs(SMP) has a factor of 2 better Energy-To Solution with respect to the Original MPI version of PWSCF.


The global scaling of the MPI+OmpSs PWSCF is more or less satisfactory. The performance obtained shows that the global scaling is dominated by the c_bands

kernel (which is FFT bound) while very good scaling is measured for the new PP-ZGEMM routine (sum_band kernel). A better global speedup could be obtained taking advantage of the Mali GPU since the current version of the kernels does not

use it at all. Then, lacking of a fully optimised FFT and linear algebra kernels for Mali are at present the main bottleneck for QE strong scalability and overall

performance.

2.6.8 Synthesis and best practices

We had different experiences while improving the performance of the two most important kernels of QuantumESPRESSO during the duration of the project

(GEMM and FFT). The OmpSs implementation of GEMM provides very good performance but we could not fully optimise FFT.

There is a need to use the proper scientif ic libraries optimised for the underlying platform. To properly exploit the Mali GPU, we need scientif ic libraries optimised

for its architecture. Having had GEMM and FFT operations tailored for the Mali

02000400060008000

100001200014000

0

75

152

228

304

380

457

532

609

685

762

837

913

990

106

6

114

2

121

8

129

5

137

1

144

6

152

3

159

9

167

5

175

1

Po

wer

Co

nsu

mp

tio

n (

mW

)

Exec Time (s)

QuantumESPRESSO/PWSCF - AUSURF 128 SDBs Power Consumption (mW) vs. Exec Time (s) per 1

SDB

OmpSs(SMP) Reference


56

GPU would had ease the porting and probably led to better performance than the one achieved so far.

References

[1] Paolo Giannozzi, Stefano Baroni, Nicola Bonini, M atteo Calandra, Roberto Car, Carlo Cavazzoni, Davide Ceresoli, Guido L Chiarotti, M atteo Cococcioni, Ismaila Dabo,

Andrea Dal Corso, Stefano de Gironcoli, Stefano Fabris, Guido Fratesi, Ralph Gebauer, Uwe Gerstmann, Christos Gougoussis, Anton Kokalj, M ichele Lazzeri, Layla M artin-Samos, Nicola M arzari, Francesco M auri, Riccardo M azzarello, Stefano Paolini, Alfredo

Pasquarello, Lorenzo Paulatto, Carlo Sbraccia, Sandro Scandolo, Gabriele Sclauzero, Ari P Seitsonen, Alexander Smogunov, Paolo Umari, and Renata M Wentzcovitch. QUANTUM ESPRESSO: a modular and open-source software project for quantum

simulations of materials. Journal of Physics: Condensed M atter, 21(39):395502 (19pp), 2009.

[2] F. Spiga and I. Girotto, phiGEM M : a CPU-GPU library for porting Quantum ESPRESSO

on hybrid systems, http://www.cise.ufl.edu/research/sparse/matrices20th Euromicro

International Conference On Parallel Distributed and Network-based Processing (IEEE 2012), DOI: 10.1109/PDP.2012.72.

[3] See, http://www.prace-ri.eu/IM G/pdf/d7.4_3ip.pdf, Test Case A from page 37.

http://www.cise.ufl.edu/research/sparse/matrices

http://www.prace-ri.eu/IMG/pdf/d7.4_3ip.pdf


57

2.7 SMMP


SMMP is used to study the thermodynamics of peptides and small proteins using Monte Carlo. It uses a protein model with fixed bond lengths and bond angles reducing the number of degrees of freedom signif icantly while maintaining a

complete atomistic descript ion of the protein under investigat ion. Currently, four different force fields, which describe the interactions between atoms, are available.

The interaction with water is approximated with the help of implicit solvent models. There are two levels of parallelism:

1. The energy function, which is called every time a new Monte Carlo update is evaluated, can be calculated in parallel.

2. Algor ithms such as parallel tempering work on multiple copies of the system in parallel and exchange information between the systems at given

intervals. Most of the CPU time is used for the evaluation of the energy. Parallelis ing the

energy calculat ion is therefore important to obtain sufficient statistics in a reasonable amount of time. The scaling of the energy function is always evaluated

for a particular protein and only strong scaling can be considered. For the paralle l tempering algor ithm, on the other hand, doubling the replica usually means doubling the number of replicas. It therefore fits a weak scaling model.

2.7.2 Code versions

Two versions were implemented and tested on the Mont-Blanc prototype. The first

version is a pure MPI implementation. Both the calculat ion of the energy and parallel tempering were implemented using MPI. The second version uses OmpSs and OpenCL to implement the calculat ion of the energy and MPI is used for paralle l

tempering.

2.7.3 Problem sizes

Q80 is a poly-glutamine sequence. Long poly-glutamine stretches are common in

many disease related proteins. It consists of 1283 atoms.

2.7.4 Weak Scaling

We measured the time for 1, 4, 16, and 64 replicas. Each replica uses 16 MPI tasks and 8 nodes.


58

Figure 2.7.1 SMMP Weak-scaling plot for pure MPI version.

There is a slight increase in compute time with increasing number of replicas. The

dashed line is a guide to the eye. Each replica uses 16 MPI tasks on 8 nodes.


We measured the time for 1, 2, 4, 8, 16, 32, and 64 MPI tasks with 1 or 2 tasks per

node.

Figure 2.7.2: SMMP Strong-scaling plot.

The scaling falls of rather quickly and is down to about 50% efficiency at 16 MPI tasks. The data is normalized to a single MPI process running on a single node.


59

Figure 2.7.3: SMMP Parallel efficiency of strong scaling.

The parallel efficiency drops of quickly and reaches 50% at about 16 MPI tasks.


Figure 2.7.4 SMMP power profile of the benchmark


60

One and two MPI tasks are used on one and two nodes or OmpSs and OpenCL for the energy calculat ion. The dashed lines indicate the idle power. The start and end

of the run are clearly vis ible. Running two MPI tasks on a single node halves the time of the benchmark but only increases the power consumption by about 20%. Unfortunately, the benchmark does not benefit from using the GPU. The numbers in

parenthesis show the total energy in Ws (=J) (excluding idle t imes) used for each of the runs.


The parallel scaling of the energy evaluation strongly depends on the quality of the

implementation of MPI_Allreduce. Due to the small memory footprint of the program it is possible to replicate the entire data and the Monte Carlo evaluation on

each MPI rank. Each rank calculates a part of the interactions and then the energy is summed up and broadcasted to all ranks involved using MPI_Allreduce.

I saw an unusually bad scaling behaviour. The parallel efficiency is down to about

50% already with 16 MPI tasks. This may be due to the fact that we use an Ethernet network and a TCP stack to communicate between nodes. TCP timeouts are large

and can adversely affect the performance of an MPI_Allreduce call.

As expected, the energy consumption increases with addition of more nodes. Running two MPI processes on a single node, the program needs about 25 s at 9 W

(or 237 Ws) to finish the benchmark. If I run two processes on two nodes, the program also needs about 25 s, but the total power drawn increases to 14.7 W. The

energy used is 387 Ws.

At the single-node level, we can also compare to the version that uses OmpSs and OpenCL to take advantage of the GPU. Here, the simulat ion takes about 63 s at 9 W

and consumes 578 Ws. If I compile the program with OmpSs but do not use it, the benchmark takes about 100 s. Apparently, the overhead of the runtime in this case is

significant.

Weak scaling works reasonably well. The two-level parallelizat ion lets me use the entire Mont-Blanc prototype.


The strong scaling behaviour of SMMP is signif icantly worse than on an IBM BlueGene system or a PC cluster connected with Infiniband. The message sizes are very small and the scaling is strongly affected by the latency of the network, which,

unfortunately, is fairly large on the current Mont-Blanc prototype. A faster, low-latency network interconnect is needed if SMMP is to scale to larger systems.

OmpSs seems to produce a fairly large overhead. If the code is compiled with OmpSs it takes about twice as long to complete the benchmark even if OmpSs is not used at all. Using the GPU makes up for some of the overhead but not enough to

compete with the two MPI tasks on a node in terms of speed and energy


61

consumption. On the other hand, OmpSs is a very elegant way of calling OpenCL kernels from Fortran. To make good use of the GPU, its calculat ions need to be

overlapped with calculat ions on the CPU. While this is true for any GPU, it is particular ly relevant for systems where CPU and GPU offer comparable performance.

I also observed many problems with the file system. Lustre can be a very stable, high-performance file system, but it does not seem to work too well in combination

with the ARM hardware used in the prototype. This could be partly due to the network.


62

2.8 SPECFEM3D


SPECFEM3D Globe is an HPC scientif ic code developed by the Computational

Infrastructure for Geodynamics (CIG), Princeton University (USA), CNRS and University of Marseille (France) and ETH Zurich (Switzerland). It simulates seismic wave propagation at the local or regional scale based upon spectral-element method

(SEM) with very good accuracy and convergence properties. This approach, combining finite elements and pseudo-spectral methods, allows the formulation of

the seismic wave equations with a greater accuracy and flexibility if compared to more tradit ional methodologies. SPECFEM3D is a Fortran application, but a subset of the globe version has been ported to C, to experiment with CUDA, StarSs and

OpenCL. This subset contains the main computation loop of the main application. The full applicat ion is composed of 50k lines of Fortran, while the subset contains

3k lines of C. SPECFEM3D is a reference application for supercomputer benchmarking thanks to

its good scaling capabilit ies. It supports asynchronous MPI communications as well as OpenCL and CUDA GPU acceleration. It shows strong scaling up to 896 GPUs and more than 21,675 CRAY XE nodes with 693,600 MPI ranks and sustained over

1 PF/s on the NCSA BlueWaters petascale system. In 2010, its mult i-GPU port won the French Bull-Fourier supercomputing prize.

2.8.2 Code version

We used the official development version of SPECFEM3D Globe 7.0 of March 10th, 2015 (Git commit #a717e94).

2.8.3 Problem sizes

SPECFEM3D was configured to simulate a regional-scale Greek earthquake

(regional_Greece_small). The accuracy of the simulat ion (SPECFEM3D `NEX`, controlling the number of elements at the surface along the two sides of the first

chunk) varied between 64 and 1008 (the higher the more complex). The NEX number must be multiple of 16.

The number of processor cores (SPECFEM3D `NPROC`) that run the application depends of the `NEX` parameter. It must be 8 * multiple of `NEX`. We successfully

ran the application with the following configurations:

NEX Number of tasks

64 1, 4, 16, 64

128 1, 4, 16, 64, 256

256 1, 4, 16, 64, 256, 1024


63

For each of these configurations, we activated one or two CPU cores per node, and enabled or not the usage of the GPGPU accelerator.

Overall, we captured 45 valid execution results. Execution with bigger problem sizes or higher number of tasks could not run to complet ion, or their exhibited

clear ly invalid performance measurements (e.g., nodes over-consuming, which implies that the CPU was used by other processes, or under-consuming, which

means they were running too slow and stained the entire performance and power profile).

2.8.4 Weak Scaling

The following figures present the weak scaling graphs of SPECFEM3D results. Figures 2.8.1 – 2.8.4 present the weak scaling graphs of SPECFEM3D results, on the four different hardware configurations : with one or two CPU cores and with or

without GPU usage. The reference time for a given NEX problem size is the largest time-step (i.e., poorest performances) obtained on one CPU core, without using the

GPU.

Figure 2.8.1 Specfem3D weak-scaling chart with 1 core and GPU


64

Figure 2.8.2 Specfem3D weak-scaling chart with 1 core and no GPU

Figure 2.8.3 Specfem3D weak-scaling chart with 2 cores and GPU


65

Figure 2.8.4 Specfem3D weak-scaling chart with 2 cores and no GPU


Figures 2.8.5 – 2.8.7 present the strong scaling graphs of SPECFEM3D results. Each figure corresponds to a SPECFEM3D problem size: 64, 128, 256 NEX (we

could not collect enough data to plot 512 NEX because of the instability of the cluster and its filesystem). The reference time for a given NEX problem size is the

largest time-step (i.e., poorest performances) obtained on one CPU core, without using the GPU.


66

Figure 2.8.5 Specfem3D strong-scaling chart with 64-NEX problem size

Figure 2.8.6 Specfem3D strong-scaling chart with 128-NEX problem size


67

Figure 2.8.7 Specfem3D strong-scaling chart with 256-NEX problem-size

With relative order of the curves, which is consistent over the different problem sizes, we can understand that the one-CPU-core configuration offers the lowest

performance, and the two-CPU-core configuration offers the best. When activat ing the GPU, using one or two CPU cores lead to the same performances. This level of

performance is better than one-CPU core, and lower than two-CPU cores. The CPU-only curves appear to scale well while increasing the number of nodes,

however the GPU configurations tend to collapse when using more than 16 nodes, for the smallest problem sizes (64 and 128 NEX).


Figures 2.8.8 – 2.8.12 present an example of the energy consumption reported for an idle run, and then, on the four hardware configurations for a given SPECFEM3D problem size (128 NEX). We can notice two spikes during the first seconds of the

profiling. They correspond to our calibrat ion preamble code. They indicate the time-scale (a spike lasts 10 seconds) and the power scale (first one core is used

intens ively, then two cores). After that, and similar ly at the end of the simulat ion, the power consumption becomes irregular during a few seconds. That corresponds to the application set-up and tear-down. We did not keep these measurements in the

following analyses.


68

Figure 2.8.8 Specfem3D Energy consumptionon Mont-Blanc cluster for an idle run.

Figure 2.8.9 Specfem3D Energy consumption on Mont-Blanc cluster using 1 core per node


69

Figure 2.8.10 Specfem3D Energy consumption on Mont-Blanc cluster using 2 cores

per node

Figure 2.8.11 Specfem3D Energy consumption on Mont-Blanc Cluster using 1 core

and a GPU per node


70

Figure 2.8.12 Specfem3D Energy consumption on Mont-Blanc cluster using 2 cores

and a GPU per node We can notice in Figure 2.8.12 that when the GPU is used in combination with two

CPU cores, the power consumption varies heavily between 7W and 9W. The other hardware configurations are more regular. We believe that this is due to the sharing

of the GPU between the two CPU-cores. In this situation, the GPU driver will have to wait more often for a GPU answer, and a busy-wait loop may explain the consumption spikes.

The Figures below illustrate the consumption variability for the different hardware

configurations, with the instant consumption of one of the nodes plotted over the time.


71

Figure 2.8.13 Specfem3D Energy consumption for each hardware configuration

Finally, the following figures present the power efficiency (power consumption per

SPECFEM3D time-step) over the different problem sizes (except 512 NEX for which we could not collect enough data), and altogether.

Figure 2.8.14 Specfem3D Power efficiency on Mont-Blanc cluster for the 64 NEX problem size


72

Figure 2.8.15 Specfem3D Power efficiency on Mont-Blanc cluster for the 128NEX

problem size

Figure 2.8.16 Specfem3D Power efficiency on Mont-Blanc Cluster for the 256NEX

problem size


73

Figure 2.8.17 Specfem3D Power efficiency


We can see in Specfem3D strong-scaling charts that the CPU benchmarks have a

good strong scaling. The scaling of the GPU version is not as good. We suspect that the computation-over-communication overlapping may not be correctly handled by

Mont-Blanc cluster's GPU drivers, which would explain the poorer performance obtained with the largest executions.

Regarding the energy consumption, a good strong scaling should keep the energy cost per iteration constant. On the power-efficiency charts Figures 2.8.14 – 2.8.17,

we can see that this is only the case with CPU benchmarks, as well as with GPU runs with a low node count. However, the GPU consumption then rises because of its bad scaling and overcomes the CPU-only consumption.

Regarding the weak scaling Figures 2.8.1 – 2.8.4, we can see that the Mont-Blanc

cluster prototype has been able to execute a wide range of SPECFEM3D simulat ion problem sizes, with a 200-ratio between the smallest and the largest problems. The charts show that the execution scales well with the increasing problem sizes.

On the energy consumption side, we can notice that the energy consumption of each

node is different. Figures 2.8.18 – 2.8.22 below show the distribution of these consumptions, first for the idle run, then for SPECFEM3D runs with 256 NEX / 256 tasks for 1 and 2 cores, and 1 core + GPU, and 128NEX / 256 tasks for 2 cores +

GPU.


74

Figure 2.8.18 Specfem3D Node consumption for idle run

Figure 2.8.19 Specfem3D node consumption 1 core/node


75

Figure 2.8.20 Specfem3D Node consumption 2 cores/node

Figure 2.8.21 Specfem3D Node consumption 1core- 1 GPU / node


76

Figure 2.8.22 Specfem3D Node consumption 2 cores – 1 GPU / node

We can also see that all the nodes are not equal, but rather have a Gaussian distribution. The energy consumptions during SPECFEM3D executions have a Gaussian distribution. Over the time, they are quite regular, except when two CPU

cores are used with the GPU.

The average consumptions are as follows:

Configuration Average consumption

Idle Baseline = 5.44W

1 core Baseline + 2.46W = 7.80W

2 cores Baseline + 4.38W = 9.82W

1 core + GPU Baseline + 1.78W = 7.21W

2 cores + GPU Baseline + 2.85W = 8.29W


77


The Mont-Blanc cluster prototype can execute SPECFEM3D without major problem and scales well, with the real full-size application. As part of the Mont-

Blanc project, the OpenCL GPU port of SPECFEM3D has been developed, contributed and incorporated into the development branch of SPECFEM3D, as well

as the different optimization patches required to run SPECFEM3D on the Mont-Blanc cluster.

With GPU acceleration, executions with one CPU-core are more efficient that with two CPU-cores. Both configurations achieve similar performances, but the 1-CPU-

core version has a lower consumption rate. Without GPU acceleration, it is more efficient to use two CPU cores: it outperforms

the single-core version with a lower consumption rate.

Overall, the dual-core configuration is better than the single CPU-core plus GPU version because of the poor GPU scaling. However, this is not true with small node-count size (under 4 for 64 and 128 NEX or under 16 for 256 NEX) where the GPU

remains more efficient.


78

2.9 Alya RED

Although this applications is not part of the original applicat ion set of the project,

we decided to test it on the Mont-Blanc prototype due to its importance recognized by the HPC Innovation Excellence Award given by the IDC and HPC User Forum. We report in this section the results in the same format as for the rest of the

applications.


Alya Red is a biomedical mechanics simulator used for research in biological systems and is specially designed to run with high efficiency in large scale supercomputers. It has reported extremely good scaling results up to 100.000 cores

in the Blue Waters supercomputer proving the viability of engineering simulat ion codes in exascale systems. The data inputs we used represent a rabbit’s heart and are

part of the Cardiac Computational Modelling project at the Barcelona Supercomputing Center. Alya Red is based on the BSC’s parallel mult iphys ics code Alya, which is part of PRACE Unified European Applications Benchmark Suite.

By default Alya Red, only depends on a third party library (metis-4.0), provided with the package and statically linked.

Similar to a stencil code, Alya Red execution repeats iterative steps which are

composed in big measure of pure computation and also a few MPI commands (collective and point-to-point).

All the timing and energy measurements we show in the results section belong to the parallel iterative part, excluding init ializat ion and finalizat ion. We also disabled

output to disk of final results.

2.9.2 Code version

The version used for all the test is the stable version pulled from internal

repositories in sept 2014. Alya is written in Fortran 90 and uses MPI for communication between processes. For the tests we used gfortran (gcc 4.9.1) and

mpich (3.1.4). No modif ication to the original code was made in order to optimize performance, we ran the tests with the out-of-the-box version of the application.

2.9.3 Problem sizes

The results are generated using two versions of the same input problem with high

(divis ion parameter = 3) and low (division = 2) resolut ion. With the divis ion (div) parameter (integer) we can increase/decrease the number of elements of the matrix,

so, from div=2 to div=3 we are increasing by 8x the number of elements (2x in 3 dimensions).


79

With those tests, we used two different MPI process allocation configurations: 1 process/thread per node (1TPN) and 2 processes/threads per node (2TNP). Each

test, execution of the application with a given configuration, is repeated 3 times, this is noted in the plots as r0, r1 and r2 for each of the repetitions. We performed these repetitions to measure system stability.

Input resolution Num. elements Configurations Cores

Rabbit 2 (div=2) 27.6 Million 1 TPN, 2 TPN 32-512 (1TPN), 16-512 (2TPN)

Rabbit 3 (div=3) 221.2M 1 TPN, 2 TPN 128-1500 (2TPN),

64-750 (1TPN)

2.9.4 Weak Scaling

We do not perform weak scaling tests because fine-grain problem size creation requires complex changes in the input files requir ing an excessive effort from 3rd actors.


The next figures show the strong scaling performance results of Alya Red on the Mont-Blanc prototype.

Figure 2.9.1 Alya Red, low resolution input set 2TPN, strong scaling


80

Figure 2.9.2 Alya Red, low resolution input set 1TPN, strong scaling

Figure 2.9.3 Alya Red, high resolution input set 2TPN, strong scaling

Figure 2.9.4 Alya Red, high resolution input set 1TPN, strong scaling


81

Figure 2.9.5 Alya Red, execution times comparison 1TPN vs 2TPN, low resolution

input


The energy results we show compare the total energy-to-solution of both process distributions (1TPN vs 2TPN) with independent plots for each input problem.

Figure 2.9.6 Alya Red low resolution input set, total job energy consumption


82

Figure 2.9.7 Alya Red high resolution input set, total job energy consumption


The strong scaling results for the low resolut ion input set (2.9.1 and 2.9.2) shows good scalability results for 2TPN and 1TPN with more than 85% efficiency at 512 cores.

In the other hand, high resolution input (2.9.3 and 2.9.4) shows scaling up to 1500

cores with >66% efficiency. We observe almost perfect scaling up to 512 cores when using 1 process per node. In overall, considering all results we seem to be experiencing network congestion

when we have more than 512 MPI processes running at the same time. Comparing the execution time results of 1TPN vs 2TPN (fig. 2.9.5) we had less than 5%

difference in both input sets (div2, div3), therefore, no shared resource contention between cores in the same node was experienced using this application.

As we said, no benefit in execution time was observed when using 1TPN over 2TPN. In terms of energy, full usage of the node with 2 cores shows 9.7 W average

power dissipation, while 1 core per node 7.8 W which represents a 20% reduction in power consumption. For 1TPN compared to 2TPN, to process the same problem at the same speed we need double number of nodes. So in overall we are experiencing

a 60% more cost in terms of energy-to-solution when using 1TPN when only taking into account the energy consumption of the nodes. If we add up the energy of the

rest of the blade, using 2TPN is even more advantageous in terms of energy expenditure.


83


We obtained very high strong scaling effic iency results in 3 out of 4 tests. The 4th shows efficiency greater than 85% at 1024 cores and decreasing to 66% at 1500

cores.

We have been able to execute with moderate-good scaling efficiency up to 1500 with low variability and high reliability.

In most executions using only one process per node did not yield benef its in performance and caused and average node power consumption increase of 60%.

We have been able to port a state-of-the-art production scient if ic applicat ion with relative low effort and achieve good performance with an out-of-the-box version.

In overall, the poor performance of the network seems to be the main reason causing

the scaling problems beyond 512 cores. Finally, about the experience with the prototype, some comments:

At the end of the period of tests we observed that low number of core jobs had less

reliability in terms of how many jobs completed the execution, although we did not perform specific tests, we found high correlation of this low reliability with the number of jobs running in the prototype at the same time. To minimize the

variability and maximize reliability in important tests, we recommend to force-reserve all the nodes in the prototype using Slurm to prevent other jobs from

running at the same time during the period of the tests. In other hand, and also in terms of reliability we experienced problems with codes that output big files from lots of cores at the same time (1024+). These reliability problems are the missing

points in the scalability lines of the plots. Every missing value belongs to a failed job. I.e: reserving all the nodes in the rabbit 3 – 1TPN test yielded a 100%

successful execution rate. Another issue can be seen in 2.9.4, right plot, last x axis value (750 cores). This is

caused by the issues with the Dynamic Voltage and Frequency Scaling management of the kernel image, and is non-related to the applicat ion. During the tests in the

lasts months, several updates in the cluster minimized this problem and in the latest tests was only experienced at specific occasions when using very high node counts i.e: 750 out of ~790 available. The experiments shown in this Section aided to the

identif ication and resolution of the DVFS problems. This problem and solution is reported in D5.11 Section 10.


84

2.10 ExMatEx Proxy Applications

The ExMatEx Proxy Applications are not part of the initial set of the project but we

tested them on the prototype given their popularity and the easiness to set them up and ability to gain insight thanks to their lower complexity.


The objective of the Exascale Co-design Center for Materials in Extreme

Environments (ExMatEx) is to establish the interrelationship among algorithms, system software, and hardware required to develop a multi-physics exascale

simulat ion framework for modelling materials subjected to extreme mechanical and radiat ion environments. The ExMatEx center claims that such a simulat ion capability will play a key role in solving many of today’s most pressing problems,

inc luding producing clean energy, extending nuclear reactor lifetimes, and certifying the aging nuclear stockpile.

For such interrelat ionship, the ExMatEx Center provides a set of Proxy Applicat ions designed as the primary vehicle for collaboration with both vendor and academic

partners. They are also heavily used within the center to explore programming models, systemware, and hardware-interfacing tools. From the eight publically available Proxy Applicat ions, we have run LULESH and CoMD in the Mont-Blanc

prototype.

ExMatEx center reports that LULESH, developed at LLNL as part of the DARP A UHPC program, was one of the earliest Proxy Applicat ions released to the HPC community. Within ExMatEx they use LULESH to explore programming models,

inc luding domain specific languages, and as a representative coarse-scale component of their scale-bridging research and Proxy Application development.

They also report that CoMD is a reference implementation of classical molecular dynamics algorithms and workloads as used in materials science. The code is

intended to serve as a vehicle for co-design by allowing others to extend and/or re-implement it as needed to test performance of new architectures, programming

models, etc. New versions of CoMD will be released to incorporate the lessons learned from the co-design process.

2.10.2 Code version

We obtained LULESH code from the LLNL web page at

https://codesign.llnl.gov/lulesh.php. We downloaded the CPU Models version 2.0.3.

We obtained CoMD code from the ExMatEx git hub repository at https://github.com/exmatex/CoMD. The code version is the latest revision from December 16th 2013.

We use the MPI + OpenMP versions of the applications and run three different

kinds of experiments depending on the process mapping to compute nodes: - 1ppn 1th: one MPI process per node with one OpenMP thread per process

https://codesign.llnl.gov/lulesh.php

https://github.com/exmatex/CoMD


85

(one core per node remains unused) - 1ppn 2th: one MPI process per node with two OpenMP threads per process

- 2ppn: two MPI processes per node with one OpenMP thread per process

2.10.3 Problem sizes

Both LULESH and CoMD work on three-dimensional problems.

LULESH allows setting the problem size with a single number. This number is the size of the three dimensions of the problem: the data set is a cube. It uses the

number of MPI processes available, which must be a cube of a natural number (greater than 0). LULESH performs weak scaling automatically, and scales the input

set proportionally by the number of processes. CoMD allows setting the size for the three dimensions of the problem, and the

number of processes per dimension. To perform weak scaling, the user must increase the input size and number of processes manually for the three dimensions

LULESH weak scaling parameters

Input “-s 65” for numbers of processes 1, 8, 27, 64, 125, 343, 512, 666, 729, 1000, and 1331.

CoMD weak scaling parameters

Number of processes Input set

total -i -j -k -x -y -z

1 1 1 1 40 40 40

8 2 2 2 80 80 80

27 3 3 3 120 120 120

64 4 4 4 160 160 160

125 5 5 5 200 200 200

343 6 6 6 240 240 240

512 7 7 7 280 280 280

666 8 8 8 320 320 320

729 9 9 9 360 360 360

1000 10 10 10 400 400 400

1331 11 11 11 420 420 420


86

LULESH strong scaling parameters

Since LULESH performs automatic weak scaling, we have to manually and

proportionally decrease the input set parameter for larger numbers of processes so the problem size remains constant. We start in 27 processes to adjust the starting parameter to fill the 3GB of memory of a node. The init ial input set for 1 process

(not shown, as it cannot execute due to exceeding the available memory) is “-s 217”. The following table shows the results.

Number of processes Input set

total -s

27 72

64 54

125 43

343 36

512 31

666 27

729 24

1000 22

1331 20

CoMD strong scaling parameters

Input “-x 217 -y 217 -z 217” for numbers of processes 125, 343, 512, 666, 729, 1000, and 1331. We start in 125 threads, starting on 1-64 would not provide enough

memory for this input set. The configuration 1ppn 2th is able to run on 64 nodes. The reference is the smallest number of processes for each process mapping.

2.10.4 Weak Scaling

The figures below show the weak scalability results of running LULESH and CoMD on the Mont-Blanc prototype. They report speed-up normalized to one

process and parallel efficiency (1 means perfect scaling). The different lines correspond to the different process mappings explained before. We also show “Ideal”

speed-up as a reference of perfect scaling: speed-up of N with N processes.


87

Figure 2.10.1 LULESH weak scalability speed-up

Figure 2.10.2 LULESH weak scalability parallel efficiency


88

Figure 2.10.3 CoMD weak scalability speed-up

Figure 2.10.4 CoMD weak scalability parallel efficiency


89


The figures below show the strong scalability results of running LULESH and CoMD on the Mont-Blanc prototype. They report speed-up normalized to the smallest number of processes for each process mapping and parallel efficiency (1

means perfect scaling). As before, we also show “Ideal” speed-up as a reference of perfect scaling: speed-up of N with N processes. For all experiments we assume

perfect scaling for the smallest number of processes: 27 in LULESH and, 64 (1ppn 2th) and 125 (1ppn 1th and 2ppn) in CoMD.

Figure 2.10.5 LULESH strong scalability speed-up


90

Figure 2.10.6 LULESH strong scalability parallel efficiency

Figure 2.10.7 CoMD strong scalability speed-up


91

Figure 2.10.8 CoMD strong scalability parallel efficiency


The figures in this section show the average power per node of each execution and

the energy normalized to the execution with 2ppn using 8 processes for weak scaling, 2ppn using 27 processes for LULESH strong scaling and 2ppn 125 processes for CoMD strong scaling. This way we can see not only the energy

expenditure of the machine as the application scales, but also which process mapping is more energy efficient.

The energy shown for weak scaling experiments represents the “energy scalability”.

Computation on larger problem sizes requires larger energy but, to ease the readability of the charts, and to have the same representation as for strong scaling, we consider energy spent for a given number of processes as the part proportional to

the problem size of the reference. Therefore, ideal is 1 for all numbers of processes in both weak scaling and strong scaling experiments.

When interpreting the normalized energy results (lines in the figures) consider that

1ppn configurations use double the number of nodes than 2ppn configurations for the same number of processes. Therefore, for the same node power, they should provide 2 times the performance of 2ppn to spend the same energy.

In this study we have considered only the power of the nodes. To consider the power overhead of the blades, one should add an extra 25-30% of power.


92

Figure 2.10.9 LULESH weak scalability energy and node power.

Figure 2.10.10 CoMD weak scalability energy and node power.


93

Figure 2.10.11 LULESH strong scalability energy and node power

Figure 2.10.12 CoMD strong scalability energy and node power


94


The weak scaling results of LULESH are very satisfactory for 2ppn. LULESH with 1ppn scales worse. The worst of LULESH is the MPI+OpenMP version with 2 threads per node, with a diminishing parallel efficiency that lowers down to 60% for

729 processes compared to 1 process.

The weak scalability of CoMD is close to ideal for 1ppn 1th and 2ppn. The threaded configuration 1ppn 2th scales a bit worse (90% efficiency) but we must consider that it achieves close to double the performance than 2ppn, using double the number

of nodes.

For weak scaling, parallel efficienc ies are generally very satisfactory considering that we are covering a scaling factor of up to 1331x.

The strong scaling results of LULESH are less satisfactory. The limited amount of node memory (up to 3GB available for user application data) allows a not-large-

enough problem size that leads to small data sets per node when the application scales up. The problem is that packets lost in the network lead to wait the Retransmission Timeout (RTO) of TCP/IP which is 200ms by default. The

following Paraver traces show this effect. Dark colours indicate longer periods and light colours shorter periods.

Figure 2.10.13 LULESH strong scaling, MPI waiting time for 2ppn 216 processes

Figure 2.10.14 LULESH strong scaling, useful compute time for 2ppn 216 processes

The blue periods in the trace show waiting periods slightly over 200ms, corresponding to transfers delayed due to the RTO (marked with red rectangles).

The rest of the wait ing time shown in green is for transfers without lost packets showing that the transfer sizes actually require much less time. The second trace shows the useful compute time. Iterations affected by RTO take double the time

than the one without this problem (the second iteration in the trace).

This problem also affects the experiments with weak scaling. However,


95

computation periods are much larger in weak scaling, and the 200ms represent a load unbalance that affects much less overall. The traces below show the effect for

the weak scaling case.

Figure 2.10.15 LULESH weak scaling, useful compute time for 2ppn 216 processes

Figure 2.10.16 LULESH weak scaling, useful compute time for 2ppn 216 processes

LULESH has some imbalance in the computation periods. Figure 2.10.16 shows this effect. This leads to situations where delays due to RTO may be compensated by delaying a process that was taking shorter in the computation in that iteration,

thus re-balancing the load. This further reduces the RTO impact on the overall execution.

CoMD strong scalability is much better. This might be in part because it starts from 125 nodes to fit the problem size. It scales reasonably well until 729 processes (over

80% efficiency) and with 60% efficiency at 1331 processes. The RTO effect in CoMD is studied in deliverable D6.8 Section 6.

The energy figures of the weak scaling experiments show that LULESH and CoMD have different effectiveness of the OpenMP parallelisat ion. While LULESH 1ppn

2th losses efficiency and consumes more energy, CoMD 1ppn 2th spends litt le more energy than 2ppn, because it provides almost twice the performance with double the

number of nodes. Using 1ppn 1th is never as effective: it actually provides more performance than 2ppn (for the same number of processes) for LULESH, but it has the same performance than 2ppn for CoMD –this indicates that CoMD is less

sensible to the network in the weak scaling configuration) than LULESH–. Since 1ppn 1th uses double the number of nodes than 2ppn, this ends up leading to almost

double the energy for CoMD. The power of the nodes is actually lower when using 1ppn 1th instead of 2ppn, because the application only uses one of the two cores. The power of 1ppn 2th is higher, because the two threads are actually using both

cores, but is lower than with 2ppn. This may be because of the addit ional activity due to the transfers within the node, or due to the lower efficiency it achieves

(which is the result of more waiting).


96

The energy results of the strong scaling show the poor strong scalability of

LULESH. Node power is lower for larger numbers of processes, showing that the nodes stay idle for long periods of time during the RTO delays. The static power, however, is signif icant, and that leads to increasing energy expenditure for larger

numbers of nodes. CoMD works much better. The 1ppn 2th matches the 2ppn configuration in energy to solution, which shows that, even for strong scaling, it

provides almost double the performance with double number of nodes.


The experiments in this section led to confirm the problem of using the default RTO configuration of 200ms. We will undertake additional experiments with lower values to see the effect in these particular applications. We recommend from these

experiments to adjust this parameter accordingly not to delay retransmissions unnecessarily while not overcongesting the network.

We also showed that different process mappings to nodes and programming models (pure MPI vs MPI+OpenMP) perform different (and scale different) for different

applications. For LULESH, 2ppn is the best configuration, and 1ppn 1th and 2ppn performance equally good for CoMD. We recommend studying the performance

and efficiency of your application with different mappings and select the best to run your experiments at scale. Without that understanding, our experiments show that 2ppn seems to be the most effective.

These experiments are also a demonstration that we achieved a reasonably stable

prototype. Weak scaling experiments are very satisfactory, with very high efficiencies, and we identified the problems preventing strong scalability –further analysis on how applications would scale were the RTO issue solved is shown in

deliverable D6.8 Section 6–. Furthermore, it is also satisfactory to show energy proportionality cases in the prototype: energy proportionally grows with

performance for good scaling applications.


97

2.11 Conclusions

The project has provided a platform to test the applications at a reasonably large scale. We had little time to carry out our analyses because the machine achieved a stable-enough state to do reproducible executions at the end of May. Despite the

short time, we are satisfied with the amount of runs and preliminary analyses that were completed and that have been reported in this deliverable.

The reported results show good scalability on the prototype for many of the applications under test:

- BigDFT - COSMO - MP2C - QuantumESPRESSO - SMMP - SPECFEM3D - Alya Red - LULESH - CoMD

Running these applications helped identify some of the configuration and stability problems of the prototype. We mentioned the issues regarding frequency scaling

due to overheating, filesystem errors and TCP retransmission timeouts. More details can be found in deliverable D5.11.

The prototype has a relat ively weak network with low achievable bandwidth per node (80MB/s) and quite high latency (50us). This weak performance affected

specif ic MPI primit ives such as All2All and AllReduce. Also, the problem of long delays on MPI calls due to retransmission timeouts (RTO) affects some of the

applications, such as SMMP and LULESH, and prevents overall strong scaling. However, the weak network performance did not affect most applications and there are, as mentioned, many cases showing good scalability.

The power measuring infrastructure provided by the prototype (described in

deliverable D5.8) allowed fine-grained energy measurements. This allowed developers to understand the energy requirements of different parts of their applications. The use of the GPU, either with MPI+OpenCL or

MPI+OmpSs+OpenCL for higher energy efficiency showed contradictory cases: it was positive for SPECFEM3D and negative for MP2C, for instance.

The experience with OmpSs was posit ive in terms of productivity and easiness of using the GPU as reported in COSMO and SMMP, but it was reported to have large

overheads (SMMP) and OmpSs+OpenCL generally performed poorly. The overheads appeared due to a too fine-grained taskification, and the poor

performance of the Mali, principally for double-precision f loat ing point computation, has been reported to be the reason behind the underperformance of


98

OmpSs+OpenCL implementations. No time was left to implement and analyse more coarse-grained taskif ications of the applicat ions in OmpSs. This is one of the

activities that will have a continuation in Mont-Blanc 2. Generally, we have provided hints or speculat ions on what could be the reasons for

underscaling or underperforming for some of the executions reported, such as in COSMO and Alya Red. Network performance and frequency downscaling effects

have been mentioned as possible causes, but these speculations need to be assessed with performance analysis tools. A success case is the analysis shown for LULES H confirming the RTO effects preventing strong scaling. The work on performance

tools will be leveraged to understand the causes of these suboptimalit ies for some applications as part of Mont-Blanc 2.


99

List of figures

Figure 2.1.1 BigDFT Weak Scaling. ............................................................9

Figure 2.1.2 BigDFT Strong scaling of the medium sized system ...................... 10 Figure 2.1.3 BigDFT Strong scaling of the big sized system ............................ 10

Figure 2.1.4 Energy consumption per node during a BigDFT run. Unit is mW. .... 11 Figure 2.1.5BigDFT : Ratio of node per 50 mW intervals for two runs – Big system

................................ ................................ .................................. 12

Figure 2.1.6 BigDFT: Energy per iteration of the big system ........................... 12 Figure 2.1.7 ( fig 2.1.3) with the energetic efficiency ..................................... 13 Figure 2.1.8 Task distribution by core in a BigDFT run ................................. 14

Figure 2.1.9 BigDFT influence of the programming model ............................. 14 Figure 2.1.10 Mean time distribution inside a BigDFT process running the medium

and big systems.............................................................................. 15 Figure 2.2.1 Performance comparison for the MPI+OmpSs version of BQCD on

Mont-Blanc................................ ................................................... 17

Figure 2.2.2 BQCD Performance comparison MPI/Hybrid/OmpSs ................... 18 Figure 2.2.3 Performance comparison of the hybrid version of BQCD on

SuperMUC phase 1................................................................ ......... 19 Figure 2.2.4 BQCD Strong scaling on Mont-Blanc ................................ ....... 19 Figure 2.2.5 Total energy consumed for weak scaling of the OmpSs + MPI of

BQCD ......................................................................................... 21 Figure 2.3.1 COSMO OpCode Structure .................................................... 24

Figure 2.3.2 COSMO OpCode weak scalability on 256 SDBs .......................... 26 Figure 2.3.3 COSMO OpCdoe run on Tibidabo ........................................... 27 Figure 2.3.4 COSMO RAPS Benchmark Strong Scaling on Cray XC40 .............. 27

Figure 2.3.5 COSMO OpCode Benchmark elapsed time ................................. 28 Figure 2.3.6 COSMO OpCode benchmark strong scaling ............................... 29

Figure 2.3.7 COSMO OpCode power consumption profile .............................. 29 Figure 2.4.1 Trace of MP2C running on one node MPI+OmpSs+OpenCL ......... 31 Figure 2.4.2 Weak scaling efficiency benchmark of the full application ............. 33

Figure 2.4.3 Strong scaling benchmark for MP2C full application .................. 34 Figure 2.4.4 Trace of the MPI+OmpSs+OpenCL version .............................. 35

Figure 2.4.5 Power consumption of MP2C for the strong scaling benchmark ..... 36 Figure 2.4.6 energy needed by MP2C to compute 20 time steps with 7290000

particles....................................................................................... 36

Figure 2.4.7 Comparison of MP2C energy consumption using one and two MPI ranks on a single node ................................ .................................... 37

Figure 2.5.1 PEPC Structure of the Treecode Framework.............................. 39 Figure 2.5.2 Weak scaling of PEPC on JUQUEEN. ................................ ...... 40 Figure 2.5.3 Weak scalability of PEPC on the Mont-Blanc prototype ................ 42

Figure 2.5.4 Efficiency of PEPC from previous figure ................................... 42 Figure 2.5.5 Strong scaling of PEPC on the Mont-Blanc prototype ................... 43

Figure 2.5.6 PEPC with a varying number of threads for a fixed number of particles and nodes. ................................ .................................................... 44

Figure 2.5.7 Watts per node and time for one PEPC ..................................... 45


100

Figure 2.5.8 PEPC Runtime and estimate of the consumed energy of one of the weak scaling ................................ ................................................. 45

Figure 2.5.9 Weak scaling efficiency of PEPC ............................................. 46 Figure 2.5.10 PEPC Strong scaling experiments with 2.56 million particles ........ 47 Figure 2.6.1 Scalability of the CP kernel of QE on BG/Q system using the

CNT10POR8 benchmark................................ .................................. 50 Figure 2.6.2 QE Time spent in 3D-FFTs varying the number of tasks ................ 51

Figure 2.7.1 SMMP Weak-scaling plot for pure MPI version. .......................... 58 Figure 2.7.2: SMMP Strong-scaling plot. ................................................... 58 Figure 2.7.3: SMMP Parallel efficiency of strong scaling. .............................. 59

Figure 2.7.4 SMMP power profile of the benchmark ..................................... 59 Figure 2.8.1 Specfem3D weak-scaling chart with 1 core and GPU ................... 63

Figure 2.8.2 Specfem3D weak-scaling chart with 1 core and no GPU ................ 64 Figure 2.8.3 Specfem3D weak-scaling chart with 2 cores and GPU .................. 64 Figure 2.8.4 Specfem3D weak-scaling chart with 2 cores and no GPU ............... 65

Figure 2.8.5 Specfem3D strong-scaling chart with 64-NEX problem-size ........... 66 Figure 2.8.6 Specfem3D strong-scaling chart with 128-NEX problem-size.......... 66

Figure 2.8.7 Specfem3D strong-scaling chart with 256-NEX problem-size.......... 67 Figure 2.8.8 Specfem3D Energy consumption .............................................. 68 Figure 2.8.9 Specfem3D Energy consumption .............................................. 68

Figure 2.8.10 Specfem3D Energy consumption ............................................ 69 Figure 2.8.11 Specfem3D Energy consumption ............................................ 69

Figure 2.8.12 Specfem3D Energy consumption of Mont-Blanc cluster ............... 70 Figure 2.8.13 Specfem3D Energy consumption for each hardware configuration . 71 Figure 2.8.14 Specfem3D Power efficiency ................................................. 71

Figure 2.8.15 Specfem3D Power efficiency ................................................. 72 Figure 2.8.16 Specfem3D Power efficiency ................................................. 72

Figure 2.8.17 Specfem3D Power efficiency ................................................. 73 Figure 2.8.18 Specfem3D Node consumption for idle run ............................... 74 Figure 2.8.19 Specfem3D node consumption 1 core/node ............................... 74

Figure 2.8.20 Specfem3D Node consumption 2 cores/node ............................. 75 Figure 2.8.21 Specfem3D Node consumption 1core- 1 GPU / node ................... 75

Figure 2.8.22 Specfem3D Node consumption 2 cores – 1 GPU / node................ 76 Figure 2.9.1 Alya Red, low resolution input set 2TPN, strong scaling ................ 79 Figure 2.9.2 Alya Red, low resolution input set 1TPN, strong scaling ................ 80

Figure 2.9.3 Alya Red, high resolution input set 2TPN, strong scaling ............... 80 Figure 2.9.4 Alya Red, high resolution input set 1TPN, strong scaling ............... 80

Figure 2.9.5 Alya Red, execution times comparison 1TPN vs 2TPN, low resolution input ................................................................ ........................... 81

Figure 2.9.6 Alya Red low resolution input set, total job energy consumption ...... 81

Figure 2.9.7 Alya Red high resolution input set, total job energy consumption ..... 82 Figure 2.10.1 LULESH weak scalability speed-up ................................ ........ 87

Figure 2.10.2 LULESH weak scalability parallel efficiency ............................. 87 Figure 2.10.3 CoMD weak scalability speed-up ........................................... 88 Figure 2.10.4 CoMD weak scalability parallel efficiency ................................ 88

Figure 2.10.5 LULESH strong scalability speed-up................................ ....... 89 Figure 2.10.6 LULESH strong scalability parallel efficiency ........................... 90


101

Figure 2.10.7 CoMD strong scalability speed-up................................ .......... 90 Figure 2.10.8 CoMD strong scalability parallel efficiency .............................. 91

Figure 2.10.9 LULESH weak scalability energy and node power. ..................... 92 Figure 2.10.10 CoMD weak scalability energy and node power. ...................... 92 Figure 2.10.11 LULESH strong scalability energy and node power .................. 93

Figure 2.10.12 CoMD strong scalability energy and node power ...................... 93 Figure 2.10.13 LULESH strong scaling, MPI waiting time for 2ppn 216 processes

................................ ................................ .................................. 94 Figure 2.10.14 LULESH strong scaling, useful compute time for 2ppn 216

processes ................................ ..................................................... 94

Figure 2.10.15 LULESH weak scaling, useful compute time for 2ppn 216 processes ................................ ................................ .................................. 95

Figure 2.10.16 LULESH weak scaling, useful compute time for 2ppn 216 processes ................................ ................................ .................................. 95

List of tables

Acronyms and Abbreviations

- DEISA Distributed European Infrastructure for Supercomputing Applications - GbE Gigabit Ethernet

- GPL General Public Licence - GPU Graphics Processing Unit - HPC High Performance Computing

- I/O Input (read), Output (write) operations on memory or on disks/tapes - M D M olecular Dynamics - NCSA National Center for Supercomputing Applications (University of Illinois at

Urbana- Champaign, USA) - PRACE Partnership for Advanced Computing in Europe (http://www.prace-ri.eu) - SoC System On Chip

- TDP Thermal Dissipation Power - WP Work Package - WP2 Work Package 2 (“Dissemination and Exploitation”)

- WP3 Work Package 3 (“Optimized application kernels ») - WP4 Work Package 4 (“Exascale applications”)

- WP5 Work Package 5 (“System software”) - WP6 Work Package 6 (“Next-generation system architecture”) - WP7 Work Package 7 (“Prototype system architecture”)

- WPL Work Package Leader

Date post:	01-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

(all on first page) · Updated results, presentation normalization ongoing. V0.4 More updates and...

Documents