Performance study of HPC applications on an Arm-based cluster

Performance study of HPC applicationson an Arm-based cluster

using a generic efficiency model

Filippo [email protected]

Arm HPC User Group @ SC19Denver – 2019, Nov 18th

Performance Optimisation and Productivity

EU-H2020 GA–676553

Motivation: Student Cluster Competition

Each year the team has 6 months for the preparation focusing on:

I 3 or more new complex applications

I a new cluster to bring up and operate

We need a method to:

I analyze the performance of the applications on our cluster

I quickly understand if they are working well or not

2015 2016 2017 2018 2019 2020** Team candidate to SCC 2020: in case of acceptance, we need a cluster! ,

Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 2 –

Motivation: Emerging Technology Partition of MareNostrum

The procurement of MareNostrum 4 included:

I a main partition powered by Intel Skylake CPUsI three emerging technology partitions:

I Power9 + GPU (like Summit)I AMD-based (more details @ BSC booth #1975)I Arm-based (more details @ BSC booth #1975)

We need a method to:

I analyze the performance of the applications on our cluster

I quickly understand if they are working well or not


Motivation: Performance Optimisation and Productivity CoE

BSC coordinates POP, a Center of Excellence that provides performance optimisationand productivity services for academic and industrial code(s) in all domains.

POP developed a method to:

I analyze the performance of parallel applications on HPC clusters

I quickly understand where to look to spot inefficiencies

Apply for POP analysis of your code here: https://pop-coe.eu/request-service-form


https://pop-coe.eu/request-service-form

The POP efficiency model

More details in: M. Wagner, S. Mohr, J. Gimenez, and J. Labarta, “A structured approach to performance analysis,” in Tools for HighPerformance Computing 2017. Cham: Springer International Publishing, 2019, pp. 1–15. and https://pop-coe.eu/node/69


https://pop-coe.eu/node/69

The Dibona clusterI 48 nodes, each node includes:

– 2× Armv8 Marvell ThunderX2 CPUs@ 2.0 GHz (2.5 GHz on turbo, 32cores and 4 threads/core)

– 8 memory channels per socket– 256 GB DDR4 memory– 128 GB SSD local storage

I Interconnection network:Single Port Mellanox EDR

I 49 TeraFlops theoretical peakperformance

I Integrated by

F. Banchelli, et al., “MB3 D6.9 – Performance analysis of applications and mini-applications andbenchmarking on the project test platforms,” Tech. Rep., 2019, http://bit.ly/mb3-dibona-apps


http://bit.ly/mb3-dibona-apps

Methodology

We studied 4 complex applications:I Selected from SCC challenges of 2019 and 2018I Running the input-set proposed by the organizers of the competition

Pennant CP2K OpenFOAM Grid

URL github.com/lanl/PENNANT github.com/cp2k openfoam.com github.com/paboyle/GridScientific domain Lagrangian hydrodynamics Quantum chemistry / solid state physics CFD Lattice QCDVersion / Commit v0.9 v6.1.0 v1812 251b904Programming language C++ F77 & F90 C++ C++Lines of code 3.5k 860k 835k 75kParallelization scheme MPI MPI MPI MPICompiler Arm/19.1 GNU/8.2.0 Arm/19.1 Arm/19.1Scientific libraries - BLAS, LAPACK, FFT via Armpl/19.1 Armpl/19.1 Armpl/19.1Focus of analysis 4 steps 4 steps QMMM 4 steps SIMPLE solver 20 steps + 1 evalInput set Leblancx4 wdftb-1 (FIST, QuickStep DFTB, QMMM) DrivAer model Lanczos algorithm

We study the efficiency metrics (aka POP metrics) for each application:

I Run with different MPI processes within a single node and on multiple nodesI Gather Extrae traces on DibonaI Isolate and study 5 representative iterations


https://github.com/lanl/PENNANT

https://github.com/cp2k

https://www.openfoam.com/releases/openfoam-v1812/

https://github.com/paboyle/Grid

Single node analysis on Dibona – Overview

Pennant CP2K

OpenFOAM Grid


Multi node analysis on Dibona – Overview

Pennant CP2K

OpenFOAM Grid


Pennant – Single node

-7% IPC scalability when using 64 MPI

This behaviour is expected when several processes are concurrently usingall the resources of a compute node.


Pennant – Multi node

Good scalability behaviour up to 16 nodes

Transfer efficiency is the potential limiting factorfor scalability at high number of compute nodes

We studied the MPI callsMPI_Allreduce shows an irregular exit pattern (de-lay of ∼ 80µs).

I Not due to load imbalance(all processes have reached the MPI call)

I No system preemtion(we checked the cycles/µs in the trace)

I Probably due to MPI implementationdeployed in Dibona


CP2K – Single node

When increasing the MPI processes:Number of instructions per rank

Limiting factor is the Instruction Scalability

Using the clustering tool* we identified regions withsimilar behaviour (e.g. similar IPC)

The identified clusters cover about 90% of the exe-cution time regardless of the number of processes.

* https://tools.bsc.es/downloads

Cluster 1, 2 and 3 are responsible for thepoor Instruction scalability: # of inst.

Next natural step would be to mapthe clusters in code regions


https://tools.bsc.es/downloads

CP2K – Multi node

Beside the problem of Instruction scalability,there is a steady decrement of Transfer efficiency

Limiting factor is the Transfer EfficiencyCollective operations are the most calledMPI primitives

MPI processes 256 512 1024

% of MPI time in Allreduce calls 59.47% 58.17% 32.09%% of MPI time in Alltoall calls 11.28% 7.21% 10.15%% of MPI time in Alltoallv calls 15.00% 21.51% 43.04%Avg. message size of Alltoallv calls 6.35 KiB 3.47 KiB 1.72 KiB

% of MPI time in Alltoallv transfer 0.38% 0.85% 3.37%

Bad Transfer efficiency is caused by:

I Overheads of the MPI implementation whenexchanging small messages

I Network congestion generated by collectivecalls with a high number of processes

Decrease the number of collective operationsby aggregating several collective operationsin a single MPI call


OpenFOAM – Single node

When increasing the MPI processes:Number of instructionsNumber of cycles

Limiting factor is the IPC scalabilityPossible cause: saturation of hardware resourcesSuspect for the memory bandwidth, but impossibleto verify due to lack of LLC hardware counters

Test with 32 MPI processes:socket #1 32, socket #2 0 → IPC = 0.58socket #1 16, socket #2 16 → IPC = 0.80

We can discard other causes for the IPC drop lookingat other hardware counters:

I no variation corresponding to 64 MPIs

I number of TLB misses can explain the lowerIPC on 32 MPI processes compare to the 64MPI processes


OpenFOAM – Multi node

Transfer efficiency < 80% with 512+ MPIs

Study regions with several MPI calls

We isolated a section of iteration corresponding to

I 36% of the total iteration time

I 63% of the MPI time of the iteration

I Transfer efficiency below 50%

We study the Isend in this region

Low Transfer efficiency is due to:

I the high number of MPI callssending small messages

I the MPI overhead + the network latencythat are paid for each call


OpenFOAM – Multi node – Dibona vs MareNostrum

Dibona MareNostrum 4

Worry about the programmer not about the architecture!


OpenFOAM – Multi node – Dibona vs MareNostrum

Dibona

MareNostrum 4


Conclusions

1. Studied complex applications on Dibona, a cluster powered by Marvell ThunderX2 CPUs

2. Applied an efficiency model to compare applications running on different HPC technologies

3. Highlighted the importance of having a complete set of hardware counters

Our method:

I Enables a deep understanding of the behaviour of complex applictions

I Provides feedback to the developers for potential performance improvements

I Provides feedback to the system administrators to improve the general setup of the cluster(e.g. misconfiguration of the MPI implementation)

Acknowledgments:

I We warmly thank the Mont-Blanc team at Atos/BULL for supporting this research

I The Student Cluster Team 2018 of the Barcelona Supercomputing Center:

BTW: We are still looking for a cluster for SCC 2020! ,


More info: BSC booth #1975https://pop-coe.eu/

https://pop-coe.eu/

Date post:	13-Jan-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Performance study of HPC applications on an Arm-based cluster

Documents