Performance study of HPC applicationson an Arm-based cluster
using a generic efficiency model
Filippo [email protected]
Arm HPC User Group @ SC19Denver – 2019, Nov 18th
Performance Optimisation and Productivity
EU-H2020 GA–676553
Motivation: Student Cluster Competition
Each year the team has 6 months for the preparation focusing on:
I 3 or more new complex applications
I a new cluster to bring up and operate
We need a method to:
I analyze the performance of the applications on our cluster
I quickly understand if they are working well or not
2015 2016 2017 2018 2019 2020** Team candidate to SCC 2020: in case of acceptance, we need a cluster! ,
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 2 –
Motivation: Emerging Technology Partition of MareNostrum
The procurement of MareNostrum 4 included:
I a main partition powered by Intel Skylake CPUsI three emerging technology partitions:
I Power9 + GPU (like Summit)I AMD-based (more details @ BSC booth #1975)I Arm-based (more details @ BSC booth #1975)
We need a method to:
I analyze the performance of the applications on our cluster
I quickly understand if they are working well or not
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 3 –
Motivation: Performance Optimisation and Productivity CoE
BSC coordinates POP, a Center of Excellence that provides performance optimisationand productivity services for academic and industrial code(s) in all domains.
POP developed a method to:
I analyze the performance of parallel applications on HPC clusters
I quickly understand where to look to spot inefficiencies
Apply for POP analysis of your code here: https://pop-coe.eu/request-service-form
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 4 –
The POP efficiency model
More details in: M. Wagner, S. Mohr, J. Gimenez, and J. Labarta, “A structured approach to performance analysis,” in Tools for HighPerformance Computing 2017. Cham: Springer International Publishing, 2019, pp. 1–15. and https://pop-coe.eu/node/69
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 5 –
The Dibona clusterI 48 nodes, each node includes:
– 2× Armv8 Marvell ThunderX2 CPUs@ 2.0 GHz (2.5 GHz on turbo, 32cores and 4 threads/core)
– 8 memory channels per socket– 256 GB DDR4 memory– 128 GB SSD local storage
I Interconnection network:Single Port Mellanox EDR
I 49 TeraFlops theoretical peakperformance
I Integrated by
F. Banchelli, et al., “MB3 D6.9 – Performance analysis of applications and mini-applications andbenchmarking on the project test platforms,” Tech. Rep., 2019, http://bit.ly/mb3-dibona-apps
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 6 –
Methodology
We studied 4 complex applications:I Selected from SCC challenges of 2019 and 2018I Running the input-set proposed by the organizers of the competition
Pennant CP2K OpenFOAM Grid
URL github.com/lanl/PENNANT github.com/cp2k openfoam.com github.com/paboyle/GridScientific domain Lagrangian hydrodynamics Quantum chemistry / solid state physics CFD Lattice QCDVersion / Commit v0.9 v6.1.0 v1812 251b904Programming language C++ F77 & F90 C++ C++Lines of code 3.5k 860k 835k 75kParallelization scheme MPI MPI MPI MPICompiler Arm/19.1 GNU/8.2.0 Arm/19.1 Arm/19.1Scientific libraries - BLAS, LAPACK, FFT via Armpl/19.1 Armpl/19.1 Armpl/19.1Focus of analysis 4 steps 4 steps QMMM 4 steps SIMPLE solver 20 steps + 1 evalInput set Leblancx4 wdftb-1 (FIST, QuickStep DFTB, QMMM) DrivAer model Lanczos algorithm
We study the efficiency metrics (aka POP metrics) for each application:
I Run with different MPI processes within a single node and on multiple nodesI Gather Extrae traces on DibonaI Isolate and study 5 representative iterations
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 7 –
Single node analysis on Dibona – Overview
Pennant CP2K
OpenFOAM Grid
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 8 –
Multi node analysis on Dibona – Overview
Pennant CP2K
OpenFOAM Grid
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 9 –
Pennant – Single node
-7% IPC scalability when using 64 MPI
This behaviour is expected when several processes are concurrently usingall the resources of a compute node.
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 10 –
Pennant – Multi node
Good scalability behaviour up to 16 nodes
Transfer efficiency is the potential limiting factorfor scalability at high number of compute nodes
We studied the MPI callsMPI_Allreduce shows an irregular exit pattern (de-lay of ∼ 80µs).
I Not due to load imbalance(all processes have reached the MPI call)
I No system preemtion(we checked the cycles/µs in the trace)
I Probably due to MPI implementationdeployed in Dibona
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 11 –
CP2K – Single node
When increasing the MPI processes:Number of instructions per rank
Limiting factor is the Instruction Scalability
Using the clustering tool* we identified regions withsimilar behaviour (e.g. similar IPC)
The identified clusters cover about 90% of the exe-cution time regardless of the number of processes.
* https://tools.bsc.es/downloads
Cluster 1, 2 and 3 are responsible for thepoor Instruction scalability: # of inst.
Next natural step would be to mapthe clusters in code regions
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 12 –
CP2K – Multi node
Beside the problem of Instruction scalability,there is a steady decrement of Transfer efficiency
Limiting factor is the Transfer EfficiencyCollective operations are the most calledMPI primitives
MPI processes 256 512 1024
% of MPI time in Allreduce calls 59.47% 58.17% 32.09%% of MPI time in Alltoall calls 11.28% 7.21% 10.15%% of MPI time in Alltoallv calls 15.00% 21.51% 43.04%Avg. message size of Alltoallv calls 6.35 KiB 3.47 KiB 1.72 KiB
% of MPI time in Alltoallv transfer 0.38% 0.85% 3.37%
Bad Transfer efficiency is caused by:
I Overheads of the MPI implementation whenexchanging small messages
I Network congestion generated by collectivecalls with a high number of processes
Decrease the number of collective operationsby aggregating several collective operationsin a single MPI call
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 13 –
OpenFOAM – Single node
When increasing the MPI processes:Number of instructionsNumber of cycles
Limiting factor is the IPC scalabilityPossible cause: saturation of hardware resourcesSuspect for the memory bandwidth, but impossibleto verify due to lack of LLC hardware counters
Test with 32 MPI processes:socket #1 32, socket #2 0 → IPC = 0.58socket #1 16, socket #2 16 → IPC = 0.80
We can discard other causes for the IPC drop lookingat other hardware counters:
I no variation corresponding to 64 MPIs
I number of TLB misses can explain the lowerIPC on 32 MPI processes compare to the 64MPI processes
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 14 –
OpenFOAM – Multi node
Transfer efficiency < 80% with 512+ MPIs
Study regions with several MPI calls
We isolated a section of iteration corresponding to
I 36% of the total iteration time
I 63% of the MPI time of the iteration
I Transfer efficiency below 50%
We study the Isend in this region
Low Transfer efficiency is due to:
I the high number of MPI callssending small messages
I the MPI overhead + the network latencythat are paid for each call
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 15 –
OpenFOAM – Multi node – Dibona vs MareNostrum
Dibona MareNostrum 4
Worry about the programmer not about the architecture!
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 16 –
OpenFOAM – Multi node – Dibona vs MareNostrum
Dibona
MareNostrum 4
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 17 –
Conclusions
1. Studied complex applications on Dibona, a cluster powered by Marvell ThunderX2 CPUs
2. Applied an efficiency model to compare applications running on different HPC technologies
3. Highlighted the importance of having a complete set of hardware counters
Our method:
I Enables a deep understanding of the behaviour of complex applictions
I Provides feedback to the developers for potential performance improvements
I Provides feedback to the system administrators to improve the general setup of the cluster(e.g. misconfiguration of the MPI implementation)
Acknowledgments:
I We warmly thank the Mont-Blanc team at Atos/BULL for supporting this research
I The Student Cluster Team 2018 of the Barcelona Supercomputing Center:
BTW: We are still looking for a cluster for SCC 2020! ,
Performance study of HPC applications on an Arm-based cluster using a generic efficiency model 2019, Nov 18th – 18 –
More info: BSC booth #1975https://pop-coe.eu/