+ All Categories
Home > Documents > The Arm architecture in HPC

The Arm architecture in HPC

Date post: 15-Feb-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
44
The Arm architecture in HPC: from mobile phones to the Top500
Transcript

The Arm architecture in HPC: from mobile phones to the Top500

Mont-Blanc ProjectEU-H2020 GA–671697

The Arm architecture in HPC: from mobile phones to the Top500

Filippo Mantovani

[email protected]

Marta Garcia-Gasulla

[email protected]

Impact of Arm hardware from an HPC application perspectiveAustin - 2019, Sep 15

Barcelona Supercomputing Center

Spanish Government 60%

Catalan Government 30%

Univ. Politècnica de Catalunya (UPC) 10%

BSC-CNS isa consortiumthat includes

Supercomputing servicesto Spanish andEU researchers

R&D in Computer,Life, Earth and

Engineering Sciences

PhD programme,technology transfer,public engagement

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 3

Europe

The MareNostrum 4 supercomputerTotal peak performance:

13,7 PFlops/s

80%

Access: prace-ri.eu/hpc-access

16%

Access: res.es

4%

The archeology of Mont-Blanc (2012-2014)

In the early days Arm-based prototypes were made out of Android development kits

• Non-HPC platforms

– 32 bits CPUs

– 1 Gigabit Ethernet (bridged)

– With several slow I/O interfaces

• Targeting embedded market

– Not prepared for 24/7 operation

– With cooling issues

– With form factor issues

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 5

N. Rajovic, A. Rico, N. Puzovic, C. Adeniyi-Jones, and A. Ramirez, 2014. “Tibidabo: Making the case for an ARM-based HPC system”. Future Generation Computer Systems, 36, pp.322-334.

Tibidabo prototype2012

The first Mont-Blanc prototype (2015 - 2019)

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 6

N. Rajovic et al., “The Mont-Blanc Prototype: An Alternative Approach for HPC Systems,” in Proceedings of SC’16, p. 38:1-38:12.

Exynos 5 compute card2 x Cortex-A15 @ 1.7GHz1 x Mali T604 GPU2 GB of LPDDR3 RAM1 Gb Ethernet15 Watts

Mont-Blanc rack8 BullX chassis72 Compute blades1080 Compute cards2160 CPUs1080 GPUs4.3 TB of DRAM17.2 TB of Flash

The “Dibona” Mont-Blanc prototype (2018)

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 7

F. Banchelli, et al., “MB3 D6.9 – Performance analysis of applications and mini-applications and benchmarking on the project test platforms,” Tech. Rep., 2019, http://bit.ly/mb3-dibona-apps

Dibona node2 x ThunderX2 processors(CN9980-2000LG4077-Y21-G 32 2.0) 64 x Marvell’s cores @ 2.0 GHz256 GB memory256 GB local storage (+ 8TB via NFS)

Dibona rack48 x Dibona nodesFat tree interconnect topology Infiniband EDR 100Gb/sTheoretical peak: 49 TFlops

Based on Bull Sequana HPC infrastructure

(both x86 and Intel)

Micro-benchmarksMemory, Computational Throughput, Network

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 8

Memory bandwidth and latency

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 9

CPU floating point throughput

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 10

Putting all together into a “roofline model”

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 11

Network: Infiniband bandwidth and latency

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 12

Network: sendrecv congestion and weak links

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 13

Classical HPC benchmarksLINPACK and HPCG

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 14

High-Performance LINPACK

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 15

Source: https://www.top500.org/statistics/efficiency-power-cores/

High-Performance Conjugate Gradient (HPCG)

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 16

Source: https://insidehpc.com/2018/11/green-hpcg-list-road-exascale/

HPCG shared memory implementation - 1

• Relevant for the HPC community

• Not fully explored on Arm architecture

• Part of the Student Cluster Competition 2017 and 2018

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 17

HPCG shared memory implementation - 2

MPI-only strong scaling (4 iterations) on KNL

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 18

64 MPI, 10% in MPI calls

128 MPI, 20% in MPI calls

192 MPI, 40% in MPI calls

HPCG shared memory implementation - 3

• B0 silicon, OpenMPI3.0, GCC 7.1.0, Arm HPC Compiler 18.1

• Code released https://gitlab.com/arm-hpc/benchmarks/hpcg

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 19

D. Ruiz., F. Spiga et al. “Open-Source Shared Memory implementation of the HPCG benchmark: analysis, improvements and evaluation on Cavium ThunderX2,” In 2019 International Conference on High Performance Computing Simulation (HPCS), in press.

ApplicationsComplex codes used in data-centers for production of scientific results

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 20

TensorFlow: HPC clusters evaluation

• TensorFlow v1.11• Linear algebra using the built

in library (Eigen)• CPU only comparison• Node-to-node comparison

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 21

TensorFlow: Performance improvements targeting Arm

• Other vendors leverages optimized linear algebra libraries (e.g., MKL)

• We plugged Arm Performance Libraries into TensorFlow (proof of concept)

• Handed over to Arm: https://gitlab.com/arm-hpc/packages/wikis/packages/tensorflow

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 22

G. Ramirez-Gargallo, M. Garcia-Gasulla, and F. Mantovani, “TensorFlow on State-of-the-Art HPC Clusters: A Machine Learning use Case,” in 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2019, pp. 526–533.

A real CFPD problem: Alya

Alya for physicists

• It can simulate several physics models:

– Incompressible Flows, Compressible Flows, Non-linear Solid Mechanics, Species transport equations, Excitable Media, Thermal Flows, N-body collisions.

• Simulations can combine multiple models (multi-physics)

Alya for computer scientists

• Fortran code (500 kLines)

• MPI only (production)

• Shared memory with OpenMP (experimental)

• Unstructured and hybrid meshes

• Scalability up to 100 kCores

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 23

Alya: production scientific code in action

What happens if you run a production simulation on a real HPC cluster?

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 24

Despite the effort in preparing the grid we can notice a macroscopic load imbalance.

Protect the performance

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 25

Protect the performance

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 26

Alya: Multidependences

• Goal: To avoid the race condition between two threads updating elements of the grid sharing variables.

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 27

Alya: Multidependency evaluation

• GCC + OmpSs• Assembly phase only (no Subgrid scale)• Impact of atomic implementation across different architectures• Impact of coloring across different architectures (linked to micro-architecture of caches)

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 28

Alya: Dynamic Load Balance

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 29

Alya: MPI only vs DLB evaluation

• Execution time of one “time-step”• 2 nodes comparison• GCC + OmpSs

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 30

Same cluster age, similar overall performance!

Alya: energy to solution [kJ] with production applications

• OmpSs + GCC + DLB best option for scaling

• Arm CLANG on Dibona more energy efficient than ICC on x86

• We cannot run DLB with Arm CLANG / ICC for vendor specific limitations

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 31

OpenFOAM: Dibona vs MareNostrum4

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 32

Worry about the programmer not about the architecture!

OpenFOAM: Dibona vs MareNostrum4

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 33

MareNostrum4

Dibona

Dibona highlights: overall strong scalability

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 34

EducationThe Student Cluster Competition “adventure”

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 35

The Student Cluster Competition: 5 years with Arm!

Competition rules• Six undergraduate students• One cluster managed by the team• 3 kW power limitCompetition awards• Best Linpack performance• 1st, 2nd and 3rd overall• Fan favorite

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 36

The challenge• HPC Benchmarks

– HPL, HPCG• Real scientific applications

– Quantum Espresso, TensorFlow• Judge interviews• Secret challenges

– Secret application, Power outage

F. Mantovani, and F. Banchelli, “Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

The Student Cluster Competition: 5 years with Arm!

Competition rules• Six undergraduate students• One cluster managed by the team• 3 kW power limitCompetition awards• Best Linpack performance• 1st, 2nd and 3rd overall• Fan favorite

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 37

The challenge• HPC Benchmarks

– HPL, HPCG• Real scientific applications

– Quantum Espresso, TensorFlow• Judge interviews• Secret challenges

– Secret application, Power outage

F. Mantovani, and F. Banchelli, “Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

SCC with Arm: a successful educational package

Effort

• ∼160h of workAmounts to six ECTSEuropean Credit Transfer System

• ∼530h of cluster usage

• 23 students (and counting!)

Cluster and other hardware resources

• Supported by sponsors

• Absorbed by university

Daily cost when traveling

• ∼100 EUR/day per student

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 38

F. Mantovani, and F. Banchelli, “Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

∼80% of students actively engaged in HPC after being part of the Barcelona SCC team

SCC with Arm: a successful educational package

Effort

• ∼160h of workAmounts to six ECTSEuropean Credit Transfer System

• ∼530h of cluster usage

• 23 students (and counting!)

Cluster and other hardware resources

• Supported by sponsors

• Absorbed by university

Daily cost when traveling

• ∼100 EUR/day per student

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 39

F. Mantovani, and F. Banchelli, “Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

∼80% of students actively engaged in HPC after being part of the Barcelona SCC team

Conclusions

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 40

And what’s next?

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 41

Research on SoC architecture for next generation HPC systemshttps://www.montblanc-project.eu/project/presentation#mont-blanc-2020-2017-2020

European roadmap to Exascalefor data-centers and automotivehttps://ec.europa.eu/digital-single-market/en/news/european-processor-initiative-consortium-develop-europes-microprocessors-future-supercomputers

Performance Optimization and Productivity• Promoting best practices in performance analysis and parallel programming• Precise understanding of application and system behavior• Suggestion/support on how to refactor code in the most productive way• Transversal across application areas, platforms, scales• For academic AND industrial codes and usershttps://pop-coe.eu/Center of Excellence

PublicationsM. Garcia-Gasulla, M. Josep-Fabrego, B. Eguzkitza, and F. Mantovani, “Computational Fluid and Particle Dynamics Simulations for Respiratory System: Runtime Optimization on an Arm Cluster,” in Proceedings of the 47th International Conference on Parallel Processing Companion, ser. ICPP ’18. ACM, 2018, pp. 11:1-11:8.

E. Calore, F. Mantovani, and D. Ruiz. "Advanced performance analysis of HPC workloads on Cavium ThunderX." In 2018 International Conference on High Performance Computing & Simulation (HPCS), pp. 375-382. IEEE, 2018.

F. Mantovani, E. Calore, F. Mantovani, and E. Calore, “Performance and Power Analysis of HPC Workloads on Heterogeneous Multi-Node Clusters,” Journal of Low Power Electronics and Applications, vol. 8, no. 2, p. 13

O. Rudyy, M. Garcia-Gasulla, F. Mantovani, A. Santiago, R. Sirvent, and M. Vazquez, “Containers in HPC: A scalability and portability study in production biological simulations,” in 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, in press. IEEE, 2019.

C. Gomez, F. Martinez, A. Armejach, M. Moreto, F. Mantovani, M. Casas, "Design Space Exploration of Next-Generation HPC Machines“ in 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, in press. IEEE, 2019.

Student Cluster Competition and Education

F. Mantovani, and F. Banchelli, “Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

L. Alvarez, E. Ayguade, and F. Mantovani, “Teaching HPC Systems and Parallel Programming with Small-Scale Clusters”, EduHPC18 workshop held in conjunction of Supercomputing 2018, Dallas.

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 42

Acknowledgment: The Mont-Blanc application team at BSC

In case you liked the talk and you want to follow up:

• Speaker: [email protected]

• More info about Mont-Blanc: – https://www.montblanc-project.eu/

– https://twitter.com/MontBlanc_Eu

– https://www.linkedin.com/company/mont-blanc-project/

• Visit BSC, the most beautiful datacenter in the world, if you come to Barcelona

Austin - 2019, Sep 15The Arm architecture in HPC: from mobile phones to the Top500 43


Recommended