The rise of HPC accelerators · This presentation contains the personal opinions, forecasts and...

The rise of HPC accelerators toward a common vision for a petascale future

Filippo Spiga, Computational Scientist @ ICHEC

Disclaimer

This presentation contains the personal opinions, forecasts and forward-looking statements of the authors and not necessarily the opinions of any other person or

organization mentioned.

For official communications and disclosures please refer to ICHEC, PRACE, PGI, CRAY, CAPS or NVIDIA websites

2 The rise of HPC accelerators - F. Spiga

Before start: what a HPC accelerator is


ß NVDIA GPU M2090

Intel MIC prototype à

Very different HW sharing the same idea: speed-up the calculations

Outline

•  Aim of this talk •  ICHEC •  PRACE – overview •  The state of the art in European HPC •  PRACE Preparatory Phase (PP) •  PRACE 1st Implementation Phase (1-IP) •  From the “theory” to “practice”: approaching HPC accelerators •  Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL •  What about Exascale?


Aim of this talk

•  Explain PRACE and the background behind it •  Explain what kind of technical expertise PRACE partners have

gathered and brought to the HPC community •  Share its achievements and actions (wherever is possible… NDA) •  Propose a critical point of view about a accelerators’ adoption in

HPC

5

If you do not know what an accelerator is how the GPU is internally organized or what is CUDA (or OpenCL) I will be

glad to have a chat with you after the talk

The rise of HPC accelerators - F. Spiga

Irish Centre for High-End Computing (ICHEC)

•  Founded in 2005 •  Staff of 20/25 people

•  Objectives: –  provide High-Performance Computing (HPC)

resources and support –  provide education and training for researchers in

third-level institutions –  provide technology transfer and consultancy

services to Irish industries

•  Awards: –  CUDA Research Centre (since 2010) –  HMPP Centre of Competence (since 2011)

•  FP7 European Projects: –  PRACE 1IP & 2IP, Autotune, eFISCAL

The rise of HPC accelerators - F. Spiga 6

Supercomputing Drives Science through Simulation


Environment Weather/ Climatology Pollution / Ozone Hole

Ageing Society Medicine Biology

Energy Plasma Physics

Fuel Cells

Materials/ Inf. Tech Spintronics

Nano-science

Partnership for Advanced Computing in Europe (PRACE)

PRACE is creating and implementing a persistent pan-European HPC service providing European researchers with access to capability computers and forming the top level of the European HPC ecosystem.

à PRACE idea began in 2006 after a European Commission report about the necessity to have a unified European Strategy (and Synergy) for HPC

Now 21 partners involved


PRACE timeline


2006 2007 2008 2009 2010 2011 2012 2013

MoU Preparatory

Phase

Implementation Phase(s)

1 IP

2 IP

3 IP

PRACE Research Infrastructure PRACE Initiative

PRACE Project

DEISA, eDEISA, DEISA2

PRACE mission & main tasks

Ø  Creation of a permanent pan-European Research Infrastructure Ø  Provided extensive HPC Training Ø  Deployed and evaluated promising architectures à prototyping Ø  Porting applications to petascale

PRACE is providing a set of machines with successively increasing capability •  1 PF (2010) + 1.5 PF (2011) + 1 PF (2011) + 3 PF (2012) + >3 PF (2013)

and will add upgrade steps + 5 PF 2013 … •  Accumulated capability estimate >15 PF in 2013

à HPC for science, industries and vendors


The European HPC Scenario

Tier-0 systems: •  Jugene (Germany), 1PFlop/s; CURIE (France), 1.6PFlop/s; HERMIT

(Germany), 1 PFlop/s; Super-MUC (Germany), ~3 PFlop/s; …


Tier-0

Tier-1

Tier-2

PRACE

DEISA/PRACE

capa

bilit

y

# of systems

Tier-1 systems: •  every country has one or more HPC cluster

of O(100) TFlop/s •  every country has its own accelerated

prototypes (in or out PRACE RI) •  almost every country has a GPU cluster

(99% is NVIDIA) •  other accelerators are out-of-the games or

not yet in the game

The European HPC Scenario

Largest (public) accelerated cluster in Europe: PLX @ CINECA •  3288 cores (548 Intel Westmere) + 528 NVIDIA GPU M2070 •  Used for both academia and industrial partners •  150~160 TFlop/s (LINPACK) •  Tier-1 in PRACE

All PRACE Tier-0 systems are (and will be) NOT “classically” accelerated •  BlueGene/P and BlueGene/Q •  CRAY XT6 (XK6? who knows…) •  Intel Sandy Bridge


A look outside Europe…

①  Do European centers believe in accelerators for a (multi) petascale future?


TITAN: •  CRAY XK6 •  Expected 299008 cores (AMD) •  Tesla (than Kepler) NVIDIA GPU •  Final expected peak: ~20 PFlop/s

STAMPEDE: •  Linux Dell machine •  Intel Sandy Bridge-EP (~2 PF) •  Intel MIC (~8 PF) •  Final expected peak: ~10 PFlop/s

Outline

•  Aim of this talk •  ICHEC •  PRACE – overview •  Present European HPC scenario •  PRACE Preparatory Phase (PP) •  PRACE 1st Implementation Phase (1-IP) •  From the “theory” to “practice”: approaching HPC accelerators •  Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL •  What about Exascale?


Preamble

Ø  Creation of a permanent pan-European Research Infrastructure Ø  Provided extensive HPC Training Ø  Deployed and evaluated promising architectures Ø  Porting applications to petascale

•  Applications are part of the PRACE Application Benchmark Suite (PABS) •  Every center usually have a strong knowledge and expertise but these

competencies are normally focused in a few areas •  The technical WPs aim to push a selection of codes to run faster and

efficiently on Tier-0 and Tier-1 infrastructures

PRACE is mainly for SERVICE and then also research


PRACE Preparatory Phase

•  WP1 Management

•  WP2 Organizational concept à Statutes

•  WP3 Dissemination, outreach and training

•  WP4 Distributed computing

•  WP5 Deployment of prototype systems

•  WP6 Software enabling for prototype systems

•  WP7 Petaflop/s systems for 2009/2010

•  WP8 Future Peta to Exaflop/s technologies

à END


PRACE PP – WP 6

WP6: Software enabling for prototype systems

•  T6.1: Identification & categorisation of applications Identification of applications & initial benchmark suite

•  T6.2: Application requirements capture Analysis of identified applications

•  T6.3: Performance analysis & benchmark selection Selection of tools and benchmarks

•  T6.4: Petascaling Petascaling of algorithms

•  T6.5: Optimisation Establishment of best-practice in optimisation

•  T6.6: Software libraries & programming models Analysis of libraries and programming models


PRACE PP – WP6 Task 6

Objectives: identifies and analyses the programming models and the software libraries required by petascaling applications in the PRACE implementation phase.

How? •  3(+1) basic numerical kernels typical of the most important computational

applications have been selected •  Code these kernels using twelve of the different programming languages

and paradigms under investigation •  Track the code activity (developer diary) •  Assessment from a performance point of view (achieved in a reasonable

amount of time) •  Assessment from a productive point of view •  Run the kernels over some PRACE prototypes



EUROBEN Kernel: •  mod2am: dense matrix-matrix multiplication •  mod2as: sparse matrix-vector multiplication •  mod2f: 1D DFT transform •  mod2h: number random generator Parallel paradigms and Languages: •  Standard parallel programming model: MPI, OpenMP, MPI+OpenMP •  Partitioned Global Address Space (PGAS): UPC, Coarray Fortran, Titanium* •  Productive-oriented languages: Chapel, X10, Fortress* •  Languages and paradigm for accelerators: CUDA, OpenCL, RapidMind, Cn,

CAPS HMPP, FPGA VHDL, Cell programming, CellSs/StarSs



Evaluation metrics: •  Time spent to achieve a first implementation versus time spent to achieve a

good implementation •  Performance versus develop time •  SLOC/NCSS (metrics over the source code)

What we learnt? •  Some new languages are really easy to learn (UPC, Coarray) •  Some languages lacks of a proper HPC context and are not yet ready for

production (X10, Chapel) •  Some languages and paradigms are difficult to get but not impossible for an

experience (and motivated) programmer (CUDA, OpenCL) •  If performance is the final goal in a short time, graphical accelerators are

the ONLY solution 20 The rise of HPC accelerators - F. Spiga

PRACE 1st Implementation Phase

•  WP1 Management

•  WP2 Evolution of the Research Infrastructure

•  WP3 Dissemination and training

•  WP4 HPC Ecosystem Relations

•  WP5 Industrial User Relations

•  WP6 Technical Operation and Evolution of the Distributed Infrastructure

•  WP7 Enabling Petascale Applications: Efficient Use of Tier-0 Systems

•  WP8 Support for the procurement and commissioning of HPC services

•  WP9 Future Technologies à 01/07/2010 – 30/06/2012, RUNNING


PRACE 1-IP – WP 7

WP7: Enabling Petascale Applications: Efficient Use of Tier-0 Systems

•  T7.1: Applications enabling for capability science 3 sub-tasks

•  T7.2: Applications Enabling with Communities 6 sub-tasks

•  T7.3: Efficient Use of PRACE Systems 4 sub-tasks

•  T7.4: Applications Requirements for Tier-0 Systems 3 sub-tasks

•  T7.5: Programming Techniques for High Performance Applications 6 sub-tasks = A/B - Scalable algorithms; C - Scalable libraries; D - Multi-core/many-core systems; E - Accelerators; F - Novel HPC languages

•  T7.6: Efficient Handling of Petascale-Class Applications Data 4 sub-tasks


PRACE 1-IP – WP7 Task 5.E

Since… •  All participating HPC centres have GPU machines •  the technology has demonstrated growth and constant improvements •  almost all the community codes have yet to be accelerated* •  the hype of this technology is “on its top” (for how long?)

HW accelerator à NVIDIA GPU •  Task lead by ICHEC, ongoing •  Applications were selected during the F2F meeting one year ago •  Collect experiences for best practices for either development or production


PRACE 1-IP – WP7 Task 5.E

Contributions: •  Parallelization, optimization and tuning of the code PWSCF [condensed

matter physic] using CUDA •  Parallelization, optimization and tuning of the code DL_POLY [molecular

dynamic] using OpenCL •  Optimization and tuning of TMLQCD code [lattice QCD, High Energy Physics] •  Porting of IFS/EC_EARTH code [weather forecasting], feasibility study •  Porting of EUTERPE code [plasma physics] for multi-GPU systems •  Evaluation of P3DFFT code [general-purpose FFT library] on GPU clusters à Applications have been proposed based on specific interests


PRACE 1-IP – Call of new prototypes

Like what happened in the PP, prototype selection and then deployment à  EC will financially cover part of these prototype à  The prototype will be accessible by every PRACE partner Interesting prototypes for the 1-IP prototype call covers… •  Efficient I/O •  Energy power efficiency clusters •  New interconnection topologies •  New promising accelerators à Intel MIC (?) Today, no public decisions have been yet taken…


Outline



PRACE 1-IP – WP7 Task 5.E direct ICHEC involvements

As part of PRACE, we are responsible of… •  coordination of the sub-task •  gathering all the information and collect best practices •  Supporting directly PRACE partners to enable application to run over PRACE

Tier-1 systems equipped with GPU •  Development of GPU-enabled PWscf (with CINECA) •  Development of GPU-enabled DL_POLY (with STFC and PSNC) •  Feasibility study of IFS/EC_EARTH code porting over GPU (more generally

over accelerator)

As NVIDIA CRC and HMPP CC… •  we are interested in every accelerator (but we need the hardware!) •  we do activities using CUDA, OpenCL (for different GPU), HMPP


From “theory” to “practice”

•  Is an accelerator what you really need? •  Is your code fully exploiting its “traditional” possibilities? •  Does your problem fit a different architecture? •  Do you have the time to build the knowledge about how efficiently use an

accelerator? Looking at NVIDIA GPUs and CUDA… •  Benchmark shows impressive Flop/s throughput… in all the cases? •  CUDA looks easy to learn… but how is easy to optimize? à  Accelerators help to reduce the time-to-solution


Comparing two case studies


TRMTOL TH_R_FROM_THL_RT_1D TRLTOM CNT0 LAITRI SLCOMM SLCOMM2A APL_AROME TRLTOG TH_R_FROM_THL_RT_2D TRGOTL LFILDO INITAPLPAR RAIN_ICE_SEDIMENT COMPUTE_ENTR_DETR CPG MODE_POS_SURF: DIWRGRID_SURF_EXT TRMTOS ELASCAW Others

DGEMMs vloc_psi addusdens newd cdiaghg other parts

B A

A: HARMONIE (HIRLAM) •  internal profile •  a lot of small functions called many times •  limited used of external libraries •  Already MPI+OpenMP •  I/O is a serious bottleneck

B: PWSCF (Quantum ESPRESSO) •  internal timing •  easy to identify heavy-computational routines •  massive use of external libraries (red is all GEMM) •  Already MPI+OpenMP •  I/O is under control 30 The rise of HPC accelerators - F. Spiga

Comparing two typical situation

GPU good

GPU less good

Acceleration strategy applied

•  explicit CUDA kernel •  “transparent” use of multi-thread

libraries (phiGEMM, MAGMA, CUFFT)

•  re-shaping the data to match the accelerator


•  Directive-based acceleration using HMPP (expert in-house)

•  place library calls wherever is possible

à time consuming, performance focused

à fast evaluation for a feasibility study

directive-based programing model

Suitable for the new multi-/many-core systems and accelerator because… •  provides high level of abstraction; •  is (almost) language independent; •  minimizes code restructuration; •  keeps applications hardware independent; •  ensures their portability across new generations of hardware. But… •  Performance? •  Run-time? •  What the compiler is really able to understand?


directive-based approach panorama


“standard”

non-standard

OpenMP 4 (CRAY is in OpenMP ARB)

openHMPP (from CAPS and Pathscale)

StarSs (free run-time, Spanish project)

PGI compiler (NVIDIA sponsors it)

②  Many players on the stage. Who will succeed?

directive-based approaches

PROs CONs

OpenMP 4 OpenMP is already a standard

Now only CRAY compiler fully implement this draft

openHMPP Collect concrete experiences

from academia and industries

Currently CAPS and Pathscale plans to support it

StarSs family (GPUSs, CellSs, SMPSs)

Promising open-source run-time & ready to target GPU,

Cell, many-cores,…

It is known at European level but not (yet) fully

considered

PGI Compiler Good support and partnership with NVIDIA

Translated code is hidden to the developer


directive-based approach versus performance

In the HPC environment codes HAVE to run fast

Exploiting accelerator directly means… •  ability to manage the data transfer •  ability to fine control the work off-loaded to the accelerator •  you have to dedicate time to the development

Exploiting accelerator indirectly means… •  less re-engineering work to do •  Knowing in advance that the final solution is (probably) sub-optimal •  you expect a gain will less effort

à nowadays when approaching accelerators the direct approach is preferred


Outline



Other interesting EU projects

EESI: European Exascale Software Initiative [ENDED] Build a European vision and roadmap to address the challenge of the new generation of massively parallel systems composed of millions of heterogeneous cores

Mont-Blanc [JUST STARTED] Design a new type of computer architecture capable of setting future global HPC standards that will deliver Exascale performance while using 15 to 30 times less energy.

Autotune [JUST STARTED] Extend a performance analysis tools with online tuning plugins for performance and energy efficiency tuning, targeting multi-core many-core and GPU system together.

e-FISCAL [JUST STARTED] Analysis of the costs and cost structures of the European High-Throughput and High-Performance Computing (HTC and HPC) e-Infrastructures.


③  Are accelerators financially suitable also for non-scientific domains?

What about Exascale?

Thinking about Exascale… •  new programming models (many-many-core systems) •  tight interaction between run-times and low lever kernel scheduler •  ensure the portability (eventually fill lacks of standards) •  required strong resilience and fault tolerance •  a solution for the I/O problem

Target: not before 2018/2020 !!! Q: HPC community is keeping guessing about a future Exascale system and the

Exascale challenge… but are we ready for present petascale systems? A: …


Let’s give an answer to the three questions…

①  Do European centers believe in accelerators for a (multi) petascale future? A: if HPC centers look to a balance between computational power, energy

consumption, programmability and real performance… there is no other concrete options nowadays!

②  Many players on the stage. Who will win?

A: very difficult to guess now, it mainly depends by how will adopt what, who will buy what and where the research will be more promising

③  Are accelerators financially suitable also for non-scientific domains?

A: new metrics tightly bind the HW to the SW. Industries already evaluate also these parameters, not only cost-per-Flops. Research institutes?


Thank you for your attention! And feel free to ask questions

[email protected]


Date post:	16-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

The rise of HPC accelerators · This presentation contains the personal opinions, forecasts and...

Documents