The rise of HPC accelerators toward a common vision for a petascale future
Filippo Spiga, Computational Scientist @ ICHEC
Disclaimer
This presentation contains the personal opinions, forecasts and forward-looking statements of the authors and not necessarily the opinions of any other person or
organization mentioned.
For official communications and disclosures please refer to ICHEC, PRACE, PGI, CRAY, CAPS or NVIDIA websites
2 The rise of HPC accelerators - F. Spiga
Before start: what a HPC accelerator is
3 The rise of HPC accelerators - F. Spiga
ß NVDIA GPU M2090
Intel MIC prototype à
Very different HW sharing the same idea: speed-up the calculations
Outline
• Aim of this talk • ICHEC • PRACE – overview • The state of the art in European HPC • PRACE Preparatory Phase (PP) • PRACE 1st Implementation Phase (1-IP) • From the “theory” to “practice”: approaching HPC accelerators • Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL • What about Exascale?
4 The rise of HPC accelerators - F. Spiga
Aim of this talk
• Explain PRACE and the background behind it • Explain what kind of technical expertise PRACE partners have
gathered and brought to the HPC community • Share its achievements and actions (wherever is possible… NDA) • Propose a critical point of view about a accelerators’ adoption in
HPC
5
If you do not know what an accelerator is how the GPU is internally organized or what is CUDA (or OpenCL) I will be
glad to have a chat with you after the talk
The rise of HPC accelerators - F. Spiga
Irish Centre for High-End Computing (ICHEC)
• Founded in 2005 • Staff of 20/25 people
• Objectives: – provide High-Performance Computing (HPC)
resources and support – provide education and training for researchers in
third-level institutions – provide technology transfer and consultancy
services to Irish industries
• Awards: – CUDA Research Centre (since 2010) – HMPP Centre of Competence (since 2011)
• FP7 European Projects: – PRACE 1IP & 2IP, Autotune, eFISCAL
The rise of HPC accelerators - F. Spiga 6
Supercomputing Drives Science through Simulation
7 The rise of HPC accelerators - F. Spiga
Environment Weather/ Climatology Pollution / Ozone Hole
Ageing Society Medicine Biology
Energy Plasma Physics
Fuel Cells
Materials/ Inf. Tech Spintronics
Nano-science
Partnership for Advanced Computing in Europe (PRACE)
PRACE is creating and implementing a persistent pan-European HPC service providing European researchers with access to capability computers and forming the top level of the European HPC ecosystem.
à PRACE idea began in 2006 after a European Commission report about the necessity to have a unified European Strategy (and Synergy) for HPC
Now 21 partners involved
8 The rise of HPC accelerators - F. Spiga
PRACE timeline
9 The rise of HPC accelerators - F. Spiga
2006 2007 2008 2009 2010 2011 2012 2013
MoU Preparatory
Phase
Implementation Phase(s)
1 IP
2 IP
3 IP
PRACE Research Infrastructure PRACE Initiative
PRACE Project
DEISA, eDEISA, DEISA2
PRACE mission & main tasks
Ø Creation of a permanent pan-European Research Infrastructure Ø Provided extensive HPC Training Ø Deployed and evaluated promising architectures à prototyping Ø Porting applications to petascale
PRACE is providing a set of machines with successively increasing capability • 1 PF (2010) + 1.5 PF (2011) + 1 PF (2011) + 3 PF (2012) + >3 PF (2013)
and will add upgrade steps + 5 PF 2013 … • Accumulated capability estimate >15 PF in 2013
à HPC for science, industries and vendors
10 The rise of HPC accelerators - F. Spiga
The European HPC Scenario
Tier-0 systems: • Jugene (Germany), 1PFlop/s; CURIE (France), 1.6PFlop/s; HERMIT
(Germany), 1 PFlop/s; Super-MUC (Germany), ~3 PFlop/s; …
11 The rise of HPC accelerators - F. Spiga
Tier-0
Tier-1
Tier-2
PRACE
DEISA/PRACE
capa
bilit
y
# of systems
Tier-1 systems: • every country has one or more HPC cluster
of O(100) TFlop/s • every country has its own accelerated
prototypes (in or out PRACE RI) • almost every country has a GPU cluster
(99% is NVIDIA) • other accelerators are out-of-the games or
not yet in the game
The European HPC Scenario
Largest (public) accelerated cluster in Europe: PLX @ CINECA • 3288 cores (548 Intel Westmere) + 528 NVIDIA GPU M2070 • Used for both academia and industrial partners • 150~160 TFlop/s (LINPACK) • Tier-1 in PRACE
All PRACE Tier-0 systems are (and will be) NOT “classically” accelerated • BlueGene/P and BlueGene/Q • CRAY XT6 (XK6? who knows…) • Intel Sandy Bridge
12 The rise of HPC accelerators - F. Spiga
A look outside Europe…
① Do European centers believe in accelerators for a (multi) petascale future?
13 The rise of HPC accelerators - F. Spiga
TITAN: • CRAY XK6 • Expected 299008 cores (AMD) • Tesla (than Kepler) NVIDIA GPU • Final expected peak: ~20 PFlop/s
STAMPEDE: • Linux Dell machine • Intel Sandy Bridge-EP (~2 PF) • Intel MIC (~8 PF) • Final expected peak: ~10 PFlop/s
Outline
• Aim of this talk • ICHEC • PRACE – overview • Present European HPC scenario • PRACE Preparatory Phase (PP) • PRACE 1st Implementation Phase (1-IP) • From the “theory” to “practice”: approaching HPC accelerators • Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL • What about Exascale?
14 The rise of HPC accelerators - F. Spiga
Preamble
Ø Creation of a permanent pan-European Research Infrastructure Ø Provided extensive HPC Training Ø Deployed and evaluated promising architectures Ø Porting applications to petascale
• Applications are part of the PRACE Application Benchmark Suite (PABS) • Every center usually have a strong knowledge and expertise but these
competencies are normally focused in a few areas • The technical WPs aim to push a selection of codes to run faster and
efficiently on Tier-0 and Tier-1 infrastructures
PRACE is mainly for SERVICE and then also research
15 The rise of HPC accelerators - F. Spiga
PRACE Preparatory Phase
• WP1 Management
• WP2 Organizational concept à Statutes
• WP3 Dissemination, outreach and training
• WP4 Distributed computing
• WP5 Deployment of prototype systems
• WP6 Software enabling for prototype systems
• WP7 Petaflop/s systems for 2009/2010
• WP8 Future Peta to Exaflop/s technologies
à END
16 The rise of HPC accelerators - F. Spiga
PRACE PP – WP 6
WP6: Software enabling for prototype systems
• T6.1: Identification & categorisation of applications Identification of applications & initial benchmark suite
• T6.2: Application requirements capture Analysis of identified applications
• T6.3: Performance analysis & benchmark selection Selection of tools and benchmarks
• T6.4: Petascaling Petascaling of algorithms
• T6.5: Optimisation Establishment of best-practice in optimisation
• T6.6: Software libraries & programming models Analysis of libraries and programming models
17 The rise of HPC accelerators - F. Spiga
PRACE PP – WP6 Task 6
Objectives: identifies and analyses the programming models and the software libraries required by petascaling applications in the PRACE implementation phase.
How? • 3(+1) basic numerical kernels typical of the most important computational
applications have been selected • Code these kernels using twelve of the different programming languages
and paradigms under investigation • Track the code activity (developer diary) • Assessment from a performance point of view (achieved in a reasonable
amount of time) • Assessment from a productive point of view • Run the kernels over some PRACE prototypes
18 The rise of HPC accelerators - F. Spiga
PRACE PP – WP6 Task 6
EUROBEN Kernel: • mod2am: dense matrix-matrix multiplication • mod2as: sparse matrix-vector multiplication • mod2f: 1D DFT transform • mod2h: number random generator Parallel paradigms and Languages: • Standard parallel programming model: MPI, OpenMP, MPI+OpenMP • Partitioned Global Address Space (PGAS): UPC, Coarray Fortran, Titanium* • Productive-oriented languages: Chapel, X10, Fortress* • Languages and paradigm for accelerators: CUDA, OpenCL, RapidMind, Cn,
CAPS HMPP, FPGA VHDL, Cell programming, CellSs/StarSs
19 The rise of HPC accelerators - F. Spiga
PRACE PP – WP6 Task 6
Evaluation metrics: • Time spent to achieve a first implementation versus time spent to achieve a
good implementation • Performance versus develop time • SLOC/NCSS (metrics over the source code)
What we learnt? • Some new languages are really easy to learn (UPC, Coarray) • Some languages lacks of a proper HPC context and are not yet ready for
production (X10, Chapel) • Some languages and paradigms are difficult to get but not impossible for an
experience (and motivated) programmer (CUDA, OpenCL) • If performance is the final goal in a short time, graphical accelerators are
the ONLY solution 20 The rise of HPC accelerators - F. Spiga
PRACE 1st Implementation Phase
• WP1 Management
• WP2 Evolution of the Research Infrastructure
• WP3 Dissemination and training
• WP4 HPC Ecosystem Relations
• WP5 Industrial User Relations
• WP6 Technical Operation and Evolution of the Distributed Infrastructure
• WP7 Enabling Petascale Applications: Efficient Use of Tier-0 Systems
• WP8 Support for the procurement and commissioning of HPC services
• WP9 Future Technologies à 01/07/2010 – 30/06/2012, RUNNING
21 The rise of HPC accelerators - F. Spiga
PRACE 1-IP – WP 7
WP7: Enabling Petascale Applications: Efficient Use of Tier-0 Systems
• T7.1: Applications enabling for capability science 3 sub-tasks
• T7.2: Applications Enabling with Communities 6 sub-tasks
• T7.3: Efficient Use of PRACE Systems 4 sub-tasks
• T7.4: Applications Requirements for Tier-0 Systems 3 sub-tasks
• T7.5: Programming Techniques for High Performance Applications 6 sub-tasks = A/B - Scalable algorithms; C - Scalable libraries; D - Multi-core/many-core systems; E - Accelerators; F - Novel HPC languages
• T7.6: Efficient Handling of Petascale-Class Applications Data 4 sub-tasks
22 The rise of HPC accelerators - F. Spiga
PRACE 1-IP – WP7 Task 5.E
Since… • All participating HPC centres have GPU machines • the technology has demonstrated growth and constant improvements • almost all the community codes have yet to be accelerated* • the hype of this technology is “on its top” (for how long?)
HW accelerator à NVIDIA GPU • Task lead by ICHEC, ongoing • Applications were selected during the F2F meeting one year ago • Collect experiences for best practices for either development or production
23 The rise of HPC accelerators - F. Spiga
PRACE 1-IP – WP7 Task 5.E
Contributions: • Parallelization, optimization and tuning of the code PWSCF [condensed
matter physic] using CUDA • Parallelization, optimization and tuning of the code DL_POLY [molecular
dynamic] using OpenCL • Optimization and tuning of TMLQCD code [lattice QCD, High Energy Physics] • Porting of IFS/EC_EARTH code [weather forecasting], feasibility study • Porting of EUTERPE code [plasma physics] for multi-GPU systems • Evaluation of P3DFFT code [general-purpose FFT library] on GPU clusters à Applications have been proposed based on specific interests
24 The rise of HPC accelerators - F. Spiga
PRACE 1-IP – Call of new prototypes
Like what happened in the PP, prototype selection and then deployment à EC will financially cover part of these prototype à The prototype will be accessible by every PRACE partner Interesting prototypes for the 1-IP prototype call covers… • Efficient I/O • Energy power efficiency clusters • New interconnection topologies • New promising accelerators à Intel MIC (?) Today, no public decisions have been yet taken…
25 The rise of HPC accelerators - F. Spiga
Outline
• Aim of this talk • ICHEC • PRACE – overview • Present European HPC scenario • PRACE Preparatory Phase (PP) • PRACE 1st Implementation Phase (1-IP) • From the “theory” to “practice”: approaching HPC accelerators • Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL • What about Exascale?
26 The rise of HPC accelerators - F. Spiga
PRACE 1-IP – WP7 Task 5.E direct ICHEC involvements
As part of PRACE, we are responsible of… • coordination of the sub-task • gathering all the information and collect best practices • Supporting directly PRACE partners to enable application to run over PRACE
Tier-1 systems equipped with GPU • Development of GPU-enabled PWscf (with CINECA) • Development of GPU-enabled DL_POLY (with STFC and PSNC) • Feasibility study of IFS/EC_EARTH code porting over GPU (more generally
over accelerator)
As NVIDIA CRC and HMPP CC… • we are interested in every accelerator (but we need the hardware!) • we do activities using CUDA, OpenCL (for different GPU), HMPP
27 The rise of HPC accelerators - F. Spiga
From “theory” to “practice”
• Is an accelerator what you really need? • Is your code fully exploiting its “traditional” possibilities? • Does your problem fit a different architecture? • Do you have the time to build the knowledge about how efficiently use an
accelerator? Looking at NVIDIA GPUs and CUDA… • Benchmark shows impressive Flop/s throughput… in all the cases? • CUDA looks easy to learn… but how is easy to optimize? à Accelerators help to reduce the time-to-solution
28 The rise of HPC accelerators - F. Spiga
Comparing two case studies
29 The rise of HPC accelerators - F. Spiga
TRMTOL TH_R_FROM_THL_RT_1D TRLTOM CNT0 LAITRI SLCOMM SLCOMM2A APL_AROME TRLTOG TH_R_FROM_THL_RT_2D TRGOTL LFILDO INITAPLPAR RAIN_ICE_SEDIMENT COMPUTE_ENTR_DETR CPG MODE_POS_SURF: DIWRGRID_SURF_EXT TRMTOS ELASCAW Others
DGEMMs vloc_psi addusdens newd cdiaghg other parts
B A
A: HARMONIE (HIRLAM) • internal profile • a lot of small functions called many times • limited used of external libraries • Already MPI+OpenMP • I/O is a serious bottleneck
B: PWSCF (Quantum ESPRESSO) • internal timing • easy to identify heavy-computational routines • massive use of external libraries (red is all GEMM) • Already MPI+OpenMP • I/O is under control 30 The rise of HPC accelerators - F. Spiga
Comparing two typical situation
GPU good
GPU less good
Acceleration strategy applied
• explicit CUDA kernel • “transparent” use of multi-thread
libraries (phiGEMM, MAGMA, CUFFT)
• re-shaping the data to match the accelerator
31 The rise of HPC accelerators - F. Spiga
• Directive-based acceleration using HMPP (expert in-house)
• place library calls wherever is possible
à time consuming, performance focused
à fast evaluation for a feasibility study
directive-based programing model
Suitable for the new multi-/many-core systems and accelerator because… • provides high level of abstraction; • is (almost) language independent; • minimizes code restructuration; • keeps applications hardware independent; • ensures their portability across new generations of hardware. But… • Performance? • Run-time? • What the compiler is really able to understand?
32 The rise of HPC accelerators - F. Spiga
directive-based approach panorama
33 The rise of HPC accelerators - F. Spiga
“standard”
non-standard
OpenMP 4 (CRAY is in OpenMP ARB)
openHMPP (from CAPS and Pathscale)
StarSs (free run-time, Spanish project)
PGI compiler (NVIDIA sponsors it)
② Many players on the stage. Who will succeed?
directive-based approaches
PROs CONs
OpenMP 4 OpenMP is already a standard
Now only CRAY compiler fully implement this draft
openHMPP Collect concrete experiences
from academia and industries
Currently CAPS and Pathscale plans to support it
StarSs family (GPUSs, CellSs, SMPSs)
Promising open-source run-time & ready to target GPU,
Cell, many-cores,…
It is known at European level but not (yet) fully
considered
PGI Compiler Good support and partnership with NVIDIA
Translated code is hidden to the developer
34 The rise of HPC accelerators - F. Spiga
directive-based approach versus performance
In the HPC environment codes HAVE to run fast
Exploiting accelerator directly means… • ability to manage the data transfer • ability to fine control the work off-loaded to the accelerator • you have to dedicate time to the development
Exploiting accelerator indirectly means… • less re-engineering work to do • Knowing in advance that the final solution is (probably) sub-optimal • you expect a gain will less effort
à nowadays when approaching accelerators the direct approach is preferred
35 The rise of HPC accelerators - F. Spiga
Outline
• Aim of this talk • ICHEC • PRACE – overview • Present European HPC scenario • PRACE Preparatory Phase (PP) • PRACE 1st Implementation Phase (1-IP) • From the “theory” to “practice”: approaching HPC accelerators • Other interesting projects: EESI, Mont-Blanc, Autotune, e-FISCAL • What about Exascale?
36 The rise of HPC accelerators - F. Spiga
Other interesting EU projects
EESI: European Exascale Software Initiative [ENDED] Build a European vision and roadmap to address the challenge of the new generation of massively parallel systems composed of millions of heterogeneous cores
Mont-Blanc [JUST STARTED] Design a new type of computer architecture capable of setting future global HPC standards that will deliver Exascale performance while using 15 to 30 times less energy.
Autotune [JUST STARTED] Extend a performance analysis tools with online tuning plugins for performance and energy efficiency tuning, targeting multi-core many-core and GPU system together.
e-FISCAL [JUST STARTED] Analysis of the costs and cost structures of the European High-Throughput and High-Performance Computing (HTC and HPC) e-Infrastructures.
37 The rise of HPC accelerators - F. Spiga
③ Are accelerators financially suitable also for non-scientific domains?
What about Exascale?
Thinking about Exascale… • new programming models (many-many-core systems) • tight interaction between run-times and low lever kernel scheduler • ensure the portability (eventually fill lacks of standards) • required strong resilience and fault tolerance • a solution for the I/O problem
Target: not before 2018/2020 !!! Q: HPC community is keeping guessing about a future Exascale system and the
Exascale challenge… but are we ready for present petascale systems? A: …
38 The rise of HPC accelerators - F. Spiga
Let’s give an answer to the three questions…
① Do European centers believe in accelerators for a (multi) petascale future? A: if HPC centers look to a balance between computational power, energy
consumption, programmability and real performance… there is no other concrete options nowadays!
② Many players on the stage. Who will win?
A: very difficult to guess now, it mainly depends by how will adopt what, who will buy what and where the research will be more promising
③ Are accelerators financially suitable also for non-scientific domains?
A: new metrics tightly bind the HW to the SW. Industries already evaluate also these parameters, not only cost-per-Flops. Research institutes?
39 The rise of HPC accelerators - F. Spiga
Thank you for your attention! And feel free to ask questions
40 The rise of HPC accelerators - F. Spiga