+ All Categories
Home > Documents > Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack....

Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack....

Date post: 04-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
26
Common Coding Strategies for Lattice QCD Albert Deuzeman University of Bern Lattice 2013 Mainz, 02.08.2013
Transcript
Page 1: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Common Coding Strategies for Lattice QCD

Albert Deuzeman

University of Bern

Lattice 2013Mainz, 02.08.2013

Page 2: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Introduction

PRACE-2IP WP8

Incrementally update scientific numericaltools to innovative computational solutions.

Includes codes in astrophysics (3), material science (4),climate science (5), particle physics (1) and engineering (5).

Compared to other fields. . .

. . . the lattice community is small.

. . . typical problems rely less on input data.

. . . there are no real ‘standard’ codes.

Talk by Claudio Gheller.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 1 / 24

Page 3: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Challenges

Lattice QCD codes

High optimisation levelsneeded.Massive investment of time.High developer turnover.Divergent goals.Limited payoff.

Main challengeGiven the research we want to do, how to make the process ofdeveloping high performance codes as efficient as possible?

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 2 / 24

Page 4: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Challenges

Lattice QCD codes

High optimisation levelsneeded.Massive investment of time.High developer turnover.Divergent goals.Limited payoff.

Main challengeGiven the research we want to do, how to make the process ofdeveloping high performance codes as efficient as possible?

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 2 / 24

Page 5: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Developments

Name Architecture Rmax Efficiency[TFlops] [MFlops / W]

Tianhe-2 Xeon + Xeon Phi 33862.7 1901.5Titan Opteron + Tesla 17590.0 2142.7Sequoia Blue Gene/Q 17173.2 2176.5K Computer SPARC64 10510.0 830.1Mira Blue Gene/Q 8586.6 2176.5

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 3 / 24

Page 6: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Developments

Units time architecture[×106] [years]

Playstation 103 6 MIPS R3051Playstation 2 155 6 Sony Emotion EnginePlaystation 3 78 7 Cell Broadband EngineXbox 24 4 Intel Pentium III CoppermineXbox 360 77 8 PowerPC tri-core XenonPlaystation 4 ?? ? AMD Jaguar APUXbox One ?? ? AMD Jaguar APU

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 4 / 24

Page 7: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Developments

Diverse processor architectures

Quad processing extensionsSPARCARM

But also, helpful new developments on the software side!

ILDG format as standard.Distributed source control management (e.g. git, Mercurial).Development tracking platforms (e.g. github, gitorious, Googlecode).Improving compiler quality.Novel compiler architecture (llvm).

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 5 / 24

Page 8: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Hybrid codes

OpenMP (+ MPI) MPI

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 6 / 24

Page 9: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Hybrid codes

S. Gottlieb and S. Tamhankar, Nucl.Phys.Proc.Suppl. 94 (2001) 841-845

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 7 / 24

Page 10: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Hybrid codes

Breaking the balance between threads gives opportunities forperformance gain.Main gains are in overlapping communication and computation.May alleviate bottlenecks due to shared resources betweenthreads.By necessity architecture and system dependent, tricky tooptimise at lower level.

Talks by Michele Brambilla and Bartosz Kostrzewa.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 8 / 24

Page 11: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Hybrid codes

Courtesy of Bartosz Kostrzewa

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 9 / 24

Page 12: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Accelerators

MPI + Accelerator

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 10 / 24

Page 13: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Frameworks

CUDAPlatform and code framework for off-loading toNvidia GPU’s.OpenCLFramework and API for heterogeneous computing.OpenMPCompiler directive based API for shared memorymultiprocessing.OpenACCA compiler directive based API for shared memorymultiprocessing, allowing also access to GPU’s.

Talks by Matthias Bach and Pushan Majumdar.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 11 / 24

Page 14: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Strategies for efficiency

One Code To Rule Them All Rich Ecosystem

Not practical!Not wanted!Not needed?

When possible, use existingprograms.When coding, use existinglibraries / interfaces.When experimenting, use ahigh level approach.

Needs awareness of the existing options.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 12 / 24

Page 15: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

SciDAC

Bagel is the most highly optimised kernel available for the BlueGene/Q.It generates instructions for a range of architectures, however,including

I Power & PowerPC

I BG/LP Hummer

I BG/Q QPX

I Alpha

I Sparc

I C++

Intel MIC support is forthcoming.

SciDACQLA QPX + OpenMPQMP SPIQMT pthreadsQDP/C pthreads

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 13 / 24

Page 16: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

SciDAC

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 20 40 60 80 100 120

Weak Scaling for BAGEL DWF RB5D CG SolverP

eta

flo

ps

BG/Q Racks

(1 rack = 1,024 nodes = 16,384 cores)

58 Gflops/node

Courtesy of Peter Boyle

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 14 / 24

Page 17: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

SciDAC

CUDA based GPU acceleration in the SciDAC stack.

QUDAI Implements solvers and performance critical gauge generation

routines.I Highly optimised: mixed precision methods, autotuning, cache

blocking.QDP/JIT

I Moves core QDP functionality to the GPU.I Just-In-Time compilation to PTX.I Works in conjunction with QUDA.I QDP++ as an interface.

Talks and poster by Mike Clark, Alejandro Vaquero and Frank Winter.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 15 / 24

Page 18: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

QUDA

Courtesy of Mike Clark and Balint Joo

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 16 / 24

Page 19: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Traditional codes

Codes within the SciDAC stable have mainly improved through thedevelopments within the underlying libraries.Integration of QUDA is available within e.g. Chroma, CPS, BQCDand the MILC code.Chroma, in particular has seen a lot of work on efficient threading,with optimisations for the Xeon Phi.

I Pioneering implementations of Xeon Phi tuned invertor librariesshow performance on par with GPU codes.

I Code should be useful for X86 libraries and could eventually beported to the Blue Gene/Q.

I Specific Blue Gene/Q optimisation is a secondary target for themoment.

Talk by Chulwoo Jung.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 17 / 24

Page 20: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Traditional codes

IroIro++I Uses several IBM Japan developed libraries for message passing,

threading and linear algebra.I Integrates Bagel for high performance inversions.I Set to be used in production by JLQCD, publicly available soon.

Bridge++

I A modern concept code, written for extendability, readability andportability.

I Only MPI fully implemented, but support for OpenMP, pthreads andOpenCL is being worked on.

I Architecture specific tuning is underway, not yet mature.I Development priorities guided by user base, focus on new features.

Talk and poster by Guido Cossu and Satoru Ueda.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 18 / 24

Page 21: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Traditional codes

OpenQCD Mainly algorithmic work (open boundaries), but someadditional architecture tuning (AVX).PLQCD New Wilson operator inverter library with hybridparallelisation.tmLQCD

I Coding work has been focusing on efficiency on the Blue Gene/Q.I Use of SPI, QPX intrinsics and hybrid parallelisation have

dramatically increased efficiency.I Somewhat limited from-scratch CUDA support is available.

Talks and poster by Abdou Abdel-Rehim, Bartosz Kostrzewa, StefanKrieg, Stefan Schaefer and Carsten Urbach.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 19 / 24

Page 22: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

High Level Languages

Scripting languages are the natural medium for utilty and analysiscodes.

qcd_utilsI Written in python and maintained by

Massimo di Pierro.I Utilities for fetching and manipulating data

files, analysis and visualisation.QLUA

I Analysis code using Lua as glue around theSciDAC libraries.

I Functions as a Domain Specific Language:flexibility with decent performance.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 20 / 24

Page 23: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

High Level Languages

The flexibility of scripting languages can also be leveraged for flexibleMonte-Carlo codes.

QCLI Designed for decent performance in a range

of exotic scenarios.I Actions described as paths in high level

Python, manipulated symbolically forefficiency and then translated into OpenCL.

FUELI Partner to QLUA, providing an API on top of

the SciDAC libraries.I Designed for BSM physics and modern

algorithms.I Preliminary indications of especially good

performance for Nc 6= 3.

Talk and poster by Meifeng Lin and Massimo di Pierro.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 21 / 24

Page 24: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

High Level Languages

The underlying concept of flexibility can be taken different routes thaninterfaces from higher level languages.

QIRALI Uses LATEXexpressions as native input.I Translated to formal logic (Maude), which is converted into C.I Uses OpenMP for shared memory parallelism.I arXiv:1208.4035

parmalgtI Objects on single spacetime points in a D-dimensional lattice as a

basis, not specific to QCD.I Hybrid parallelism through threads and MPI.I Templates and C++11 features for efficiency and flexibility.

Talks by Michele Brambilla, Mattia dalla Brida and Dirk Hesse.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 22 / 24

Page 25: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Summary

The hardware landscape is changing rapidly and adjusting is achallenge.A massive amount of work is being done and we need to use this.Libraries are perhaps the most promising route to synergy.There is much potential in using existing libraries as interfacedefinitions.High level languages, offering agility and ease of use, are startingto be explored.

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 23 / 24

Page 26: Common Coding Strategies for Lattice QCD · SciDAC CUDA based GPU acceleration in the SciDAC stack. QUDA I Implements solvers and performance critical gauge generation routines. I

Acknowledgements

Abdou Abdel-RehimMatthias BachPeter BoyleMichele BrambillaMike ClarkGuido CossuCarleton DeTarMassimo DiPierroClaudio GhellerBalint Joo

Chulwoo JungBartosz KostrzewaMeifeng LinPushan MajumdarJames OsbornStefan SchaeferSatoru UedaCarsten UrbachAlejandro VaqueroFrank Winter

A. Deuzeman (University of Bern) Common Coding Strategies Lattice 2013 24 / 24


Recommended