Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu...

http://synergy.cs.vt.edu/

Paul Sathre, Wu Feng

{sath6220, feng} @cs.vt.edu

Virginia Tech --- Center for Synergistic

Environments for Experimental Computing

Bac

kend

Lay

er

Plu

gin L

ayer

CUDA

Backend

OpenCL

Backend

Future

Device

Backends

Timer

Plugin

MPI

Plugin

Fortran

Compat.

Layer

Fortran

2003 API

Future

PluginsC API

User

Apps

-D WITH_CUDA -D WITH_OPENCL -D WITH_BACKENDX

-D WITH_TIMERS -D WITH_MPI -D WITH_FORTRAN -D WITH_PLUGINX

...

...

Design“Make-your-own” library from

modular building blocks

Include only needed plugins

and backends

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 0

Plugins

MPI

NVIDIA GPU - CUDA

Interconnect

Interconnect

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 1 NVIDIA GPU - CUDA

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 3 Intel MIC - OpenCL

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

NIC

Node 2 CPU - OpenCL

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 4 AMD GPU - OpenCL

PCI-E

MPIOn-device packing of

ghost regions

GPU Direct option, if

available with MPI/

backend

Fallback to host-staged

transfers otherwise

Transparently exchange

device buffers

between processes,

regardless of backend

Automatic Profiling

Fortran Compatibility

If compiled in, all kernels

and transfers are

timed behind the

scenes automatically

Environment variable

controls verbosity

Introduction

Accelerated Backends

Purpose-built to provide single API

to accelerated kernels for

current and future devices

Runtime control over

backends, plugins,

and accelerator

device selection

Verbose pass-by-reference C API,

usable with ISO_C_BINDINGS

Fortran 2003 wrapper to verbose API

provides simplified, unified calling

convention for supported real and

integer types

Uses C API internally, so all runtime

control variables work equivalently

Performance

Heterogeneity is becoming a fact of life

in HPC, largely driven by demands for

increased parallelism and power efficiency

over what traditional CPUs can provide.

However, extracting the full

performance of heterogeneous

systems is non-trivial and

requires architecture expertise.

Future Work

Retrofitting existing codes for

heterogeneity is tedious and error-prone,

architecture experts are in short supply,

and accelerators are moving targets.

Therefore, a single API for transparently

executing optimized code on

accelerators with minimal intervention

is needed for scientific productivity.

Related EffortsSolver Frameworks

The heavy-lifters of the library, selected at

runtime by a “mode” environment variable

from those included at compile-time

Include implementations of all C API-supported

kernels for a single accelerator model

Standalone libraries in their own right can be used

and distributed separately from the top-

level API, as long as they are API compliant,

supporting community development of

closed- or open-source alternatives

Currently support both CUDA and

OpenCL, providing access to the

most popular accelerators

Currently support simple operations on

subsets of 3D dense matrices:

reduction-sum, dot-product, 2D

transpose, pack/unpack of

subregions

More kernels from computational fluid

dynamics in the pipeline; extensions

to other domains to follow, simply a

matter of adding necessary kernels

Figure of MPI Exchange

performance testing, of

packing/unpacking ghost

regions, in GPUDirect, host-

staged, and mixed-backend

transfers

Figure of API Plugin overheads

(timing, Fortran API vs. Strictly C

API)

All tests performed in-node on a single system

containing: 2x Intel Xeon X5550 Quad-core

CPUs, 20GB RAM, and 4x Tesla C2070 GPUs

Continue expanding the API’s provided set of

kernels and backends with other primitive

operations underlying fluid simulations. i.e.

Krylov solvers, stencil computations, and

various preconditioners

Generalize operations to work on non-3D data,

and add primitives for computations on

unstructured grids

Generate a third automatically runtime-scheduled

backend to transparently execute code

across entire node, a la CoreTSAR [7].

Solver Libraries

OpenFOAM [1]

MAGMA [3]

PARALUTION [2]

Trilinos [4]

Pros: Support for useful pre- and post-processing (mesh

generation and visualization); many solvers for many domains

Cons: No internal accelerator support; framework-centric

development; cumbersome API and “case” construction

Pros: Many matrix storage formats; many solvers; many

preconditioners; support for OpenMP, CUDA, and OpenCL on

CPUs/GPUs and MIC; plugins for Fortran and OpenFOAM

Cons: Framework-centric development; interop. with existing

code low; no MPI support (yet), asynchronous operations only

on CUDA; lack of non-destructive copy to/from C arrays

Pros: Full BLAS and LAPACK support for CUDA, OpenCL, and

MIC; support for several factorizations and eigenvalue

problems; smart scheduling of hybrid CPU/GPU algorithms

with QUARK directed acyclic graph scheduler; Multi-GPU

methods

Cons: CUDA, OpenCL, and MIC variants are separate

implementations; no internal MPI support; MKL/ACML

dependency poorly documented and cumbersome

Pros: Massive set of capability areas beyond linear algebra,

solvers, and meshes; built-in distributed memory support;

some preliminary CUDA/MIC work (e.g. Kokkos, Phalanx,

Tpetra packages)

Cons: Redundancies of capability between packages; breadth

of packages difficult to navigate for newcomers

[1] H. Jasak, A. Jemcov, and Z. Tukovic, “OpenFOAM: A C++ library for complex

physics simulations.”[2] D. Lukarski, “PARALUTION project v0.7.0,” 2012.[3] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I.

Yamazaki, “Accelerating Numerical Dense Linear Algebra Calculations with GPUs,” in

Numerical Computations with GPUs. Springer, 2014, pp. 3–28.

[4] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, “An overview of the

Trilinos project,” ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397–423, 2005.

[5] AMD. clMath (formerly APPML). Accessed 2014.10.17. [Online]. Available:

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-math-libraries/[6] D. K. Panda. MVAPICH: MPI over InfiniBand, 10GigE/iWARPand RoCE. Network-

Based Computing Laboratory, The Ohio State University. Accessed 2014.10.17.

[Online]. Available: http://mvapich.cse.ohio-state.edu/

[7] T. Scogland, W.-c. Feng, B. Rountree, and B. de Supinski, “CoreTSAR: Adaptive Worksharing for Heterogeneous Systems,” in Supercomputing, ser. Lecture Notes in Computer Science, J. Kunkel, T. Ludwig, and H. Meuer, Eds. Springer International

Publishing, 2014, vol. 8488, pp. 172–186.

This work was funded in part by the Air Force Office

of Scientific Research (AFOSR) Basic Research

Initiative from the Computational Mathematics

Program via Grant No. FA9550-12-1-0442.

Figure of transpose

performance vs. alternative GPU

BLAS libraries

MKL float (CPU)MKL double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clMAGMA float (GPU)clMAGMA double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)

100

10-1

10-2

10-3

10-4

MetaMorph Transpose vs. Alternatives

Transpose Size

Tim

e p

er ele

ment

(µ

s)

Figure of dot-product

performance vs. alternative GPU

BLAS libraries

64 512 4K 32K 256K 2M 16M 128M

PLASMA/MKL float (CPU)PLASMA/MKL double (CPU)ATLAS float (CPU)ATLAS double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clAmdBlas float (GPU)clAmdBlas double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)

100

10-1

10-2

103

104

101

102

MetaMorph Dot Product vs Alternatives

Vector Length

Tim

e p

er ele

ment

(µ

s)

MPI Exchange Primitives

Vector Length

Tim

e p

er ele

ment

(µ

s)0

500

1000

1500

2000

2500

3000

H2D Copy D2D Copy D2H Copy Dot Product

C+CUDA

C+OpenCL

C+Timers+CUDA

C+Timers+OpenCL

Fortran+CUDA

Fortran+OpenCL

Fortran+Timers+CUDA

Fortran+Timers+OpenCL

MetaMorph Fortran & Timer Plugin Overhead

To

tal t

ime p

er cal

l, 2

21flo

ats

(µs)

2x2 4x4 8x8 16x16 32x32 64x64

0

5

10

15

20

25

30

35

40

45

50 GPU Direct Pack CUDA PackGPU Direct Send CUDA SendGPU Direct Recv CUDA RecvGPU Direct Unpack CUDA UnpackOpenCL Pack CUDA-to-OpenCL PackOpenCL Send CUDA-to-OpenCL SendOpenCL Recv CUDA-to-OpenCL RecvOpenCL Unpack CUDA-to-OpenCL Unpack

2D Face Size

MetaMorph Transparent Face Exchange Primitive

Tim

e p

er flo

at e

lem

ent

(µ

s)

Date post:	26-Apr-2018
Category:	Documents
Upload:	vuhanh
View:	222 times
Download:	4 times

Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu...

Documents