http://synergy.cs.vt.edu/
Paul Sathre, Wu Feng
{sath6220, feng} @cs.vt.edu
Virginia Tech --- Center for Synergistic
Environments for Experimental Computing
Bac
kend
Lay
er
Plu
gin L
ayer
CUDA
Backend
OpenCL
Backend
Future
Device
Backends
Timer
Plugin
MPI
Plugin
Fortran
Compat.
Layer
Fortran
2003 API
Future
PluginsC API
User
Apps
-D WITH_CUDA -D WITH_OPENCL -D WITH_BACKENDX
-D WITH_TIMERS -D WITH_MPI -D WITH_FORTRAN -D WITH_PLUGINX
...
...
Design“Make-your-own” library from
modular building blocks
Include only needed plugins
and backends
Core 1Core 0
Core 5Core 4
Core 3Core 2
Core 7Core 6
AcceleratorNIC
Node 0
Plugins
MPI
NVIDIA GPU - CUDA
Interconnect
Interconnect
PCI-E
Core 1Core 0
Core 5Core 4
Core 3Core 2
Core 7Core 6
AcceleratorNIC
Node 1 NVIDIA GPU - CUDA
PCI-E
Core 1Core 0
Core 5Core 4
Core 3Core 2
Core 7Core 6
AcceleratorNIC
Node 3 Intel MIC - OpenCL
PCI-E
Core 1Core 0
Core 5Core 4
Core 3Core 2
Core 7Core 6
NIC
Node 2 CPU - OpenCL
PCI-E
Core 1Core 0
Core 5Core 4
Core 3Core 2
Core 7Core 6
AcceleratorNIC
Node 4 AMD GPU - OpenCL
PCI-E
MPIOn-device packing of
ghost regions
GPU Direct option, if
available with MPI/
backend
Fallback to host-staged
transfers otherwise
Transparently exchange
device buffers
between processes,
regardless of backend
Automatic Profiling
Fortran Compatibility
If compiled in, all kernels
and transfers are
timed behind the
scenes automatically
Environment variable
controls verbosity
Introduction
Accelerated Backends
Purpose-built to provide single API
to accelerated kernels for
current and future devices
Runtime control over
backends, plugins,
and accelerator
device selection
Verbose pass-by-reference C API,
usable with ISO_C_BINDINGS
Fortran 2003 wrapper to verbose API
provides simplified, unified calling
convention for supported real and
integer types
Uses C API internally, so all runtime
control variables work equivalently
Performance
Heterogeneity is becoming a fact of life
in HPC, largely driven by demands for
increased parallelism and power efficiency
over what traditional CPUs can provide.
However, extracting the full
performance of heterogeneous
systems is non-trivial and
requires architecture expertise.
Future Work
Retrofitting existing codes for
heterogeneity is tedious and error-prone,
architecture experts are in short supply,
and accelerators are moving targets.
Therefore, a single API for transparently
executing optimized code on
accelerators with minimal intervention
is needed for scientific productivity.
Related EffortsSolver Frameworks
The heavy-lifters of the library, selected at
runtime by a “mode” environment variable
from those included at compile-time
Include implementations of all C API-supported
kernels for a single accelerator model
Standalone libraries in their own right can be used
and distributed separately from the top-
level API, as long as they are API compliant,
supporting community development of
closed- or open-source alternatives
Currently support both CUDA and
OpenCL, providing access to the
most popular accelerators
Currently support simple operations on
subsets of 3D dense matrices:
reduction-sum, dot-product, 2D
transpose, pack/unpack of
subregions
More kernels from computational fluid
dynamics in the pipeline; extensions
to other domains to follow, simply a
matter of adding necessary kernels
Figure of MPI Exchange
performance testing, of
packing/unpacking ghost
regions, in GPUDirect, host-
staged, and mixed-backend
transfers
Figure of API Plugin overheads
(timing, Fortran API vs. Strictly C
API)
All tests performed in-node on a single system
containing: 2x Intel Xeon X5550 Quad-core
CPUs, 20GB RAM, and 4x Tesla C2070 GPUs
Continue expanding the API’s provided set of
kernels and backends with other primitive
operations underlying fluid simulations. i.e.
Krylov solvers, stencil computations, and
various preconditioners
Generalize operations to work on non-3D data,
and add primitives for computations on
unstructured grids
Generate a third automatically runtime-scheduled
backend to transparently execute code
across entire node, a la CoreTSAR [7].
Solver Libraries
OpenFOAM [1]
MAGMA [3]
PARALUTION [2]
Trilinos [4]
Pros: Support for useful pre- and post-processing (mesh
generation and visualization); many solvers for many domains
Cons: No internal accelerator support; framework-centric
development; cumbersome API and “case” construction
Pros: Many matrix storage formats; many solvers; many
preconditioners; support for OpenMP, CUDA, and OpenCL on
CPUs/GPUs and MIC; plugins for Fortran and OpenFOAM
Cons: Framework-centric development; interop. with existing
code low; no MPI support (yet), asynchronous operations only
on CUDA; lack of non-destructive copy to/from C arrays
Pros: Full BLAS and LAPACK support for CUDA, OpenCL, and
MIC; support for several factorizations and eigenvalue
problems; smart scheduling of hybrid CPU/GPU algorithms
with QUARK directed acyclic graph scheduler; Multi-GPU
methods
Cons: CUDA, OpenCL, and MIC variants are separate
implementations; no internal MPI support; MKL/ACML
dependency poorly documented and cumbersome
Pros: Massive set of capability areas beyond linear algebra,
solvers, and meshes; built-in distributed memory support;
some preliminary CUDA/MIC work (e.g. Kokkos, Phalanx,
Tpetra packages)
Cons: Redundancies of capability between packages; breadth
of packages difficult to navigate for newcomers
[1] H. Jasak, A. Jemcov, and Z. Tukovic, “OpenFOAM: A C++ library for complex
physics simulations.”[2] D. Lukarski, “PARALUTION project v0.7.0,” 2012.[3] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I.
Yamazaki, “Accelerating Numerical Dense Linear Algebra Calculations with GPUs,” in
Numerical Computations with GPUs. Springer, 2014, pp. 3–28.
[4] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, “An overview of the
Trilinos project,” ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397–423, 2005.
[5] AMD. clMath (formerly APPML). Accessed 2014.10.17. [Online]. Available:
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-math-libraries/[6] D. K. Panda. MVAPICH: MPI over InfiniBand, 10GigE/iWARPand RoCE. Network-
Based Computing Laboratory, The Ohio State University. Accessed 2014.10.17.
[Online]. Available: http://mvapich.cse.ohio-state.edu/
[7] T. Scogland, W.-c. Feng, B. Rountree, and B. de Supinski, “CoreTSAR: Adaptive Worksharing for Heterogeneous Systems,” in Supercomputing, ser. Lecture Notes in Computer Science, J. Kunkel, T. Ludwig, and H. Meuer, Eds. Springer International
Publishing, 2014, vol. 8488, pp. 172–186.
This work was funded in part by the Air Force Office
of Scientific Research (AFOSR) Basic Research
Initiative from the Computational Mathematics
Program via Grant No. FA9550-12-1-0442.
Figure of transpose
performance vs. alternative GPU
BLAS libraries
MKL float (CPU)MKL double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clMAGMA float (GPU)clMAGMA double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)
100
10-1
10-2
10-3
10-4
MetaMorph Transpose vs. Alternatives
Transpose Size
Tim
e p
er ele
ment
(µ
s)
Figure of dot-product
performance vs. alternative GPU
BLAS libraries
64 512 4K 32K 256K 2M 16M 128M
PLASMA/MKL float (CPU)PLASMA/MKL double (CPU)ATLAS float (CPU)ATLAS double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clAmdBlas float (GPU)clAmdBlas double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)
100
10-1
10-2
103
104
101
102
MetaMorph Dot Product vs Alternatives
Vector Length
Tim
e p
er ele
ment
(µ
s)
MPI Exchange Primitives
Vector Length
Tim
e p
er ele
ment
(µ
s)0
500
1000
1500
2000
2500
3000
H2D Copy D2D Copy D2H Copy Dot Product
C+CUDA
C+OpenCL
C+Timers+CUDA
C+Timers+OpenCL
Fortran+CUDA
Fortran+OpenCL
Fortran+Timers+CUDA
Fortran+Timers+OpenCL
MetaMorph Fortran & Timer Plugin Overhead
To
tal t
ime p
er cal
l, 2
21flo
ats
(µs)
2x2 4x4 8x8 16x16 32x32 64x64
0
5
10
15
20
25
30
35
40
45
50 GPU Direct Pack CUDA PackGPU Direct Send CUDA SendGPU Direct Recv CUDA RecvGPU Direct Unpack CUDA UnpackOpenCL Pack CUDA-to-OpenCL PackOpenCL Send CUDA-to-OpenCL SendOpenCL Recv CUDA-to-OpenCL RecvOpenCL Unpack CUDA-to-OpenCL Unpack
2D Face Size
MetaMorph Transparent Face Exchange Primitive
Tim
e p
er flo
at e
lem
ent
(µ
s)