Revolu'onizing the Datacenter
Join the Conversa'on #OpenPOWERSummit
CUDA Programming and Performance Tuning with IBM XL Compilers on POWER8 Systems
Shereen Ghobrial, Yaoqing Gao IBM Canada Lab
Join the Conversa'on #OpenPOWERSummit
Istvan Reguly PPCU ITK, Hungary
Agenda
! IBM XL Compiler Overview ! Performance Tuning with XL Compilers ! Performance results ! Summary
2 4/11/16
Overview of XL C/C++ and Fortran Compilers
3 4/11/16
Power zSystem
Linux AIX z/OS Linux
XL C++
XL FORTRAN
BG/Q
BE LE
Advanced OpUmizaUon Technology • Full plaQorm exploita'on
! Enable and exploit POWER hardware features
• Loop transforma'on ! Analyze and transform loops to improve performance
• Automa'c SIMDiza'on/Vectoriza'on ! Convert operaUons to allow for several calculaUons to occur simultaneously
• Paralleliza'on ! AutomaUc parallelizaUon and explicit parallelizaUon through OpenMP
• Op'mized Math libraries ! Scalar MASS library and vector MASSV library tuned for POWER
• IPA (Inter-‐Procedural Analysis) ! Apply opUmizaUon techniques to enUre programs
• PDF (Profile-‐directed feedback) ! Tune applicaUon performance for typical usage scenarios
4 4/11/16
Language standard compliance
" C99 compliance; C11 compliance on Linux on Power, and selected features on other pla[orms
" C++98 compliance; C++11 compliance on Linux on Power, and selected features on other pla[orms
" Fortran 2003 compliance, selected Fortran 2008 and TS29113 features
" OpenMP 3.1 compliance, selected OpenMP 4.0 features
4/11/16 5
Open Source & GCC affinity
! Binary compaUbility
4/11/16 6
XL gcc
C++ C++
XL gcc
C++ C++
Makefile Makefile
calls migraUon
! OpUons, source compaUbility
IBM Advance Toolchain support
Performance Tuning with XL compilers and Libraries
! Iden'fy applica'on hot spots and performance boZleneck • Performance tools: perf. Oprofile, gporf, etc.
! Use XL compiler report to examine compiler op'miza'on -qlistfmt=[xml | html]=inlines generates inlining information -qlistfmt=[xml | html]=transform generates loop transformation information -qlistfmt=[xml | html]=data generates data reorganization information -qlistfmt=[xml | html]=pdf generates dynamic profiling information -qlistfmt=[xml | html]=all turns on all optimization content -qlistfmt=[xml | html]=none turns off all optimization content
! Tune the performance with XL Compilers • Typically start from -‐O2 or -‐O3 • Add high order opUmizaUon –qhot for floaUng-‐point computaUon and memory
bandwidth intensive workload • Add whole program opUmizaUon –qipa[=level=0 | 1 | 2] for workloads with a lot
of C/C++ small funcUon calls • Add –qsmp=omp for OpenMP workloads • Use CUDA C/C++ and Fortran for GPU enabled workloads
7 4/11/16
Performance Tuning with XL compilers and Libraries
! Consider using the highly tuned MASS/MASSV and ESSL libraries ! Aggressive op'miza'on may affect the results of the program. -‐qstrict
guarantees iden'cal result to noopt, at the expense of op'miza'on. Subop'ons allow fine-‐grain control over this guarantee Examples:
-‐qstrict=precision Strict FP precision -‐qstrict=excepUons Strict FP excepUons -‐qstrict=ieeefp Strict IEEE FP implementaUon -‐qstrict=nans Strict general and computaUon of NANs -‐qstrict=order Do not modify evaluaUon order -‐qstrict=vectorprecision Maintain precision over all loop iteraUons
8 4/11/16
XL C/C++ for Linux (LE) Community Edi'on GA: June 17, 2016
9 4/11/16
! Free of charge ! Unlimited license for producUon use ! Available for download via IBM developerWorks, or
public apt-‐get, yum, and zypper repositories ! Includes all the core features of XL C/C++ on Linux
Summary of XL Compilers for OpenPOWER
10 4/11/16
Easy migra'on • C/C++ language standard conformance • Full binary compaUbility with GCC • OpUon and source compaUbility with GCC and Clang
Industry leading performance • Full enablement and exploitaUon of the latest Power hardware • Advanced compiler opUmizaUon technologies • OpUmized math libraries • 10 – 30% bener than open source compilers for typical workloads
World class service and support
POWER8 Performance figures
! IBM S824L – 2x10 cores @ 3.69 GHz
0
50
100
150
200
250
300
350
SMT1 SMT2 SMT4 SMT8
Band
width)(G
B/s)
Multi2threading)mode
3*1.5GB 3*0.75GB 3*0.5GB 3*150MB 3*75MB0
200
400
600
800
1000
1200
SMT1 SMT2 SMT4 SMT8Ba
ndwidth)(G
B/s)
Multi2threading)mode
3*37MB
Off-‐chip memory (256 GB Centaur) Up to 3x over Intel E-‐2650 v3
On-‐chip L3 cache (160 MB total)
11 4/11/16
ComputaUonal throughput
! IBM ESSL 5.3 mulU-‐threaded: SGEMM and DGEMM
0
100
200
300
400
500
600
700
800
900
1000
SMT1 SMT2 SMT4 SMT8
Compu
tatio
nalthrou
ghpu
t(GF
LOPS/s)
Multi-threadingmode
Single Double
12 4/11/16
vs. MKL E2650 v3 Haswell (with FMA): 1280 (0.72x) single 523 (0.96x) double GFLOPS
Black-‐Scholes 1D ! Compute intensive financial benchmark (alUvec)
4/11/16 13
128
512
2048
8192
SMT1 SMT2 SMT4 SMT8
Execu'
on 'me (m
s)
explicit1 SP scalar
explicit1 SP vector
explicit2 SP vector
implicit1 SP scalar
implicit1 SP vector
128
256
512
1024
2048
4096
8192
SMT1 SMT2 SMT4 SMT8
Execution*tim
e*(m
s)
explicit13SP3scalar
explicit13SP3vector
explicit23SP3vector
implicit13SP3scalar
implicit13SP3vector
128
256
512
1024
2048
4096
8192
SMT1 SMT2 SMT4 SMT8
Execution*tim
e*(m
s)
explicit13DP3scalar
explicit13DP3vector
explicit23DP3vector
implicit13DP3scalar
implicit13DP3vector
Figure 3: One-factor hand-vectorised Black-Scholes performance on the Power8. The timings are for 50000 (explicit) or 2500(implicit) timesteps, for 6144 options each on a grid of size 256.
Table 2: One-factor hand-vectorised Black-Scholes performance on the Power8 and Intel. The timings are for 50000 (explicit)or 2500 (implicit) timesteps, for 6144 options each on a grid of size 256.
POWER8 Intel E5-2650 v3single prec. double prec. single prec. double prec.
msec GFlop/s msec GFlop/s msec GFlop/s msec GFlop/sexplicit1 scalar 1782 264 3679 127 942 499 1593 293explicit1 vector 821 573 1395 337 1512 311 2805 167explicit2 vector 1131 415 2412 195 756 620 1359 346implicit1 scalar 946 82 1038 65 1191 65 1524 44implicit1 vector 205 380 406 166 288 270 756 89
with u�1 = uJ = 0. Here n is the timestep index whichincreases as the computation progresses, so un+1
j is a sim-ple combination of the values at the nearest neighbours atthe previous timestep. All of the numerical results are for agrid of size J = 256 which corresponds to a fairly high reso-lution grid in financial applications. Additional parallelismis achieved by solving multiple one-factor problems at thesame time, with each one having di↵erent model constants,or a di↵erent financial option payo↵.
A standard implicit time-marching approximation leads toa tridiagonal set of equations of the form
aj un+1j�1 + bj u
n+1j + cj u
n+1j+1 = un
j , j = 0, 1, . . . J � 1
with u�1 = uJ = 0. This is then solved using the Thomasalgorithm, a sequential algorithm for each option at eachtimestep.
The baseline implementations for both explicit and implicitalgorithms have an outermost loop over all options, andthen after initialising the a,b, and c coe�cient arrays, it-erate for a number of timesteps (50000 explicit 2500 im-plicit) updating the values of all 256 grid points - this is theexplicit1/implicit1 implementation. Various compilers mayor may not vectorise over either the outer loop, in whichcase di↵erent lanes of the vector represent di↵erent options,or in the explicit case, over the finite-di↵erence update ofgrid points, with each lane representing a di↵erent gridpoint.Hand-written vectorisations are created for all of these, us-ing vector intrinsics on Intel and the Altivec types on thePower8. Scalar and vectorised versions (over di↵erent op-tions) are named explicit1, implicit1, explicit vectorisationover a single option is named explicit2.
Figure 3 shows the performance of the one-factor bench-
marks in single precision (left figure) and double precision(right figure), both scalar and hand-vectorised versions areshown, the latter use using Altivec instructions. These areprimarily compute-bound benchmarks, as the metrics showin Table 2, around 62-67% of practical peak compute (930GFLOPS/s in single, 500 GFLOPS/s in double) is achievedwith the hand-vectorised explicit1 benchmarks. Hand-vectorisedcode gives significant improvements in all cases; up to fourtimes in both single and double precision. The performancedi↵erence between single and double precisions for both im-plicit and explicit hand-vectorised versions is close to 2⇥. Inthe implicit case, during the Thomas algorithm a reciprocalvalue has to be computed at each grid point, here the fastapproximate reciprocal is used (vec_re), which gives a 15%improvement in overall performance.
In comparison, on Intel Xeon E5-2950 v3 (Haswell) the In-tel compiler does auto-vectorize the explicit1, although it isless performant than the hand-vectorised variant. Overall, itachieves about 50% of the benchmarked peak, slightly out-performing the POWER8 on the explicit benchmark, but itis significantly slower on the implicit benchmark.
5.2 Three-factorThe three-factor test application uses the Black-Scholes PDEfor 3 underlying assets, each corresponding to GeometricBrownian Motion and with positive correlation between the3 driving Brownian motions. This leads to a parabolic PDEwhich spatial cross-derivative terms with positive coe�cients.The spatial approximation of this leads to a 13-point stencilinvolving o↵sets ±(1, 0, 0), ±(0, 1, 0), ±(0, 0, 1), ±(1, 1, 0),±(0, 1, 1) and ±(1, 0, 1), relative to a point with 3D indices(i, j, k). The test case uses a grid of size 2563, with all datastored in the main memory in 1D arrays.
0.92x 0.97x 1.4x 1.8x
CloverLeaf 3D – structured mesh PDE solver
! Mantevo Suite, primarily BW limited
4/11/16 14
0
10
20
30
40
50
60
20 MPI 40 MPI 80 MPI 160 MPI 20 MPI + 1 OpenMP
20 MPI + 2 OpenMP
20 MPI + 4 OpenMP
20 MPI + 8 OpenMP
RunU
me(s)
CloverLeaf 2D (3840^2, 87 iter) with different compilers and implementaUons
ref XL f kern ref XL c kern OPS XL OPS Clang
Average 160 GB/s, high 270, low 70 GB/s
Intel E5-‐2650 v3: 32 sec (3x) NVIDIA K40: 15 sec (1.4x)
Airfoil – Unstructured mesh FVM
0
10
20
30
40
50
60
70
80
SMT1 SMT2 SMT4 SMT8
Execution*tim
e*(s)
Double3precision
MPI3full
MPI3comm
4MPI+OpenMP
4MPI+OpenMP3comm
8MPI+OpenMP
8MPI+OpenMP3comm
0
10
20
30
40
50
60
70
80
SMT1 SMT2 SMT4 SMT8
Execution*tim
e*(s)
Single3precision
MPI3full
MPI3comm
4MPI+OpenMP
4MPI+OpenMP3comm
8MPI+OpenMP
8MPI+OpenMP3comm
4/11/16 15
Irregular computaUons bonleneck
Intel E5-‐2650 v3: 23.38/30.27 sec (1.5x), NVIDIA K40 10.5/17.6 sec (0.8x)
Table 5: Useful bandwidth (BW - GB/s) and computational (Comp - GFops/s) throughput of baseline implementations onAirfoil
POWER8 Intel E5-2650 v3Kernel Double precision Single precision Double precision Single precision
Time BW Spdup Time BW Spdup Time BW Time BWsave soln 0.61 301 4.7x 0.14 641 10.2x 2.86 64 1.44 64adt calc 8.39 38 0.7x 7.49 22 0.8x 5.89 55 6.44 25res calc 8.94 77 1.3x 8.29 42 1.3x 11.92 58 10.95 32bres calc 0.02 133 2.5x 0.01 57 3x 0.05 53 0.03 27update 4.55 172 2.1c 3.39 115 1.3x 9.52 82 4.49 87
10. REFERENCES[1] A. V. Adinetz, P. F. Baumeister, H. Bottiger, T. Hater,
T. Maurer, D. Pleiter, W. Schenck, and S. F. Schifano.Performance Evaluation of Scientific Applications onPOWER8.
[2] M. Giles, E. Laszlo, I. Reguly, J. Appleyard, andJ. Demouth. Gpu implementation of finite di↵erencesolvers. In Proceedings of the 7th Workshop on HighPerformance Computational Finance, pages 1–8. IEEEPress, 2014.
[3] M. Heroux and R. Barrett. Mantevo project, 2011.[4] J. D. McCalpin. Memory bandwidth and machine
balance in current high performance computers. IEEEComputer Society Technical Committee on ComputerArchitecture (TCCA) Newsletter, pages 19–25, Dec.1995.
[5] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran,and S. McIntosh-Smith. The ops domain specificabstraction for multi-block structured gridcomputations. In Proceedings of the FourthInternational Workshop on Domain-Specific Languagesand High-Level Frameworks for High PerformanceComputing, pages 58–67. IEEE Press, 2014.
Rolls-‐Royce Hydra (unstructured mesh)
4/11/16 16
0
5
10
15
20
20 MPI
40 MPI
80 MPI
160 MPI
20 MPI 1 OMP
20 MPI 2 OMP
20 MPI 4 OMP
20 MPI 8 OMP
RunU
me (s)
Time MPI Time
0
2
4
6
8
10
Intel E5-‐2650 v3 POWER8 NVIDIA K80 CUDA
Hydra on Haswell CPU, K80 GPU, POWER8
1.4x 1.8x
Your logo here
Two tracks to challenge and win: 1. The Open Road Test
– Port and optimize for OpenPOWER – Go faster with accelerators (optional)
2. The Spark Rally – Train an accelerated DNN and recognize
objects with greater accuracy – Show you can scale with Spark
Key Dates
Register today openpower.devpost.com
Sun May 1st:
Submission periods opens
Tue Aug 2nd: Submission period closes
Grand prizes include a trip to
Supercompu'ng 2016 Other prizes include iPads, Apple Watches
Join the conversa'on at #OpenPOWERSummit
AddiUonal InformaUon ! XL C/C++ home page
• hnp://www-‐142.ibm.com/souware/products/us/en/ccompfami/ ! C/C++ Café
• hnp://ibm.biz/Bdx8XR ! XL Fortran home page
• hnp://www-‐03.ibm.com/souware/products/en/fortcompfami ! Fortran Café
• hnp://ibm.biz/Bdx8XX ! IBM SDK Linux – Using the CPI plugin
• hnp://www-‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpibm.htm
! PMU events • hnp://www-‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpievents.htm
20 4/11/16
AddiUonal InformaUon ! Code op'miza'on with the IBM XL compilers on Power architectures
• hnp://www-‐01.ibm.com/support/docview.wss?uid=swg27005174&aid=1
! Performance Op'miza'on and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
• hnp://www.redbooks.ibm.com/abstracts/sg248171.html
! Implemen'ng an IBM High-‐Performance Compu'ng Solu'on on IBM POWER8
• hnp://www.redbooks.ibm.com/abstracts/sg248263.html?Open
21 4/11/16
Installing XL C/C++ and Fortran Compilers ! Get XL C/C++ and Fortran compilers
C/C++: hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux Fortran: hnp://www.ibm.com/developerworks/downloads/r/xlfortranlinux
! Run install uUlity to perform the following tasks: • Detects the current architecture (big endian or linle endian) • Installs all prerequisite souware packages (using apt-‐get, zypper or yum) • Installs all compiler packages into the default locaUon, /opt/ibm/ • AutomaUcally invokes the xlc_configure uUlity, which installs the license file and
generates the default configuraUon file • Creates symbolic links in /usr/bin/ to the compiler invocaUon commands
! To use XL C/C++ with the Advance Toolchain, run xlc_configure –at to create xlc_at and xlC_at
! Free download XL compiler libraries hnp://public.dhe.ibm.com/souware/server/POWER/Linux/rte/xlcpp/le/
22 4/11/16
Compiling ApplicaUon with XLC and XLF ! Set up the path for IBM XL compilers
export PATH=/opt/ibm/xlC/13.1.3/bin:/opt/ibm/xlf/15.1.3/bin:$PATH ! Check the compiler release and version
xlc –qversion xlC -‐qversion xlf –qversion
! Compile an applica'on xlc for C (add –qlanglvl=stdc11, –qlanglvl=extc1x to enable C11 ) xlC for C++ (add –qlanglvl=extended0x to enable C++11) xlf, xlf90, xlf95, xlf2003, xlf2008 for Fortran
! Specify compile op'ons • -‐O3 or -‐O3 –qhot for floaUng point computaUon intensive applicaUon • -‐O3 or -‐O3 –qipa for integer applicaUon • –qsmp=omp for OpenMP applicaUon
xlc -‐qversion IBM XL C/C++ for Linux, V13.1.3 Version: 13.01.0003.0000
23 4/11/16
GPU Programming with CUDA C/C++ and XL C/C++
! Check CUDA-‐capable GPU lspci | grep -‐i nvidia
! Verify Linux version uname -‐m && cat /etc/*release
! Get and Install CUDA C/C++ and XL C/C++ and Fortran compilers CUDA C/C++: hnps://developer.nvidia.com/cuda-‐downloads-‐power8 XL C/C++: hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux
! Modify host_config.h for XL C/C++ Change "!=" to ">=" and remove the irrelevant comment.
! Set up the path for IBM XL compilers export PATH=/opt/ibm/xlC/13.1.3/bin:$PATH
! Environment Setup export PATH=/usr/local/cuda-‐7.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-‐7.0/lib64:$LD_LIBRARY_PATH
! Compile and run nvcc -‐ccbin xlC -‐m64 -‐Xcompiler -‐O3 -‐Xcompiler -‐q64 -‐Xcompiler -‐qsmp=omp -‐gencode arch=compute_20,code=sm_20 -‐o cudaOpenMP.o -‐c cudaOpenMP.cu nvcc -‐ccbin xlC -‐m64 -‐Xcompiler -‐O3 -‐Xcompiler -‐q64 -‐o cudaOpenMP cudaOpenMP.o -‐lxlsmp
24 4/11/16
SPEC2006cpu Rate Performance
0200400600800100012001400160018002000
POWER S824 Haswell
E5-4655 V3
SPECint_rate2006
1000
1050
1100
1150
1200
1250
1300
1350SPECfp_rate2006
Haswell E4-4655 V3
POWER S824
• Power8 is up to 1.4x top 24c Haswell • XLC V13.1/XLF V15.1 are 10 to 30% bener than GCC 4.9
25 4/11/16
STAC-‐A2 Performance
1 1
2.3 2.1
0.0
0.5
1.0
1.5
2.0
2.5
POWER8 24c/192t
POWER8 24c/192t
STAC-‐A2.β2.GREEKS.TIME.WARM STAC-‐A2.β2.GREEKS.MAX_PATHS
E5-‐2699 v3 36c/72t
E7-‐4890 v2 60c/120t
Rela'v
e Performan
ce (spe
edup
)
hnps://stacresearch.com/news/2015/03/12/stac-‐report-‐stac-‐a2-‐ibm-‐power8-‐soluUon
• Power8 double the performance of Haswell • XLC 13.1 is 10% to 20% beZer than GCC 4.9
26 4/11/16
NAS Performance
0
0.5
1
1.5
2
2.5
3
bt cg dc ep u is lu mg sp ua
Speedu
p
NAS benchmarks
Speedup of XL 13.1 and XLF 15.1 over GCC5.2
• XLC 13 / XLF 15 performs 1.36X bener than GCC5.2
27 4/11/16