Date post: | 13-Apr-2017 |
Category: |
Documents |
Upload: | yaoqing-gao |
View: | 65 times |
Download: | 1 times |
© 2016 IBM Corporation
Performance Tuning withIBM XL C/C++/Fortran Compilers and Libraries
Yaoqing Gao [email protected] Technical Staff Member
IBM Canada Lab
November 2016
@IBM_compilers1 IBM C/C++ and Fortran Compilers and Libraries
§ Overview of IBM XL C/C++ and Fortran Compilers§ Performance Analysis
§ Performance tools§ Hot spots and bottleneck detection
§ Performance Tuning§ Summary
Agenda
@IBM_compilers2 IBM C/C++ and Fortran Compilers and Libraries
Overview of IBM XL C/C++ and Fortran Compilers and Libraries
@IBM_compilers3 IBM C/C++ and Fortran Compilers and Libraries
IBM XL C/C++ and Fortran Compilers
Easy migration• C/C++ language standard conformance; Fortran 2003 compliance, selected
Fortran 2008 features; OpenMP 3.1 compliance, selected OpenMP 4.0/4.5 features, CUDA C/C++ and Fortran
• Full binary compatibility with GCC• Option and source compatibility with GCC and Clang
Industry leading performance• Full enablement and exploitation of the latest Power hardware• Leading edge advanced compiler optimization technologies• Optimized math libraries• 10 – 30% better than open source compilers for typical workloads
Agility• Flexibility/speed in delivery schedules.• Superior service and support• Successful customer engagements
@IBM_compilers4 IBM C/C++ and Fortran Compilers and Libraries
Advanced Optimization Technology
• Full platform exploitation– Enable and exploit POWER hardware features
• Loop transformation– Analyze and transform loops to improve performance
• Automatic SIMDization/Vectorization– Convert operations to allow for several calculations to occur simultaneously
• Parallelization– Automatic parallelization and explicit parallelization through OpenMP
• Optimized Math libraries– Scalar MASS library and vector MASSV library tuned for POWER
• IPA (Inter-Procedural Analysis)– Apply optimization techniques to entire programs
• PDF (Profile-directed feedback)– Tune application performance for typical usage scenarios
@IBM_compilers5 IBM C/C++ and Fortran Compilers and Libraries
Open Source & GCC Affinity
GCC/Clang
calls
C/C++Object/Binary
C/C++Object/Binary
XLC
C/C++
MakefileMakefile
migration
§ Options, source compatibility
C/C++
§ Binary compatibility
Linker
§ Clang adoption
@IBM_compilers6 IBM C/C++ and Fortran Compilers and Libraries
Compiling Application with XLC and XLF
§ Check the compiler release and versionxlc –qversionxlC -qversionxlf –qversion
§ Compile an applicationxlc for C code; xlC for C++ codexlf, xlf90, xlf95, xlf2003, xlf2008 for Fortran code
§ Specify compile options• -O3 or -O3 –qhot for floating point computation intensive application• -O3 or -O3 –qipa for integer application• –qsmp=omp for OpenMP application
xlc -qversionIBM XL C/C++ for Linux, V13.1.4 Version: 13.01.0004.0000
@IBM_compilers7 IBM C/C++ and Fortran Compilers and Libraries
CUDA C/C++
§ NVCC can use XLC as host compiler for POWER CPU• NVCC is NVIDIA CUDA C++ Compiler from NVIDIA CUDA Toolkit• NVCC partitions C/C++ source code into CPU and GPU portions
§ Detailed instructions for using XLC• Red Book:
http://www.redbooks.ibm.com/redpapers/pdfs/redp5169.pdf
§ Invocation example• nvcc -ccbin xlC -m64 -Xcompiler -O3 -Xcompiler -q64 -Xcompiler
-qsmp=omp -gencode arch=compute_20,code=sm_20 -o cudaOpenMP.o -c cudaOpenMP.cu
• nvcc -ccbin xlC -m64 -Xcompiler -O3 -Xcompiler -q64 -o cudaOpenMP cudaOpenMP.o -lxlsmp
@IBM_compilers8 IBM C/C++ and Fortran Compilers and Libraries
CUDA Fortran§ What’s CUDA Fortran
• Created by PGI and NVIDIA in 2009-2010• Functionally equivalent to CUDA C• Provides seamless integration of CUDA into Fortran declarations
and statements• CUDA runtime API also available from CUDA Fortran• Fortran modules provide bind(c) interface for CUDA C libraries
§ XL Fortran V15.1.4 support for CUDA Fortran• Supports commonly used subset of CUDA Fortran features• Programs benefit from industry-leading POWER CPU optimization,
while exploiting GPU performance
§ Invocation example• xlcuf –O3 demo.cuf –o demo• ./demo
@IBM_compilers9 IBM C/C++ and Fortran Compilers and Libraries
OpenMP Support for C/C++ and Fortran
§ What’s OpenMP• A de-facto industry standard for parallel programming over 15 years• Supports Fortran and C/C++ programming languages, both shared-memory
and accelerator programming models§ OpenMP accelerator support (new in OpenMP 4.0)
• The host device offloads target regions to the target devices• The target devices can be GPU, DSP, coprocessor etc.• Insert directives to offload code blocks to a target device
§ XL C/C++ and XL Fortran support of OpenMP 4.0 & 4.5• Commonly used subset: The beta in June and GA at the end of 2016• Incremental GA deliveries to complete the support in the future releases
§ Invocation examples• For CPU only
– xlC_r -O3 -qsmp=omp saxpy.cpp ; ./a.out– xlf_r -O3 -qsmp=omp saxpy.f ; ./a.out
• For GPU offloading– xlC_r -O3 -qsmp=omp –qoffload saxpy.cpp ; ./a.out– xlf_r -O3 -qsmp=omp –qoffload saxpy.f ; ./a.out
@IBM_compilers10 IBM C/C++ and Fortran Compilers and Libraries
Compiler Options Quick Reference Guide POWER8OpenPower Linux (LE)
XL (xlc, xlC, xlf)http://ibm.biz/xlcpp-linuxhttp://ibm.biz/xlfortran-linux
GNU (gcc, g++, gfortran)http://gcc.gnu.org
Clanghttp://clang.llvm.org
ArchitectureGenerate instructions that run on POWER8
-mcpu=power8 or -qarch=pwr8 (default)
-mcpu=power8 -target powerpcle-unknown-linux-gnu -mcpu=pwr8
Optimization LevelsDisable all optimizations
-O0 -qnoopt (default) -O0 (default) -O0 (default)
Optimization levels -O-O2-O3-O4-O5
-O or -O1-O2-O3-Ofast
-O0-O2-O3-Os
Recommended optimization(A good balance between runtime performance and compilation time)
Commercial code-O3 or –O3 –qipaTechnical computing/analytic -O3 or –O3 –qhot
Commercial code-O3 -mcpu=power8Technical computing/analytic-O3 -mcpu=power8 -funroll-loops
-O2
Additional OptimizationsFeedback directed optimization
-qpdf1-qpdf2
-fprofile-generate -fprofile-use
-fprofile-instr-generate -fprofile-instr-use
InterproceduralOptimizations
-qipa -flto -flto
OpenMP -qsmp=omp -fopenmp -fopenmpLoop optimizations -qhot -fpeel-loops, -funroll-loops -funroll-loopsMore InfoResources from IBM http://ibm.biz/xlcpp-linux-ce
http://ibm.biz/xl-info http://ibm.biz/linuxonpowerhttp://ibm.biz/sdk-linuxonpower
@IBM_compilers11 IBM C/C++ and Fortran Compilers and Libraries
Identify Application Hot Spots and Performance Bottlenecks
@IBM_compilers12 IBM C/C++ and Fortran Compilers and Libraries
Hot Spot and Bottleneck Detection
§ Identify hot spots and detect bottlenecks• gather the profile information: timing, call frequency, block
frequency, frequently used values, • performance tools: gprof, oprofile, perf, PIF, IBM SDK, compiler
instrumentation§ Identify if a workload is computation intensive, memory latency or
bandwidth intensive; IO intensive by gathering performance counter information about
• CPI breakdown• FPU/FXU• Cache misses• Branch mispredictions• LSU, etc.
@IBM_compilers13 IBM C/C++ and Fortran Compilers and Libraries
Profiling with gprof
§ gprof is a performance analysis tool using a hybrid of instrumentation and sampling
§ Step 1: Compile an application with XL compiler option –pg• Instrumentation code is inserted into the program code during
compilation to gather caller-function data at run-time. • xlc –O3 –pg –o app app.c
§ Step 2: Run the application• Sampling data is saved in 'gmon.out' or in 'progname.gmon' file
just before the program exits§ Step 3: Run gprof tool
• gprof app gmon.out > analysis.txt§ Step 4: Analyze the profiling information
• Each function, who called it, whom it called, and how many times • How many times each function got called, total times involved,
sorted by time consumed.
@IBM_compilers14 IBM C/C++ and Fortran Compilers and Libraries
Profiling with operf/oprofile
§ oProfile is a system-wide statistical profiling tool and operf is the profiler tool provided with oprofile.
§ Step 1: Select performance events • ophelp to list available events
§ Step 2:Gather the profile information• operf -e event1[,event2[,...]] where event is specified by
event_name:sampling_rate[:unitmask[:kernel[:user]]]
§ Step 3: Analyze the profiling information• opreport -l
@IBM_compilers15 IBM C/C++ and Fortran Compilers and Libraries
Profiling with perf
§ perf is a performance tool that automatically groups events, and cycles through them every N μsecs
§ Step 1: select performance events • perf list to list available events
§ Step 2:Gather the profile information• perf <command>
, where <command> = { lock, stat, sched, kmem, timechart, top, etc.}• perf record –e event_name| raw_PMU_event or perf -e event1[,event2[,...]]
where event is specified by event_name:sampling_rate[:unitmask[:kernel[:user]]] | \mem:addr[:[r][w][x]]
§ Step 3: Analyze the profiling information• perf report --source
@IBM_compilers16 IBM C/C++ and Fortran Compilers and Libraries
Compiler Optimization: Basic and Advanced Optimization
@IBM_compilers17 IBM C/C++ and Fortran Compilers and Libraries
Optimization Capabilities
§ Platform exploitation• qarch: ISA exploitation• qtune: skew performance tuning for specific processor, including
tune=balanced • Large library of compiler builtins and performance annotations
§ Mature compiler optimization technology• Five distinct optimization packages• Debug support and assembly listings available at all optimization
levels• Whole program optimization• Profile-directed optimization
@IBM_compilers18 IBM C/C++ and Fortran Compilers and Libraries
§ Used at lower optimization levels
§ Focus on fast compilationnoopt-O2
§ More aggressive optimization, with limited impact on compilation time
-O3 -qnohot
Implies -qnostrict, which may affect program behavior(mainly precision of floating-point operations)
§ Optionally generate an assembly listing file
C/C++ FEFortran FE
xl*code
Source file
object file source.lst
Basic Compilation
@IBM_compilers19 IBM C/C++ and Fortran Compilers and Libraries
§ Focus on runtime performance, at the expense of compilation time
• Aggressive loop transformations
• More precise dataflow analysis
§ Triggered by several compiler flags
-O3-qhot-qsmp
§ Multiple levels of aggressiveness for loop transformations
-qhot=level=0 (default at -O3)-qhot=level=1 (default at -qhot)-qhot=level=2
§ Can be combined with -qstrict
C FEFortran FE
ipa
Source file
object file
xl*code
source.lst
Advanced Compilation
@IBM_compilers20 IBM C/C++ and Fortran Compilers and Libraries
§ Collect high-level program representation in preparation for link-time whole program optimization
§ Triggered by -qipaImplied by -O4, -O5, -qpdf1/-
qpdf2Identical behavior at all -qipa
levels§ Can be used independently of -
qhot§ Output is composite object file
• Includes regular object file and intermediate representation
• Allows linking the object file with or without link-time optimization
• Skip generation of regular object using -qipa=noobject
C/C++ FEFortran FE
ipa
Source file
extended object file
xl*code
object filesource.lst
Whole-Program Optimization – Compile Phase
@IBM_compilers21 IBM C/C++ and Fortran Compilers and Libraries
extended object file
system library object file § Intercept the system linker and
re-optimize whole program-qipa=level=0 (default with qpdf)-qipa=level=1 (default with qipa)-qipa=level=2
§ Must use the compiler invocation to link the program, with -qipa• Do not use ld directly
§ Flexible handling of extended objects• Can be placed in archives• Accepts combination of regular and
extended object files§ Whole program assembly listing
• Default name a.lst
§ Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program
extended object file
ipa
system library object file
xl*code
final object file
system linker
executable
a.lstprofile data file
Whole-Program Optimization – Link Phase
@IBM_compilers22 IBM C/C++ and Fortran Compilers and Libraries
§ Noopt,-O0– Quick local optimizations– Keep the semantics of a program (-qstrict)
§ -O2 – Optimizations for the best combination of
compile speed and runtime performance– Keep the semantics of a program (-qstrict)
§ -O3– Equivalent to –O3 –qhot=level=0 –
qnostrict for XLC– Equivalent to –O3 –qhot –qnostrict for XLF– Focus on runtime performance at the
expense of compilation time: loop transformations, dataflow analysis
– May alter the semantics of a program (-qnostrict)
§ -O3 –qhot–Equivalent to –O3 –qhot=level=1 –
qnostrict–Perform aggressive loop
transformations and dataflow analysis at the expense of compilation time
§ -O4–Equivalent to –O3 –qhot=level=1 –
qipa=level=1 -qnostrict–Aggressive optimization: whole
program optimization; aggressive dataflow analysis and loop transformations
§ -O5–Equivalent to –O3 –qhot=level=1 –
qipa=level=2 -qnostrict–More aggressive optimization: more
aggressive whole program optimization, more precise dataflow analysis and loop transformations
Summary of Optimization Levels
@IBM_compilers23 IBM C/C++ and Fortran Compilers and Libraries
Basic Optimization Techniques
§ Inlining• Replaces a call to a procedure by a copy of the procedure itself. It is done
to eliminate the overhead of calling the function, and also to allow specialization of the function for the specific call point
§ Redundancy detection• Identify computations that are redundant or partially redundant with values
previously computed, so their value can be reused rather than recomputed
§ Platform exploitation• Use a model of the target processor to determine the best mix of
instructions to use to implement a certain program sequence
§ Flow restructuring• Reorganize the code to increase the density of the hot code or to make it
less frequent for conditional branches to be taken
@IBM_compilers24 IBM C/C++ and Fortran Compilers and Libraries
§ Analyze and transform loops to improve runtime performance• Analyze memory access patterns to improve cache utilization• Tailor instruction schedule for specific loop and target processor• Interleave execution of multiple loop iterations
§ Most effective on numerical applications, e.g. analytics, technical computing• Depends on loops with regular behavior that can be analyzed
and restructured by the optimizer
§ Enabled at O3 and above. Aggressive loop optimization with –O3 -qhot
Loop Optimization
@IBM_compilers25 IBM C/C++ and Fortran Compilers and Libraries
§ Supports data types of INTEGER, UNSIGNED, REAL and COMPLEX
§ Explicit SIMD programming with –qaltivec (=BE|LE)
§ Automatic SIMDization at –O3 –qhot• Basic block level SIMDizaton• Loop level aggregation• Data conversion• Reduction• Loop with limited control flow• Math SIMDization• Partial Loop Vectorization• Alignment Handling
SIMDization/Vectorization
@IBM_compilers26 IBM C/C++ and Fortran Compilers and Libraries
for (i=0; i<n; i++)
a[i] =
loop level
A = sqrt( B );C = sqrt( D );
Math Vectorization
multiple targets
load b[i]
load a[i] convert
add
store
load a[i+4]convert
add
store
INTEGER
FLOATFLOAT
data size conversion
b0 b1 b2 b3 b4 b5 b6 b7 b8 b916-byte boundaries
vload b[1]
b0 b1 b2 b3
vload b[5]
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
...b1
b1
b1
alignment constraints
Partial Loop Vectorization
GENERIC
POWER BG/Q
Successful SIMDizer
a[i+0] = b[i+0] * c[i+0] + d[i+0]
a[i+1] = b[i+1] * c[i+1] + d[i+0]a[i+2] = b[i+2] * c[i+2] – d[i+0]a[i+3] = b[i+3] * c[i+3] - d[i+0]
Non-Isomorphic basic-block level
SIMDizera[i+0] = b[i+0] * c[i+0]
a[i+1] = b[i+1] * c[i+1]a[i+2] = b[i+2] * c[i+2]a[i+3] = b[i+3] * c[i+3]
Isomorphic basic-block level
for (i=1; i<n; i++)
a[i] = c[i] * d[i];b[i] = b[i-1];CELL
@IBM_compilers27 IBM C/C++ and Fortran Compilers and Libraries
MASS Libraries
§ MASS stands for Mathematical Acceleration SubSystem.
§ MASS Libraries contain mathematical routines tuned for optimal performance on various POWER architectures• General implementation tuned for POWER• Specific implementations tuned for specific POWER
processors
§ 16x average speedup for POWER8 LE vector MASS vs. libm
§ Users can add explicit calls to the library
§ XL Compilers can automatically insert calls to MASS/MASSV routines at higher optimization levels
@IBM_compilers28 IBM C/C++ and Fortran Compilers and Libraries
MASS Contains Over 140 Functions
§ Over 140 functions in all• Single/double precision• Scalar/SIMD/vector functions
§ Trigonometric functions and inverses• cos, sin, cosisin, sincos, tan, acos, asin, atan, atan2
§ Hyperbolic functions and inverses• acosh, asinh, atanh, acosh, asinh, atanh
§ Exponential functions• exp, exp2, expm1, exp2m1
§ Logarithm functions• log, log2, log10, log1p, log21p
§ Roots and reciprocal roots• sqrt, cbrt, qdrt, rsqrt, rcbrt, rqdrt
§ Reciprocal and divide• rec, div
§ Power• pow
§ Rounding, sign copy• aint, dint, anint, dnint, rint, copysign
§ Special functions• hypot, erf, erfc, lgamma, popcnt4, popcnt8
@IBM_compilers29 IBM C/C++ and Fortran Compilers and Libraries
MASS Scalar Library
§ Analogous to libm math library§ Produces one math function result (exception: sincos)§ Easy to use in existing code since names match libm (just link MASS)§ Calling from C
#include <math.h> // prototypes for most scalar MASS functions
#include <mass.h> // prototypes for scalar MASS functions not in math.h
double dx, dy;dy = exp (dx); // compute dy = exponential function of dx
float fx, fy;
fy = expf (fx); // compute fy = exponential function of fx
double dx, dy, dz;dz = pow (dx, dy); // compute dz = dx to the power dy
double dx, dsin, dcos;
sincos (dx, &dsin, &dcos); // dsin=sin(dx), dcos=cos(dx)
@IBM_compilers30 IBM C/C++ and Fortran Compilers and Libraries
MASS Scalar Library – Calling from Fortran
include 'mass.include' ! interfaces for non-intrinsic scalar MASS
real*8 dx, dy
dy = exp (dx) ! compute dy = exponential function of dx
real*4 fx, fy
fy = exp (fx) ! compute fy = exponential function of fx
real*8 dx, dy, dz
dz = dx**dy ! compute dz = dx to the power dy
real*8 dx, dsin, dcos
sincos (dx, dsin, dcos) ! dsin=sin(dx), dcos=cos(dx)
@IBM_compilers31 IBM C/C++ and Fortran Compilers and Libraries
MASS Vector Library
§ Computes the same math function for each of multiple inputs§ Highest performance, provided vector length is sufficient
• vector length at least 2 to 10 depending on the function
§ Calling from C#include <massv.h> // prototypes for vector MASS functions
#define N 1000
int n=N;
double vdx[N], vdy[N];
vexp (vdy, vdx, &n); // vdy[i] = exp (vdx[i]), i=0,...,n-1
float vfx[N], vfy[N];
vsexp (vfy, vfx, &n); // vfy[i] = exp (vfx[i]), i=0,...,n-1
double vdx[N], vdy[N], vdz[N];
vpow (vdz, vdx, vdy, &n); // vdz[i] = pow (vdx[i], vdy[i]), i=0,...,n-1
@IBM_compilers32 IBM C/C++ and Fortran Compilers and Libraries
MASS Vector Library – Calling from Fortran
include 'massv.include' ! interfaces for vector MASS functions
integer, parameter :: n=1000
real*8 vdx(n), vdy(n)
call vexp (vdy, vdx, n) ! vdy(i) = exp (vdx(i)), i=1,...,n
real*4 vfx(n), vfy(n)
call vsexp (vfy, vfx, n) ! vdy(i) = exp (vdx(i)), i=1,...,n
real*8 vdx(n), vdy(n), vdz(n)
call vpow (vdz, vdy, vdx, n) ! vdz(i) = vdx(i)**vdy(i), i=1,...,n
@IBM_compilers33 IBM C/C++ and Fortran Compilers and Libraries
MASS SIMD Library
§ Computes the same math function for each element of a SIMD vector• Convenient when writing code with vector datatypes and built-in functions
– e.g. vector double, vector float, vec_add() etc.• Vector MASS recommended for best performance if vector length is non-trivial
§ Calling from C
#include <mass_simd.h> // prototypes for vector SIMD functions
vector double vdx, vdy;
vdy = expd2 (vdx); // vdy[i] = exp (vdx[i]), i=0,1
vector float vfx, vfy;
vfy = expf4 (vfx); // vfy[i] = exp (vfx[i]), i=0,1,2,3
vector double vdx, vdy, vdz;
vdz = powd2 (vdx, vdy); // vdz[i] = pow (vdx[i], vdy[i]), i=0,1
@IBM_compilers34 IBM C/C++ and Fortran Compilers and Libraries
MASS SIMD Library -- Calling from Fortran
include 'mass_simd.include' ! interfaces for SIMD MASS functions
vector(real*8) vdx, vdy
vdy = expd2 (vdx) ! vdy(i) = exp (vdx(i)), i=1,2
vector(real*4) vfx, vfy
vfy = expf4 (vfx) ! vfy(i) = exp (vfx(i)), i=1,2,3,4
vector(real*8) vdx, vdy, vdz
vdz = powd2 (vdx, vdy) ! vdz(i) = vdx(i)**vdy(i), i=1,2
@IBM_compilers35 IBM C/C++ and Fortran Compilers and Libraries
Linking MASS (for manual use)
§ Scalar MASS• -l mass (common for all supported POWER processors)
§ Vector MASS• -l massv generic for all supported POWER processors• -l massvp8 for POWER8 (available for LE or BE)
§ SIMD MASS• -l mass_simdp8 for POWER8 (available for LE or BE)
§ Example of linking all MASS libraries when compiling for POWER8
• xlc main.c -qarch=pwr8 -qaltivec -q64 -l mass -l massvp8 -l mass_simdp8
@IBM_compilers36 IBM C/C++ and Fortran Compilers and Libraries
§ XL C/C++ and XL Fortran compilers are able to• recognize opportunities in source code to use MASS• Auto-vectorize: generate calls to MASS vector functions• Auto-inline: inline MASS scalar functions• Auto-scalarize: generate calls to MASS scalar functions
§ Compiler optimization levels of “-O3 –qhot” or above for automatic MASS exploitation
§ Transformation report shows automatic MASS usage
for (i=0;i<n;i++) {b[i]=sqrt(a[i]);
} __vsqrt_P8(b,a,n);
Loop vectorizationwas performed.
Transformation report
MASS Library Exploitation with XL Compilers
@IBM_compilers37 IBM C/C++ and Fortran Compilers and Libraries
§ User-driven parallelism• All optimization levels interoperate with POSIX Threads implementation• Full OpenMP 3.1 implementation provides simple mechanism to write
parallel applications – Based on pragmas/annotations on top of sequential code– Industry specification, developed by OpenMP consortium
(www.openmp.org)
§ Compiler-driven parallelism• Mechanism for the compiler to automatically identify and exploit data
parallelism • Identify parallelizable loops, performing independent operations on
arrays or vectors– Best results on loop-intensive, compute-intensive workloads– Aided by program annotations, fully interoperable with OpenMP
Parallelization
@IBM_compilers38 IBM C/C++ and Fortran Compilers and Libraries
§ Optimize the whole program at module scope• Intercept the linker and re-optimize the program at module scope
§ Three levels of aggressiveness (-qipa=level=0/1/2)• Balance between aggressive optimization and longer optimization
time
§ Enables additional program optimization• Cross-file inlining (including cross-language)• Global code placement based on call affinity• Global data reorganization
§ Reduction in TOC pressure, through data coalescing
Inter-Procedural Analysis (IPA)
@IBM_compilers39 IBM C/C++ and Fortran Compilers and Libraries
§ Collect program statistics on training run to use on subsequent optimization phase
• Minor impact on execution time of instrumented program (10% - 50%)• Static program information: Call frequencies, basic block execution counts• Value profiling: collect histogram of values for expressions of interest• Hardware counter information (optional)
§ Supports multiple training runs and parallel instances of the program• Profiling information from multiple training runs aggregated into single file• Locking used to avoid clobbering of the profiling data on file
§ Integrated with IPA process (implies ipa=level=0)• PDF synchronization point at beginning of link-time optimization phase• No need to recompile source files for PDF2, only relink with qpdf2 option
§ Tolerates program changes between instrumentation/optimization• Compiler skips profile-based optimization for any modified functions• Shows an estimate of the relevance of the profiling data
Profile-Directed Optimization (PDF)
@IBM_compilers40 IBM C/C++ and Fortran Compilers and Libraries
Performance Tuning Tips with XL Compilers and Libraries
@IBM_compilers41 IBM C/C++ and Fortran Compilers and Libraries
Frequently used XL Compiler Options§ Typically start from -O2 or -O3 § Add high order optimization –qhot for floating-point computation intensive
and memory intensive workload, e.g., a lot of time spent on loops§ Add whole program optimization –qipa[=level=0 | 1 | 2] for workloads with
a lot of C/C++ small function calls§ Add profile directed feedback optimization –qpdf1/pdf2 for workloads with
lots of branching and function calls Usage:
1.Instrumentation:export PDFDIR=your_work_dirCompile a program with –qpdf1=exename to generate instrumented executable2. Profile:Use typical input data to run the executable and generate the profile data in PDFDIR3. Recompile:Re-compile the program with –qpdf2=exename to generate optimized executable
§ Add –qsmp=omp for OpenMP workloads (-qoffload for GPU exploitation)§ Add cuda option for CUDA workloads
@IBM_compilers42 IBM C/C++ and Fortran Compilers and Libraries
Performance Tuning using Compiler Transformation Reports
§ Generate compilation reports consumable by other tools• Enable better visualization and analysis of compiler information• Help users do manual performance tuning• Help automatic performance tuning through performance tool
integration§ Unified report from all compiler subcomponents and analysis
• Compiler options• Pseudo-sources• Compiler transformations, including missed opportunities
§ Consistent support among Fortran, C/C++ § Controlled under option
-qlistfmt=[xml | html]=inlines generates inlining information-qlistfmt=[xml | html]=transform generates loop transformation information-qlistfmt=[xml | html]=data generates data reorganization information-qlistfmt=[xml | html]=pdf generates dynamic profiling information-qlistfmt=[xml | html]=all turns on all optimization content-qlistfmt=[xml | html]=none turns off all optimization content
@IBM_compilers43 IBM C/C++ and Fortran Compilers and Libraries
file.c
foo (float *p, float *q, float *r, int n) {
for (int i=0; i< n; i++) {p[i] = p[i] + q[i]*r[i];
}}
Performance Tuning with Compiler Reports
-qlistfmt=xml=all
file.c
foo (float * restrict p, float * restrict q, float * restrict r, int n) {
for (int i=0; i< n; i++) {p[i] = p[i] + q[i]*r[i];
}}
file.xmlLoop was not SIMD vectorized bacause a data dependence prevents SIMD vectorization
Original source file modified source file
file.xmlLoop was SIMD vectorized
Tuning
@IBM_compilers44 IBM C/C++ and Fortran Compilers and Libraries
SIMDization Tuning
memory accesses have non-vectorizable alignment.
§Use __attribute__((aligned(n)) to set data alignment§Use __alignx(16, a) to indicate the data alignment to the compiler §Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
§Use fewer pointers when possible§Use #pragma independent_loop if it has no loop carried dependency§Use restrict keyword
User actionsTransformation report
Loop was SIMD vectorized
§Use #pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable to vectorize
@IBM_compilers45 IBM C/C++ and Fortran Compilers and Libraries
memory accesses have non-vectorizable strides
§Loop interchange for stride-one accesses, when possible§Data layout reshape for stride-one accesses §Higher optimization to propagate compile known stride information§Stride versioning
§Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization.
§Convert while-loops into do-loops when possible§Limited use of control flow in a loop §Use MIN, MAX instead of if-then-else§Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
SIMDization Tuning
@IBM_compilers46 IBM C/C++ and Fortran Compilers and Libraries
Compiler Friendly Code
§ Compiler must be conservative when determining potential side effects
• Procedure calls may access or modify any visible variables• Accesses through pointers may modify any visible variables
§ Pessimistic side effect analysis prevents compiler optimizations• Must re-compute expressions with operands which may have been modified• Must compute values that otherwise might be unneeded
§ Help the compiler identify side effects to improve application performance
• Use suitable optimization levels• Include appropriate header files for any system routines in use• Use local variables to maintain values of global variables across function calls
or pointer dereferences• Avoid using global variables when local variables are suitable• Avoid reusing local variables for unrelated purposes• Follow ANSI C/C++ language pointer aliasing rules– An object of a certain data type can only be accessed through a pointer of the same (or
compatible) data type
@IBM_compilers47 IBM C/C++ and Fortran Compilers and Libraries
§ Use restrict keyword (XLC supports multiple level and scope restricted pointer) or compiler directives/pragmas to help the compiler do dependence and alias analysis
§ Use “const” for globals, parameters and functions whenever possible§ Group frequently used functions into the same file (compilation unit)
to expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)
§ Limit exception handling§ Excessive hand-optimization such as unrolling may impede the
compiler§ Keep array index expressions as simple as possible for easy
dependency analysis§ Consider using the highly tuned MASS/MASSV and ESSL libraries
Compiler Friendly Code
@IBM_compilers48 IBM C/C++ and Fortran Compilers and Libraries
§ Make use of visibility attribute• Load time improvement• Better code with PLT overhead reduction• Code size reduction• Symbol collision avoidance
§ Inline tuning • Call overhead reduction• Load-hit-store avoidance
§ Whole program optimization by IPA• Across-file inlining• Code partitioning• Data reorganization • TOC pressure reduction
Performance Tuning Tips
@IBM_compilers49 IBM C/C++ and Fortran Compilers and Libraries
§ OpenMP Environment VariablesOMP_NUM_THREADS: control the number of threadsOMP_THREAD_LIMIT: control the maximum number of threadsOMP_PLACES: control thread affinity THREADS | CORES | SOCKETSOMP_WAIT_POLICY: control the thread idle policy: ACITVE | PASSIVEOMP_STACKSIZE: control the thread stack sizeOMP_SCHEDULE: control the schedule type DYNAMIC | GUIDED | STATICOMP_PROC_BIND control thread binding TRUE | FALSE, MASTER | CLOSE | SPREADOMP_DYNAMIC: control dynamic thread adjustment TRUE | FALSEOMP_DISPLAY_ENV: control to display environment variables TRUE | FALSE
§ OpenMP/OpenMPI affinity• OpenMP programs will automatically detect whether they have been invoked by OpenMPI via
OpenMPI-set environment variables. • When OpenMPI has been detected, OpenMP will restrict the default OMP_PLACES to the affinity
that has been set for that process– e.g., with -binds-to core, each OpenMPI process will be placed on a different core – each
OpenMP program will be restricted to the particular core for that process– This feature can be overridden by manually setting OMP_PLACES, i.e., this feature only
applies to the default setting for OMP_PLACES– OMP_PROC_BIND will be set to TRUE
OpenMP Tuning
@IBM_compilers50 IBM C/C++ and Fortran Compilers and Libraries
§ System configuration • Adjust SMT level: ppc64_cpu --smt=<level>• Adjust hardware prefetch aggressiveness: ppc64_cpu --dscr=<value>• Adjust cpu/memory affinity: numactl <flags>• Set huge pages: sysctl -w vm.nr_hugepages=<number>
§ POWER8 exploitation• POWER8 specific ISA exploitation under –qarch=pwr8• Scheduling and instruction selection under –qtune=pwr8:SMTn (n=1, 2, 4, 8)
§ Automatic SIMDization at O3 –qhot• Limited use of control flow• Limited use of pointers. Use independent_loop directive to tell the compiler a loop has
no loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible
• Limited use of stride accesses. Expose stride-one accesses whenever possible§ Data prefetch
• Automatic data prefetch at O3 –qhot or above. • -qprefetch=dscr=N to control hardware prefetch aggressiveness
Architecture and System Specific Tuning Tips
@IBM_compilers51 IBM C/C++ and Fortran Compilers and Libraries
Floating-point Computation Accuracy Control
§ Aggressive optimization may affect the results of the program• Precision of floating-point computation• Handling of special cases of IEEE FP standard (INF, NAN, etc)• Use of alternate math libraries
§ -qstrict guarantees identical result to noopt, at the expense of optimization
• Suboptions allow fine-grain control over this guarantee• Examples:
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
§ Can be combined: -qstrict=precision:nonans
@IBM_compilers52 IBM C/C++ and Fortran Compilers and Libraries
This presentation addresses:• What are frequently used XL compiler options• How to identify program hot spots and detect performance
bottlenecks with XL compilers and performance tools• How to write compiler-friendly code for better performance• How to do performance tuning with XL compilers and libraries• How to do POWER8 specific optimization
Summary
@IBM_compilers53 IBM C/C++ and Fortran Compilers and Libraries
@IBM_compilers54 IBM C/C++ and Fortran Compilers and Libraries
@IBM_compilers55 IBM C/C++ and Fortran Compilers and Libraries
Important XL Compilers Links
§ XL C/C++ home page•http://ibm.biz/xlcpp-linux
§ C/C++/Fortran Community•http://ibm.biz/xlcpp-linux-ce
§ XL Fortran home page•http://ibm.biz/xlfortran-linux
@IBM_compilers56 IBM C/C++ and Fortran Compilers and Libraries56 IBM Confidential
Additional information
§ IBM SDK Linux http://ibm.biz/ibmsdklop
§ PMU eventshttp://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpievents.htm
§ Code optimization with the IBM XL compilers on Power architectureshttp://www-01.ibm.com/support/docview.wss?uid=swg27005174&aid=1
§ Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8http://www.redbooks.ibm.com/abstracts/sg248171.html
§ Implementing an IBM High-Performance Computing Solution on IBM POWER8http://www.redbooks.ibm.com/abstracts/sg248263.html?Open
§ NVIDIA CUDA on IBMPOWER8: Technical overview, software installation, and application developmenthttp://www.redbooks.ibm.com/redpapers/pdfs/redp5169.pdf
@IBM_compilers57 IBM C/C++ and Fortran Compilers and Libraries
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area.
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquiries, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied.All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions.IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice.IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.
IBM, the IBM logo, ibm.com AIX, AIX (logo), IBM Watson, DB2 Universal Database, POWER, PowerLinux, PowerVM, PowerVM (logo), PowerHA, Power Architecture, Power Family, POWER Hypervisor, Power Systems, Power Systems (logo), POWER2, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, POWER7, POWER7+, and POWER8 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
NVIDIA, the NVIDIA logo, and NVLink are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries.Linux is a registered trademark of Linus Torvalds in the United States, other countries or both.PowerLinux™ uses the registered trademark Linux® pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the Linux® mark on a world-wide basis.The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.The OpenPOWER word mark and the OpenPOWER Logo mark, and related marks, are trademarks and service marks licensed by OpenPOWER.
Other company, product and service names may be trademarks or service marks of others.
Notices and Disclaimers