IBM Haifa Labs © 2005 IBM Corporation
IBM Haifa Tools Update and Directions
http://www.haifa.il.ibm.com/dept/svt/code_paot.html
Gad Haber([email protected])
IBM Haifa Labs
© 2005 IBM Corporation2
IBM Haifa Performance Tools FDPR-Pro
Feedback-based optimizer operating on binary executable files Part of the AIX 5L Available on Linux on Power via alphaworks Under development for z/OS to be available in SDK 2.0 for the Cell platform
CodeAnalyzer Eclipse plugin for analyzing executable files and shared libraries
Part of the Visual Performance Navigator (VPA) to be available in the Cell SDK 2.0
ESTO Utility for identifying the optimal set of optimization options Embedded into FDPR-Pro Under development for tuning compilers’ options
BProber Utility for instrumenting binary executable files Under development
PDT – Performance Debugging Tool for the Cell Operates on trace files from the Cell SPEs
IBM Haifa Labs © 2005 IBM Corporation
FDPR-Pro
Feedback Directed Program Restructuring
IBM Haifa Labs
© 2005 IBM Corporation4
FDPR-Pro - Feedback Directed Program Restructuring
Using a global view of the entire program Operating on the executable file after linkage These properties enable FDPR-Pro to do:
Global Code ReorderingInter Procedure Boundaries OptimizationsStatic Data RearrangementConstant Area RearrangementData Prefetching
Examples of FDPR-Pro additional optimizations:Usage of Branch TablesUsage of TOC load instructions More..
IBM Haifa Labs
© 2005 IBM Corporation5
Method
Phase 1: Code instrumentationBasic block level
Phase 2: Profile information gatheringSelection of "right" input set (representative workload)Accumulation over several input sets
Phase 3: Global Code & Data OptimizationsComplements the compiler
IBM Haifa Labs
© 2005 IBM Corporation6
Partial list of FDPR-Pro Optimizations
-RC Reorder Code -bf Branch folding -bp Branch prediction bit setting -align Code alignment -uce Unreachable code elimination -i_resched Instruction re-scheduling -RD, -build_dcg Static data reordering -tocload, -reduce_toc Tocload optimizations -si, -ipht, -ihf, -isf Function inlining options -ptrgl_optimization Optimize function calls via pointers -dp Data prefetching -link_reg_optimization Eliminate stores/restore of link register -volatile_regs Eliminate stores/restores using available volatile regs -killed_regs Eliminate stores/restores of killed registers -load_after_store Separate between frequent load and store to same address -loop_unroll Loop unrolling -stack_opt Reduce stack frame size of Hot functions -dce Dead code elimination -cp Constant propagation
IBM Haifa Labs
© 2005 IBM Corporation7
FDPR-Pro Directions
New heavy analyses for more optimizations enablementUnder development
Value propagation Constant Evaluation Stack aliasing
FDPR-Pro for multi-core FDPR-Pro for the Cell processor to be available in SDK 2.0
Special options for profile gahering on the Cell New optimizations for SPE code Auto-parallelization optimizations
FDPR-Pro for embedded PowerPC is available Special features added to FDPR-Pro
accepting sampled profile and complemeting it optimizations taking into account pipeline stalls of embedded PowerPC
New optimizations for space reduction are added
IBM Haifa Labs © 2005 IBM Corporation
Code Analyzer
IBM Haifa Labs
© 2005 IBM Corporation9
Why Code Analyzer?
Architectures are becoming more complex Now upcoming multi-core platforms
Using only hardware simulators to detect information about potential performance bottlenecks in a given program is hard
There is a need for performance tools that can statically analyze and visualize programs for a platform design, to be used by: Hardware architects Compiler writers Application developers
IBM Haifa Labs
© 2005 IBM Corporation10
What is Code Analyzer?
Code Analyzer is an eclipse plugin which performs comprehensive static analysis on given executable files and DLLs Relies on the FDPR-Pro as the engine for the
analysis phase
Code Analyzer displays the analyzed information together with profiling data collected by:
tprof/Oprofile (in VPA xml format - ETM files) FDPR-Pro (in binary or xml format)
The code is then colored according to: Frequency counters - gathered by FDPR-Pro Hardware event ticks - gathered by tprof/Oprofile
IBM Haifa Labs
© 2005 IBM Corporation11
Code Analyzer Views
Provides several views of the input binary Assembly instructions Basic blocks Procedures CSECT modules control flow graph Hot loops Call graph Annotated source code Dispatch group formation Pipeline slots and functional units
IBM Haifa Labs
© 2005 IBM Corporation12
Grouping, Performance Comments and Pipeline Views
IBM Haifa Labs
© 2005 IBM Corporation13
Code Analyzer opened up from Profile Analyzer
IBM Haifa Labs
© 2005 IBM Corporation14
Code Analyzer (on the right) synchronized with Profile Analyzer (on the left)
IBM Haifa Labs
© 2005 IBM Corporation15
Code Analyzer - Available Performance Comments
Comments which do not require profiling Pipeline stalls for the Power architecture Pipeline stalls for the z9 platform Unreachable code and non-used data Misaligned targets
Profile-based comments Invariant instructions within Hot loops Hot function calls proceeded by overwriting non-volatile registers Hot saves and restores of registers which could be relocated to cold spill areas Hot instructions that could be scheduled to colder areas in the code Removable hot branches Hot direct unconditional branches Hot direct conditional branches that are taken, which have a colder fallthru Hot call sites that are appropriate candidates for function inlining Hot TOC load instructions that can be replaced by immediate add instructions Hot Branch to branch instructions
IBM Haifa Labs
© 2005 IBM Corporation16
Code Analyzer Directions
Enablement of more comments Under development Using FDPR-Pro added analyses
Value propagation Constant Evaluation Stack aliasing
Code Analyzer for multi-core Code Analyzer for the Cell processor to be available in SDK 2.0
Special views for distribution of instructions’ frequency on SPE code New stall comments relevant to the PPE and SPEs
IBM Haifa Labs © 2005 IBM Corporation
ESTO Expert System for Tuning Optimizations
IBM Haifa Labs
© 2005 IBM Corporation18
Optimization is controlled by a large number of options The problem is finding the option set that maximizes performance Parameterized (ranged) options complicate and multiply the
possibilities Each option performs a rather small change in the object program Typical users do not know which options are best for their
programs The default (e.g. -O3) is adequate, but not best for a specific
program Optimizer (compiler) developers need to find the optimal option
sets for the default combinations (e.g. -O3) and benchmarking (e.g. SPEC)
Why an automatic tool for tuning optimizations?
IBM Haifa Labs
© 2005 IBM Corporation19
ESTO - Expert System for Tuning Optimizations
Purpose Enable a typical user to utilize the actual optimization potential Automate the search in the very complex option space Produce a ‘close to optimal’ program in a reasonable time
Method Trial-and-error search in the multidimensional options space In each step another option set is used to optimize same program The program runtime is measured and compared to other results The algorithm converges to some ‘close to optimal’ option set
Features Flexible configuration for applications and running environments Possibility to extend the components, run parallel processes, etc.
IBM Haifa Labs
© 2005 IBM Corporation20
ESTO today
Embedded into FDPR-Pro By using a command line option --tune Reaches impressive speed-ups on some benchmarks Provides a good average
ESTO gain % over FDPR-Pro -O3 on Linux with SPEC2000 train workload, 64 bit
0.002.004.006.008.00
10.0012.0014.0016.00
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
swim
wup
wis
e
aver
age
IBM Haifa Labs
© 2005 IBM Corporation21
ESTO directions Enabling ESTO to tune compiler optimizations
Under development Requires a configuration file with descriptions of all optimization
flags Initial adaptation for GCC
Looked at GCC “binary” (on/off) options: ~60 affect performance Runtime speed-up on SPEC BMs relative to -O1
spec 64 runtime gain over -O1applu 10.25 35.71%apsi 10.88 25.75%art 4.92 30.38%bzip2 30.20 75.26%equake 17.61 21.55%gap 7.48 3.53%gcc 3.51 0.11%mcf 13.41 25.41%mesa 68.29 10.82%mgrid 16.30 39.38%perlbmk 72.42 4.39%sixtrack 66.02 15.76%swim 9.89 17.60%twolf 12.22 6.71%vpr 22.80 19.77%average 22.14%
ESTO gain over GCC -O1 (train 64)
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%
ap
plu
ap
si art
bzi
p2
eq
ua
ke
ga
p
gcc
mcf
me
sa
mg
rid
pe
rlb
mk
sixt
rack
swim
two
lf
vpr
ave
rag
e
IBM Haifa Labs © 2005 IBM Corporation
BProberBinary Prober
IBM Haifa Labs
© 2005 IBM Corporation23
Analysis Each Application has it own characteristics Insert tailored instrumentation stubs
Simulation New architectures Insert code that simulates new functionality
Optimization Performing optimizations locally Function level down to instructions level Insert code to be executed instead of existing one
Why binary probing technology is needed?
IBM Haifa Labs
© 2005 IBM Corporation24
Based on FDPR-Pro technology Enables insertion of code at
Specific address Specific Function (entry and exit points)
The inserted code is defined as function in separate library Can be written in any language Control transfer to the code is done via inserted call Parameters passed to the function
Original address of instrumentation Save area of the registers prior to the call
Definition file of user code (libraries and functions) and insertion locations is used
Availability IBM internal use (alpha) Supports very large programs including 64bit applications Both AIX and Linux on Power
BProber Today
IBM Haifa Labs © 2005 IBM Corporation
PDTPerformance Debugging Tool for the Cell
IBM Haifa Labs
© 2005 IBM Corporation26
PDT – Performance Debugging Tool PDT enables analysis and visualizing of traces from the
various SPE and the interactions between them