Download - IBM Haifa Labs © 2005 IBM Corporation IBM Haifa Tools Update and Directions Gad Haber ([email protected])

IBM Haifa Labs © 2005 IBM Corporation

IBM Haifa Tools Update and Directions

http://www.haifa.il.ibm.com/dept/svt/code_paot.html

Gad Haber([email protected])

IBM Haifa Labs

© 2005 IBM Corporation2

IBM Haifa Performance Tools FDPR-Pro

Feedback-based optimizer operating on binary executable files Part of the AIX 5L Available on Linux on Power via alphaworks Under development for z/OS to be available in SDK 2.0 for the Cell platform

CodeAnalyzer Eclipse plugin for analyzing executable files and shared libraries

Part of the Visual Performance Navigator (VPA) to be available in the Cell SDK 2.0

ESTO Utility for identifying the optimal set of optimization options Embedded into FDPR-Pro Under development for tuning compilers’ options

BProber Utility for instrumenting binary executable files Under development

PDT – Performance Debugging Tool for the Cell Operates on trace files from the Cell SPEs


FDPR-Pro

Feedback Directed Program Restructuring

IBM Haifa Labs


FDPR-Pro - Feedback Directed Program Restructuring

Using a global view of the entire program Operating on the executable file after linkage These properties enable FDPR-Pro to do:

Global Code ReorderingInter Procedure Boundaries OptimizationsStatic Data RearrangementConstant Area RearrangementData Prefetching

Examples of FDPR-Pro additional optimizations:Usage of Branch TablesUsage of TOC load instructions More..

IBM Haifa Labs


Method

Phase 1: Code instrumentationBasic block level

Phase 2: Profile information gatheringSelection of "right" input set (representative workload)Accumulation over several input sets

Phase 3: Global Code & Data OptimizationsComplements the compiler

IBM Haifa Labs


Partial list of FDPR-Pro Optimizations

-RC Reorder Code -bf Branch folding -bp Branch prediction bit setting -align Code alignment -uce Unreachable code elimination -i_resched Instruction re-scheduling -RD, -build_dcg Static data reordering -tocload, -reduce_toc Tocload optimizations -si, -ipht, -ihf, -isf Function inlining options -ptrgl_optimization Optimize function calls via pointers -dp Data prefetching -link_reg_optimization Eliminate stores/restore of link register -volatile_regs Eliminate stores/restores using available volatile regs -killed_regs Eliminate stores/restores of killed registers -load_after_store Separate between frequent load and store to same address -loop_unroll Loop unrolling -stack_opt Reduce stack frame size of Hot functions -dce Dead code elimination -cp Constant propagation

IBM Haifa Labs


FDPR-Pro Directions

New heavy analyses for more optimizations enablementUnder development

Value propagation Constant Evaluation Stack aliasing

FDPR-Pro for multi-core FDPR-Pro for the Cell processor to be available in SDK 2.0

Special options for profile gahering on the Cell New optimizations for SPE code Auto-parallelization optimizations

FDPR-Pro for embedded PowerPC is available Special features added to FDPR-Pro

accepting sampled profile and complemeting it optimizations taking into account pipeline stalls of embedded PowerPC

New optimizations for space reduction are added


Code Analyzer

IBM Haifa Labs


Why Code Analyzer?

Architectures are becoming more complex Now upcoming multi-core platforms

Using only hardware simulators to detect information about potential performance bottlenecks in a given program is hard

There is a need for performance tools that can statically analyze and visualize programs for a platform design, to be used by: Hardware architects Compiler writers Application developers

IBM Haifa Labs


What is Code Analyzer?

Code Analyzer is an eclipse plugin which performs comprehensive static analysis on given executable files and DLLs Relies on the FDPR-Pro as the engine for the

analysis phase

Code Analyzer displays the analyzed information together with profiling data collected by:

tprof/Oprofile (in VPA xml format - ETM files) FDPR-Pro (in binary or xml format)

The code is then colored according to: Frequency counters - gathered by FDPR-Pro Hardware event ticks - gathered by tprof/Oprofile

IBM Haifa Labs


Code Analyzer Views

Provides several views of the input binary Assembly instructions Basic blocks Procedures CSECT modules control flow graph Hot loops Call graph Annotated source code Dispatch group formation Pipeline slots and functional units

IBM Haifa Labs


Grouping, Performance Comments and Pipeline Views

IBM Haifa Labs


Code Analyzer opened up from Profile Analyzer

IBM Haifa Labs


Code Analyzer (on the right) synchronized with Profile Analyzer (on the left)

IBM Haifa Labs


Code Analyzer - Available Performance Comments

Comments which do not require profiling Pipeline stalls for the Power architecture Pipeline stalls for the z9 platform Unreachable code and non-used data Misaligned targets

Profile-based comments Invariant instructions within Hot loops Hot function calls proceeded by overwriting non-volatile registers Hot saves and restores of registers which could be relocated to cold spill areas Hot instructions that could be scheduled to colder areas in the code Removable hot branches Hot direct unconditional branches Hot direct conditional branches that are taken, which have a colder fallthru Hot call sites that are appropriate candidates for function inlining Hot TOC load instructions that can be replaced by immediate add instructions Hot Branch to branch instructions

IBM Haifa Labs


Code Analyzer Directions

Enablement of more comments Under development Using FDPR-Pro added analyses

Value propagation Constant Evaluation Stack aliasing

Code Analyzer for multi-core Code Analyzer for the Cell processor to be available in SDK 2.0

Special views for distribution of instructions’ frequency on SPE code New stall comments relevant to the PPE and SPEs


ESTO Expert System for Tuning Optimizations

IBM Haifa Labs


Optimization is controlled by a large number of options The problem is finding the option set that maximizes performance Parameterized (ranged) options complicate and multiply the

possibilities Each option performs a rather small change in the object program Typical users do not know which options are best for their

programs The default (e.g. -O3) is adequate, but not best for a specific

program Optimizer (compiler) developers need to find the optimal option

sets for the default combinations (e.g. -O3) and benchmarking (e.g. SPEC)

Why an automatic tool for tuning optimizations?

IBM Haifa Labs


ESTO - Expert System for Tuning Optimizations

Purpose Enable a typical user to utilize the actual optimization potential Automate the search in the very complex option space Produce a ‘close to optimal’ program in a reasonable time

Method Trial-and-error search in the multidimensional options space In each step another option set is used to optimize same program The program runtime is measured and compared to other results The algorithm converges to some ‘close to optimal’ option set

Features Flexible configuration for applications and running environments Possibility to extend the components, run parallel processes, etc.

IBM Haifa Labs


ESTO today

Embedded into FDPR-Pro By using a command line option --tune Reaches impressive speed-ups on some benchmarks Provides a good average

ESTO gain % over FDPR-Pro -O3 on Linux with SPEC2000 train workload, 64 bit

0.002.004.006.008.00

10.0012.0014.0016.00

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

swim

wup

wis

e

aver

age

IBM Haifa Labs


ESTO directions Enabling ESTO to tune compiler optimizations

Under development Requires a configuration file with descriptions of all optimization

flags Initial adaptation for GCC

Looked at GCC “binary” (on/off) options: ~60 affect performance Runtime speed-up on SPEC BMs relative to -O1

spec 64 runtime gain over -O1applu 10.25 35.71%apsi 10.88 25.75%art 4.92 30.38%bzip2 30.20 75.26%equake 17.61 21.55%gap 7.48 3.53%gcc 3.51 0.11%mcf 13.41 25.41%mesa 68.29 10.82%mgrid 16.30 39.38%perlbmk 72.42 4.39%sixtrack 66.02 15.76%swim 9.89 17.60%twolf 12.22 6.71%vpr 22.80 19.77%average 22.14%

ESTO gain over GCC -O1 (train 64)

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%

ap

plu

ap

si art

bzi

p2

eq

ua

ke

ga

p

gcc

mcf

me

sa

mg

rid

pe

rlb

mk

sixt

rack

swim

two

lf

vpr

ave

rag

e


BProberBinary Prober

IBM Haifa Labs


Analysis Each Application has it own characteristics Insert tailored instrumentation stubs

Simulation New architectures Insert code that simulates new functionality

Optimization Performing optimizations locally Function level down to instructions level Insert code to be executed instead of existing one

Why binary probing technology is needed?

IBM Haifa Labs


Based on FDPR-Pro technology Enables insertion of code at

Specific address Specific Function (entry and exit points)

The inserted code is defined as function in separate library Can be written in any language Control transfer to the code is done via inserted call Parameters passed to the function

Original address of instrumentation Save area of the registers prior to the call

Definition file of user code (libraries and functions) and insertion locations is used

Availability IBM internal use (alpha) Supports very large programs including 64bit applications Both AIX and Linux on Power

BProber Today


PDTPerformance Debugging Tool for the Cell

IBM Haifa Labs


PDT – Performance Debugging Tool PDT enables analysis and visualizing of traces from the

various SPE and the interactions between them