Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | shelley-hardin |
View: | 35 times |
Download: | 0 times |
TM
Compiler Techniques for Compiler Techniques for Single Processor Tuning Single Processor Tuning
An introduction
TM
OutlineOutline Compiler is the primary tool of computer program optimization:
• structure of the compiler and the compilation process
• compiler optimizations
• steering of the compilation process - compiler options
• structure of the run time libraries and scientific libraries
• computational domain and computation accuracy
TM
Compiler manages processor resources:
• registers
• integer/floating-point execution units
• load/store/prefetch for data flow in/out of processor
• The implementation details of processor and system architecture are build into the compiler
User Program (C/C++/Fortran, etc.)– high level representation
– low level representation Machine instructions
The CompilerThe Compiler
Solving:–data dependencies–control flow dependencies–parallelization–compactification of code–optimal scheduling of the code
Compilationprocess
Intermediate representation
TM
MIPSpro Compiler ComponentsMIPSpro Compiler Components
• There are no source-to-source optimizers or parallelizers
• source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); – same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs
• Inter-Procedural analyser requires final translation at link time
F77/f90 cc/CCdriver
Macropre-
processor
Front-end
(source to WHIRLformat) Inter-
ProceduralAnalyser
Loopnest
optimizer
Paralleloptimizer
LinkerInter-
ProceduralAnalyser
source
Globaloptimizer
Codegenerator
Executable object
TM
Compiler OptimizationsCompiler Optimizations• Global optimizer:
– dead code elimination– copy propagation– loop normalisation
• stride one loops• single induction variable
– memory alias analysis– strength reduction
• Inter-Procedural Analyser:– cross-file function inlining– dead function elimination– dead variable elimination– padding of variables in common blocks– inter-procedural constant propagation
• Automatic parallelizer– loop level work distribution
• Loop Nest optimizer:– loop unrolling (outer)– loop interchange– loop fusion/fission– loop blocking– memory prefetch– padding local variables
• Code Generator:– software pipelining– inner loop unrolling– if-conversion– read/write optimization– recurrence breaking– instruction scheduling
inside basic blocks
TM
SGI Architecture, ABI, LanguagesSGI Architecture, ABI, Languages• Instruction Set Architecture (ISA):
– -mips4 (R1x000, R8000, R5000 processors)– -mips3 (R4400)– -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler)
• ABI (Application Binary Interface):– -n32 (32 bit pointers, 4 byte integers, 4 byte real)– -64 (64 bit pointers, 4 byte integers, 4 byte real)
• Languages:– C– C++– Fortran 77– Fortran 90
Variable C size[bit] F size[bit]-n32 -64 -n32 -64
char/character 8 8 8 8short 16 16 int/integer 32 32 32 32long 32 64long long 64 64logical 32 32float/real 32 32 32 32double 64 64 64 64pointer 32 64
TM
Options: ABI & ISAOptions: ABI & ISA Option Functionality
-n32 invoke the MIPSpro Compiler, use 32 bit addressing -64 invoke the MIPSpro Compiler, use 64 bit addressing -o32/-32 invoke the old ucode compiler, 32 bit addressing -mips[1234] ISA; -mips[12] implies ucode compiler
There are two more ways to define the ABI and ISA:• environment variable “SGI_ABI” can be set to -n32 or -64
• the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: -DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3
There is a way to find which compiler flags were used:
dwarfdump -i file.o | grep DW_AT_producer
TM
Optimization LevelsOptimization Levels Compilation speed degrades with higher optimization• -O0 turn off all optimizations
• -O1 only local optimizations
• -O2 or -O extensive but conservative optimizations
• -O3 aggressive optimizations, LNO, software pipelining
• -ipa inter-procedural analysis (only at -O2 and -O3)
• -apo automatic parallelization option (same as -pfa)
• -g[0|3] debugging switch: -g0 forces -O0-g3 to debug with -O3
TM
Options: PerformanceOptions: Performance Option Functionality -r10000 Generate optimal instruction schedule for the R10000 proc -r8000 Generate optimal instruction schedule for the R8000 proc
-O[0|1|2|3] Set optimization Level to 0, 1, 2, 3
-Ofast=[ipXX] Select best optimization for the given architecture
-mp Enable multi-processing directives -mpio Support I/O from a parallel region -apo Invoke automatic parallelization option
XX machine (output of the hinv -c processor command)27 Origin2000 (all cpu frequencies and cache sizes)35 Origin3000 (all cpu frequencies and cache sizes)
optimizations may differ on the version of the compiler. Currently:
-O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed
(thus -Ofast switch invokes the Interprocedural Analyser)
TM
Options: PortingOptions: Porting Option Functionality -d8/d16 Double precision variables as 8 or 16 bytes -r8 Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1)
-i8 Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1)
-static Local variables will be initialised in fixed locations on the heap -col[72|120] Source line is 72 or 120 columns
-Dname Define name for the pre-processor -Idir Define include directory dir -alignN Assume alignment on the N=8,16,32,64,128 bit boundary
-version Show compiler version -show Put the compiler in verbose mode: all switches are displayed
(1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit
TM
Options: DebuggingOptions: Debugging Option Functionality -g Disable optimization and keep all symbol tables -DEBUG: the DEBUG group option (man DEBUG_GROUP):• check_div=n n=1 (default) check integer divide by zero
n=2 check integer overflow n=3 check integer divide by zero and overflow• subscript_check (default ON) to check for subscripts out of range
C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT• verbose_runtime(default OFF) to give source line number of failures• trap_uninitialized (default OFF) initialise all variables to 0xFFFA5A5
when used as pointer - access violation when used as fp values - NaN causes fp trap Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON
TM
Compilation ExamplesCompilation Examples 1. Produce executable a.out with default compilation options: f77 source.f cc source.f
be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C
2. Options for debugging: f77/cc -o prog -n32 -g -static source.f
2. Explicit setting of ABI/ISA/Processor, highest opt:
f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f
3. Detailed control of the optimization process with the group options : f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...
TM
Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically:• program data is large (does not fit into the cache)• program does not violate language standard• program is insensitive to roundoff errors• all data in the program is alias-ed, unless it can prove otherwise
if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important:• OPT for general optimizations assumptions• LNO for the Loop Nest optimizer options• IPA for the Inter-Procedural Analyser options
Additional options that help to tune the compiler properly:• TENV, TARG for the target machine and environment description• LIST, DEBUG for the listing and debugging options
Fine Tuning Compiler ActionsFine Tuning Compiler Actions
TM
Group OptionsGroup Options Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative:
E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc.
Group Heading Reference page Usage comments -OPT:key=val cc(1) f77(1) opt(5) optimizations -TENV:key=val cc(1) f77(1) Control target environment -TARG:key=val cc(1) f77(1) Control target architecture -FLIST/CLIST cc(1) f77(1) Listing control -LIST:key=val cc(1) f77(1) Options to control listing -DEBUG:key=val debug_group(5) Debugging options -IPA:key=val ipa(5) Inter-Procedural Analyser control -INLINE:key=val ipa(5) Procedure inliner control -LNO:key=val lno(5) Loop Nest optimizer control -MP:key=val cc(1) f77(1) parallelization control -LANG: cc(1) f77(1) language compatibility features
TM
Compiler man PagesCompiler man Pages• Primary man pages:
man f77(1) f90(1) cc(1) CC(1) ld(1)• some of the compiler option groups are rather large and deserve their
own man pages
man opt(5)
man lno(5)
man ipa(5)
man DEBUG_GROUP(5)
man mp(3F)
man pe_environ(5)
man sigfpe(3C)
TM
The Run-Time Library StructureThe Run-Time Library Structure
/usr
lib/
lib32/
lib64/
*.a, *.so
ucode-compiler
nonshared/*.a
*.a, *.so
Cmplrs/mongoose-compiler
nonshared/*.a
mips3
mips4
*.a, *.sononshared/*.a
*.a, *.sononshared/*.a
*.a, *.so
Cmplrs/mongoose-compiler
nonshared/*.a
mips3
mips4
*.a, *.sononshared/*.a
*.a, *.sononshared/*.a
n32
n64
o32
Environment variables:LD_LIBRARY_PATHLD_LIBRARY32_PATHLD_LIBRARY64_PATH
TM
The Scientific LibrariesThe Scientific Libraries Standard scientific libraries containing:• Basic Linear Algebra operations and algorithms:
– BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3)– LAPACK (see man intro_lapack)
• Fast Fourier Transformations (FFT):– 1D, 2D, 3D, multiple 1D transformations (see man intro_fft)
• Convolutions (Signal Processing, e.g. man SIIR2D)• Sparse Solvers (see man solvers; man PSLDLT)
To use: – -lscs serial versions– -lscs_mp -mp for parallel versions– man intro_scsl for detailed description– -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions– man complib.sgimath for detailed desctiption
TM
Computational DomainComputational Domain Range of numbers (from /usr/include/limits.h): FLT_DIG 6 /* decimal digits of precision of a float */ FLT_MAX 3.40282347E+38F FLT_MIN 1.17549435E-38F
DBL_DIG 15 /* decimal digits of precision of a double */ DBL_MAX 1.7976931348623157E+308 DBL_MIN 2.2250738585072014E-308
LONGLONG_MIN -9223372036854775807LL-1LL LONGLONG_MAX 9223372036854775807LL ULONGLONG_MAX 18446744073709551615LLU
The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)
TM
Underflow and Denormal NumbersUnderflow and Denormal Numbers When de-normalised numbers emerge in a computation
(i.e. numbers x<DBL_MIN) they are flushed to zero by default:
will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print 0.22250738585072014D-308 Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalised numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software.
It is a performance issue - not to manipulate the de-normalized numbers in calculations.
Program denormreal*8 a,b
a = 2.2250738585072014D-308b = a/10.0D0write(6,10) bend
#include <sys/fpu.h>void no_flush_(){
union fpc_csr f;f.fc_word = get_fpc_csr();f.fc_struct.flush = 0;set_fpc_csr(f.fc_word);
}
TM
Overflow ExampleOverflow Example Program example that generates overflows and underflows:
Output with all exceptions ignored by default:
setenv TRAP_FPE „UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback (Link -lfpe).
Parameter (N=20)INCLUDE “/usr/include/limits.h”Real*8 A(N),B(N)complex*16 C(N)
do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into doubleenddo
C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 !
write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N)
10 0.340282347000000E+39 0.117549435000000E-37 A,B0.340282346638529E+39 0.117549435082229E-37 Cr,Ci
11 0.374310581700000E+39 0.106863122727273E-37 Infinity 0.000000000000000
12 etc…
Compile with: f77 -n32 -mips4 -O3Note: Compilation with -r8 avoids the error.
Flush to zero!Overflow!
TM
Floating Point ExceptionsFloating Point Exceptions A fp status register flag is set when fpu is has an illegal condition:• division by zero• overflow• underflow• invalid• inexact
By default, all exceptions are ignored! (e.g. for 1/0 NaN value is set and execution continues)
The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action:• abort• ignore the exception• repair the illegal condition
You can manipulate the status register to select action:• with calls to the FPE library, link with -lfpe• with environment variable TRAP_FPE
see man handle_sigfpes
TM
Compiler-Generated ExceptionsCompiler-Generated Exceptions compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) X = 0 no speculative code motion X = 1 IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) X = 2 all IEEE-754 exceptions disabled except 1/0 (default -O3) X = 3 all IEEE-754 exceptions are disabled X = 4 memory access exceptions are disabled
IF-conversion with conditional moves for Software Pipelining (with -O3):
Removing IF(…) will cause divide by zero!In this case this exception must be ignored
Do i=1,Nif(a(i) .lt. eps) then
x = x + 1/epselse
x = x + 1/a(i)endifenddo
#put eps in $f1 and (1/eps) in $f0
ldc1 $f5,-8($2) load a(i)c.lt.d $fcc0,$f5,$f1 if(a(i) < eps)recip.d $f2,$f5 1/a(i)movt.d $f2,$f0,$fcc0 y = 1/a or 1/eps add.d $f1,$f1,$f2 x = x+ y
Note: transf applied already at -O3 & X=1!
TM
IEEE_754 Compliance IEEE_754 Compliance The MIPS4 instruction set contains IEEE-754 non-compliant instructions:• recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp
• rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp
-OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands X=1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2)
X=2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3)
X=3 any mathematically valid transformation is allowed, including recip & rsqrt instr.
Do i=1,n x = x + a(i)/yenddo
y_tmp = 1/ydo i=1,n x = x + a(I)*y_tmpenddo
21 cycles/iteration; 4% peak1 cycles/iteration; 100% peak
-O3 -OPT:IEEE_arithmetic=3
Note: X=3is required!
TM
Rounding AccuracyRounding Accuracy Rounding mode can be specified with -OPT:roundoff=X switch: X = 0 no optimizations that affect fp behaviour (default at -O1 -O2)
X = 1 allows simple transformations with limited round-off and overflow differences
X = 2 allows reordering of reduction loops (default at -O3)
X = 3 any mathematically valid transformation is allowed
do i=1,n x = x + a(i)enddo
do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) …enddox = x0 + x1 + …
-O3 -OPT:roundoff=2(default at -O3)
With -O3 -OPT:roundoff=12 cycles/iter; 25% peak 1 cycles/iter; 50% peak
Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3
TM
SummarySummary
Compiler is the primary tool of program optimization
• compilation is the process of lowering the code representation from high level to low, I.e. processor level
• the MipsPro compiler targets the MIPS R1x000 processor and has build in the features of the processor and Origin2000 architecture
• A large number of options exist to steer the compilation process– ABI, ISA and optimization options selections
– setting of assumptions about the program behaviour
• there are optimized and parallelized libraries of subroutines for scientific computation
• when programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations