Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | joan-reasons |
View: | 218 times |
Download: | 0 times |
Structure of Recent Compilers
Front End Transform Language to Common Intermediate Form
Note: Only few companies make front for C. Source code for C++ Front end is about 30 times bigger than C. Most Front down convert C++ to C before compilation.
High Level Optimization
High Level Loop Optimization
Example: Procedure In-lining
(Lang Dep., Machine Ind.)
Global Optimization
Global and Local Optimization and register allocation
(Small Lang Dep., Small Machine dep.)
Code Generation
Detailed Instruction Selection and machine dependent optimization (No Lang Dep., Highly Machine Dep.)
Compiler Prime Target
Program Correctness Speed Compilation Time? Phases of compilers help write bug-
free code
Optimizations
High-level Local (Basic Block) Global (across branches) Register Allocation, Live Range
Analysis Processor Dependent
Optimization Names Procedure Integration Common Sub-expression Elimination/Dead Code Elimination
A = b+ c ;dead code eliminated, no subsequent use of b+cA = x+ ySimilarly if a procedure does not return a value and uses only local
variables will be eliminated. (Test this in VC++) Constant Propagation: A variable used as constant. (Constants aren’t,
Variable Won’t. Osborn’s Law) Global Sub-expression Elimination Copy Propagation (a = b, a will be replaced by b) Code Motion (A code that does not change with index in a loop will be moved out of the
loop.) Induction Variable Elimination (A = A + 5 in a loop that runs n
times will be replaced with A = A + 5 * n and moved out of loop, if A is not used,)
Strength Reduction (Multiply replaced with shift and add if possible, A*25 + b*25 will be replaced with (A+B) * 25)
Pipeline Scheduling Branch Optimization
Problems with Pointers
A = 5;p = x+y;*p = 9 (only programmer knows &A = p)
Compiler cannot assign a register
Architecture Help
Provide Orthogonality The Operations, The Data Types, The
Addressing Modes, The Register Functions should be orthogonal
Simplify Trade-offs between alternatives (With caches and pipelining, trade-offs have become very complex) For Example: Most difficult one in register-memory architecture: How many times a variable is referenced before it is assigned a register.
Provide Instructions to Bind Variables with Constants
Most SIMD kernels are hand-coded as no compiler support
Hand-Coded VS Compiler GeneratedOn TMS320C6203 (VLIW CPU) (reported May 2000)
EEMBC Telecom Kernels
Ratio of Execution Time (Compiler/Hand
Written)
Ratio of Code Size (Compiler/Hand Written)
Convolution Encoder 44.0 0.5
Fixed Point Complex FFT
13.5 1.0
Viterbi GSM Decoder
13.0 0.7
Fixed Point Bit Allocation
7.0 1.4
Auto-correlation 1.8 0.7
Basic Compiler Techniques
Basic Pipelining Static Loop UnrollingExample:
Instruction Producing
Result
Instruction Using Result
Latency in CC
FP ALU FP ALU 3
FP ALU Store 2
Load FP ALU 1
FP Load FP Store 0
Example (Contd…)
Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1, #-8 BNEQ R1,R2, Loop
Example (Without Scheduling)
Loop: L.D F0, 0(R1) stall ;LUD ADD.D F4,F0,F2 stall stall S.D F4, 0(R1) DADDUI R1,R1, #-8 stall BNEQ R1,R2, Loop stall ;Successor flushed
Total 10cc
Example (With Scheduling)
Loop: L.D F0, 0(R1) DADDUI R1,R1, #-8
ADD.D F4,F0,F2 stall BNEQ R1,R2, Loop S.D F4, 8(R1) ;delay slot
Total 6cc (3 for data, 3 overhead)
Example (Static Loop Unrolling 4 times)
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1,R1, #-32 S.D F12, 16(R1) BNEQ R1,R2, Loop S.D F16, 8(R1) ; Delay slot
Total 3.5cc per element
Compiler Considerations:
1. Use of delay slot
2. Loop level independence
3. Register Assignment
4. Proper Loop Adjustment
Example (Static Dual Issue, 1 Int and 1 FP/CC)
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4,F0,F2 L.D F14, -32(R1) ADD.D F8,F6,F2 L.D F18, -36(R1) ADD.D F12,F10,F2
S.D F4, 0(R1) ADD.D F16,F14,F2 S.D F8, -8(R1) ADD.D F20,F18,F2 S.D F12, -16(R1) DADDUI R1,R1, #-40 S.D F16, 16(R1) BNEQ R1,R2, Loop S.D F20, 8(R1) ; Delay slot
Total 2.4cc per element
LUD
VLIW
Compiler formats issue packets Compiler ensures that dependencies
are not present 64 to 200-bit long instructions
Example (VLIW, 1 Int, 2 FP, 2 LD/ST /CC 5-slots)
Mem 1 Slot Mem 2 Slot
FP 1 Slot FP 2 Slot Int/ Branch
L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) L.D F14, -24(R1)
L.D F18, -36(R1) L.D F22, -40(R1)
ADD.D F4,F0,F2 ADD.D F8,F6,F2
L.D F26, -48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2
ADD.D F20,F18,F2 ADD.D F24,F12,F2
S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2
S.D F12, -16(R1) S.D F16, -24(R1) DADDUI R1,R1, #-56
S.D F20, 24(R1) S.D F24, 16(R1)
S.D F28, 8(R1) BNEQ R1,R2, Loop
1.29cc per element, 23 slots used out of potential 45
Loop Level Parallelism
Loop Carried Dependence: Data calculated in one loop iteration is
required in the next loop. A Parallel Loop
For (I = 1000; I > 0; I = i-1)
x[i] = x[i] + s
Example
For (i = 1; i <= 100; i = i+1)
{A[i+1] = A[i] + + C[i];
B[i+1] = B[i] + + A[i+1];
}
Dependences?
Example 2
Make the following loop parallel.For (i = 1; i <= 100; i = i+1)
{A[i] = A[i] + + B[i];
B[i+1] = C[i] + + D[i];
}
The GCD Test
Loop stores in a j + b and later fetches from c k + d.
Sufficient test is that if loop carried dependence exits then GCD(c,a) must integer divide (d-b) (no remainder).For (i = 1; i <= 100; i = i+1)
x(2*i+3] = x[2*i] *5This test ignores loop bounds.
Example 2
Use renaming to find ILPFor (i = 1; i <= 100; i = i+1)
{ Y[i] = X[i] /c1 X[i] = X[i] +c2 Z[i] = Y[i] + c3 Y[i] = c4 - Y[i] /c }