Date post: | 22-Dec-2015 |
Category: |
Documents |
Upload: | fernanda-dummitt |
View: | 221 times |
Download: | 0 times |
Computer Architecture
Chapter 4
Instruction-Level Parallelism - 3
Prof. Jerry Breecher
CS 240
Fall 2003
Chap. 4 - ILP 3 2
Chapter Overview
4.1 Compiler Techniques for Exposing ILP
4.2 Static Branch Prediction
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism
Chap. 4 - ILP 3 3
Ideas To Reduce StallsTechnique Reduces
Dynamic scheduling Data hazard stalls
Dynamic branchprediction
Control stalls
Iss uing multipleinstructions per cycle
I deal CPI
Speculation Data and control stalls
Dynamic memorydisambiguation
Data hazard stalls involvingmemory
Loop unrolling Control hazard stalls
Basic compiler pipelinescheduling
Data hazard stalls
Compiler dependenceanalysis
I deal CPI and data hazard stalls
Sof tware pipelining andtrace scheduling
I deal CPI and data hazard stalls
Compiler speculation I deal CPI, data and control stalls
Chapter 3
Chapter 4
Chap. 4 - ILP 3 4
Instruction Level Parallelism
4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism
How can compilers recognize and take advantage of ILP?
Chap. 4 - ILP 3 5
Simple Loop and its Assembler Equivalent
for (i=1; i<=1000; i++) x(i) = x(i) + s;
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8bytes (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
Compilers and ILPPipeline Scheduling and
Loop Unrolling
This is a clean and simple example!
Chap. 4 - ILP 3 6
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
FP Loop Hazards
Compilers and ILPPipeline Scheduling and
Loop Unrolling
Instruction Instruction Latency inproducing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
Where are the stalls?
Chap. 4 - ILP 3 7
FP Loop Showing Stalls
10 clocks: Rewrite code to minimize stalls?
1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8Byte (DW) 8 stall 9 BNEZ R1,Loop ;branch R1!=zero 10 stall ;delayed branch slot
Compilers and ILP Pipeline Scheduling and Loop Unrolling
Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1Load double Store double 0Integer op Integer op 0
Chap. 4 - ILP 3 8
Scheduled FP Loop Minimizing Stalls
Now 6 clocks: Now unroll loop 4 times to make faster.
Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1
1 Loop: LD F0,0(R1)
2 SUBI R1,R1,8
3 ADDD F4,F0,F2
4 stall
5 BNEZ R1,Loop ;delayed branch
6 SD 8(R1),F4 ;altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Compilers and ILPPipeline Scheduling and
Loop Unrolling
Stall is because SD can’t proceed.
Chap. 4 - ILP 3 9
Unroll Loop Four Times (straightforward way)
Rewrite loop to minimize stalls.
1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1) 8 stall 9 ADDD F8,F6,F210 stall11 stall12 SD -8(R1),F813 LD F10,-16(R1)14 stall
Compilers and ILPPipeline Scheduling and
Loop Unrolling
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration Assumes R1 is multiple of 4
15 ADDD F12,F10,F216 stall17 stall18 SD -16(R1),F1219 LD F14,-24(R1)20 stall21 ADDD F16,F14,F222 stall23 stall24 SD -24(R1),F1625 SUBI R1,R1,#3226 BNEZ R1,LOOP27 stall28 NOP
Chap. 4 - ILP 3 10
Unrolled Loop That Minimizes StallsWhat assumptions made when
moved code?– OK to move store past SUBI
even though changes register– OK to move loads before
stores: get right data?– When is it safe for compiler to
do such changes?
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
Compilers and ILPPipeline Scheduling and
Loop Unrolling
No Stalls!!
Chap. 4 - ILP 3 11
Summary of Loop Unrolling Example• Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset. • Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance code.
• Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations.
• Eliminate the extra tests and branches and adjust the loop maintenance code.
• Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield the same result as the original code.
Compilers and ILPPipeline Scheduling and
Loop Unrolling
Chap. 4 - ILP 3 12
Compiler Perspectives on Code MovementCompiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
• Tries to schedule code to avoid hazards.
• Looks for Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Compilers and ILPDependencies
Chap. 4 - ILP 3 13
Compiler Perspectives on Code Movement
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SUBI R1,R1,8
4 BNEZ R1,Loop ;delayed branch
5 SD 8(R1),F4 ;altered when move past SUBI
Compilers and ILPData Dependencies
Where are the data dependencies?
Chap. 4 - ILP 3 14
Compiler Perspectives on Code Movement
• Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data
• Anti-dependence (WAR if a hazard for HW)
– Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
• Output dependence (WAW if a hazard for HW)
– Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.
Compilers and ILPName Dependencies
Chap. 4 - ILP 3 15
Compiler Perspectives on Code Movement 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2 6 SD -8(R1),F4 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBI R1,R1,#32 14 BNEZ R1,LOOP 15 NOP
How can we remove these dependencies?
Compilers and ILPName Dependencies
Where are the name dependencies?
No data is passed in F0, but can’t reuse F0 in cycle 4.
Chap. 4 - ILP 3 17
Compiler Perspectives on Code Movement• Again Name Dependencies are Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1 doesn’t change then:
0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)
There were no dependencies between some loads and stores so they could be moved around each other
Compilers and ILPName Dependencies
Chap. 4 - ILP 3 18
• Final kind of dependence called control dependence
• Example
if p1 {S1;};if p2 {S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.
Compilers and ILPControl Dependencies
Compiler Perspectives on Code Movement
Chap. 4 - ILP 3 19
• Two (obvious) constraints on control dependences:
– An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.
– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.
• Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)
Compilers and ILPControl Dependencies
Compiler Perspectives on Code Movement
Chap. 4 - ILP 3 20
Where are the control dependencies?
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4
4 SUBI R1,R1,8
5 BEQZ R1,exit 6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8
10 BEQZ R1,exit 11 LD F0,0(R1) 12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8
15 BEQZ R1,exit....
Compilers and ILPControl Dependencies
Compiler Perspectives on Code Movement
Chap. 4 - ILP 3 21
When Safe to Unroll Loop?• Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence” between iterations
• Implies that iterations are dependent, and can’t be executed in parallel
• Note the case for our prior example; each iteration was distinct
Compilers and ILPLoop Level Parallelism
for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1]; /* S2 */
}
Chap. 4 - ILP 3 22
When Safe to Unroll Loop?• Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
1. No dependence from S1 to S2. If there were, then there would be a cycle in the dependencies and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2.
2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop.
Compilers and ILPLoop Level Parallelism
for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i]; /* S2 */
}
Chap. 4 - ILP 3 23
Now Safe to Unroll Loop? (p. 240)
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i];A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */
OLD:
NEW:
Compilers and ILPLoop Level Parallelism
No circular dependencies.
Loop caused dependence on B.
Have eliminated loop dependence.
Chap. 4 - ILP 3 24
Example 1There are NO dependencies
Compilers and ILP
Loop Level Parallelism
/* ***************************************************** This is the example on page 305 of Hennessy &
Patterson but running on an Intel Machine***************************************************** */
#define MAX 1000#define ITER 100000int main( int argc, char argv[] ){ double x[MAX + 2]; double s = 3.14159; int i, j;
for ( i = MAX; i > 0; i-- ) /* Init array */ x[i] = 0;
for ( j = ITER; j > 0; j-- ) for ( i = MAX; i > 0; i-- ) x[i] = x[i] + s;}
Chap. 4 - ILP 3 25
Compilers and ILPLoop Level Parallelism
This is the GCC optimized code
.L15:
fldl (%ecx,%eax)
fadd %st(1),%st
decl %edx
fstpl (%ecx,%eax)
addl $-8,%eax
testl %edx,%edx
jg .L15
This is the ICC optimized code.L2:
fstpl 8(%esp,%edx,8)
fldl (%esp,%edx,8)
fadd %st(1), %st
fldl -8(%esp,%edx,8)
fldl -16(%esp,%edx,8)
fldl -24(%esp,%edx,8)
fldl -32(%esp,%edx,8)
fxch %st(4)
fstpl (%esp,%edx,8)
fxch %st(2)
fadd %st(4), %st
fstpl -8(%esp,%edx,8)
fadd %st(3), %st
fstpl -16(%esp,%edx,8)
fadd %st(2), %st
fstpl -24(%esp,%edx,8)
fadd %st(1), %st
addl $-5, %edx
testl %edx, %edx
jg .L2 # Prob 99%
fstpl 8(%esp,%edx,8)
Example 1
Elapsed seconds = 0.590026
Elapsed seconds = 0.122848
Chap. 4 - ILP 3 26
Example 2Compilers and ILP
Loop Level Parallelism
// Example on Page 320
get_current_time( &start_time );
for ( j = ITER; j > 0; j-- )
{
for ( i = 1; i <= MAX; i++ )
{
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
}
get_current_time( &end_time );
There are two depend-encies here – what are
they?
Chap. 4 - ILP 3 27
Compilers and ILPLoop Level Parallelism
This is GCC optimized code.L55:
fldl -8(%esi,%eax)faddl -8(%edi,%eax)fstl (%esi,%eax)faddl -8(%ecx,%eax)incl %edxfstpl (%ecx,%eax)addl $8,%eaxcmpl $1000,%edxjle .L55
This is the ICC optimized code.L4:
fstpl 25368(%esp,%edx,8)
fldl 8472(%esp,%edx,8)
faddl 16920(%esp,%edx,8)
fldl 25368(%esp,%edx,8)
fldl 16928(%esp,%edx,8)
fxch %st(2)
fstl 8480(%esp,%edx,8)
fadd %st, %st(1)
fxch %st(1)
fstl 25376(%esp,%edx,8)
fxch %st(2)
faddp %st, %st(1)
fstl 8488(%esp,%edx,8)
faddp %st, %st(1)
addl $2, %edx
cmpl $1000, %edx
jle .L4 # Prob 99%
fstpl 25368(%esp,%edx,8)
Example 2
This is Microsoft optimized code$L1225:
fld QWORD PTR _C$[esp+eax+40108]add eax, 8cmp eax, 7992
fadd QWORD PTR _A$[esp+eax+40100]fst QWORD PTR _A$[esp+eax+40108]fadd QWORD PTR _B$[esp+eax+40100]fstp QWORD PTR _B$[esp+eax+40108]jle $L1225
Elapsed seconds = 0.664073
Elapsed seconds = 1.357084
Chap. 4 - ILP 3 28
Example 3Compilers and ILP
Loop Level Parallelism
// Example on Page 321
get_current_time( &start_time );
for ( j = ITER; j > 0; j-- )
{
for ( i = 1; i <= MAX; i++ )
{
A[i] = A[i] + B[i];
B[i+1] = C[i] + D[i];
}
}
get_current_time( &end_time );
What are the depend-encies here??
Chap. 4 - ILP 3 29
Compilers and ILPLoop Level Parallelism
This is the GCC optimized code
.L65:fldl (%esi,%eax)faddl (%ecx,%eax)fstpl (%esi,%eax)movl -40100(%ebp),%edifldl (%edi,%eax)movl -40136(%ebp),%edifaddl (%edi,%eax)incl %edxfstpl 8(%ecx,%eax)addl $8,%eaxcmpl $1000,%edxjle .L65
This is the ICC optimized code .L6:
fstpl 8464(%esp,%edx,8)
fldl 8472(%esp,%edx,8)
faddl 25368(%esp,%edx,8
fldl 16920(%esp,%edx,8)
faddl 33824(%esp,%edx,8)
fldl 8480(%esp,%edx,8)
fldl 16928(%esp,%edx,8)
faddl 33832(%esp,%edx,8)
fxch %st(3)
fstpl 8472(%esp,%edx,8)
fxch %st(1)
fstl 25376(%esp,%edx,8)
fxch %st(2)
fstpl 25384(%esp,%edx,8)
faddp %st, %st(1)
addl $2, %edx
cmpl $1000, %edx
jle .L6 # Prob 99%
fstpl 8464(%esp,%edx,8)
Example 3
Elapsed seconds = 0.325419
Elapsed seconds = 1.370478
Chap. 4 - ILP 3 30
Example 4Compilers and ILP
Loop Level Parallelism
// Example on Page 322
get_current_time( &start_time );
for ( j = ITER; j > 0; j-- )
{
A[1] = A[1] + B[1];
for ( i = 1; i <= MAX - 1; i++ )
{
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
}
get_current_time( &end_time );
How many depend-encies here??
Elapsed seconds = 1.200525
Chap. 4 - ILP 3 31
Compilers and ILPLoop Level Parallelism
This is the GCC optimized code
.L75:movl -40136(%ebp),%edifldl -8(%edi,%eax)faddl -8(%esi,%eax)movl -40104(%ebp),%edifstl (%edi,%eax)faddl (%ecx,%eax)incl %edxfstpl (%ecx,%eax)addl $8,%eaxcmpl $999,%edxjle .L75
This is the Microsoft optimized code
$L1239fld QWORD PTR _D$[esp+eax+40108]
add eax, 8
cmp eax, 7984 ;00001f30H
fadd QWORD PTR _C$[esp+eax+40100]
fst QWORD PTR _B$[esp+eax+40108]
fadd QWORD PTR _A$[esp+eax+40108]
fstp QWORD PTR _A$[esp+eax+40108]
jle SHORT $L1239
Example 4
Elapsed seconds = 1.200525
Chap. 4 - ILP 3 32
Compilers and ILPLoop Level Parallelism
This is the ICC optimized code
.L8:
fstpl 8472(%esp,%edx,8)
fldl 16920(%esp,%edx,8)
faddl 33824(%esp,%edx,8)
fldl 8480(%esp,%edx,8)
fldl 16928(%esp,%edx,8)
faddl 33832(%esp,%edx,8)
fldl 8488(%esp,%edx,8)
fldl 16936(%esp,%edx,8)
faddl 33840(%esp,%edx,8)
fldl 8496(%esp,%edx,8)
fxch %st(5)
CONTINUED
fstl 25376(%esp,%edx,8)
fxch %st(3)
fstl 25384(%esp,%edx,8)
fxch %st(1)
fstl 25392(%esp,%edx,8)
fxch %st(3)
faddp %st, %st(4)
fxch %st(3)
fstpl 8480(%esp,%edx,8)
faddp %st, %st(2)
fxch %st(1)
fstpl 8488(%esp,%edx,8)
faddp %st, %st(1)
addl $3, %edx
cmpl $999, %edx
jle .L8
fstpl 8472(%esp,%edx,8)
Example 4
Elapsed seconds = 0.359232
Chap. 4 - ILP 3 33
Static Multiple Issue
4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism
Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.
Flavor I: Superscalar processors issue varying
number of instructions per clock - can be either statically scheduled (by the compiler) or dynamically scheduled (by the hardware).
Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
Chap. 4 - ILP 3 34
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates
– Joint HP/Intel agreement in 1999/2000
– Intel Architecture-64 (IA-64) 64-bit address
– Style: “Explicitly Parallel Instruction Computer (EPIC)”
Multiple Issue
Chap. 4 - ILP 3 35
Issuing Multiple Instructions/Cycle
Flavor II - continued:
• 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
• 64 integer registers + 64 floating point registers
– Not separate files per functional unit as in old VLIW
• Hardware checks dependencies (interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mis-predictions?
• IA-64 : name of instruction set architecture; EPIC is type
• Merced is name of first implementation (1999/2000?)
Multiple Issue
Chap. 4 - ILP 3 36
Issuing Multiple Instructions/Cycle– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– instruction in right half can’t use it, nor instructions in next slot
Multiple IssueA SuperScalar Version of MIPS
In our MIPS example, we can handle 2 instructions/cycle:
• Floating Point• Anything Else
Chap. 4 - ILP 3 37
Unrolled Loop Minimizes Stalls for Scalar1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
Latencies:LD to ADDD: 1 CycleADDD to SD: 2 Cycles
Multiple IssueA SuperScalar Version of MIPS
Chap. 4 - ILP 3 38
Loop Unrolling in SuperscalarInteger instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD 8(R1),F20 12
• Unrolled 5 times to avoid delays (+1 due to SS)• 12 clocks, or 2.4 clocks per iteration
Multiple IssueA SuperScalar Version of MIPS
Chap. 4 - ILP 3 39
Dynamic Scheduling in Superscalar
Code compiler for scalar version will run poorly on Superscalar
May want code to vary depending on how Superscalar
Simple approach: separate Tomasulo Control for separate reservation stations for Integer FU/Reg and for FP FU/Reg
Multiple IssueMultiple Instruction Issue &
Dynamic Scheduling
Chap. 4 - ILP 3 40
Dynamic Scheduling in Superscalar
• How to do instruction issue with two instructions and keep in-order instruction issue for Tomasulo?
– Issue 2X Clock Rate, so that issue remains in order
– Only FP loads might cause dependency between integer and FP issue:
• Replace load reservation station with a load queue; operands must be read in the order they are fetched
• Load checks addresses in Store Queue to avoid RAW violation
• Store checks addresses in Load Queue to avoid WAR,WAW
Multiple IssueMultiple Instruction Issue &
Dynamic Scheduling
Chap. 4 - ILP 3 41
Performance of Dynamic Superscalar
Iteration Instructions Issues ExecutesWrites resultno. clock-cycle number1 LD F0,0(R1) 1 2 41 ADDD F4,F0,F2 1 5 81 SD 0(R1),F4 2 91 SUBI R1,R1,#8 3 4 51 BNEZ R1,LOOP 4 52 LD F0,0(R1) 5 6 82 ADDD F4,F0,F2 5 9 122 SD 0(R1),F4 6 132 SUBI R1,R1,#8 7 8 92 BNEZ R1,LOOP 8 9 4 clocks per iterationBranches, Decrements still take 1 clock cycle
Multiple IssueMultiple Instruction Issue &
Dynamic Scheduling
Chap. 4 - ILP 3 42
Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
• Unrolled 7 times to avoid delays• 7 results in 9 clocks, or 1.3 clocks per iteration• Need more registers to effectively use VLIW
Multiple IssueVLIW
Chap. 4 - ILP 3 43
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy.
• Difficulties in building HW
– Duplicate Functional Units to get parallel execution
– Increase ports to Register File (VLIW example needs 6 read and 3 write for Int. Reg. & 6 read and 4 write for Reg.)
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth
Multiple IssueLimitations With Multiple Issue
Chap. 4 - ILP 3 44
Limits to Multi-Issue Machines
• Limitations specific to either SS or VLIW implementation
– Decode issue in SS
– VLIW code size: unroll loops + wasted fields in VLIW
– VLIW lock step => 1 hazard & all instructions stall
– VLIW & binary compatibility
Multiple IssueLimitations With Multiple Issue
Chap. 4 - ILP 3 45
Multiple Issue Challenges• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:– Exactly 50% FP operations– No hazards
• If more instructions issue at same time, greater difficulty of decode and issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
Multiple IssueLimitations With Multiple Issue
Chap. 4 - ILP 3 46
Compiler Support For ILP
4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism
How can compilers be smart?1. Produce good scheduling of code.2. Determine which loops might contain
parallelism.3. Eliminate name dependencies.
Compilers must be REALLY smart to figure out aliases -- pointers in C are a real problem.
Techniques lead to:Symbolic Loop UnrollingCritical Path Scheduling
Chap. 4 - ILP 3 47
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW) Iteration
0 Iteration 1 Iteration
2 Iteration 3 Iteration
4
Software- pipelined iteration
Compiler Support For ILP Symbolic Loop Unrolling
Chap. 4 - ILP 3 48
SW Pipelining ExampleBefore: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP
After: Software PipelinedLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)
1 SD 0(R1),F4; Stores M[i]
2 ADDD F4,F0,F2; Adds to M[i-1]
3 LD F0,-16(R1); loads M[i-2]
4 SUBI R1,R1,#8 5 BNEZ R1,LOOP
SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4IF ID EX Mem WB
IF ID EX Mem WB IF ID EX Mem WB
SDADDDLD
Read F4Write F4
Read F0
Write F0
Compiler Support For ILP Symbolic Loop Unrolling
Chap. 4 - ILP 3 49
SW Pipelining Example
Symbolic Loop Unrolling– Less code space– Overhead paid only once vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations = 25 loops with 4 unrolled iterations each
Compiler Support For ILP Symbolic Loop Unrolling
Chap. 4 - ILP 3 50
Trace Scheduling• Parallelism across IF branches vs. LOOP branches
• Two steps:
– Trace Selection
• Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess (discards values in registers)
• Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks
Compiler Support For ILP Critical Path Scheduling
Chap. 4 - ILP 3 51
Hardware Support For Parallelism
4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism
Software support of ILP is best when code is predictable at compile time.
But what if there’s no predictability?
Here we’ll talk about hardware techniques. These include:
• Conditional or Predicated Instructions
• Hardware Speculation
Chap. 4 - ILP 3 52
Tell the Hardware To Ignore An Instruction• Avoid branch prediction by turning branches into
conditionally executed instructions:IF (x) then A = B op C else NOP
– If false, then neither store result nor cause exception– Expanded ISA of Alpha, MIPs, PowerPC, SPARC,
have conditional move. PA-RISC can annul any following instruction.
– IA-64: 64 1-bit condition fields selected so conditional execution of any instruction
• Drawbacks to conditional instructions:– Still takes a clock, even if “annulled”– Stalls if condition evaluated late– Complex conditions reduce effectiveness; condition
becomes known late in pipeline.This can be a major win because there is no time lost by
taking a branch!!
Hardware Support For Parallelism
Nullified Instructions
x
A = B op C
Chap. 4 - ILP 3 53
Tell the Hardware To Ignore An Instruction
Suppose we have the code:if ( VarA == 0 )
VarS = VarT;
Previous Method:LD R1,
VarABNEZ R1, LabelLD R2,
VarTSD VarS,
R2Label:
Hardware Support For Parallelism
Nullified Instructions
Nullified Method:LD R1, VarALD R2, VarTCMPNNZ R1, #0SD VarS, R2
Label:
Compare and Nullify Next Instr. If Not Zero
Nullified Method:LD R1, VarALD R2, VarTCMOVZ VarS,R2, R1
Compare and Move
IF Zero
Chap. 4 - ILP 3 54
Hardware Support For Parallelism
The theory here is to move an instruction across a branch so as to increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example
if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers. Are a signal that the instruction results are invalid until some later time.
2. Result of instruction isn’t written until it’s certain the instruction is no longer speculative.
Compiler Speculation
Increasing Parallelism
Chap. 4 - ILP 3 55
Hardware Support For Parallelism
Example on Page 305.
Code for
if ( A == 0 )
A = B;
else
A = A + 4;
Assume A is at 0(R3) and B is at 0(R4)
Compiler Speculation
Increasing Parallelism
Original Code:
LW R1, 0(R3) Load A
BNEZ R1, L1 Test A
LW R1, 0(R2) If Clause
J L2 Skip Else
L1:ADDI R1, R1, #4 Else Clause
L2:SW 0(R3), R1 Store A
Speculated Code:
LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B
BEQZ R1, L3 Other if Branch
ADDI R14, R1, #4 Else Clause
L3:SW 0(R3), R14 Non-Spec StoreNote here that only ONE
side needs to take a branch!!
Chap. 4 - ILP 3 56
Hardware Support For Parallelism
In the example on the last page, if the LW* produces an exception, a poison bit is set on that register. The if a later instruction tries to use the register, an exception is THEN raised.
Compiler Speculation
Poison BitsSpeculated Code:
LW R1, 0(R3) Load A
LW* R14, 0(R2) Spec Load B
BEQZ R1, L3 Other if Branch
ADDI R14, R1, #4 Else Clause
L3:SW 0(R3), R14 Non-Spec Store
Chap. 4 - ILP 3 57
HW support for More ILP
• Need HW buffer for results of uncommitted instructions: reorder buffer
– Reorder buffer can be operand source
– Once operand commits, result is found in register
– 3 fields: instr. type, destination, value
– Use reorder buffer number instead of reservation station
– Discard instructions on mis-predicted branches or on exceptions
ReorderBuffer
FP Regs
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
Figure 4.34, page 311
Hardware Support For Parallelism
Hardware Speculation
Chap. 4 - ILP 3 58
HW support for More ILP
How is this used in practice?
Rather than predicting the direction of a branch, execute the instructions on both side!!
We early on know the target of a branch, long before we know it if will be taken or not.
So begin fetching/executing at that new Target PC.
But also continue fetching/executing as if the branch NOT taken.
Hardware Support For Parallelism
Hardware Speculation
Chap. 4 - ILP 3 59
Summary
4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue: VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more Parallelism