1
Lecture: Static ILP
• Topics: loop unrolling, VLIW, software pipelines, predication
2
Scheduled and Unrolled Loop
Loop: L.D F0, 0(R1) L.D F6, -8(R1)L.D F10,-16(R1)L.D F14, -24(R1)ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2ADD.D F16, F14, F2S.D F4, 0(R1)S.D F8, -8(R1)DADDUI R1, R1, # -32S.D F12, 16(R1)BNE R1,R2, LoopS.D F16, 8(R1)
• Execution time: 14 cycles or 3.5 cycles per original iteration
LD -> any : 1 stallFPALU -> any: 3 stallsFPALU -> ST : 2 stallsIntALU -> BR : 1 stall
3
Loop Unrolling
• Increases program size
• Requires more registers
• To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterationsof the original loop
4
Automating Loop Unrolling
• Determine the dependences across iterations: in theexample, we knew that loads and stores in different iterationsdid not conflict and could be re-ordered
• Determine if unrolling will help – possible only if iterationsare independent
• Determine address offsets for different loads/stores
• Dependency analysis to schedule code without introducinghazards; eliminate name dependences by using additionalregisters
5
Problem 2
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• How many unrolls does it take to avoid stall cycles?
6
Problem 2
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• How many unrolls does it take to avoid stall cycles?
Degree 2: LD LD MUL MUL DA DA 1s SD BNE SDDegree 3: LD LD LD MUL MUL MUL DA DA SD SD BNE SD
– 12 cyc/3 iterations
7
Superscalar Pipelines
Integer pipeline FP pipeline
Handles L.D, S.D, ADDUI, BNE Handles ADD.D
• What is the schedule with an unroll degree of 5?
8
Superscalar Pipelines
Integer pipeline FP pipelineLoop: L.D F0,0(R1)
L.D F6,-8(R1)L.D F10,-16(R1) ADD.D F4,F0,F2L.D F14,-24(R1) ADD.D F8,F6,F2L.D F18,-32(R1) ADD.D F12,F10,F2S.D F4,0(R1) ADD.D F16,F14,F2S.D F8,-8(R1) ADD.D F20,F18,F2S.D F12,-16(R1)DADDUI R1,R1,# -40S.D F16,16(R1)BNE R1,R2,LoopS.D F20,8(R1)
• Need unroll by degree 5 to eliminate stalls (fewer if we move DADDUI up)• The compiler may specify instructions that can be issued as one packet• The compiler may specify a fixed number of instructions in each packet:
Very Large Instruction Word (VLIW)
9
Problem 3
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• How many unrolls does it take to avoid stalls in the superscalar pipeline?
10
Problem 3
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• How many unrolls does it take to avoid stalls in the superscalar pipeline?LDLDLD MULLD MULLD MUL 7 unrolls. Could also make do with 5 if weLD MUL moved up the DADDUIs.LD MULSD MUL
11
Software Pipeline?!
L.D ADD.D S.D
DADDUI BNE
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D
L.D ADD.D
DADDUI BNE
DADDUI BNE
DADDUI BNE
DADDUI BNE
DADDUI BNE
…
…
Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop
12
Software Pipeline
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D S.D
L.D ADD.D
L.D
Original iter 1
Original iter 2
Original iter 3
Original iter 4
New iter 1
New iter 2
New iter 3
New iter 4
13
Software Pipelining
Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop
Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop
• Advantages: achieves nearly the same effect as loop unrolling, butwithout the code expansion – an unrolled loop may have inefficienciesat the start and end of each iteration, while a sw-pipelined loop isalmost always in steady state – a sw-pipelined loop can also be unrolledto reduce loop overhead
• Disadvantages: does not reduce loop overhead, may require moreregisters
14
Problem 4
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• Show the SW pipelined version of the code and does it cause stalls?
15
Problem 4
for (i=1000; i>0; i--)x[i] = y[i] * s;
Loop: L.D F0, 0(R1) ; F0 = array elementMUL.D F4, F0, F2 ; multiply scalarS.D F4, 0(R2) ; store resultDADDUI R1, R1,# -8 ; decrement address pointerDADDUI R2, R2,#-8 ; decrement address pointerBNE R1, R3, Loop ; branch if R1 != R3NOP
Source code
Assembly code
LD -> any : 1 stallFPMUL -> any: 5 stallsFPMUL -> ST : 4 stallsIntALU -> BR : 1 stall
• Show the SW pipelined version of the code and does it cause stalls?
Loop: S.D F4, 0(R2)MUL F4, F0, F2L.D F0, 0(R1)DADDUI R2, R2, #-8BNE R1, R3, LoopDADDUI R1, R1, #-8 There will be no stalls
16
Predication
• A branch within a loop can be problematic to schedule
• Control dependences are a problem because of the needto re-fetch on a mispredict
• For short loop bodies, control dependences can beconverted to data dependences by using predicated/conditional instructions
17
Predicated or Conditional Instructions
if (R1 == 0) R2 = R2 + R4
else R6 = R3 + R5R4 = R2 + R3
R7 = !R1 R8 = R2 R2 = R2 + R4 (predicated on R7)R6 = R3 + R5 (predicated on R1)R4 = R8 + R3 (predicated on R1)
18
Predicated or Conditional Instructions
• The instruction has an additional operand that determineswhether the instr completes or gets converted into a no-op
• Example: lwc R1, 0(R2), R3 (load-word-conditional)will load the word at address (R2) into R1 if R3 is non-zero;if R3 is zero, the instruction becomes a no-op
• Replaces a control dependence with a data dependence(branches disappear) ; may need register copies for thecondition or for values used by both directions
if (R1 == 0) R2 = R2 + R4
else R6 = R3 + R5R4 = R2 + R3
R7 = !R1 ; R8 = R2 ;R2 = R2 + R4 (predicated on R7)R6 = R3 + R5 (predicated on R1)R4 = R8 + R3 (predicated on R1)
19
Problem 1
• Use predication to remove control hazards in this code
if (R1 == 0) R2 = R5 + R4R3 = R2 + R4
else R6 = R3 + R2
20
Problem 1
• Use predication to remove control hazards in this code
if (R1 == 0) R2 = R5 + R4R3 = R2 + R4
else R6 = R3 + R2
R7 = !R1 ;R6 = R3 + R2 (predicated on R1)R2 = R5 + R4 (predicated on R7)R3 = R2 + R4 (predicated on R7)
21
Complications
• Each instruction has one more input operand – moreregister ports/bypassing
• If the branch condition is not known, the instruction stalls(remember, these are in-order processors)
• Some implementations allow the instruction to continuewithout the branch condition and squash/complete later inthe pipeline – wasted work
• Increases register pressure, activity on functional units
• Does not help if the br-condition takes a while to evaluate
22
Support for Speculation
• When re-ordering instructions, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependences
stbr
ld
23
Detecting Exceptions
• Some exceptions require that the program be terminated(memory protection violation), while other exceptionsrequire execution to resume (page faults)
• For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss
• In the former case, you want to defer servicing theexception until you are sure the instruction is not speculative
• Note that a speculative instruction needs a special opcodeto indicate that it is speculative
24