+ All Categories
Home > Documents > Compiler Techniques for ILP

Compiler Techniques for ILP

Date post: 05-Jan-2016
Category:
Upload: arista
View: 59 times
Download: 1 times
Share this document with a friend
Description:
Compiler Techniques for ILP. So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling Scoreboard Tomasulo’s algorithm Speculation Multiple issue How can compilers help?. Loop Unrolling. Let’s look at the code: - PowerPoint PPT Presentation
35
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling Scoreboard Tomasulo’s algorithm Speculation Multiple issue How can compilers help?
Transcript
Page 1: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

1

Compiler Techniques for ILP

So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling

Scoreboard Tomasulo’s algorithm

Speculation Multiple issue

How can compilers help?

Page 2: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

2

Loop Unrolling Let’s look at the code:

for (i=1000;i>0;i=i-1)x[i] = x[i] + s

ADD R2,R0,R0Loop: L.D F0,0(R1)

ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop

Page 3: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

3

Scheduling On A Simple 5 Stage MIPS

Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

10 cycles

Page 4: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

4

We Could Rearrange The InstructionsLoop: L.D F0,0(R1)

stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

Interleavethese inst. with someindependentinst.Best we canachieve is 6

6 cycles

Loop: L.D F0,0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop8

Page 5: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

5

Loop Unrolling Getting into the loop more

useful instructions and reducing overhead Step 1: Put several iterations together

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop Assume taken

Page 6: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

6

Loop Unrolling Step 2: Take out control instructions, adjust offsets

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop

Page 7: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

7

Loop Unrolling Step 3: Rename registers

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop

Page 8: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

8

Loop Unrolling Current loop still has stalls due to RAW

dependencies

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop

Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty

28 cycles = 7 per it.

Page 9: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

9

Loop Unrolling Step 4: Interleave iterations

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop

14 cycles = 3.5 per it.

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2S.D F4, 0(R1)S.D F8, -8(R1)DADDUI R1, R1, #-32S.D F12, 16(R1)BNE R1, R2, LoopS.D F16, 8(R1)

Page 10: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

10

Loop Unrolling + Multiple Issue Let’s unroll the loop 5 times, mark int. and FP operations

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)L.D F18,-32(R1)ADD.D F20, F18, F2S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop

Page 11: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

11

Loop Unrolling + Multiple Issue Move all loads first, then ADD.D then S.D

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop

Page 12: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

12

Loop Unrolling + Multiple Issue Rearrange instructions to handle delay for DADDUI and

BNELoop: L.D F0,0(R1)

L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, -24(R1)BNE R1, R2, LoopS.D F20, -32(R1)

Page 13: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

13

Loop Unrolling + Multiple Issue Fix immediate displacement values

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)

Page 14: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

14

Loop Unrolling + Multiple Issue Now imagine we can issue 2 instructions per cycle, one

integer and one FP

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)

123

3

4

4

5

56

67

789101112

12 cycles = 2.4 per it.

Page 15: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

15

Static Branch Prediction Analyze the code, figure out which outcome of a branch

is likely Always predict taken

Predict backward branches as taken, forward as not taken

Predict based on the profile of previous runs

Static branch prediction can help us schedule delayed branch slots

Page 16: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

16

Static Multiple Issue: VLIW Hardware checking for dependencies in issue packets

may be expensive and complex Compiler can examine instructions and decide which ones can

be scheduled in parallel – group instructions into instruction packets – VLIW

Hardware can then be simplified

Processor has multiple functional units and each field of the VLIW is assigned to one unit

For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references

Page 17: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

17

Example Assume VLIW contains 5 fields: ALU instruction or

branch, two FP instructions and two memory references

Ignore branch delay slot

Memory reference

Memory reference

FP instruction

ALU instruction

ALU instruction

Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loop

Page 18: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

18

Example Unroll seven times and rearrange

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

1

ALU /branch FP FP mem mem

3

Page 19: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

19

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

2

ALU /branch FP FP mem mem

3

4

Page 20: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

20

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

3

3

ALU /branch FP FP mem mem

4

6

5

Page 21: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

21

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

4

4

ALU /branch FP FP mem mem

7

6

5

6

Page 22: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

22

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

5

ALU /branch FP FP mem mem

7

6

6

8

Page 23: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

23

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

6

6

ALU /branch FP FP mem mem

7

9

8

Page 24: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

24

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

7

7

ALU /branch FP FP mem mem

9

8

Page 25: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

25

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

8

8

ALU /branch FP FP mem mem

9

Page 26: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

26

Example

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)

9

Overall 9 cycles for 7 iterations 1.29 per iterationBut VLIW was always half-full

ALU /branch FP FP mem mem

Page 27: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

27

Detecting and Enhancing Loop Level Parallelism

Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence

Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */}

S1 calculates a value A[i+1] which will be used in next iteration of S1S2 calculates a value B[i+1] which will be used in next iteration of S2 This is a loop-carried dependence and prevents parallelismS1 calculates a value A[i+1] which will be used in the current iteration of S2 This is dependence within the loop

Page 28: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

28

Detecting and Enhancing Loop Level Parallelism

for(i=1; i<=100; i=i+1){ A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */}

S1 calculates a value A[i] which is not used in the futureS2 calculates a value B[i+1] which will be used in next iteration of S1 This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1This loop can be made parallel if we transform it so that there is no loop-carried dependence

A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1){ B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */} B[101] = C[100]+D[100]

Page 29: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

29

Detecting and Enhancing Loop Level Parallelism

Recursion creates loop-carried dependence

But sometimes it may parallelizable if distance between dependent elements is >1

for(i=1; i<=100; i=i+1){ A[i] = A[i-1] + B[i];}

for(i=1; i<=100; i=i+1){ A[i] = A[i-5] + B[i];}

Page 30: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

30

Detecting and Enhancing Loop Level Parallelism

Find all dependencies in the following loop (5) and eliminate as many as you can:

for(i=1; i<=100; i=i+1){ Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */}

Solution at page 325

Page 31: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

31

Code Transformation

Eliminating dependent computations Copy propagation

Tree height reduction

DADDUI R1, R2, #4DADDUI R1, R1, #4 DADDUI R1, R2, #8

ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7

ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4

Can be done in parallel

sum=sum+x /* suppose this is in a loop and we unroll

it 5 times */

sum=sum+x1+x2+x3+x4+x5sum=(sum+x1)+(x2+x3)+(x4+x5)

Can be done in parallel

Must be done sequentially

Page 32: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

32

Software Pipelining

Combining instructions from different loop iterations to separate dependent instructions within an iteration

Page 33: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

33

Software Pipelining

Apply software pipelining technique to the following loop:

L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop

L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)

L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)

L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)

R1+16 R1+8 R1

168

S.D F0,16(R1)ADD.D F4, F0, F2L.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop

Startup code

Cleanup code

Page 34: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

34

Software Pipelining vs. Loop Unrolling

Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations Creates larger code

Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration Requires more complex transformations

Page 35: Compiler Techniques for ILP

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04

35

Homework #8 Due Tuesday, November 16 by the end of the class

Submit either in class (paper) or by E-mail (PS or PDF only) or bring the paper copy to my office

Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11


Recommended