+ All Categories
Home > Documents > Detailed look at the TigerSHARC pipeline

Detailed look at the TigerSHARC pipeline

Date post: 08-Jan-2016
Category:
Upload: zurina
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Detailed look at the TigerSHARC pipeline. Cycle counting for COMPUTE block versions of the DC_Removal algorithm. To be tackled today. Expected and actual cycle count for Compute Block version of DC_Removal algorithm Understanding why the stalls occur and how to fix. - PowerPoint PPT Presentation
Popular Tags:
28
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm
Transcript
Page 1: Detailed look at the TigerSHARC pipeline

Detailed look at the TigerSHARC pipeline

Cycle counting for COMPUTE block versions of the DC_Removal algorithm

Page 2: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 2 / 28

To be tackled today

Expected and actual cycle count for Compute Block version of DC_Removal algorithmUnderstanding why the stalls occur

and how to fix.

Understanding some operations “first time into function” – cache issues?

Page 3: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 3 / 28

Set up timeIn principle 1 cycle / instruction

2 + 4 instructions

Page 4: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 4 / 28

First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N)

4 instructions

N * 5 instructions

1 + 2 * log2N

Page 5: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 5 / 28

Third key element – FIFO circular buffer-- Order (N)

6

3

6 * N

2

Page 6: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 6 / 28

TigerSHARC pipeline

Page 7: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 7 / 28

Time in theorySet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return

244 + N * 51 + 2 * log2N63 + 6 * N2---------------------------22 + 11 N + 2 log2N

N = 128 – instructions = 1444

1444 cycles + 1100 delay cycles

C++ debug mode – 9500 cycles???????

Note other tests executed before this test.Means “cache filled”

Page 8: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 8 / 28

Set up timeExpected2 + 4 instructions

Actual2 + 4 instructions+ 2 stalls

Why not 4 stalls?

Page 9: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 9 / 28

First time round sum loop

Expected 9 instructions

LC0 load – 3 stallsEach memory fetch – 4 stallsActual 9 + 11 stalls

Page 10: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 10 / 28

Other times around the loop

Expected 5 instructions

Each memory fetch – 4 stallsActual 5 + 8 stalls

Page 11: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 11 / 28

Shift Loop – 1st time around

Expected 3 instructions

No stalls on LC0 load?4 stall on ASHIFTRBTB hit followed by 5 aborts

Page 12: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 12 / 28

Time in theory / practiceSet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return

Entry into subroutine 10 stalls?2 0 stalls4 2 stalls4 + N * 5 N * 8 = 1024 stalls1 + 2 * log2N 9 stalls6 3 stalls3 + 6 * N 3 stalls2 -- Exit from subroutine 10 stalls?--------------------------- --------------22 + 11 N + 2 log2N 1061 stalls

N = 128 – instructions = 1444

1444 cycles + 1061 stalls = 2505 cyclesIn practice 2507 cycles

C++ debug mode – 9500 cycles???????

Note other tests executed before this test.Means “cache filled”

Page 13: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 13 / 28

Final sum code – Using XR registers

Page 14: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 14 / 28

Time in PracticeSet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return

Entry into subroutine 10 stalls2 0 stalls4 2 stalls4 + N * 5 Was 1024 stalls1 Was 1 + 2 * log2N + 9 stalls6 3 stalls3 + 6 * N 3 stalls2 10 stalls---------------------------

23 + 11 N Was 22 + 11 N + 2 log2N

N = 128 – instructions = 1430

1430 + 279 delay cycles = 1709 cycles

Was 2,504 cycles with JALU1444 cycles + 1061 delay cycles

Predicted stall with X-compute block = 249 stalls

-- close enough to 256 = N * 2 – or one stall for each memory access

Improved more than expected as accidentally making better use of availableresources

Page 15: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 15 / 28

Second time into functionFirst time around the loop

2 stalls per loop iterationas predicted

Page 16: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 16 / 28

2nd time into function9th time around the loop

Stalls as expectedNote sets of 5 quad instructions appearto be fetch in

Page 17: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 17 / 28

Interpretation

Currently XR2 = [J0 + J8];; XR6 = R6 + R2;; // Must wait 1 cycle for XR2 to be brought in XR3 = [J1 + J8];; XR7 = R7 + R3;; // Must wait 1 cycle for XR3? Next improvement? XR2 = [J0 + J8];;

XR3 = [J1 + J8];; XR6 = R6 + R2;; // XR2 and XR3 are now ready when we want to use

// them? XR7 = R7 + R3;; // or do we get DATA / DATA clash along J-bus?

Page 18: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 18 / 28

Pipeline “intermingled” left and right filter operation

Page 19: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 19 / 28

Time in PracticeSet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return

Entry into subroutine 10 stalls2 0 stalls4 2 stalls4 + N * 5 Was 1024 stalls1 Was 1 + 2 * log2N + 9 stalls6 3 stalls3 + 6 * N 3 stalls2 10 stalls---------------------------

23 + 11 N Was 22 + 11 N + 2 log2N

N = 128 – instructions = 1430

1430 + 279 delay cycles = 1709 cycles

Was 2,504 cycles with JALU1444 cycles + 1061 delay cycles

Predicted stall with X-compute block = 249 stalls

-- close enough to 256 = N * 2 – or one stall for each memory access

Intermingled code – around 1430 cycles + 30 stalls

Page 20: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 20 / 28

1st time into function1st time round the loop

Page 21: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 21 / 28

1st time into function2nd, 3rd, … time round loop

Page 22: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 22 / 28

9th, 17th etc time into the loop

Page 23: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 23 / 28

From TigerSHARC p9-11

Reading in 8-words at a time from “memory” into “cache” MIGHTexplain the behaviour

Page 24: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 24 / 28

Again, talking about“8” data values

Page 25: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 25 / 28

Read buffer

Page 26: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 26 / 28

Implications – read buffer

Prefetch buffer 4 pages Each page 8 256 bit words = 64 items Buffer = 256 – exactly enough to handle 128

left and 128 right Does that imply that speed does not scale

up – 256 point arrays are slower than 2 x as slow as 128 points

May make sense to process all of left and then all of right?

Page 27: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 27 / 28

Implications – cache

4 way associative cache 128 cache sets Each cache set has four cache ways Each cache way – 8 32 bit words That’s 1024 32bit words

Things break down when left / right arrays are of size 512, or else do all left then all right – things change at 1024

Page 28: Detailed look at the TigerSHARC pipeline

DC_Removal algorithm performance 28 / 28

To be tackled today

Expected and actual cycle count for Compute Block version of DC_Removal algorithmUnderstanding why the stalls occur

and how to fix.

Understanding some operations “first time into function” – cache issues?


Recommended