+ All Categories
Home > Documents > CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1...

CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1...

Date post: 21-Jul-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
39
CTA-Aware Prefetching and Scheduling for GPU Gunjae Koo *, Hyeran Jeon , Zhenhong Liu , Nam Sung Kim , Murali Annavaram* *University of Southern California San Jose State University University of Illinois at Urbana-Champaign
Transcript
Page 1: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

CTA-Aware Prefetching and Scheduling for GPU

Gunjae Koo*, Hyeran Jeon†, Zhenhong Liu‡, Nam Sung Kim‡, Murali Annavaram*

*University of Southern California†San Jose State University

‡University of Illinois at Urbana-Champaign

Page 2: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Memory Latency – GPU Compute Hurdle

2

GPU execution model• Can hide tens of cycles by quick context switching

MUL R2,R1,8ADD R4,R2,R3LD R5,[R4+0]SUB R6,R5,R1……

Warp 0

Warp 1

Warp 2

Warp 3

M LA

M LA

Cache miss

M LA

M LA

S

Fetched

S

Page 3: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Memory Latency – GPU Compute Hurdle

3

GPU demand fetch latency• Hundreds of cycles if fetched from LLC or DRAM

Warp 0

Warp 1

Warp 2

Warp 3

M LA

M LA

Cache miss

M LA

M LA

S

Fetched

Pipeline stalls

Page 4: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Memory Latency – GPU Compute Hurdle

4

Warp 0

Warp 1

Warp 2

Warp 3

M LA

M LA

Cache miss

M LA

M LA

S

Fetched

S

Pipeline stalls

GPU demand fetch latency

• Not hidden by warp throttling

• Critical performance hurdle: pipeline stalls for long

cycles

Pipeline stalls by memory operations [Kim (HPCA’16)]

38% (average), 63% (memory intensive)

Page 5: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Prefetch: A Solution for Long Data Fetch Latency

5

Prefetch can be a solution for long data fetch latency in GPU

Prefetch performance factors

Coverage

Accuracy

Timeliness

Page 6: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Limitations of GPU Prefetch

6

Intra-warp stride prefetching• Applied for iterative global loads in a loop• Limited coverage: loops are rare in GPU kernels• Bad timeliness: early prefetch

for (int i=0; i < n; i++)y[i] = a*x[i] + y[i];

int i = blockIdx.x * blockDim.x + threadIdx.x;y[i] = a*x[i] + y[i]

Page 7: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Limitations of GPU Prefetch

7

Inter-warp prefetching• Regular strides observed among threads

for (int i=0; i < n; i++)y[i] = a*x[i] + y[i];

int i = blockIdx.x * blockDim.x + threadIdx.x;y[i] = a*x[i] + y[i]

Page 8: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Limitations of GPU Prefetch

8

Inter-warp prefetching• Regular strides observed among threads• Low accuracy: discrepancy across CTA boundaries• Bad timeliness: base address estimation for each CTA

Page 9: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

CTA-Aware Prefetcher and Scheduler

9

✓ CTA-Aware Prefetcher and Scheduler (CAPS)• CTA-aware prefetcher

• Base address of each CTA• Common stride per load between warps

• Prefetch-aware warp scheduler• Reorganizing warp execution priority to detect

required information for prefetcher• Warp-wakeup

Page 10: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Software Perspective

10

#define INDEX(i,j,j_off) (i +__mul24(j,j_off))

__shared__ float u1[3*KOFF];

i = threadIdx.x;j = threadIdx.y;

i = INDEX(i,blockIdx.x,BLOCK_X); j = INDEX(j,blockIdx.y,BLOCK_Y);indg = INDEX(i,j,pitch);

active = (i<NX) && (j<NY);

if (active) u1[ind+KOFF] = d_u1[indg];......

Page 11: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Software Perspective

11

#define INDEX(i,j,j_off) (i +__mul24(j,j_off))

__shared__ float u1[3*KOFF];

i = threadIdx.x;j = threadIdx.y;

i = INDEX(i,blockIdx.x,BLOCK_X); j = INDEX(j,blockIdx.y,BLOCK_Y);indg = INDEX(i,j,pitch);

active = (i<NX) && (j<NY);

if (active) u1[ind+KOFF] = d_u1[indg];......

Indg = threadIdx.x + blockIdx.x * BLOCK_X +(threadIdx.y + blockIdx.y * BLOCK_Y) * pitch

C1 blockIdx.x * BLOCK_XC2 blockIdx.y * BLOCK_YC3 pitch

= threadIdx.x + C1 + (threadIdx.y + C2 ) * C3= threadIdx.x + threadIdx.y * C3 + (C1 + C2 * C3)= threadIdx.x + threadIdx.y * C3 + Θ

Function of thread IDs Function of CTA IDs

Page 12: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Software Perspective

12

#define INDEX(i,j,j_off) (i +__mul24(j,j_off))

__shared__ float u1[3*KOFF];

i = threadIdx.x;j = threadIdx.y;

i = INDEX(i,blockIdx.x,BLOCK_X); j = INDEX(j,blockIdx.y,BLOCK_Y);indg = INDEX(i,j,pitch);

active = (i<NX) && (j<NY);

if (active) u1[ind+KOFF] = d_u1[indg];......

Indg = threadIdx.x + blockIdx.x * BLOCK_X +(threadIdx.y + blockIdx.y * BLOCK_Y) * pitch

C1 blockIdx.x * BLOCK_XC2 blockIdx.y * BLOCK_YC3 pitch

= threadIdx.x + C1 + (threadIdx.y + C2 ) * C3= threadIdx.x + threadIdx.y * C3 + (C1 + C2 * C3)= threadIdx.x + threadIdx.y * C3 + Θ

Stride between warps Base address of each CTA

Page 13: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

CTA-Aware Prefetcher

13

CTA-aware prefetcher• Estimates prefetch addresses for trailing warp executions• Base address of CTA + (stride between warps) × distance

CTA 0 Warp 0 Warp 1 Warp 2 Warp 3

CTA 1 Warp 4 Warp 5 Warp 6 Warp 7

CTA 2 Warp 8 Warp 9 Warp 10 Warp 11

Base address of CTA0

Base address of CTA1

Base address of CTA2

stride

Page 14: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

CTA-Aware Prefetcher

14

CTA-aware prefetcher• Estimates prefetch addresses for trailing warp executions• Base address of CTA + (stride between warps) × distance

CTA 0 Warp 0 Warp 1 Warp 2 Warp 3

CTA 1 Warp 4 Warp 5 Warp 6 Warp 7

CTA 2 Warp 8 Warp 9 Warp 10 Warp 11

Base address of CTA0

Base address of CTA1

Base address of CTA2

stride

Page 15: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

PerCTA table

DIST table

CTA-Aware Prefetcher

CTA-aware prefetcher hardware• PerCTA table: CTA base addresses, leading warp ID• DIST table: stride, misprediction count

CTA 0 W0

CTA 1 W4 W5 W6 W7

CTA 2 W8 W9 W10 W11

base addr,lead wid

stride, misprediction cnt

W1 W2 W3

base addr,lead wid

base addr,lead wid

Prefetchrequest

generator

Page 16: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Prefetch-Aware Scheduler

16

Timeliness matters• Prefetch is pending until detecting strides and CTA bases• Reordering warp executions to detect required info quickly• Warp-wakeup to prevent eviction of prefetched data

Page 17: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Prefetch-Aware Scheduler

17

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1 W2 W3

Pending queue

W4 W5 W6 W7 W8

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

warp

exec

load

ld.global (cache miss)

Page 18: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

PerCTA

Prefetch-Aware Scheduler

18

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1 W2 W3

Pending queue

W4 W5 W6 W7 W8

W0

Base0 DIST

Base0

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Page 19: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

19

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W1 W2 W3 W4 W0W5 W6 W7 W8

W0

Base0 DIST

W1

stride

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Page 20: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

20

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0W1 W2 W3 W4 W5 W6 W7 W8

W0

Base0 DIST

W1

stride

Pr(W2)

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Page 21: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

21

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W2 W3 W4 W5 W0 W1W6 W7 W8

W0

Base0 DIST

W1

stride

Pr(W2)

W2

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Page 22: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

22

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W3 W4 W5 W6 W0 W1 W2W7 W8

W0

Base0 DIST

W1

stride

W2 W3

Base1

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Base1

Page 23: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

23

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W4 W5 W6 W7 W0 W1 W2 W3W8

W0

Base0 DIST

W1

stride

W2 W3

Base1

Pr(W4)

Pr(W5)

Compute phase (W4~W7)

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

warp

exec

load

Page 24: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

PerCTA

Prefetch-Aware Scheduler

24

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1 W2 W3W4 W5 W6 W7 W8

W0

Base0 DIST

W1

stride

W2 W3

Base1

Pr(W4)

Pr(W5)

Compute phase (W4~W7) W4 W5 W6 W7

Conventional two-level scheduler• Warps are fairly assigned to ready queue• CTA base addresses are detected late

Page 25: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

25

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1 W3 W6 W2 W4 W5 W7 W8

Page 26: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

26

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1 W2W3 W4 W5W6 W7 W8

PerCTA

W0

Base0 DIST

Base0

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

Page 27: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

27

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W1 W2W3 W6 W0W4 W5 W7 W8

PerCTA

W0

Base0 DIST

W1

stride

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

stride

Page 28: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

28

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W2W3 W4W6 W0 W1W5 W7 W8

PerCTA

W0

Base0 DIST

W1

stride

W3 W6

Base1 Base2

Pr(W2)

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

Page 29: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

29

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1W2W3 W4 W5W6 W7 W8

PerCTA

W0

Base0 DIST

W1

stride

W3 W6

Base1 Base2

Pr(W4)

Pr(W5)

Pr(W7)

Pr(W8)

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

Pr(W2)

Page 30: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

30

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W2 W4 W5 W7 W0 W1 W3 W6W8

PerCTA

W0

Base0 DIST

W1

stride

W3 W6

Base1 Base2

Pr(W4)

Pr(W5)

Pr(W7)

Pr(W8)

Compute phase (W2,W4,W5,W7)

Pr(W2)

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

warp

exec

load

Page 31: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

31

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1W2 W3W4 W5 W6W7 W8

PerCTA

W0

Base0 DIST

W1

stride

W3 W6

Base1 Base2

Pr(W4)

Pr(W5)

Pr(W7)

Pr(W8)

Compute (W2,W4,W5,W7)

Pr(W2)

W2 W4 W5 W7

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

Page 32: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Pending queue

Prefetch-Aware Scheduler

32

CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8

Ready queue

W0 W1W2 W3W4 W5 W6W7 W8

PerCTA

W0

Base0 DIST

W1

stride

W3 W6

Base1 Base2

Pr(W4)

Pr(W5)

Pr(W7)

Pr(W8)

Compute (W2,W4,W5,W7)

Pr(W2)

W2 W4 W5 W7

Prefetch-aware scheduler• Reorganizes warp priorities to detect CTA addresses quickly• Required information can be detected early

Page 33: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Evaluation

33

• GPGPU-Sim v3.2.2• Configuration: NVIDIA GTX480 (Fermi architecture)

Parameter ConfigurationCore 15 cores, 32 SIMT lanes, 1400MHzResources / SM 48 concurrent warps, 8 concurrent CTAsRegister file 128KBShared memory 48KBScheduler two-level scheduler (8 ready warps)L1I cache 2KB, 128B line, 4-wayL1D cache 16KB, 128B line, 4-way, LRU, 32 MSHR entries

L2 cache64KB per partitions (12 partitions), 128B line, 8-way, LRU, 32 MSHR entries

DRAM 924MHz GDDR5, 6 channels, FR-FCFS

Page 34: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Evaluation: Performance

34

• Performance is improved by 8%

1.08

0.8

0.9

1.0

1.1

1.2

1.3

CP LPS BPR HSP MRQ STE CNV HST JC1 FFT SCN MM PVR CCL BFS KM Mean

No

rmal

ized

IPC

INTRA INTER MTA NLP LAP ORCH CAPS

Irregular applications: 6%

Page 35: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Evaluation: Performance by Number of CTAs

35

1.08

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Concurrent CTA = 1 Concurrent CTA = 2 Concurrent CTA = 4 Concurrent CTA = 8

No

rmal

ized

IPC

INTRA INTER MTA NLP LAP ORCH CAPS

Concurrent CTAs• Exploit more parallelism & hardware resources• CAPS is effective if more concurrent CTAs are running

Effective for a single CTA

Page 36: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Performance: Accuracy

36

Accuracy of prefetcher• Inaccurate prefetch requests degrade performance• CAPS provides high prefetching accuracy

0%

20%

40%

60%

80%

100%

CP LPS BPR HSP MRQ STE CNV HST JC1 FFT SCN MM PVR CCL BFS KM Mean

Acc

ura

cy

INTRA INTER MTA NLP LAP ORCH CAPSAccuracy: 97%

Page 37: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Performance: Timeliness

37

0.91 1.16

0

2

4

6

8

10

12

MEAN

Ear

ly p

refe

tch

rat

io (

%)

INTRAINTERMTACAPSCAPS w/o Wakeup

64.3

145.0

172.7

0

50

100

150

200

MEANA

vera

ge

cycl

es

LRRTLVPA-TLV

Timeliness• Early prefetch: prefetched data is evicted before demand requests• Prefetch distance: cycle gaps between prefetch and demand fetch

Page 38: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Conclusion

38

✓ CTA-aware prefetcher and scheduler

• Accurate prefetching address estimation by

detecting CTA base addresses and strides

• Better timeliness with prefetch-aware warp

scheduling

• Improves performance of GPU

Page 39: CTA-Aware Prefetching and Scheduling for GPUgunjaeko/pubs/Gunjae_IPDPS18_slides.pdf · CTA0 W0 W1 W2 CTA1 W3 W4 W5 CTA2 W6 W7 W8 Ready queue W2 W4 W5 W7 W8 W0 W1 W3 W6 PerCTA W0 Base0

Thank you

CTA-Aware Prefetching and Scheduling for GPU

Gunjae Koo, Hyeran Jeon, Zhenhong Lie, Nam Sung Kim, Murali Annavaram

[email protected] http://gunjaekoo.com


Recommended