+ All Categories
Home > Documents > CAPRI - SJTU

CAPRI - SJTU

Date post: 20-Nov-2021
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
27
CAPRI Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures Minsoo Rhu and Mattan Erez The University of Texas at Austin Electrical and Computer Engineering Dept.
Transcript
Page 1: CAPRI - SJTU

CAPRI Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures

Minsoo Rhu and Mattan Erez

The University of Texas at Austin Electrical and Computer Engineering Dept.

Page 2: CAPRI - SJTU

CAPRI: Compaction-Adequacy Prediction

•  Dynamic SIMD-lane compaction –  Enhances SIMD lane utilization from control divergence –  Fills idle lanes in one SIMD group with threads from others

•  Problem –  Compaction often ineffective –  Always imposes compaction-barrier at all branches

•  Our solution –  Predict whether compaction will be beneficial or not –  Mitigate the detrimental impact of extra barriers

2 CAPRI [ISCA'12] (c) Minsoo Rhu

0.6$

0.8$

1$

1.2$

1.4$

BFS$ MUM$ MUM++$ LPS$ BACKP$ TPACF$

Normalized

$IPC$

Baseline$CompacDon$CAPRI$

Page 3: CAPRI - SJTU

Outline

•  GPU and SIMD compaction background •  Compaction can degrade performance •  CAPRI – compaction adequacy prediction •  Evaluation

3 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 4: CAPRI - SJTU

Graphic Processing Units (GPUs)

•  General-purpose many-core accelerators –  Supports non-graphics APIs (e.g. CUDA, OpenCL)

•  Scalar frontend (fetch & decode) + parallel backend –  Amortizes the cost of frontend and control

4 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 5: CAPRI - SJTU

CUDA exposes hierarchy of data-parallel threads

•  SPMD model: single kernel executed by all threads

•  Kernel / Thread-block –  Multiple thread-blocks (concurrent-thread-arrays(CTAs))

compose a kernel

5 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 6: CAPRI - SJTU

CUDA exposes hierarchy of data-parallel threads

•  SPMD model: single kernel executed by all threads

•  Kernel / Thread-block / Warp –  Multiple warps compose a thread-block –  Multiple threads (32) compose a warp

6 CAPRI [ISCA'12] (c) Minsoo Rhu

A thread is executed in a lane

: Thread maintains its home execution lane until completion

Page 7: CAPRI - SJTU

GPUs have HW support for conditional branches

•  Control-divergence: threads in warp branch differently –  Predicate-registers : only a subset of the warp commits results –  Stack-based re-convergence model –  Always execute active-threads at the top-of-stack (TOS). –  Reconvergence PC (RPC) derived at compile-time

7 CAPRI [ISCA'12] (c) Minsoo Rhu

(b)$Per(warp$re(convergence$stack

Page 8: CAPRI - SJTU

Thread-block compaction (TBC) [Fung11]

•  Dynamically compact warps within a thread-block –  Synchronize all warps at conditional-branches and

reconvergence points –  Adopts a CTA-wide re-convergence stack for sync –  Similar approach in [Narasiman11] – TBC+

8 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 9: CAPRI - SJTU

Thread-block compaction (TBC)

•  CTA-wide re-convergence stack –  Each warp incrementally updates the active bitmasks –  Enforces HW-generated compaction-barriers

•  Warp compaction unit (WCU) –  Composed of a set of priority encoders –  Generates one valid warp per cycle (after compaction) –  Only threads not executing in a common lane are compacted

into the same warp

9 CAPRI [ISCA'12] (c) Minsoo Rhu

* 4-threads/warp, 2-warps/CTA

*

Page 10: CAPRI - SJTU

Outline

•  GPU and SIMD compaction background •  Compaction can degrade performance •  CAPRI – compaction adequacy prediction •  Evaluation

10 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 11: CAPRI - SJTU

Problem 1: extra synchronization overhead

•  Compaction is applied at all branching points –  A branch’s divergence is determined after execution –  Not all branches are amenable to compaction –  Synchronization is imposed regardless

11 CAPRI [ISCA'12] (c) Minsoo Rhu

Warps are always

forced to execute

only within a basic-

block

: Compaction barrier

Number of warps

remains the same!

Page 12: CAPRI - SJTU

Problem 2: increased memory divergence

•  Threads within a warp shuffled after compaction –  Compaction unit “slides” threads up –  May break coalesced memory accesses in original order

•  Compaction is worthwhile only when the application is highly irregular

12 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 13: CAPRI - SJTU

Outline

•  GPU and SIMD compaction background •  Compaction can degrade performance •  CAPRI – compaction adequacy prediction •  Evaluation

13 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 14: CAPRI - SJTU

CAPRI: Compaction-Adequacy PRedIction

•  Intuition –  Not all branch points are likely to be compactable –  Activate compaction only when past history indicates

compaction was adequate –  Bypass warps from compaction-barrier when inadequate

•  Compaction-adequacy –  Compaction is adequate when the number of

executing warps is reduced

14 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 15: CAPRI - SJTU

Compaction-adequacy assessment

•  Check if compaction was beneficial –  Number of warps before/after compaction are compared

15 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 16: CAPRI - SJTU

Microarchitecture of CAPRI

•  Design –  Compaction-Adequacy Prediction Table (CAPT) –  Single CAPT per shader core (shared among CTAs within core) –  Fully-associative structure (32-entries)

•  8-entries typically sufficient

–  Each entry contains a history-bit for compaction-adequacy –  Tag: PC of Branch instruction (BADDR)

•  History-bit configuration –  Most recently evaluated compaction-adequacy

16 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 17: CAPRI - SJTU

CAPRI Algorithm

17 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 18: CAPRI - SJTU

CAPRI Algorithm

18 CAPRI [ISCA'12] (c) Minsoo Rhu

All non-divergent branches are

always bypassed

Page 19: CAPRI - SJTU

CAPRI Algorithm

19 CAPRI [ISCA'12] (c) Minsoo Rhu

1st divergent warp invokes CAPT

entry allocation

Page 20: CAPRI - SJTU

CAPRI Algorithm

20 CAPRI [ISCA'12] (c) Minsoo Rhu

Following warps also wait for compaction

Page 21: CAPRI - SJTU

CAPRI Algorithm

21 CAPRI [ISCA'12] (c) Minsoo Rhu

Compaction Adequacy evaluated when all

warps arrive

Page 22: CAPRI - SJTU

Outline

•  GPU and SIMD compaction background •  Compaction can degrade performance •  CAPRI – compaction adequacy prediction •  Evaluation

22 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 23: CAPRI - SJTU

Simulation Environment

•  GPGPU-Sim (v2.1.1b) –  30 shader cores (Streaming Multiprocessors) –  1024 threads per core, 16K registers per core –  Cache: 32kB L1, 1024kB Unified L2 –  Warp scheduling policy: Round-Robin –  Memory Controller: FR-FCFS

•  Workloads –  Regular/Irregular apps with various input-sets –  Chosen from CUDA-SDK(v2.2), Rodinia, Parboil, etc

23 CAPRI [ISCA'12] (c) Minsoo Rhu

Page 24: CAPRI - SJTU

Idle Cycles (Normalized)

24 CAPRI [ISCA'12] (c) Minsoo Rhu

0$0.5$1$

1.5$2$

2.5$3$

BFS$ MUM$ MUM++$ LPS$ BACKP$ TPACF$

(a) Divergent Benchmarks

(b) Non-divergent Benchmarks

0$0.5$1$

1.5$2$

2.5$3$

RAY$ LIB$ PFLT$ CFD$ FFT$ QSRDM$

BASELINE$ TBC$ TBC+$ CAPRI$

Page 25: CAPRI - SJTU

SIMD lane utilization*

25 CAPRI [ISCA'12] (c) Minsoo Rhu

* Average number of lanes utilized when a warp is issued for execution

(a) Divergent Benchmarks

(b) Non-divergent Benchmarks

0%$20%$40%$60%$80%$100%$

RAY$ LIB$ PFLT$ CFD$ FFT$ QSRDM$

BASELINE$ TBC$ TBC+$ CAPRI$

0%$20%$40%$60%$80%$100%$

BFS$ MUM$ MUM++$ LPS$ BACKP$ TPACF$

Page 26: CAPRI - SJTU

Overall Performance

26 CAPRI [ISCA'12] (c) Minsoo Rhu

(a) Normalized IPC (Divergent)

(b) Normalized IPC (Non-divergent)

0.7$

0.8$

0.9$

1$

1.1$

1.2$

1.3$

RAY$ LIB$ PFLT$ CFD$ FFT$ QSRDM$ Harmonic$Mean$

Normalized

$IPC$

BASELINE$ TBC$ TBC+$ CAPRI$

0.6$

0.8$

1$

1.2$

1.4$

BFS$ MUM$ MUM++$ LPS$ BACKP$ TPACF$ Harmonic$Mean$

Normalized

$IPC$ BASELINE$ TBC$ TBC+$ CAPRI$$

Page 27: CAPRI - SJTU

Conclusions

•  CAPRI provides the best of both baseline and TBC –  Higher resource utilization and minimized synchronization

•  Throughput improvements –  Divergent: 7.6% (max 10.8%) improvements on top of TBC –  Non-divergent: Avoids the average 10.1% performance

degradation of TBC –  Robust to scheduling policy thanks to bypassing

•  Unlike TBC and TBC+

•  Implementation cost is negligible –  32-entry prediction-table per shader core

27 CAPRI [ISCA'12] (c) Minsoo Rhu


Recommended