Performance in GPU Architectures: Potentials and
Distances
Ahmad LashgarECE
University of Tehran
Amirali BaniasadiECE
University of Victoria
WDDD-9June 5, 2011
This Work
Goal: Investigating GPU performance for general-purpose workloads
How: Studying the isolated impact ofI. Memory divergence II. Branch divergence III. Context-keeping resources
Key finding: Memory has the biggest impact.Branch divergence solution needs memory consideration.
2A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Outline
Background
Performance Impacting Parameters
Machine Models
Performance Potentials
Performance Distances
Sensitivity Analysis
Conclusion
3A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
GPU Architecture
Interconnection Netw
ork
MCtrl6
DRAM1DRAM1DRAM1
DRAM6
...... ... ...
TPC1
SM1 SM2 SM3
MCtrl1
DRAM1DRAM1DRAM1
DRAM1
MCtrl2
DRAM1DRAM1DRAM1
DRAM2
MCtrl5
DRAM1DRAM1DRAM1
DRAM5TPC10
SM1 SM2 SM3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
Thread Pool
L1Data L1Cost L1Text
PE32PE1 PE2 PE31
Register File
CTAID Program Counter
TID CTAID Program Counter.
.
.
.
.
.
.
.
.
.
.
.
TID
…
…
•Number of concurrent CTAs per SM is limited by the size of 3 shared resources:
1. Thread Pool2. Register File3. Shared Memory
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.
Branch Divergence
SM is SIMD processor Group of threads (warp) execute the same
instruction on the lanes. Branch instruction potentially diverge warp to two
groups:1. Threads with taken outcome2. Threads with not-taken outcome
5A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
A 1 1 1 1 1 1 1 1
B 1 1 0 1 0 0 1 0
C 0 0 1 0 1 1 0 1
D 1 1 1 1 1 1 1 1
A: // Pre-Divergence if(CONDITION) {B: //NT path } else {C: //T path }D: // reconvergence point
Control-flow mechanism
Control-flow solutions address this. Previous solutions:
Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally
reconverging all paths Dynamic Warp Formulation (DWF)
Regrouping the threads in diverging paths into new warps
6A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM
7A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
ASIMD
Utilizationover time
W0111
1W1
1111
B W0011
0W1
0001
C W0100
1W1
1110
D W0111
1W1
1111
W0011
0W1
0001
TOS
TOS
Dynamic regrouping ofdiverged threads at same path
increases utilization
DWF
8A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
ASIMD
Utilizationover time
W0111
1W1
1111
B W0011
0W1
0001
C W2100
1W3
1110
D W0011
1W1
1111
Warp Pool
Wi PC Mask Vector
W0 A 1 1 1 1
W1 A 1 1 1 1
Wi PC Mask Vector
W0 B 0 1 1 0
W1 A 1 1 1 1
W2 C 1 0 0 1
Wi PC Mask Vector
W0 B 0 1 1 0
W1 B 0 0 0 1
W2 C 1 0 0 1
W3 C 1 1 1 0
Wi PC Mask Vector
W0 B 0 1 1 1
W1 C 1 1 1 1
W2 C 1 0 0 0
Wi PC Mask Vector
W0 D 0 1 1 1
W1 C 1 1 1 1
W2 C 1 0 0 0
Wi PC Mask Vector
W0 D 0 1 1 1
W1 D 1 1 1 1
W2 C 1 0 0 0
Wi PC Mask Vector
W0 D 0 1 1 1
W1 D 1 1 1 1
W2 D 1 0 0 0
Wi PC Mask Vector
W0 D 1 1 1 1
W1 D 1 1 1 1
W1111
1W2
1000
W0011
1
W2100
0
W0111
1
Wi PC Mask Vector
W0 A 1 1 1 1
W1 D 1 1 1 1
Wi PC Mask Vector
W0 A 1 1 1 1
W1 A 1 1 1 1
W1111
1
W0111
1
Merge
Possibilit
y
Performance impacting parameters Memory Divergence
Increase of memory pressure with un-coalesced memory accesses Branch Divergence
Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism
CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources:
1. Shared Memory2. Register File3. Thread Pool
9A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
-
Machine Models
10
Limited Resources :LRUnlimited
Resources :UR
X
DC: DWF Control-flowPC: PDOM Control-flowIC: Ideal Control-flow (MIMD)
IM: Ideal Memory M: Real Memory
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.
Y ZX Y Z-
Isolates the impact of each parameter:
Machine Models continued…
11A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM
Real-Memory
Ideal-Memory
Real-Memory
Ideal-Memory
Limitedper SM resources
Unlimitedper SM resources
Methodology
GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and
CUDA SDK 2.3
12A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Parameter ValueNoC
Total Number of SMs 30Number of Memory Ctrls 6Number of SM Sharing an
Interconnect3
SM
Warp Size 32 ThreadsNumber of Thread per SM 1024
Number of Register per SM 16384 32-bit
Number of PEs per SM 32Shared Memory Size 16KB
L1 Data Cache 32KB
Parameter ValueClocking
Core Clock 325 MHzInterconnect Clock 650 MHz
DRAM memory Clock 800MHzControl-Flow Mechanisms
Base DWF issue heuristic MajorityPDOM warp scheduling round-robin
Performance Potentials
The speedup can be reached if the impacting parameter is idealized
3 Potentials (per control-flow mechanism): Memory Potential
Speedup due to ideal memory Control Potential
Speedup due to free-of-divergence architecture Resource Potential
Speedup due to infinite CTA-limiting resources per SM
13A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Performance Potentials continued…
14A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Memory Potentials
15A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
DWF61%PDOM59%
Resource Potentials
16A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
DWF8.6%PDOM9.4%
Control Potentials
17A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
DWF2%
PDOM-7%
Performance Distances
How much an otherwise ideal GPU is distanced from ideal due to the parameter.
3 Distances: Memory Distance
Distance form ideal GPU due to real memory Resource Distance
Distance from ideal GPU due to limited resources Control Distance
Distance from ideal GPU due to branch divergence
18A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Performance Distances continued…
19A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Memory Distance
20A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
40%
Resource Distance
21A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
2%
Control Distances
22A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
DWF15%
PDOM8%
Sensitivity Analysis
Validating the findings under aggressive configurations: Aggressive-Memory
2x L1 caches 2x Number of memory controllers
Aggressive-Resource 2x CTA-limiting resources
Limited to performance potentials
23A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
Aggressive-memory
Memory Potentials
24A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM memory potential
28%
DWF memory potential
28%
Aggressive-memory continued…
Control Potentials
25A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM control potential
-8%
DWF control potential
-0.4%
Aggressive-memory continued…
Resource Potentials
26A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM resource potential
8%
DWF resource potential
~0%
Aggressive-resource
Memory Potentials
27A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM memory potential
51%
DWF memory potential
52%
Aggressive-resource continued…
Control Potentials
28A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM control potential
-8%
DWF control potential
2%
Aggressive-resource continued…
Resource Potentials
29A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and
Distances.
PDOM resource potential
4%
DWF resource potential
3%
Conclusion
30A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.
Conclusion
Performance in GPUs Potentials: Improvement by idealizing
Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF
Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2%
Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources
31A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.
32
Thank you.
Questions?
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.
Why 32 PEs per SM
GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs:
Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp
We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs:
33
0-7 8-15 16-23 24-31
0-31
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.