Escaping the SIMD vs. MIMD mindsetA new class of hybrid microarchitectures
between GPUs and CPUs
Sylvain CollangeUniversità degli Studi di Siena
Séminaire DALIDecember 15, 2011
2
This talk is not about GPUs
Yesterday (2000-2010)
Homogeneous multi-core
Discrete components CentralProcessing Unit
(CPU)
GraphicsProcessingUnit (GPU)
Latency-optimized
cores
Throughput-optimized
cores
Today (2011-...)Chip-level integration
Intel Sandy Bridge
AMD Fusion
NVIDIA Denver/Maxwell project…
TomorrowHeterogeneous multi-core
Focus on the throughput-optimized part
Programming model: SPMD
Heterogeneous multi-core chip
Hardwareaccelerators
3
Outline
SIMT architectures
Parallel locality and its exploitation
Revisiting Flynn's Taxonomy
How to keep threads synchronized
Two instructions, multiple data
Parallel value locality
4
Locality, regularity in sequential apps
Application behavior likely to follow regular patterns
Controlregularity
for(i…){
if(f(i)) {}
j = g(i);x = a[j];
}
Time
taken taken takentaken
Regular/local case Irregular case
taken takennot tk not tk
Memorylocality,regularity
j=21 j=4 j=2j=17j=17 j=18 j=20j=19
Applications
Caches
Branch prediction
Instruction prefetch, data prefetch, write combining…
i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3
Valuelocality
x=15 x=0 x=52x=2x=42 x=42 x=42x=42
5
Regularity in parallel applications
Similarity in behavior between SPMD threads
IrregularRegular
Parallelcontrolregularity
Parallelmemorylocality
Tim
e
Thread1 2 3 41 2 3 4
switch(i) { case 2:... case 17:... case 21:...}
i=21 i=4 i=2i=17i=17 i=17 i=17i=17
loadA[8]
loadA[0]
loadA[11]
loadA[3]
loadA[8]
loadA[9]
loadA[10]
loadA[11]
A Memory
Parallelvaluelocality
a=32 a=32
r=A[i]
r=a*b
a=32 a=32
b=52 b=52 b=52 b=52
a=17 a=-5 a=11 a=42
b=15 b=0 b=-2 b=52
6
How to exploit parallel locality?
Multi-threading implementation options:
Replication
Different resources, same time
Chip Multi-Processing (CMP)
Time-multiplexing
Same resource, different times
Multi-Threading (MT)
Factorization
If we have parallel locality
Same resource, same time
Single-Instruction Multi-Threading (SIMT)
time
time
spac
esp
ace
spac
e
time
T0T1T2T3
T0T1
T2T3
T0-T3
7
Single Instruction, Multiple Threads (SIMT)
Area/Power-efficient thanks to parallel locality
(0-3) store
(0) mul
IF
ID
EX
LSU(0)
Mem
ory
(1) mul (2) mul (3) mul
(1) (2) (3)
(0-3) load
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers
T0
T1
T2
T3
8
Flynn's taxonomy revisited
Singleresource
Multipleresources
InstructionFetch
Resource:pipeline stage
Memory(Address)
RF, Execute(Data)
SIMT
MIMT
F
T0T
1T
2T
3
F
T0T
1T
2T
3
F F F
SAMT
MAMT
M
T0T
1T
2T
3
M
T0T
1T
2T
3
M M M
SDMT
MDMT
X
T0T
1T
2T
3
X
T0T
1T
2T
3
X X X
F MX
Mostly orthogonal
Mix and match to build your own _I_D_A_T pipeline!
9
Examples: conventional design points
F MXMulti-core
MIMD(MAMT) F MX
F MX
GPU
SI(MDSA)MT
Short-vector SIMD
SIMD(SAST)
T0
T1
T2
X
F MX
X
T0
X
F MX
X
T0
T1
T2
MI MD MA MT
SIMD
SA ST
SIMD
SA MT
10
A GPU: NVIDIA GeForce GTX 580
SIMT: warps of 32 threads
16 SMs / chip
2×16 cores / SM, 48 warps / SM
1580 Gflop/s
Up to 24576 threads in flight
Time
Core 1
Core 2
Core 16
Warp 3
Warp 1
Warp 47
SM1 SM16
……C
ore 17
Core 18
Core 32
Warp 4
Warp 2
Warp 48
…
11
Outline
SIMT architectures
How to keep threads synchronized
The old way: mask stacks
The new way: distributed control and arbitration
Two instructions, multiple data
Parallel value locality
12
How to keep threads synchronized?
Issue: control divergence
Rules of the game
One thread per Processing Element (PE)
All PE execute the same instruction
PEs can be individually disabled
PE 1 PE 21 instruction PE 0 PE 3
Thread 0 Thread 1 Thread 2 Thread 3
x = 0;
if(tid > 17) {
x = 1;
}
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
// Uniform condition
// Divergent conditions
13
The standard way: mask stack
x = 0;
if(tid > 17) {
x = 1;
}
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
Code
push
push
pop
push
pop
pop
1111
1111 1100
1111 1100 1000
1111 1100
1111 1100 0100
1111 1100
1111
Mask Stack1 activity bit / thread
tid=0
tid=1
tid=2
tid=3
1111
skip
// Uniform condition
// Divergent conditions
14
Goto considered harmful?
jjaljrsyscall
MIPS
jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop
Intel GMAGen4(2006)
jmpiifelseendifcasewhilebreakconthaltcallreturnfork
Intel GMASB(2011)
pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after
AMD Cayman(2011)
pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after
AMDR600(2007)
jumploopendlooprependrepbreakloopbreakrepcontinue
AMDR500(2005)
barbrabrkbrkptcalcontkilpbkpretretssytrap.s
NVIDIATesla(2007)
barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s
NVIDIAFermi(2010)
Control instructions in some CPU and GPU instruction sets
Why so many?
Expose control flow structure to the instruction sequencer
No generic support for arbitrary control flow
15
Alternative: 1 PC / thread
Master PC
Code Program Counters (PCs)tid= 0 1 2 3x = 0;
if(tid > 17) {
x = 1;
}
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
1 0 0 0
PC0
PC1
PC2
PC3
Match→ active
No match→ inactive
16
Scheduling policy: min(SP:PC)
Which PC to choose as master PC ?
Conditionals, loops
Order of code addresses
min(PC)
Functions
Favor max nesting depth
min(SP)
if(…){}else{}
…p? br else…br endifelse:…endif:
Source Assembleur Ordre
1
2
3
while(…){} 1 2 3
start:…p? br start…
4
…f();
void f(){ …}
…call f…f:…ret
2
3
1
G. Diamos, A. Kerr, H. Wu, S. Yalamanchili, B. Ashbaugh,S. Maiyuran. SIMD re-convergence at thread frontiers.MICRO 44, 2011.
With compiler support
Unstructured control flow too
No code duplication
17
Our new SIMT pipeline
Vot
e InstructionFetch
MPC Insn,MPC
Bro
adca
st
Match Exec Update PCInsn
PC0
PC0
PC1
Insn,MPC
Match Exec Update PCInsn
PC1
Insn,MPC
Match Exec Update PCInsn
PCn
Insn,MPCPC
n
No match: discard instruction
S. Collange. Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT. SympA'14, 2011.
18
Benefits of multiple-PC arbitration
Before: stack, counters
O(d), O(log d) memoryd = nesting depth
C-style structured control-flow only
1 R/W port to memory
Exceptions: stack overflow, underflow
Partial SIMD semantics(Bougé-Levaire)
Structured control flow only
Specific instruction sets
After: multiple PCs
O(1) memory
No shared state
Arbitrary control flow
Allows thread suspension, restart, migration
Full SPMD semantics(multi-thread)
Traditional languages, compilers
Traditional instruction sets
Enables many new architecture ideas
19
Outline
SIMT architectures
How to keep threads synchronized
Two instructions, multiple data
From divergent branches
From multiple warps
Parallel value locality
20
Sharing 2 resources
Resource count
1
M
InstructionFetch
Resource type: Memory port(Address)
Computation /registers(Data)
SIMT
MIMT
F
T0T
1T
2T
3
F
T0T
1T
2T
3
F F F
2
DIMTF
T0T
1T
2T
3
F
SAMT
MAMT
M
T0T
1T
2T
3
M
T0T
1T
2T
3
M M M
DAMTM
T0T
1T
2T
3
M
SDMT
MDMT
X
T0T
1T
2T
3
X
T0T
1T
2T
3
X X X
DDMTX
T0T
1T
2T
3
X
A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.
21
Simultaneous Branch Interweaving
Co-issue instructions from divergent branches
Fill holes using parallelism from divergent paths
SIMT(baseline)
SBI
Same warp,differentinstruction
1
234
56
Control-flowgraph
22
Secondary scheduler policy
Primary scheduler: MPC1=Min
i(PC
i)
Secondary scheduler: MPC2 = Min
i(PC
i, PC
i ≠ MPC
1)
Enforce control-flow reconvergence
Annotate reconvergence points with pointer to dominator
Wait for any thread of the warp between PCdiv and PCrec
T0
T1
T2
T3
T0 and T2 (at F)wait for T1 (in D).T3 (in B) can proceedin parallel.
23
Implementation
Fermi GPUs already have 2 instruction schedulers
Direct both schedulers to the same units
Fermi: warp size 322 warps / clock1 instruction / warp
SBI: warp size 641 warp / clock2 instructions / warp
24
Simultaneous Warp Interweaving
Co-issue instructions from different warps
Transposition of Simultaneous Multi-Threading (SMT)in the SIMD world
SWI SBI+SWI
Different warp,differentinstruction
25
Implementation: cascaded scheduling
Secondary scheduler refines initial scheduling
Looks for warp instruction with disjoint set of active threads
SBI/SWI: warp size 641 warp / clock
26
Detecting compatible warps
Bitset inclusion test:Content-Associative Memory
Treat zeros as don't care bits
Power-hungry!
1 1
0 1 1 0
1 1 1 1 1m
W0W1W2W3W4W5W6
hit
0 0 0
0 0 0 0 0 0
0 hit0 0 0
Set-associative lookupSplit warps in setsRestrict lookup to 1 setMore power-efficient 1 1
0 1 1 0
1 1 1 1 1m
W0W1W2W3W4W5W6
hit
0 0 0
0 0 0 0 0 0
00 0 0
sameset
27
Set-associative lookup is good enough
3-way: 97% of fully-associative (23-way) performance
Direct-mapped: 96%
28
2332
Using divergence correlations
Issue: unbalanced divergence introduces conflicts
e.g. Parallel reduction
Solution: static lane shuffling
Apply different lane permutation for each warp
Preserves inter-thread memory locality
time
warp 0 warp 1 warp 2 warp 3
Warp 0 is never compatible with warp 2:conflict in lane 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
time
warp 0 warp 1 warp 2 warp 3
Threads 0 mapped to different physicallanes: no conflict
0 1 2 3 01 23 0 1 01
29
Results
Collaboration with Nicolas Brunie (LIP, ENS Lyon / Kalray), Gregory Diamos (Georgia Tech / NVIDIA)
Regular applications Irregular applications
Speedup Regular Irregular
SBI +15% +41%
SWI +25% +33%
SBI+SWI +23% +40%
30
Outline
SIMT architectures
How to keep threads synchronized
Two instructions, multiple data
Parallel value locality
Dynamic scalarization
Affine vector cache
Affine-aware register allocation
31
32 birds with 1 stone
Not as crazy as it looks...
What about SISDSAMT?
Phenomenon: parallel value locality
Applications: instruction sharing, register sharing
F MX
T0
T1
T2
SI SD SA MT
SI SD MT
32
What are we computing on?
Uniform data
In a warp, v[tid] = c 5 5 5 5 5 5 5 5
8 9 101112131415
thread 0Affine data
In a warp, v[tid] = b + tid × s
Base b, stride sb=8
s=1
c=5
RF reads
Operations
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
OtherAffineUniform
Average frequency in GPGPU applications
thread 1
33
mov i ← tid A←Aloop:
load t ← X[i] K←U[A]mul t ← a×t K←U×Kstore X[i] ← t U[A]←Kadd i ← i+tcnt A←A+Ubranch i<n? loop A<U?
loop:load t ← X[i] K←U[A]mul t ← a×t K←U×K...
Instructions
t17 X X X X X0 1 X X X X
51 X X X X X
ain
Thread0 10 2 3 …
Dynamic scalarization: tagging registers
KU
UA
Tag
TagsAssociate a tag to each vector register
Uniform, Affine, unKnown
Propagate tags across arithmetic instructions
2 lanes are enough to encode uniform and affine vectors
Trace
34
Dynamic scalarization: clock-gating
DecodeFetch
De-duplication
Tags
Readoperands
ScalarRF
Vector RF
Execute
Branch /Mask
...
Reg ID Reg ID + tag
Inactive for24% of instructions
Inactive for38% of operands
S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009
35
Why on-chip memory size matters
Conventional wisdom
Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:
GPU Register files+ caches
NVIDIA GF110
3.9 MB
AMD Cayman
7.7 MB
At this rate, will catch up with CPUs by 2012…
Actual data
36
What is inside thread-private memory?
Private memory: extension to the RF
Contains call stack, local arrays, spilled registers
80% of private memory traffic is affine
RF traffic was 50%
37
Affine Vector Cache
As level-1 cache
Private memory physically interleaved across threads1 cache line = 1 spilled vector register
Affine vectors: store (base, stride) only
Research project of Alexandre Kouyoumdjian, LIP, ENS Lyon, April-May 2011
16× morecompact
38
What is inside a GPU register file?
50% - 92% of GPU RF contains affine variables
More than register reads: non-affine variables are short-lived
Also explains private memory traffic
MatrixMul: 3 non-affine / 14
Research project of Élie Gédéon, LIP, ENS Lyon, June-July 2011
Needleman-Wunsch:2 non-affine / 24
Non-affine registers alive in inner loop:
Convolution: 4 non-affine inhotspot / 14
39
Compilers to the rescue
Static analysis to identify affine registers
Issue: divergent control-flow introduces dependencies
Solution: gated-SSA form + live-range splitting
Application: spill affine variables to shared memory
Collaboration with Fernando Magno Quintão Pereira, Diogo Sampaio, Rafael Martins, Universidade Federal de Minas Gerais, Brazil
Up to 40% speedup on current GPUs, for 8 registers / thread
40
Future direction: affine cache as RF
SIMT execution: only 1 tag lookup / operand
Same translation for all lanes
Affine ALU handles most control flow and addresses
Vector ALUs/FPUs do the heavy lifting
Coordinate warp scheduling and replacement policy?
Tags
L0Array
L0Array
L0Array
L0Array
ALU/FPU ALU/FPU ALU/FPU ALU/FPU
InstructionDecode
Instructions
ArchregIDs
µarch reg IDs
Affine ALU
AffineArray
Bro
adca
st
InstructionFetch
Replacementpolicy
41
Bottom line: the missing link
New micro-architecture space between Clustered Multi-Threading and SIMD
New ways to exploit parallel value locality for higher perf/W
CMP
SMT
CMTSIMDstack-based
SIMTPC-based
SIMT
SIMD programming model Multi-thread programming model
Optimize forparallel locality
Allow moreflexibility
42
Conclusion: research factorization?
Clustered multi-thread architectures: choose between
Replication
Time-multiplexing
Factorization New!
Instruction fetch policy in multi-thread processors: balance
Instruction throughput
Fairness
Parallel locality New!
Control-flow reconvergence points
For latency: to reduce branch misprediction penalty
For throughput: to restore thread synchronization New!
Cross-fertilization with ideas from “classical” superscalar microarchitecture ?
Escaping the SIMD vs. MIMD mindseta new class of hybrid microarchitectures
between GPUs and CPUs
Sylvain CollangeUniversità degli Studi di Siena
Séminaire DALIDecember 15, 2011