Download - Escaping the SIMD vs. MIMD mindset - · PDF fileEscaping the SIMD vs. MIMD mindset ... Conditionals, loops Order of code addresses min(PC) Functions ... 0123 0123 0123 0123 t i m e

Escaping the SIMD vs. MIMD mindsetA new class of hybrid microarchitectures

between GPUs and CPUs

Sylvain CollangeUniversità degli Studi di Siena

[email protected]

Séminaire DALIDecember 15, 2011

2

This talk is not about GPUs

Yesterday (2000-2010)

Homogeneous multi-core

Discrete components CentralProcessing Unit

(CPU)

GraphicsProcessingUnit (GPU)

Latency-optimized

cores

Throughput-optimized

cores

Today (2011-...)Chip-level integration

Intel Sandy Bridge

AMD Fusion

NVIDIA Denver/Maxwell project…

TomorrowHeterogeneous multi-core

Focus on the throughput-optimized part

Programming model: SPMD

Heterogeneous multi-core chip

Hardwareaccelerators

3

Outline

SIMT architectures

Parallel locality and its exploitation

Revisiting Flynn's Taxonomy

How to keep threads synchronized

Two instructions, multiple data

Parallel value locality

4

Locality, regularity in sequential apps

Application behavior likely to follow regular patterns

Controlregularity

for(i…){

if(f(i)) {}

j = g(i);x = a[j];

}

Time

taken taken takentaken

Regular/local case Irregular case

taken takennot tk not tk

Memorylocality,regularity

j=21 j=4 j=2j=17j=17 j=18 j=20j=19

Applications

Caches

Branch prediction

Instruction prefetch, data prefetch, write combining…

i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3

Valuelocality

x=15 x=0 x=52x=2x=42 x=42 x=42x=42

5

Regularity in parallel applications

Similarity in behavior between SPMD threads

IrregularRegular

Parallelcontrolregularity

Parallelmemorylocality

Tim

e

Thread1 2 3 41 2 3 4

switch(i) { case 2:... case 17:... case 21:...}

i=21 i=4 i=2i=17i=17 i=17 i=17i=17

loadA[8]

loadA[0]

loadA[11]

loadA[3]

loadA[8]

loadA[9]

loadA[10]

loadA[11]

A Memory

Parallelvaluelocality

a=32 a=32

r=A[i]

r=a*b

a=32 a=32

b=52 b=52 b=52 b=52

a=17 a=-5 a=11 a=42

b=15 b=0 b=-2 b=52

6

How to exploit parallel locality?

Multi-threading implementation options:

Replication

Different resources, same time

Chip Multi-Processing (CMP)

Time-multiplexing

Same resource, different times

Multi-Threading (MT)

Factorization

If we have parallel locality

Same resource, same time

Single-Instruction Multi-Threading (SIMT)

time

time

spac

esp

ace

spac

e

time

T0T1T2T3

T0T1

T2T3

T0-T3

7

Single Instruction, Multiple Threads (SIMT)

Area/Power-efficient thanks to parallel locality

(0-3) store

(0) mul

IF

ID

EX

LSU(0)

Mem

ory

(1) mul (2) mul (3) mul

(1) (2) (3)

(0-3) load

Factorization of fetch/decode, load-store units

Fetch 1 instruction on behalf of several threads

Read 1 memory location and broadcast to several registers

T0

T1

T2

T3

8

Flynn's taxonomy revisited

Singleresource

Multipleresources

InstructionFetch

Resource:pipeline stage

Memory(Address)

RF, Execute(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

F MX

Mostly orthogonal

Mix and match to build your own _I_D_A_T pipeline!

9

Examples: conventional design points

F MXMulti-core

MIMD(MAMT) F MX

F MX

GPU

SI(MDSA)MT

Short-vector SIMD

SIMD(SAST)

T0

T1

T2

X

F MX

X

T0

X

F MX

X

T0

T1

T2

MI MD MA MT

SIMD

SA ST

SIMD

SA MT

10

A GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads

16 SMs / chip

2×16 cores / SM, 48 warps / SM

1580 Gflop/s

Up to 24576 threads in flight

Time

Core 1

Core 2

Core 16

Warp 3

Warp 1

Warp 47

SM1 SM16

……C

ore 17

Core 18

Core 32

Warp 4

Warp 2

Warp 48

…

11

Outline

SIMT architectures


The old way: mask stacks

The new way: distributed control and arbitration



12

How to keep threads synchronized?

Issue: control divergence

Rules of the game

One thread per Processing Element (PE)

All PE execute the same instruction

PEs can be individually disabled

PE 1 PE 21 instruction PE 0 PE 3

Thread 0 Thread 1 Thread 2 Thread 3

x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

// Uniform condition

// Divergent conditions

13

The standard way: mask stack

x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

push

push

pop

push

pop

pop

1111

1111 1100

1111 1100 1000

1111 1100

1111 1100 0100

1111 1100

1111

Mask Stack1 activity bit / thread

tid=0

tid=1

tid=2

tid=3

1111

skip

// Uniform condition

// Divergent conditions

14

Goto considered harmful?

jjaljrsyscall

MIPS

jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop

Intel GMAGen4(2006)

jmpiifelseendifcasewhilebreakconthaltcallreturnfork

Intel GMASB(2011)

pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMD Cayman(2011)

pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMDR600(2007)

jumploopendlooprependrepbreakloopbreakrepcontinue

AMDR500(2005)

barbrabrkbrkptcalcontkilpbkpretretssytrap.s

NVIDIATesla(2007)

barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s

NVIDIAFermi(2010)

Control instructions in some CPU and GPU instruction sets

Why so many?

Expose control flow structure to the instruction sequencer

No generic support for arbitrary control flow

15

Alternative: 1 PC / thread

Master PC

Code Program Counters (PCs)tid= 0 1 2 3x = 0;

if(tid > 17) {

x = 1;

}

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

1 0 0 0

PC0

PC1

PC2

PC3

Match→ active

No match→ inactive

16

Scheduling policy: min(SP:PC)

Which PC to choose as master PC ?

Conditionals, loops

Order of code addresses

min(PC)

Functions

Favor max nesting depth

min(SP)

if(…){}else{}

…p? br else…br endifelse:…endif:

Source Assembleur Ordre

1

2

3

while(…){} 1 2 3

start:…p? br start…

4

…f();

void f(){ …}

…call f…f:…ret

2

3

1

G. Diamos, A. Kerr, H. Wu, S. Yalamanchili, B. Ashbaugh,S. Maiyuran. SIMD re-convergence at thread frontiers.MICRO 44, 2011.

With compiler support

Unstructured control flow too

No code duplication

17

Our new SIMT pipeline

Vot

e InstructionFetch

MPC Insn,MPC

Bro

adca

st

Match Exec Update PCInsn

PC0

PC0

PC1

Insn,MPC


PC1

Insn,MPC


PCn

Insn,MPCPC

n

No match: discard instruction

S. Collange. Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT. SympA'14, 2011.

18

Benefits of multiple-PC arbitration

Before: stack, counters

O(d), O(log d) memoryd = nesting depth

C-style structured control-flow only

1 R/W port to memory

Exceptions: stack overflow, underflow

Partial SIMD semantics(Bougé-Levaire)

Structured control flow only

Specific instruction sets

After: multiple PCs

O(1) memory

No shared state

Arbitrary control flow

Allows thread suspension, restart, migration

Full SPMD semantics(multi-thread)

Traditional languages, compilers

Traditional instruction sets

Enables many new architecture ideas

19

Outline

SIMT architectures



From divergent branches

From multiple warps


20

Sharing 2 resources

Resource count

1

M

InstructionFetch

Resource type: Memory port(Address)

Computation /registers(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

2

DIMTF

T0T

1T

2T

3

F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

DAMTM

T0T

1T

2T

3

M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

DDMTX

T0T

1T

2T

3

X

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.

21

Simultaneous Branch Interweaving

Co-issue instructions from divergent branches

Fill holes using parallelism from divergent paths

SIMT(baseline)

SBI

Same warp,differentinstruction

1

234

56

Control-flowgraph

22

Secondary scheduler policy

Primary scheduler: MPC1=Min

i(PC

i)

Secondary scheduler: MPC2 = Min

i(PC

i, PC

i ≠ MPC

1)

Enforce control-flow reconvergence

Annotate reconvergence points with pointer to dominator

Wait for any thread of the warp between PCdiv and PCrec

T0

T1

T2

T3

T0 and T2 (at F)wait for T1 (in D).T3 (in B) can proceedin parallel.

23

Implementation

Fermi GPUs already have 2 instruction schedulers

Direct both schedulers to the same units

Fermi: warp size 322 warps / clock1 instruction / warp

SBI: warp size 641 warp / clock2 instructions / warp

24

Simultaneous Warp Interweaving

Co-issue instructions from different warps

Transposition of Simultaneous Multi-Threading (SMT)in the SIMD world

SWI SBI+SWI

Different warp,differentinstruction

25

Implementation: cascaded scheduling

Secondary scheduler refines initial scheduling

Looks for warp instruction with disjoint set of active threads

SBI/SWI: warp size 641 warp / clock

26

Detecting compatible warps

Bitset inclusion test:Content-Associative Memory

Treat zeros as don't care bits

Power-hungry!

1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

0 hit0 0 0

Set-associative lookupSplit warps in setsRestrict lookup to 1 setMore power-efficient 1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

00 0 0

sameset

27

Set-associative lookup is good enough

3-way: 97% of fully-associative (23-way) performance

Direct-mapped: 96%

28

2332

Using divergence correlations

Issue: unbalanced divergence introduces conflicts

e.g. Parallel reduction

Solution: static lane shuffling

Apply different lane permutation for each warp

Preserves inter-thread memory locality

time

warp 0 warp 1 warp 2 warp 3

Warp 0 is never compatible with warp 2:conflict in lane 0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

time

warp 0 warp 1 warp 2 warp 3

Threads 0 mapped to different physicallanes: no conflict

0 1 2 3 01 23 0 1 01

29

Results

Collaboration with Nicolas Brunie (LIP, ENS Lyon / Kalray), Gregory Diamos (Georgia Tech / NVIDIA)

Regular applications Irregular applications

Speedup Regular Irregular

SBI +15% +41%

SWI +25% +33%

SBI+SWI +23% +40%

30

Outline

SIMT architectures




Dynamic scalarization

Affine vector cache

Affine-aware register allocation

31

32 birds with 1 stone

Not as crazy as it looks...

What about SISDSAMT?

Phenomenon: parallel value locality

Applications: instruction sharing, register sharing

F MX

T0

T1

T2

SI SD SA MT

SI SD MT

32

What are we computing on?

Uniform data

In a warp, v[tid] = c 5 5 5 5 5 5 5 5

8 9 101112131415

thread 0Affine data

In a warp, v[tid] = b + tid × s

Base b, stride sb=8

s=1

c=5

RF reads

Operations

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

OtherAffineUniform

Average frequency in GPGPU applications

thread 1

33

mov i ← tid A←Aloop:

load t ← X[i] K←U[A]mul t ← a×t K←U×Kstore X[i] ← t U[A]←Kadd i ← i+tcnt A←A+Ubranch i<n? loop A<U?

loop:load t ← X[i] K←U[A]mul t ← a×t K←U×K...

Instructions

t17 X X X X X0 1 X X X X

51 X X X X X

ain

Thread0 10 2 3 …

Dynamic scalarization: tagging registers

KU

UA

Tag

TagsAssociate a tag to each vector register

Uniform, Affine, unKnown

Propagate tags across arithmetic instructions

2 lanes are enough to encode uniform and affine vectors

Trace

34

Dynamic scalarization: clock-gating

DecodeFetch

De-duplication

Tags

Readoperands

ScalarRF

Vector RF

Execute

Branch /Mask

...

Reg ID Reg ID + tag

Inactive for24% of instructions

Inactive for38% of operands

S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009

35

Why on-chip memory size matters

Conventional wisdom

Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:

GPU Register files+ caches

NVIDIA GF110

3.9 MB

AMD Cayman

7.7 MB

At this rate, will catch up with CPUs by 2012…

Actual data

36

What is inside thread-private memory?

Private memory: extension to the RF

Contains call stack, local arrays, spilled registers

80% of private memory traffic is affine

RF traffic was 50%

37

Affine Vector Cache

As level-1 cache

Private memory physically interleaved across threads1 cache line = 1 spilled vector register

Affine vectors: store (base, stride) only

Research project of Alexandre Kouyoumdjian, LIP, ENS Lyon, April-May 2011

16× morecompact

38

What is inside a GPU register file?

50% - 92% of GPU RF contains affine variables

More than register reads: non-affine variables are short-lived

Also explains private memory traffic

MatrixMul: 3 non-affine / 14

Research project of Élie Gédéon, LIP, ENS Lyon, June-July 2011

Needleman-Wunsch:2 non-affine / 24

Non-affine registers alive in inner loop:

Convolution: 4 non-affine inhotspot / 14

39

Compilers to the rescue

Static analysis to identify affine registers

Issue: divergent control-flow introduces dependencies

Solution: gated-SSA form + live-range splitting

Application: spill affine variables to shared memory

Collaboration with Fernando Magno Quintão Pereira, Diogo Sampaio, Rafael Martins, Universidade Federal de Minas Gerais, Brazil

Up to 40% speedup on current GPUs, for 8 registers / thread

40

Future direction: affine cache as RF

SIMT execution: only 1 tag lookup / operand

Same translation for all lanes

Affine ALU handles most control flow and addresses

Vector ALUs/FPUs do the heavy lifting

Coordinate warp scheduling and replacement policy?

Tags

L0Array

L0Array

L0Array

L0Array

ALU/FPU ALU/FPU ALU/FPU ALU/FPU

InstructionDecode

Instructions

ArchregIDs

µarch reg IDs

Affine ALU

AffineArray

Bro

adca

st

InstructionFetch

Replacementpolicy

41

Bottom line: the missing link

New micro-architecture space between Clustered Multi-Threading and SIMD

New ways to exploit parallel value locality for higher perf/W

CMP

SMT

CMTSIMDstack-based

SIMTPC-based

SIMT

SIMD programming model Multi-thread programming model

Optimize forparallel locality

Allow moreflexibility

42

Conclusion: research factorization?

Clustered multi-thread architectures: choose between

Replication

Time-multiplexing

Factorization New!

Instruction fetch policy in multi-thread processors: balance

Instruction throughput

Fairness

Parallel locality New!

Control-flow reconvergence points

For latency: to reduce branch misprediction penalty

For throughput: to restore thread synchronization New!

Cross-fertilization with ideas from “classical” superscalar microarchitecture ?

Escaping the SIMD vs. MIMD mindseta new class of hybrid microarchitectures

between GPUs and CPUs

Sylvain CollangeUniversità degli Studi di Siena

[email protected]

Séminaire DALIDecember 15, 2011