CODE GPU WITH CUDASIMT
NVIDIA GPU ARCHITECTURE
Created by Marina Kolpakova ( ) for cuda.geek Itseez
BACK TO CONTENTS
OUTLINEHardware revisionsSIMT architectureWarp schedulingDivergence & convergencePredicated executionConditional execution
OUT OF SCOPEComputer graphics capabilities
HARDWARE REVISIONSSM (shading model) – particular hardware implementation.
Generation SM GPU modelsTesla sm_10 G80 G92(b) G94(b)
sm_11 G86 G84 G98 G96(b) G94(b) G92(b)sm_12 GT218 GT216 GT215sm_13 GT200 GT200b
Fermi sm_20 GF100 GF110sm_21 GF104 GF114 GF116 GF108 GF106
Kepler sm_30 GK104 GK106 GK107sm_32 GK20Asm_35 GK110 GK208sm_37 GK210
Maxwell sm_50 GM107 GM108sm_52 GM204sm_53 GM20B
LATENCY VS THROUGHPUT ARCHITECTURESModern CPUs and GPUs are both multi-core systems.
CPUs are latency oriented:Pipelining, out-of-order, superscalarCaching, on-die memory controllersSpeculative execution, branch predictionCompute cores occupy only a small part of a die
GPUs are throughput oriented:100s simple compute coresZero cost scheduling of 1000s or threadsCompute cores occupy most part of a die
SIMD – SIMT – SMTSingle Instruction Multiple Thread
SIMD: elements of short vectors are processed in parallel. Represents problem as shortvectors and processes it vector by vector. Hardware support for wide arithmetic.SMT: instructions from several threads are run in parallel. Represents problem as scopeof independent tasks and assigns them to different threads. Hardware support for multi-threading.SIMT vector processing + light-weight threading:
Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32-lane widethread scheduling and fast context switching between different warps to minimizestalls
SIMTDEPTH OF MULTI-THREADING × WIDTH OF SIMD
1. SIMT is abstraction over vector hardware:Threads are grouped into warps (32 for NVIDIA)A thread in a warp usually called laneVector register file. Registers accessed line by line.A lane loads laneId’s element from registerSingle program counter (PC) for whole warpOnly a couple of special registers, like PC, can be scalar
2. SIMT HW is responsible for warp scheduling:Static for all latest hardware revisionsZero overhead on context switchingLong latency operation score-boarding
SASS ISASIMT is like RISC
Memory instructions are separated from arithmeticArithmetic performed only on registers and immediates
SIMT PIPELINEWarp scheduler manages warps, selects ready to executeFetch/decode unit is associated with warp schedulerExecution units are SC, SFU, LD/ST
Area-/power-efficiency thanks to regularity.
VECTOR REGISTER FILE~Zero warp switching requires a big vector register file (RF)
While warp is resident on SM it occupies a portion of RFGPU's RF is 32-bit. 64-bit values are stored in register pairFast switching costs register wastage on duplicated itemsNarrow data types are as costly as wide data types.
Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM,Maxwell: 64 KB per scheduler.
DYNAMIC VS STATIC SCHEDULINGStatic scheduling
instructions are fetched, executed & completed in compiler-generated orderIn-order executionin case one instruction stalls, all following stall too
Dynamic schedulinginstructions are fetched in compiler-generated orderinstructions are executed out-of-orderSpecial unit to track dependencies and reorder instructionsindependent instructions behind a stalled instruction can pass it
WARP SCHEDULINGGigaThread subdivide work between SMsWork for SM is sent to Warp SchedulerOne assigned warp can not migrate between schedulersWarp has own lines in register file, PC, activity maskWarp can be in one of the following states:
Executed - perform operationReady - wait to be executedWait - wait for resourcesResident - wait completion of other warps within the same block
WARP SCHEDULINGDepending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)
WARP SCHEDULING (CONT)Modern warp schedulers support dualissue (sm_21+) to decode instruction pairfor active warp per clock
SM has 2 or 4 warp schedulers dependingon the architecture
Warps belong to blocks. Hardware tracksthis relations as well
DIVERGENCE & (RE)CONVERGENCEDivergence: not all lanes in a warp take the same code path
Convergence handled via convergence stackConvergence stack entry includes
convergence PCnext-path PClane mask (mark active lanes on that path)
SSY instruction pushes convergence stack. It occurs before potentially divergentinstructions<INSTR>.S indicates convergence point – instruction after which all lanes in a warp takethe same code path
DIVERGENT CODE EXAMPLE ( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ;
Assume warp size == 4
PREDICATED & CONDITIONAL EXECUTIONPredicated execution
Frequently used for if-then statements, rarely for if-then-else. Decision is made bycompiler heuristic.Optimizes divergence overhead.
Conditional executionCompare instruction sets condition code (CC) registers.CC is 4-bit state vector (sign, carry, zero, overflow)
No WB stage for CC-marked registersUsed in Maxwell to skip unneeded computations for arithmetic operationsimplemented in hardware with multiple instructions
I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;
FINAL WORDSSIMT is RISC-based throughput oriented architectureSIMT combines vector processing and light-weight threadingSIMT instructions are executed per warpWarp has its own PC and activity maskBranching is done by divergence, predicated or conditional execution
THE ENDNEXT
BY / 2013–2015CUDA.GEEK