Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | julian-foster |
View: | 217 times |
Download: | 0 times |
1
ECE 8823A
GPU Architectures
Module 5: Execution and Resources - I
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6
• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-guide/
#abstract
2
Objective
• To understand the implications of programming model constructs on demand for execution resources
• To be able to reason about performance consequences of programming model parameters– Thread blocks, warps, memory behaviors, etc.– Need deeper understanding of architecture to be really valuable
(later)
• To understand DRAM bandwidth– Cause of the DRAM bandwidth problem– Programming techniques that address the problem: memory
coalescing, corner turning,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
3
Formation of Warps
4
• How do you form warps out of multidimensional arrays of threads?– Linearize thread IDs
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2)
(1,0,3)
warp
1D Thread Block
3D Thread Block
Formation of Warps
5
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2)
(1,0,3)
T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3
linear order
2D Thread Block3D Thread Block
Execution of Warps
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
6
• Each warp executed as SIMD bundle• How do we handle divergent control flow among
threads in a warp? – Execution semantics– How is it implemented? (later)
warp
Reduction: Approach 1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
7
1. __shared__ float partialsum[]; ..2. unsigned int t = threadIDx.x;3. For (unsigned int stride =1; stride <blockDim.x; stride *=2)4. {5. __syncthread();6. If(t%(2*stride) == 0)7. partialsum[t] +=partialsum[t+stride];8. }
0 1 2 43 5 66
0+1 2+3 4+5 6+7
0..3 4..7
0..7
threadID.x
Thread Block
Reduction: Approach 2
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
8
1. __shared__ float partialsum[]; ..2. unsigned int t = threadIDx.x;3. For (unsigned int stride = blockDim.x; stride>1; stride /=2)4. {5. __syncthread();6. If(t < stride)7. partialsum[t] +=partialsum[t+stride];8. }
• Difference is in which threads diverge!• For a thread block of 512 threads
– Threads 0-255 take the branch, 256-511 do not
• For a warp size of 32, all threads in a warp have identical branch conditions no divergence!
• When #active threads <warp-size, old problem
Global Memory (DRAM) Bandwidth
Ideal Reality
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
9
RowAddr
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
DRAM Bank Organization
• Each core array has about 1M bits
• Each bit is stored in a tiny capacitor, made of one transistor
Memory CellCore Array
RowDecoder
Sense Amps
Column Latches
MuxColumnAddr
Off-chip Data
Wide
Narrow
Pin Interface
10
A very small (8x2 bit) DRAM Bank
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
deco
de
0 1 1
Sense amps
Mux11
DRAM core arrays are slow.• Reading from a cell in the core array is a very
slow process– DDR: Core speed = ½ interface speed– DDR2/GDDR3: Core speed = ¼ interface speed– DDR3/GDDR4: Core speed = ⅛ interface speed– … likely to be worse in the future
deco
de
To sense amps
A very small capacitance that stores a data bit
About 1000 cells connected to each vertical line
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
12
DRAM Bursting.
• For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:
– Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed
– DDR2/GDDR3: buffer width = 4X interface width
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
13
DRAM Bursting
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
deco
de
0 1 0
Sense amps
Mux
14
DRAM Bursting
deco
de
0 1 1
Sense amps and buffer
Mux
15
“You can buy bandwidth but you can’t bribe
God” --
Unknown
DRAM Bursting for the 8x2 Bank
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
time
Address bits to decoder
Core Array access delay2 bitsto pin
2 bitsto pin
Non-burst timing
Burst timing
Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations.
16
Multiple DRAM Banks
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
deco
de
Sense amps
Muxde
code
Sense amps
Mux
0 1 10
Bank 0 Bank 1
17
DRAM Bursting for the 8x2 Bank
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
time
Address bits to decoder
Core Array access delay2 bitsto pin
2 bitsto pin
Single-Bank burst timing, dead time on interface
Multi-Bank burst timing, reduced dead time
18
First-order Look at the GPU off-chip memory subsystem
• nVidia GTX280 GPU: – Peak global memory bandwidth = 141.7GB/s
• Global memory (GDDR3) interface @ 1.1GHz– (Core speed @ 276Mhz)– For a typical 64-bit interface, we can sustain only
about 17.6 GB/s (Recall DDR - 2 transfers per clock)– We need a lot more bandwith (141.7 GB/s) – thus 8
memory channels
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
19
Multiple Memory Channels
• Divide the memory address space into N parts– N is number of memory channels– Assign each portion to a channel
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Channel 0
Channel 1
Channel 2
Channel 3
Bank Bank Bank Bank
20
Memory Controller Organization of a Many-Core Processor
• GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect– DRAM controllers are interleaved– Within DRAM controllers (channels), DRAM
banks are interleaved for incoming memory requests
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
21
Lessons
• Organize data accesses to maximize burst mode bandwidth – Access consecutive locations– Algorithmic strategies + data layout
• Thread blocks issue warp-size load/store instructions– 32 addresses in Fermi– Coalesce these accesses to create smaller number of
memory transactions maximize memory bandwidth– More later as we discuss microarchitecture
22
Memory Coalescing
• Memory references are coalesced into sequence of memory transactions– Accesses to a segment are coalesced, e.g., 128 byte
segments)23
LD LD LD LD
Opportunity to Coalesce
16*4= 64 bytes
Warp
Implications of Memory Coalescing
• Reduce the request rate to L1 and DRAM
• Distinct from CPU optimizations – why?
• Need to be able to re-map entries from each access back to threads
24
Warp Schedulers
Register File
SPSPSPSP
SPSPSPSP
SPSPSPSP
SPSPSPSP
L1/Shared Memory
DRAMDRAM
DRAMDRAM
L1 access
bandwidth
DRAM access
bandwidth
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
linearized order in increasing address
Placing a 2D C array into linear memory space
Base Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and Mint Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column index of Pd and Nint Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-
matrixfor (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;} 26
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Two Access Patterns
27
d_M d_N
WIDTH
WIDTH
Thread 1
Thread 2
(a) (b)
d_M[Row*Width+k] d_N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code
28
N accesses are coalesced.
N
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
Load iteration 1
Access direction in kernel code (one thread)
…
N0,2
N1,1
N0,1N0,0
N1,0
N0,3
N1,2 N1,3
N2,1N2,0 N2,2 N2,3
N3,1N3,0 N3,2 N3,3
N0,2N0,1N0,0 N0,3 N1,1N1,0 N1,2 N1,3 N2,1N2,0 N2,2 N2,3 N3,1N3,0 N3,2 N3,3
Across successive threads in a warp
d_N[k*Width+Col]
M accesses are not coalesced.
29
M
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
Load iteration 1
Access direction in Kernel code (in a thread)
…
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3 M3,1M3,0 M3,2 M3,3
d_M[Row*Width+k]
Access across successive
threads in a warp
Using Shared Memory
30
d_M d_N
WIDTH
d_M d_N
Original AccessPattern
Tiled AccessPattern
Copy into scratchpad
memory
Perform multiplication
with scratchpad values
WIDTH
Shared Memory Accesses
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 31
• Shared memory is banked – No coalescing
• Data access patterns should be structured to avoid bank conflicts
• Low order interleaved mapping?
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;// Identify the row and column of the d_P element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;// Loop over the d_M and d_N tiles required to compute the d_P element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of d_M and d_N tiles into shared memory9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx];10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col];11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[tx][k] * Nds[k][ty];14. __synchthreads(); }15. d_P[Row*Width+Col] = Pvalue; }
• Accesses from shared memory, hence coalescing is not necessary
• Consider bank conflicts
Coalescing Behavior
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
33
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*T
ILE
_WID
TH
m*TILE_WIDTH
Col
Row
…
…
Thread Granularity
34
Warp Schedulers
Register File
SPSPSPSP
SPSPSPSP
SPSPSPSP
SPSPSPSP
L1/Shared Memory
DRAMDRAM
DRAMDRAM
• Consider instruction bandwidth vs. memory bandwidth
• Control amount of work per thread
Fetch/Decode
Thread Granularity Tradeoffs
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 35
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*T
ILE
_WID
TH
m*TILE_WIDTH
Col
Row
…
…
• Preserving instruction bandwidth (memory bandwidth)– Increase thread granularity– Merge adjacent tiles: sharing tile
data
Thread Granularity Tradeoffs (2)
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 36
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*T
ILE
_WID
TH
m*TILE_WIDTH
Col
Row
…
…
• Impact on parallelism– #TBs, #registers/thread– Need to explore impact
autotuning
ANY MORE QUESTIONS?READ CHAPTER 6!
37