Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 8CUDA Memories
Each thread can: Read/write per-
thread registers Read/write per-
thread local memory
Read/write per-block shared memory
Read/write per-grid global memory
Read/only per-grid constant memory
CUDA Memories – Slide 2
Hardware Implementation of CUDA Memories
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
__device__ is optional when used with __local__, __shared__ or __constant__
CUDA Memories – Slide 3
CUDA Variable Type QualifiersVariable declaration Memor
yScop
eLifetim
e int LocalVar; register thread thread
__device__ __local__ int LocalVar; int ArrayVar[10];
local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
Automatic scalar variables without any qualifier reside in a register Compiler will spill to thread local memory
Automatic array variables without any qualifier reside in a thread-local memory
CUDA Memories – Slide 4
CUDA Variable Type Qualifiers (cont.)
Variable declaration Memory
Scope
Lifetime
int LocalVar; register thread thread
__device__ __local__ int LocalVar; int ArrayVar[10];
local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
scalar variables reside in fast, on-chip registersshared variables reside in fast, on-chip memoriesthread-local arrays and global variables reside in
uncached off-chip memoryconstant variables reside in cached off-chip memory
CUDA Memories – Slide 5
CUDA Variable Type PerformanceVariable declaration Memor
yPenalty
int LocalVar; register 1x
__device__ __local__ int LocalVar; int ArrayVar[10];
local 100x
__device__ __shared__ int SharedVar; shared 1x
__device__ int GlobalVar; global 100x
__device__ __constant__ int ConstantVar; constant 1x
100Ks per-thread variables, R/W by 1 thread100s shared variables, each R/W by 100s of
threads1 global variable is R/W by 100Ks threads1 constant variable is readable by 100Ks threads
CUDA Memories – Slide 6
CUDA Variable Type ScaleVariable declaration Instanc
esVisibilit
y int LocalVar; 100,000s 1
__device__ __local__ int LocalVar; int ArrayVar[10];
100,000s 1
__device__ __shared__ int SharedVar; 100s 100s
__device__ int GlobalVar; 1 100,000s
__device__ __constant__ int ConstantVar; 1 100,000s
CUDA Memories – Slide 7
Where to declare variables?Can host access
it?Yes No
Outside of any function In the kernel
__constant__ int ConstantVar;__device__ int GlobalVar;
int LocalVar; int ArrayVar[10];__shared__ int SharedVar;
CUDA Memories – Slide 8
Example: Thread-local Variables// motivate per-thread variables with// Ten Nearest Neighbors application__global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs){ // p goes in a register float2 p = ps[threadIdx.x];
// per-thread heap goes in off-chip memory float2 heap[10];
// read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ...}
CUDA Memories – Slide 9
Example: Shared Variables// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one; }}
CUDA Memories – Slide 10
Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one; }}
Two loads
CUDA Memories – Slide 11
Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i > 0) { // how many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one; }}
// once by thread i// again by thread i+1
CUDA Memories – Slide 12
Example: Shared Variables (cont.)// motivate shared variables with// Adjacent Difference application:// compute result[i] = input[i] – input[i-1]__global__ void adj_diff_naive(int *result, int *input){ // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one; }}
CUDA Memories – Slide 13
Example: Shared Variables (cont.)// optimized version of adjacent difference__global__ void adj_diff(int *result, int *input){ // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i];
// avoid race condition: ensure all loads complete before continuing __syncthreads(); if (tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if (i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; }}
CUDA Memories – Slide 14
Example: Shared Variables (cont.)// when the size of the array isn’t known at compile time...__global__ void adj_diff(int *result, int *input){ // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ...}// pass the size of the per-block array, in bytes, as the third// argument to the triple chevronsadj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);
Experiment performed on a GT200 chipImprovement likely better on an older architectureImprovement likely worse on a newer architecture
Optimizations tend to come with a development cost
CUDA Memories – Slide 15
Optimization AnalysisImplementation Origina
lImproved
Global loads 2N N + N/BLOCK_SIZE
Global stores N NThroughput 36.8
GB/s57.5 GB/s
Source lines of code (SLOCs)
18 35
Relative improvement 1x 1.57xImprovement/SLOC 1x 0.81x
Pointers can only point to memory allocated or declared in global memory:
Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)
Obtained as the address of a global variable: float* ptr = &GlobalVar;
CUDA Memories – Slide 16
Variable Type Restrictions
So you can use pointers and point at any memory space per se:
CUDA Memories – Slide 17
Variable Type Restrictions (cont.)
__device__ int my_global_variable;__constant__ int my_constant_variable = 13;
__global__ void foo(void){ __shared__ int my_shared_variable;
int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable; ... *ptr_to_global = *ptr_to_shared;}
Pointers aren’t typed on memory space
Where does ptr point?ptr is a __shared__ pointer variable, not a
pointer to a __shared__ variable!
CUDA Memories – Slide 18
Variable Type Restrictions (cont.)
__shared__ int *ptr;
CUDA Memories – Slide 19
Don’t confuse the compiler!__device__ int my_global_variable;__global__ void foo(int *input){ __shared__ int my_shared_variable;
int *ptr = 0; if (input[threadIdx.x] % 2) ptr = &my_global_variable; else ptr = &my_shared_variable; // where does ptr point?}
Prefer dereferencing pointers in simple, regular access patterns
Avoid propagating pointersAvoid pointers to pointers
The GPU would rather not pointer chaseLinked lists will not perform well
Pay attention to compiler warning messagesWarning: Cannot tell what pointer points to, assuming global memory space
Crash waiting to happen
CUDA Memories – Slide 20
Advice
Global memory resides in device memory (DRAM)Much slower access than shared memory
So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:Generalize from adjacent_difference
exampleDivide and conquer
CUDA Memories – Slide 21
A Common Programming Strategy
Partition data into subsets that fit into shared memory CUDA Memories – Slide 22
A Common Programming Strategy (cont.)
Handle each data subset with one thread block as follows: CUDA Memories – Slide 23
A Common Programming Strategy (cont.)
Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24
A Common Programming Strategy (cont.)
Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25
A Common Programming Strategy (cont.)
Copy the results from shared memory back to global memory
CUDA Memories – Slide 26
A Common Programming Strategy (cont.)
Constant memory also resides in device memory (DRAM)Much slower access than shared memoryBut…cached!Highly efficient access for read-only data
CUDA Memories – Slide 27
A Common Programming Strategy (cont.)
Carefully partition data according to access patternsRead-only __constant__ memory (very fast
if in cache)R/W & shared within block __shared__
memory (very fast)R/W within each thread registers (very fast)Indexed R/W within each thread local
memory (slow)R/W inputs/results cudaMalloc’ed global
memory (very slow)
CUDA Memories – Slide 28
A Common Programming Strategy (cont.)
This is a race condition; the result is undefined
The order in which threads access the variable is undefined without explicit coordination
Two ways to enforce well-defined semanticsCUDA Memories – Slide 29
Communication through Memory
__global__ void race(void){ __shared__ int my_shared_variable; my_shared_variable = threadIdx.x;}
Use barriers (e.g., __syncthreads) to ensure data is ready for access
The state of the entire data array is now well-defined for all threads in this block.
CUDA Memories – Slide 30
Communication through Memory (cont.)
__global__ void share_data(int *input){ __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads();}
Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable
After this kernel exits, the value of *result will be the sum of the inputs
CUDA Memories – Slide 31
Communication through Memory (cont.)
// assume *result is initialized to 0
__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}
Atomic operations aren’t cheap; they imply serialized access to a variable.
How many threads will contend for exclusive access to result?
CUDA Memories – Slide 32
Resource Contention
__global__ void sum(int *input, int *result){ atomicAdd(result, input[threadIdx.x]);}
sum<<<B,N/B>>>(input,result);
Divide and ConquerPer-thread atomicAdd to a __shared__ partial
sumPer-block atomicAdd to the total sumCUDA Memories – Slide 33
Hierarchical AtomicsS
S0 S1 Si
CUDA Memories – Slide 34
Hierarchical Atomics (cont.)__global__ void sum(int *input, int *result){ __shared__ int partial_sum;
// thread 0 is responsible for initializing partial_sum if (threadIdx.x == 0) partial_sum = 0; __syncthreads();
// each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads();
// thread 0 updates the total sum if (threadIdx.x == 0) atomicAdd(result, partial_sum);}
Use barriers such as __syncthreads to wait until __shared__ data is ready
Prefer barriers to atomics when data access patterns are regular or predictable
Prefer atomics to barriers when data access patterns are sparse or unpredictable
Atomics to __shared__ variables are much faster than atomics to global variables
Don’t synchronize or serialize unnecessarily
CUDA Memories – Slide 35
Advice
Generalize adjacent_difference exampleAB = A * B
Each element ABij = dot(row(A,i),col(B,j))
Parallelization strategyThread ABij2D kernel
CUDA Memories – Slide 36
Example: Matrix Multiplication using Shared Memory
A
B
AB
CUDA Memories – Slide 37
First Try: Matrix Multiply Kernel using Multiple Blocks
__global__ void mat_mul(float *a, float *b, float *ab, int width){ // calculate the row & col index of the element int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;
float result = 0;
// do dot product between row of a and col of b for (int k = 0; k < width; ++k) result += a[row * width + k] * b[k * width + col];
ab[row * width+col] = result;}
CUDA Memories – Slide 38
How will this perform?How many loads per term of dot product?
2 (a and b) = 8 Bytes
How many floating point (FP) operations?
2 (multiply and addition)
Global memory access to flop ratio (GMAC)
8 Bytes / 2 ops = 4 B/op
What is the peak FP performance of GeForce GTX 260?
805 GFLOPS
Lower bound on bandwidth required to reach peak FP performance
GMAC * Peak FLOPS = 4 * 805 = 3.2 TB/s
What is the actual memory bandwidth of GeForce GTX 260?
112 GB/s
Then what is an upper bound on performance of our implementation?
Actual BW / GMAC = 112 / 4 = 28 GFLOPS
All threads access global memory for their input matrix elements
The actual code runs at about 15 GFLOPS
Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS
CUDA Memories – Slide 39
How will this perform? (cont.)Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Each input element is read by width threads
Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
CUDA Memories – Slide 40
Idea: Use __shared__ memory to reuse global data
A
B
AB
width
Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B
Load a tile of both matrices into __shared__ each phase
CUDA Memories – Slide 41
Tiled Multiply TILE_WIDTH
A
B
AB
Each phaseeach block
computes one square sub-matrix ABsub of size TILE_WIDTH
each phase, each thread computes a partial result, one element of ABsub
CUDA Memories – Slide 42
Tiled Multiply (cont.) TILE_WIDTH
A
B
AB
CUDA Memories – Slide 43
A Small Example
AB1,0A2,0
A1,1
A1,0A0,0
A0,1
A3,0
A2,1
AB0,0
A3,1 AB0,1
AB2,0AB3,0
B0,3 B1,3
B1,2
B1,1
B1,0B0,0
B0,1
B0,2
AB1,1
AB0,2 AB2,2AB3,2AB1,2
AB3,1AB2,1
AB0,3 AB2,3AB3,3AB1,3
Every A and B element is used exactly twice in generating a 2-by-2 tile of AB
CUDA Memories – Slide 44
A Small Example (cont.)
AB0,0
thread0,0
AB1,0
thread1,0
AB0,1
thread0,1
AB1,1
thread1,1
A0,0 * B0,0 A0,0 * B1,0 A0,1 * B0,0 A0,1 * B1,0
A1,0 * B0,1 A1,0 * B1,1 A1,1 * B0,1 A1,1 * B1,1
A2,0 * B0,2 A2,0 * B1,2 A2,1 * B0,2 A2,1 * B1,2
A3,0 * B0,3 A3,0 * B1,3 A3,1 * B0,3 A3,1 * B1,3
Accessorder
CUDA Memories – Slide 45
Breaking A and B into Tiles
AB1,0A2,0
A1,1
A1,0A0,0
A0,1
A3,0
A2,1
AB0,0
A3,1 AB0,1
AB2,0AB3,0
B0,3 B1,3
B1,2
B1,1
B1,0B0,0
B0,1
B0,2
AB1,1
AB0,2 AB2,2AB3,2AB1,2
AB3,1AB2,1
AB0,3 AB2,3AB3,3AB1,3
Each phase of a thread block uses one tile from A and one from B
CUDA Memories – Slide 46
Breaking A and B into Tiles (cont.)
Phase 1 Phase 2T0,0 A0,0
↓ s_a0,0
B0,0
↓ s_b0,0
AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1
A2,0 ↓ s_a0,0
B0,2
↓ s_b0,0
AB0,0 += s_a0,0*s_b0,0 + s_a1,0*s_b0,1
T1,0 A1,0
↓ s_a1,0
B1,0
↓ s_b1,0
AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1
A3,0 ↓ s_a1,0
B1,2
↓ s_b1,0
AB1,0 += s_a0,0*s_b1,0 + s_a1,0*s_b1,1
T0,1 A0,1
↓ s_a0,1
B0,1
↓ s_b0,1
AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1
A2,1
↓ s_a0,1
B0,3
↓ s_b0,1
AB0,1 += s_a0,1*s_b0,0 + s_a1,1*s_b0,1
T1,1 A1,1
↓ s_a1,1
B1,1
↓ s_b1,1
AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1
A3,1 ↓ s_a1,1
B1,3
↓ s_b1,1
AB1,1 += s_a0,1*s_b1,0 + s_a1,1*s_b1,1
time
Each phaseeach block
computes one square sub-matrix ABsub of size TILE_WIDTH
each phase, each thread computes a partial result, one element of ABsub
CUDA Memories – Slide 47
Tiled Multiply (cont.) TILE_WIDTH
A
B
AB
Set up the execution configuration
CUDA Memories – Slide 48
Better Implementation
dim3 dimBlock (TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid (Width / TILE_WIDTH, Width / TILE_WIDTH);
CUDA Memories – Slide 49
Better Implementation (cont.)__global__ void mat_mul(float *a, float *b, float *ab, int width){ // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by * blockDim.y + ty; int col = bx * blockDim.x + tx;
float result = 0;
CUDA Memories – Slide 50
Better Implementation (cont.) // loop over the tiles of the input in phases for (int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col]; __syncthreads();
// dot product between row of s_a and col of s_b for (int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); }
ab[row * width + col] = result;}
Two barriers per phase:__syncthreads after all data is loaded into __shared__ memory
__syncthreads after all data is read from __shared__ memory
Note that second __syncthreads in phase p guards the load in phase p+1
Use barriers to guard dataGuard against using uninitialized dataGuard against bashing live data
CUDA Memories – Slide 51
Use of Barriers in mat_mul
Each thread block should have many threadsTILE_WIDTH = 16 16*16 = 256 threads
There should be many thread blocks1024-by-1024 matrices 64*64 = 4096 thread
blocksTILE_WIDTH = 16 gives each SM 3 blocks,
768 threadsFull occupancy
Each thread block performs 2 * 256 = 512 32B loads from global memory for 256 * (2 * 16) = 8,192 FP operationsMemory bandwidth no longer a limiting factorCUDA Memories – Slide 52
First Order Size Considerations
Experiment performed on a GT200This optimization was clearly worth the effortBetter performance still possible in theory
CUDA Memories – Slide 53
Optimization AnalysisImplementation Original ImprovedGlobal Loads 2N3 2N2
*(N/TILE_WIDTH)Throughput 10.7
GFLOPS183.9 GFLOPS
SLOCs 20 44Relative Improvement 1x 17.2xImprovement/SLOC 1x 7.8x
Effective use of different memory resources reduces the number of accesses to global memory
These resources are finite!The more memory locations each thread requires
the fewer threads an SM can accommodateCUDA Memories – Slide 54
Memory Resources as Limit to Parallelism
Resource Per GT200
SM
Full Occupancy on GT200
Registers 16384 ≤ 16384 / 768 threads= 21 per thread
__shared__ Memory 16 KB ≤ 16 KB / 8 blocks= 2 KB per block
Each SM in GT200 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B
= 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively
executing This allows up to 8*512 = 4,096 pending loads. (2 per thread,
256 threads per block) The next TILE_WIDTH 32 would lead to 2*32*32*4B=
8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 112 GB/s bandwidth can now support (112/4)*16 =
448 GFLOPS! CUDA Memories – Slide 55
GT200 Shared Memory and Threading
CUDA Memories – Slide 56
TILE_SIZE Effects
Global variables declaration __host__ __device__... __global__, __constant__, __texture__
Function prototypes __global__ void kernelOne(…) float handyFunction(…)
Main () allocate memory space on the device – cudaMalloc(&d_GlblVarPtr,
bytes ) transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution
Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__
automatic variables transparently assigned to registers or local memory __syncthreads()…
Other functions float handyFunction(int inVar…);
CUDA Memories – Slide 57
Summary: Typical Structure of a CUDA Program
repeatas
needed
Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput
Use __shared__ memory to eliminate redundant loads from global memoryUse __syncthreads barriers to protect __shared__ data
Use atomics if access patterns are sparse or unpredictable
Optimization comes with a development costMemory resources ultimately limit parallelism
CUDA Memories – Slide 58
Final Thoughts
Reading: Chapter 5, “Programming Massively Parallel Processors” by Kirk and Hwu.
Based on original material fromThe University of Illinois at Urbana-Champaign
David Kirk, Wen-mei W. HwuStanford University
Jared Hoberock, David TarjanRevision history: last updated 8/9/2011.
CUDA Memories – Slide 59
End Credits