Post on 14-Dec-2015
transcript
1ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011GPUMemories.ppt
GPU Memories
These notes will introduce:
•The basic memory hierarchy in the NVIDIA GPU• global memory, shared memory, register file,
constant memory•How to declare variables for each memory•Cache memory and making most effective in program
2
Host-Device Connection
Host (CPU)
Device Global
Memory
Host Memory
Device (GPU)
PCIe x164 GB/s
PCIe x16 Gen28 GB/s peak
GPU busC2050 1030.4 GB/sGTX 280 141.7 GB/s
DDR 4003.2 GB/s
GDDR5230 GB/s
Memory buslimited by memory
and processor-memory connection
bandwidth
Hypertransport and Intel’s Quickpath
currently 25.6 GB/s
Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfersGPU on a laptop such as Mac pro may share the system memory.
3
GPU Memory Hierarchy
Global memory is off-chip on the GPU card.
Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance
GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory –Pointers created with cudaMalloc() point to global memory.
Two principal levels on-chip:shared memory and registers
4
Grid
Block
Threads
Shared memoryLocal memory
Registers
Global memory
Constant memory
Scope of global memory, shared memory, and registers
Host
Host memory
For storing global constants see later. Also a read-only global memory called texture memory.
5
Currently can only transfer data from host to global (and constant memory) and not host directly to shared.
Constant memory used for data that does not change (i.e. read-only by GPU)
Shared memory is said to provide up to 15 x speed of global memory
Register similar speed to shared memory if reading same address or no bank conflicts.
6
Lifetimes
Global/constant memory –- lifetime of applicationShared memory -– lifetime of a kernelRegisters –- lifetime of a kernel
Scope
Global/constant memory –- GridShared memory –- BlockRegisters –- Thread
7
Declaring program variables for registers, shared memory and global memory
Memory Declaration Scope Lifetime
Registers Automatic variables* Thread Kernelother than arrays
Local Automatic array variables Thread Kernel
Shared __shared__ Block Kernel
Global __device__ Grid Application
Constant __constant__ Grid Application
*Automatic variables allocated automatically when entering scope of variable and de-allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable
8
Global Memory__device__
For data available to all threads in device.
Declared outside function bodies
Scope of Grid and lifetime of application
#include <stdio.h>#include <stdlib.h>#include <cuda.h>#define N 1000…__device__ int A[N];
__global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x;A[tid] = ……
}
main {…
}
9
Issues with using Global memory
• Long delays, slow
• Access congestion
• Cannot synchronize accesses
• Need to ensure no conflicts of accesses between threads
10
Shared Memory
Shared memory is on the GPU chip and very fast
Separate data available to all threads in one block.
Declared inside function bodies
Scope of block and lifetime of kernel call
So each block would have its own array A[N]
#include <stdio.h>#include <stdlib.h>#include <cuda.h>#define N 1000…
__global__ kernel() {
__shared__ int A[N];
int tid = threadIdx.x;A[tid] = ……
}main {
…}
11
Transferring data to shared memory
int A[N][N]; //to be copied into device from host with cudamalloc
__global__ void myKernel (int *A_global) {__shared__ int A_sh[n][n]; // declare shared memory
int row = …int col = …A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared…
}
main () {… cudaMalloc((void**)dev_ A, size); // allocate global memorycudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memorymyKernel<<G,B>>(dev_A)…
}
12
Issues with Shared Memory
Shared memory is not immediately synchronized after access.
Usually it is the writes that matter.
Use __syncthreads() before you read data that has been altered.
Shared memory is very limited(Fermi has up to 48KB per GPU core, NOT per block)
Hence may have to divide your data into “chunks”
13
Example uses of shared data
Where the data can be divided into independent parts:
Image processing
- Image can be divided into blocks and placed into shared memory for processing
Block matrix multiplication
-Sub-matrices can be stored in shared memory (Slides to follow on this)
14
Registers
Compiler will place variables declared in kernel in registers when possible
Limit to the number of registers
Fermi has 32768 32-bit registers
Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps
__global__ kernel() {
int x, y, z;
…
}
15
Arrays declared within kernel(Automatic array variables)
__global__ kernel() {
int A[10];
…
}
Generally stored in global memory but private copy made for each thread.*
Can be as slow access as global memory, except cached, see later
If array indexed with a constant value, compiler may use registers
* Global “local” memory, see later
16
Constant Memory__constant__
For data not altered by device.
Although stored in global memory, cached and has fast access
Declared outside function bodies
Scope of grid and lifetime of application
Size currently limited to 65536 bytes
#include <stdio.h>#include <stdlib.h>#include <cuda.h>…__constant__ int n;
__global__ kernel() {…
}main {
n = ……
}
17
Local memory
Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible.
For compute capability 2.x, cached in L1 and L2 caches on-chip
Used to hold arrays if not indexed with a constant value
and
for variables when there are no more register available for them
18
Cache memory
More recent GPUs have L1 and L2 cache memory, but apparently without cache coherence so up to the programmer to ensure that.
Make sure each thread accesses different locations
Ideally arrange accesses to be in same cache lines
Compute capability 1.3 Tesla’s do not have cache memory
Compute capability 2.0 Fermi’s have L1/L2 caches
19
Fermi Caches
Streaming processors (SM’s)
L2 cache
L1 cache/ shared memory
Streaming processors (SM)
Register file
20
Fermi Cache Sizes
L2
• Unified 384kB L2 cache for all SM’s
• 384-bit memory bus from device memory to L2 cache
• Up to 160 GB/s bandwidth
• 128 bytes cache line (32 32-bit integers or floats, or 16 doubles)
L1
• Each SM has 16kB or 48kB of L1 cache (64kB split 16/48 or 48/16
between L1 cache and shared memory)
• No global cache coherency!
21
Poor Performance from Poor Data Layout
__global__ void kernel(int *A) {
int i = threadIdx.x + blockDim.x*blockIdx.x;
A[1000*i] = …
}
Very Bad!
Each thread accesses a location on a different line.
Fermi line size is 32 integers or floats
22
Taking Advantage of Cache
__global__ void kernel(int *A) {
int i = threadIdx.x + blockDim.x*blockIdx.x;
A[i] = …
}
Good!
Groups of 32 accesses by consecutive threads on same line. Threads will be in same warpFermi line size is 32 integers or floats
23
Warp
A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode
A “half warp” (16 threads) actually execute simultaneously (current GPUs)
Using knowledge of warps and how the memory is laid out can improve code performance
24
Memory Banks
Memory 1 Memory 4Memory 3Memory 2
Device (GPU)
Consecutive locations on successive memory banks
A[0] A[1] A[2] A[3]
Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.
25
Shared Memory Banks
Shared memory divided into 16 or 32 banks of 32-bit width.Banks can be accessed simultaneously
Compute cap. 1.x has 16 banks accesses processed per half warp
Compute cap. 2.x has 32 banks accesses processed per warp
Banks can be accessed simultaneously
To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory
Exception: all threads read the same location which results in a broadcast operation
*coit-grid06 C2050 compute capability 2.0 has 32 banks)
26
Global memory banks
Global memory is also partitioned into banks depending upon the version of the GPU
200 series and 10 series NVIDIA GPUs have 8 partitions of 256 bytes wide
C2050 has ??
27
Achieving best data access patterns
Requires a lot of thought – will consider in detail for specific problems
Generally
Padding data to make data aligned
For matrix operationsTilingPre-transpose operationsPadding – adding columns/rows
28
Memory Coalescing
Aligned memory accesses
Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned.
That is: A 4-byte read must start at address …xxxxx00A 8 byte read must start at address …xxxx000A 16 byte read must start at address …xxx0000
Then access is much faster (twice?)
29
Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses
A bad case would be:
•Thread 0 to access A[0], A[2], ... A[15] •Thread 1 to access A[16], A[17], ... A[31] •Thread 2 to access A[32], A[33], ... A[63]
… etc.
Good case would be
•Thread 0 to access A[0], A[16], ... A[31] •Thread 1 to access A[1], A[17], ... A[32] •Thread 2 to access A[2], A[18], ... A[33] … etc.if there are 16 banks. Need to know that detail!
Time
30Wikipedia “ CUDA” http://en.wikipedia.org/wiki/CUDA
coit-grid06C2050 2.0
Questions