© NVIDIA Corporation 2010
GPU Tutorial @ Lund Observatory
Gernot Ziegler, NVIDIA UK
© NVIDIA Corporation 2010
HISTORY / INTRODUCTION
VAX
Maspar
Thinking Machines
Blue Gene Many-Core
GPUs
Multi-Core
x86
Intel 4004
DEC PDP-1
ILLIAC IV
IBM System 360
Cray-1
IBM POWER4
Parallel vs Sequential Architecture Evolution
High Performance Computing Architectures
Database, Operating System Sequential Architectures
Recent History
Specialised machines faded out (e.g. CRAY)
Cost, economies of scale
Intel and AMD chips designed for home/office use
Increasing clock frequencies gave increasing performance
Commodity clusters
Computer gaming drives Graphics Processing Unit (GPU)
NVIDIA and ATI
© NVIDIA Corporation 2009
Present
Clock frequency no longer increasing
Power consumption ƒ2
Multi-core dominates
© NVIDIA Corporation 2010
GPU Computing
CPU + GPU Co-Processing
4 cores
Graphics Pipelines for Last 20 YearsProcessor per function
T&L evolved to vertex shading
Triangle, point, line – setup
Flat shading, texturing, eventually
Pixel shading
Blending, Z-buffering, anti-aliasing
Wider and faster over the years
Vertex
Triangle
Pixel
ROP
Memory
Vertex
Shader
Pixel
ShaderIdle hardware
Idle hardwareVertex
Shader
Pixel
Shader
Previous Pipelined Architectures
Heavy Geometry
Workload Perf = 4
Heavy Pixel
Workload Perf = 8
Unified Architecture Replaces the Pipeline
ModelThe future of GPUs is programmable processing
So – build the architecture around the processor
L2
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Data Assembler
L2 L2 L2 L2 L2
FB
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
SP SP
TF
L1
FBFBFBFBFB
Host
© NVIDIA Corporation 2010
Low Latency or High Throughput?
CPU
Optimised for low-latency
access to cached data sets
Control logic for out-of-order
and speculative execution
GPU
Optimised for data-parallel,
throughput computation
Architecture tolerant of
memory latency
More transistors dedicated to
computation
Cache
ALUControl
ALU
ALU
ALU
DRAM
DRAM
Heterogeneous Computing Domains
Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging
GPU(Parallel Computing)
Graphics
CPU(Sequential Computing)
Massive DataParallelism
Instruction LevelParallelism
Data Fits in Cache Larger Data Sets
© NVIDIA Corporation 2010
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
100X
Astrophysics
RIKEN
149X
Financial simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, Urbana
30X
Gene Sequencing
U of Maryland
50x – 150x
Tesla, CUDA & PSC – definitions
CUDA Architecture
Our enabling technology for GPU computing
The architecture of the GPU to support compute - plus
C language extensions and retargetter
Usable with any 8 series and up GPU
Tesla
Dedicated compute hardware
C1060 and S1070
Fermi: C2050 and S2070
PSC
Personal Super Computer
A desktop machine with at least 3 C1060s
© NVIDIA Corporation 2010
NVIDIA Tesla 20-Series (Fermi) Products
Tesla S2050 /
S2070 1U System
Tesla C2050 / C2070
Workstation Board
GPUs 1 Tesla GPU 4 Tesla GPUs 1 Tesla GPU
Single Precision
Performance
1030 Gigaflops 4.12 Teraflops 1030 Gigaflops
Double Precision
Performance
515 Gigaflops 2.06 Teraflops 515 Gigaflops
Memory : x2050
Memory : x2070
3 GB
6 GB
12 GB (3 GB / GPU)
24 GB (6 GB / GPU)
3 GB
6 GB
Tesla M2050 /
M2070 Module
Data Center Products Workstation
© NVIDIA Corporation 2010
The Performance Gap Widens Further
2003 2004 2005 2006 2007 2008 2009 2010
Peak Single Precision Performance
GFlops/sec
Tesla 8-series
Tesla 10-series
Nehalem
3 GHz
Tesla 20-series
8x double precision
ECC
L1, L2 Caches
1 TF Single Precision
4GB Memory
NVIDIA GPU
X86 CPU
© NVIDIA Corporation 2010
GPU Computing Applications
CUDA Parallel Computing Architecture
NVIDIA GPUwith the CUDA Parallel Computing Architecture
C OpenCLtm Direct
ComputeFortran
Java and Python
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
C++
© NVIDIA Corporation 2010
GPU Parallel Computing Developer Eco-System
Debuggers& Profilers
cuda-gdbNV Visual Profiler
Parallel NsightVisual Studio
AllineaTotalView
MATLABMathematicaNI LabView
pyCUDA
Numerical Packages
CC++
FortranOpenCL
DirectComputeJava
Python
GPU Compilers
PGI AcceleratorCAPS HMPP
mCUDAOpenMP
ParallelizingCompilers
BLASFFT
LAPACKNPP
VideoImaging
Libraries
Solution ProvidersCUDA Consultants & Training
ANEO GPU Tech
© NVIDIA Corporation 2010
CUDA OVERVIEW
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
PCI Bus
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
PCI Bus
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
PCI Bus
© NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
Parallel computing architecture
and programming model
Includes a CUDA C compiler,
support for OpenCL and
DirectCompute
Architected to natively support
multiple computational
interfaces (standard languages
and APIs)
© NVIDIA Corporation 2010
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
© NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
CUDA defines:
Programming model
Memory model
Execution model
CUDA uses the GPU, but is for general-purpose computing
Facilitate heterogeneous computing: CPU + GPU
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads
© NVIDIA Corporation 2010
Compiling CUDA C Applications (Runtime API)
void serial_function(… ) {
...
}
void other_function(int ... ) {
...
}
void saxpy_serial(float ... ) {
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
void main( ) {
float x;
saxpy_serial(..);
...
}
NVCC
(Open64)CPU Compiler
C CUDA
Key Kernels
CUDA object
files
Rest of C
Application
CPU object
filesLinker
CPU-GPU
Executable
Modify into
Parallel
CUDA code
© NVIDIA Corporation 2010
PROGRAMMING MODEL
CUDA Review
© NVIDIA Corporation 2010
CUDA Kernels
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
© NVIDIA Corporation 2010
CUDA Kernels: Parallel Threads
A kernel is a function executed
on the GPU
Array of threads, in parallel
All threads execute the same
code, can take different paths
Each thread has an ID
Select input/output data
Control decisions
float x = input[threadID];
float y = func(x);
output[threadID] = y;
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
© NVIDIA Corporation 2010
Communication Within a Block
Threads may need to cooperate
Memory accesses
Share results
Cooperate using shared memory
Accessible by all threads within a block
Restriction to “within a block” permits scalability
Fast communication between N threads is not feasible when N large
© NVIDIA Corporation 2010
Transparent Scalability – G84
1 2 3 4 5 6 7 8 9 10 11 12
1 2
3 4
5 6
7 8
9 10
11 12
© NVIDIA Corporation 2010
Transparent Scalability – G80
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8
9 10 11 12
© NVIDIA Corporation 2010
Transparent Scalability – GT200
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle
© NVIDIA Corporation 2010
CUDA Programming Model - Summary
A kernel executes as a grid of
thread blocks
A block is a batch of threads
Communicate through shared
memory
Each block has a block ID
Each thread has a thread ID
Host
Kernel 1
Kernel 2
Device
0 1 2 3
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
1D
2D
© NVIDIA Corporation 2010
MEMORY MODEL
CUDA Review
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
© NVIDIA Corporation 2010
Additional Memories
Host can also allocate textures and arrays of constants
Textures and constants have dedicated caches
© NVIDIA Corporation 2010
PROGRAMMING ENVIRONMENT
CUDA Review
© NVIDIA Corporation 2010
CUDA APIs
API allows the host to manage the devices
Allocate memory & transfer data
Launch kernels
CUDA C “Runtime” API
High level of abstraction - start here!
CUDA C “Driver” API
More control, more verbose
(OpenCL: Similar to CUDA C Driver API)
© NVIDIA Corporation 2010
CUDA C and OpenCL
Shared back-end compiler
and optimization technology
Entry point for developers
who want low-level APIEntry point for developers
who prefer high-level C
© NVIDIA Corporation 2010
Visual Studio
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Compilation rules: cuda.rules
Syntax highlighting
Intellisense
Integrated debugger and
profiler: Nexus
© NVIDIA Corporation 2010
Linux
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Typically makefile driven
cuda-gdb for debugging
CUDA Visual Profiler
© NVIDIA Corporation 2010
CUDA OPTIMIZATION GUIDELINES
Performance
© NVIDIA Corporation 2010
Optimize Algorithms for GPU
Algorithm selectionUnderstand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host
© NVIDIA Corporation 2010
Optimize Memory Access
Coalesce global memory access
Maximise DRAM efficiency
Order of magnitude impact on performance
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
Understand spatial locality
Optimize use of textures to ensure spatial locality
© NVIDIA Corporation 2010
Exploit Shared Memory
Hundreds of times faster than global memory
Inter-thread cooperation via shared memory and synchronization
Cache data that is reused by multiple threads
Stage loads/stores to allow reordering
Avoid non-coalesced global memory accesses
© NVIDIA Corporation 2010
Use Resources Efficiently
Partition the computation to keep multiprocessors busyMany threads, many thread blocks
Multiple GPUs
Monitor per-multiprocessor resource utilizationRegisters and shared memory
Low utilization per thread block permits multiple active blocks per multiprocessor
Overlap computation with I/OUse asynchronous memory transfers
© NVIDIA Corporation 2010
DEBUGGING AND PROFILING
cuda-gdb and Visual Profiler
© NVIDIA Corporation 2010
CUDA-GDB
Extended version of GDB with support for C for CUDA
Supported on Linux 32bit/64bit systems
Seamlessly debug both the host|CPU and device|GPU code
• Set breakpoints on any source line or symbol name
• Single step executes only one warp – except on sync threads
• Access and print all CUDA memory allocations, local, global,
constant and shared vars.
© NVIDIA Corporation 2010
Linux GDB
Integration with
EMACS
© NVIDIA Corporation 2010
Linux GDB
Integration with
DDD
© NVIDIA Corporation 2010
CUDA Driver – Low-level Profiling support
1. Set up environment variables› export CUDA_PROFILE=1
› export CUDA_PROFILE_CSV=1
› export CUDA_PROFILE_CONFIG=config.txt
› export CUDA_PROFILE_LOG=profile.csv
2. Set up configuration fileFILE "config.txt":
gpustarttimestamp
instructions
3. Run application› matrixMul
4. View profiler output
FILE "profile.csv":# CUDA_PROFILE_LOG_VERSION 1.5
# CUDA_DEVICE 0 GeForce 8800 GT
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fa292bb1ea2c12c
gpustarttimestamp,method,gputime,cputime,occupancy,instructions
115f4eaa10e3b220,memcpyHtoD,7.328,12.000
115f4eaa10e5dac0,memcpyHtoD,5.664,4.000
115f4eaa10e95ce0,memcpyHtoD,7.328,6.000
115f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333,43
52
115f4eaa10f443a0,memcpyDtoH,7.776,36.000
© NVIDIA Corporation 2010
CUDA Visual Profiler - Overview
• Performance analysis tool to fine tune CUDA applications
• Supported on Linux/Windows/Mac platforms
• Functionality:
• Execute a CUDA application and collect profiling data
• Multiple application runs to collect data for all hardware performance counters
• Profiling data for all kernels and memory transfers
• Analyze profiling data
© NVIDIA Corporation 2010
CUDA Visual Profiler – data for kernels
© NVIDIA Corporation 2010
CUDA Visual Profiler – computed data for kernels
• Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate
• Global memory read throughput (Gigabytes/second)
• Global memory write throughput (Gigabytes/second)
• Overall global memory access throughput (Gigabytes/second)
• Global memory load efficiency
• Global memory store efficiency
© NVIDIA Corporation 2010
CUDA Visual Profiler – data for memory transfers
• Memory transfer type and direction
(D=Device, H=Host, A=cuArray)
• e.g. H to D: Host to Device
• Synchronous / Asynchronous
• Memory transfer size, in bytes
• Stream ID
© NVIDIA Corporation 2010
CUDA Visual Profiler – data analysis views
• Views:
• Summary table
• Kernel table
• Memcopy table
• Summary plot
• GPU Time Height plot
• GPU Time Width plot
• Profiler counter plot
• Profiler table column plot
• Multi-device plot
• Multi-stream plot
• Analyze profiler counters
• Analyze kernel occupancy
© NVIDIA Corporation 2010
CUDA Visual Profiler – Misc.
• Multiple sessions
• Compare views for different sessions
• Comparison Summary plot
• Profiler projects – save & load
• Import/Export profiler data
(.CSV format)
© NVIDIA Corporation 2010
NVIDIA Parallel Nsight
Accelerates GPU + CPUapplication development
The industry’s 1st Development Environment for massively parallel applications
Complete Visual Studio-integrateddevelopment environment
© NVIDIA Corporation 2010
Parallel Nsight 1.0
Nsight Parallel Debugger
GPU source code debugging
Variable & memory inspection
Nsight Analyzer
Platform-level Analysis
For the CPU and GPU
Nsight Graphics Inspector
Visualize and debug graphics content
© NVIDIA Corporation 2010
Source Debugging
Supporting CUDA C and HLSL code.
Hardware breakpoints
GPU memory and variable views
Nsight menu and toolbars
© NVIDIA Corporation 2010
View a correlated trace timeline with both CPU and GPU events.
Analysis
© NVIDIA Corporation 2010
Detailed tooltips are available for every event on the timeline.
Analysis
© NVIDIA Corporation 2010
1.0 System Requirements
Operating SystemWindows Server 2008 R2
Windows 7 / Vista
32 or 64-bit
HardwareGeForce 9 series or higher
Tesla C1060/S1070 or higher
Quadro (G9x or higher)
Visual StudioVisual Studio 2008 SP1
© NVIDIA Corporation 2010
Supported System Configurations
#1: Single machine, Single GPU
AnalyzerGraphics Inspector
#2: Two machines connected over the network
DebuggerAnalyzerGraphics Inspector
TCP/IP
#3: Single SLI MOS machine, Two Quadro GPUs
DebuggerAnalyzerGraphics Inspector
© NVIDIA Corporation 2010
Parallel Nsight 1.0 Versions
Standard (free)GPU Source Debugger
Graphics Inspector
Professional ($349)Analyzer
Data Breakpoints
Premium ticket-based support
Volume and Site Licensing available
© NVIDIA Corporation 2010
NVIDIA Nexus IDE
The industry’s first IDE for massively
parallel applications
Accelerates co-processing (CPU + GPU)
application development
Complete Visual Studio-integrated
development environment
© NVIDIA Corporation 2010
NVIDIA Nexus IDE - Debugging
© NVIDIA Corporation 2010
NVIDIA Nexus IDE - Profiling
© NVIDIA Corporation 2010
RESOURCES
Productivity
© NVIDIA Corporation 2010
Getting Started
CUDA Zone
www.nvidia.com/cuda
Introductory tutorials
GPU computing online seminars
(aka Webinars)
Forums
Documentation
Programming Guide
Best Practices Guide
Examples
CUDA SDK
© NVIDIA Corporation 2010
Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...
© NVIDIA Corporation 2009
Additional material
© NVIDIA Corporation 2010
Targeting Multiple Platforms with CUDA
CUDA C / C++
NVCCNVIDIA CUDA Toolkit
MCUDACUDA to Multi-core
OcelotPTX to Multi-corePTX
MCUDA: http://impact.crhc.illinois.edu/mcuda.php
Ocelot: http://code.google.com/p/gpuocelot/
Swan: http://www.multiscalelab.org/swan
SWAN CUDA to OpencL
Other GPUs
Multi-Core
CPUs
NVIDIA
GPUs
© NVIDIA Corporation 2010
OPTIMIZATION 1:
MEMORY TRANSFERS &
COALESCING
© NVIDIA Corporation 2010 9
5
Execution ModelSoftware Hardware
Threads are executed by scalar processors
Thread
Scalar
Processor
Thread
Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one
multiprocessor - limited by multiprocessor resources
(shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
Only one kernel can execute on a device at one time
© NVIDIA Corporation 2010 9
6
Warps and Half Warps
Thread
Block Multiprocessor
32 Threads
32 Threads
32 Threads
...
Warps
16
Half Warps
16
DRAM
Global
Local
A thread block consists of 32-thread
warps
A warp is executed physically in parallel
(SIMD) on a multiprocessor
Device
Memory
=
A half-warp of 16 threads can coordinate
global memory accesses into a single
transaction
© NVIDIA Corporation 2010 9
7
Memory Architecture
Host
CPU
Chipset
DRAM
Device
DRAM
Global
Constant
Texture
Local
GPU
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Constant and Texture
Caches
© NVIDIA Corporation 2010 9
8
Host-Device Data Transfers
Device to host memory bandwidth much lower than device to
device bandwidth
8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280)
Minimize transfers
Intermediate data can be allocated, operated on, and deallocated without
ever copying them to host memory
Group transfers
One large transfer much better than many small ones
© NVIDIA Corporation 2010 9
9
Page-Locked Data Transfers
cudaMallocHost() allows allocation of page-locked (“pinned”) host memory
Enables highest cudaMemcpy performance3.2 GB/s on PCI-e x16 Gen1
5.2 GB/s on PCI-e x16 Gen2
See the “bandwidthTest” CUDA SDK sample
Use with caution!!Allocating too much page-locked memory can reduce overall system performance
Test your systems and apps to learn their limits
© NVIDIA Corporation 2010 1
0
Overlapping Data Transfers and Computation
Async and Stream APIs allow overlap of H2D or D2H data transfers with computation
CPU computation can overlap data transfers on all CUDA capable devices
Kernel computation can overlap data transfers on devices with “Concurrent copy and execution” (roughly compute capability >= 1.1)
Stream = sequence of operations that execute in order on GPU
Operations from different streams can be interleaved
Stream ID used as argument to async calls and kernel launches
© NVIDIA Corporation 2010 1
0
Coalescing
Global Memory
Half-warp of threads
Global memory access of 32, 64, or 128-bit words by a half-warp of threads
(Fermi: warp of threads) can result in as few as one (or two) transaction(s) if
certain access requirements are met
Float (32-bit) data example:
32-byte segments
64-byte segments
128-byte segments
……
© NVIDIA Corporation 2010 1
0
CoalescingCompute capability 1.2 and higher
1 transaction - 64B segment
2 transactions - 64B and 32B segments
1 transaction - 128B segment
Issues transactions for segments of 32B, 64B, and 128B
Smaller transactions used to avoid wasted bandwidth
……
……
……
© NVIDIA Corporation 2010
OPTIMIZATION 2:
EXECUTION CONFIG
© NVIDIA Corporation 2010 1
0
Occupancy
Thread instructions are executed sequentially, so executing
other warps is the only way to hide latencies and keep the
hardware busy
Occupancy = Number of warps running concurrently on a
multiprocessor divided by maximum number of warps that can
run concurrently
Limited by resource usage:
Registers
Shared memory
© NVIDIA Corporation 2010 1
0
Blocks per Grid Heuristics
# of blocks > # of multiprocessors
So all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2
Multiple blocks can run concurrently in a multiprocessor
Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
Subject to resource availability – registers, shared memory
# of blocks > 100 to scale to future devices
Blocks executed in pipeline fashion
1000 blocks per grid will scale across multiple generations
© NVIDIA Corporation 2010 1
0
Register Pressure
Hide latency by using more threads per multiprocessor
Limiting Factors:
Number of registers per kernel
8K/16K per multiprocessor, partitioned among concurrent threads
Amount of shared memory
16KB per multiprocessor, partitioned among concurrent threadblocks
Compile with –ptxas-options=-v flag
Use –maxrregcount=N flag to NVCC
N = desired maximum registers / kernel
At some point “spilling” into local memory may occur
Reduces performance – local memory is slow
© NVIDIA Corporation 2010 1
0
Occupancy Calculator
© NVIDIA Corporation 2010 1
0
Optimizing threads per block
Choose threads per block as a multiple of warp size
Avoid wasting computation on under-populated warps
Facilitates coalescing
Want to run as many warps as possible per multiprocessor (hide
latency)
Multiprocessor can run up to 8 blocks at a time
Heuristics
Minimum: 64 threads per block
Only if multiple concurrent blocks
192 or 256 threads a better choice
Usually still enough regs to compile and invoke successfully
This all depends on your computation, so experiment!
© NVIDIA Corporation 2010 1
0
Occupancy != Performance
Increasing occupancy does not necessarily increase
performance
BUT …
Low-occupancy multiprocessors cannot adequately hide latency
on memory-bound kernels
(It all comes down to arithmetic intensity and available parallelism)
© NVIDIA Corporation 2010
OPTIMIZATION 3:
MATH FUNCS & BRANCHING
© NVIDIA Corporation 2010 1
1
Runtime Math Library
There are two types of runtime math operations in single
precision
__funcf(): direct mapping to hardware ISA
Fast but lower accuracy (see prog. guide for details)
Examples: __sinf(x), __expf(x), __powf(x,y)
funcf() : compile to multiple instructions
Slower but higher accuracy (5 ulp or less)
Examples: sinf(x), expf(x), powf(x,y)
The -use_fast_math compiler option forces every funcf() to
compile to __funcf()
© NVIDIA Corporation 2010 1
1
Control Flow Instructions
Main performance concern with branching is divergence
Threads within a single warp take different paths
Different execution paths must be serialized
Avoid divergence when branch condition is a function of thread
ID
Example with divergence:
if (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:
if (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size
© NVIDIA Corporation 2010
OPTIMIZATION 4: SHARED MEMORY
© NVIDIA Corporation 2010 1
1
Shared Memory
~Hundred times faster than global memory
Cache data to reduce global memory accesses
Threads can cooperate via shared memory
Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order non-coalesceable addressing
© NVIDIA Corporation 2010 1
1
Shared Memory Architecture
Many threads accessing memory
Therefore, memory is divided into banks
Successive 32-bit words assigned to successive banks
Each bank can service one address per cycle
A memory can service as many simultaneous
accesses as it has banks
Multiple simultaneous accesses to a bank
result in a bank conflict
Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
© NVIDIA Corporation 2010 1
1
Bank Addressing Examples
No Bank Conflicts
Linear addressing
stride == 1
No Bank Conflicts
Random 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
© NVIDIA Corporation 2010 1
1
Bank Addressing Examples
2-way Bank Conflicts
Linear addressing
stride == 2
8-way Bank Conflicts
Linear addressing stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
x8
x8
© NVIDIA Corporation 2010 1
1
Shared memory bank conflicts
Shared memory is ~ as fast as registers if there are no bank conflicts
warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is no bank conflict
If all threads of a half-warp read the identical address, there is no bank conflict
(broadcast)
The slow case:
Bank Conflict: multiple threads in the same half-warp access the same bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank
© NVIDIA Corporation 2010 1
1
Shared Memory Example: Transpose
Each thread block works on a tile of the matrix
Naïve implementation exhibits strided access to global memory
idata odata
Elements transposed by a half-warp of threads
© NVIDIA Corporation 2010 1
2
Naïve Transpose
Loads are coalesced, stores are not (strided by height)
idata odata
__global__ void transposeNaive(float *odata, float *idata,
int width, int height)
{
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex;
int index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
© NVIDIA Corporation 2010 1
2
Coalescing through shared memory
Access columns of a tile in shared memory to write contiguous
data to global memory
Requires __syncthreads() since threads access data in
shared memory stored by other threads
Elements transposed by a half-warp of threads
idata odatatile
© NVIDIA Corporation 2010 1
2
__global__ void transposeCoalesced(float *odata, float *idata,
int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
__syncthreads();
odata[index_out] = tile[threadIdx.x][threadIdx.y];
}
Coalescing through shared memory
© NVIDIA Corporation 2010 1
2
Bank Conflicts in Transpose
16x16 shared memory tile of floats
Data in columns are in the same bank
16-way bank conflict reading columns in tile
Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];
Data in anti-diagonals are in same bank
Elements transposed by a half-warp of threads
idata odatatile
© NVIDIA Corporation 2010
FERMI: NEW ARCHITECTURE
© NVIDIA Corporation 2010
Fermi: The Computational GPU
Disclaimer: Specifications subject to change
Performance• 13x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point
Flexibility
• Increased Shared Memory from 16 KB to 64 KB• Added L1 and L2 Caches• ECC on all Internal and External Memories• Enable up to 1 TeraByte of GPU Memories• High Speed GDDR5 Memory Interface
DR
AM
I/F
HO
ST
I/F
Gig
a T
hre
ad
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Usability
• Multiple Simultaneous Tasks on GPU• 10x Faster Atomic Operations• C++ Support• System Calls, printf support
Availability: Q2 2010
© NVIDIA Corporation 2010
Fermi
Memory operations are done per warp (32 threads) instead of half-warp
Global memory, Shared memory
Shared memory:
16 or 48KB
Now 32 banks, 32-bit wide each
No bank-conflicts when accessing 8-byte words
L1 cache per multiprocessor
Should help with misaligned access, strides access, some register spilling
Much improved dual-issue:
Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc.
IEEE-conformant rounding
64bit address space, uniform
© NVIDIA Corporation 2010
L1 cache
For all memory operations
Global memory, Shared memory
Shares 64kb with Shared memory:
Switch size between 16 or 48KB (CUDA API call)
Caches gmem reads only
It benefits if compiler detects that all threads load same value
L1 cache per multiprocessor
NOT coherent! Use volatile for global memory access if other SM's threads
change the location. (but why needed? not all blocks running -> danger of
deadlock)
But caches local memory reads and writes
To improve spilling behaviour
(Coherence no problem as local memory SM-private)
© NVIDIA Corporation 2010
Fermi has 64bit address space
But only 32bit registers
In unfortunate cases, register allocation unnecessary overhead on Fermi
C2050 (3 GB)
Driver API:
Compile kernels in 32bit mode, can be loaded by 64bit app
Runtime API (CUDART):
Use new __launchbounds() intrinsic to help compiler optimize register usage
compile application in 32bit mode (nvcc -m32), produces also GPU code in 32bit.
© NVIDIA Corporation 2010
__umul24 not optimal on Fermi
On Tesla C1060 / GT200 architecture, bounded integer multiplications could be
accelerated with __umul24(a, b) instead of a * b, e.g. for
unsigned int tid = __umul24(blockIdx.x, blockDim.x) + threadIdx.x
On Fermi, __umul24() is emulated, and thus slower than a * b
© NVIDIA Corporation 2010
HPC and IEEE conformance
Default settings for computation on GPU now more conservative (for HPC)
Denormal support, IEEE-conformant division and square root
Accuracy over speed
If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20
then PTX JIT has used "old" Tesla C1060 settings, which favor speed:
flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square
root
For similar results in -arch=sm_20, use:
-ftz=true -prec-div=false -prec-sqrt=false
NVIDIA CUDA Programming Guide, sections 5.4.1, G.2
The CUDA Compiler Driver NVCC, pg. 14-15
(BTW, Sections 5.4.1 also contains information on instruction timings)
© NVIDIA Corporation 2010
CONCLUSION, QUESTIONS
& GTC INVITE
© NVIDIA Corporation 2010
GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010
San Jose Convention Center, San Jose, California
The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing
Preview disruptive technologies and emerging applications
Get tools and techniques to impact mission critical projects
Network with experts, colleagues, and peers across industries
Opportunities
Call for Submissions
Sessions & posters
Sponsors / Exhibitors
Reach decision makers
“CEO on Stage”
Showcase for Startups
Tell your story to VCs and
analysts
“I consider the GPU Technology Conference to be the single best place to see the amazing
work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,
and entrepreneurs from around the world.”
-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker
© NVIDIA Corporation 2009
Thank You
Questions?