© NVIDIA Corporation 2010
GPU Tutorial
Build environment
Debugging/Profiling
Fermi
Optimization / CUDA 3.1 and Fermi advice
© NVIDIA Corporation 2010
BUILD ENVIRONMENT
Compilation
© NVIDIA Corporation 2010
Linux
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Typically makefile driven
cuda-gdb for debugging
CUDA Visual Profiler
© NVIDIA Corporation 2010
Windows: Visual Studio
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Compilation rules: cuda.rules
Syntax highlighting
Intellisense
CUDA Visual Profiler
Integrated debugger and
profiler: Parallel Nsight (Win7)
© NVIDIA Corporation 2010
Additional Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...
© NVIDIA Corporation 2010
DEBUGGING AND PROFILING
cuda-gdb and Visual Profiler
© NVIDIA Corporation 2010
CUDA-GDB
Extended version of GDB with support for C for CUDA
Supported on Linux 32bit/64bit systems
Seamlessly debug both the host|CPU and device|GPU code
• Set breakpoints on any source line or symbol name
• Single step executes only one warp – except on sync threads
• Access and print all CUDA memory allocations, local, global,
constant and shared vars.
Walkthrough example with sourcecode : CUDA-GDB manual
© NVIDIA Corporation 2010
Linux GDB
Integration with
EMACS
© NVIDIA Corporation 2010
Linux GDB
Integration with
DDD
© NVIDIA Corporation 2010
Linux GDB
Integration with
DDD
© NVIDIA Corporation 2010
CUDA-MemCheck
Detects/tracks memory errors
Out of bounds accesses
Misaligned accesses (types must be aligned on their size)
Integrated into CUDA-GDB
Linux and WinXP
Win7 and Vista support coming
11©NVIDIA 2010
© NVIDIA Corporation 2010
CUDA Driver – Low-level Profiling support
1. Set up environment variables› export CUDA_PROFILE=1
› export CUDA_PROFILE_CSV=1
› export CUDA_PROFILE_CONFIG=config.txt
› export CUDA_PROFILE_LOG=profile.csv
2. Set up configuration fileFILE "config.txt":
gpustarttimestamp
instructions
3. Run application› matrixMul
4. View profiler output
FILE "profile.csv":# CUDA_PROFILE_LOG_VERSION 1.5
# CUDA_DEVICE 0 GeForce 8800 GT
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fa292bb1ea2c12c
gpustarttimestamp,method,gputime,cputime,occupancy,instructions
115f4eaa10e3b220,memcpyHtoD,7.328,12.000
115f4eaa10e5dac0,memcpyHtoD,5.664,4.000
115f4eaa10e95ce0,memcpyHtoD,7.328,6.000
115f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333,43
52
115f4eaa10f443a0,memcpyDtoH,7.776,36.000
© NVIDIA Corporation 2010
CUDA Visual Profiler - Overview
• Performance analysis tool to fine tune CUDA applications
• Supported on Linux/Windows/Mac platforms
• Functionality:
• Execute a CUDA application and collect profiling data
• Multiple application runs to collect data for all hardware performance counters
• Profiling data for all kernels and memory transfers
• Analyze profiling data
© NVIDIA Corporation 2010
CUDA Visual Profiler – data for kernels
© NVIDIA Corporation 2010
CUDA Visual Profiler – computed data for kernels
• Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate
• Global memory read throughput (Gigabytes/second)
• Global memory write throughput (Gigabytes/second)
• Overall global memory access throughput (Gigabytes/second)
• Global memory load efficiency
• Global memory store efficiency
© NVIDIA Corporation 2010
CUDA Visual Profiler – data for memory transfers
• Memory transfer type and direction
(D=Device, H=Host, A=cuArray)
• e.g. H to D: Host to Device
• Synchronous / Asynchronous
• Memory transfer size, in bytes
• Stream ID
© NVIDIA Corporation 2010
CUDA Visual Profiler – data analysis views
• Views:
• Summary table
• Kernel table
• Memcopy table
• Summary plot
• GPU Time Height plot
• GPU Time Width plot
• Profiler counter plot
• Profiler table column plot
• Multi-device plot
• Multi-stream plot
• Analyze profiler counters
• Analyze kernel occupancy
© NVIDIA Corporation 2010
CUDA Visual Profiler – Misc.
• Multiple sessions
• Compare views for different sessions
• Comparison Summary plot
• Profiler projects – save & load
• Import/Export profiler data
(.CSV format)
© NVIDIA Corporation 2010
NVIDIA Parallel Nsight
Accelerates GPU + CPUapplication development
The industry’s 1st Development Environment for massively parallel applications
Complete Visual Studio-integrateddevelopment environment
© NVIDIA Corporation 2010
Parallel Nsight 1.0
Nsight Parallel Debugger
GPU source code debugging
Variable & memory inspection
Nsight Analyzer
Platform-level Analysis
For the CPU and GPU
Nsight Graphics Inspector
Visualize and debug graphics content
© NVIDIA Corporation 2010
Source Debugging
Supporting CUDA C and HLSL code.
Hardware breakpoints
GPU memory and variable views
Nsight menu and toolbars
© NVIDIA Corporation 2010
View a correlated trace timeline with both CPU and GPU events.
Analysis
© NVIDIA Corporation 2010
Detailed tooltips are available for every event on the timeline.
Analysis
© NVIDIA Corporation 2010
1.0 System Requirements
Operating SystemWindows Server 2008 R2
Windows 7 / Vista
32 or 64-bit
HardwareGeForce 9 series or higher
Tesla C1060/S1070 or higher
Quadro (G9x or higher)
Visual StudioVisual Studio 2008 SP1
© NVIDIA Corporation 2010
Supported System Configurations
#1: Single machine, Single GPU
AnalyzerGraphics Inspector
#2: Two machines connected over the network
DebuggerAnalyzerGraphics Inspector
TCP/IP
#3: Single SLI MOS machine, Two Quadro GPUs
DebuggerAnalyzerGraphics Inspector
© NVIDIA Corporation 2010
Parallel Nsight 1.0 Versions
Standard (free)GPU Source Debugger
Graphics Inspector
Professional ($)Analyzer
Data Breakpoints
Premium ticket-based support
Volume and Site Licensing available
© NVIDIA Corporation 2010
GPU OPTIMIZATION GUIDELINES
Performance
© NVIDIA Corporation 2010
Optimize Algorithms for GPU
Algorithm selectionUnderstand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host
© NVIDIA Corporation 2010
Optimize Memory Access
Coalesce global memory access
Maximise DRAM efficiency
Order of magnitude impact on performance
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
Understand spatial locality
Optimize use of textures to ensure spatial locality
© NVIDIA Corporation 2010
Exploit Shared Memory
Hundreds of times faster than global memory
Inter-thread cooperation via shared memory and synchronization
Cache data that is reused by multiple threads
Stage loads/stores to allow reordering
Avoid non-coalesced global memory accesses
© NVIDIA Corporation 2010
Use Resources Efficiently
Partition the computation to keep multiprocessors busyMany threads, many thread blocks
Multiple GPUs
Monitor per-multiprocessor resource utilizationRegisters and shared memory
Low utilization per thread block permits multiple active blocks per multiprocessor
Overlap computation with I/OUse asynchronous memory transfers
© NVIDIA Corporation 2010
OPTIMIZATION 1:
MEMORY TRANSFERS &
COALESCING
© NVIDIA Corporation 2010
Execution ModelSoftware Hardware
Threads are executed by scalar processors
Thread
Scalar
Processor
Thread
Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one
multiprocessor - limited by multiprocessor resources
(shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
Only one kernel can execute on a device at one time
© NVIDIA Corporation 2010
Warps and Half Warps
Thread
Block Multiprocessor
32 Threads
32 Threads
32 Threads
...
Warps
16
Half Warps
16
DRAM
Global
Local
A thread block consists of 32-thread
warps
A warp is executed physically in parallel
(SIMD) on a multiprocessor
Device
Memory
=
A half-warp of 16 threads can coordinate
global memory accesses into a single
transaction
© NVIDIA Corporation 2010
Memory Architecture
Host
CPU
Chipset
DRAM
Device
DRAM
Global
Constant
Texture
Local
GPU
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Constant and Texture
Caches
© NVIDIA Corporation 2010
Host-Device Data Transfers
Device to host memory bandwidth much lower than device to
device bandwidth
8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280)
Minimize transfers
Intermediate data can be allocated, operated on, and deallocated without
ever copying them to host memory
Group transfers
One large transfer much better than many small ones
© NVIDIA Corporation 2010
Page-Locked Data Transfers
cudaMallocHost()allows allocation of page-locked (“pinned”) host memory
Enables highest cudaMemcpy performance3.2 GB/s on PCI-e x16 Gen1
5.2 GB/s on PCI-e x16 Gen2
See the “bandwidthTest” CUDA SDK sample
Use with caution!!Allocating too much page-locked memory can reduce overall system performance
Test your systems and apps to learn their limits
© NVIDIA Corporation 2010
Overlapping Data Transfers and Computation
Async and Stream APIs allow overlap of H2D or D2H data transfers with computation
CPU computation can overlap data transfers on all CUDA capable devices
Kernel computation can overlap data transfers on devices with “Concurrent copy and execution” (roughly compute capability >= 1.1)
Stream = sequence of operations that execute in order on GPU
Operations from different streams can be interleaved
Stream ID used as argument to async calls and kernel launches
© NVIDIA Corporation 2010
Coalescing: GT200 arch, Tesla C1060
Global Memory
Half-warp of threads
Global memory access of 32, 64, or 128-bit words by a half-warp of threads
can result in as few as one (or two) transaction(s) if certain access
requirements are met
Float (32-bit) data example:
32-byte segments
64-byte segments
128-byte segments
……
© NVIDIA Corporation 2010
CoalescingCompute capability 1.2 and 1.3 (GT200, Tesla C1060)
1 transaction - 64B segment
2 transactions - 64B and 32B segments
1 transaction - 128B segment
Issues transactions for segments of 32B, 64B, and 128B
Smaller transactions used to avoid wasted bandwidth
……
……
……
© NVIDIA Corporation 2010
CoalescingCompute capability 2.0 (Fermi, Tesla C2050)
32 transactions - 32 x 32B segments, instead of 32 x 128B segments.
2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.
Memory transactions handled per warp (32 threads)
L1 cache ON:
Issues always 128B segment transactions
caches them in 16kB or 48kB L1 cache per multiprocessor
……
……
L1 cache OFF:
Issues always 32B segment transactions
E.g. advantage for widely scattered thread accesses
…
© NVIDIA Corporation 2010
OPTIMIZATION 2:
EXECUTION CONFIG
© NVIDIA Corporation 2010
Occupancy
Thread instructions are executed sequentially, so executing
other warps is the only way to hide latencies and keep the
hardware busy
Occupancy = Number of warps running concurrently on a
multiprocessor divided by maximum number of warps that can
run concurrently
Limited by resource usage:
Registers
Shared memory
© NVIDIA Corporation 2010
Blocks per Grid Heuristics
# of blocks > # of multiprocessors
So all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2
Multiple blocks can run concurrently in a multiprocessor
Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
Subject to resource availability – registers, shared memory
# of blocks > 100 to scale to future devices
Blocks executed in pipeline fashion
1000 blocks per grid will scale across multiple generations
© NVIDIA Corporation 2010
Register Pressure
Hide latency by using more threads per multiprocessor
Limiting Factors:
Number of registers per kernel
8K/16K per multiprocessor, partitioned among concurrent threads
Amount of shared memory
16KB per multiprocessor, partitioned among concurrent threadblocks
Compile with –ptxas-options=-v flag
Use –maxrregcount=N flag to NVCC
N = desired maximum registers / kernel
At some point “spilling” into local memory may occur
Reduces performance – local memory is slow
© NVIDIA Corporation 2010
Occupancy CalculatorExcel-sheet to calculate GPU occupancy - Part of CUDA SDK (Windows!)
© NVIDIA Corporation 2010 5
1
Optimizing threads per block
Choose threads per block as a multiple of warp size
Avoid wasting computation on under-populated warps
Facilitates coalescing
Want to run as many warps as possible per multiprocessor (hide
latency)
Multiprocessor can run up to 8 blocks at a time
Heuristics
Minimum: 64 threads per block
Only if multiple concurrent blocks
192 or 256 threads a better choice
Usually still enough regs to compile and invoke successfully
This all depends on your computation, so experiment!
© NVIDIA Corporation 2010 5
2
Occupancy != Performance
Increasing occupancy does not necessarily increase
performance
BUT …
Low-occupancy multiprocessors cannot adequately hide latency
on memory-bound kernels
(It all comes down to arithmetic intensity and available parallelism)
© NVIDIA Corporation 2010
OPTIMIZATION 3:
MATH FUNCS & BRANCHING
© NVIDIA Corporation 2010 5
4
Runtime Math Library
There are two types of runtime math operations in single
precision
__funcf(): direct mapping to hardware ISA
Fast but lower accuracy (see prog. guide for details)
Examples: __sinf(x), __expf(x), __powf(x,y)
funcf() : compile to multiple instructions
Slower but higher accuracy (5 ulp or less)
Examples: sinf(x), expf(x), powf(x,y)
The -use_fast_math compiler option forces every funcf() to
compile to __funcf()
© NVIDIA Corporation 2010 5
5
Control Flow Instructions
Main performance concern with branching is divergence
Threads within a single warp take different paths
Different execution paths must be serialized
Avoid divergence when branch condition is a function of thread
ID
Example with divergence:
if (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:
if (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size
© NVIDIA Corporation 2010
OPTIMIZATION 4: SHARED MEMORY
© NVIDIA Corporation 2010 5
7
Shared Memory
~Hundred times faster than global memory
Cache data to reduce global memory accesses
Threads can cooperate via shared memory
Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order non-coalesceable addressing
© NVIDIA Corporation 2010 5
8
Shared Memory Architecture
Many threads accessing memory
Therefore, memory is divided into banks
Successive 32-bit words assigned to successive banks
Each bank can service one address per cycle
A memory can service as many simultaneous
accesses as it has banks
Multiple simultaneous accesses to a bank
result in a bank conflict
Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
© NVIDIA Corporation 2010 5
9
Bank Addressing Examples
No Bank Conflicts
Linear addressing
stride == 1
No Bank Conflicts
Random 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
© NVIDIA Corporation 2010 6
0
Bank Addressing Examples
2-way Bank Conflicts
Linear addressing
stride == 2
8-way Bank Conflicts
Linear addressing stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
x8
x8
© NVIDIA Corporation 2010 6
1
Shared memory bank conflicts
Shared memory is ~ as fast as registers if there are no bank conflicts
warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is no bank conflict
If all threads of a half-warp read the identical address, there is no bank conflict
(broadcast)
The slow case:
Bank Conflict: multiple threads in the same half-warp access the same bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank
© NVIDIA Corporation 2010 6
2
Shared Memory Example: Transpose
Each thread block works on a tile of the matrix
Naïve implementation exhibits strided access to global memory
idata odata
Elements transposed by a half-warp of threads
© NVIDIA Corporation 2010 6
3
Naïve Transpose
Loads are coalesced, stores are not (strided by height)
idata odata
__global__ void transposeNaive(float *odata, float *idata,
int width, int height)
{
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex;
int index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
© NVIDIA Corporation 2010 6
4
Coalescing through shared memory
Access columns of a tile in shared memory to write contiguous
data to global memory
Requires __syncthreads() since threads access data in
shared memory stored by other threads
Elements transposed by a half-warp of threads
idata odatatile
© NVIDIA Corporation 2010 6
5
__global__ void transposeCoalesced(float *odata, float *idata,
int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
__syncthreads();
odata[index_out] = tile[threadIdx.x][threadIdx.y];
}
Coalescing through shared memory
© NVIDIA Corporation 2010 6
6
Bank Conflicts in Transpose
16x16 shared memory tile of floats
Data in columns are in the same bank
16-way bank conflict reading columns in tile
Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];
Data in anti-diagonals are in same bank
Elements transposed by a half-warp of threads
idata odatatile
© NVIDIA Corporation 2010
FERMI: NEW ARCHITECTURE
© NVIDIA Corporation 2010
Fermi: The Computational GPU
Disclaimer: Specifications subject to change
Performance• 13x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point
Flexibility
• Increased Shared Memory from 16 KB to 48 KB• Added L1 and L2 Caches• ECC on all Internal and External Memories• Enable up to 1 TeraByte of GPU Memories• High Speed GDDR5 Memory Interface
DR
AM
I/F
HO
ST
I/F
Gig
a T
hre
ad
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Usability
• Multiple Simultaneous Tasks on GPU• 10x Faster Atomic Operations• C++ Support• System Calls, printf support
© NVIDIA Corporation 2010
Fermi
Up to 1536 threads per multiprocessor
Memory operations are done per warp (32 threads) instead of half-warp
Global memory, Shared memory
Shared memory:
16 or 48KB
Now 32 banks, 32-bit wide each
No bank-conflicts when accessing 8-byte words
L1 cache per multiprocessor
Should help with misaligned access, strides access, some register spilling
Much improved dual-issue:
Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc.
IEEE-conformant rounding
ECC option, 64bit address space, generic
© NVIDIA Corporation 2010
Fermi: Additional capabilities
Fermi can execute several kernels concurrently
Threadblocks from one kernel are launched first
If there are resources available, threadblocks from a kernel in another
stream are launched
Fermi has 2 copy-engines (Tesla C2050 has)
Can concurrently copy CPU-GPU and GPU-CPU across PCIe
PCIe is duplex, so aggregate bandwidth is doubled in such cases
Previous generation could only do one copy
© NVIDIA Corporation 2010
Fermi: L1 cache
L1 cache designed for spatial re-usage, not temporal (similar to coalescing)
Shares 64kb with Shared memory:
Switch size between 16 or 48KB (CUDA API call)
Caches gmem reads only
It benefits if compiler detects that all threads load same value (LDU, load uniform)
L1 cache can be deactivated: Smaller granularity of memory transactions
L1 cache per multiprocessor
NOT coherent! Use volatile for global memory access if other blocks' threads
change the location. (but: needed? If not all blocks active -> danger of deadlock!)
Caches local memory reads and writes
To improve spilling behaviour
(Coherence no problem as local memory SM-private)
© NVIDIA Corporation 2010
CUDA 3.1 Goodies
ABI (Application Binary Interface)
"Real" function calls, including function pointers
User stack, e.g. for recursion (1kB per default, user manipulable)
Surface reads/writes (Fermi only)
2D coordinate mapping to cuArray's opaque data arrangement.
Direct access to a 2D cuArray from within a kernel.
Write-to-texture (caution: max resolution of cuArray 8192x8192).
printf()
Compiler improvements
16-way kernel concurrency
cudaGetDeviceProp() returns PCI BusID, DeviceID from
More detailed feedback from cuda-memcheck
(instead of Unspecified Launch Failure - ULF)
© NVIDIA Corporation 2010
CUDA RESOURCES
Productivity
© NVIDIA Corporation 2010
Getting Started
www.gpgpu.org / www.gpucomputing.net
Gernot Ziegler <[email protected]>
www.nvidia.com/gtc2010-content
CUDA Zone www.nvidia.com/cuda
Introductory tutorials
GPU computing online seminars
(aka Webinars)
Forums
Documentation
Programming Guide
Best Practices Guide
Examples
CUDA SDK
© NVIDIA Corporation 2010
GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010
San Jose Convention Center, San Jose, California
The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing
Preview disruptive technologies and emerging applications
Get tools and techniques to impact mission critical projects
Network with experts, colleagues, and peers across industries
Opportunities
Call for Submissions
Sessions & posters
Sponsors / Exhibitors
Reach decision makers
“CEO on Stage”
Showcase for Startups
Tell your story to VCs and
analysts
“I consider the GPU Technology Conference to be the single best place to see the amazing
work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,
and entrepreneurs from around the world.”
-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker
THANK YOU!
Questions?
© NVIDIA Corporation 2009
Additional material
© NVIDIA Corporation 2010
Targeting Multiple Platforms with CUDA
CUDA C / C++
NVCCNVIDIA CUDA Toolkit
MCUDACUDA to Multi-core
OcelotPTX to Multi-corePTX
MCUDA: http://impact.crhc.illinois.edu/mcuda.php
Ocelot: http://code.google.com/p/gpuocelot/
Swan: http://www.multiscalelab.org/swan
SWAN CUDA to OpencL
Other GPUs
Multi-Core
CPUs
NVIDIA
GPUs
© NVIDIA Corporation 2010
Fermi: Register pressure different
The fewer registers a kernel uses, the more threads and thread blocks are likely
to reside on a multiprocessor, which can improve performance.
--maxrregcount was previously used to control register count,
but GT200 and Fermi architecture differ (new load/store architecture),
Same maxrregcount setting for C1060/C2050 is not useful.
With new kernel-specific directive, application can aid compiler heuristics:
__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
MyKernel(...) { ... }
(more details in B.16 of CUDA 3.1 Programming Guide)
© NVIDIA Corporation 2010
Fermi: __umul24 no longer optimal
On Tesla C1060 / GT200 architecture, bounded integer multiplications could be
accelerated with __umul24(a, b) instead of a * b, e.g. for
unsigned int tid = __umul24(blockIdx.x, blockDim.x) + threadIdx.x
On Fermi, __umul24() is emulated, and thus slower than a * b!
© NVIDIA Corporation 2010
Fermi: LDU (load uniform) Constant Loading
Previously, data that was uniform for all threads (such as runtime-determined
constants that were read-only to a kernel), was preferrably stored in 64kB
constant memory (constant cache could broadcast to all threads).
Fermi: the LDU instruction can perform similar constant caching for any global
memory location. LDU = load (block-)uniform variable from memory.
It is loaded into the uniform cache.
Condition: a) prefix pointer with const keyword
b) Memory access must be uniform across all threads in the block, e.g.
__global__ void kernel( float *g_dst, const float *g_src )
{
g_dst = g_src[0] + g_src[blockIdx.x];
}
© NVIDIA Corporation 2010
Fermi: Shared memory bank conflicts
On Tesla C1060/SM13, shared memory bank conflicts were determined per
half-warp
Fermi: 32 shared memory banks, and bank conflicts occur per warp (instead
of per half-warp)
Thus: If you used padding to avoid shmem bank conflicts on GT200/Tesla
C1060, e.g. in 2D transpose,:
__shared tile [16][17];
Then on Fermi, make sure to change both tile size and padding to warp size
__shared tile [32][33];
© NVIDIA Corporation 2010
Shmem for cross-thread communication:
check volatile !
Threads can communicate via shmem without using _syncthreads,
if all of them belong to the same warp, e.g. if (tid < 32) { … }
On Tesla C1060, a simple declaration sufficed
__shared__ int cross[32];
On C2050 (Fermi), make sure to have a volatile in front of the shared memory
declaration, if you want to use it for interwarp communication like above!
volatile __shared__ int cross[32];
Reason:
C1060 (GT200) could access shmem directly as operand, while
C2050 (Fermi) uses load/store architecture into registers!
© NVIDIA Corporation 2010
Fermi: IEEE accuracy vs. speed
Default settings for computation on GPU now more conservative (for HPC)
Denormal support, IEEE-conformant division and square root
Accuracy over speed
If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20
then PTX JIT has used "old" Tesla C1060 settings, which favor speed:
flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square
root
For similar results in -arch=sm_20, use:
-ftz=true -prec-div=false -prec-sqrt=false
NVIDIA CUDA Programming Guide, sections 5.4.1, G.2
The CUDA Compiler Driver NVCC, pg. 14-15
(BTW, Sections 5.4.1 also contains information on instruction timings)
© NVIDIA Corporation 2010
Fermi: CUDA 3.2 ABI allows for inlining
ABI = Application Binary Interface
Fermi supports function pointers and callstack
Reduces kernel size drastically
However, can use slightly more registers!
Use __forceinline__ on functions to enforce inlining again.