Download - GPU Tutorial Build environment Debugging/Profiling … · GPU Tutorial Build environment Debugging/Profiling Fermi ... CUDA-GDB manual ... Warps and Half Warps Thread

© NVIDIA Corporation 2010

GPU Tutorial

Build environment

Debugging/Profiling

Fermi

Optimization / CUDA 3.1 and Fermi advice


BUILD ENVIRONMENT

Compilation


Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler


Windows: Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

CUDA Visual Profiler

Integrated debugger and

profiler: Parallel Nsight (Win7)


Additional Libraries

NVIDIA

cuBLAS Dense linear algebra (subset of full BLAS suite)

cuFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

cuLAPACK/MAGMA

Open Source

Thrust STL/Boost style template language

cuDPP Data parallel primitives (e.g. scan, sort and reduction)

CUSP Sparse linear algebra and graph computation

Many more...


DEBUGGING AND PROFILING

cuda-gdb and Visual Profiler


CUDA-GDB

Extended version of GDB with support for C for CUDA

Supported on Linux 32bit/64bit systems

Seamlessly debug both the host|CPU and device|GPU code

• Set breakpoints on any source line or symbol name

• Single step executes only one warp – except on sync threads

• Access and print all CUDA memory allocations, local, global,

constant and shared vars.

Walkthrough example with sourcecode : CUDA-GDB manual


Linux GDB

Integration with

EMACS


Linux GDB

Integration with

DDD


Linux GDB

Integration with

DDD


CUDA-MemCheck

Detects/tracks memory errors

Out of bounds accesses

Misaligned accesses (types must be aligned on their size)

Integrated into CUDA-GDB

Linux and WinXP

Win7 and Vista support coming

11©NVIDIA 2010


CUDA Driver – Low-level Profiling support

1. Set up environment variables› export CUDA_PROFILE=1

› export CUDA_PROFILE_CSV=1

› export CUDA_PROFILE_CONFIG=config.txt

› export CUDA_PROFILE_LOG=profile.csv

2. Set up configuration fileFILE "config.txt":

gpustarttimestamp

instructions

3. Run application› matrixMul

4. View profiler output

FILE "profile.csv":# CUDA_PROFILE_LOG_VERSION 1.5

# CUDA_DEVICE 0 GeForce 8800 GT

# CUDA_PROFILE_CSV 1

# TIMESTAMPFACTOR fa292bb1ea2c12c

gpustarttimestamp,method,gputime,cputime,occupancy,instructions

115f4eaa10e3b220,memcpyHtoD,7.328,12.000

115f4eaa10e5dac0,memcpyHtoD,5.664,4.000

115f4eaa10e95ce0,memcpyHtoD,7.328,6.000

115f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333,43

52

115f4eaa10f443a0,memcpyDtoH,7.776,36.000


CUDA Visual Profiler - Overview

• Performance analysis tool to fine tune CUDA applications

• Supported on Linux/Windows/Mac platforms

• Functionality:

• Execute a CUDA application and collect profiling data

• Multiple application runs to collect data for all hardware performance counters

• Profiling data for all kernels and memory transfers

• Analyze profiling data


CUDA Visual Profiler – data for kernels


CUDA Visual Profiler – computed data for kernels

• Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

• Global memory read throughput (Gigabytes/second)

• Global memory write throughput (Gigabytes/second)

• Overall global memory access throughput (Gigabytes/second)

• Global memory load efficiency

• Global memory store efficiency


CUDA Visual Profiler – data for memory transfers

• Memory transfer type and direction

(D=Device, H=Host, A=cuArray)

• e.g. H to D: Host to Device

• Synchronous / Asynchronous

• Memory transfer size, in bytes

• Stream ID


CUDA Visual Profiler – data analysis views

• Views:

• Summary table

• Kernel table

• Memcopy table

• Summary plot

• GPU Time Height plot

• GPU Time Width plot

• Profiler counter plot

• Profiler table column plot

• Multi-device plot

• Multi-stream plot

• Analyze profiler counters

• Analyze kernel occupancy


CUDA Visual Profiler – Misc.

• Multiple sessions

• Compare views for different sessions

• Comparison Summary plot

• Profiler projects – save & load

• Import/Export profiler data

(.CSV format)


NVIDIA Parallel Nsight

Accelerates GPU + CPUapplication development

The industry’s 1st Development Environment for massively parallel applications

Complete Visual Studio-integrateddevelopment environment


Parallel Nsight 1.0

Nsight Parallel Debugger

GPU source code debugging

Variable & memory inspection

Nsight Analyzer

Platform-level Analysis

For the CPU and GPU

Nsight Graphics Inspector

Visualize and debug graphics content


Source Debugging

Supporting CUDA C and HLSL code.

Hardware breakpoints

GPU memory and variable views

Nsight menu and toolbars


View a correlated trace timeline with both CPU and GPU events.

Analysis


Detailed tooltips are available for every event on the timeline.

Analysis


1.0 System Requirements

Operating SystemWindows Server 2008 R2

Windows 7 / Vista

32 or 64-bit

HardwareGeForce 9 series or higher

Tesla C1060/S1070 or higher

Quadro (G9x or higher)

Visual StudioVisual Studio 2008 SP1


Supported System Configurations

#1: Single machine, Single GPU

AnalyzerGraphics Inspector

#2: Two machines connected over the network

DebuggerAnalyzerGraphics Inspector

TCP/IP

#3: Single SLI MOS machine, Two Quadro GPUs

DebuggerAnalyzerGraphics Inspector


Parallel Nsight 1.0 Versions

Standard (free)GPU Source Debugger

Graphics Inspector

Professional ($)Analyzer

Data Breakpoints

Premium ticket-based support

Volume and Site Licensing available


GPU OPTIMIZATION GUIDELINES

Performance


Optimize Algorithms for GPU

Algorithm selectionUnderstand the problem, consider alternate algorithms

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Recompute?GPU allocates transistors to arithmetic, not memory

Sometimes better to recompute rather than cache

Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host


Optimize Memory Access

Coalesce global memory access

Maximise DRAM efficiency

Order of magnitude impact on performance

Avoid serialization

Minimize shared memory bank conflicts

Understand constant cache semantics

Understand spatial locality

Optimize use of textures to ensure spatial locality


Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and synchronization

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses


Use Resources Efficiently

Partition the computation to keep multiprocessors busyMany threads, many thread blocks

Multiple GPUs

Monitor per-multiprocessor resource utilizationRegisters and shared memory

Low utilization per thread block permits multiple active blocks per multiprocessor

Overlap computation with I/OUse asynchronous memory transfers


OPTIMIZATION 1:

MEMORY TRANSFERS &

COALESCING


Execution ModelSoftware Hardware

Threads are executed by scalar processors

Thread

Scalar

Processor

Thread

Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

multiprocessor - limited by multiprocessor resources

(shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time


Warps and Half Warps

Thread

Block Multiprocessor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAM

Global

Local

A thread block consists of 32-thread

warps

A warp is executed physically in parallel

(SIMD) on a multiprocessor

Device

Memory

=

A half-warp of 16 threads can coordinate

global memory accesses into a single

transaction


Memory Architecture

Host

CPU

Chipset

DRAM

Device

DRAM

Global

Constant

Texture

Local

GPU

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Constant and Texture

Caches


Host-Device Data Transfers

Device to host memory bandwidth much lower than device to

device bandwidth

8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280)

Minimize transfers

Intermediate data can be allocated, operated on, and deallocated without

ever copying them to host memory

Group transfers

One large transfer much better than many small ones


Page-Locked Data Transfers

cudaMallocHost()allows allocation of page-locked (“pinned”) host memory

Enables highest cudaMemcpy performance3.2 GB/s on PCI-e x16 Gen1

5.2 GB/s on PCI-e x16 Gen2

See the “bandwidthTest” CUDA SDK sample

Use with caution!!Allocating too much page-locked memory can reduce overall system performance

Test your systems and apps to learn their limits


Overlapping Data Transfers and Computation

Async and Stream APIs allow overlap of H2D or D2H data transfers with computation

CPU computation can overlap data transfers on all CUDA capable devices

Kernel computation can overlap data transfers on devices with “Concurrent copy and execution” (roughly compute capability >= 1.1)

Stream = sequence of operations that execute in order on GPU

Operations from different streams can be interleaved

Stream ID used as argument to async calls and kernel launches


Coalescing: GT200 arch, Tesla C1060

Global Memory

Half-warp of threads

Global memory access of 32, 64, or 128-bit words by a half-warp of threads

can result in as few as one (or two) transaction(s) if certain access

requirements are met

Float (32-bit) data example:

32-byte segments

64-byte segments

128-byte segments

……


CoalescingCompute capability 1.2 and 1.3 (GT200, Tesla C1060)

1 transaction - 64B segment

2 transactions - 64B and 32B segments

1 transaction - 128B segment

Issues transactions for segments of 32B, 64B, and 128B

Smaller transactions used to avoid wasted bandwidth

……

……

……


CoalescingCompute capability 2.0 (Fermi, Tesla C2050)

32 transactions - 32 x 32B segments, instead of 32 x 128B segments.

2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.

Memory transactions handled per warp (32 threads)

L1 cache ON:

Issues always 128B segment transactions

caches them in 16kB or 48kB L1 cache per multiprocessor

……

……

L1 cache OFF:

Issues always 32B segment transactions

E.g. advantage for widely scattered thread accesses

…


OPTIMIZATION 2:

EXECUTION CONFIG


Occupancy

Thread instructions are executed sequentially, so executing

other warps is the only way to hide latencies and keep the

hardware busy

Occupancy = Number of warps running concurrently on a

multiprocessor divided by maximum number of warps that can

run concurrently

Limited by resource usage:

Registers

Shared memory


Blocks per Grid Heuristics

# of blocks > # of multiprocessors

So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2

Multiple blocks can run concurrently in a multiprocessor

Blocks that aren’t waiting at a __syncthreads() keep the hardware busy

Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devices

Blocks executed in pipeline fashion

1000 blocks per grid will scale across multiple generations


Register Pressure

Hide latency by using more threads per multiprocessor

Limiting Factors:

Number of registers per kernel

8K/16K per multiprocessor, partitioned among concurrent threads

Amount of shared memory

16KB per multiprocessor, partitioned among concurrent threadblocks

Compile with –ptxas-options=-v flag

Use –maxrregcount=N flag to NVCC

N = desired maximum registers / kernel

At some point “spilling” into local memory may occur

Reduces performance – local memory is slow


Occupancy CalculatorExcel-sheet to calculate GPU occupancy - Part of CUDA SDK (Windows!)

© NVIDIA Corporation 2010 5

1

Optimizing threads per block

Choose threads per block as a multiple of warp size

Avoid wasting computation on under-populated warps

Facilitates coalescing

Want to run as many warps as possible per multiprocessor (hide

latency)

Multiprocessor can run up to 8 blocks at a time

Heuristics

Minimum: 64 threads per block

Only if multiple concurrent blocks

192 or 256 threads a better choice

Usually still enough regs to compile and invoke successfully

This all depends on your computation, so experiment!


2

Occupancy != Performance

Increasing occupancy does not necessarily increase

performance

BUT …

Low-occupancy multiprocessors cannot adequately hide latency

on memory-bound kernels

(It all comes down to arithmetic intensity and available parallelism)


OPTIMIZATION 3:

MATH FUNCS & BRANCHING


4

Runtime Math Library

There are two types of runtime math operations in single

precision

__funcf(): direct mapping to hardware ISA

Fast but lower accuracy (see prog. guide for details)

Examples: __sinf(x), __expf(x), __powf(x,y)

funcf() : compile to multiple instructions

Slower but higher accuracy (5 ulp or less)

Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces every funcf() to

compile to __funcf()


5

Control Flow Instructions

Main performance concern with branching is divergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is a function of thread

ID

Example with divergence:

if (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

if (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size


OPTIMIZATION 4: SHARED MEMORY


7

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced access

Stage loads and stores in shared memory to re-order non-coalesceable addressing


8

Shared Memory Architecture

Many threads accessing memory

Therefore, memory is divided into banks

Successive 32-bit words assigned to successive banks

Each bank can service one address per cycle

A memory can service as many simultaneous

accesses as it has banks

Multiple simultaneous accesses to a bank

result in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0


9

Bank Addressing Examples

No Bank Conflicts

Linear addressing

stride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15


Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15



0

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing

stride == 2

8-way Bank Conflicts

Linear addressing stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8


1

Shared memory bank conflicts

Shared memory is ~ as fast as registers if there are no bank conflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is no bank conflict

If all threads of a half-warp read the identical address, there is no bank conflict

(broadcast)

The slow case:

Bank Conflict: multiple threads in the same half-warp access the same bank

Must serialize the accesses

Cost = max # of simultaneous accesses to a single bank


2

Shared Memory Example: Transpose

Each thread block works on a tile of the matrix

Naïve implementation exhibits strided access to global memory

idata odata

Elements transposed by a half-warp of threads


3

Naïve Transpose

Loads are coalesced, stores are not (strided by height)

idata odata

__global__ void transposeNaive(float *odata, float *idata,

int width, int height)

{

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;

int index_out = yIndex + height * xIndex;

odata[index_out] = idata[index_in];

}


4

Coalescing through shared memory

Access columns of a tile in shared memory to write contiguous

data to global memory

Requires __syncthreads() since threads access data in

shared memory stored by other threads


idata odatatile


5

__global__ void transposeCoalesced(float *odata, float *idata,

int width, int height)

{

__shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

int index_out = xIndex + (yIndex)*height;

tile[threadIdx.y][threadIdx.x] = idata[index_in];

__syncthreads();

odata[index_out] = tile[threadIdx.x][threadIdx.y];

}

Coalescing through shared memory


6

Bank Conflicts in Transpose

16x16 shared memory tile of floats

Data in columns are in the same bank

16-way bank conflict reading columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank


idata odatatile


FERMI: NEW ARCHITECTURE


Fermi: The Computational GPU

Disclaimer: Specifications subject to change

Performance• 13x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point

Flexibility

• Increased Shared Memory from 16 KB to 48 KB• Added L1 and L2 Caches• ECC on all Internal and External Memories• Enable up to 1 TeraByte of GPU Memories• High Speed GDDR5 Memory Interface

DR

AM

I/F

HO

ST

I/F

Gig

a T

hre

ad

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Usability

• Multiple Simultaneous Tasks on GPU• 10x Faster Atomic Operations• C++ Support• System Calls, printf support


Fermi

Up to 1536 threads per multiprocessor

Memory operations are done per warp (32 threads) instead of half-warp

Global memory, Shared memory

Shared memory:

16 or 48KB

Now 32 banks, 32-bit wide each

No bank-conflicts when accessing 8-byte words

L1 cache per multiprocessor

Should help with misaligned access, strides access, some register spilling

Much improved dual-issue:

Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc.

IEEE-conformant rounding

ECC option, 64bit address space, generic


Fermi: Additional capabilities

Fermi can execute several kernels concurrently

Threadblocks from one kernel are launched first

If there are resources available, threadblocks from a kernel in another

stream are launched

Fermi has 2 copy-engines (Tesla C2050 has)

Can concurrently copy CPU-GPU and GPU-CPU across PCIe

PCIe is duplex, so aggregate bandwidth is doubled in such cases

Previous generation could only do one copy


Fermi: L1 cache

L1 cache designed for spatial re-usage, not temporal (similar to coalescing)

Shares 64kb with Shared memory:

Switch size between 16 or 48KB (CUDA API call)

Caches gmem reads only

It benefits if compiler detects that all threads load same value (LDU, load uniform)

L1 cache can be deactivated: Smaller granularity of memory transactions

L1 cache per multiprocessor

NOT coherent! Use volatile for global memory access if other blocks' threads

change the location. (but: needed? If not all blocks active -> danger of deadlock!)

Caches local memory reads and writes

To improve spilling behaviour

(Coherence no problem as local memory SM-private)


CUDA 3.1 Goodies

ABI (Application Binary Interface)

"Real" function calls, including function pointers

User stack, e.g. for recursion (1kB per default, user manipulable)

Surface reads/writes (Fermi only)

2D coordinate mapping to cuArray's opaque data arrangement.

Direct access to a 2D cuArray from within a kernel.

Write-to-texture (caution: max resolution of cuArray 8192x8192).

printf()

Compiler improvements

16-way kernel concurrency

cudaGetDeviceProp() returns PCI BusID, DeviceID from

More detailed feedback from cuda-memcheck

(instead of Unspecified Launch Failure - ULF)


CUDA RESOURCES

Productivity


Getting Started

www.gpgpu.org / www.gpucomputing.net

Gernot Ziegler <[email protected]>

www.nvidia.com/gtc2010-content

CUDA Zone www.nvidia.com/cuda

Introductory tutorials

GPU computing online seminars

(aka Webinars)

Forums

Documentation

Programming Guide

Best Practices Guide

Examples

CUDA SDK


GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010

San Jose Convention Center, San Jose, California

The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing

Preview disruptive technologies and emerging applications

Get tools and techniques to impact mission critical projects

Network with experts, colleagues, and peers across industries

Opportunities

Call for Submissions

Sessions & posters

Sponsors / Exhibitors

Reach decision makers

“CEO on Stage”

Showcase for Startups

Tell your story to VCs and

analysts

“I consider the GPU Technology Conference to be the single best place to see the amazing

work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,

and entrepreneurs from around the world.”

-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker

THANK YOU!

Questions?


Additional material


Targeting Multiple Platforms with CUDA

CUDA C / C++

NVCCNVIDIA CUDA Toolkit

MCUDACUDA to Multi-core

OcelotPTX to Multi-corePTX

MCUDA: http://impact.crhc.illinois.edu/mcuda.php

Ocelot: http://code.google.com/p/gpuocelot/

Swan: http://www.multiscalelab.org/swan

SWAN CUDA to OpencL

Other GPUs

Multi-Core

CPUs

NVIDIA

GPUs

http://impact.crhc.illinois.edu/mcuda.php

http://code.google.com/p/gpuocelot/

http://www.multiscalelab.org/swan


Fermi: Register pressure different

The fewer registers a kernel uses, the more threads and thread blocks are likely

to reside on a multiprocessor, which can improve performance.

--maxrregcount was previously used to control register count,

but GT200 and Fermi architecture differ (new load/store architecture),

Same maxrregcount setting for C1060/C2050 is not useful.

With new kernel-specific directive, application can aid compiler heuristics:

__global__ void

__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)

MyKernel(...) { ... }

(more details in B.16 of CUDA 3.1 Programming Guide)


Fermi: __umul24 no longer optimal

On Tesla C1060 / GT200 architecture, bounded integer multiplications could be

accelerated with __umul24(a, b) instead of a * b, e.g. for

unsigned int tid = __umul24(blockIdx.x, blockDim.x) + threadIdx.x

On Fermi, __umul24() is emulated, and thus slower than a * b!


Fermi: LDU (load uniform) Constant Loading

Previously, data that was uniform for all threads (such as runtime-determined

constants that were read-only to a kernel), was preferrably stored in 64kB

constant memory (constant cache could broadcast to all threads).

Fermi: the LDU instruction can perform similar constant caching for any global

memory location. LDU = load (block-)uniform variable from memory.

It is loaded into the uniform cache.

Condition: a) prefix pointer with const keyword

b) Memory access must be uniform across all threads in the block, e.g.

__global__ void kernel( float *g_dst, const float *g_src )

{

g_dst = g_src[0] + g_src[blockIdx.x];

}


Fermi: Shared memory bank conflicts

On Tesla C1060/SM13, shared memory bank conflicts were determined per

half-warp

Fermi: 32 shared memory banks, and bank conflicts occur per warp (instead

of per half-warp)

Thus: If you used padding to avoid shmem bank conflicts on GT200/Tesla

C1060, e.g. in 2D transpose,:

__shared tile [16][17];

Then on Fermi, make sure to change both tile size and padding to warp size

__shared tile [32][33];


Shmem for cross-thread communication:

check volatile !

Threads can communicate via shmem without using _syncthreads,

if all of them belong to the same warp, e.g. if (tid < 32) { … }

On Tesla C1060, a simple declaration sufficed

__shared__ int cross[32];

On C2050 (Fermi), make sure to have a volatile in front of the shared memory

declaration, if you want to use it for interwarp communication like above!

volatile __shared__ int cross[32];

Reason:

C1060 (GT200) could access shmem directly as operand, while

C2050 (Fermi) uses load/store architecture into registers!


Fermi: IEEE accuracy vs. speed

Default settings for computation on GPU now more conservative (for HPC)

Denormal support, IEEE-conformant division and square root

Accuracy over speed

If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20

then PTX JIT has used "old" Tesla C1060 settings, which favor speed:

flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square

root

For similar results in -arch=sm_20, use:

-ftz=true -prec-div=false -prec-sqrt=false

NVIDIA CUDA Programming Guide, sections 5.4.1, G.2

The CUDA Compiler Driver NVCC, pg. 14-15

(BTW, Sections 5.4.1 also contains information on instruction timings)


Fermi: CUDA 3.2 ABI allows for inlining

ABI = Application Binary Interface

Fermi supports function pointers and callstack

Reduces kernel size drastically

However, can use slightly more registers!

Use __forceinline__ on functions to enforce inlining again.