Download - GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

GPU Tutorial @ Lund Observatory

Gernot Ziegler, NVIDIA UK


HISTORY / INTRODUCTION

VAX

Maspar

Thinking Machines

Blue Gene Many-Core

GPUs

Multi-Core

x86

Intel 4004

DEC PDP-1

ILLIAC IV

IBM System 360

Cray-1

IBM POWER4

Parallel vs Sequential Architecture Evolution

High Performance Computing Architectures

Database, Operating System Sequential Architectures

Recent History

Specialised machines faded out (e.g. CRAY)

Cost, economies of scale

Intel and AMD chips designed for home/office use

Increasing clock frequencies gave increasing performance

Commodity clusters

Computer gaming drives Graphics Processing Unit (GPU)

NVIDIA and ATI


Present

Clock frequency no longer increasing

Power consumption ƒ2

Multi-core dominates


GPU Computing

CPU + GPU Co-Processing

4 cores

Graphics Pipelines for Last 20 YearsProcessor per function

T&L evolved to vertex shading

Triangle, point, line – setup

Flat shading, texturing, eventually

Pixel shading

Blending, Z-buffering, anti-aliasing

Wider and faster over the years

Vertex

Triangle

Pixel

ROP

Memory

Vertex

Shader

Pixel

ShaderIdle hardware

Idle hardwareVertex

Shader

Pixel

Shader

Previous Pipelined Architectures

Heavy Geometry

Workload Perf = 4

Heavy Pixel

Workload Perf = 8

Unified Architecture Replaces the Pipeline

ModelThe future of GPUs is programmable processing

So – build the architecture around the processor

L2

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Data Assembler

L2 L2 L2 L2 L2

FB

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

FBFBFBFBFB

Host


Low Latency or High Throughput?

CPU

Optimised for low-latency

access to cached data sets

Control logic for out-of-order

and speculative execution

GPU

Optimised for data-parallel,

throughput computation

Architecture tolerant of

memory latency

More transistors dedicated to

computation

Cache

ALUControl

ALU

ALU

ALU

DRAM

DRAM

Heterogeneous Computing Domains

Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

GPU(Parallel Computing)

Graphics

CPU(Sequential Computing)

Massive DataParallelism

Instruction LevelParallelism

Data Fits in Cache Larger Data Sets


146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x – 150x

Tesla, CUDA & PSC – definitions

CUDA Architecture

Our enabling technology for GPU computing

The architecture of the GPU to support compute - plus

C language extensions and retargetter

Usable with any 8 series and up GPU

Tesla

Dedicated compute hardware

C1060 and S1070

Fermi: C2050 and S2070

PSC

Personal Super Computer

A desktop machine with at least 3 C1060s


NVIDIA Tesla 20-Series (Fermi) Products

Tesla S2050 /

S2070 1U System

Tesla C2050 / C2070

Workstation Board

GPUs 1 Tesla GPU 4 Tesla GPUs 1 Tesla GPU

Single Precision

Performance

1030 Gigaflops 4.12 Teraflops 1030 Gigaflops

Double Precision

Performance

515 Gigaflops 2.06 Teraflops 515 Gigaflops

Memory : x2050

Memory : x2070

3 GB

6 GB

12 GB (3 GB / GPU)

24 GB (6 GB / GPU)

3 GB

6 GB

Tesla M2050 /

M2070 Module

Data Center Products Workstation


The Performance Gap Widens Further

2003 2004 2005 2006 2007 2008 2009 2010

Peak Single Precision Performance

GFlops/sec

Tesla 8-series

Tesla 10-series

Nehalem

3 GHz

Tesla 20-series

8x double precision

ECC

L1, L2 Caches

1 TF Single Precision

4GB Memory

NVIDIA GPU

X86 CPU


GPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPUwith the CUDA Parallel Computing Architecture

C OpenCLtm Direct

ComputeFortran

Java and Python

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

C++


GPU Parallel Computing Developer Eco-System

Debuggers& Profilers

cuda-gdbNV Visual Profiler

Parallel NsightVisual Studio

AllineaTotalView

MATLABMathematicaNI LabView

pyCUDA

Numerical Packages

CC++

FortranOpenCL

DirectComputeJava

Python

GPU Compilers

PGI AcceleratorCAPS HMPP

mCUDAOpenMP

ParallelizingCompilers

BLASFFT

LAPACKNPP

VideoImaging

Libraries

Solution ProvidersCUDA Consultants & Training

ANEO GPU Tech

http://www.supermicro.com/

http://en.wikipedia.org/wiki/File:Logo_groupe_bull.jpg

http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg

http://www.google.com/imgres?imgurl=http://blog.taragana.com/wp-content/uploads/2009/05/nec-logo.jpg&imgrefurl=http://blog.taragana.com/index.php/t/east-asia/&h=354&w=354&sz=8&tbnid=YJa5kHMJJ5aMmM:&tbnh=121&tbnw=121&prev=/images?q=NEC+logo&hl=en&usg=__vqs8CIGTn2HFsKXlXcsnKjhGaww=&ei=Q98zSsTUG4vWsgPysrDODg&sa=X&oi=image_result&resnum=2&ct=image


CUDA OVERVIEW


Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus


Processing Flow


memory

2. Load GPU program and execute,

caching data on chip for performance

PCI Bus


Processing Flow


memory

2. Load GPU program and execute,

caching data on chip for performance

3. Copy results from GPU memory to CPU

memory

PCI Bus



Parallel computing architecture

and programming model

Includes a CUDA C compiler,

support for OpenCL and

DirectCompute

Architected to natively support

multiple computational

interfaces (standard languages

and APIs)


C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

// Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code



CUDA defines:

Programming model

Memory model

Execution model

CUDA uses the GPU, but is for general-purpose computing

Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable

Scale to run on 100s of cores/1000s of parallel threads


Compiling CUDA C Applications (Runtime API)

void serial_function(… ) {

...

}

void other_function(int ... ) {

...

}

void saxpy_serial(float ... ) {

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

void main( ) {

float x;

saxpy_serial(..);

...

}

NVCC

(Open64)CPU Compiler

C CUDA

Key Kernels

CUDA object

files

Rest of C

Application

CPU object

filesLinker

CPU-GPU

Executable

Modify into

Parallel

CUDA code


PROGRAMMING MODEL

CUDA Review


CUDA Kernels

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight

Fast switching

1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels


CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPU

Array of threads, in parallel

All threads execute the same

code, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threadID];

float y = func(x);

output[threadID] = y;


CUDA Kernels: Subdivide into Blocks



Threads are grouped into blocks




Blocks are grouped into a grid





A kernel is executed as a grid of blocks of threads





A kernel is executed as a grid of blocks of threads

GPU


Communication Within a Block

Threads may need to cooperate

Memory accesses

Share results

Cooperate using shared memory

Accessible by all threads within a block

Restriction to “within a block” permits scalability

Fast communication between N threads is not feasible when N large


Transparent Scalability – G84

1 2 3 4 5 6 7 8 9 10 11 12

1 2

3 4

5 6

7 8

9 10

11 12


Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8

9 10 11 12


Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle


CUDA Programming Model - Summary

A kernel executes as a grid of

thread blocks

A block is a batch of threads

Communicate through shared

memory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1 2 3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

1D

2D


MEMORY MODEL

CUDA Review


Memory hierarchy

Thread:

Registers


Memory hierarchy

Thread:

Registers

Thread:

Local memory


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory


Additional Memories

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches


PROGRAMMING ENVIRONMENT

CUDA Review


CUDA APIs

API allows the host to manage the devices

Allocate memory & transfer data

Launch kernels

CUDA C “Runtime” API

High level of abstraction - start here!

CUDA C “Driver” API

More control, more verbose

(OpenCL: Similar to CUDA C Driver API)


CUDA C and OpenCL

Shared back-end compiler

and optimization technology

Entry point for developers

who want low-level APIEntry point for developers

who prefer high-level C


Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

Integrated debugger and

profiler: Nexus


Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler


CUDA OPTIMIZATION GUIDELINES

Performance


Optimize Algorithms for GPU

Algorithm selectionUnderstand the problem, consider alternate algorithms

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Recompute?GPU allocates transistors to arithmetic, not memory

Sometimes better to recompute rather than cache

Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host


Optimize Memory Access

Coalesce global memory access

Maximise DRAM efficiency

Order of magnitude impact on performance

Avoid serialization

Minimize shared memory bank conflicts

Understand constant cache semantics

Understand spatial locality

Optimize use of textures to ensure spatial locality


Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and synchronization

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses


Use Resources Efficiently

Partition the computation to keep multiprocessors busyMany threads, many thread blocks

Multiple GPUs

Monitor per-multiprocessor resource utilizationRegisters and shared memory

Low utilization per thread block permits multiple active blocks per multiprocessor

Overlap computation with I/OUse asynchronous memory transfers


DEBUGGING AND PROFILING

cuda-gdb and Visual Profiler


CUDA-GDB

Extended version of GDB with support for C for CUDA

Supported on Linux 32bit/64bit systems

Seamlessly debug both the host|CPU and device|GPU code

• Set breakpoints on any source line or symbol name

• Single step executes only one warp – except on sync threads

• Access and print all CUDA memory allocations, local, global,

constant and shared vars.


Linux GDB

Integration with

EMACS


Linux GDB

Integration with

DDD


CUDA Driver – Low-level Profiling support

1. Set up environment variables› export CUDA_PROFILE=1

› export CUDA_PROFILE_CSV=1

› export CUDA_PROFILE_CONFIG=config.txt

› export CUDA_PROFILE_LOG=profile.csv

2. Set up configuration fileFILE "config.txt":

gpustarttimestamp

instructions

3. Run application› matrixMul

4. View profiler output

FILE "profile.csv":# CUDA_PROFILE_LOG_VERSION 1.5

# CUDA_DEVICE 0 GeForce 8800 GT

# CUDA_PROFILE_CSV 1

# TIMESTAMPFACTOR fa292bb1ea2c12c

gpustarttimestamp,method,gputime,cputime,occupancy,instructions

115f4eaa10e3b220,memcpyHtoD,7.328,12.000

115f4eaa10e5dac0,memcpyHtoD,5.664,4.000

115f4eaa10e95ce0,memcpyHtoD,7.328,6.000

115f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333,43

52

115f4eaa10f443a0,memcpyDtoH,7.776,36.000


CUDA Visual Profiler - Overview

• Performance analysis tool to fine tune CUDA applications

• Supported on Linux/Windows/Mac platforms

• Functionality:

• Execute a CUDA application and collect profiling data

• Multiple application runs to collect data for all hardware performance counters

• Profiling data for all kernels and memory transfers

• Analyze profiling data


CUDA Visual Profiler – data for kernels


CUDA Visual Profiler – computed data for kernels

• Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

• Global memory read throughput (Gigabytes/second)

• Global memory write throughput (Gigabytes/second)

• Overall global memory access throughput (Gigabytes/second)

• Global memory load efficiency

• Global memory store efficiency


CUDA Visual Profiler – data for memory transfers

• Memory transfer type and direction

(D=Device, H=Host, A=cuArray)

• e.g. H to D: Host to Device

• Synchronous / Asynchronous

• Memory transfer size, in bytes

• Stream ID


CUDA Visual Profiler – data analysis views

• Views:

• Summary table

• Kernel table

• Memcopy table

• Summary plot

• GPU Time Height plot

• GPU Time Width plot

• Profiler counter plot

• Profiler table column plot

• Multi-device plot

• Multi-stream plot

• Analyze profiler counters

• Analyze kernel occupancy


CUDA Visual Profiler – Misc.

• Multiple sessions

• Compare views for different sessions

• Comparison Summary plot

• Profiler projects – save & load

• Import/Export profiler data

(.CSV format)


NVIDIA Parallel Nsight

Accelerates GPU + CPUapplication development

The industry’s 1st Development Environment for massively parallel applications

Complete Visual Studio-integrateddevelopment environment


Parallel Nsight 1.0

Nsight Parallel Debugger

GPU source code debugging

Variable & memory inspection

Nsight Analyzer

Platform-level Analysis

For the CPU and GPU

Nsight Graphics Inspector

Visualize and debug graphics content


Source Debugging

Supporting CUDA C and HLSL code.

Hardware breakpoints

GPU memory and variable views

Nsight menu and toolbars


View a correlated trace timeline with both CPU and GPU events.

Analysis


Detailed tooltips are available for every event on the timeline.

Analysis


1.0 System Requirements

Operating SystemWindows Server 2008 R2

Windows 7 / Vista

32 or 64-bit

HardwareGeForce 9 series or higher

Tesla C1060/S1070 or higher

Quadro (G9x or higher)

Visual StudioVisual Studio 2008 SP1


Supported System Configurations

#1: Single machine, Single GPU

AnalyzerGraphics Inspector

#2: Two machines connected over the network

DebuggerAnalyzerGraphics Inspector

TCP/IP

#3: Single SLI MOS machine, Two Quadro GPUs

DebuggerAnalyzerGraphics Inspector


Parallel Nsight 1.0 Versions

Standard (free)GPU Source Debugger

Graphics Inspector

Professional ($349)Analyzer

Data Breakpoints

Premium ticket-based support

Volume and Site Licensing available


NVIDIA Nexus IDE

The industry’s first IDE for massively

parallel applications

Accelerates co-processing (CPU + GPU)

application development

Complete Visual Studio-integrated

development environment


NVIDIA Nexus IDE - Debugging


NVIDIA Nexus IDE - Profiling


RESOURCES

Productivity


Getting Started

CUDA Zone

www.nvidia.com/cuda

Introductory tutorials

GPU computing online seminars

(aka Webinars)

Forums

Documentation

Programming Guide

Best Practices Guide

Examples

CUDA SDK


Libraries

NVIDIA

cuBLAS Dense linear algebra (subset of full BLAS suite)

cuFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

cuLAPACK/MAGMA

Open Source

Thrust STL/Boost style template language

cuDPP Data parallel primitives (e.g. scan, sort and reduction)

CUSP Sparse linear algebra and graph computation

Many more...


Additional material


Targeting Multiple Platforms with CUDA

CUDA C / C++

NVCCNVIDIA CUDA Toolkit

MCUDACUDA to Multi-core

OcelotPTX to Multi-corePTX

MCUDA: http://impact.crhc.illinois.edu/mcuda.php

Ocelot: http://code.google.com/p/gpuocelot/

Swan: http://www.multiscalelab.org/swan

SWAN CUDA to OpencL

Other GPUs

Multi-Core

CPUs

NVIDIA

GPUs

http://impact.crhc.illinois.edu/mcuda.php

http://code.google.com/p/gpuocelot/

http://www.multiscalelab.org/swan


OPTIMIZATION 1:

MEMORY TRANSFERS &

COALESCING

© NVIDIA Corporation 2010 9

5

Execution ModelSoftware Hardware

Threads are executed by scalar processors

Thread

Scalar

Processor

Thread

Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

multiprocessor - limited by multiprocessor resources

(shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time


6

Warps and Half Warps

Thread

Block Multiprocessor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAM

Global

Local

A thread block consists of 32-thread

warps

A warp is executed physically in parallel

(SIMD) on a multiprocessor

Device

Memory

=

A half-warp of 16 threads can coordinate

global memory accesses into a single

transaction


7

Memory Architecture

Host

CPU

Chipset

DRAM

Device

DRAM

Global

Constant

Texture

Local

GPU

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Constant and Texture

Caches


8

Host-Device Data Transfers

Device to host memory bandwidth much lower than device to

device bandwidth

8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280)

Minimize transfers

Intermediate data can be allocated, operated on, and deallocated without

ever copying them to host memory

Group transfers

One large transfer much better than many small ones


9

Page-Locked Data Transfers

cudaMallocHost() allows allocation of page-locked (“pinned”) host memory

Enables highest cudaMemcpy performance3.2 GB/s on PCI-e x16 Gen1

5.2 GB/s on PCI-e x16 Gen2

See the “bandwidthTest” CUDA SDK sample

Use with caution!!Allocating too much page-locked memory can reduce overall system performance

Test your systems and apps to learn their limits


0

Overlapping Data Transfers and Computation

Async and Stream APIs allow overlap of H2D or D2H data transfers with computation

CPU computation can overlap data transfers on all CUDA capable devices

Kernel computation can overlap data transfers on devices with “Concurrent copy and execution” (roughly compute capability >= 1.1)

Stream = sequence of operations that execute in order on GPU

Operations from different streams can be interleaved

Stream ID used as argument to async calls and kernel launches


0

Coalescing

Global Memory

Half-warp of threads

Global memory access of 32, 64, or 128-bit words by a half-warp of threads

(Fermi: warp of threads) can result in as few as one (or two) transaction(s) if

certain access requirements are met

Float (32-bit) data example:

32-byte segments

64-byte segments

128-byte segments

……


0

CoalescingCompute capability 1.2 and higher

1 transaction - 64B segment

2 transactions - 64B and 32B segments

1 transaction - 128B segment

Issues transactions for segments of 32B, 64B, and 128B

Smaller transactions used to avoid wasted bandwidth

……

……

……


OPTIMIZATION 2:

EXECUTION CONFIG


0

Occupancy

Thread instructions are executed sequentially, so executing

other warps is the only way to hide latencies and keep the

hardware busy

Occupancy = Number of warps running concurrently on a

multiprocessor divided by maximum number of warps that can

run concurrently

Limited by resource usage:

Registers

Shared memory


0

Blocks per Grid Heuristics

# of blocks > # of multiprocessors

So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2

Multiple blocks can run concurrently in a multiprocessor

Blocks that aren’t waiting at a __syncthreads() keep the hardware busy

Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devices

Blocks executed in pipeline fashion

1000 blocks per grid will scale across multiple generations


0

Register Pressure

Hide latency by using more threads per multiprocessor

Limiting Factors:

Number of registers per kernel

8K/16K per multiprocessor, partitioned among concurrent threads

Amount of shared memory

16KB per multiprocessor, partitioned among concurrent threadblocks

Compile with –ptxas-options=-v flag

Use –maxrregcount=N flag to NVCC

N = desired maximum registers / kernel

At some point “spilling” into local memory may occur

Reduces performance – local memory is slow


0

Occupancy Calculator


0

Optimizing threads per block

Choose threads per block as a multiple of warp size

Avoid wasting computation on under-populated warps

Facilitates coalescing

Want to run as many warps as possible per multiprocessor (hide

latency)

Multiprocessor can run up to 8 blocks at a time

Heuristics

Minimum: 64 threads per block

Only if multiple concurrent blocks

192 or 256 threads a better choice

Usually still enough regs to compile and invoke successfully

This all depends on your computation, so experiment!


0

Occupancy != Performance

Increasing occupancy does not necessarily increase

performance

BUT …

Low-occupancy multiprocessors cannot adequately hide latency

on memory-bound kernels

(It all comes down to arithmetic intensity and available parallelism)


OPTIMIZATION 3:

MATH FUNCS & BRANCHING


1

Runtime Math Library

There are two types of runtime math operations in single

precision

__funcf(): direct mapping to hardware ISA

Fast but lower accuracy (see prog. guide for details)

Examples: __sinf(x), __expf(x), __powf(x,y)

funcf() : compile to multiple instructions

Slower but higher accuracy (5 ulp or less)

Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces every funcf() to

compile to __funcf()


1

Control Flow Instructions

Main performance concern with branching is divergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is a function of thread

ID

Example with divergence:

if (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

if (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size


OPTIMIZATION 4: SHARED MEMORY


1

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced access

Stage loads and stores in shared memory to re-order non-coalesceable addressing


1

Shared Memory Architecture

Many threads accessing memory

Therefore, memory is divided into banks

Successive 32-bit words assigned to successive banks

Each bank can service one address per cycle

A memory can service as many simultaneous

accesses as it has banks

Multiple simultaneous accesses to a bank

result in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0


1

Bank Addressing Examples

No Bank Conflicts

Linear addressing

stride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15


Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15



1

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing

stride == 2

8-way Bank Conflicts

Linear addressing stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8


1

Shared memory bank conflicts

Shared memory is ~ as fast as registers if there are no bank conflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is no bank conflict

If all threads of a half-warp read the identical address, there is no bank conflict

(broadcast)

The slow case:

Bank Conflict: multiple threads in the same half-warp access the same bank

Must serialize the accesses

Cost = max # of simultaneous accesses to a single bank


1

Shared Memory Example: Transpose

Each thread block works on a tile of the matrix

Naïve implementation exhibits strided access to global memory

idata odata

Elements transposed by a half-warp of threads


2

Naïve Transpose

Loads are coalesced, stores are not (strided by height)

idata odata

__global__ void transposeNaive(float *odata, float *idata,

int width, int height)

{

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;

int index_out = yIndex + height * xIndex;

odata[index_out] = idata[index_in];

}


2

Coalescing through shared memory

Access columns of a tile in shared memory to write contiguous

data to global memory

Requires __syncthreads() since threads access data in

shared memory stored by other threads


idata odatatile


2

__global__ void transposeCoalesced(float *odata, float *idata,

int width, int height)

{

__shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

int index_out = xIndex + (yIndex)*height;

tile[threadIdx.y][threadIdx.x] = idata[index_in];

__syncthreads();

odata[index_out] = tile[threadIdx.x][threadIdx.y];

}

Coalescing through shared memory


2

Bank Conflicts in Transpose

16x16 shared memory tile of floats

Data in columns are in the same bank

16-way bank conflict reading columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank


idata odatatile


FERMI: NEW ARCHITECTURE


Fermi: The Computational GPU

Disclaimer: Specifications subject to change

Performance• 13x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point

Flexibility

• Increased Shared Memory from 16 KB to 64 KB• Added L1 and L2 Caches• ECC on all Internal and External Memories• Enable up to 1 TeraByte of GPU Memories• High Speed GDDR5 Memory Interface

DR

AM

I/F

HO

ST

I/F

Gig

a T

hre

ad

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Usability

• Multiple Simultaneous Tasks on GPU• 10x Faster Atomic Operations• C++ Support• System Calls, printf support

Availability: Q2 2010


Fermi

Memory operations are done per warp (32 threads) instead of half-warp

Global memory, Shared memory

Shared memory:

16 or 48KB

Now 32 banks, 32-bit wide each

No bank-conflicts when accessing 8-byte words

L1 cache per multiprocessor

Should help with misaligned access, strides access, some register spilling

Much improved dual-issue:

Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc.

IEEE-conformant rounding

64bit address space, uniform


L1 cache

For all memory operations

Global memory, Shared memory

Shares 64kb with Shared memory:

Switch size between 16 or 48KB (CUDA API call)

Caches gmem reads only

It benefits if compiler detects that all threads load same value

L1 cache per multiprocessor

NOT coherent! Use volatile for global memory access if other SM's threads

change the location. (but why needed? not all blocks running -> danger of

deadlock)

But caches local memory reads and writes

To improve spilling behaviour

(Coherence no problem as local memory SM-private)


Fermi has 64bit address space

But only 32bit registers

In unfortunate cases, register allocation unnecessary overhead on Fermi

C2050 (3 GB)

Driver API:

Compile kernels in 32bit mode, can be loaded by 64bit app

Runtime API (CUDART):

Use new __launchbounds() intrinsic to help compiler optimize register usage

compile application in 32bit mode (nvcc -m32), produces also GPU code in 32bit.


__umul24 not optimal on Fermi

On Tesla C1060 / GT200 architecture, bounded integer multiplications could be

accelerated with __umul24(a, b) instead of a * b, e.g. for

unsigned int tid = __umul24(blockIdx.x, blockDim.x) + threadIdx.x

On Fermi, __umul24() is emulated, and thus slower than a * b


HPC and IEEE conformance

Default settings for computation on GPU now more conservative (for HPC)

Denormal support, IEEE-conformant division and square root

Accuracy over speed

If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20

then PTX JIT has used "old" Tesla C1060 settings, which favor speed:

flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square

root

For similar results in -arch=sm_20, use:

-ftz=true -prec-div=false -prec-sqrt=false

NVIDIA CUDA Programming Guide, sections 5.4.1, G.2

The CUDA Compiler Driver NVCC, pg. 14-15

(BTW, Sections 5.4.1 also contains information on instruction timings)


CONCLUSION, QUESTIONS

& GTC INVITE


GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010

San Jose Convention Center, San Jose, California

The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing

Preview disruptive technologies and emerging applications

Get tools and techniques to impact mission critical projects

Network with experts, colleagues, and peers across industries

Opportunities

Call for Submissions

Sessions & posters

Sponsors / Exhibitors

Reach decision makers

“CEO on Stage”

Showcase for Startups

Tell your story to VCs and

analysts

“I consider the GPU Technology Conference to be the single best place to see the amazing

work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,

and entrepreneurs from around the world.”

-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker


Thank You

Questions?