[Harvard CS264] 05 - Advanced-level CUDA Programming

transcript

Lecture #5: Advanced CUDA | February 22th, 2011

Nicolas Pinto (MIT, Harvard) pinto@mit.edu

Massively Parallel ComputingCS 264 / CSCI E-292

Administrivia

• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)

• Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11

• Guest lectures:

• schedule coming soon

• on Fridays 7.35-9.35pm (March, April) ?

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Todayyey!!

Outline

1.Hardware Review

2.Memory/Communication Optimizations

3.Threading/Execution Optimizations

1. Hardware Review

8© NVIDIA Corporation 2008

10-Series Architecture

240 thread processors execute kernel threads

30 multiprocessors, each contains

8 thread processors

One double-precision unit

Shared memory enables thread cooperation

Thread

Processors

Multiprocessor

Shared

Memory

Double

Execution Model

Software Hardware

Threads are executed by thread processors

Thread

Thread Processor

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Threading Hierarchy

Warps and Half Warps

Thread

Block Multiprocessor

32 Threads

Half Warps

Global

A thread block consists of 32-

thread warps

A warp is executed physically in

parallel (SIMD) on a

multiprocessor

Device

Memory

A half-warp of 16 threads can

coordinate global memory

accesses into a single transaction

Memory Architecture

Chipset

Device

Global

Constant

Texture

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Constant and Texture

Caches

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

...Kernel 0

...Kernel 1

GlobalMemory

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

...Kernel 0

...Kernel 1

GlobalMemory

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Kernel I/O

Global Memory

Per-device

...Kernel 0

...Kernel 1

GlobalMemory

• Off-chip, large

• Uncached

• Kernel I/O

• Different types of “global memory”

• Linear Memory

• Texture Memory

• Constant Memory

Memory Architecture

Memory Location Cached Access Scope Lifetime

Register On-chip N/A R/W One thread Thread

Local Off-chip No R/W One thread Thread

Shared On-chip N/A R/W All threads in a block Block

Global Off-chip No R/W All threads + host Application

Constant Off-chip Yes R All threads + host Application

Texture Off-chip Yes R All threads + host Application

2. Memory/CommunicationOptimizations

2.1 Host/Device TransferOptimizations

Review

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

!"#$%&'()*+,-

! !"#$%&'(")*+,-".'/"$'0&"12"#+"345"#$%&'(6

! !**"#$%&'(6"78"'"#$%&'(")*+,-"'%&"%18"+8"#$&"

6'.&".1*#792%+,&66+%

! :$16",'8",+..187,'#&"07'"6$'%&(".&.+%/

! !8("6/8,$%+87;&"

! :$%&'(6"+<"'")*+,-"'%&".1*#72*&=&("+8#+"'"

.1*#792%+,&66+%"'6"!"#$%

A+%#$)%7(B&

F+1#$)%7(B&

F!:! G#$&%8&#

H%'2$7,6">'%("I"

J%+8#"F7(&"K16

E&.+%/"K16 ?>L"K16

?>L9G=2

%&66"K16

! ./012 +%"./0$! D&2*',&("!H?

! ?5?M"J1**"C12*&="F&%7'*M"F/..&#%7,"K16! 53NEKI6")'8(O7(#$"78"&',$"(7%&,#7+8

! "#$$#%&'()#$*+%,-(+%.#($/&.+0&,(1&2%,3(,+8<7B1%'#7+86P""GPBQ! ?>L9G"4R="S"4R"*'8&6

! 4R"#7.&6"#$&")'8(O7(#$"TUHKI6V

! :$&">@C!"62&,7<7,'#7+8"$'6")&&8"12('#&(! W&%67+8"4PN"4 L87#7'*"%&*&'6&M"N4INX

! W&%67+8"4P4"4@2('#&"O7#$"8&O&%"$'%(O'%&M"NUINX

! K',-O'%(6",+.2'#7)*&

! G=2&,#&("12('#&6"78"8&'%"<1#1%&Q! W&%67+8"4P5"I"5PN

! RY9)7#"<*+'#78B"2+78#"6122+%#"T7P&P"(+1)*&V

! W&%67+8"4P4"'((&("6+.&"7.2+%#'8#"16&<1*"

<&'#1%&6Q

3*456%#$

! !6/8,$%+8+16".&.+%/",+27&6

! !6/8,$%+8+16"H?@"2%+B%'."*'18,$

7%#&6%#$

! !#+.7,".&.+%/"786#%1,#7+86

3+ Gb/s

8 GB/s

25+ GB/s

160+ GB/sto

PC Architecture

modified from Matthew Bolitho

Review

The PCI-“not-so”-e Bus

• PCIe bus is slow

• Try to minimize/group transfers

• Use pinned memory on host whenever possible

• Try to perform copies asynchronously (e.g. Streams)

• Use “Zero-Copy” when appropriate

• Examples in the SDK (e.g. bandwidthTest)

Review

2.2 Device MemoryOptimizations

Definitions

• gmem: global memory

• smem: shared memory

• tmem: texture memory

• cmem: constant memory

• bmem: binary code (cubin) memory ?!?(covered next week)

Performance Analysise.g. Matrix Transpose

Matrix Transpose

Transpose 2048x2048 matrix of floats

Performed out-of-place

Separate input and output matrices

Use tile of 32x32 elements, block of 32x8 threads

Each thread processes 4 matrix elements

In general tile and block size are fair game foroptimization

Process

Get the right answer

Measure effective bandwidth (relative to theoretical orreference case)

Address global memory coalescing, shared memory bankconflicts, and partition camping while repeating abovesteps

Theoretical Bandwidth

Device Bandwidth of GTX 280

1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s

Specs report 141 GB/s

Use 10^9 B/GB conversion rather than 1024^3

Whichever you use, be consistent

Memory

clock (Hz)

Memory

interface

(bytes)

Effective Bandwidth

Transpose Effective Bandwidth

2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s

Reference Case - Matrix Copy

Transpose operates on tiles - need better comparisonthan raw device bandwidth

Look at effective bandwidth of copy that uses tiles

Matrix size

(bytes)

Read and

Matrix Copy Kernel

__global__ void copy(float *odata, float *idata, int width, int height){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; }}

TILE_DIM = 32BLOCK_ROWS = 8

32x32 tile32x8 thread block

idata and odata

in global memory

idata odata

Elements copied by a half-warp of threads

Matrix Copy Kernel Timing

Measure elapsed time over loop

Looping/timing done in two ways:

Over kernel launches (nreps = 1)

Includes launch/indexing overhead

Within the kernel over loads/stores (nreps > 1)

Amortizes launch/indexing overhead

__global__ void copy(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;

for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } }}

Naïve Transpose

Similar to copy

Input and output matrices have different indices

__global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } }}

idata odata

Effective Bandwidth

Effective Bandwidth (GB/s)

2048x2048, GTX 280

Loop over

kernel

Loop in kernel

Simple Copy 96.9 81.6

Naïve

Transpose

2.2 2.2

gmem coalescing

Memory Coalescing

GPU memory controller granularity is 64 or 128 bytes

Must also be 64 or 128 byte aligned

Suppose thread loads a float (4 bytes)

Controller loads 64 bytes, throws 60 bytes away

Memory Coalescing

Memory controller actually more intelligent

Consider half-warp (16 threads)

Suppose each thread reads consecutive float

Memory controller will perform one 64 byte load

This is known as coalescing

Make threads read consecutive locations

Coalescing

Global Memory

Half-warp of threads

} 64B aligned segment (16 floats)

Global memory access of 32, 64, or 128-bit words by a half-warp of threads can result in as few as one (or two)transaction(s) if certain access requirements are met

Depends on compute capability

1.0 and 1.1 have stricter access requirements

Examples – float (32-bit) data

}128B aligned segment (32 floats)

CoalescingCompute capability 1.0 and 1.1

K-th thread must access k-th word in the segment (or k-th word in 2contiguous 128B segments for 128-bit words), not all threads need toparticipate

Coalesces – 1 transaction

Out of sequence – 16 transactions Misaligned – 16 transactions

Memory Coalescing

GT200 has hardware coalescer

Inspects memory requests from each half-warp

Determines minimum set of transactions which are

64 or 128 bytes long

64 or 128 byte aligned

CoalescingCompute capability 1.2 and higher

1 transaction - 64B segment

2 transactions - 64B and 32B segments 1 transaction - 128B segment

Coalescing is achieved for any pattern of addresses that fits into asegment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for32- and 64-bit words

Smaller transactions may be issued to avoid wasted bandwidth dueto unused words

(e.g. GT200 like the C1060)

CoalescingCompute capability 2.0 (Fermi, Tesla C2050)

32 transactions - 32 x 32B segments, instead of 32 x 128B segments.

2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.

Memory transactions handled per warp (32 threads)L1 cache ON: Issues always 128B segment transactionscaches them in 16kB or 48kB L1 cache per multiprocessor

L1 cache OFF: Issues always 32B segment transactionsE.g. advantage for widely scattered thread accesses

Coalescing Summary

Coalescing dramatically speeds global memory access

Strive for perfect coalescing:

Align starting address (may require padding)

A warp should access within a contiguous region

Coalescing in Transpose

Naïve transpose coalesces reads, but not writes

idata odata

Elements transposed by a half-warp of threads

Q: How to coalesce writes ?

smem as a cache

Shared Memory

SMs can access gmem at 80+ GiB/sec

but have hundreds of cycles of latency

Each SM has 16 kiB ‘shared’ memory

Essentially user-managed cache

Speed comparable to registers

Accessible to all threads in a block

Reduces load/stores to device memory

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing

Coalescing in Transpose

Naïve transpose coalesces reads, but not writes

idata odata

Q: How to coalesce writes ?

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing

A Common Programming Strategy

!   Partition data into subsets that fit into shared memory

!   Handle each data subset with one thread block

!   Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

!   Perform the computation on the subset from shared memory

!   Copy the result from shared memory back to global memory

Coalescing through shared memory

Access columns of a tile in shared memory to writecontiguous data to global memory

Requires __syncthreads() since threads write dataread by other threads

idata odata

__global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; }

__syncthreads();

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}

Coalescing through shared memory

Effective Bandwidth

2048x2048, GTX 280

Loop over kernel Loop in kernel

Shared Memory Copy 80.9 81.1

Naïve Transpose 2.2 2.2

Coalesced Transpose 16.5 17.1

Uses shared

memory tile

__syncthreads()

smem bank conflicts

Shared Memory Architecture

Many threads accessing memoryTherefore, memory is divided into banks

Successive 32-bit words assigned to successive banks

Each bank can service one address per cycleA memory can service as many simultaneousaccesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Shared Memory Banks

Shared memory divided into 16 ‘banks’

Shared memory is (almost) as fast as registers (...)

Exception is in case of bank conflicts

4 bytes

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 5

Bank 6

Bank 7

Bank 8

Bank 9

Bank 10

Bank 11

Bank 12

Bank 13

Bank 14

Bank 15

Bank Addressing Examples

No Bank Conflicts

Linear addressingstride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank Addressing Examples

2-way Bank Conflicts

8-way Bank Conflicts

Thread 11

Thread 10

Thread 9Thread 8

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 15

Bank 7

Bank 6Bank 5

Bank 4

Bank 3Bank 2

Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5

Thread 4

Thread 3Thread 2

Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2

Bank 1Bank 0

Shared memory bank conflicts

Shared memory is ~ as fast as registers if there are no bankconflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is nobank conflict

If all threads of a half-warp read the identical address, there is nobank conflict (broadcast)

The slow case:

Bank Conflict: multiple threads in the same half-warp access thesame bank

Must serialize the accesses

Cost = max # of simultaneous accesses to a single bank

Bank Conflicts in Transpose

32x32 shared memory tile of floats

Data in columns k and k+16 are in same bank

16-way bank conflict reading half columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank

idata odata

Q: How to avoid bank conflicts ?

Bank Conflicts in Transpose

32x32 shared memory tile of floats

Data in columns k and k+16 are in same bank

16-way bank conflict reading half columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank

idata odata

Illustration!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• :;<:; !(=('#$$#+• >#$?'#77%99%9'#'7*6@)0,

– !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9

warps:

warps:0 1 2 31

Bank 0Bank 1…

Bank 3120 1

!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• -&&'#'7*6:)0'5*$';#&&/01,

– !"#!! $%&%'())(*• <#$;'#77%99%9'#'7*6:)0,

– !" +,--.)./0'1(/234'/5'1(/2'65/-7,603warps:

warps:0 1 2 31 padding

Bank 0Bank 1…

Bank 313120 1

Illustration

Effective Bandwidth

2048x2048, GTX 280

Loop over

kernel

Loop in kernel

Bank Conflict Free Transpose 16.6 17.2

Need a pause?

Unrelated: Tchatcher Illusion

gmem partition camping

Partition Camping

Global memory accesses go through partitions

6 partitions on 8-series GPUs, 8 partitions on 10-seriesGPUs

Successive 256-byte regions of global memory areassigned to successive partitions

For best performance:

Simultaneous global memory accesses GPU-wide shouldbe distributed evenly amongst partitions

Partition Camping occurs when global memoryaccesses at an instant use a subset of partitions

Directly analogous to shared memory bank conflicts, buton a larger scale

0 1 2 3 4 5

64 65 66 67 68 69

128 129 130 ...

0 64 128

1 65 129

2 66 130

3 67 ...

odataidata

Partition Camping in Transpose

tiles in matrices

colors = partitions

blockId = gridDim.x * blockIdx.y + blockIdx.x

Partition width = 256 bytes = 64 floats

Twice width of tile

On GTX280 (8 partitions), data 2KB apart map tosame partition

2048 floats divides evenly by 2KB => columns of matricesmap to same partition

Partition Camping Solutions

blockId = gridDim.x * blockIdx.y + blockIdx.x

Pad matrices (by two tiles)

In general might be expensive/prohibitive memory-wise

Diagonally reorder blocks

Interpret blockIdx.y as different diagonal slices andblockIdx.x as distance along a diagonal

odataidata

0 64 128

1 65 129

2 66 130

3 67 ...

128 65 2

129 66 3

130 67 4

... 68 5

__global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM+1];

int blockIdx_y = blockIdx.x; int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;

int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;

for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}© NVIDIA Corporation 2008

Diagonal Transpose

Add lines to map diagonal

to Cartesian coordinates

Replace

blockIdx.x

blockIdx_x,

blockIdx.y

blockIdx_y

if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;} else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;}

Diagonal Transpose

Previous slide for square matrices (width == height)

More generally:

Effective Bandwidth

2048x2048, GTX 280

Loop over kernel Loop in kernel

Bank Conflict Free Transpose 16.6 17.2

Diagonal 69.5 78.3

Order of Optimizations

Larger optimization issues can mask smaller ones

Proper order of some optimization techniques innot known a priori

Eg. partition camping is problem-size dependent

Don’t dismiss an optimization technique asineffective until you know it was applied at the righttime

Naïve

2.2 GB/s

Coalescing16.5 GB/s

ConflictsPartition

Camping16.6 GB/s

48.8 GB/s

69.5 GB/s

Partition

Camping

Conflicts

Transpose Summary

Coalescing and shared memory bank conflicts aresmall-scale phenomena

Deal with memory access within half-warp

Problem-size independent

Partition camping is a large-scale phenomenon

Deals with simultaneous memory accesses by warps ondifferent multiprocessors

Problem size dependent

Wouldn’t see in (2048+32)^2 matrix

Coalescing is generally the most critical

SDK Transpose Example:http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

Textures in CUDA

Texture is an object for reading data

Benefits:Data is cached (optimized for 2D locality)

Helpful when coalescing is a problem

FilteringLinear / bilinear / trilinear

Dedicated hardware

Wrap modes (for “out-of-bounds” addresses)Clamp to edge / repeat

Addressable in 1D, 2D, or 3DUsing integer or normalized coordinates

Usage:CPU code binds data to a texture object

Kernel reads data by calling a fetch function

Other goodiesOptional “format conversion”

• {char, short, int, half} (16bit) to float (32bit)

• “for free”

• useful for *mem compression (see later)

Texture Addressing

WrapOut-of-bounds coordinate iswrapped (modulo arithmetic)

ClampOut-of-bounds coordinate isreplaced with the closestboundary

0 1 2 3 4

0(5.5, 1.5)

0 1 2 3 4

0(2.5, 0.5)(1.0, 1.0)

0 1 2 3 4

0(5.5, 1.5)

Two CUDA Texture Types

Bound to linear memoryGlobal memory address is bound to a textureOnly 1DInteger addressingNo filtering, no addressing modes

Bound to CUDA arraysCUDA array is bound to a texture1D, 2D, or 3DFloat addressing (size-based or normalized)FilteringAddressing modes (clamping, repeat)

Both:Return either element type or normalized float

CUDA Texturing Steps

Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)

Create a texture reference object

Currently must be at file-scope

Bind the texture reference to memory/array

When done:

Unbind the texture reference, free resources

Device (kernel) code:Fetch using texture reference

Linear memory textures:

tex1Dfetch()

Array textures:

tex1D() or tex2D() or tex3D()

!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)

– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%

– 1'$(#$+2(3.(/.'0(32(456(7./$.+%

– 8,9,&.0(&#(:;<=

• <)+*2'&..$'4#20"+*'&11)$$)$=

• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$%

– 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%%

– B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./

• !"#$%&#%'1&13)'%3+"49374%='– EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/

– G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%%

• H./,'+,C.%(#&A./@,%.

!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)

– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%

– 1'$(#$+2(3.(/.'0(32(456(7./$.+%

– 8,9,&.0(&#(:;<=

• <)+*2'&..$'4#20"+*'&11)$$)$=

!!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' BC

• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$%

– 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%%

– H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./

• !"#$%&#%'1&13)'%3+"49374%='– KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/

– M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%%• N./,'+,I.%(#&G./F,%.

DDD-+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9-+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9-+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9DDD

!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'D(E(F

– !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8#– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,

– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,

...addresses from a warp

96 192128 160 224 28825632 64 352320 384 4484160

!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'0"#$%&#%D1#=?"+*'&00)$$E

– !"#$%&'(#)&*+%,-+$&./&01%+$– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",

– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",• 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+

...addresses from a warp

96 192128 160 224 28825632 64 352320 384 4484160

*mem compression

!"#$%$&$'()*$#+),-%"./00$-'• 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9)47#/0)'//5/5:);-'0$5/.);-%"./00$-'

• <"".-2;+/0=– !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;%

&'7<=)>– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"

– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"• A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+

– G;"6)0-;+)H$• I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+

• K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'"

• <""3$;2#$-')$')".2;#$;/=– M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"%

+'=()*+%'"%Q@R+S– F##<$TT;*L,(U'*6T;-+TCV22U42V2

Accelerating GPU computation through

mixed-precision methods

Michael Clark

Harvard-Smithsonian Center for Astrophysics

Harvard University

SC’10

caching

... too much ?

bank conflicts

coalescing

partition campingclam

broadcasting

streamszero-copy

Parallel ProgrammParking

is Hard(but you’ll pick it up)

(you are not alone)

3. Threading/ExecutionOptimizations

3.1 Exec. ConfigurationOptimizations

Occupancy

Thread instructions are executed sequentially, soexecuting other warps is the only way to hidelatencies and keep the hardware busy

Occupancy = Number of warps running concurrentlyon a multiprocessor divided by maximum number ofwarps that can run concurrently

Limited by resource usage:

Registers

Shared memory

Grid/Block Size Heuristics

# of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2Multiple blocks can run concurrently in a multiprocessor

Blocks that aren’t waiting at a __syncthreads() keep thehardware busy

Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devicesBlocks executed in pipeline fashion

1000 blocks per grid will scale across multiple generations

Register Dependency

Read-after-write register dependencyInstruction’s result can be read ~24 cycles later

Scenarios: CUDA: PTX:

To completely hide the latency:Run at least 192 threads (6 warps) per multiprocessor

At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3)

Threads do not have to belong to the same thread block

add.f32 $f3, $f1, $f2

add.f32 $f5, $f3, $f4

x = y + 5;

z = x + 3;

ld.shared.f32 $f3, [$r31+0]

add.f32 $f3, $f3, $f4

s_data[0] += 3;

Register Pressure

Hide latency by using more threads per SM

Limiting Factors:Number of registers per kernel

8K/16K per SM, partitioned among concurrent threads

Amount of shared memory

16KB per SM, partitioned among concurrent threadblocks

Compile with –ptxas-options=-v flag

Use –maxrregcount=N flag to NVCCN = desired maximum registers / kernel

At some point “spilling” into local memory may occur

Reduces performance – local memory is slow

Occupancy Calculator

Optimizing threads per block

Choose threads per block as a multiple of warp sizeAvoid wasting computation on under-populated warps

Want to run as many warps as possible permultiprocessor (hide latency)

Multiprocessor can run up to 8 blocks at a time

HeuristicsMinimum: 64 threads per block

Only if multiple concurrent blocks

192 or 256 threads a better choice

Usually still enough regs to compile and invoke successfully

This all depends on your computation, so experiment!

Occupancy != Performance

Increasing occupancy does not necessarily increaseperformance

BUT …

Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels

(It all comes down to arithmetic intensity and availableparallelism)

!"##"$%&"$'($)*+,"%*#%-

(."$%/,,01*+,2

!"#$%&'!(%)(*'

+,'-./).%.&'

0.12.34./'556'5787'

GTC’10

Occupancy != Performance

Increasing occupancy does not necessarily increaseperformance

BUT …

Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels

(It all comes down to arithmetic intensity and availableparallelism)

3.2 InstructionOptimizations

CUDA Instruction Performance

Instruction cycles (per warp) = sum ofOperand read cycles

Instruction execution cycles

Result update cycles

Therefore instruction throughput depends onNominal instruction throughput

Memory latency

Memory bandwidth

“Cycle” refers to the multiprocessor clock rate1.3 GHz on the Tesla C1060, for example

Maximizing Instruction Throughput

Maximize use of high-bandwidth memoryMaximize use of shared memory

Minimize accesses to global memory

Maximize coalescing of global memory accesses

Optimize performance by overlapping memoryaccesses with HW computation

High arithmetic intensity programs

i.e. high ratio of math to memory transactions

Many concurrent threads

Arithmetic Instruction Throughput

int and float add, shift, min, max and float mul, mad:4 cycles per warp

int multiply (*) is by default 32-bit

requires multiple cycles / warp

Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit intmultiply

Integer divide and modulo are more expensiveCompiler will convert literal power-of-2 divides to shifts

But we have seen it miss some cases

Be explicit in cases where compiler can’t tell that divisor isa power of 2!

Useful trick: foo % n == foo & (n-1) if n is a power of 2

Runtime Math Library

There are two types of runtime math operations insingle-precision

__funcf(): direct mapping to hardware ISA

Fast but lower accuracy (see prog. guide for details)

Examples: __sinf(x), __expf(x), __powf(x,y)

funcf() : compile to multiple instructions

Slower but higher accuracy (5 ulp or less)

Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces everyfuncf() to compile to __funcf()

GPU results may not match CPU

Many variables: hardware, compiler, optimizationsettings

CPU operations aren’t strictly limited to 0.5 ulpSequences of operations can be more accurate due to 80-bit extended precision ALUs

Floating-point arithmetic is not associative!

FP Math is Not Associative!

In symbolic math, (x+y)+z == x+(y+z)

This is not necessarily true for floating-point additionTry x = 1030, y = -1030 and z = 1 in the above equation

When you parallelize computations, you potentiallychange the order of operations

Parallel results may not exactly match sequentialresults

This is not specific to GPU or CUDA – inherent part ofparallel execution

Control Flow Instructions

Main performance concern with branching isdivergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is afunction of thread ID

Example with divergence:

if (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

if (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size

Scared ?

Howwwwww?!(do I start)

Scared ?

Profiler

!"#$%&'&()'*+(,-./'$0-• ,-./'$0-(1.2"*0-&3

– !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+• #$%&"'()*+,+(%+-"./"0 1+*"23*1

• 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($%

– -.+)%*/&*#$!"-#$)%*/&*#$• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(

• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(

• :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7

– .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!("• :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@

– &"'2'4*+)-.(12.).(2+)$%2"#2'$!("• :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5

• 6.78#-03– B>""D"!"#$%&'$!("#)!##&*+ <D"B>"E"23*1"5'F+"D<– 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@

CUDA Visual Profiler data for memory transfers

Memory transfer type and direction(D=Device, H=Host, A=cuArray)

e.g. H to D: Host to Device

Synchronous / Asynchronous

Memory transfer size, in bytes

Stream ID

CUDA Visual Profiler data for kernels

CUDA Visual Profiler computed data for kernels

Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

Global memory read throughput (Gigabytes/second)

Global memory write throughput (Gigabytes/second)

Overall global memory access throughput (Gigabytes/second)

Global memory load efficiency

Global memory store efficiency

CUDA Visual Profiler data analysis viewsViews:

Summary table Kernel tableMemcopy table Summary plotGPU Time Height plotGPU Time Width plotProfiler counter plotProfiler table column plotMulti-device plotMulti-stream plot

Analyze profiler counters

Analyze kernel occupancy

CUDA Visual Profiler Misc.Multiple sessions

Compare views for different sessions

Comparison Summary plot

Profiler projects save & load

Import/Export profiler data (.CSV format)

meh!!!! I don’t like to profile

Scared ?

Modified source code

!"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0

• 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$– !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&'

"++&%##$.5– 6$0%#'7)8'5))+'%#,$9",%#'()&:

• ;$9%'#2%.,'"**%##$.5'9%9)&7

• ;$9%'#2%.,'"**%##$.5'9%9)&7• ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).#

• 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$&– =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+– @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$*

• A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9%

I want to believe...

Scared ?

!"#$%&'(#)*$%!+$,(-."/

mem math full mem math full mem math full mem math full

Memory and latency bound

Poor mem-math overlap: latency is a problem

Math-bound

Good mem-math overlap: latency not a problem

(assuming instruction throughput is not low compared to HW theory)

Memory-bound

(assuming memory throughput is not low compared to HW theory)

Balanced

(assuming memory/instrthroughput is not low compared to HW theory)

Memory bound ?

Math bound ?

Latency bound ?

!"#$%&'(#)*$%!+$,(-."/

Math-bound

Memory-bound

Balanced

!"#$%&'(#)*$%!+$,(-."/

Math-bound

Memory-bound

Balanced

Argn&%#$... too many optimizations !!!

Parameterize Your Application

Parameterization helps adaptation to different GPUs

GPUs vary in many ways# of multiprocessors

Memory bandwidth

Shared memory size

Register file size

Max. threads per block

You can even make apps self-tuning (like FFTW andATLAS)

“Experiment” mode discovers and saves optimalconfiguration

More ?• Next week:

GPU “Scripting”, Meta-programming, Auto-tuning

• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)

• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)

• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)

• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)

• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)

• ...

iPhD one more thingor two...

Life/Code Hacking #2.xSpeed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.2Speed writing

http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html

Typing tutors: gtypist, ktouch, typingweb.com, etc.

Kinesis Advantage (QWERTY/DVORAK)

[Harvard CS264] 05 - Advanced-level CUDA Programming

Education