+ All Categories
Home > Documents > Optimization on Kepler

Optimization on Kepler

Date post: 22-Mar-2016
Category:
Upload: chinara
View: 131 times
Download: 7 times
Share this document with a friend
Description:
Optimization on Kepler. Zehuan Wang [email protected]. Fundamental Optimization. Optimization Overview. GPU a rchitecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams. - PowerPoint PPT Presentation
48
Optimization on Kepler Zehuan Wang [email protected]
Transcript
Page 1: Optimization on  Kepler

Optimization on Kepler

Zehuan [email protected]

Page 2: Optimization on  Kepler

Fundamental Optimization

Page 3: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 4: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 5: Optimization on  Kepler

GPU High Level View

SMEM

SMEM

SMEM

SMEM

Global Memory CPU Chipset

PCIe

Streaming MultiprocessorGlobal memory

Page 6: Optimization on  Kepler

GK110 SMControl unit

4 Warp Scheduler8 instruction dispatcher

Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST

MemoryRegisters: 64K 32-bitCache

L1+shared memory (64 KB)TextureConstant

Page 7: Optimization on  Kepler

GPU and Programming ModelSoftware GPU

Threads are executed by CUDA cores

Thread

CUDA Core

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources

...Grid Device

A kernel is launched as a grid of thread blocks

Up to 16 kernels can execute on a device at one time

Page 8: Optimization on  Kepler

Warp

Warp is successive 32 threads in a blockE.g. blockDim = 160

Automatically divided to 5 warps by GPUE.g. blockDim = 161

If the blockDim is not the Multiple of 32The rest of thread will occupy one more warp

Block 0

Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block 0

Warp 5 (160)Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block

32 Threads

32 Threads

32 Threads

...

Warps

=

Page 9: Optimization on  Kepler

Warp

SIMD: Same Instruction Multi DataThe threads in the same warp always executing the same instructionInstructions will be issued tooperation units by warp

warp 8 instruction 11

Warp Scheduler 0

warp 2 instruction 42

warp 14 instruction 95

warp 8 instruction 12

...

warp 14 instruction 96

warp 2 instruction 43

warp 9 instruction 11

Warp Scheduler 1

warp 3 instruction 33

warp 15 instruction 95

warp 9 instruction 12

...

warp 3 instruction 34

warp 15 instruction 96

Page 10: Optimization on  Kepler

Warp

Latency is caused by the dependencybetween the neighbor instructions in the same warpIn the waiting time, other instructionsfrom other warps can be executedContext switching is freeA lot of warps can hide memory latency

warp 8 instruction 11

Warp Scheduler 0

warp 2 instruction 42

warp 14 instruction 95

warp 8 instruction 12

...

warp 14 instruction 96

warp 2 instruction 43

warp 9 instruction 11

Warp Scheduler 1

warp 3 instruction 33

warp 15 instruction 95

warp 9 instruction 12

...

warp 3 instruction 34

warp 15 instruction 96

Page 11: Optimization on  Kepler

Kepler Memory Hierarchy

RegisterSpills to local memory

CachesShared memoryL1 cacheL2 cacheConstant cacheTexture cache

Global memory

Page 12: Optimization on  Kepler

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler

Page 13: Optimization on  Kepler

Wrong View To Optimization

Try all the optimization methods in bookOptimization is endless

Low Efficiency

Page 14: Optimization on  Kepler

General Optimization Strategies: Measurement

Find out the limiting factor in kernel performanceMemory bandwidth bound (memory optimization)Instruction throughput bound (instruction optimization)Latency bound (configuration optimization)

Measure effective memory/instruction throughput: NVIDIA Visual Profiler

Page 15: Optimization on  Kepler

ResolvedFind Limiter

Compare to Effective Value GB/s

Memory optimization

Compare to Effective Value inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Page 16: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 17: Optimization on  Kepler

Memory Optimization

If the code is memory-bound and effective memory throughput is much lower than the peak

Purpose: access only data that are absolutely necessary

Major techniquesImprove access pattern to reduce wasted transactionsReduce redundant access: read-only cache, shared memory

Page 18: Optimization on  Kepler

Reduce Wasted Transactions

Memory accesses are per warpMemory is accessed in discrete chunksL1 is reserved only for register spills and stack data in Kepler

Go directly to L2 (invalidate line in L1), on L2 miss go to DRAMMemory is transport in segments = 32 B (same as for writes)If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.

Page 19: Optimization on  Kepler

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler

Page 20: Optimization on  Kepler

Reduce Wasted Transactions

Scenario:Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%

Warp needs 128 bytes128 bytes move across the bus on a miss

...

addresses from a warp

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 21: Optimization on  Kepler

Reduce Wasted Transactions

...

addresses from a warp

Scenario:Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%

Warp needs 128 bytes128 bytes move across the bus on a miss

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 22: Optimization on  Kepler

Reduce Wasted Transactions

Scenario:Warp requests 32 consecutive 4-byte words, offset from perfect alignment

Addresses fall within at most 5 segments1 replay (2 transactions)Bus utilization: at least 80%

Warp needs 128 bytesAt most 160 bytes move across the busSome misaligned patterns will fall within 4 segments, so 100% utilization

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

...addresses from a warp

Page 23: Optimization on  Kepler

Reduce Wasted Transactions

addresses from a warp

Scenario:All threads in a warp request the same 4-byte word

Addresses fall within a single segmentNo replaysBus utilization: 12.5%

Warp needs 4 bytes32 bytes move across the bus on a miss

...

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 24: Optimization on  Kepler

Reduce Wasted Transactions

addresses from a warp

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Scenario:Warp requests 32 scattered 4-byte words

Addresses fall within N segments(N-1) replays (N transactions)

Could be lower some segments can be arranged into a single transactionBus utilization: 128 / (N*32) (4x higher than caching loads)

Warp needs 128 bytesN*32 bytes move across the bus on a miss

...

Page 25: Optimization on  Kepler

Read-Only Cache

An alternative to L1 when accessing DRAMAlso known as texture cache: all texture accesses use this cacheCC 3.5 and higher also enable global memory accesses

Should not be used if a kernel reads and writes to the same addresses

Comparing to L1:Generally better for scattered reads than L1

Caching is at 32 B granularity (L1, when caching operates at 128 B granularity)Does not require replay for multiple transactions (L1 does)

Higher latency than L1 reads, also tends to increase register use

Page 26: Optimization on  Kepler

Read-Only Cache

Annotate eligible kernel parameters withconst __restrict

Compiler will automatically map loads to use read-only data cache path

__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);

// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}

Page 27: Optimization on  Kepler

Shared Memory

Low latency: a few cyclesHigh throughputMain use

Inter-block communicationUser-managed cache to reduce redundant global memory accessesAvoid non-coalesced access

Page 28: Optimization on  Kepler

Shared Memory Example: Matrix Multiplication

A

B

C

C=AxB

Every thread corresponds to one entry in C.

Page 29: Optimization on  Kepler

Naive Kernel

__global__ void simpleMultiply(float* a, float* b, float* c,

int N){ int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum;}

Every thread corresponds to one entry in C.

Page 30: Optimization on  Kepler

Blocked Matrix Multiplication

A

B

C

C=AxB

Data reuse in the blocked version

Page 31: Optimization on  Kepler

Blocked and cached kernel__global__ void coalescedMultiply(double*a,

double* b, double*c,

int N){ __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM];  int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum;}

Page 32: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 33: Optimization on  Kepler

Latency OptimizationWhen the code is latency bound

Both the memory and instruction throughputs are far from the peak

Latency hiding: switching threadsA thread blocks when one of the operands isn’t ready

Purpose: have enough warps to hide latency

Major techniques: increase active warps

Page 34: Optimization on  Kepler

Enough Block and Block Size

# of blocks >> # of SM > 100 to scale well to future deviceMinimum: 64. I generally use 128 or 256. But use whatever is best for your app.Depends on the problem, do experiments!

Page 35: Optimization on  Kepler

Occupancy & Active Warps

Occupancy: ratio of active warps per SM to the maximum number of allowed warps

Maximum number: 48 in Fermi

We need the occupancy to be high enough to hide latency

Occupancy is limited by resource usage

Page 36: Optimization on  Kepler

Dynamical Partitioning of SM Resources

Shared memory is partitioned among blocksRegisters are partitioned among threads: <= 255Thread block slots: <= 16Thread slots: <= 2048Any of those can be the limiting factor on how many threads can be launched at the same time on a SM

Page 37: Optimization on  Kepler

Occupancy Calculator

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Page 38: Optimization on  Kepler

Occupancy OptimizationsKnow the current occupancy

Visual profiler--ptxas-options=-v: output resource usage info; input to Occupancy Calculator

Adjust resource usage to increase occupancyChange block sizeLimit register usage

Compiler option –maxrregcount=n: per file__launch_bounds__: per kernel

Dynamical allocating shared memory

Page 39: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 40: Optimization on  Kepler

Instruction Optimization

If you find out the code is instruction boundCompute-intensive algorithm can easily become memory-bound if not careful enoughTypically, worry about instruction optimization after memory and execution configuration optimizations

Purpose: reduce instruction countUse less instructions to get the same job done

Major techniquesUse high throughput instructionsReduce wasted instructions: branch divergence, etc.

Page 41: Optimization on  Kepler

Reduce Instruction Count

Use float if precision allowAdding “f” to floating literals (e.g. 1.0f) because the default is double

Fast math functionsTwo types of runtime math library functions

func(): slower but higher accuracy (5 ulp or less)

__func(): fast but lower accuracy (see prog. guide for full details)

-use_fast_math: forces every func() to __func ()

Page 42: Optimization on  Kepler

Control FlowDivergent branches:

Threads within a single warp take different pathsExample with divergence:

if (threadIdx.x > 2) {...} else {...}Branch granularity < warp size

Different execution paths within a warp are serialized

Different warps can execute different code with no impact on performance

Avoid diverging within a warpExample without divergence:

if (threadIdx.x / WARP_SIZE > 2) {...} else {...}Branch granularity is a whole multiple of warp size

Page 43: Optimization on  Kepler

Kernel Optimization WorkflowFind Limiter

Compare to peak GB/s

Memory optimization

Compare to peak inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Page 44: Optimization on  Kepler

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 45: Optimization on  Kepler

Minimizing CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.

16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)

Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPUSometimes it’s even better to recompute on GPU

Group transferOne large transfer much better than many small onesOverlap memory transfer with computation

Page 46: Optimization on  Kepler

Streams and Async API

Default API:Kernel launches are asynchronous with CPUMemcopies (D2H, H2D) block CPU threadCUDA calls are serialized by the driver

Streams and async functions provide:Memcopies (D2H, H2D) asynchronous with CPUAbility to concurrently execute a kernel and a memcopyConcurrent kernel in Fermi

Stream = sequence of operations that execute in issue-order on GPUOperations from different streams can be interleavedA kernel and memcopy from different streams can be overlapped

Page 47: Optimization on  Kepler

Pinned (non-pageable) memory

Pinned memory enables:memcopies asynchronous with CPU & GPU

UsagecudaHostAlloc / cudaFreeHost instead of malloc / free

Note:pinned memory is essentially removed from virtual memorycudaHostAlloc is typically very expensive

Page 48: Optimization on  Kepler

Overlap kernel and memory copy

Requirements:D2H or H2D memcopy from pinned memoryDevice with compute capability ≥ 1.1 (G84 and later)Kernel and memcopy in different, non-0 streams

Code:cudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);

cudaMemcpyAsync( dst, src, size, dir, stream1 );kernel<<<grid, block, 0, stream2>>>(…);

potentiallyoverlapped


Recommended