Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | coral-pitts |
View: | 219 times |
Download: | 0 times |
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 10Lecture 10Tue, 25 Nov, 2008Tue, 25 Nov, 2008
PreviouslyPreviously
Optimizing Instruction ThroughputOptimizing Instruction Throughput– Low throughout instructionsLow throughout instructions
Different versions of math functionsDifferent versions of math functions Type conversions are costlyType conversions are costly Avoid warp diversionAvoid warp diversion Accessing global memory is expensiveAccessing global memory is expensive Overlap memory ops with math opsOverlap memory ops with math ops
PreviouslyPreviously
Optimizing Instruction ThroughputOptimizing Instruction Throughput– Optimal use of memory bandwidthOptimal use of memory bandwidth
Global memory: coalesce accessesGlobal memory: coalesce accesses Local memory: coalesced automaticallyLocal memory: coalesced automatically Constant memory: cached, cost Constant memory: cached, cost
proportional to #addresses readproportional to #addresses read Texture memory: cached, optimized for Texture memory: cached, optimized for
2D spatial locality2D spatial locality Shared memory: on chip, fast but avoid Shared memory: on chip, fast but avoid
bank conflictsbank conflicts
TodayToday
Optimizing Instruction ThroughputOptimizing Instruction Throughput– Optimal use of memory bandwidthOptimal use of memory bandwidth
Shared memory: on chip, fast but avoid Shared memory: on chip, fast but avoid bank conflictsbank conflicts
RegistersRegisters
Optimizing #Threads per BlockOptimizing #Threads per Block Memory copiesMemory copies Texture vs. Global vs. ConstantTexture vs. Global vs. Constant General optimizationsGeneral optimizations
Shared MemoryShared Memory
Bank conflictsBank conflicts– Shared memory divided into 32-Shared memory divided into 32-
bit modules called banksbit modules called banks– Allow simultaneous readsAllow simultaneous reads– N-way bank conflict if N threads N-way bank conflict if N threads
try to read from the same banktry to read from the same bankLeads to serializing of readsLeads to serializing of readsNot necessarily N serial readsNot necessarily N serial reads
Shared MemoryShared Memory
Bank conflictsBank conflicts– Broadcast mechanismBroadcast mechanism
One word is chosen as a One word is chosen as a broadcast wordbroadcast word
Automatically passed to other Automatically passed to other threads reading from that wordthreads reading from that word
– Cannot control which word is Cannot control which word is picked as the broadcast wordpicked as the broadcast word
RegistersRegisters
Generally 0 clock cyclesGenerally 0 clock cycles– Time to access registers included Time to access registers included
in instruction timein instruction time– There could be delaysThere could be delays
RegistersRegisters
Delays may occur due to register Delays may occur due to register memory bank conflictsmemory bank conflicts– Register memory banks handled Register memory banks handled
by compiler and thread schedulerby compiler and thread schedulerTry to schedule instructions to Try to schedule instructions to avoid conflictsavoid conflicts
Work best when 64x threads Work best when 64x threads per blockper block
– Application has no other controlApplication has no other control
RegistersRegisters
Delays may occur due to read-Delays may occur due to read-after-write dependenciesafter-write dependencies– May be hidden if each SM has at May be hidden if each SM has at
least 192 active threadsleast 192 active threads
Optimizing #threads Optimizing #threads per blockper block 2 or more blocks per SM2 or more blocks per SM
– A waiting block (thread sync, A waiting block (thread sync, memo copy) can be overlapped memo copy) can be overlapped with running blockswith running blocks
Shared memory per block should Shared memory per block should be less than half the shared be less than half the shared memory per SMmemory per SM
Optimizing #threads Optimizing #threads per blockper block Having 32x threads per block fully Having 32x threads per block fully
populates warpspopulates warps Having 64x threads per block Having 64x threads per block
allows compiler and thread allows compiler and thread scheduler to avoid register scheduler to avoid register memory bank conflictsmemory bank conflicts
Optimizing #threads Optimizing #threads per blockper block More threads per block = fewer More threads per block = fewer
registers per kernelregisters per kernel– Compiler option to report Compiler option to report
memory requirements of a memory requirements of a kernel, kernel, --ptxas-options=-v--ptxas-options=-v
– #registers per device varies with #registers per device varies with compute capabilitycompute capability
Optimizing #threads Optimizing #threads per blockper block When optimizing, go for 64x When optimizing, go for 64x
threads per blockthreads per block– 192 or 256 recommended192 or 256 recommended
Occupancy of SM = (#active Occupancy of SM = (#active warps) / (max. active warps)warps) / (max. active warps)– Compiler tries to maximize Compiler tries to maximize
occupancyoccupancy
Optimizing Memory Optimizing Memory CopiesCopies Host mem <=> Device memHost mem <=> Device mem
– Low bandwidthLow bandwidth– Higher bandwidth can be Higher bandwidth can be
achieved using achieved using pagelocked/pinned memorypagelocked/pinned memory
Optimizing Memory Optimizing Memory CopiesCopies Minimize such transfersMinimize such transfers
– Move more code to the device, Move more code to the device, even if it does not fully utilize even if it does not fully utilize parallelismparallelism
– Create intermediate data Create intermediate data structures in device memorystructures in device memory
– Group several transfers into oneGroup several transfers into one
Texture fetches vs. reading Texture fetches vs. reading Global/Constant memGlobal/Constant mem
Cached, optimized for spatial Cached, optimized for spatial localitylocality
No coalescing constraintsNo coalescing constraints Address calculation latency is Address calculation latency is
better hiddenbetter hidden Data can be packedData can be packed Optional conversion of integers to Optional conversion of integers to
normalized floats [0.0,1.0] or [-normalized floats [0.0,1.0] or [-1.0,1.0]1.0,1.0]
Texture fetches vs. reading Texture fetches vs. reading Global/Constant memGlobal/Constant mem
For textures stored in CUDA arrays For textures stored in CUDA arrays – FilteringFiltering– Normalized texture coordinatesNormalized texture coordinates– Addressing modesAddressing modes
General GuidelinesGeneral Guidelines
Maximize parallelismMaximize parallelism Maximize memory bandwidthMaximize memory bandwidth Maximize instruction throughputMaximize instruction throughput
Maximize ParallelismMaximize Parallelism
Build on data parallelismBuild on data parallelism– Broken in case of thread dependencyBroken in case of thread dependency– For threads in the same blockFor threads in the same block
__syncThreads()__syncThreads() share data using shared memoryshare data using shared memory
– For threads in different blocksFor threads in different blocks Share data using global memoryShare data using global memory Two kernel callsTwo kernel calls First to write dataFirst to write data Second to read dataSecond to read data
Maximize ParallelismMaximize Parallelism
Build on data parallelismBuild on data parallelism Choose kernel parameters Choose kernel parameters
accordinglyaccordingly Clever device use: streamsClever device use: streams Clever host use: async kernelsClever host use: async kernels
Maximize Memory Maximize Memory BandwidthBandwidth Minimize host <=> device memory Minimize host <=> device memory
copiescopies Minimize device <=> device Minimize device <=> device
memory data transfermemory data transfer– Use shared memoryUse shared memory
Might even be better to not copy at Might even be better to not copy at allall– Just recompute on deviceJust recompute on device
Maximize Memory Maximize Memory BandwidthBandwidth Organize data for optimal memory Organize data for optimal memory
access patternsaccess patterns– Crucial for accesses to global memoryCrucial for accesses to global memory
Maximize Instruction Maximize Instruction ThroughputThroughput For non-crucial cases, use higher For non-crucial cases, use higher
throughput arithmetic instructions throughput arithmetic instructions – Sacrifice accuracy for performanceSacrifice accuracy for performance– Replace Replace doubledouble with with float float
operationsoperations Pay attention to warp diversionPay attention to warp diversion
– Try to arrange diverging threads pe Try to arrange diverging threads pe warpwarpif (threadIdx / warp_size) > nif (threadIdx / warp_size) > n
Final ProjectsFinal Projects
Time-lineTime-line– Thu, 20 Nov:Thu, 20 Nov:
Float write-ups on ideas of Jens & WaqarFloat write-ups on ideas of Jens & Waqar
– Tue, 25 Nov (today):Tue, 25 Nov (today): Suggest groups and topicsSuggest groups and topics
– Thu, 27 Nov:Thu, 27 Nov: Groups and topics assignedGroups and topics assigned
– Tue, 2 Dec:Tue, 2 Dec: Last chance to change groups/topicsLast chance to change groups/topics Groups and topics finalizedGroups and topics finalized
All for todayAll for today
Next timeNext time– A full-fledged example projectA full-fledged example project
On to exercises!On to exercises!