PyCUDA
Continued...
gpuarray Vector Types
● pycuda.gpuarray.vec
● All CUDA vector types are supported:– float3, int3, long4, etc, …
● Available as numpy data types
● Field names x, y, z, and w as in CUDA
● Construct using make_type function:
– e.g. make_float3(x, y, z)
Conditionals
● gpuarray.if_positive(criterion, then, else)
– Return an array like then, which, for the element at index i, contains then[i] if criterion>0, otherwise else[i].
● gpuarray.maximum(a, b)
– Return the elementwise maximum of a and b
● gpuarray.minimum(a, b)
– Return elementwise minimum of a and b
Elementwise Kernel
● Avoid extra store-fetch cycles for elementwise math
Xa
Yb
Z
X
X
+=
Elementwise Kernel Example
1234567891011121314151617181920
import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import rand as curand
a_gpu = curand((50,))b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernellin_comb = ElementwiseKernel( "float a, float *x, float b, float *y, float *z", "z[i] = a*x[i] + b*y[i]", "linear_combination")
c_gpu = gpuarray.empty_like(a_gpu)lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
import numpy.linalg as laassert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
Reduction Kernel Example
123456789101112
from pycuda.reduction import ReductionKernelfrom pycuda.curandom import rand as curand
dot = ReductionKernel(dtype_out=numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="float *x, float *y")
x = curand((1000*1000), dtype=numpy.float32)y = curand((1000*1000), dtype=numpy.float32)
x_dot_y = dot(x, y).get()x_dot_y_cpu = numpy.dot(x.get(), y.get())
● Example: A dot product calculation
Parallel Scan / Prefix Sum
X
Y
Prefix Sum Example
1234567891011121314
import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpy as npfrom pycuda.scan import InclusiveScanKernel
knl = InclusiveScanKernel( np.int32 , "a+b" )
n = 2**20 - 2**18 + 5host_data = np.random.randint( 0, 10, n ).astype( np.int32 )dev_data = gpuarray.to_gpu( host data )
knl ( dev_data )assert( dev_data.get() == np.cumsum ( hostdata , axis=0) ).all()
Custom Data Types
● Use your own data types in scan and reduction● Define custom type in preamble● tools.register_dtype(dtype, name)
Elementwise Math Functions
● Rounding and absolute value– fabs, ceil, floor
● Exponentials, logarithms and roots– exp, log, log10, sqrt
● Trigonometric functions– sin, cos, tan, asin, acos, atan
● Hyperbolic functions– sinh, cosh, tanh
● Floating point decomposition and assembly– fmod, frexp, ldexp, modf
Random Number Generation
● curandom.rand(shape)
– Returns values in the range [0, 1)
● For more control in random number generation use the following curandom classes:
– XORWOWRandomNumberGenerator()
– Sobol32RandomNumberGenerator()
– ScrambledSobol32RandomNumberGenerator()
– Sobol64RandomNumberGenerator()
– ScrambledSobol64RandomNumberGenerator()
Monte Carlo SimulationCalculate PI
● Circle radius R and square with side length 2R● Square area = (2R)2
● Circle area = πR2
● Ratio of the two areas is π/4● Pick N random points ~ Nπ/4 points fall within
the circle● Therefore: π=4M/N
Monte Carlo PI Example
12345678910111213141516171819202122
import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import XORWOWRandomNumberGeneratorfrom pycuda.reduction import ReductionKernel
rng = XORWOWRandomNumberGenerator()
N = 10000000
x_gpu = rng.gen_uniform((N,), dtype=numpy.float32)y_gpu = rng.gen_uniform((N,), dtype=numpy.float32)
circle = ReductionKernel(numpy.dtype(numpy.float32), neutral="0", reduce_expr="a+b", map_expr="float((x[i]*x[i]+y[i]*y[i])<=1.0f)", arguments="float *x, float *y")
result = 4.0 * circle(x_gpu, y_gpu).get() / N
print 'Estimate for PI on GPU: {}'.format(result)
CUDA Programming Paradigm
Performance Analysis Tools
● CUDA_PROFILE=1 python calculate.py
– Generates cuda_profile_0.log file in same directory
– Provides breakdown of method executed, GPU time, CPU time and occupancy
● CUDA Visual Profiler– Run computeprof executable
– Plots graphs for easy analysis
Event Timing Example
123456789101112131415161718192021
import pycuda.autoinitimport pycuda.driver as cudaimport pycuda.curandom as curandomimport numpy
#create two timersstart = cuda.Event()end = cuda.Event()
#record the starting timestart.record()
#perform gpu computationsfor i in range(1000):
curandom.rand((10000000,))
#record finishing timeend.record()end.synchronize()
print "GPU time: %.2f seconds" % (start.time_till(end)*1e-3)
Occupancy
● pycuda.tools.OccupancyRecord
– tb_per_mp● How many thread blocks execute on each multi-processor
– limited_by● What tb_per_mp is limited by. One of “device”, “warps”, “regs”,
“smem”
– warps_per_mp● How many warps execute on each multi-processor
– occupancy● A float value between 0 and 1 indicating how much of each
multi-processor's scheduling capability is occupied by the kernel
Important Hardware Properties
● Warp size● Maximum block dimensions● Max threads per block● Max threads per multiprocessor● Multiprocessor count
Query Hardware Example
12345678910111213141516
import pycuda.driver as drv
drv.init()print "%d device(s) found." % drv.Device.count()
for ordinal in range(drv.Device.count()): dev = drv.Device(ordinal) print "Device #%d: %s" % (ordinal, dev.name()) print " Compute Capability: %d.%d" % dev.compute_capability() print " Total Memory: %s KB" % (dev.total_memory()//(1024)) atts = [(str(att), value) for att, value in dev.get_attributes().iteritems()] atts.sort()
for att, value in atts: print " %s: %s" % (att, value)
Simple Optimisation Strategy
● Choose parameters based on your hardware constraits (e.g. block dimensions)
● Threads per block should be multiple of warp size
● Each multiprocessor should have enough active warps to hide instruction and memory latency
● Pad input data if necessary
N-Body Simulation
N-Body Example