PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import...

PyCUDA

Continued...

gpuarray Vector Types

● pycuda.gpuarray.vec

● All CUDA vector types are supported:– float3, int3, long4, etc, …

● Available as numpy data types

● Field names x, y, z, and w as in CUDA

● Construct using make_type function:

– e.g. make_float3(x, y, z)

Conditionals

● gpuarray.if_positive(criterion, then, else)

– Return an array like then, which, for the element at index i, contains then[i] if criterion>0, otherwise else[i].

● gpuarray.maximum(a, b)

– Return the elementwise maximum of a and b

● gpuarray.minimum(a, b)

– Return elementwise minimum of a and b

Elementwise Kernel

● Avoid extra store-fetch cycles for elementwise math

Xa

Yb

Z

X

X

+=

Elementwise Kernel Example

1234567891011121314151617181920

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import rand as curand

a_gpu = curand((50,))b_gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernellin_comb = ElementwiseKernel( "float a, float *x, float b, float *y, float *z", "z[i] = a*x[i] + b*y[i]", "linear_combination")

c_gpu = gpuarray.empty_like(a_gpu)lin_comb(5, a_gpu, 6, b_gpu, c_gpu)

import numpy.linalg as laassert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5

Reduction Kernel Example

123456789101112

from pycuda.reduction import ReductionKernelfrom pycuda.curandom import rand as curand

dot = ReductionKernel(dtype_out=numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="float *x, float *y")

x = curand((1000*1000), dtype=numpy.float32)y = curand((1000*1000), dtype=numpy.float32)

x_dot_y = dot(x, y).get()x_dot_y_cpu = numpy.dot(x.get(), y.get())

● Example: A dot product calculation

Parallel Scan / Prefix Sum

X

Y

Prefix Sum Example

1234567891011121314

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpy as npfrom pycuda.scan import InclusiveScanKernel

knl = InclusiveScanKernel( np.int32 , "a+b" )

n = 2**20 - 2**18 + 5host_data = np.random.randint( 0, 10, n ).astype( np.int32 )dev_data = gpuarray.to_gpu( host data )

knl ( dev_data )assert( dev_data.get() == np.cumsum ( hostdata , axis=0) ).all()

Custom Data Types

● Use your own data types in scan and reduction● Define custom type in preamble● tools.register_dtype(dtype, name)

Elementwise Math Functions

● Rounding and absolute value– fabs, ceil, floor

● Exponentials, logarithms and roots– exp, log, log10, sqrt

● Trigonometric functions– sin, cos, tan, asin, acos, atan

● Hyperbolic functions– sinh, cosh, tanh

● Floating point decomposition and assembly– fmod, frexp, ldexp, modf

Random Number Generation

● curandom.rand(shape)

– Returns values in the range [0, 1)

● For more control in random number generation use the following curandom classes:

– XORWOWRandomNumberGenerator()

– Sobol32RandomNumberGenerator()

– ScrambledSobol32RandomNumberGenerator()

– Sobol64RandomNumberGenerator()

– ScrambledSobol64RandomNumberGenerator()

Monte Carlo SimulationCalculate PI

● Circle radius R and square with side length 2R● Square area = (2R)2

● Circle area = πR2

● Ratio of the two areas is π/4● Pick N random points ~ Nπ/4 points fall within

the circle● Therefore: π=4M/N

Monte Carlo PI Example

12345678910111213141516171819202122

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import XORWOWRandomNumberGeneratorfrom pycuda.reduction import ReductionKernel

rng = XORWOWRandomNumberGenerator()

N = 10000000

x_gpu = rng.gen_uniform((N,), dtype=numpy.float32)y_gpu = rng.gen_uniform((N,), dtype=numpy.float32)

circle = ReductionKernel(numpy.dtype(numpy.float32), neutral="0", reduce_expr="a+b", map_expr="float((x[i]*x[i]+y[i]*y[i])<=1.0f)", arguments="float *x, float *y")

result = 4.0 * circle(x_gpu, y_gpu).get() / N

print 'Estimate for PI on GPU: {}'.format(result)

CUDA Programming Paradigm

Performance Analysis Tools

● CUDA_PROFILE=1 python calculate.py

– Generates cuda_profile_0.log file in same directory

– Provides breakdown of method executed, GPU time, CPU time and occupancy

● CUDA Visual Profiler– Run computeprof executable

– Plots graphs for easy analysis

Event Timing Example

123456789101112131415161718192021

import pycuda.autoinitimport pycuda.driver as cudaimport pycuda.curandom as curandomimport numpy

#create two timersstart = cuda.Event()end = cuda.Event()

#record the starting timestart.record()

#perform gpu computationsfor i in range(1000):

curandom.rand((10000000,))

#record finishing timeend.record()end.synchronize()

print "GPU time: %.2f seconds" % (start.time_till(end)*1e-3)

Occupancy

● pycuda.tools.OccupancyRecord

– tb_per_mp● How many thread blocks execute on each multi-processor

– limited_by● What tb_per_mp is limited by. One of “device”, “warps”, “regs”,

“smem”

– warps_per_mp● How many warps execute on each multi-processor

– occupancy● A float value between 0 and 1 indicating how much of each

multi-processor's scheduling capability is occupied by the kernel

Important Hardware Properties

● Warp size● Maximum block dimensions● Max threads per block● Max threads per multiprocessor● Multiprocessor count

Query Hardware Example

12345678910111213141516

import pycuda.driver as drv

drv.init()print "%d device(s) found." % drv.Device.count()

for ordinal in range(drv.Device.count()): dev = drv.Device(ordinal) print "Device #%d: %s" % (ordinal, dev.name()) print " Compute Capability: %d.%d" % dev.compute_capability() print " Total Memory: %s KB" % (dev.total_memory()//(1024)) atts = [(str(att), value) for att, value in dev.get_attributes().iteritems()] atts.sort()

for att, value in atts: print " %s: %s" % (att, value)

Simple Optimisation Strategy

● Choose parameters based on your hardware constraits (e.g. block dimensions)

● Threads per block should be multiple of warp size

● Each multiprocessor should have enough active warps to hide instruction and memory latency

● Pad input data if necessary

N-Body Simulation

N-Body Example

Date post:	26-Apr-2018
Category:	Documents
Upload:	vanthuan
View:	270 times
Download:	3 times

PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import...

Documents