+ All Categories
Home > Documents > PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import...

PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import...

Date post: 26-Apr-2018
Category:
Upload: vanthuan
View: 270 times
Download: 3 times
Share this document with a friend
22
PyCUDA Continued...
Transcript
Page 1: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

PyCUDA

Continued...

Page 2: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

gpuarray Vector Types

● pycuda.gpuarray.vec

● All CUDA vector types are supported:– float3, int3, long4, etc, …

● Available as numpy data types

● Field names x, y, z, and w as in CUDA

● Construct using make_type function:

– e.g. make_float3(x, y, z)

Page 3: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Conditionals

● gpuarray.if_positive(criterion, then, else)

– Return an array like then, which, for the element at index i, contains then[i] if criterion>0, otherwise else[i].

● gpuarray.maximum(a, b)

– Return the elementwise maximum of a and b

● gpuarray.minimum(a, b)

– Return elementwise minimum of a and b

Page 4: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Elementwise Kernel

● Avoid extra store-fetch cycles for elementwise math

Xa

Yb

Z

X

X

+=

Page 5: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Elementwise Kernel Example

1234567891011121314151617181920

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import rand as curand

a_gpu = curand((50,))b_gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernellin_comb = ElementwiseKernel( "float a, float *x, float b, float *y, float *z", "z[i] = a*x[i] + b*y[i]", "linear_combination")

c_gpu = gpuarray.empty_like(a_gpu)lin_comb(5, a_gpu, 6, b_gpu, c_gpu)

import numpy.linalg as laassert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5

Page 6: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Reduction Kernel Example

123456789101112

from pycuda.reduction import ReductionKernelfrom pycuda.curandom import rand as curand

dot = ReductionKernel(dtype_out=numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="float *x, float *y")

x = curand((1000*1000), dtype=numpy.float32)y = curand((1000*1000), dtype=numpy.float32)

x_dot_y = dot(x, y).get()x_dot_y_cpu = numpy.dot(x.get(), y.get())

● Example: A dot product calculation

Page 7: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Parallel Scan / Prefix Sum

X

Y

Page 8: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Prefix Sum Example

1234567891011121314

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpy as npfrom pycuda.scan import InclusiveScanKernel

knl = InclusiveScanKernel( np.int32 , "a+b" )

n = 2**20 - 2**18 + 5host_data = np.random.randint( 0, 10, n ).astype( np.int32 )dev_data = gpuarray.to_gpu( host data )

knl ( dev_data )assert( dev_data.get() == np.cumsum ( hostdata , axis=0) ).all()

Page 9: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Custom Data Types

● Use your own data types in scan and reduction● Define custom type in preamble● tools.register_dtype(dtype, name)

Page 10: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Elementwise Math Functions

● Rounding and absolute value– fabs, ceil, floor

● Exponentials, logarithms and roots– exp, log, log10, sqrt

● Trigonometric functions– sin, cos, tan, asin, acos, atan

● Hyperbolic functions– sinh, cosh, tanh

● Floating point decomposition and assembly– fmod, frexp, ldexp, modf

Page 11: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Random Number Generation

● curandom.rand(shape)

– Returns values in the range [0, 1)

● For more control in random number generation use the following curandom classes:

– XORWOWRandomNumberGenerator()

– Sobol32RandomNumberGenerator()

– ScrambledSobol32RandomNumberGenerator()

– Sobol64RandomNumberGenerator()

– ScrambledSobol64RandomNumberGenerator()

Page 12: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Monte Carlo SimulationCalculate PI

● Circle radius R and square with side length 2R● Square area = (2R)2

● Circle area = πR2

● Ratio of the two areas is π/4● Pick N random points ~ Nπ/4 points fall within

the circle● Therefore: π=4M/N

Page 13: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Monte Carlo PI Example

12345678910111213141516171819202122

import pycuda.gpuarray as gpuarrayimport pycuda.driver as cudaimport pycuda.autoinitimport numpyfrom pycuda.curandom import XORWOWRandomNumberGeneratorfrom pycuda.reduction import ReductionKernel

rng = XORWOWRandomNumberGenerator()

N = 10000000

x_gpu = rng.gen_uniform((N,), dtype=numpy.float32)y_gpu = rng.gen_uniform((N,), dtype=numpy.float32)

circle = ReductionKernel(numpy.dtype(numpy.float32), neutral="0", reduce_expr="a+b", map_expr="float((x[i]*x[i]+y[i]*y[i])<=1.0f)", arguments="float *x, float *y")

result = 4.0 * circle(x_gpu, y_gpu).get() / N

print 'Estimate for PI on GPU: {}'.format(result)

Page 14: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

CUDA Programming Paradigm

Page 15: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Performance Analysis Tools

● CUDA_PROFILE=1 python calculate.py

– Generates cuda_profile_0.log file in same directory

– Provides breakdown of method executed, GPU time, CPU time and occupancy

● CUDA Visual Profiler– Run computeprof executable

– Plots graphs for easy analysis

Page 16: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Event Timing Example

123456789101112131415161718192021

import pycuda.autoinitimport pycuda.driver as cudaimport pycuda.curandom as curandomimport numpy

#create two timersstart = cuda.Event()end = cuda.Event()

#record the starting timestart.record()

#perform gpu computationsfor i in range(1000):

curandom.rand((10000000,))

#record finishing timeend.record()end.synchronize()

print "GPU time: %.2f seconds" % (start.time_till(end)*1e-3)

Page 17: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Occupancy

● pycuda.tools.OccupancyRecord

– tb_per_mp● How many thread blocks execute on each multi-processor

– limited_by● What tb_per_mp is limited by. One of “device”, “warps”, “regs”,

“smem”

– warps_per_mp● How many warps execute on each multi-processor

– occupancy● A float value between 0 and 1 indicating how much of each

multi-processor's scheduling capability is occupied by the kernel

Page 18: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Important Hardware Properties

● Warp size● Maximum block dimensions● Max threads per block● Max threads per multiprocessor● Multiprocessor count

Page 19: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Query Hardware Example

12345678910111213141516

import pycuda.driver as drv

drv.init()print "%d device(s) found." % drv.Device.count()

for ordinal in range(drv.Device.count()): dev = drv.Device(ordinal) print "Device #%d: %s" % (ordinal, dev.name()) print " Compute Capability: %d.%d" % dev.compute_capability() print " Total Memory: %s KB" % (dev.total_memory()//(1024)) atts = [(str(att), value) for att, value in dev.get_attributes().iteritems()] atts.sort()

for att, value in atts: print " %s: %s" % (att, value)

Page 20: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

Simple Optimisation Strategy

● Choose parameters based on your hardware constraits (e.g. block dimensions)

● Threads per block should be multiple of warp size

● Each multiprocessor should have enough active warps to hide instruction and memory latency

● Pad input data if necessary

Page 21: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

N-Body Simulation

Page 22: PyCUDA - Elefantel = dot(x, y).get() ... import pycuda.driver as cuda import pycuda.autoinit import numpy from pycuda.curandom import XORWOWRandomNumberGenerator

N-Body Example


Recommended