Introduction to GPGPUs and Massively Threaded Programming Barzan Shkeh
Transcript
Slide 1
Barzan Shkeh
Slide 2
Outline Introduction Massive multithreading GPGPU CUDA memory
types CUDA C/ C++ programming CUDA in Bioinformatics
Slide 3
Introduction Today, science and technology are inextricably
linked. Human insight in bioinformatics, in particular, is driven
by the vast amount of data that can be collected with sufficient
computational capability to extract, analyze, model, and visualize
results.
Slide 4
Introduction cont. For example, Harvard Connectome project is
in the process of creating a complete wiring-diagram of a rat brain
at 3 nm/pixel resolution using automated slicing and data
collection instruments.
Slide 5
Introduction cont. Many important problems have remained
intractable because there were on computer powerful enough or
because scientists simply could not afford to access machine with
the necessary capabilities. Current revolution in scientific
computation is happening because intense computation in computer
graphics, mainly driven by the computer gaming industry, has
evolved graphics processors into extremely capable yet low-cost
general-purpose computation platforms.
Slide 6
Introduction cont. GPGPU (general-purpose graphics processor
unit). Many scientists and programmers, using existing tools, are
able to achieve one to tow orders of magnitude, 10x-100x, of
performance increase over conventional hardware when running their
applications on GPGPUs.
Slide 7
Massive Multithreading Massive Multithreading is the key to
harnessing computational power of GPGPUs because it provides a
common paradigm that both programmers and hardware designers can
exploit to attain the highest possible performance. Permits
graphics processors to achieve extremely high floating-point
performance because the latency of memory accesses can be hidden
and full bandwidth of the memory subsystem can be utilized.
Slide 8
Massive Multithreading cont. Roughly speaking, graphics
processors can be considered streaming processors because best
performance is achieved when coalesced memory operations are used
to simultaneously stream data from all of the on-board graphics
memory banks. coalesced memory operation combines simultaneous
memory accesses by multiple threads into a single memory
transaction.
Slide 9
GPGPU
Slide 10
GPGPU cont GPU hardware effectively evolved into single program
multiple data (SPMD). NVIDAI generally bundles 32 threads into a
wrap, which runs single instruction multiple data fashion (SIMD) on
each streaming multiprocessor SM
Slide 11
Slide 12
CUDA ( Compute Unified Device Architecture ) Is a parallel
computing platform and programming model created by NVIDIA. The
CUDA platform is accessible to software developers through
CUDA-accelerated libraries, compiler directives, and extensions to
industry-standard programming languages, including C,
C++,..etc.
Slide 13
Slide 14
CUDA Memory Types cont. Global memory (read and write) slow,
but cached Texture memory (read only) { cache optimized for 2D
access pattern Constant memory where constants and kernel arguments
are stored Shared memory (48KB per SM) fast, but subject to
(diernt) bank conicts Local memory used for whatever doesn't t in
to registers part of global memory; slow but now cached Registers {
32768 32-bit registers per SM
Slide 15
CUDA Memory Types cont. Using registers Registers are
read/write per-thread Cant access registers outside of each thread
Used for storing local variables in functions, etc. No special
syntax for doing this - just declare local variables as usual
Physically stored in each MP Cant index (no arrays) Obviously, cant
access from host code
Slide 16
CUDA Memory Types cont. Local memory Also read/write per-thread
Cant read other threads local memory Declare a variable in local
memory using the __local__ keyword __local__ float results[32]; Can
index (this is where local arrays go) Much slower than register
memory! Dont use local arrays if you dont have to
Slide 17
CUDA Memory Types cont. Shared memory Read/write per-block All
threads in a block share the same memory In general, pretty fast
Certain cases can hinder performance...
Slide 18
CUDA Memory Types cont. Using shared memory Similar to local
memory: __shared__ float current_row[]; Only declare one variable
as shared! Multiple declarations of __shared__ variables will
occupy same memory space! __shared__ float a[]; __shared__ float
b[]; b[0] = 0.5f; // now a[0] == 0.5f also!
Slide 19
CUDA Memory Types cont. Global memory Read/write
per-application Can share between blocks and grids Persistent
across kernel executions Completely un-cached Really slow!
Slide 20
CUDA Memory Types cont. Constant memory Read-only from device
Cached in each SM Cache can broadcast to every thread running -
very efficient! Keyword: __constant__ Access from device code like
normal variables Set values from host code with cudaMemcpyToSymbol
Cant use pointers Cant dynamically allocate
Slide 21
CUDA Memory Types cont. Texture memory Read-only from device
Complex 2D caching method Linear filtering/interpolation
available,
Slide 22
CUDA C/C++ programming
Slide 23
Heterogeneous Computing Terminology: -Host The CPU and its
memory (host memory) -Device The GPU and its memory (device memory)
hostDevice
Slide 24
Hello World! int main(void) { printf("Hello World!\n"); return
0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be
used to compile programs with no device code Output: $ nvcc
hello_world.cu $ a.out Hello World! $
Hello World! with Device Code __global__ void mykernel(void) {
} int main(void) { mykernel >>(); printf("Hello World!\n");
return 0; } CUDA C/C++ keyword __global__ indicates a function
that: -Runs on the device -Is called from host code nvcc separates
source code into host and device components -Device functions (e.g.
mykernel()) processed by NVIDIA compiler -Host functions (e.g.
main()) processed by standard host compiler -gcc, cl.exe
Slide 26
Memory Management Host and device memory are separate entities
Device pointers point to GPU memory -May be passed to/from host
code -May not be dereferenced in host code Host pointers point to
CPU memory -May be passed to/from device code -May not be
dereferenced in device code Simple CUDA API for handling device
memory cudaMalloc(), cudaFree(), cudaMemcpy() Similar to the C
equivalents malloc(), free(), memcpy()
Slide 27
Addition on the Device: add() __global__ void add(int *a, int
*b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index]; } #define N (2048*2048) #define
THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host
copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int); // Alloc space for device copies of a,
b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void
**)&d_b, size); cudaMalloc((void **)&d_c, size);
Slide 28
Addition on the Device: add() a = (int *)malloc(size);
random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c =
(int *)malloc(size); // Copy inputs to device cudaMemcpy(d_a, a,
size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size,
cudaMemcpyHostToDevice); // Launch add() kernel on GPU add
>>(d_a, d_b, d_c); cudaMemcpy(c, d_c, size,
cudaMemcpyDeviceToHost); // Copy result back to host free(a);
free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0; }
Slide 29
kernel can be executed by multiple equally-shaped thread
blocks, so that the total number of threads is equal to the number
of threads per block times the number of blocks.
Slide 30
CUDA in Bioinformatics MUMmerGPU: High-through DNA sequence
alignment using GPUs
Slide 31
CUDA in Bioinformatics SmithWaterman-CUDA allows to perform
alignments between one or more sequences and a database (all the
sequences, even in the DB, are intended to be proteinic). LISSOM
(Laterally Interconnected Synergetically Self- Organizing Map) is a
model of human neocortex (mainly modeled on visual cortex) at a
neural column level.
Slide 32
CUDA in Bioinformatics CUDA-MEME is an ultrafast scalable motif
discovery algorithm based on MEME. MEME (Multiple expectation
maximization form Motif Elicitation). CUSHAW: a CUDA compatible
short read aligner to large genomes based on the Burrows-Wheeler
transform.