+ All Categories
Home > Documents > Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++...

Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++...

Date post: 25-Dec-2015
Category:
Upload: jeremy-casey
View: 223 times
Download: 2 times
Share this document with a friend
Popular Tags:
32
Introduction to GPGPUs and Massively Threaded Programming Barzan Shkeh
Transcript
  • Slide 1
  • Barzan Shkeh
  • Slide 2
  • Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics
  • Slide 3
  • Introduction Today, science and technology are inextricably linked. Human insight in bioinformatics, in particular, is driven by the vast amount of data that can be collected with sufficient computational capability to extract, analyze, model, and visualize results.
  • Slide 4
  • Introduction cont. For example, Harvard Connectome project is in the process of creating a complete wiring-diagram of a rat brain at 3 nm/pixel resolution using automated slicing and data collection instruments.
  • Slide 5
  • Introduction cont. Many important problems have remained intractable because there were on computer powerful enough or because scientists simply could not afford to access machine with the necessary capabilities. Current revolution in scientific computation is happening because intense computation in computer graphics, mainly driven by the computer gaming industry, has evolved graphics processors into extremely capable yet low-cost general-purpose computation platforms.
  • Slide 6
  • Introduction cont. GPGPU (general-purpose graphics processor unit). Many scientists and programmers, using existing tools, are able to achieve one to tow orders of magnitude, 10x-100x, of performance increase over conventional hardware when running their applications on GPGPUs.
  • Slide 7
  • Massive Multithreading Massive Multithreading is the key to harnessing computational power of GPGPUs because it provides a common paradigm that both programmers and hardware designers can exploit to attain the highest possible performance. Permits graphics processors to achieve extremely high floating-point performance because the latency of memory accesses can be hidden and full bandwidth of the memory subsystem can be utilized.
  • Slide 8
  • Massive Multithreading cont. Roughly speaking, graphics processors can be considered streaming processors because best performance is achieved when coalesced memory operations are used to simultaneously stream data from all of the on-board graphics memory banks. coalesced memory operation combines simultaneous memory accesses by multiple threads into a single memory transaction.
  • Slide 9
  • GPGPU
  • Slide 10
  • GPGPU cont GPU hardware effectively evolved into single program multiple data (SPMD). NVIDAI generally bundles 32 threads into a wrap, which runs single instruction multiple data fashion (SIMD) on each streaming multiprocessor SM
  • Slide 11
  • Slide 12
  • CUDA ( Compute Unified Device Architecture ) Is a parallel computing platform and programming model created by NVIDIA. The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives, and extensions to industry-standard programming languages, including C, C++,..etc.
  • Slide 13
  • Slide 14
  • CUDA Memory Types cont. Global memory (read and write) slow, but cached Texture memory (read only) { cache optimized for 2D access pattern Constant memory where constants and kernel arguments are stored Shared memory (48KB per SM) fast, but subject to (diernt) bank conicts Local memory used for whatever doesn't t in to registers part of global memory; slow but now cached Registers { 32768 32-bit registers per SM
  • Slide 15
  • CUDA Memory Types cont. Using registers Registers are read/write per-thread Cant access registers outside of each thread Used for storing local variables in functions, etc. No special syntax for doing this - just declare local variables as usual Physically stored in each MP Cant index (no arrays) Obviously, cant access from host code
  • Slide 16
  • CUDA Memory Types cont. Local memory Also read/write per-thread Cant read other threads local memory Declare a variable in local memory using the __local__ keyword __local__ float results[32]; Can index (this is where local arrays go) Much slower than register memory! Dont use local arrays if you dont have to
  • Slide 17
  • CUDA Memory Types cont. Shared memory Read/write per-block All threads in a block share the same memory In general, pretty fast Certain cases can hinder performance...
  • Slide 18
  • CUDA Memory Types cont. Using shared memory Similar to local memory: __shared__ float current_row[]; Only declare one variable as shared! Multiple declarations of __shared__ variables will occupy same memory space! __shared__ float a[]; __shared__ float b[]; b[0] = 0.5f; // now a[0] == 0.5f also!
  • Slide 19
  • CUDA Memory Types cont. Global memory Read/write per-application Can share between blocks and grids Persistent across kernel executions Completely un-cached Really slow!
  • Slide 20
  • CUDA Memory Types cont. Constant memory Read-only from device Cached in each SM Cache can broadcast to every thread running - very efficient! Keyword: __constant__ Access from device code like normal variables Set values from host code with cudaMemcpyToSymbol Cant use pointers Cant dynamically allocate
  • Slide 21
  • CUDA Memory Types cont. Texture memory Read-only from device Complex 2D caching method Linear filtering/interpolation available,
  • Slide 22
  • CUDA C/C++ programming
  • Slide 23
  • Heterogeneous Computing Terminology: -Host The CPU and its memory (host memory) -Device The GPU and its memory (device memory) hostDevice
  • Slide 24
  • Hello World! int main(void) { printf("Hello World!\n"); return 0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code Output: $ nvcc hello_world.cu $ a.out Hello World! $
  • Slide 25 >(); printf("Hello World!\n"); return 0; } CUDA C/C++ keyw">
  • Hello World! with Device Code __global__ void mykernel(void) { } int main(void) { mykernel >>(); printf("Hello World!\n"); return 0; } CUDA C/C++ keyword __global__ indicates a function that: -Runs on the device -Is called from host code nvcc separates source code into host and device components -Device functions (e.g. mykernel()) processed by NVIDIA compiler -Host functions (e.g. main()) processed by standard host compiler -gcc, cl.exe
  • Slide 26
  • Memory Management Host and device memory are separate entities Device pointers point to GPU memory -May be passed to/from host code -May not be dereferenced in host code Host pointers point to CPU memory -May be passed to/from device code -May not be dereferenced in device code Simple CUDA API for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy()
  • Slide 27
  • Addition on the Device: add() __global__ void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size);
  • Slide 28
  • Addition on the Device: add() a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add >>(d_a, d_b, d_c); cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Copy result back to host free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
  • Slide 29
  • kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
  • Slide 30
  • CUDA in Bioinformatics MUMmerGPU: High-through DNA sequence alignment using GPUs
  • Slide 31
  • CUDA in Bioinformatics SmithWaterman-CUDA allows to perform alignments between one or more sequences and a database (all the sequences, even in the DB, are intended to be proteinic). LISSOM (Laterally Interconnected Synergetically Self- Organizing Map) is a model of human neocortex (mainly modeled on visual cortex) at a neural column level.
  • Slide 32
  • CUDA in Bioinformatics CUDA-MEME is an ultrafast scalable motif discovery algorithm based on MEME. MEME (Multiple expectation maximization form Motif Elicitation). CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform.

Recommended