M.I. Zuev, O.I. Streltsova, M. ValaHeterogeneous Computations Team, HybriLIT
Laboratory of Information Technologies
Joint Institute for Nuclear Research
Parallel Computing with CUDA
Heterogeneous Computation Team, HybriLIT
SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMSJINR, DUBNA
2-6 November, 2015
NVIDIA Tesla K40 “Atlas” GPU Accelerator
Supports CUDA and OpenCL Specifications
• 2880 CUDA GPU cores• Performance
4.29 TFLOP single-precision; 1.43 T FLOP double-precision
• 12GB of global memory• Memory bandwidth up to 288 GB/s
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
CUDA (Compute Unified Device Architecture) programming model, CUDA C
Source:http://blog.goldenhelix.com/?p=374
Core 1 Core 2
Core 3 Core 4
CPUGPU
Multiprocessor1
•
•
•
•
•
•
(192 Cores)
Multiprocessor2
(192 Cores)
Multiprocessor14
(192 Cores)
Multiprocessor15
(192 Cores)
•
•
•
CPU / GPU Architecture
2880 CUDA GPU cores
http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
CUDA architecture (Kepler TM GK110/210)
Streaming Multiprocessor (SMX) Architecture:
• 192 single-precision CUDA cores
• 64 double-precision units
• 32 special function units (SFU)
• 32 load/store units (LD/ST)
Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif
CUDA (Compute Unified Device Architecture) programming model
nvcc -arch=compute_35 deviceInfo_CUDA.cu -o test_CUDA
Tutorial’s materials
Module
• NVIDIA CUDA Compiler Driver NVCC• Full information
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#axzz37LQKVSFi
module add hlit/cuda/7.0
Heterogeneous Computation Team, HybriLIT
$ nvcc -arch=compute_35 deviceInfo_CUDA.cu -o test_CUDA
Compilation:
#Task 0: deviceInfo_CUDA.cu
Location:
/tmp/ Tutorial_AIS-GRID/CUDA /
#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA
Script:
Execution:
$ sbatch script_CUDASubmitted batch job 8881
$ module add hlit/cuda/7.0
Definitions: device = GPU; host = CPU; kernel = function that runs on the device
Parallel parts of an application are executed on the device as kernels• One kernel is executed at a time• Several threads execute each kernel
Function Qualifiers
Function type qualifiers specify whether a function executes and whether it is
callable Declaration specifier Callable from Executed on
__global__ host device
__host__ host host
__device__ device device
Launching kernelkernel <<< dim3 dG, dim3 dB >>> (…)
dG – dimension and size of grid in blocks• Two-dimension al: x and y• Blocks launched in the grid: dG.x * dG.y
dB – dimension and size of blocks in threads• Three-dimension al: x, y and z• Threads per block: dB.x * dB.y * dB.z
dim3 is an integer vector type that can be used in CUDA code.dim3 dG(512); // 512 x 1 x 1
dim3 dB(1024, 1024); // 1024 x 1024 x 1
CUDA (Compute Unified Device Architecture) programming model: CUDA C
Heterogeneous Computation Team, HybriLIT
Device Memory Hierarchy
Registers are fast, off-chiplocal memory has high latency
Tens of kb per block, on-chip,very fast
Size up to 12 Gb, high latency
Random access very expensive!Coalesced access much moreefficient
Threads and blocks
Thread 0 Thread 1
Thread 0 Thread 1
Thread 0 Thread 1
Thread 0 Thread 1
Block 0
Block 1
Block 2
Block 3
int tid = threadIdx.x + blockIdx.x * blockDim.x
tid – index of threads
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
Hybrid model of code
__global__ void kernel_1 (args ){
}
int main{
…
kernel_1 <<< nBlock1, nThread1, nShMem, nStream >>> (parameters);…
}dim3 nBlock – dimension of grid,
dim3 nThread – dimension of blocks
Serial code
Kernel_1 <<< nBlock1, nThread1>>>
(args);…
Kernel_2 <<< nBlock2, nThread2>>> (args);…
Serial code
Heterogeneous Computation Team, HybriLIT
Function Type Qualifiers
__global__
__host__
CPU GPU
__global__
__device__
__global__ void kernel ( void ){
}
int main{
…
kernel <<< nBlock, nThread, nShMem, nStream >>> (parameters);…
}dim3 nBlock – dimension of grid,
dim3 nThread – dimension of blocks
Function Type Qualifiers
__global__ void kernel (args ){
}
int main{
…
kernel <<< nBlock, nThread, nShMem, nStream >>> ( args );
…
}
• dim3 nBlock – number of blocks dimension (grid)
• dim3 nThread– dimension of blocks
• nShMem - the amount of shared memory to be allocated for each block
• nStream - the stream number
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
Scheme program on CUDA C/C++ and C/C++
CUDA C / C++
1. Memory allocation
cudaMalloc (void ** devPtr, size_t size); void * malloc (size_t size);
2. Copy variables
cudaMemcpy (void * dst, const void * src, size_t count, enum
cudaMemcpyKind kind);
void * memcpy (void * destination,const void * source, size_t num);
copy: host → device, device → host,host ↔ host, device ↔ device ---
3. Function call
kernel <<< gridDim, blockDim >>> (args); Function (args);
4. Copy results to host
cudaMemcpy (void * dst, const void * src, size_t count, device → host);
---
Threads and blocks
Thread 0 Thread 1
Thread 0 Thread 1
Thread 0 Thread 1
Thread 0 Thread 1
Block 0
Block 1
Block 2
Block 3
int tid = threadIdx.x + blockIdx.x * blockDim.x
tid – index of threads
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
GPU vectors sums
kernel <<< gridDim, blockDim, nSharedMem, nStream>>> (args);
Kernel call:
void sum( int N, int *a, int *b, int *c ) { int tid = 0; while (tid < N) {
c[tid] = a[tid] + b[tid]; tid += 1;
}}
𝑐𝑖 = 𝑎𝑖 + 𝑏𝑖 , i=0,…,N−1
Serial code:
number of blocks in the grid
The number of threads per block
__global__ void kernel(int N ,int *a, int *b, int *c ){
int tid = threadIdx.x + blockIdx.x * blockDim.x; while (tid < N) {
c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x;
} } CUDA code:
Heterogeneous Computation Team, HybriLIT
Error Handling
1. //--------------------------------- Getting the last error ---------------------//2. cudaError_t error = cudaGetLastError();3. if(error != cudaSuccess){4. printf("CUDA error, CUFFT: %s\n", СudaGetErrorString(error));5. exit(-1);6. }
Run-Time Errors:
every CUDA call (except kernel launches) return an error code of type
cudaError_t
Getting the last error:
cudaError_t cudaGetLastError (void)
Heterogeneous Computation Team, HybriLIT
Error Handling: uses macro substitution
that works with any CUDA call that returns an error code.
Example:
1. #define CUDA_CALL(x) do { cudaError_t err = x; if (( err ) != cudaSuccess ) { \
2. printf ("Error \"%s\" at %s :%d \n" , cudaGetErrorString(err), \
3. __FILE__ , __LINE__ ) ; return -1;\
4. }} while (0);
CUDA_CALL(cudaMalloc((void**) dA, n*n*sizeof(float)));
• GPU kernel calls return before completing, cannot immediately use
cudaGetLastError to get errors:
First use cudaDeviceSynchronize() as a synchronization point;
Then call cudaGetLastError.
Error Handling: Example
#include <stdio.h>#include <stdlib.h>
#define NX 64//……………………………………………………………………………………………………………………………………………………………………………………………………………………..\\
#define CUDA_CALL(x) do { cudaError_t err = x; if (( err ) != cudaSuccess ){ \printf ("Error \"%s\" at %s :%d \n" , cudaGetErrorString(err), \__FILE__ , __LINE__ ) ; return -1;\}} while (0);
int main() {printf(" =============== Test Error Handling ================ \n" );printf("Array size NX = %d \n", NX);// Input data: allocate memory on host
double* hA; hA = new double[NX];
for (int i= 0; i< NX; i++){ hA[i]= (double) i;
}
…
Error Handling: Example (continue)
// Allocate memory on devicedouble* dev_A1;
// cudaMalloc((void**)&dev_A1, NX*sizeof(double));
CUDA_CALL(cudaMalloc((void**)&dev_A1, NX*sizeof(double)) );
//………………………… Last error ………………………………..//// Getting the last errorcudaError_t error = cudaGetLastError();if(error != cudaSuccess){
printf("CUDA error : %s\n", cudaGetErrorString(error));exit(-1);
}// Memory volume:size_t avail, total;cudaMemGetInfo( &avail, &total ); // in bytessize_t used = total - avail; printf("GPU memory total: %lf ( Mb) , Used memory: %lf ( Mb) \n",
total/(pow(2,20)), used/(pow(2,20) ) ); …
Error Handling: Example (end)
// Copy input data from Host to DeviceCUDA_CALL(cudaMemcpy(dev_A1 , hA, NX*sizeof(double),cudaMemcpyHostToDevice) );
//………………some computation………….//
// Free memory on CPU and GPUdelete[] hA;cudaFree(dev_A1);
return 0;}
Timing a CUDA application using events
1. //------------------------------ Event Management---------------------//
2. cudaEvent_t start, stop;
3. cudaEventCreate(&start);
4. cudaEventCreate(&stop);
5. cudaEventRecord(start); // cudaEventRecord(start, 0);
6. …. Some operation on CPU and GPU…………………
7. cudaEventRecord(stop);
8. float time0 = 0.0;
9. cudaEventSynchronize(stop);
10. cudaEventElapsedTime(&time0, start, stop);
11. printf("GPU compute time_load data(msec): %.5f\n", time0);
$ nvcc -arch=compute_35 sum_vectors_CUDA.cu -o test_CUDA
Compilation:
#Task 1: sum_vectors_CUDA.cu
Location:
/Tutorial_AIS-GRID/CUDA /
#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA
Script:
Execution:
$ sbatch script_CUDASubmitted batch job 8882
$ module add cuda-6.0-x86_64
Exercise #1
Exercise #1: Initialize the vectorsdev_A, dev_B, dev_С on the device.
Heterogeneous Computation Team, HybriLIT
Effective and quick development parallel applications
• Use numerical libraries for GPUs
• Use existing GPU software
• Program GPU code withdirectives
CUBLAS, CUFFT,MAGMA,…..
NAMD, GROMACS, TeraChem, Matlab,…..
• PGI Accelerator by the Portland Group;
• HMPP Workbench by CAPS Enterprise
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector SignalImage Processing
GPU AcceleratedLinear Algebra
Matrix Algebra on GPU and Multicore
NVIDIA cuFFT
C++ STL Features for
CUDAIMSL Library
Building-block Algorithms for
CUDAArrayFire Matrix
Computations
Sparse Linear Algebra
Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)
Some GPU-accelerated Libraries
Task # 2 Element-wise vector multiplication by a scalar
𝐴𝑘 = alpha ∙ 𝐴𝑘
floatdouble
CuComplexcuDoubleComplex
floatdouble
CuComplexcuDoubleComplex
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
cuBLAS library: cublas<t>scal()
The cuBLAS library is an implementation of BLAS
(Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA runtime.Site:
http://docs.nvidia.com/cuda/cublas/
• cublasStatus_t cublasSscal(cublasHandle_t handle, int n,
const float *alpha, float *x, int incx)
• cublasStatus_t cublasDscal(cublasHandle_t handle, int n,
const double *alpha, double *x, int incx)
• cublasStatus_t cublasCscal(cublasHandle_t handle, int n,
const cuComplex *alpha, cuComplex *x, int incx)
• cublasStatus_t cublasCsscal(cublasHandle_t handle, int n,
const float *alpha, cuComplex *x, int incx)
• cublasStatus_t cublasZscal(cublasHandle_t handle, int n,
const cuDoubleComplex *alpha,
cuDoubleComplex *x, int incx)
• cublasStatus_t cublasZdscal(cublasHandle_t handle, int n,const double *alpha, cuDoubleComplex *x, int incx)
nvcc -arch=compute_35 -lcublas testBlas_CUDA.cu -o test_CUDA
Compilation
CUBLAS
Heterogeneous Computation Team, HybriLIT
The Discrete Fourier Transform (DFT)The Discrete Fourier transform (DFT) maps a complex-valued vector
𝑥𝑘(time domain)
𝑋𝑘 =
𝑛=0
𝑁−1
𝑥𝑛𝑒−2𝜋𝑖𝑁 𝑘𝑛 , 𝑘 = 0,1,… ,𝑁 − 1
𝑋𝑘(frequency domain representation)
𝑥𝑛 =1
𝑁
𝑘=0
𝑁−1
𝑋𝑘𝑒+2𝜋𝑖𝑁 𝑘𝑛 , 𝑛 = 0,1,… ,𝑁 − 1
Escape: no normalized output𝐼𝐹𝐹𝑇 (𝐹𝐹𝑇 𝑨 = 𝒍𝒆𝒏 𝑨 𝑨
𝑋𝑘 𝑥𝑘
Inverse transform
Forward transform
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
CUFFT library: cuFFT
The cuFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT)product
Site:http://docs.nvidia.com/cuda/cufft/
Key Features• 1D, 2D, 3D transforms of complex and real data types• 1D transform sizes up to 128 million elements• Flexible data layouts by allowing arbitrary strides between individual elements and array dimensions• FFT algorithms based on Cooley-Tukey and Bluestein• Familiar API similar to FFTW Advanced Interface• Streamed asynchronous execution• Single and double precision transforms• Batch execution for doing multiple transforms• In-place and out-of-place transforms• Flexible input & output data layouts, similar tp FFTW "Advanced Interface"• Thread-safe & callable from multiple host threads
NVIDIA cuFFT
Heterogeneous Computation Team, HybriLIT
CUFFT library: cuFFT
cuFFT uses algorithms based on the well-known Cooley-Tukey and Bluestein algorithms
Algorithms highly optimized for input sizes that can be written in the form
2𝑎 ∙ 3𝑏 ∙ 5𝑐 ∙ 7𝑑
cuFFT Library allocates space for
8*batch*n[0]*..*n[rank-1]
cufftComplex or cufftDoubleComplex elementswhere• batch denotes the number of transforms that will be executed in parallel• rank is the number of dimensions of the input data
Heterogeneous Computation Team, HybriLIT
#Task 1: 1D transforms of complex data type for vector length NX#Task 2: NY ×1D transforms of complex data type for vector length NX
NX NX … NX NX
NY
CUFFT - Transform types:‣R2C: real to complex‣ C2R: Complex to real‣ C2C: complex to complex‣ D2Z: double to double complex‣ Z2D: double complex to double‣ Z2Z: double complex to double complex
Typical scheme : • allocate memory on the
device • Create and configure the
transformation and cufftPlan• Perform a transform • Free memory and transformation ‣ cufftPlan1d()
‣ cufftPlan2d()‣ cufftPlan3d()‣ cufftPlanMany
A_input[𝑖𝑥 + 𝑁𝑋 ∙ 𝑖𝑦], 𝑖𝑥= 0,1,…, 𝑁𝑋 − 1,
𝑖𝑦= 0,1,…, 𝑁𝑌 − 1
Heterogeneous Computation Team, HybriLIT
CUFFT library: cuFFT
cufftResult cufftPlanMany(
cufftComplex or cufftDoubleComplex elementswhere• batch denotes the number of transforms that will be executed in parallel• rank is the number of dimensions of the input data
cufftHandle *plan,
int rank,
int *n,
int *inembed,
int istride,
int idist,
int *onembed,
int ostride,
int odist,
cufftType type,
int batch);
Plan
1
num[0]= NX
inembed[0]= 0
1
NX
onembed[0]= 0
1
NX
CUFFT_Z2Z
NY
$ nvcc -arch=compute_35 -lcublas -lcufft test_cufft_cublas_CUDA.cu -o test_CUDA
Compilation:
#Task 3: using cufftPlanMany()
Location:
/Tutorial_AIS-GRID/CUDA
#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA
Script:
Execution:
$ sbatch script_CUDASubmitted batch job 8881
Heterogeneous Computation Team, HybriLIT
#Task 4: Shared memory
cuBLAS: Level-1 Basic Linear Algebra Subprograms (BLAS1)
𝑥𝒏 , 𝑛 = 0,1,… ,𝑁 − 1
• cublas<t> asum() : 𝑛=0𝑁−1( 𝑅𝑒 𝑥𝒏 + 𝐼𝑚 𝑥𝒏 )
• cublas<t> nrm2() : 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏
sum = 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏 = 𝒏=𝟎
𝑵−𝟏 𝑅𝑒 𝑥𝒏𝟐+ 𝐼𝑚 𝑥𝒏
𝟐
Shared memory
Heterogeneous Computation Team, HybriLIT
#Task 4: Shared memory (Parallel realization of complex numbers' module sums' computations)
sum = 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏 = 𝒏=𝟎
𝑵−𝟏 𝑅𝑒 𝑥𝒏𝟐+ 𝐼𝑚 𝑥𝒏
𝟐
block 0
block 1
block 2
N
BLOCKS ×TH_BLOCKS
number of blocks in the grid
The number of threads per block
Shared memory available
per block in bytes:
49152 bytes
int i =threadIdx.x+ blockDim.x * blockIdx.x;
tid=threadIdx.x
tid= 0tid= 1tid= 2
tid= 0tid= 1tid= 2
tid= 0tid= 1tid= 2
i =2+3*1=5
#Task 4: Shared memory (sum_compex_CUDA.cu)
block 0
block 1
block 2
N
BLOCKS = 64TH_BLOCKS = 128
__shared__ double
cache[TH_BLOCKS];
tid=threadIdx.x
tid= 0tid= 1tid= 2
tid= 0tid= 1tid= 2
tid= 0tid= 1tid= 2
cache[TH_BLOCKS]
cache[TH_BLOCKS]
cache[TH_BLOCKS]
Heterogeneous Computation Team, HybriLIT
Heterogeneous Computation Team, HybriLIT
kernel_sum_shared () 1. __global__ void kernel_sum_shared( int nx, cuDoubleComplex* dev_vrC, double * c ){
2. volatile __shared__ double cache[TH_BLOCKS];
3. int tid=threadIdx.x;
4. int i =threadIdx.x+ blockDim.x * blockIdx.x;
5. double a;
6. a=0.0;
7. while ( i< nx){
8. a += cuCabs(dev_vrC[i]) ; //
9. i+=blockDim.x * gridDim.x;
10. }
11. cache[tid] = a;
12. __syncthreads();
13. int s= blockDim.x /2 ;
14. while ( s != 0) {
15. if( tid< s )
16. cache[tid]+= cache[tid+s]; //reduction
17. __syncthreads();
18. s /= 2;
19. }
20. if(tid == 0) c[blockIdx.x]=cache[0];
21. __syncthreads(); 22. }
1: copy to shared memory and compute
abs (dev_vrC[i])
2: in shared memory
Heterogeneous Computation Team, HybriLIT
1. NVIDIA website: http://www.nvidia.com/
2. NVIDIA CUDAZONE: https://developer.nvidia.com/cuda-zone
3. Jason Sanders and Edward KandrotCUDA by Example: An Introduction to General-Purpose GPU Programming. Jul 29, 2010
4. List of Book’s:https://developer.nvidia.com/suggested-reading
5. Боресков А. В., Харламов А. А., Марковский Н. Д. и др.Параллельные вычисления на GPU. Архитектура и программная модель CUDA: Учебное пособие для вузов МГУ им. М. В. Ломоносова; Суперкомпьютерный консорциум университетов России; - М.: Издательство Московского университета, 2012. - 336с.: ил. -(Суперкомпьютерное образование).
CUDA materials
Thank you for your attention!
Join HybriLIT!
Heterogeneous Computation Team, HybriLIT