+ All Categories
Home > Documents > Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M...

Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M...

Date post: 23-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
41
M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations Team, HybriLIT Laboratory of Information Technologies Joint Institute for Nuclear Research Parallel Computing with CUDA Heterogeneous Computation Team, HybriLIT SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS JINR, DUBNA 2-6 November, 2015
Transcript
Page 1: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

M.I. Zuev, O.I. Streltsova, M. ValaHeterogeneous Computations Team, HybriLIT

Laboratory of Information Technologies

Joint Institute for Nuclear Research

Parallel Computing with CUDA

Heterogeneous Computation Team, HybriLIT

SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMSJINR, DUBNA

2-6 November, 2015

Page 2: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

NVIDIA Tesla K40 “Atlas” GPU Accelerator

Supports CUDA and OpenCL Specifications

• 2880 CUDA GPU cores• Performance

4.29 TFLOP single-precision; 1.43 T FLOP double-precision

• 12GB of global memory• Memory bandwidth up to 288 GB/s

Heterogeneous Computation Team, HybriLIT

Page 3: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

CUDA (Compute Unified Device Architecture) programming model, CUDA C

Source:http://blog.goldenhelix.com/?p=374

Core 1 Core 2

Core 3 Core 4

CPUGPU

Multiprocessor1

(192 Cores)

Multiprocessor2

(192 Cores)

Multiprocessor14

(192 Cores)

Multiprocessor15

(192 Cores)

CPU / GPU Architecture

2880 CUDA GPU cores

Page 4: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

CUDA architecture (Kepler TM GK110/210)

Streaming Multiprocessor (SMX) Architecture:

• 192 single-precision CUDA cores

• 64 double-precision units

• 32 special function units (SFU)

• 32 load/store units (LD/ST)

Page 5: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif

CUDA (Compute Unified Device Architecture) programming model

Page 6: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

nvcc -arch=compute_35 deviceInfo_CUDA.cu -o test_CUDA

Tutorial’s materials

Module

• NVIDIA CUDA Compiler Driver NVCC• Full information

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#axzz37LQKVSFi

module add hlit/cuda/7.0

Heterogeneous Computation Team, HybriLIT

Page 7: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

$ nvcc -arch=compute_35 deviceInfo_CUDA.cu -o test_CUDA

Compilation:

#Task 0: deviceInfo_CUDA.cu

Location:

/tmp/ Tutorial_AIS-GRID/CUDA /

#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA

Script:

Execution:

$ sbatch script_CUDASubmitted batch job 8881

$ module add hlit/cuda/7.0

Page 8: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Definitions: device = GPU; host = CPU; kernel = function that runs on the device

Parallel parts of an application are executed on the device as kernels• One kernel is executed at a time• Several threads execute each kernel

Function Qualifiers

Function type qualifiers specify whether a function executes and whether it is

callable Declaration specifier Callable from Executed on

__global__ host device

__host__ host host

__device__ device device

Launching kernelkernel <<< dim3 dG, dim3 dB >>> (…)

dG – dimension and size of grid in blocks• Two-dimension al: x and y• Blocks launched in the grid: dG.x * dG.y

dB – dimension and size of blocks in threads• Three-dimension al: x, y and z• Threads per block: dB.x * dB.y * dB.z

dim3 is an integer vector type that can be used in CUDA code.dim3 dG(512); // 512 x 1 x 1

dim3 dB(1024, 1024); // 1024 x 1024 x 1

CUDA (Compute Unified Device Architecture) programming model: CUDA C

Page 9: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Device Memory Hierarchy

Registers are fast, off-chiplocal memory has high latency

Tens of kb per block, on-chip,very fast

Size up to 12 Gb, high latency

Random access very expensive!Coalesced access much moreefficient

Page 10: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Threads and blocks

Thread 0 Thread 1

Thread 0 Thread 1

Thread 0 Thread 1

Thread 0 Thread 1

Block 0

Block 1

Block 2

Block 3

int tid = threadIdx.x + blockIdx.x * blockDim.x

tid – index of threads

Heterogeneous Computation Team, HybriLIT

Page 11: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Hybrid model of code

__global__ void kernel_1 (args ){

}

int main{

kernel_1 <<< nBlock1, nThread1, nShMem, nStream >>> (parameters);…

}dim3 nBlock – dimension of grid,

dim3 nThread – dimension of blocks

Serial code

Kernel_1 <<< nBlock1, nThread1>>>

(args);…

Kernel_2 <<< nBlock2, nThread2>>> (args);…

Serial code

Page 12: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Function Type Qualifiers

__global__

__host__

CPU GPU

__global__

__device__

__global__ void kernel ( void ){

}

int main{

kernel <<< nBlock, nThread, nShMem, nStream >>> (parameters);…

}dim3 nBlock – dimension of grid,

dim3 nThread – dimension of blocks

Page 13: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Function Type Qualifiers

__global__ void kernel (args ){

}

int main{

kernel <<< nBlock, nThread, nShMem, nStream >>> ( args );

}

• dim3 nBlock – number of blocks dimension (grid)

• dim3 nThread– dimension of blocks

• nShMem - the amount of shared memory to be allocated for each block

• nStream - the stream number

Heterogeneous Computation Team, HybriLIT

Page 14: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Scheme program on CUDA C/C++ and C/C++

CUDA C / C++

1. Memory allocation

cudaMalloc (void ** devPtr, size_t size); void * malloc (size_t size);

2. Copy variables

cudaMemcpy (void * dst, const void * src, size_t count, enum

cudaMemcpyKind kind);

void * memcpy (void * destination,const void * source, size_t num);

copy: host → device, device → host,host ↔ host, device ↔ device ---

3. Function call

kernel <<< gridDim, blockDim >>> (args); Function (args);

4. Copy results to host

cudaMemcpy (void * dst, const void * src, size_t count, device → host);

---

Page 15: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Threads and blocks

Thread 0 Thread 1

Thread 0 Thread 1

Thread 0 Thread 1

Thread 0 Thread 1

Block 0

Block 1

Block 2

Block 3

int tid = threadIdx.x + blockIdx.x * blockDim.x

tid – index of threads

Heterogeneous Computation Team, HybriLIT

Page 16: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

GPU vectors sums

kernel <<< gridDim, blockDim, nSharedMem, nStream>>> (args);

Kernel call:

void sum( int N, int *a, int *b, int *c ) { int tid = 0; while (tid < N) {

c[tid] = a[tid] + b[tid]; tid += 1;

}}

𝑐𝑖 = 𝑎𝑖 + 𝑏𝑖 , i=0,…,N−1

Serial code:

number of blocks in the grid

The number of threads per block

__global__ void kernel(int N ,int *a, int *b, int *c ){

int tid = threadIdx.x + blockIdx.x * blockDim.x; while (tid < N) {

c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x;

} } CUDA code:

Page 17: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Error Handling

1. //--------------------------------- Getting the last error ---------------------//2. cudaError_t error = cudaGetLastError();3. if(error != cudaSuccess){4. printf("CUDA error, CUFFT: %s\n", СudaGetErrorString(error));5. exit(-1);6. }

Run-Time Errors:

every CUDA call (except kernel launches) return an error code of type

cudaError_t

Getting the last error:

cudaError_t cudaGetLastError (void)

Page 18: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

Error Handling: uses macro substitution

that works with any CUDA call that returns an error code.

Example:

1. #define CUDA_CALL(x) do { cudaError_t err = x; if (( err ) != cudaSuccess ) { \

2. printf ("Error \"%s\" at %s :%d \n" , cudaGetErrorString(err), \

3. __FILE__ , __LINE__ ) ; return -1;\

4. }} while (0);

CUDA_CALL(cudaMalloc((void**) dA, n*n*sizeof(float)));

• GPU kernel calls return before completing, cannot immediately use

cudaGetLastError to get errors:

First use cudaDeviceSynchronize() as a synchronization point;

Then call cudaGetLastError.

Page 19: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Error Handling: Example

#include <stdio.h>#include <stdlib.h>

#define NX 64//……………………………………………………………………………………………………………………………………………………………………………………………………………………..\\

#define CUDA_CALL(x) do { cudaError_t err = x; if (( err ) != cudaSuccess ){ \printf ("Error \"%s\" at %s :%d \n" , cudaGetErrorString(err), \__FILE__ , __LINE__ ) ; return -1;\}} while (0);

int main() {printf(" =============== Test Error Handling ================ \n" );printf("Array size NX = %d \n", NX);// Input data: allocate memory on host

double* hA; hA = new double[NX];

for (int i= 0; i< NX; i++){ hA[i]= (double) i;

}

Page 20: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Error Handling: Example (continue)

// Allocate memory on devicedouble* dev_A1;

// cudaMalloc((void**)&dev_A1, NX*sizeof(double));

CUDA_CALL(cudaMalloc((void**)&dev_A1, NX*sizeof(double)) );

//………………………… Last error ………………………………..//// Getting the last errorcudaError_t error = cudaGetLastError();if(error != cudaSuccess){

printf("CUDA error : %s\n", cudaGetErrorString(error));exit(-1);

}// Memory volume:size_t avail, total;cudaMemGetInfo( &avail, &total ); // in bytessize_t used = total - avail; printf("GPU memory total: %lf ( Mb) , Used memory: %lf ( Mb) \n",

total/(pow(2,20)), used/(pow(2,20) ) ); …

Page 21: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Error Handling: Example (end)

// Copy input data from Host to DeviceCUDA_CALL(cudaMemcpy(dev_A1 , hA, NX*sizeof(double),cudaMemcpyHostToDevice) );

//………………some computation………….//

// Free memory on CPU and GPUdelete[] hA;cudaFree(dev_A1);

return 0;}

Page 22: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Timing a CUDA application using events

1. //------------------------------ Event Management---------------------//

2. cudaEvent_t start, stop;

3. cudaEventCreate(&start);

4. cudaEventCreate(&stop);

5. cudaEventRecord(start); // cudaEventRecord(start, 0);

6. …. Some operation on CPU and GPU…………………

7. cudaEventRecord(stop);

8. float time0 = 0.0;

9. cudaEventSynchronize(stop);

10. cudaEventElapsedTime(&time0, start, stop);

11. printf("GPU compute time_load data(msec): %.5f\n", time0);

Page 23: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

$ nvcc -arch=compute_35 sum_vectors_CUDA.cu -o test_CUDA

Compilation:

#Task 1: sum_vectors_CUDA.cu

Location:

/Tutorial_AIS-GRID/CUDA /

#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA

Script:

Execution:

$ sbatch script_CUDASubmitted batch job 8882

$ module add cuda-6.0-x86_64

Page 24: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Exercise #1

Exercise #1: Initialize the vectorsdev_A, dev_B, dev_С on the device.

Heterogeneous Computation Team, HybriLIT

Page 25: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Effective and quick development parallel applications

• Use numerical libraries for GPUs

• Use existing GPU software

• Program GPU code withdirectives

CUBLAS, CUFFT,MAGMA,…..

NAMD, GROMACS, TeraChem, Matlab,…..

• PGI Accelerator by the Portland Group;

• HMPP Workbench by CAPS Enterprise

Page 26: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore

NVIDIA cuFFT

C++ STL Features for

CUDAIMSL Library

Building-block Algorithms for

CUDAArrayFire Matrix

Computations

Sparse Linear Algebra

Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)

Some GPU-accelerated Libraries

Page 27: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Task # 2 Element-wise vector multiplication by a scalar

𝐴𝑘 = alpha ∙ 𝐴𝑘

floatdouble

CuComplexcuDoubleComplex

floatdouble

CuComplexcuDoubleComplex

Heterogeneous Computation Team, HybriLIT

Page 28: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

cuBLAS library: cublas<t>scal()

The cuBLAS library is an implementation of BLAS

(Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA runtime.Site:

http://docs.nvidia.com/cuda/cublas/

• cublasStatus_t cublasSscal(cublasHandle_t handle, int n,

const float *alpha, float *x, int incx)

• cublasStatus_t cublasDscal(cublasHandle_t handle, int n,

const double *alpha, double *x, int incx)

• cublasStatus_t cublasCscal(cublasHandle_t handle, int n,

const cuComplex *alpha, cuComplex *x, int incx)

• cublasStatus_t cublasCsscal(cublasHandle_t handle, int n,

const float *alpha, cuComplex *x, int incx)

• cublasStatus_t cublasZscal(cublasHandle_t handle, int n,

const cuDoubleComplex *alpha,

cuDoubleComplex *x, int incx)

• cublasStatus_t cublasZdscal(cublasHandle_t handle, int n,const double *alpha, cuDoubleComplex *x, int incx)

Page 29: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

nvcc -arch=compute_35 -lcublas testBlas_CUDA.cu -o test_CUDA

Compilation

CUBLAS

Heterogeneous Computation Team, HybriLIT

Page 30: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

The Discrete Fourier Transform (DFT)The Discrete Fourier transform (DFT) maps a complex-valued vector

𝑥𝑘(time domain)

𝑋𝑘 =

𝑛=0

𝑁−1

𝑥𝑛𝑒−2𝜋𝑖𝑁 𝑘𝑛 , 𝑘 = 0,1,… ,𝑁 − 1

𝑋𝑘(frequency domain representation)

𝑥𝑛 =1

𝑁

𝑘=0

𝑁−1

𝑋𝑘𝑒+2𝜋𝑖𝑁 𝑘𝑛 , 𝑛 = 0,1,… ,𝑁 − 1

Escape: no normalized output𝐼𝐹𝐹𝑇 (𝐹𝐹𝑇 𝑨 = 𝒍𝒆𝒏 𝑨 𝑨

𝑋𝑘 𝑥𝑘

Inverse transform

Forward transform

Heterogeneous Computation Team, HybriLIT

Page 31: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

CUFFT library: cuFFT

The cuFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT)product

Site:http://docs.nvidia.com/cuda/cufft/

Key Features• 1D, 2D, 3D transforms of complex and real data types• 1D transform sizes up to 128 million elements• Flexible data layouts by allowing arbitrary strides between individual elements and array dimensions• FFT algorithms based on Cooley-Tukey and Bluestein• Familiar API similar to FFTW Advanced Interface• Streamed asynchronous execution• Single and double precision transforms• Batch execution for doing multiple transforms• In-place and out-of-place transforms• Flexible input & output data layouts, similar tp FFTW "Advanced Interface"• Thread-safe & callable from multiple host threads

NVIDIA cuFFT

Page 32: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

CUFFT library: cuFFT

cuFFT uses algorithms based on the well-known Cooley-Tukey and Bluestein algorithms

Algorithms highly optimized for input sizes that can be written in the form

2𝑎 ∙ 3𝑏 ∙ 5𝑐 ∙ 7𝑑

cuFFT Library allocates space for

8*batch*n[0]*..*n[rank-1]

cufftComplex or cufftDoubleComplex elementswhere• batch denotes the number of transforms that will be executed in parallel• rank is the number of dimensions of the input data

Page 33: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

#Task 1: 1D transforms of complex data type for vector length NX#Task 2: NY ×1D transforms of complex data type for vector length NX

NX NX … NX NX

NY

CUFFT - Transform types:‣R2C: real to complex‣ C2R: Complex to real‣ C2C: complex to complex‣ D2Z: double to double complex‣ Z2D: double complex to double‣ Z2Z: double complex to double complex

Typical scheme : • allocate memory on the

device • Create and configure the

transformation and cufftPlan• Perform a transform • Free memory and transformation ‣ cufftPlan1d()

‣ cufftPlan2d()‣ cufftPlan3d()‣ cufftPlanMany

A_input[𝑖𝑥 + 𝑁𝑋 ∙ 𝑖𝑦], 𝑖𝑥= 0,1,…, 𝑁𝑋 − 1,

𝑖𝑦= 0,1,…, 𝑁𝑌 − 1

Page 34: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

CUFFT library: cuFFT

cufftResult cufftPlanMany(

cufftComplex or cufftDoubleComplex elementswhere• batch denotes the number of transforms that will be executed in parallel• rank is the number of dimensions of the input data

cufftHandle *plan,

int rank,

int *n,

int *inembed,

int istride,

int idist,

int *onembed,

int ostride,

int odist,

cufftType type,

int batch);

Plan

1

num[0]= NX

inembed[0]= 0

1

NX

onembed[0]= 0

1

NX

CUFFT_Z2Z

NY

Page 35: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

$ nvcc -arch=compute_35 -lcublas -lcufft test_cufft_cublas_CUDA.cu -o test_CUDA

Compilation:

#Task 3: using cufftPlanMany()

Location:

/Tutorial_AIS-GRID/CUDA

#!/bin/sh#SBATCH -p gpu#SBATCH --gres=gpu:1srun test_CUDA

Script:

Execution:

$ sbatch script_CUDASubmitted batch job 8881

Page 36: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

#Task 4: Shared memory

cuBLAS: Level-1 Basic Linear Algebra Subprograms (BLAS1)

𝑥𝒏 , 𝑛 = 0,1,… ,𝑁 − 1

• cublas<t> asum() : 𝑛=0𝑁−1( 𝑅𝑒 𝑥𝒏 + 𝐼𝑚 𝑥𝒏 )

• cublas<t> nrm2() : 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏

sum = 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏 = 𝒏=𝟎

𝑵−𝟏 𝑅𝑒 𝑥𝒏𝟐+ 𝐼𝑚 𝑥𝒏

𝟐

Shared memory

Page 37: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

#Task 4: Shared memory (Parallel realization of complex numbers' module sums' computations)

sum = 𝑛=0𝑁−1 𝑥𝒏 ∙ 𝑥𝒏 = 𝒏=𝟎

𝑵−𝟏 𝑅𝑒 𝑥𝒏𝟐+ 𝐼𝑚 𝑥𝒏

𝟐

block 0

block 1

block 2

N

BLOCKS ×TH_BLOCKS

number of blocks in the grid

The number of threads per block

Shared memory available

per block in bytes:

49152 bytes

int i =threadIdx.x+ blockDim.x * blockIdx.x;

tid=threadIdx.x

tid= 0tid= 1tid= 2

tid= 0tid= 1tid= 2

tid= 0tid= 1tid= 2

i =2+3*1=5

Page 38: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

#Task 4: Shared memory (sum_compex_CUDA.cu)

block 0

block 1

block 2

N

BLOCKS = 64TH_BLOCKS = 128

__shared__ double

cache[TH_BLOCKS];

tid=threadIdx.x

tid= 0tid= 1tid= 2

tid= 0tid= 1tid= 2

tid= 0tid= 1tid= 2

cache[TH_BLOCKS]

cache[TH_BLOCKS]

cache[TH_BLOCKS]

Heterogeneous Computation Team, HybriLIT

Page 39: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

kernel_sum_shared () 1. __global__ void kernel_sum_shared( int nx, cuDoubleComplex* dev_vrC, double * c ){

2. volatile __shared__ double cache[TH_BLOCKS];

3. int tid=threadIdx.x;

4. int i =threadIdx.x+ blockDim.x * blockIdx.x;

5. double a;

6. a=0.0;

7. while ( i< nx){

8. a += cuCabs(dev_vrC[i]) ; //

9. i+=blockDim.x * gridDim.x;

10. }

11. cache[tid] = a;

12. __syncthreads();

13. int s= blockDim.x /2 ;

14. while ( s != 0) {

15. if( tid< s )

16. cache[tid]+= cache[tid+s]; //reduction

17. __syncthreads();

18. s /= 2;

19. }

20. if(tid == 0) c[blockIdx.x]=cache[0];

21. __syncthreads(); 22. }

1: copy to shared memory and compute

abs (dev_vrC[i])

2: in shared memory

Page 40: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Heterogeneous Computation Team, HybriLIT

1. NVIDIA website: http://www.nvidia.com/

2. NVIDIA CUDAZONE: https://developer.nvidia.com/cuda-zone

3. Jason Sanders and Edward KandrotCUDA by Example: An Introduction to General-Purpose GPU Programming. Jul 29, 2010

4. List of Book’s:https://developer.nvidia.com/suggested-reading

5. Боресков А. В., Харламов А. А., Марковский Н. Д. и др.Параллельные вычисления на GPU. Архитектура и программная модель CUDA: Учебное пособие для вузов МГУ им. М. В. Ломоносова; Суперкомпьютерный консорциум университетов России; - М.: Издательство Московского университета, 2012. - 336с.: ил. -(Суперкомпьютерное образование).

CUDA materials

Page 41: Parallel Computing with CUDA M.I. Zuev, O.I. Streltsova, M ...ais-grid-2015.jinr.ru/pdf/Lecture_CUDA_AIS-GRID_2015.pdf · M.I. Zuev, O.I. Streltsova, M. Vala Heterogeneous Computations

Thank you for your attention!

Join HybriLIT!

Heterogeneous Computation Team, HybriLIT


Recommended