CUDA - 101

Post on 23-Feb-2016

68 views 0 download

description

CUDA - 101. Basics. Overview. What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication . GPU revised!. What is CUDA?. C ompute D evice U nified A rchitecture Programming interface to GPU Supports C/C++ and Fortran natively - PowerPoint PPT Presentation

transcript

CUDA - 101

Basics

Overview

• What is CUDA?• Data Parallelism• Host-Device model• Thread execution• Matrix-multiplication

GPU revised!

What is CUDA?

• Compute Device Unified Architecture• Programming interface to GPU• Supports C/C++ and Fortran natively– Third party wrappers for Python, Java, MATLAB etc

• Various libraries available– cuBLAS, cuFFT and many more…– https://developer.nvidia.com/gpu-accelerated-libr

aries

CUDA computing stack

CUDA computing stack

CUDA computing stack

CUDA computing stack

Data Parallel programming

i1

Kernel

i2 i3 … iN

o1 o2 o3 … oN

Data parallel algorithm

• Dot product : C = A . BA1 B1 …

C1 C2 C3 … CN

A2 B2 A3 B3 AN BN

+ + + + +Kernel

Host-Device model

CPU (Host) GPU (Device)

Threads

• A thread is an instance of the kernel program– Independent in a data

parallel model– Can be executed on a

different core• Host tells the device to

run a kernel program– And how many threads

to launch

Matrix-Multiplication

CPU-only MatrixMultiplication

Execute this code

For all elements of P

Memory Indexing in C (and CUDA)

M(i, j) = M[i + j * width]

CUDA version - I

CUDA program flow

• Allocate input and output memory on host– Do the same for device

• Transfer input data from host -> device• Launch kernel on device• Transfer output data from device -> host

Allocating Device memory

• Host tells the device when to allocate and free memory in device

• Functions for host-program– cudaMalloc(memory reference, size)– cudaFree(memory reference)

Transfer Data to/from device

• Again, host tells device when to transfer data• cudaMemcpy(target, source, size, flag)

CUDA version - 2Host Memory

Device Memory

Allocate matrix M on deviceTransfer M from host -> Device

Allocate matrix N on deviceTransfer N from host -> Device

Allocate matrix P on device

Execute Kernel on Device

Transfer P from Device-> Host

Free Device memories for M, N and P

Matrix Multiplication Kernel

• Kernel specifies the function to be executed on Device

Parameters = Device memories, width

Thread = Each element of output matrix P

Dot product of M’s row and N’s column

Write dot product at current location

Extensions : Function qualifiers

Extensions : Thread indexing

• All threads execute the same code– But they need work on separate memory data

• threadId.x & threadId.y– These variables automatically receive

corresponding values for their threads

Thread Grid

• Represents group of all threads to be executed for a particular kernel

• Two level hierarchy– Grid is composed of Blocks– Each Block is composed of threads

Thread Grid

0, 0 1, 0 2, 0 width-1, 0

0, 1 width–1, 1

0, 2

0, width-1 width – 1, width - 1

Conclusion

• Sample code and tutorials• CUDA nodes?• Programming guide – http://docs.nvidia.com/cuda/cuda-c-programming-g

uide/

• SDK– https://developer.nvidia.com/cuda-downloads– Available for windows, Mac and Linux– Lot of sample programs

QUESTIONS?