+ All Categories
Home > Documents > Introduction to Project 0 due Tues 09/18 CUDA 1 of...

Introduction to Project 0 due Tues 09/18 CUDA 1 of...

Date post: 26-Apr-2018
Category:
Upload: lamque
View: 222 times
Download: 4 times
Share this document with a friend
15
1 Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Announcements Project 0 due Tues 09/18 Project 1 released today. Due Friday 09/28 Starts student blogs Get approval for all third-party code Acknowledgements Many slides are from David Kirk and Wen- mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/ GPU Architecture Review GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control
Transcript
Page 1: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

1

Introduction to

CUDA 1 of 2

Patrick Cozzi

University of Pennsylvania

CIS 565 - Fall 2012

Announcements

� Project 0 due Tues 09/18

� Project 1 released today. Due Friday 09/28

�Starts student blogs

�Get approval for all third-party code

Acknowledgements

� Many slides are from David Kirk and Wen-

mei Hwu’s UIUC course:� http://courses.engr.illinois.edu/ece498/al/

GPU Architecture Review

� GPUs are specialized for

�Compute-intensive, highly parallel computation

�Graphics!

� Transistors are devoted to:

�Processing

�Not:

� Data caching

� Flow control

Page 2: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

2

GPU Architecture Review

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Transistor Usage

Let’s program

this thing!

GPU Computing History

� 2001/2002 – researchers see GPU as data-

parallel coprocessor

�The GPGPU field is born

� 2007 – NVIDIA releases CUDA

�CUDA – Compute Uniform Device Architecture

�GPGPU shifts to GPU Computing

� 2008 – Khronos releases OpenCL

specification

CUDA Abstractions

� A hierarchy of thread groups

� Shared memories

� Barrier synchronization

Page 3: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

3

CUDA Terminology

� Host – typically the CPU

�Code written in ANSI C

� Device – typically the GPU (data-parallel)

�Code written in extended ANSI C

� Host and device have separate memories

� CUDA Program

�Contains both host and device code

CUDA Terminology

� Kernel – data-parallel function

� Invoking a kernel creates lightweight threads

on the device

� Threads are generated and scheduled with

hardware

� Similar to a shader in OpenGL?

CUDA Kernels

� Executed N times in parallel by N different

CUDA threads

Thread ID

Execution

Configuration

Declaration

Specifier

CUDA Program Execution

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 4: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

4

Thread Hierarchies

� Grid – one or more thread blocks

�1D or 2D

� Block – array of threads

�1D, 2D, or 3D

�Each block in a grid has the same number of

threads

�Each thread in a block can

� Synchronize

� Access shared memory

Thread Hierarchies

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies

� Block – 1D, 2D, or 3D

�Example: Index into vector, matrix, volume

Thread Hierarchies

� Thread ID: Scalar thread identifier

� Thread Index: threadIdx

� 1D: Thread ID == Thread Index

� 2D with size (Dx, Dy)

�Thread ID of index (x, y) == x + y Dy

� 3D with size (Dx, Dy, Dz)

�Thread ID of index (x, y, z) == x + y Dy + z Dx Dy

Page 5: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

5

Thread Hierarchies

1 Thread Block 2D Block

2D Index

Thread Hierarchies

� Thread Block

�Group of threads

� G80 and GT200: Up to 512 threads

� Fermi: Up to 1024 threads

�Reside on same processor core

�Share memory of that core

Thread Hierarchies

� Thread Block

�Group of threads

� G80 and GT200: Up to 512 threads

� Fermi: Up to 1024 threads

�Reside on same processor core

�Share memory of that core

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies

� Block Index: blockIdx

� Dimension: blockDim

�1D or 2D

Page 6: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

6

Thread Hierarchies

2D Thread Block

16x16

Threads per block

Thread Hierarchies

� Example: N = 32

�16x16 threads per block (independent of N)

� threadIdx ([0, 15], [0, 15])

�2x2 thread blocks in grid

� blockIdx ([0, 1], [0, 1])

� blockDim = 16

� i = [0, 1] * 16 + [0, 15]

Thread Hierarchies

� Thread blocks execute independently

� In any order: parallel or series

�Scheduled in any order by any number of

cores

� Allows code to scale with core count

Thread Hierarchies

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Page 7: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

7

CUDA Memory Transfers

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

� Host can transfer to/from device

�Global memory

�Constant memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

� cudaMalloc()

�Allocate global memory on device

� cudaFree()

�Frees memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 8: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

8

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Pointer to device memory

Size in bytes

CUDA Memory Transfers

� cudaMemcpy()

�Memory transfer

� Host to host

� Host to device

� Device to host

� Device to device

Host Device

Global Memory

CUDA Memory Transfers

� cudaMemcpy()

�Memory transfer

� Host to host

� Host to device

� Device to host

� Device to device

Host Device

Global Memory

CUDA Memory Transfers

� cudaMemcpy()

�Memory transfer

� Host to host

� Host to device

� Device to host

� Device to device

Host Device

Global Memory

Page 9: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

9

CUDA Memory Transfers

� cudaMemcpy()

�Memory transfer

� Host to host

� Host to device

� Device to host

� Device to device

Host Device

Global Memory

CUDA Memory Transfers

� cudaMemcpy()

�Memory transfer

� Host to host

� Host to device

� Device to host

� Device to device

Host Device

Global Memory

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host Device

Global Memory

Source (host)Destination (device) Host to device

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host Device

Global Memory

Page 10: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

10

Matrix Multiply Reminder

� Vectors

� Dot products

� Row major or column major?

� Dot product per output element

Matrix Multiply

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

� P = M * N

� Assume M and N are

square for simplicity

Matrix Multiply

� 1,000 x 1,000 matrix

� 1,000,000 dot products

� Each 1,000 multiples and 1,000 adds

Matrix Multiply: CPU Implementation

Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt

void MatrixMulOnHost(float* M, float* N, float* P, int width)

{

for (int i = 0; i < width; ++i)

for (int j = 0; j < width; ++j)

{

float sum = 0;

for (int k = 0; k < width; ++k)

{

float a = M[i * width + k];

float b = N[k * width + j];

sum += a * b;

}

P[i * width + j] = sum;

}

}

Page 11: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

11

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply

� Step 1

�Add CUDA memory transfers to the skeleton

Page 12: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

12

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate input

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate output

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Read back

from device

Page 13: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

13

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

� Similar to GPGPU with GLSL.

Matrix Multiply

� Step 2

� Implement the kernel in CUDA C

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Accessing a matrix, so using a 2D block

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Each kernel computes one output

Page 14: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

14

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Where did the two outer for loops

in the CPU implementation go?

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

No locks or synchronization, why?

Matrix Multiply

� Step 3

� Invoke the kernel in CUDA C

Matrix Multiply: Invoke Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

One block with width

by width threads

Page 15: Introduction to Project 0 due Tues 09/18 CUDA 1 of 2cis565-fall-2012.github.io/lectures/09-17-CUDA-Introduction-1-of-2.pdfCUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS 565

15

Matrix Multiply

57

� One Block of threads compute matrix Pd

� Each thread computes one element of Pd

� Each thread

� Loads a row of matrix Md

� Loads a column of matrix Nd

� Perform one multiply and addition for each pair of Md and Ndelements

� Compute to off-chip memory access ratio close to 1:1 (not very high)

� Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread

(2, 2)

WIDTH

Md Pd

Nd

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt

Matrix Multiply

� What is the major performance problem

with our implementation?

� What is the major limitation?


Recommended