+ All Categories
Home > Documents > Gpgpu intro

Gpgpu intro

Date post: 02-Jul-2015
Category:
Upload: dominik-seifert
View: 143 times
Download: 2 times
Share this document with a friend
66
GPGPU Performance & Tools I
Transcript
Page 1: Gpgpu intro

GPGPU Performance & Tools I

Page 2: Gpgpu intro

Outline

1. Introduction

2. Threads

3. Physical Memory

4. Logical Memory

5. Efficient GPU Programming

6. Some Examples

7. CUDA Programming

8. CUDA Tools Introduction

9. CUDA Debugger

10. CUDA Visual Profiler

NOTE:A lot of this serves as a recap of what was covered so far.

REMEMBER:Repetition is the key to remembering things.

Page 3: Gpgpu intro

But first…

• Do you believe that there can be a school without exams?

• Do you believe that a 9 year old kid in a South Indian village can understand how DNA works?

• Do you believe that schools and universities

should be changed entirely?

• http://www.ted.com/talks/sugata_mitra_build_a_school_in_the_cloud.html

• Fixing education is a task that requires everyone’s attention…

Page 4: Gpgpu intro

Most importantly…

• Do you believe that we can learn, driven entirely by

motivation?

• If your answer is “NO”, then try to…

• … Get a new perspective on life…

…leave your comfort zone!

突破自己!

Page 5: Gpgpu intro

Introduction

Page 6: Gpgpu intro

Why are we here?CPU vs. GPU

Page 7: Gpgpu intro

Combining strengths: CPU + GPU

• Can’t we just build a new device that combines the two?

• Short answer: Some new devices are just that!

• AMD Fusion

• Intel MIC (Xeon Phi)

• Long answer:

• Take 楊佳玲’s Advanced Computer Architecture class!

Page 8: Gpgpu intro

Writing CodePerformance vs. Design• Programmers have two contradictory goals:

1. Good Performance (FAST!)

2. Good Design (bug-resilient, extensible, easy to use etc…)

• Rule of thumb:

• Example:

• Mathematical description – 1 line

• Algorithm Pseudocode – 10 lines

• Algorithm Code – 20 lines

• Optimized Algorithm Code – 50 lines

Fast code is not pretty

Page 9: Gpgpu intro

Writing CodeCommon Fallacies

1. “GPU Programs are always faster than their CPU counterpart”

• Only if: 1. The problem allows it and 2. you invest a lot of time

2. “I don’t need a profiler”

• A profiler helps you analyze performance and find bottlenecks.

• If you don’t care for performance, do NOT use the GPU.

3. “I don’t need a debugger”

• Yes you do.

• Adding tons of printf’s makes things a lot more difficult (and longer)

• (Plus, people are lazy)

4. “I can write bug-free code”

• No, you can’t – No one can.

Page 10: Gpgpu intro

Writing CodeA Tale of Two Address Spaces…

• Never forget – In the current architecture: • The CPU, and each GPU all have their own address space and code

• We CANNOT access host pointers from device or vice versa

• We CANNOT call host code from the device or vice versa

• We CANNOT access device pointers or call code from different devices

Memory

Memory

BUS BUSGPUCPU

PCIe

HOST DEVICE

Page 11: Gpgpu intro

Threads &Parallel Programming

Page 12: Gpgpu intro

Why do we need multithreading?

• Most and foremost: Speed!

• There are some other reasons, but not today…

• Real-life example:

• Ship 10k containers from 台北 to 香港

• Question: Do you use 1 very fast ship, or 4 slow ships?

• Program example:

• Add a scalar to 10k numbers

• Question: Do you use 1 very fast processor, or 4 slow processors?

• The real issue: Single-unit speed never scales!

There is no very fast ship or very fast processor

Page 13: Gpgpu intro

Why do we hate multithreading?

• Multithreading adds whole new dimensions of complicationsto programming

• … Communication

• … Synchronization

• (… Dead-locks – But generally not on the GPU)

• Plus, debugging is a lot more complicated

Page 14: Gpgpu intro

How many Threads?

•T1 T2

T4T3

Kitchen

T1 T2

T4T3

Kitchen

Page 15: Gpgpu intro

GPU ThreadsRecap•

Page 16: Gpgpu intro

Physical MemoryHow our computer works

Page 17: Gpgpu intro

Memory HierarchySmaller is faster!

& Shared Memory

Page 18: Gpgpu intro

Processor vs. Memory Speed

• Memory latency keeps getting worse!

• http://seven-degrees-of-freedom.blogspot.tw/2009/10/latency-elephant.html

Page 19: Gpgpu intro

Logical MemoryHow we see memory in our programs

Page 20: Gpgpu intro

Working with MemoryWhat is Memory logically?• Let’s define: Memory = 1D array of bytes

• An object is a set of 1 or more bytes with a special meaning• If the bytes are contiguous, the object is a struct

• Examples of structs:• byte

• int

• float

• pointer

• sequence of structs:

• A pointer is a struct that represents a memory address• Basically it’s same as a 1D array index!

0 1 2 3 4 5 6 7 8 9

!?!

int float* short

Page 21: Gpgpu intro

Working with MemoryStructs vs. Arrays

• A chunk of contiguous memory is either an array or a struct• Array: 1 or more of the same element:• Struct: 1 or more of (possibly different) elements:

• Determine at compile-time

• Don’t make silly assumptions about structs!• Compiler might change alignment• Compiler might reorder elements

• GPU pointers must be word (4-byte) – aligned

• If the object is only a single element, it can be said to be both:• A one-element struct• A one-element arrayBut don’t overthink it…

Page 22: Gpgpu intro

Working with MemoryMulti-dimensional Arrays

• Arrays are often multi-dimensional!

• …a line (1D)

• …a rectangle (2D)

• …a box (3D)

• … and so on

• But address space is only 1D!

• We have to map higher dimensional space into 1D…

• C and CUDA-C do not allow for multi-dimensional array indices

• We need to compute indices ourselves

Page 23: Gpgpu intro

Working with MemoryRow-Major Indexing

w=5

h=…

x

y

Page 24: Gpgpu intro

Working with MemorySummary•

Page 25: Gpgpu intro

Efficient GPU Programming

Page 26: Gpgpu intro

Must Read!

• If you want to understand the GPU and write fast programs, read these:

• CUDA C Programming Guide

• CUDA Best Practices Guide

• All important CUDA documentation is right here:• http://docs.nvidia.com/cuda/index.html

• OpenCL documentation:• http://developer.amd.com/resources/heterogeneous-computing/opencl-

zone/

• http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf

Page 27: Gpgpu intro

Can Read!Some More Optimization Slides• The power of ILP:

• http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

• Some tips and tricks:

• http://www.nvidia.com/content/cudazone/download/Advanced_CUDA_Training_NVISION08.pdf

Page 28: Gpgpu intro

ILP Magic

• The GPU facilitates both TLP and ILP

• Thread-level parallelism

• Instruction-level parallelism

• ILP means: We can execute multiple instructions at the same time

• Thread does not stall on memory access

• It only stalls on RAW (Read-After-Write) dependencies:

a = A[i]; // no stall

b = B[i]; // no stall

// …

c = a * b; // stall

• Threads can execute multiple arithmetic instructions in parallela = k1 + c * d; // no stall

b = k2 + f * g; // no stall

Page 29: Gpgpu intro

SM Scheduler

Warps occupying a SM(SM=Streaming Multiprocessor)• Using the previous example:

a = A[i]; // no stall

b = B[i]; // no stall

// …

c = a * b; // stall

• What happens on a stall?• The current warp is placed in the I/O queue and another can run on

the SM

• That is why we want as many threads (warps) per SM as possible

• Also need multiple blocks

• E.g. Geforce 660 can have 2048 threads/SM but only 1024 threads/block

warp5

warp4warp6

warp8

Page 30: Gpgpu intro

TLP vs. ILPWhat is good Occupancy?•

Ex.: Only 50% processor utilization!

Page 31: Gpgpu intro

Registers + Shared Memory vs.Working Set Size• Shared Memory + Registers must hold current working set of

all active warps on a SM• In other words: Shared Memory + Registers must hold all (or most

of the) data that all of the threads currently and most often need

• More threads = better TLP = less actual stalls

• More threads = less space for working set • Less registers/thread & shared memory/thread

• If Shm + Registers too small for working set, must use out-of-core method• For example: External merge sort

• http://en.wikipedia.org/wiki/External_sorting

Page 32: Gpgpu intro

Memory Coalescing and Bank Conflicts• VERY big bottleneck!

• See the professor’s slides

• Also, see the Must Read! section

Page 33: Gpgpu intro

OOP vs. DOP

• Array-of-Struct vs. Struct-of-Array (AoS vs. SoA)

• You probably all have heard of Object-Oriented Programming• Idealistic OOP is slow

• OOP groups data (and code) into logical chunks (structs)

• OOP generally ignores temporal locality of data

• Good performance requires: Data-Oriented Programming• http://research.scee.net/files/presentations/gcapaustralia09/Pitf

alls_of_Object_Oriented_Programming_GCAP_09.pdf

• Bundle data together that is accessed at about the same time!• I.e. group data in a way that maximizes temporal locality

Page 34: Gpgpu intro

Streams – Pipeliningmemcpy vs. computation•

Why? Because:memcpy between host and device is a huge bottleneck!

Page 35: Gpgpu intro

Look beyond the codeE.g.

int a = …, wA = …;

int tx = threadIdx.x, ty = threadIdx.y;

__shared__ int A[128];

As[ty][tx] = A[a + wA * ty + tx];

• Which resources does the line of code use?

• Several registers and constant cache

• Variables and constants

• Intermediate results

• Memory (shared or global)

• Reads from A (shared)

• Writes to As (maybe global)

Page 36: Gpgpu intro

Where to get the numbers?

• For actual NVIDIA device properties, check CUDA programming guide Appendix F, Table 10

• (The appendix lists a lot of info complementary to device query)

• Note: Every device has a max Compute Capability (CC) version• The CC version of your device decides which features it supports

• More info can be found in each CC section (all in Appendix F)• E.g. # warp schedulers (2 for CC 2.x; 4 for CC 3.x)

• Dual-issue since CC 2.1

• For comparison of device stats consider NVIDIA

• http://en.wikipedia.org/wiki/GeForce_600_Series#Products

• etc…

• E.g. Memory latency (from section 5.2.3 of the Progr. Guide)

• “400 to 800 clock cycles for devices of compute capability 1.x and 2.x and about 200 to 400 clock cycles for devices of compute capability 3.x”

Page 37: Gpgpu intro

Other Tuning Tips

• The most important contributor to performance is the algorithm!

• Block size always a multiple of Warp size (32 on NVIDIA, 64 on AMD)!

• There is a lot more…

• Page-lock Host Memory

• Etc…

• Read all the references mentioned in this talk and you’ll get it.

Page 38: Gpgpu intro

Writing the Code…

• Do not ask the TA via email to help you with the code!

• Use the forum instead

• Other people probably have similar questions!

• The TA (this guy) will answer all forum posts to his best judgment

• Other students can also help!

• Just one rule: Do not share your actual code!

Page 39: Gpgpu intro

Some Examples

Page 40: Gpgpu intro

Example 1Scalar-Vector Multiplication

Why?

Page 41: Gpgpu intro

Example 2A typical CUDA kernel…

Shared memory declarations

Repeat:

Copy some input to shared memory (shm)

__syncthreads();

Use shm data for actual computation

__syncthreads();

Write to global memory

Page 42: Gpgpu intro

Example 3Median Filter

• No code (sorry!), but here are some hints…

• Use shared memory!

• The code skeleton looks like Example 2

• Remember: All threads in a block can access the same shared memory

• Use 2D blocks!• To get increased shared memory data re-use

• Each thread computes one output pixel!

• Use the debugger!

• Use the profiler!

• Some more hints are in the homework description…

Page 43: Gpgpu intro

Many More Examples…

• Check out the NVIDIA CUDA and AMD APP SDK samples

• Some of them come with documents, explaining:

• The parallel algorithm (and how it was developed)

• Exactly how much speed up was gained from each optimization step

• CUDA 5 samples with docs:

• simpleMultiCopy

• Mandelbrot

• Eigenvalue

• recursiveGaussian

• sobelFilter

• smokeParticles

• BlackScholes

• …and many more…

Page 44: Gpgpu intro

CUDA Tools

Page 46: Gpgpu intro

CUDA DebuggerVS 2010 & NSIGHTWorks with Eclipse and VS 2010

(no VS 2012 support yet)

Page 47: Gpgpu intro

NSIGHT 3 and 2.2Setup

• Get NSIGHT 3.0:

• Go to: https://developer.nvidia.com/nvidia-nsight-visual-studio-edition

• Register (Create an account)

• Login

• https://developer.nvidia.com/rdp/nsight-visual-studio-edition-early-access

• Download NSIGHT 3

• Works for CUDA 5

• Also has an OpenGL debugger and more

• Alternative: Get NSIGHT 2.2

• No login required

• Only works for CUDA 4

Page 48: Gpgpu intro

CUDA DebuggerSome References

• http://http.developer.nvidia.com/NsightVisualStudio/3.0/Documentation/UserGuide/HTML/Content/Debugging_CUDA_Application.htm

• https://www.youtube.com/watch?v=FLQuqXhlx40

• A bit outdated, but still very useful

• etc…

Page 49: Gpgpu intro

Visual Studio 2010 & NSIGHT

• System Info

Page 50: Gpgpu intro

Visual Studio 2010 & NSIGHT

1. Enable Debugging

• NOTE: CPU and GPU debugging are entirely separated at this point

• You must set everything explicitly for GPU

• When GPU debug mode is enabled GPU kernels will run a lot slower!

Page 51: Gpgpu intro

Visual Studio 2010 & NSIGHT

2. Set breakpoint in code:

3. Start CUDA Debugger

• DO NOT USE THE VS DEBUGGER (F5) for CUDA debugging

Page 52: Gpgpu intro

Visual Studio 2010 & NSIGHT

4. Step through the code

• Step Into (F11)

• Step Over (F10)

• Step Out (Shift + F11)

5. Open the corresponding windows

Page 53: Gpgpu intro

Visual Studio 2010 & NSIGHT

6. Inspect everything…

Page 54: Gpgpu intro

Visual Studio 2010 & NSIGHT

Conditions• Right-Click on breakpoint

• Result:

Remember?

Page 55: Gpgpu intro

Visual Studio 2010 & NSIGHT

• Move between warps

Page 56: Gpgpu intro

Visual Studio 2010 & NSIGHT

• Select a specific thread

Page 57: Gpgpu intro

Visual Studio 2010 & NSIGHT

• Inspect Thread and Warp State

• Lists state information of all Threads. E.g.:

• Id, Block, Warp, File, Line, PC (Program Counter), etc…

• Barrier information (is warp currently waiting for sync?)

• Active Mask

• Which threads of the thread’s warp are currently running

• One bit per thread

• Prof. Chen will cover warp divergence later in the class

Page 58: Gpgpu intro

Visual Studio 2010 & NSIGHT

• Inspect Memory

• Can use Drag & Drop!

Why is 1 == 00 00 80 3f?

Floating Point representation!

Page 59: Gpgpu intro

CUDA ProfilersUnderstand your program’s performance profiles!

Page 60: Gpgpu intro

Comprehensive References

• Great Overview:

• http://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf

• http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0419B-GTC2012-Profiling-Profiling-Tools.pdf

Page 61: Gpgpu intro

NVIDIA Visual ProfilerTODO…• Great Tool!

• Chance for bonus points:

• Put together a comprehensive and easily understandable tutorial!

• We will cast a vote!

• The best tutorial gets bonus points!

Page 62: Gpgpu intro

nvprofTODO• Text-based profiler

• For everyone without a GUI

• Maybe also bonus points?

• We will post more details on the forum…

Page 63: Gpgpu intro

GTC – More about the GPU

• NVIDIA’s annual GPU Technology Conference hosts many talks available online

• This year’s GTC is in progress RIGHT NOW!

• http://www.gputechconf.com/page/sessions.html

• Of course it’s a big advertisement campaign for NVIDIA

• But it also has a lot of interesting stuff!

Page 64: Gpgpu intro

The EndAny Questions?

Page 65: Gpgpu intro

Update (1)

1. Compiler Optionsnvcc (NVIDIA Cuda Compiler)有很多option可以玩玩看。建議各位把nvcc的help印到一個檔案然後每次開始寫code之前多多參考:nvcc --help > nvcchelp.txt

2. Compute Capability 1.3測試系統很老所以他跑的CUDA版本跟大部分的人家裡的CUDA版本應該不一樣。你們如果家裡可以pass但是批改娘雖然不讓你們pass的話,這裡就有一個很好的解決方法:用"-arch=sm_13"萊compile跟測試系統會跑一樣的machine code:nvcc -arch=sm_13

3. Register Pressure & Register Usage這個stackoverflow的文章就是談nvcc跟register usage的一些事情:[url]http://stackoverflow.com/questions/9723431/tracking-down-cuda-kernel-register-usage[/url]如果跟nvcc講-Xptxas="-v"的話,他就會跟你講每一個thread到底在用幾個register。

我的中文好差。請各位多多指教。


Recommended