EECS 583 – Class 21Research Topic 3: Compilation for GPUs
University of Michigan
December 12, 2011 – Last Class!!
- 2 -
Announcements & Reading Material
This class reading» “Program optimization space pruning for a multithreaded GPU,”
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.
Project demos» Dec 13-16, 19 (19th is full)
» Send me email with date and a few timeslots if your group does not have a slot
» Almost all have signed up already
- 3 -
Project Demos
Demo format» Each group gets 30 mins
Strict deadlines because many back to back groups Don’t be late!
» Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say
something Talk about what you did (basic idea, previous work), how you did it
(approach + implementation), and results Demo or real code examples are good
Report» 5 pg double spaced including figures – same content as
presentation
» Due either when you do you demo or Dec 19 at 6pm
- 4 -
Midterm Results
Mean: 97.9 StdDev: 13.9 High: 128 Low: 50
If you did poorly, all is not lost. This is a grad class, the project is by far most important!!
Answer key on the course webpage, pick up graded exams from Daya
- 5 -
Why GPUs?
5
- 6 -
Efficiency of GPUs
6
High Memory
Bandwidth
GTX 285 : 159 GB/Seci7 : 32 GB/Sec
High FlopRate
i7 :102 GFLOPS
GTX 285 :1062 GFLOPS
i7 : 51 GFLOPS
GTX 285 : 88.5 GFLOPS
GTX 480 : 168 GFLOPS
High FlopPer Watt
GTX 285 : 5.2 GFLOP/W
i7 : 0.78 GFLOP/W
High FlopPer Dollar
GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$
- 7 -
GPU Architecture
7
Shared
Regs
0 1
2 3
4 5
6 7
Interconnection Network
Global Memory (Device Memory)PCIe
Bridge
CPU HostMemory
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
SM 0 SM 1 SM 2 SM 29
- 8 -
CUDA
“Compute Unified Device Architecture”
General purpose programming model» User kicks off batches of threads on the GPU
Advantages of CUDA» Interface designed for compute - graphics free API
» Orchestration of on chip cores
» Explicit GPU memory management
» Full support for Integer and bitwise operations
8
- 9 -
Programming Model
9
Host
Kernel 1
Kernel 2
Grid 1
Grid 2
DeviceTi
me
- 10 -
Grid 1
GPU Scheduling
10
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 30
Shared
Regs
0 1
2 3
4 5
6 7
- 11 -
Warp Generation
Block 0
Block 1
Block 3
Shared
Registers
0 1
2
4 5
3
6 7
SM0
11
Block 2
ThreadId
0 31 32 63Warp 0 Warp 1
- 12 -
Memory Hierarchy
12
Per BlockShared Memory
__shared__ int SharedVar
Block 0
Per-threadRegister
int LocalVarArray[10]
Per-threadLocal Memory
int RegisterVarThread 0
Grid 0
Per appGlobal Memory
Host
__global__ int GlobalVar
__constant__ int ConstVar
Texture<float,1,ReadMode> TextureVar
Per appTexture Memory
Per appConstant Memory
Devic
e
- 13 -
Discussion Points
Who has written CUDA, how have you optimized it, how long did it take?» Did you do tune using a better algorithm than trial and error?
Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do?» How would you do it?
» What’s the input language? C, C++, Java, StreamIt?
Are GPUs a compiler writers best friend or worst enemy?
What about non-scientific codes, can they be mapped to GPUs?» How can GPUs be made more “general”?