of 25
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
1/25
Beyond Programmable Shading Cours
Parallel Programming Modfor Real-Time Graphics
Aaron LefohnIntel Corporation
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
2/25
Beyond Programmable Shading Cours
Hardware Resources
Core
Execution Context
SIMD functional units
On-chip memory
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
3/25
Beyond Programmable Shading Cours
CPU-GPU System-on-a-Chip
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
4/25
Beyond Programmable Shading Cours
Abstraction
Abstraction enables portability and system optimization
E.g., dynamic load balancing, SIMD utilization, producer-consumer
Lack of abstraction enables arch-specific programmer optim
E.g., multiple execution contexts jointly building on-chip data structu
When a parallel programming model abstracts a HW recode written in that programming model scales acrossarchitectures with varying amounts of that resource
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
5/25
Beyond Programmable Shading Cours
Execution Definitions
Execution context
The state required to execute an instruction stream: instruction pointer, re(aka thread)
Work
A logically related set of instructions executed in a single execution contex(aka shader, instance of a kernel, task)
Concurrent execution
Multiple units of work that may execute simultaneously(because they are logically independent)
Parallel execution
Multiple units of work whose execution contexts are guaranteed to be live (because you want them to be for locality, synchronization, etc)
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
6/25
Beyond Programmable Shading Cours
Synchronization
Synchronization between execution c
Enables inter-context communication
Restricts when work is permitted to execu
Granularity of permitted synchronizatidetermines at which granularity systemallows programmer to control schedul
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
7/25
Beyond Programmable Shading Cours
Vertex Shaders: Pure Data Parallelism
Execution
Concurrent execution of identical per-vertex work
What is abstracted?
Cores, execution contexts, SIMD functional units, mem
hierarchy
What synchronization is allowed?
Between draw calls
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
8/25
Beyond Programmable Shading Cours
Pure Data-parallel Pseudocode
concurrent_for( i = 1 to numVertices)
{
// Execute vertex shader
}
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
9/25
Beyond Programmable Shading Cours
Conventional Thread Parallelism
Execution
Parallel execution of N different units of work on N execution contex Parallel execution of M identical units of work on M-wide SIMD func
What is abstracted?
Nothing (ignoring preemption)
Where is synchronization allowed?
Between any execution context at various granularities
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
10/25
Beyond Programmable Shading Cours
Conventional Thread Parallelism
CPU
Launch a pthread per hardware execution
GPU
Persistent threads
Launch a workgroup per hardware execucontext sized to the HW SIMD width
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
11/25
Beyond Programmable Shading Cours
D3D/OpenGL Rendering Pipeline
Execution Concurrent execution of identical work within each shading Concurrent execution of different shading stages Each stage spawns work to the next stage No parallelism exposed to user
What is abstracted?
Cores, execution contexts, SIMD functional units, memory and fixed-function graphics units (tessellator, rasterizer, RO
Where is synchronization allowed? Between draw calls
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
12/25
Beyond Programmable Shading Cours
Abstracting SIMD ALUs
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
13/25
Beyond Programmable Shading Cours
Explicit SIMD Programming
float16 a = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1
float16 b = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1float16 c = a + b;
Mechanisms Intrinsics Assembly Wide vector types
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
14/25
Beyond Programmable Shading Cours
SPMD/Implicit SIMD Programming
parallel_for( i = 1 to SIMD_width)
{
// Per-lane code goes here}
concurrent_for( i = 1 to someBigNumber)
{
// Per-lane code goes here
}
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
15/25
Beyond Programmable Shading Cours
SPMD/Implicit SIMD Programming
GPU
Current GPU programming models are always SPMD
CPU
Intel SPMD Program Compiler (ISPC)
SPMD combined with other abstractions
OpenCL (some implementations)
Intel Array Building Blocks
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
16/25
Beyond Programmable Shading Cours
Abstracting
Cores and Execution Conte
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
17/25
Beyond Programmable Shading Cours
Task Systems (Cilk, TBB, ConcRT, GCD, )
Execution Concurrent execution of many (likely different) units of w Work runs in a single execution context
What is abstracted? Cores and execution contexts Not abstracted: SIMD functional units or memory hierar
Where is synchronization allowed? Between tasks
T k P d C d
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
18/25
Beyond Programmable Shading Cours
Task Pseudo Code
void myTask(some arguments)
{
}
void main()
{
for( i = 0 to NumTasks - 1 )
{
spawnmyTask();
}
sync;
// More work
}
N t d T k P d C d
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
19/25
Beyond Programmable Shading Cours
Nested Task Pseudo Codevoid barTask(some parameters){
}
void fooTask(some parameters){
if (someCondition) {spawnbarTask();
}else {
spawn fooTask();}
}
void main(){
concurrent_for( i = 0 to NumTasks - 1 ) {fooTask();
}sync;
More code }
GPU C t P d C d
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
20/25
Beyond Programmable Shading Cours
GPU Compute Pseudo Code
void myWorkGroup()
{
parallel_for(i = 0 to NumWorkItems - 1){ GPU Kernel Code (This is where you write GPU comp
}
}
void main()
{concurrent_for( i = 0 to NumWorkGroups - 1)
{
myWorkGroup();
}
sync;
}
GPU C t L
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
21/25
Beyond Programmable Shading Cours
GPU Compute Languages
Execution Lower level is parallel execution of identical work (work
within work-group Upper level is concurrent execution of identical work-gr
What is abstracted? Work-group abstracts a cores execution contexts, SIM
functional units, memory
Where is synchronization allowed? Between work-items in a work-group Between passes (set of work-groups)
S f C t
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
22/25
Beyond Programmable Shading Cours
Summary of Concepts
Abstraction When a parallel programming model abstracts a HW re
code written in that programming model scales acrossarchitectures with varying amounts of that resource
Concurrency versus parallelism
Concurrency provides scalability and portability Parallel execution permits explicit communication and clocality
Synchronization Where is user allowed to control scheduling?
C l i
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
23/25
Beyond Programmable Shading Cours
Conclusions
Current real-time rendering programming uses a
data-, task-, and pipeline-parallel programming (aconventional threads as means to an end)
Future SOC (CPU + GPU) programming model d Tasks are effective way to abstract execution contexts
SPMD is an effective way to abstract over SIMD ALUs Many open questions
Look for uses of these different models throughouof the course
Acknowledgements
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
24/25
Beyond Programmable Shading Cours
Acknowledgements
Tim Foley and Matt Pharr at Intel
Mike Houston at AMD Kayvon Fatahalian at CMU The Advanced Rendering Technology
research team, Pete Baker, Aaron Coand Elliot Garbus at Intel
References
7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn
25/25
Beyond Programmable Shading Cours
References
GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+), CUDA
The Fusion APU Architecture: A Programmers Perspective (Ben Gaster)
http://developer.amd.com/afds/assets/presentations/2901_final.pdf
Task systems (CPU and CPU+GPU+) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library
Conventional CPU thread programming Pthreads
GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine , SIGGRAPH 2010
Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics,
Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course
Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects , ParaPL
http://en.wikipedia.org/wiki/DirectComputehttp://www.khronos.org/opencl/http://www.nvidia.com/object/what_is_cuda_new.htmlhttp://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://software.intel.com/en-us/articles/intel-cilk/http://www.threadingbuildingblocks.org/http://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://en.wikipedia.org/wiki/POSIX_Threadshttp://www.tml.tkk.fi/~timo/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttp://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://www.tml.tkk.fi/~timo/http://en.wikipedia.org/wiki/POSIX_Threadshttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://www.threadingbuildingblocks.org/http://software.intel.com/en-us/articles/intel-cilk/http://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://www.nvidia.com/object/what_is_cuda_new.htmlhttp://www.khronos.org/opencl/http://en.wikipedia.org/wiki/DirectComputehttp://en.wikipedia.org/wiki/DirectCompute