06-programmingModelsRealTimeGraphics-BPS2011-lefohn

7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

1/25

Beyond Programmable Shading Cours

Parallel Programming Modfor Real-Time Graphics

Aaron LefohnIntel Corporation


2/25


Hardware Resources

Core

Execution Context

SIMD functional units

On-chip memory


3/25


CPU-GPU System-on-a-Chip


4/25


Abstraction

Abstraction enables portability and system optimization

E.g., dynamic load balancing, SIMD utilization, producer-consumer

Lack of abstraction enables arch-specific programmer optim

E.g., multiple execution contexts jointly building on-chip data structu

When a parallel programming model abstracts a HW recode written in that programming model scales acrossarchitectures with varying amounts of that resource


5/25


Execution Definitions

Execution context

The state required to execute an instruction stream: instruction pointer, re(aka thread)

Work

A logically related set of instructions executed in a single execution contex(aka shader, instance of a kernel, task)

Concurrent execution

Multiple units of work that may execute simultaneously(because they are logically independent)

Parallel execution

Multiple units of work whose execution contexts are guaranteed to be live (because you want them to be for locality, synchronization, etc)


6/25


Synchronization

Synchronization between execution c

Enables inter-context communication

Restricts when work is permitted to execu

Granularity of permitted synchronizatidetermines at which granularity systemallows programmer to control schedul


7/25


Vertex Shaders: Pure Data Parallelism

Execution

Concurrent execution of identical per-vertex work

What is abstracted?

Cores, execution contexts, SIMD functional units, mem

hierarchy

What synchronization is allowed?

Between draw calls


8/25


Pure Data-parallel Pseudocode

concurrent_for( i = 1 to numVertices)

{

// Execute vertex shader

}


9/25


Conventional Thread Parallelism

Execution

Parallel execution of N different units of work on N execution contex Parallel execution of M identical units of work on M-wide SIMD func

What is abstracted?

Nothing (ignoring preemption)

Where is synchronization allowed?

Between any execution context at various granularities


10/25


Conventional Thread Parallelism

CPU

Launch a pthread per hardware execution

GPU

Persistent threads

Launch a workgroup per hardware execucontext sized to the HW SIMD width


11/25


D3D/OpenGL Rendering Pipeline

Execution Concurrent execution of identical work within each shading Concurrent execution of different shading stages Each stage spawns work to the next stage No parallelism exposed to user

What is abstracted?

Cores, execution contexts, SIMD functional units, memory and fixed-function graphics units (tessellator, rasterizer, RO

Where is synchronization allowed? Between draw calls


12/25


Abstracting SIMD ALUs


13/25


Explicit SIMD Programming

float16 a = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1

float16 b = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1float16 c = a + b;

Mechanisms Intrinsics Assembly Wide vector types


14/25


SPMD/Implicit SIMD Programming

parallel_for( i = 1 to SIMD_width)

{

// Per-lane code goes here}

concurrent_for( i = 1 to someBigNumber)

{

// Per-lane code goes here

}


15/25


SPMD/Implicit SIMD Programming

GPU

Current GPU programming models are always SPMD

CPU

Intel SPMD Program Compiler (ISPC)

SPMD combined with other abstractions

OpenCL (some implementations)

Intel Array Building Blocks


16/25


Abstracting

Cores and Execution Conte


17/25


Task Systems (Cilk, TBB, ConcRT, GCD, )

Execution Concurrent execution of many (likely different) units of w Work runs in a single execution context

What is abstracted? Cores and execution contexts Not abstracted: SIMD functional units or memory hierar

Where is synchronization allowed? Between tasks

T k P d C d


18/25


Task Pseudo Code

void myTask(some arguments)

{

}

void main()

{

for( i = 0 to NumTasks - 1 )

{

spawnmyTask();

}

sync;

// More work

}

N t d T k P d C d


19/25


Nested Task Pseudo Codevoid barTask(some parameters){

}

void fooTask(some parameters){

if (someCondition) {spawnbarTask();

}else {

spawn fooTask();}

}

void main(){

concurrent_for( i = 0 to NumTasks - 1 ) {fooTask();

}sync;

More code }

GPU C t P d C d


20/25


GPU Compute Pseudo Code

void myWorkGroup()

{

parallel_for(i = 0 to NumWorkItems - 1){ GPU Kernel Code (This is where you write GPU comp

}

}

void main()

{concurrent_for( i = 0 to NumWorkGroups - 1)

{

myWorkGroup();

}

sync;

}

GPU C t L


21/25


GPU Compute Languages

Execution Lower level is parallel execution of identical work (work

within work-group Upper level is concurrent execution of identical work-gr

What is abstracted? Work-group abstracts a cores execution contexts, SIM

functional units, memory

Where is synchronization allowed? Between work-items in a work-group Between passes (set of work-groups)

S f C t


22/25


Summary of Concepts

Abstraction When a parallel programming model abstracts a HW re

code written in that programming model scales acrossarchitectures with varying amounts of that resource

Concurrency versus parallelism

Concurrency provides scalability and portability Parallel execution permits explicit communication and clocality

Synchronization Where is user allowed to control scheduling?

C l i


23/25


Conclusions

Current real-time rendering programming uses a

data-, task-, and pipeline-parallel programming (aconventional threads as means to an end)

Future SOC (CPU + GPU) programming model d Tasks are effective way to abstract execution contexts

SPMD is an effective way to abstract over SIMD ALUs Many open questions

Look for uses of these different models throughouof the course

Acknowledgements


24/25


Acknowledgements

Tim Foley and Matt Pharr at Intel

Mike Houston at AMD Kayvon Fatahalian at CMU The Advanced Rendering Technology

research team, Pete Baker, Aaron Coand Elliot Garbus at Intel

References


25/25


References

GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+), CUDA

The Fusion APU Architecture: A Programmers Perspective (Ben Gaster)

http://developer.amd.com/afds/assets/presentations/2901_final.pdf

Task systems (CPU and CPU+GPU+) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library

Conventional CPU thread programming Pthreads

GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine , SIGGRAPH 2010

Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics,

Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course

Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects , ParaPL
http://en.wikipedia.org/wiki/DirectComputehttp://www.khronos.org/opencl/http://www.nvidia.com/object/what_is_cuda_new.htmlhttp://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://software.intel.com/en-us/articles/intel-cilk/http://www.threadingbuildingblocks.org/http://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://en.wikipedia.org/wiki/POSIX_Threadshttp://www.tml.tkk.fi/~timo/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttp://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://www.tml.tkk.fi/~timo/http://en.wikipedia.org/wiki/POSIX_Threadshttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://www.threadingbuildingblocks.org/http://software.intel.com/en-us/articles/intel-cilk/http://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://www.nvidia.com/object/what_is_cuda_new.htmlhttp://www.khronos.org/opencl/http://en.wikipedia.org/wiki/DirectComputehttp://en.wikipedia.org/wiki/DirectCompute

Date post:	03-Apr-2018
Category:	Documents
Upload:	yurymik
View:	215 times
Download:	0 times

06-programmingModelsRealTimeGraphics-BPS2011-lefohn

Documents