+ All Categories
Home > Documents > 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

06-programmingModelsRealTimeGraphics-BPS2011-lefohn

Date post: 03-Apr-2018
Category:
Upload: yurymik
View: 215 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    1/25

    Beyond Programmable Shading Cours

    Parallel Programming Modfor Real-Time Graphics

    Aaron LefohnIntel Corporation

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    2/25

    Beyond Programmable Shading Cours

    Hardware Resources

    Core

    Execution Context

    SIMD functional units

    On-chip memory

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    3/25

    Beyond Programmable Shading Cours

    CPU-GPU System-on-a-Chip

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    4/25

    Beyond Programmable Shading Cours

    Abstraction

    Abstraction enables portability and system optimization

    E.g., dynamic load balancing, SIMD utilization, producer-consumer

    Lack of abstraction enables arch-specific programmer optim

    E.g., multiple execution contexts jointly building on-chip data structu

    When a parallel programming model abstracts a HW recode written in that programming model scales acrossarchitectures with varying amounts of that resource

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    5/25

    Beyond Programmable Shading Cours

    Execution Definitions

    Execution context

    The state required to execute an instruction stream: instruction pointer, re(aka thread)

    Work

    A logically related set of instructions executed in a single execution contex(aka shader, instance of a kernel, task)

    Concurrent execution

    Multiple units of work that may execute simultaneously(because they are logically independent)

    Parallel execution

    Multiple units of work whose execution contexts are guaranteed to be live (because you want them to be for locality, synchronization, etc)

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    6/25

    Beyond Programmable Shading Cours

    Synchronization

    Synchronization between execution c

    Enables inter-context communication

    Restricts when work is permitted to execu

    Granularity of permitted synchronizatidetermines at which granularity systemallows programmer to control schedul

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    7/25

    Beyond Programmable Shading Cours

    Vertex Shaders: Pure Data Parallelism

    Execution

    Concurrent execution of identical per-vertex work

    What is abstracted?

    Cores, execution contexts, SIMD functional units, mem

    hierarchy

    What synchronization is allowed?

    Between draw calls

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    8/25

    Beyond Programmable Shading Cours

    Pure Data-parallel Pseudocode

    concurrent_for( i = 1 to numVertices)

    {

    // Execute vertex shader

    }

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    9/25

    Beyond Programmable Shading Cours

    Conventional Thread Parallelism

    Execution

    Parallel execution of N different units of work on N execution contex Parallel execution of M identical units of work on M-wide SIMD func

    What is abstracted?

    Nothing (ignoring preemption)

    Where is synchronization allowed?

    Between any execution context at various granularities

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    10/25

    Beyond Programmable Shading Cours

    Conventional Thread Parallelism

    CPU

    Launch a pthread per hardware execution

    GPU

    Persistent threads

    Launch a workgroup per hardware execucontext sized to the HW SIMD width

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    11/25

    Beyond Programmable Shading Cours

    D3D/OpenGL Rendering Pipeline

    Execution Concurrent execution of identical work within each shading Concurrent execution of different shading stages Each stage spawns work to the next stage No parallelism exposed to user

    What is abstracted?

    Cores, execution contexts, SIMD functional units, memory and fixed-function graphics units (tessellator, rasterizer, RO

    Where is synchronization allowed? Between draw calls

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    12/25

    Beyond Programmable Shading Cours

    Abstracting SIMD ALUs

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    13/25

    Beyond Programmable Shading Cours

    Explicit SIMD Programming

    float16 a = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1

    float16 b = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1float16 c = a + b;

    Mechanisms Intrinsics Assembly Wide vector types

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    14/25

    Beyond Programmable Shading Cours

    SPMD/Implicit SIMD Programming

    parallel_for( i = 1 to SIMD_width)

    {

    // Per-lane code goes here}

    concurrent_for( i = 1 to someBigNumber)

    {

    // Per-lane code goes here

    }

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    15/25

    Beyond Programmable Shading Cours

    SPMD/Implicit SIMD Programming

    GPU

    Current GPU programming models are always SPMD

    CPU

    Intel SPMD Program Compiler (ISPC)

    SPMD combined with other abstractions

    OpenCL (some implementations)

    Intel Array Building Blocks

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    16/25

    Beyond Programmable Shading Cours

    Abstracting

    Cores and Execution Conte

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    17/25

    Beyond Programmable Shading Cours

    Task Systems (Cilk, TBB, ConcRT, GCD, )

    Execution Concurrent execution of many (likely different) units of w Work runs in a single execution context

    What is abstracted? Cores and execution contexts Not abstracted: SIMD functional units or memory hierar

    Where is synchronization allowed? Between tasks

    T k P d C d

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    18/25

    Beyond Programmable Shading Cours

    Task Pseudo Code

    void myTask(some arguments)

    {

    }

    void main()

    {

    for( i = 0 to NumTasks - 1 )

    {

    spawnmyTask();

    }

    sync;

    // More work

    }

    N t d T k P d C d

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    19/25

    Beyond Programmable Shading Cours

    Nested Task Pseudo Codevoid barTask(some parameters){

    }

    void fooTask(some parameters){

    if (someCondition) {spawnbarTask();

    }else {

    spawn fooTask();}

    }

    void main(){

    concurrent_for( i = 0 to NumTasks - 1 ) {fooTask();

    }sync;

    More code }

    GPU C t P d C d

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    20/25

    Beyond Programmable Shading Cours

    GPU Compute Pseudo Code

    void myWorkGroup()

    {

    parallel_for(i = 0 to NumWorkItems - 1){ GPU Kernel Code (This is where you write GPU comp

    }

    }

    void main()

    {concurrent_for( i = 0 to NumWorkGroups - 1)

    {

    myWorkGroup();

    }

    sync;

    }

    GPU C t L

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    21/25

    Beyond Programmable Shading Cours

    GPU Compute Languages

    Execution Lower level is parallel execution of identical work (work

    within work-group Upper level is concurrent execution of identical work-gr

    What is abstracted? Work-group abstracts a cores execution contexts, SIM

    functional units, memory

    Where is synchronization allowed? Between work-items in a work-group Between passes (set of work-groups)

    S f C t

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    22/25

    Beyond Programmable Shading Cours

    Summary of Concepts

    Abstraction When a parallel programming model abstracts a HW re

    code written in that programming model scales acrossarchitectures with varying amounts of that resource

    Concurrency versus parallelism

    Concurrency provides scalability and portability Parallel execution permits explicit communication and clocality

    Synchronization Where is user allowed to control scheduling?

    C l i

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    23/25

    Beyond Programmable Shading Cours

    Conclusions

    Current real-time rendering programming uses a

    data-, task-, and pipeline-parallel programming (aconventional threads as means to an end)

    Future SOC (CPU + GPU) programming model d Tasks are effective way to abstract execution contexts

    SPMD is an effective way to abstract over SIMD ALUs Many open questions

    Look for uses of these different models throughouof the course

    Acknowledgements

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    24/25

    Beyond Programmable Shading Cours

    Acknowledgements

    Tim Foley and Matt Pharr at Intel

    Mike Houston at AMD Kayvon Fatahalian at CMU The Advanced Rendering Technology

    research team, Pete Baker, Aaron Coand Elliot Garbus at Intel

    References

  • 7/29/2019 06-programmingModelsRealTimeGraphics-BPS2011-lefohn

    25/25

    Beyond Programmable Shading Cours

    References

    GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+), CUDA

    The Fusion APU Architecture: A Programmers Perspective (Ben Gaster)

    http://developer.amd.com/afds/assets/presentations/2901_final.pdf

    Task systems (CPU and CPU+GPU+) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library

    Conventional CPU thread programming Pthreads

    GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine , SIGGRAPH 2010

    Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics,

    Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course

    Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects , ParaPL

    http://en.wikipedia.org/wiki/DirectComputehttp://www.khronos.org/opencl/http://www.nvidia.com/object/what_is_cuda_new.htmlhttp://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://software.intel.com/en-us/articles/intel-cilk/http://www.threadingbuildingblocks.org/http://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://en.wikipedia.org/wiki/POSIX_Threadshttp://www.tml.tkk.fi/~timo/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_17.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttps://graphics.stanford.edu/wikis/cs448s-10/FrontPage?action=AttachFile&do=get&target=tfoley-Programming+Models.pdfhttp://s09.idav.ucdavis.edu/talks/03_tfoley_ProgrammingModels.pdfhttp://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://idav.ucdavis.edu/publications/print_pub?pub_id=1036http://www.tml.tkk.fi/~timo/http://en.wikipedia.org/wiki/POSIX_Threadshttp://msdn.microsoft.com/en-us/library/dd460717.aspxhttp://msdn.microsoft.com/en-us/library/dd504870.aspxhttp://developer.apple.com/mac/library/documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.htmlhttp://www.threadingbuildingblocks.org/http://software.intel.com/en-us/articles/intel-cilk/http://developer.amd.com/afds/assets/presentations/2901_final.pdfhttp://www.nvidia.com/object/what_is_cuda_new.htmlhttp://www.khronos.org/opencl/http://en.wikipedia.org/wiki/DirectComputehttp://en.wikipedia.org/wiki/DirectCompute

Recommended