8/3/2019 (videogame) Rendering 102
1/32
8/3/2019 (videogame) Rendering 102
2/32
8/3/2019 (videogame) Rendering 102
3/32
8/3/2019 (videogame) Rendering 102
4/32
This is were we left from http://c0de517e.blogspot.com/2011/09/rendering-101.html
http://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.html8/3/2019 (videogame) Rendering 102
5/32
In the middle, Nvidia Fermi GPU Die
8/3/2019 (videogame) Rendering 102
6/32
8/3/2019 (videogame) Rendering 102
7/32
In the background image and in the scheme, the Pentium 4 CPU. Compare its
complicated design with many different logic blocks with the Fermi GPU in Slide 5
See i.e. http://www.azillionmonkeys.com/qed/cpujihad.shtml
http://www.azillionmonkeys.com/qed/cpujihad.shtmlhttp://www.azillionmonkeys.com/qed/cpujihad.shtml8/3/2019 (videogame) Rendering 102
8/32
The code is written in a dumb way here, we could have written out[i] =
(in*i+105)in*i+, but we wanted to write it in a way thats closer to how things are
translated in assembly and executed
8/3/2019 (videogame) Rendering 102
9/32
8/3/2019 (videogame) Rendering 102
10/32
Its interesting to notice the similarities between manual unrolling and prefetching used
in traditional CPU optimization, fibers, continuations and async I/O used in modern
servers, and GPGPU architecture.
Note about the fixed vectors: SIMD instructions (i.e. Math on float4s) does nottranslate into SIMD execution in HW (nor is needed for HW SIMD) HW SIMD width
may be very different from language SIMD width!
8/3/2019 (videogame) Rendering 102
11/32
Siggraph 2008 - From Shader Code to a Teraflop: How Shader Cores Work Kayvon
Fatahalian, Stanford University
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf8/3/2019 (videogame) Rendering 102
12/32
8/3/2019 (videogame) Rendering 102
13/32
8/3/2019 (videogame) Rendering 102
14/32
http://en.wikipedia.org/wiki/Parallel_computing
http://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Parallel_computing8/3/2019 (videogame) Rendering 102
15/32
Moores law: it was originally about transistor count, and processors roughly managed to respect it. But CPUs are also resp ectingit in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPUsare following Moores in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahls law is the limiting factor there)
Whats on the die (PC processors...)
8086...386 ---Mostly processing power: Logic units.
486...Pentium2 ---Processing power and caches: A bit of cache, FPUs. Multiple pipelines.
Pentium3...Pentium4 ---Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines.
Core2...i7 ---Multicore + Big caches
Future ---Back to pure processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical
SIMD, lower power/flops ratio (GPUs, Cell...)Manycore (GPU) integrated with multicore (CPU), sharing a cache level or direct bus interconnection (single die or fast paths
between units: Xenon/Xenos, Ps3 PPU/SPU...)
Past:http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.cs.clemson.edu/~mark/330/p6.html
Future:www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdfs09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdfbps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspx
http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.research.ibm.com/cell/http://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.html8/3/2019 (videogame) Rendering 102
16/32
8/3/2019 (videogame) Rendering 102
17/32
How much latency? On the 360 GPU, from the start of a shader (task) to the end (write
into the framebuffer) there are roughly 1000 gpu cycles of latency
8/3/2019 (videogame) Rendering 102
18/32
Just a few examples! There are many fast sequential sorts (i.e. Radix and the other
distribution sorts), many are even faster if the sequence to sort has certain properties
(i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are
desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online:
Library) and most of them can be parallelized (not only the MergeSort). Also hybrids areoften useful (i.e. Radix sort and parallel merge).
www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf
theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt
http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm
http://elliottback.com/wp/sorting-in-linear-time/
http://en.wikipedia.org/wiki/Sorting_algorithm
http://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.c
om/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-
related&resnum=1&ved=0CCQQzwIwAA
http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://elliottback.com/wp/sorting-in-linear-time/http://en.wikipedia.org/wiki/Sorting_algorithmhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://en.wikipedia.org/wiki/Sorting_algorithmhttp://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm8/3/2019 (videogame) Rendering 102
19/32
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-
/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2
http://www.amazon.com/GPU-Computing-Gems-Emerald-
Applications/dp/0123849888/ref=pd_bxgy_b_img_c
http://developer.nvidia.com/category/zone/cuda-zone
http://gpgpu.org/
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://developer.nvidia.com/category/zone/cuda-zonehttp://gpgpu.org/http://gpgpu.org/http://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-28/3/2019 (videogame) Rendering 102
20/32
Also you might want to start experimenting with some code. SlimDX is a good DirectX9/10/11
C# (.net) wrapper for DirectX and comes with some simple examples: http://slimdx.org/. Note
that the various APIs have different amounts of legacy, its probably not a bad idea to start with
DX11, its more complicated than for example old immediate mode OpenGL, but its closer to
reality.Immediate mode refers to a way of drawing primitives that places the vertex data directly in
the command buffer (instead of being a resource that is created upfront and then bound). Its
slow and mostly deprecated.
States (switches for the fixed function parts of the pipelines), resources and shader constants
are very different concepts in DX9 where you can set/get a single render state or a single
shader constant. This is expensive and yields to lots of commands being sent and requires
careful practices to avoid generating too many redundant state sets. DX10-11 manage
everything as resources, buffers that the CPU can modify, transfer to the GPU and then set, and
that contain data, textures, groups of states and shader constants.
Some GPU implementations are very different from what I will sketch from here, for example
the PowerVR GPU, which is used in many mobile platforms, use tile based deferred rendering
which is pretty different from what I will discuss in terms of pipelines and rasterization and its
pretty distant from the logical stages that the DirectX API exposes (even if you might find a
DirectX or equivalently OpenGL implementation for such platforms)
http://en.wikipedia.org/wiki/PowerVR
Remember that these pipeline stages are only a logical view of the API with some logical view
of a typical implementation, not a low-level physical view.
Also, we wont delve deep, for further reference see:
http://c0de517e.blogspot.com/2008/04/gpu-part-1.html
htt : c0de517e.blo s ot.com 2008 04 how- u-works- art-2.html
http://slimdx.org/http://en.wikipedia.org/wiki/PowerVRhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://en.wikipedia.org/wiki/PowerVRhttp://slimdx.org/8/3/2019 (videogame) Rendering 102
21/32
Dump of GPU commands, captured with Windows DirectX PIX (comes with the DirectX
SDK)
There are a few other API capture applications, the best ones for DirectX 9 are PIX and
Intel GPA (http://software.intel.com/en-us/articles/intel-gpa/)For DirectX 11 you can use PIX but also Nvidias Parallel Nsight:
http://developer.nvidia.com/nvidia-parallel-nsight and AMD GPU PerfStudio
http://developer.amd.com/tools/PerfStudio/Pages/default.aspx
To peek into games the old DXExplorer http://www.sandboxsoftware.net/dxexplorer/
and 3D Ripper are fun too http://www.deep-shadows.com/hax/3DRipperDX.htm
Similar tools exist for OpenGL, for example gDEBugger http://www.gremedy.com/
http://software.intel.com/en-us/articles/intel-gpa/http://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://www.sandboxsoftware.net/dxexplorer/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.gremedy.com/http://www.gremedy.com/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.sandboxsoftware.net/dxexplorer/http://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/8/3/2019 (videogame) Rendering 102
22/32
Data fetching usually has a simple linear cache associated with it (pre-transform cache).
Indexed reading is useful as triangles often share the same vertices, so we dont wantto replicate the same data over and over in source buffers, by using indexing we can
just replicate indices. Moreover, GPUs store a few vertices out of the vertex shader, andif you ask to shade the same vertex again it can reuse the output without re-executingthe shader (post-transform cache).Usually indexed triangle lists are the best ways to render objects on modern GPUs andthere are ways to optimize the order of triangles in an object in order to maximize post-transform cache use.
Usually using inteleaved inputs (a single buffer, an array of structures) is the bestchoice, but we might want to split the data in different buffers for example if we needto use the same data in multiple draws but each draw needs only some attributes, or ifpart of the data needs to be modified by the cpu etc...
As for all the units, dont confuse the logical representation with the physical one. Forexample, on a Xbox 360 Xenos GPU the vertex data fetching is done by the vertexshader, vertex fetching instructions are added at the beginning of a vertex shader andthe fetching unit is only a kind of memory interface, like the texture fetching unit, andits available both to vertex and pixel shaders (as the shader engine is unified on thatGPU). The only unit that generates vertices is the unit that handles indexing and post-transform caching, and vertices are only indices that are an input to the vertex shader.
8/3/2019 (videogame) Rendering 102
23/32
8/3/2019 (videogame) Rendering 102
24/32
Nvidia FX Composer is a good tool to experiment with shaders without having to care
about how to draw primitives and bind resources http://developer.nvidia.com/fx-
composer
The older 1.8 version of FX Composer is actually a better starting point that the newer.net based one.
DirectX shader language is called HLSL, its very close to Nvidias CG which can be
used both from OpenGL and DirectX (and works on non-nvidia cards too). OpenGL uses
GLSL which is a bit more confusing and in general messier. HLSL and CG support an
effect framework (FX) that enables the specification of not only shader code but also
control over some GPU states, thus enabling to data-drive most of what is needed to
render a given object in an easier way.
http://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composer8/3/2019 (videogame) Rendering 102
25/32
Note: we are really collapsing in this slide multiple related fixed units, clipping/culling,
primitive setup/assembly, interpolation and so on are different stages, each with their
own limits (throughput) and buffers.
There is a lot of documentation on how to write rasterizers and rasterizationalgorithms. Ill link here two software based approach, a simple and mostly illustrative
one: http://www.devmaster.net/codespotlight/show.php?id=17 and a more advanced
one that was devised for the (now defunct) Larrabee GPU, which had a programmable
rasterizer http://software.intel.com/en-us/articles/rasterization-on-larrabee/
Often rasterization happens in more than a single step, there can for example be a
coarse-rasterizer that generates bigger tiles (i.e. 8x8 pixels) and then does some early-
rejection (see next slides) to then pass the results to a fine-rasterizer that generates 2x2
quads.
http://www.devmaster.net/codespotlight/show.php?id=17http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://www.devmaster.net/codespotlight/show.php?id=178/3/2019 (videogame) Rendering 102
26/32
8/3/2019 (videogame) Rendering 102
27/32
Quad waste is a big problem for dense meshes, tassellation and displacement. Mesh
level of detail is often more useful to avoid generating too much waste in the shaded
quads than to reduce the number of vertices and their associated load in the vertex
shading stages of the pipeline. Its a problem that needs to be solved for the future of
GPU graphics. http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf
Its a long discussion, and in practice very few shaders use differentials to integrate
discontinuities (we should though, or try to avoid discontinuities and high-frequencies
in shaders), but the differentials are by default used when fetching textures to pre-filter
them (mipmaps, anisotropic filtering), and this alone makes a HUGE difference.
Some pointers for further reads:
http://en.wikipedia.org/wiki/Spatial_anti-aliasing
http://en.wikipedia.org/wiki/Multisample_anti-aliasing
http://www.amazon.com/Texturing-Modeling-Second-Procedural-
Approach/dp/0122287304
http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-
Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1
http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdfhttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf8/3/2019 (videogame) Rendering 102
28/32
8/3/2019 (videogame) Rendering 102
29/32
I wont go into any details of what a stencil buffer is and how early-stencil can be used,
or the many ways a depth buffer can be used and configured... But early-rejections are
fundamental for GPU performance and they can be used for many different
optimizations, its fundamental to understand how they work on a specific GPU
architecture as often they will work only in some given state configurations and areeasy to be invalidated. Some data here:
http://www.gpgpu.org/w/index.php/Code_Examples#Early-z
Also each GPU vendor calls this optimizations in different ways: Hierarchial Z (HiZ),
HyperZ, Early-Z, Zcull and so on... The actual representation that permits this early test
varies with the hardware, and there can even be multiple levels of rejection. Most of
them are inspired by Hierarchial Occlusion Maps:
http://www.cs.unc.edu/~zhangh/hom.html but the literature is rich i.e.
http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472
http://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.cs.unc.edu/~zhangh/hom.htmlhttp://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://www.cs.unc.edu/~zhangh/hom.htmlhttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Examples8/3/2019 (videogame) Rendering 102
30/32
8/3/2019 (videogame) Rendering 102
31/32
8/3/2019 (videogame) Rendering 102
32/32
Example image and zbuffer from a pix capture of a retail version of Battlefield Bad
Company 2