Download - (videogame) Rendering 102

8/3/2019 (videogame) Rendering 102

1/32


2/32


3/32


4/32

This is were we left from http://c0de517e.blogspot.com/2011/09/rendering-101.html
http://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.html


5/32

In the middle, Nvidia Fermi GPU Die


6/32


7/32

In the background image and in the scheme, the Pentium 4 CPU. Compare its

complicated design with many different logic blocks with the Fermi GPU in Slide 5

See i.e. http://www.azillionmonkeys.com/qed/cpujihad.shtml
http://www.azillionmonkeys.com/qed/cpujihad.shtmlhttp://www.azillionmonkeys.com/qed/cpujihad.shtml


8/32

The code is written in a dumb way here, we could have written out[i] =

(in*i+105)in*i+, but we wanted to write it in a way thats closer to how things are

translated in assembly and executed


9/32


10/32

Its interesting to notice the similarities between manual unrolling and prefetching used

in traditional CPU optimization, fibers, continuations and async I/O used in modern

servers, and GPGPU architecture.

Note about the fixed vectors: SIMD instructions (i.e. Math on float4s) does nottranslate into SIMD execution in HW (nor is needed for HW SIMD) HW SIMD width

may be very different from language SIMD width!


11/32

Siggraph 2008 - From Shader Code to a Teraflop: How Shader Cores Work Kayvon

Fatahalian, Stanford University

http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf


12/32


13/32


14/32

http://en.wikipedia.org/wiki/Parallel_computing
http://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Parallel_computing


15/32

Moores law: it was originally about transistor count, and processors roughly managed to respect it. But CPUs are also resp ectingit in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPUsare following Moores in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahls law is the limiting factor there)

Whats on the die (PC processors...)

8086...386 ---Mostly processing power: Logic units.

486...Pentium2 ---Processing power and caches: A bit of cache, FPUs. Multiple pipelines.

Pentium3...Pentium4 ---Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines.

Core2...i7 ---Multicore + Big caches

Future ---Back to pure processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical

SIMD, lower power/flops ratio (GPUs, Cell...)Manycore (GPU) integrated with multicore (CPU), sharing a cache level or direct bus interconnection (single die or fast paths

between units: Xenon/Xenos, Ps3 PPU/SPU...)

Past:http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.cs.clemson.edu/~mark/330/p6.html

Future:www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdfs09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdfbps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspx
http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.research.ibm.com/cell/http://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.html


16/32


17/32

How much latency? On the 360 GPU, from the start of a shader (task) to the end (write

into the framebuffer) there are roughly 1000 gpu cycles of latency


18/32

Just a few examples! There are many fast sequential sorts (i.e. Radix and the other

distribution sorts), many are even faster if the sequence to sort has certain properties

(i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are

desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online:

Library) and most of them can be parallelized (not only the MergeSort). Also hybrids areoften useful (i.e. Radix sort and parallel merge).

www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf

theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm

http://elliottback.com/wp/sorting-in-linear-time/

http://en.wikipedia.org/wiki/Sorting_algorithm

http://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.c

om/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-

related&resnum=1&ved=0CCQQzwIwAA
http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://elliottback.com/wp/sorting-in-linear-time/http://en.wikipedia.org/wiki/Sorting_algorithmhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://en.wikipedia.org/wiki/Sorting_algorithmhttp://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm


19/32

http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-

/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2

http://www.amazon.com/GPU-Computing-Gems-Emerald-

Applications/dp/0123849888/ref=pd_bxgy_b_img_c

http://developer.nvidia.com/category/zone/cuda-zone

http://gpgpu.org/
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://developer.nvidia.com/category/zone/cuda-zonehttp://gpgpu.org/http://gpgpu.org/http://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2


20/32

Also you might want to start experimenting with some code. SlimDX is a good DirectX9/10/11

C# (.net) wrapper for DirectX and comes with some simple examples: http://slimdx.org/. Note

that the various APIs have different amounts of legacy, its probably not a bad idea to start with

DX11, its more complicated than for example old immediate mode OpenGL, but its closer to

reality.Immediate mode refers to a way of drawing primitives that places the vertex data directly in

the command buffer (instead of being a resource that is created upfront and then bound). Its

slow and mostly deprecated.

States (switches for the fixed function parts of the pipelines), resources and shader constants

are very different concepts in DX9 where you can set/get a single render state or a single

shader constant. This is expensive and yields to lots of commands being sent and requires

careful practices to avoid generating too many redundant state sets. DX10-11 manage

everything as resources, buffers that the CPU can modify, transfer to the GPU and then set, and

that contain data, textures, groups of states and shader constants.

Some GPU implementations are very different from what I will sketch from here, for example

the PowerVR GPU, which is used in many mobile platforms, use tile based deferred rendering

which is pretty different from what I will discuss in terms of pipelines and rasterization and its

pretty distant from the logical stages that the DirectX API exposes (even if you might find a

DirectX or equivalently OpenGL implementation for such platforms)

http://en.wikipedia.org/wiki/PowerVR

Remember that these pipeline stages are only a logical view of the API with some logical view

of a typical implementation, not a low-level physical view.

Also, we wont delve deep, for further reference see:

http://c0de517e.blogspot.com/2008/04/gpu-part-1.html

htt : c0de517e.blo s ot.com 2008 04 how- u-works- art-2.html
http://slimdx.org/http://en.wikipedia.org/wiki/PowerVRhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://en.wikipedia.org/wiki/PowerVRhttp://slimdx.org/


21/32

Dump of GPU commands, captured with Windows DirectX PIX (comes with the DirectX

SDK)

There are a few other API capture applications, the best ones for DirectX 9 are PIX and

Intel GPA (http://software.intel.com/en-us/articles/intel-gpa/)For DirectX 11 you can use PIX but also Nvidias Parallel Nsight:

http://developer.nvidia.com/nvidia-parallel-nsight and AMD GPU PerfStudio

http://developer.amd.com/tools/PerfStudio/Pages/default.aspx

To peek into games the old DXExplorer http://www.sandboxsoftware.net/dxexplorer/

and 3D Ripper are fun too http://www.deep-shadows.com/hax/3DRipperDX.htm

Similar tools exist for OpenGL, for example gDEBugger http://www.gremedy.com/
http://software.intel.com/en-us/articles/intel-gpa/http://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://www.sandboxsoftware.net/dxexplorer/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.gremedy.com/http://www.gremedy.com/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.sandboxsoftware.net/dxexplorer/http://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/


22/32

Data fetching usually has a simple linear cache associated with it (pre-transform cache).

Indexed reading is useful as triangles often share the same vertices, so we dont wantto replicate the same data over and over in source buffers, by using indexing we can

just replicate indices. Moreover, GPUs store a few vertices out of the vertex shader, andif you ask to shade the same vertex again it can reuse the output without re-executingthe shader (post-transform cache).Usually indexed triangle lists are the best ways to render objects on modern GPUs andthere are ways to optimize the order of triangles in an object in order to maximize post-transform cache use.

Usually using inteleaved inputs (a single buffer, an array of structures) is the bestchoice, but we might want to split the data in different buffers for example if we needto use the same data in multiple draws but each draw needs only some attributes, or ifpart of the data needs to be modified by the cpu etc...

As for all the units, dont confuse the logical representation with the physical one. Forexample, on a Xbox 360 Xenos GPU the vertex data fetching is done by the vertexshader, vertex fetching instructions are added at the beginning of a vertex shader andthe fetching unit is only a kind of memory interface, like the texture fetching unit, andits available both to vertex and pixel shaders (as the shader engine is unified on thatGPU). The only unit that generates vertices is the unit that handles indexing and post-transform caching, and vertices are only indices that are an input to the vertex shader.


23/32


24/32

Nvidia FX Composer is a good tool to experiment with shaders without having to care

about how to draw primitives and bind resources http://developer.nvidia.com/fx-

composer

The older 1.8 version of FX Composer is actually a better starting point that the newer.net based one.

DirectX shader language is called HLSL, its very close to Nvidias CG which can be

used both from OpenGL and DirectX (and works on non-nvidia cards too). OpenGL uses

GLSL which is a bit more confusing and in general messier. HLSL and CG support an

effect framework (FX) that enables the specification of not only shader code but also

control over some GPU states, thus enabling to data-drive most of what is needed to

render a given object in an easier way.
http://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composer


25/32

Note: we are really collapsing in this slide multiple related fixed units, clipping/culling,

primitive setup/assembly, interpolation and so on are different stages, each with their

own limits (throughput) and buffers.

There is a lot of documentation on how to write rasterizers and rasterizationalgorithms. Ill link here two software based approach, a simple and mostly illustrative

one: http://www.devmaster.net/codespotlight/show.php?id=17 and a more advanced

one that was devised for the (now defunct) Larrabee GPU, which had a programmable

rasterizer http://software.intel.com/en-us/articles/rasterization-on-larrabee/

Often rasterization happens in more than a single step, there can for example be a

coarse-rasterizer that generates bigger tiles (i.e. 8x8 pixels) and then does some early-

rejection (see next slides) to then pass the results to a fine-rasterizer that generates 2x2

quads.
http://www.devmaster.net/codespotlight/show.php?id=17http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://www.devmaster.net/codespotlight/show.php?id=17


26/32


27/32

Quad waste is a big problem for dense meshes, tassellation and displacement. Mesh

level of detail is often more useful to avoid generating too much waste in the shaded

quads than to reduce the number of vertices and their associated load in the vertex

shading stages of the pipeline. Its a problem that needs to be solved for the future of

GPU graphics. http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf

Its a long discussion, and in practice very few shaders use differentials to integrate

discontinuities (we should though, or try to avoid discontinuities and high-frequencies

in shaders), but the differentials are by default used when fetching textures to pre-filter

them (mipmaps, anisotropic filtering), and this alone makes a HUGE difference.

Some pointers for further reads:

http://en.wikipedia.org/wiki/Spatial_anti-aliasing

http://en.wikipedia.org/wiki/Multisample_anti-aliasing

http://www.amazon.com/Texturing-Modeling-Second-Procedural-

Approach/dp/0122287304

http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-

Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1
http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdfhttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf


28/32


29/32

I wont go into any details of what a stencil buffer is and how early-stencil can be used,

or the many ways a depth buffer can be used and configured... But early-rejections are

fundamental for GPU performance and they can be used for many different

optimizations, its fundamental to understand how they work on a specific GPU

architecture as often they will work only in some given state configurations and areeasy to be invalidated. Some data here:

http://www.gpgpu.org/w/index.php/Code_Examples#Early-z

Also each GPU vendor calls this optimizations in different ways: Hierarchial Z (HiZ),

HyperZ, Early-Z, Zcull and so on... The actual representation that permits this early test

varies with the hardware, and there can even be multiple levels of rejection. Most of

them are inspired by Hierarchial Occlusion Maps:

http://www.cs.unc.edu/~zhangh/hom.html but the literature is rich i.e.

http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472
http://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.cs.unc.edu/~zhangh/hom.htmlhttp://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://www.cs.unc.edu/~zhangh/hom.htmlhttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Examples


30/32


31/32


32/32

Example image and zbuffer from a pix capture of a retail version of Battlefield Bad

Company 2