DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy [email protected].

DirectX 9 & Radeon 9700Performance Optimizations

Richard [email protected]

DirectX 9 and Radeon 9700 considerations

• Resources• Sorting and Clearing• Vertex Buffers and Index Buffers• Render States• How to draw primitives• Vertex Data• Vertex Shaders• Pixel Shaders• Textures• Targets (both Z and color)• Miscellaneous

General resource management

• Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc)

• “Most important” is “most frequently used”• Never call Create in your main loop

– So create the main colour and Z buffers before you do anything else…

• The “main buffer” is the one through which the largest number of pixels pass…

Sorting

• Sort roughly front to back– There’s a staggering amount of hardware

devoted to making this highly efficient

• Sort by vertex shader…or…

• Sort by pixel shader, or • sort by texture

• When you change VS or PS it’s good to go back to that shader as soon as possible…

• Short shaders are faster^2 when sorted

Clearing

• Ideally use Clear once per frame (not less)– Always clear the whole render target

• Don’t track dirty regions at all

– Always clear colour, Z and stencil together unless you can just clear Z/stencil

• Most importantly don’t force us to preserve stencil

• Don’t use 2 triangles to clear…• Using Clear() is the way to get all the fancy

Z buffer hardware working for you

Vertex Buffers

• Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc

• Try to always use DISCARD at the start of the frame on dynamic VB’s

• Specify write-only whenever possible• Use the default pool whenever possible• Roughly 2 – 4 MB for best performance

– This allows large batches

– And gives the driver sufficient granularity

Index Buffers

• Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible

– i.e. Use 32 bit indices only if you need to

– Use 16 bit indices whenever you can

• All recent ATI hardware treats Index Buffers as ‘first class citizens’

– They don’t have to be copied about before the chip gets access

– So keep them out of system memory

Updating Index and Vertex Buffers

• IBs and VBs which are optimally located need to be updated with sequential DWORD writes.

• AGP memory and LVM both benefit from this treatment…

Handling Render States

• Prefer minimal state blocks– ‘minimal’ means you should weed out any

redundant state changes where possible• If 5% of state changes are redundant that’s OK• If 50% are redundant then get it fixed!

• The expensive state changes:– Switching between VS and FF

– Switching Vertex Shader

– Changing Texture

How to draw primitives

• DrawIndexedPrimitive( strip or list )– Indexing is a big win on real world data

– Long strips beat everything else

– Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists)

– Make sure your VB’s and IB’s are in optimal memory for best performance

– Give the card hundreds of polys per call• Small batches kill performance

Vertex data

• Don’t scatter it around– Fewer streams give better cache behaviour

• Compress it if you can– 16 bits or less per component– Even if it costs you 1 or 2 ops in the shader…

• Try to avoid spilling into AGP– Because AGP has high latency

• pow2 sizes help – 32 bytes is best– Work the cache on the GPU

• Avoid random access patterns where possible by reordering vertex data before the main loop…

– That’s at app start up or at authoring time

Compiling and Linking shaders

• Do this all “up front”– It may not be obvious to you - but you have to

actually use a shader to force it’s complete instantiation in DirectX 9

– So, if you’re not careful you may get linking happening in your main loop

– And linking may be time consuming – Draw a little of everything before you start for

real. Think of this as priming the caches…

Vertex shaders I

• Shorter shaders are faster – no surprises here…• Avoid all unnecessary writes

– This includes the output registers of the VS– So use the write masks aggressively– Pack constants as much as possible– Prefer locality of reference on constants too…

• Be aware of the expansion of macros but prefer them anyway if they match exactly what you want

• Pack your shader constant updates• You should optimise the algorithm and leave the

object-code optimisation to the driver/runtime

Vertex shaders II

• Branches and conditionals are fast so use them agressively

– That’s not like the CPU where branches are slow…

– Longer shaders allow better batching

• Shorter shaders are also more cache friendly– i.e. it’s usually faster to switch to the previous shader

than to any other

– But the shorter your shaders are…

– …the more of them fit into the cache.

Vertex shaders II

• API Change:– Now you don’t “mov” to the address register, you use

“mova”

– And this performs round to nearest, not floor

– And now A0 is a 4d register• A0.x, A0.y, A0.z, A0.w

Pixel shaders I

• API change to accommodate MET’s:– You now have to explicitly write to oC0, oC1,

oC2 and 0C3 to set the output colour

– And the write has to be with a mov instruction

– If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1]

• i.e. Writes must be contiguous starting at oC0• But the writes can happen in any order

• You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)

Pixel shaders II

• Shorter is much faster– It’s much easier to be pixel limited than vertex

limited

– Short shaders are more cache friendly

– Be aggressive with write masks

– Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out)

• Generally prefer to spend cycles on shader ops rather than using texture lookups

– Because memory latency is the enemy here

Pixel shaders III

• Dual issue?– But that’s not in the 2.0 shader spec…

– But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed

– And that means it has dual issue hardware ready for you to use

Pixel shaders IV

• Example : Diffuse + specular lighting…dp3 r0, r1, r0 // N.Hdp3 r2, r1, r2 // N.Lmul r2, r2, r3 // * colormul r2, r2, r4 // * texturemul r0.r, r0.r, r0.r // spec^2mul r0.r, r0.r, r0.r // spec^4mul r0.r, r0.r, r0.r // spec^8mad r0.rgb, r0.r, r5, r2…Total: 8 instructions

…dp3 r0, r1, r0 // N.Hdp3 r2.r, r1, r2 // N.Lmul r6.a, r0.r, r0.r // spec^2mul r2.rgb, r2.r, r3 // * colormul r6.a, r6.a, r6.a // spec^4mul r2.rgb, r2, r4 // * texturemul r6.a, r6.a, r6.a // spec^8mad r0.rgb, r6.a, r5, r2…Optimized to 5 “DI” instructions

Pixel shaders IV

• Texture instructions– Avoid TEXDEPTH to retain the early Z-reject

– If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant]

• Register usage– Minimize total number of registers used

– No problems with dependency

Vertex and Pixel shaders

• If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then…

• …maybe you should consider HLSL?• In most cases it is as good as hand written

assembler• And much faster to author…

– Perfect for prototyping

– And for release code where you use D3DX

Textures I

• API addition– SetSamplerState()– Handles the now-decoupled texture sampler

setup.– You may now freely mix and match texture

coordinates with texture samplers to fetch texels in arbitrary ways

• Texture coordinates are now just iterated floats• Samplers handle clamp, wrap, bias and filter modes

– You have 8 texture coordinates– And 16 texture samplers

• texld r11, t7, s15 (all register numbers are max)

Textures II

• Use compressed textures– Do you need a good compressor?

• Use smaller textures• Use 16 bit textures in preference to 32 bit• Use textures with few components

– Use an L8 or A8 format if that’s what you want• Pack textures together

– e. g. If you’re using two 2D textures then consider using a single RGBA texture

• Texture performance is bandwidth limited

Textures III

• Filtering modes– Use trilinear filtering to improve texture cache

coherency

– Only use anisotropic or tri-linear filtering when they make sense - they are more expensive

– Avoid using anisotropic filtering with bumpmapping

– Avoid using tri-linear anisotropic filtering unless the quality win justifies it

– More costly filtering is more affordable with longer pixel shaders

Targets

• Always clear the whole of the target• Present():

– WASSTILLDRAWING makes a comeback

– Please use it!

– Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource

Depth Buffer I

• Never lock depth buffers• Clearing depth buffers

– Clear the whole surface

– When stencil is present clear both depth and stencil simultaneously

• If possible disable depth buffering when alpha blending (i.e. drawing HUD’s)

• Use as few depth buffers as possible…– i.e. re-use them across multiple render

targets

Depth Buffer II

• Efficiently use Hyper-Z– Render front to back

– Make Znear, Zfar close to active depth range of the scene

– The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!

Occlusion query

• New to DirectX 9– In GL you have HP_occlusion_query and

NV_occlusion_query to avoid the need for locks• Not free, but much cheaper than Lock()

• Supported on all ATI hardware since the Radeon 8500

• CreateQuery(OCCLUSION, ppQuery) • Issue(Begin/End)• GetData() returns S_OK to signal completion -

but please don’t spin waiting for the answer…

AGP 8X

• Is fast at ~2GB per second• But has high latency compared to LVM• And is 10 times slower than LVM• Radeon 9700 has up to 20GB per sec of

bandwidth available when talking to LVM– (LVM = Local Video Memory)

User clip planes

• User clip planes are much more efficient than texkill because:1. They insert a per-vertex test, rather than a per-pixel

test, and vertices are typically fewer in number than pixels

2. It’s important always to kill data at the earliest stage possible in the pipeline

• Plus, clipping is essentially a geometric operation

• All hardware which supports ps1.4 supports user clip planes in hardware

Sky box. First or last?

• Draw it last because:– That’s a rough front to back sort

– In this case you know that most sky pixels will fail the Z test.

• Draw it first because:– That way you don’t need any Z tests

– In this case you know that most sky pixels would pass the Z test

So, here is our target:

• DX9 style mainstream graphics (per frame):– > 500K triangles– < 500 DrawIndexedPrimitive() calls– < 500 VertexBuffer switches– < 200 different textures– < 200 State change groups– Few calls to SetRenderTarget - aim for 0 to 4...– 1 pass per poly is typical, but 2 is sometimes smart– Runs at monitor refresh rate– Which gives more than 40 million polys per second

• And everything goes through the programmable pipeline

– No occurrences of Lock(0), DrawPrimitive(), DPUP()

Questions…

?Richard Huddy

[email protected]

Date post:	14-Dec-2015
Category:	Documents
Upload:	trinity-finnie
View:	227 times
Download:	6 times

DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy [email protected].

Documents