Direct3D 11 Computer Shader More Generality for Advanced Techniques

Direct3D 11 Compute ShaderMore Generality for Advanced Techniques

Chas. BoydArchitectWindows Desktop & Graphics TechnologyMicrosoft

OverviewGPGPU vs. Data Parallel ComputingIntroducing the Compute ShaderAdvantagesTarget ApplicationsKey FeaturesExamples

Image reduction, histogram, convolutionAPI Support

GPGPU = Data Parallel ComputingGPU Performance continues to grow

More algorithms want this performanceApps can scale to massive parallelism without tricky code changesGeneral recognition that this model is applicable beyond just rendering…

…although that is our primary targetDeliver scalable performance

Code scales with core count with no changes

Introducing: Compute ShaderA new processing model for GPUs

Data–parallel programming for mass marketsIntegrated with Direct3D

For efficient interoperability in client scenariosSupports more general constructs:

Cross-thread data sharingUnordered access I/O operations

Enables more general data structuresIrregular arrays, trees, etc.

Enables more general algorithmsFar beyond shading

Optimized for Client ScenariosSimpler setup syntax

Balance between power and complexityReal-time rendering of results

Working to reduce cost of transition from compute mode to graphics mode

Better integration with media data types:

Pixels, samples, text, vs. only floatsNeed consistency between implementations

Both across vendors and over time/generations

Compute Shader FeaturesPredictable Thread Invocation

Regular arrays of threads: 1D, 2D, 3DDon’t have to “draw a quad” anymore

Shared registers between threadsReduces register pressureCan eliminate redundant compute and I/O

Scattered WritesCan read/write arbitrary data structuresEnables new classes of algorithmsIntegrated with Direct3D resources

7

Target ApplicationsImage/post-processing:

Image reduction, histogram, convolution, FFT

Effect physicsParticles, smoke, water, cloth, etc.

A-Buffer/OITRay-tracing, radiosity, etc.Gameplay physics, AI

Integrated with Direct3DFully supports all Direct3D resourcesTargets graphics/media data typesEvolution of DirectX HLSLGraphics pipeline updated to emit general data structures

which can then be manipulated by compute shaderand then rendered by D3D again

Scene Image

Integration with PipelineInput Assembler

Vertex Shader

Pixel Shader

Tessellation

Rasterizer

Output Merger

Geometry Shader

Compute Shader

Data Structure

Render sceneWrite out scene imageUse Compute for image post-processingOutput final image

Final Image

Direct Thread InvocationThe ability to explicitly launch a known number of threads onto the GPUpD3D11Device->Dispatch( … numThreads… );Analogous to graphics DrawPrimitive() callsEnables algorithms to execute the optimal number of threads

Not how many vertices are read, or pixels writtenCurrent thread id is available to shader code:

sv_ThreadID.xAnalogous to sv_PrimitiveID system value

Enables predictable memory access and register usage

12

Shared Register ClassNew register type/variable storage class

shared float sfFoo;Multiple threads can access same memory

Enables uses like user-controlled cacheMaximum of 32 KB of registers can be shared in DirectX 11

8K floats or 2K float4svs. 64 KB of total temporary registers available

16K floats or 4K float4s13

Sub BlockingNot all threads in the call can/should share registers with each otherSharing threads are broken down into subsets (groups) of threadsThread indices are made available in shader

sv_ThreadIDsv_ThreadGroupIDsv_ThreadIDinGroup

14

Atomic IntrinsicsEnable parallel operations on individual 32-bit memory locations without requiring full synchronization

Either video memory or shared registersCan be used to implement higher-level synch constructs

Semaphores, etc.Not intended for heavy liftingSupport an immediate return argument

At some performance cost

Atomic IntrinsicsEnables basic operations:InterlockedAdd( rVar, val );InterlockedMin( rVar, val );InterlockedMax( rVar, val );InterlockedOr( rVar, val );InterlockedXOr( rVar, val );InterlockedCompareWrite( rVar, val );InterlockedCompareExchange( rVar, val );

Unordered Memory Accesses

HLSL ‘resource variables’Declared in the language

DXGI resourcesEnables out-of-bounds memory checking

Returns 0 on readsWrites are No-Ops

Improves security, reliability of shipped code

Unordered I/OFor fastest performance when ordering of records need not be preservedBoth reads and writes:UnorderedLoad( ResourceVar, val);UnorderedStore( ResourceVar, val);

Requires buffer allocated before-hand

Integration with Direct3DPixel shaders can also perform scattered writesEnables rendering output to data structures more complex than a 2D array

Histogram, linked list, irregular array, tree, etc.

Don’t ForgetTexture sampling still works:Object.Load( Loc, Offset, Samples );Object.Gather( Sampler, Loc );Object.Sample( Sampler, Loc );Object.SampleLevel( Sampler, Loc, LoD );No automatic trilinear LoD calculationOther graphics features are not present:

Antialiasing, depth culling, alpha blending, triangle rasterization

ExamplesImage ReductionImage HistogramFFT

Image Post-ProcessingSignificant fraction of frame time

10–20% for most games50–70% for deferred shading-based engines

Savings here means more time for 3D

Image ReductionFind the average intensity of an Image

E.g. for HDR exposure adjustmentOptimizes scene for viewing on SDR monitor

Algorithm breakdown:Input: 1 million pixelsCompute: 1 MAD per pixel readOutput: 1 value

Should this run at texture sample rate?

Does not due to write contention

Million-to-1 reduction

OutputGPU

Input

Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;

ImageAverage(){

groupshared uint Total; // Total so fargroupshared uint Count; // Count added

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;

InterlockedAdd( Count, 1 );InterlockedAdd( Total, value );

SynchronizeThreadGroup(); // enable all threads in group to complete

Reduction Compute Code2// Allow all threads in group to complete SynchronizeThreadGroup();

// Compute the average and store it in our output bufferif (threadID.x == 0){

float fAverage = total/count; // compute avg

UnorderedStore( Result[0], fAverage ); // write it out}

}

Fast Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;ImageAverage(){

groupshared uint Total[32]; // array of 32 totalsgroupshared uint Count[32]; // array of 32 counts

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z)

& 32;

Total[idx] += value;Count[idx] += 1;

Fast Reduction Compute Code2// Allow all threads in group to complete

SynchronizeThreadGroup();

// Compute the average and store it in our output bufferif (threadIDInGroup.x == 0){

for ( uint i=0; i< 32; i++ ){

TheTotal += total[i];TheCount += count[i];

}float fAverage = TheTotal/TheCount; // compute

avgUnorderedStore( Result[GroupID], fAverage ); //

write}

}

Reduction PerformancePyramid approaches work today

Some choice in reduction level per passTradeoff is contention for destination

1M pixels takes ~0.4ms in Direct3DPass-count-limited at small end of pyramid

Ideally should run at texture read rate< 0.1 ms in theory, or 4–10x faster

Compute shader features should helpSuch as local read-write cachePrototypes show ~2x speed boost so far

Histogram GenerationSimilar to reduction problem

Reduce to 64–256 destinations at data dependent (unpredictable) addresses

Still suffers contention when multiple pixels increment same bin

So replicate bins e.g. 16xIncrement bins using InterlockedAdd() math operations

Currently showing 2x speedup

Histogram Generation CodeHistogram(){

shared int Histograms[16][256]; // array of 16

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );int iBin = fLuminance*255.0f;

// compute bin to incrementint iHist = sv_ThreadIDInGroup & 16; // use thread indexHistograms[iHist][iBin] += 1; // update bin

SynchronizeThreadGroup; // enable all threads in group to complete

Histogram Generation Code2

// Write register histograms out to memory:iBin = sv_ThreadIDInGroup.x;if ( ( sv_ThreadID.x < 256 ){

for ( iHist = 0; iHist < 16; iHist++ ){

int2 destAddr = int2( iHist, iBin );OutputResource.add( destAddr,

Histograms[iHist][iBin] ); // atomic

}}

}

Histogram PerformanceRecent work shows similar performance to reductions:

Direct3D takes ~2.4 ms per megapixelOn DirectX10 hardware

2x speedup shown via prototypes On same hardware but using shared registers

8x theoretically possibleif purely read limited

Image ConvolutionFundamental operation for blurs:

HDR flares, depth-of-field, soft shadows, streaks

Need fairly large kernels for these100 wide is possible at high resolutions(sparse sampling produces artifacts)

7-Tap Separable Kernel

…

7-Tap Separable Kernel

…

Convolution PerformanceMassively variable depending on methodDirect3D does 5x5 kernel in 0.65ms/Mpix

Separable kernelPrototype does slightly better

Using shared register capabilityTheoretical performance should be higher

Some opportunity remainsNeed to evaluate relevant kernel sizes

Games need 100x100 effectively

Other Example TechniquesThese are not used directly in game post-processing today, but are key foundations of other algorithms

Scan (prefix sum), andFFT (fast Fourier transform)

Scan (Prefix-sum)Each number in data sequence is sum of all previous numbers

Used to compute writes in irregular arraysFoundation of Summed Area Tables

Known GPU algorithms (Horn’s method)

Pyramid scheme, so I/O boundSharing memory between threads results in ~2x speedup

Scan (Prefix-sum)We are looking at providing this in a library routineAlong with FFT, etc.

Summed Area Table2D equivalent of Scan

Each element of 2D array has sum of all elements up/left of it

Enables box filter with performance independent of kernel size O(k)Fast generation of

Shadow blur with distanceDepth-of-fieldArea light integrals, etc.

Fast Fourier TransformConverts image into frequency domainMany operations are faster in frequency domain than in spatial domain

e.g. convolution becomes a multiplyTrivial detection of periodic noiseSome application to motion estimation

Core algorithm similar to scanSimilar I/O patterns,But more math-intensive inner loop

Direct3D FFTPing-pong between 2 R32G32F surfaces

R is Real, G is ComplexDo LogN passes along rows then columnsPixel shader onlyDoes not use blenders or iterators

Uses vPos.xy as array indices [i][j]Inner loop is math intensive

20+ instructions including trigIndexing math dominates unless DX10

FFT Before

FFT After

After

FFT PerformanceComplex 1024x1024 2D FFT:

Software 42ms 6 GFlopsDirect3D915ms 17 GFlops 3xPrototype DX11 6ms 42 GFlops 6xLatest chips 3ms 100 GFlops

Shared register space and random access writes enable ~2x speedups

Order-Independent TranslucencyEliminates draw-order issues, and

shimmer in moving scenesCorrect AA even of transparent objects

Any object is transparent if antialiasede.g. alpha tested leaves in forests

Current methods require large sample counts

Alpha-To-CoverageDepth Peeling with Occlusion Queries

The A-Buffer MethodA-buffer is a more accurate method

Accumulate object data in per-pixel listThen sort each pixel into orderCollapse to final color and display

Brings visual quality to movie levels without requiring 256-sample MSAASomething to keep an eye on for OIT

A-Buffer RenderingCurrently prototyping using refrast

DirectX reference rasterizer running on CPUMeasuring memory access patterns/localityEvaluating feasibility of hardwareNot really feasible with current Direct3D

Compute shader features enable thisSuch as indexed writes, counters, etc.Rendering to structures beyond regular arrays

But performance is still largely unknown

Additional AlgorithmsNew rendering methods

Ray-tracing, collision detection, etc.Rendering elements at different resolutions

Non-rendering algorithmsIK, physics, AI, simulation, fluid simulation, radiosity

Need more general data structuresQuad/octrees, irregular arrays, sparse arrays

Need linear algebra

SummaryCompute Shader is coming in Direct3D 11

GPU performance levels for more applications

Scalable parallel processing modelCode should scale for several generations

Increased generality will enable both:Improved performance on existing GPU tasksMore CPU tasks can switch to DP cores

Full cross-vendor supportEnables broadest possible installed base

Questions?

www.xnagamefest.com

© 2008 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only.

Microsoft makes no warranties, express or implied, in this summary.

http://www.xna.com/

Date post:	16-Jul-2016
Category:	Documents
Upload:	chi-zhi
View:	46 times
Download:	5 times

Direct3D 11 Computer Shader More Generality for Advanced Techniques

Documents