+ All Categories
Home > Documents > Direct3D 11 Computer Shader More Generality for Advanced Techniques

Direct3D 11 Computer Shader More Generality for Advanced Techniques

Date post: 16-Jul-2016
Category:
Upload: chi-zhi
View: 46 times
Download: 5 times
Share this document with a friend
Description:
mocrisoft
54
Transcript
Page 1: Direct3D 11 Computer Shader More Generality for Advanced Techniques
Page 2: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Direct3D 11 Compute ShaderMore Generality for Advanced Techniques

Chas. BoydArchitectWindows Desktop & Graphics TechnologyMicrosoft

Page 3: Direct3D 11 Computer Shader More Generality for Advanced Techniques

OverviewGPGPU vs. Data Parallel ComputingIntroducing the Compute ShaderAdvantagesTarget ApplicationsKey FeaturesExamples

Image reduction, histogram, convolutionAPI Support

Page 4: Direct3D 11 Computer Shader More Generality for Advanced Techniques

GPGPU = Data Parallel ComputingGPU Performance continues to grow

More algorithms want this performanceApps can scale to massive parallelism without tricky code changesGeneral recognition that this model is applicable beyond just rendering…

…although that is our primary targetDeliver scalable performance

Code scales with core count with no changes

Page 5: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Introducing: Compute ShaderA new processing model for GPUs

Data–parallel programming for mass marketsIntegrated with Direct3D

For efficient interoperability in client scenariosSupports more general constructs:

Cross-thread data sharingUnordered access I/O operations

Enables more general data structuresIrregular arrays, trees, etc.

Enables more general algorithmsFar beyond shading

Page 6: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Optimized for Client ScenariosSimpler setup syntax

Balance between power and complexityReal-time rendering of results

Working to reduce cost of transition from compute mode to graphics mode

Better integration with media data types:

Pixels, samples, text, vs. only floatsNeed consistency between implementations

Both across vendors and over time/generations

Page 7: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Compute Shader FeaturesPredictable Thread Invocation

Regular arrays of threads: 1D, 2D, 3DDon’t have to “draw a quad” anymore

Shared registers between threadsReduces register pressureCan eliminate redundant compute and I/O

Scattered WritesCan read/write arbitrary data structuresEnables new classes of algorithmsIntegrated with Direct3D resources

7

Page 8: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Target ApplicationsImage/post-processing:

Image reduction, histogram, convolution, FFT

Effect physicsParticles, smoke, water, cloth, etc.

A-Buffer/OITRay-tracing, radiosity, etc.Gameplay physics, AI

Page 9: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Integrated with Direct3DFully supports all Direct3D resourcesTargets graphics/media data typesEvolution of DirectX HLSLGraphics pipeline updated to emit general data structures

which can then be manipulated by compute shaderand then rendered by D3D again

Page 10: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Scene Image

Integration with PipelineInput Assembler

Vertex Shader

Pixel Shader

Tessellation

Rasterizer

Output Merger

Geometry Shader

Compute Shader

Data Structure

Render sceneWrite out scene imageUse Compute for image post-processingOutput final image

Final Image

Page 11: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Direct Thread InvocationThe ability to explicitly launch a known number of threads onto the GPUpD3D11Device->Dispatch( … numThreads… );Analogous to graphics DrawPrimitive() callsEnables algorithms to execute the optimal number of threads

Not how many vertices are read, or pixels writtenCurrent thread id is available to shader code:

sv_ThreadID.xAnalogous to sv_PrimitiveID system value

Enables predictable memory access and register usage

12

Page 12: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Shared Register ClassNew register type/variable storage class

shared float sfFoo;Multiple threads can access same memory

Enables uses like user-controlled cacheMaximum of 32 KB of registers can be shared in DirectX 11

8K floats or 2K float4svs. 64 KB of total temporary registers available

16K floats or 4K float4s13

Page 13: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Sub BlockingNot all threads in the call can/should share registers with each otherSharing threads are broken down into subsets (groups) of threadsThread indices are made available in shader

sv_ThreadIDsv_ThreadGroupIDsv_ThreadIDinGroup

14

Page 14: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Atomic IntrinsicsEnable parallel operations on individual 32-bit memory locations without requiring full synchronization

Either video memory or shared registersCan be used to implement higher-level synch constructs

Semaphores, etc.Not intended for heavy liftingSupport an immediate return argument

At some performance cost

Page 15: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Atomic IntrinsicsEnables basic operations:InterlockedAdd( rVar, val );InterlockedMin( rVar, val );InterlockedMax( rVar, val );InterlockedOr( rVar, val );InterlockedXOr( rVar, val );InterlockedCompareWrite( rVar, val );InterlockedCompareExchange( rVar, val );

Page 16: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Unordered Memory Accesses

HLSL ‘resource variables’Declared in the language

DXGI resourcesEnables out-of-bounds memory checking

Returns 0 on readsWrites are No-Ops

Improves security, reliability of shipped code

Page 17: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Unordered I/OFor fastest performance when ordering of records need not be preservedBoth reads and writes:UnorderedLoad( ResourceVar, val);UnorderedStore( ResourceVar, val);

Requires buffer allocated before-hand

Page 18: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Integration with Direct3DPixel shaders can also perform scattered writesEnables rendering output to data structures more complex than a 2D array

Histogram, linked list, irregular array, tree, etc.

Page 19: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Don’t ForgetTexture sampling still works:Object.Load( Loc, Offset, Samples );Object.Gather( Sampler, Loc );Object.Sample( Sampler, Loc );Object.SampleLevel( Sampler, Loc, LoD );No automatic trilinear LoD calculationOther graphics features are not present:

Antialiasing, depth culling, alpha blending, triangle rasterization

Page 20: Direct3D 11 Computer Shader More Generality for Advanced Techniques

ExamplesImage ReductionImage HistogramFFT

Page 21: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Image Post-ProcessingSignificant fraction of frame time

10–20% for most games50–70% for deferred shading-based engines

Savings here means more time for 3D

Page 22: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Image ReductionFind the average intensity of an Image

E.g. for HDR exposure adjustmentOptimizes scene for viewing on SDR monitor

Algorithm breakdown:Input: 1 million pixelsCompute: 1 MAD per pixel readOutput: 1 value

Should this run at texture sample rate?

Does not due to write contention

Page 23: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Million-to-1 reduction

OutputGPU

Input

Page 24: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;

ImageAverage(){

groupshared uint Total; // Total so fargroupshared uint Count; // Count added

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;

InterlockedAdd( Count, 1 );InterlockedAdd( Total, value );

SynchronizeThreadGroup(); // enable all threads in group to complete

Page 25: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Reduction Compute Code2// Allow all threads in group to complete SynchronizeThreadGroup();

// Compute the average and store it in our output bufferif (threadID.x == 0){

float fAverage = total/count; // compute avg

UnorderedStore( Result[0], fAverage ); // write it out}

}

Page 26: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Fast Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;ImageAverage(){

groupshared uint Total[32]; // array of 32 totalsgroupshared uint Count[32]; // array of 32 counts

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z)

& 32;

Total[idx] += value;Count[idx] += 1;

Page 27: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Fast Reduction Compute Code2// Allow all threads in group to complete

SynchronizeThreadGroup();

// Compute the average and store it in our output bufferif (threadIDInGroup.x == 0){

for ( uint i=0; i< 32; i++ ){

TheTotal += total[i];TheCount += count[i];

}float fAverage = TheTotal/TheCount; // compute

avgUnorderedStore( Result[GroupID], fAverage ); //

write}

}

Page 28: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Reduction PerformancePyramid approaches work today

Some choice in reduction level per passTradeoff is contention for destination

1M pixels takes ~0.4ms in Direct3DPass-count-limited at small end of pyramid

Ideally should run at texture read rate< 0.1 ms in theory, or 4–10x faster

Compute shader features should helpSuch as local read-write cachePrototypes show ~2x speed boost so far

Page 29: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Histogram GenerationSimilar to reduction problem

Reduce to 64–256 destinations at data dependent (unpredictable) addresses

Still suffers contention when multiple pixels increment same bin

So replicate bins e.g. 16xIncrement bins using InterlockedAdd() math operations

Currently showing 2x speedup

Page 30: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Histogram Generation CodeHistogram(){

shared int Histograms[16][256]; // array of 16

float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );int iBin = fLuminance*255.0f;

// compute bin to incrementint iHist = sv_ThreadIDInGroup & 16; // use thread indexHistograms[iHist][iBin] += 1; // update bin

SynchronizeThreadGroup; // enable all threads in group to complete

Page 31: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Histogram Generation Code2

// Write register histograms out to memory:iBin = sv_ThreadIDInGroup.x;if ( ( sv_ThreadID.x < 256 ){

for ( iHist = 0; iHist < 16; iHist++ ){

int2 destAddr = int2( iHist, iBin );OutputResource.add( destAddr,

Histograms[iHist][iBin] ); // atomic

}}

}

Page 32: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Histogram PerformanceRecent work shows similar performance to reductions:

Direct3D takes ~2.4 ms per megapixelOn DirectX10 hardware

2x speedup shown via prototypes On same hardware but using shared registers

8x theoretically possibleif purely read limited

Page 33: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Image ConvolutionFundamental operation for blurs:

HDR flares, depth-of-field, soft shadows, streaks

Need fairly large kernels for these100 wide is possible at high resolutions(sparse sampling produces artifacts)

Page 34: Direct3D 11 Computer Shader More Generality for Advanced Techniques

7-Tap Separable Kernel

Page 35: Direct3D 11 Computer Shader More Generality for Advanced Techniques

7-Tap Separable Kernel

Page 36: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Convolution PerformanceMassively variable depending on methodDirect3D does 5x5 kernel in 0.65ms/Mpix

Separable kernelPrototype does slightly better

Using shared register capabilityTheoretical performance should be higher

Some opportunity remainsNeed to evaluate relevant kernel sizes

Games need 100x100 effectively

Page 37: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Other Example TechniquesThese are not used directly in game post-processing today, but are key foundations of other algorithms

Scan (prefix sum), andFFT (fast Fourier transform)

Page 38: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Scan (Prefix-sum)Each number in data sequence is sum of all previous numbers

Used to compute writes in irregular arraysFoundation of Summed Area Tables

Known GPU algorithms (Horn’s method)

Pyramid scheme, so I/O boundSharing memory between threads results in ~2x speedup

Page 39: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Scan (Prefix-sum)We are looking at providing this in a library routineAlong with FFT, etc.

Page 40: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Summed Area Table2D equivalent of Scan

Each element of 2D array has sum of all elements up/left of it

Enables box filter with performance independent of kernel size O(k)Fast generation of

Shadow blur with distanceDepth-of-fieldArea light integrals, etc.

Page 41: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Fast Fourier TransformConverts image into frequency domainMany operations are faster in frequency domain than in spatial domain

e.g. convolution becomes a multiplyTrivial detection of periodic noiseSome application to motion estimation

Core algorithm similar to scanSimilar I/O patterns,But more math-intensive inner loop

Page 42: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Direct3D FFTPing-pong between 2 R32G32F surfaces

R is Real, G is ComplexDo LogN passes along rows then columnsPixel shader onlyDoes not use blenders or iterators

Uses vPos.xy as array indices [i][j]Inner loop is math intensive

20+ instructions including trigIndexing math dominates unless DX10

Page 43: Direct3D 11 Computer Shader More Generality for Advanced Techniques

FFT Before

Page 44: Direct3D 11 Computer Shader More Generality for Advanced Techniques

FFT After

Page 45: Direct3D 11 Computer Shader More Generality for Advanced Techniques

After

Page 46: Direct3D 11 Computer Shader More Generality for Advanced Techniques

FFT PerformanceComplex 1024x1024 2D FFT:

Software 42ms 6 GFlopsDirect3D915ms 17 GFlops 3xPrototype DX11 6ms 42 GFlops 6xLatest chips 3ms 100 GFlops

Shared register space and random access writes enable ~2x speedups

Page 47: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Order-Independent TranslucencyEliminates draw-order issues, and

shimmer in moving scenesCorrect AA even of transparent objects

Any object is transparent if antialiasede.g. alpha tested leaves in forests

Current methods require large sample counts

Alpha-To-CoverageDepth Peeling with Occlusion Queries

Page 48: Direct3D 11 Computer Shader More Generality for Advanced Techniques

The A-Buffer MethodA-buffer is a more accurate method

Accumulate object data in per-pixel listThen sort each pixel into orderCollapse to final color and display

Brings visual quality to movie levels without requiring 256-sample MSAASomething to keep an eye on for OIT

Page 49: Direct3D 11 Computer Shader More Generality for Advanced Techniques

A-Buffer RenderingCurrently prototyping using refrast

DirectX reference rasterizer running on CPUMeasuring memory access patterns/localityEvaluating feasibility of hardwareNot really feasible with current Direct3D

Compute shader features enable thisSuch as indexed writes, counters, etc.Rendering to structures beyond regular arrays

But performance is still largely unknown

Page 50: Direct3D 11 Computer Shader More Generality for Advanced Techniques
Page 51: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Additional AlgorithmsNew rendering methods

Ray-tracing, collision detection, etc.Rendering elements at different resolutions

Non-rendering algorithmsIK, physics, AI, simulation, fluid simulation, radiosity

Need more general data structuresQuad/octrees, irregular arrays, sparse arrays

Need linear algebra

Page 52: Direct3D 11 Computer Shader More Generality for Advanced Techniques

SummaryCompute Shader is coming in Direct3D 11

GPU performance levels for more applications

Scalable parallel processing modelCode should scale for several generations

Increased generality will enable both:Improved performance on existing GPU tasksMore CPU tasks can switch to DP cores

Full cross-vendor supportEnables broadest possible installed base

Page 53: Direct3D 11 Computer Shader More Generality for Advanced Techniques

Questions?

Page 54: Direct3D 11 Computer Shader More Generality for Advanced Techniques

www.xnagamefest.com

© 2008 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only.

Microsoft makes no warranties, express or implied, in this summary.


Recommended