Direct3D 11 Compute ShaderMore Generality for Advanced Techniques
Chas. BoydArchitectWindows Desktop & Graphics TechnologyMicrosoft
OverviewGPGPU vs. Data Parallel ComputingIntroducing the Compute ShaderAdvantagesTarget ApplicationsKey FeaturesExamples
Image reduction, histogram, convolutionAPI Support
GPGPU = Data Parallel ComputingGPU Performance continues to grow
More algorithms want this performanceApps can scale to massive parallelism without tricky code changesGeneral recognition that this model is applicable beyond just rendering…
…although that is our primary targetDeliver scalable performance
Code scales with core count with no changes
Introducing: Compute ShaderA new processing model for GPUs
Data–parallel programming for mass marketsIntegrated with Direct3D
For efficient interoperability in client scenariosSupports more general constructs:
Cross-thread data sharingUnordered access I/O operations
Enables more general data structuresIrregular arrays, trees, etc.
Enables more general algorithmsFar beyond shading
Optimized for Client ScenariosSimpler setup syntax
Balance between power and complexityReal-time rendering of results
Working to reduce cost of transition from compute mode to graphics mode
Better integration with media data types:
Pixels, samples, text, vs. only floatsNeed consistency between implementations
Both across vendors and over time/generations
Compute Shader FeaturesPredictable Thread Invocation
Regular arrays of threads: 1D, 2D, 3DDon’t have to “draw a quad” anymore
Shared registers between threadsReduces register pressureCan eliminate redundant compute and I/O
Scattered WritesCan read/write arbitrary data structuresEnables new classes of algorithmsIntegrated with Direct3D resources
7
Target ApplicationsImage/post-processing:
Image reduction, histogram, convolution, FFT
Effect physicsParticles, smoke, water, cloth, etc.
A-Buffer/OITRay-tracing, radiosity, etc.Gameplay physics, AI
Integrated with Direct3DFully supports all Direct3D resourcesTargets graphics/media data typesEvolution of DirectX HLSLGraphics pipeline updated to emit general data structures
which can then be manipulated by compute shaderand then rendered by D3D again
Scene Image
Integration with PipelineInput Assembler
Vertex Shader
Pixel Shader
Tessellation
Rasterizer
Output Merger
Geometry Shader
Compute Shader
Data Structure
Render sceneWrite out scene imageUse Compute for image post-processingOutput final image
Final Image
Direct Thread InvocationThe ability to explicitly launch a known number of threads onto the GPUpD3D11Device->Dispatch( … numThreads… );Analogous to graphics DrawPrimitive() callsEnables algorithms to execute the optimal number of threads
Not how many vertices are read, or pixels writtenCurrent thread id is available to shader code:
sv_ThreadID.xAnalogous to sv_PrimitiveID system value
Enables predictable memory access and register usage
12
Shared Register ClassNew register type/variable storage class
shared float sfFoo;Multiple threads can access same memory
Enables uses like user-controlled cacheMaximum of 32 KB of registers can be shared in DirectX 11
8K floats or 2K float4svs. 64 KB of total temporary registers available
16K floats or 4K float4s13
Sub BlockingNot all threads in the call can/should share registers with each otherSharing threads are broken down into subsets (groups) of threadsThread indices are made available in shader
sv_ThreadIDsv_ThreadGroupIDsv_ThreadIDinGroup
14
Atomic IntrinsicsEnable parallel operations on individual 32-bit memory locations without requiring full synchronization
Either video memory or shared registersCan be used to implement higher-level synch constructs
Semaphores, etc.Not intended for heavy liftingSupport an immediate return argument
At some performance cost
Atomic IntrinsicsEnables basic operations:InterlockedAdd( rVar, val );InterlockedMin( rVar, val );InterlockedMax( rVar, val );InterlockedOr( rVar, val );InterlockedXOr( rVar, val );InterlockedCompareWrite( rVar, val );InterlockedCompareExchange( rVar, val );
Unordered Memory Accesses
HLSL ‘resource variables’Declared in the language
DXGI resourcesEnables out-of-bounds memory checking
Returns 0 on readsWrites are No-Ops
Improves security, reliability of shipped code
Unordered I/OFor fastest performance when ordering of records need not be preservedBoth reads and writes:UnorderedLoad( ResourceVar, val);UnorderedStore( ResourceVar, val);
Requires buffer allocated before-hand
Integration with Direct3DPixel shaders can also perform scattered writesEnables rendering output to data structures more complex than a 2D array
Histogram, linked list, irregular array, tree, etc.
Don’t ForgetTexture sampling still works:Object.Load( Loc, Offset, Samples );Object.Gather( Sampler, Loc );Object.Sample( Sampler, Loc );Object.SampleLevel( Sampler, Loc, LoD );No automatic trilinear LoD calculationOther graphics features are not present:
Antialiasing, depth culling, alpha blending, triangle rasterization
ExamplesImage ReductionImage HistogramFFT
Image Post-ProcessingSignificant fraction of frame time
10–20% for most games50–70% for deferred shading-based engines
Savings here means more time for 3D
Image ReductionFind the average intensity of an Image
E.g. for HDR exposure adjustmentOptimizes scene for viewing on SDR monitor
Algorithm breakdown:Input: 1 million pixelsCompute: 1 MAD per pixel readOutput: 1 value
Should this run at texture sample rate?
Does not due to write contention
Million-to-1 reduction
OutputGPU
Input
Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;
ImageAverage(){
groupshared uint Total; // Total so fargroupshared uint Count; // Count added
float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;
InterlockedAdd( Count, 1 );InterlockedAdd( Total, value );
SynchronizeThreadGroup(); // enable all threads in group to complete
Reduction Compute Code2// Allow all threads in group to complete SynchronizeThreadGroup();
// Compute the average and store it in our output bufferif (threadID.x == 0){
float fAverage = total/count; // compute avg
UnorderedStore( Result[0], fAverage ); // write it out}
}
Fast Reduction Compute CodeBuffer<uint> Values;OutputBuffer<uint> Result;ImageAverage(){
groupshared uint Total[32]; // array of 32 totalsgroupshared uint Count[32]; // array of 32 counts
float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );uint value = fLuminance*65536;uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z)
& 32;
Total[idx] += value;Count[idx] += 1;
Fast Reduction Compute Code2// Allow all threads in group to complete
SynchronizeThreadGroup();
// Compute the average and store it in our output bufferif (threadIDInGroup.x == 0){
for ( uint i=0; i< 32; i++ ){
TheTotal += total[i];TheCount += count[i];
}float fAverage = TheTotal/TheCount; // compute
avgUnorderedStore( Result[GroupID], fAverage ); //
write}
}
Reduction PerformancePyramid approaches work today
Some choice in reduction level per passTradeoff is contention for destination
1M pixels takes ~0.4ms in Direct3DPass-count-limited at small end of pyramid
Ideally should run at texture read rate< 0.1 ms in theory, or 4–10x faster
Compute shader features should helpSuch as local read-write cachePrototypes show ~2x speed boost so far
Histogram GenerationSimilar to reduction problem
Reduce to 64–256 destinations at data dependent (unpredictable) addresses
Still suffers contention when multiple pixels increment same bin
So replicate bins e.g. 16xIncrement bins using InterlockedAdd() math operations
Currently showing 2x speedup
Histogram Generation CodeHistogram(){
shared int Histograms[16][256]; // array of 16
float3 vPixel = load( sampler, sv_ThreadID );float fLuminance = dot( vPixel, LUM_VECTOR );int iBin = fLuminance*255.0f;
// compute bin to incrementint iHist = sv_ThreadIDInGroup & 16; // use thread indexHistograms[iHist][iBin] += 1; // update bin
SynchronizeThreadGroup; // enable all threads in group to complete
Histogram Generation Code2
// Write register histograms out to memory:iBin = sv_ThreadIDInGroup.x;if ( ( sv_ThreadID.x < 256 ){
for ( iHist = 0; iHist < 16; iHist++ ){
int2 destAddr = int2( iHist, iBin );OutputResource.add( destAddr,
Histograms[iHist][iBin] ); // atomic
}}
}
Histogram PerformanceRecent work shows similar performance to reductions:
Direct3D takes ~2.4 ms per megapixelOn DirectX10 hardware
2x speedup shown via prototypes On same hardware but using shared registers
8x theoretically possibleif purely read limited
Image ConvolutionFundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks
Need fairly large kernels for these100 wide is possible at high resolutions(sparse sampling produces artifacts)
7-Tap Separable Kernel
…
7-Tap Separable Kernel
…
Convolution PerformanceMassively variable depending on methodDirect3D does 5x5 kernel in 0.65ms/Mpix
Separable kernelPrototype does slightly better
Using shared register capabilityTheoretical performance should be higher
Some opportunity remainsNeed to evaluate relevant kernel sizes
Games need 100x100 effectively
Other Example TechniquesThese are not used directly in game post-processing today, but are key foundations of other algorithms
Scan (prefix sum), andFFT (fast Fourier transform)
Scan (Prefix-sum)Each number in data sequence is sum of all previous numbers
Used to compute writes in irregular arraysFoundation of Summed Area Tables
Known GPU algorithms (Horn’s method)
Pyramid scheme, so I/O boundSharing memory between threads results in ~2x speedup
Scan (Prefix-sum)We are looking at providing this in a library routineAlong with FFT, etc.
Summed Area Table2D equivalent of Scan
Each element of 2D array has sum of all elements up/left of it
Enables box filter with performance independent of kernel size O(k)Fast generation of
Shadow blur with distanceDepth-of-fieldArea light integrals, etc.
Fast Fourier TransformConverts image into frequency domainMany operations are faster in frequency domain than in spatial domain
e.g. convolution becomes a multiplyTrivial detection of periodic noiseSome application to motion estimation
Core algorithm similar to scanSimilar I/O patterns,But more math-intensive inner loop
Direct3D FFTPing-pong between 2 R32G32F surfaces
R is Real, G is ComplexDo LogN passes along rows then columnsPixel shader onlyDoes not use blenders or iterators
Uses vPos.xy as array indices [i][j]Inner loop is math intensive
20+ instructions including trigIndexing math dominates unless DX10
FFT Before
FFT After
After
FFT PerformanceComplex 1024x1024 2D FFT:
Software 42ms 6 GFlopsDirect3D915ms 17 GFlops 3xPrototype DX11 6ms 42 GFlops 6xLatest chips 3ms 100 GFlops
Shared register space and random access writes enable ~2x speedups
Order-Independent TranslucencyEliminates draw-order issues, and
shimmer in moving scenesCorrect AA even of transparent objects
Any object is transparent if antialiasede.g. alpha tested leaves in forests
Current methods require large sample counts
Alpha-To-CoverageDepth Peeling with Occlusion Queries
The A-Buffer MethodA-buffer is a more accurate method
Accumulate object data in per-pixel listThen sort each pixel into orderCollapse to final color and display
Brings visual quality to movie levels without requiring 256-sample MSAASomething to keep an eye on for OIT
A-Buffer RenderingCurrently prototyping using refrast
DirectX reference rasterizer running on CPUMeasuring memory access patterns/localityEvaluating feasibility of hardwareNot really feasible with current Direct3D
Compute shader features enable thisSuch as indexed writes, counters, etc.Rendering to structures beyond regular arrays
But performance is still largely unknown
Additional AlgorithmsNew rendering methods
Ray-tracing, collision detection, etc.Rendering elements at different resolutions
Non-rendering algorithmsIK, physics, AI, simulation, fluid simulation, radiosity
Need more general data structuresQuad/octrees, irregular arrays, sparse arrays
Need linear algebra
SummaryCompute Shader is coming in Direct3D 11
GPU performance levels for more applications
Scalable parallel processing modelCode should scale for several generations
Increased generality will enable both:Improved performance on existing GPU tasksMore CPU tasks can switch to DP cores
Full cross-vendor supportEnables broadest possible installed base
Questions?
www.xnagamefest.com
© 2008 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only.
Microsoft makes no warranties, express or implied, in this summary.