GPU-Efficient Recursive Filtering and Summed-Area Tables
D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3
1IMPA 2Digitok 3Microsoft Research
• Linear, shift-invariant filters• But use feedback from earlier outputs
Recursive filters
input
output
prologue
• Linear, shift-invariant filters• But use feedback from earlier outputs
• Sequential dependency chainoutput
inputprologue
Recursive filters
Applications of recursive filtering
• B-Spline (or other) interpolation
input coefficients interpolation(from coefficients)
recursive preprocessing step
Applications of recursive filtering
• B-Spline (or other) interpolation• Fast, wide, Gaussian-blur approximation• Summed-area tables
input blurred
recursive filters
• Recursive filters can be causal or anticausal• Causal goes forward, anticausal in reverse direction
• Filter order is simply the number r of feedbacks
Causality and order
input epilogue
output
• Independent columns• Causal
• Anticausal
• Independent rows• Causal
• Anticausal
Filter sequences and separability• Often, sequences of recursive filters are needed
Algorithm RT
• The baseline algorithm• Process columns in parallel, then rows in parallel• Ruijters et al. 2010 “GPU prefilter […]”
inpu
tou
tput
stag
es
column processing row processing
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT
Cubic B-Spline Interpolation (GeForce GTX 480)
Alg. Step Complexity
Max. # of Threads
UsedBandwidth
RT
Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further
Alg. Step Complexity
Max. # of Threads
UsedBandwidth
RT
• Similar to parallel prefix-sum algorithms• Sengupta et al. 2007 “Scan primitives for GPU computing”• Dotsenko et al. 2008 “Fast scan algorithms […]”
• Compute and store incomplete prologues• Fix incomplete prologues• Somewhat more complicated than a recursive invocation
• Use prologues to compute and store causal results
Increasing parallelism
… …✗ ✗✗✗… ……
✗
Fixing incomplete prologues
… …
…
superposition
linearity
Algorithm 2
• Adds block parallelism• Sung et al. 1986 “Efficient […] recursive […]”, or• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
inpu
tou
tput
stag
es
fix fix fix fix
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT2
Cubic B-Spline Interpolation (GeForce GTX 480)
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
2
RT
Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further
• FLOP/IO ratio of recursive filters is too low• Can use even more FLOPs but must reduce IO• To do so, we introduce overlapping
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
2
RT
Causal-anticausal overlapping• Start anticausal processing before causal is done• Saves reading and writing causal results!
• Compute and store incomplete prologues & epilogues• Fix incomplete prologues & twice-incomplete epilogues• Twice-incomplete epilogues are trickier
• Use them to compute and store anticausal results
… …
Fixing twice-incomplete epilogues• Repeatedly apply linearity and superposition
• Tedious derivation, simple result
twice-incomplete epilogue
corrected prologue
corrected epilogue
Algorithm 4
• Adds causal-anticausal overlapping• Eliminates reading and writing causal results• Both in column and in row processing
• Modest increase in computation
inpu
tou
tput
stag
es
fix bothfix both
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
4
2
RT
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT24
Cubic B-Spline Interpolation (GeForce GTX 480)
Algorithm 5
• Adds row-column overlapping• Eliminates reading and writing column results• Modest increase in computation
inpu
tou
tput
stag
es
fix all!
Start from input and global borders
Load blocks into shared memory
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
All borders in global memory
Fix incomplete borders
Fix twice-incomplete borders
Fix thrice-incomplete borders
Fix four-times-incomplete borders
Done fixing all borders
Load blocks into shared memory
Finish causal columns
Finish anticausal columns
Finish causal rows
Finish anticausal rows
Store results to global memory
Done!
• Fixing thrice-incomplete row-prologues
• Fixing four-times-incomplete row-epilogues
Row-column overlapping rules
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
5
4
2
RT
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT245
Cubic B-Spline Interpolation (GeForce GTX 480)
Second-order filter benchmarks
• Alg. 42 uses causal-anticausal overlapping
• Alg. 52 adds row-column overlapping• Added complexity outweighs IO reduction• Balance will change (hardware, compiler, implementation)
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
42
52
1
2
3
4
5
Thr o
ughp
ut ( G
iP/s
)
52
42
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
Quintic B-Spline Interpolation (GeForce GTX 480)
• CUFFT is in frequency domain• complexity• DIR is direct convolution• complexity• Podlozhnyuk 2007 whitepaper
“Image convolution with CUDA”
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
Thro
ughp
ut ( G
iP/s
)
DIR 2.5DIR 5DIR 10
Overlapped Recursive
CUFFT
Gaussian blur results• Overlapped recursive• 3rd order approximation• complexity• van Vliet et al. 1998
“Recursive Gaussian derivative filters”• Implemented as 51 fused with 42
• Recursive approximation is faster• Even for modest size images• Also modest standard-deviations
Gaussian Blur(GeForce GTX 480)
Summed-area table benchmarks
• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP
• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements
+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages
• Overlapped SAT• Row-column overlapping
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
8
9
Thro
ughp
ut ( G
iP/s
)
Summed-area Table(GeForce GTX 480)
Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT
• First-order filter, unit coefficient, no anticausal component
Future work• Volumetric processing• Overlapping should generalize• Not enough shared memory (yet?)
• CPU implementation• Blocking should increase L1 cache effectiveness• Is doubling amount of computation worth it?
• Solving general narrow-banded linear systems• Overlapping back- and forward- substitution
Conclusions• Recursive filters are useful in many applications• Cubic and quintic B-Spline interpolation• Gaussian-blur approximation• Summed-area table computation
• We introduced parallel algorithms for GPUs• Overlapping reduces IO requirements• Leads to faster algorithms
• Code is available from project page• Most is already there, rest is on the way
Questions?
baseline
Alg. RT (0.5 GiP/s)
+ block parallelism
Alg. 2 (3 GiP/s)
+ causal-anticausal overlapping
Alg. 4 (5 GiP/s)
+ row-column overlapping
Alg. 5 (6 GiP/s)