GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables

D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3

1IMPA 2Digitok 3Microsoft Research

• Linear, shift-invariant filters• But use feedback from earlier outputs

Recursive filters

input

output

prologue

• Linear, shift-invariant filters• But use feedback from earlier outputs

• Sequential dependency chainoutput

inputprologue

Recursive filters

Applications of recursive filtering

• B-Spline (or other) interpolation

input coefficients interpolation(from coefficients)

recursive preprocessing step

Applications of recursive filtering

• B-Spline (or other) interpolation• Fast, wide, Gaussian-blur approximation• Summed-area tables

input blurred

recursive filters

• Recursive filters can be causal or anticausal• Causal goes forward, anticausal in reverse direction

• Filter order is simply the number r of feedbacks

Causality and order

input epilogue

output

• Independent columns• Causal

• Anticausal

• Independent rows• Causal

• Anticausal

Filter sequences and separability• Often, sequences of recursive filters are needed

Algorithm RT

• The baseline algorithm• Process columns in parallel, then rows in parallel• Ruijters et al. 2010 “GPU prefilter […]”

inpu

tou

tput

stag

es

column processing row processing

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT

Cubic B-Spline Interpolation (GeForce GTX 480)

Alg. Step Complexity

Max. # of Threads

UsedBandwidth

RT

Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further


Max. # of Threads

UsedBandwidth

RT

• Similar to parallel prefix-sum algorithms• Sengupta et al. 2007 “Scan primitives for GPU computing”• Dotsenko et al. 2008 “Fast scan algorithms […]”

• Compute and store incomplete prologues• Fix incomplete prologues• Somewhat more complicated than a recursive invocation

• Use prologues to compute and store causal results

Increasing parallelism

… …✗ ✗✗✗… ……

✗

Fixing incomplete prologues

… …

…

superposition

linearity

Algorithm 2

• Adds block parallelism• Sung et al. 1986 “Efficient […] recursive […]”, or• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

inpu

tou

tput

stag

es

fix fix fix fix

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT2



Max. # of Threads

MemoryBandwidth

2

RT

Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further

• FLOP/IO ratio of recursive filters is too low• Can use even more FLOPs but must reduce IO• To do so, we introduce overlapping


Max. # of Threads

MemoryBandwidth

2

RT

Causal-anticausal overlapping• Start anticausal processing before causal is done• Saves reading and writing causal results!

• Compute and store incomplete prologues & epilogues• Fix incomplete prologues & twice-incomplete epilogues• Twice-incomplete epilogues are trickier

• Use them to compute and store anticausal results

… …

Fixing twice-incomplete epilogues• Repeatedly apply linearity and superposition

• Tedious derivation, simple result

twice-incomplete epilogue

corrected prologue

corrected epilogue

Algorithm 4

• Adds causal-anticausal overlapping• Eliminates reading and writing causal results• Both in column and in row processing

• Modest increase in computation

inpu

tou

tput

stag

es

fix bothfix both


Max. # of Threads

MemoryBandwidth

4

2

RT


• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT24


Algorithm 5

• Adds row-column overlapping• Eliminates reading and writing column results• Modest increase in computation

inpu

tou

tput

stag

es

fix all!

Start from input and global borders

Load blocks into shared memory

Compute & store incomplete borders








All borders in global memory

Fix incomplete borders

Fix twice-incomplete borders

Fix thrice-incomplete borders

Fix four-times-incomplete borders

Done fixing all borders

Load blocks into shared memory

Finish causal columns

Finish anticausal columns

Finish causal rows

Finish anticausal rows

Store results to global memory

Done!

• Fixing thrice-incomplete row-prologues

• Fixing four-times-incomplete row-epilogues

Row-column overlapping rules


• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation


Max. # of Threads

MemoryBandwidth

5

4

2

RT

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT245


Second-order filter benchmarks

• Alg. 42 uses causal-anticausal overlapping

• Alg. 52 adds row-column overlapping• Added complexity outweighs IO reduction• Balance will change (hardware, compiler, implementation)


Max. # of Threads

MemoryBandwidth

42

52

1

2

3

4

5

Thr o

ughp

ut ( G

iP/s

)

52

42

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

Quintic B-Spline Interpolation (GeForce GTX 480)

• CUFFT is in frequency domain• complexity• DIR is direct convolution• complexity• Podlozhnyuk 2007 whitepaper

“Image convolution with CUDA”

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

Thro

ughp

ut ( G

iP/s

)

DIR 2.5DIR 5DIR 10

Overlapped Recursive

CUFFT

Gaussian blur results• Overlapped recursive• 3rd order approximation• complexity• van Vliet et al. 1998

“Recursive Gaussian derivative filters”• Implemented as 51 fused with 42

• Recursive approximation is faster• Even for modest size images• Also modest standard-deviations

Gaussian Blur(GeForce GTX 480)

Summed-area table benchmarks

• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP

• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements

+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages

• Overlapped SAT• Row-column overlapping

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

8

9

Thro

ughp

ut ( G

iP/s

)

Summed-area Table(GeForce GTX 480)

Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT

• First-order filter, unit coefficient, no anticausal component

Future work• Volumetric processing• Overlapping should generalize• Not enough shared memory (yet?)

• CPU implementation• Blocking should increase L1 cache effectiveness• Is doubling amount of computation worth it?

• Solving general narrow-banded linear systems• Overlapping back- and forward- substitution

Conclusions• Recursive filters are useful in many applications• Cubic and quintic B-Spline interpolation• Gaussian-blur approximation• Summed-area table computation

• We introduced parallel algorithms for GPUs• Overlapping reduces IO requirements• Leads to faster algorithms

• Code is available from project page• Most is already there, rest is on the way

Questions?

baseline

Alg. RT (0.5 GiP/s)

+ block parallelism

Alg. 2 (3 GiP/s)

+ causal-anticausal overlapping

Alg. 4 (5 GiP/s)

+ row-column overlapping

Alg. 5 (6 GiP/s)

Date post:	24-Feb-2016
Category:	Documents
Upload:	marlee
View:	87 times
Download:	0 times

GPU-Efficient Recursive Filtering and Summed-Area Tables

Documents