+ All Categories
Home > Documents > © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23:...

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23:...

Date post: 13-Dec-2015
Category:
Upload: keyon-lester
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA
Transcript
Page 1: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

1

ECE 498AL

Lecture 23: Kernel and Algorithm Patterns for CUDA

Page 2: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

2

Objective

• Learn about algorithm patterns and principles– What are my threads? Where does my data go?

• This lecture goes over several patterns that have seen significant success with CUDA, and how they got there.– Input/Output Convolution– Bounded Input/Output Convolution– Stencil computation– Input/Input Convolution– Bounded Input/Input Convolution

Page 3: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

3

Two Questions

• For every application, the start of a CUDA implementation begins with these two questions:

• What work does each thread do?

• What memory space should each piece of data go?

Page 4: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

4

Assumptions

• Computation is independent (parallel) unless otherwise stated– i.e. Reductions are the only real presense of serialization

• An work unit (as presented) is reasonably small for one CUDA thread.– We won't be discussing cases where the “tasks” are just too

large to fit into a thread.

• Global memory is big enough to hold your entire dataset– There are another level of issues to address for this case

Page 5: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

5

A Few Commonalities: Reductions and Memory Patterns

Page 6: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

6

Reduction patterns in CUDA

• Local– One thread performs an entire, unique reduction

• Matrix Multiplication

• In-Block– Threads only within a block contribute to a reduction

• Only slightly less efficient than local reduction, esp. on new hardware

• Global– Every thread contributes to the reduction: two subtypes

• Blocked: threads within a block contribute to the same reduction, allowing some component of block reduction

– The more reduction you can do within a block, the better• Scattered: threads in a block do not contribute to the same reduction

Page 7: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

7

Mapping data into CUDA's memories

• Output must finally end up in global memory– No other way to communicate results to the rest of the world– Intermediate outputs can (and should) be stored in registers

or shared memory

• Globally-shared input goes in constant memory– Run several kernels to process chunks at a time

• Input shared only by adjacent threads should be tiled into the shared memory– Matrix Multiplication tiles

Page 8: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

8

Mapping data continued...

• Input not shared by adjacent threads should just be loaded from global memory– There are cases where shared memory is still useful.

• E.g. coalescing data structure loads from global memory

• Texture memory should really only be used if its specialized indexing features are useful– Just accessing it “normally” is usually not worth it– Applications needing specific features might find it helpful

• Linear interpolation (good for FP function lookup tables)• Array bounds clipping or wraparound• Subword type unpacking

Page 9: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

9

Input/Output Convolution:e.g. MRI, Direct Summation CP

Page 10: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

10

Generic Algorithm Description

•.•.•.

Every input element contributes to every output element

Each output element is dependent on all input elements

Input contributions are combined through some reduction operation

Assumptions:• All input contributions and output elements are independent• An interaction is reasonably small for one CUDA thread.

0

N-1 M-1

0

Page 11: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

11

What could each thread be assigned?

Input elements• O(N) threads• Each contributes to M global reductions (scattered reduce)

Output elements • ~M threads• Each reads N input elements, local reduction only

Input/Output pairs• O(N*M) threads• Each thread contributes to one of M global reductions

Pros and Cons to each possibility!

•.•.•.

0

N-1 M-1

0

Page 12: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

12

Thread Assignment Tradeoffs

• Input elements / global reductions– Usually ineffective, as it requires M global reductions

• Input/Output pairs– Effective when you need the extra parallelism– You can group threads into blocks based on input or output

elements• Basically a choice between blocked and scattered reductions

• Output elements / local reductions– Very effective if a reasonable amount of input can fit into the

constant memory

Page 13: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

13

What memory space does the data use?

• Output has to be in global memory• If input is globally shared (threads assigned Output

elements), constant memory is best.– Again, it's likely that the whole input won't fit in constant

memory at once. – Break up your implementation into “mini-kernels”, each

reading a chunk of the input at a time from constant memory.

• Even if constant memory doesn't make sense (threads assigned Input/Output pairs), shared memory can probably help some.

Page 14: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

14

Bounded Input/Output Convolutione.g. Cutoff Summation

(and Matrix Multiplication)

Page 15: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

15

Generic problem description

• Input elements will contribute to a bounded range of output elements– Conversely, each output element

is affected by a limited range/set of input elements

• Usually arises from cutoff distances in spatial representations– O(# output elements) instead of

O(# output elements * # input elements)

Cutoff

Page 16: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

16

Revisiting thread-assignment tradeoffs

• Input elements / “global” reductions– The reductions that an input element affects are restricted– Might be reasonable, if the the “global” reduction can

become mostly an in-block reduction.

• Input/Output pairs– Still most effective when you need the extra parallelism– Try as much as possible to keep conceptually “global”

reductions as in-block reductions in reality• If it works, this strategy will likely be very competetive

• Output elements / local reductions– Still most effective if feasible

Page 17: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

17

Data Mapping?

• Input isn't globally shared anymore– Constant memory doesn't make sense because most threads

won't need a particular input element

• Read “tiles” of input data relevant to a tile of output data into shared memory– Not all threads in the grid will need the data, but adjacent

threads will with high probability

Page 18: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

18

Stencil Computation: Fluid Dynamics, Image Convolution

Page 19: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

Generic Algorithm Description

• Class of in-place applications where the next “step” for an element depends on a predetermined set of other (usually adjacent) elements.

• Dataset should either be double-buffered or red-black colored to prevent dependencies.

T=0

T=1

Page 20: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

20

Basic Questions again

• What does each thread do?• One input component for the element?

– Thread blocks compute one or a small number of elements

• The whole computation for one element?– Thread blocks compute tiles of elements

• Again, tradeoffs: mostly determined by how much work goes into an element for the next timestep– Directly related to the size of the stencil

Page 21: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

21

What memory space?

• Depends on the app.– If adjacent elements

share input values, ideal case for shared memory tiling

– If entire thread blocks compute single elements, tiling doesn't help

• No intrablock sharing

T=0

Overlapping input tiles

Non-Overlapping output tiles

T=1

Page 22: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

22

What if basic tiling isn't good enough?

• Sometimes, the bandwidth of loading and storing tiles far outweighs the needed computation

• Multi-step kernels• Larger input tiles, multiple

steps within a block– Means there's some redundant

computation for the edges of T=1 intermediate tiles

T=0

Input tiles

Overlapping intermediate tiles of T=1

T=2

Page 23: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

23

Input/Input Convolutione.g. N-body Interaction

Page 24: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

24

Generic Algorithm Description

• All input elements interact– Pairwise most common,

sometimes even higher-degree

• Interactions usually contribute to reductions– Either a per-element reduction

or a global reduction, depending on the app

• Examples: gravitational or electrical points in space, two-point autocorrelation function in astronomy

• Threads and data storage?

Page 25: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

25

What does each thread do?

• If the reduction is per-element, this looks a lot like the input/output convolution case– Input pair is the new “element”– Apply tradeoffs from that case

• If the reduction is global, it looks a lot like a simple reduction.– Again: Input pair is the new “element”– Try to load and reduce many “elements” in-block

Page 26: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

26

What memory space does the data use?• Per-element reductions handled

the same way as input/output convolution– Tile one copy of input through

constant memory, read in the other from global memory

• Global reductions could be handled by loading input tile pairs into shared memory, and doing as much in-block reduction as possible– N^2 interactions per block for 2 N-element input tiles

• Should prevent memory bandwidth from being a bottleneck if N is reasonable.

Input tiles

Output (tiles or partial reduction)

Page 27: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

27

Bounded Input/Input Convolutionse.g. N-body approximations (NAMD)

Page 28: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

28

Generic Algorithm

• Input/Input convolution with some cutoff– O(N) instead of O(N^2)

• Similar approach to the bounded input/output convolution approach– Use some kind of spatial

binning to reduce algorithmic complexity

Page 29: © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

29

Modifications from Unbounded Case

• An input “tile” is naturally defined by the binning process– A tile is all input elements in one (or several) bins

• Each tile has a limited number of other tiles to interact– For per-element reductions, this looks a lot like the bounded

input/output convolution case• Load a tile, and interact every other relevant tile within one block

– For global reductions, this is essentially the unchanged from the unbounded case, except that fewer input-pairs are considered for reduction contributions


Recommended