+ All Categories
Home > Documents > GPGPU Programming

GPGPU Programming

Date post: 30-Dec-2015
Category:
Upload: cruz-moore
View: 35 times
Download: 3 times
Share this document with a friend
Description:
GPGPU Programming. Dominik G ö ddeke. Overview. Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail. Choices in GPU Programming. Application e.g. in C/C++, Java, Fortran, Perl. Operating system e.g. Windows, Unix, Linux, MacOS. - PowerPoint PPT Presentation
23
GPGPU Programming GPGPU Programming Dominik Göddeke Dominik Göddeke
Transcript
Page 1: GPGPU Programming

GPGPU Programming GPGPU Programming

Dominik GöddekeDominik Göddeke

Page 2: GPGPU Programming

2

OverviewOverview

• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail

Page 3: GPGPU Programming

3

Application

e.g. in

C/C++, Java,

Fortran, Perl

Shaderprograms

e.g. inHLSL, GLSL,

Cg

Choices in GPU ProgrammingChoices in GPU Programming

Graphics

hardware

e.g.

Radeon (ATI),

GeForce (NV)

Operating

system

e.g.

Windows, Unix,

Linux, MacOS

Graphics API

e.g.

OpenGL,

DirectX

Window

manager

e.g.

GLUT, Qt,

Win32, Motif

Metaprogramming language

e.g. BrookGPU, Sh

OR

Self-written libGPU

hides the graphics details

Page 4: GPGPU Programming

4

Bottom linesBottom lines

• This is not as difficult as it seems– Similar choices to be made in all software projects– Some options are mutually exclusive– Some can be used without in-depth knowledge– No direct access to the hardware, the driver does all the

tedious thread-management anyway

• Advantages and disadvantages– Steeper learning curve vs. higher flexibility– Focus on algorithm, not on (unnecessary) graphics– Portable code vs. platform and hardware specific

Page 5: GPGPU Programming

6

Libraries and AbstractionsLibraries and Abstractions

• Some coding is required– no library available that you just link against – tremendously hard to massively parallelize existing

complex code automatically

• Good news– much functionality can be added to applications in a

minimally invasive way, no rewrite from scratch

• First libraries under development– Accelerator (Microsoft): linear algebra, BLAS-like– Glift (Lefohn et al.): abstract data structures, e.g. trees

Page 6: GPGPU Programming

7

OverviewOverview

• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail

Page 7: GPGPU Programming

8

Native Data LayoutNative Data Layout

• CPU: 1D array

• GPU: 2D array

Indices are floats, addressing array element centers (GL) or top-left corners (D3D).

This will be important later.

Page 8: GPGPU Programming

9

• saxpy (from BLAS)– given two vectors x and y of size N and a scalar a– compute scaled vector-vector addition y = y + a*x

• CPU implementation– store each vector in one array, loop over all elements

• Identify computation inside loop as kernel– no logic in this basic kernel, pure computation– logic and computation fully separated

for (i=0; i<N; i++)

y[i] = y[i] + a*x[i];

Example ProblemExample Problem

y[i] = y[i] + a*x[i];

for (i=0; i<N; i++)

Page 9: GPGPU Programming

10

Understanding GPU LimitationsUnderstanding GPU Limitations

• No simultaneous reads and writes into the same memory– No read-modify-write buffer means no logic required to

handle read-before-write hazards– Not a missing feature, but essential hardware design for

good performance and throughput

– saxpy: introduce additional array: ynew = yold + a*x

• Coherent memory access– For a given output element, read in from the same index in

the two input arrays– Trivially achieved in this basic example

Page 10: GPGPU Programming

11

Performing ComputationsPerforming Computations

• Load a kernel program– Detailed examples later on

• Specify the output and input arrays– Pseudocode:

setInputArrays(yold, x);

setOutputArray(ynew);

• Trigger the computation– GPU is after all a graphics processor– So just draw something appropriate

Page 11: GPGPU Programming

12

Computing = DrawingComputing = Drawing

• Specify input and output regions– Set up 1:1 mapping from graphics viewport to output array

elements, set up input regions– saxpy: input and output regions coincide

• Generate data streams– Literally draw some geometry that covers all elements in

the output array – In this example, a 4x4 filled quad from four vertices– GPU will interpolate output array indices from vertices

across the output region– And generate data stream flowing through the parallel PEs

Page 12: GPGPU Programming

13

ExampleExample

Kernel

y + 0.5*x

Page 13: GPGPU Programming

14

Performing ComputationsPerforming Computations

• High-level view– Kernel is executed simultaneously on all elements in the

output region– Kernel knows its output index (and eventually additional

input indices, more on that later)– Drawing replaces CPU loops, foreach-execution– Output array is write-only

• Feedback loop (ping-pong technique)– Output array can be used read-only as input for next

operation

Page 14: GPGPU Programming

15

OverviewOverview

• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail

Page 15: GPGPU Programming

16

• Kernel on the CPU

• Written in Cg for the GPU

GPU Kernels: saxpyGPU Kernels: saxpy

y[i] = y[i] + a*x[i]

float saxpy(float2 coords: WPOS,

uniform samplerRECT arrayX,

uniform samplerRECT arrayY,

uniform float a) : COLOR

{

float y = texRECT(arrayY,coords);

float x = texRECT(arrayX,coords);

return y+a*x;

}

input arrays

array index

gather

compute

Page 16: GPGPU Programming

17

GPU Kernels: Jacobi IterationGPU Kernels: Jacobi Iteration

• Good news- Simple linear system solver can be built with exactly these

basic techniques!

• Example: Finite Differences- x: vector of unknowns, sampled with a 5-point stencil

(offsets)- b: right-hand-side- regular, equidistant grid- `solved´ with Jacobi iteration

Page 17: GPGPU Programming

18

GPU Kernels: Jacobi IterationGPU Kernels: Jacobi Iteration

float jacobi (float2 center : WPOS,

uniform samplerRECT x,

uniform samplerRECT b,

uniform float one_over_h) : COLOR

{

float2 left = center – float2(1,0);

float2 right = center + float2(1,0);

float2 bottom = center – float2(0,1);

float2 top = center + float2(0,1);

float x_center = texRECT(x, center);

float x_left = texRECT(x, left);

float x_right = texRECT(x, right);

float x_bottom = texRECT(x, bottom);

float x_top = texRECT(x, top);

float rhs = texRECT(b, center);

float Ax = one_over_h *

( 4.0 * x_center – x_left -

x_right – x_bottom – x_top );

float inv_diag = one_over_h / 4.0;

return x_center + inv_diag * (rhs – Ax);

}

calculate offsets

gather values

matrix-vector

Jacobi step

Page 18: GPGPU Programming

19

Maximum of an ArrayMaximum of an Array

• Entirely different operation– Output is single scalar, input is array of length N

• Naive approach – Use 1x1 array as output, gather all N values in one step– Doomed: will only use one PE, no parallelism at all– Runs into all sorts of other troubles

• Solution: parallel reduction– Idea based on global communication in parallel computing– Smart interplay of output and input regions– Same technique applies to dot products, norms etc.

Page 19: GPGPU Programming

20

Maximum of an ArrayMaximum of an Array

input arrayN/2 x N/2 outputadjust indices to gather 2x2 regions for each output

maximum of

2x2 region

first output

input intermediates results

float maximum (float2 coords: WPOS,

uniform samplerRECT array) : COLOR {

float2 topleft = ((coords-0.5)*2.0)+0.5;

float val1 = texRECT(array, topleft);

float val2 = texRECT(array, topleft+float2(1,0));

float val3 = texRECT(array, topleft+float2(1,1));

float val4 = texRECT(array, topleft+float2(0,1));

return max(val1,max(val2,max(val3,val4)));

}

Page 20: GPGPU Programming

21

Multigrid TransfersMultigrid Transfers

• Restriction– Interpolate values from fine into coarse array– Local neighborhood weighted gather on both CPU and

GPU

fine coarseadjust index to read

neighbors

output region coarse array

i2i 2i+12i-1

result

Page 21: GPGPU Programming

22

Multigrid TransfersMultigrid Transfers

• Prolongation– Scatter values from fine to coarse with weighting stencil– Typical CPU implementation: loop over coarse array with

stride-2 daxpy‘s

Page 22: GPGPU Programming

23

Multigrid TransfersMultigrid Transfers

• Three cases1) Fine node lies in the center of an element (4 interpolants)

2) Fine node lies on the edge of an element (2 interpolants)

3) Fine node lies on top of a coarse node (copy)

• Reformulate scatter as gather for the GPU– Set fine array as output region – Sample with index offset 0.25

0.25 snaps back to center (case 3)

0.25 snaps to neigh- bors (case 1 and 2)

same code for all three cases, no conditionals or red-black-map

Page 23: GPGPU Programming

24

ConclusionsConclusions

• This is not as complicated as it might seem– Course notes online:

http://www.mathematik.uni-dortmund.de/~goeddeke/iccs– GPGPU community site: http://www.gpgpu.org

• Developer information, lots of useful references• Paper archive• Help from real people in the GPGPU forums


Recommended