History of GPGPU

transcript

History of GPGPU

● Evolution of● GPU architecture &● GPU programming interfaces

● GPGPU applications

GPU Architecture Evolution

● Excerpts from Nvidia's presentations● D. Luebke, I. Buck● Nvidia-centric...

History of GPU APIs

● Graphics/rendering● IRIS GL: early 90s by SGI● OpenGL: OpenGL ARB founded in 1992● Current version: OpenGL 4.1

glBegin(GL_TRIANGLES); glMultiTexCoord2f(GL_TEXTURE0_ARB, 0, 0); glMultiTexCoord4f(GL_TEXTURE1_ARB, 0, 0, 0, 0); glMultiTexCoord4f(GL_TEXTURE2_ARB, 0, 0, 0, 0); glVertex2f(0, 0); glMultiTexCoord2f(GL_TEXTURE0_ARB, 2, 0); glMultiTexCoord4f(GL_TEXTURE1_ARB, 2, 0, 2, 0); glMultiTexCoord4f(GL_TEXTURE2_ARB, 2, 0, 2, 0); glVertex2f(2, 0); glMultiTexCoord2f(GL_TEXTURE0_ARB, 0, 2); glMultiTexCoord4f(GL_TEXTURE1_ARB, 0, 2, 0, 2); glMultiTexCoord4f(GL_TEXTURE2_ARB, 0, 2, 0, 2); glVertex2f(0, 2); glEnd();

OpenGL Extensions for GPGPU

● Vendors can extend OpenGL via extensions● Complete list http://www.opengl.org/registry/● About 500 extensions!

● Extensions relevant for GPGPU are● Making the pipeline programmable

– not only fixed function

– Most relevant for GPGPU: programmable fragment processing

● More versatile texture (data storage) formats– Floating point vs. fixed point format

● Rendering to textures (output arrays)– Not only to frame buffer for the display

Mapping GPGPU to Rendering

● Streams (data-parallel arrays): CPU array = GPU texture● float a[1024] ↔ glGenTextures(...); glTexImage2D(...)

● Kernel: Body of parallel for loop = GPU fragment program

● Output / input for next stage (parallel for)● CPU target array = GPU render-to-texture

● Execute computation● CPU run parallel for loop = render quad with shaders enabled

● Gather op: CPU array access = GPU texture fetch● … a[i] … ↔ … tex2D(a_tex, st) …

● Scatter op: CPU array write = GPU adjust vertex coordinates● a[k] = … ↔ OPOS = ...

Adding Programming Power

● Nvidia:

● NV_register_combiners– Mostly used for correct per-pixel Phong shading/bump mapping

● NV_texture_shader, NV_texture_shader2– 23 different ways to fetch data from a texture

● Nvparse

● Easier use of texture shaders / register combiners

● ATI: ATI_fragment_shader

● Assembly-like instructions...glPassTexCoordATI(GL_REG_1_ATI, GL_TEXTURE1_ARB, GL_SWIZZLE_STR_ATI); // NglPassTexCoordATI(GL_REG_2_ATI, GL_TEXTURE2_ARB, GL_SWIZZLE_STR_ATI); // light to vertex vector in light spaceglPassTexCoordATI(GL_REG_3_ATI, GL_TEXTURE3_ARB, GL_SWIZZLE_STR_ATI); // HglSampleMapATI(GL_REG_4_ATI, GL_TEXTURE4_ARB, GL_SWIZZLE_STR_ATI); // L (sample cubemap normalizer)

// reg4 = N.LglColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_4_ATI, GL_NONE, GL_NONE, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_4_ATI, GL_NONE, GL_2X_BIT_ATI|GL_BIAS_BIT_ATI);

// reg1 = N.HglColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE);

// reg1(green) = H.H (aka |H|^2)glColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_1_ATI, GL_GREEN_BIT_ATI, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE);...

● ARB_FRAGMENT_PROGRAM & NV_FRAGMENT_PROGRAM*

● Assembly instructions

!!ARBvp1.0OPTION NV_vertex_program3;PARAM mvp[4] = { state.matrix.mvp };PARAM scale = program.local[0];TEMP pos, displace;# vertex texture lookupTEX displace, vertex.texcoord, texture[0], 2D;MUL displace.x, displace.x, scale;# displace along normalMAD pos.xyz, vertex.normal, displace.x, vertex.position;MOV pos.w, 1.0;

# transform to clip spaceDP4 result.position.x, mvp[0], pos;DP4 result.position.y, mvp[1], pos;DP4 result.position.z, mvp[2], pos;DP4 result.position.w, mvp[3], pos;MOV result.color, vertex.color;MOV result.texcoord[0], texcoord;END

Shader Metaprogramming

● Lib Sh (2003)● Embed shader code in C++, heavy operator/function overloading

● Translated to “real shader” code at run-time

● Cg (NVidia) / GLSL (OpenGL) / HLSL (Direct3D)

● Cg 1.1 released in 2002, Cg 3.0 Nov. 2010

More Useful Texture Formats

● OpenGL supports originally only 8 bit textures● Now

● 16 bit half floats● 32 bit single precision floats● 16/32 bit integer formats● 1,2,3 and 4 channels

Render-to-Texture

● WGL_ARB_pbuffer / WGL_ARB_render_texture (1997)

● The nightmare of every GPU programmer (and rather slow)static int attributes[] = /* OpenGL attributes */ { GLX_RGBA, GLX_RED_SIZE, 8, GLX_GREEN_SIZE, 8, GLX_BLUE_SIZE, 8, GLX_DEPTH_SIZE, 16, 0, /* Save space for GLX_DOUBLEBUFFER */ 0 };

static int pbattrs[] = /* Pbuffer attributes */ { GLX_PBUFFER_WIDTH, 1024, GLX_PBUFFER_HEIGHT, 1024, 0 };

configs = glXChooseFBConfig(display, DefaultScreen(display), attributes, &nconfigs);pbuffer = glXCreatePbuffer(display, *configs, pbattrs);context = glXCreateNewContext(display, *configs, GLX_RGBA_BIT, 0, True);glXMakeCurrent(display, pbuffer, context);

Render-to-Texture

● EXT_framebuffer_object (2004)

Non-Graphics GPU Programming

● BrookGPU [Buck et al., 2004]

● High-level data-parallel language mapped on GPU

/*** Kernels needed for Sparse Matrix-Vector multiplication *******************/

kernel void gather(float index<>, float x[NUM_ROWS+1], out float result<>) { result = x[index];}

// componentwise multiplykernel void mult(float a<>, float b<>, out float c<>) { c = a*b;}

reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues;}

● Close-to-metal (AMD)● Very low-level● NDA required

● Stream SDK (AMD)● 1.0 in 2007, 2.0 in 2010● Includes Brook+ (successor of BrookGPU)

● DirectCompute (Microsoft)● Part of DirectX 11● Computational model similar to CUDA & OpenCL

● CUDA (Feb 2007) / OpenCL

Evolution of GPGPU Applications

● GPU for General purpose● Data structures for search● Simulation / stochastic sampling● Linear algebra / PDEs

● GPU for visual computing beyond rendering● Computer vision● Global illumination (raytracing, radiosity)● Non-photorealistic rendering (NPR)

● Survey of early work!

General Comments

● Early approaches (pre-CUDA time) use aspects of fixed-function pipeline in a smart way● Z-buffer (point-wise minimum)

● Triangle rasterization / tex-coord/depth interpolation

● Blending (in-place summation)

● Stencil buffer & early Z-test for flow control

● Simulate >8 bit values

● Exploit vec3/vec4 parallelism

● With CUDA/OpenCL● Reformulate problem to make it data-parallel

● Hide memory latency

● Exploit shared memory

● Increase GPU occupancy

Voronoi Diagrams

● [Hoff et al., 1999]

● Use rasterization & Z-buffer HW to compute Voronoi cells

Image Processing: Filtering

● [Hopf & Ertl, 1999]

● Gaussian filtering of volumetric (medical) data● Uses EXT_convolution

● [Hadwiger et al, 2001]

● Texture filtering● Register combiner /

ATI fragment shaders

GPU Raytracing

● [Purcell et al, 2002]

● Simulates GPU-based raytracing on not-yet available HW– Simulator in SW

● Octree traversal– Looping and branching (not available in 2002)

– Multi-pass rendering

Game-of-Life on the GPU

● [Harris et al., 2002] (Mark Harris coined the term GPGPU)

● Run extensions of cellular automata on the GPU● Reaction-diffusion processes

– Simpler variant of fluid dynamics

– Simulate boiling water

Stereo on the GPU

● [Yang et al., 2002]

● Plane-sweep approach for stereo

● Free viewpoint video /tele-presence

Stereo on the GPU

● [Zach et al., 2003]

● Hierarchical search for depth

● Use 8-bit RGB channels to encode 18 bit integers

Sparse Linear Algebra

● [Krueger & Westermann, 2003]

● Solve sparse linear systems– Conjugate gradients

– Basic set of LinAlg operations

– Most important: sparse matrix-vector multiplication

● Uses mix of vertex and fragment shaders● Fluid flow

Sparse Linear Algebra 2

● [Bolz et al., 2003]

● Sparse matrix-vector multiplication● Multigrid● Surface denoising & particle advection

Level-set segmentation

● [Lefohn et al., 2003]

● Level-set segmentation & volume rendering● Mixed approach for level sets

– Bookkeeping done by CPU

– PDE updates in narrow band on the GPU

Scan / Reductions

● All-prefix-sums● Generate all partial sums in a sequence

● O(n log(n)) algorithm based on recursive doubling● Applications

– Collision detection [Horn 2005]– Integral images for fast box filtering [Hensley et al., 2005]

b i :=∑ j=1

Scan / Reductions

● Horn 2005

Compaction

● Extract all elements from a sequence satisfying a predicate

● E.g. histogram pyramids [Ziegler et al., 2006]

Dynamic Programming

● [Zach et al., 2006]

● DP for stereo based on recursive doubling (O(n log n) complexity)

● Not needed anymore with CUDA/OpenCL programming model

2006-Today:Plethora of GPU Applications

● Graphics● Global illumination

● Tone mapping / NPR

● Cloth simulation / animation

● Computer vision & image processing● Image features & matching

● Segmentation / medical imaging

● Stereo & 3D reconstruction from images

● Scientific computing● Monte-Carlo simulation for quantum mechanics, climate research etc.

● PDEs for fluid flow, heat flow...

● Bioinformatics (genome sequence alignment)

● Video encoding

● Brute-force password cracking

CUDA Showcase

● Currently 1260 applications registered (imaging subset shown)

Acknowledgements

● Lots of internet resources and papers

● Owens et al., A Survey of General-Purpose Computation on Graphics Hardware, 2007

History of GPGPU

Documents