Post on 03-Feb-2022
transcript
History of GPGPU
● Evolution of● GPU architecture &● GPU programming interfaces
● GPGPU applications
GPU Architecture Evolution
● Excerpts from Nvidia's presentations● D. Luebke, I. Buck● Nvidia-centric...
History of GPU APIs
● Graphics/rendering● IRIS GL: early 90s by SGI● OpenGL: OpenGL ARB founded in 1992● Current version: OpenGL 4.1
glBegin(GL_TRIANGLES); glMultiTexCoord2f(GL_TEXTURE0_ARB, 0, 0); glMultiTexCoord4f(GL_TEXTURE1_ARB, 0, 0, 0, 0); glMultiTexCoord4f(GL_TEXTURE2_ARB, 0, 0, 0, 0); glVertex2f(0, 0); glMultiTexCoord2f(GL_TEXTURE0_ARB, 2, 0); glMultiTexCoord4f(GL_TEXTURE1_ARB, 2, 0, 2, 0); glMultiTexCoord4f(GL_TEXTURE2_ARB, 2, 0, 2, 0); glVertex2f(2, 0); glMultiTexCoord2f(GL_TEXTURE0_ARB, 0, 2); glMultiTexCoord4f(GL_TEXTURE1_ARB, 0, 2, 0, 2); glMultiTexCoord4f(GL_TEXTURE2_ARB, 0, 2, 0, 2); glVertex2f(0, 2); glEnd();
OpenGL Extensions for GPGPU
● Vendors can extend OpenGL via extensions● Complete list http://www.opengl.org/registry/● About 500 extensions!
● Extensions relevant for GPGPU are● Making the pipeline programmable
– not only fixed function
– Most relevant for GPGPU: programmable fragment processing
● More versatile texture (data storage) formats– Floating point vs. fixed point format
● Rendering to textures (output arrays)– Not only to frame buffer for the display
Mapping GPGPU to Rendering
● Streams (data-parallel arrays): CPU array = GPU texture● float a[1024] ↔ glGenTextures(...); glTexImage2D(...)
● Kernel: Body of parallel for loop = GPU fragment program
● Output / input for next stage (parallel for)● CPU target array = GPU render-to-texture
● Execute computation● CPU run parallel for loop = render quad with shaders enabled
● Gather op: CPU array access = GPU texture fetch● … a[i] … ↔ … tex2D(a_tex, st) …
● Scatter op: CPU array write = GPU adjust vertex coordinates● a[k] = … ↔ OPOS = ...
Adding Programming Power
● Nvidia:
● NV_register_combiners– Mostly used for correct per-pixel Phong shading/bump mapping
● NV_texture_shader, NV_texture_shader2– 23 different ways to fetch data from a texture
Adding Programming Power
● Nvparse
● Easier use of texture shaders / register combiners
Adding Programming Power
● ATI: ATI_fragment_shader
● Assembly-like instructions...glPassTexCoordATI(GL_REG_1_ATI, GL_TEXTURE1_ARB, GL_SWIZZLE_STR_ATI); // NglPassTexCoordATI(GL_REG_2_ATI, GL_TEXTURE2_ARB, GL_SWIZZLE_STR_ATI); // light to vertex vector in light spaceglPassTexCoordATI(GL_REG_3_ATI, GL_TEXTURE3_ARB, GL_SWIZZLE_STR_ATI); // HglSampleMapATI(GL_REG_4_ATI, GL_TEXTURE4_ARB, GL_SWIZZLE_STR_ATI); // L (sample cubemap normalizer)
// reg4 = N.LglColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_4_ATI, GL_NONE, GL_NONE, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_4_ATI, GL_NONE, GL_2X_BIT_ATI|GL_BIAS_BIT_ATI);
// reg1 = N.HglColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_1_ATI, GL_NONE, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE);
// reg1(green) = H.H (aka |H|^2)glColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_1_ATI, GL_GREEN_BIT_ATI, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE, GL_REG_3_ATI, GL_NONE, GL_NONE);...
Adding Programming Power
● ARB_FRAGMENT_PROGRAM & NV_FRAGMENT_PROGRAM*
● Assembly instructions
!!ARBvp1.0OPTION NV_vertex_program3;PARAM mvp[4] = { state.matrix.mvp };PARAM scale = program.local[0];TEMP pos, displace;# vertex texture lookupTEX displace, vertex.texcoord, texture[0], 2D;MUL displace.x, displace.x, scale;# displace along normalMAD pos.xyz, vertex.normal, displace.x, vertex.position;MOV pos.w, 1.0;
# transform to clip spaceDP4 result.position.x, mvp[0], pos;DP4 result.position.y, mvp[1], pos;DP4 result.position.z, mvp[2], pos;DP4 result.position.w, mvp[3], pos;MOV result.color, vertex.color;MOV result.texcoord[0], texcoord;END
Shader Metaprogramming
● Lib Sh (2003)● Embed shader code in C++, heavy operator/function overloading
● Translated to “real shader” code at run-time
Adding Programming Power
● Cg (NVidia) / GLSL (OpenGL) / HLSL (Direct3D)
● Cg 1.1 released in 2002, Cg 3.0 Nov. 2010
More Useful Texture Formats
● OpenGL supports originally only 8 bit textures● Now
● 16 bit half floats● 32 bit single precision floats● 16/32 bit integer formats● 1,2,3 and 4 channels
Render-to-Texture
● WGL_ARB_pbuffer / WGL_ARB_render_texture (1997)
● The nightmare of every GPU programmer (and rather slow)static int attributes[] = /* OpenGL attributes */ { GLX_RGBA, GLX_RED_SIZE, 8, GLX_GREEN_SIZE, 8, GLX_BLUE_SIZE, 8, GLX_DEPTH_SIZE, 16, 0, /* Save space for GLX_DOUBLEBUFFER */ 0 };
static int pbattrs[] = /* Pbuffer attributes */ { GLX_PBUFFER_WIDTH, 1024, GLX_PBUFFER_HEIGHT, 1024, 0 };
configs = glXChooseFBConfig(display, DefaultScreen(display), attributes, &nconfigs);pbuffer = glXCreatePbuffer(display, *configs, pbattrs);context = glXCreateNewContext(display, *configs, GLX_RGBA_BIT, 0, True);glXMakeCurrent(display, pbuffer, context);
Render-to-Texture
● EXT_framebuffer_object (2004)
Non-Graphics GPU Programming
● BrookGPU [Buck et al., 2004]
● High-level data-parallel language mapped on GPU
/*** Kernels needed for Sparse Matrix-Vector multiplication *******************/
kernel void gather(float index<>, float x[NUM_ROWS+1], out float result<>) { result = x[index];}
// componentwise multiplykernel void mult(float a<>, float b<>, out float c<>) { c = a*b;}
reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues;}
Non-Graphics GPU Programming
● Close-to-metal (AMD)● Very low-level● NDA required
● Stream SDK (AMD)● 1.0 in 2007, 2.0 in 2010● Includes Brook+ (successor of BrookGPU)
● DirectCompute (Microsoft)● Part of DirectX 11● Computational model similar to CUDA & OpenCL
Non-Graphics GPU Programming
● CUDA (Feb 2007) / OpenCL
Evolution of GPGPU Applications
● GPU for General purpose● Data structures for search● Simulation / stochastic sampling● Linear algebra / PDEs
● GPU for visual computing beyond rendering● Computer vision● Global illumination (raytracing, radiosity)● Non-photorealistic rendering (NPR)
● Survey of early work!
General Comments
● Early approaches (pre-CUDA time) use aspects of fixed-function pipeline in a smart way● Z-buffer (point-wise minimum)
● Triangle rasterization / tex-coord/depth interpolation
● Blending (in-place summation)
● Stencil buffer & early Z-test for flow control
● Simulate >8 bit values
● Exploit vec3/vec4 parallelism
● With CUDA/OpenCL● Reformulate problem to make it data-parallel
● Hide memory latency
● Exploit shared memory
● Increase GPU occupancy
Voronoi Diagrams
● [Hoff et al., 1999]
● Use rasterization & Z-buffer HW to compute Voronoi cells
Image Processing: Filtering
● [Hopf & Ertl, 1999]
● Gaussian filtering of volumetric (medical) data● Uses EXT_convolution
● [Hadwiger et al, 2001]
● Texture filtering● Register combiner /
ATI fragment shaders
GPU Raytracing
● [Purcell et al, 2002]
● Simulates GPU-based raytracing on not-yet available HW– Simulator in SW
● Octree traversal– Looping and branching (not available in 2002)
– Multi-pass rendering
Game-of-Life on the GPU
● [Harris et al., 2002] (Mark Harris coined the term GPGPU)
● Run extensions of cellular automata on the GPU● Reaction-diffusion processes
– Simpler variant of fluid dynamics
– Simulate boiling water
Stereo on the GPU
● [Yang et al., 2002]
● Plane-sweep approach for stereo
● Free viewpoint video /tele-presence
Stereo on the GPU
● [Zach et al., 2003]
● Hierarchical search for depth
● Use 8-bit RGB channels to encode 18 bit integers
Sparse Linear Algebra
● [Krueger & Westermann, 2003]
● Solve sparse linear systems– Conjugate gradients
– Basic set of LinAlg operations
– Most important: sparse matrix-vector multiplication
● Uses mix of vertex and fragment shaders● Fluid flow
Sparse Linear Algebra 2
● [Bolz et al., 2003]
● Sparse matrix-vector multiplication● Multigrid● Surface denoising & particle advection
Level-set segmentation
● [Lefohn et al., 2003]
● Level-set segmentation & volume rendering● Mixed approach for level sets
– Bookkeeping done by CPU
– PDE updates in narrow band on the GPU
Scan / Reductions
● All-prefix-sums● Generate all partial sums in a sequence
● O(n log(n)) algorithm based on recursive doubling● Applications
– Collision detection [Horn 2005]– Integral images for fast box filtering [Hensley et al., 2005]
b i :=∑ j=1
ia j
Scan / Reductions
● Horn 2005
Compaction
● Extract all elements from a sequence satisfying a predicate
● E.g. histogram pyramids [Ziegler et al., 2006]
Dynamic Programming
● [Zach et al., 2006]
● DP for stereo based on recursive doubling (O(n log n) complexity)
● Not needed anymore with CUDA/OpenCL programming model
2006-Today:Plethora of GPU Applications
● Graphics● Global illumination
● Tone mapping / NPR
● Cloth simulation / animation
● Computer vision & image processing● Image features & matching
● Segmentation / medical imaging
● Stereo & 3D reconstruction from images
● Scientific computing● Monte-Carlo simulation for quantum mechanics, climate research etc.
● PDEs for fluid flow, heat flow...
● Bioinformatics (genome sequence alignment)
● Video encoding
● Brute-force password cracking
CUDA Showcase
● Currently 1260 applications registered (imaging subset shown)
Acknowledgements
● Lots of internet resources and papers
● Owens et al., A Survey of General-Purpose Computation on Graphics Hardware, 2007