+ All Categories
Home > Documents > Computing using GPUs - DESY · High performance computing “Grand computational challenges” –...

Computing using GPUs - DESY · High performance computing “Grand computational challenges” –...

Date post: 27-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
23
Computing using GPUs Introduction, architecture overview CUDA and OpenCL programming models (recap) Programming tools: CUDA Fortran, debuggers, profilers, libraries Software and applications: Mathematica, MATLAB, Linpack, CFD, Imaging,... Visualization, enhancing browser experience (WebGL) GPU clusters, power issues Future trends?
Transcript
Page 1: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Computing using GPUs

● Introduction, architecture overview● CUDA and OpenCL programming models (recap)● Programming tools: CUDA Fortran, debuggers, profilers, libraries● Software and applications: Mathematica, MATLAB, Linpack, CFD, Imaging,... ● Visualization, enhancing browser experience (WebGL) ● GPU clusters, power issues● Future trends?

Page 2: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

High performance computing

● “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model,Full-scale traffic, LHC data analysis, etc. – large synergies, sometimes custom hardware● Lots of compute-intensive scientific and engineering problems (CFD)● Supercomputers built increasingly from off-the-shelf hardware● www.top500.org● Requirements for multimedia

Page 3: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Trends in architecture

● Feature size (32nm) and frequency (~3GHz) approaching limit ('power wall')● At 3GHz light travels a length order of a processor size in 1 clock cycle (~10cm)● Power (~160Watt/proc) becoming an issue● Instruction—level parallelism already exploited● Now actively exploring multi-core architectures

Page 4: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Memory

● Memory performance increased slower than CPU● Solved by hierarchy of caches● Large portion of a processor dedicated to memory management

Nehalem Pentium 1993

Page 5: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Multi-core and GPUs

● Chip makers looking into future multi-core architectures● See e.g. www.intel.com/go/terascale/

● On the other hand, VGA controllers have been evolvinginto powerful programmable multiprocessors since 1990s● More transistor budget allocated to arithmetic operations ● Could be programmed only with specific graphics languages● Graphic processors lacked various features (instructions, memory harwdare, floating point) required to support general-purpose computing

Page 6: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

GPGPUs

● NVIDIA recognized the potential to use GPUs for general-purpose computing and launchedCUDA (Computer Unified Device Architecture) around 2007● Allows using latest NVIDIA GPUs as general-purpose processors● Programs written in an extension of C 99

Page 7: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Architecture

Page 8: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Programming model

● A host (normally a CPU) + several devices (GPUs)● A program is structured into host code and device code (kernel) ● Workflow: move data to device(s) load programs (kernels) to device move data back → →● Running kernels, moving data between host/device, coarse sync. managed by queues ● SIMT model – kernels executed by independent threads● Multiple threads per processor● Threads scheduled in 'blocks' ('warps')● Threads communicate through local and global memories● Partitioning of the problem programmed manually● Programming for performance challenges:

● Aligned and coalesced memory access● Use local memory instead of global ('manual caching')● Minimize host/device data transfers

Page 9: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

CUDA C

● See e.g. a tutorial at www.drdobbs.com//high-performance-computing/207200659● NVIDIA nvcc compiler, c99 extension

#include <cuda.h>__global__ void incrementArrayOnDevice(float *a, int N){ int idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx]+1.f;}int main(void){ float *a_h, *b_h; // pointers to host memory float *a_d; // pointer to device memory int i, N = 10; size_t size = N*sizeof(float); a_h = (float *)malloc(size); b_h = (float *)malloc(size); cudaMalloc((void **) &a_d, size); for (i=0; i<N; i++) a_h[i] = (float)i; cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice); int blockSize = 4; int nBlocks = N/blockSize + (N%blockSize == 0?0:1); incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N); cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); free(a_h); free(b_h); cudaFree(a_d); }

Page 10: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

OpenCL

● Open standard for GPU computing http://www.khronos.org/opencl/● Khronos group, 100+ members● Supported by AMD● Less mature than CUDA, but gaining momentum● Based on jit compilation of kernels● Check out http://www.khronos.org/developers/resources/opencl/

Page 11: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); queue = clCreateCommandQueue(context, NULL, 0, NULL); memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL);

memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL); program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, "fft1D_1024", NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL); global_work_size[0] = num_entries; local_work_size[0] = 64; clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);

.....

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16];

...... }

Page 12: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

CUDA Fortran

● from Portland Group http://www.pgroup.com/resources/cudafortran.htm

subroutine mmul( A, B, C )! use cudafor real, dimension(:,:) :: A, B, C integer :: N, M, L real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimGrid, dimBlock! N = size(A,1) ; M = size(A,2) ; L = size(B,2) allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) Bdev = B(1:M,1:L) dimGrid = dim3( N/16, L/16, 1 ) dimBlock = dim3( 16, 16, 1 ) call mmul_kernel<<<dimGrid,dimBlock>>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deallocate( Adev, Bdev, Cdev )! end subroutine

attributes(global) subroutine MMUL_KERNEL( A,B,C,N,M,L)! real,device :: A(N,M),B(M,L),C(N,L) ... real,shared :: Ab(16,16), Bb(16,16) real :: Cij! tx = threadidx%x ; ty = threadidx%y

..... end subroutine

Page 13: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

PyCUDA and PyOpenCL

● PyCUDA scripted wrapper for CUDA C http://mathema.tician.de/software/pycuda● CUDA.NET, jCUDA, ...

import pycuda.compiler as compimport pycuda.driver as drvimport numpyimport pycuda.autoinit mod = comp.SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1)) print dest-a*b

Page 14: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Thrust

● STL-like library for CUDA code.google.com/p/thrust/

#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/sort.h>#include <cstdlib.h>

Int main(void){// generate 32M random numbers on the hostthrust::host_vector<int> h_vec(32 << 20);thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the devicethrust::device_vector<int> d_vec= h_vec;

// sort data on the device (846M keys per sec on GeForceGTX 480)thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to hostthrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());return 0;}

Page 15: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Debuggers and profilers

● cuda-gdb, cuda-memcheck (part of nvidia sdk)● Allinea DDT www.allinea.com/● RogueWave TotalView● NVIDIA Parallel Nsight (Windows only)

Page 16: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Software exploiting GPUS● CUDA Libraries: CUFFT, CUBLAS, math.h, CURAND, NPP ● MATLAB and Mathematica● CST Microwave studio, ANSYS, AMBER11, CERN level2 trigger soft, CFD (OpenCurrent),H.264 video codec, biostatistics, finance, medical imaging (Siemenns, CERA CT software, MRI software from GE), microtomography (ESRF, CFEL,...), visualization, autonomous car navigation,... ● When comparing Nehalem Quad Core to C2050 (1-2 per system) most applications report 5x-20x speedup● PIC-e becoming a bottleneck (8 GB/s max)

Page 17: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Visualization – ray tracing

● Mental images iray● Realistic rendering of CAD-generatedScenes with ray tracing● Using a GPU cluster in the cloud to compute the scene● Navigatable in real time

Page 18: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Visualization – WebGL

● http://www.khronos.org/webgl/● Accelerated graphics in a browser embedded in html5 tags and programmable with a Javascript similar to opengl● Expected to be supported by all major browsers during 2011● Demos https://sites.google.com/a/chromium.org/dev/developers/demos-gpu-acceleration-and-webgl ● Probably a great choice for scientific visualization (integrated with browser)● http://dan.lecocq.us/wordpress/tag/webglot/ high-perf. visualization project

Page 19: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

GPU clusters● In November 10 top500 #1 (Tianhe-1A, 2.5 PFlops) and #3 (Nebulae, 1.27 PFlops) are GPU clusters ● Longhorn visualization cluster with 256 nodes, 2GPU/node, infiniband, lustre, attached to Ranger supercomputer (#14 top500) www.tacc.utexas.edu● Power consumption becoming a serious issue (Jaguar at OakRidge, 6.9 MW)● With current trends future HPC machines ('exascale') might consume ~100MW (!)● Which will cost several $M/year● US data centres account for 2% energy consumption in 2007● Memory consumes substantial portion of energy for memories >128 GB (5Watt 4GB DIMM)● GPUs (T1060 at 4GFlops/Watt) are more efficient than CPUs (i7 at 0.8 GFlops/Watt), heating still a serious problem. Space to increase power efficiency● Ongoing research to reduce GPU power consumption by clocking down and software techniques such as efficient scheduling and virtualization (gVirtuS)

Page 20: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

NVIDIA's roadmap

● More energy-efficient GPUs

Page 21: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

AMD's roadmap

● Active player in the market● ATI stream SDK for GPU programming● OpenCL support● New architecture announced fusion.amd.com/ x86 CPU with programmable vector processing engines on a single die (Accelerated Processing Unit)

Page 22: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Conclusions/outlook

● GPGPU computing has happened.● Will GPU-based HPC be sustainable?● Will sufficient open source software emerge? ● Will open standards be mature and established?● Writing software is the hardest part (most of the software mentioned is developed in collaboration with NVIDIA engineers)● To make use of available FLOPS, lots of applied software will need to berewritten. ● Software usually outlives hardware (what did gcc run on in 1987?). → Sustainable open programming models (like MPI) are required● There is space and time to rethink algorithms and approaches● 2011 promises to be an exciting year for GPU computing

Page 23: Computing using GPUs - DESY · High performance computing “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis,

Amdahl's law.

● Optimize parts where code spends most time first● Profile your code


Recommended