Algorithm Engineering „ GPGPU“

transcript

Algorithm Engineering

„GPGPU“

Stefan Edelkamp

Graphics Processing Units

GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the masses“ Application: Fourier-Transformation, Model Checking,Bio-Informatics, see CUDA-ZONE

Programming the Graphics Processing Unitwith Cuda

Overview

Cluster / Multicore / GPU comparisonComputing on the GPUGPGPU languagesCUDASmall Example

Overview

Cluster / Multicore / GPU

Cluster systemmany unique systemseach one

one (or more) processors internal memory often HDD

communication over network slow compared to internal no shared memory

CPU RAM

Switch

Multicore systemsmultiple CPUsRAMexternal memory on HDD communication over RAM

CPU1 CPU2

CPU4CPU3

System with a Graphic Processing UnitMany (240) Parallel processing unitsHierarchical memory structure

RAM VideoRAM SharedRAM

Communication PCI BUS Graphics Card

SRAM VRAM RAM

Hard Disk Drive

Overview

Computing on the GPU

Hierarchical executionGroups

executed sequentiallyThreads

executed parallel lightweight (creation / switching nearly free)

one Kernel function executed by each thread

• Group 0

Computing on the GPU

Hierarchical memoryVideo RAM

1 GB Comparable to RAM

Shared RAM in the GPU 16 KB Comparable to registers parallel access by threads

Graphic Card

GPUSRAM VideoRAM

Beispielarchitektur G200 z.B. in 280GTX

Beispielprobleme

Ranking und Unranking mit Parity

2-Bit BFS

1-Bit BFS

Schiebepuzzle

Some Results…

Weitere Resultate …

Overview

GPGPU Languages

RapidMindSupports MultiCore, ATI, NVIDIA and CellC++ analysed and compiled for target hardware

Accelerator (Microsoft)Library for .NET language

BrookGPU (Stanford University)Supports ATI, NVIDIAOwn Language, variant of ANSI C

Overview

Cluster / Multicore / GPU comparisonComputing on the GPUProgramming languagesCUDASmall Example

Programming languageSimilar to CFile suffix .cuOwn compiler called nvccCan be linked to C

CUDAC++ code CUDA Code

Compile with GCC Compile with nvcc

Link with ld

Executable

Additional variable typesDim3 Int3Char3

Different types of functions__global__ invoked from host__device__ called from device

Different types of variables__device__ located in VRAM__shared__ located in SRAM

Calling the kernel functionname<<<dim3 grid, dim3 block>>>(...)

Grid dimensions (groups)Block dimensions (threads)

Memory handlingCudaMalloc(...) - allocating VRAMCudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM

Distinguish threads blockDim – Number of all groupsblockIdx – Id of Group (starting with 0)threadIdx – Id of Thread (starting with

0)Id =

blockDim.x*blockIdx.x+threadIdx.x

Overview

Cluster / Multicore / GPU comparisonComputing on the GPUProgramming languagesCUDASmall Example

void inc(int *a, int b, int N) {

for (int i = 0; i<N; i++) a[i] = a[i] + b;

void main(){

inc(a,b,N);}

__global__ void inc(int *a, int b, int N){ int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N)

a[id] = a[id] + b;}

void main(){

...int * a_d = CudaAlloc(N);CudaMemCpy(a_d,a,N,HostToDevice);dim3 dimBlock ( blocksize, 0, 0 );dim3 dimGrid ( N / blocksize, 0, 0 );inc<<<dimGrid,dimBlock>>>(a_d,b,N);

Realworld Example

LTL Model checkingTraversing an implicit Graph G=(V,E)Vertices called statesEdges represented by transitionsDuplicate removal needed

Realworld Example

External Model checkingGenerate Graph with external BFSEach BFS layer needs to be sorted

GPU proven to be fast in sorting

Realworld Example

ChallengesMillions of states in one layerHuge state sizeFast access only in SRAMElements needs to be moved

Realworld Example

Solutions:Gpuqsort

Qsort optimized for GPUs Intensive swapping in VRAM

Bitonic based sorting Fast for subgroupsConcatenating Groups slow

Realworld Example

Our solutionStates S presorted by Hash H(S) Bucket sorted in SRAM by a Group

• VRAM

• SRAM

Realworld Example

Our solutionOrder given by H(S),S

Realworld Example

Results

Questions???

Programming the GPU

Algorithm Engineering „ GPGPU“

Documents