Date post: | 10-May-2015 |
Category: |
Health & Medicine |
Upload: | daniel-blezek |
View: | 3,872 times |
Download: | 2 times |
General Purpose Computingusing Graphics Hardware
Hanspeter PfisterHarvard University
2
Acknowledgements
• Won-Ki Jeong, Harvard University• Kayvon Fatahalian, Stanford University
3
GPU (Graphics Processing Unit)
• PC hardware dedicated for 3D graphics– Massively parallel SIMD processor
• Performance pushed by game industry
NVIDIA SLI System
4
GPGPU
• General Purpose computation on the GPU– Started in computer graphics research
community– Mapping computational problems to graphics
rendering pipeline
Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
5
Why GPU for computing?• GPU is fast
– Massively parallel• CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad
Core)• GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA
GT200)– High memory bandwidth
• Programmable– NVIDIA CUDA, DirectX Compute Shader, OpenCL
• High precision floating point support– 64bit floating point (IEEE 754)
• Inexpensive desktop supercomputer– NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000
6
FLOPS
Image Courtesy NVIDIA
7
Memory Bandwidth
Image Courtesy NVIDIA
8
GPGPU Biomedical Examples
Level-Set Segmentation (Lefohn et al.)
EM Image Processing (Jeong et al.)Image Registration (Strzodka et al.)
CT/MRI Reconstruction (Sumanaweera et al.)
9
Overview
1. GPU Architecture Overview2. GPU Programming Overview
– Programming Model– NVIDIA CUDA– OpenCL
3. Application Example– CUDA ITK
1. GPU Architecture Overview
Kayvon Fatahalian Stanford University
10
11
What’s in a GPU?
Compute
Core
ComputeCore
Compute
Core
Compute
Core
Compute
Core
Compute
Core
Compute
Core
Compute
Core
Tex
Tex
Tex
Tex
Input Assembly
Rasterizer
Output Blend
Video Decode
WorkDistributor
Heterogeneous chip multi-processor (highly tuned for graphics)
HWor
SW?
12
CPU-“style” cores
ALU(Execute)
Fetch/Decode
ExecutionContext
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
Data Cache(A big one)
13
Slimming down
ALU(Execute)
Fetch/Decode
ExecutionContext
Idea #1:
Remove components thathelp a single instructionstream run fast
14
Two cores (two threads in parallel)
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
thread1
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
thread 2
15
Four cores (four threads in parallel)
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
16
Sixteen cores (sixteen threads in parallel)
ALU ALU
ALUALU
ALU ALU
ALUALU
ALU ALU
ALUALU
ALU ALU
ALUALU
16 cores = 16 simultaneous instruction streams
17
Instruction stream sharing
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
But… many threads shouldbe able to share an instructionstream!
18
Recall: simple processing core
Fetch/Decode
ALU(Execute)
ExecutionContext
19
Add ALUs
Fetch/Decode
Idea #2:
Amortize cost/complexity ofmanaging an instructionstream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20
Modifying the code
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
Original compiled shader:
Fetch/Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Processes one threadusing scalar ops on scalarregisters
21
Modifying the code
Fetch/Decode
<VEC8_diffuseShader>:
VEC8_sample vec_r0, vec_v4, t0, vec_s0
VEC8_mul vec_r3, vec_v0, cb0[0]
VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
VEC8_mul vec_o0, vec_r0, vec_r3
VEC8_mul vec_o1, vec_r1, vec_r3
VEC8_mul vec_o2, vec_r2, vec_r3
VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
New compiled shader:
22
Modifying the code
Fetch/Decode
<VEC8_diffuseShader>:
VEC8_sample vec_r0, vec_v4, t0, vec_s0
VEC8_mul vec_r3, vec_v0, cb0[0]
VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
VEC8_mul vec_o0, vec_r0, vec_r3
VEC8_mul vec_o1, vec_r1, vec_r3
VEC8_mul vec_o2, vec_r2, vec_r3
VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
2 31 4
6 75 8
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
23
128 threads in parallel
= 16 simultaneous instruction streams16 cores = 128 ALUs
24
But what about branches?
ALU 1ALU 2 . . . ALU 8. . . Time
(clocks)
2...
1...
8
if (x> 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
25
But what about branches?
ALU 1ALU 2 . . . ALU 8. . . Time
(clocks)
2...
1...
8
if (x> 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F FF F F
26
But what about branches?
ALU 1ALU 2 . . . ALU 8. . . Time
(clocks)
2...
1...
8
if (x> 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F FF F F
Not all ALUs do useful work! Worst case: 1/8 performance
27
But what about branches?
ALU 1ALU 2 . . . ALU 8. . . Time
(clocks)
2...
1...
8
if (x> 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F FF F F
28
Clarification
• Option 1: Explicit vector instructions– Intel/AMD x86 SSE, Intel Larrabee
• Option 2: Scalar instructions, implicit HW vectorization– HW determines instruction stream sharing across
ALUs (amount of sharing hidden from software)– NVIDIA GeForce (“SIMT” warps), ATI Radeon
architectures
SIMD processing does not imply SIMD instructions
In practice: 16 to 64 threads share an instruction stream
29
Stalls!
Texture access latency = 100’s to 1000’s of cycles
We’ve removed the fancy caches and logic that helps avoid stalls.
Stalls occur when a core cannot run the next instruction because of a dependency on a
previous operation.
30
But we have LOTS of independent threads.
Idea #3:Interleave processing of many threads on a single
core to avoid stalls caused by high latency operations.
31
Hiding stallsTime
(clocks)Thread1 …
8
Fetch/Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
SharedCtx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
32
Hiding stallsTime
(clocks)
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Thread1 … 8
Thread9… 16 Thread17 … 24 Thread25 … 32
33
Hiding stallsTime
(clocks)
Stall
Runnable
1 2 3 4
Thread1 … 8
Thread9… 16 Thread17 … 24 Thread25 … 32
34
Hiding stallsTime
(clocks)
Stall
Runnable
1 2 3 4
Thread1 … 8
Thread9… 16 Thread17 … 24 Thread25 … 32
35
Hiding stallsTime
(clocks)
1 2 3 4
Stall
Stall
Stall
Stall
Runnable
Runnable
Runnable
Thread1 … 8
Thread9… 16 Thread17 … 24 Thread25 … 32
36
Throughput!Time
(clocks)
Stall
Runnable
2 3 4
Thread1 … 8
Thread9… 16 Thread17 … 24 Thread25 … 32
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
1
Increase run time of one groupTo maximum throughput of many groups
Start
Start
Start
37
Storing contexts
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Pool of context storage
32KB
38
Twenty small contexts
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2 3 4 5
6 7 8 9 10
11 1512 13 14
16 2017 18 19
(maximal latency hiding ability)
39
Twelve medium contexts
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2 3 4
5 6 7 8
9 10 11 12
40
Four large contexts
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
43
1 2
(low latency hiding ability)
41
GPU block diagram key
= single “physical” instruction stream fetch/decode (functional unit control)
= SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs”
= execution context storage
= fixed function unit
= 32-bit mul-add unit= 32-bit multiply unit
42
Example: NVIDIA GeForce GTX 280• NVIDIA-speak:
– 240 stream processors– “SIMT execution” (automatic HW-managed sharing of instruction
stream)
• Generic speak:– 30 processing cores– 8 SIMD functional units per core– 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)– Best case: 240 mul-adds + 240 muls per clock– 1.3 GHz clock– 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS
• Mapping data-parallelism to chip:– Instruction stream shared across 32 threads– 8 threads run on 8 SIMD functional units in one clock
43
GTX 280 core
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Zcull/Clip/Rast Output Blend Work Distributor
… … …
………
………
………
………
………
………
………
………
………
44
Example: ATI Radeon 4870• AMD/ATI-speak:
– 800 stream processors– Automatic HW-managed sharing of scalar instruction stream (like
“SIMT”)
• Generic speak:– 10 processing cores– 16 SIMD functional units per core– 5 mul-adds per functional unit (5 * 2 =10 flops/clock)– Best case: 800 mul-adds per clock– 750 MHz clock– 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS
• Mapping data-parallelism to chip:– Instruction stream shared across 64 threads– 16 threads run on 16 SIMD functional units in one clock
45
ATI Radeon 4870 core
Zcull/Clip/Rast Output Blend Work Distributor
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
…
…
…
…
…
…
…
…
…
…
46
Summary: three key ideas
1. Use many “slimmed down cores” to run in parallel
2. Pack cores full of ALUs (by sharing instruction stream across groups of threads)– Option 1: Explicit SIMD vector instructions– Option 2: Implicit sharing managed by
hardware
3. Avoid latency stalls by interleaving execution of many groups of threads– When one group stalls, work on another group
2. GPU Programming Models
Programming ModelNVIDIA CUDA
OpenCL
47
48
Task parallelism
• Distribute the tasks across processors based on dependency
• Coarse-grain parallelism
Task 1Task
2
Task 4Task
5Task 6
Task 7 Task 8Task 9
Task 3
Task dependency graph
Task assignment across 3 processors
Task 1
Task 4
Task 7
Task 5
Task 8
Task 2
Task 6
Task 3
Task 9
P1P2P3
Time
49
Data parallelism
• Run a single kernel over many elements– Each element is independently updated– Same operation is applied on each element
• Fine-grain parallelism– Many lightweight threads, easy to switch
context– Maps well to ALU heavy architecture : GPU
Kernel P1
P2
P3
P4
P5
Pn
…….
…….Data
GPU-friendly Problems
• Data-parallel processing• High arithmetic intensity
– Keep GPU busy all the time– Computation offsets memory latency
• Coherent data access– Access large chunk of contiguous memory– Exploit fast on-chip shared memory
50
The Algorithm Matters
• Jacobi: Parallelizable
for(inti=0; i<num; i++){
vn+1[i] = (vn[i-1] + vn[i+1])/2.0;}
• Gauss-Seidel: Difficult to parallelize
for(inti=0; i<num; i++){
v[i] = (v[i-1] + v[i+1])/2.0;}
51
Example: Reduction
• Serial version (O(N))for(int i=1; i<N; i++){
v[0] += v[i];}
• Parallel version (O(logN))width = N/2;while(width > 1){
for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2;
}
52
53
GPU programming languages
• Using graphics APIs– GLSL, Cg, HLSL
• Computing-specific APIs– DX 11 Compute Shaders– NVIDIA CUDA– OpenCL
54
NVIDIA CUDA• C-extension programming language
– No graphics API– Supports debugging tools
• Extensions / API– Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__– Low-level functions
• cudaMalloc(), cudaFree(), cudaMemcpy(),…• __syncthread(), atomicAdd(),…
• Program types– Device program (kernel) : runs on the GPU– Host program : runs on the CPU to call device programs
55
CUDA Programming Model
• Kernel– GPU program that runs on a thread grid
• Thread hierarchy– Grid : a set of blocks– Block : a set of threads– Grid size * block size = total # of threads
Grid
Block 1
Threads
Block 2
Threads
Block n
Threads
. . . . .
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
Kernel
56
CUDA Memory Structure• Memory hierarchy
– PC memory : off-card– GPU Global : off-chip / on-card– Shared/register/cache : on-chip
• The host can read/write global memory• Each thread communicates using shared
memory
PC Memory
(DRAM)
GPU GlobalMemory(DRAM)
GPU SharedMemory
(On-Chip)ALUs
Graphics cardGPU Core
12004000
57
Synchronization
• Threads in the same block can communicate using shared memory
• No HW global synchronization function yet
• __syncthreads()– Barrier for threads only within the current block
• __threadfence()– Flushes global memory writes to make them
visible to all threads
58
Example: CPU Vector Addition
// Pair-wise addition of vector elements// CPU version : serial add
void vectorAdd(float* iA, float* iB, float* oC, int num)
{ for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; }
}
59
Example: CUDA Vector Addition
// Pair-wise addition of vector elements// CUDA version : one thread per addition
__global__ voidvectorAdd(float* iA, float* iB, float* oC) {
intidx = threadIdx.x + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];
}
60
Example: CUDA Host Code
float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B
// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N *
sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice );
// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
61
OpenCL (Open Computing Language)
• First industry standard for computing language– Based on C language– Platform independent
• NVIDIA, ATI, Intel, ….
• Data and task parallel compute model– Use all computational resources in system
• CPU, GPU, …– Work-item : same as thread / fragment / etc..– Work-group : a group of work-items
• Work-items in a same work-group can communicate
– Execute multiple work-groups in parallel
62
OpenCL program structure
• Host program (CPU)– Platform layer
• Query compute devices• Create context
– Runtime• Create memory objects• Compile and create kernel program objects• Issue commands (i.e., kernel launching) to command-
queue• Synchronization of commands• Clean up OpenCL resources
• Kernel (CPU, GPU)– C-like code with some extensions– Runs on compute device
63
CUDA v.s. OpenCL comparison
• Conceptually almost identical– Work-item == thread– Work-group == block– Similar memory model
• Global, local, shared memory– Kernel, host program
• CUDA is highly optimized only for NVIDIA GPUs
• OpenCL can be widely used for any GPUs/CPUs
64
Implementation status of OpenCL
• Specification 1.0 released by Khronos• NVIDIA released Beta 1.2 driver and SDK
– Available for registered GPU computing developers
• Apple will include in Mac OS X Snow Leopard– Q3 2009– NVIDIA and ATI GPUs, Intel CPU for Mac
• More companies will join
GPU optimization tips: configuration
• Identify bottleneck– Computing / bandwidth bound (use profiler)– Focus on most expensive but parallelizable
parts (Amdahl’s law)• Maximize parallel execution
– Use large input (many threads)– Avoid divergent execution– Efficient use of limited resource
• Minimize shared memory / register use
65
GPU optimization tips: memory
• Memory access: the most important optimization– Minimize device to host memory overhead
• Overlap kernel with memory copy (asynchronous copy)– Avoid shared memory bank conflict– Coalesced global memory access– Texture or constant memory can be helpful (cache)
PC Memory
(DRAM)
GPU GlobalMemory(DRAM)
GPU SharedMemory
(On-Chip)ALUs
Graphics cardGPU Core
12004000
66
67
GPU optimization tips: instructions
• Use less expensive operators– division: 32 cycles, multiplication: 4 cycles
• *0.5 instead of /2.0– Atomic operator is expensive
• Possible race condition– Double precision is much slower than float– Use less accurate floating point instruction
when possible• __sin(), __exp(), __pow()
• Save unnecessary instructions– Loop unrolling
3. Application Example
CUDA ITK
68
ITK image filters implemented using CUDA• Convolution filters
– Mean filter– Gaussian filter– Derivative filter– Hessian of Gaussian filter
• Statistical filter– Median filter
• PDE-based filter– Anisotropic diffusion filter
69
70
CUDA ITK
• CUDA code is integrated into ITK– Transparent to the ITK users– No need to modify current code using ITK
library• Check environment variable ITK_CUDA
– Entry point• GenerateData() or ThreadedGenerateData()
– If ITK_CUDA == 0• Execute original ITK code
– If ITK_CUDA == 1• Execute CUDA code
71
Convolution filters
• Weighted sum of neighbors– For size n filter, each pixel is reused n times
• Non-separable filter (Anisotropic)– Reusing data using shared memory
• Separable filter (Gaussian)– N-dimensional convolution = N*1D convolution
kernel
*
kernel
*
kernel
*
72
• Read from input image whenever needed
Naïve C/CUDA implementation
int xdim, ydim; // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][]; // convolution kernel of size n*m
for(x=0; x<xdim; x++){ for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } }}
n*m
xdim*ydim
load from global memory, n*m times
73
• For size n*m filter, each pixel is reused n*m times– Save n*m-1 global memory loads by using shared
memory
Improved CUDA convolution filter
__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory
sharedmem[] = in[][];
__syncthreads();
// sum neighbor pixel values float _sum = 0;
for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uint i=threadIdx.x; i<=threadIdx.x + n; i++) { wx = i – threadIdx.x; wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}
n*m
load from global memory (slow), only once
load from shared memory (fast), n*m times
74
CUDA Gaussian filter
• Apply 1D convolution filter along each axis– Use temporary buffers: ping-pong rendering
// temp[0], temp[1] : temporary buffer to store intermediate results
void cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev);
temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w);
}
out = temp[i%2];}
1D convolution cuda kernel
Median filter• Viola et al. [VIS 03]
– Finding median by bisection of histogram bins
– Log(# bins) iterations• 8-bit pixel : log(256) = 8
iterations
14 3 18 2 10
16 41.
0 1 2 3 4 5 6 7Intensity :
3. 14 3 18 2 10
1152. 14 3 18 2 10
4. 14 3 18 2 10
Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}return floor(pivot);
75
76
Perona & Malik anisotropic diffusion
• Nonlinear diffusion– Adaptive smoothing based on magnitude of
gradient– Preserves edges (high gradient)
• Numerical solution– Euler explicit integration (iterative method)– Finite difference for derivative computation
Input Image Linear diffusion P & M diffusion
77
Performance
• Convolution filters– Mean filter : ~140x– Gaussian filter : ~60x– Derivative filter– Hessian of Gaussian filter
• Statistical filter– Median filter : ~25x
• PDE-based filter– Anisotropic diffusion
filter : ~70x
78
CUDA ITK• Source code available at
– http://sourceforge.net/projects/cudaitk/
79
CUDA ITK Future Work• ITK GPU image class
– Reduce CPU to GPU memory I/O– Pipelining support
• Native interface for GPU code– Similar to ThreadedGenerateData() for GPU threads
• Numerical library (vnl)• Out-of-GPU-core / GPU-cluster
– Processing large images (10~100 Terabytes)
• GPU Platform independent implementation– OpenCL could be a solution
80
Conclusions• GPU computing delivers high performance
– Many scientific computing problems are parallelizable
– More consistency/stability in HW/SW• Main GPU architecture is mature• Industry-wide programming standard now exists (OpenCL)
– Better support/tools available• C-based language, compiler, and debugger
• Issues– Not every problem is suitable for GPUs
• Re-engineering of algorithms/software required– Unclear future performance growth of GPU
hardware• Intel’s Larrabee
thrust
• thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA
• C++ template metaprogramming automatically chooses the fastest code path at compile time
thrust::sort
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <cstdlib>
int main(void)
{
// generate random data on the host
thrust::host_vector<int> h_vec(1000000);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and sort
thrust::device_vector<int> d_vec = h_vec;
// sort 140M 32b keys/sec on GT200
thrust::sort(d_vec.begin(), d_vec.end());
return 0;}
http://thrust.googlecode.com