01/05/2023
GPU acceleration of image processing Jan
Lemeire
1
3
GPU vs CPU Peak Performance Trends
GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law
Source : NVIDIA
2010350 Million triangles/second3 Billion transistors GPU
19955,000 triangles/second800,000 transistors GPU
01/05/2023
To the rescue: Graphical Processing Units (GPUs)
94 fps (AMD Tahiti Pro)
GPU: 1-3 TeraFlop/second instead of 10-20 GigaFlop/second for CPU
4
Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.
Multi-core CPU
Many-core GPU
Courtesy: John Owens
01/05/2023
GPUs are an alternative for CPUs
in offering processing power
6
01/05/2023 7
pixel rescaling lens correction pattern detection
CPU gives only 4 fpsnext generation machines need 50fps
01/05/2023
CPU: 4 fps GPU: 70 fps
8
01/05/2023
Methodology
9
Application
Identification of compute-intensive parts
Feasibility study of GPU acceleration
GPU implementation
GPU optimization
Hardware
01/05/2023
Obstacle 1Hard(er) to implement
10
01/05/2023 11
Device/GPU ± 1TFLOPS
Global Memory (1GB)
Multiprocessor 1
Local Memory (16/48KB)
ScalarProcessor
± 1GHz
Private 16K/8
ScalarProcessor
Private
Multiprocessor 2
Local Memory
ScalarProcessor
Private
ScalarProcessor
PrivateHost/CPU
Constant Memory (64KB)
GPU Programming Concepts
Texture Memory (in global memory)
RAM
Proces-sor
Grid (1D, 2D or 3D)
Group(0, 0)
Group(1, 0)
Group(0, 1)
Group(1, 1)
Group(2, 0)
Group(2, 1)
Work group
Work item(0, 0)
Work item(1, 0)
Work item(2, 0)
Work item(0, 1)
Work item(1, 1)
Work item(2, 1)
Work item(0, 2)
Work item(1, 2)
Work item(2, 2)
kernel
Max #work items per work group: 1024Executed in warps/wavefronts of 32/64 work itemsMax work groups simultaneously on MP: 8Max active warps on MP: 24/48
get_local_size(0)
get_
local_
size(1
)Work
group
size
S yWork group size Sx
(get_local_id(0), get_local_id(1))
(get_group_id(0),get_group_id(1))100GB/s 200 cycles
40GB/s few cycles
4-8 GB/s
OpenCL terminology
01/05/2023
Semi-abstract scalable hardware model
Need to know model for effective and efficient code
CPU: processor ensures efficient execution
12
Need to know more details than of CPU
Code remains compatible/efficient
01/05/2023
Increased code complexity
1. Complex index calculations Mapping data elements on processing elements (at
least 2 levels) Sometimes better to group elements
2. Optimizations Impact on performance need to be tested
3. A lot of parameters:a. Algorithm, implementationb. Configuration of mappingc. Hardware parameters (limits)d. Optimized versions
13
01/05/2023 14
Application
Identification of compute-intensive parts
Feasibility study of GPU acceleration
GPU implementation
GPU optimization
Hardware
Skeleton-based
OpenCL
Pragma-based
Parallelization by compiler
Methodology
01/05/2023
Obstacle 2Hard(er) to get efficiency
15
01/05/2023
We expect peak performance Speedup of 100x possible
At least, we expect some speedup But what is 5x worth?
Reasons for low efficiency?
16
01/05/2023 17
Roofline model
01/05/2023 18
01/05/2023 19
Application
Identification of compute-intensive parts
Feasibility study of GPU acceleration
Performance estimation
GPU implementation Performance analysis
GPU optimization
Hardware
Algorithm characterization
Hardwarecharacterization
bottlenecks & trade-offs
Skeleton-based
OpenCL
Pragma-based
Parallelization by compiler
Roofline model& benchmarks
Analytical model
benchmarks
Methodology: our contribution
Anti-parallel patterns
01/05/2023
Conclusions
20
01/05/2023
Conclusions
21
Changed into…
01/05/2023
Conclusions
22
01/05/2023
Competence Center for Personal Supercomputing
Offer trainings (overcome obstacle 1) Acquire expertise Take an independent, critical position
Offer feasibility and performance studies (overcome obstacle 2)
Symposium: Brussels, December 13th 2012
http://parallel.vub.ac.be 23