Building Real-Time Professional Visualization Solutions on GPUs
Kristof Denolf
Samuel Maroy
Ronny Dewaele
Page 2
Page 3
Outline
� Barco’s professional visualization solutions
� The need for performance portability
� Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)
– Transfers with parallel GPU Compute/Rendering
� Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates
� The cost of OpenGL/CL(or CUDA) interoperability
� Towards partial transfers to reduce the latency
� Conclusions
Page 4
Company structureFour core divisions, five wholly-owned ventures
Digital signage Lighting Design servicesATM softwareLED
HealthcareEntertainmentControl rooms &
SimulationDefense & Aerospace
Page 5
Healthcare
“Supporting healthcare professionals a billion times a year”
Page 6
Control Rooms
“Helping over 2.5 billion commuters get home safely every day”
Page 7
Media & Entertainment
“Setting the scene for over 2,500 gigs and shows every year”
Page 8
Professional Visualization
� High quality– High resulutions
– Mutliple sources
– True colours
� Low latency
� Perfect calibration
� Synchronization
Page 9
OpenCL as Initial Answer for Portability
� OpenCL for GPU and multi-core CPU programming of image processing chains
� OpenCL for GPU accelerated prototypes of new algorithms
Page 10
Portability also Towards FPGA Design
[Desh Singh, presented at DATE 2011and FPGA 2011 Pre-Conference Workshop]
[Altera news: San Jose, Calif., November 15, 2011]
Page 11
Outline
� Barco’s professional visualization solutions
� The need for performance portability
� Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)
– Transfers with parallel GPU Compute/Rendering
� Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates
� The cost of OpenGL/CL(or CUDA) interoperability
� Towards partial transfers to reduce the latency
� Conclusions
Page 12
Ideal Data Transfer has Highest Rate and Virtually no Compute Impact
GPUproc (n)
in (n+1)
out (n-1)
GPUproc (n)in (n) out (n)
Maximize GPU compute& transfer time
CPU
GPU
CPU
GPU
� CPU asynchronous
� Pinned CPU memory
� GPU parallel transfer
� Highest rate
Quadro
Graphics
cardCopy
EngineCopy
Engine
DRAM
PCIe bus
CPUDRAM
Page 13
(Over) Peak Data Rates Highest for Direct Transfers from/to Pinned Host Memory
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2CPU, pinned, mappedCpu2Gpu, pinned, pagedGpu2Cpu, pinned, paged
oclBandwidthTest
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 5 10 15 20 25 30
Transfer Size (MB)
Tra
nsfe
r R
ate
(MB
ps)
Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2Cpu, pinned, mappedCpu2Gpu, paged, directGpu2Cpu, paged, direct
testTransferSpeedOpenCL
All tests done on Q3000M on PCIe x 16 Gen2 (GPUdirect on Q4000)
Page 14
Other Transfers Sustain a Similar Rate
� CL buffers, images and GL interoperable variants similar
� Choose most appropriate CL memory type
� Efficiency– > 4 GBps from 480p (1.3 MB)
– > 4.8 GBps from 720p (3.5 MB)
� All numbers for RGBA– Write to GPU: 10 1080p60
– Read from GPU: 10 1080p60
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Cpu2GPU, bufferGpu2Cpu, bufferCpu2Gpu, imageGpu2Cpu, imageCpu2Gpu, bufferGLGpu2Cpu, bufferGLCpu2Gpu, imageGLGpu2Cpu, imageGL
OpenCL
Page 15
OpenCL/CUDA Transfers with Parallel Compute (Transfer Dominated)
GPUproc (n)
in (n+1)
out (n-1)
OpenCL
CUDA• GPU dual copy engines working• GPU compute in parallel with data transfers• still some gaps present
Page 16
Throughput Impact Related to Kernel Duration
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Peak transferGPU parallel
� Efficiency (OpenCL)– > 3.2 GBps from 480p (1.3 MB)
– > 3.4 GBps from 720p (3.5 MB)
� All numbers for RGBA– Write to GPU: 6.8 1080p60
– Read from GPU: 6.8 1080p60
� Note that also maximizing the GPU compute time is hampered
OpenCL
Page 17
CUDA and GPUdirect Achieve Highest Peak Rate
� Efficiency– CUDA transfers boost to 6 GBps
– DVP read from GPU upto 7.5 GBps for very large transfers
– Other: all around 5.2 GBps
� How to get 6 GBps for all programming models
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Cpu2Gpu, OpenCLGpu2Cpu, OpenCLCpu2Gpu, OpenGLGpu2Cpu, OpenGLCpu2Gpu, GPUdirectGpu2Cpu, GPUdirectCpu2Gpu, CUDAGpu2Cpu, CUDA
Page 18
Outline
� Barco’s professional visualization solutions
� The need for performance portability
� Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)
– Transfers with parallel GPU Compute/Rendering
� Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates
� The cost of OpenGL/CL(or CUDA) interoperability
� Towards partial transfers to reduce the latency
� Conclusions
Page 19
CL/GL Interoperability Hampers Parallelism
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Peak transferGPU parallel
OpenCL
Page 20
CUDA / GL Interoperability not Trivial
No GL rendering With GL rendering
Page 21
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30
Transfer Size (MB)
Tran
sfer
Rat
e (M
Bp
s)
Peak transferGPU parallel
Return to OpenGL, render on full HD screen (1/2)
OpenGL
Page 22
Return to OpenGL, Readback to CPU Memory (2/2)
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30
Transfer Size (MB)Tr
ansf
er R
ate
(MB
ps)
Peak transferGPU parallel
OpenGL
Page 23
to Avoid Interoperability Issue
� 9 HD 1080p in at 60 fps �4.5 GBps
� Parallel rendering
Page 24
Outline
� Barco’s professional visualization solutions
� The need for performance portability
� Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)
– Transfers with parallel GPU Compute/Rendering
� Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates
� The cost of OpenGL/CL(or CUDA) interoperability
� Towards partial transfers to reduce the latency
� Conclusions
Page 25
Partial Image Transfers for Low Latency
� 1/8 HD (1 MB) transfer size has reasonable rate (certainly for CUDA)
� Concurrent partial update same image ?
Page 26
Conclusions
� Barco’s professional visualization requires– High quality
– High resolution
– Multiple sources
� Barco’s professional visualization desires portability
� DMA enabled and fully parallel data transfers are essential
� Mind the gap: peak data rates can not be achieved contineoulsy
� CL or CUDA /GL interoperability is difficult