Building Real-Time Professional Visualization Solutions on GPUs … · 2013. 8. 23. · Page 14...

Building Real-Time Professional Visualization Solutions on GPUs

Kristof Denolf

Samuel Maroy

Ronny Dewaele

Outline

� Barco’s professional visualization solutions

� The need for performance portability

� Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)

– Transfers with parallel GPU Compute/Rendering

� Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates

� The cost of OpenGL/CL(or CUDA) interoperability

� Towards partial transfers to reduce the latency

� Conclusions

Company structureFour core divisions, five wholly-owned ventures

Digital signage Lighting Design servicesATM softwareLED

HealthcareEntertainmentControl rooms &

SimulationDefense & Aerospace

Healthcare

“Supporting healthcare professionals a billion times a year”

Control Rooms

“Helping over 2.5 billion commuters get home safely every day”

Media & Entertainment

“Setting the scene for over 2,500 gigs and shows every year”

Professional Visualization

� High quality– High resulutions

– Mutliple sources

– True colours

� Low latency

� Perfect calibration

� Synchronization

OpenCL as Initial Answer for Portability

� OpenCL for GPU and multi-core CPU programming of image processing chains

� OpenCL for GPU accelerated prototypes of new algorithms

Portability also Towards FPGA Design

[Desh Singh, presented at DATE 2011and FPGA 2011 Pre-Conference Workshop]

[Altera news: San Jose, Calif., November 15, 2011]

Outline








� Conclusions

Ideal Data Transfer has Highest Rate and Virtually no Compute Impact

GPUproc (n)

in (n+1)

out (n-1)

GPUproc (n)in (n) out (n)

Maximize GPU compute& transfer time

CPU

GPU

CPU

GPU

� CPU asynchronous

� Pinned CPU memory

� GPU parallel transfer

� Highest rate

Quadro

Graphics

cardCopy

EngineCopy

Engine

DRAM

PCIe bus

CPUDRAM

(Over) Peak Data Rates Highest for Direct Transfers from/to Pinned Host Memory

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)

Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2CPU, pinned, mappedCpu2Gpu, pinned, pagedGpu2Cpu, pinned, paged

oclBandwidthTest

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 5 10 15 20 25 30

Transfer Size (MB)

Tra

nsfe

r R

ate

(MB

ps)

Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2Cpu, pinned, mappedCpu2Gpu, paged, directGpu2Cpu, paged, direct

testTransferSpeedOpenCL

All tests done on Q3000M on PCIe x 16 Gen2 (GPUdirect on Q4000)

Other Transfers Sustain a Similar Rate

� CL buffers, images and GL interoperable variants similar

� Choose most appropriate CL memory type

� Efficiency– > 4 GBps from 480p (1.3 MB)

– > 4.8 GBps from 720p (3.5 MB)

� All numbers for RGBA– Write to GPU: 10 1080p60

– Read from GPU: 10 1080p60

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)

Cpu2GPU, bufferGpu2Cpu, bufferCpu2Gpu, imageGpu2Cpu, imageCpu2Gpu, bufferGLGpu2Cpu, bufferGLCpu2Gpu, imageGLGpu2Cpu, imageGL

OpenCL

OpenCL/CUDA Transfers with Parallel Compute (Transfer Dominated)

GPUproc (n)

in (n+1)

out (n-1)

OpenCL

CUDA• GPU dual copy engines working• GPU compute in parallel with data transfers• still some gaps present

Throughput Impact Related to Kernel Duration

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)

Peak transferGPU parallel

� Efficiency (OpenCL)– > 3.2 GBps from 480p (1.3 MB)

– > 3.4 GBps from 720p (3.5 MB)

� All numbers for RGBA– Write to GPU: 6.8 1080p60

– Read from GPU: 6.8 1080p60

� Note that also maximizing the GPU compute time is hampered

OpenCL

CUDA and GPUdirect Achieve Highest Peak Rate

� Efficiency– CUDA transfers boost to 6 GBps

– DVP read from GPU upto 7.5 GBps for very large transfers

– Other: all around 5.2 GBps

� How to get 6 GBps for all programming models

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)

Cpu2Gpu, OpenCLGpu2Cpu, OpenCLCpu2Gpu, OpenGLGpu2Cpu, OpenGLCpu2Gpu, GPUdirectGpu2Cpu, GPUdirectCpu2Gpu, CUDAGpu2Cpu, CUDA

Outline








� Conclusions

CL/GL Interoperability Hampers Parallelism

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)


OpenCL

CUDA / GL Interoperability not Trivial

No GL rendering With GL rendering

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30

Transfer Size (MB)

Tran

sfer

Rat

e (M

Bp

s)


Return to OpenGL, render on full HD screen (1/2)

OpenGL

Return to OpenGL, Readback to CPU Memory (2/2)

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30

Transfer Size (MB)Tr

ansf

er R

ate

(MB

ps)


OpenGL

to Avoid Interoperability Issue

� 9 HD 1080p in at 60 fps �4.5 GBps

� Parallel rendering

Outline








� Conclusions

Partial Image Transfers for Low Latency

� 1/8 HD (1 MB) transfer size has reasonable rate (certainly for CUDA)

� Concurrent partial update same image ?

Conclusions

� Barco’s professional visualization requires– High quality

– High resolution

– Multiple sources

� Barco’s professional visualization desires portability

� DMA enabled and fully parallel data transfers are essential

� Mind the gap: peak data rates can not be achieved contineoulsy

� CL or CUDA /GL interoperability is difficult

Date post:	31-Jan-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Building Real-Time Professional Visualization Solutions on GPUs … · 2013. 8. 23. · Page 14...

Documents