+ All Categories
Home > Documents > Building Real-Time Professional Visualization Solutions on GPUs … · 2013. 8. 23. · Page 14...

Building Real-Time Professional Visualization Solutions on GPUs … · 2013. 8. 23. · Page 14...

Date post: 31-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
26
Building Real-Time Professional Visualization Solutions on GPUs Kristof Denolf Samuel Maroy Ronny Dewaele
Transcript
  • Building Real-Time Professional Visualization Solutions on GPUs

    Kristof Denolf

    Samuel Maroy

    Ronny Dewaele

  • Page 2

  • Page 3

    Outline

    � Barco’s professional visualization solutions

    � The need for performance portability

    � Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)

    – Transfers with parallel GPU Compute/Rendering

    � Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates

    � The cost of OpenGL/CL(or CUDA) interoperability

    � Towards partial transfers to reduce the latency

    � Conclusions

  • Page 4

    Company structureFour core divisions, five wholly-owned ventures

    Digital signage Lighting Design servicesATM softwareLED

    HealthcareEntertainmentControl rooms &

    SimulationDefense & Aerospace

  • Page 5

    Healthcare

    “Supporting healthcare professionals a billion times a year”

  • Page 6

    Control Rooms

    “Helping over 2.5 billion commuters get home safely every day”

  • Page 7

    Media & Entertainment

    “Setting the scene for over 2,500 gigs and shows every year”

  • Page 8

    Professional Visualization

    � High quality– High resulutions

    – Mutliple sources

    – True colours

    � Low latency

    � Perfect calibration

    � Synchronization

  • Page 9

    OpenCL as Initial Answer for Portability

    � OpenCL for GPU and multi-core CPU programming of image processing chains

    � OpenCL for GPU accelerated prototypes of new algorithms

  • Page 10

    Portability also Towards FPGA Design

    [Desh Singh, presented at DATE 2011and FPGA 2011 Pre-Conference Workshop]

    [Altera news: San Jose, Calif., November 15, 2011]

  • Page 11

    Outline

    � Barco’s professional visualization solutions

    � The need for performance portability

    � Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)

    – Transfers with parallel GPU Compute/Rendering

    � Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates

    � The cost of OpenGL/CL(or CUDA) interoperability

    � Towards partial transfers to reduce the latency

    � Conclusions

  • Page 12

    Ideal Data Transfer has Highest Rate and Virtually no Compute Impact

    GPUproc (n)

    in (n+1)

    out (n-1)

    GPUproc (n)in (n) out (n)

    Maximize GPU compute& transfer time

    CPU

    GPU

    CPU

    GPU

    � CPU asynchronous

    � Pinned CPU memory

    � GPU parallel transfer

    � Highest rate

    Quadro

    Graphics

    cardCopy

    EngineCopy

    Engine

    DRAM

    PCIe bus

    CPUDRAM

  • Page 13

    (Over) Peak Data Rates Highest for Direct Transfers from/to Pinned Host Memory

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2CPU, pinned, mappedCpu2Gpu, pinned, pagedGpu2Cpu, pinned, paged

    oclBandwidthTest

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tra

    nsfe

    r R

    ate

    (MB

    ps)

    Cpu2Gpu, pinned, directGpu2Cpu, pinned, directCpu2Gpu, pinned, mappedGpu2Cpu, pinned, mappedCpu2Gpu, paged, directGpu2Cpu, paged, direct

    testTransferSpeedOpenCL

    All tests done on Q3000M on PCIe x 16 Gen2 (GPUdirect on Q4000)

  • Page 14

    Other Transfers Sustain a Similar Rate

    � CL buffers, images and GL interoperable variants similar

    � Choose most appropriate CL memory type

    � Efficiency– > 4 GBps from 480p (1.3 MB)

    – > 4.8 GBps from 720p (3.5 MB)

    � All numbers for RGBA– Write to GPU: 10 1080p60

    – Read from GPU: 10 1080p60

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Cpu2GPU, bufferGpu2Cpu, bufferCpu2Gpu, imageGpu2Cpu, imageCpu2Gpu, bufferGLGpu2Cpu, bufferGLCpu2Gpu, imageGLGpu2Cpu, imageGL

    OpenCL

  • Page 15

    OpenCL/CUDA Transfers with Parallel Compute (Transfer Dominated)

    GPUproc (n)

    in (n+1)

    out (n-1)

    OpenCL

    CUDA• GPU dual copy engines working• GPU compute in parallel with data transfers• still some gaps present

  • Page 16

    Throughput Impact Related to Kernel Duration

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Peak transferGPU parallel

    � Efficiency (OpenCL)– > 3.2 GBps from 480p (1.3 MB)

    – > 3.4 GBps from 720p (3.5 MB)

    � All numbers for RGBA– Write to GPU: 6.8 1080p60

    – Read from GPU: 6.8 1080p60

    � Note that also maximizing the GPU compute time is hampered

    OpenCL

  • Page 17

    CUDA and GPUdirect Achieve Highest Peak Rate

    � Efficiency– CUDA transfers boost to 6 GBps

    – DVP read from GPU upto 7.5 GBps for very large transfers

    – Other: all around 5.2 GBps

    � How to get 6 GBps for all programming models

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Cpu2Gpu, OpenCLGpu2Cpu, OpenCLCpu2Gpu, OpenGLGpu2Cpu, OpenGLCpu2Gpu, GPUdirectGpu2Cpu, GPUdirectCpu2Gpu, CUDAGpu2Cpu, CUDA

  • Page 18

    Outline

    � Barco’s professional visualization solutions

    � The need for performance portability

    � Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)

    – Transfers with parallel GPU Compute/Rendering

    � Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates

    � The cost of OpenGL/CL(or CUDA) interoperability

    � Towards partial transfers to reduce the latency

    � Conclusions

  • Page 19

    CL/GL Interoperability Hampers Parallelism

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Peak transferGPU parallel

    OpenCL

  • Page 20

    CUDA / GL Interoperability not Trivial

    No GL rendering With GL rendering

  • Page 21

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30

    Transfer Size (MB)

    Tran

    sfer

    Rat

    e (M

    Bp

    s)

    Peak transferGPU parallel

    Return to OpenGL, render on full HD screen (1/2)

    OpenGL

  • Page 22

    Return to OpenGL, Readback to CPU Memory (2/2)

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30

    Transfer Size (MB)Tr

    ansf

    er R

    ate

    (MB

    ps)

    Peak transferGPU parallel

    OpenGL

  • Page 23

    to Avoid Interoperability Issue

    � 9 HD 1080p in at 60 fps �4.5 GBps

    � Parallel rendering

  • Page 24

    Outline

    � Barco’s professional visualization solutions

    � The need for performance portability

    � Real PCIe Data Rates to/from GPU– Transfer Only (e.g. the bandwidth test)

    – Transfers with parallel GPU Compute/Rendering

    � Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates

    � The cost of OpenGL/CL(or CUDA) interoperability

    � Towards partial transfers to reduce the latency

    � Conclusions

  • Page 25

    Partial Image Transfers for Low Latency

    � 1/8 HD (1 MB) transfer size has reasonable rate (certainly for CUDA)

    � Concurrent partial update same image ?

  • Page 26

    Conclusions

    � Barco’s professional visualization requires– High quality

    – High resolution

    – Multiple sources

    � Barco’s professional visualization desires portability

    � DMA enabled and fully parallel data transfers are essential

    � Mind the gap: peak data rates can not be achieved contineoulsy

    � CL or CUDA /GL interoperability is difficult


Recommended