High Performance Video Pipelining: A Flexible Architecture for GPU Processing of Broadcast Video
Peter Walsh Chief Emerging Technology Engineer
ESPN
Overview
• Real-time GPU processing of broadcast video
– Maximize GPU utilization
– Maintain flexibility
• High Performance Video Pipeline
– CPU and GPU buffers
– Data transfer
Monday Night Football production truck
NASCAR production truck
Studio (BCS championship “Film Room”)
GPU Processing
• Segmentation (generating chromakey)
• Inserting graphics (linear and chromakeying)
• Field (camera) tracking
• Object (player) tracking
Segmentation
GFX insertion
Field Tracking
Interop
Input Video
CPU GPU
Rendering
Output Video
Object Tracking
Background
• “Best Practices in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2013
• “Topics in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2014
Naïve Sequential Implementation
• Acquire
• Upload
• Process
• Download
• Output
1 Frame Time
Simultaneous Operations
• Acquire
• Upload
• Process
• Download
• Output
1 Frame Time
Techniques
• Avoid CPU memory copies
• Use pinned system memory
• DMA Video I/O using pinned memory
• DMA between CPU and GPU
• Asynchronous – using multiple CUDA streams
• Double buffers for simultaneous R/W
Frame Buffers
Pinned System
System
GPU
Frame Buffers
Pinned System
System
GPU
Buffer Allocation • Device • System • Pinned System
• 1D • 2D (pitch specified) • 2D (pitch determined by CUDA allocation)
Pitch
CUDA API
Allocation:
Memory Copies:
cudaMalloc() cudaHostAlloc() cudaMallocPitch()
cudaMemcpy() cudaMemcpy2D() cudaMemcpyAsync() cudaMemcpy2DAsync()
Buffer Transfers
B.Copy(A, pStream)
• Source and destination buffers
– System, pinned system, device
– Different pitches
• Supports Synchronous/Asynchronous transfers
CUDA Kernels
LaunchKernel( A, B, pStream, …)
• Buffers A and B are in device memory
• Sync/Async behavior controlled by pStream
A
B C
D
Processing
Acquire(A) B.Copy(A, pUploadStream) Process(B, C, pProcessingStream, params) D.Copy(C, pDownLoadStream) Output(D)
GPU
CPU
Double Buffering
Dst
Src
Src
Dst
Frame “i”
Frame “i + 1”
Double Buffering
Src
Processing
GPU
CPU
Dst
Src Dst Src Dst
Src Dst
Double Buffering
Src
Processing
GPU
CPU
Dst
Src Dst Src Dst
Src Dst
Segmentation
GFX insertion
Field Tracking
Interop
Input Video
CPU GPU
Rendering
Output Video
Object Tracking
Simultaneous Operations
• Acquire
• Upload
• Process
• Download
• Output
1 Frame Time
Intel IPP ippiFilter_8u_C1R (pSrcImgOffset, srcPitch, pDstImgOffset, dstPitch, roi, filterKernel, kernelSize, anchor, divisor);
NVIDIA NPP nppiFilter_8u_C1R (pSrcImgOffset, srcPitch, pDstImgOffset, dstPitch, roi, filterKernel, kernelSize, anchor, divisor);
HPVP Filter_8u_C1R(pSrc, pDest, roi, pFilterKernel);
Live Filtering
• Acquire(A)
• B.Copy(A, pUploadStream)
• Filter_8u_C3R(B, C, roi, pFilterKernel) *
• D.Copy(C, pDownLoadStream)
• Output(D)
* CUDA stream for processing already defined
References/Links
“Best Practices in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2013
“Topics in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2014 http://www.youtube.com/watch?v=QpEV-XVIxNw http://frontrow.espn.go.com/2014/01/espns-advanced-replay-tool-art-graphically-enhances-sports-telecasts/