A few words about UbiCast
● Products○ presentation capture appliances based on
GStreamer since 2007 (lectures, elearning, event)
○ video CMS gstconf.ubicast.tv ● Supports a variety of sources:
○ v4l (SDI, VGA, YUV, HDMI, Composite, …)○ select network cameras (Axis, Canon,
SONY)○ USB audio sources
2
Context● Our capture product is stable but based on a software-only encode
architecture (gst-0.10)● Upcoming markets require multi-channel solutions (e.g. medical): 4+ 1080p
inputs● Want to look into smaller FF solutions (fanless ? embedded ?)● We wanted to benchmark available accelerated encoding solutions
Target use case: live video encoding for high quality archival (20Mb/s)
3
Live encoding constraints- support worst case scenarios (e.g. noise/snow)- CBR (predict filesize/complexity), 30 GOP- prefer low latency whenever possible
4
Simulating live inputsA live source cannot be used for benchmarking ← caps at 60 fpsClosest equivalent: push raw frames to encoders (e.g. actual footage)
Challenges:● with uncompressed, HDD can be the bottleneck● with compressed, CPU/processing bias
5
Initial target platformsRaspberryPi B+Tegra K1 (jetson)Nvidia NVENC (GTX 760)
VAAPI / CPU:● i7-4771 @3.50GHz (GT2 HD Graphics 4600) “Haswell”● i3-6100U (GT2 HD Graphics 520, 24 EU): NUC6i3SYK “Skylake” ● i5-6260U (GT3e Iris Graphics 540, 48EU): NUC6i5SYK “Skylake”● i7-6770HQ (GT4e Intel Iris Pro Graphics 580, 72EU): NUC6i7KYK “Skylake”
Had to lend it
Forget it
6
Target formatsCodecs: h264, h265, vp8, mjpeg
Resolutions:● 1080p (1920x1080)● 4k (3840x2160)
Colorspaces:● I420/NV12 ← natively supported by most encoders (“raw” numbers)● YUY2, UYVY ← widespread capture device colorspaces, will require
conversion7
Generating test load (...on embedded !)Options:
● videotestsrc ← too slow even on i7
gst-launch-1.0 videotestsrc pattern=snow num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! fakesink
takes 11s on 3.5 Ghz i7 haswell ← 1000/11 = 85 fps
Plus, we don’t get to inject actual footage
8
Muxed video files● muxed raw video (e.g. mkvdemux 1700 fps) ← cpu overhead (esp.
embedded)● gdppayed video <- copy
gst-launch-1.0 videotestsrc pattern=snow num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! matroskamux ! filesink location=test.mkv
gst-launch-1.0 filesrc location=test.mkv ! matroskademux ! fakesink 290 fpsgst-launch-1.0 filesrc location=test.gdp ! gdpdepay ! fakesink 558 fps
9
● videoparse● filesrc blocksize (side effects)
gst-launch-1.0 videotestsrc pattern=snow num-buffers=100 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! filesink location=test.raw
gst-launch-1.0 filesrc location=test.raw ! videoparse format=i420 width=1920 height=1080 ! fakesink 2564fps on PC, 30.67 fps on Pi
Even faster
Embedded
10
About HDD/RAMMedia must fit into RAM/cache ← automatic free RAM detection by copying into tmpfs (otherwise slowed down to disk speed), e.g. 150 fps on PC
Always run once (cache “heating”) to ensure that media is loaded into RAM
11
Encountered issues- #771528 vaapivp8enc segfault with raw video (thanks sreerenj)- #772457 delayed linking failed when requesting I420 from decodebin (vaapi)- #771524 GPU reset and pipeline hangs randomly when using low-power tune (GT3e / GT4e)
#772584 x265 fails to encode Y444 Jetson TK1 not booting, decodebin fails on tegra with mp4/h264 filetotal machine freeze when using vaapi + xorg heavily with intel_gpu_runningno cbr on vaapih265enc
pi way too slow to give results#772688 nvenc does not build on Ubuntu 16.04vpXenc abysmally slow with noise/snow despite: vp8enc end-usage=cbr target-bitrate=20000000 keyframe-mode=disabled keyframe-max-dist=30 threads=8 cpu-used=16
12
1.8.3 vs 1.9.90software encoders performance is stablevaapih264enc shows a massive 30% bump on GT2 and 10% on haswellvaapivp8enc shows a ~80% regression on snow case
GT2 (i3)
Encoder Pattern Avg Mpx/s Avg Mpx/s 1.8.3 > 1.9.90 var
*vaapih264enc* pattern=snow-* 214 239 11,6%
*vaapivp8enc* pattern=snow-* 418 89 -78,8%
*vaapih264enc* pattern=smpte-* 211 291 38,0%
*vaapivp8enc* pattern=smpte-* 172 166 -3,2%
*vaapih264enc* pattern=black-* 226 272 20,1%
*vaapivp8enc* pattern=black-* 171 180 5,6%14
CodecsNo surprises, jpeg is fastest,
h264enc is second
jpegenc is not MT, benchmark does not parallelize
h264 hw accell is very efficient (compared to x265)
vp8 suffers from snow
NUC6i3
Encoder Compared to avenc_mjpeg
*avenc_mjpeg* 0%
*vaapijpegenc* -8%
*x264enc* -22%
*jpegenc* -34%
*vaapih264enc* -62%
*vaapih265enc* -127%
*vaapivp8enc* -175%
*x265enc* -490%
15
Software vs VaapiOn average, on all formfactors, x264enc is faster than vaapih264enc by 40%
However, the low power mode is only 13% slower (but tests only available on i3 because of GPU hang bug).
i3 (skylake) i5 (skylake) i7 (haswell)
I420 x264enc vs vaapih264enc
25% 51% 48%
x264enc vs vaapih264enc LP
13%
YUY2 videoconvert + x264enc vs vaapipostproc + vaapih264enc
30%
16
Haswell vs SkylakeThe haswell CPU (84W) is not a fair game for the Skylake mobile CPUs (7.5W and 9.5W).
The performance per watt of SKL is impressive considering 10x less power
More hardware needed (desktop SKL !)
The i3 is faster than the i5 (!!!), i7 was also not much faster (but the CPU)
Encoder i5 vs i3 haswell vs i3
*avenc_mjpeg* 15% 173%
*x264enc* 10% 99%
*jpegenc* 0% 18%
*x265enc* 16% 108%
*vaapijpegenc* -15%
*vaapih264enc* -29% 38%
*vaapih265enc* -13%
*vaapivp8enc* -8%17
Influence of the test pattern
● Footage (bbb, content, camera) vs videotestsrc (black, smpte, snow): on average, similar
● Big buck bunny is too easy (not a good test)● Snow is actually a very good worst case scenario● vaapi is more robust to snow
NUC5i3
Pattern Avg Mpx/s 1.9.90
pattern 261
sample 266
pattern=black 311
pattern=snow 193
pattern=smpte 280
sample=slides 270
sample=bbb 285
sample=speaker 225
Encoder Pattern Avg Mpx/s 1.9.90
software pattern=snow 144
vaapi pattern=snow 232
software sample=bbb-* 312
vaapi sample=bbb-* 256
18
1080p vs 4k4K always squeezes in a bit more Mpx/s (10-15%)
NUC6i3 NUC6i5 Haswell i7
Resolution Mpx/s Mpx/s Mpx/s
1920 249 253 525
3840 284 277 593
4k vs 1080 14% 10% 13%
19
I420 vs YUY2 vs UYVY
YUY2 and UYVY are exactly the same performance
YUY2 costs ~40% performance (videoconvert/vaapipostproc)
while jpegenc accepts YUY2 natively, performance is still 15% worse than I420 + videoconvert ! avenc_mjpeg max-threads=1
Encoder
1080p30 streams equivalence (i3)
Avg costYUY2 I420
*avenc_mjpeg* 4,7 6,8 -43%
*x264enc* 3,9 5,2 -33%
*jpegenc* 3,0 4,9 -63%
*x265enc* 1,0 1,0 -1%
*vaapijpegenc* 3,6 5,9 -65%
*vaapih264enc* 2,7 4,0 -45%
*vaapih264enc tune=low-power*
2,9 4,7 -59%
*vaapih265enc* 2,2 2,9 -32%
*vaapivp8enc* 1,9 2,4 -26%
20
Low power SkylakeCan it be used in // ?
intel_gpu_top
Approx: 183 fps vs 128 (+43%) or 100 (+83%) ← maybe 30% more
Low power (fixed function mode)Slice-based
22
ConclusionThis study does not include any quality metricsSadly capture devices always offer YUY2/UYVY (and not directly I420, rarely NV12)GT3e/GT4 vaapi implementation seems less optimized than GT2x264enc is blazing fast, and CPU-based encoding will stay around a bit longeravenc_mjpeg even more (1171 Mpx/s on haswell: x18 1080p streams !)snow is a good worst case scenarioEmbedded was painful
The combination of x264enc and vaapi (esp. LP) packs a lot of encoding performance into a sub 10W machine ! 4x1080p robust streams seem possible
23
Future workEvaluate nvencEvaluate (more) desktop hw (SKL/KBL)Try LP mode on i5 & i7 once fixedActually test parallel streams (with drops & queue measurements)Add ssim/psnr quality measurement
24