+ All Categories
Home > Documents > GStreamer Encoding Benchmarks with An overview of Video...Tegra K1 (jetson) Nvidia NVENC (GTX 760)...

GStreamer Encoding Benchmarks with An overview of Video...Tegra K1 (jetson) Nvidia NVENC (GTX 760)...

Date post: 09-Jun-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
26
An overview of Video Encoding Benchmarks with GStreamer
Transcript

An overview of Video Encoding Benchmarks with GStreamer

A few words about UbiCast

● Products○ presentation capture appliances based on

GStreamer since 2007 (lectures, elearning, event)

○ video CMS gstconf.ubicast.tv ● Supports a variety of sources:

○ v4l (SDI, VGA, YUV, HDMI, Composite, …)○ select network cameras (Axis, Canon,

SONY)○ USB audio sources

2

Context● Our capture product is stable but based on a software-only encode

architecture (gst-0.10)● Upcoming markets require multi-channel solutions (e.g. medical): 4+ 1080p

inputs● Want to look into smaller FF solutions (fanless ? embedded ?)● We wanted to benchmark available accelerated encoding solutions

Target use case: live video encoding for high quality archival (20Mb/s)

3

Live encoding constraints- support worst case scenarios (e.g. noise/snow)- CBR (predict filesize/complexity), 30 GOP- prefer low latency whenever possible

4

Simulating live inputsA live source cannot be used for benchmarking ← caps at 60 fpsClosest equivalent: push raw frames to encoders (e.g. actual footage)

Challenges:● with uncompressed, HDD can be the bottleneck● with compressed, CPU/processing bias

5

Initial target platformsRaspberryPi B+Tegra K1 (jetson)Nvidia NVENC (GTX 760)

VAAPI / CPU:● i7-4771 @3.50GHz (GT2 HD Graphics 4600) “Haswell”● i3-6100U (GT2 HD Graphics 520, 24 EU): NUC6i3SYK “Skylake” ● i5-6260U (GT3e Iris Graphics 540, 48EU): NUC6i5SYK “Skylake”● i7-6770HQ (GT4e Intel Iris Pro Graphics 580, 72EU): NUC6i7KYK “Skylake”

Had to lend it

Forget it

6

Target formatsCodecs: h264, h265, vp8, mjpeg

Resolutions:● 1080p (1920x1080)● 4k (3840x2160)

Colorspaces:● I420/NV12 ← natively supported by most encoders (“raw” numbers)● YUY2, UYVY ← widespread capture device colorspaces, will require

conversion7

Generating test load (...on embedded !)Options:

● videotestsrc ← too slow even on i7

gst-launch-1.0 videotestsrc pattern=snow num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! fakesink

takes 11s on 3.5 Ghz i7 haswell ← 1000/11 = 85 fps

Plus, we don’t get to inject actual footage

8

Muxed video files● muxed raw video (e.g. mkvdemux 1700 fps) ← cpu overhead (esp.

embedded)● gdppayed video <- copy

gst-launch-1.0 videotestsrc pattern=snow num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! matroskamux ! filesink location=test.mkv

gst-launch-1.0 filesrc location=test.mkv ! matroskademux ! fakesink 290 fpsgst-launch-1.0 filesrc location=test.gdp ! gdpdepay ! fakesink 558 fps

9

● videoparse● filesrc blocksize (side effects)

gst-launch-1.0 videotestsrc pattern=snow num-buffers=100 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive ! filesink location=test.raw

gst-launch-1.0 filesrc location=test.raw ! videoparse format=i420 width=1920 height=1080 ! fakesink 2564fps on PC, 30.67 fps on Pi

Even faster

Embedded

10

About HDD/RAMMedia must fit into RAM/cache ← automatic free RAM detection by copying into tmpfs (otherwise slowed down to disk speed), e.g. 150 fps on PC

Always run once (cache “heating”) to ensure that media is loaded into RAM

11

Encountered issues- #771528 vaapivp8enc segfault with raw video (thanks sreerenj)- #772457 delayed linking failed when requesting I420 from decodebin (vaapi)- #771524 GPU reset and pipeline hangs randomly when using low-power tune (GT3e / GT4e)

#772584 x265 fails to encode Y444 Jetson TK1 not booting, decodebin fails on tegra with mp4/h264 filetotal machine freeze when using vaapi + xorg heavily with intel_gpu_runningno cbr on vaapih265enc

pi way too slow to give results#772688 nvenc does not build on Ubuntu 16.04vpXenc abysmally slow with noise/snow despite: vp8enc end-usage=cbr target-bitrate=20000000 keyframe-mode=disabled keyframe-max-dist=30 threads=8 cpu-used=16

12

PC vs Embedded

… No comment13

1.8.3 vs 1.9.90software encoders performance is stablevaapih264enc shows a massive 30% bump on GT2 and 10% on haswellvaapivp8enc shows a ~80% regression on snow case

GT2 (i3)

Encoder Pattern Avg Mpx/s Avg Mpx/s 1.8.3 > 1.9.90 var

*vaapih264enc* pattern=snow-* 214 239 11,6%

*vaapivp8enc* pattern=snow-* 418 89 -78,8%

*vaapih264enc* pattern=smpte-* 211 291 38,0%

*vaapivp8enc* pattern=smpte-* 172 166 -3,2%

*vaapih264enc* pattern=black-* 226 272 20,1%

*vaapivp8enc* pattern=black-* 171 180 5,6%14

CodecsNo surprises, jpeg is fastest,

h264enc is second

jpegenc is not MT, benchmark does not parallelize

h264 hw accell is very efficient (compared to x265)

vp8 suffers from snow

NUC6i3

Encoder Compared to avenc_mjpeg

*avenc_mjpeg* 0%

*vaapijpegenc* -8%

*x264enc* -22%

*jpegenc* -34%

*vaapih264enc* -62%

*vaapih265enc* -127%

*vaapivp8enc* -175%

*x265enc* -490%

15

Software vs VaapiOn average, on all formfactors, x264enc is faster than vaapih264enc by 40%

However, the low power mode is only 13% slower (but tests only available on i3 because of GPU hang bug).

i3 (skylake) i5 (skylake) i7 (haswell)

I420 x264enc vs vaapih264enc

25% 51% 48%

x264enc vs vaapih264enc LP

13%

YUY2 videoconvert + x264enc vs vaapipostproc + vaapih264enc

30%

16

Haswell vs SkylakeThe haswell CPU (84W) is not a fair game for the Skylake mobile CPUs (7.5W and 9.5W).

The performance per watt of SKL is impressive considering 10x less power

More hardware needed (desktop SKL !)

The i3 is faster than the i5 (!!!), i7 was also not much faster (but the CPU)

Encoder i5 vs i3 haswell vs i3

*avenc_mjpeg* 15% 173%

*x264enc* 10% 99%

*jpegenc* 0% 18%

*x265enc* 16% 108%

*vaapijpegenc* -15%

*vaapih264enc* -29% 38%

*vaapih265enc* -13%

*vaapivp8enc* -8%17

Influence of the test pattern

● Footage (bbb, content, camera) vs videotestsrc (black, smpte, snow): on average, similar

● Big buck bunny is too easy (not a good test)● Snow is actually a very good worst case scenario● vaapi is more robust to snow

NUC5i3

Pattern Avg Mpx/s 1.9.90

pattern 261

sample 266

pattern=black 311

pattern=snow 193

pattern=smpte 280

sample=slides 270

sample=bbb 285

sample=speaker 225

Encoder Pattern Avg Mpx/s 1.9.90

software pattern=snow 144

vaapi pattern=snow 232

software sample=bbb-* 312

vaapi sample=bbb-* 256

18

1080p vs 4k4K always squeezes in a bit more Mpx/s (10-15%)

NUC6i3 NUC6i5 Haswell i7

Resolution Mpx/s Mpx/s Mpx/s

1920 249 253 525

3840 284 277 593

4k vs 1080 14% 10% 13%

19

I420 vs YUY2 vs UYVY

YUY2 and UYVY are exactly the same performance

YUY2 costs ~40% performance (videoconvert/vaapipostproc)

while jpegenc accepts YUY2 natively, performance is still 15% worse than I420 + videoconvert ! avenc_mjpeg max-threads=1

Encoder

1080p30 streams equivalence (i3)

Avg costYUY2 I420

*avenc_mjpeg* 4,7 6,8 -43%

*x264enc* 3,9 5,2 -33%

*jpegenc* 3,0 4,9 -63%

*x265enc* 1,0 1,0 -1%

*vaapijpegenc* 3,6 5,9 -65%

*vaapih264enc* 2,7 4,0 -45%

*vaapih264enc tune=low-power*

2,9 4,7 -59%

*vaapih265enc* 2,2 2,9 -32%

*vaapivp8enc* 1,9 2,4 -26%

20

Headless (drm) vs xorgNo noticeable impact (except if using the GPU at the same time)

21

Low power SkylakeCan it be used in // ?

intel_gpu_top

Approx: 183 fps vs 128 (+43%) or 100 (+83%) ← maybe 30% more

Low power (fixed function mode)Slice-based

22

ConclusionThis study does not include any quality metricsSadly capture devices always offer YUY2/UYVY (and not directly I420, rarely NV12)GT3e/GT4 vaapi implementation seems less optimized than GT2x264enc is blazing fast, and CPU-based encoding will stay around a bit longeravenc_mjpeg even more (1171 Mpx/s on haswell: x18 1080p streams !)snow is a good worst case scenarioEmbedded was painful

The combination of x264enc and vaapi (esp. LP) packs a lot of encoding performance into a sub 10W machine ! 4x1080p robust streams seem possible

23

Future workEvaluate nvencEvaluate (more) desktop hw (SKL/KBL)Try LP mode on i5 & i7 once fixedActually test parallel streams (with drops & queue measurements)Add ssim/psnr quality measurement

24

Thank you for your attention


Recommended