1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....

1

Hierarchical Parallelization of an H.264/AVC Video Encoder

A. Rodriguez, A. Gonzalez, and M.P. Malumbres

IEEE PARELEC 2006

2

Outline

IntroductionPerformance AnalysisHierarchical H.264 Parallel EncoderExperimental ResultsConclusions

3

IntroductionBackground Knowledge (1/5)

Video Communication

4


H.264/AVCRemove sensitive redundant informationIn order to reach the limits on compression

efficiency intensive computation Video on demand, video conference, live

broadcasting, etc.

5


H.264/AVC encoderHigh CPU demand

Low latency Real time response

Platforms with supercomputing capabilitiesClustersMultiprocessorsSpecial purpose devices

6


ClusterA group of linked computersImprove performance and/or availability

over that provided by a single computerCategorizations

High-availability clusters Load-balancing clusters High-performance clusters

7


Message Passing ParallelismMessage passing runtimes and libraries MPI

Multithread ParallelismOpenMP

Optimized librariesSIMD extension and global processing unit Intel IPP, AMD ACML, etc.

8

IntroductionMain Purpose (1/6)

Apply parallel processing to H.264 encoders in order to reduce computation intensity.

Given video quality and bit rateImage resolutionFrame rateLatency

9


Hierarchical parallelization of H.264 encoder

Two level MPI message passing parallelizationGOP levelSlice level

10


GOP level parallelismGood speed-upHigh latency

…….. …….. ……..

GOP GOP GOP

11


Example of latency1 GOP = 10 framesFrame rate = 30 frames/secTime for encoding 1 GOP = 3 secondsWe have to encode 9 GOP in parallel in order to

achieve real time responseLatency = 3 seconds

12


Slice level parallelismLow latencyLess coding efficiency

13


Combination both approachesSpeed-up Efficiency

14

Performance AnalysisOverview (1/2)

““Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations”

“A Parallel implementation of H.26L video encoder”

CombinationScalability and low latency

15

Performance AnalysisOverview (2/2)

Processing flow

video sequence

GOP GOP GOP GOP……..……..

Increasethroughput

Reducelatency

16

Performance AnalysisEquation definition

Little’s lawN = X * R

• N : Number of GOPs processed in parallel.

• X : Number of GOPs encoded per second.

• R : Elapsed time between a GOP enters the

system and the same GOP is completely

encoded.

17

Performance AnalysisAnalysis (1/2)

If we have np nodes in the cluster and every GOP decomposed in ns slices

N = np / ns

R = RSEQ / ( ns * Es)

• RSEQ : Sequential encoding time of a GOP

• Es : Parallel efficiency of slice level

18

Performance AnalysisAnalysis (2/2)

GOP throughput of combined parallel encoder

If Es is significantly less than 1, throughput would be affected negatively

sSEQ

p

ss

SEQ

s

p

ERn

EnRnn

RNX

19

Performance AnalysisExample (1/4)

Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder

encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined

SEQ

p

Rn

X

20


To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec

np = 4 * 5 = 20 nodes

21


Combined with slice level parallelismMaximum of allowed latency = 1 secSlice parallelism efficiency = 0.8

nodesnp 258.05*4

slicesER

Rn

s

SEQs 25.6

8.0*15

22


We set ns to 7 and N to 4, and number of required nodes is adjusted to 28

sec89.0

8.0*75

sec/GOPs48.48.0*5

4*7

ss

SEQ

EnR

R

X Throughput

Latency

23

Performance AnalysisEfficiency Estimation (1/5)

Why we have to estimate Es ?ThroughputLatency

How to estimate Es ?PAMELA (PerformAnce ModEling

LAnguage) model

24


Update DPB (Decoding Picture Buffer) in every nodeUsing MPI_Allgather

In this PAMELA model MPI_Allgather is implemented using binary tree

25


The PAMELA model to parallel encode one frame is :

L = par ( p = 1…ns )

delay (ts); delay (tw)

seq ( I = 0…log2(ns)-1)

par ( j = 1…ns)

delay ( tL + tc * 2i)

ns : The number of slices processed in

parallel

ts : The mean of slice encoding time

tw : The mean wait time due to variations

in ts and global synchronization

tL : Start up time

tc : Transmission time of one encoded

slice

26


The parallel time obtained solving this model is

Efficiency can be computed as

T(L) = ts + tw + tAG

tAG = log2 (ns) * tL + (ns - 1) * tc

27


The experimental estimations of parameter values

Estimated efficiency for a slice based parallel encoder

tL tc ts tw tAG

6.0 0.0133*4056 840000 20586 421

28

Performance AnalysisSlice Parallelism Scalability (1/4) The feasible number of slices will

depend on the video resolution

Number of MBs per slice

Bit rate increment (%)

29

Performance AnalysisSlice Parallelism Scalability (2/4)Bit rate overhead vs. number of slices

per frame

30

Performance AnalysisSlice Parallelism Scalability (3/4)PSNR loss vs. number of slices per

frame

31

Performance AnalysisSlice Parallelism Scalability (4/4)Encoding time vs. number of slices per

frame

32

Hierarchical Parallel Encoder Overview

In order to achieve scalability and low latencyCombine GOP and slice level parallelism

In the first levelDivide sequence in GOPs(15 frames) Every GOP is assigned to a processor

group inside the cluster Each group encodes independently

33

Hierarchical Parallel Encoder GOP assignment method

Local managerCommunicate with global manager

Global managerInform the GOP assignment by sending a

message with the GOP number to the requesting local manager

Simple and load balance

34

Hierarchical Parallel Encoder Framework

Hierarchical H.264 parallel encoderGlobal Manager

P0

P1 P2

P0

P1 P2

P0

P1 P2

35

Experimental ResultsEnvironments (1/2)

Mozart4 biprocessor nodes with AMD Opteron 246

at 2 GHz interconnected by a switched Gigabit Ethernet

AldebaranSGI Altix 3700 with 44 nodes Itanium II

interconnected by a high performance proprietary network

36

Experimental ResultsEnvironments (2/2)

720 * 480 standard sequence Ayersroc which composed by 16 GOPs Configuration Cluster #Groups #Slices

01_Gr_08S1 Mozart 1 8




01_Gr_16S1 Aldebaran 1 16





37

Experimental ResultsSystem Speedup (1/2)

Speed up in Mozart

38

Experimental ResultsSystem Speedup (2/2)

Speed up in Aldebaran

39

Experimental ResultsEncoding Latency

Mean GOP encoding time

40

Conclusions

A hierarchical parallel video encoder based on H.264/AVC was proposed.

Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder.

Some issues remains open, as mentioned in previous section.

41

Reference

[1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2400), pp. 830, 833, Padderborn, 2002.2002.

[2] A. Rodriguez, A. González and M.P. Malumbres,A. Rodriguez, A. González and M.P. Malumbres,“ Performance “ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, Electrical Engineering, pp. 354, 357, Dresden, 2004.pp. 354, 357, Dresden, 2004.

[3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, Parallel Systems”, IEEE Transactions on Parallel and Distributed IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003.Systems, vol 14, no 2, Feb. 2003.

[4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.Publishers, Inc.

Date post:	20-Jan-2018
Category:	Documents
Upload:	zoe-phelps
View:	214 times
Download:	0 times

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....

Documents