Date post: | 20-Jan-2018 |
Category: |
Documents |
Upload: | zoe-phelps |
View: | 214 times |
Download: | 0 times |
1
Hierarchical Parallelization of an H.264/AVC Video Encoder
A. Rodriguez, A. Gonzalez, and M.P. Malumbres
IEEE PARELEC 2006
2
Outline
IntroductionPerformance AnalysisHierarchical H.264 Parallel EncoderExperimental ResultsConclusions
3
IntroductionBackground Knowledge (1/5)
Video Communication
4
IntroductionBackground Knowledge (2/5)
H.264/AVCRemove sensitive redundant informationIn order to reach the limits on compression
efficiency intensive computation Video on demand, video conference, live
broadcasting, etc.
5
IntroductionBackground Knowledge (3/5)
H.264/AVC encoderHigh CPU demand
Low latency Real time response
Platforms with supercomputing capabilitiesClustersMultiprocessorsSpecial purpose devices
6
IntroductionBackground Knowledge (4/5)
ClusterA group of linked computersImprove performance and/or availability
over that provided by a single computerCategorizations
High-availability clusters Load-balancing clusters High-performance clusters
7
IntroductionBackground Knowledge (5/5)
Message Passing ParallelismMessage passing runtimes and libraries MPI
Multithread ParallelismOpenMP
Optimized librariesSIMD extension and global processing unit Intel IPP, AMD ACML, etc.
8
IntroductionMain Purpose (1/6)
Apply parallel processing to H.264 encoders in order to reduce computation intensity.
Given video quality and bit rateImage resolutionFrame rateLatency
9
IntroductionMain Purpose (2/6)
Hierarchical parallelization of H.264 encoder
Two level MPI message passing parallelizationGOP levelSlice level
10
IntroductionMain Purpose (3/6)
GOP level parallelismGood speed-upHigh latency
…….. …….. ……..
GOP GOP GOP
11
IntroductionMain Purpose (4/6)
Example of latency1 GOP = 10 framesFrame rate = 30 frames/secTime for encoding 1 GOP = 3 secondsWe have to encode 9 GOP in parallel in order to
achieve real time responseLatency = 3 seconds
12
IntroductionMain Purpose (5/6)
Slice level parallelismLow latencyLess coding efficiency
13
IntroductionMain Purpose (6/6)
Combination both approachesSpeed-up Efficiency
14
Performance AnalysisOverview (1/2)
““Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations”
“A Parallel implementation of H.26L video encoder”
CombinationScalability and low latency
15
Performance AnalysisOverview (2/2)
Processing flow
video sequence
GOP GOP GOP GOP……..……..
Increasethroughput
Reducelatency
16
Performance AnalysisEquation definition
Little’s lawN = X * R
• N : Number of GOPs processed in parallel.
• X : Number of GOPs encoded per second.
• R : Elapsed time between a GOP enters the
system and the same GOP is completely
encoded.
17
Performance AnalysisAnalysis (1/2)
If we have np nodes in the cluster and every GOP decomposed in ns slices
N = np / ns
R = RSEQ / ( ns * Es)
• RSEQ : Sequential encoding time of a GOP
• Es : Parallel efficiency of slice level
18
Performance AnalysisAnalysis (2/2)
GOP throughput of combined parallel encoder
If Es is significantly less than 1, throughput would be affected negatively
sSEQ
p
ss
SEQ
s
p
ERn
EnRnn
RNX
19
Performance AnalysisExample (1/4)
Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder
encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined
SEQ
p
Rn
X
20
Performance AnalysisExample (2/4)
To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec
np = 4 * 5 = 20 nodes
21
Performance AnalysisExample (3/4)
Combined with slice level parallelismMaximum of allowed latency = 1 secSlice parallelism efficiency = 0.8
nodesnp 258.05*4
slicesER
Rn
s
SEQs 25.6
8.0*15
22
Performance AnalysisExample (4/4)
We set ns to 7 and N to 4, and number of required nodes is adjusted to 28
sec89.0
8.0*75
sec/GOPs48.48.0*5
4*7
ss
SEQ
EnR
R
X Throughput
Latency
23
Performance AnalysisEfficiency Estimation (1/5)
Why we have to estimate Es ?ThroughputLatency
How to estimate Es ?PAMELA (PerformAnce ModEling
LAnguage) model
24
Performance AnalysisEfficiency Estimation (2/5)
Update DPB (Decoding Picture Buffer) in every nodeUsing MPI_Allgather
In this PAMELA model MPI_Allgather is implemented using binary tree
25
Performance AnalysisEfficiency Estimation (3/5)
The PAMELA model to parallel encode one frame is :
L = par ( p = 1…ns )
delay (ts); delay (tw)
seq ( I = 0…log2(ns)-1)
par ( j = 1…ns)
delay ( tL + tc * 2i)
ns : The number of slices processed in
parallel
ts : The mean of slice encoding time
tw : The mean wait time due to variations
in ts and global synchronization
tL : Start up time
tc : Transmission time of one encoded
slice
26
Performance AnalysisEfficiency Estimation (4/5)
The parallel time obtained solving this model is
Efficiency can be computed as
T(L) = ts + tw + tAG
tAG = log2 (ns) * tL + (ns - 1) * tc
27
Performance AnalysisEfficiency Estimation (5/5)
The experimental estimations of parameter values
Estimated efficiency for a slice based parallel encoder
tL tc ts tw tAG
6.0 0.0133*4056 840000 20586 421
28
Performance AnalysisSlice Parallelism Scalability (1/4) The feasible number of slices will
depend on the video resolution
Number of MBs per slice
Bit rate increment (%)
29
Performance AnalysisSlice Parallelism Scalability (2/4)Bit rate overhead vs. number of slices
per frame
30
Performance AnalysisSlice Parallelism Scalability (3/4)PSNR loss vs. number of slices per
frame
31
Performance AnalysisSlice Parallelism Scalability (4/4)Encoding time vs. number of slices per
frame
32
Hierarchical Parallel Encoder Overview
In order to achieve scalability and low latencyCombine GOP and slice level parallelism
In the first levelDivide sequence in GOPs(15 frames) Every GOP is assigned to a processor
group inside the cluster Each group encodes independently
33
Hierarchical Parallel Encoder GOP assignment method
Local managerCommunicate with global manager
Global managerInform the GOP assignment by sending a
message with the GOP number to the requesting local manager
Simple and load balance
34
Hierarchical Parallel Encoder Framework
Hierarchical H.264 parallel encoderGlobal Manager
P0
P1 P2
P0
P1 P2
P0
P1 P2
35
Experimental ResultsEnvironments (1/2)
Mozart4 biprocessor nodes with AMD Opteron 246
at 2 GHz interconnected by a switched Gigabit Ethernet
AldebaranSGI Altix 3700 with 44 nodes Itanium II
interconnected by a high performance proprietary network
36
Experimental ResultsEnvironments (2/2)
720 * 480 standard sequence Ayersroc which composed by 16 GOPs Configuration Cluster #Groups #Slices
01_Gr_08S1 Mozart 1 8
02_Gr_04S1 Mozart 2 4
04_Gr_02S1 Mozart 4 2
08_Gr_01S1 Mozart 8 1
01_Gr_16S1 Aldebaran 1 16
02_Gr_08S1 Aldebaran 2 8
04_Gr_04S1 Aldebaran 4 4
08_Gr_02S1 Aldebaran 8 2
16_Gr_01S1 Aldebaran 16 1
37
Experimental ResultsSystem Speedup (1/2)
Speed up in Mozart
38
Experimental ResultsSystem Speedup (2/2)
Speed up in Aldebaran
39
Experimental ResultsEncoding Latency
Mean GOP encoding time
40
Conclusions
A hierarchical parallel video encoder based on H.264/AVC was proposed.
Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder.
Some issues remains open, as mentioned in previous section.
41
Reference
[1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2400), pp. 830, 833, Padderborn, 2002.2002.
[2] A. Rodriguez, A. González and M.P. Malumbres,A. Rodriguez, A. González and M.P. Malumbres,“ Performance “ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, Electrical Engineering, pp. 354, 357, Dresden, 2004.pp. 354, 357, Dresden, 2004.
[3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, Parallel Systems”, IEEE Transactions on Parallel and Distributed IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003.Systems, vol 14, no 2, Feb. 2003.
[4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.Publishers, Inc.