+ All Categories
Home > Documents > Prediction of Transmission Distortion for Wireless Video ... · Section III, we derive the...

Prediction of Transmission Distortion for Wireless Video ... · Section III, we derive the...

Date post: 11-Feb-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
15
1 Prediction of Transmission Distortion for Wireless Video Communication: Analysis Zhifeng Chen and Dapeng Wu Department of Electrical and Computer Engineering, University of Florida, Gainesville, Florida 32611 Abstract—Transmitting video over wireless is a challenging problem since video may be seriously distorted due to packet errors caused by wireless channels. The capability of predicting transmission distortion (i.e., video distortion caused by packet errors) can assist in designing video encoding and transmission schemes that achieve maximum video quality or minimum end- to-end video distortion. This paper is aimed at deriving formulae for predicting transmission distortion. The contribution of this paper is two-folded. First, we identify the governing law that describes how the transmission distortion process evolves over time, and analytically derive the transmission distortion formula as a closed-form function of video frame statistics, channel error statistics, and system parameters. Second, we identify, for the first time, two important properties of transmission distortion. The first property is that the clipping noise, produced by non-linear clipping, causes decay of propagated error. The second property is that the correlation between motion vector concealment error and propagated error is negative, and has dominant impact on transmission distortion, compared to other correlations. Due to these two properties and elegant error/distortion decomposition, our formula provides not only more accurate prediction but also lower complexity than the existing methods. Index Terms—Wireless video, transmission distortion, clipping noise, slice data partitioning (SDP), unequal error protection (UEP), time-varying channel. I. I NTRODUCTION Both multimedia technology and mobile communications have experienced massive growth and commercial success in recent years. As the two technologies converge, wireless video, such as video phone and mobile TV in 3G/4G systems, is expected to achieve unprecedented growth and worldwide success. However, different from the traditional video coding system, transmitting video over wireless with good quality or low end-to-end distortion is particularly challenging since the received video is subject to not only quantization error but also transmission error. In a wireless video communication system, end-to-end distortion consists of two parts: quantization dis- tortion and transmission distortion. Quantization distortion is caused by quantization errors during the encoding process, and has been extensively studied in rate distortion theory [1], [2]. Transmission distortion is caused by packet errors during the transmission of a video sequence, and it is the major part Please direct all correspondence to Prof. Dapeng Wu, University of Florida, Dept. of Electrical & Computer Engineering, P.O.Box 116130, Gainesville, FL 32611, USA. Tel. (352) 392-4954. Fax (352) 392-0044. Email: [email protected]fl.edu. Homepage: http://www.wu.ece.ufl.edu. This work was sup- ported in part by the US National Science Foundation under grant ECCS- 1002214. Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. of the end-to-end distortion in delay-sensitive wireless video communication 1 under high packet error probability (PEP), e.g., in a wireless fading channel. The capability of predicting transmission distortion at the transmitter can assist in designing video encoding and trans- mission schemes that achieve maximum video quality under resource constraints. Specifically, transmission distortion pre- diction can be used in the following three applications in video encoding and transmission: 1) mode decision, which is to find the best intra/inter-prediction mode for encoding a macroblock (MB) with the minimum rate-distortion (R-D) cost given the instantaneous PEP, 2) cross-layer encoding rate control, which is to control the instantaneously encoded bit rate for a real- time encoder to minimize the frame-level end-to-end distortion given the instantaneous PEP, e.g., in video conferencing, 3) packet scheduling, which chooses a subset of packets of the pre-coded video to transmit and intentionally discards the remaining packets to minimize the group of picture (GOP)- level end-to-end distortion given the average PEP and average burst length, e.g., in streaming pre-coded video over networks. All the three applications require a formula for predicting how transmission distortion is affected by their respective control policy, in order to choose the optimal mode or encoding rate or transmission schedule. However, predicting transmission distortion poses a great challenge due to the spatio-temporal correlation inside the input video sequence, the nonlinearity of both the encoder and the decoder, and varying PEP in time-varying channels. In a typical video codec, the temporal correlation among consecutive frames and the spatial correlation among the adjacent pixels of one frame are exploited to improve the coding efficiency. Nevertheless, such a coding scheme brings much difficulty in predicting transmission distortion because a packet error will degrade not only the video quality of the current frame but also the following frames due to error propagation. In addition, as we will see in Section III, the nonlinearity of both the encoder and the decoder makes the instantaneous transmission distortion not equal to the sum of distortions caused by individual error events. Furthermore, in a wireless fading channel, the PEP is time-varying, which makes the error process a non-stationary random process and hence, as a function of the error process, the distortion process is also a non-stationary random process. According to the aforementioned three applications, the 1 Delay-sensitive wireless video communication usually does not allow retransmission to correct packet errors since retransmission may cause long delay.
Transcript
  • 1

    Prediction of Transmission Distortion for WirelessVideo Communication: Analysis

    Zhifeng Chen and Dapeng WuDepartment of Electrical and Computer Engineering, University of Florida, Gainesville, Florida 32611

    Abstract—Transmitting video over wireless is a challengingproblem since video may be seriously distorted due to packeterrors caused by wireless channels. The capability of predictingtransmission distortion (i.e., video distortion caused by packeterrors) can assist in designing video encoding and transmissionschemes that achieve maximum video quality or minimum end-to-end video distortion. This paper is aimed at deriving formulaefor predicting transmission distortion. The contribution of thispaper is two-folded. First, we identify the governing law thatdescribes how the transmission distortion process evolves overtime, and analytically derive the transmission distortion formulaas a closed-form function of video frame statistics, channel errorstatistics, and system parameters. Second, we identify, for the firsttime, two important properties of transmission distortion. Thefirst property is that the clipping noise, produced by non-linearclipping, causes decay of propagated error. The second propertyis that the correlation between motion vector concealment errorand propagated error is negative, and has dominant impact ontransmission distortion, compared to other correlations. Due tothese two properties and elegant error/distortion decomposition,our formula provides not only more accurate prediction but alsolower complexity than the existing methods.

    Index Terms—Wireless video, transmission distortion, clippingnoise, slice data partitioning (SDP), unequal error protection(UEP), time-varying channel.

    I. INTRODUCTION

    Both multimedia technology and mobile communicationshave experienced massive growth and commercial successin recent years. As the two technologies converge, wirelessvideo, such as video phone and mobile TV in 3G/4G systems,is expected to achieve unprecedented growth and worldwidesuccess. However, different from the traditional video codingsystem, transmitting video over wireless with good quality orlow end-to-end distortion is particularly challenging since thereceived video is subject to not only quantization error but alsotransmission error. In a wireless video communication system,end-to-end distortion consists of two parts: quantization dis-tortion and transmission distortion. Quantization distortion iscaused by quantization errors during the encoding process,and has been extensively studied in rate distortion theory [1],[2]. Transmission distortion is caused by packet errors duringthe transmission of a video sequence, and it is the major part

    Please direct all correspondence to Prof. Dapeng Wu, University ofFlorida, Dept. of Electrical & Computer Engineering, P.O.Box 116130,Gainesville, FL 32611, USA. Tel. (352) 392-4954. Fax (352) 392-0044. Email:[email protected]. Homepage: http://www.wu.ece.ufl.edu. This work was sup-ported in part by the US National Science Foundation under grant ECCS-1002214. Copyright (c) 2010 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

    of the end-to-end distortion in delay-sensitive wireless videocommunication1 under high packet error probability (PEP),e.g., in a wireless fading channel.

    The capability of predicting transmission distortion at thetransmitter can assist in designing video encoding and trans-mission schemes that achieve maximum video quality underresource constraints. Specifically, transmission distortion pre-diction can be used in the following three applications in videoencoding and transmission: 1) mode decision, which is to findthe best intra/inter-prediction mode for encoding a macroblock(MB) with the minimum rate-distortion (R-D) cost given theinstantaneous PEP, 2) cross-layer encoding rate control, whichis to control the instantaneously encoded bit rate for a real-time encoder to minimize the frame-level end-to-end distortiongiven the instantaneous PEP, e.g., in video conferencing, 3)packet scheduling, which chooses a subset of packets of thepre-coded video to transmit and intentionally discards theremaining packets to minimize the group of picture (GOP)-level end-to-end distortion given the average PEP and averageburst length, e.g., in streaming pre-coded video over networks.All the three applications require a formula for predicting howtransmission distortion is affected by their respective controlpolicy, in order to choose the optimal mode or encoding rateor transmission schedule.

    However, predicting transmission distortion poses a greatchallenge due to the spatio-temporal correlation inside theinput video sequence, the nonlinearity of both the encoderand the decoder, and varying PEP in time-varying channels.In a typical video codec, the temporal correlation amongconsecutive frames and the spatial correlation among theadjacent pixels of one frame are exploited to improve thecoding efficiency. Nevertheless, such a coding scheme bringsmuch difficulty in predicting transmission distortion becausea packet error will degrade not only the video quality ofthe current frame but also the following frames due to errorpropagation. In addition, as we will see in Section III, thenonlinearity of both the encoder and the decoder makes theinstantaneous transmission distortion not equal to the sum ofdistortions caused by individual error events. Furthermore, in awireless fading channel, the PEP is time-varying, which makesthe error process a non-stationary random process and hence,as a function of the error process, the distortion process is alsoa non-stationary random process.

    According to the aforementioned three applications, the

    1Delay-sensitive wireless video communication usually does not allowretransmission to correct packet errors since retransmission may cause longdelay.

  • 2

    existing algorithms for estimating transmission distortion canbe categorized into the following three classes: 1) pixel-levelor block-level algorithms (applied to mode decision), e.g.,Recursive Optimal Per-pixel Estimate (ROPE) algorithm [3]and Law of Large Number (LLN) algorithm [4], [5]; 2)frame-level or packet-level or slice-level algorithms (appliedto cross-layer encoding rate control) [6], [7], [8], [9], [10];3) GOP-level or sequence-level algorithms (applied to packetscheduling) [11], [12], [13], [14], [15]. Although the existingdistortion estimation algorithms work at different levels, theyshare some common properties, which come from the inherentcharacteristics of wireless video communication system, thatis, spatio-temporal correlation, nonlinear codec and time-varying channel. However, none of the existing works analyzedthe effect of non-linear clipping noise on the transmissiondistortion, and therefore cannot provide accurate distortionestimation.

    In this paper, we derive the transmission distortion formulaefor wireless video communication systems. With considerationof spatio-temporal correlation, nonlinear codec and time-varying channel, our distortion prediction formulae improvethe accuracy of distortion estimation from existing works.Besides that, our formulae support, for the first time, thefollowing capabilities: 1) prediction at different levels (e.g.,pixel/frame/GOP level), 2) prediction for multi-reference mo-tion compensation, 3) prediction under slice data partitioning(SDP) [16], 4) prediction under arbitrary slice-level pack-etization with flexible macroblock ordering (FMO) mecha-nism [17], [18], 5) prediction under time-varying channels,6) one unified formula for both I-MB and P-MB, and 7) pre-diction for both low motion and high motion video sequences.In addition, this paper also identifies two important propertiesof transmission distortion for the first time: 1) clipping noise,produced by non-linear clipping, causes decay of propagatederror; 2) the correlation between motion vector concealmenterror and propagated error is negative, and has dominantimpact on transmission distortion, among all the correlationsbetween any two of the four components in transmission error.Due to the page limit, we move most of the experimentalresults to our sequel paper [19], which 1) verify the accuracyof the formulae derived in this paper and compare that toexisting models, 2) discuss the algorithms designed based onthe formulae, 3) apply our algorithms in practical video codecdesign, and 4) compare the R-D performance between ouralgorithms and existing estimation algorithms.

    The rest of the paper is organized as follows. Section IIpresents the preliminaries of our system model under study tofacilitate the derivations in the later sections, and illustratesthe limitations of existing transmission distortion models. InSection III, we derive the transmission distortion formula asa function of frame statistics, channel condition, and systemparameters. Section IV concludes the paper.

    II. SYSTEM DESCRIPTION

    A. Structure of a Wireless Video Communication System

    Fig. 1 shows the structure of a typical wireless videocommunication system. It consists of an encoder, two channels

    Videocapture

    Input

    T/Q-Q-1/T-1

    ResidualChannel

    +

    Motioncompensation

    Memory

    Motionestimation

    MVChannel

    Q-1/T-1

    +

    Motioncompensation

    MV Errorconcealment

    ChannelEncoder Decoder

    Videodisplay

    Output

    Clipping Clipping

    Memory

    keu

    kuvm

    (

    keuˆ

    kfû1ˆ −kfu

    1ˆ −+k

    kfumvu

    1~

    ~ −+k

    kfuvmu

    keu~

    kfu~

    1~ −kfu

    Residual ErrorConcealment

    keuˆ

    keu(

    kfu

    kumv

    S(r)

    S(m)

    ‘0’

    ‘0’

    ‘1’

    ‘1’

    Fig. 1. System structure, where T, Q, Q−1, and T−1 denote transform,quantization, inverse quantization, and inverse transform, respectively.

    and a decoder where residual channel and motion vector(MV) channel may be either the same channel or differentchannels. If residual packets or MV packets are erroneous, theerror concealment module will be activated. In typical videostandard such as H.263/264 and MPEG-2/4, the functionalblocks of an encoder can be divided into two classes: 1)basic parts, such as predictive coding, transform, quantization,entropy coding, motion compensation, and clipping; and 2)performance-enhancing parts, such as interpolation filtering,deblocking filtering, B-frame, multi-reference prediction, etc.Although the up-to-date video standard, e.g. the coming newvideo standard HEVC, includes more and more performance-enhancing parts, the basic parts do not change. In this paper,we analyze the transmission distortion for the structure withthe basic parts in Fig. 1.

    Note that in this system, both residual channel and MVchannel are application-layer channels; specifically, both chan-nels consist of entropy coding and entropy decoding, net-working layers2, and physical layer (including channel en-coding, modulation, wireless fading channel, demodulation,channel decoding). Although the residual channel and MVchannel usually share the same physical-layer channel, thetwo application-layer channels may have different parametersettings (e.g., different channel code-rate) for different SDPpackets under unequal error protection (UEP) consideration.

    Table I lists notations used in this paper. All vectors arein bold font. Note that the encoder needs to reconstruct thecompressed video for predictive coding; hence the encoder andthe decoder have a similar structure for pixel value reconstruc-tion. To distinguish the variables in the reconstruction moduleof the encoder from those in the reconstruction module of thedecoder, we add ˆ on top of the variables at the encoder andadd ˜ on top of the variables at the decoder.B. Clipping Noise

    In this subsection, we examine the effect of clipping noiseon the reconstructed pixel value along each pixel trajectoryover time (frames). All pixel positions in a video sequenceform a three-dimensional spatio-temporal domain, i.e., twodimensions in spatial domain and one dimension in temporal

    2Here, networking layers can include any layers other than physical layer.

  • 3

    TABLE ISUMMARY OF NOTATIONS

    uk : A pixel with position u in the k-th framefku : Value of the pixel uk

    eku : Residual of the pixel uk

    mvku: MV of the pixel uk

    ∆ku : Clipping noise of the pixel uk

    εku : Residual concealment error of the pixel uk

    ξku : MV concealment error of the pixel uk

    ζku : Transmission reconstructed error of the pixel uk

    Sku : Error state of the pixel uk

    Pku : Error probability of the pixel uk

    Dku : Transmission distortion of the pixel uk

    Dk : Transmission distortion of the k-th frameVk : Set of all the pixels in the k-th frame|Vk| : Number of elements in set Vk (cardinality of Vk)αk : Propagation factor of the k-th frameβk : Percentage of I-MBs in the k-th frameλk : Correlation ratio of the k-th frame

    domain. Each pixel can be uniquely represented by uk in thisthree-dimensional time-space, where k means the k-th framein temporal domain and u is a two-dimensional vector inspatial domain, i.e. position in the k-th frame. The philosophybehind inter-prediction of a video sequence is to representthe video sequence by virtual motion of each pixel, i.e., eachpixel recursively moves from position v in the k − 1 frame,i.e. vk−1, to position uk. The difference between these twopositions is a two-dimensional vector called MV of pixel uk,i.e., mvku = v

    k−1 − uk. The difference between the pixelvalues of these two positions is called residual of pixel uk,that is, eku = f

    ku − f̂k−1u+mvku . Recursively, each inter-predicted

    pixel in the k-th frame has one and only one reference pixeltrajectory backward towards the latest I-block.3

    At the encoder, after transform, quantization, inverse quanti-zation, and inverse transform for the residual, the reconstructedpixel value for uk may be out-of-range and should be clippedas

    f̂ku = Γ(f̂k−1u+mvku

    + êku), (1)

    where Γ(·) function is a clipping function defined by

    Γ(x) =

    γl, x < γl

    x, γl ≤ x ≤ γhγh, x > γh,

    (2)

    where γl and γh are user-specified low threshold and highthreshold, respectively. Usually, γl = 0 and γh = 255.

    The residual and MV at the decoder may be differentfrom their counterparts at the encoder because of channelimpairments. Denote m̃vku and ẽ

    ku the MV and residual at the

    decoder, respectively. Then, the reference pixel position for uk

    at the decoder is ṽk−1 = uk + m̃vku, and the reconstructedpixel value for uk at the decoder is

    f̃ku = Γ(f̃k−1u+m̃vku

    + ẽku). (3)

    In error-free channels, the reconstructed pixel value at thereceiver is exactly the same as the reconstructed pixel valueat the transmitter, because there is no transmission error and

    3We will also discuss intra-predicted pixels in Section III.

    hence no transmission distortion. However, in error-pronechannels, we know from (3) that f̃ku is a function of threefactors: the received residual ẽku, the received MV m̃v

    ku,

    and the propagated error f̃k−1u+m̃vku

    . The received residual ẽkudepends on three factors, namely, 1) the transmitted residualêku, 2) the residual packet error state, which depends oninstantaneous residual channel condition, and 3) the residualerror concealment algorithm if the received residual packet iserroneous. Similarly, the received MV m̃vku depends on 1)the transmitted mvku, 2) the MV packet error state, whichdepends on instantaneous MV channel condition, and 3) theMV error concealment algorithm if the received MV packetis erroneous. The propagated error f̃k−1

    u+m̃vkuincludes the error

    propagated from the reference frames, and therefore dependson all samples in the previous frames indexed by i, where1 ≤ i < k and their reception error states as well as errorconcealment algorithms. In this paper, we consider the tem-poral error concealment [20], [21] in deriving the transmissiondistortion formulae.

    The non-linear clipping function within the pixel trajectorymakes the distortion estimation more challenging. However,it is interesting to observe that clipping actually reducestransmission distortion. In Section III, we will quantify theeffect of clipping on transmission distortion.

    C. Definition of Transmission Distortion

    In a video sequence, all pixel positions in the k-th frameform a two-dimensional vector set Vk, and we denote thenumber of elements in set Vk by |Vk|. So, for any pixel atposition u in the k-th frame, i.e., u ∈ Vk, its reference pixelposition is chosen from set Vk−1 for single-reference motioncompensation.

    Given the joint probability mass function (PMF) of f̂ku andf̃ku , we define the pixel-level transmission distortion (PTD) forpixel uk by

    Dku , E[(f̂ku − f̃ku)2], (4)

    where E[·] represents expectation and the randomness comesfrom both random video input and random channel error state.Then, we define the frame-level transmission distortion (FTD)for the k-th frame by

    Dk , E[ 1|Vk|

    ·∑u∈Vk

    (f̂ku − f̃ku)2]. (5)

    It is easy to prove that the relationship between FTD and PTDis characterized by

    Dk =1

    |Vk|·∑u∈Vk

    Dku. (6)

    In fact, (6) is a general form for distortions of all levels. If|Vk| = 1, (6) reduces to (4). For slice/packet-level distortion,Vk is the set of the pixels contained in a slice/packet. ForGOP-level distortion, Vk could be replaced by the set of thepixels contained in a GOP. In this paper, we only show howto derive formulae for PTD and FTD. Our methodology isalso applicable to deriving formulae for slice/packet/GOP-leveldistortion by using appropriate Vk.

  • 4

    D. Limitations of the Existing Transmission Distortion Models

    We define the clipping noise for pixel uk at the encoder as

    ∆̂ku , (f̂k−1u+mvku + êku)− Γ(f̂k−1u+mvku + ê

    ku), (7)

    and the clipping noise for pixel uk at the decoder as

    ∆̃ku , (f̃k−1u+m̃vku + ẽku)− Γ(f̃k−1u+m̃vku + ẽ

    ku). (8)

    Using (1), Eq. (7) becomes

    f̂ku = f̂k−1u+mvku

    + êku − ∆̂ku, (9)

    and using (3), Eq. (8) becomes

    f̃ku = f̃k−1u+m̃vku

    + ẽku − ∆̃ku, (10)

    where ∆̂ku only depends on the video content and encoderstructure, e.g., motion estimation, quantization, mode decisionand clipping function; and ∆̃ku depends on not only the videocontent and encoder structure, but also channel conditionsand decoder structure, e.g., error concealment and clippingfunction.

    In most existing works [3], [7], [9], [10], [15], both ∆̂ku and∆̃ku are neglected, i.e., these works assume f̂

    ku = f̂

    k−1u+mvku

    + êku

    and f̃ku = f̃k−1u+m̃vku

    +ẽku. However, this assumption is only valid

    for stored video or error-free communication, where ∆̃ku =∆̂ku, since ∆̂

    ku = 0 with very high probability. For error-prone

    communication, decoder clipping noise ∆̃ku has a significantimpact on transmission distortion and hence should not beneglected. Without taking into consideration ∆̃ku, the estimateddistortion can be much larger than true distortion [22].

    III. TRANSMISSION DISTORTION FORMULAE

    In this section, we derive formulae for PTD and FTD.The section is organized as follows: Section III-A presentsan overview of our approach to analyzing PTD and FTD.Then, we elaborate on the derivation details in Section III-Bthrough Section III-E. Specifically, Section III-B quantifies theeffect of residual concealment error (RCE) on transmissiondistortion; Section III-C quantifies the effect of motion vectorconcealment error (MVCE) on transmission distortion; Sec-tion III-D quantifies the effect of propagated error and clippingnoise on transmission distortion; Section III-E quantifies theeffect of correlations (between any two of the error sources)on transmission distortion. Finally, Section III-F summarizesthe key results of this paper, i.e., the formulae for PTD andFTD.

    A. Overview of the Approach to Analyzing PTD and FTD

    To analyze PTD and FTD, we take a divide-and-conquerapproach. We first divide transmission reconstructed errorinto four components: three random errors (RCE, MVCE andpropagated error) due to their different physical causes, andclipping noise, which is a non-linear function of these threerandom errors. This error decomposition allows us to furtherdecompose transmission distortion into four terms, i.e., distor-tion caused by 1) RCE, 2) MVCE, 3) propagated error plusclipping noise, and 4) correlations between any two of the error

    TABLE IIDEFINITIONS

    RCE : residual concealment errorMVCE: motion vector concealment errorPTD : pixel-level transmission distortionFTD : frame-level transmission distortionXEP : pixel error probabilityPEP : packet error probabilityFMO : flexible macroblock orderingUEP : unequal error protectionSDP : slice data partitioningPMF : probability mass function

    sources, respectively. This distortion decomposition facilitatesthe derivation of a simple and accurate closed-form formulafor each of the four distortion terms. Next, we elaborate onerror decomposition and distortion decomposition.

    Define transmission reconstructed error for pixel uk byζ̃ku , f̂ku − f̃ku . From (9) and (10), we obtain

    ζ̃ku = (êku + f̂

    k−1u+mvku

    − ∆̂ku)− (ẽku + f̃k−1u+m̃vku − ∆̃ku)

    = (êku − ẽku) + (f̂k−1u+mvku − f̂k−1u+m̃vku

    )

    + (f̂k−1u+m̃vku

    − f̃k−1u+m̃vku

    )− (∆̂ku − ∆̃ku).

    (11)

    Define RCE ε̃ku by ε̃ku , êku − ẽku, and define MVCE ξ̃ku

    by ξ̃ku , f̂k−1u+mvku − f̂k−1u+m̃vku

    . Note that f̂k−1u+m̃vku

    − f̃k−1u+m̃vku

    =

    ζ̃k−1u+m̃vku

    , which is the transmission reconstructed error of theconcealed reference pixel in the reference frame; we callζ̃k−1u+m̃vku

    propagated error. As mentioned in Section II-D, we

    assume ∆̂ku = 0. Therefore, (11) becomes

    ζ̃ku = ε̃ku + ξ̃

    ku + ζ̃

    k−1u+m̃vku

    + ∆̃ku. (12)

    (12) is our proposed error decomposition. In Table II, we listthe abbreviations that will be used frequently in the followingsections.

    Combining (4) and (12), we have

    Dku = E[(ε̃ku + ξ̃

    ku + ζ̃

    k−1u+m̃vku

    + ∆̃ku)2]

    = E[(ε̃ku)2] + E[(ξ̃ku)

    2] + E[(ζ̃k−1u+m̃vku

    + ∆̃ku)2] + 2E[ε̃ku · ξ̃ku]

    + 2E[ε̃ku · (ζ̃k−1u+m̃vku + ∆̃ku)] + 2E[ξ̃

    ku · (ζ̃k−1u+m̃vku + ∆̃

    ku)].

    (13)

    Denote Dku(r) , E[(ε̃ku)2], Dku(m) , E[(ξ̃ku)2], Dku(P ) ,E[(ζ̃k−1

    u+m̃vku+ ∆̃ku)

    2] and Dku(c) , 2E[ε̃ku · ξ̃ku] + 2E[ε̃ku ·(ζ̃k−1

    u+m̃vku+∆̃ku)]+2E[ξ̃

    ku·(ζ̃k−1u+m̃vku+∆̃

    ku)]. Then, (13) becomes

    Dku = Dku(r) +D

    ku(m) +D

    ku(P ) +D

    ku(c). (14)

    (14) is our proposed distortion decomposition for PTD. Thereason why we combine propagated error and clipping noiseinto one term (called clipped propagated error) is becauseclipping noise is mainly caused by propagated error and suchdecomposition will simplify the formulae.

    There are three major reasons for our decompositions in(12) and (14). First, if we directly substitute the terms in(4) by (9) and (10), it will produce 5 second moments and10 cross-correlation terms (assuming ∆̂ku = 0); since there

  • 5

    are 8 possible error events due to three individual randomerrors, there are a total of 8 × (5 + 10) = 120 terms forPTD, making the analysis highly complicated. In contrast, ourdecompositions in (12) and (14) significantly simplify the anal-ysis. Second, each term in (12) and (14) has a clear physicalmeaning, which lessens the requirement for joint PMF of f̂kuand f̃ku and leads to accurate estimation algorithms with lowcomplexity. Third, such decompositions allow our formulaeto be easily extended for supporting advanced video codecwith more performance-enhancing parts, e.g., multi-referenceprediction [22] and interpolation filtering in fractional-pelmotion estimation [23].

    To derive the formula for FTD, from (6) and (14), we obtain

    Dk = Dk(r) +Dk(m) +Dk(P ) +Dk(c), (15)

    whereDk(r) =

    1

    |V|·∑u∈Vk

    Dku(r), (16)

    Dk(m) =1

    |V|·∑u∈Vk

    Dku(m), (17)

    Dk(P ) =1

    |V|·∑u∈Vk

    Dku(P ), (18)

    Dk(c) =1

    |V|·∑u∈Vk

    Dku(c). (19)

    (15) is our proposed distortion decomposition for FTD. Usu-ally, the cardinality, i.e. the number of elements, of set Vkin a video sequence is the same for all frames. That is,|V1| = · · · = |Vk| = |V| for all k ≥ 14. Hence, we removethe frame index k and denote |Vk| for all k ≥ 1 by |V|.Note that in a video codec, e.g. H.264/AVC [16], a referencepixel may be in a position out of picture boundary; however,the cardinality of set consisting of reference pixels, althoughlarger than the cardinality of the input pixel set |V|, is still thesame for all frames.

    B. Analysis of Distortion Caused by RCE

    In this subsection, we first derive the pixel-level residualcaused distortion Dku(r). Then, we derive the frame-levelresidual caused distortion Dk(r).

    1) Pixel-level Distortion Caused by RCE: We denote Sku asthe state indicator of whether there is transmission error forpixel uk after channel decoding. Note that as mentioned inSection II-A, both the residual channel and the MV channelcontain channel decoding; hence in this paper, the transmissionerror in the residual channel or the MV channel is meant to bean uncorrectable error after channel decoding. To distinguishthe residual error state and the MV error state, here we useSku(r) to denote the residual error state for pixel u

    k. That is,Sku(r) = 1 if ê

    ku is received with error, and S

    ku(r) = 0 if

    êku is received without error. At the receiver, if there is noresidual transmission error for pixel uk, ẽku is equal to ê

    ku.

    4Note that although they have the same cardinality, different sets are verydifferent, i.e. Vk−1 ̸= Vk .

    However, if the residual packets are received with error, weneed to conceal the residual error at the receiver. Denote ěkuthe concealed residual when Sku(r) = 1, and we have,

    ẽku =

    {ěku, S

    ku(r) = 1

    êku, Sku(r) = 0.

    (20)

    Note that ěku depends on êku and the residual concealment

    method, but does not depend on the channel condition. Fromthe definition of ε̃ku and (20), we have

    ε̃ku = (êku − ěku) · Sku(r) + (êku − êku) · (1− Sku(r))

    = (êku − ěku) · Sku(r).(21)

    êku depends on the input video sequence and the encoderstructure, while Sku(r) depends on the random multiplica-tive and additive noises in the wireless channel. Under ourframework shown in Fig. 1, the input video sequence and theencoder structure are independent of communication systemparameters. Therefore, we assume êku and S

    ku(r) are indepen-

    dent as the following assumption.Assumption 1: Sku(r) is independent of ê

    ku.

    Denote εku , êku − ěku; we have ε̃ku = εku · Sku(r). DenoteP ku (r) as the residual pixel error probability (XEP) for pixeluk, that is, P ku (r) , P{Sku(r) = 1}5. Then, given P ku (r),from (21) and Assumption 1, we have

    Dku(r) = E[(ε̃ku)

    2] = E[(εku)2] · E[(Sku(r))2]

    = E[(εku)2] · (1 · P ku (r)) = E[(εku)2] · P ku (r).

    (22)

    Hence, our formula for the pixel-level residual caused distor-tion is

    Dku(r) = E[(εku)

    2] · P ku (r). (23)

    Note that we may also generalize (23) for I-MB. For pixelsin I-MB, if the packet containing those pixels has error, ěku isstill available since all the erroneous pixels will be concealedin the same way. However, since there is no êku available, inorder to use (23) to predict the transmission distortion, we mayneed to find the best reference, in terms of R-D cost, for thereconstructed I-MB by doing a virtual motion estimation andthen calculate êku for (23). The estimated mv

    ku can be used to

    predict Dku(m) for I-MB in the next subsection. An alternativemethod to calculate êku for I-MB is to use the same positionof previous frame as reference, i.e. assuming mvku = 0. Notethat if the packet containing those pixels in I-MB is correctlyreceived, Dku(r) = 0.

    2) Frame-level Distortion Caused by RCE: To derive theframe-level residual caused distortion, the encoder needs toknow the second moment of RCE for each pixel in that frame.In most, if not all, existing distortion models [3], [7], [9], [10],[15], the residual error concealment method is to let ěku = 0for all erroneous pixels. However, as long as êku and ě

    ku satisfy

    some properties, we can derive a formula for more generalresidual error concealment methods instead of assuming ěku =0. We make the following assumption for êku and ě

    ku.

    5Pku (r) depends on the communication system parameters such as delaybound, channel coding rate, transmission power, channel gain of the wirelesschannel.

  • 6

    Assumption 2: The residual êku is stationary with respect to2D variable u in the same frame. In addition, ěku only dependson {êkv : v ∈ Nu} where Nu is a fixed neighborhood of u.

    In other words, Assumption 2 assumes that 1) êku is a 2Dstationary stochastic process and the distribution of êku is thesame for all u ∈ Vk, and 2) ěku is also a 2D stationarystochastic process since it only depends on the neighboringêku. Hence, ê

    ku − ěku is also a 2D stationary stochastic process,

    and its second moment E[(êku− ěku)2] = E[(εku)2] is the samefor all u ∈ Vk. Therefore, we can drop u from the notation,and let E[(εk)2] = E[(εku)

    2] for all u ∈ Vk.Denote Nki (r) as the number of pixels contained in the i-th

    residual packet of the k-th frame; denote P ki (r) as PEP of thei-th residual packet of the k-th frame; denote Nk(r) as thetotal number of residual packets of the k-th frame. Since forall pixels in the same packet, the residual XEP is equal to itsPEP, from (16) and (23), we have

    Dk(r) =1

    |V|∑u∈Vk

    E[(εku)2] · P ku (r) (24)

    =1

    |V|∑u∈Vk

    E[(εk)2] · P ku (r) (25)

    (a)=

    E[(εk)2]

    |V|

    Nk(r)∑i=1

    (P ki (r) ·Nki (r)) (26)

    (b)= E[(εk)2] · P̄ k(r). (27)

    where (a) is due to P ku (r) = Pki (r) for pixel u in the i-th

    residual packet; (b) is due to

    P̄ k(r) , 1|V|

    Nk(r)∑i=1

    (P ki (r) ·Nki (r)). (28)

    P̄ k(r) is a weighted average over PEP of all residual packetsin the k-th frame, in which different packets may containdifferent numbers of pixels. Hence, given PEP of all residualpackets in the k-th frame, our formula for the frame-levelresidual caused distortion is

    Dk(r) = E[(εk)2] · P̄ k(r). (29)

    Note that with FMO mechanism, many neighboring pixelsmay be encoded into different slices and transmitted in dif-ferent packets. Since each packet may experience differentPEP especially in a fast fading channel, even neighboringpixels may have very different XEP. Therefore, (29) worksperfectly under the FMO consideration. This situation is takeninto consideration throughout this paper.

    C. Analysis of Distortion Caused by MVCE

    Similar to the derivations in Section III-B1, in this sub-section, we derive the formula for the pixel-level MV causeddistortion Dku(m), and the frame-level MV caused distortionDk(m).

    1) Pixel-level Distortion Caused by MVCE: Denote theMV error state for pixel uk by Sku(m), and denote the con-cealed MV by m̌vku for general temporal error concealmentmethods when Sku(m) = 1. Therefore, we have

    m̃vku =

    {m̌vku, S

    ku(m) = 1

    mvku, Sku(m) = 0.

    (30)

    Denote ξku , f̂k−1u+mvku − f̂k−1u+m̌vku

    , where ξku depends onthe accuracy of MV concealment, and the spatial correlationbetween reference pixel and concealed reference pixel at theencoder. A more comprehensive analysis of effect of inaccu-rate MV estimation on ξku can be found in Ref. [24], which isthen extended to support multihypothesis motion-compensatedprediction [25] and to derive a rate-distortion model taking intoaccount the temporal prediction distance [26].

    We also make the following assumption.Assumption 3: Sku(m) is independent of ξ

    ku.

    Denote P ku (m) as the MV XEP for pixel uk, that is,

    P ku (m) , P{Sku(m) = 1}. Note that it is possible thatP ku (m) ̸= P ku (r) if SDP and UEP are applied. Given P ku (m),following the same derivation process in Section III-B1, wecan obtain

    Dku(m) = E[(ξku)

    2] · P ku (m). (31)

    Also note that in H.264/AVC specification [16], there is noSDP for an instantaneous decoding refresh (IDR) frame; soSku(r) = S

    ku(m) in an IDR-frame and hence P

    ku (r) = P

    ku (m).

    This is also true for MB without SDP. For P-MB with SDP inH.264/AVC, Sku(r) and S

    ku(m) are dependent. In other words,

    if the MV packet is lost, the corresponding residual packetcannot be decoded even if it is correctly received, since thereis no slice header in the residual packet. Therefore, the residualchannel and the MV channel in Fig. 1 are actually dependentif the encoder follows H.264/AVC specification. In this paper,we study transmission distortion in a more general case whereSku(r) and S

    ku(m) can be either independent or dependent.

    6

    2) Frame-level Distortion Caused by MVCE: To derive theframe-level MV caused distortion, we also make the followingassumption.

    Assumption 4: The second moment of ξku is the same forall u ∈ Vk.

    Under Assumption 4, we can drop u from the notation, andlet E[(ξk)2] = E[(ξku)

    2] for all u ∈ Vk. Denote Nki (m) as thenumber of pixels contained in the i-th MV packet of the k-thframe; denote P ki (m) as PEP of the i-th MV packet of the k-th frame; denote Nk(m) as the total number of MV packets ofthe k-th frame. Then, given PEP of all MV packets in the k-thframe, following the same derivation process in Section III-B2,we obtain the frame-level MV caused distortion for the k-thframe as

    Dk(m) = E[(ξk)2] · P̄ k(m), (32)

    6To achieve this, we add side information to the H.264/AVC referencecode JM14.0 by allowing residual packets to be used for decoder without thecorresponding MV packets being correctly received, that is, êku can be usedto reconstruct f̃ku even if mvku is not correctly received.

  • 7

    where P̄ k(m) , 1|V|∑Nkm

    i=1(Pki (m) · Nki (m)), a weighted

    average over PEP of all MV packets in the k-th frame, inwhich different packets may contain different numbers ofpixels.

    D. Analysis of Distortion Caused by Propagated Error PlusClipping Noise

    In this subsection, we derive the distortion caused byerror propagation in a non-linear decoder with clipping. Wefirst derive the pixel-level propagation and clipping causeddistortion Dku(P ). Then, we derive the frame-level propagationand clipping caused distortion Dk(P ).

    1) Pixel-level Distortion Caused by Propagated Error PlusClipping Noise: First, we analyze the pixel-level propagationand clipping caused distortion Dku(P ) in P-MBs. From thedefinition, we know Dku(P ) depends on propagated error andclipping noise; and clipping noise is a function of RCE,MVCE and propagated error. Hence, Dku(P ) depends on RCE,MVCE and propagated error. Let r,m, p denote the event ofoccurrence of RCE, MVCE and propagated error respectively,and let r̄, m̄, p̄ denote logical NOT of r,m, p respectively(indicating no error). We use a triplet to denote the joint eventof three types of error; e.g., {r,m, p} denotes the event thatall the three types of errors occur, and uk{r̄, m̄, p̄} denotesthe pixel uk experiencing none of the three types of errors.

    When we analyze the condition that several error eventsmay occur, the notation could be simplified by the principleof formal logic. For example, ∆̃ku{r̄, m̄} denotes the clippingnoise under the condition that there is neither RCE nor MVCEfor pixel uk, while it is not certain whether the referencepixel has error. Correspondingly, denote P ku{r̄, m̄} as theprobability of event {r̄, m̄}, that is, P ku{r̄, m̄} = P{Sku(r) =0 and Sku(m) = 0}. From the definition of P ku (r), the marginalprobability P ku{r} = P ku (r) and the marginal probabilityP ku{r̄} = 1 − P ku (r). Similarly, P ku{m} = P ku (m) andP ku{m̄} = 1− P ku (m).

    Define Dku(p) , E[(ζ̃k−1u+mvku + ∆̃ku{r̄, m̄})2]; and define

    αku ,Dku(p)

    Dk−1u+mvku

    , which is called propagation factor for pixel

    uk. The propagation factor αku defined in this paper is differentfrom the propagation factor [10], leakage [7], or attenuationfactor [15], which are modeled as the effect of spatial filteringor intra update; our propagation factor αku is also different fromthe fading factor [8], which is modeled as the effect of usingfraction of referenced pixels in the reference frame for motionprediction. Note that Dku(p) is only a special case of D

    ku(P )

    under the error event of {r̄, m̄} for pixel uk. However, mostexisting models inappropriately use their propagation factor,obtained under the error event of {r̄, m̄}, to replace Dku(P )directly.

    To calculate E[(ζ̃k−1u+m̃vku

    +∆̃ku)2] in (13), we need to analyze

    ∆̃ku in four different error events for pixel uk: 1) both residual

    and MV are erroneous, denoted by uk{r,m}; 2) residual iserroneous but MV is correct, denoted by uk{r, m̄}; 3) residualis correct but MV is erroneous, denoted by uk{r̄,m}; and 4)

    both residual and MV are correct, denoted by uk{r̄, m̄}. So,

    Dku(P ) = Pku{r,m} · E[(ζ̃k−1u+m̌vku + ∆̃

    ku{r,m})2]

    + P ku{r, m̄} · E[(ζ̃k−1u+mvku + ∆̃ku{r, m̄})2]

    + P ku{r̄,m} · E[(ζ̃k−1u+m̌vku + ∆̃ku{r̄,m})2]

    + P ku{r̄, m̄} · E[(ζ̃k−1u+mvku + ∆̃ku{r̄, m̄})2].

    (33)

    Note that the concealed pixel value should be in the clippingfunction range, that is, Γ(f̃k−1

    u+m̃vku+ ěku) = f̃

    k−1u+m̃vku

    + ěku, so

    ∆̃ku{r} = 0. Also note that if the MV channel is independentof the residual channel, we have P ku{r,m} = P ku (r) ·P ku (m).However, as mentioned in Section III-C1, in H.264/AVC spec-ification, these two channels are dependent. In other words,P ku{r̄,m} = 0 and P ku{r̄, m̄} = P ku{r̄} for P-MBs with SDPin H.264/AVC7. In such a case, (33) is simplified to

    Dku(P ) = Pku{r,m} ·Dk−1u+m̌vku + P

    ku{r, m̄} ·Dk−1u+mvku

    + P ku{r̄} ·Dku(p).(34)

    Note that for P-MB without SDP, we have P ku{r, m̄} =P ku{r̄,m} = 0, P ku{r,m} = P ku{r} = P ku{m} = P ku , andP ku{r̄, m̄} = P ku{r̄} = P ku{m̄} = 1−P ku . Therefore, (34) canbe further simplified to

    Dku(P ) = Pku ·Dk−1u+m̌vku + (1− P

    ku ) ·Dku(p). (35)

    Also note that for I-MB, there will be no transmissiondistortion if it is correctly received, that is, Dku(p) = 0. So(35) can be further simplified to

    Dku(P ) = Pku ·Dk−1u+m̌vku . (36)

    Comparing (36) with (35), we see that I-MB is a specialcase of P-MB with Dku(p) = 0, that is, the propagation factorαku = 0 according to the definition. It is important to note thatDku(P ) > 0 for I-MB since P

    ku ̸= 0. In other words, I-MB also

    contains the distortion caused by propagation error and it canbe predicted by (36). However, existing linear time-invariant(LTI) models [7], [8] assume that there is no distortion causedby propagation error for I-MB, which underestimates thetransmission distortion.

    In the following part of this subsection, we derive thepropagation factor αku for P-MB and prove some importantproperties of clipping noise. To derive αku, we first giveLemma 1 as below.

    Lemma 1: Given the PMF of the random variable ζ̃k−1u+mvku

    and the value of f̂ku , Dku(p) can be calculated at the encoder by

    7In a more general case, where Pku{r̄,m} ̸= 0, Eq. (34) can be used as anapproximation. This is because E[(ζ̃k−1

    u+m̌vku+∆̃ku{r̄,m})2] only happens in

    SDP condition, where the probability of MV packet error is usually less thanthe probability of residual packet error and the probability of the event thata residual packet is correctly received but the corresponding MV packet is inerror, i.e. Pku{r̄,m}, is very small. In addition, since ∆̃ku{r} = 0, for the fourdifferent error events in (33), ∆̃ku{r̄,m} is much more similar to ∆̃ku{r̄, m̄}than to ∆̃ku{r,m} and ∆̃ku{r, m̄}. Therefore, we may approximate the lasttwo terms in (33) by Pku{r̄} · E[(ζ̃

    k−1u+mvku

    + ∆̃ku{r̄, m̄})2], i.e. Pku{r̄} ·Dku(p).

  • 8

    Dku(p) = E[Φ2(ζ̃k−1

    u+mvku, f̂ku)], where Φ(x, y) is called error

    reduction function and defined by

    Φ(x, y) , y−Γ(y−x) =

    y − γl, y − x < γlx, γl ≤ y − x ≤ γhy − γh, y − x > γh.

    (37)

    Lemma 1 is proved in Appendix A. In fact, we havefound in our experiments that in any error event, ζ̃k−1

    u+mvkuapproximately follows Laplacian distribution with zero mean.If we assume ζ̃k−1

    u+mvkufollows Laplacian distribution with zero

    mean, the calculation for Dku(p) becomes simpler since theonly unknown parameter for PMF of ζ̃k−1

    u+mvkuis its variance.

    Under this assumption, we have the following proposition.Proposition 1: The propagation factor α for propagated

    error with Laplacian distribution of zero-mean and varianceσ2 is given by

    α = 1− 12e−

    y−γlb (

    y − γlb

    + 1)− 12e−

    γh−yb (

    γh − yb

    + 1),

    (38)

    where y is the reconstructed pixel value, and b =√22 σ.

    Proposition 1 is proved in Appendix B. In the zero-meanLaplacian case, αku will only be a function of f̂

    ku and the

    variance of ζ̃k−1u+mvku

    , which is equal to Dk−1u+mvku

    in this case.Since Dk−1

    u+mvkuhas already been calculated during the phase of

    predicting the (k−1)-th frame transmission distortion, Dku(p)can be calculated by Dku(p) = α

    ku ·Dk−1u+mvku via the definition

    of αku. Then, we can recursively calculate Dku(P ) in (34) since

    both Dk−1u+m̌vku

    and Dk−1u+mvku

    have been calculated previouslyfor the (k − 1)-th frame.

    Next, we prove an important property of the non-linearclipping function in Proposition 2. To prove Proposition 2,we need to use the following lemma.

    Lemma 2: The error reduction function Φ(x, y) satisfiesΦ2(x, y) ≤ x2 for any γl ≤ y ≤ γh.Lemma 2 is proved in Appendix C. From Lemma 2, we knowthat the function Φ(x, y) reduces the energy of propagatederror. This is the reason why we call it error reduction function.With Lemma 1, it is straightforward to prove that whatever thePMF of ζ̃k−1

    u+mvkuis,

    Dku(p) = E[Φ2(ζ̃k−1

    u+mvku, f̂ku)] ≤ E[(ζ̃k−1u+mvku)

    2] = Dk−1u+mvku

    ,

    (39)

    i.e., αku ≤ 1. In other words, we have the following proposi-tion.

    Proposition 2: Clipping reduces propagated error, that is,Dku(p) ≤ Dk−1u+mvku , or α

    ku ≤ 1.

    Proposition 2 tells us that if there is no newly inducederrors in the k-th frame, transmission distortion decreases fromthe (k − 1)-th frame to the k-th frame. Fig. 2 shows theexperimental result of transmission distortion propagation for‘bus’ sequence in cif format, where the third frame is lostat the decoder and all other frames are correctly received8.

    8Since showing the experimental results for all trajectories is almostimpossible in the paper, we just show the result with mean square error (MSE)of all pixels in the same frame.

    0 5 10 15 20 25 30 35 400

    200

    400

    600

    800

    1000

    1200

    Frame index

    MS

    E d

    isto

    rtio

    n

    Only the third frame is lost

    Fig. 2. The effect of clipping noise on distortion propagation.

    The experiment setup for Fig. 2, Fig. 3, Fig. 4, Fig. 5 andFig. 6 is: JM14.0 [27] encoder and decoder are used; the firstframe is an I-frame, and the subsequent frames are all P-frameswithout including I-MB; For temporal error concealment,MV error concealment is the default frame copy in JM14.0decoder due to its simplicity; residual packets can be used fordecoder without the corresponding MV packets being correctlyreceived as aforementioned; interpolation filter and deblockingfilter are disabled. That is, the error reduction is caused onlyby the clipping noise.

    In fact, if we consider the more general cases where theremay be new error induced in the k-th frame, we can still provethat E[(ζ̃k−1

    u+m̃vku+ ∆̃ku)

    2] ≤ E[(ζ̃k−1u+m̃vku

    )2] as shown in (60)during the proof for the following corollary.

    Corollary 1: The correlation coefficient between ζ̃k−1u+m̃vku

    and ∆̃ku is non-positive. Specifically, they are negatively corre-lated under the condition {r̄, p}, and uncorrelated under otherconditions.Corollary 1 is proved in Appendix D. This property is veryimportant for designing a low complexity algorithm to estimatepropagation and clipping caused distortion in PTD, which ispresented in the sequel paper [19].

    2) Frame-level Distortion Caused by Propagated ErrorPlus Clipping Noise: Define Dk(p) as the mean of Dku(p)over all u ∈ Vk, i.e., Dk(p) , 1|V|

    ∑u∈Vk D

    ku(p); the formula

    for frame-level propagation and clipping caused distortion isgiven in Lemma 3.

    Lemma 3: The frame-level propagation and clipping causeddistortion in the k-th frame is

    Dk(P ) = Dk−1 · P̄ k(r) +Dk(p) · (1− P̄ k(r))(1− βk),(40)

    where P̄ k(r) is defined in (28); βk is the percentage of I-MBsin the k-th frame; Dk−1 is the transmission distortion in the(k − 1)-th frame.

    Lemma 3 is proved in Appendix F. Define the propaga-tion factor for the k-th frame αk , D

    k(p)Dk−1

    ; then, we have

    αk =

    ∑u∈Vk α

    ku·D

    k−1u+mvku

    Dk−1. As explained in Appendix F, when

    the number of pixels in the (k − 1)-th frame is sufficiently

  • 9

    large, the sum of Dk−1u+mvku

    over all the pixels in the (k − 1)-th frame will converge to Dk−1 due to the randomness of

    mvku. Therefore, we have αk ≈

    ∑u∈Vk α

    ku·D

    k−1u+mvku∑

    u∈Vk Dk−1u+mvku

    , which is

    a weighted average of αku with the weight being Dk−1u+mvku

    . Asa result, Dk(p) ≤ Dk(P ) with high probability9. However,most existing works directly use Dk(P ) = Dk(p) in predict-ing transmission distortion. This is another reason why LTImodels [7], [8] underestimate transmission distortion whenthere is no MV error.

    E. Analysis of Correlation Caused DistortionIn this subsection, we first derive the pixel-level correlation

    caused distortion Dku(c). Then, we derive the frame-levelcorrelation caused distortion Dk(c).

    1) Pixel-level Correlation Caused Distortion: We analyzethe correlation caused distortion Dku(c) at the decoder in fourdifferent cases: i) for uk{r̄, m̄}, both ε̃ku = 0 and ξ̃ku = 0,so Dku(c) = 0; ii) for u

    k{r, m̄}, ξ̃ku = 0 and Dku(c) =2E[εku · (ζ̃k−1u+mvku +∆̃

    ku{r, m̄})]; iii) for uk{r̄,m}, ε̃ku = 0 and

    Dku(c) = 2E[ξku · (ζ̃k−1u+m̌vku + ∆̃

    ku{r̄,m})]; iv) for uk{r,m},

    Dku(c) = 2E[εku · ξku] + 2E[εku · (ζ̃k−1u+m̌vku + ∆̃

    ku{r,m})] +

    2E[ξku ·(ζ̃k−1u+m̌vku+∆̃ku{r,m})]. From Section III-D1, we know

    ∆̃ku{r} = 0. So, we obtain

    Dku(c) = Pku{r, m̄} · 2E[εku · ζ̃k−1u+mvku ]

    + P ku{r̄,m} · 2E[ξku · (ζ̃k−1u+m̌vku + ∆̃ku{r̄,m})]

    + P ku{r,m} · (2E[εku · ξku] + 2E[εku · ζ̃k−1u+m̌vku ]

    + 2E[ξku · ζ̃k−1u+m̌vku ]).

    (41)

    In our experiments, we find that in the trajectory of pixeluk, 1) the residual êku is almost uncorrelated with the residualin all other frames êiv where i ̸= k, i.e. their correlationcoefficient is almost zero, as shown in Fig. 310; and 2) alsothe residual êku is almost uncorrelated with the MVCE of thecorresponding pixel, i.e. ξku, and the MVCE in all previousframes, i.e. ξiv, where i < k, as shown in Fig. 4. Based on theabove observations, we further assume that for any i < k, êkuis uncorrelated with êiv and ξ

    iv if v

    i is not in the trajectory ofpixel uk, and make the following assumption.

    Assumption 5: êku is uncorrelated with ξku, and is uncorre-

    lated with both êiv and ξiv for any i < k.

    Since ζ̃k−1u+mvku

    and ζ̃k−1u+m̌vku

    are the transmission recon-structed errors accumulated from all the frames before thek-th frame, εku is uncorrelated with ζ̃

    k−1u+mvku

    and ζ̃k−1u+m̌vku

    dueto Assumption 5. Thus, (41) becomes

    Dku(c) = 2Pku{m} · E[ξku · ζ̃k−1u+m̌vku ]

    + 2P ku{r̄,m} · E[ξku · ∆̃ku{r̄,m}].(42)

    9When the number of reference pixels in the (k − 1)-th frame is small,∑u∈Vk α

    ku ·D

    k−1u+mvku

    may be larger than Dk−1 in case the reference pixelswith high distortion are used more often than the reference pixels with lowdistortion.

    10Fig. 3, Fig. 4, Fig. 5 and Fig. 6 are plotted for low motion sequence, e.g.‘foreman’, and high motion sequence, e.g. ’stefan’, in cif format. All othersequences show the similar statistics.

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.5

    0

    0.5

    1

    Frame index

    Temporal correlation between residuals in one trajectory

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (a)

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.5

    0

    0.5

    1

    Frame index

    Temporal correlation between residuals in one trajectory

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (b)

    Fig. 3. (a) foreman-cif, (b) stefan-cif.

    However, we observe that in the trajectory of pixel uk,1) êku is correlated with ξ

    iv when i > k and especially

    when i = k + 1 there are peaks, as seen in Fig. 4; and 2)ξku is highly correlated with ξ

    iv as shown in Fig. 5. These

    interesting statistical relationships could be exploited by anerror concealment algorithm, e.g. finding a concealed MV forpixel vi with proper ξiv given ê

    ku or ξ

    ku, and is subject to our

    future study.As mentioned in Section III-D1, for P-MBs with SDP in

    H.264/AVC, P ku{r̄,m} = 0. So, (42) becomes

    Dku(c) = 2Pku{m} · E[ξku · (f̂k−1u+m̌vku − f̃

    k−1u+m̌vku

    )]. (43)

    Note that in the more general case that P ku{r̄,m} ̸= 0,Eq. (43) can be used as an approximation since in (42),E[ξku · ∆̃ku{r̄,m}] is much smaller than E[ξku · ζ̃k−1u+m̌vku ] andP ku{r̄,m} is much smaller than P ku{m}.

    For MBs without SDP, since P ku{r, m̄} = P ku{r̄,m} = 0and P ku{r,m} = P ku{r} = P ku{m} = P ku as mentioned inSection III-D1, (41) can be simplified to

    Dku(c) = 2Pku · (2E[εku · ξku] + 2E[εku · ζ̃k−1u+m̌vku ] + 2E[ξ

    ku · ζ̃k−1u+m̌vku ]).

    (44)

    Under Assumption 5, (44) reduces to (43).

  • 10

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.4

    −0.3

    −0.2

    −0.1

    0

    0.1

    0.2

    0.3

    0.4

    MV frame index

    Temporal correlation between residual and concealment error in one trajectory

    Residual frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (a)

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.4

    −0.3

    −0.2

    −0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    MV frame index

    Temporal correlation between residual and concealment error in one trajectory

    Residual frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (b)

    Fig. 4. (a) foreman-cif, (b) stefan-cif.

    Define λku ,E[ξku·f̃

    k−1u+m̌vku

    ]

    E[ξku·f̂k−1u+m̌vku

    ]; λku is a correlation ratio, that

    is, the ratio of the correlation between MVCE and concealedreference pixel value at the decoder, to the correlation betweenMVCE and concealed reference pixel value at the encoder. λkuquantifies the effect of the correlation between the MVCE andpropagated error on transmission distortion.

    Note that although we do not know the exact value of λkuat the encoder, its range is characterized by XEP of all pixelsin the trajectory T, which passes through the pixel uk, as

    k−1∏i=1

    P iT(i){r̄, m̄} ≤ λku ≤ 1, (45)

    where T(i) is the reference pixel position in the i-th framefor the trajectory T. For example, T(k−1) = uk+mvku andT(k−2) = T(k−1)+mvk−1T(k−1). The left inequality in (45)holds in the extreme case that any error in the trajectory willcause ξku and f̃

    k−1u+m̌vku

    to be uncorrelated, which is usually truefor high motion video. The right inequality in (45) holds inanother extreme case that all errors in the trajectory do notaffect the correlation between ξku and f̃

    k−1u+m̌vku

    , that is E[ξku ·f̃k−1u+m̌vku

    ] ≈ E[ξku · f̂k−1u+m̌vku ] , which is usually true for lowmotion video. The details on how to estimate λku is presented

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.5

    0

    0.5

    1

    Frame index

    Temporal correlation between concealment errors in one trajectory

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (a)

    0

    10

    20

    30

    40

    0

    10

    20

    30

    40−0.5

    0

    0.5

    1

    Frame index

    Temporal correlation between concealment errors in one trajectory

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    (b)

    Fig. 5. (a) foreman-cif, (b) stefan-cif.

    in the sequel paper [19].Using the definition of λku, we have the following lemma.Lemma 4:

    Dku(c) = (λku − 1) ·Dku(m). (46)

    Lemma 4 is proved in Appendix G.If we assume E[ξku] = 0, we may further derive the

    correlation coefficient between ξku and f̂k−1u+m̌vku

    . Denote ρ astheir correlation coefficient, from (70), we have

    ρ =E[ξku · f̂k−1u+m̌vku ]− E[ξ

    ku] · E[f̂k−1u+m̌vku ]

    σξku · σf̂ku

    = − E[(ξku)

    2]

    2 · σξku · σf̂ku= −

    σξku2 · σf̂ku

    .

    (47)

    Similarly, it is easy to prove that the correlation coefficientbetween ξku and f̂

    k−1u+mvku

    isσξku

    2·σf̂ku

    . This agrees well with theexperimental results shown in Fig. 6. Via the same derivationprocess, one can obtain the correlation coefficient between êkuand f̂k−1

    u+mvku, and between êku and f̂

    ku . One possible application

    of these correlation properties is error concealment with partialinformation available.

  • 11

    0 5 10 15 20 25 30 35 40−0.2

    −0.15

    −0.1

    −0.05

    0

    0.05

    0.1

    0.15

    0.2

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    ntMeasured ρ between ξ k

    uand f̂k−1

    u+mvku

    Estimated ρ between ξ ku

    and f̂k−1u+mvk

    u

    Measured ρ between ξ ku

    and f̂k−1u+m̌vk

    u

    Estimated ρ between ξ ku

    and f̂k−1u+m̌vk

    u

    (a)

    0 5 10 15 20 25 30 35 40−0.5

    −0.4

    −0.3

    −0.2

    −0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Frame index

    Cor

    rela

    tion

    coef

    ficie

    nt

    Measured ρ between ξ ku

    and f̂k−1u+mvk

    u

    Estimated ρ between ξ ku

    and f̂k−1u+mvk

    u

    Measured ρ between ξ ku

    and f̂k−1u+m̌vk

    u

    Estimated ρ between ξ ku

    and f̂k−1u+m̌vk

    u

    (b)

    Fig. 6. Comparison between measured and estimated correlation coefficients:(a) foreman-cif, (b) stefan-cif.

    2) Frame-Level Correlation Caused Distortion: DenoteVki {m} the set of pixels in the i-th MV packet of the k-thframe. From (19), (71) and Assumption 4, we obtain

    Dk(c) =E[(ξk)2]

    |V|∑u∈Vk

    (λku − 1) · P ku (m)

    =E[(ξk)2]

    |V|

    Nk(m)∑i=1

    {P ki (m)∑

    u∈Vki {m}

    (λku − 1)}.(48)

    Define λk , 1|V|∑

    u∈Vk λku;

    1Nki (m)

    ∑u∈Vki {m}

    λku willconverge to λk for any packet that contains a sufficiently largenumber of pixels. By rearranging (48), we obtain

    Dk(c) =E[(ξk)2]

    |V|

    Nk(m)∑i=1

    {P ki (m) ·Nki (m) · (λk − 1)}

    = (λk − 1) · E[(ξk)2] · P̄ k(m).

    (49)

    From (32), we know that E[(ξk)2] · P̄ k(m) is exactly equalto Dk(m). Therefore, (49) is further simplified to

    Dk(c) = (λk − 1) ·Dk(m). (50)

    F. Summary

    In Section III-A, we decomposed transmission distortioninto four terms; we derived a formula for each term inSections III-B through III-E. In this section, we combine theformulae for the four terms into a single formula.

    1) Pixel-Level Transmission Distortion:Theorem 1: Under single-reference motion compensation,

    the PTD of pixel uk is

    Dku = Dku(r) + λ

    ku ·Dku(m) + P ku{r,m} ·Dk−1u+m̌vku

    + P ku{r, m̄} ·Dk−1u+mvku + Pku{r̄} · αku ·Dk−1u+mvku .

    (51)

    Proof: (51) can be obtained by plugging (23), (31), (34),and (71) into (14).

    Corollary 2: Under single-reference motion compensationand no SDP, (51) is simplified to

    Dku = Pku · (E[(εku)2] + λku · E[(ξku)2] +Dk−1u+m̌vku)

    + (1− P ku ) · αku ·Dk−1u+mvku .(52)

    2) Frame-Level Transmission Distortion:Theorem 2: Under single-reference motion compensation,

    the FTD of the k-th frame is

    Dk = Dk(r) + λk ·Dk(m) + P̄ k(r) ·Dk−1

    + (1− P̄ k(r)) ·Dk(p) · (1− βk).(53)

    Proof: (53) can be obtained by plugging (29), (32), (40)and (50) into (15).

    Corollary 3: Under single-reference motion compensationand no SDP, the FTD of the k-th frame is simplified to

    Dk = P̄ k · (E[(εk)2] + λk · E[(ξk)2] +Dk−1)+ (1− P̄ k) · αk ·Dk−1 · (1− βk)

    (54)

    Following the same deriving process, it is not difficult toobtain the distortion prediction formulae under multi-referencecase. Due to the space limit, in this paper we just presentthe formulae for distortion estimation under single-referencecase. Interested reader may refer to Ref. [22] for the analysisof multi-reference case. In Ref. [22], we also identify therelationship between our result and existing models, andspecify the conditions, under which those models are accurate.

    IV. CONCLUSION

    In this paper, we derived the transmission distortion formu-lae for wireless video communication systems. With consider-ation of spatio-temporal correlation, nonlinear codec and time-varying channel, our distortion prediction formulae improvethe accuracy of distortion estimation from existing works.Besides that, our formulae support, for the first time, thefollowing capabilities: 1) prediction at different levels (e.g.,pixel/frame/GOP level), 2) prediction for multi-reference mo-tion compensation, 3) prediction under SDP, 4) predictionunder arbitrary slice-level packetization with FMO mecha-nism, 5) prediction under time-varying channels, 6) one unifiedformula for both I-MB and P-MB, and 7) prediction for bothlow motion and high motion video sequences. In addition, thispaper also identified two important properties of transmissiondistortion for the first time: 1) clipping noise, produced by

  • 12

    non-linear clipping, causes decay of propagated error; 2)the correlation between motion vector concealment error andpropagated error is negative, and has dominant impact ontransmission distortion, among all the correlations between anytwo of the four components in transmission error.

    In the sequel paper [19], we use the formulae derived inthis paper to design algorithms for estimating pixel-level andframe-level transmission distortion and apply the algorithms tovideo codec design; we also verify the accuracy of the formu-lae derived in this paper through experiments; the applicationof these formulae shows superior performance over existingmodels.

    ACKNOWLEDGMENTS

    This work was supported in part by an Intel gift, the USNational Science Foundation under grant CNS-0643731 andECCS-1002214. The authors would like to thank Jun Xu andQian Chen for many fruitful discussions related to this workand suggestions that helped to improve the presentation of thispaper. The authors would also like to thank the anonymousreviewers for their valuable comments to improve the qualityof this paper.

    APPENDIX

    A. Proof of Lemma 1

    Proof: From (10) and (12), we obtain f̃k−1u+m̃vku

    + ẽku =

    f̂ku − ξ̃ku − ε̃ku − ζ̃k−1u+m̃vku ,Together with (8), we obtain

    ∆̃ku = (f̂ku − ξ̃ku − ε̃ku − ζ̃k−1u+m̃vku)− Γ(f̂

    ku − ξ̃ku − ε̃ku − ζ̃k−1u+m̃vku).

    (55)

    So, ζ̃k−1u+m̃vku

    + ∆̃ku = (f̂ku − ξ̃ku − ε̃ku)− Γ(f̂ku − ξ̃ku − ε̃ku −

    ζ̃k−1u+m̃vku

    ), and

    Dku(P ) = E[(ζ̃k−1u+m̃vku

    + ∆̃ku)2]

    = E[Φ2(ζ̃k−1u+m̃vku

    , f̂ku − ξ̃ku − ε̃ku)].(56)

    We know from the definition that Dku(p) is a special caseof Dku(P ) under the condition {r̄, m̄}, which means ẽku = êku,i.e. ε̃ku = 0, and m̃v

    ku = mv

    ku, i.e. ξ̃

    ku = 0. Therefore, we

    obtain

    Dku(p) = E[Φ2(ζ̃k−1

    u+mvku, f̂ku)]. (57)

    B. Proof of Proposition 1

    Proof: The probability density function of the randomvariable having a Laplacian distribution is f(x|µ, b) =12b exp

    (− |x−µ|b

    ). Since µ = 0, we have E[x2] = 2b2, and

    −300 −200 −100 0 100 200 3000

    1

    2

    3

    4

    5

    6

    7

    8

    9x 10

    4

    x

    y=100, γH

    =255, γL=0

    x2

    Φ2(x, y)

    Fig. 7. Comparison of Φ2(x, y) and x2.

    from (37), we obtain

    E[x2]− E[Φ2(x, y)]

    =

    ∫ +∞y−γl

    (x2 − (y − γl)2)1

    2be−

    xb dx

    +

    ∫ y−γh−∞

    [x2 − (y − γh)2]1

    2be

    xb dx

    = e−y−γl

    b ((y − γl) · b+ b2) + e−γh−y

    b ((γh − y) · b+ b2).(58)

    From the definition of propagation factor, we obtain α =E[Φ2(x, y)]

    E[x2] = 1−12e

    − y−γlb (y−γlb + 1)−12e

    − γh−yb (γh−yb + 1).

    C. Proof of Lemma 2

    Proof: From the definition in (37), we obtain

    Φ2(x, y)− x2 =

    (y − γl)2 − x2, x > y − γl0, y − γh ≤ x ≤ y − γl(y − γh)2 − x2, x < y − γh.

    (59)Since y ≥ γl, we obtain (y − γl)2 < x2 when x > y − γl.

    Similarly, since y ≤ γh, we obtain (y − γh)2 < x2 whenx < y − γh. Therefore Φ2(x, y) − x2 ≤ 0 for γl ≤ y ≤ γh.Fig. 7 shows a pictorial example of the case that γh = 255,γl = 0 and y = 100.

    D. Proof of Corollary 1

    Proof: From (55), we obtain ∆̃ku{p̄} = (f̂ku − ξ̃ku − ε̃ku)−Γ(f̂ku − ξ̃ku− ε̃ku). Together with Lemma 5, which is presentedand proved in Appendix E, we have γl ≤ f̂ku − ξ̃ku − ε̃ku ≤ γh.From Lemma 2, we have Φ2(x, y) ≤ x2 for any γl ≤ y ≤ γh;therefore, E[Φ2(ζ̃k−1

    u+m̃vku, f̂ku − ξ̃ku − ε̃ku)] ≤ E[(ζ̃k−1u+m̃vku)

    2].Together with (56), it is straightforward to prove that

    E[(ζ̃k−1u+m̃vku

    + ∆̃ku)2] ≤ E[(ζ̃k−1

    u+m̃vku)2]. (60)

  • 13

    By expanding E[(ζ̃k−1u+m̃vku

    + ∆̃ku)2], we obtain

    E[ζ̃k−1u+m̃vku

    · ∆̃ku] ≤ −1

    2E[(∆̃ku)

    2] ≤ 0. (61)

    The physical meaning of (61) is that ζ̃k−1u+m̃vku

    and ∆̃ku are

    negatively correlated if ∆̃ku ̸= 0. Since ∆̃ku{r} = 0 as notedin Section III-D1 and ∆̃ku{p̄} = 0 as proved in Lemma 5, weknow that ∆̃ku ̸= 0 is valid only for the error events {r̄,m, p}and {r̄, m̄, p}, and ∆̃ku = 0 for any other error event. Inother words, ζ̃k−1

    u+m̃vkuand ∆̃ku are negatively correlated under

    the condition {r̄, p}, and they are uncorrelated under otherconditions.

    E. Lemma 5 and Its Proof

    Before presenting the proof, we first give the definition ofIdeal Codec.

    Definition 1: Ideal Codec: both the true MV and concealedMV are within the search range, and the position pointed bythe true MV, i.e., u+mvku, is the best reference pixel, underthe MMSE criteria, for pixel uk within the whole search rangeVk−1SR , that is, v = argmin

    v∈Vk−1SR

    {(f̂ku − f̂k−1v )2}.

    To prove Corollary 1, we need to use the following lemma.Lemma 5: In an ideal codec, ∆̃ku{p̄} = 0, In other words,

    if there is no propagated error, the clipping noise for the pixeluk at the decoder is always zero no matter what kind of errorevent occurs in the k-th frame.

    Proof: In an ideal codec, we have (êku)2 = (f̂ku −

    f̂k−1u+mvku

    )2 ≤ (f̂ku−f̂k−1u+m̌vku)2. Due to the spatial and temporal

    continuity of the natural video, we can prove by contradictionthat in an ideal codec f̂ku − f̂k−1u+mvku and f̂

    ku − f̂k−1u+m̌vku have

    the same sign, that is either

    f̂ku − f̂k−1u+m̌vku ≥ êku ≥ 0, or f̂ku − f̂k−1u+m̌vku ≤ ê

    ku ≤ 0.

    (62)

    If the sign of f̂ku − f̂k−1u+mvku and f̂ku − f̂k−1u+m̌vku is not the

    same, then due to the spatial and temporal continuity of theinput video, there exists a better position v ∈ Vk−1 betweenmvku and m̌v

    ku, and therefore within the search range, so that

    (êku)2 ≥ (f̂ku − f̂k−1v )2. In this case, encoder will choose

    v as the best reference pixel within the search range. Thiscontradicts the assumption that the best reference pixel isu+mvku within the search range.

    Therefore, from (62), we obtain

    f̂ku ≥ f̂k−1u+m̌vku + êku ≥ f̂k−1u+m̌vku ,

    or f̂ku ≤ f̂k−1u+m̌vku + êku ≤ f̂k−1u+m̌vku .

    (63)

    Since both f̂ku and f̂k−1u+m̌vku

    are reconstructed pixel value,

    they are within the range γh ≥ f̂ku , f̂k−1u+m̌vku ≥ γl. From (63),we have γh ≥ f̂k−1u+m̌vku + ê

    ku ≥ γl, and thus Γ(f̂k−1u+m̌vku +

    êku) = f̂k−1u+m̌vku

    + êku. As a result, we obtain ∆̃ku{r̄,m, p̄} =

    (f̂k−1u+m̌vku

    + êku)− Γ(f̂k−1u+m̌vku + êku) = 0.

    Since ∆̃ku{r̄, m̄, p̄} = ∆̂ku = 0, and from Section III-D1, weknow that ∆̃ku{r, p̄} = 0, hence we obtain ∆̃ku{p̄} = 0.

    Remark 1: Note that Lemma 5 is proved under the assump-tion of pixel-level motion estimation. In a practical encoder,block-level motion estimation is adopted with the criterionof minimizing the MSE of the whole block, e.g., in H.263,or minimizing the cost of residual bits and MV bits, e.g., inH.264/AVC. Therefore, some reference pixels in the block maynot be the best reference pixel within the search range. Onthe other hand, Rate Distortion Optimization (RDO) as usedin H.264/AVC may also cause some reference pixels not tobe the best reference pixels. However, the experiment resultsfor all the test video sequences show that the probability of∆̃ku{r̄,m, p̄} ̸= 0 is negligible.

    F. Proof of Lemma 3

    Proof: For P-MBs with SDP, from (18) and (34) weobtain

    Dk(P ) =1

    |V|∑u∈Vk

    (P ku{r,m} ·Dk−1u+m̌vku)

    +1

    |V|∑u∈Vk

    (P ku{r, m̄} ·Dk−1u+mvku) +1

    |V|∑u∈Vk

    (P ku{r̄} ·Dku(p)).

    (64)

    Denote Vki {r, m̄} the set of pixels in the k-th frame with thesame XEP P ki {r, m̄}; denote Nki {r, m̄} the number of pixelsin Vki {r, m̄}; denote Nk{r, m̄} the number of sets with differ-ent XEP P ki {r, m̄} in the k-th frame. Although D

    k−1u+mvku

    maybe very different for different pixels u+mvku in the (k−1)-thframe, e.g. under a fast fading channel with FMO mechanism,for large Nki {r, m̄}, we have 1Nki {r,m̄}

    ∑u∈Vki {r,m̄}

    Dk−1u+mvku

    converges to Dk−1.11 Therefore,

    1

    |V|∑u∈Vk

    (P ku{r, m̄} ·Dk−1u+mvku)

    =1

    |V|

    Nk{r,m̄}∑i=1

    (P ki {r, m̄}∑

    u∈Vki {r,m̄}

    Dk−1u+mvku

    )

    =1

    |V|

    Nk{r,m̄}∑i=1

    (P ki {r, m̄} ·Nki {r, m̄} ·Dk−1)

    = Dk−1 · P̄ k{r, m̄},

    (65)

    where P̄ k{r, m̄} = 1|V|∑Nk{r,m̄}

    i=1 (Pki {r, m̄} ·Nki {r, m̄}).

    Following the same process, we obtain the first term in theright-hand side in (64) as Dk−1·P̄ k{r,m}, where P̄ k{r,m} =

    11According to the definition, for any given u ∈ Vk−1, Dk−1u isan expected value, that is, it is not a random variable. However, dueto the randomness of mvku, each pixel in the k − 1-th frame can beused as a reference for multi pixels in the k-th frame. In other words,

    1Nki {r,m̄}

    ∑u∈Vki {r,m̄}

    Dk−1u+mvku

    can be described as simple random sam-pling with replacement (SRSWR) and take their average. On the other hand,according to (6), Dk−1 is the mean of Dk−1u over all u ∈ Vk−1. Therefore,using the Theorem 5.2.6 in Ref. [28], it is easy to prove that expectation ofDk−1

    u+mvkuis exactly equal to Dk−1; and using the Theorem 5.5.2 in Ref. [28],

    it is also easy to prove that 1Nki {r,m̄}

    ∑u∈Vki {r,m̄}

    Dk−1u+mvku

    converge in

    probability to Dk−1. Note again that the randomness of Dk−1u+mvku

    is caused

    by mvku.

  • 14

    1|V|

    ∑Nk{r,m}i=1 (P

    ki {r,m} ·Nki {r,m}); and

    1

    |V|∑u∈Vk

    (P ku{r̄} ·Dku(p)) =1

    |V|

    Nk{r̄}∑i=1

    (P ki {r̄}∑

    u∈Vki {r̄}

    Dku(p)).

    (66)

    For large Nki {r̄}, we have 1Nki {r̄}∑

    u∈Vki {r̄}Dku(p) converges

    to Dk(p), so the third term in the right-hand side in (64) isDk(p) · (1− P̄ k(r)).

    Note that P ki {r,m}+P ki {r, m̄} = P ki {r} and Nki {r,m} =Nki {r, m̄}. So, we obtain

    Dk(P ) = Dk−1 · P̄ k(r) +Dk(p) · (1− P̄ k(r)). (67)

    For P-MBs without SDP, it is straightforward to acquire(67) from (35). For I-MBs, from (36), it is also easy to obtainDk(P ) = Dk−1 · P̄ k(r). So, together with (67), we obtain(40).

    G. Proof of Lemma 4

    Proof: Using the definition of λku, (43) becomes

    Dku(c) = 2Pku{m} · (1− λku) · E[ξku · f̂k−1u+m̌vku ]. (68)

    Under the condition that the distance between mvku andm̌vku is small, for example, inside the same MB, the statisticsof f̂k−1

    u+m̌vkuand f̂k−1

    u+mvkuare almost the same. Therefore, we

    may assume E[(f̂k−1u+m̌vku

    )2] = E[(f̂k−1u+mvku

    )2].

    Since ξku = f̂k−1u+mvku

    − f̂k−1u+m̌vku

    , we have

    E[(f̂k−1u+m̌vku

    )2] = E[(f̂k−1u+mvku

    )2]

    = E[(ξku + f̂k−1u+m̌vku

    )2],(69)

    and therefore

    E[ξku · f̂k−1u+m̌vku ] = −E[(ξku)

    2]

    2(70)

    Note that following the same derivation process, we can proveE[ξku · f̂k−1u+mvku ] =

    E[(ξku)2]

    2 .Therefore, (68) can be simplify as

    Dku(c) = (λku − 1) · E[(ξku)2] · P ku (m). (71)

    From (31), we know that E[(ξku)2] · P ku (m) is exactly equal

    to Dku(m). Therefore, (71) is further simplified to (46).

    REFERENCES[1] C. E. Shannon, “Coding theorems for a discrete source with a fidelity

    criterion,” IRE Nat. Conv. Rec. Part, vol. 4, pp. 142–163, 1959.[2] T. Berger and J. Gibson, “Lossy source coding,” IEEE Transactions on

    Information Theory, vol. 44, no. 6, pp. 2693–2723, 1998.[3] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with optimal

    inter/intra-mode switching for packet loss resilience,” IEEE Journal onSelected Areas in Communications, vol. 18, no. 6, pp. 966–976, Jun.2000.

    [4] T. Stockhammer, M. Hannuksela, and T. Wiegand, “H. 264/AVC inwireless environments,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 13, no. 7, pp. 657–673, 2003.

    [5] T. Stockhammer, T. Wiegand, and S. Wenger, “Optimized transmissionof h.26l/jvt coded video over packet-lossy networks,” in IEEE ICIP,2002.

    [6] M. Sabir, R. Heath, and A. Bovik, “Joint source-channel distortionmodeling for MPEG-4 video,” IEEE Transactions on Image Processing,vol. 18, no. 1, pp. 90–105, 2009.

    [7] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, “Analysis of videotransmission over lossy channels,” IEEE Journal on Selected Areas inCommunications, vol. 18, pp. 1012–1032, Jun. 2000.

    [8] J. U. Dani, Z. He, and H. Xiong, “Transmission distortion modelingfor wireless video communication,” in Proceedings of IEEE GlobalTelecommunications Conference (GLOBECOM’05), 2005.

    [9] Z. He, J. Cai, and C. W. Chen, “Joint source channel rate-distortionanalysis for adaptive mode selection and rate control in wireless videocoding,” IEEE Transactions on Circuits and System for Video Technol-ogy, special issue on wireless video, vol. 12, pp. 511–523, Jun. 2002.

    [10] Y. Wang, Z. Wu, and J. M. Boyce, “Modeling of transmission-loss-induced distortion in decoded video,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 16, no. 6, pp. 716–732, Jun.2006.

    [11] J. Chakareski, J. Apostolopoulos, W.-T. Tan, S. Wee, and B. Girod,“Distortion chains for predicting the video distortion for general packetloss patterns,” in Proc. ICASSP, 2004.

    [12] J. Chakareski, J. Apostolopoulos, S. Wee, W.-T. Tan, and B. Girod,“Rate-distortion hint tracks for adaptive video streaming,” IEEE Trans-actions on Circuits and Systems for Video Technology, vol. 15, no. 10,pp. 1257–1269, 2005.

    [13] C. Zhang, H. Yang, S. Yu, and X. Yang, “GOP-level transmissiondistortion modeling for mobile streaming video,” Signal Processing:Image Communication, 2007.

    [14] M. T. Ivrlač, L. U. Choi, E. Steinbach, and J. A. Nossek, “Models andanalysis of streaming video transmission over wireless fading channels,”Signal Processing: Image Communication, vol. 24, no. 8, pp. 651–665,Sep. 2009.

    [15] Y. J. Liang, J. G. Apostolopoulos, and B. Girod, “Analysis of packet lossfor compressed video: Effect of burst losses and correlation betweenerror frames,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 18, no. 7, pp. 861–874, Jul. 2008.

    [16] ITU-T Series H: Audiovidual and Multimedia Systems, Advanced videocoding for generic audiovisual services, Nov. 2007.

    [17] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview ofthe h.264/AVC video coding standard,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 13, no. 7, pp. 560–576, Jul.2003.

    [18] P. Lambert, W. De Neve, Y. Dhondt, and R. Van de Walle, “Flexiblemacroblock ordering in H. 264/AVC,” Journal of Visual Communicationand Image Representation, vol. 17, no. 2, pp. 358–375, 2006.

    [19] Z. Chen and D. Wu, “Prediction of Transmission Distortion for WirelessVideo Communication: Algorithm and Application,” Journal of VisualCommunication and Image Representation, vol. 21, no. 8, pp. 948–964,2010.

    [20] Y. Wang and Q.-F. Zhu, “Error control and concealment for videocommunication: a review,” in Proceedings of IEEE, vol. 86, no. 5, 1998,pp. 974–997.

    [21] D. Agrafiotis, D. R. Bull, and C. N. Canagarajah, “Enhanced errorconcealment with mode selection,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 16, no. 8, pp. 960–973, Aug. 2006.

    [22] Z. Chen and D. Wu, “Prediction of Transmission Distortionfor Wireless Video Communication: Part I: Analysis,” 2010,http://www.wu.ece.ufl.edu/mypapers/journal-1.pdf.

    [23] Z. Chen, P. Pahalawatta, A. M. Tourapis, and D. Wu, “The ERMPCAlgorithm for Error Resilient Rate Distortion Optimization in VideoCoding,” IEEE Transactions on Circuits and Systems for Video Tech-nology, 2011, accepted.

    [24] B. Girod, “The efficiency of motion-compensating prediction for hy-brid coding of video sequences,” IEEE Journal on Selected Areas inCommunications, vol. 5, no. 7, pp. 1140–1154, Aug. 1987.

    [25] ——, “Efficiency analysis of multihypothesis motion-compensated pre-diction for video coding,” Image Processing, IEEE Transactions on,vol. 9, no. 2, pp. 173–183, 2000.

    [26] A. Leontaris and P. Cosman, “Compression efficiency and delay trade-offs for hierarchical b-pictures and pulsed-quality frames,” Image Pro-cessing, IEEE Transactions on, vol. 16, no. 7, pp. 1726–1740, 2007.

    [27] “H.264/AVC reference software JM14.0,” May. 2008. [Online].Available: http://iphome.hhi.de/suehring/tml/download

    [28] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. DuxburyPress, 2001.

  • 15

    PLACEPHOTOHERE

    Zhifeng Chen received Ph.D. degree in Electricaland Computer Engineering from the University ofFlorida, Gainesville, Florida, in 2010. He joinedInterdigital Inc. in 2010, where he is currently a staffengineer working on video coding research.

    PLACEPHOTOHERE

    Dapeng Wu (S’98–M’04–SM’6) received Ph.D. inElectrical and Computer Engineering from CarnegieMellon University, Pittsburgh, PA, in 2003. Cur-rently, he is a professor of Electrical and ComputerEngineering Department at University of Florida,Gainesville, FL.


Recommended