Lossless Predictive Coding for Images With Bayesian...

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014 5519

Lossless Predictive Coding for ImagesWith Bayesian Treatment

Jing Liu, Guangtao Zhai, Member, IEEE, Xiaokang Yang, Senior Member, IEEE, and Li Chen, Member, IEEE

Abstract— Adaptive predictor has long been used for losslesspredictive coding of images. Most of existing lossless predictivecoding techniques mainly focus on suitability of predictionmodel for training set with the underlying assumption of localconsistency, which may not hold well on object boundariesand cause large predictive error. In this paper, we propose anovel approach based on the assumption that local consistencyand patch redundancy exist simultaneously in natural images.We derive a family of linear models and design a new algorithm toautomatically select one suitable model for prediction. From theBayesian perspective, the model with maximum posterior proba-bility is considered as the best. Two types of model evidence areincluded in our algorithm. One is traditional training evidence,which represents the models’ suitability for current pixel underthe assumption of local consistency. The other is target evidence,which is proposed to express the preference for different modelsfrom the perspective of patch redundancy. It is shown that thefusion of training evidence and target evidence jointly exploitsthe benefits of local consistency and patch redundancy. As aresult, our proposed predictor is more suitable for natural imageswith textures and object boundaries. Comprehensive experimentsdemonstrate that the proposed predictor achieves higher effi-ciency compared with the state-of-the-art lossless predictors.

Index Terms— Lossless image predictive coding, Bayesianmethod, local consistency, patch redundancy.

I. INTRODUCTION

IMAGE compression has been widely researched in thelast few decades due to the popularity of electronic imag-

ing devices. Lossy coded images, though easy to transmit,may destroy some valuable information, such as fine detailsin medical images. Moreover, people nowadays are moreinclined to have total control of image quality. Lossless imagepredictive coding becomes the interest of many researchers inthe past few years, especially since Rissanen et al. [1]–[3]established the intimate connection between prediction anduniversal source coding. The key of lossless predictive coding

Manuscript received August 16, 2013; revised January 2, 2014 andAugust 8, 2014; accepted October 15, 2014. Date of publication October 29,2014; date of current version November 17, 2014. This work was sup-ported in part by the National Natural Science Foundation of China underGrant 61025005, Grant 61129001, Grant 61371146, Grant 61422112, Grant61331014, and Grant 61102098, in part by the 973 Program under Grant2010CB731401, in part by the Shanghai Municipal Science and TechnologyCommission under Grant 13511504500, in part by the 111 Program underGrant B07022, and in part by the Foundation for the Author of NationalExcellent Doctoral Dissertation of China under Grant 201339. (Correspondingauthor: Xiaokang Yang.)

The authors are with the Institute of Image Communication andNetwork Engineering, Shanghai Jiao Tong University, Shanghai200240, China (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2014.2365698

techniques as well as many other image processing techniques(e.g. superresolution, denoising) is to accurately predict futurepixels from the past ones. In literature, numerous predictionalgorithms have been proposed to improve the predictionaccuracy. Yet most of existing lossless predictors endeavor toexploit local correlation or redundancy of image structures asmuch as possible.

Linear predictors have been widely used in signal process-ing, especially in lossless compression [4], [5] due toits low computational complexity and high efficiency fordecorrelating stationary Gaussian process. Gradient adaptiveprediction (GAP) used in CALIC [4] and edge-directedprediction (EDP) [6] are typical representatives of linear adap-tive prediction schemes. These context-based methods estimatethe coefficients of linear model from causal neighbors andsignificant improvements have been seen over fixed predictionschemes such as lossless JPEG [7]. However, pre-determinedmodel support and training set of CALIC and EDP limittheir prediction performance. Some researchers moved onestep further: Takeda et al. [8] used an adaptive ellipse shapedsupport, with the major axis aligned with the direction of localedge. Kervrann and Boulanger [9] adaptively chose the sizeof support based on local image statistics. In [10], Wu et al.proposed an MDL-based sequential predictor which adaptsto changing statistics with a locally tuned support shape andtraining set.

In spite of the superiority of CALIC, EDP or MDL-basedsequential predictor, the information exploited from imagetextures is far from sufficient to meet the demand for evenmore efficient predictive coding techniques. The existingpredictors prefer the model that minimizes the predictive resid-uals of training set, with the underlying assumption of localconsistency. It is believed that the predictive model suitablefor the training set is also applicable to current target. Thisassumption of local consistency is usually far from being truewhen it comes to abrupt changes such as object boundaries.Obviously, the predictors break down when the assumptionof local consistency fails. Consequently, researchers searchdeeper for more intrinsic natural image properties. The intro-duction of nonlocal means [11] opens the floodgate to theexploitation of patch redundancy. Patch redundancy assumesthat most patches have similar ‘replicates’ in the image. Thiswork has inspired many variants for image restoration with‘prediction’ as the core [12]–[16]. And lossless predictivecoding has no reason to be the exception. However, as hasbeen noticed by many researchers, patch redundancy-basedmethods tend to over-smooth fine details and are not suitableto be used alone.

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

5520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2014

Fig. 1. Illustration of the prediction model. The support vector ux iscomposed of pixels within red windows. Three blue windows ‘centered’ ony1, y2 and y3 are three training samples. (Better refer to the electronic versionfor color notations).

In this work, we propose an adaptive predictor by exploitingboth local consistency and patch redundancy of natural images.Given a family of linear models, a novel method is proposedto adaptively select one proper model on a pixel-by-pixel basisfrom the Bayesian perspective. More specifically, the modelwith maximum posterior probability is regarded as the bestone. Two types of self-defined model evidence are involvedto produce the posterior probability. They are called TrainingEvidence (TrE) and Target Evidence (TaE). TrE, definedas the likelihood function of training samples, representsmodel’s suitability for current pixel under the assumption oflocal consistency. The novel TaE expresses the preference fordifferent models from the perspective of patch redundancy.With the combination of TrE and TaE, the predictor dealswith various image structures more efficiently. Comprehensiveexperimental results show that the proposed lossless predictivecoding algorithm achieves 0.0864 bit per pixel (bpp) lowerentropy on average than MDL-PAR model [10], which onlyconsiders TrE. Note that the basic assumption of the proposedalgorithm appears not only in natural images but also inmany other multi-dimensional signals. Therefore, the proposedalgorithm is also suitable for a wide range of applicationsalthough we focus on image predictive coding algorithm inthis paper.

The remainder of this paper is organized as follows.Section II starts with linear model and describes the detailsof Training Evidence, Target Evidence and MAP selector.Section III presents the procedure to derive a limited numberof alternative models for lower computational complexity.Experimental results are given in Section IV to demonstrate theefficiency of the proposed Bayesian-based prediction scheme.Section V provides some discussions on computationalcomplexity and comparisons with the well-known MRP andMDL-PAR techniques. Section VI concludes this paper.

II. BAYESIAN-BASED PREDICTION

Without losing generality, the proposed lossless predictorencodes the pixels in the raster scan order. Fig. 1 illustrates theprediction model. x represents current pixel, which is predictedfrom its causal pixels that are denoted by circles. Traditionalprediction methods, such as CALIC and EDP, assume that

natural images are piecewise autoregressive. The causal pixelscorrelated with x are located in a local window, denoted as redshadow (we call the window ‘centered’ on x throughout thispaper). The linear model with additive Gaussian noise term isformulated as:

x = uTx · a + ε, (1)

where ux is the support vector comprised of causal neighbors,a is the linear coefficients vector and ε is zero mean Gaussianrandom variable. The dimension of ux is called model order.Obviously, support vector ux and linear coefficients a deter-mine the linear model. Given a predetermined approach toestimate a from training samples (e.g. least-squares solution),a linear model is characterized by its support vector andtraining samples. It is assumed that support pixels and trainingsamples are located in support window Ws and trainingwindow Wt around x . The support vector and training set thatinclude all causal pixels in Ws and Wt are called full supportvector and full training set, respectively.

Suppose we have a family of linear models, M = {Mk}.The support vector and training samples of each model aredifferent from each other. For convenience, let uk and Sk

stand for the support vector and training set of Mk and let ak

represent the least-squares solution for linear coefficients.Thus, the predicted value for current target is x k = uk

xT· ak.

It is well known that the predictive residuals of naturalimages fit well to zero mean Laplacian distribution [17].Thus, the ‘optimal’ model Mopt refers to the model that givesminimum absolute predictive error. However, it is impossibleto determine the ‘optimal’ model unless current pixel isavailable. A common used assumption is that the model thatfits past datum better, gives smaller predictive errors for futuredatum. The fitness can be measured by the posterior proba-bility of different models given causal pixels. Therefore, themodel with the maximum posterior probability is determinedas the ‘best’ model M∗.

In this paper, an adaptive predictor is designed to auto-matically select the ‘best’ model on a pixel-by-pixel basisfrom the Bayesian perspective. The diagram of the proposedlossless predictor is shown in Fig. 2. Given a set of linearmodels Mk and their corresponding predicted values x k, theproposed predictor adaptively determines the ‘best’ predictionmodel M∗ for current pixel x . Two types of self-definedmodel evidence, called TrE and TaE, are defined and incor-porated into Bayesian framework. TrE is defined as the like-lihood function of training samples. It represents the model’ssuitability for current pixel under the assumption of localconsistency. However, local consistency does not always hold,especially for natural images. In the case of object boundaries,local consistency almost breaks down and the model selectedby TrE alone may produce large predictive error. Therefore,patch redundancy of natural images are explored to derivea series of estimates for current pixel. TaE, defined as thelikelihood function of these estimates, indicates the preferencefor different models from the perspective of patch redundancy.Since local consistency and patch redundancy co-exist innatural images, the combination of TrE and TaE is able tomake a more reasonable decision on model selection.

LIU et al.: LOSSLESS PREDICTIVE CODING FOR IMAGES 5521

Fig. 2. The diagram of the proposed lossless predictor for each pixel. Given a set of linear models Mk along with their predicted target values xk, the novelTaE is combined with traditional TrE under Bayesian framework to select the best prediction model M∗ for current pixel x . Then x∗ is estimated using M∗and subtracted from the ground truth value x to produce predictive residual e∗.

Fig. 3. An example of local consistency and patch redundancy. Current patcharound x is in the red window. Green windows show similar neighboringpatches. Yellow windows represent reference patches. (Better refer to theelectronic version for color notations).

Fig. 3 gives an example of basic assumption that localconsistency and patch redundancy co-exist in natural images.Current patch around x is in the red window. Green windowsshow similar neighboring patches. Yellow windows representreference patches, which resemble the current patch but arenot necessarily geometrically close to it. It can be seen thatlocal patch ‘repeats’ in many places. These similar patches,being either close to or far away from current pixel, show thatthe assumption of local consistency and patch redundancy isvalid.

A. Training Evidence

Local consistency assumes that local pixels fit the samemodel. If a model is suitable for local training samples, itis highly likely that this model fits unknown samples well.TrE is defined as the likelihood function of training samples.It stands for the preference shown by the training samples.For model Mk, assuming that the training samples in trainingset Sk are independent. Let yk = [yk

1 , yk2 , . . . , yk

Nk ]T be avector of ‘centers’ of training samples, where Nk is thenumber of training samples in Sk. Given the estimated linearcoefficients ak, the likelihood function of training samples canbe written as:

Pr(yk |Mk , ak) =Nk∏

j=1

Pr(ykj |Mk, ak)

=Nk∏

j=1

Pr(ykj − uk

y j

T · ak), (2)

wherein, uky j

is the support vector of y j . As has been pointedout by natural image statistics, the predictive residuals canbe well modeled by Laplacian distribution [17]. We integratethe Laplacian distribution and get the probability of integerpredictive residual ek

j = ykj − uk

y j

T · ak as:

Pr(ekj ) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

1 − exp− 1√

2σ , ekj = 0

1

2

⎛

⎝exp− |ek

j |−0.5

σ /√

2 −exp− |ek

j |+0.5

σ /√

2

⎞

⎠ , 0 < |ekj | < V

1

2exp

− |ekj |−0.5

σ /√

2 , |ekj | = V

(3)

wherein, σ 2 is the variance estimated from ekj ’s and V denotes

the maximum absolute residual value, which is 255 for 8-bitimages.

Then, the likelihood function of training samples, i.e. TrE,becomes:

Pr(yk |Mk) =∫

Pr(yk, ak |Mk)dak . (4)

Assuming that Pr(ak |yk) is approximately normal, ak hasa prior distribution that may be approximated by a normaldistribution with large variance, the following results can bederived using Laplacian approximation [18]–[20] and withsome simplifications [21]:

log Pr(yk |Mk).= log Pr(yk |Mk , ak) − pk

2·log(Nk ). (5)

A detailed derivation from equation (4) to equation (5) is givenin the Appendix.

Note that likelihood functions among distinct models areincomparable since the sizes of corresponding training setsare different. Thus, the likelihood functions are normalized bythe number of training samples. The logarithm of TrE turnsout to be:

log Pr(y|Mk)= 1

Nklog Pr(yk |Mk, ak) − pk

2Nk· log(Nk ). (6)

B. Target Evidence

The assumption of local consistency holds well for smoothregions, but it breaks down at abrupt changes such as objectboundaries. In this case, TrE that focuses on training samplesfails to represent model preference. Thus, additional modelevidence is required and patch redundancy is exploited in


our algorithm. Patch redundancy assumes that an image patchusually has ‘replicas’ that are not necessarily close to it.Therefore, causal similar patches can help to predict currentpixel x . x is estimated from every causal similar patch. Giventhe prediction x k for model Mk, the likelihood function ofthese estimated values is a new type of model evidence, calledTaE. For simplicity, the estimated values of x are called targetsamples in this article.

If the value of current pixel is considered as a randomvariable and the assumption of patch redundancy holds, currentpixel will take the values of target samples. Therefore, thelikelihood function of target samples, i.e. TaE, is able torepresent the model’s suitability for current pixel under theassumption of patch redundancy. Two critical issues are relatedto the derivation of TaE. The first issue is how to derivetarget samples by exploiting patch redundancy. The secondissue is how to formulate the likelihood function of targetsamples.

1) Target Samples Derivation: Patch redundancy assumesthat local patch always ‘repeats’ in natural images. Thecausal similar patches are called reference patches, as shownin Fig. 3. Let C be the set of all causal pixels of x . Cr ⊆ Ccontains the ‘centers’ of reference patches. Let c ∈ Cr bethe ‘center’ of one reference patch. Since the context of xand that of c are similar to each other, it is reasonable toassume that c shares its ‘optimal’ model with x to someextend. For convenience, we denote Mopt(c)

c as the ‘optimal’model of c with support vector uopt(c)

c and linear coefficientsaopt(c)

c .A straightforward approach to estimate x is to map uopt(c)

c

to the context of x to form support vector uopt(c)x . And the

linear coefficients aopt(c)c is used to derive target sample xc

as xc = aopt(c)c

T · uopt(c)x . However, this simple method does

not perform well in most cases. One interpretation is that thesimilarity among patches is not strong enough to guarantee thesharing of the whole model. Therefore, only the support vectoris borrowed and the linear coefficients aopt(c)

x are adaptivelydetermined from current training set. Thus, target sample xc

is derived as xc = aopt(c)x

T · uopt(c)x . For convenience, let xCr

represent the vector of target samples that are derived fromdifferent reference patches.

One important problem in deriving target samples is thatthe linear models may fail to capture local context in somecases. The linear model is believed to be ineffective oncethe ‘optimal’ predictive error is large. Target samples derivedfrom ineffective ‘optimal’ models are not credible for currenttarget, hence, needing to be excluded. Therefore, we start bycalculating the ‘optimal’ predictive residuals for each referencepatch and only keep reference patches with ‘optimal’ residualbelow a certain threshold. The steps to derive target samplesare summarized in Algorithm 1.

2) Target Evidence Formulation: With the assumption oflinear model (i.e. equation (1)), we get x ∼ N (x, σ 2

c ),where σ 2

c is the variance estimated from target samples.As mentioned above, x will take the value of xc if theassumption of patch redundancy holds. Thus, the probability

Algorithm 1 Algorithm to Derive Current Target Samples

of xc given Mk can be written as:

Pr(xc|Mk) = N (xc|x k, σ 2c ). (7)

Since target samples are derived independently of eachother, the likelihood function of target samples is:

Pr(xCr |Mk) =∏

c∈Cr

Pr(xc|Mk)

=∏

c∈Cr

1√2πσ 2

c

exp{− (xc − x k)2

2σ 2c

}. (8)

Similar to TrE, TaE is defined as the normalized likelihoodfunction. It is formulated as:

Pr(xCr |Mk) = {Pr(xCr |Mk)} 1Card(Cr ) , (9)

where Card(Cr ) is the number of elements in Cr .

C. MAP Selector

The Bayesian view of model selection uses probabilitiesto represent uncertainty in the choice of model. The modelselection issue is formulated as a maximum a posteriori (MAP)problem. Given two types of model evidence Pr(y|Mk)and Pr(xCr |Mk), the model that maximizes the posteriorprobability Pr(Mk |y, xCr ) is considered as the best, i.e.

M∗ = argmaxk Pr(Mk |y, xCr ), (10)

where M∗ stands for the best model. Supposing the priorprobability of each model is equal, the Bayes’ theorem givesthat:

Pr(Mk |y, xCr ) ∝ Pr(y, xCr |Mk). (11)


Fig. 4. Graphical model of Mk, y and xCr . Naive Bayes assumes indepen-dence between y and xCr given Mk, which fails for most natural images.

Fig. 5. Comparison of residual entropy for naive Bayes and the proposedmethod. The points below zero line stand for the patches that can berepresented with fewer bits with the proposed algorithm than with naive Bayesmodel, hence, preferring the proposed method than naive Bayes model.

A straightforward approach to decompose Pr(y, xCr |Mk) isto assume conditional independence of y and xCr given Mk,which is known as naive Bayes. The graphical model of Mk,y and xCr is given in Fig. 4.

However, the naive Bayes is not sufficient for most naturalimages, which possess high correlation among pixels.Therefore, the following decomposition approaches areproposed:

Pr(y, xCr |Mk) = γ (y, xCr , Mk )Pr(y|Mk)Pr(xCr |Mk), (12)

where γ (y, x, Mk) is defined as:

γ (y, xCr , Mk) = I

(Pr(y|Mk)

maxkPr(y|Mk)> τl

)

· I

(Pr(xCr |Mk)

maxkPr(xCr |Mk)> τl

). (13)

The indicator function I (STATEMENT) is defined to 1 whenthe STATEMENT is true and 0 otherwise. By setting the jointconditional probability to zero mandatorily when one evidenceis below a specific threshold, two evidences interact with eachother.

Fig. 5 compares the proposed method with naive Bayesmodel. 178 non-overlapped patches of size 100 × 100 arerandomly extracted from Kodak database [22]. For each patch,entropy of predictive residuals with the proposed method andnaive Bayes model is subtracted. A point below zero linerepresents a patch where the proposed method achieves lessresidual entropy than naive Bayes model. In other words,the points below zero line represent patches that can berepresented with fewer bits with the proposed method thanwith naive Bayes model and vice versa. It can be seen thatmost patches prefer the proposed method than naive Bayes.

Fig. 6. Flow diagram of the proposed Bayesian-based prediction methodwith the path of novel TaE and MAP selector highlighted. (Better refer to theelectronic version for color notations).

The flow diagram of the proposed Bayesian-basedprediction method is illustrated in Fig. 6. For current pixel xfilled in red, causal pixels are marked in pink in the secondsub-figure from the top. Training Evidence and TargetEvidence are derived separately from neighboring patches andreference patches. Then, these two types of model evidence arecombined under Bayesian framework to produce the posteriorprobability. Finally, the best model is selected using theMAP criterion. Our main contributions, i.e. the path of novelTaE and MAP selector, are highlighted in the flow diagram.

III. DERIVATION OF ALTERNATIVE MODELS

Although the algorithm described above is theoreticallysound, searching for the best model among a large setof alterative models is extremely time-consuming, makingthis algorithm unpractical. To overcome this problem, wederive a limited number of alternative models M whileretaining their capability of handling different image contexts.


The nested family of predictors similar to [10] are adopted.Unique alternative model M p is derived for each modelorder p. As p increases, the support vector up will grow andcreate nested support vectors, i.e. up1 ⊂ up2 if p1 < p2.

A. Model Support Derivation

In our algorithm, Z-score (also known as standardizedcoefficient) is used to sequentialize causal pixels. The Z-scorerepresents the effect of dropping specific pixel in supportvector from the model, hence, being a good measurementof support pixel’s importance. For each increment of modelorder p, the support vector up grows by including the causalpixel with the next highest Z-score.

The Z-score of i -th support pixel of support vector ux in (1)is defined as:

z(i) = a(i)βε√v(i)

, (14)

where a(i) is the estimated linear coefficient for thei -th support pixel, βε is the precision of ε in (1) and v(i) isthe i -th diagonal element of (NT N)−1. N is the design matrixof full support vector and full training set. Let Pm be thesize of full support vector and Nm be the size of full trainingset. Under the null hypothesis that a(i) = 0, z(i) follows at-distribution with Nm − Pm − 1 degrees of freedom [18].Hence, a Z-score greater than 2 in absolute value is approx-imately significant at the 5% level for a maximum 24-ordermodel with more than 85 training samples. (The 0.025 tailquantiles of the t85−24−1 distribution are ± 2.000.) Therefore,a Z-score greater than 2 in absolute value will lead to therejection of the null hypothesis of a(i) = 0, and indicatesthat a(i) is significant. In this work, we set the minimum valueof p as:

pmin = min{4, Card(|z( j )| > 2)}. (15)

B. Training Set Derivation

Having determined the model support up , the training setis derived as follows:

S p = {(y j , upy j )|‖up

y j − upx ‖ < τs}, (16)

where τs is a threshold adaptive to local texture as follows:

τs = maxj

‖upminy j − upmin

x ‖. (17)

One rough heuristic is that the number of training sam-ples should be no less than some multiple (say 3 or 5) ofthe number of adaptive parameters in the model. Besides,Kass and Raftery [23] have suggested that at least 5× p train-ing samples are required for the Laplacian approximation tohold. Thus, a hard constraint Card(S p) ≥ 5 × p is given foreach alternative model. This method to determine training setis also used for the derivation of training set in the fifth stepof Algorithm 1. The steps of alternative models derivation aresummarized in Algorithm 2.

Algorithm 2 Algorithm of Alternative Models Derivation

Fig. 7. The test images used in the experiments and comparison studies. Fromleft to right and top to bottom: ‘Barbara’, ‘Girl’, ‘Lena’, ‘Building’, ‘Bush’,‘Flower’, ‘Foam’, ‘Leaves1’, ‘Leaves2’, ‘Reeds1’, ‘Reeds2’, ‘Riverbank1’,‘Riverbank2’, ‘Tree’ and ‘Wall’.

IV. EXPERIMENTAL RESULTS

A. Experiment Settings

Extensive experiments are conducted to evaluate thesuperiority of the proposed context model with Bayesiantreatment for lossless predictive coding. The proposedmethod is compared with the median predictor (MED) [24],


Fig. 8. Histograms of residual images under five different predictive methods. (a) ’Riverbank1’. (b) ’Barbara’. (c) ’Lena’. (d) ’Wall’. (Better refer to theelectronic version for color notations).

the BMF method [25], the gradient adaptive predictor inCALIC [4], the edge directed predictor (EDP) [6], the Mini-mum Rate Prediction (MRP) [26]–[28] and the MDL-basedadaptive predictor (MDL-PAR) [10]. For thoroughness andfairness, 15 images are used for lossless predictive coding, asshown in Fig. 7. The selected image set has a rich diversity.It includes popular benchmark images (‘Lena’ and ‘Barbara’of size 512 × 512) and ‘Girl’ with fine details. The restimages are natural images from Van Hateren’s Natural ImageDatabase [29]. It contains trees, bushes, rivers and buildings,which are commonly seen in real world. The parameter settingof the proposed methods is as follows. The support window is3 pixels up and left to current pixel. The training window is10 pixels up and left to current pixel.1 And other parametersare set as: τl = 0.5 and τe = 5.

1Theoretically, reference patches can be freely chosen in the causal area.However, the computational complexity would be very high even if fast-searching algorithms are used. In this work, we only search reference patchwithin a window around current pixel to accelerate the algorithm at theexpense of some degradations in terms of compression rate. Exhaustivesearching is used in the searching process.

B. Visual Results

Since the proposed algorithm focuses on predictive tech-niques, predictive errors, or predictive residuals, are able todemonstrate the performance of different predictors. Fig. 8gives part of the histograms of residual images for five losslesscoding techniques. It clearly shows that the predictive residualswith the proposed method are more closely gathered aroundzero than other methods. The higher peak and less tail of theproposed residual histogram indicate higher coding efficiency.

Visual residuals of different predictors are illustratedin Fig. 9. The image structures appear in the residualimage when the predictor fails to capture image textures.MED and CALIC leave image structures in residual imagesclearly visible because they make little adaption with sharpchanges. These methods almost break down in the cases thatthe pixel values jump very fast such as the clothing of ‘Bar-bara’ and the hat boundaries of ‘Lena’. EDP reduces structuralresiduals by adapting the predictor reasonably to edge areas,such as the stem of ‘Leaves2’. But the predetermined predictororder and support limit EDP’s performance and this problem


Fig. 9. Zoom-in residual images for five different predictive coding methods. From the first row to the fourth row: the hat edge of ‘Lena’ (sharp edges),the stem of ‘Leaves2’ (cross lines), the surface of ‘Building’ (periodic textures) and the scarf boundary of ‘Barbara’ (context pattern changes). The predictiveresidual images for MED, CALIC, EDP, MDL-PAR and the proposed method are illustrated from left to right. (a) MED. (b) CALIC. (c) EDP. (d) MDL-PAR.(e) Proposed.

is solved by MDL-PAR. MDL-PAR selects appropriate modelorder and samples to eliminate predictive residuals, as shownin the hat of ‘Lena’. However, MDL-PAR merely relieson local consistency and the prediction performance can befurther improved if more properties of natural images areinvolved. The proposed algorithm effectively combines localconsistency and patch redundancy, and the predictive residualsare largely reduced for various kinds of singularities.

Fig. 9 includes four common image structures. The hat edgeof ‘Lena’ is an example of sharp edges. Information fromreference patches makes the predictor adapt to edges well,hence leaving fewer residuals. The stem of ‘Leaves’ shows thecase of cross line, which is composed of multiple sharp edges.Traditional adaptive predictors fail to capture the structureof different lines simultaneously. It can be seen that fewerresiduals are left at the cross section by using the proposedalgorithm than any other methods owing to the additionalinformation provided from patch redundancy. The surfaceof ‘Building’ shows periodic textures, where the proposed

method gives the best results. The brick structure left is greatlyreduced compared to MDL-PAR. Our proposed predictor iseven capable of tackling with abrupt changes in a periodicpattern, such as the boundary of scarf in ‘Barbara’. It isevident that the proposed predictor under Bayesian treatmentproduces the residual images with least structure informationand achieves the most accurate predictions.

C. Statistical Results

To further validate the superiority of the proposed algorithm,statistical measurements are given in this section. The entropyof predictive residuals is reported in Table I. Smaller resid-ual entropy indicates less coding length requirement, hence,a better predictor. The last second column shows the minimumreduction of residual entropy compared with state-of-the-artmethods. The last column shows the minimum reductionrate. It can be observed that the proposed Bayesian predictorachieves the lowest entropy among five different methods.


TABLE I

RESIDUAL ENTROPY (IN BITS PER PIXEL) COMPARISONS OF FIVE METHODS WITH BEST METHOD HIGHLIGHTED IN BOLDFACE

TABLE II

CODING RATE (IN BITS PER PIXEL) COMPARISONS OF SIX METHODS WITH BEST METHOD HIGHLIGHTED IN BOLDFACE2

The maximum of minimum reduction appears in ‘Barbara’and ‘Building’, which contain a lot of periodic textures. Over3% reduction rate is achieved compared with the most recentMDL-PAR predictor. The proposed method also performs wellfor fine-detailed images (‘Girl’) and images of natural scenes(such as ‘Bush’, ‘Flower’ and ‘Leaves2’). The reduction ofresidual entropy compared with MED, CALIC, EDP andMDL-PAR is respectively 0.379bpp, 0.365bpp, 0.187bpp and0.086bpp on average, with reduction rate being 8.82%, 8.56%,4.46% and 1.99%.

Finally, the lossless code lengths of the test images areprovided to help evaluating the practical coding performanceof the proposed coding scheme. Table II lists coding ratesof the proposed scheme together with those of the state-of-the-art lossless coding techniques. The proposed algorithmobtains shorter lossless code length than other methods onaverage. The superiority of our algorithm is obvious for testimages ‘Wall’ and ‘Building’ with the reduction of morethan 0.03bpp. The average rate reduction as compared toBMF, JPEGLS, EDP, CALIC and MRP is 0.303bpp, 0.391bpp,0.400bpp, 0.215bpp and 0.018bpp, respectively. Although themargin of our algorithm over the MRP technique is quiteslight, our algorithm enjoys the flexibility of easy extensionto deal with other image processing problems (e.g. inpainting

and interpolation), which is not possible for the MRP model.A detailed discussion is given in Section V.

V. DISCUSSION

A. Complexity Reduction

Since we focus on the theoretical study of losslesspredictive coding, the proposed predictive coding algorithm isimplemented in a combination of MATLAB and C languages.Several approaches have been used to accelerate the proposedalgorithm. 1) As mentioned in Section III, only one alternativelinear model is chosen for each model order. 2) Motivatedby equation (13), the best model is selected in a coarse-to-fine manner. At the coarse level, TrE of all alternativemodels are derived and then used to preliminarily excludea few models from the MAP solution. At the fine level,

2Since the arithmetic coding part of MDL-PAR is unavailable, we donot compare with it in this table. The gap between the proposed algorithmand MDL-PAR on residual entropy is able to verify the superiority of theproposed adaptive predictor. The codes of JPEGLS, BMF, CALIC, EDPand MRP are from http://www.stat.columbia.edu/~jakulin/jpeg-ls/mirror.htm,http://compression.ru/ds/, http://www.ece.mcmaster.ca/~xwu/calicexe/, http://www.csee.wvu.edu/~xinl/source.html and http://itohws03.ee.noda.sut.ac.jp/~matsuda/mrp/, respectively. Range coder is used for BMF, MRP and theproposed algorithm. Arithmetic coding is used for CALIC and EDP. AndGolomb-type coder is used for JPEGLS.


TaE and TrE for the remaining models are combined forfinal selection. By this approach, the steps to get TaE forpre-excluded models are skipped. Although several methodshave been designed to reduce the computational complexity,the processing time of our serial program is much longer thansome other available encoders (e.g. EDP and MRP written inC or C++). Luckily, our algorithm is highly parallel-friendly.The derivations of alternative models and model evidence foreach model order are independent. The derivations of targetsamples are also independent of each other. One of our futureworks is certainly to exploit parallel-programming to speed-up our program. Another possible approach is to shorten theruntime at the expense of memory consumption. For example,generating causal patches within a large window is a necessarybut time-consuming step for the prediction process. If thecausal patches are stored in the pre-allocated memory and thememory is updated accordingly when proceeding to next pixel,the computational complexity will be significantly reduced.

B. Comparison With MRP

In this section, our algorithm is compared with the well-known adaptive predictor MRP proposed by Matsuda et al.The proposed prediction method is an implicit model whereboth the encoder and the decoder can automatically determinethe prediction model with causal pixels. Therefore, the encod-ing and decoding process can be performed synchronously inone-pass coding. In contrast, MRP exploits an explicit model,where the best model chosen is sent as side information todecoder.

Since MRP uses the ground truth of current pixel tochoose the best linear predictor, the chosen model is moreaccurate than our method that does not have access to currentpixel. However, sending the model choice for every pixel isimpossible and MRP assumes the same model for a block.This assumption inevitably degrades the prediction perfor-mance since the optimal choice of predictor is likely to varyamong pixels inside a block, especially for natural images.The proposed algorithm, on the other hand, runs in a rasterscan order and employs pixel-wise models for prediction. Onlybasic information such as image width and height is requiredto be transmitted, which is much less than that of MRP.

Another advantage of our method is that a precise imagemodel is proposed and it can be easily extended to other imageprocessing applications (e.g. inpainting, interpolation and errorconcealment). The model selection and linear coefficients ofpredictor are automatically determined using causal pixels.When original pixel is not available (e.g. missed duringtransmission), the proposed algorithm can proceed to fill in themissing part with preferable visual quality [30]. In contrast,the procedure of MRP does not work as there is no explicitunderlying image model.

C. Comparison With MDL-PAR

The minimum description length (MDL) criterion used inMDL-PAR algorithm [10] is motivated from an optimal codingviewpoint. The objective of MDL approach is defined as− log Pr(yk |Mk, ak) + pk

2Nk · log(Nk ), where the first term is

log-likelihood function and the second term represents orderpenalty. This objective is just the minus of TrE defined inequation (6). Therefore, the minimum solution of descriptionlength is equal to the maximum solution of TrE. With equalprior probability for different models, the MDL-PAR algorithmcan be approximated as a simplified version of our algorithmwith TaE ignored.

Although choosing the model with minimum descriptionlength is equivalent to choosing the model with maximumposterior probability, Bayesian framework is more compre-hensive and more properties of natural images can be incor-porated to complement traditional evidence. In this paper,we explore the property of patch redundancy and use the‘optimal’ model of previously encoded pixels to produce targetsamples of high quality. Based on these target samples, anovel type of model evidence called TaE is then derivedand combined with traditional TrE to determine the ‘best’model. The visual results in Fig. 9 and statistical resultsin Table I validate the superiority of the proposedBayesian-based algorithm.

VI. CONCLUSION

Traditional lossless predictors have difficulty in determiningproper predictive model with mere training data. To overcomethis problem, we exploit patch redundancy of natural imagesand propose a novel model for lossless image predictivecoding from the Bayesian perspective. A set of alternativemodels are derived and the predictive issue is formulatedas a model selection problem. Different from traditionalalgorithms, we investigate target samples and derive a newtype of model evidence called Target Evidence. The novelevidence is combined with traditional Training Evidence tojointly exploit the benefits of local consistency and patchredundancy. With the fusion of these two types of modelevidence, the predictor is able to adapt to different contexts,hence, producing good prediction performance. Moreover, amethod to reduce the number of alternative models is alsoproposed to reduce the computational complexity of ouralgorithm. Experimental results show that the proposed predic-tor can capture image textures better and suppress predictiveerror more efficiently. The proposed algorithm outperforms thestate-of-the-art algorithms in terms of objective and subjectiveevaluations.

APPENDIX

As described in Section II-A, a series of assumptionsand simplifications are needed to get equation (5) fromequation (4). They are given in details here.3 Equation (4)is written as:

Pr(yk |Mk) =∫

Pr(yk, ak |Mk)dak .

Assuming that Pr(ak |yk) is approximately normal withmean ak and covariance matrix Vk. A so-called saddle-point approximation or Laplacian approximation [18]–[20] to

3For more details, please refer to [18].


equation (4) gives:

Pr(yk |Mk).= Pr(yk, ak |Mk)

×∫

exp{−1

2(ak − ak)T Vk−1

(ak − ak)}dak

= Pr(yk, ak |Mk)(2π)pk

2 |Vk | 12

= Pr(yk |ak, Mk)Pr(ak |Mk)(2π)pk

2 |Vk | 12 , (18)

where pk is the dimension of ak. Thus,

log Pr(yk |Mk).= log Pr(yk |Mk , ak) + log Pr(ak|Mk)

+ k

2log(2π) + 1

2log |Vk |. (19)

Supposing ak has a prior distribution which may be approx-imated by N (ak

0, Vk0), equation (19) becomes:

log Pr(yk |Mk).= log Pr(yk |Mk, ak)

− 1

2(ak − ak

0)T Vk

0−1

(ak − ak0)

− 1

2log |Vk

0| + 1

2log |Vk |. (20)

Assuming that the prior distribution of ak is very diffused,the second term can be neglected. So the penalty on thelog-likelihood is − 1

2 log |Vk0| + 1

2 log |Vk |. It is roughly

proportional to − pk

2 log Nk provided the parameters are iden-tifiable [21]. Finally,

log Pr(yk |Mk).= log Pr(yk |Mk , ak) − pk

2log(Nk),

which is the same as the equation (5).

REFERENCES

[1] J. Rissanen and G. G. Langdon, Jr., “Universal modeling and coding,”IEEE Trans. Inf. Theory, vol. 27, no. 1, pp. 12–23, Jan. 1981.

[2] J. Rissanen, “A universal data compression system,” IEEE Trans. Inf.Theory, vol. 29, no. 5, pp. 656–664, Sep. 1983.

[3] J. Rissanen, “Universal coding, information, prediction, and esti-mation,” IEEE Trans. Inf. Theory, vol. 30, no. 4, pp. 629–636,Jul. 1984.

[4] X. Wu and N. Memon, “Context-based, adaptive, lossless image coding,”IEEE Trans. Commun., vol. 45, no. 4, pp. 437–444, Apr. 1997.

[5] Y.-L. Lee, K.-H. Han, and G. J. Sullivan, “Improved lossless intra codingfor H.264/MPEG-4 AVC,” IEEE Trans. Image Process., vol. 15, no. 9,pp. 2610–2615, Sep. 2006.

[6] X. Li and M. T. Orchard, “Edge-directed prediction for lossless com-pression of natural images,” IEEE Trans. Image Process., vol. 10, no. 6,pp. 813–817, Jun. 2001.

[7] W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image DataCompression Standard. New York, NY, USA: Springer-Verlag,1993.

[8] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for imageprocessing and reconstruction,” IEEE Trans. Image Process., vol. 16,no. 2, pp. 349–366, Feb. 2007.

[9] C. Kervrann and J. Boulanger, “Local adaptivity to variable smooth-ness for exemplar-based image regularization and representation,” Int.J. Comput. Vis., vol. 79, no. 1, pp. 45–69, 2008.

[10] X. Wu, G. Zhai, X. Yang, and W. Zhang, “Adaptive sequential pre-diction of multidimensional signals with applications to lossless imagecoding,” IEEE Trans. Image Process., vol. 20, no. 1, pp. 36–42,Jan. 2011.

[11] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm forimage denoising,” in Proc. IEEE Comput. Conf. Comput. Vis. PatternRecognit. (CVPR), vol. 2. Jun. 2005, pp. 60–65.

[12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Imagedenoising by sparse 3-D transform-domain collaborative filtering,”IEEE Trans. Image Process., vol. 16, no. 8, pp. 2080–2095,Aug. 2007.

[13] Z. Xu and J. Sun, “Image inpainting by patch propagation using patchsparsity,” IEEE Trans. Image Process., vol. 19, no. 5, pp. 1153–1165,May 2010.

[14] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-localsparse models for image restoration,” in Proc. IEEE 12th Int. Conf.Comput. Vis. (ICCV), Sep./Oct. 2009, pp. 2272–2279.

[15] M. Ebrahimi and E. R. Vrscay, “Solving the inverse problem of imagezooming using ‘self-examples’,” in Image Analysis and Recognition.Berlin, Germany: Springer-Verlag, 2007, pp. 117–130.

[16] G. Freedman and R. Fattal, “Image and video upscaling from local self-examples,” ACM Trans. Graph., vol. 30, no. 2, 2011, Art. ID 12.

[17] E. Y. Lam and J. W. Goodman, “A mathematical analysis of the DCTcoefficient distributions for images,” IEEE Trans. Image Process., vol. 9,no. 10, pp. 1661–1666, Oct. 2000.

[18] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge,U.K.: Cambridge Univ. Press, 1996.

[19] D. V. Lindley, “Approximate Bayesian methods,” Trabajos de EstadísticaY de Investigación Operativa, vol. 31, no. 1, pp. 223–245, 1980.

[20] L. Tierney and J. B. Kadane, “Accurate approximations for posteriormoments and marginal densities,” J. Amer. Statist. Assoc., vol. 81,no. 393, pp. 82–86, 1986.

[21] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6,no. 2, pp. 461–464, 1978.

[22] Kodak. (1999). Kodak Lossless True Color Image Suite. [Online].Available: http://r0k.us/graphics/kodak/, accessed Jan. 27, 2013.

[23] R. E. Kass and A. E. Raftery, “Bayes factors,” J. Amer. Statist. Assoc.,vol. 90, no. 430, pp. 773–795, 1995.

[24] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The LOCO-I loss-less image compression algorithm: Principles and standardization intoJPEG-LS,” IEEE Trans. Image Process., vol. 9, no. 8, pp. 1309–1324,Aug. 2000.

[25] D. Shkarin. (2009). A Special Lossless Compressor BMF. [Online].Available: http://compression.ru/ds/

[26] I. Matsuda, H. Mori, and S. Itoh, “Lossless coding of stillimages using minimum-rate predictors,” in Proc. Eur. Signal Process.Conf. (EUSIPCO), vol. 1. 2000, pp. 1205–1208.

[27] I. Matsuda, N. Shirai, and S. Itoh, “Lossless coding using predictorsand arithmetic code optimized for each image,” in Visual ContentProcessing and Representation. Berlin, Germany: Springer-Verlag, 2003,pp. 199–207.

[28] I. Matsuda, N. Ozaki, Y. Umezu, and S. Itoh, “Lossless cod-ing using variable block-size adaptive prediction optimized for eachimage,” in Proc. 13th Eur. Signal Process. Conf. (EUSIPCO), 2005,pp. 818–821.

[29] J. H. van Hateren and A. van der Schaaf, “Independent component filtersof natural images compared with simple cells in primary visual cortex,”Proc. Roy. Soc. London, Ser. B, Biological Sci., vol. 265, no. 1394,pp. 359–366, 1998.

[30] J. Liu, G. Zhai, X. Yang, B. Yang, and L. Chen, “Spatial errorconcealment with adaptive linear predictor,” IEEE Trans. Circuits Syst.Video Technol., doi: 10.1109/TCSVT.2014.2359145.

Jing Liu received the B.E. degree from ShanghaiJiao Tong University, Shanghai, China, in 2011,where she is currently pursuing the Ph.D. degreewith the Institute of Image Communication andNetwork Engineering. Her research interests includeimage and video processing.


Guangtao Zhai (M’10) received the B.E. andM.E. degrees from Shandong University, Jinan,China, in 2001 and 2004, respectively, and thePh.D. degree from Shanghai Jiao Tong University,Shanghai, China, in 2009, where he is currentlya Research Professor with the Institute of ImageCommunication and Network Engineering.

He was a Student Intern with the Institute forInfocomm Research, Singapore, from 2006 to 2007,and a Visiting Student with the School of Com-puter Engineering, Nanyang Technological Univer-

sity, Singapore, from 2007 to 2008, and the Department of Electrical andComputer Engineering, McMaster University, Hamilton, ON, Canada, from2008 to 2009, where he was a Post-Doctoral Fellow from 2010 to 2012.From 2012 to 2013, he was a Humboldt Research Fellow with the Instituteof Multimedia Communication and Signal Processing, Friedrich AlexanderUniversity of Erlangen-Nuremberg, Erlangen, Germany. He was a recipientof the National Excellent Ph.D. Thesis Award from the Ministry of Educationof China in 2012. His research interests include multimedia signal processingand perceptual signal processing.

Xiaokang Yang (A’00–SM’04) received theB.S. degree from Xiamen University, Xiamen,China, in 1994, the M.S. degree from the ChineseAcademy of Sciences, Shanghai, China, in 1997,and the Ph.D. degree from Shanghai Jiao TongUniversity, Shanghai, in 2000.

He is currently a Full Professor and the DeputyDirector with the Department of ElectronicEngineering, Institute of Image Communicationand Network Engineering, Shanghai Jiao TongUniversity. From 2000 to 2002, he was a Research

Fellow with the Center for Signal Processing, Nanyang TechnologicalUniversity, Singapore, and a Research Scientist with the Institute forInfocomm Research, Singapore, from 2002 to 2004. He has authored over80 refereed papers and holds six patents. His current research interestsinclude video processing and communication, media analysis and retrieval,perceptual visual processing, and pattern recognition.

Dr. Yang was a recipient of the Microsoft Young Professorship Award in2006, the Best Young Investigator Paper Award at the IS&T/SPIE InternationalConference on Video Communication and Image Processing in 2003, andseveral awards from the Agency for Science, Technology and Research, andthe Tan Kah Kee Foundation. He is a member of the Visual Signal Processingand Communications Technical Committee of the IEEE Circuits and SystemsSociety. He was the Special Session Chair of the Perceptual Visual Processingof the IEEE International Conference on Multimedia and Expo in 2006. He isthe Local Co-Chair of the International Conference on Communications andNetworking in China in 2007 and the Technical Program Co-Chair of the IEEEWorkshop on Signal Processing Systems in 2007. He actively participates inthe International Standards, such as MPEG-4, JVT, and MPEG-21.

Li Chen received the B.S. and M.S. degreesfrom Northwestern Polytechnical University, Xi’an,China, and the Ph.D. degree from Shanghai JiaoTong University, Shanghai, China, in 2006, all inelectrical engineering. His research interest includesimage and video processing, DSP, and very largescale integration (VLSI) for image and videoprocessing. Under the grants from NSFC, he hasbeen devoted to Image Completion and Inpainting,Video Frame Rate Conversion, and Image Deshakeand Deblur. He currently focuses on VLSI for image

and video processing.

Date post:	20-Sep-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Lossless Predictive Coding for Images With Bayesian...

Documents