ISBN 978-91-7501-075-5 ISSN 1653-5146 Distribution ...437204/FULLTEXT02.pdf · Distribution...

Distribution Preserving Quantization

M i n y u e L i

Doctoral Thesis in TelecommunicationsStockholm, Sweden 2011

www.kth.se

ISBN 978-91-7501-075-5ISSN 1653-5146

TRITA-EE 2011:055

Min

yue Li D

istribution Preserving Quantization

KTH 2011

Thesis for the degree of Doctor of Philosophy

Distribution Preserving Quantization

Minyue Li

Sound and Image Processing LaboratorySchool of Electrical Engineering

KTH (Royal Institute of Technology)

Stockholm 2011

Li, MinyueDistribution Preserving Quantization

Copyright c©2011 Minyue Li except whereotherwise stated. All rights reserved.

ISBN 978-91-7501-075-5ISSN 1653-5146TRITA-EE 2011:055

Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH (Royal Institute of Technology)SE-100 44 Stockholm, Sweden

(This is a quote from Shimin Li, an emperor of China in the TangDynasty (618 – 907). Trying my best to translate it, this quote says “onobjects with observable forms, a fool has no doubts; on a substance withouta form, sages may have little knowledge.” It is the hidden truths thatmotivate me to do research.)

Abstract

In the lossy coding of perceptually relevant signals, such as sound and im-ages, the ultimate goal is to achieve good perceived quality of the recon-structed signal, under a constraint on the bit-rate. Conventional method-ologies focus either on a rate-distortion optimization or on the preservationof signal features. Technologies resulting from these two perspectives areefficient only for high-rate or low-rate scenarios. In this dissertation, a newobjective is proposed: to seek the optimal rate-distortion trade-off undera constraint that statistical properties of the reconstruction are similar tothose of the source.

The new objective leads to a new quantization concept: distributionpreserving quantization (DPQ). DPQ preserves the probability distributionof the source by stochastically switching among an ensemble of quantizers.At low rates, DPQ exhibits a synthesis nature, resembling existing cod-ing methods that preserve signal features. Compared with rate-distortionoptimized quantization, DPQ yields some rate-distortion performance forperceptual benefits.

The rate-distortion optimization for DPQ facilitates mathematical anal-ysis. The dissertation defines a distribution preserving rate-distortion func-tion (DP-RDF), which serves as a lower bound on the rate of any DPQmethod for a given distortion. For a large range of sources and distortionmeasures, the DP-RDF approaches the classic rate-distortion function withincreasing rate. This suggests that, at high rates, an optimal DPQ can ap-proach conventional quantization in terms of rate-distortion characteristics.

After verifying the perceptual advantages of DPQ with a rela-tively simple realization, this dissertation focuses on a method calledtransformation-based DPQ, which is based on dithered quantization anda non-linear transformation. Asymptotically, with increasing dimensional-ity, a transformation-based DPQ achieves the DP-RDF for i.i.d. Gaussiansources and the mean squared error (MSE).

This dissertation further proposes a DPQ scheme that asymptoticallyachieves the DP-RDF for stationary Gaussian processes and the MSE. Forpractical applications, this scheme can be reduced to dithered quantizationwith pre- and post-filtering. The simplified scheme preserves the power

i

spectral density (PSD) of the source.The use of dithered quantization and non-linear transformations to con-

struct DPQ is extended to multiple description coding, which leads to a mul-tiple description DPQ (MD-DPQ) scheme. MD-DPQ preserves the sourceprobability distribution for any packet loss scenario.

The proposed schemes generally require efficient entropy coding. Thedissertation also includes an entropy coding algorithm for lossy coding sys-tems, which is referred to as sequential entropy coding of quantization in-dices with update recursion on probability (SECURE).

The proposed lossy coding methods were subjected to evaluations in thecontext of audio coding. The experimental results confirm the benefits ofthe methods and, therewith, the effectiveness of the proposed new lossycoding objective.

Keywords: lossy source coding, perceived quality, perceptual cod-ing, rate-distortion function, quantization, synthesis, distribution preserv-ing quantization, distribution preserving rate-distortion function, multipledescription coding, entropy coding.

ii

List of Papers

The dissertation is based on the following papers:

[A] M. Li and W. B. Kleijn, “Quantization with Constrained Rel-ative Entropy and Its Application to Audio Coding,” in AudioEngineering Society Convention 127, 2009.

[B] M. Li, J. Klejsa, and W. B. Kleijn, “Distribution PreservingQuantization with Dithering and Transformation,” IEEE SignalProcessing Letters, vol. 17, no. 12, pp. 1014–1017, 2010.

[C] M. Li, J. Klejsa, and W. B. Kleijn, “On Distribution PreservingQuantization,” submitted for publication.

[D] M. Li, A. Ozerov, J. Klejsa, and W. B. Kleijn, “Asymptoti-cally Optimal Distribution Preserving Quantization for Station-ary Gaussian Processes,” submitted for publication.

[E] J. Klejsa, G. Zhang, M. Li, and W. B. Kleijn, “Multiple De-scription Distribution Preserving Quantization,” submitted forpublication.

[F] M. Li and W. B. Kleijn, “Sequential Entropy Coding of Quan-tization Indices with Update Recursion on Probability,” to besubmitted.

iii

In addition to papers A-F, the following papers have also been

produced in part by the author of the dissertation:

[1] M. Li and W. B. Kleijn, “A Low-Delay Audio Coder withConstrained-Entropy Quantization,” in IEEE Workshop on Ap-plications of Signal Processing to Audio and Acoustics, pp. 191–194, 2007.

[2] W. B. Kleijn and M. Li, “Least Significant Bit Coding ofSpeech,” in Asilomar Conference on Signals, Systems & Com-puters, pp. 1485–1490, 2007.

[3] S. Bruhn, V. Grancharov, W. B. Kleijn, J. Klejsa, M. Li,J. Plasberg, H. Pobloth, S. Ragot, and A. Vasilache, “The Flex-Code Speech and Audio Coding Approach,” in ITG FachtagungSprachkommunikation, 2008.

[4] J. Klejsa, M. Li, and W. B. Kleijn, “FlexCode — Flexible Au-dio Coding”, in IEEE International Conference on Acoustics,Speech, and Signal Processing, pp. 361–364, 2010.

[5] O. A. Moussa, M. Li, and W. B. Kleijn, “Predictive Audio Cod-ing Using Rate-Distortion-Optimal Pre- and Post-Filtering,” inIEEE Workshop on Applications of Signal Processing to Audioand Acoustics, 2011, to appear.

iv

Acknowledgements

Five years ago, in the late summer like now, I left my hometown, traveledone-fourth of the globe, with luggage of half my own weight, to Stockholm,and started my life as a doctoral student. However worrisome it felt tome, moving to this completely new environment has turned out to be anon-difficulty because of the help of many people. My time as a doctoralstudent is now coming to a close with the completion of this dissertation.It is time to express my deepest gratitude to all those who have supportedme throughout these years.

First and foremost, I am heartily thankful to my supervisor,Prof. W. Bastiaan Kleijn, whose profound knowledge and sense of respon-sibility brought me a very effective supervision. Besides uncountably manyvaluable suggestions to my work, his scientific way of thinking, enthusiasmbut seriousness in research, and hard work ethics have influenced me a lotand will be beneficial to all my life.

I would also like to thank my chief co-authors: Dr. Alexey Ozerov,Dr. Guoqiang Zhang, Janusz Klejsa, and Obada Alhaj Moussa. I appreciateall the valuable discussions with them, without which this dissertation wouldnot have been possible. My special thanks to Janusz Klejsa who is so readyto help that I never hesitated to interrupt him with questions, sometimeseven ludicrous ones.

I am indebted to Prof. Arne Leijon, Prof. Jiakang Liu, Dr. GuoqiangZhang, Gustav Henter, and Zhanyu Ma for proof-reading parts of the dis-sertation and offering numerous constructive comments. My special thanksto Gustav Henter, whose comments came from all aspects ranging frompunctuation to philosophy.

I would like to thank all my colleagues, both current and past. It ismy luck to have had Christos Koniaris and Haopeng Li as office mates. Ienjoyed the teaching experience with Prof. Markus Flierl and Nasser Mo-hammadiha. I owe my gratitude to Dora Soderberg for her support withvarious administrative matters. My apologies for omitting other names,since the list is far too long to put here.

I also thank my collaborators in the Flexcode project and from theresearch department of Huawei Technologies.

v

Life as a doctoral student is by no means limited to that on campus. Acolorful life kept me refreshed and brought inspirations to my work. I thankall my friends, particularly my Chinese friends, for all the unforgettable get-togethers. I thank Maestro Gunnar Julin, who gave me the chance to playa part in the university orchestra. I thank Simba Qiu for directing theChinese choir and letting me be a part of it. I am grateful to Jieying for allthe love and happiness she brings to me.

Last but not least, my parents deserve my special thanks. Despite adistance of almost 7000 km between Beijing and Stockholm, I constantlyfelt strong encouragement from them.

Minyue Li

Stockholm, August 2011

vi

Contents

Abstract i

List of Papers iii

Acknowledgements v

Contents vii

Notations xi

I Introduction 11 Rate-Distortion Theory and Quantization . . . . . . . . . . 4

1.1 Rate-Distortion Theory . . . . . . . . . . . . . . . . 51.2 Quantization . . . . . . . . . . . . . . . . . . . . . . 81.3 Entropy Coding . . . . . . . . . . . . . . . . . . . . 201.4 Distortion Measure Considerations . . . . . . . . . . 21

2 Perceived Quality . . . . . . . . . . . . . . . . . . . . . . . . 222.1 Quality Assessment . . . . . . . . . . . . . . . . . . 232.2 Cognitive Theories of Similarity . . . . . . . . . . . . 26

3 Perceptual Coding . . . . . . . . . . . . . . . . . . . . . . . 313.1 Analysis-by-Synthesis . . . . . . . . . . . . . . . . . 313.2 Quantization with Perceptual Distortion Measures . 333.3 Parametric Coding . . . . . . . . . . . . . . . . . . . 353.4 Quantization with Constraints on Statistical Proper-

ties of Reconstruction . . . . . . . . . . . . . . . . . 394 Distribution Preserving Quantization . . . . . . . . . . . . . 40

4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Distribution Preserving Rate-Distortion Function . . 434.3 Realizations . . . . . . . . . . . . . . . . . . . . . . . 454.4 Variations . . . . . . . . . . . . . . . . . . . . . . . . 474.5 Applications . . . . . . . . . . . . . . . . . . . . . . 47

vii

5 Summary of Contributions . . . . . . . . . . . . . . . . . . . 486 Conclusions and Future Work . . . . . . . . . . . . . . . . . 50Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

II Included papers 71

A Quantization with Constrained Relative Entropy and Its

Application to Audio Coding A1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A12 CRE Quantization . . . . . . . . . . . . . . . . . . . . . . . A3

2.1 Generic Quantization . . . . . . . . . . . . . . . . . A32.2 Distribution Preserving Quantization . . . . . . . . . A42.3 CRE quantization . . . . . . . . . . . . . . . . . . . A52.4 Optimal Reconstruction Distribution . . . . . . . . . A52.5 Optimal Partition . . . . . . . . . . . . . . . . . . . A6

3 Audio Coding with CRE Quantization . . . . . . . . . . . . A83.1 The Audio Coder . . . . . . . . . . . . . . . . . . . . A83.2 Implementation . . . . . . . . . . . . . . . . . . . . . A10

4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A115 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . A13Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A13References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A14

B Distribution Preserving Quantization with Dithering and

Transformation B1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B12 Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . B3

2.1 Transformation . . . . . . . . . . . . . . . . . . . . . B42.2 Asymptotical Optimality . . . . . . . . . . . . . . . B62.3 Example: Gaussian Source . . . . . . . . . . . . . . B6

3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . B64 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . B7Appendix. A Proof of Proposition 1 . . . . . . . . . . . . . . . . B9References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B10

C On Distribution Preserving Quantization C1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . C12 Definition of DPQ . . . . . . . . . . . . . . . . . . . . . . . C4

2.1 Simple Example of DPQ . . . . . . . . . . . . . . . . C72.2 DPQ Derived from Any Quantizer . . . . . . . . . . C82.3 Scope of This Article . . . . . . . . . . . . . . . . . . C9

3 Distribution Preserving Rate-Distortion Function . . . . . . C9

viii

3.1 DP-RDF for Gaussian Distributions and MSE . . . C123.2 Relationship between DP-RDF and RDF . . . . . . C14

4 Transformation-Based DPQ . . . . . . . . . . . . . . . . . . C174.1 Quantization Scheme . . . . . . . . . . . . . . . . . . C174.2 Properties of Transformation-Based DPQ . . . . . . C204.3 Asymptotic Properties w.r.t. High Rates . . . . . . . C224.4 Asymptotic Properties w.r.t. High Dimensionality . C23

5 Achievability of DP-RDF for Gaussian Distributions and MSEC266 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . C28Appendix. A Proof of Proposition 6 . . . . . . . . . . . . . . . . C28References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C30

D Asymptotically Optimal Distribution Preserving Quantiza-

tion for Stationary Gaussian Processes D1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . D12 DP-RDF for Stationary Gaussian Processes and MSE . . . D3

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . D32.2 New Results . . . . . . . . . . . . . . . . . . . . . . . D6

3 Asymptotically Optimal DPQ for Stationary Gaussian Pro-cesses and MSE . . . . . . . . . . . . . . . . . . . . . . . . . D11

4 Simplified Scheme Based on Pre- and Post-Filtering . . . . D175 Optimal Rate-MSE Trade-off for PSD-PQ . . . . . . . . . . D206 Applicable Scheme Based on Pre- and Post-Filtered DPCM D227 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . D248 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . D279 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . D30References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D30

E Multiple Description Distribution Preserving Quantization E1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . E12 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . E3

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . E32.2 Multiple Description Distribution Preserving Quanti-

zation . . . . . . . . . . . . . . . . . . . . . . . . . . E53 Two-Description DPQ . . . . . . . . . . . . . . . . . . . . . E6

3.1 Construction of Quantizers . . . . . . . . . . . . . . E73.2 Index Assignment . . . . . . . . . . . . . . . . . . . E83.3 Subtractive Dithering . . . . . . . . . . . . . . . . . E93.4 Distribution Preserving Transformations . . . . . . . E13

4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E154.1 High-Rate Analysis . . . . . . . . . . . . . . . . . . . E154.2 Experimental Evaluation . . . . . . . . . . . . . . . E17

5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . E195.1 Robust Audio Coder . . . . . . . . . . . . . . . . . . E19

ix

5.2 Listening Test . . . . . . . . . . . . . . . . . . . . . . E236 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . E23Appendix. High-Rate Analysis of Side Quantizers . . . . . . . . . E24References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E26

F Sequential Entropy Coding of Quantization Indices with

Update Recursion on Probability F1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . F12 Model of Lossy Coding Systems . . . . . . . . . . . . . . . . F33 Exact Update Recursion . . . . . . . . . . . . . . . . . . . . F54 Parametric Update Recursion . . . . . . . . . . . . . . . . . F65 Application to Linear Stochastic Systems . . . . . . . . . . F7

5.1 Update Recursion with Type-I Side Information . . F85.2 Update Recursion with Type-II Side Information . . F10

6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . F126.1 Experiment I . . . . . . . . . . . . . . . . . . . . . . F126.2 Experiment II . . . . . . . . . . . . . . . . . . . . . . F17

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . F20Appendix I. Construction of Linear Stochastic Systems for ARMA

Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . F20Appendix II. Conditional Mean and Covariance Matrix of Normal

Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . F21References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F22

x

Notations

The notations throughout the dissertation follow the conventions below,expect where otherwise stated.

(A,A , µ) The probability space with sample space A, σ-algebraA , and probability measure µ

Pr{A} The probability of event APr{A|B} The conditional probability of event A given event BFX(x) The cumulative distribution function (c.d.f.) of random

vector XFX|Y (x|y) The conditional c.d.f. of random vector X given random

vector YF−1X (x) The inverse c.d.f. of random vector XF−1X|Y (x|y) The inverse conditional c.d.f. of random vector X given

random vector YpX(x) The probability mess function (p.m.f.) of discrete ran-

dom vector XpX|Y (x|y) The conditional p.m.f. of discrete random vectorX given

random vector YfX(x) The probability density function (p.d.f.) of continuous

random vector XfX|Y (x|y) The condition p.d.f. of continuous random vector X

given random vector YH(X) The entropy of discrete random vector XH(X |Y ) The conditional entropy of discrete random vector X

given random vector Yh(X) The differential entropy of continuous random vector Xh(X |Y ) The conditional differential entropy of continuous ran-

dom vector X given random vector YI(X ;Y ) The mutual information between random vectors X and

Y

xi

X ⊥ Y |Z Random vector X is conditionally independent of ran-dom vector Y given random vector Z

E{X} The expectation of random vector XE{X |Y } The conditional expectation of random vector X given

random vector Yvar{X} The variance of random variable (r.v.) Xvar{X |Y } The conditional variance of r.v. X given random vector

Ycov{X} The covariance matrix of random vector XN (x;µ,Σ) The Gaussian p.d.f. with mean µ and covariance matrix

Σ; x is the argument of the p.d.f., which can be omittedwhen the context makes it clear

U(x; a, b) The univariate uniform p.d.f. with left boundary a andright boundary b; x is the argument of the p.d.f., whichcan be omitted when the context makes it clear

O(g(x)) A function that converges to zero equally fast or fasterthan g(x)

f(x) ∝ g(x) Function f(x) is proportional to function g(x)

f(x) The first derivative of function f(x)[f(x)]ba f(b)− f(a)f−1(x) The inverse of function f(x)Γ(x) The Gamma functiondet(C) The determinant of matrix CCT The transpose of matrix CR

k The k-dimensional Euclidean space〈x, y〉 The inner product of two vectors x and y‖x‖p The p-norm of vector x ∈ R

k

‖x‖ The Euclidean norm of x ∈ Rk, i.e. ‖x‖2

Vol(G) The volume of region GBk(r) The k-dimensional ball with radius r|A| The cardinality of set AA ∪B The union of set A and set BA ∩B The intersection of set A and set BA \B The relative complement of set B in set AA× B The Cartesian product of set A and set BN The set of all natural numbersZ The set of all integers

xii

Part I

Introduction

Introduction

To some extent, our mobile phones, digital cameras, camcorders, and manyother audio-visual devices perform the same task — transforming a soundor a scene to a storable or transmittable “cipher”, which can then be used toreproduce the original sound or scene. Two major concerns in this processare the cost of storage or transmission, and the perceived quality of thereproduction. These two aspects are generally in conflict with each other: alower cost usually means a worse quality. An obvious question is what thebest quality under a limited cost is and how to achieve it. When consideredfrom a technical viewpoint, this is the lossy source coding problem.

Lossy source coding has its roots in digital communication, for whicha source signal is encoded into binary (0-1) sequences and stored in mediaor transmitted over networks. The unit for the length of a binary sequenceis bit. To evaluate the efficiency of a source coding system, which usuallyoperates on a source with an arbitrary duration or an arbitrary number ofsamples, a commonly considered measure is rate, which is defined as thelength of the binary sequence produced by the coding system, on a per unittime or per source sample basis.

In 1948, Claude Elwood Shannon published his famous “A MathematicalTheory of Communication” [1], which described the lossy source codingproblem as “the rate for a source relative to a fidelity evaluation”. The ideawas further developed in his 1959 paper [2], giving birth to rate-distortiontheory, which has been regarded as the foundation of lossy source coding. Inrate-distortion theory, a distortion measure between a source signal and itsreconstruction is used to evaluate the quality of the reconstruction, therebytranslating lossy source coding into the mathematical problem of minimizingthe rate under a constraint on distortion (or minimizing the distortion undera constraint on rate). The most significant contribution of rate-distortiontheory is that it provides bounds for the best possible trade-off between rateand distortion.

Extensive efforts from numerous researchers over a period of more thanhalf a century have made lossy source coding a highly successful disciplinewith many advances in both theory and practice. However, the lossy sourcecoding problem is not yet completely solved. Challenges remain in the cod-

2 Introduction

ing of perceptually relevant signals (PRS), e.g., sound and images, whichconstitute a major part of the objects that lossy source coding is concernedwith. On coding PRS, the quality of the reconstruction is inevitably relatedto human perception, of which the existing knowledge is limited. Many ef-forts have been made to characterize the perceived quality in the form ofdistortion measures that are compatible with rate-distortion theory. Thisparadigm and resulting technologies have been proved to be effective, how-ever, usually only when the rate is relatively high. At low rates, rate-distortion optimized methods usually lead to poor perceived quality.

Coding technologies that are aimed specifically at low rates do exist, butare generally not based on the formal principles of rate-distortion theory.In general, these methods reconstruct the source by synthesizing a signalaccording to a model. The model captures the most perceptually relevantfeatures of the source signal, which are maintained in the reconstruction.The synthesis is often inconsistent with the notion of rate-distortion op-timization. In audio coding, for example, rate-distortion theory suggeststhat, to achieve the minimum mean squared error, frequency bands withsmall power should be suppressed, while from the synthesis point of view,these bands should maintain their original spectral structure.

Because of the disparity between coding for high rates and low rates, aunified objective, which maximizes the perceived quality for the lossy codingof PRS at any rate, is desirable. Such an objective may convey new insightsinto lossy coding of PRS and thus improve the perceived quality of existingmethods at different rates. Moreover, it may lead to a coding method thatcan operate over a large range of rates and achieve a reasonable quality atany rate. Such a flexible coding method is of great significance for today’shighly heterogeneous network environments. This dissertation is devotedto the investigation of such an objective and corresponding source codingmethods for PRS.

A good objective for lossy coding of PRS should be

1. Meaningful: A good objective should exhibit sensible behavior undervarious circumstances. For example, an increasing rate must corre-spond to a better perceived quality, which classic rate-distortion the-ory already facilitates. However, the tendency of a rate-distortionoptimization to remove frequency bands is not very reasonable;

2. Interpretive: A good objective should be able to explain the suc-cesses or failures of existing methods. Classic rate-distortion theorycannot explain the success of many low-rate coders and hence is notsufficiently interpretive. However, rate-distortion theory provides anefficient guideline for high-rate coders. Therefore, the new objectivethat we are aiming at is likely to resemble rate-distortion theory athigh rates;

3

3. Universal: A good objective should be able to cover as many scenar-ios as possible. These scenarios include different sources. Extensiveresearch has been performed to quantify the perceived quality of sig-nals of particular type, e.g., speech, audio, images, and video. Theseefforts are not universal enough to make good objectives. Anothershortcoming of these quality measures is that they are usually toocomplicated to be mathematically tractable, which leads to the fol-lowing consideration;

4. Analyzable: A good objective should facilitate mathematical analy-sis. In fact, an ultimate goal of constructing an objective is to enablemathematics-aided design of lossy coding systems. For analysis pur-poses, an important consideration is the definition of optimality. Asuccess of rate-distortion theory is that it establishes an optimal per-formance of all lossy coding schemes;

5. Feasible: A good objective should be practically relevant and facilitateimplementation. PRS coding is a practical problem and the effort oflooking for an objective is to guide the design of practical systems.Therefore, the ease of implementing such an objective is also impor-tant.

The objective proposed in this dissertation is to seek the optimal trade-off between rate and distortion, under a constraint that statistical propertiesof the reconstruction remain similar to those of the source. With this notion,a lossy coding system is expected to yield reconstructions that are alwaysrecognized as being in the same class as the source and can approach aparticular source sample as the rate increases. For example, in the contextof speech coding, the reconstruction should always sound like speech, whileits fidelity to a given source sample depends on the rate. The proposedobjective and its realizations in coding systems will be detailed later. Letus first evaluate the appropriateness of this concept from the aforementionedviewpoints.

1. The objective is meaningful. Perceived signal quality is determinedby neural activities. It is widely accepted that neural processing isinfluenced by environmental statistics [3]. As an example, the associ-ation of a speech signal to a certain speaker, emotion, and/or mean-ing depends on the statistical properties of the given speech and ofall speech utterances that a listener has ever listened to. Therefore,two stimuli have to share certain statistical properties to be judgedas being similar to each other. Meanwhile, tiny differences betweenparticular stimuli can also be discerned. For example, suppose thata speaker repeats a word multiple times; though the utterances allsound similar, it is always possible to detect differences among them.

4 Introduction

Therefore, placing the preservation of source statistical properties asa premise for a rate-distortion optimization is an intuitively efficientway to achieve reasonable perceived quality for the coding of PRS, atany rate.

2. The objective is interpretive. On one hand, it converges to the notionof classic rate-distortion theory at high rates, when the statisticalproperties of the source are automatically preserved. On the otherhand, it can explain the successes of low rate coding techniques, which,in principle, aim to reconstruct some chosen statistical properties ofthe source.

3. The objective is universal. The quantities used in this objective arerate, distortion, and statistical properties, which are sufficiently ab-stract to capture a large scope of source coding scenarios.

4. The objective is tractable. This objective, in a simplified perspective,merely imposes a constraint on classic rate-distortion theory. In fact,it possesses good mathematical properties. This dissertation includesa theoretical treatment of the proposed objective.

5. The objective is feasible. As stated, the new objective leads to conven-tional methods at both high and low rates. Furthermore, appropriaterealizations of the objective enable a smooth transition between thetwo ends. In this dissertation, a number of implementations are con-sidered.

This dissertation is titled “distribution preserving quantization” (DPQ),which is the name of a featured coding method under the proposed objective.The modifier “distribution preserving” indicates that it preserves statisticalproperties of the source and “quantization” implies that it applies a rate-distortion optimization.

This dissertation is organized as follows. First, a brief survey of rate-distortion theory and studies of perceived signal quality is presented. Ex-isting coding methods for PRS are also reviewed. During the review, therelationship between theory and practice is emphasized. Finally, the newlossy coding objective and resulting coding methods are introduced. De-tailed discussions of the new theory and associated methods are included ina number of papers in Part II.

1 Rate-Distortion Theory and Quantization

Since the birth of rate-distortion theory, many contributions have been madeboth to extend the theory to various situations and to find structural de-signs that approach the theoretically optimal rate-distortion trade-off. For a

1 Rate-Distortion Theory and Quantization 5

historical review of rate-distortion theory, the reader is referred to [4]. Here,some important results are recapitulated. Many of them form a foundationfor the theoretical treatment of the new coding strategy.

1.1 Rate-Distortion Theory

Rate-distortion theory has its roots in probability theory. A source X andits reconstruction X are regarded as consisting of random variables (r.v.).A lossy source coding method encodes any realization of the source intoa binary sequence, from which the rate gains its definition. Any binarysequence can be decoded into a reconstruction. A distortion measure isused to describe the quality of a particular reconstruction with respect to(w.r.t.) its original form. In the context of probability theory, we can definethe mathematical expectation of the rate and the distortion measure of alossy coding method. Throughout this dissertation, the terms “rate” and“distortion” are both used in such an average sense.

Rate-distortion theory defines a so-called rate-distortion function (RDF)[5], which describes the minimum possible rate of any lossy source codingmethod, subject to a constraint on distortion. According to the source type,e.g., with or without memory, the RDF can take different forms. Here, arelatively general definition is adopted, which is defined for discrete-timeprocesses with memory.

Definition 1 (RDF [5]) Given a discrete-time process {Xt}∞t=1 and a se-quence of distortion measures {ρt}∞t=1 satisfying

ρk(x, x) = k−1k∑

i=1

ρ1(xi, xi), x = (x1, · · · , xk), x = (x1, · · · , xk), (1)

the rate-distortion function (RDF) for {Xt}∞t=1 and {ρt}∞t=1 is defined as

R(D) = limk→∞

k−1 inffX|X (·|·)∈Qk(D)

I(X ; X), (2)

X = (X1, · · · , Xk), X = (X1, · · · , Xk),

where Qk(D) consists of all conditional probability distributions, for whichthe expectation of the k-th distortion measure is bounded by D, i.e.,

Qk(D) ={

fX|X(·|·) : E{ρk(X, X)} ≤ D}

. (3)

A sequence of distortion measures satisfying (1) is called a single-letterfidelity criterion. The single-letter nature of distortion measures is criticalfor the definition of the RDF. It is also essential for the RDF to be theminimum achievable rate of any lossy coding approach, which is illustratedby the source coding theorem.

6 Introduction

Theorem 1 (Source coding theorem [5]) For a stationary and er-godic discrete time process {Xt}∞t=1, let R(·) denote the RDF for {Xt}∞t=1

and a single-letter fidelity criterion {ρt}∞t=1 and assume there exists a y suchthat E{ρ1(X1, y)} <∞. Then, for any D ≥ 0 such that R(D) <∞ and anyǫ > 0, there exist a k > 0 and a lossy source code that operates on a k-tupleof the source, for which the rate is less than R(D) + ǫ and the distortionw.r.t. ρk is less than D+ ǫ. No lossy source code with a rate less than R(D)can achieve a distortion less than D.

There are two aspects to the source coding theorem: the minimalityand the achievability of the RDF. The fact that no source code can achievea smaller rate than the RDF at any distortion level is almost a straight-forward consequence of the definition of the RDF, while the proof for theachievability is non-trivial. A commonly used technique to prove the achiev-ability is the random coding argument. In the random coding argument,an ensemble of randomly generated lossy source codes is considered. Theaverage distortion of the ensemble can reach the RDF, so there is at leasta source code that achieves the RDF. It is worth noting that the sourcecoding theorem applies not only to a particular source code, but also to asource code ensemble. It is also important to see that the RDF and thesource coding theorem can only be valid if the distortion measures are 1)defined on the samples of the source and the reconstruction, and 2) definedin a single-letter manner.

Among all possible combinations of sources and distortion measures,Gaussian sources with mean squared error (MSE) distortion have garneredthe most interest. This is due to the ubiquity of Gaussian distributionsin nature and the intuitiveness of the MSE, and also due to the ease ofmathematical analysis.

RDF for i.i.d. Gaussian sources and MSE

When {Xt}∞t=1 are identically and independently distributed (i.i.d.), given

any k-tuple X , it follows that I(X ; X) ≥ ∑ki=1 I(Xi; Xi), where the

equality holds when fX|X(x|x) =∏k

i=1 fXi|Xi(xi|xi) (see, e.g., [6, Chap-

ter 10.3.3]). To obtain the infimum of I(X ; X) under the constraint thatE{ρk(X, X)} ≤ D, we may find the infimum of I(Xi; Xi) under the con-

straints that E{ρ1(Xi, Xi)} ≤ Di and∑k

i=1Di ≤ kD, and then let fX|X(·|·)be the product of the conditional probability of fXi|Xi

(·|·) that achieves theinfimum. It turns out that (see, e.g., [6, Chapter 10.2]) the RDF for ani.i.d. process can be defined on an individual sample of the process, i.e.,

R(D) = inffX1|X1

(·|·)∈Q1(D)I(X1; X1). (4)


Source Reconstruction

Noise

Figure 1: Backward channel model that achieves the RDF for Gaussiansources and MSE.

Since I(X1; X1) = h(X1)−h(X1|X1) ≥ h(X1)−h(X1−X1) with equalitywhen X1 − X1 is independent of X1, minimizing I(X1; X1) amounts toachieving such independence and maximizing h(X1 − X1). With an MSEconstraint E{(X1− X1)

2} ≤ D, h(X1− X1) is maximized when X1− X1 isGaussian distributed (see, e.g., [6, Theorem 8.6.5]). When X1 is Gaussiandistributed, it is possible to have X1 − X1 both Gaussian and independentof X1. Then the RDF for an i.i.d. Gaussian source and MSE can be derivedas (see, e.g., [5, Theorem 10.3.2]),

R(D) =

{12 log2

σ2

D D ≤ σ2

0 D > σ2 . (5)

where σ2 is the variance of the source Gaussian distribution.To achieve this RDF, it is required that the difference between the source

and its reconstruction is independent of the reconstruction. Such a conditioncan be described as a “backward channel” as shown in Figure 1.

RDF for stationary Gaussian processes and MSE

The RDF for a stationary Gaussian process and MSE can be derived us-ing the Karhunen-Loeve transform (KLT) and Szego’s theorem on Toeplitzmatrices.

For any k-dimensional vector X taken from the source process, the co-variance matrix is Toeplitz. The KLT performs Y = UX such that Yconsists of independent Gaussian r.v.’s, whose variances are the eigenvaluesof the covariance matrix of X . Let Y be a reconstruction of Y , the recon-struction of X is the inverse KLT of Y , i.e., X = UTY . It follows thatI(X ; X) = I(Y ; Y ) and ‖X − X‖2 = ‖Y − Y ‖2.

Since the elements of Y are mutually independent, it again suffices touse the RDF for each element Yi, as for the i.i.d. source. However, to attainthe global infimum, a proper distortion allocation is needed. The optimaldistortion allocation solves

min k−1∑k

i=1Ri(Di)

subject to k−1∑k

i=1Di ≤ D, (6)

8 Introduction

where Ri(Di) follows (5) with σ2i being the variance of Yi. Using the La-

grange multiplier method, it can be shown (see, e.g., [6, Chapter 10.3.3])that the optimal distortion allocation is Di = min{σ2

i , θ}, where θ is cho-sen to satisfy the constraint on the total distortion. Accordingly, the rate

allocation is Ri = max{

12 log2

σ2i

θ , 0}

.

Increasing the dimensionality of X , the RDF for a stationary Gaussianprocess and MSE can be obtained. Using Szego’s theorem on Toeplitz ma-trices (see, e.g., [7]), the per-dimension rate becomes

R = limk→∞

k−1k∑

i=1

Ri =1

4π

∫ 2π

0

max

{

log2P (ω)

θ, 0

}

dω, (7)

where P (ω) denotes the power spectral density (PSD) of the source Gaussianprocess. Correspondingly, the per-dimension MSE is

D = limk→∞

k−1k∑

i=1

Di =1

2π

∫ 2π

0

min{P (ω), θ}dω. (8)

For D ≤ 12π

∫ 2π

0P (ω)dω, (7) and (8) define the RDF for a stationary Gaus-

sian process and MSE, while for greater distortion the RDF is zero. Whenthe RDF for a stationary Gaussian process and MSE is attained, the source,the reconstruction, and their difference also satisfy the backward channelcondition in Figure 1. It can also be noticed that, for frequency componentswith a power below θ, no rate is allocated, the distortion equals the sourcepower, and the reconstruction has a zero power. This effect is known asreverse-water-filling [6].

To summarize, rate-distortion theory establishes a primary considerationof lossy source coding: to seek the optimal rate-distortion trade-off. Rate-distortion theory also defines a guideline for the rate-distortion optimizationby stating that the RDF is an achievable lower bound. The achievabilityof the RDF is proven with a random coding argument, which is, though, oflimited value in constructing real-world coding schemes. Through decades,researchers have made successful efforts in finding structured lossy codingmethods that approach the RDF under different circumstances. In thefollowing, the major contributions will be reviewed.

1.2 Quantization

Most structured lossy source coding methods are carried out in the form ofquantization.

Quantization maps a source X into an index I and then maps the indexto a reconstruction X. The mapping from the source to the index is many-to-one and establishes a space partition. It segments the sample space of


the source, denoted as X , into a countable set of quantization cells {Si ⊆X , i ∈ I ⊆ N}, which satisfies that 1) Si ∩ Sj = ∅ for i 6= j, and 2)⋃

i∈I Si = X . The index I indicates which cell a realization of X belongsto. Each index i is associated with a reconstruction point xi, which is used asthe reconstruction for all realizations of X that fall in Si. Given a distortionmeasure d(·, ·), the distortion D of a quantizer is defined as the expectationof d(·, ·) on a source and its reconstruction:

D = E{d(X, X)} =∑

i∈I

Pr{X ∈ Si}E{d(X, xi)|X ∈ Si}. (9)

Another important facet of quantization is the rate. The rate is relatedto how the quantization index is encoded into a binary sequence. Typically,each index is represented by a binary codeword. A binary sequence thatis formed by arbitrarily concatenating the codewords must be uniquely de-codable. To fulfill this, the lengths of the codewords must satisfy Kraft’sinequality [8]:

∑

i∈I 2−li ≤ 1, where li gives the length of the codeword

for index i. The rate R is then defined as the expected codeword length,normalized by the dimensionality of the source, say k, i.e.,

R = k−1∑

i∈I

Pr{X ∈ Si}l(ci). (10)

The goal of quantization is to achieve a good rate-distortion trade-off bychoosing a proper partition and reconstruction points. Observing (9), wecan see that the reconstruction of a quantization cell should be the pointthat minimizes the expected distortion conditioned on the cell, i.e.,

x⋆i = argminx∈X

E{d(X, x)|X ∈ Si}. (11)

We refer to this as the optimal reconstruction given a quantization cell.When d(·, ·) represents the squared error, it follows that xi = E{X |X ∈Si}, i.e., the reconstruction is the minimum mean squared error (MMSE)estimate of the source given the quantization cell it belongs to.

Applying Kraft’s inequality to (10), one can show that the rate of aquantizer is lower bounded by the entropy of the index, i.e.,

R ≥ k−1H(I) = −k−1∑

i∈I

Pr{X ∈ Si} log2 Pr{X ∈ Si}. (12)

Using entropy codes, e.g., Shannon codes [1], Huffman codes [9], and arith-metic coding [10], the rate can approach this lower bound. In particu-lar, when infinitely many independent realizations of the source are codedjointly, the rate per realization can get arbitrarily close to the lower bound.The entropy coded codeword for the i-th quantization cell has a length that

10 Introduction

is close to − log2 Pr{X ∈ Si}, which generally varies among cells. Quanti-zation with codewords of variable lengths is known as variable-rate quan-tization. In contrast, fixed-rate quantization refers a quantization methodthat forces the codeword length to be a constant. It follows from Kraft’sinequality that the rate of a fixed-rate quantizer satisfies R ≥ k−1 log2 |I|.With a large dimensionality, it can be arbitrarily close to k−1 log2 |I|.

In general, a variable-rate quantizer leads to better rate-distortion per-formance than a fixed-rate quantizer. The new lossy coding methods pro-posed in this dissertation are mainly based on variable-rate quantization,which, therefore, will be the focus of our discussion. Fixed-rate quantiza-tion, however, will also be discussed for completeness.

Fixed-rate quantization

Compared to variable-rate quantization, fixed-rate quantization is generallysimpler and more robust to channel errors. Despite a loss in rate-distortionperformance for general cases, fixed-rate quantization can be asymptoti-cally as efficient as variable-rate quantization when the dimensionality in-creases. In the random coding argument proof for the achievability of theRDF, the commonly used random code is in fact a fixed-rate quantizer.The efficiency of fixed-rate quantization can be explained by the so-calledasymptotic equipartition property (AEP), which basically states that theprobability of all typical sequences of a large length are equally likely.

For fixed-rate quantization, the rate is invariant under changes of thepartition, as long as the number of quantization cells is fixed. Minimizingthe distortion (9) implies that each quantization cell should be a Voronoiregion, consisting of all points that are closer to its reconstruction pointthan to any other, i.e.,

S⋆i = {x ∈ X : d(x, xi) ≤ d(x, xj), ∀j ∈ I}, (13)

where the ties between adjacent regions can be broken in an arbitrary man-ner. Let us call (13) the optimal cell for a fixed-rate quantizer given thereconstruction points. It can be seen that fixed-rate quantization is uniquelydetermined by its reconstruction points, which are much easier to handlethan the quantization cells.

Given a target rate, an optimized fixed-rate quantizer can be obtained byLloyd-Max algorithms [11–13], which apply iterative optimization based onthe aforementioned conditions of optimal reconstruction (11) and optimalcell (13). These algorithms can, in principle, deal with arbitrary sources,distortion measures and rates. However, they have two major shortcom-ings: 1) the methods may lead to local optima and 2) the algorithms arecomputationally complex [14]. Modifications to the Lloyd-Max algorithmsexist to solve the first problem (see, e.g., [15]), but they come at a price offurther increase in complexity.


Analytical expressions for optimal fixed-rate quantization exist in high-rate theory (or high-resolution theory) [16–19]. To arrive at these, a celldensity is introduced to describe a quantizer:

λ(x) = Vol(Si)−1, x ∈ Si, (14)

where Vol(·) returns the volume of a region. The cell density should inte-grate to the number of cells, yielding a relation to the rate as

2kR ≈ |I| =∫

X

λ(x)dx. (15)

The goal is to find the optimal cell density. Assuming that the quantizationcells are so tiny that the probability density of the source is effectively aconstant in every cell, the distortion contributed by Si follows

E{d(X, xi)|X ∈ Si} ≈ Vol(Si)−1

∫

Si

d(x, xi)dx, (16)

in which the source probability distribution becomes irrelevant.

To express the distortion of the quantization (9) in terms of the celldensity, a remaining difficulty is that the differences in cell shapes haveto be taken into account. A simple way to overcome this problem is toadopt Gersho’s conjecture [19]. Let us confine the source sample space toR

k and let distortion measure be the per-dimension r-th power Euclideannorm. Gersho’s conjecture hypothesizes that to achieve the global minimumdistortion, all cells are congruent to a basic geometry. In particular, after anormalization, the distortion for every cell becomes a constant

C(k, r) = k−1Vol(Si)−k+rk

∫

Si

‖x− x⋆i ‖rdx, ∀i ∈ I, (17)

then the distortion of the quantization (9) becomes

D ≈∑

i∈I

Pr{X ∈ Si}C(k, r)Vol(Si)−rk

≈ C(k, r)∫

Rk

fX(x)λ(x)−rk dx. (18)

Minimizing (18) subject to (15), one can derive the optimal cell densityfor fixed-rate quantization as [19]

λ(x) = |I| fX(x)k

k+r

∫

Rk fX(x)k

k+r dx. (19)

12 Introduction

Substituting (19) into (18), one can further obtain the rate-distortionrelation for the optimal fixed-rate quantization at high rates:

R ≈ k−1 log2 |I|

≈ 1

rlog2 C(k, r) +

k + r

krlog2

∫

Rk

fX(x)k

k+r dx− 1

rlog2D. (20)

For i.i.d. Gaussian variables with variance σ2, it follows that

∫

Rk

fX(x)k

k+r dx = (2πσ2)k

k+r

(k + r

k

) kr

. (21)

For this case it can then be shown that, when the dimensionality approachesinfinity, the rate satisfies

limk→∞

R =1

rlog2 C(∞, r) +

1

rlog2(2πeσ

2)− 1

rlog2D. (22)

Considering the squared error (r = 2), it is easy to see that (22) matchesthe RDF for the i.i.d. Gaussian source and the MSE, i.e. (5), if C(∞, 2) =(2πe)−1. It is worth noting that as the dimensionality increases, ani.i.d. Gaussian source can be seen as being uniformly distributed on asphere [20], which means that the cell density should also be uniform onthat sphere.

We will not consider any further for fixed-rate quantization, insteadswitch to variable-rate quantization, because the techniques involved in thisdissertation are mainly based on the latter.

Variable-rate quantization

As mentioned, the rate of a variable-rate quantizer can approach the en-tropy of the quantization index. Therefore, variable-rate quantization is alsoreferred to as entropy-constrained quantization. An optimal variable-ratequantizer should fulfill the optimal reconstruction condition (11). However,it does not necessarily obey the optimal cell condition of fixed-rate quanti-zation, because the cell arrangement determines not only the distortion butalso the rate. To define a locally optimal condition for a cell, an extendedcriterion η = D + λR can be formulated, where λ is a Lagrange multiplier.Given λ and reconstruction points, it can be shown that the optimal cellfor variable-rate quantization is

S⋆i = {x ∈ X : d(x, xi) + λli ≤ d(x, xj) + λlj , ∀j ∈ I}, (23)

where the codeword length satisfies that

li = − log2 Pr{X ∈ Si}. (24)


With this consideration, the optimal variable-rate quantization can be ob-tained by adapting the Lloyd-Max algorithms (see, e.g., [21]). However, theadapted Lloyd-Max algorithms inherit the shortcomings of those for fixed-rate quantization. Moreover, it is unclear how to select λ, since it is difficultto know a priori the rate (and the distortion) a particular λ will result in.

High-rate theory can also be applied to variable-rate quantization [18,22, 23]. Following the same procedure as for fixed-rate quantization, thedistortion remains the same as in (18), while the cell density is related tothe rate by

R ≈ k−1H(I)

= −k−1∑

i∈I

Pr{X ∈ Si} log2 Pr{X ∈ Si}

≈ −k−1

∫

Rk

fX(x) log2(fX(x)λ(x)−1

)dx

= k−1h(X) + k−1E{log2 λ(X)}. (25)

Minimizing (18) subject to (25), it can be shown that the optimal cell densityfor a variable-rate quantizer remains a constant (see, e.g., [24]). Specifically,

λ(x) ≡ 2H(I)−h(X). (26)

This result suggests that the optimal high-rate variable-rate quantizationuses a uniform partition independently from the source probability distribu-tion. A simple scheme that facilitates uniform partitioning is lattice quan-tization, which will be discussed later.

It follows from (18), (25), and (26) that the rate-distortion relation foran optimal variable-rate quantizer at high rates follows

R ≈ k−1H(I) ≈ 1

rlog2 C(k, r) + k−1h(X)− 1

rlog2D. (27)

In parallel to the final discussion for fixed-rate quantization, let us con-sider an i.i.d. Gaussian source and the squared error (r = 2), then assumeagain C(∞, 2) = (2πe)−1. It is easy to see that the rate of the optimalvariable-rate quantization also asymptotically achieves the RDF for thei.i.d. Gaussian source and MSE. It is not surprising that both fixed-rateand variable-rate quantization can approach this RDF with increasing di-mensionality, because they both lead to a uniform partition over the sphereon which i.i.d. Gaussian data are concentrated.

Lattice quantization

An intriguing result of high-rate theory is that uniform quantization isasymptotically optimal for variable-rate coding. However, high-rate theory

14 Introduction

is based on an indirect description of the quantization, i.e., the cell densityand hence does not yield a direct design for quantizers. Lattice quantizationis a way to address this.

A k-dimensional lattice is a countable subset of Rk that forms a group[25]. A lattice Λ can be seen as generated from a set of vectors {vi ∈ R

k, i =1, · · · , n ≤ k}, using integer coordinates, i.e.,

Λ =

{n∑

i=1

uivi, ui ∈ Z

}

. (28)

Associated with each lattice point ℓi ∈ Λ is a Voronoi region:

Vi = {x ∈ Rk : d(x, ℓi) ≤ d(x, ℓj), ∀ℓj ∈ Λ}, (29)

where d(·, ·) denotes a distortion measure and the ties between adjacentregions are broken in a systematic manner. If d(·, ·) is a function of thedifference of its arguments, i.e. d(a, b) = ρ(a−b), which is called a differencedistortion measure, every Voronoi region is a translation of a fundamentalregion P , which is the Voronoi region for the origin point (the origin isalways a lattice point). In particular, it follows Vi = P + ℓi. A latticequantizer is comprised of all Voronoi regions. Therefore, lattice quantizationforms a uniform partition when a difference distortion measure is considered.

For each quantization cell, one can take the corresponding lattice pointas the reconstruction point. Such a reconstruction is called lattice recon-struction or lattice decoding. This setup fits Gersho’s conjecture. With thehigh-rate assumption and using the r-power Euclidean distance as the dis-tortion measure, the rate-distortion relation for optimal variable-rate quan-tization at high rates, i.e. (27), applies to lattice quantization. A majortask of lattice quantization is then to find a lattice such that C(k, r) isminimized. In fact, C(k, r) is determined by the fundamental region of thelattice:

C(k, r) = k−1Vol(P)− k+rk

∫

P

‖x‖rdx. (30)

Most of the research on lattice quantization has been focused on thesquared error distortion (r = 2). The constant C(k, 2) is often referred toas the normalized second moment of the corresponding lattice. It has beenshown [26,27] that there exists a sequence of lattices with increasing dimen-sionality such that limk→∞ C(k, 2) = (2πe)−1, as was assumed in the earlierdiscussion. Such a lattice sequence is known as good for (MSE) quantiza-tion. Since the same behavior is found for a ball of infinite dimensionality,the concept of being good for quantization defines a sense of convergencethat a lattice cell approaches a ball.

Besides quantization, lattices are also used in packing, covering, channelcoding, etc. The application of lattice in AWGN channel coding will be


briefly mentioned, since it is used in Paper C of this dissertation. To trans-mit a discrete source through an additive white Gaussian noise (AWGN)channel, each source sample is sent as a lattice point. Due to the channelnoise, the received sample has an offset from the transmitted point. Thesource sample is recovered according to which Voronoi region the receivedsample belongs to. Channel coding is interested in the probability thatthe source sample is interpreted incorrectly. It has been shown [28] thatthere exists a sequence of lattices with increasing dimensionality, such thatthe error probability approaches zero as long as the volume of the Voronoiregions is larger than the volume of a noise “ball”. The noise “ball” is ak-dimensional ball with radius

√

kσ2N , where σ2

N denotes the variance ofthe channel noise. The condition of zero error probability means that theVoronoi region must cover the sphere on which noise vectors are concen-trated when the dimensionality goes to infinity. A sequence of lattices thatfulfills zero error probability is called sphere-bound-achieving [29]. Thereexists a stricter requirement for lattices, which defines a convergence speedof the error probability, known as Poltyrev’s exponent [26]. A lattice se-quence that achieves Poltyrev’s exponent is called good for (AWGN) chan-nel coding. It is worth noting that the lattice coding for AWGN channelmentioned above is not strictly a channel code, since the transmission poweris unbounded. This problem can be solved by truncation [30] or more opti-mally by lattice nesting [31]. Being good for channel coding also defines asense that a lattice cell converges to a ball. It has been proved that thereexist lattice sequences that are good for channel coding and quantizationsimultaneously [32].

According to the earlier discussion, lattice quantization can be asymp-totically optimal for quantizing i.i.d. Gaussian sources with MSE distortion,when both rate and dimensionality approach infinity. In the following, itwill be shown that lattice quantization, together with dithering and filtering,can be optimal for all stationary Gaussian processes at all rates.

Dithered quantization

Dithered quantization was introduced as a method for enhancing the per-ceptual quality of conventional quantization in image compression [33]. Theperceptual benefits of dithered quantization will be discussed later. Heredithered (lattice) quantization is discussed because of its significance inrate-distortion theory.

A dithered quantizer consists of a dither generator and a lattice quan-tizer, as shown in Figure 2. The dither generator performs random samplingaccording to a probability distribution. Given a source sampleX , a dither Zis generated and added to the source before quantization. The same ditheris then subtracted from the reconstruction of the lattice quantizer, yield-ing a final reconstruction X . Another type of dithered quantization does

16 Introduction

X X

Z

Lattice

Quantizer

Dither

Generator

+ −

Figure 2: A dithered quantizer.

not involve subtraction of the dither and is called non-subtractive ditheredquantization [34]. The former setup is known as being subtractive. Hereonly subtractive dithered quantization is considered.

A dithered quantizer can be seen as a quantizer with a stochastic parti-tion. This perspective is essential for later discussion of the main theme ofthis dissertation — distribution preserving quantization. Next some existingresults of dithered quantization will be reviewed.

One of the most important properties of dithered quantization statesthat a dithered lattice quantizer is effectively a channel with additive noise,for which the probability distribution of the channel noise depends onlyon the geometry of the lattice. This property was first discovered in [35],which, however, only considered the one-dimensional case. A relativelygeneral statement for any dimensionality follows:

Theorem 2 [36] If the dither is independent of the source and uniformlydistributed over the fundamental region P of the lattice quantizer, then theerror of the dithered quantizer, X − X , is independent of the source anduniformly distributed over −P .

Another important property is that the rate of a dithered quantizerequals the mutual information between the source and the reconstruction,as indicated by the following theorem.

Theorem 3 [36] If the dither is independent of the source and uniformlydistributed over the fundamental region P of the lattice quantizer, the con-ditional entropy of the index I resulting from the lattice quantizer, giventhe dither, i.e.,

H(I|Z) = −Vol(P)−1

∫

P

∑

i∈I

Pr{X + z ∈ Vi} log2 Pr{X + z ∈ Vi}dz,

equals I(X ; X).


Xt Xt

Dithered

Quantizer

Pre- Post-

Filter Filter

Yt Yt

Figure 3: A filtered dithered quantizer.

Using entropy coding, this conditional entropy can be reached asymptot-ically, when coding a long sequence. Therefore, Theorem 3 links the rate ofdithered quantization with the mutual information between the source andthe reconstruction. In particular, if a dithered quantizer realizes a source-reconstruction relation that leads to the RDF, it is theoretically optimal.

Considering an i.i.d. Gaussian source and MSE, we have seen that theRDF is achieved by a “backward channel”. The dithered quantizer does notdirectly realize such a channel since: 1) according to Theorem 2, a ditheredquantizer is a “forward channel”, i.e., the noise is independent of the source,rather than the reconstruction; and 2) the noise is not i.i.d. Gaussian dis-tributed, but uniformly distributed over the fundamental region. To addressthe first issue, one can transform the “forward channel” to the “backwardchannel” by scaling the source and/or the reconstruction, as proposed in [5,Section 4.3]. To overcome the second problem, a lattice sequence that isgood for quantization can be used, for which the uniform distribution overthe fundamental region converges to an i.i.d. Gaussian distribution, in thesense of a per-dimension Kullback-Leibler divergence [27].

Following this logic, it has been proved [37] that dithered quantizationwith a pre-scaling and a post-scaling can achieve the RDF for i.i.d. Gaussiansources and MSE at any rate, as the lattice dimensionality increases. In thesame paper, this method was generalized to a scheme that asymptoticallyachieves the RDF for any stationary Gaussian process andMSE. The schemeis comprised of a pre-filter, a dithered quantizer, and a post-filter, as shownin Figure 3. The pre-filter has an amplitude response satisfying

|H(ω)|2 = 1− min{PX(ω), k−1∫

P ‖τ‖2dτ}PX(ω)

, (31)

and an arbitrary phase response. The post-filter has a frequency responsethat equals the complex conjugate of H(ω):

G(ω) = H(ω). (32)

In this setup, the pre- and post-filter jointly introduce zero delay. Let{Xt}∞t=1 be a stationary Gaussian process and {Yt}∞t=1 be the output processof the pre-filter. For a k-tuple of {Yt}∞t=1, say Y , a dithered quantizer isapplied, yielding Y . It follows that, with a lattice sequence that is good forquantization, limk→∞ k−1I(Y ; Y ) equals the RDF for {Xt}∞t=1 and MSE.According to Theorem 3, the filtered dithered quantizer shown in Figure

18 Introduction

3 asymptotically achieves the RDF for stationary Gaussian processes andMSE.

Dithered quantization has theoretical significance in many other con-texts, such as distributed source coding, AWGN channel coding, and mul-tiple description coding (see, e.g., [36]). The new methods proposed in thisdissertation are also heavily based on it.

Until now, the discussion has been based on quantization with arbi-trary dimensionality. Such quantization is known as vector quantization.Although vector quantization is of great theoretical interest, it has limitedapplications in practice due to implementational complexity. Scalar quan-tization (one dimensional vector quantization) is sometimes preferred. Ingeneral, scalar quantization has a inferior performance [38]. An importantreason is that it cannot exploit dependencies among source samples. Thesedependencies, however, can be reduced by transforms or prediction. Thisrespectively leads to two source coding realms: transform coding and pre-dictive coding. These two methodologies will now be discussed in moredetail.

Transform coding

Transform coding [39] applies a multivariate transformation to the source,leading to a number of coefficients. A separate quantizer is used for eachcoefficient, and the quantized coefficients are finally mapped to a recon-struction by the inverse of the transformation. The transformation thatleads to the best rate-distortion performance for the source (rather thanthe coefficients) is preferred [40]. In practice, transform coding is usuallyconfined to use unitary transforms, because these transforms preserve thesquared error distortion, meaning that the rate-MSE optimization for thesource is tantamount to that for the coefficients.

An important benefit of transform coding is that it can perform rate-distortion optimization by distributing a bit budget to each transform co-efficient, which is know as bit allocation [41,42]. The optimal bit allocationshould satisfy that moving bits from one transform coefficient to any othercannot decrease the distortion. Suppose the distortion is the sum of the in-dividual distortions of the transform coefficients, then this condition impliesthat the derivative of the distortion w.r.t. the rate of the used quantizer is aconstant among all coefficients except those for which the derivative cannotreach this constant.

For transform coding of a Gaussian random vector subject to the MSEdistortion, the KLT is the optimal transform, if the quantizer applied tothe coefficients yields an operational distortion D and rate R satisfyingD = σ2g(R) with some non-increasing function g(·), when quantizing aGaussian r.v. with variance σ2 [43]. In this case, the optimal bit allocation


for the KLT coefficients follows [41]:

Ri =

{

g−1(

θσ2i

g(0))

σ2i > θ

0 σ2i ≤ θ

, (33)

where σ2i denotes the variance of the i-th KLT coefficient and θ is chosen

to satisfy the total rate budget. This bit allocation allocates zero bits tocoefficients with small variances, resembling reverse-water-filling. This re-sult related to the RDF for Gaussian sources and MSE. In particular, ifthe quantization is not limited to be scalar, rather, a separate vector quan-tizer is applied to each transform coefficient through accumulating multiplerealizations of that coefficient, g(R) can approach 2−2R. Applying such ag(R) to (33) leads to the optimal bit allocation for independent Gaussianr.v.’s, which has been used as an essential step in the derivation of theRDF for stationary Gaussian processes and MSE. For stationary Gaussianprocesses, the discrete Fourier transform (DFT) and discrete cosine trans-form (DCT) [44] can achieve a similar coding efficiency as the KLT for highdimensionality [45].

A shortcoming of transform coding is that it needs to gather a largenumber of source samples to operate, thus causing a large delay. Predictivecoding is a way to address this problem.

Predictive coding

Predictive coding subtracts a prediction of a source sample, quantizes theresidual of the prediction, and then adds a prediction back. One of theearliest realizations of predictive coding is the differential pulse-coded mod-ulation (DPCM) [46]. The prediction aims to remove dependencies in thesource so as to enable a good rate-distortion performance by using scalarquantizers. Unusually, the prediction is restricted to be linear.

Prediction can be categorized into open-loop and closed-loop. Open-loopprediction is based on previous source samples. Since the source samplesare not available at the decoder, an approximate prediction must be used atthe decoder for reconstructing the source. Closed-loop prediction is basedon previous reconstructed samples, which can be known to both the decoderand the encoder, so can lead to better rate-distortion performance.

Recently, a predictive coding structure, which achieves the RDF forstationary Gaussian processes and MSE, has been proposed [47]. The codingstructure, as shown in Figure 4, uses a pre-filter and a post-filter that follow(31) and (32). A closed-loop predictor is also applied. The predictor isdesigned such that the power of the prediction error {Et}∞t=1 is minimized.The predictive coding structure assumes the quantization noise to be anAWGN. Two results can be obtained [47]:

20 Introduction

Xt Xt

Pre- Post-

Filter Filter

Predictor

AWGN

+−

Yt YtEt Et

Figure 4: An RDF achieving structure with prediction.

1. Let {Xt}∞t=1 be a stationary Gaussian process, Y and Y be k-tuples of the processes {Yt}∞t=1 and {Yt}∞t=1, respectively. As forthe filtered dithered quantizer shown in Figure 3, it follows thatlimk→∞ k−1I(Y ; Y ) equals the RDF for {Xt}∞t=1 and MSE;

2. Due to the Gaussianity of {Yt}∞t=1 and the AWGN, it follows thatI(Et; Et) = limk→∞ I(Y ; Y ).

These results imply that encoding the residual samples independently canachieve the RDF. However, when a scalar quantizer is used, the quantizationnoise is non-Gaussian. This introduces two deficiencies: 1) limk→∞ I(Y ; Y )becomes greater than the RDF, and 2) I(Et; Et) > limk→∞ I(Y ; Y ), mean-ing that independent entropy coding of the residual samples is suboptimal.While the first problem is inevitable, the second problem can be avoided byusing high-order entropy coding.

1.3 Entropy Coding

It has been mentioned that an entropy coding is an integral part of variable-rate quantization. The average length of any code for a random vectorcannot be less than its entropy. It is also known that the optimal averagecode length can be within 1 bit of the entropy [1], which means that theentropy is a relatively tight lower bound. Entropy coding aims to reach theentropy. To this end, it is usually desirable to code many samples jointly.

The advantages of joint encoding can be one or more of the following:

1. The entropy of a collection of samples is smaller than the sum of theirindividual entropies, unless the samples are independent. So encodingsamples jointly can exploit their dependencies;

2. With many entropy coding methods, e.g., Shannon codes [1], Huffmancodes [9], and arithmetic coding [10], when coding a random vector A,the average code length L satisfies H(A) ≤ L ≤ H(A) + c with someconstant c that does not vary with the dimensionality of A. Whencoding infinitely many samples together, the contribution of c to eachsample diminishes;


3. Many entropy coding methods require the probability distribution ofthe source. However, the probability distribution cannot be preciselyknown a priori in all cases. Collecting samples facilitates probabilisticmodeling of the source, thus enabling these entropy coding methods.There exist entropy coding methods that avoid probabilistic model-ing by using an adaptive code. The most well-known method is theLempel-Ziv code [48, 49]. These entropy codes also require long se-quences to perform efficiently.

Although it is capable of good performance, joint encoding may causecomputational problems. Using entropy codes that pre-define a codeword toeach possible outcome of the source vector (e.g., Huffman codes), the size ofthe codebook grows exponentially w.r.t. the dimensionality. This problemcan be solved by sequential entropy coding methods, which encode sourcesamples one-by-one according to the conditional probability distribution ofeach sample given all samples proceeding it. Arithmetic coding uses thisidea. However, the codebook design for each sample requires the entire his-tory to be efficient and therefore, can also be computationally prohibitive.When the random sequence is Markovian, the conditional probability dis-tribution depends on a finite number of historic samples and can thus becalculated efficiently.

In this dissertation, we are concerned with entropy coding of quantizedsignals. For such signals, Markovianity can be too strong an assumption.Firstly, the source being quantized may not be Markovian, and secondly,even if the source is a Markov process, the quantization operation maydestroy the Markovianity. To address this problem, Paper F proposes asequential entropy coding method for variable-rate quantization. Modelingthe source with a hidden Markov model (HMM), this method applies arecursive update on the conditional probability distribution for the outputof a generic quantizer.

1.4 Distortion Measure Considerations

The choice of distortion measure is critical for rate-distortion theory andquantization techniques. Together with the source probability distribution,it determines whether the rate-distortion optimization problem is analyz-able.

The distortion measures considered in rate-distortion theory and quanti-zation methods are limited. Firstly, distortion measures are usually limitedto being additive, i.e., the distortion of two vectors is the sum of a dis-tortion for each pair of their elements. Additivity is the essence of thesingle-letter nature of a fidelity criterion, which is commonly assumed inrate-distortion theory. Additivity is also important for bit allocation. Sec-ondly, the distortion measure is also commonly limited to be a difference

22 Introduction

X XMSE

QuantizerF(·) F

−1(·)

Figure 5: A generic companding structure.

distortion measure, which is necessary for Gersho’s conjecture and also im-portant for lattice quantization and predictive coding. Finally, the most wellstudied distortion measure is the squared error. The squared error is an ad-ditive difference distortion measure. As seen earlier, abundant theoreticaland practical findings are available for the squared error.

For many applications, more general distortion measures are desired.At low distortion levels, some distortion measures can be approximated bylocally quadratic distortion measures:

d(x, x) = (x− x)TM(x)(x − x) +O(‖x− x‖3), (34)

where M(x) is known as a sensitivity matrix. The optimal quantization forsuch a distortion measure can be obtained by formulating a refined rate-distortion optimization problem and solving it with existing methodologies(see, e.g., [50]). A more general method is so-called companding [51], asshown in Figure 5. In a companding structure, an MSE-optimal quantizeris embedded in a multidimensional compressor F(·) and its inverse. At highrates, the optimal compressor satisfies

F(x)TF(x) = cM(x), (35)

almost everywhere.The companding structure provides a systematic solution for quantiza-

tion with general distortion measures. However, its optimality is limited tohigh rates.

In summary, rate-distortion theory and quantization place the lossysource coding problem in a mathematical framework. In this framework,many techniques can be analyzed and optimized. In particular, the RDF,an achievable lower bound for all lossy coding systems, is rigorously defined.However, the choice of distortion measures for the mathematical frameworkis subject to certain limitations.

2 Perceived Quality

The ultimate goal of lossy coding of PRS is to maximize the perceivedquality of the reconstruction, subject to a constraint on the rate. Assess-ing perceived quality forms a stand-alone research topic, and has a muchbroader applications than lossy source coding, including, e.g., signal en-hancement [52] and watermarking [53]. The studies of perceived signal

2 Perceived Quality 23

quality can be either microscopic or macroscopic. A microscopic study in-vestigates how a human assesses the quality of a specific type of signal, whilea macroscopic study seeks a generalized theory of human cognition. Sincethe goal of this dissertation is to find a generalized objective, a macroscopicview of the perceived quality is of great interest. However, a good gener-alized theory must be able to explain most, if not all, of the findings madefor specific cases. In the following, the two streams of research are brieflyreviewed one after another.

2.1 Quality Assessment

Significant research efforts have been made in looking for objective measuresthat predict the perceived quality of particular types of signals. Commonquality assessment techniques are designed to deal with one of the followingmedia: speech [54, 55], audio [56], images [57], or video [58]. A recentreview of all these subjects can be found in [59]. Here, an attempt is madeto provide a unified view for different types of quality assessment.

Since the perceived quality is inherently subjective, the performance ofany objective measure must be verified against subjective measures. In fact,many objective measures are derived by training on subjective test results.Therefore, the acquisition of reliable subjective quality data is critical. Thisis, however, beyond the scope of this dissertation. We will focus on objectivequality assessment.

Quality estimation for particular signals can focus either on the impact ofnoise on clean signals or on the features of input signals. In this dissertation,these are referred to as error-based and feature-based quality assessment,respectively.

Error-based quality assessment

Error-based quality assessment, as the name suggests, is based on the er-ror of a noisy signal w.r.t. a clean signal. The simplest error-based qualitymeasure is the MSE. A related measure is the signal-to-noise ratio (SNR).Although MSE and SNR are commonly used due to their simplicity, thereis a consensus on their inefficiency in predicting perceived quality [60]. Ad-vanced error-based quality assessment utilizes the masking effects of humansensory (e.g., auditory and visual) systems.

Masking refers to the fact that the human sensory system often cannotperceive a weak signal in the presence of a strong signal. Masking effectsare evaluated through masking experiments, in which chosen test signals areadded to prototype signals and a human’s sensitivity to these test signalsis assessed by subjective tests. The prototype signal is called a maskerand the test signal is called a maskee. A masking experiment is usually

24 Introduction

concerned with the just noticeable difference (JND), which describes thesmallest detectable intensity of a maskee under a masker.

Masking experiments have been performed for both the human audi-tory system and the human visual system (HVS). Auditory masking exper-iments include spectral (or simultaneous) masking and temporal (or non-simultaneous) masking. They account for a human’s sensitivity to mas-kees of different frequency and different timing characteristics, respectively.Spectral masking experiments, depending on the contents of the maskerand the maskee, can be classified into tone-tone, noise-tone, and tone-noise [61–64]. Temporal masking [65] includes forward masking (maskerbefore maskee) and backward masking (maskee before masker). For theHVS, known masking effects include spacial frequency masking [66] andcontrast masking [67].

Given masking data, a systematic method for deriving quality measuresis to fit the data into a model of a human sensory system, and then obtaina quality measure out of the model. To model a human sensory system,other knowledge, e.g., physiological findings, can also be utilized.

Various auditory models have been developed [68–72]. Models for theHVS have also been proposed [73,74]. These models produce an “internal”representation of the input signal, which can be used for general purposes,such as quality assessment and recognition. For quality assessment, one candefined a measure as a weighted sum of the squared errors (or some othermetric) on different dimensions of the “internal” representation:

d(x, x) =∑

i

wi(x)(φi(x) − φi(x))2, (36)

where x and x are regarded as a clean signal and a noisy signal, respectively,φi(·) returns the i-th dimension of the “internal” representation, and wi()is a weighting factor, which can be learned from masking data. Examplesof error-based quality assessment can be found for both audio [75] andimages [76, 77].

Using a sensitivity matrix, an error-based quality measure defined onthe “internal” representation can be approximated by a weighted squarederror in the signal domain [78]. This approximation is valid only when thedifference between the input signals is small.

Error-based measures can be accurate at high quality levels, but usuallybecome poor for low quality signals [79]. Reasons for their deficienciesinclude:

1. Error-based quality assessment can only work for signals subjected toadditive distortion. It is not suitable for non-additive distortions, e.g.,when an audio signal is made longer or when an image is shifted orrotated;


2. The acquisition of masking data has many restrictions. In maskingexperiments, the maskers and maskees are purposely selected. Themaskers are usually laboratory signals like pure sinusoids and themaskees are commonly statistically independent of the masker. Inaddition, since masking experiments are mostly concerned with theJND, the maskees usually have small intensity;

3. Due to the limitations in masking experiments and the fact that exist-ing knowledge about human sensory systems is limited, the perceptualmodel may not be sufficiently reliable;

4. When using the sensitivity matrix for a quadratic approximation ofan error-based quality measure, the error is required to be small.

Given the drawbacks of error-based quality assessment, other method-ologies have been considered. A successful track has been to find perceptu-ally relevant features and to define the quality as a function of the features.This feature-based quality assessment will be discussed below.

Feature-based quality assessment

Feature-based quality assessment does not require the difference of two sig-nals. An interesting consequence is that it does not necessarily requireany reference (clean) signal. This endows feature-based quality assessmentwith a logical advantage in that it reflects the fact that a human is able tojudge the quality even if no references are given. Quality assessment withor without a reference is referred to as being intrusive or non-intrusive, re-spectively. Intrusive measures can exploit a comparison of the two presentedsignals, while non-intrusive measures depend heavily on a priori knowledge.Topics related to non-intrusive quality assessment include the estimation ofsingle-signal features such as loudness [80], speech intelligibility [81] andvoice naturalness (see, e.g., a number of indices proposed in [82]). Method-ologically, there are no essential differences between the assessment of thesequantities and the non-intrusive assessment of perceived signal quality.

Quality assessment is essentially a machine learning problem and canbe decomposed into a feature extraction and a feature-to-quality mapping.The difference between intrusive and non-intrusive measures can be seenas whether there is a feature combination process before the feature-to-quality mapping. Figure 6 depicts a typical feature-based quality assessmentsystem.

Features are the aspects of an object that a human uses to distinguishthe object from others. A necessary condition for an aspect of an ob-ject to be distinctive is that the human sensory system is sensitive to it.Therefore, good feature extraction usually utilizes knowledge of the hu-man perception of a particular type of signal. For speech and audio qual-ity assessment, widely-used features include perceptual linear prediction

26 Introduction

Signal

Reference

QualityQuality

Mapping

Feature

Feature

Feature

Extraction

Extraction

Combination

Figure 6: A diagram of feature-based quality assessment. The dashedparts are not used for non-intrusive measures.

(PLP) coefficients [83], mel-frequency cepstral coefficients (MFCC) [84], thetemporal-modulation spectrum [85], and the spectro-temporal modulationspectrum [81, 86]. These features are in line with findings in psychoacous-tics, e.g., [87–89]. For image and video quality assessment, two classes offeatures recently found to perform well are structural information [79] andnatural scene statistics [90, 91].

The feature-to-quality mapping can choose from a large pool of patternrecognition tools [92] such as artificial neural networks (ANN), Gaussianmixture models (GMM), and support vector machines (SVM). Many pat-tern recognition methods are of a form where a pre-defined function g(·) isapplied to a linear combination of the features: q = g(wTφ+ b), where w isa weighting vector, b is a bias vector, and φ is the extracted feature vector.To avoid over-fitting, regularization is usually applied to the weighting vec-tor [93], which facilitates automatic rejection of irrelevant features. Patternrecognition methods have been successfully used in both intrusive [94, 95]and non-intrusive [90, 96–98] quality measures. However, these methodsusually require training data, which are not always available.

To avoid the requirement of training data and/or to reduce the chance ofover-fitting, heuristic feature-to-quality mappings are also commonly used,especially in intrusive quality measures. Examples of such mappings includemetrics [99], the correlation coefficient [85], and the mutual information [91],all defined on features.

2.2 Cognitive Theories of Similarity

Human perception has also been studied by philosophers and psychologists.From a philosophical or psychological perspective, the perceived quality isessentially a judgment of similarity. Judging the quality of a signal againsta reference is naturally interpretable as a similarity assessment, while judg-ing the quality of a signal without any reference amounts to assessing thesimilarity between the signal and a signal class. For example, asking “howgood (bad) is this image?” is almost equivalent to asking “how similar is


X

X

Φ

Φ

Function

Function

Function

Psychological

Psychological

Cognitives(X, X)

Figure 7: A diagram of similarity measures.

this image to good (bad) images?”. Similarity is a topic of great impor-tance to cognitive psychology, since it is fundamental for knowledge andbehavior [100, 101]. Cognitive theories of similarity are very general andmay inspire new insights for lossy coding of PRS. In the following, thesetheories are briefly reviewed.

Most cognitive theories assume that there is a psychological function thattransforms a perceptual stimulus into a representation in a psychologicalspace. Similarity judgment is performed by a cognitive function on thepsychological representations of two stimuli. Let X and X denote twostimuli, and let Φ and Φ denote their psychological representations. Asimilarity measure operates as illustrated in Figure 7.

Geometric models

Geometric models are among the earliest developments of similarity model-ing. In a basic geometric model, the similarity is defined as a monotonicallydecreasing function of a metric in the psychological space:

s(X, X) = g(d(Φ, Φ)). (37)

The geometric model satisfies

1. The similarity is based on a metric in the psychological space. Acommonly used metric is the Minkowski power metric:

d(φ, φ) =

(∑

i

|φi − φi|r) 1

r

; (38)

2. The metric in the psychological space is monotonically related to thesimilarity. No further assumption is made for g(·). Therefore, theform of g(·) is somewhat irrelevant. The only necessity is that therank order of the metric inversely matches that of the similarity;

3. The psychological space has small dimensionality.

28 Introduction

The basic geometric model has been called into question by both exper-iments and intellectual arguments [102–104]. It is commonly accepted thatthe three axioms of metric are invalid in the context of similarity assessment[103]. These three axioms are 1) the minimality, i.e., d(φ, φ) ≥ d(φ, φ) = 0,

2) the symmetry, i.e., d(φ, φ) = d(φ, φ), and 3) the triangle inequality, i.e.,

d(φ, τ) + d(τ, φ) ≥ d(φ, φ). A less widely known issue with geometric mod-els is that the additivity and the subtractivity in typical metrics, e.g., thementioned Minkowski power metric (38), may be undesirable [102]. Thedrawbacks of the basic geometric model have been partially alleviated bysome of its variations (see, e.g., [105, 106]).

A widely used application of the geometric model is multidimensionalscaling (MDS) [107–110]. MDS provides a procedure for constructing geo-metric models from similarity data. Given n stimuli and empirical similaritydata on all stimulus pairs, the method assigns each stimulus a position in ametric space with minimum possible dimensionality, such that there existsa non-increasing function g(·), such that (37) fits the similarity data well.MDS treats each stimulus as a symbol and hence can only deal with a finitenumber of stimuli. In addition, the mapping from the physical space to thepsychological space is implicit, but can be critical to the accuracy of thesimilarity measure. Therefore, MDS can hardly provide a feasible qualitymeasure for PRS.

Feature-contrast models

Tversky proposed a feature-contrast similarity measure [103] based on settheory. The model assumes that the psychological representation of a stim-ulus takes the form of a set consisting of categorical features of the stimulus.The similarity of one stimulus to another is defined as a function of theircommon and distinctive features:

s(X, X) = g(Φ ∩ Φ)− αg(Φ \ Φ)− βg(Φ \ Φ), (39)

where g(·) satisfies that g(A) ≤ g(B) whenever A ⊆ B.The feature-contrast model does not suffer from the metric dilemma of

geometric models and can be consistent with many similarity experiments.Since the feature-contrast model is based on the notion of categorical fea-tures, it seems unable to deal with continuous-valued psychological spaces,though, in fact, it can be extended to do so. If measurement of a psycho-logical dimension takes values in R, one can define a sequence of featuresby constructing a sequence of nested intervals (−∞, bi], b1 < b2 < · · · , andchecking which of the intervals a psychological value belongs to [103]. An-other solution is to apply fuzzy predicates [111]. However, these refinementsare difficult to implement. In addition, like in MDS, the psychological func-tion is unclear. Therefore, the feature-contrast model also lacks feasibility.


Statistical models

Yet another class of similarity measures are statistical models [112,113], forwhich the psychological representations of the stimuli are regarded as ran-dom variables, characterized by their probability density functions (p.d.f.)fΦ and fΦ. The randomness can account for the noise in neural activi-ties and, more interestingly, allows uncertainty in the stimuli if the stimuliare also modeled as random variables. In deterministic models, includingthe aforementioned geometric model and feature-contrast model, differentpsychological functions apply to signals and signal classes. In contrast,statistical models allow the same psychological function to be used. Differ-ences in stimulus type can be reflected by the probability distributions. Forexample, a single image can have a narrower p.d.f. than a class of images.

A statistical model of similarity is based on a general recognition theory(GRT) [113, 114]. In this model, the similarity of X w.r.t. X is measuredas the probability that X is mistaken for X , i.e.,

s(X, X) = c

∫

R(Φ)

fΦ(φ)dφ, (40)

where c is a constant and R(Φ) is the region in which Φ is mistaken forΦ. With the maximum-likelihood decision criterion, the region is definedby R(Φ) = {φ : fΦ(φ) ≥ fΦ(φ)}. When fΦ(φ) = fΦ(φ), ∀φ, the similarity ismaximized and equal to c. The GRT model has shown a good performanceagainst real data. Another advantage is that its accuracy does not neces-sarily depend on the psychological function, which will be addressed in thefollowing.

As discussed, with statistical models, the psychological function can bethe same for both stimuli. Let p be the psychological function and assumethat it is invertible and differentiable. The p.d.f. of a stimulus X and thatof its psychological representation Φ follow

fΦ(φ) = fX(p−1(φ))| det(J(φ))|, (41)

where J(x) is the Jacobian of p−1(x). The GRT model then becomes

s(X, X) = c

∫

R(X)

fX(x)dx (42)

where R(X) = {x : fX(x) ≥ fX(x)}. This means that the similarity canbe thoroughly determined by the probability distributions of the stimuli.Unfortunately, the psychological function is generally non-invertible, since1) the psychological space, as discussed, usually has a lower dimensionalitythan the physical spaces of the stimuli; and 2) if neural variability is con-sidered, such a function is stochastic. However, in many cases, letting the

30 Introduction

probability distributions of the stimuli be identical is sufficient to maximizetheir similarity.

When two stimuli are generated from two distinct probability distribu-tions, using the generating probability distributions for the GRT model canresult in a relatively accurate similarity estimate. However, if two stimuliare generated from the same probability distribution, the GRT similaritybased on the generating probability distribution reaches its maximum, whilea human may still notice differences between them. This can be interpretedas a change in attention, i.e., when the stimuli belong to different classes, ahuman focuses on the differences between their corresponding classes, andwhen the stimuli belong to the same class, a human’s attention can switchto their detailed differences. Therefore, it is advisable to split the concept ofsimilarity into intra-class and inter-class similarity [115]. Knowing stimulusstatistics, the GRT can be used to predict the inter-class similarity, whilethe intra-class similarity can be obtained through other models, e.g., thedeterministic models mentioned earlier.

To summarize our discussion of perceived quality, we have seen that

1. Perceived quality is essentially determined by features of signals. Thisis reflected by the fact that feature-based quality assessment gener-ally works better than error-based assessment. In addition, cognitivetheories of similarity always assume a psychological space, on whichthe similarity measure is directly defined. Actually, the psychologicalspace is comprised of features of the source;

2. It is difficult to identify all perceptually relevant features. Methodsthat avoid explicit feature extraction are therefore favored. Error-based quality estimation based on the sensitivity matrix is a choice.However, it only works well at low distortion levels. The GRT modelof similarity using the probability distributions of stimuli is anotherapproach. However, it is somewhat limited to inter-class similarity;

3. Differentiating between inter-class and intra-class similarity can behelpful. In particular, the inter-class similarity should be primary.Upon high inter-class similarity, a human’s attention can switch tothe intra-class similarity;

4. A priori knowledge plays an important role in quality assessment.In non-intrusive quality assessment, perceived quality is purely deter-mined by experience. For intrusive quality assessment, experience isalso critical since it influences how features are mapped to the quality.The GRT model of similarity provides a natural way of handling a pri-ori knowledge, which is to account for it in the selection of probabilitydistributions.

3 Perceptual Coding 31

PSfragX

XDictionary Synthesizer

Quality

Measure

Figure 8: A generic AbS structure.

3 Perceptual Coding

An ideal lossy coding system for PRS should yield the optimal trade-off be-tween the rate and a reliable measure of the perceived quality of the recon-struction. This notion is consistent with rate-distortion theory, if the per-ceived quality lends itself to a distortion measure. Numerous rate-distortionoptimized compression algorithms have been developed [116]. These meth-ods usually utilize an error-based quality measure. Despite apparent suc-cesses, they usually yield reconstructions of unreasonably poor quality whenthe rate becomes low. This is consistent with the fact that error-basedquality measures generally are inaccurate at high distortions. For low-ratescenarios, PRS coding methods mostly rely on an objective to preserve fea-tures of the signal, which is consistent with the notion that the perceivedquality depends on features.

In the following, perceptual coding methods are briefly reviewed, withan emphasis on how perceived quality measures, rate-distortion theory, andsource coding methods are applied to practice.

3.1 Analysis-by-Synthesis

The simplest way to apply perceived quality measures in source coding isprobably analysis-by-synthesis (AbS). A basic AbS structure is shown inFigure 8. In this approach, candidate reconstructions are synthesized fromatoms of a dictionary and compared against a source signal according toa measure of perceived quality. The reconstruction that yields the bestquality is selected and the indices of the chosen atoms are coded.

A prominent example of AbS is the code excited linear prediction(CELP) coding, in which an excitation codebook and an adaptive filterare used as the dictionary and the synthesizer, respectively. CELP is basedon a synthetic model of speech, which regards a speech signal as generatedby filtering an excitation with a short-term and a long-term filter. The firstCELP system used Gaussian random excitations, based on an observationthat excitations are statistically similar to Gaussian random sequences [117].This coder applied a perceptually weighted squared error as the quality mea-

32 Introduction

sure. Further developments of CELP include using an algebraic codebookfor fast searching [118], using an adaptive codebook to better exploit thepitch structure in the signal [119], and using a post-filter to enhance thequality [120].

Another type of AbS system uses a combination of atoms in the dictio-nary to synthesize the reconstruction. Usually, the number of atoms thatare used for synthesis is small. Such synthesis is known as a sparse approx-imation. The goal is to maximize a quality measure, under a constraint onthe number of chosen atoms. Exhaustive search is computationally impossi-ble. An efficient solution is greedy pursuit [121–123], which chooses an atomat a time with a locally optimal criterion. Greedy pursuit is not guaranteedto yield the global optimum. For greedy pursuit, the optimization criterionis restricted to be a norm of the difference between the synthesized and thesource signal. By using an error-based quality measure that takes the formof a norm, perceptual effects can be taken into account [124].

Another method for obtaining sparse approximations is to solve an op-timization problem of the form:

min ‖x− wTΞ‖2subject to ‖w‖0 = N

, (43)

where x represents the source signal, Ξ represents the dictionary, w is aweighting vector, ‖·‖0 denotes the zero-norm, and N denotes the number ofatoms to be used for synthesis. By introducing a Lagrange multiplier, a newcriterion is obtained J = ‖x−wTΞ‖2+λ‖w‖0. This criterion is non-convexand thus hard to optimize. For an efficient solution, one can apply convexrelaxation [125], which replaces the zero-norm with a one-norm, yieldinga convex criterion, to which convex optimization techniques [126] can beapplied. In principle, the squared error can be replaced by a perceptuallyweighted squared error to gain perceptual benefits. However, the unclearrelation between λ and N makes this Lagrange-multiplier-based approachhard to use in practice.

An advantage of AbS is that, in principle, an arbitrary quality measurecan be used. However, as mentioned above, to make fast searching possi-ble, practical AbS methods usually are limited in their choice of objectivefunctions.

AbS can be seen as a naıve quantizer, in the sense that it outputs thereconstruction that leads to the minimum distortion. However, AbS is notreally optimal in the rate-distortion sense. For example, it has been arguedthat the codebook of CELP is suboptimal, since the synthesizer stretchesthe excitation domain where the codebook is defined [127]. Actually, the useof an excitation codebook in CELP is, to a large extent, motivated from thementioned synthetic model of speech, rather than rate-distortion consider-ations. This implies that the design of the dictionary and the synthesizer


may capture perceptual effects that are not accounted in the quality mea-sure. To this end, [128] argues that the dictionary should be adapted tonatural signals. Commonly used dictionaries include sinusoids [124], Gaborfunctions [129], damped sinusoids [130], wavelets [131], and a compositionof various dictionaries [132]. These choices are all perceptually meaningful.

3.2 Quantization with Perceptual Distortion Measures

Quantization methods can be used directly for the coding of PRS when theperceived quality is captured by a proper distortion measure. Error-basedquality measures are suitable for this purpose, because, with quantization,the source and its reconstruction have the same format, enabling a sub-traction, where the difference is the quantization noise. In particular, anerror-based quality measure can usually be approximated by a weightedsquared error, for which quantization theory is well developed. To be con-sistent with literature, a distortion measure that accounts for perceptualeffects is called a perceptual distortion measure [50].

Perceptual distortion measures can be exploited in transform coding.Usually, the perceptual domains in which many error-based quality mea-sures are defined are also amenable to quantization. This is consistentwith the efficient coding theory of human perception [3,133], which assumesthat human perceptual systems attempt to decompose stimuli into inde-pendent components. The distortion measures usually take the form of a(weighted) sum of the distortions of transform coefficients. Consequently,the rate-distortion optimization is actually the aforementioned bit alloca-tion (or distortion allocation) problem. It is worth noting that the weightingfactors in perceptual distortion measures, see (36), can be dependent on thesource signal and therefore, additional transmission may be necessary toprovide the bit allocation to the decoder.

Transform-based perceptual coding is ubiquitous in both audio and im-age compression. In this context, another commonly used term is the filterbank, which can be seen as a special implementation of a transform [24].Depending on the format of the source signal, the transform can eitherbe one-dimensional (e.g., DCT), two-dimensional (e.g., 2D-DCT), or three-dimensional [134]. Although quantization theory provides transforms thatare theoretically efficient, the choice of transforms in practical systems usu-ally involves additional considerations, mainly related to the following twocharacteristics of PRS.

1. Non-stationarity: Audio and image signals are regarded as being seg-mentally stationary. Transforms (e.g., DFT and DCT) that are op-timized for stationary signals perform well for stationary parts of asignal. When they are applied to transitions within a signal, arti-facts can appear. In audio coding, a common artifact is the pre-echo.

34 Introduction

When a transform is applied to a segment consisting of a quiet piecefollowed by a loud signal, the quantization noise may extend to thequiet part, causing a pre-echo. In image coding, the same problem isvisible as blurred object boundaries. A solution is window switching,which uses different transform lengths for stationary and transitionalframes. Window switching has been successfully used in audio cod-ing [135]. Another solution is to use transforms that are able to detectthe transitions, e.g., the wavelet transform [136], which has shown itsstrengths in image coding [137]. The wavelet transform, however, haspoor frequency resolution in high frequency regions and hence cannot,on its own, achieve high coding efficiency for audio signals [138].

2. Inter-block dependency: Due to design limitations such as complexityand delay, the transform must be applied to segments of the source.Dependencies among blocks can cause two major problems: 1) reduc-tion in coding efficiency, and 2) discontinuities at the block boundaries,which is known as the blocking effect. A solution is to use inter-blockprediction and transforms that are adapted to the prediction. Exam-ples of this treatment include a zero-input-response removed KLT foraudio coding [139,140] and motion-compensated transforms for videocoding [141]. Another solution to the problems related to inter-blockdependencies is to apply lapped transforms [142,143]. A lapped trans-form has two desirable features: 1) the basis functions have decayingmagnitude toward the boundaries to suppress the blocking effect, and2) no redundant data is produced, which is known as critical sam-pling. The MDCT [144] is such a transform, which is now widely usedin audio coding [145].

Bit allocation is an integral part of transform coding. In practical sys-tems, bit allocation can be implemented in different manners. A directimplementation is to adjust the step sizes of the quantizers for differenttransform coefficients according to the solution of an explicit optimizationproblem [146]. Another possibility is embedded coding, which sorts the bitsaccording to their perceptual significance, then truncates the bit stream tofulfill the bit budget. The bits can be ordered from the most significant tothe least significant bit plane and/or from the most significant to the leastsignificant coefficient [137].

A shortcoming of transform-based perceptual coding is that it tries touse the same transform to achieve both perceptual benefits and coding ef-ficiency. The two purposes, however, may not be achieved simultaneouslyby a single transform. For example, the human auditory system has poorfrequency resolution at high frequency bands, while for coding, frequencybias is undesirable. It is preferable to decouple the perceptual and the cod-ing considerations. The companding structure in Figure 5 yields a naturalsolution. As stated, a perceptual distortion measure can be approximated


by a quadratic form of the error in the signal domain. Using the compres-sor function, the optimization between the rate and a perceptual distortionmeasure can be transformed into the classic optimization between the rateand MSE. An early use of companding is the A-law and µ-law algorithms inPCM coding of audio (see, e.g., [147, Chapter 4.5]). More recently, the useof perceptual pre-/post-filters has been recommended for audio coding [148].

Companding requires a globally invertible compressor that satisfies (35).However, the sensitivity matrix for a perceptual distortion measure can beso complicated that no proper compressor can be found. With some sideinformation, it is possible to find a locally invertible compressor. This sideinformation requires some additional data transmission, which correspondsto the transmission of the bit allocation.

With companding, quantization is not limited to transform coding; pre-dictive coding can also be used. A benefit of predictive coding comparedto transform coding is a shorter delay. Recently, a perceptual audio coderthat uses a perceptual pre-/post-filter within the aforementioned optimalpredictive coding structure (see Figure 4) has been proposed [149].

Quantization-based perceptual coding can systematically exploit the rig-orously developed rate-distortion theory and lossy coding methods. Its use-fulness is widely accepted [116]. However, its efficiency is not satisfactory. Acommon problem is that the reconstructed signal becomes unnatural whenthe rate is low. Examples of such issues include a “band-limited” arti-fact [150] and a so-called “birdies” artifact [151] in perceptual audio coding.A fundamental reason for the problem is, as discussed, that perceptual dis-tortion measures generally become irrelevant to the perceived quality whenthe distortion is large. The feature-based approach to quality assessmentsuggests that, to achieve a good quality, signal characteristics have to bepreserved. Parametric coding follows this philosophy.

3.3 Parametric Coding

Coding based on signal quantization is also known as waveform coding [147].Another paradigm is parametric coding [152], which does not focus on theexactness of the reconstruction compared to the source on a sample level.Rather, it synthesizes the reconstruction from a model of the source, so thatthe reconstruction and the source are similar on a feature level.

Parametric coding utilizes a model of the source signal. A model gen-erally provides a probability distribution of the source. Let us denote theconditional p.d.f. of the source given the model parameters by fX|Θ(·|·). Ifthe source is deterministically related to the parameters, such a p.d.f. can bedescribed by a Dirac delta function. Knowing the parameters, one can syn-thesize a reconstruction according to fX|Θ(·|·). A major task of parametriccoding is to estimate and code the parameters. The coding of the parame-ters can make use of the source coding techniques discussed earlier, where

36 Introduction

the parameter rather than the signal itself serves as the source. Parameterestimation can draw on another productive research area: estimation theory(see, e.g., [153]).

By selecting a model that is related to the generating mechanism and/orhuman perception of the source signal, the reconstruction conveys physicaland/or perceptual meanings. This explains the fact that parametric codingis the predominant coding approach for low rate scenarios. In the following,some widely used parametric coding tools are reviewed, with a focus on howsignal modeling and synthesis can be beneficial for low-rate coding of PRS.

Speech vocoder

In the 1930s, Dudley invented the first vocoder [154]. The basic idea of thisvocoder is to reproduce a speech signal by generating the contents for anumber of narrow frequency bands, such that the power distribution amongdifferent bands is the same for the generated signal and the source signal.This vocoder is called a channel vocoder. Clearly, the vocoder does nottry to reproduce the waveform, since the phase information is lost. Thehuman auditory system has a low sensitivity to absolute phase changes ina monaural signal and therefore, losing phase information has a relativelysmall influence on the perceived quality. Another type of vocoder is thelinear predictive coding (LPC) vocoder. It synthesizes speech signals byfiltering a random noise and/or a pulse train [155]. Such a synthetic modelof speech is also a basis for CELP.

Sinusoidal coding

Sinusoidal coding models an audio signal as a sum of sinusoids [156]. Re-cent development has also introduced transient and noise components in asinusoid model [157]. In an early sinusoidal coder [158], the main sinusoidalcomponents are selected and their magnitudes are transmitted, while thephases are replaced by generated phases that fulfill a maximum smoothnesscriterion. The performance of sinusoidal coding can be enhanced by usingpsychoacoustic criteria for the selection of sinusoids [159].

Waveform interpolation

Waveform interpolation [160] models a speech signal as a concatenation ofcharacteristic waveforms. The characteristic waveform as a function of timeis extracted from the input signal. After some alignment, the characteristicwaveform evolves smoothly in time. Then the encoder down-samples andtransmits the characteristic waveform contour. The decoder performs aninterpolation on the received contour and synthesizes the reconstruction. Itis worth noting that, due to the alignment of the characteristic waveformbefore encoding, the reconstruction is not time-aligned with the source.


The characteristic waveform can also be decomposed into a rapidly-evolvingwaveform and a slowly-evolving waveform to model the voiced and unvoicedsignal components more efficiently [161].

Binaural cue coding

Binaural cue coding (BCC) [162, 163] is a paradigm of multichannel audiocoding. In BCC, input signals are down-mixed to a single channel signal,which is coded with high precision. The relations among the signals indifferent channels can be described by parameters, e.g., inter-channel timedifference, inter-channel level difference, and inter-channel coherence. Theoutput signals are synthesized based on the decoded down-mixed signaland the parameters. A particular approach to synthesize signals for differ-ent channels is to model the signal in each channel as a linear combinationof the down-mixed signal and a randomly filtered version of it, where theweighting factors for the combination must fulfill the inter-channel charac-teristics described by the mentioned parameters [164].

Model-based image coding

The parametric coding methods discussed earlier all deal with audio signals.Image compression can also apply parametric coding techniques. Paramet-ric image coding is usually known as model-based image coding [165]. Inthis field, analysis and synthesis of faces and facial movements have arousedmuch interest. A static facial image can be modeled as a wire frame anda texture [165], while facial movements can be described by a facial actioncoding system (FACS) [166, 167].

Others

There are a handful of technologies that are closely related to parametriccoding in a sense that they synthesize a signal based on a model. Some ofthem are listed below together with a brief discussion:

1. Bandwidth extension and spectral band replication: Bandwidth exten-sion (BWE) [168,169] is a method for converting a narrow-band audiosignal to a wide-band signal, so as to enhance the perceived quality.To this end, a signal is synthesized using a model of the spectrum ofthe missing band, and added to a narrow-band signal. The model ofthe missing spectrum can be obtained by analyzing a reference signalor by mapping from the known band through, e.g., a trained dictio-nary. Spectral band replication (SBR) [157, 170] is a particular formof BWE, for which the spectrum of the missing band is copied fromthe known bands.

38 Introduction

2. Noise-fill: Noise-fill [171,172] is a way to eliminate the “birdies” arti-fact in transform-based audio coding. In the case of a small bit budget,some transform coefficients can be assigned zero bits. A slight changein the energy distribution can change the bit allocation so that thesecoefficients receive some bits in another frame. This is effectively anon-off switching, which produces an unpleasant sound known as the“birdies” artifact. To eliminate this artifact, noise with certain char-acteristics can be added to the coefficients when they receive zero bits,and this technique is know as noise-fill.

3. Speech and music synthesis: In the earlier discussion of parametric au-dio coding, synthetic models of speech and music have been touchedupon. Here, we consider synthesis that is based on linguistic or mu-sical models. For speech, text-to-speech (TTS) synthesis [173] is anactive research area. Besides the text, speaker [174] and emotionalcharacteristics [175] can also be used for synthesis. Together with arecognition system that identifies the text [176], the speaker [177], orthe emotion [178], these speech synthesis methods can lend themselvesto speech coding. A corresponding technology in music industry is thesynthesis of music from musical description; see [179] for a review.

4. Image rendering: Image rendering is a procedure to synthesize an im-age. Realism is a typical objective for image rendering. To this end, aphysical model of scenes can be used. Image rendering has great sig-nificance in movie making, video gaming, pilot training, architecturaldesign, etc, and forms a substantial research area. For an introductionto the topic, one may refer to [180].

Signal synthesis generally outperforms signal quantization at low rates.However, when a high rate is possible, signal synthesis becomes less effi-cient, due to its inefficiency in rate-distortion optimization. To address thisproblem, hybrid methods have been considered. A combination of signalquantization and synthesis can lead to a method that facilitates a transitionbetween the two. A simple composition is to subtract a synthesized signalfrom the source and then quantize the residual. A bit allocation can be madebetween the parameters used for the synthesis and the residual signal (see,e.g., [181]). AbS can also be interpreted as a hybrid coding method, sinceit utilizes a synthetic model and chooses the final reconstruction accordingto a distortion criterion. Yet another paradigm for combining the two cod-ing concepts is to introduce feature-preserving capabilities to quantization.This paradigm can improve the low-rate performance of quantization, whilemaintaining its high-rate advantages. This topic will be addressed furtherin the following.


3.4 Quantization with Constraints on Statistical Prop-

erties of Reconstruction

From the discussion of perceived quality, we notice that the features ofa signal play an important role in its quality. Many signal features arestatistical. Therefore, introducing constraints on statistical properties ofthe reconstruction may enhance the perceptual aspects of a quantizer.

An early consideration of statistical properties of the reconstruction inquantization is the use of dithering [33]. A related technique is halfton-ing [182]. Typically, a dither is used in conjunction with a quantizer toproduce a continuous-tone image. Dithered quantization is also used in au-dio applications [183]. A dithered quantizer aims to optimize rate-distortionperformance like a conventional quantizer, while trying to achieve certainstatistical properties. The statistical constraints that dithered quantizationtries to satisfy are usually implicit and vaguely stated. Depending on itsapplication and implementation, a dithered quantizer can achieve one ormore of the following properties: 1) independence or uncorrelatedness be-tween the quantization noise and the source [184], 2) a certain quantizationnoise distribution, and 3) a continuous-valued reconstruction.

An explicit use of constraints on statistical properties of the reconstruc-tion is found in moment preserving quantization (MPQ) [185]. MPQ forcescertain statistical moments of the reconstruction to be identical to those ofthe source. Methodologically, cells and reconstruction points are arrangedto satisfy the moment preserving constraints. MPQ has been successfullyapplied to image coding [186, 187]. However, due to its methodology, therate of an MPQ and the number of the moments it can preserve are inter-dependent.

We have proposed distribution preserving quantization (DPQ) recentlyin [188,189] (Papers A and B). DPQ uses a quantizer ensemble, which con-sists of quantizers that switch in a stochastic manner. In this way, thepreservation of statistical properties is independent of the rate. Even theprobability distribution of the source and therefore, all statistical proper-ties, can be preserved at any rate. On top of the preservation of statisti-cal properties, a rate-distortion optimization is performed. DPQ yields asmooth transition between synthesis and conventional quantization. At lowrates, DPQ synthesizes signals according to the probability distribution ofthe source, while at high rates, it behaves more like a regular quantizer. Inparticular, the transition between the two ends is a natural outcome of arate-distortion optimization. DPQ has demonstrated good performance inaudio coding [188–190]. A large part of this dissertation is devoted to thetheoretical and practical aspects of DPQ.

40 Introduction

4 Distribution Preserving Quantization

In the beginning of this dissertation, we mentioned a new objective for lossycoding of PRS: optimizing the trade-off between rate and distortion, undera constraint that statistical properties of the reconstruction are similar tothose of the source. From the earlier discussion, we can see that this ob-jective is consistent with findings from studies of the perceived quality ofPRS. In particular, it facilitates the preservation of perceptually relevantfeatures, thus enabling high perceived quality. One manifestation of thenew objective is DPQ, which has demonstrated superior performance overconventional methods. Intriguingly, DPQ yields a smooth transition be-tween the two coding paradigms: synthesis and signal quantization, whichhave proven to be efficient at low rates and high rates, respectively.

In this section, DPQ will be discussed in detail. We will present itsdefinition, a theoretical analysis of its optimal performance, its realizations,variations, and applications.

4.1 Definition

Before going into the definition of DPQ, let us take a tour of why and howto define DPQ. We begin by seeking ways to formulate the preservation ofsignal features.

Formulation

When a signal feature is discussed, one may view it either as a summaryof observed data, or as a statistical property related to a random entity.These different understandings of the term “feature” may lead to differentformulations for feature preservation.

When considering a feature as a summary of data, one may end up withconstraints such as “the sample mean of the reconstructed signal should beclose to that of the observed source.” This constraint can be expressed bya distortion measure, e.g.,

d(x, x) =

(

k−1k∑

i=1

xi − k−1k∑

i=1

xi

)2

, (44)

x = (x1, · · · , xk), x = (x1, · · · , xk).

More generally, one can define a canonical distortion measure [191], whichis a distortion measure on one or more functions of a source sample andits reconstruction. In general, canonical distortion measures are difficultto optimize. In addition, they are of limited theoretical interest becausethey usually cannot form single-letter fidelity criteria. We can see that evensuch a simple distortion measure as (44) is not single-letter. A consequence

4 Distribution Preserving Quantization 41

is that the RDF loses its definition and the rate-distortion optimizationbecomes intractable.

The other viewpoint is to regard both the source and the reconstructionas random variables, which are characterized by their probability distribu-tions, and then define a feature as a statistical property. There can be twological advantages of this viewpoint over the observation-based definition ofa feature: 1) it reflects the abstraction process in neural processing, whichassociates an observed object to an underlying concept, and 2) it allows bet-ter use of a priori knowledge, i.e., we may know the features of the sourceand the reconstruction without observing any particular realizations.

Besides conceptual benefits, viewing features as statistical propertiesalso brings technical convenience. By defining a statistical model of thesource, its features are fixed and can be preserved by imposing a constrainton statistical properties of the reconstruction like “the mathematical expec-tation of the reconstruction should be close to zero”. Such a constraint is,per se, a constraint on the probability distribution of the reconstruction,which can be fulfilled at any rate, since it is always possible to achieve acertain probability distribution by post processing, e.g., random numbergeneration or transformation. In contrast, a constraint on a canonical dis-tortion measure may be satisfied only at a limited range of rates.

From the discussion above, we see that it is advantageous to formulatefeature preservation as a constraint on statistical properties of the recon-struction. Such a constraint can be defined as a threshold on a divergencemeasure between certain statistical properties of the source and those ofits reconstruction. Another treatment is to confine certain statistical prop-erties to be identical. The latter is simpler since it requires no additionaldivergence measure and threshold. We will focus on this approach below.Constraining a divergence measure will be considered later as a variation ofthe methodology being developed now.

The next question for the new objective for lossy source coding is whatstatistical properties to preserve.

What statistical properties to preserve

There are two possibilities: 1) to preserve some selected statistical prop-erties and 2) to preserve the probability distribution so that all statisticalproperties are preserved. These two perspectives are compared in the fol-lowing.

1. The perceived quality may depend on multiple statistical properties.It is difficult to identify all of them. Even if it is possible to identify allperceptually relevant statistical properties, there may be an arbitrarynumber of them. Therefore, lossy coding of PRS can become an op-timization problem with an arbitrary number of constraints, making

42 Introduction

it hard to analyze. Preserving the probability distribution is a singleconstraint and can be more mathematically tractable.

2. Quantization that preserves the probability distribution facilitates thecompanding structure in Figure 5. Specifically, a quantizer that pre-serves the probability distribution in a companded domain also pre-serves the probability distribution in the source domain. In contrast,the preservation of a statistical property in a companded domain doesnot necessarily mean that the same property is preserved in the sourcedomain. For example, let X be a positive-valued random variable anda compressor be F(x) = x2; it is clear that E{F(X)} = E{F(X)} doesnot guarantee E{X} = E{X}.

3. A quantizer that preserves certain statistical properties can be imple-mented as a conventional quantizer with fixed cells and reconstructionpoints, as for MPQ. However, this is not a universal solution sincethere is a limitation on the rate. A better implementation is to usea quantizer ensemble, which breaks the interdependency between therate and the statistical properties of the reconstruction. To implementa quantizer that preserves the probability distribution, the quantizerensemble is the only choice (see Paper C).

Preserving the probability distribution can be advantageous to preserv-ing certain statistical properties and will be the focus in the following. Amethod that preserve a specific statistical property will be considered lateras a variation.

As discussed, a quantizer ensemble is a proper choice for preserving sta-tistical properties. The concept of the quantizer ensemble will be addressednext.

Quantizer ensemble

A quantizer ensemble is a set of quantizers, together with a random vari-able that selects the quantizer for each use of the ensemble. In this disser-tation, the selecting variable is confined to be statistically independent ofthe source. A conventional quantizer is a special quantizer ensemble, whichcontains only one quantizer.

The rate and the distortion of a quantizer ensemble are the mathematicalexpectation of the rate and the distortion of an individual quantizer in theensemble. If all quantizers in an ensemble are fixed-rate, the quantizerensemble is also fixed-rate. If any quantizer has a variable rate, the quantizerensemble is variable-rate.

The quantizer ensemble provides reconstructions with high flexibility intheir statistical properties. The probability distribution of the reconstruc-


tion X is

fX(x) =

∫

Z

fZ(z)Pr{X = x|Z = z}dz. (45)

where Z is the selecting variable of the quantizer ensemble. By choosing aproper set of quantizers and a proper selecting variable, the reconstructioncan have an arbitrary probability distribution.

An example of a quantizer ensemble is the dithered quantizer. Thedithered quantizer was introduced as the combination of a dither generatorand a quantizer. From the quantizer ensemble point of view, the dithergenerator is a random variable that selects from a pool of quantizers. It willbe shown that many lossy source coding schemes that facilitate preservationof source statistical properties can be based on a dithered quantizer.

Definition of DPQ

It is now possible to define the distribution preserving quantization.

Definition 2 (DPQ) A distribution preserving quantizer is a quantizerensemble for which the reconstruction has the same probability distributionas the source.

According to the GRT model of similarity, DPQ maximizes the similar-ity between the source and the reconstruction. As discussed, such similaritycan be seen as inter-class similarity. Certainly, there can be detectable dif-ferences between a source sample and a reconstruction, even if they comefrom the same probability distribution. It is expected that such differencesdecrease as the rate increases. This can be achieved efficiently through therate-distortion optimization of classic rate-distortion theory. Such optimiza-tion can be seen as maximization of intra-class similarity.

Two major concerns of DPQ are 1) what is the optimal rate-distortiontrade-off under the constraint of probability distribution preservation, andhow this optimal trade-off is related to the classic RDF, and 2) how toimplement a DPQ that is useful for practical applications. In the following,these two questions will be addressed.

4.2 Distribution Preserving Rate-Distortion Function

Analogous to the RDF, a so-called distribution preserving rate-distortionfunction (DP-RDF) can be used as a guideline for the rate-distortion per-formance of DPQ.

Definition 3 (DP-RDF) Given a discrete-time process {Xt}∞t=1 and asingle-letter fidelity criterion {ρt}∞t=1, the distribution preserving rate-

44 Introduction

distortion function (DP-RDF) for {Xt}∞t=1 and {ρt}∞t=1 is defined as

RDP(D) = limk→∞

k−1 inffX|X (·|·)∈Qk(D)

I(X ; X), (46)

X = (X1, · · · , Xk), X = (X1, · · · , Xk),

where Qk(D) consists of all conditional probability distributions such that1) the expectation of the k-th distortion measure is bounded by D and 2)the reconstruction has the same probability distribution as the source, i.e.,

Qk(D) ={

fX|X(·|·) : E{ρk(X, X)} ≤ D, fX(x) = fX(x), ∀x}

. (47)

To verify the feasibility of this definition, we will show by the followingtheorem that, for stationary processes with decaying memory, the limit in(46) exists.

Theorem 4 For a stationary discrete-time process {Xt}∞t=1 consisting ofcontinuous r.v.’s, the following sequence

Rk(D) = k−1 inffX|X (·|·)∈Qk(D)

I(X ; X), (48)

X = (X1, · · · , Xk), X = (X1, · · · , Xk),

with Qk(D) defined by (47) converges when k → ∞, if the process hasdecaying memory in the following sense:

limk→∞

k−1I(X1, · · · , Xk;Xk+1, · · · ) = 0. (49)

A proof of Theorem 4 follows a similar procedure as [192, Theorem 9.8.1]and [5, Problem 4.17]. However, for the sake of probability distributionpreservation, a copula [193] is introduced. The proof is provided in Ap-pendix.

We can also see that, when {Xt}∞t=1 are i.i.d., following the same logicas for the RDF, the DP-RDF can be defined by an individual sample of theprocess. This case is considered in Paper C.

As the RDF for conventional quantization, the DP-RDF is a lower boundof the rate for all DPQ schemes, given any distortion level. This is illustratedby the following theorem.

Theorem 5 For a discrete-time process {Xt}∞t=1 and a single-letter fidelitycriterion {ρt}∞t=1, if the DP-RDF RDP(D) is well defined, no DPQ with arate less than RDP(D) can achieve a distortion less than D.


Proof: We consider a k-tuple of the source X = (X1, · · · , Xk) and a DPQthat operates on X , yielding X. Let Z be the selecting variable of the DPQ,we can see that

R ≥ k−1H(X |Z) (50)

= k−1(H(X |Z)−H(X|X,Z)) (51)

= k−1(h(X |Z)− h(X |X, Z)) (52)

≥ k−1(h(X)− h(X |X)) (53)

= k−1I(X ; X), (54)

where (51) stems from H(X |X,Z) = 0, since X is determined by X and Z,and (53) uses the independence between X and Z.

A DPQ must preserve the probability distribution of an X of arbitrarydimensionality. So letting k → ∞, Theorem 5 follows from (54) and thedefinition of the DP-RDF.

The DP-RDF is in general greater than the RDF for the same source anddistortion level, due to the additional constraint of probability distributionpreservation. However, such a loss can be justified when perceived quality istaken into account. Moreover, it is shown in Paper C that, for a large rangeof sources and distortion measures, the DP-RDF approaches the RDF asthe distortion decreases. This implies that the optimal DPQ may performas well as conventional quantization at high rates, even from the classicrate-distortion perspective.

Unlike the RDF, which has been proved to be an achievable lower bound,the achievability of the DP-RDF for a general source and single-letter fi-delity criterion is an open problem. However, for i.i.d. Gaussian sources orstationary Gaussian processes subject to MSE, DPQ schemes that asymp-totically achieve the DP-RDF have been identified. The details are enclosedin Papers C and D.

4.3 Realizations

DPQ can be implemented in different ways. A simple implementation is toalter a conventional quantizer so that it randomly selects a reconstructionin the quantization cell that a source sample belongs to. In this process, therandom selection follows the probability distribution of the source in thatcell, i.e.,

fSi(x) =

1

Pr{X ∈ Si}

{fX(x) x ∈ Si0 x /∈ Si . (55)

This DPQ scheme was proposed in [188] (Paper A). Here, it is referred to asrandom-reconstruction-based DPQ. In the language of the quantizer ensem-

46 Introduction

Source ReconstructionDithered

Quantizer

Transfor-

mation

Figure 9: A transformation-based DPQ.

ble, such a DPQ scheme uses a set of quantizers with identical quantizationcells but different reconstruction points.

Another DPQ scheme uses a dithered quantizer and a transformation,as shown in Figure 9. The basic idea is that, according to Theorem 2,the probability distribution of the output of a dithered quantizer is easy toanalyze. A transformation on this output can then be used to retrieve thesource distribution. Papers B and C in this dissertation deal with this typeof DPQ, which is referred to as transformation-based DPQ.

Comparing the two DPQ schemes, random-reconstruction-based DPQis simpler to implement since it uses a fixed partition, avoiding the needfor synchronizing the cell arrangement between the encoder and the de-coder. However, this DPQ is suboptimal in terms of rate-distortion perfor-mance, causing a loss of up to 3 dB in the MSE at any rate. In contrast,transformation-based DPQ, although possibly more complex to implement,is asymptotically rate-distortion optimal in the following senses:

1. Increasing the rate, the MSE of a transformation-based DPQ can ap-proach the MSE of the dithered quantizer used in it. Since it hasbeen shown that dithered quantization can be asymptotically optimalat high rates [194], transformation-based DPQ should also be optimalin the same circumstances. See Papers B and C for details on thisargument;

2. Transformation-based DPQ asymptotically achieves the DP-RDF fori.i.d. Gaussian sources and MSE, as the dimensionality increases. Pa-per C provides more details on this point;

3. With a transformation-based DPQ as the core, a DPQ scheme asymp-totically achieves the DP-RDF for stationary Gaussian processes andMSE. Paper D contains the details of this DPQ scheme.

DPQ can also be realized in multiple description coding (Paper E). Mul-tiple description coding (MDC) [195] is a method to combat packet losses.It encodes a source sample into multiple codewords. Depending on the set ofcodewords that are received, a particular decoder is called upon to producethe reconstruction. Since the effective rate varies in time, the perceivedquality can be poor when a conventional MSE-based design is used. A mul-tiple description DPQ (MD-DPQ) is introduced to address this problem.It follows the concept of transformation-based DPQ in using a dither and


a transformation to preserve the source probability distribution. In par-ticular, a transformation is designed for each packet loss scenario so thatthe source probability distribution is always preserved. In terms of rate-distortion performance, the proposed method is equally efficient as a con-ventional MSE-optimized multiple description quantizer, in the high-ratelimit.

4.4 Variations

In the preceding discussion of the definition of DPQ, we encountered twoalternatives to DPQ: 1) quantization with a constraint on a measure ofthe divergence between the source and the reconstruction in their proba-bility distributions and 2) quantization with specific statistical propertiespreserved. As mentioned, these are treated as variations of DPQ.

For the first variation, a possible measure of the divergence between twoprobability distributions is the Kullback-Leibler (KL) divergence, also knownas the relative entropy. Quantization with a constraint on the KL divergencein conjunction with a conventional distortion measure was considered in[188] (Paper A).

For the second variation, a quantizer ensemble that preserves the powerspectral density (PSD) is considered in Paper D. Such a quantizer ensembleis called a PSD preserving quantization (PSD-PQ) method. In analogy tothe DP-RDF, Paper D derives a rate-distortion lower bound for PSD-PQ.Paper D also proposes asymptotically optimal PSD-PQ schemes. Interest-ingly, one proposed scheme in Paper D has the same structure as the pre-and post-filtered dithered lattice quantizer in Figure 3, but with a differentchoice of pre- and post-filter.

4.5 Applications

Some of the proposed DPQ schemes (including their variations) have beenapplied to audio coding, where DPQ facilitates the construction of audiocoders that can operate over a vast range of rates. These applications mainlyconsidered scalar DPQ methods. To enhance their performance, the scalarDPQ schemes were used either in transform coding [188–190] or in pre-dictive coding (Paper D). Audio coders based on DPQ have demonstratedsignificant perceptual advantages over coders based on conventional rate-distortion optimized quantization. In particular, DPQ has proved itself aneffective way to remove the “birdies” artifact and the “band-limited” arti-fact. Details of the applications of DPQ in audio coding can be found inPapers A, B, D, and E.

The proposed DPQ schemes all require a statistical model of the source.Their performance relies on the accuracy of this model. In the theoreti-cal treatment of DPQ, the model is assumed known a priori. In practice,

48 Introduction

however, it usually needs to be estimated from the source samples. In theaforementioned applications of DPQ in audio coding, audio signals weremodeled as quasi-stationary Gaussian autoregressive (AR) processes. Suchmodels are generally accurate, but can be inefficient for some audio sig-nals, which may affect the performance of DPQ. However, each applicationhas provided strong evidence to validate the philosophy of this dissertation:rate-distortion optimization subject to preserving source statistical proper-ties is an attractive objective for lossy coding of PRS.

5 Summary of Contributions

This dissertation focuses on the theory and practice of lossy source codingwith constraints on statistical properties of the reconstruction. The maincontributions include

1. the definition of distribution preserving quantization (DPQ) and thedistribution preserving rate-distortion function (DP-RDF), and theirproperties,

2. the optimal DPQ for Gaussian sources and the mean squared error(MSE), and

3. the design and applications of practical DPQ schemes and their vari-ations.

Some of the contributions are enclosed in the six papers in Part II ofthis dissertation. Short summaries of these papers are presented below.

Paper A: Quantization with Constrained Relative Entropy and Its

Application to Audio Coding

As the first development in the context of lossy source coding with a con-straint on statistical properties of the reconstruction, Paper A considers asimple implementation of DPQ, i.e., the random-reconstruction-based DPQ.This DPQ causes a loss up to 3 dB in terms of the MSE, compared to anMSE-optimized quantizer, at any rate. To reduce the MSE, the randomreconstruction is set up as to sample from a probability distribution thatis within a certain Kullback-Leibler divergence from the source probabilitydistribution conditioned on the quantization index. The optimal proba-bility distribution for reconstruction and the optimal quantization cell ar-rangement are both analyzed under high-rate assumptions. The proposedmethod was applied to a transform-based audio coder and a subjective testconfirmed its perceptual benefits. The proposed method is particularly ef-fecient in eliminating “birdies” artifact.

5 Summary of Contributions 49

Paper B: Distribution Preserving Quantization with Dithering

and Transformation

In Paper B, a transformation-based scalar DPQ is proposed. Unlike therandom-reconstruction-based DPQ in Paper A, which causes a loss in theMSE at any rate, the transformation-based DPQ asymptotically achieveszero loss with an increasing rate. A transformation-based DPQ was appliedto a transform-based audio coder and compared to a random-reconstruction-based DPQ and a conventional MSE-optimized quantizer. It outperformedboth systems in a subjective test.

Paper C: On Distribution Preserving Quantization

Paper C deals with a variety of theoretical aspects of DPQ. First, it providesa formal definition of DPQ based on the notion of a quantizer ensemble.Second, Paper C defines the DP-RDF for i.i.d. sources and a single-letterfidelity criterion, and proves that the DP-RDF is a lower bound of therate under a constraint on the distortion, for any DPQ scheme. The DP-RDF for i.i.d. Gaussian sources and MSE is derived. It is also observedthat, for a broad class of sources and fidelity criteria including i.i.d. Gaus-sian sources and MSE, the DP-RDF approaches the RDF as the distortiondecreases. Third, the transformation-based DPQ proposed in Paper B isgeneralized to a vector quantizer. The asymptotic behaviors of such a DPQare analyzed for high-rate and large-dimensionality situations, respectively.In particular, Paper C shows that, for i.i.d. Gaussian sources and MSE,transformation-based DPQ achieves the DP-RDF for the entire range ofrates, as the dimensionality increases.

Paper D: Asymptotically Optimal Distribution Preserving Quan-

tization for Stationary Gaussian Processes

In Paper D, the DP-RDF is derived for stationary Gaussian processes andMSE. A DPQ scheme that asymptotically achieves this DP-RDF is proposedbased on the findings of Paper C. For the sake of applicability, the quantizeris simplified to a power spectral density preserving quantization (PSD-PQ)scheme. In Paper D, the optimal rate-MSE trade-off for all PSD-PQ schemesis analyzed. Interestingly, this optimal trade-off is the same as the RDF forstationary Gaussian processes. The proposed PSD-PQ scheme is provento be asymptotically optimal. It was implemented in a predictive codingframework and applied to audio compression. A subjective test verifiedits advantages over a conventional rate-MSE optimized predictive codingscheme.

50 Introduction

Paper E: Multiple Description Distribution Preserving Quantiza-

tion

Paper E extends the idea of the transformation-based DPQ proposed inPaper B to multiple description coding. A resulting method uses a ditheredmultiple description quantizer that facilitates analytically tractable statis-tical modeling of the quantizer output for any description loss scenario. Foreach description loss scenario, a transformation is performed on the quan-tizer output, so that the probability distribution of the source is alwaysretrieved. The proposed method asymptotically achieves the same rate-distortion performance as a conventional MSE-optimized multiple descrip-tion quantizer, as the rate increases, while it has a significant perceptualadvantage. The efficiency of the scheme was illustrated in an application topacket-loss robust audio coding.

Paper F: Sequential Entropy Coding of Quantization Indices with

Update Recursion on Probability

The quantization schemes proposed in Papers A - E are all variable-rate.Paper F provides an entropy coding algorithm for variable-rate quantizersto achieve a high coding efficiency. The algorithm is called Sequential En-tropy Coding with Update REcursion on probability (SECURE). SECUREapplies a generic model of a lossy coding system, in which source statisticsand side information are considered. The source is assumed to be associatedwith a hidden Markov model, which facilitates a recursive update on theprobability distribution of a quantization index given all past indices andside information. The side information is particularly useful for modelinga dither, which is extensively used by the methods in Papers B - E. Theefficiency of SECURE was verified in two lossy coding scenarios, both ofwhich involved a dithered quantizer.

6 Conclusions and Future Work

Summarizing the discussion in Part I and collecting important findings fromthe papers in Part II, we can draw the following conclusions for this disser-tation:

1. Rate-distortion optimization with a constraint that statistical prop-erties of the reconstruction remain similar to those of the source fa-cilitates lossy coding of perceptually relevant signals to achieve state-of-the-art quality over a large range of rates. In particular, codingmethods based on this objective enable a smooth transition betweensynthesis and conventional quantization;

Appendix 51

2. A constraint on statistical properties of the reconstruction is techni-cally feasible. In particular, such a constraint can be fulfilled at anyrate using a quantizer ensemble;

3. Distribution preserving quantization (DPQ) is a proper realization ofthe proposed objective. It facilitates mathematical analysis;

4. The distribution preserving rate-distortion function (DP-RDF) de-fines a lower bound on the rate of any DPQ at any distortion level.For a large range of sources and distortion measures, the DP-RDFapproaches the RDF with a decreasing distortion;

5. The DP-RDF for stationary Gaussian processes and the mean squarederror can be asymptotically achieved based on a transformation-basedDPQ with increasing dimensionality;

6. Transformation-based DPQ has a good rate-distortion performance athigh rates for any dimensionality.

The concept of source coding with a constraint on statistical propertiesof the reconstruction is still in its infancy. Many aspects remain to beworked out. Some interesting topics are:

1. Achievability of the DP-RDF: The achievability of the DP-RDF for ageneral source and fidelity criterion is an open problem;

2. Influence of a mismatched model: Current DPQ schemes rely on anaccurate source model. If a priori knowledge of the source is inac-curate, both the rate-distortion performance and the preservation ofsource statistical properties can be affected. Evaluation of the sensi-tivity and solutions to minimize this sensitivity may be studied;

3. Universal coding: Another way to combat possible mismatches in an apriori model is to avoid using such a model altogether and instead, ap-ply a coding system that adapts to its output. This is known as back-ward adaptation, which makes a coding system universal. Backwardadaptation has been considered in a transform coding method [43].A theoretical interest for a backward adaptive system is whether itsperformance converges to the optimal performance;

4. Applications: DPQ and its variations can be applied to additionalscenarios, e.g., multichannel audio coding, image coding, and videocoding. In principle, if a synthesis model exists and can be reformu-lated as a statistical model, a DPQ can be derived.

52 Introduction

Appendix. A Proof of Theorem 4

Proof: As stated in Theorem 4, {Xt}∞t=1 is a stationary process. Let mand L be arbitrary positive integers. For an (m + L)-tuple of the sourceX = (X1, · · · , Xm+L), we may split it into an m-tuple A = (X1, · · · , Xm)and an L-tuple B = (Xm+1, · · · , Xm+L). We want to find a relation amongRm+L(D), Rm(D), and RL(D).

According to Sklar’s theorem (see, e.g., [193]), there always exists acopula C(·, ·) that relates the marginal probability distributions of the m-tuple and the L-tuple to their joint probability distribution in the followingmanner:

FA,B(a, b) = C(FA(a), FB(b)). (56)

For a given D, we select the fA|A(·|·) that achieves Rm(D), subject to

fA|A(·|·) ∈ Qm(D). Similarly, we select the fB|B(·|·) ∈ QL(D) that achieves

RL(D). Because of the stationarity of the source process, the shift of Bfrom the first sample of the entire sequence does not influence the result.

Let FA,A(·, ·) be the c.d.f. induced by fA|A(·|·) and FB,B(·, ·) be the

c.d.f. induced by fB|B(·|·). Then we define a joint probability distribution

of X and its reconstruction X , using the copula:

FA,B,A,B(a, b, a, b) = C(FA,A(a, a), FB,B(b, b)). (57)

It can be seen that

FA,B(a, b) = lima→∞m,b→∞L

C(FA,A(a, a), FB,B(b, b)) (58)

= C(FA(a), FB(b)) (59)

= C(FA(a), FB(b)) (60)

= FA,B(a, b). (61)

where (59) stems from the fact that C(·, ·) is a continuous function; and(60) is due to the fact that fA|A(·|·) and fB|B(·|·) preserve the probabilitydistribution of A and B, respectively. In addition, since the fidelity crite-rion is single-letter, it follows that E{ρm+L(X, X)} ≤ D. Therefore, theconditional p.d.f. of X given X induced from (57) belongs to Qm+L(D), soit provides an upper bound on Rm+L(D).

In addition, it follows from (57) that the joint p.d.f. of X and X can bewritten as

fA,B,A,B(a, b, a, b) =∂4FA,B,A,B(a, b, a, b)

∂a∂b∂a∂b

= c(FA,A(a, a), FB,B(b, b))fA,A(a, a)fB,B(b, b). (62)

Appendix 53

where c(·, ·) is the derivative of C(·, ·) and known as a copula density. Thenwe can relate the mutual information between X and X to the mutualinformation between their parts as

I(X ; X) =

∫

fA,B,A,B(a, b, a, b) log2fA,B,A,B(a, b, a, b)

fA,B(a, b)fA,B(a, b)dadbdadb (63)

=

∫

fA,B,A,B(a, b, a, b) log2fA,A(a, a)fB,B(b, b)

fA,B(a, b)fA,B(a, b)dadbdadb

+

∫

[0,1]2c(µ, ν) log2 c(µ, ν)dµdν (64)

= h(A,B) + h(A, B)− h(A, A)− h(B, B) + I(A;B) (65)

≤ I(A; A) + I(B; B) + I(A;B), (66)

where (65) uses a property of the copula, i.e., the negative copula entropyequals the mutual information of the random variables that the copula de-scribes (see, e.g., [196]). Then we can see that

Rm+L(D) ≤ 1

m+ LI(X ; X) (67)

≤ m

m+ LRm(D) +

L

m+ LRL(D) +

1

m+ LI(A;B). (68)

Using (68) and letting Xji denote (Xi, · · · , Xj), we can see that

R2m(D) ≤ Rm(D) +1

2mI(Xm

1 ;X2mm+1). (69)

Next, we prove the following inequality by induction:

Rkm(D) ≤ Rm(D) +1

mI(Xm

1 ;X∞m+1), (70)

which is true for k = 2. We can verify (70) by

R(k+1)m(D)

≤ k

k + 1Rkm(D) +

1

k + 1Rm(D) +

1

(k + 1)mI(Xm

1 ;X(k+1)mm+1 ) (71)

≤ Rm(D) +k

(k + 1)mI(Xm

1 ;X∞m+1) +

1

(k + 1)mI(Xm

1 ;X∞m+1) (72)

≤ Rm(D) +1

mI(Xm

1 ;X∞m+1). (73)

Let R = lim supm→∞Rm(D) and R = lim infm→∞Rm(D). Due to thecondition of decaying memory, it follows

lim infm→∞

Rm(D) +m−1I(Xm1 ;X∞

m+1) = R. (74)

54 Introduction

Given any ǫ > 0, we find anm such that Rm(D)+m−1I(Xm1 ;X∞

m+1) < R+ǫ.Given any n, write n = km+ j, j < m, then apply (68) and (70), it follows

Rn(D) ≤ Rkm(D) +j

nRj(D) +

1

nI(Xkm

1 ;Xnkm+1) (75)

≤ Rm(D) +j

nRj(D) +

1

mI(Xm

1 ;X∞m+1) +

1

kmI(Xkm

1 ;X∞km+1)

(76)

Taking lim supn→∞ on both sides of (76), we find that R < R+ ǫ. Sinceǫ can arbitrary, R must equal R, i.e., Rn(D) converges when n→∞.

References

[1] C. E. Shannon, “A mathematical theory of communication,” Bell Sys-tem Technical Journal, vol. 27, pp. 379–423 and 623–656, 1948.

[2] ——, “Coding theorems for a discrete source with a fidelity criterion,”IRE National Convention Record, pp. 143–163, 1959.

[3] E. Simoncelli and B. Olshausen, “Natural image statistics and neuralrepresentation,” Annual Review of Neuroscience, vol. 24, pp. 1193–1216, 2001.

[4] T. Berger and J. D. Gibson, “Lossy source coding,” IEEE Transac-tions on Information Theory, vol. 44, no. 6, pp. 2693–2723, 1998.

[5] T. Berger, Rate-distortion theory: A mathematical basis for data com-pression. Prentice-Hall, 1971.

[6] T. M. Cover and J. A. Thomas, Elements of Information Theory,2nd ed. John Wiley & Sons, 2006.

[7] R. M. Gray, “Toeplitz and circulant matrices: A review,” Foundationsand Trends in Communications and Information Theory, vol. 2, no. 3,pp. 155–239, 2006.

[8] B. McMillan, “Two inequalities implied by unique decipherability,”IRE Transactions on Information Theory, vol. 2, no. 4, pp. 115–116,1956.

[9] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.

[10] P. G. Howard and J. S. Vitter, “Arithmetic coding for data compres-sion,” Proceedings of the IEEE, vol. 82, no. 6, pp. 857–865, 1994.

References 55

[11] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactionson Information Theory, vol. 28, no. 2, pp. 129–137, 1982.

[12] J. Max, “Quantizing for minimum distortion,” IRE Transactions onInformation Theory, vol. 6, no. 1, pp. 7–12, 1960.

[13] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizerdesign,” IEEE Transactions on Communications, vol. 28, no. 1, pp.84–95, 1980.

[14] M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of thegeneralized Lloyd - Max problem,” IEEE Transactions on InformationTheory, vol. 28, no. 2, pp. 255–256, 1982.

[15] K. Zeger, J. Vaisey, and A. Gersho, “Globally optimal vector quan-tizer design by stochastic relaxation,” IEEE Transactions on SignalProcessing, vol. 40, no. 2, pp. 310–322, 1992.

[16] W. Bennett, “Spectra of quantized signals,” Bell System TechnicalJournal, vol. 27, pp. 446–472, 1948.

[17] P. F. Panter and W. Dite, “Quantization distortion in pulse-countmodulation with nonuniform spacing of levels,” Proceedings of theIRE, vol. 39, no. 1, pp. 44–48, 1951.

[18] P. Zador, “Asymptotic quantization error of continuous signals andthe quantization dimension,” IEEE Transactions on Information The-ory, vol. 28, no. 2, pp. 139–149, 1982.

[19] A. Gersho, “Asymptotically optimal block quantization,” IEEETransactions on Information Theory, vol. 25, no. 4, pp. 373–380, 1979.

[20] D. Sakrison, “A geometric treatment of the source encoding of a Gaus-sian random variable,” IEEE Transactions on Information Theory,vol. 14, no. 3, pp. 481–486, 1968.

[21] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy-constrainedvector quantization,” IEEE Transactions on Acoustics, Speech, andSignal Processing, vol. 37, no. 1, pp. 31–42, 1989.

[22] H. Gish and J. Pierce, “Asymptotically efficient quantizing,” IEEETransactions on Information Theory, vol. 14, no. 5, pp. 676–683, 1968.

[23] R. A. Gray, T. Linder, and J. Li, “A Lagrangian formulation of Zador’sentropy-constrained quantization theorem,” IEEE Transactions onInformation Theory, vol. 48, no. 3, pp. 695–707, 2002.

[24] W. B. Kleijn, “A basis for source coding,” 2010, lecture notes.

56 Introduction

[25] J. H. Conway and N. J. A. Sloane, Sphere Packing, Lattices andGroups, 3rd ed. Springer, 1998.

[26] G. Poltyrev, “On coding without restrictions for the AWGN channel,”IEEE Transactions on Information Theory, vol. 40, no. 2, pp. 409–417, 1994.

[27] R. Zamir and M. Feder, “On lattice quantization noise,” IEEE Trans-actions on Information Theory, vol. 42, no. 4, pp. 1152–1159, 1996.

[28] H.-A. Loeliger, “Averaging bounds for lattices and linear codes,”IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 1767–1773, 1997.

[29] G. D. Forney, M. D. Trott, and S.-Y. Chung, “Sphere-bound-achievingcoset codes and multilevel coset codes,” IEEE Transactions on Infor-mation Theory, vol. 46, no. 3, pp. 820–850, 2000.

[30] R. Urbanke and B. Rimoldi, “Lattice codes can achieve capacity on theAWGN channel,” IEEE Transactions on Information Theory, vol. 44,no. 1, pp. 273–278, 1998.

[31] U. Erez and R. Zamir, “Achieving 1/2 log (1+SNR) on the AWGNchannel with lattice encoding and decoding,” IEEE Transactions onInformation Theory, vol. 50, no. 10, pp. 2293–2314, 2004.

[32] U. Erez, S. Litsyn, and R. Zamir, “Lattices which are good for(almost) everything,” IEEE Transactions on Information Theory,vol. 51, no. 10, pp. 3401–3416, 2005.

[33] L. Roberts, “Picture coding using pseudo-random noise,” IRE Trans-actions on Information Theory, vol. 8, no. 2, pp. 145–154, 1962.

[34] R. M. Gray and J. Stockham, T. G., “Dithered quantizers,” IEEETransactions on Information Theory, vol. 39, no. 3, pp. 805–812, 1993.

[35] L. Schuchman, “Dither signals and their effect on quantization noise,”IEEE Transactions on Communication Technology, vol. 12, no. 4, pp.162–165, 1964.

[36] R. Zamir, “Lattices are everywhere,” in Information Theory and Ap-plications Workshop, 2009, pp. 392–421.

[37] R. Zamir and M. Feder, “Information rates of pre/post-filtereddithered quantizers,” IEEE Transactions on Information Theory,vol. 42, no. 5, pp. 1340–1353, 1996.

References 57

[38] T. D. Lookabaugh and R. M. Gray, “High-resolution quantizationtheory and the vector quantizer advantage,” IEEE Transactions onInformation Theory, vol. 35, no. 5, pp. 1020–1033, 1989.

[39] J. Huang and P. Schultheiss, “Block quantization of correlated Gaus-sian random variables,” IEEE Transactions on Communications Sys-tems, vol. 11, no. 3, pp. 289–296, 1963.

[40] V. K. Goyal, “Theoretical foundations of transform coding,” IEEESignal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.

[41] A. Segall, “Bit allocation and encoding for vector sources,” IEEETransactions on Information Theory, vol. 22, no. 2, pp. 162–169, 1976.

[42] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitrary setof quantizers,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 36, no. 9, pp. 1445–1453, 1988.

[43] V. K. Goyal, J. Zhuang, and M. Veiterli, “Transform coding with back-ward adaptive updates,” IEEE Transactions on Information Theory,vol. 46, no. 4, pp. 1623–1633, 2000.

[44] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Transactions on Computers, no. 1, pp. 90–93, 1974.

[45] M. Hamidi and J. Pearl, “Comparison of the cosine and Fourier trans-forms of Markov-1 signals,” IEEE Transactions on Acoustics, Speech,and Signal Processing, vol. 24, no. 5, pp. 428–429, 1976.

[46] C. C. Cutler, “Differential quantization of communication signals,”U.S. Patent 2 605 361, 1952.

[47] R. Zamir, Y. Kochman, and U. Erez, “Achieving the Gaussian rate–distortion function by prediction,” IEEE Transactions on InformationTheory, vol. 54, no. 7, pp. 3354–3364, 2008.

[48] J. Ziv and A. Lempel, “A universal algorithm for sequential datacompression,” IEEE Transactions on Information Theory, vol. 23,no. 3, pp. 337–343, 1977.

[49] ——, “Compression of individual sequences via variable-rate coding,”IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 1978.

[50] J. Li, N. Chaddha, and R. M. Gray, “Asymptotic performance of vec-tor quantizers with a perceptual distortion measure,” IEEE Transac-tions on Information Theory, vol. 45, no. 4, pp. 1082–1091, 1999.

58 Introduction

[51] T. Linder, R. Zamir, and K. Zeger, “High-resolution source coding fornon-difference distortion measures: Multidimensional companding,”IEEE Transactions on Information Theory, vol. 45, no. 2, pp. 548–561, 1999.

[52] N. Virag, “Single channel speech enhancement based on masking prop-erties of the human auditory system,” IEEE Transactions on Speechand Audio Processing, vol. 7, no. 2, pp. 126–137, 1999.

[53] F. Hartung and M. Kutter, “Multimedia watermarking techniques,”Proceedings of the IEEE, vol. 87, no. 7, pp. 1079–1107, 1999.

[54] V. Grancharov and W. B. Kleijn, “Speech quality assessment,” inSpringer Handbook of Speech Processing, J. Benesty, M. M. Sondhi,and Y. Huang, Eds. Springer, 2007, ch. 5, pp. 83–102.

[55] A. W. Rix, J. G. Beerends, D.-S. Kim, P. Kroon, and O. Ghitza,“Objective assessment of speech and audio quality — technology andapplications,” IEEE Transactions on Audio, Speech, and LanguageProcessing, vol. 14, no. 6, pp. 1890–1901, 2006.

[56] D. Campbell, E. Jones, and M. Glavin, “Audio quality assessmenttechniques — a review, and recent developments,” Signal Processing,vol. 89, no. 8, pp. 1489–1500, 2009.

[57] Z. Wang and A. C. Bovik, Modern Image Quality Assessment. Mor-gan & Claypool, 2006.

[58] S. Winkler, “Issues in vision modeling for perceptual video qualityassessment,” Signal Processing, vol. 78, no. 2, pp. 231 – 252, 1999.

[59] K. Seshadrinathan and A. Bovik, “Automatic prediction of percep-tual quality of multimedia signals — a survey,” Multimedia Tools andApplications, vol. 51, pp. 163–186, 2011.

[60] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leaveit? a new look at signal fidelity measures,” IEEE Signal ProcessingMagazine, vol. 26, no. 1, pp. 98–117, 2009.

[61] R. L. Wegel and C. E. Lane, “The auditory masking of one pure toneby another and its probable relation to the dynamics of the inner ear,”Physical Review, vol. 23, no. 2, pp. 266–285, 1924.

[62] J. Hawkins, J. E. and S. S. Stevens, “The masking of pure tones and ofspeech by white noise,” Journal of the Acoustical Society of America,vol. 22, pp. 6–13, 1950.

References 59

[63] R. Hellman, “Asymmetry of masking between noise and tone,” Atten-tion, Perception, & Psychophysics, vol. 11, no. 3, pp. 241–246, 1972.

[64] B. C. J. Moore, J. I. Alcantara, and T. Dau, “Masking patterns forsinusoidal and narrow-band noise maskers,” Journal of the AcousticalSociety of America, vol. 104, no. 2, pp. 1023–1038, 1998.

[65] L. L. Elliott, “Backward and forward masking,” International Journalof Audiology, vol. 10, no. 2, pp. 65–76, 1971.

[66] C. F. Stromeyer and B. Julesz, “Spatial-frequency masking in vision:Critical bands and spread of masking,” Journal of the Optical Societyof America, vol. 62, no. 10, pp. 1221–1232, 1972.

[67] G. E. Legge and J. M. Foley, “Contrast masking in human vision,”Journal of the Optical Society of America, vol. 70, no. 12, pp. 1458–1471, 1980.

[68] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative model of the’effective’ signal processing in the auditory system. I. model struc-ture,” Journal of the Acoustical Society of America, vol. 99, no. 6, pp.3615–3622, 1996.

[69] ——, “A quantitative model of the ’effective’ signal processing in theauditory system. II. simulations and measurements,” Journal of theAcoustical Society of America, vol. 99, no. 6, pp. 3623–3631, 1996.

[70] T. Dau, B. Kollmeier, and A. Kohlrausch, “Modeling auditory pro-cessing of amplitude modulation. I. detection and masking withnarrow-band carriers,” Journal of the Acoustical Society of America,vol. 102, no. 5, pp. 2892–2905, 1997.

[71] ——, “Modeling auditory processing of amplitude modulation. II.spectral and temporal integration,” Journal of the Acoustical Soci-ety of America, vol. 102, no. 5, pp. 2906–2919, 1997.

[72] M. L. Jepsen, S. D. Ewert, and T. Dau, “A computational modelof human auditory signal processing and perception,” Journal of theAcoustical Society of America, vol. 124, no. 1, pp. 422–438, 2008.

[73] C. J. V. den Branden Lambrecht and O. Verscheure, “Perceptual qual-ity measure using a spatiotemporal model of the human visual sys-tem,,” Proceedings of SPIE, vol. 2668, pp. 450–461, 1996.

[74] J. A. Ferwerda, P. Shirley, S. N. Pattanaik, and D. P. Greenberg,“A model of visual masking for computer graphics,” in InternationalConference on Computer Graphics and Interactive Techniques, 1997,pp. 143–152.

60 Introduction

[75] S. van de Par, A. Kohlrausch, G. Charestan, and R. Heusdens, “Anew psychoacoustical masking model for audio coding applications,”in IEEE International Conference on Acoustics, Speech and SignalProcessing, vol. 2, 2002.

[76] A. P. Bradley, “A wavelet visible difference predictor,” IEEE Trans-actions on Image Processing, vol. 8, no. 5, pp. 717–730, 1999.

[77] A. B. Watson, J. Hu, and J. F. McGowan, “Digital video qualitymetric based on human vision,” Journal of Electronic Imaging, vol. 10,no. 1, pp. 20–29, 2001.

[78] J. H. Plasberg and W. B. Kleijn, “The sensitivity matrix: Using ad-vanced auditory models in speech and audio processing,” IEEE Trans-actions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp.310–319, 2007.

[79] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612,2004.

[80] B. C. J. Moore, B. R. Glasberg, and T. Baer, “A model for the predic-tion of thresholds, loudness, and partial loudness,” Journal of AudioEngineering Society, vol. 45, no. 4, pp. 224–240, 1997.

[81] M. Elhilali, T. Chi, and S. A. Shamma, “A spectro-temporal modu-lation index (STMI) for assessment of speech intelligibility,” SpeechCommunication, vol. 41, no. 2-3, pp. 331–348, 2003.

[82] M. Hirano, “Objective evaluation of the human voice: Clinical as-pects,” Folia Phoniatrics, vol. 41, pp. 89–144, 1989.

[83] T. H. Falk and W.-Y. Chan, “Single-ended speech quality measure-ment using machine learning methods,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 14, no. 6, pp. 1935–1947, 2006.

[84] M. Narwaria, W. Lin, I. McLoughlin, S. Emmanuel, and C. Tien,“Non-intrusive speech quality assessment with support vector regres-sion,” in Advances in Multimedia Modeling, ser. Lecture Notes in Com-puter Science, S. Boll, Q. Tian, L. Zhang, Z. Zhang, and Y.-P. Chen,Eds. Springer, 2010, pp. 325–335.

[85] R. Huber and B. Kollmeier, “PEMO-Q— a new method for objectiveaudio quality assessment using a model of auditory perception,” IEEETransactions on Audio, Speech, and Language Processing, vol. 14,no. 6, pp. 1902–1911, 2006.

References 61

[86] T.-Y. Yen, J.-H. Chen, and T.-S. Chi, “Perception-based objectivespeech quality assessment,” in IEEE International Conference onAcoustics, Speech and Signal Processing, 2009, pp. 4521–4524.

[87] H. Hermansky, “Perceptual linear predictive (PLP) analysis ofspeech,” Journal of the Acoustical Society of America, vol. 87, no. 4,pp. 1738–1752, 1990.

[88] T. Ezzat, J. Bouvrie, and T. Poggio, “Spectro-temporal analysis ofspeech using 2-d Gabor filters,” in Interspeech, 2007, pp. 506–509.

[89] T. Dau, “Auditory processing models,” in Handbook of Signal Pro-cessing in Acoustics, D. Havelock, S. Kuwano, and M. Vorlander, Eds.Springer, 2008, pp. 175–196.

[90] H. R. Sheikh, A. C. Bovik, and L. Cormack, “No-reference qualityassessment using natural scene statistics: JPEG2000,” IEEE Trans-actions on Image Processing, vol. 14, no. 11, pp. 1918–1927, 2005.

[91] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444,2006.

[92] C. M. Bishop, Pattern recognition and machine learning. Springer,2006.

[93] A. Y. Ng, “Feature selection, L1 vs. L2 regularization, and rotationalinvariance,” in International conference on Machine learning, 2004,pp. 78–85.

[94] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G.Beerends, and C. Colomes, “PEAQ — the ITU standard for objectivemeasurement of perceived audio quality,” Journal of Audio Engineer-ing Society, vol. 48, no. 1/2, pp. 3–29, 2000.

[95] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Percep-tual evaluation of speech quality (PESQ) — a new method for speechquality assessment of telephone networks and codecs,” in IEEE In-ternational Conference on Acoustics, Speech and Signal Processing,vol. 2, 2001, pp. 749–752.

[96] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn, “Low-complexity, nonintrusive speech quality assessment,” IEEE Transac-tions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp.1948–1956, 2006.

[97] D.-S. Kim and A. Tarraf, “ANIQUE+: A new American nationalstandard for non-intrusive estimation of narrowband speech quality,”Bell Labs Technical Journal, vol. 12, no. 12, pp. 221–236, 2007.

62 Introduction

[98] P. N. Petkov, I. S. Mossavat, and W. B. Kleijn, “A Bayesian approachto non-intrusive quality assessment of speech,” in Interspeech, 2009,pp. 2875–2878.

[99] M. Karjalainen, “A new auditory model for the evaluation of soundquality of audio systems,” in IEEE International Conference onAcoustics, Speech and Signal Processing, vol. 10, 1985, pp. 608–611.

[100] R. N. Shepard, “Toward a universal law of generalization for psycho-logical science,” Science, vol. 237, pp. 1317–1323, 1987.

[101] R. L. Goldstone and J. Son, “Similarity,” in Cambridge Handbook ofThinking and Reasoning, K. Holyoak and R. Morrison, Eds. Cam-bridge University Press, 2005, pp. 13–36.

[102] D. H. Krantz and A. Tversky, “Similarity of rectangles: An analysis ofsubjective dimensions,” Journal of Mathematical Psychology, vol. 12,no. 1, pp. 4–34, 1975.

[103] A. Tversky, “Features of similarity,” Psychological Review, vol. 84,no. 4, pp. 327–352, 1977.

[104] A. Tversky and I. Gati, “Similarity, separability, and the triangleinequality,” Psychological Review, vol. 89, no. 2, pp. 123–154, 1982.

[105] C. L. Krumhansl, “Concerning the applicability of geometric modelsto similarity data: The interrelationship between similarity and spa-tial density,” Psychological Review, vol. 85, no. 5, pp. 445–463, 1978.

[106] E. W. Holman, “Monotonic models for asymmetric proximities,” Jour-nal of Mathematical Psychology, vol. 20, no. 1, pp. 1–15, 1979.

[107] R. Shepard, “The analysis of proximities: Multidimensional scalingwith an unknown distance function. I.” Psychometrika, vol. 27, pp.125–140, 1962.

[108] ——, “The analysis of proximities: Multidimensional scaling with anunknown distance function. II,” Psychometrika, vol. 27, pp. 219–246,1962.

[109] J. Kruskal, “Multidimensional scaling by optimizing goodness of fitto a nonmetric hypothesis,” Psychometrika, vol. 29, pp. 1–27, 1964.

[110] M. A. A. Cox and T. F. Cox, “Multidimensional scaling,” in Hand-book of Data Visualization, ser. Springer Handbooks of ComputationalStatistics, C. houh Chen, W. Hardle, and A. Unwin, Eds. Springer,2008, pp. 315–347.

References 63

[111] S. Santini and R. Jain, “Similarity measures,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 21, no. 9, pp. 871–883,1999.

[112] D. M. Ennis, J. J. Palen, and K. Mullen, “A multidimensional stochas-tic theory of similarity,” Journal of Mathematical Psychology, vol. 32,no. 4, pp. 449 – 465, 1988.

[113] F. G. Ashby and N. A. Perrin, “Toward a unified theory of similarityand recognition,” Psychological Review, vol. 95, no. 1, pp. 124–150,1988.

[114] F. G. Ashby, Ed., Multidimensional models of perception and cogni-tion. Lawrence Erlbaum Associates, 1992.

[115] R. M. Nosofsky, “Attention, similarity, and the identification-categorization relationship,” Journal of Experimental Psychology:General, vol. 115, no. 1, pp. 39–57, 1986.

[116] N. Jayant, J. Johnston, and R. Safranek, “Signal compression basedon models of human perception,” Proceedings of the IEEE, vol. 81,no. 10, pp. 1385–1422, 1993.

[117] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP):High-quality speech at very low bit rates,” in IEEE International Con-ference on Acoustics, Speech and Signal Processing, vol. 10, 1985, pp.937–940.

[118] J.-P. Adoul, P. Mabilleau, M. Delprat, and S. Morissette, “Fast CELPcoding based on algebraic codes,” in IEEE International Conferenceon Acoustics, Speech and Signal Processing, vol. 12, 1987, pp. 1957–1960.

[119] W. Kleijn, D. Krasinski, and R. Ketchum, “An efficient stochasticallyexcited linear predictive coding algorithm for high quality low bitrate transmission of speech,” Speech Communication, vol. 7, no. 3,pp. 305–316, 1988.

[120] J.-H. Chen and A. Gersho, “Adaptive postfiltering for quality en-hancement of coded speech,” IEEE Transactions on Speech and AudioProcessing, vol. 3, no. 1, pp. 59–71, 1995.

[121] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequencydictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12,pp. 3397–3415, 1993.

[122] G. M. Davis, S. G. Mallat, and Z. Zhang, “Adaptive time-frequencydecompositions,” Optical Engineering, vol. 33, no. 7, pp. 2183–2191,1994.

64 Introduction

[123] J. A. Tropp, “Greed is good: Algorithmic results for sparse approxi-mation,” IEEE Transactions on Information Theory, vol. 50, no. 10,pp. 2231–2242, 2004.

[124] R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling usingpsychoacoustic-adaptive matching pursuits,” IEEE Signal ProcessingLetters, vol. 9, no. 8, pp. 262–265, 2002.

[125] J. A. Tropp, “Just relax: Convex programming methods for iden-tifying sparse signals in noise,” IEEE Transactions on InformationTheory, vol. 52, no. 3, pp. 1030–1051, 2006.

[126] S. Boyd and L. Vandenberghe, Convex Optimization. CambridgeUniversity Press, 2004.

[127] M. Y. Kim and W. B. Kleijn, “KLT-based adaptive classified VQ ofthe speech signal,” IEEE Transactions on Speech and Audio Process-ing, vol. 12, no. 3, pp. 277–289, 2004.

[128] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning formatrix factorization and sparse coding,” Journal of Machine LearningResearch, vol. 11, pp. 19–60, 2010.

[129] R. Neff and A. Zakhor, “Very low bit-rate video coding based onmatching pursuits,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 7, no. 1, pp. 158–171, 1997.

[130] M. Goodwin, “Matching pursuit with damped sinusoids,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing,vol. 3, 1997, pp. 2037–2040.

[131] P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, D. Martinez-Munoz, and F. Lopez-Ferreras, “Transient modeling by matching pur-suits with a wavelet dictionary for parametric audio coding,” IEEESignal Processing Letters, vol. 11, no. 3, pp. 349–352, 2004.

[132] N. Ruiz Reyes and P. V. Candeas, “Adaptive signal modeling based onsparse approximations for scalable parametric audio coding,” IEEETransactions on Audio, Speech, and Language Processing, vol. 18,no. 3, pp. 447–460, 2010.

[133] M. S. Lewicki, “Efficient coding of natural sounds.” Nature Neuro-science, vol. 5, no. 4, pp. 356–363, 2002.

[134] G. Karlsson and M. Vetterli, “Three dimensional sub-band coding ofvideo,” in IEEE International Conference on Acoustics, Speech andSignal Processing, 1988, pp. 1100–1103.

References 65

[135] P. Noll, “MPEG digital audio coding,” IEEE Signal Processing Mag-azine, vol. 14, no. 5, pp. 59–81, 1997.

[136] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press,1999.

[137] J. M. Shapiro, “Embedded image coding using zerotrees of waveletcoefficients,” IEEE Transactions on Signal Processing, vol. 41, no. 12,pp. 3445–3462, 1993.

[138] K. N. Hamdy, M. Ali, and A. H. Tewfi, “Low bit rate high qualityaudio coding with combined harmonic and wavelet representations,”in IEEE International Conference on Acoustics, Speech and SignalProcessing, vol. 2, 1996, pp. 1045–1048.

[139] M. Li and W. B. Kleijn, “A low-delay audio coder with constrained-entropy quantization,” in IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics, 2007, pp. 191–194.

[140] A. Ozerov and W. B. Kleijn, “Flexible quantization of audio andspeech based on the autoregressive model,” in Asilomar Conferenceon Signals, Systems, and Computers, 2007, pp. 535–539.

[141] M. Flierl and B. Girod, “A motion-compensated orthogonal trans-form with energy-concentration constraint,” in IEEE Workshop onMultimedia Signal Processing, 2006, pp. 391–394.

[142] H. S. Malvar and D. H. Staelin, “The LOT: Transform coding withoutblocking effects,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 37, no. 4, pp. 553–559, 1989.

[143] H. S. Malvar, “Lapped transforms for efficient transform/subbandcoding,” IEEE Transactions on Acoustics, Speech, and Signal Pro-cessing, vol. 38, no. 6, pp. 969–978, 1990.

[144] J. Princen, A. Johnson, and A. Bradley, “Subband/transform codingusing filter bank designs based on time domain aliasing cancellation,”in IEEE International Conference on Acoustics, Speech and SignalProcessing, vol. 12, 1987, pp. 2161–2164.

[145] T. Painter and A. Spanias, “Perceptual coding of digital audio,” Pro-ceedings of the IEEE, vol. 88, no. 4, pp. 451–515, 2000.

[146] J. Ribas-Corbera and S. Lei, “Rate control in DCT video coding forlow-delay communications,” IEEE Transactions on Circuits and Sys-tems for Video Technology, vol. 9, no. 1, pp. 172–185, 1999.

66 Introduction

[147] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principlesand Applications to Speech and Video. Prentice Hall, 1990.

[148] G. D. T. Schuller, B. Yu, D. Huang, and B. Edler, “Perceptual audiocoding using adaptive pre- and post-filters and lossless compression,”IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6,pp. 379–390, 2002.

[149] O. A. Moussa, M. Li, and W. B. Kleijn, “Predictive audio coding usingrate-distortion-optimal pre- and post-filtering,” in IEEE Workshop onApplications of Signal Processing to Audio and Acoustics, 2011, toappear.

[150] C.-M. Liu, H.-W. Hsu, and W.-C. Lee, “Compression artifacts in per-ceptual audio coding,” IEEE Transactions on Audio, Speech, and Lan-guage Processing, vol. 16, no. 4, pp. 681–695, 2008.

[151] M. Erne, “Perceptual audio coders ‘what to listen for’,” in AudioEngineering Society Convention 111, 2001.

[152] A. V. McCree, “Low-bit-rate speech coding,” in Springer Handbookof Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds.Springer, 2008, ch. 16, pp. 331–350.

[153] S. M. Kay, Fundamentals of Statistical Signal Processing: EstimationTheory. Prentice Hall, 1993.

[154] H. Dudley, “Signal transmission,” U.S. Patent 2 151 091, 1939.

[155] A. V. McCree and I. Barnwell, T. P., “A mixed excitation LPCvocoder model for low bit rate speech coding,” IEEE Transactionson Speech and Audio Processing, vol. 3, no. 4, pp. 242–250, 1995.

[156] P. Hedelin, “A tone oriented voice excited vocoder,” in IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, vol. 6,1981, pp. 205–208.

[157] E. Schuijers, W. Oomen, B. den Brinker, and J. Breebaart, “Advancesin parametric coding for high-quality audio,” in Audio EngineeringSociety Convention 114, 2003.

[158] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on asinusoidal representation,” IEEE Transactions on Acoustics, Speech,and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[159] C. O. Etemoglu and V. Cuperman, “Matching pursuits sinusoidalspeech coding,” IEEE Transactions on Speech and Audio Processing,vol. 11, no. 5, pp. 413–424, 2003.

References 67

[160] W. B. Kleijn, “Encoding speech using prototype waveforms,” IEEETransactions on Speech and Audio Processing, vol. 1, no. 4, pp. 386–399, 1993.

[161] W. B. Kleijn and J. Haagen, “Waveform interpolation for coding andsynthesis,” in Speech Coding and Synthesis, W. B. Kleijn and K. K.Paliwal, Eds. Elsevier, 1995, ch. 5, pp. 175–207.

[162] F. Baumgarte and C. Faller, “Binaural cue coding — part I: Psychoa-coustic fundamentals and design principles,” IEEE Transactions onSpeech and Audio Processing, vol. 11, no. 6, pp. 509–519, 2003.

[163] C. Faller and F. Baumgarte, “Binaural cue coding — part II: Schemesand applications,” IEEE Transactions on Speech and Audio Process-ing, vol. 11, no. 6, pp. 520–531, 2003.

[164] C. Faller, “Parametric multichannel audio coding: Synthesis of co-herence cues,” IEEE Transactions on Audio, Speech, and LanguageProcessing, vol. 14, no. 1, pp. 299–310, 2006.

[165] K. Aizawa and T. S. Huang, “Model-based image coding: Advancedvideo coding techniques for very low bit-rate applications,” Proceed-ings of the IEEE, vol. 83, no. 2, pp. 259–271, 1995.

[166] P. Ekman and W. V. Friesen, Facial Action Coding System. Con-sulting Psychologists Press, 1978.

[167] I. A. Essa and A. P. Pentland, “Coding, analysis, interpretation,and recognition of facial expressions,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 19, no. 7, pp. 757–763, 1997.

[168] P. Jax and P. Vary, “Bandwidth extension of speech signals: A cat-alyst for the introduction of wideband speech coding?” IEEE Com-munications Magazine, vol. 44, no. 5, pp. 106–111, 2006.

[169] B. Geiser, P. Jax, P. Vary, H. Taddei, S. Schandl, M. Gartner, C. Guil-laume, and S. Ragot, “Bandwidth extension for hierarchical speechand audio coding in ITU-T Rec. G.729.1,” IEEE Transactions on Au-dio, Speech, and Language Processing, vol. 15, no. 8, pp. 2496–2509,2007.

[170] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral band repli-cation, a novel approach in audio coding,” in Audio Engineering So-ciety Convention 112, 2002.

[171] B. Bessette, R. Salami, C. Laflamme, and R. Lefebvre, “A widebandspeech and audio codec at 16/24/32 kbit/s using hybrid ACELP/TCXtechniques,” in IEEE Workshop on Speech Coding, 1999, pp. 7–9.

68 Introduction

[172] M. Xie, D. Lindbergh, and P. Chu, “ITU-T G.722.1 Annex C: A newlow-complexity 14 kHz audio coding standard,” in IEEE InternationalConference on Acoustics, Speech and Signal Processing, vol. 5, 2006.

[173] T. Dutoit, An Introduction to Text-to-Speech Synthesis. Kluwer Aca-demic Publishers, 1997.

[174] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in IEEE International Conference on Acoustics,Speech and Signal Processing, vol. 1, 1998, pp. 285–288.

[175] M. Schroder, “Emotional speech synthesis — a review,” in EuropeanConference on Speech Communication and Technology, vol. 1, 2001,pp. 561–564.

[176] J. Baker, L. Deng, J. Glass, S. Khudanpur, C. hui Lee, N. Morgan, andD. O’Shaughnessy, “Research developments and directions in speechrecognition and understanding, part 1,” IEEE Signal Processing Mag-azine, vol. 26, no. 3, pp. 75–80, 2009.

[177] J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of theIEEE, vol. 85, no. 9, pp. 1437–1462, 1997.

[178] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kol-lias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18,no. 1, pp. 32–80, 2001.

[179] B. L. Vercoe, W. G. Gardner, and E. D. Scheirer, “Structured audio:Creation, transmission, and rendering of parametric sound represen-tations,” Proceedings of the IEEE, vol. 86, no. 5, pp. 922–940, 1998.

[180] D. P. Greenberg, K. E. Torrance, P. Shirley, J. Arvo, E. Lafortune,J. A. Ferwerda, B. Walter, B. Trumbore, S. Pattanaik, and S.-C.Foo, “A framework for realistic image synthesis,” in InternationalConference on Computer Graphics and Interactive Techniques, 1997,pp. 477–494.

[181] R. Vafin and W. B. Kleijn, “Rate-distortion optimized quantization inmultistage audio coding,” IEEE Transactions on Audio, Speech, andLanguage Processing, vol. 14, no. 1, pp. 311–320, 2006.

[182] R. Ulichney, Digital Halftoning. The MIT Press, 1987.

[183] R. A. Wannamaker, “Psychoacoustically optimal noise shaping,”Journal of Audio Engineering Society, vol. 40, no. 7/8, pp. 611–620,1992.

References 69

[184] L. R. Rabiner and J. A. Johnson, “Perceptual evaluation of the effectsof dither on low bit rate PCM systems,” Bell Labs Technical Journal,vol. 51, no. 7, pp. 1487–1494, 1972.

[185] E. J. Delp and O. R. Mitchell, “Moment preserving quantization,”IEEE Transactions on Communications, vol. 39, no. 11, pp. 1549–1558, 1991.

[186] E. Delp and O. Mitchell, “Image compression using block truncationcoding,” IEEE Transactions on Communications, vol. 27, no. 9, pp.1335–1342, 1979.

[187] M. Lema and O. Mitchell, “Absolute moment block truncation codingand its application to color images,” IEEE Transactions on Commu-nications, vol. 32, no. 10, pp. 1148–1157, 1984.

[188] M. Li and W. B. Kleijn, “Quantization with constrained relative en-tropy and its application to audio coding,” in Audio Engineering So-ciety Convention 127, 2009.

[189] M. Li, J. Klejsa, and W. B. Kleijn, “Distribution preserving quan-tization with dithering and transformation,” IEEE Signal ProcessingLetters, vol. 17, no. 12, pp. 1014–1017, 2010.

[190] J. Klejsa, M. Li, and W. B. Kleijn, “Flexcode — flexible audio cod-ing,” in IEEE International Conference on Acoustics, Speech and Sig-nal Processing, 2010, pp. 361–364.

[191] J. Baxter, “The canonical distortion measure for vector quantizationand function approximation,” in Learning to learn, S. Thrun andL. Pratt, Eds. Kluwer Academic Publishers, 1998.

[192] R. G. Gallager, Information Theory and Reliable Communication.John Wiley & Sons, 1968.

[193] R. B. Nelsen, An Introduction to Copulas, 2nd, Ed. Springer, 2006.

[194] R. Zamir and M. Feder, “On universal quantization by randomizeduniform/lattice quantizers,” IEEE Transactions on Information The-ory, vol. 38, no. 2, pp. 428–436, 1992.

[195] V. A. Vaishampayan, “Design of multiple description scalar quantiz-ers,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp.821–834, 1993.

[196] R. S. Calsaverini and R. Vicente, “An information-theoretic approachto statistical dependence: Copula information,” EPL (EurophysicsLetters), vol. 88, no. 6, 2009.

Date post:	17-Oct-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

ISBN 978-91-7501-075-5 ISSN 1653-5146 Distribution ...437204/FULLTEXT02.pdf · Distribution...

Documents