+ All Categories
Home > Documents > The Rate-Distortion Function and Excess-Distortion ...

The Rate-Distortion Function and Excess-Distortion ...

Date post: 10-Jan-2022
Category:
Upload: others
View: 20 times
Download: 0 times
Share this document with a friend
16
1 The Rate-Distortion Function and Excess-Distortion Exponent of Sparse Regression Codes with Optimal Encoding Ramji Venkataramanan, Senior Member, IEEE, and Sekhar Tatikonda, Senior Member, IEEE Abstract—This paper studies the performance of sparse regres- sion codes for lossy compression with the squared-error distortion criterion. In a sparse regression code, codewords are linear combinations of subsets of columns of a design matrix. It is shown that with minimum-distance encoding, sparse regression codes achieve the Shannon rate-distortion function for i.i.d. Gaussian sources R * (D) as well as the optimal excess-distortion exponent. This completes a previous result which showed that R * (D) and the optimal exponent were achievable for distortions below a certain threshold. The proof of the rate-distortion result is based on the second moment method, a popular technique to show that a non-negative random variable X is strictly positive with high probability. In our context, X is the number of codewords within target distortion D of the source sequence. We first identify the reason behind the failure of the standard second moment method for certain distortions, and illustrate the different failure modes via a stylized example. We then use a refinement of the second moment method to show that R * (D) is achievable for all distortion values. Finally, the refinement technique is applied to Suen’s correlation inequality to prove the achievability of the optimal Gaussian excess-distortion exponent. Index Terms—Lossy compression, sparse superposition codes, rate-distortion function, Gaussian source, error exponent, second moment method, large deviations I. I NTRODUCTION D EVELOPING practical codes for lossy compression at rates approaching Shannon’s rate-distortion bound has long been an important goal in information theory. A practical compression code requires a codebook with low storage complexity as well as encoding and decoding with low computational complexity. Sparse Superposition Codes or Sparse Regression Codes (SPARCs) are a recent class of codes introduced by Barron and Joseph, originally for communcation over the AWGN channel [1], [2]. They were subsequently used for lossy compression with the squared-error distortion criterion in [3]–[5]. The codewords in a SPARC are linear combinations of columns of a design matrix A. The storage complexity of the code is proportional to the size of the matrix, which is polynomial in the block length n. A computationally efficient encoder for compression with SPARCs was proposed in [5] and shown to achieve rates approaching the Shannon rate-distortion function for i.i.d. Gaussian sources. This work was partially supported by a Marie Curie Career Integration Grant (Grant Agreement Number 631489) and by NSF Grant CCF-1217023. This paper was presented in part at the 2014 IEEE International Symposium on Information Theory. R. Venkataramanan is with the Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK (e-mail: [email protected]). S. Tatikonda is with the Department of Statistics and Data Science, Yale University, New Haven CT 06511, USA (e-mail: [email protected]). D/σ 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rate (bits) 0 0.5 1 1.5 2 2.5 3 3.5 0.5 log σ 2 /D 1-D/σ 2 Fig. 1. The solid line shows the previous achievable rate R0(D), given in (1). The rate-distortion function R * (D) is shown in dashed lines. It coincides with R0(D) for D/σ 2 x * , where x * 0.203. In this paper, we study the compression performance of SPARCs with the squared-error distortion criterion under optimal (minimum-distance) encoding. We show that for any ergodic source with variance σ 2 , SPARCs with opti- mal encoding achieve a rate-distortion trade-off given by R * (D) := 1 2 log σ 2 D . Note that R * (D) is the optimal rate- distortion function for an i.i.d. Gaussian source with variance σ 2 . The performance of SPARCs with optimal encoding was first studied in [4], where it was shown that for any distortion- level D, rates greater than R 0 (D) := max 1 2 log σ 2 D , 1 - D σ 2 (1) are achievable with the optimal Gaussian excess-distortion exponent. The rate R 0 (D) in (1) is equal to R * (D) when D σ 2 x * , but is strictly larger than R * (D) when D σ 2 >x * , where x * 0.203; see Fig. 1. In this paper, we complete the result of [4] by proving that sparse regression codes achieve the Gaussian rate-distortion function R * (D) for all distortions D (02 ). We also show that these codes attain the optimal excess-distortion exponent for i.i.d. Gaussian sources at all rates. Though minimum-distance encoding is not practically feasi- ble (indeed, the main motivation for sparse regression codes is that they enable low-complexity encoding and decoding), char- acterizing the rate-distortion function and excess-distortion exponent under optimal encoding establishes a benchmark to compare the performance of various computationally efficient arXiv:1401.5272v6 [cs.IT] 19 Jun 2017
Transcript
Page 1: The Rate-Distortion Function and Excess-Distortion ...

1

The Rate-Distortion Function and Excess-DistortionExponent of Sparse Regression Codes

with Optimal EncodingRamji Venkataramanan, Senior Member, IEEE, and Sekhar Tatikonda, Senior Member, IEEE

Abstract—This paper studies the performance of sparse regres-sion codes for lossy compression with the squared-error distortioncriterion. In a sparse regression code, codewords are linearcombinations of subsets of columns of a design matrix. It is shownthat with minimum-distance encoding, sparse regression codesachieve the Shannon rate-distortion function for i.i.d. Gaussiansources R∗(D) as well as the optimal excess-distortion exponent.This completes a previous result which showed that R∗(D) andthe optimal exponent were achievable for distortions below acertain threshold. The proof of the rate-distortion result is basedon the second moment method, a popular technique to showthat a non-negative random variable X is strictly positive withhigh probability. In our context, X is the number of codewordswithin target distortionD of the source sequence. We first identifythe reason behind the failure of the standard second momentmethod for certain distortions, and illustrate the different failuremodes via a stylized example. We then use a refinement of thesecond moment method to show that R∗(D) is achievable forall distortion values. Finally, the refinement technique is appliedto Suen’s correlation inequality to prove the achievability of theoptimal Gaussian excess-distortion exponent.

Index Terms—Lossy compression, sparse superposition codes,rate-distortion function, Gaussian source, error exponent, secondmoment method, large deviations

I. INTRODUCTION

DEVELOPING practical codes for lossy compressionat rates approaching Shannon’s rate-distortion bound

has long been an important goal in information theory. Apractical compression code requires a codebook with lowstorage complexity as well as encoding and decoding withlow computational complexity. Sparse Superposition Codes orSparse Regression Codes (SPARCs) are a recent class of codesintroduced by Barron and Joseph, originally for communcationover the AWGN channel [1], [2]. They were subsequentlyused for lossy compression with the squared-error distortioncriterion in [3]–[5]. The codewords in a SPARC are linearcombinations of columns of a design matrix A. The storagecomplexity of the code is proportional to the size of the matrix,which is polynomial in the block length n. A computationallyefficient encoder for compression with SPARCs was proposedin [5] and shown to achieve rates approaching the Shannonrate-distortion function for i.i.d. Gaussian sources.

This work was partially supported by a Marie Curie Career IntegrationGrant (Grant Agreement Number 631489) and by NSF Grant CCF-1217023.This paper was presented in part at the 2014 IEEE International Symposiumon Information Theory.

R. Venkataramanan is with the Department of Engineering, University ofCambridge, Cambridge CB2 1PZ, UK (e-mail: [email protected]).

S. Tatikonda is with the Department of Statistics and Data Science, YaleUniversity, New Haven CT 06511, USA (e-mail: [email protected]).

D/σ2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ra

te (

bits)

0

0.5

1

1.5

2

2.5

3

3.5

0.5 log σ2/D

1-D/σ2

Fig. 1. The solid line shows the previous achievable rate R0(D),given in (1). The rate-distortion function R∗(D) is shown in dashedlines. It coincides with R0(D) for D/σ2 ≤ x∗, where x∗ ≈ 0.203.

In this paper, we study the compression performance ofSPARCs with the squared-error distortion criterion underoptimal (minimum-distance) encoding. We show that forany ergodic source with variance σ2, SPARCs with opti-mal encoding achieve a rate-distortion trade-off given byR∗(D) := 1

2 log σ2

D . Note that R∗(D) is the optimal rate-distortion function for an i.i.d. Gaussian source with varianceσ2. The performance of SPARCs with optimal encoding wasfirst studied in [4], where it was shown that for any distortion-level D, rates greater than

R0(D) := max

{1

2log

σ2

D,

(1− D

σ2

)}(1)

are achievable with the optimal Gaussian excess-distortionexponent. The rate R0(D) in (1) is equal to R∗(D) whenDσ2 ≤ x∗, but is strictly larger than R∗(D) when D

σ2 > x∗,where x∗ ≈ 0.203; see Fig. 1. In this paper, we complete theresult of [4] by proving that sparse regression codes achievethe Gaussian rate-distortion function R∗(D) for all distortionsD ∈ (0, σ2). We also show that these codes attain the optimalexcess-distortion exponent for i.i.d. Gaussian sources at allrates.

Though minimum-distance encoding is not practically feasi-ble (indeed, the main motivation for sparse regression codes isthat they enable low-complexity encoding and decoding), char-acterizing the rate-distortion function and excess-distortionexponent under optimal encoding establishes a benchmark tocompare the performance of various computationally efficient

arX

iv:1

401.

5272

v6 [

cs.I

T]

19

Jun

2017

Page 2: The Rate-Distortion Function and Excess-Distortion ...

2

A:

β:T

0, c√L, 0, c√

L, 0, c√

L, 0, , 00,

M columns M columnsM columnsSection 1 Section 2 Section L

Fig. 2. A is an n ×ML matrix and β is a ML × 1 binary vector.The positions of the non-zeros in β correspond to the gray columnsof A which combine to form the codeword Aβ.

encoding schemes. Further, the results of this paper and [4]together show that SPARCs retain the good covering propertiesof the i.i.d. Gaussian random codebook, while having acompact representation in terms of a matrix whose size is alow-order polynomial in the block length.

Let us specify some notation before proceeding. Upper-case letters are used to denote random variables, and lower-case letters for their realizations. Bold-face letters are usedto denote random vectors and matrices. All vectors havelength n. The source sequence is S := (S1, . . . , Sn), and thereconstruction sequence is S := (S1, . . . , Sn). ‖x‖ denotes the`2-norm of vector x, and |x| = ‖x‖√

nis the normalized version.

N (µ, σ2) denotes the Gaussian distribution with mean µ andvariance σ2. Logarithms are with base e and rate is measuredin nats, unless otherwise mentioned. The notation an ∼ bnmeans that limn→∞

1n log an = limn→∞

1n log bn, and w.h.p

is used to abbreviate the phrase ‘with high probability’. Wewill use κ, κ1, κ2 to denote generic positive constants whoseexact value is not needed.

A. SPARCs with Optimal Encoding

A sparse regression code is defined in terms of a designmatrix A of dimension n × ML whose entries are i.i.d.N (0, 1). Here n is the block length and M and L are integerswhose values will be specified in terms of n and the rateR. As shown in Fig. 2, one can think of the matrix A ascomposed of L sections with M columns each. Each codewordis a linear combination of L columns, with one column fromeach section. Formally, a codeword can be expressed as Aβ,where β is an ML×1 vector (β1, . . . , βML) with the followingproperty: there is exactly one non-zero βi for 1 ≤ i ≤M , onenon-zero βi for M + 1 ≤ i ≤ 2M , and so forth. The non-zerovalues of β are all set equal to c√

Lwhere c is a constant that

will be specified later. Denote the set of all β’s that satisfythis property by BM,L.

Minimum-distance encoder: This is defined by a mappingg : Rn → BM,L. Given the source sequence S, the encoderdetermines the β that produces the codeword closest in Eu-clidean distance, i.e.,

g(S) = argminβ∈BM,L

‖S−Aβ‖.

Decoder: This is a mapping h : BM,L → Rn. On receivingβ ∈ BM,L from the encoder, the decoder produces reconstruc-tion h(β) = Aβ.

Since there are M columns in each of the L sections, thetotal number of codewords is ML. To obtain a compressionrate of R nats/sample, we therefore need

ML = enR. (2)

For our constructions, we choose M = Lb for some b > 1 sothat (2) implies

L logL =nR

b. (3)

Thus L is Θ (n/ log n), and the number of columns ML inthe dictionary A is Θ

((n/ log n)

b+1)

, a polynomial in n.

B. Overview of our Approach

To show that a rate R can be achieved at distortion-levelD, we need to show that with high probability at least one ofthe enR choices for β satisfies

|S−Aβ|2 ≤ D. (4)

If β satisfies (4), we call it a solution.Denoting the number of solutions by X , the goal is to

show that X > 0 with high probability when R > R∗(D).Note that X can be expressed as the sum of enR indicatorrandom variables, where the ith indicator is 1 if β(i) is asolution and zero otherwise, for 1 ≤ i ≤ enR. Analyzing theprobability P (X > 0) is challenging because these indicatorrandom variables are dependent: codewords Aβ(1) and Aβ(2)will be dependent if β(1) and β(2) share common non-zero terms. To handle the dependence, we use the secondmoment method (second MoM), a technique commonly usedto prove existence (‘achievability’) results in random graphsand random constraint satisfaction problems [6]. In the settingof lossy compression, the second MoM was used in [7] toobtain the rate-distortion function of LDGM codes for binarysymmetric sources with Hamming distortion.

For any non-negative random variable X , the second MoM[8] bounds the probability of the event X > 0 from below as1

P (X > 0) ≥ (EX)2

E[X2]. (5)

Therefore the second MoM succeeds if we can show that(EX)2/E[X2] → 1 as n → ∞. It was shown in [4] thatthe second MoM succeeds for R > R0(D), where R0(D) isdefined in (1). In contrast, for R∗(D) < R < R0(D) it wasfound that (EX)2/E[X2] → 0, so the second MoM fails.From this result in [4], it is not clear whether the gap fromR∗(D) is due to an inherent weakness of the sparse regressioncodebook, or if it is just a limitation of the second MoM asa proof technique. In this paper, we demonstrate that it is thelatter, and refine the second MoM to prove that all rates greaterthan R∗(D) are achievable.

1The inequality (5) follows from the Cauchy-Schwarz inequality(E[XY ])2 ≤ EX2 EY 2 by substituting Y = 1{X>0}.

Page 3: The Rate-Distortion Function and Excess-Distortion ...

3

Our refinement of the second MoM is inspired by the workof Coja-Oghlan and Zdeborova [9] on finding sharp thresholdsfor two-coloring of random hypergraphs. The high-level ideais as follows. The key ratio (EX)2/E[X2] can be expressedas (EX)/E[X(β)], where X(β) denotes the total number ofsolutions conditioned on the event that a given β is a solution.(Recall that β is a solution if |S − Aβ|2 ≤ D.) Thus whenthe second MoM fails, i.e. the ratio goes to zero, we havea situation where the expected number of solutions is muchsmaller than the expected number of solutions conditionedon the event that β is a solution. This happens because forany S, there are atypical realizations of the design matrix thatyield a very large number of solutions. The total probabilityof these matrices is small enough that EX in not significantlyaffected by these realizations. However, conditioning on βbeing a solution increases the probability that the realizeddesign matrix is one that yields an unusually large numberof solutions. At low rates, the conditional probability ofthe design matrix being atypical is large enough to makeE[X(β)]� EX , causing the second MoM to fail.2

The key to rectifying the second MoM failure is to show thatX(β) ≈ EX with high probability although E[X(β)]� EX .We then apply the second MoM to count just the ‘good’solutions, i.e., solutions β for which X(β) ≈ EX . Thissucceeds, letting us conclude that X > 0 with high probability.

C. Related Work

As mentioned above, the second moment method was usedin [7] to analyze the rate-distortion function of LDGM codesfor binary symmetric sources with Hamming distortion. Theidea of applying the second MoM to a random variable thatcounts just the ‘good’ solutions was recently used to obtainimproved thresholds for problems such as random hypergraph2-coloring [9], k-colorability of random graphs [10], andrandom k-SAT [11]. However, the key step of showing that agiven solution is ‘good’ with high probability depends heavilyon the geometry of the problem being considered. This steprequires identifying a specific property of the random objectbeing considered (e.g., SPARC design matrix, hypergraph,or boolean formula) that leads to a very large number ofsolutions in atypical realizations of the object. For example,in SPARC compression, the atypical realizations are designmatrices with columns that are unusually well-aligned with thesource sequence to be compressed; in random hypergraph 2-coloring, the atypical realizations are hypergraphs with an edgestructure that allows an unusually large number of vertices totake on either color [9].

It is interesting to contrast the analysis of SPARC lossycompression with that of SPARC AWGN channel coding in[1]. The dependence structure of the SPARC codewords makesthe analysis challenging in both problems, but the techniquesrequired to analyze SPARC channel coding are very differentfrom those used here for the excess distortion analysis. In thechannel coding case, the authors use a modified union boundtogether with a novel bounding technique for the probabilityof pairwise error events [1, Lemmas 3,4] to establish that the

2This is similar to the inspection paradox in renewal processes.

error probability decays exponentially for all rates smaller thanthe channel capacity. In contrast, we use a refinement of thesecond moment method for the rate-distortion function, andSuen’s correlation inequality to obtain the excess-distortionexponent.

Beyond the excess-distortion exponent, the dispersion is an-other quantity of interest in a lossy compression problem [12],[13]. For a fixed excess-distortion probability, the dispersionspecifies how fast the rate can approach the rate-distortionfunction with growing block length. It was shown that fordiscrete memoryless and i.i.d. Gaussian sources, the optimaldispersion was equal to the inverse of the second derivative ofthe excess-distortion exponent. Given that SPARCs attain theoptimal excess-distortion exponent, it would be interesting toexplore if they also achieve the optimal dispersion for i.i.d.Gaussian sources with squared-error distortion.

The rest of the paper is organized as follows. The mainresults, specifying the rate-distortion function and the excess-distortion expoenent of SPARCs, are stated in Section II. InSection III, we set up the proof and show why the second MoMfails for R < (1− D

ρ2 ). As the proofs of the main theorems aretechnical, we motivate the main ideas with a stylized examplein Section III-C. The main results are proved in Section IV,with the proof of the main technical lemma given in SectionV.

II. MAIN RESULTS

The probability of excess distortion at distortion-level D ofa rate-distortion code Cn with block length n and encoder anddecoder mappings g, h is

Pe(Cn, D) = P(|S− h(g(S))|2 > D

). (6)

For a SPARC generated as described in Section I-A, theprobability measure in (6) is with respect to the random sourcesequence S and the random design matrix A.

A. Rate-Distortion Trade-off of SPARC

Definition 1. A rate R is achievable at distortion level Dif there exists a sequence of SPARCs {Cn}n=1,2,... such thatlimn→∞ Pe(Cn, D) = 0 where for all n, Cn is a rate R codedefined by an n× LnMn design matrix whose parameter Lnsatisfies (3) with a fixed b and Mn = Lbn.

Theorem 1. Let S be drawn from an ergodic source with mean0 and variance σ2. For D ∈ (0, σ2), let R∗(D) = 1

2 log σ2

D .Fix R > R∗(D) and b > bmin(σ

2

D ), where

bmin (x) =

20Rx4(1 + 1

x

)2 (1− 1

x

) [−1 +

(1 + 2

√x

(x−1)

(R− 1

2 (1− 1x )))1/2

]2

(7)

for 1 < x ≤ e2R. Then there exists a sequence of rateR SPARCs {Cn}n=1,2... for which limn→∞ Pe(Cn, D) = 0,where Cn is defined by an n × LnMn design matrix, withMn = Lbn and Ln determined by (3).

Page 4: The Rate-Distortion Function and Excess-Distortion ...

4

Remark: Though the theorem is valid for all D ∈ (0, σ2),it is most relevant for the case D

σ2 > x∗, where x∗ ≈ 0.203 isthe solution to the equation

(1− x) +1

2log x = 0.

For Dσ2 ≤ x∗, [4, Theorem 1] already guarantees that the

optimal rate-distortion function can be achieved, with a smallervalue of b than that required by the theorem above.

B. Excess-distortion exponent of SPARC

The excess-distortion exponent at distortion-level D of asequence of rate R codes {Cn}n=1,2,... is given by

r(D,R) = − lim supn→∞

1

nlogPe(Cn, D), (8)

where Pe(Cn, D) is defined in (6). The optimal excess-distortion exponent for a rate-distortion pair (R,D) is thesupremum of the excess-distortion exponents over all se-quences of codes with rate R, at distortion-level D.

The optimal excess-distortion exponent for discrete mem-oryless sources was obtained by Marton [14], and the resultwas extended to memoryless Gaussian sources by Ihara andKubo [15].

Fact 1. [15] For an i.i.d. Gaussian source distributed asN (0, σ2) and squared-error distortion criterion, the optimalexcess-distortion exponent at rate R and distortion-level D is

r∗(D,R) =

{12

(a2

σ2 − 1− log a2

σ2

)R > R∗(D)

0 R ≤ R∗(D)(9)

where a2 = De2R.

For R > R∗(D), the exponent in (9) is the Kullback-Leiblerdivergence between two zero-mean Gaussians, distributed asN (0, a2) and N (0, σ2), respectively.

The next theorem characterizes the excess-distortion expo-nent performance of SPARCs.

Theorem 2. Let S be drawn from an ergodic source withmean zero and variance σ2. Let D ∈ (0, σ2), R > 1

2 log σ2

D ,and γ2 ∈ (σ2, De2R). Let

b > max

{2,

7

5bmin

(γ2/D

)}, (10)

where bmin(.) is defined in (7). Then there exists a sequenceof rate R SPARCs {Cn}n=1,2..., where Cn is defined by ann× LnMn design matrix with Mn = Lbn and Ln determinedby (3), whose probability of excess distortion at distortion-level D can be bounded as follows for all sufficiently large n.

Pe(Cn, D) ≤ P (|S|2 ≥ γ2) + exp(−κn1+c

), (11)

where κ, c are strictly positive universal constants.

Corollary 1. Let S be drawn from an i.i.d. Gaussian sourcewith mean zero and variance σ2. Fix rate R > 1

2 log σ2

D , and

let a2 = De2R. Fix any ε ∈ (0, a2 − σ2), and

b > max

{2,

7

5bmin

(a2 − εD

)}. (12)

There exists a sequence of rate R SPARCs with parameter bthat achieves the excess-distortion exponent

1

2

(a2 − εσ2

− 1− loga2 − εσ2

).

Consequently, the supremum of excess-distortion exponentsachievable by SPARCs for i.i.d. Gaussian sources sources isequal to the optimal one, given by (9).

Proof: From Theorem 2, we know that for any ε ∈(0, a2−σ2), there exists a sequence of rate R SPARCs {Cn}for which

Pe(Cn, D) ≤ P (|S|2 ≥ a2 − ε)(

1 +exp(−κn1+c)

P (|S|2 ≥ a2 − ε)

)(13)

for sufficiently large n, as long as the parameter b satisfies(12). For S that is i.i.d. N (0, σ2), Cramer’s large deviationtheorem [16] yields

limn→∞

− 1

nlogP (|S|2 ≥ a2 − ε)

=1

2

(a2 − εσ2

− 1− loga2 − εσ2

) (14)

for (a2−ε) > σ2. Thus P (|S|2 ≥ a2−ε) decays exponentiallywith n; in comparison exp(−κn1+c) decays faster than expo-nentially with n. Therefore, from (13), the excess-distortionexponent satisfies

lim infn→∞

−1

nlogPe(Cn, D)

≥ lim infn→∞

−1

n

[logP (|S|2 ≥ a2 − ε)

+ log

(1 +

exp(−κn1+c)

P (|S|2 ≥ a2 − ε)

)]=

1

2

(a2 − εσ2

− 1− loga2 − εσ2

).

(15)

Since ε > 0 can be chosen arbitrarily small, the supremum ofall achievable excess-distortion exponents is

1

2

(a2

σ2− 1− log

a2

σ2

),

which is optimal from Fact 1.We remark that the function bmin(x) is increasing in x.

Therefore (12) implies that larger values of the design parame-ter b are required to achieve excess-distortion exponents closerto the optimal value (i.e., smaller values of ε in Corollary 1).

III. INADEQUACY OF THE DIRECT SECOND MOM

A. First steps of the proof

Fix a rate R > R∗(D), and b greater than the minimumvalue specified by the theorem. Note that De2R > σ2 sinceR > 1

2 log σ2

D . Let γ2 be any number such that σ2 < γ2 <De2R.

Page 5: The Rate-Distortion Function and Excess-Distortion ...

5

Code Construction: For each block length n, pick L asspecified by (3) and M = Lb. Construct an n ×ML designmatrix A with entries drawn i.i.d. N (0, 1). The codebookconsists of all vectors Aβ such that β ∈ BM,L. The non-zeroentries of β are all set equal to a value specified below.

Encoding and Decoding: If the source sequence S is suchthat |S|2 ≥ γ2, then the encoder declares an error. If |S|2 ≤ D,then S can be trivially compressed to within distortion D usingthe all-zero codeword. The addition of this extra codeword tothe codebook affects the rate in a negligible way.

If |S|2 ∈ (D, γ2), then S is compressed in two steps. First,quantize |S|2 with an n-level uniform scalar quantizer Q(.)with support in the interval (D, γ2]. For input x ∈ (D, γ2], if

x ∈(D +

(γ2 −D)(i− 1)

n, D +

(γ2 −D)i

n

],

for i ∈ {1, . . . , n}, then the quantizer output is

Q(x) = D +(γ2 −D)(i− 1

2 )

n.

Conveying the scalar quantization index to the decoder (withan additional log n nats) allows us to adjust the codebook vari-ance according to the norm of the observed source sequence.3

The non-zero entries of β are each set to√

(Q(|S|2)−D)/Lso that each SPARC codeword has variance (Q(|S|2) − D).Define a “quantized-norm” version of S as

S :=

√Q(|S|2)

|S|2 S. (16)

Note that |S|2 = Q(|S|2). We use the SPARC to compress S.The encoder finds

β := argminβ∈BM,L

‖S−Aβ‖2.

The decoder receives β and reconstructs S = Aβ. Note thatfor block length n, the total number of bits transmitted byencoder is log n+L logM , yielding an overall rate of R+ logn

nnats/sample.

Error Analysis: For S such that |S|2 ∈ (D, γ2), the overalldistortion can be bounded as

|S−Aβ|2 = |S− S + S−Aβ|2

≤ |S− S|2 + 2|S− S||S−Aβ|+ |S−Aβ|2

≤ κ1

n2+κ2|S−Aβ|

n+ |S−Aβ|2

(17)

for some positive constants κ1, κ2. The last inequality holdsbecause the step-size of the scalar quantizer is (γ2−D)

n , and|S|2 ∈ (D, γ2).

Let E(S) be the event that the minimum of |S−Aβ|2 overβ ∈ BM,L is greater than D. The encoder declares an error ifE(S) occurs. If E(S) does not occur, the overall distortion in

3The scalar quantization step is only included to simplify the analysis. Infact, we could use the same codebook variance (γ2−D) for all S that satisfy|S|2 ≤ (γ2 − D), but this would make the forthcoming large deviationsanalysis quite cumbersome.

(17) can be bounded as

|S−Aβ|2 ≤ D +κ

n, (18)

for some positive constant κ. The overall rate (including thatof the scalar quantizer) is R+ logn

n .Denoting the probability of excess distortion for this random

code by Pe,n, we have

Pe,n ≤ P (|S|2 ≥ γ2) + maxρ2∈(D,γ2)

P (E(S) | |S|2 = ρ2). (19)

As γ2 > σ2, the ergodicity of the source guarantees that

limn→∞

P (|S|2 ≥ γ2) = 0. (20)

To bound the second term in (19), without loss of generalitywe can assume that the source sequence

S = (ρ, . . . , ρ).

This is because the codebook distribution is rotationally in-variant, due to the i.i.d. N (0, 1) design matrix A. For any β,the entries of Aβ(i) i.i.d. N (0, ρ2 − D). We enumerate thecodewords as Aβ(i), where β(i) ∈ BM,L for i = 1, . . . , enR.

Define the indicator random variables

Ui(S) =

{1 if |Aβ(i)− S|2 ≤ D,0 otherwise.

(21)

We can then write

P (E(S)) = P

enR∑i=1

Ui(S) = 0

. (22)

For a fixed S, the Ui(S)’s are dependent. To see this,consider codewords S(i), S(j) corresponding to the vectorsβ(i), β(j) ∈ BM,L, respectively. Recall that a vector in BM,L

is uniquely defined by the position of the non-zero value ineach of its L sections. If β(i) and β(j) overlap in r of theirnon-zero positions, then the column sums forming codewordsS(i) and S(j) will share r common terms, and consequentlyUi(S) and Uj(S) will be dependent.

For brevity, we henceforth denote Ui(S) by just Ui. Apply-ing the second MoM with

X :=

enR∑i=1

Ui,

we have from (5)

P (X > 0) ≥ (EX)2

E[X2]

(a)=

EXE[X|U1 = 1]

(23)

where (a) is obtained by expressing E[X2] as follows.

E[X2] = E

X enR∑i=1

Ui

=

enR∑i=1

E[XUi]

=

enR∑i=1

P (Ui = 1)E[X|Ui = 1]

= EX · E[X|U1 = 1].

(24)

The last equality in (24) holds because EX =∑enR

i=1 P (Ui =

Page 6: The Rate-Distortion Function and Excess-Distortion ...

6

1), and due to the symmetry of the code construction. AsE[X2] ≥ (EX)2, (23) implies that E[X|U1 = 1] ≥ EX .Therefore, to show that X > 0 w.h.p, we need

E[X|U1 = 1]

EX→ 1 as n→∞. (25)

B. EX versus E[X|U1 = 1]

To compute EX , we derive a general lemma specifying theprobability that a randomly chosen i.i.d N (0, y) codeword iswithin distortion z of a source sequence S with |S|2 = x. Thislemma will be used in other parts of the proof as well.

Lemma 1. Let S be a vector with |S|2 = x. Let S be ani.i.d. N (0, y) random vector that is independent of S. Thenfor x, y, z > 0 and sufficiently large n, we have

κ√ne−nf(x,y,z) ≤ P

(|S− S|2 ≤ z

)≤ e−nf(x,y,z), (26)

where κ is a universal positive constant and for x, y, z > 0,the large-deviation rate function f is

f(x, y, z) =

{x+z2y − xz

Ay − A4y − 1

2 ln A2x if z ≤ x+ y,

0 otherwise,(27)

andA =

√y2 + 4xz − y. (28)

Proof: We have

P(|S− S|2 ≤ z

)= P

(1

n

n∑k=1

(Sk − Sk)2 ≤ z)

= P

(1

n

n∑k=1

(Sk −√x)2 ≤ z

),

(29)

where the last equality is due to the rotational invariance ofthe distribution of S, i.e., S has the same joint distributionas OS for any orthogonal (rotation) matrix O. In particular,we choose O to be the matrix that rotates S to the vector(√x, . . . ,

√x), and note that |S− S|2 = |OS−OS|2. Then,

using the strong version of Cramer’s large deviation theoremdue to Bahadur and Rao [16], [17], we have

κ√ne−nI(x,y,z) ≤ P

(1

n

n∑k=1

(Sk − x)2 ≤ z)≤ e−nI(x,y,z),

(30)where the large-deviation rate function I is given by

I(x, y, z) = supλ≥0

{λz − logEeλ(S−

√x)2}. (31)

The expectation on the RHS of (31) is computed with S ∼N (0, y). Using standard calculations, we obtain

logEeλ(S−√x)2 =

λx

1− 2yλ− 1

2log(1− 2yλ), λ < 2y.

(32)Substituting the expression in (32) in (31) and maximizingover λ ∈ [0, 2y) yields I(x, y, z) = f(x, y, z), where f isgiven by (27).

The expected number of solutions is given by

EX = enRP (U1 = 1) = enRP(|Aβ(1)− S|2 ≤ D

). (33)

Since S = (ρ, ρ, . . . , ρ), and Aβ(1) is i.i.d. N (0, ρ2 − D),applying Lemma 1 we obtain the bounds

κ√nenRe−nf(ρ2,ρ2−D,D) ≤ EX ≤ enRe−nf(ρ2,ρ2−D,D),

(34)Note that

f(ρ2, ρ2 −D,D) =1

2log

ρ2

D. (35)

Next consider E[X|U1 = 1]. If β(i) and β(j) overlap in r oftheir non-zero positions, the column sums forming codewordsS(i) and S(j) will share r common terms. Therefore,

E[X|U1 = 1] =

enR∑i=1

P (Ui = 1|U1 = 1)

=

enR∑i=1

P (Ui = 1, U1 = 1)

P (U1 = 1)

(a)=

L∑r=0

(L

r

)(M − 1)L−r

P (U2 = U1 = 1| F12(r))

P (U1 = 1)

(36)

where F12(r) is the event that the codewords corresponding toU1 and U2 share r common terms. In (36), (a) holds becausefor each codeword S(i), there are a total of

(Lr

)(M − 1)L−r

codewords which share exactly r common terms with S(i),for 0 ≤ r ≤ L. From (36) and (33), we obtain

E[X|U1 = 1]

EX

=

L∑r=0

(L

r

)(M − 1)L−r

P (U2 = U1 = 1| F12(r))

enR (P (U1 = 1))2

(a)∼ 1 +∑

α= 1L ,...,

LL

(L

)P (U2 = U1 = 1| F12(α))

MLα (P (U1 = 1))2

(b)= 1 +

∑α= 1

L ,...,LL

en∆α

(37)

where (a) is obtained by substituting α = rL and enR = ML.

The notation xL ∼ yL means that xL/yL → 1 as L → ∞.The equality (b) is from [4, Appendix A], where it was alsoshown that

∆α ≤κ

L+R

bmin{α, α, log 2

logL} − h(α) (38)

where

h(α) := αR− 1

2log

(1 + α

1− α(1− 2Dρ2 )

). (39)

The inequality in (38) is asymptotically tight [4]. The termen∆α in (37) may be interpreted as follows. Conditioned onβ(1) being a solution, the expected number of solutions thatshare αL common terms with β(1) is ∼ en∆αEX . Recallthat we require the left side of (37) to tend to 1 as n → ∞.Therefore, we need ∆α < 0 for α = 1

L , . . . ,LL . From (38), we

need h(α) to be positive in order to guarantee that ∆α < 0.

Page 7: The Rate-Distortion Function and Excess-Distortion ...

7

However, when R < (1− Dρ2 ), it can be verified that h(α) < 0

for α ∈ (0, α∗) where α∗ ∈ (0, 1) is the solution to h(α) = 0.Thus ∆α is positive for α ∈ (0, α∗) when 1

2 log ρ2

D ≤ R ≤(1− D

ρ2 ). Consequently, (37) implies that

E[X|U1 = 1]

EX∼∑α

en∆α →∞ as n→∞, (40)

and the second MoM fails.

C. A Stylized Example

Before describing how to rectify the second MoM failurein the SPARC setting, we present a simple example to giveintuition about the failure modes of the second MoM. Theproofs in the next two sections do not rely on the discussionhere.

Consider a sequence of generic random structures (e.g.,a sequence of random graphs or SPARC design matrices)denoted by Rn, n ≥ 1. Suppose that for each n, the realizationof Rn belongs to one of two categories: a category C1 structurewhich has which has en solutions, or a category C2 structurewhich has e2n solutions. In the case of SPARC, a solutionis a codeword that is within the target distortion. Let theprobabilities of Rn being of each category be

P (Rn ∈ C1) = 1− e−np, P (Rn ∈ C2) = e−np, (41)

where p > 0 is a constant. Regardless of the realization, wenote that Rn always has at least en solutions.

We now examine whether the second MoM can guaranteethe existence of a solution for this problem as n → ∞. Thenumber of solutions X can be expressed as a sum of indicatorrandom variables:

X =

N∑i=1

Ui,

where Ui = 1 if configuration i is a solution, and N is thetotal number of configurations. (In the SPARC context, a con-figuration is a codeword.) We assume that the configurationsare symmetric (as in the SPARC set-up), so that each one hasequal probability of being a solution, i.e.,

P (Ui = 1 | Rn ∈ C1) =en

N, P (Ui = 1 | Rn ∈ C2) =

e2n

N.

(42)

Due to symmetry, the second moment ratio can be expressedas

EX2

(EX)2=

E[X | U1 = 1]

EX=

E[X | U1 = 1]

(1− e−np)en + e−npe2n.

(43)The conditional expectation in the numerator can be computed

as follows.

E[X|U1 = 1] = P (Rn ∈ C1 | U1 = 1)E[X|U1 = 1,C1]

+ P (Rn ∈ C2 | U1 = 1)E[X|U1 = 1,C2]

(a)=

(1− e−np)(en/N)

(1− e−np)(en/N) + e−np(e2n/N)en

+e−np(e2n/N)

(1− e−np)(en/N) + e−np(e2n/N)e2n

=(1− e−np)e2n + en(4−p)

(1− e−np)en + en(2−p) ,

(44)

where (a) is obtained by using Bayes’ rule to computeP (Rn ∈ C1 | U1 = 1). The second MoM ratio in (43)therefore equals

EX2

(EX)2=

E[X | U1 = 1]

EX=

(1− e−np)e2n + en(4−p)

[(1− e−np)en + en(2−p)]2.

(45)We examine the behavior of the ratio above as n → ∞ fordifferent values of p.

Case 1: p ≥ 2. The dominant term in both the numeratorand the denominator of (45) is e2n, and we get

E[X | U1 = 1]

EX→ 1 as n→∞, (46)

and the second MoM succeeds.

Case 2: 1 < p ≤ 2. The dominant term in the numerator isen(4−p), while the dominant term in the denominator is e2n.Hence

E[X | U1 = 1]

EX=en(4−p)

e2n(1 + o(1)) ∼ en(2−p) n→∞−→ ∞.

(47)

Case 3: 0 < p ≤ 1. The dominant term in the numeratoris en(4−p), while the dominant term in the denominator isen(4−2p). Hence

E[X | U1 = 1]

EX=

en(4−p)

en(4−2p)(1 + o(1)) ∼ enp n→∞−→ ∞. (48)

Thus in both Case 2 and Case 3, the second MoM failsbecause the expected number of solutions conditioned on a so-lution (U1 = 1) is exponentially larger than the unconditionalexpected value. However, there is an important distinctionbetween the two cases, which allows us to fix the failure ofthe second MoM in Case 2 but not in Case 3.

Consider the conditional distribution of the number ofsolutions given U1 = 1. From the calculation in (44), we have

P (X = en | U1 = 1) = P (Rn ∈ C1 | U1 = 1)

=(1− e−np)en

(1− e−np)en + en(2−p) ,

P (X = e2n | U1 = 1) = P (Rn ∈ C2 | U1 = 1)

=en(2−p)

(1− e−np)en + en(2−p) .

(49)

When 1 < p ≤ 2, the first term in the denominator of the

Page 8: The Rate-Distortion Function and Excess-Distortion ...

8

RHS dominates, and the conditional distribution of X is

P (X = en | U1 = 1) = 1− e−n(p−1)(1 + o(1)),

P (X = e2n | U1) = e−n(p−1)(1 + o(1)).(50)

Thus the conditional probability of a realization Rn beingcategory C1 given U1 = 1 is slightly smaller than the uncondi-tional probability, which is 1−e−np. However, conditioned onU1 = 1, a realization Rn is still extremely likely to have comefrom category C1, i.e., have en solutions. Therefore, when1 < p ≤ 2, conditioning on a solution does not change thenature of the ‘typical’ or ‘high-probability’ realization. Thismakes it possible to fix the failure of the second MoM inthis case. The idea is to define a new random variable X ′

which counts the number of solutions coming from typicalrealizations, i.e., only category C1 structures. The secondMoM is then applied to X ′ to show that is strictly positivewith high probability.

When p < 1, conditioning on a solution completely changesthe distribution of X . The dominant term in the denominatorof the RHS in (49) is en(2−p), so the conditional distributionof X is

P (X = en | U1 = 1) = e−n(1−p)(1 + o(1)),

P (X = e2n | U1) = 1− e−n(1−p)(1 + o(1)).(51)

Thus, conditioned on a solution, a typical realization of Rnbelongs to category C2, i.e., has e2n solutions. On the otherhand, if we draw from the unconditional distribution of Rnin (41), a typical realization has en solutions. In this case,the second moment method cannot be fixed by counting onlythe solutions from realizations of category C1, because thetotal conditional probability of such realizations is very small.This is the analog of the “condensation phase” that is foundin problems such as random hypergraph coloring [9]. In thisphase, although solutions may exist, even an enhanced secondMoM does not prove their existence.

Fortunately, there is no condensation phase in the SPARCcompression problem. Despite the failure of the direct secondMoM, we prove (Lemma 2) that conditioning on a solutiondoes not significantly alter the total number of solutions fora very large fraction of design matrices. Analogous to Case2 above, we can apply the second MoM to a new randomvariable that counts only the solutions coming from typicalrealizations of the design matrix. This yields the desired resultthat solutions exist for all rates R < R∗(D).

IV. PROOFS OF MAIN RESULTS

A. Proof of Theorem 1

The code parameters, encoding and decoding are as de-scribed in Section III-A. We build on the proof set-up ofSection III-B. Given that β ∈ BM,L is a solution, forα = 0, 1

L , . . . ,LL define Xα(β) to be the number of solutions

that share αL non-zero terms with β. The total number ofsolutions given that β is a solution is

X(β) =∑

α=0, 1L ,...,LL

Xα(β) (52)

Using this notation, we have

E[X|U1 = 1]

EX(a)=

E[X(β)]

EX

=∑

α=0, 1L ,...,LL

E[Xα(β)]

EX(b)∼ 1 +

∑α= 1

L ,...,LL

en∆α ,(53)

where (a) holds because the symmetry of the code constructionallows us to condition on a generic β ∈ BM,L being a solution;(b) follows from (37). Note that E[Xα(β)] and E[X(β)] areexpectations evaluated with the conditional distribution overthe space of design matrices given that β is a solution.

The key ingredient in the proof is the following lemma,which shows that Xα(β) is much smaller than EX w.h.p ∀α ∈{ 1L , . . . ,

LL}. In particular, Xα(β)� EX even for α for which

E[Xα(β)]

EX∼ en∆α →∞ as n→∞.

Lemma 2. Let R > 12 log ρ2

D . If β ∈ BM,L is a solution, thenfor sufficiently large L

P(Xα(β) ≤ L−3/2 EX, for 1

L ≤ α ≤ L−1L

)≥ 1− η (54)

whereη = L

−2.5(

bbmin(ρ2/D)

−1). (55)

The function bmin(.) is defined in (7).

Proof. The proof of the lemma is given in Section V.The probability measure in Lemma 2 is the conditional

distribution on the space of design matrices A given that β isa solution.

Definition 2. For ε > 0, call a solution β “ε-good” if∑α= 1

L ,...,LL

Xα(β) < εEX. (56)

Since we have fixed S = (ρ, . . . , ρ), whether a solution βis ε-good or not is determined by the design matrix. Lemma 2guarantees that w.h.p any solution β will be ε-good, i.e., if βis a solution, w.h.p the design matrix is such that the numberof solutions sharing any common terms with β is less εE[X].

The key to proving Theorem 1 is to apply the second MoMonly to ε-good solutions. Fix ε = L−0.5. For i = 1, . . . , enR,define the indicator random variables

Vi =

{1 if |Aβ(i)− S|2 ≤ D and β(i) is ε-good,0 otherwise.

(57)The number of ε-good solutions, denoted by Xg , is given by

Xg = V1 + V2 + . . .+ VenR . (58)

We will apply the second MoM to Xg to show that P (Xg >0)→ 1 as n→∞. We have

P (Xg > 0) ≥ (EXg)2

E[X2g ]

=EXg

E[Xg|V1 = 1](59)

where the second equality is obtained by writing E[X2g ] =

(EXg)E[Xg|V1 = 1], similar to (24).

Page 9: The Rate-Distortion Function and Excess-Distortion ...

9

Lemma 3. a) EXg ≥ (1− η)EX , where η is defined in (55).b) E[Xg|V1 = 1] ≤ (1 + L−0.5)EX .

Proof: Due to the symmetry of the code construction, wehave

EXg = enRP (V1 = 1)(a)= enRP (U1 = 1)P (V1 = 1|U1 = 1)

= EX · P (β(1) is ε-good | β(1) is a solution).(60)

In (60), (a) follows from the definitions of Vi in (57) and Uiin (21). Given that β(1) is a solution, Lemma 2 shows that∑

α= 1L ,...,

LL

Xα(β(1)) < (EX)L−0.5. (61)

with probability at least 1− η. As ε = L−0.5, β(1) is ε-goodaccording to Definition 2 if (61) is satisfied. Thus EXg in (60)can be lower bounded as

EXg ≥ (1− η)EX. (62)

For part (b), first observe that the total number of solutionsX is an upper bound for the number of ε-good solutions Xg .Therefore

E[Xg|V1 = 1] ≤ E[X|V1 = 1]. (63)

Given that β(1) is an ε-good solution, the expected numberof solutions can be expressed as

E[X|V1 = 1]

= E[X0(β(1)) | V1 = 1] + E[∑

α= 1L ,...,

LL

Xα(β(1)) | V1 = 1].

(64)

There are (M − 1)L codewords that share no common termswith β(1). Each of these codewords is independent of β(1),and thus independent of the event V1 = 1.

E[X0(β(1)) | V1 = 1] = E[X0(β(1))]

= (M − 1)L P (|S−Aβ|2 ≤ D)

≤ML P (|S−Aβ|2 ≤ D)

= EX.

(65)

Next, note that conditioned on β(1) being an ε-good solution(i.e., V1 = 1), ∑

α= 1L ,...,

LL

Xα(β(1)) < εEX (66)

with certainty. This follows from the definition of ε-good in(56). Using (65) and (66) in (64), we conclude that

E[X|V1 = 1] < (1 + ε)EX. (67)

Combining (67) with (63) completes the proof of Lemma 3.

Using Lemma 3 in (59), we obtain

P (Xg > 0) ≥ EXg

E[Xg|V1 = 1]≥ (1− η)

1 + ε

=1− L−2.5( b

bmin(ρ2/D)−1)

1 + L−1/2,

(68)

where the last equality is obtained by using the definition of ηin (55) and ε = L−0.5. Hence the probability of the existenceof at least one good solution goes to 1 as L → ∞. Thus wehave shown that for any ρ2 ∈ (D, γ2), the quantity

P (E(S) | |S|2 = ρ2)

in (19) tends to zero whenever R > 12 log ρ2

D and b >

bmin(ρ2

D ). Combining this with (18)–(20),we conclude thatthat the probability that

|S−Aβ|2 ≤ D +κ

n

goes to one as n→∞. As γ2 > σ2 can be chosen arbitrarilyclose to σ2, the proof of Theorem 1 is complete.

B. Proof of Theorem 2

The code construction is as described in Section III-A,with the parameter b now chosen to satisfy (10). Recall thedefinition of an ε-good solution in Definition 2. We followthe set-up of Section IV-A and count the number of ε-goodsolutions, for an appropriately defined ε. As before, we wantan upper bound for the probability of the event Xg = 0, wherethe number of ε-good solutions Xg is defined in (58).

Theorem 2 is obtained using Suen’s correlation inequalityto upper bound on the probability of the event Xg = 0. Suen’sinequality yields a sharper upper bound than the second MoM.We use it to prove that the probability of Xg = 0 decayssuper-exponentially in L. In comparison, the second MoMonly guarantees a polynomial decay.

We begin with some definitions required for Suen’s inequal-ity.

Definition 3 (Dependency Graphs [8]). Let {Vi}i∈I be afamily of random variables (defined on a common probabilityspace). A dependency graph for {Vi} is any graph Γ withvertex set V (Γ) = I whose set of edges satisfies the followingproperty: if A and B are two disjoint subsets of I such thatthere are no edges with one vertex in A and the other in B,then the families {Vi}i∈A and {Vi}i∈B are independent.

Fact 2. [8, Example 1.5, p.11] Suppose {Yα}α∈A is a familyof independent random variables, and each Vi, i ∈ I is afunction of the variables {Yα}α∈Ai for some subset Ai ⊆ A.Then the graph with vertex set I and edge set {ij : Ai∩Aj 6=∅} is a dependency graph for {Ui}i∈I .

In our setting, we fix ε = L−3/2, let Vi be the indicator therandom variable defined in (57). Note that Vi is one if and onlyif β(i) is an ε-good solution. The set of codewords that shareat least one common term with β(i) are the ones that play arole in determining whether β(i) is an ε-good solution or not.Hence, the graph Γ with vertex set V (Γ) = {1, . . . , enR} andedge set e(Γ) given by

{ij : i 6= j and the codewords β(i), β(j)

share at least one common term}

is a dependency graph for the family {Vi}enR

i=1 . This followsfrom Fact 2 by observing that: i) each Vi is a function of the

Page 10: The Rate-Distortion Function and Excess-Distortion ...

10

columns of A that define β(i) and all other codewords thatshare at least one common term with β(i); and ii) the columnsof A are generated independently of one another.

For a given codeword β(i), there are(Lr

)(M − 1)L−r other

codewords that have exactly r terms in common with β(i),for 0 ≤ r ≤ (L−1). Therefore each vertex in the dependencygraph for the family {Vi}e

nR

i=1 is connected to

L−1∑r=1

(L

r

)(M − 1)L−r = ML − 1− (M − 1)L

other vertices.

Fact 3 (Suen’s Inequality [8]). Let Vi ∼ Bern(pi), i ∈ I,be a finite family of Bernoulli random variables having adependency graph Γ. Write i ∼ j if ij is an edge in Γ. Define

λ =∑i∈I

EVi, ∆ =1

2

∑i∈I

∑j∼i

E(ViVj), δ = maxi∈I

∑k∼i

EVk.

Then

P

(∑i∈I

Vi = 0

)≤ exp

(−min

2,λ

6δ,λ2

8∆

}). (69)

We apply Suen’s inequality with the dependency graphspecified above for {Vi}e

nR

i=1 to compute an upper bound forP (Xg = 0), where Xg =

∑enR

i=1 Vi is the total number ofε-good solutions for ε = L−3/2. Note that the ε chosen hereis smaller than the value of L−1/2 used for Theorem 1. Thissmaller value is required to prove the super-exponential decayof the excess-distortion probability via Suen’s inequality. Wealso need a stronger version of Lemma 2.

Lemma 4. Let R > 12 log ρ2

D . If β ∈ BM,L is a solution, thenfor sufficiently large L

P(Xα(β) ≤ L−5/2 EX, for 1

L ≤ α ≤ LL

)≥ 1− ξ (70)

whereξ = L

−2.5( bbmin(ρ2/D)

− 75 ). (71)

Proof: The proof is nearly identical to that of Lemma 2given in Section V, with the terms L−3/2 and 3

2L replacedby L−5/2 and 5

2L , respectively, throughout the lemma. Thuswe obtain the following condition on b which is the analog of(107).

b >

maxα∈{ 1

L ,...,LL}

{R

(min{αΛ(α), c1})

[min

{α, α,

log 2

logL

}+

5

2L

]}=

3.5R

Λ(0)+O

(1

L

)=

7

5bmin

(ρ2

D

)+O

(1

L

).

(72)

The result is then obtained using arguments analogous to (108)and (109).

We now compute each of the three terms in the RHS ofSuen’s inequality.

First Term λ2 : We have

λ =

enR∑i=1

EVi = EXg

(a)= EX · P (β(1) is ε-good | β(1) is a solution),

(73)

where (a) follows from (60). Given that β(1) is a solution,Lemma 4 shows that∑

α= 1L ,...,

LL

Xα(β(1)) < (EX)L−3/2 (74)

with probability at least 1− ξ. As ε = L−3/2, β(1) is ε-goodaccording to Definition 2 if (74) is satisfied. Thus the RHS of(73) can be lower bounded as follows.

λ = EX · P (β(1) is ε-good | β(1) is a solution)

≥ EX · (1− ξ). (75)

Using the expression from (33) for the expected number ofsolutions EX , we have

λ ≥ (1− ξ) κ√nen(R− 1

2 log ρ2

D ), (76)

where κ > 0 is a constant. For b > 75bmin(ρ2/D), (71) implies

that ξ approaches 1 with growing L.

Second term λ/(6δ): Due to the symmetry of the codeconstruction, we have

δ = maxi∈{1,...,enR}

∑k∼i

P (Vk = 1)

=∑k∼i

P (Vk = 1) ∀i ∈ {1, . . . , enR}

=L−1∑r=1

(L

r

)(M − 1)L−r · P (V1 = 1)

=(ML − 1− (M − 1)L

)P (V1 = 1) .

(77)

Combining this together with the fact that

λ =

ML∑i=1

EVi = ML P (V1 = 1),

we obtain

λ

δ=

ML

ML − 1− (M − 1)L=

1

1− L−bL − (1− L−b)L ,(78)

where the second equality is obtained by substituting M = Lb.Using a Taylor series bound for the denominator of (78) (see[4, Sec. V] for details) yields the following lower bound forsufficiently large L:

λ

δ≥ Lb−1

2. (79)

Page 11: The Rate-Distortion Function and Excess-Distortion ...

11

Third Term λ2/(8∆): We have

∆ =1

2

ML∑i=1

∑j∼i

E [ViVj ]

=1

2

ML∑i=1

P (Vi = 1)∑j∼i

P (Vj = 1 | Vi = 1)

(a)=

1

2EXg

∑j∼1

P (Vj = 1 | V1 = 1)

=1

2EXg E

[∑j∼1

1{Vj = 1} | V1 = 1]

(b)

≤ 1

2EXg E

[ ∑α= 1

L ,...,L−1L

Xα(β(1)) | V1 = 1

].

(80)

In (80), (a) holds because of the symmetry of the codeconstruction. The inequality (b) is obtained as follows. Thenumber of ε-good solutions that share common terms withβ(1) is bounded above by the total number of solutions sharingcommon terms with β(1). The latter quantity can be expressedas the sum of the number of solutions sharing exactly αLcommon terms with β(1), for α ∈ { 1

L , . . . ,L−1L }.

Conditioned on V1 = 1, i.e., the event that β(1) is a ε-good solution, the total number of solutions that share commonterms with β(1) is bounded by εEX . Therefore, from (80) wehave

∆ ≤ 1

2EXg E

[ ∑α= 1

L ,...,L−1L

Xα(β(1)) | V1 = 1

]

≤ 1

2(EXg) (L−3/2 EX) ≤ L−3/2

2(EX)2,

(81)

where we have used ε = L−3/2, and the fact that Xg ≤ X .Combining (81) and (75), we obtain

λ2

8∆≥ (1− ξ)2(EX)2

4L−3/2(EX)2≥ κL3/2, (82)

where κ is a strictly positive constant.

Applying Suen’s inequality: Using the lower bounds ob-tained in (76), (79), and (82) in (69), we obtain

P

enR∑i=1

Vi

≤ exp

(−κ min

{en(R− 1

2 log ρ2

D −logn2n ), Lb−1, L3/2

}),

(83)

where κ is a positive constant. Recalling from (3) that L =

Θ( nlogn ) and R > 1

2 ln ρ2

D , we see that for b > 2,

P

enR∑i=1

Vi

≤ exp(−κn1+c

), (84)

where c > 0 is a constant. Note that the condition b >75bmin(ρ2/D) was also needed to obtain (83) via Suen’sinequality. In particular, this condition on b is required for

ξ in Lemma 4 to go to 0 with growing L.Using (84) in (19), we conclude that for any γ2 ∈

(σ2, De2R) the probability of excess distortion can be boundedas

Pe,n ≤ P (|S|2 ≥ γ2) + maxρ2∈(D,γ2)

P (E(S) | |S|2 = ρ2)

≤ P (|S|2 ≥ γ2) + exp(−κn1+c),(85)

provided the parameter b satisfies

b > maxρ2∈(D,γ2)

max

{2,

7

5bmin

(ρ2/D

)}. (86)

It can be verified from the definition in (7) that bmin(x) isstrictly increasing in x ∈ (1, e2R). Therefore, the maximumon the RHS of (86) is bounded by max

{2, 7

5bmin(γ2/D

)}.

Choosing b to be larger than this value will guarantee that (85)holds. This completes the proof of the theorem.

V. PROOF OF LEMMA 2

We begin by listing three useful properties of the functionf(x, y, z) defined in (27). Recall that the probability that ani.i.d. N (0, y) sequence is within distortion within distortion zof a norm-x sequence is ∼ e−nf(x,y,z).

1) For fixed x, y, f is strictly decreasing in z ∈ (0, x+ y).2) For fixed y, z, f is strictly increasing in x ∈ (z,∞).3) For fixed x, z and x > z, f is convex in y and attains

its minimum value of 12 log x

z at y = x− z.These properties are straightforward to verify from the defini-tion (27) using elementary calculus.

For K ⊆ {1, . . . , L}, let βK denote the restriction of β tothe set K, i.e., βK coincides with β in the sections indicated byK and the remaining entries are all equal to zero. For example,if K = {2, 3}, the second and third sections of βK will eachhave one non-zero entry, the other entries are all zeros.

Definition 4. Given that β is a solution, for α = 1L , . . . ,

LL ,

define Fα(β) as the event that

|S−AβK|2 ≥ Dα

for every size αL subset K ⊂ {1, . . . , L}, where Dα is thesolution to the equation

Rα = f(ρ2, (ρ2 −D)α,Dα). (87)

The intuition behind choosing Dα according to (87) is thefollowing. Any subset of αL sections of the design matrix Adefines a SPARC of rate Rα, with each codeword consisting ofi.i.d N (0, (ρ2−D)α) entries. (Note that the entries of a singlecodeword are i.i.d., though the codewords are dependent dueto the SPARC structure.) The probability that a codeword fromthis rate Rα code is within distortion z of the source sequenceS is ∼ e−nf(ρ2,(ρ2−D)α,z). Hence the expected number ofcodewords in the rate Rα codebook within distortion z of Sis

enRαe−nf(ρ2,(ρ2−D)α,z).

As f(ρ2, (ρ2−D)α, z) is a strictly decreasing function of z in(0, ρ2), (87) says that Dα is the smallest expected distortion

Page 12: The Rate-Distortion Function and Excess-Distortion ...

12

for any rate Rα code with codeword entries chosen i.i.d.N (0, (ρ2 − D)α). 4 For z < Dα, the expected number ofcodewords within distortion z of S is vanishingly small.

Conditioned on Fα(β), the idea is that any αL sections of βcannot by themselves represent S with distortion less than Dα.In other words, in a typical realization of the design matrix,all the sections contribute roughly equal amounts to findinga codeword within D of S. On the other hand, if some αLsections of the SPARC can represent S with distortion lessthan Dα, the remaining αL sections have “less work” to do—this creates a proliferation of solutions that share these αLcommon sections with β. Consequently, the total number ofsolutions is much greater than EX for these atypical designmatrices.

The first step in proving the lemma is to show that for anyβ, the event Fα(β) holds w.h.p. The second step is showingthat when Fα(β) holds, the expected number of solutions thatshare any common terms with β is small compared to EX .Indeed, using Fα(β) we can write

P(Xα(β) > L−3/2EX

)= P

({Xα(β) > L−3/2EX}, Fcα(β)

)+ P

({Xα(β) > L−3/2 EX}, Fα(β)

)≤ P (Fcα(β)) + P (Fα(β))P

(Xα(β) < L−3/2EX | Fα(β)

)≤ P (Fcα(β)) +

E[Xα(β) | Fα(β)]

L−3/2 EX(88)

where the last line follows from Markov’s inequality. We willshow that the probability on the left side of (88) is small forany solution β by showing that each of the two terms on theRHS of (88) is small. First, a bound on Dα.

Lemma 5. For α ∈ (0, 1],

Rα > f(ρ2, (ρ2 −D)α, ρ2α+Dα) =1

2log

ρ2

ρ2α+Dα.

(89)Consequently, Dα < ρ2α+Dα for α = 1

L , . . . ,LL .

Proof. The last equality in (89) holds because f(x, x−z, z) =12 ln x

z . Define a function

g(α) = Rα− 1

2log

ρ2

ρ2α+Dα.

Then g(0) = 0, g(1) = R − 12 ln ρ2

D > 0, and the secondderivative is

d2g

dα2=

−(1− Dρ2 )2

(1− (1− Dρ2 )α)2

< 0.

Therefore g is strictly concave in [0, 1], and its minimumvalue (attained at α = 0) is 0. This proves (89). Recallingthe definition of Dα in (87), (89) implies that

f(ρ2, (ρ2−D)α,Dα) = Rα > f(ρ2, (ρ2−D)α, ρ2α+Dα)

4Note that Dα is not the distortion-rate function at rate Rα as thecodewords are not chosen with the optimal variance for rate Rα.

As f is decreasing in its third argument (the distortion), weconclude that Dα < ρ2α+Dα.

We now bound each term on the RHS of (88). Showingthat the first term of (88) is small implies that w.h.p any αLsections by themselves will leave a residual distortion of atleast Dα. Showing that the second term is small implies thatunder this condition, the expected number of solutions sharingany common terms with β is small compared to EX .

Bounding Fcα(β): From the definition of the event Fα(β),we have

P (Fcα(β)) = ∪K P (|S−AβK|2 < Dα | β is a solution)(90)

where the union is over all size-αL subsets of {1, . . . , L}.Using a union bound, (90) becomes

P (Fcα(β)) ≤(L

)P (|S−AβK|2 < Dα, |S−Aβ|2 < D)

P (|S−Aβ|2 < D)(91)

where K is a generic size-αL subset of {1, . . . , L}, say K ={1, . . . , αL}. Recall from (33) that for sufficiently large n, thedenominator in (91) can be bounded from below as

P (|S−Aβ|2 < D) ≥ κ√ne−nf(ρ2,ρ2−D,D) (92)

and f(ρ2, ρ2 −D,D) = 12 log ρ2

D . The numerator in (91) canbe expressed as

P (|S−AβK|2 < Dα, |S−Aβ|2 < D)

=

∫ Dα

0

ψ(y)P (|S−Aβ|2 < D | |S−AβK|2 = y) dy

(93)

where ψ is the density of the random variable |S − AβK|2.Using the cdf at y to bound ψ(y) in the RHS of (93), weobtain the following upper bound for sufficiently large n.

P (|S−AβK|2 < Dα, |S−Aβ|2 < D)

≤∫ Dα

0

P (|S−AβK|2 < y)

· P (|S−Aβ|2 < D | |S−AβK|2 = y) dy

(a)

≤∫ Dα

0

κ√ne−nf(ρ2,(ρ2−D)α,y)

· P (|(S−AβK)−AβKc |2 < D | |S−AβK|2 = y) dy

(b)

≤∫ Dα

0

κ√ne−nf(ρ2,(ρ2−D)α,y) · e−nf(y,(ρ2−D)α,D) dy

(c)

≤∫ Dα

0

κ√ne−nf(ρ2,(ρ2−D)α,Dα) · e−nf(Dα,(ρ

2−D)α,D) dy.

(94)

In (94), (a) holds for sufficiently large n and is obtained usingthe strong version of Cramer’s large deviation theorem: notethat AβK is a linear combination of αL columns of A, henceit is a Gaussian random vector with i.i.d. N (0, (ρ2 − D)α)entries that is independent of S. Inequality (b) is similarlyobtained: AβKc has i.i.d. N (0, (ρ2 − D)α) entries, and isindependent of both S and AβK. Finally, (c) holds because

Page 13: The Rate-Distortion Function and Excess-Distortion ...

13

the overall exponent

f(ρ2, (ρ2 −D)α, y) + f(y, (ρ2 −D)α,D)

is a decreasing function of y, for y ∈ (0, ρ2α + Dα], andDα ≤ ρ2α+Dα.

Using (92) and (94) in (91), for sufficiently large n we have

P (Fcα(β)) ≤ κ(L

)× e−n[f(ρ2,(ρ2−D)α,Dα)+f(Dα,(ρ

2−D)α,D)−f(ρ2,ρ2−D,D)].(95)

Bounding E[Xα(β)| Fα(β)]: There are(LLα

)(M − 1)Lα

codewords which share αL common terms with β. Therefore

E[Xα(β) | Fα(β)] =

(L

)(M − 1)Lα

× P (|S−Aβ′|2 < D | |S−Aβ|2 < D, Fα(β))(96)

where β′ is a codeword that shares exactly αL common termswith β. If K is the size-αL set of common sections betweenβ and β′, then β′ = βK + β′Kc and

P (|S−Aβ′|2 < D | |S−Aβ|2 < D, Fα(β))

= P (|(S−AβK)−Aβ′Kc |2 < D | |S−Aβ|2 < D, Fα(β))

(a)

≤ P

(1n

n∑i=1

(Dα − (Aβ′Kc)i)2 < D

)(b)

≤ κ√ne−nf(Dα,(ρ

2−D)α,D),

(97)

where (b) holds for sufficiently large n. In (97), (a) is obtainedas follows. Under the event Fα(β), the norm |S −AβK|2 isat least Dα, and Aβ′Kc is an i.i.d. N (0, (ρ2 − D)α) vectorindependent of S, β, and βK. (a) then follows from therotational invariance of the distribution of Aβ′Kc . Inequality(b) is obtained using the strong version of Cramer’s largedeviation theorem.

Using (97) in (96), we obtain for sufficiently large n

E[Xα(β) | Fα(β)]

≤(L

)(M − 1)Lα

κ√ne−nf(Dα,(ρ

2−D)α,D)

≤(L

)κ√nen(Rα−f(Dα,(ρ

2−D)α,D)).

(98)

Overall bound: Substituting the bounds from (95), (98) and(33) in (88), for sufficiently large n we have for 1

L ≤ α ≤ 1:

P(Xα(β) > L−3/2EX

)≤ κ

(L

)×(e−n[f(ρ2,(ρ2−D)α,Dα)+f(Dα,(ρ

2−D)α,D)−f(ρ2,ρ2−D,D)]

+ L3/2e−n[Rα+f(Dα,(ρ2−D)α,D)−f(ρ2,(ρ2−D),D)]

).

(99)

Since Dα is chosen to satisfy Rα = f(Dα, (ρ2−D)α,D),

the two exponents in (99) are equal. To bound (99), we usethe following lemma.

Lemma 6. For α ∈ { 1L , . . . ,

L−1L }, we have[

f(ρ2, (ρ2 −D)α,Dα) + f(Dα, (ρ2 −D)α,D)

− f(ρ2, (ρ2 −D), D)]

>

{αΛ(α) if Dα > Dc1 if Dα ≤ D.

(100)

where Dα is the solution of (87), c1 is a positive constantgiven by (137), and

Λ(α) =1

8

(D

ρ2

)4(1 +

D

ρ2

)2(1− D

ρ2

)

·

−1 +

(1 +

2√ρ2/D

(ρ2

D − 1)

(R− 1

2αlog

ρ2

ρ2α+Dα

)) 12

2

.

(101)

Proof: See Appendix I.We observe that Λ(α) is strictly decreasing for α ∈ (0, 1].

This can be seen by using the Taylor expansion of log(1−x)for 0 < x < 1 to write

R− 1

2αln

ρ2

ρ2α+Dα= R+

1

2αlog

(1− α

(1− D

ρ2

))= R− 1

2

∞∑k=1

(1− D

ρ2

)kαk−1

k.

(102)

SinceR >

1

2log

ρ2

D>

1

2

(1− D

ρ2

),

(102) shows that Λ(α) is strictly positive and strictly decreas-ing in α ∈ (0, 1) with

Λ(0) := limα→0

Λ(α) =1

8

(D

ρ2

)4(1 +

D

ρ2

)2(1− D

ρ2

)−1 +

(1 +

2√ρ2/D

(ρ2

D − 1)

(R− 1

2

(1− D

ρ2

))) 12

2

,

Λ(1) =1

8

(D

ρ2

)4(1 +

D

ρ2

)2(1− D

ρ2

)−1 +

(1 +

2√ρ2/D

(ρ2

D − 1)

(R− 1

2log

ρ2

D

)) 12

2

.

(103)

Substituting (100) in (99), we have, for α ∈{

1L , . . . ,

LL

}:

P(Xα(β) > L−3/2EX

)< κ

(L

)L3/2 exp(−n ·min{αΛ(α), c1}).

(104)

Page 14: The Rate-Distortion Function and Excess-Distortion ...

14

Taking logarithms and dividing both sides by L logL, weobtain

1

L logLlogP

(Xα(β) > L−3/2EX

)<

log κ

L logL+

log(LLα

)L logL

+3

2L− n(min{αΛ(α), c1})

L logL

(a)=

log κ

L logL+ min

{α, α,

log 2

logL

}+

3

2L

− (min{αΛ(α), c1})bR

(105)

where to obtain (a), we have used the bound

log

(L

)< min {αL logL, (1− α)L logL, L log 2}

and the relation (3). For the right side of (105) to be negativefor sufficiently large L, we need

(min{αΛ(α), c1})bR

> min

{α, α,

log 2

logL

}+

3

2L. (106)

This can be arranged by choosing b to be large enough. Since(106) has to be satisfied for all α ∈ { 1

L , . . . ,L−1L }, we need

b >

maxα∈{ 1

L ,...,LL}

{R

(min{αΛ(α), c1})[

min{α, α,

log 2

logL

}+

3

2L

]}(a)=

2.5R

Λ(0)+O

(1

L

)= bmin

(ρ2

D

)+O

(1

L

).

(107)

In (107), (a) holds because Λ(α) is of constant order for allα ∈ (0, 1], hence the maximum is attained at α = 1

L . Theconstant Λ(0) is given by (103), and bmin(.) is defined in thestatement of Theorem 1.

When b satisfies (107) and L is sufficiently large, for α ∈{ 1L , . . . ,

LL}, the bound in (105) becomes

1

L logLlogP

(Xα(β) > L−3/2EX

)<

log κ

L logL− min{αΛ(α), c1}(b− bmin −O( 1

L ))

R

≤ log κ

L logL− Λ(0)

L

(b− bmin)

R=

log κ

L logL−

2.5( bbmin

− 1)

L.

(108)

Therefore

P(Xα(β) > L−3/2EX

)< κL

−2.5( bbmin

−1). (109)

This completes the proof of Lemma 2.

APPENDIX IPROOF OF LEMMA 6

For α ∈ { 1L , . . . ,

L−1L }, define the function gα : R→ R as

gα(u) = f(ρ2, (ρ2−D)α, u)+f(u, (ρ2−D)α,D)− 1

2lnρ2

D.

(110)

We want a lower bound for gα(Dα) ≥ Λ(α)α, where Dα isthe solution to

Rα = f(ρ2, (ρ2 −D)α,Dα). (111)

We consider the cases Dα > D and Dα ≤ D separately.Recall from Lemma 5 that Dα < ρ2α+Dα.

Case 1: D < Dα < ρ2α+Dα.: In this case, both the f(.)terms in the definition of gα(Dα) are strictly positive. We canwrite

Dα = ρ2α+Dα− δ, (112)

where δ ∈ (0, (ρ2 − D)α). Expanding gα(ρ2α + Dα − δ)around ρ2α+Dα using Taylor’s theorem, we obtain

g(Dα) = g(ρ2α+Dα)−g′(ρ2α+Dα)δ+g′′(ξ)δ2

2= g′′(ξ)

δ2

2(113)

since g(ρ2α+Dα) = g′(ρ2α+Dα) = 0. Here ξ is a numberin the interval (D, ρ2α+Dα). We bound g(Dα) from belowby obtaining separate lower bounds for g′′(ξ) and δ.

Lower Bound for g′′(ξ): Using the definition of f in (27),the second derivative of g(u) is

g′′(u) =−1

2u2

+2ρ4 · [(ρ2 −D)2α2 + 4ρ2u]−1/2

α(ρ2 −D)[√

(ρ2 −D)2α2 + 4ρ2u− (ρ2 −D)α]2

+2D2 · [(ρ2 −D)2α2 + 4Du]−1/2

α(ρ2 −D)[√

(ρ2 −D)2α2 + 4Du− (ρ2 −D)α]2 .(114)

It can be verified that g′′(u) is a decreasing function, and hencefor ξ ∈ (D, ρ2α+Dα),

g′′(ξ) ≥ g′′(ρ2α+Dα)

=−1

2(ρ2α+Dα)2

+ρ4

2α(ρ2 −D)(ρ2α+Dα)2(ρ2(1 + α) +Dα)

+1

2α(ρ2 −D)(ρ2α+D(1 + α))

=(ρ2 +D)

2αα(ρ2 −D)(ρ2(1 + α) +Dα)(ρ2α+D(1 + α))

≥ 1

4αα(ρ2 −D)ρ2.

(115)

Lower bound for δ: From (111) and (112), note that δ isthe solution to

Rα = f(ρ2, (ρ2 −D)α, ρ2α+Dα− δ). (116)

Using Taylor’s theorem for f in its third argument around thepoint p := (ρ2, (ρ2 −D)α, ρ2α+Dα), we have

Rα = f(p)− ∂f

∂z

∣∣∣pδ +

∂2f

∂z2

∣∣∣p

δ2

2

=1

2ln

ρ2

ρ2α+Dα+

1

2(ρ2α+Dα)δ +

1

2

∂2f

∂z2

∣∣∣pδ2,

(117)

Page 15: The Rate-Distortion Function and Excess-Distortion ...

15

where p = p = (ρ2, (ρ2 − D)α, z) for some z ∈ (D, ρ2α +Dα). As (117) is a quadratic in δ with positive coefficientsfor the δ and δ2 terms, replacing the δ2 coefficient with anupper bound and solving the resulting quadratic will yield alower bound for δ. Since the function

∂2f

∂z2

∣∣∣(x,y,z)

=2x2

y√y2 + 4xz(

√y2 + 4xz − y)2

(118)

is decreasing in z, the δ2 coefficient can be bounded as follows.

1

2

∂2f

∂z2

∣∣∣p=(ρ2,(ρ2−D)α,z)

≤ a∗ :=1

2

∂2f

∂z2

∣∣∣(ρ2,(ρ2−D)α,D)

,

(119)where a∗ can be computed to be

a∗ = ρ4

(α(ρ2 −D)

√(ρ2 −D)2α2 + 4ρ2D

[√(ρ2 −D)2α2 + 4ρ2D − (ρ2 −D)α

]2)−1

.

(120)

Therefore we can obtain a lower bound for δ, denoted by δ,by solving the equation

δ2 a∗ + δ1

2(ρ2α+Dα)−(Rα− 1

2ln

ρ2

ρ2α+Dα

)= 0.

(121)We thus obtain

δ > δ =1

4(ρ2α+Dα)a∗

[− 1 +

(1 + 16(ρ2α+Dα)2a∗α

(R− 1

2αln

ρ2

ρ2α+Dα

))1/2].

(122)

We now show that δ can be bounded from below by αΛ(α)by obtaining lower and upper bounds for a∗α. From (120) wehave

a∗α =ρ4 · [(ρ2 −D)2α2 + 4ρ2D]−1/2

(ρ2 −D)[√

(ρ2 −D)2α2 + 4ρ2D − (ρ2 −D)α]2

≥√ρ2/D

8D(ρ2 −D),

(123)

where the inequality is obtained by noting that a∗α is strictlyincreasing in α, and hence taking α = 0 gives a lower bound.Analogously, taking α = 1 yields the upper bound

a∗α ≤ ρ4

4D2(ρ4 −D2). (124)

Using the bounds of (123) and (124) in (122), we obtain

δ > δ ≥ α D2(ρ4 −D2)

ρ6

[− 1 +

(1 +

2√ρ2/D

(ρ2

D − 1)

(R− 1

2αln

ρ2

ρ2α+Dα

))1/2 ].

(125)

Finally, using the lower bounds for g′′(ξ) and δ from (125)and (115) in (113), we obtain

g(Dα) >α

8

(D

ρ2

)4(1 +

D

ρ2

)2(1− D

ρ2

)

×

−1 +

(1 +

2√ρ2/D

(ρ2

D − 1)

(R− 1

2αln

ρ2

ρ2α+Dα

)) 12

2

= αΛ(α).(126)

Case 2: Dα ≤ D.: In this case, g(Dα) is given by

g(Dα) = f(ρ2, (ρ2 −D)α,Dα) + f(Dα, (ρ2 −D)α,D)

− 1

2lnρ2

D

= Rα− 1

2lnρ2

D,

(127)

where we have used (111) and the fact that f(Dα, (ρ2 −

D)α,D) = 0 for Dα ≤ D. The right hand side of the equation

Rα = f(ρ2, (ρ2 −D)α, z)

is decreasing in z for z ∈ (0, D]. Therefore, it is sufficient toconsider Dα = D in order to obtain a lower bound for Rαthat holds for all Dα ∈ (0, D].

Next, we claim that the α that solves the equation

Rα = f(ρ2, (ρ2 −D)α,D) (128)

lies in the interval ( 12 , 1). Indeed, observe that the LHS of

(128) is increasing in α, while the RHS is decreasing in α forα ∈ (0, 1]. Since the LHS is strictly greater than the RHS atα = 1 (R > 1

2 ln ρ2

D ), the solution is strictly less than 1. Onthe other hand, for α ≤ 1

2 , we have

Rα ≤ R

2≤ 1

2

(1− D

ρ2

)<

1

2lnρ2

D= f(ρ2, (ρ2 −D), D)

< f(ρ2, (ρ2−D)2 , D),

(129)

i.e., the LHS of (128) is strictly less than the RHS. Thereforethe α that solves (128) lies in ( 1

2 , 1).

To obtain a lower bound on the RHS of (127), we expandf(ρ2, (ρ2 − D)α,D) using Taylor’s theorem for the secondargument.

f(ρ2, (ρ2 −D)α,D) = f(ρ2, (ρ2 −D)−∆, D)

=1

2lnρ2

D−∆

∂f

∂y

∣∣∣(ρ2,(ρ2−D),D)

+∆2

2

∂2f

∂y2

∣∣∣(ρ2,y0,D)

=1

2lnρ2

D+

∆2

2

∂2f

∂y2

∣∣∣(ρ2,y0,D)

,

(130)

where ∆ = (ρ2 − D)α, and y0 lies in the interval ( 12 (ρ2 −

D), (ρ2 −D)). Using (130) and the shorthand

f ′′(y0) :=∆2

2

∂2f

∂y2

∣∣∣(ρ2,y0,D)

,

Page 16: The Rate-Distortion Function and Excess-Distortion ...

16

(128) can be written as

Rα− 1

2lnρ2

D= α2 (ρ2 −D)2

2f ′′(y0), (131)

or

R− 1

2lnρ2

D= Rα + α2 (ρ2 −D)2

2f ′′(y0). (132)

Solving the quadratic in α, we get

α =−R+ [R2 + 2(ρ2 −D)2(R− 1

2 ln ρ2

D )f ′′(y0)]1/2

(ρ2 −D)2f ′′(y0).

(133)Using this in (131), we get

Rα− 1

2lnρ2

D

=

(−R+

[R2 + 2(ρ2 −D)2(R− 1

2 ln ρ2

D )f ′′(y0)]1/2)2

2(ρ2 −D)2f ′′(y0).

(134)

The LHS is exactly the quantity we want to bound from below.From the definition of f in (27), the second partial derivativewith respect to y can be computed:

f ′′(y) =∂2f

∂y2

∣∣∣(ρ2,y,D)

=ρ2

y3+

1

y2− y

2(y2 + 4ρ2D)3/2. (135)

The RHS of (135) is strictly decreasing in y. We can thereforebound f ′′(y0) as

ρ2

(ρ2 −D)3< f ′′(ρ2 −D) < f ′′(y0) < f ′′

(ρ2 −D

2

)<

12ρ2

(ρ2 −D)3.

(136)

Substituting these bounds in (134), we conclude that for Dα ≤D,

g(Dα) = Rα− 1

2lnρ2

D

≥ c1 :=(ρ2 −D)

24ρ2

−R+

[R2 +

2ρ2(R− 12 ln ρ2

D )

(ρ2 −D)

] 12

2

.

(137)

ACKNOWLEDGEMENT

We thank the anonymous referee for comments whichhelped improve the paper.

REFERENCES

[1] A. Barron and A. Joseph, “Least squares superposition codes of moder-ate dictionary size are reliable at rates up to capacity,” IEEE Trans. Inf.Theory, vol. 58, pp. 2541–2557, Feb 2012.

[2] A. Joseph and A. Barron, “Fast sparse superposition codes have expo-nentially small error probability for R < C,” IEEE Trans. Inf. Theory,vol. 60, pp. 919–942, Feb 2014.

[3] I. Kontoyiannis, K. Rad, and S. Gitzenis, “Sparse superposition codesfor Gaussian vector quantization,” in 2010 IEEE Inf. Theory Workshop,p. 1, Jan. 2010.

[4] R. Venkataramanan, A. Joseph, and S. Tatikonda, “Lossy compressionvia sparse linear regression: Performance under minimum-distance en-coding,” IEEE Trans. Inf. Thy, vol. 60, pp. 3254–3264, June 2014.

[5] R. Venkataramanan, T. Sarkar, and S. Tatikonda, “Lossy compressionvia sparse linear regression: Computationally efficient encoding anddecoding,” IEEE Trans. Inf. Theory, vol. 60, pp. 3265–3278, June 2014.

[6] N. Alon and J. H. Spencer, The probabilistic method. John Wiley &Sons, 2004.

[7] M. Wainwright, E. Maneva, and E. Martinian, “Lossy source compres-sion using low-density generator matrix codes: Analysis and algorithms,”IEEE Trans. Inf. Theory, vol. 56, no. 3, pp. 1351 –1368, 2010.

[8] S. Janson, Random Graphs. Wiley, 2000.[9] A. Coja-Oghlan and L. Zdeborova, “The condensation transition in

random hypergraph 2-coloring,” in Proc. 23rd Annual ACM-SIAM Symp.on Discrete Algorithms, pp. 241–250, 2012.

[10] A. Coja-Oghlan and D. Vilenchik, “Chasing the k-colorability thresh-old,” in Proc. IEEE 54th Annual Symposium on Foundations of Com-puter Science, pp. 380–389, 2013.

[11] A. Coja-Oghlan and K. Panagiotou, “Going after the k-SAT threshold,”in Proc. 45th Annual ACM Symposium on Theory of Computing,pp. 705–714, 2013.

[12] A. Ingber and Y. Kochman, “The dispersion of lossy source coding,” inData Compression Conference, pp. 53 –62, March 2011.

[13] V. Kostina and S. Verdu, “Fixed-length lossy compression in the finiteblocklength regime,” IEEE Trans. Inf. Theory, vol. 58, no. 6, pp. 3309–3338, 2012.

[14] K. Marton, “Error exponent for source coding with a fidelity criterion,”IEEE Trans. Inf. Theory, vol. 20, pp. 197 – 199, Mar 1974.

[15] S. Ihara and M. Kubo, “Error exponent for coding of memorylessGaussian sources with a fidelity criterion,” IEICE Trans. Fundamentals,vol. E83-A, Oct. 2000.

[16] F. Den Hollander, Large deviations, vol. 14. Amer. MathematicalSociety, 2008.

[17] R. R. Bahadur and R. R. Rao, “On deviations of the sample mean,” TheAnnals of Mathematical Statistics, vol. 31, no. 4, 1960.


Recommended