+ All Categories
Home > Documents > 1 The Perception-Distortion Tradeoff - arXiv · 1 The Perception-Distortion Tradeoff Yochai Blau...

1 The Perception-Distortion Tradeoff - arXiv · 1 The Perception-Distortion Tradeoff Yochai Blau...

Date post: 25-Jan-2020
Category:
Upload: others
View: 23 times
Download: 0 times
Share this document with a friend
17
1 The Perception-Distortion Tradeoff Yochai Blau and Tomer Michaeli Abstract—Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by human opinion scores that quantify perceived perceptual quality. In this paper, we prove mathematically that distortion and perceptual quality are at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restoration algorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptual quality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR or SSIM criteria. We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortion bound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose a new methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recent super-resolution algorithms. 1 I NTRODUCTION T HE last decades have seen continuous progress in image restoration algorithms (e.g. for denoising, deblurring, super-resolution) both in visual quality and in distortion measures like peak signal-to-noise ratio (PSNR) and struc- tural similarity index (SSIM) [2]. However, in recent years, it seems that the improvement in reconstruction accuracy is not always accompanied by an improvement in visual quality. In fact, and perhaps counter-intuitively, algorithms that are superior in terms of perceptual quality, are often inferior in terms of e.g. PSNR and SSIM [3], [4], [5], [6], [7], [8], [9]. This phenomenon is commonly interpreted as a shortcoming of the existing distortion measures [10], which fuels a constant search for alternative “more perceptual” criteria. In this paper, we offer a complementary explanation for the apparent tradeoff between perceptual quality and distortion measures. Specifically, we prove that there exists a region in the perception-distortion plane, which cannot be attained regardless of the algorithmic scheme (see Fig. 1). Furthermore, the boundary of this region is monotone. Therefore, in its proximity, it is only possible to improve either perceptual quality or distortion, one at the expense of the other. The perception-distortion tradeoff exists for all distortion measures, and is not only a problem of the mean- square error (MSE) or SSIM criteria. Let us clarify the difference between distortion and per- ceptual quality. The goal in image restoration is to estimate an image x from its degraded version y (e.g. noisy, blurry, etc.). Distortion refers to the dissimilarity between the re- constructed image ˆ x and the original image x. Perceptual quality, on the other hand, refers only to the visual quality of ˆ x, regardless of its similarity to x. Namely, it is the extent to which ˆ x looks like a valid natural image. An increasingly popular way of measuring perceptual quality Y. Blau and T. Michaeli are with the Viterbi Faculty of Electrical Engi- neering, Technion - Israel Institute of Technology, Haifa 32000, Israel. E-mail: {yochai@campus, tomer.m@ee}.technion.ac.il This is an extended version of a paper published in the Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [1]. https:// ieeexplore.ieee.org/ abstract/ document/ 8578750 Perception Distortion Impossible Possible Less distortion Better quality Alg. 2 Alg. 3 Alg. 1 Fig. 1. The perception-distortion tradeoff. Image restoration algo- rithms can be characterized by their average distortion and by the perceptual quality of the images they produce. We show that there exists a region in the perception-distortion plane which cannot be attained, regardless of the algorithmic scheme. When in proximity of this unattain- able region, an algorithm can be potentially improved only in terms of its distortion or in terms of its perceptual quality, one at the expense of the other. is by using real-vs.-fake user studies, which examine the ability of human observers to tell whether ˆ x is real or the output of an algorithm [5], [11], [12], [13], [14], [15], [16], [17] (similarly to the idea underlying generative adversarial nets [18]). Therefore, perceptual quality can be defined as the best possible probability of success in such discrimina- tion experiments, which as we show, is proportional to the distance between the distribution of reconstructed images and that of natural images. Based on these definitions of perception and distortion, we follow the logic of rate-distortion theory [19]. That is, we seek to characterize the behavior of the best attainable perceptual quality (minimal deviation from natural image statistics) as a function of the maximal allowable aver- age distortion, for any estimator. This perception-distortion function (wide curve in Fig. 1) separates between the attain- able and unattainable regions in the perception-distortion plane and thus describes the fundamental tradeoff between arXiv:1711.06077v3 [cs.CV] 3 Oct 2019
Transcript

1

The Perception-Distortion TradeoffYochai Blau and Tomer Michaeli

Abstract—Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by humanopinion scores that quantify perceived perceptual quality. In this paper, we prove mathematically that distortion and perceptual qualityare at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restorationalgorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptualquality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR orSSIM criteria. We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortionbound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose anew methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recentsuper-resolution algorithms.

F

1 INTRODUCTION

THE last decades have seen continuous progress in imagerestoration algorithms (e.g. for denoising, deblurring,

super-resolution) both in visual quality and in distortionmeasures like peak signal-to-noise ratio (PSNR) and struc-tural similarity index (SSIM) [2]. However, in recent years,it seems that the improvement in reconstruction accuracyis not always accompanied by an improvement in visualquality. In fact, and perhaps counter-intuitively, algorithmsthat are superior in terms of perceptual quality, are ofteninferior in terms of e.g. PSNR and SSIM [3], [4], [5], [6],[7], [8], [9]. This phenomenon is commonly interpreted as ashortcoming of the existing distortion measures [10], whichfuels a constant search for alternative “more perceptual”criteria.

In this paper, we offer a complementary explanationfor the apparent tradeoff between perceptual quality anddistortion measures. Specifically, we prove that there existsa region in the perception-distortion plane, which cannot beattained regardless of the algorithmic scheme (see Fig. 1).Furthermore, the boundary of this region is monotone.Therefore, in its proximity, it is only possible to improveeither perceptual quality or distortion, one at the expense ofthe other. The perception-distortion tradeoff exists for alldistortion measures, and is not only a problem of the mean-square error (MSE) or SSIM criteria.

Let us clarify the difference between distortion and per-ceptual quality. The goal in image restoration is to estimatean image x from its degraded version y (e.g. noisy, blurry,etc.). Distortion refers to the dissimilarity between the re-constructed image x and the original image x. Perceptualquality, on the other hand, refers only to the visual qualityof x, regardless of its similarity to x. Namely, it is theextent to which x looks like a valid natural image. Anincreasingly popular way of measuring perceptual quality

• Y. Blau and T. Michaeli are with the Viterbi Faculty of Electrical Engi-neering, Technion - Israel Institute of Technology, Haifa 32000, Israel.E-mail: {yochai@campus, tomer.m@ee}.technion.ac.il

This is an extended version of a paper published in the Proceedings of the 2018IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [1].https:// ieeexplore.ieee.org/abstract/document/8578750

Perception

Distortion

Impossible

Possible

Less distortion

Bette

r qua

lity

Alg. 2

Alg. 3

Alg. 1

Fig. 1. The perception-distortion tradeoff. Image restoration algo-rithms can be characterized by their average distortion and by theperceptual quality of the images they produce. We show that there existsa region in the perception-distortion plane which cannot be attained,regardless of the algorithmic scheme. When in proximity of this unattain-able region, an algorithm can be potentially improved only in terms of itsdistortion or in terms of its perceptual quality, one at the expense of theother.

is by using real-vs.-fake user studies, which examine theability of human observers to tell whether x is real or theoutput of an algorithm [5], [11], [12], [13], [14], [15], [16],[17] (similarly to the idea underlying generative adversarialnets [18]). Therefore, perceptual quality can be defined asthe best possible probability of success in such discrimina-tion experiments, which as we show, is proportional to thedistance between the distribution of reconstructed imagesand that of natural images.

Based on these definitions of perception and distortion,we follow the logic of rate-distortion theory [19]. That is,we seek to characterize the behavior of the best attainableperceptual quality (minimal deviation from natural imagestatistics) as a function of the maximal allowable aver-age distortion, for any estimator. This perception-distortionfunction (wide curve in Fig. 1) separates between the attain-able and unattainable regions in the perception-distortionplane and thus describes the fundamental tradeoff between

arX

iv:1

711.

0607

7v3

[cs

.CV

] 3

Oct

201

9

2

perception and distortion. Our analysis shows that algo-rithms cannot be simultaneously very accurate and produceimages that fool observers to believe they are real, no matterwhat measure is used to quantify accuracy. This tradeoffimplies that optimizing distortion measures can be not onlyineffective, but also potentially damaging in terms of visualquality. This has been empirically observed e.g. in [3], [4],[5], [6], [7], but was never established theoretically.

From the standpoint of algorithm design, we show thatgenerative adversarial nets (GANs) provide a principledway to approach the perception-distortion bound. Thisgives theoretical support to the growing empirical evidenceof the advantages of GANs in image restoration [3], [6], [7],[11], [20], [21], [22].

The perception-distortion tradeoff has major implica-tions on low-level vision. In certain applications, reconstruc-tion accuracy is of key importance (e.g. medical imaging).In others, perceptual quality may be preferred. The impos-sibility of simultaneously achieving both goals calls for anew way for evaluating algorithms: By placing them on theperception-distortion plane. We use this new methodologyto conduct an extensive comparison between recent super-resolution (SR) methods, revealing which SR methods lieclosest to the perception-distortion bound.

2 DISTORTION AND PERCEPTUAL QUALITY

Distortion and perceptual quality have been studied inmany different contexts, and are sometimes referred to bydifferent names. Let us briefly put past works in our context.

2.1 Distortion (full-reference) measuresGiven a distorted image x and a ground-truth referenceimage x, full-reference distortion measures quantify thequality of x by its discrepancy to x. These measures areoften called full reference image quality criteria because ofthe reasoning that if x is similar to x and x is of high quality,then x is also of high quality. However, as we show in thispaper, this logic is not always correct. We thus prefer to callthese measures distortion or dissimilarity criteria.

The most common distortion measure is the MSE, whichis quite poorly correlated with semantic similarity betweenimages [10]. Many alternative, more perceptual, distortionmeasures have been proposed over the years, includingSSIM [2], MS-SSIM [23], IFC [24], VIF [25], VSNR [26]and FSIM [27]. Recently, measures based on the `2-distancebetween deep feature maps of a neural-net have been shownto capture more semantic similarities [3], [4], [28].

2.2 Perceptual qualityThe perceptual quality of an image x is the degree towhich it looks like a natural image, and has nothing to dowith its similarity to any reference image. In many imageprocessing domains, perceptual quality has been associatedwith deviations from natural image statistics.

Human opinion based quality assessmentPerceptual quality is commonly evaluated empirically bythe mean opinion score of human subjects [29], [30]. Re-cently, it has become increasingly popular to perform such

studies through real vs. fake questionnaires [5], [11], [12],[13], [14], [15], [16], [17]. These test the ability of a humanobserver to distinguish whether an image is real or theoutput of some algorithm. The probability of success psuccessof the optimal decision rule in this hypothesis testing task isknown to be (see Appendix A)

psuccess = 12dTV(pX , pX) + 1

2 , (1)

where dTV(pX , pX) is the total-variation (TV) distance be-tween the distribution pX of images produced by the al-gorithm in question, and the distribution pX of naturalimages [31]. Note that psuccess decreases as the deviationbetween pX and pX decreases, becoming 1

2 (no better thana coin toss) when pX = pX .

No-reference quality measuresPerceptual quality can also be measured by an algorithm.In particular, no-reference measures quantify the perceptualqualityQ(x) of an image x without depending on a referenceimage. These measures are commonly based on estimatingdeviations from natural image statistics. Note, for example,that if Q(x) is taken to be the log-likelihood log(pX(x)),then the average quality of generated images is given by

EX∼pX [Q(X)] = −dKL(pX , pX) +H(pX), (2)

where dKL(·, ·) is the Kullback-Leibler distance and Hdenotes entropy. This illustrates that a small KL divergencedKL(pX , pX) is indicative of high average quality and largediversity of generated samples. The works [32], [33], [34]proposed perceptual quality indices based on the KL diver-gence between the distribution of the wavelet coefficients ofx and that of natural scenes. This idea was further extendedby the popular methods DIIVINE [29], BRISQUE [30],BLIINDS-II [35] and NIQE [36], which quantify perceptualquality by various measures of deviation from natural imagestatistics in the spatial, wavelet and DCT domains.

GAN-based image restorationMost recently, GAN-based methods have demonstrated un-precedented perceptual quality in super-resolution [3], [6],[9], [37], inpainting [7], [20], [38], compression [21], [39],[40], deblurring [41] and image-to-image translation [11],[22], [42]. This was accomplished by utilizing an adversarialloss, which minimizes some distance d(pX , pXGAN

) betweenthe distribution pXGAN

of images produced by the generatorand the distribution pX of images in the training dataset. Alarge variety of GAN schemes have been proposed, whichminimize different distances between distributions. Theseinclude the Jensen-Shannon divergence [18], the Wassersteindistance [43], and any f -divergence [44].

3 PROBLEM FORMULATION

In statistical terms, a natural image x can be thought of asa realization from the distribution of natural images pX .In image restoration, we observe a degraded version yrelating to x via some conditional distribution pY |X . In thispaper we focus on non-invertible settings1, where x cannot

1. By invertible we mean that the support of pX|Y (·|y) is a singletonfor almost all y’s (see Appendix C for a formal definition).

3

Original Degraded Reconstructed

Distortion:

Perception:

Fig. 2. Problem setting. Given an original image x ∼ pX , a degradedimage y is observed according to some conditional distribution pY |X .Given the degraded image y, an estimate x is constructed according tosome conditional distribution pX|Y . Distortion is quantified by the meanof some distortion measure between X and X. The perceptual qualityindex corresponds to the deviation between pX and pX .

be estimated from y with zero error. This is typically thecase in denoising, deblurring, inpaitning, super-resolution,etc. Given y, an image restoration algorithm produces anestimate x according to some distribution pX|Y . Note thatthis description is quite general in that it does not restrict theestimator x to be a deterministic function of y. This problemsetting is illustrated in Fig. 2.

Given a full-reference dissimilarity criterion ∆(x, x), theaverage distortion of an estimator X is given by

E[∆(X, X)], (3)

where the expectation is over the joint distribution pX,X .This definition aligns with the common practice of eval-uating average performance over a database of degradednatural images. We assume that the dissimilarity criterion issuch that ∆(x, x) ≥ 0 with equality when x = x. Note thatsome distortion measures, e.g. SSIM, are actually similaritymeasures (higher is better), yet can always be inverted (andshifted) to become dissimilarity measures.

As discussed in Sec. 2.2, the perceptual quality of anestimator X (as quantified e.g. by real vs. fake humanopinion studies) is directly related to the distance betweenthe distribution of its reconstructed images pX , and thedistribution of natural images pX . We thus define the per-ceptual quality index (lower is better) of an estimator X as

d(pX , pX), (4)

where d(p, q) is some divergence between distributions thatsatisfies d(p, q) ≥ 0 with equality if p = q, e.g. the KLdivergence, TV distance, Wasserstein distance, etc. It shouldbe pointed out that the divergence function d(·, ·) which bestrelates to human perception is a subject of ongoing research.Yet, our results below hold for (nearly) any divergence.

Notice that the best possible perceptual quality is ob-tained when the outputs of the algorithm follow the distri-bution of natural images (i.e. pX = pX ). In this situation,by looking at the reconstructed images, it is impossible totell that they were generated by an algorithm. However, notevery estimator with this property is necessarily accurate.Indeed, we could achieve perfect perceptual quality byrandomly drawing natural images that have nothing todo with the original ground-truth images. In this case thedistortion would be quite large.

Fig. 3. The distribution of the MMSE and MAP estimates. In this ex-ample, Y = X+N , where X ∼ pX and N ∼ N (0, 1). The distributionsof both the MMSE and the MAP estimates deviate significantly from thedistribution pX .

Our goal is to characterize the tradeoff between (3)and (4). But let us first see why minimizing the averagedistortion (3), does not necessarily lead to a low perceptualquality index (4). We start by illustrating this with thesquare-error distortion ∆(x, x) = ‖x − x‖2 and the 0 − 1distortion ∆(x, x) = 1 − δx,x (where δ is Kronecker’sdelta). More details about those examples are provided inAppendix B. We then proceed to discuss this phenomenonfor arbitrary distortions in Sec. 3.3.

3.1 The square-error distortionThe minimum mean square-error (MMSE) estimator is givenby the posterior-mean x(y) = E[X|Y = y]. Consider thecase Y = X + N , where X is a discrete random variablewith probability mass function

pX(x) =

{p1 x = ±1,

p0 x = 0,(5)

and N ∼ N (0, 1) is independent of X (see Fig. 3). In thissetting, the MMSE estimate is given by

xMMSE(y) =∑

n∈{−1,0,1}

n p(X = n|y), (6)

where

p(X = n|y) =pn exp{− 1

2 (y − n)2}∑m∈{−1,0,1}

pm exp{− 12 (y −m)2}

. (7)

Notice that xMMSE can take any value in the range (−1, 1),whereas x can only take the discrete values {−1, 0, 1}. Thus,clearly, pXMMSE

is very different from pX , as illustrated inFig. 3. This demonstrates that minimizing the MSE distor-tion does not generally lead to pX ≈ pX .

The same intuition holds for images. The MMSE estimateis an average over all possible explanations to the measureddata, weighted by their likelihoods. However the averageof valid images is not necessarily a valid image, so thatthe MMSE estimate frequently “falls off” the natural imagemanifold [3]. This leads to unnatural blurry reconstructions,as illustrated in Fig. 4. In this experiment, x is a 280 × 280image comprising 100 smaller 28 × 28 digit images. Each

4

MM

SEOriginal Noisy ( 1)

MA

P

1 3 5Denoised

Fig. 4. MMSE and MAP denoising. Here, the original image consistsof 100 smaller images, chosen uniformly at random from the MNISTdataset enriched with blank images. After adding Gaussian noise (σ =1, 3, 5), the image is denoised using the MMSE and MAP estimators.In both cases, the estimates significantly deviate from the distribution ofimages in the dataset.

digit is chosen uniformly at random from a dataset com-prising 54K images from the MNIST dataset [45] and anadditional 5.4K blank images. The degraded image y is anoisy version of x. As can be seen, the MMSE estimatorproduces blurry reconstructions, which do not follow thestatistics of the (binary) images in the dataset.

3.2 The 0− 1 distortion

The discussion above may give the impression that unnat-ural estimates are mainly a problem of the square-errordistortion, which causes averaging. One way to avoid aver-aging, is to minimize the binary 0−1 loss, which restricts theestimator to choose x only from the set of values that x cantake. In fact, the minimum mean 0− 1 distortion is attainedby the maximum-a-posteriori (MAP) rule, which is verypopular in image restoration. However, as we exemplifynext, the distribution of the MAP estimator also deviatesfrom pX . This behavior has also been studied in [46].

Consider again the setting of (5). In this case, the MAPestimate is given by

xMAP(y) = arg maxn∈{−1,0,1}

p(X = n|y), (8)

where p(X = n|y) is as in (7). Now, it can be easily verifiedthat when log(p1/p0) > 1/2, we have xMAP(y) = sign(y).Namely, the MAP estimator never predicts the value 0.Therefore, in this case, the distribution of the estimate is

pXMAP(x) =

{0.5 x = +1,

0.5 x = −1,(9)

which is obviously different from pX of (5) (see Fig. 3).This effect can also be seen in the experiment of Fig. 4.

Here, the MAP estimates become increasingly dominatedby blank images as the noise level rises, and thus clearlydeviates from the underlying prior distribution.

3.3 Arbitrary distortion measures

We saw that neither the square-error nor the 0 − 1 loss aredistribution preserving, in the sense that their minimizationdoes not generally lead to pX = pX (i.e. perfect perceptualquality). However these two examples do not yet precludethe existence of a distribution preserving distortion mea-sure. Does there exist a measure whose minimization isguaranteed to lead to pX = pX? If we limit ourselves toone single setting, then the answer may be positive. Forexample, in the setting of Fig. 3, if p0 of (5) equals 0, thenthe 0 − 1 loss is distribution preserving as its minimizationleads to an estimate satisfying pX = pX . This illustratesthat a distortion measure may be distribution preserving forcertain underlying distributions pX,Y but not for others.

However, from a practical standpoint, we typically wantour distortion measure to be adequate in more than onesingle setting. For example, if our goal is to train a neuralnetwork to perform denoising, then it is reasonable to expectthat the same distortion measure be equally adequate as aloss function for different noise levels. In fact, we may alsowant to use the same distortion measure across differenttasks (e.g. super-resolution, deblurring, inpainting). The in-teresting question is, therefore, whether there exists a stabledistribution preserving distortion measure.

Definition 1. Assume that the distortion measure ∆(·, ·) isdistribution preserving at a distribution pX,Y . We say that itis stably distribution preserving at pX,Y if for every pertur-bation of pX,Y , the estimator minimizing the mean distortion (3)satisfies2 pX = pX .

As we show next, if the optimal estimator is unique,then the distortion metric cannot be stably distributionpreserving (see proof in Appendix C).

Theorem 1. If pX,Y defines a non-invertible degradation andthe estimator minimizing the mean distortion (3) is unique, then∆(·, ·) is not a stably distribution preserving distortion at pX,Y .

Sometimes the optimal estimator is not unique. This mayhappen, for example, when using the distance between deepfeatures of a pre-trained network (e.g. VGG) as a distortionmetric [4], [28],

∆(x, x) = ‖ψ(x)− ψ(x)‖2. (10)

Since the function ψ(·) implemented by the first few layersof a network is often not invertible (e.g. due to the ReLUactivations and the dimensionality reduction at the poolinglayers), different inputs may be mapped to the same deeprepresentation, so that we may have more than one optimalestimator. The next theorem shows that in such cases it

2. That is, ∆(·, ·) is not stably distribution preserving at pX,Y if forany (arbitrarily small) α > 0, there exists a distribution pX,Y such thatminimizing E[∆(X, X)] under the distribution (1 − α)pX,Y + αpX,Y

results in pX 6= pX .

5

0.5 0.6 0.7 0.8

0

0.2

0.4

Fig. 5. Plot of Eq. (11) for the setting of Example 1. The minimalattainable KL distance between pX and pX subject to a constraint onthe maximal allowable MSE between X and X. Here, Y = X + N ,where X ∼ N (0, 1) and N ∼ N (0, σN ), and the estimator is linear,X = aY . Notice the clear trade-off: The perceptual index (dKL) dropsas the allowable distortion (MSE) increases. The graphs cut-off at theMMSE (marked by a square).

is also impossible to guarantee pX = pX (see proof inAppendix D).

Theorem 2. For any pX,Y , if the estimator minimizing the meandistortion (3) is not unique, then at least one of the optimalestimators satisfies pX 6= pX .

4 THE PERCEPTION-DISTORTION TRADEOFF

We saw that for any distortion measure, a low distortiondoes not generally imply good perceptual-quality. An inter-esting question, then, is: What is the best perceptual qualitythat can be attained by an estimator with a prescribeddistortion level?

Definition 2. The perception-distortion function of a signalrestoration task is given by

P (D) = minpX|Y

d(pX , pX) s.t. E[∆(X, X)] ≤ D, (11)

where ∆(·, ·) is a distortion measure and d(·, ·) is a divergencebetween distributions.

In words, P (D) is the minimal deviation between thedistributions pX and pX that can be attained by an estimatorwith distortion D. To gain intuition into the typical behaviorof this function, consider the following example.

Example 1. Suppose that Y = X+N , where X ∼ N (0, 1) andN ∼ N (0, σ2

N ) are independent. Take ∆(·, ·) to be the square-error distortion and d(·, ·) to be the KL divergence. For simplicity,let us focus on estimators of the form X = aY . In this case, wecan derive a closed form solution to Eq. (11) (see Appendix E),which is plotted for several noise levels σN in Fig. 5. As can beseen, the minimal attainable dKL(pX , pX) drops as the maximalallowable distortion (MSE) increases. Furthermore, the tradeoff isconvex and becomes more severe at higher noise levels σN .

In general settings, it is impossible to solve (11) analyti-cally. However, it turns out that the behavior seen in Fig. 5is typical, as we show next (see proof in Appendix F).

Theorem 3 (The perception-distortion tradeoff). Assume theproblem setting of Section 3. If d(p, q) of (4) is convex in its

second argument3, then the perception-distortion function P (D)of (11) is

1) monotonically non-increasing;2) convex.

Note that Theorem 3 requires no assumptions on the dis-tortion measure ∆(·, ·). This implies that a tradeoff betweenperceptual quality and distortion exists for any distortionmeasure, including e.g. MSE, SSIM, square error betweenVGG features [3], [4], etc. Yet, this does not imply thatall distortion measures have the same perception-distortionfunction. Indeed, as we demonstrate in Sec. 6, the tradeofftends to be less severe for distortion measures that capturesemantic similarities between images.

The convexity of P (D) implies that the tradeoff is moresevere at the low-distortion and at the high-perceptual-quality extremes. This is particularly important when con-sidering the TV divergence which is associated with theability to distinguish between real vs. fake images (seeSec. 2.2). Since P (D) is steeper at the low-distortion regime,any small improvement in distortion for an algorithm whosedistortion is already low, must be accompanied by a largedegradation in the ability to fool a discriminator. Similarly,any small improvement in the perceptual quality of analgorithm whose perceptual index is already low, mustbe accompanied by a large increase in distortion. Let uscomment that the assumption that d(p, q) is convex, is notvery limiting. For instance, any f -divergence (e.g. KL, TV,Hellinger, X 2) as well as the Renyi divergence, satisfy thisassumption [47], [48]. In any case, the function P (D) ismonotonically non-increasing even without this assump-tion.

4.1 Bounding the Perception-Distortion functionSeveral past works attempted to answer the question: Whatis the minimal attainable distortion Dmin in various restora-tion tasks? [49], [50], [51], [52], [53]. This corresponds to thevalue

Dmin = minpX|Y

E[∆(X, X)], (12)

which is the horizontal coordinate of the leftmost point onthe perception-distortion function. However, as the mini-mum distortion estimator is generally not distribution pre-serving (Sec. 3.3), an important complementary question is:What is the minimal distortion that can be attained by anestimator having perfect perceptual quality? This correspondsto the value

Dmax = minpX|Y

E[∆(X, X)] s.t. pX = pX , (13)

which is the horizontal coordinate of the point where theperception-distortion function first touches the horizontalaxis (see Fig. 6).

Observe that perfect perceptual quality (pX = pX ) isalways attainable, for example by drawing x from pX in-dependently of the input y. This method, however, ignoresthe input and is thus not good in terms of distortion. It turnsout that perfect perceptual quality can generally be achievedwith a significantly lower MSE distortion, as we show next(see proof in Appendix G).

3. That is, d(p, λq1 + (1− λ)q2) ≤ λd(p, q1) + (1− λ)d(p, q2) for anythree distributions p, q1, q2 and any λ ∈ [0, 1].

6

Perceptual

Index

DistortionDmin Dmax

Fig. 6. Bounding the perception-distortion function. The distancebetween Dmin and Dmax is the increase in distortion which is neededto obtain perfect perceptual quality. For the MSE, Theorem 4 proves thiswill never be more than a factor of 2 (which is 3dB in terms of PSNR).

Theorem 4. For the square error distortion ∆(x, x) = ‖x−x‖2,

Dmax ≤ 2Dmin, (14)

where Dmin and Dmax are defined by (12) and (13), respectively.This bound is attained by the estimator X defined through

pX|Y (x|y) = pX|Y (x|y), (15)

which achieves pX = pX and has an MSE of 2Dmin.

In simple words, Theorem 4 states that one would neverneed to sacrifice more than 3dB in PSNR to obtain perfectperceptual quality. This can be achieved by drawing x fromthe posterior distribution pX|Y . Interestingly, such a degra-dation was indeed incurred by all super-resolution methodsthat achieved state-of-the-art perceptual quality to date. Thiscan be seen in Fig. 9, where the RMSE of the algorithms withthe lowest perceptual index is nearly a factor of

√2 larger

than the RMSE of the methods with the lowest RMSE (seealso [3], [6]). However, note that this bound is generally nottight. For example, in the scalar Gaussian toy example ofFig. 5, Dmax can be quite smaller than 2Dmin, depending onthe noise level.

4.2 Connection to rate-distortion theoryThe perception-distortion tradeoff is closely related to thewell-established rate-distortion theory [19]. This theorycharacterizes the tradeoff between the bit-rate required tocommunicate a signal, and the distortion incurred in thesignal’s reconstruction at the receiver. More formally, therate-distortion function of a signal X is defined by

R(D) = minpX|X

I(X; X) s.t. E[∆(X, X)] ≤ D, (16)

where I(X; X) is the mutual information between X and X .There are, however, several key differences between the

two tradeoffs. First, in rate-distortion the optimization isover all conditional distributions pX|X , i.e. given the originalsignal. In the perception-distortion case, the estimator hasaccess only to the degraded signal Y , so that the optimiza-tion is over the conditional distributions pX|Y , which ismore restrictive. In other words, the perception-distortiontradeoff depends on the degradation pY |X , and not only onthe signal’s distribution pX (see Example 1). Second, in rate-distortion the rate is quantified by the mutual informationI(X; X), which depends on the joint distribution pX,X . Inour case, perception is quantified by the similarity between

pX and pX , which does not depend on their joint dis-tribution. Lastly, mutual information is inherently convex,while the convexity of the perception-distortion curve isguaranteed only when d(·, ·) is convex.

While the two tradeoffs are different, it is importantto note that perceptual quality does play a role in lossycompression, as evident from the success of recent GANbased compression schemes [39], [40], [54]. Theoretically, itseffect can be studied through the rate-distortion-perceptionfunction [55], [56], [57], which is an extension of the rate-distortion function (16) and the perception-distortion func-tion (11), characterizing the triple tradeoff between rate,distortion, and perceptual quality.

5 TRAVERSING THE TRADEOFF WITH A GANThere exists a systematic way to design estimators that ap-proach the perception-distortion curve: Using GANs. Specif-ically, motivated by [3], [6], [7], [11], [20], [21], restorationproblems can be approached by modifying the loss of thegenerator of a GAN to be

`gen = `distortion + λ `adv, (17)

where `distortion is the distortion between the original andreconstructed images, and `adv is the standard GAN ad-versarial loss. It is well known that `adv is proportional tosome divergence d(pX , pX) between the generator and datadistributions [18], [43], [44] (the type of divergence dependson the loss). Thus, (17) in fact approximates the objective

`gen ≈ E[∆(x, x)] + λ d(pX , pX). (18)

Viewing λ as a Lagrange multiplier, it is clear that minimiz-ing `gen is equivalent to minimizing (11) for someD. Varyingλ corresponds to varying D, thus producing estimatorsalong the perception-distortion function.

Let us use this approach to explore the perception-distortion tradeoff for the digit denoising example of Fig. 4with σ = 3. We train a Wasserstein GAN (WGAN) based de-noiser [43], [58] with an MSE distortion loss `distortion. Here,`adv is proportional to the Wasserstein distance dW (pX , pX)between the generator and data distributions. The WGANhas the valuable property that its discriminator (critic) loss isan accurate estimate (up to a constant factor) of dW (pX , pX)[43]. This allows us to easily compute the perceptual qualityindex of the trained denoiser. We obtain a set of estimatorswith several values of λ ∈ [0, 0.3]. For each denoiser, weevaluate the perceptual quality by the final discriminatorloss. As seen in Fig. 7, the curve connecting the estimators onthe perception-distortion plane is monotonically decreasing.Moreover, it is associated with estimates that graduallytransition from blurry and accurate to sharp and inaccurate.This curve obviously does not coincide with the analyticbound (11) (illustrated by a dashed line). However, it seemsto be adjacent to it. This is indicated by the fact that the left-most point of the WGAN curve is very close to the left-mostpoint of the theoretical bound, which corresponds to theMMSE estimator. See Appendix H for the WGAN trainingdetails and architecture.

Besides the MMSE estimator, Figure 7 also includesthe MAP estimator, the random draw estimator x ∼ pX(which ignores the noisy image y), and the conditional

7

A

B

C

D

A B C D

Fig. 7. Image denoising utilizing a GAN. A Wasserstein GAN wastrained to denoise the images of the experiment in Fig. 4. The generatorloss lgen = lMSE + λ ladv consists of a perceptual quality (adversarial)loss and a distortion (MSE) loss, where λ controls the trade-off betweenthe two. For each λ ∈ [0, 0.3], the graph depicts the distortion (MSE)and perceptual quality (Wasserstein distance between pX and pX ).The curve connecting the estimators is a good approximation to thetheoretical perception-distortion tradeoff (illustrated by a dashed line).

draw estimator of (15). The perceptual quality of theseestimators is evaluated, as above, by the final loss of theWGAN discriminator [43], trained (without a generator)to distinguish between the estimators’ outputs and imagesfrom the dataset. Note that the denoising WGAN estimator(D) achieves the same distortion as the MAP estimator, butwith far better perceptual quality. Furthermore, it achievesnearly the same perceptual quality as the random drawestimator, but with a significantly lower distortion.

6 PRACTICAL METHOD FOR EVALUATING ALGO-RITHMS

Certain applications may require low-distortion (e.g. inmedical imaging), while others may prefer superior percep-tual quality. How should image restoration algorithms beevaluated, then?

Definition 3. We say that Algorithm A dominates AlgorithmB if it has better perceptual quality and less distortion.

Note that if Algorithm A is better than B in only oneof the two criteria, then neither A dominates B nor Bdominates A. Therefore, among a group of algorithms, theremay be a large subset which can be considered equally good.

Definition 4. We say that an algorithm is admissible among agroup of algorithms, if it is not dominated by any other algorithmin the group.

As shown in Figure 8, these definitions have very simpleinterpretations when plotting algorithms on the perception-distortion plane. In particular, the admissible algorithms in

Perceptual

Index

A

B

C

D

Distortion

Fig. 8. Dominance and admissibility. Algorithm A is dominated byAlgorithm B, and is thus inadmissible. Algorithms B, C and D are alladmissible, as they are not dominated by any algorithm.

the group, are those which lie closest to the perception-distortion bound.

As discussed in Sec. 2, distortion is measured by full-reference (FR) metrics, e.g. [2], [4], [23], [24], [25], [26],[27]. The choice of the FR metric, depends on the type ofsimilarities we want to measure (per-pixel, semantic, etc.).Perceptual quality, on the other hand, is ideally quantifiedby collecting human opinion scores, which is time consum-ing and costly [29], [35]. Instead, the divergence d(pX , pX)can be computed, for instance by training a discriminatornet (see Sec. 5). However, this requires many training imagesand is thus also time consuming. A practical alternative isto utilize no-reference (NR) metrics, e.g. [29], [30], [35], [36],[59], [60], [61], which quantify the perceptual quality of animage without a corresponding original image. In scenarioswhere NR metrics are highly correlated with human mean-opinion-scores (e.g. 4× super-resolution [61]), they can beused as a fast and simple method for approximating theperceptual quality of an algorithm4.

We use this approach to evaluate 16 SR algorithms in a4× magnification task, by plotting them on the perception-distortion plane (Fig. 9). We measure perceptual qualityusing the NR metric NIQE [36], which was shown tocorrelate well with human opinion scores in a recent SRchallenge [64] (see Appendix I for experiments with the NRmetrics BRISQUE [30], BLIINDS-II [35] and the recent NRmetric by Ma et al. [61]). We measure distortion by the fivecommon FR metrics RMSE, SSIM [2], MS-SSIM [23], IFC[24] and VIF [25], and additionally by the recent VGG2,2

metric (the distance in the feature space of a VGG net) [3],[4]. To conform to previous evaluations, we compute allmetrics on the y-channel after discarding a 4-pixel border(except for VGG2,2, which is computed on RGB images).Comparisons on color images can be found in Appendix I.The algorithms are evaluated on the BSD100 dataset [65].The evaluated algorithms include: A+ [66], SRCNN [67],SelfEx [68], VDSR [69], Johnson et al. [4], LapSRN [70], Baeet al. [71] (“primary” variant), EDSR [72], SRResNet vari-ants which optimize MSE and VGG2,2 [3], SRGAN variantswhich optimize MSE, VGG2,2, and VGG5,4, in addition toan adversarial loss [3], ENet [6] (“PAT” variant), Deng [73](γ = 0.55), and Mechrez et al. [74].

4. In scenarios where NR metrics are inaccurate (e.g. blind deblurringwith large blurs [62], [63]), the perceptual metric should be human-opinion-scores or the loss of a discriminator trained to distinguish thealgorithms’ outputs from natural images.

8

12 13 14 15 16 17

3

4

5

6

7

8

0.650.70.75 0.930.940.950.96

1.822.22.42.62.8

3

4

5

6

7

8

0.250.30.35 2.2 2.4 2.6 2.8 3 3.2

Fig. 9. Perception-distortion evaluation of SR algorithms. We plot 16 algorithms on the perception-distortion plane. Perception is measuredby the NR metric NIQE [36]. Distortion is measured by the common full-reference metrics RMSE, SSIM, MS-SSIM, IFC, VIF and VGG2,2. In allplots, the lower left corner is blank, revealing an unattainable region in the perception-distortion plane. In proximity of the unattainable region, animprovement in perceptual quality comes at the expense of higher distortion.

Fig. 10. Visual comparison of algorithms closest to the perception-distortion bound. The algorithms are ordered from low to high distortion(as evaluated by RMSE, MS-SSIM, IFC, VIF). Notice the co-occurring increase in perceptual quality.

Interestingly, the same pattern is observed in all plots:(i) The lower left corner is blank, revealing an unattainableregion in the perception-distortion plane. (ii) In proximityof this blank region, NR and FR metrics are anti-correlated,indicating a tradeoff between perception and distortion.Notice that the tradeoff exists even for the IFC, VIF andVGG2,2 measures, which are considered to capture visualquality better than MSE and SSIM.

Figure 10 depicts the outputs of several algorithms lyingclosest to the perception-distortion bound in the IFC graphin Fig. 9. While the images are ordered from low to highdistortion (according to IFC), their perceptual quality clearlyimproves from left to right.

Both FR and NR measures are commonly validated bycalculating their correlation with human opinion scores,based on the assumption that both should be correlatedwith perceptual quality. However, as Fig. 11 shows, whileFR measures can be well-correlated with perceptual qualitywhen distant from the unattainable region, this is clearly

not the case when approaching the perception-distortionbound. In particular, all tested FR methods are inconsistentwith human opinion scores which found the SRGAN to besuperb in terms of perceptual quality [3], while NR meth-ods successfully determine this. We conclude that imagerestoration algorithms should always be evaluated by a pairof NR and FR metrics, constituting a reliable, reproducibleand simple method for comparison, which accounts for bothperceptual quality and distortion. This evaluation methodwas demonstrated and validated by a human opinion studyin the 2018 PIRM super-resolution challenge [64].

Up until 2016, SR algorithms occupied only the upper-left section of the perception-distortion plane. Nowadays,emerging techniques are exploring new regions in thisplane. The SRGAN, ENet, Deng, Johnson et al. and Mechrezet al. methods are the first (to our knowledge) to populatethe high perceptual quality region. In the near future wewill most likely witness continued efforts to approach theperception-distortion bound, not only in the low-distortion

9

Until 2017: IFC well-correlated

with perceptual quality

After 2017: IFC anti-correlated

with perceptual quality

Fig. 11. Correlation between distortion and perceptual quality. Inproximity of the perception-distortion bound, distortion and perceptualquality are anti-correlated. However, correlation is possible at distancefrom the bound.

region, but throughout the entire plane.

7 CONCLUSION

We proved and demonstrated the counter-intuitive phe-nomenon that distortion and perceptual quality are at oddswith each other. Namely, the lower the distortion of analgorithm, the more its distribution must deviate from thestatistics of natural scenes. We showed empirically thatthis tradeoff exists for many popular distortion measures,including those considered to be well-correlated with hu-man perception. Therefore, any distortion measure alone,is unsuitable for assessing image restoration methods. Ournovel methodology utilizes a pair of NR and FR metricsto place each algorithm on the perception-distortion plane,facilitating a more informative comparison of image restora-tion methods.

Acknowledgements This research was supported by theIsrael Science Foundation (grant no. 852/17), and by theTechnion Ollendorff Minerva Center.

REFERENCES

[1] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,”in Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition (CVPR), 2018, pp. 6228–6237, doi:10.1109/CVPR.2018.00652.

[2] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612,2004.

[3] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,“Photo-realistic single image super-resolution using a generativeadversarial network,” in Conference on Computer Vision and PatternRecognition (CVPR), 2017, pp. 4681–4690.

[4] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conferenceon Computer Vision (ECCV), 2016, pp. 694–711.

[5] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolu-tion,” in International Conference on Computer Vision (ICCV), 2017,pp. 5439–5448.

[6] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Singleimage super-resolution through automated texture synthesis,” inInternational Conference on Computer Vision (ICCV), 2017, pp. 4491–4500.

[7] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with deepgenerative models,” in Conference on Computer Vision and PatternRecognition (CVPR), 2017, pp. 5485–5493.

[8] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in European Conference on ComputerVision (ECCV), 2014, pp. 372–386.

[9] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN: Learning agenerative model from a single natural image,” in InternationalConference on Computer Vision (ICCV), 2019.

[10] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leaveit? A new look at signal fidelity measures,” IEEE Signal ProcessingMagazine, vol. 26, no. 1, pp. 98–117, 2009.

[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134.

[12] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”in European Conference on Computer Vision (ECCV), 2016, pp. 649–666.

[13] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,and X. Chen, “Improved techniques for training gans,” in Advancesin Neural Information Processing Systems (NIPS), 2016, pp. 2234–2242.

[14] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative imagemodels using a Laplacian pyramid of adversarial networks,” inAdvances in Neural Information Processing Systems (NIPS), 2015, pp.1486–1494.

[15] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!:Joint end-to-end learning of global and local image priors forautomatic image colorization with simultaneous classification,”ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 110, 2016.

[16] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A.Efros, “Real-time user-guided image colorization with learneddeep priors,” ACM Transactions on Graphics (TOG), vol. 9, no. 4,2017.

[17] S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, andK. Murphy, “PixColor: Pixel recursive colorization,” British Ma-chine Vision Conference (BMVC), 2017.

[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,” in Advances in Neural Information Processing Systems (NIPS),2014, pp. 2672–2680.

[19] T. M. Cover and J. A. Thomas, Elements of information theory. JohnWiley & Sons, 2012.

[20] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2536–2544.

[21] O. Rippel and L. Bourdev, “Real-time adaptive image compres-sion,” in International Conference on Machine Learning (ICML), 2017,pp. 2922–2930.

[22] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inInternational Conference on Computer Vision (ICCV), 2017, pp. 2223–2232.

[23] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in Conference on Signals,Systems and Computers, vol. 2, 2003, pp. 1398–1402.

[24] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An informationfidelity criterion for image quality assessment using natural scenestatistics,” IEEE Transactions on Image Processing, vol. 14, no. 12, pp.2117–2128, 2005.

[25] H. R. Sheikh and A. C. Bovik, “Image information and visualquality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp.430–444, 2006.

[26] D. M. Chandler and S. S. Hemami, “VSNR: A wavelet-basedvisual signal-to-noise ratio for natural images,” IEEE Transactionson Image Processing, vol. 16, no. 9, pp. 2284–2298, 2007.

[27] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A featuresimilarity index for image quality assessment,” IEEE Transactionson Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.

[28] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“The unreasonable effectiveness of deep features as a perceptualmetric,” in Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 586–595.

[29] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment:From natural scene statistics to perceptual quality,” IEEE transac-tions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.

10

[30] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference imagequality assessment in the spatial domain,” IEEE Transactions onImage Processing, vol. 21, no. 12, pp. 4695–4708, 2012.

[31] F. Nielsen, “Hypothesis testing, information divergence and com-putational geometry,” in Geometric Science of Information, 2013, pp.241–248.

[32] Z. Wang and E. P. Simoncelli, “Reduced-reference image qual-ity assessment using a wavelet-domain natural image statisticmodel.” in Human Vision and Electronic Imaging, vol. 5666, 2005,pp. 149–159.

[33] Z. Wang, G. Wu, H. R. Sheikh, E. P. Simoncelli, E.-H. Yang, andA. C. Bovik, “Quality-aware images,” IEEE Transactions on ImageProcessing, vol. 15, no. 6, pp. 1680–1689, 2006.

[34] Q. Li and Z. Wang, “Reduced-reference image quality assessmentusing divisive normalization-based image representation,” IEEEJournal of Selected Topics in Signal Processing, vol. 3, no. 2, pp. 202–211, 2009.

[35] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality as-sessment: A natural scene statistics approach in the DCT domain,”IEEE transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352,2012.

[36] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a com-pletely blind image quality analyzer,” IEEE Signal Processing Let-ters, vol. 20, no. 3, pp. 209–212, 2013.

[37] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, andC. Change Loy, “Esrgan: Enhanced super-resolution generativeadversarial networks,” in Proceedings of the European Conference onComputer Vision (ECCV) workshops, 2018.

[38] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generativeimage inpainting with contextual attention,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2018, pp. 5505–5514.

[39] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, andL. Van Gool, “Generative adversarial networks for extremelearned image compression,” arXiv preprint arXiv:1804.02958, 2018.

[40] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generativemodels for distribution-preserving lossy compression,” in Proc.Conference on Neural Information Processing Systems (NeurIPS), 2018.

[41] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas,“Deblurgan: Blind motion deblurring using conditional adversar-ial networks,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018, pp. 8183–8192.

[42] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in Advances in neural information processingsystems (NIPS), 2017, pp. 700–708.

[43] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generativeadversarial networks,” in International Conference on Machine Learn-ing (ICML), 2017, pp. 214–223.

[44] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generativeneural samplers using variational divergence minimization,” inAdvances in Neural Information Processing Systems (NIPS), 2016, pp.271–279.

[45] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,vol. 86, no. 11, pp. 2278–2324, 1998.

[46] M. Nikolova, “Model distortions in Bayesian MAP reconstruc-tion,” Inverse Problems and Imaging, vol. 1, no. 2, p. 399, 2007.

[47] I. Csiszar, P. C. Shields et al., “Information theory and statistics: Atutorial,” Foundations and Trends R© in Communications and Informa-tion Theory, vol. 1, no. 4, pp. 417–528, 2004.

[48] T. Van Erven and P. Harremos, “Renyi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory,vol. 60, no. 7, pp. 3797–3820, 2014.

[49] A. Levin and B. Nadler, “Natural image denoising: Optimalityand inherent bounds,” in Conference on Computer Vision and PatternRecognition (CVPR), 2011, pp. 2833–2840.

[50] A. Levin, B. Nadler, F. Durand, and W. T. Freeman, “Patchcomplexity, finite pixel correlations and optimal denoising,” inEuropean Conference on Computer Vision (ECCV), 2012, pp. 73–86.

[51] P. Chatterjee and P. Milanfar, “Is denoising dead?” IEEE Transac-tions on Image Processing, vol. 19, no. 4, pp. 895–911, 2010.

[52] P. Chatterjee and P. Milanfar, “Practical bounds on image denois-ing: From estimation to information,” IEEE Transactions on ImageProcessing (TIP), vol. 20, no. 5, pp. 1221–1233, 2011.

[53] S. Baker and T. Kanade, “Limits on super-resolution and how tobreak them,” IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), vol. 24, no. 9, pp. 1167–1183, 2002.

[54] S. Santurkar, D. Budden, and N. Shavit, “Generative compres-sion,” in Proc. Picture Coding Symposium (PCS), 2018.

[55] Y. Blau and T. Michaeli, “Rethinking lossy compression: TheRate-distortion-perception tradeoff,” in International Conference onMachine Learning (ICML), 2019.

[56] R. Matsumoto, “Introducing the perception-distortion tradeoffinto the rate-distortion theory of general information sources,”IEICE Communications Express, vol. 7, no. 11, pp. 427–431, 2018.

[57] R. Matsumoto, “Rate-distortion-perception tradeoff of variable-length source coding for general information sources,” IEICECommunications Express, vol. 8, no. 2, pp. 38–42, 2019.

[58] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville, “Improved training of wasserstein gans,” in Advances inNeural Information Processing Systems (NIPS), 2017, pp. 5769–5779.

[59] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised featurelearning framework for no-reference image quality assessment,” inConference on Computer Vision and Pattern Recognition (CVPR), 2012,pp. 1098–1105.

[60] L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neuralnetworks for no-reference image quality assessment,” in Conferenceon Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1733–1740.

[61] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference quality metric for single-image super-resolution,” Com-puter Vision and Image Understanding, vol. 158, pp. 1–16, 2017.

[62] W.-S. Lai, J.-B. Huang, Z. Hu, N. Ahuja, and M.-H. Yang, “A com-parative study for single image blind deblurring,” in Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1701–1709.

[63] Y. Liu, J. Wang, S. Cho, A. Finkelstein, and S. Rusinkiewicz, “A no-reference metric for evaluating the quality of motion deblurring.”ACM Transactions on Graphics (TOG), vol. 32, no. 6, pp. 175–1, 2013.

[64] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor,“The 2018 pirm challenge on perceptual image super-resolution,”in Proceedings of the European Conference on Computer Vision (ECCV),2018.

[65] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of hu-man segmented natural images and its application to evaluatingsegmentation algorithms and measuring ecological statistics,” inInternational Conference on Computer Vision (ICCV), vol. 2, 2001, pp.416–423.

[66] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchoredneighborhood regression for fast super-resolution,” in Asian Con-ference on Computer Vision, 2014, pp. 111–126.

[67] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deepconvolutional network for image super-resolution,” in EuropeanConference on Computer Vision (ECCV), 2014, pp. 184–199.

[68] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 5197–5206.

[69] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.

[70] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacianpyramid networks for fast and accurate super-resolution,” inConference on Computer Vision and Pattern Recognition (CVPR), 2017,pp. 624–632.

[71] W. Bae, J. Yoo, and J. Chul Ye, “Beyond deep residual learning forimage restoration: Persistent homology-guided manifold simpli-fication,” in Conference on Computer Vision and Pattern Recognition(CVPR) Workshops, 2017, pp. 145–153.

[72] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deepresidual networks for single image super-resolution,” in Conferenceon Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.

[73] X. Deng, “Enhancing image quality via style transfer for singleimage super-resolution,” IEEE Signal Processing Letters, 2018.

[74] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Maintainingnatural image statistics with the contextual loss,” in Asian Confer-ence on Computer Vision. Springer, 2018, pp. 427–443.

[75] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in International Conference on Learning Representations (ICLR),2014.

11

APPENDIX AREAL-VS.-FAKE USER STUDIES AND HYPOTHESISTESTING

We assume the setting where an observer is shown a realimage (a draw from pX ) or an algorithm output (a drawfrom pX ), with a prior probability of 0.5 each. The task is toidentify which distribution the image was drawn from (pXor pX ) with maximal probability of success. This is the set-ting of the Bayesian hypothesis testing problem, for whichthe maximum a-posteriori (MAP) decision rule minimizesthe probability of error (see Section 1 in [31]). When thereare two possible hypotheses with equal probabilities (as inour setting), the relation between the probability of errorand the total-variation distance between pX and pX in (1)can be easily derived (see Section 2 in [31]).

APPENDIX BTHE MMSE AND MAP EXAMPLES OF SEC. 3

Sections 3.1 and 3.2 exemplify that the MSE and the 0−1 lossare not distribution preserving in the setting of estimating adiscrete random variable (vector) X from its noisy versionY = X + N , where N ∼ N (0, σ2I) is independent ofX . Since the conditional distribution of Y given X = xis N (x, σ2I), the MMSE estimator is given by

xMMSE(y) = E[X|Y = y]

=∑x

xp(x|y)

=∑x

xp(y|x)p(x)∑x′ p(y|x′)p(x′)

=∑x

xexp(− 1

2σ2 ‖y − x‖2)p(x)∑x′ exp(− 1

2σ2 ‖y − x′‖2)p(x′), (19)

and the MAP estimator is given by

xMAP(y) = arg maxx p(x|y)

= arg minx− log(p(y|x)p(x))

= arg minx1

2σ2‖y − x‖2 − log(p(x)). (20)

In the example of Fig. 4, x is a 280 × 280 binary imagecomprising 28×28 blocks chosen uniformly at random froma finite database. Since the noiseN is i.i.d., each 28×28 blockof y can be denoised separately, both in the case of the MSEcriterion and in the case of MAP. For each block, we havep(x) = 1/59400 for the non-blank images and p(x) = 1/11for the blank image.

In the trinary example (5), we calculate the distributionof the MMSE estimate (Fig. 3) by

pXMMSE(x) = pY (x−1MMSE(x))

∣∣∣∣ ddx x−1MMSE(x)

∣∣∣∣ (21)

where the inverse of xMMSE(y) (see (6)) and its derivative arecalculated numerically, and pY (y) =

∑x p(y|x)p(x) with

p(y|x) ∼ N (x, 1) and p(x) of (5).

APPENDIX CPROOF OF THEOREM 1

Let us first clarify what we mean by a non-invertible degra-dation and by a unique estimator.

Definition 5. We say that a degradation is not invertible ifpX|Y (x|y) > 0 for all (x, y) ∈ Sx × Sy , where Sx is a set withpositive Lebesgue measure and Sy satisfies P(Y ∈ Sy) > 0.

Definition 6. We say that the optimal estimator is not uniqueif there exist two estimators, pX1|Y and pX2|Y that minimize themean distortion (3) and differ from one another in the sense that

pX1|Y (x|y) 6= pX2|Y (x|y) ∀(x, y) ∈ Sx × Sy, (22)

where Sx is a set with positive Lebesgue measure and Sy satisfiesP(Y ∈ Sy) > 0.

We start with the following observation.

Lemma 1. If the estimator X∗ that minimizes the mean distor-tion (3) is unique and stably distribution preserving at pX,Y , thenit must be defined by pX∗|Y = pX|Y .

Proof. Since X and X are independent given Y , the meandistortion can be written as

E[∆(X, X)]=

∫∫∫∆(x, x)pX|Y(x|y)pX|Y(x|y)pY(y)dxdxdy

=

∫ (∫f(x, y)pX|Y (x|y)dx

)pY (y)dy (23)

where we defined

f(x, y) =

∫∆(x, x)pX|Y (x|y)dx. (24)

Therefore, the optimal pX|Y is that which minimizes∫f(x, y)pX|Y (x|y)dx for each y. This shows that the op-

timal estimator depends only on the posterior distributionpX|Y and not on pY (as f(x, y) does not depend on pY ).

Now assume that this optimal estimator satisfies pX∗ =pX for any perturbation of pX,Y . Consider a perturbed jointdistribution pX,Y = pX|Y pY having the same posteriorpX|Y as pX,Y , but a perturbed marginal

pY = αpY + (1− α)pε, (25)

where α ∈ (0, 1]. Since the posterior has not changed, theoptimal estimator pX∗|Y remains the same. Its marginal pX∗ ,however, is modified to

pX∗(x) =

∫pX∗|Y (x|y)pY (y)dy

= α

∫pX∗|Y (x|y)pY (y)dy

+ (1− α)

∫pX∗|Y (x|y)pε(y)dy

= αpX + (1− α)

∫pX∗|Y (x|y)pε(y)dy, (26)

12

where we used the assumption that pX∗ = pX . Similarly,the distribution of X has changed to

pX(x) =

∫pX|Y (x|y)pY (y)dy

= α

∫pX|Y (x|y)pY (y)dy

+ (1− α)

∫pX|Y (x|y)pε(y)dy

= αpX + (1− α)

∫pX|Y (x|y)pε(y)dy. (27)

Thus, equality between pX∗ and pX is kept only if∫p∗X|Y (x|y)pε(y)dy =

∫pX|Y (x|y)pε(y)dy. (28)

This equality can hold for every perturbation pε only ifpX∗|Y = pX|Y , completing the proof.

We now prove the theorem. Let us assume to the con-trary that pX,Y defines a non-invertible degradation, andthat the distortion function ∆(·, ·) is stably distributionpreserving at pX,Y and is minimized by a unique estimatorX∗.

Since the degradation is non-invertible, pX|Y (x|y) > 0for all (x, y) ∈ Sx × Sy , where Sx is some set withpositive Lesbegue measure and Sy is a set that satisfiesP(Y ∈ Sy) > 0 (Definition 5). Furthermore, from Lemma 1,we know that pX∗|Y = pX|Y in this setting. Therefore, wealso have that pX∗|Y (x|y) > 0 for all (x, y) ∈ Sx×Sy . Now,since X∗ is an optimal estimator, pX∗|Y must minimize∫f(x, y)pX|Y (x|y)dx for each y (see proof of Lemma 1).

This means that for any y, the conditional pX∗|Y (x|y) mustassign positive probability only to x in the set of minimaSmin(y) = arg minx f(x, y). We conclude that that Sx ⊆Smin(y) for every y ∈ Sy . But this implies that any otherestimator that assigns zero probability to x /∈ Sx for everyy ∈ Sy , is also optimal. Thus, by Definition 6, the optimalestimator is not unique, contradicting our assumption.

APPENDIX DPROOF OF THEOREM 2

As discussed in Appendix C, for an optimal estimator,pX|Y (x|y) must assign positive probability only to x inthe set of minima Smin(y) = arg minx f(x, y). This im-plies that if we have two optimal estimators pX1|Y (x|y)and pX2|Y (x|y), then for each y, they can differ only onx ∈ Smin(y). Thus, if the optimal estimator is not unique,then from Definition 6 we must have that Sx ⊆ Smin(y)for every y ∈ Sy . Consequently, any other estimator thatassigns zero probability to x /∈ Sx for every y ∈ Sy , is alsooptimal. All that remains to show is that there exists at leastone such optimal estimator that satisfies pX 6= pX (i.e. is notdistribution preserving). To this end, let us construct twooptimal estimators with different marginal distributions,pXa 6= pXb , which will show that at least one of them mustsatisfy pX 6= pX .

Formally, let Sax ,Sbx be non-empty disjoint sets such thatSax ∪ Sbx = Sx. Now, let pXa|Y and pXb|Y be two optimal

estimators that satisfy pXa|Y (x|y) = pXb|Y (x|y) for every(x, y) /∈ Sx × Sy , but also

pXa|Y (x|y) = 0, ∀(x, y) ∈ Sax × Sy (29)

pXa|Y (x|y) > 0, ∀(x, y) ∈ Sbx × Sy (30)

and

pXb|Y (x|y) > 0, ∀(x, y) ∈ Sax × Sy (31)

pXb|Y (x|y) = 0, ∀(x, y) ∈ Sbx × Sy. (32)

To see that the marginals pXa , pXb are different, let usexamine the probability of each estimator to take values inSax . We have that

P(Xa ∈ Sax) = P(Xa ∈ Sax |Y ∈ Sy)P(Y ∈ Sy)

+ P(Xa ∈ Sax |Y /∈ Sy)P(Y /∈ Sy)

< P(Xb ∈ Sax |Y ∈ Sy)P(Y ∈ Sy)

+ P(Xb ∈ Sax |Y /∈ Sy)P(Y /∈ Sy)

= P(xb ∈ Sax), (33)

thus pXa 6= pXb . Therefore, if there is more than oneestimator that minimizes the mean distortion, then at leastone such estimator satisfies pX 6= pX .

APPENDIX EDERIVATION OF EXAMPLE 1Since X = aY = a(X +N), it is a zero-mean Gaussian ran-dom variable. Now, the Kullback-Leibler distance betweentwo zero-mean normal distributions is given by

dKL(pX‖pX) = ln

(σXσX

)+

σ2X

2σ2X

− 1

2, (34)

and the MSE between X and X is given by

MSE(X, X) = E[(X − X)2] = σ2X − 2σXX + σ2

X. (35)

Substituting X = aY and σ2X = 1, we obtain that σX =

|a|√

1 + σ2N and σXX = a, so that

dKL(a) = ln

(|a|√

1 + σ2N

)+

1

2a2(1 + σ2N )− 1

2, (36)

MSE(a) = 1 + a2(1 + σ2N )− 2a, (37)

andP (D) = min

adKL(a) s.t. MSE(a) ≤ D. (38)

Notice that dKL is symmetric, and MSE(|a|) ≤ MSE(a) (seeFig. 12). Thus, for any negative a, there always exists apositive a with which dKL is the same and the MSE is notlarger. Therefore, without loss of generality, we focus on therange a ≥ 0.

ForD < Dmin =σ2N

1+σ2N

the constraint set of MSE(a) < D

is empty, and there is no solution to (38). For D ≥ Dmin, theconstraint is satisfied for a− ≤ a ≤ a+, where

a±(D) =1

(1 + σ2N )

(1±

√D(1 + σ2

N )− σ2N

). (39)

For D = Dmin, the optimal (and only possible) a is

a = a+(Dmin) = a−(Dmin) =1

(1 + σ2N ). (40)

13

Fig. 12. Plots of (36) and (37). D defines the range (a−, a+) of a values complying with the MSE constraint (marked in red). The objective dKL isminimized over this range of possible a values.

Perceptual

Index

Distortion𝐷1 𝐷2𝐷𝜆

𝜆𝐷1 + 1 − 𝜆 𝐷2

𝑋1

𝑋2

𝑋𝜆

𝑃(𝐷1)

𝑃(𝐷2)

𝑃(𝐷𝜆)𝜆𝑃 𝐷1 + 1 − 𝜆 𝑃(𝐷2)

𝑃 𝜆𝐷1 + 1 − 𝜆 𝐷2

Fig. 13. Illustration of the proof of Theorem 3.

For D > Dmin, a+ monotonically increases with D, broad-ening the constraint set. The objective dKL(a) monotonically

decreases with a in the range a ∈ (0, 1/√

(1 + σ2N )) (see

Fig. 12 and the mathematical justification below). Thus, forDmin < D ≤ D0, the optimal a is always the largestpossible a, which is a = a+(D), where D0 is defined by

a+(D0) = 1/√

(1 + σ2N ) (see Fig. 12). For D > D0, the

optimal a is a = 1/√

(1 + σ2N ), which achieves the global

minimum dKL(a) = 0. The closed form solution is thereforegiven by

P (D) =

{dKL(a+(D)) Dmin ≤ D < D0

0 D0 ≤ D(41)

To justify the monotonicity of dKL(a) in the range a ∈(0, 1/

√(1 + σ2

N )), notice that for a > 0,

d

dadKL(a) =

1

a− 1

(1 + σ2N )

1

a3, (42)

which is negative for a ∈ (0, 1/√

(1 + σ2N )).

APPENDIX FPROOF OF THEOREM 3The proof of Theorem 3 follows closely that of the rate-distortion theorem from information theory [19]. The valueP (D) is the minimal distance d(pX , pX) over a constraintset whose size does not decrease with D. This implies that

the function P (D) is non-increasing in D. Now, to prove theconvexity of P (D), we will show that

λP (D1) + (1− λ)P (D2) ≥ P (λD1 + (1− λ)D2), (43)

for all λ ∈ [0, 1] (see Fig. 13). First, by definition, the lefthand side of (43) can be written as

λd(pX , pX1) + (1− λ)d(pX , pX2

), (44)

where X1 and X2 are the estimators defined by

pX1|Y = arg minpX|Y

d(pX , pX) s.t. E[∆(X, X)

]≤ D1, (45)

pX2|Y = arg minpX|Y

d(pX , pX) s.t. E[∆(X, X)

]≤ D2. (46)

Since d(·, ·) is convex in its second argument,

λd(pX , pX1) + (1− λ)d(pX , pX2

) ≥ d(pX , pXλ), (47)

where Xλ is defined by

pXλ|Y = λpX1|Y + (1− λ) pX2|Y . (48)

Denoting Dλ = E[∆(X, Xλ)], we have that

d(pX , pXλ) ≥ minpX|Y

{d(pX , pX) : E[∆(X, X)] ≤ Dλ

}= P (Dλ), (49)

because Xλ is in the constraint set. Below, we show that

Dλ ≤ λD1 + (1− λ)D2. (50)

Therefore, since P (D) is non-increasing in D, we have that

P (Dλ) ≥ P (λD1 + (1− λ)D2). (51)

Combining (44), (47), (49) and (51) proves (43), thus demon-strating that P (D) is convex.

To justify (50), note that

Dλ = E[∆(X, Xλ)

]= E

[E[∆(X, Xλ)|Y

]]= E

[λE[∆(X, X1)|Y

]+ (1− λ)E

[∆(X, X2)|Y

]]= λE

[∆(X, X1)

]+ (1− λ)E

[∆(X, X2)

]≤ λD1 + (1− λ)D2, (52)

14

where the second and fourth transitions are according to thelaw of total expectation and the third transition is justifiedby

p(x, xλ|y) = p(xλ|x, y)p(x|y)

= p(xλ|y)p(x|y)

= (λp(x1|y) + (1− λ)p(x2|y))p(x|y)

= λp(x1|y)p(x|y) + (1− λ)p(x2|y))p(x|y)

= λp(x, x1|y) + (1− λ)p(x, x2|y)). (53)

Here we used (48) and the fact that X and Xλ are inde-pendent given Y , and similarly for the pairs (X, X1) and(X, X2).

APPENDIX GPROOF OF THEOREM 4The estimator X of (15) attains perfect perceptual qualitysince

pX(x) =

∫pX|Y (x|y)pY (y)dy

=

∫pX|Y (x|y)pY (y)dy

= pX(x). (54)

Furthermore, note that

E[XT X] = E[E[XT X|Y ]]

= E[E[X|Y ]TE[X|Y ]]

= E[‖E[X|Y ]‖2], (55)

and

E[‖X‖2] = E[E[‖X‖2|Y ] = E[E[‖X‖2|Y ] = E[‖X‖2], (56)

where we used the law of total expectation and the factthat given Y , X and X are independent and identicallydistributed. The MSE of X is therefore

E[‖X − X‖2] = E[‖X‖2]− 2E[XT X] + E[‖X‖2]

= 2(E[‖X‖2]− E[‖E[X|Y ]‖2])

= 2E[‖X − E[X|Y ]‖2]

= 2E[‖X − XMMSE‖2], (57)

where the second equality is due to (55) and (56), and thethird equality is due to the orthogonality principle. We thusestablished that X is a distribution preserving estimatorwhose MSE is precisely twice the MSE of the MMSE esti-mator. This implies that

Dmax ≤ E[‖X − X‖2] = 2Dmin, (58)

completing the proof.

APPENDIX HWGAN ARCHITECTURE AND TRAINING DETAILS(SEC. 5)The architecture of the WGAN trained for denoising theMNIST images is detailed in Table 1. The training algo-rithm and adversarial losses are as proposed in [58]. Thegenerator loss was modified to include a content loss term,

i.e. `gen = `MSE + λ `adv, where `MSE is the standard MSEloss. For each λ the WGAN was trained for 35 epochs, with abatch size of 64 images. The ADAM optimizer [75] was used,with β1 = 0.5, β2 = 0.9. The generator/discriminator initiallearning rate is 10−3/10−4 respectively, where learning rateof both decreases by half every 10 epochs. The filter sizeof the discriminator convolutional layers is 5 × 5, andthese are performed without padding. The filter size in thegenerator transposed-convolutional layers is 5×5/4×4, andthese are performed with 2/1 pixel padding for the first/second and third transposed-convolutional layers, respec-tively. The stride of each convolutional layer and the slopefor the leaky-ReLU layers appear in Table 1. Note that theperception-distortion curve in Fig. 7 is generated by trainingon single digit images, which in general may deviate fromthe perception-distortion curve of whole images containingi.i.d. sub-blocks of digits.

APPENDIX ISUPER-RESOLUTION EVALUATION DETAILS (SEC.6) AND ADDITIONAL COMPARISONS

The no-reference (NR) and full-reference (FR) methodsBRISQUE, BLIINDS-II, NIQE, SSIM, MS-SSIM, IFC andVIF were obtained from the LIVE laboratory website5, theNR method of Ma et al. was obtained from the projectwebpage6, and the pretrained VGG-19 network was ob-tained through the PyTorch torchvision package7. The low-resolution images were obtained by factor 4 downsamplingwith a bicubic kernel. The super-resolution results on theBSD100 dataset of the SRGAN and SRResNet variants wereobtained online8, and the results of EDSR, Deng, Johnsonet al. and Mechrez et al. were kindly provided by theauthors. The algorithms for testing the other SR methodswere obtained online: A+9, SRCNN10, SelfEx11, VDSR12,LapSRN13, Bae et al. 14 and ENet15. All NR and FR metricsand all SR algorithms were used with the default parametersand models.

The general pattern appearing in Fig. 9 will appear forany NR method which accurately predicts the perceptualquality of images. We show here three additional popularNR methods: BRISQUE [30], BLIINDS-II [35] and the recentmeasure by Ma et al. [61] in Figs. 14, 15, 16, where the sameconclusions as for NIQE [36] (see Sec. 6) are apparent. Thesame pattern appears for RGB images as well, as shownin Figs. 17, 18. Note that the perceptual quality of John-son et al. and SRResNet-VGG2,2 is inconsistent betweenNR metrics, likely due to varying sensitivity to the cross-hatch pattern artifacts which are present in these method’s

5. http://live.ece.utexas.edu/research/Quality/index.htm6. https://github.com/chaoma99/sr-metric7. http://pytorch.org/docs/master/torchvision/index.html8. https://twitter.box.com/s/lcue6vlrd01ljkdtdkhmfvk7vtjhetog9. http://www.vision.ee.ethz.ch/∼timofter/ACCV2014 ID820

SUPPLEMENTARY/10. http://mmlab.ie.cuhk.edu.hk/projects/SRCNN.html11. https://github.com/jbhuang0604/SelfExSR12. http://cv.snu.ac.kr/research/VDSR/13. https://github.com/phoenix104104/LapSRN14. https://github.com/iorism/CNN15. https://webdav.tue.mpg.de/pixel/enhancenet/

15

TABLE 1Generator and discriminator architecture. FC is a fully-connected layer, BN is a batch-norm layer, and l-ReLU is a leaky-ReLU layer.

DiscriminatorSize Layer

28× 28× 1 Input12× 12× 32 Conv (stride=2), l-ReLU (slope=0.2)4× 4× 64 Conv (stride=2), l-ReLU (slope=0.2)

1024 Flatten1 FC1 Output

GeneratorSize Layer

28× 28× 1 Input784 Flatten

4× 4× 128 FC, unflatten, BN, ReLU7× 7× 64 transposed-Conv (stride=2), BN, ReLU

14× 14× 32 transposed-Conv (stride=2), BN, ReLU28× 28× 1 transposed-Conv (stride=2), sigmoid28× 28× 1 Output

12 13 14 15 16 17

5

6

7

8

9

0.630.670.710.75 0.930.940.950.96

1.82.12.42.7

5

6

7

8

9

0.250.280.310.34 2.2 2.5 2.8 3.1

Fig. 14. Plot of 15 algorithms on the perception-distortion plane, where perception is measured by the NR metric by Ma et al. [61], and distortionis measured by the common full-reference metrics RMSE, SSIM, MS-SSIM, IFC, VIF and VGG2,2. All metrics were calculated on the y-channelalone.

outputs. For this reason, Johnson et al. does not appear inthe NIQE plots, as its NIQE score is 13.55 (far off the plots).

16

12 13 14 15 16 17

0

20

40

60

0.650.70.75 0.930.940.950.96

1.822.22.42.62.8

0

20

40

60

0.250.30.35 2.2 2.4 2.6 2.8 3 3.2

Fig. 15. Plot of 16 algorithms on the perception-distortion plane, where perception is measured by the NR metric BRISQUE, and distortion ismeasured by the common full-reference metrics RMSE, SSIM, MS-SSIM, IFC, VIF and VGG2,2. All metrics were calculated on the y-channelalone.

12 13 14 15 16 17

0

20

40

60

0.650.70.75 0.930.940.950.96

1.822.22.42.62.8

0

20

40

60

0.250.30.35 2.2 2.4 2.6 2.8 3 3.2

Fig. 16. Plot of 16 algorithms on the perception-distortion plane, where perception is measured by the NR metric BLIINDS-II, and distortion ismeasured by the common full-reference metrics RMSE, SSIM, MS-SSIM, IFC, VIF and VGG2,2. All metrics were calculated on the y-channelalone.

17

14 16 18 20

5

6

7

8

9

0.60.650.7 0.920.930.940.950.96

14 16 18 20

3

4

5

6

7

8

0.60.650.7 0.920.930.940.950.96

Fig. 17. Plot of 16 algorithms on the perception-distortion plane. Perception is measured by the the NR metrics of Ma et al. and NIQE, and distortionis measured by the common full-reference metrics RMSE, SSIM and MS-SSIM. All metrics were calculated on three channel RGB images.

14 16 18 20

0

10

20

30

40

50

0.60.650.7 0.920.930.940.950.96

14 16 18 20

0

20

40

60

0.60.650.7 0.920.930.940.950.96

Fig. 18. Plot of 16 algorithms on the perception-distortion plane. Perception is measured by the the NR metrics BRISQUE and BLIINDS-II, anddistortion is measured by the common full-reference metrics RMSE, SSIM and MS-SSIM. All metrics were calculated on three channel RGBimages.


Recommended